Previously, prometheus_notifications_errors_total was incremented by
one whenever a batch of alerts was affected by an error during sending
to a specific alertmanager. However, the corresponding metric
prometheus_notifications_sent_total, counting all alerts that were
sent (including those where the sent ended in error), is incremented
by the batch size, i.e. the number of alerts.
Therefore, the ratio used in the mixin for the
PrometheusErrorSendingAlertsToSomeAlertmanagers alert is inconsistent.
This commit changes the increment of
prometheus_notifications_errors_total to the number of alerts that
were sent in the attempt that ended in an error. It also adjusts the
metrics help string accordingly and makes the wording in the alert in
the mixin more precise.
Signed-off-by: beorn7 <beorn@grafana.com>
that was introcued in https://github.com/prometheus/prometheus/pull/13554
The same motivation for adding the metric applies: To avoid silent SD failures,
as existing logs may not be regularly checked and can be missed.
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Co-authored-by: Simon Pasquier <spasquie@redhat.com>
This alert will never return anything as the left side of the query has
the labels `[component, environment, instance, job, type]` while the
right side has `[component, environment, instance, job]`.
The `type` label was added to `prometheus_tsdb_head_samples_appended_total` in this PR but the mixin wasn't updated
for the new label: https://github.com/prometheus/prometheus/pull/11395
This was found with [pint](https://github.com/cloudflare/pint) PromQL
linting
Signed-off-by: Will Bollock <wbollock@linode.com>
* add alert for sd refresh failure
Due to config error or sd service down, prometheus may fail to refresh sd resource, which may lead to scrape fail or irrelavant metrics.
Signed-off-by: Leo Q <LeoQuote@users.noreply.github.com>
* apply suggestions
Signed-off-by: Leo Q <LeoQuote@users.noreply.github.com>
---------
Signed-off-by: Leo Q <LeoQuote@users.noreply.github.com>
Currently we're hardcoding `job="prometheus-k8s"` as selector. This
doesn't work if your prometheus is elsewhere. Fortunately we have
`prometheusSelector` in `$._config` which all the other alerts use.
Use that here too.
Signed-off-by: Iain Lane <iain@orangesquash.org.uk>
This commit adds an alert in the prometheus mixin which triggers when
Prometheus has failed scrapes that have exceeded the configured
sample_limit for that job.
Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>
Add cleanup of the lockfile when the db is cleanly closed
The metric describes the status of the lockfile on startup
0: Already existed
1: Did not exist
-1: Disabled
Therefore, if the min value over time of this metric is 0, that means that executions have exited uncleanly
We can then use that metric to have a much lower threshold on the crashlooping alert:
If the metric exists and it has been zero, two restarts is enough to trigger the alarm
If it does not exist (old prom version for example), the current five restarts threshold remains
Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
* Change metric name + set unset value to -1
Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
* Only check the last value of the clean start alert
Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
* Fix test + nit
Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
* scrape: add label limits per scrape
Add three new limits to the scrape configuration to provide some
mechanism to defend against unbound number of labels and excessive
label lengths. If any of these limits are broken by a sample from a
scrape, the whole scrape will fail. For all of these configuration
options, a zero value means no limit.
The `label_limit` configuration will provide a mechanism to bound the
number of labels per-scrape of a certain sample to a user defined limit.
This limit will be tested against the sample labels plus the discovery
labels, but it will exclude the __name__ from the count since it is a
mandatory Prometheus label to which applying constraints isn't
meaningful.
The `label_name_length_limit` and `label_value_length_limit` will
prevent having labels of excessive lengths. These limits also skip the
__name__ label for the same reasons as the `label_limit` option and will
also make the scrape fail if any sample has a label name/value length
that exceed the predefined limits.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
* scrape: add metrics and alert to label limits
Add three gauge, one for each label limit to easily access the
limit set by a certain scrape target.
Also add a counter to count the number of targets that exceeded the
label limits and thus were dropped. This is useful for the
`PrometheusLabelLimitHit` alert that will notify the users that scraping
some targets failed because they had samples exceeding the label limits
defined in the scrape configuration.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
* scrape: apply label limits to __name__ label
Apply limits to the __name__ label that was previously skipped and
truncate the label names and values in the error messages as they can be
very very long.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
* scrape: remove label limits gauges and refactor
Remove `prometheus_target_scrape_pool_label_limit`,
`prometheus_target_scrape_pool_label_name_length_limit`, and
`prometheus_target_scrape_pool_label_value_length_limit` as they are not
really useful since we don't have the information on the labels in it.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
Currently, it relies on `job, instance` being the labels completely
identifying a Prometheus instance. However, what's intended is to
simply not match on `remote_name, url`.
Signed-off-by: beorn7 <beorn@grafana.com>
There is certainly a potential to add more of these. This is mostly
meant to introduce the concept and cover a few critical parts.
Signed-off-by: beorn7 <beorn@grafana.com>
The counter is only increased when tsdb.Open() is called which
Prometheus does only once in its lifetime (when it initializes). If the
corruption can't be recovered, tsdb.Open() returns an error and
Prometheus exits. Hence the metric is either 0 (no corruption) or 1
(corruption detected and repaired). If the latter, the alert isn't
actionable and the only way to resolve it is to restart Prometheus which
would reset the counter.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
While doing so, re-introduce the summary/description
annotations. Also, add a few more rules and tweak a few of the
existing ones.
Signed-off-by: beorn7 <beorn@grafana.com>