Previously, prometheus_notifications_errors_total was incremented by
one whenever a batch of alerts was affected by an error during sending
to a specific alertmanager. However, the corresponding metric
prometheus_notifications_sent_total, counting all alerts that were
sent (including those where the sent ended in error), is incremented
by the batch size, i.e. the number of alerts.
Therefore, the ratio used in the mixin for the
PrometheusErrorSendingAlertsToSomeAlertmanagers alert is inconsistent.
This commit changes the increment of
prometheus_notifications_errors_total to the number of alerts that
were sent in the attempt that ended in an error. It also adjusts the
metrics help string accordingly and makes the wording in the alert in
the mixin more precise.
Signed-off-by: beorn7 <beorn@grafana.com>
that was introcued in https://github.com/prometheus/prometheus/pull/13554
The same motivation for adding the metric applies: To avoid silent SD failures,
as existing logs may not be regularly checked and can be missed.
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Co-authored-by: Simon Pasquier <spasquie@redhat.com>
Allow users to opt-out of the multi-cluster setup for Prometheus
dashboard, in environments where it isn't applicable.
Refer: https://github.com/prometheus/prometheus/pull/13180.
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
This alert will never return anything as the left side of the query has
the labels `[component, environment, instance, job, type]` while the
right side has `[component, environment, instance, job]`.
The `type` label was added to `prometheus_tsdb_head_samples_appended_total` in this PR but the mixin wasn't updated
for the new label: https://github.com/prometheus/prometheus/pull/11395
This was found with [pint](https://github.com/cloudflare/pint) PromQL
linting
Signed-off-by: Will Bollock <wbollock@linode.com>
* add alert for sd refresh failure
Due to config error or sd service down, prometheus may fail to refresh sd resource, which may lead to scrape fail or irrelavant metrics.
Signed-off-by: Leo Q <LeoQuote@users.noreply.github.com>
* apply suggestions
Signed-off-by: Leo Q <LeoQuote@users.noreply.github.com>
---------
Signed-off-by: Leo Q <LeoQuote@users.noreply.github.com>
Prior to this fix uptime column interpreted as number and the higher values are suffixed with raw units like `K`. This commit adds unit for the column as `second` to make visual interpretation easy.
Signed-off-by: Arunprasad Rajkumar <ar.arunprasad@gmail.com>
Currently we're hardcoding `job="prometheus-k8s"` as selector. This
doesn't work if your prometheus is elsewhere. Fortunately we have
`prometheusSelector` in `$._config` which all the other alerts use.
Use that here too.
Signed-off-by: Iain Lane <iain@orangesquash.org.uk>
This commit adds an alert in the prometheus mixin which triggers when
Prometheus has failed scrapes that have exceeded the configured
sample_limit for that job.
Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>
Add cleanup of the lockfile when the db is cleanly closed
The metric describes the status of the lockfile on startup
0: Already existed
1: Did not exist
-1: Disabled
Therefore, if the min value over time of this metric is 0, that means that executions have exited uncleanly
We can then use that metric to have a much lower threshold on the crashlooping alert:
If the metric exists and it has been zero, two restarts is enough to trigger the alarm
If it does not exist (old prom version for example), the current five restarts threshold remains
Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
* Change metric name + set unset value to -1
Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
* Only check the last value of the clean start alert
Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
* Fix test + nit
Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>
* scrape: add label limits per scrape
Add three new limits to the scrape configuration to provide some
mechanism to defend against unbound number of labels and excessive
label lengths. If any of these limits are broken by a sample from a
scrape, the whole scrape will fail. For all of these configuration
options, a zero value means no limit.
The `label_limit` configuration will provide a mechanism to bound the
number of labels per-scrape of a certain sample to a user defined limit.
This limit will be tested against the sample labels plus the discovery
labels, but it will exclude the __name__ from the count since it is a
mandatory Prometheus label to which applying constraints isn't
meaningful.
The `label_name_length_limit` and `label_value_length_limit` will
prevent having labels of excessive lengths. These limits also skip the
__name__ label for the same reasons as the `label_limit` option and will
also make the scrape fail if any sample has a label name/value length
that exceed the predefined limits.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
* scrape: add metrics and alert to label limits
Add three gauge, one for each label limit to easily access the
limit set by a certain scrape target.
Also add a counter to count the number of targets that exceeded the
label limits and thus were dropped. This is useful for the
`PrometheusLabelLimitHit` alert that will notify the users that scraping
some targets failed because they had samples exceeding the label limits
defined in the scrape configuration.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
* scrape: apply label limits to __name__ label
Apply limits to the __name__ label that was previously skipped and
truncate the label names and values in the error messages as they can be
very very long.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
* scrape: remove label limits gauges and refactor
Remove `prometheus_target_scrape_pool_label_limit`,
`prometheus_target_scrape_pool_label_name_length_limit`, and
`prometheus_target_scrape_pool_label_value_length_limit` as they are not
really useful since we don't have the information on the labels in it.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
In its current form this configuration clashes in one of the most widely
used configurations (kube-prometheus). This patch scopes the
configuration to prevent this.
Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
Currently, it relies on `job, instance` being the labels completely
identifying a Prometheus instance. However, what's intended is to
simply not match on `remote_name, url`.
Signed-off-by: beorn7 <beorn@grafana.com>
There is certainly a potential to add more of these. This is mostly
meant to introduce the concept and cover a few critical parts.
Signed-off-by: beorn7 <beorn@grafana.com>
This should be the way forward when importing libraries in jsonnet. It's
closer to how Go imports look and makes it more obvious where packages
live.
This is not breaking anything, as the old imports were already symlinks
to the now directly used directories.
Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>
* Mixin: Ignore unset remote write timestamp
This pull request ignores the zero value of highest_sent_timestamp_seconds
in Highest Timestamp In vs. Highest Timestamp Sent which just show that
remote write has not been successful yet.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* .circleci/config.yml: check mixins
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Run jsonnetfmt
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Install tools in the image instead of using coreos/jsonnet-ci
The latter is deprecated
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Update jsonnetfile.json
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Due to https://github.com/grafana/grafana/issues/15642, this prevents users putting this dashboard in a Grafana folder called 'Prometheus'.
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>