prometheus

Commit Graph

Author	SHA1	Message	Date
beorn7	e01c5cefac	notifier: fix increment of metric prometheus_notifications_errors_total Previously, prometheus_notifications_errors_total was incremented by one whenever a batch of alerts was affected by an error during sending to a specific alertmanager. However, the corresponding metric prometheus_notifications_sent_total, counting all alerts that were sent (including those where the sent ended in error), is incremented by the batch size, i.e. the number of alerts. Therefore, the ratio used in the mixin for the PrometheusErrorSendingAlertsToSomeAlertmanagers alert is inconsistent. This commit changes the increment of prometheus_notifications_errors_total to the number of alerts that were sent in the attempt that ended in an error. It also adjusts the metrics help string accordingly and makes the wording in the alert in the mixin more precise. Signed-off-by: beorn7 <beorn@grafana.com>	2024-11-26 15:50:02 +01:00
machine424	f9ca6c4ae6	chore: add an alert based on the metric prometheus_sd_kubernetes_failures_total that was introcued in https://github.com/prometheus/prometheus/pull/13554 The same motivation for adding the metric applies: To avoid silent SD failures, as existing logs may not be regularly checked and can be missed. Signed-off-by: machine424 <ayoubmrini424@gmail.com> Co-authored-by: Simon Pasquier <spasquie@redhat.com>	2024-06-19 17:51:56 +02:00
Will Bollock	839b9e5b53	fix: PrometheusNotIngestingSamples label matching This alert will never return anything as the left side of the query has the labels `[component, environment, instance, job, type]` while the right side has `[component, environment, instance, job]`. The `type` label was added to `prometheus_tsdb_head_samples_appended_total` in this PR but the mixin wasn't updated for the new label: https://github.com/prometheus/prometheus/pull/11395 This was found with [pint](https://github.com/cloudflare/pint) PromQL linting Signed-off-by: Will Bollock <wbollock@linode.com>	2024-01-31 09:08:36 -05:00
Leo Q	4268feb9d7	add alert for sd refresh failure (#12410 ) * add alert for sd refresh failure Due to config error or sd service down, prometheus may fail to refresh sd resource, which may lead to scrape fail or irrelavant metrics. Signed-off-by: Leo Q <LeoQuote@users.noreply.github.com> * apply suggestions Signed-off-by: Leo Q <LeoQuote@users.noreply.github.com> --------- Signed-off-by: Leo Q <LeoQuote@users.noreply.github.com>	2023-06-07 14:28:13 +02:00
Iain Lane	e5cd5a33d0	PrometheusHighQueryLoad alert: use configured selector Currently we're hardcoding `job="prometheus-k8s"` as selector. This doesn't work if your prometheus is elsewhere. Fortunately we have `prometheusSelector` in `$._config` which all the other alerts use. Use that here too. Signed-off-by: Iain Lane <iain@orangesquash.org.uk>	2022-07-15 10:04:32 +01:00
Haoyu Sun	26a7f80aa1	add alert PrometheusHighQueryLoad. Signed-off-by: Haoyu Sun <hasun@redhat.com>	2022-07-13 14:08:24 +02:00
fpetkovski	501a8a7865	Address code review comments Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>	2022-03-30 09:35:08 +02:00
fpetkovski	877320784b	Add alert in mixin for exceeded sample limit This commit adds an alert in the prometheus mixin which triggers when Prometheus has failed scrapes that have exceeded the configured sample_limit for that job. Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>	2022-03-30 09:31:35 +02:00
Haoyu Sun	3c903af474	Add Alert PrometheusScrapeBodySizeLimitHit Signed-off-by: Haoyu Sun <hasun@redhat.com>	2022-03-22 15:13:00 +01:00
Niko Smeds	53ca693f9e	Be specific Signed-off-by: Niko Smeds <nikosmeds@gmail.com>	2021-11-18 11:28:38 -08:00
Niko Smeds	0bc2cbdd7d	Leave time range for clean restarts as-is Signed-off-by: Niko Smeds <nikosmeds@gmail.com>	2021-11-17 15:14:26 -08:00
Niko Smeds	fdcd423dfe	Increase time range for PrometheusHAGroupCrashlooping alert Signed-off-by: Niko Smeds <nikosmeds@gmail.com>	2021-11-08 15:06:42 -08:00
Julien Duchesne	8855c2e626	Add `prometheus_tsdb_clean_start` metric (#8824 ) Add cleanup of the lockfile when the db is cleanly closed The metric describes the status of the lockfile on startup 0: Already existed 1: Did not exist -1: Disabled Therefore, if the min value over time of this metric is 0, that means that executions have exited uncleanly We can then use that metric to have a much lower threshold on the crashlooping alert: If the metric exists and it has been zero, two restarts is enough to trigger the alarm If it does not exist (old prom version for example), the current five restarts threshold remains Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com> * Change metric name + set unset value to -1 Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com> * Only check the last value of the clean start alert Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com> * Fix test + nit Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>	2021-06-16 15:03:02 +05:30
Levi Harrison	2826fbeeb7	SD: Add target creation failure counter and change failure handling (#8786 ) * Added metric and changed failure/drop strategy Signed-off-by: Levi Harrison <git@leviharrison.dev>	2021-05-28 23:50:59 +02:00
Damien Grisonnet	b50f9c1c84	Add label scrape limits (#8777 ) * scrape: add label limits per scrape Add three new limits to the scrape configuration to provide some mechanism to defend against unbound number of labels and excessive label lengths. If any of these limits are broken by a sample from a scrape, the whole scrape will fail. For all of these configuration options, a zero value means no limit. The `label_limit` configuration will provide a mechanism to bound the number of labels per-scrape of a certain sample to a user defined limit. This limit will be tested against the sample labels plus the discovery labels, but it will exclude the __name__ from the count since it is a mandatory Prometheus label to which applying constraints isn't meaningful. The `label_name_length_limit` and `label_value_length_limit` will prevent having labels of excessive lengths. These limits also skip the __name__ label for the same reasons as the `label_limit` option and will also make the scrape fail if any sample has a label name/value length that exceed the predefined limits. Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com> * scrape: add metrics and alert to label limits Add three gauge, one for each label limit to easily access the limit set by a certain scrape target. Also add a counter to count the number of targets that exceeded the label limits and thus were dropped. This is useful for the `PrometheusLabelLimitHit` alert that will notify the users that scraping some targets failed because they had samples exceeding the label limits defined in the scrape configuration. Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com> * scrape: apply label limits to __name__ label Apply limits to the __name__ label that was previously skipped and truncate the label names and values in the error messages as they can be very very long. Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com> * scrape: remove label limits gauges and refactor Remove `prometheus_target_scrape_pool_label_limit`, `prometheus_target_scrape_pool_label_name_length_limit`, and `prometheus_target_scrape_pool_label_value_length_limit` as they are not really useful since we don't have the information on the labels in it. Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>	2021-05-06 09:56:21 +01:00
ravilr	adc8807851	Update remote-write alert rules mixin (#8423 ) Signed-off-by: ravilr <raviprasad_lr@yahoo.com>	2021-01-31 20:07:49 +00:00
beorn7	553f904f2d	mixin: Add a capability to exclude non-prod AM instances Signed-off-by: beorn7 <beorn@grafana.com>	2020-12-03 20:59:53 +01:00
beorn7	638e99c814	prometheus-mixin: Make PrometheusRemoteWriteBehind more generic Currently, it relies on `job, instance` being the labels completely identifying a Prometheus instance. However, what's intended is to simply not match on `remote_name, url`. Signed-off-by: beorn7 <beorn@grafana.com>	2020-11-17 13:29:49 +01:00
beorn7	371ca9ff46	prometheus-mixin: add HA-group aware alerts There is certainly a potential to add more of these. This is mostly meant to introduce the concept and cover a few critical parts. Signed-off-by: beorn7 <beorn@grafana.com>	2020-11-11 19:45:34 +01:00
Simon Pasquier	f381d8a9bd	documentation/prometheus-mixin: improve PrometheusNotIngestingSamples The alert shouldn't fire when there's no target and no rule configured. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2020-10-15 11:13:17 +02:00
Julien Pivotto	f482c7bdd7	Add per scrape-config targets limit (#7554 ) * Add per scrape-config targets limit Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2020-07-30 14:20:24 +02:00
Callum Styan	5400e71b91	Update mixin dashboards and alerts for new remote write label names. Signed-off-by: Callum Styan <callumstyan@gmail.com>	2020-04-08 12:56:00 -07:00
Marco Pracucci	1e1785690a	Fix queue in alerts annotation Signed-off-by: Marco Pracucci <marco@pracucci.com>	2020-02-12 12:48:13 +01:00
beorn7	9c8f9bfa63	Fix the description template for PrometheusRemoteWriteDesiredShards Signed-off-by: beorn7 <beorn@grafana.com>	2019-10-30 13:27:37 +01:00
beorn7	61617eb2d9	Fix PrometheusRemoteWriteDesiredShards This rule has the same labels on both sides. We don't want `group_right` and `on`, we want nothing. Signed-off-by: beorn7 <beorn@grafana.com>	2019-10-29 00:23:39 +01:00
Simon Pasquier	e36ab7e192	prometheus-mixin: improve description of sample alerts (#6050 ) Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-09-24 17:44:27 +02:00
Björn Rabenstein	3b3eaf3496	Merge pull request #5787 from cstyan/reshard-max-logging Add metrics for max/min/desired shards to queue manager.	2019-09-09 22:32:54 +02:00
Callum Styan	a98599bea8	Update remote write max shards alert; properly template/query for max shards in description. Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-09-09 12:01:11 -07:00
Callum Styan	3b75614892	Add a warning alert, since the remote write behind alert will probably already be going off, about desired shards being higher than max shards. Signed-off-by: Callum Styan <callumstyan@gmail.com>	2019-08-08 06:45:46 -07:00
Simon Pasquier	dd174963a2	prometheus-mixin: remove PrometheusTSDBWALCorruptions The counter is only increased when tsdb.Open() is called which Prometheus does only once in its lifetime (when it initializes). If the corruption can't be recovered, tsdb.Open() returns an error and Prometheus exits. Hence the metric is either 0 (no corruption) or 1 (corruption detected and repaired). If the latter, the alert isn't actionable and the only way to resolve it is to restart Prometheus which would reset the counter. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2019-08-06 14:36:56 +02:00
beorn7	4825585834	Tweak tenses Signed-off-by: beorn7 <beorn@grafana.com>	2019-06-28 17:37:49 +02:00
beorn7	9a2177949d	Protect gauge-based alerts against failed scrapes Signed-off-by: beorn7 <beorn@grafana.com>	2019-06-28 16:46:19 +02:00
beorn7	7a25a2586d	Sync with alerts from kube-prometheus While doing so, re-introduce the summary/description annotations. Also, add a few more rules and tweak a few of the existing ones. Signed-off-by: beorn7 <beorn@grafana.com>	2019-06-27 23:50:26 +02:00
beorn7	1336a28848	Use a config variable for the Prometheus name Signed-off-by: beorn7 <beorn@grafana.com>	2019-06-27 14:34:11 +02:00
beorn7	e34af6d4d3	Address various comments from the review Signed-off-by: beorn7 <beorn@grafana.com>	2019-06-26 23:22:16 +02:00
beorn7	23c03207e9	Fixed indentation Signed-off-by: beorn7 <beorn@grafana.com>	2019-06-26 20:31:05 +02:00
Tom Wilkie	38a9bbbec2	Loosen off PrometheusRemoteWriteBehind alert. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-03-04 12:47:24 +00:00
Tom Wilkie	b615069289	Update metric names. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-03-01 07:39:48 -08:00
Tom Wilkie	e248ffb220	Add alert for WAL remote write falling behind. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2019-02-12 15:22:58 +00:00
Tom Wilkie	638204c775	Typo Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2018-11-19 12:23:42 +00:00
Tom Wilkie	8f42192e52	Add Prometheus alerts from kube-prometheus, remove the alertmanager alerts. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2018-11-19 11:22:55 +00:00
Tom Wilkie	50861d586a	Alert if more than 1% of alerts fail for a given integration. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2018-11-16 17:17:47 +00:00
Tom Wilkie	266ba185fe	Remove PromScrapeFailed alert. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2018-11-16 17:17:47 +00:00
Tom Wilkie	ee1427faad	Prometheus monitoring mixin for Prometheus itself. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2018-11-16 17:17:47 +00:00

44 Commits (af2a1cb10c89de496cd4309ac532624b34112d74)