prometheus

Commit Graph

Author	SHA1	Message	Date
beorn7	e01c5cefac	notifier: fix increment of metric prometheus_notifications_errors_total Previously, prometheus_notifications_errors_total was incremented by one whenever a batch of alerts was affected by an error during sending to a specific alertmanager. However, the corresponding metric prometheus_notifications_sent_total, counting all alerts that were sent (including those where the sent ended in error), is incremented by the batch size, i.e. the number of alerts. Therefore, the ratio used in the mixin for the PrometheusErrorSendingAlertsToSomeAlertmanagers alert is inconsistent. This commit changes the increment of prometheus_notifications_errors_total to the number of alerts that were sent in the attempt that ended in an error. It also adjusts the metrics help string accordingly and makes the wording in the alert in the mixin more precise. Signed-off-by: beorn7 <beorn@grafana.com>	2024-11-26 15:50:02 +01:00
machine424	f9ca6c4ae6	chore: add an alert based on the metric prometheus_sd_kubernetes_failures_total that was introcued in https://github.com/prometheus/prometheus/pull/13554 The same motivation for adding the metric applies: To avoid silent SD failures, as existing logs may not be regularly checked and can be missed. Signed-off-by: machine424 <ayoubmrini424@gmail.com> Co-authored-by: Simon Pasquier <spasquie@redhat.com>	2024-06-19 17:51:56 +02:00
Pranshu Srivastava	87427682fd	bugfix: allow opting-out of multi-cluster setups Allow users to opt-out of the multi-cluster setup for Prometheus dashboard, in environments where it isn't applicable. Refer: https://github.com/prometheus/prometheus/pull/13180. Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>	2024-05-07 23:46:10 +05:30
Will Bollock	839b9e5b53	fix: PrometheusNotIngestingSamples label matching This alert will never return anything as the left side of the query has the labels `[component, environment, instance, job, type]` while the right side has `[component, environment, instance, job]`. The `type` label was added to `prometheus_tsdb_head_samples_appended_total` in this PR but the mixin wasn't updated for the new label: https://github.com/prometheus/prometheus/pull/11395 This was found with [pint](https://github.com/cloudflare/pint) PromQL linting Signed-off-by: Will Bollock <wbollock@linode.com>	2024-01-31 09:08:36 -05:00
Erik Sommer	d09d77b62a	included instance in all necessary descriptions Signed-off-by: Erik Sommer <ersotech@posteo.de>	2024-01-10 19:40:16 +01:00
Erik Sommer	0e585bf5c0	add cluster variable to Overview dashboard Signed-off-by: Erik Sommer <ersotech@posteo.de>	2023-11-23 17:04:57 +01:00
Julien Pivotto	7a07a279c9	Merge pull request #10721 from ncauchois/fix_prometheus_remote_write_dashboard mixin: Use url filter on Remote Write dashboard	2023-11-03 15:49:42 -04:00
Julien Pivotto	95606830fd	Merge pull request #11498 from paulfantom/selector documentation/mixin: use prometheus metrics for dashboard variables	2023-07-11 13:36:00 +02:00
Leo Q	4268feb9d7	add alert for sd refresh failure (#12410 ) * add alert for sd refresh failure Due to config error or sd service down, prometheus may fail to refresh sd resource, which may lead to scrape fail or irrelavant metrics. Signed-off-by: Leo Q <LeoQuote@users.noreply.github.com> * apply suggestions Signed-off-by: Leo Q <LeoQuote@users.noreply.github.com> --------- Signed-off-by: Leo Q <LeoQuote@users.noreply.github.com>	2023-06-07 14:28:13 +02:00
Paweł Krupa (paulfantom)	b6caa6cabf	documentation/mixin: use prometheus metrics for dashboard variables Signed-off-by: Paweł Krupa (paulfantom) <pawel@krupa.net.pl>	2022-11-30 14:02:58 +01:00
Arunprasad Rajkumar	400d50eb7e	Add unit for uptime column in Prometheus stats dashboard Prior to this fix uptime column interpreted as number and the higher values are suffixed with raw units like `K`. This commit adds unit for the column as `second` to make visual interpretation easy. Signed-off-by: Arunprasad Rajkumar <ar.arunprasad@gmail.com>	2022-11-08 19:09:14 +05:30
Simon Pasquier	ec7929aaa8	documentation/prometheus-mixin: fix comment typo Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2022-09-09 17:03:53 +02:00
Iain Lane	e5cd5a33d0	PrometheusHighQueryLoad alert: use configured selector Currently we're hardcoding `job="prometheus-k8s"` as selector. This doesn't work if your prometheus is elsewhere. Fortunately we have `prometheusSelector` in `$._config` which all the other alerts use. Use that here too. Signed-off-by: Iain Lane <iain@orangesquash.org.uk>	2022-07-15 10:04:32 +01:00
Haoyu Sun	26a7f80aa1	add alert PrometheusHighQueryLoad. Signed-off-by: Haoyu Sun <hasun@redhat.com>	2022-07-13 14:08:24 +02:00
Nolwenn Cauchois	ff3d4e91dc	mixin: Use url filter on Remote Write dashboard Signed-off-by: Nolwenn Cauchois <nolwenn.cauchois@orange.com>	2022-05-20 15:04:55 +02:00
fpetkovski	501a8a7865	Address code review comments Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>	2022-03-30 09:35:08 +02:00
fpetkovski	877320784b	Add alert in mixin for exceeded sample limit This commit adds an alert in the prometheus mixin which triggers when Prometheus has failed scrapes that have exceeded the configured sample_limit for that job. Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>	2022-03-30 09:31:35 +02:00
Haoyu Sun	3c903af474	Add Alert PrometheusScrapeBodySizeLimitHit Signed-off-by: Haoyu Sun <hasun@redhat.com>	2022-03-22 15:13:00 +01:00
Björn Rabenstein	2234798f60	Merge pull request #9700 from nikosmeds/nikosmeds/hagroupcrashlooping-mixin-60m Increase time range for PrometheusHAGroupCrashlooping alert	2021-11-19 12:53:55 +01:00
Niko Smeds	53ca693f9e	Be specific Signed-off-by: Niko Smeds <nikosmeds@gmail.com>	2021-11-18 11:28:38 -08:00
Niko Smeds	0bc2cbdd7d	Leave time range for clean restarts as-is Signed-off-by: Niko Smeds <nikosmeds@gmail.com>	2021-11-17 15:14:26 -08:00
Fatih Sarhan	bc89e9e494	mixin: Reorder template variables on Remote Write dashboard Signed-off-by: f9n <f9n@protonmail.com>	2021-11-12 14:38:05 +03:00
Niko Smeds	fdcd423dfe	Increase time range for PrometheusHAGroupCrashlooping alert Signed-off-by: Niko Smeds <nikosmeds@gmail.com>	2021-11-08 15:06:42 -08:00
SuperQ	3cd2c033e2	Use Go 1.16+ install for mixin tests Use new `go install` syntax to fetch tools. Signed-off-by: SuperQ <superq@gmail.com>	2021-10-23 22:52:16 +02:00
Julien Pivotto	d5676fb9e0	Merge pull request #9254 from prometheus/superq/go1.17 Build with Go 1.17 / npm 7 / node 16	2021-08-28 18:36:42 +02:00
Frederic Hemberger	16b8911b1a	docs: Replace `go get` with `go install` for command installation (#9098 ) `go get` is deprecated for installation of commands as of go v1.17 Ref: https://go.googlesource.com/go/+/ced0fdbad0655d63d535390b1a7126fd1fef8348 Signed-off-by: Frederic Hemberger <mail@frederic-hemberger.de>	2021-08-27 11:08:21 +02:00
SuperQ	e167a45c65	Add new Go build tags. Add new go:build comments based on 1.17 formatting[0]. [0]: https://golang.org/doc/go1.17#gofmt Signed-off-by: SuperQ <superq@gmail.com>	2021-08-27 10:24:14 +02:00
Philip Gough	751ca03fad	mixin: Filter instance by job for Prometheus overview dashboard Signed-off-by: Philip Gough <philip.p.gough@gmail.com>	2021-07-28 14:34:26 +01:00
Julien Duchesne	8855c2e626	Add `prometheus_tsdb_clean_start` metric (#8824 ) Add cleanup of the lockfile when the db is cleanly closed The metric describes the status of the lockfile on startup 0: Already existed 1: Did not exist -1: Disabled Therefore, if the min value over time of this metric is 0, that means that executions have exited uncleanly We can then use that metric to have a much lower threshold on the crashlooping alert: If the metric exists and it has been zero, two restarts is enough to trigger the alarm If it does not exist (old prom version for example), the current five restarts threshold remains Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com> * Change metric name + set unset value to -1 Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com> * Only check the last value of the clean start alert Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com> * Fix test + nit Signed-off-by: Julien Duchesne <julien.duchesne@grafana.com>	2021-06-16 15:03:02 +05:30
hanjm	1df05bfd49	Add body_size_limit to prevent bad targets response large body cause Prometheus server OOM (#8827 ) Signed-off-by: hanjm <hanjinming@outlook.com>	2021-05-29 07:05:42 +08:00
Levi Harrison	2826fbeeb7	SD: Add target creation failure counter and change failure handling (#8786 ) * Added metric and changed failure/drop strategy Signed-off-by: Levi Harrison <git@leviharrison.dev>	2021-05-28 23:50:59 +02:00
Damien Grisonnet	b50f9c1c84	Add label scrape limits (#8777 ) * scrape: add label limits per scrape Add three new limits to the scrape configuration to provide some mechanism to defend against unbound number of labels and excessive label lengths. If any of these limits are broken by a sample from a scrape, the whole scrape will fail. For all of these configuration options, a zero value means no limit. The `label_limit` configuration will provide a mechanism to bound the number of labels per-scrape of a certain sample to a user defined limit. This limit will be tested against the sample labels plus the discovery labels, but it will exclude the __name__ from the count since it is a mandatory Prometheus label to which applying constraints isn't meaningful. The `label_name_length_limit` and `label_value_length_limit` will prevent having labels of excessive lengths. These limits also skip the __name__ label for the same reasons as the `label_limit` option and will also make the scrape fail if any sample has a label name/value length that exceed the predefined limits. Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com> * scrape: add metrics and alert to label limits Add three gauge, one for each label limit to easily access the limit set by a certain scrape target. Also add a counter to count the number of targets that exceeded the label limits and thus were dropped. This is useful for the `PrometheusLabelLimitHit` alert that will notify the users that scraping some targets failed because they had samples exceeding the label limits defined in the scrape configuration. Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com> * scrape: apply label limits to __name__ label Apply limits to the __name__ label that was previously skipped and truncate the label names and values in the error messages as they can be very very long. Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com> * scrape: remove label limits gauges and refactor Remove `prometheus_target_scrape_pool_label_limit`, `prometheus_target_scrape_pool_label_name_length_limit`, and `prometheus_target_scrape_pool_label_value_length_limit` as they are not really useful since we don't have the information on the labels in it. Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>	2021-05-06 09:56:21 +01:00
ravilr	adc8807851	Update remote-write alert rules mixin (#8423 ) Signed-off-by: ravilr <raviprasad_lr@yahoo.com>	2021-01-31 20:07:49 +00:00
Frederic Branczyk	62bc755733	mixin: Scope grafana config In its current form this configuration clashes in one of the most widely used configurations (kube-prometheus). This patch scopes the configuration to prevent this. Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>	2020-12-30 17:50:34 +01:00
Nicolas Lamirault	aa1ca13025	Add: Custom tags and prefix in Prometheus Mixin (#8287 ) * Add: custom tags and prefix Signed-off-by: Nicolas Lamirault <nicolas.lamirault@gmail.com> * Fix: fmt Signed-off-by: Nicolas Lamirault <nicolas.lamirault@gmail.com>	2020-12-16 18:49:06 +01:00
Björn Rabenstein	511511324a	Merge pull request #8235 from Allex1/master Update remote-write grafana mixin	2020-12-08 14:50:47 +01:00
beorn7	553f904f2d	mixin: Add a capability to exclude non-prod AM instances Signed-off-by: beorn7 <beorn@grafana.com>	2020-12-03 20:59:53 +01:00
birca	3ec4161575	Update remote-write grafana mixin Signed-off-by: birca <birca@adobe.com>	2020-12-02 09:50:15 +02:00
beorn7	638e99c814	prometheus-mixin: Make PrometheusRemoteWriteBehind more generic Currently, it relies on `job, instance` being the labels completely identifying a Prometheus instance. However, what's intended is to simply not match on `remote_name, url`. Signed-off-by: beorn7 <beorn@grafana.com>	2020-11-17 13:29:49 +01:00
beorn7	371ca9ff46	prometheus-mixin: add HA-group aware alerts There is certainly a potential to add more of these. This is mostly meant to introduce the concept and cover a few critical parts. Signed-off-by: beorn7 <beorn@grafana.com>	2020-11-11 19:45:34 +01:00
Matthias Loibl	13ba013a24	Use absolute jsonnet import paths This should be the way forward when importing libraries in jsonnet. It's closer to how Go imports look and makes it more obvious where packages live. This is not breaking anything, as the old imports were already symlinks to the now directly used directories. Signed-off-by: Matthias Loibl <mail@matthiasloibl.com>	2020-10-20 11:42:30 +02:00
Björn Rabenstein	d49f267f76	Merge pull request #8054 from simonpasquier/improve-not-ingesting-samples-alert documentation/prometheus-mixin: improve PrometheusNotIngestingSamples	2020-10-15 12:29:39 +02:00
Simon Pasquier	f381d8a9bd	documentation/prometheus-mixin: improve PrometheusNotIngestingSamples The alert shouldn't fire when there's no target and no rule configured. Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2020-10-15 11:13:17 +02:00
Julien Pivotto	4596abee4d	Mixin: Ignore unset remote write timestamp (#8046 ) * Mixin: Ignore unset remote write timestamp This pull request ignores the zero value of highest_sent_timestamp_seconds in Highest Timestamp In vs. Highest Timestamp Sent which just show that remote write has not been successful yet. Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2020-10-15 09:15:59 +02:00
Simon Pasquier	e693af6c01	.circleci/config.yml: check mixins (#6895 ) * .circleci/config.yml: check mixins Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Run jsonnetfmt Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Install tools in the image instead of using coreos/jsonnet-ci The latter is deprecated Signed-off-by: Simon Pasquier <spasquie@redhat.com> * Update jsonnetfile.json Signed-off-by: Simon Pasquier <spasquie@redhat.com>	2020-08-25 15:59:41 +02:00
Julien Pivotto	f482c7bdd7	Add per scrape-config targets limit (#7554 ) * Add per scrape-config targets limit Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>	2020-07-30 14:20:24 +02:00
Tom Wilkie	27b1009acd	Rename the dashboard in the mixin to 'Prometheus Overview'. (#7489 ) Due to https://github.com/grafana/grafana/issues/15642, this prevents users putting this dashboard in a Grafana folder called 'Prometheus'. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>	2020-06-30 15:45:44 +01:00
Manuel Fontan	6e7554639b	Update Readme since jsonnetfmt is available in the jsonnet go implementation since v0.16.0 Signed-off-by: Manuel Fontan <mfontangarcia@slack-corp.com>	2020-06-16 10:41:58 +01:00
Callum Styan	5400e71b91	Update mixin dashboards and alerts for new remote write label names. Signed-off-by: Callum Styan <callumstyan@gmail.com>	2020-04-08 12:56:00 -07:00
Marco Pracucci	1e1785690a	Fix queue in alerts annotation Signed-off-by: Marco Pracucci <marco@pracucci.com>	2020-02-12 12:48:13 +01:00

1 2

89 Commits (5352e48dddffdb995572503c32f739ba9e885180)