prometheus

Commit Graph

Author	SHA1	Message	Date
Tobias Schmidt	eaf33759fb	Register forgotten prometheus_evaluator_iterations_total metric	8 years ago
Tobias Schmidt	aaaba57184	Export number of missed rule evaluations In case the execution of all rules takes longer than the configured rule evaluation interval, one or more iterations will be skipped. This needs to be visible to the opterator.	8 years ago
Fabian Reinartz	e68a3cf21f	rules: update annotations on each iteration	8 years ago
Jonathan Lange	d78dd3593d	Set evaluation interval on Group construction Prevents having object in invalid state, and allows users of public API to construct valid Groups.	8 years ago
Jonathan Lange	31fc357cd8	Make NewGroup and Group.Eval public Allows callers to execute evaluate lists of rules without first writing them to disk.	8 years ago
Jonathan Lange	2a2da40223	Make rule evaluation publicly available Means that a third-party can parse rules and run them with their own execution model.	8 years ago
Matt Bostock	926a5ab3dd	rules/manager.go: Fix race between reload and stop On one relatively large Prometheus instance (1.7M series), I noticed that upgrades were frequently resulting in Prometheus undergoing crash recovery on start-up. On closer examination, I found that Prometheus was panicking on shutdown. It seems that our configuration management (or misconfiguration thereof) is reloading Prometheus then immediately restarting it, which I suspect is causing this race: Sep 21 15:12:42 host systemd[1]: Reloading prometheus monitoring system. Sep 21 15:12:42 host prometheus[18734]: time="2016-09-21T15:12:42Z" level=info msg="Loading configuration file /etc/prometheus/config.yaml" source="main.go:221" Sep 21 15:12:42 host systemd[1]: Reloaded prometheus monitoring system. Sep 21 15:12:44 host systemd[1]: Stopping prometheus monitoring system... Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=warning msg="Received SIGTERM, exiting gracefully..." source="main.go:203" Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=info msg="See you next time!" source="main.go:210" Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=info msg="Stopping target manager..." source="targetmanager.go:90" Sep 21 15:12:52 host prometheus[18734]: time="2016-09-21T15:12:52Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:548" Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=warning msg="Error on ingesting out-of-order samples" numDropped=1 source="scrape.go:467" Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=error msg="Error adding file watch for \"/etc/prometheus/targets\": no such file or directory" source="file.go:84" Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=error msg="Error adding file watch for \"/etc/prometheus/targets\": no such file or directory" source="file.go:84" Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping rule manager..." source="manager.go:366" Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Rule manager stopped." source="manager.go:372" Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping notification handler..." source="notifier.go:325" Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping local storage..." source="storage.go:381" Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping maintenance loop..." source="storage.go:383" Sep 21 15:13:01 host prometheus[18734]: panic: close of closed channel Sep 21 15:13:01 host prometheus[18734]: goroutine 7686074 [running]: Sep 21 15:13:01 host prometheus[18734]: panic(0xba57a0, 0xc60c42b500) Sep 21 15:13:01 host prometheus[18734]: /usr/local/go/src/runtime/panic.go:500 +0x1a1 Sep 21 15:13:01 host prometheus[18734]: github.com/prometheus/prometheus/rules.(Manager).ApplyConfig.func1(0xc6645a9901, 0xc420271ef0, 0xc420338ed0, 0xc60c42b4f0, 0xc6645a9900) Sep 21 15:13:01 host prometheus[18734]: /home/build/packages/prometheus/tmp/build/gopath/src/github.com/prometheus/prometheus/rules/manager.go:412 +0x3c Sep 21 15:13:01 host prometheus[18734]: created by github.com/prometheus/prometheus/rules.(Manager).ApplyConfig Sep 21 15:13:01 host prometheus[18734]: /home/build/packages/prometheus/tmp/build/gopath/src/github.com/prometheus/prometheus/rules/manager.go:423 +0x56b Sep 21 15:13:03 host systemd[1]: prometheus.service: main process exited, code=exited, status=2/INVALIDARGUMENT	8 years ago
Julius Volz	c187308366	storage: Contextify storage interfaces. This is based on https://github.com/prometheus/prometheus/pull/1997. This adds contexts to the relevant Storage methods and already passes PromQL's new per-query context into the storage's query methods. The immediate motivation supporting multi-tenancy in Frankenstein, but this could also be used by Prometheus's normal local storage to support cancellations and timeouts at some point.	8 years ago
Julius Volz	ed5a0f0abe	promql: Allow per-query contexts. For Weaveworks' Frankenstein, we need to support multitenancy. In Frankenstein, we initially solved this without modifying the promql package at all: we constructed a new promql.Engine for every query and injected a storage implementation into that engine which would be primed to only collect data for a given user. This is problematic to upstream, however. Prometheus assumes that there is only one engine: the query concurrency gate is part of the engine, and the engine contains one central cancellable context to shut down all queries. Also, creating a new engine for every query seems like overkill. Thus, we want to be able to pass per-query contexts into a single engine. This change gets rid of the promql.Engine's built-in base context and allows passing in a per-query context instead. Central cancellation of all queries is still possible by deriving all passed-in contexts from one central one, but this is now the responsibility of the caller. The central query context is now created in main() and passed into the relevant components (web handler / API, rule manager). In a next step, the per-query context would have to be passed to the storage implementation, so that the storage can implement multi-tenancy or other features based on the contextual information.	8 years ago
beorn7	75bae065fd	Revert "Modify tests to adjust to reverting the /graph changes" This reverts commit `f1ea5bf232`. Part two necessary for reverting the /graph revert.	8 years ago
beorn7	f1ea5bf232	Modify tests to adjust to reverting the /graph changes These tests have been added after the /graph changes and therefore already test the new syntax. This commit has to be reverted together with the previous one to get back to the old new state. sigh	8 years ago
Julius Volz	fe7b8b7fd1	Add missing license header to alerting_test.go	8 years ago
Julius Volz	da7206ec29	Fix rule HTML escaping issues This was mentioned as part of https://github.com/prometheus/alertmanager/issues/452	8 years ago
Brian Brazil	6fc88d4b4d	Remove __name__ from alerts sent to AM. Fixes #1861	8 years ago
Dmitry Vorobev	273e457da4	web: return status code and error message for config resource	8 years ago
Brian Brazil	0509b0f2db	Expand alert templates at eval time. Fixes #1678 #1677	8 years ago
beorn7	064b57858e	Consistently use the `Seconds()` method for conversion of durations This also fixes one remaining case of recording integral numbers of seconds only for a metric, i.e. this will probably fix #1796.	9 years ago
beorn7	b95c096a45	Fix style issues in rules/...	9 years ago
beorn7	45e5775f9b	Add missing logging of out-of-order samples So far, out-of-order samples during rule evaluation were not logged, and neither scrape health samples. The latter are unlikely to cause any errors. That's why I'm logging them always now. (It's alway highly irregular should it happen.) For rules, I have used the same plumbing as for samples, just with a different wording in the message to mark them as a result of rule evaluation.	9 years ago
beorn7	4b574e8a61	Switch chunk encoding to type 2 where it was hardcoded type 1 before The chunk encoding was hardcoded there because it mostly doesn't matter what encoding is chosen in that test. Since type 1 is battle-hardened enough, I'm switching to type 2 here so that we can catch unexpected problems as a byproduct. My expectation is that the chunk encoding doesn't matter anyway, as said, but then "unexpected problems" contains the word "unexpected".	9 years ago
Fabian Reinartz	d89c254849	Make copying alerting state safer. This considers static labels in the equality of alerts to avoid falsely copying state from a different alert definition with the same name across reloads. To be safe, it also copies the state map rather than just its pointer so that remaining collisions disappear after one evaluation interval.	9 years ago
Fabian Reinartz	bfa8aaa017	Rename notification to notifier	9 years ago
beorn7	663a1550d0	Fix the instrumentation fixes	9 years ago
Tobias Schmidt	f1f8317fa5	Fix detection of flapping alerts Alerts in the resolve retention period must be transitioned to the active state again when their condition is met.	9 years ago
beorn7	ec08c9a391	Rework the way to communicate backpressure (AKA suspended ingestion) This gives up on the idea to communicate throuh the Append() call (by either not returning as it is now or returning an error as suggested/explored elsewhere). Here I have added a Throttled() call, which has the advantage that it can be called before a whole _batch_ of Append()'s. Scrapes will happen completely or not at all. Same for rule group evaluations. That's a highly desired behavior (as discussed elsewhere). The code is even simpler now as the whole ingestion buffer could be removed. Logging of throttled mode has been streamlined and will create at most one message per minute.	9 years ago
beorn7	a7408bfb47	Unify duration parsing It's actually happening in several places (and for flags, we use the standard Go time.Duration...). This at least reduces all our home-grown parsing to one place (in model).	9 years ago
Fabian Reinartz	a6935024e1	Remove old WITH clause in alert printing	9 years ago
Fabian Reinartz	b0adfea8d5	Fix swapped constants, improve instrumentation	9 years ago
Fabian Reinartz	a8c38c3ac5	Don't log rule evaluation failure on shutdown	9 years ago
Fabian Reinartz	6eee86dce8	Terminate rule groups during initial sleep When an evaluation group runs initially, it waits a deterministic amount of time. During that time it also has to accept a termination singnal so shutdown doesn't hang during the first evaluation iteration after a configuration reload. Fixes #1307	9 years ago
Fabian Reinartz	26eb3ac2f8	Don't skip recording rule errors	9 years ago
Fabian Reinartz	37d80c4b25	Fix premature rule evaluation This commit prevents rule evaluation from starting until after the storage is ready.	9 years ago
Fabian Reinartz	0cf3c6a9ef	Add comments, rename a method	9 years ago
Fabian Reinartz	bf6abac8f4	Send resolved notifications	9 years ago
Fabian Reinartz	f69e668fc4	Improve rules/ instrumentation This commit adds a counter for the total number of rule evaluations and standardizes the units to seconds.	9 years ago
Fabian Reinartz	62075aa037	Reduce noisy no-alertmanager warning	9 years ago
Fabian Reinartz	52e5224f5a	Refactor rules/ package	9 years ago
Fabian Reinartz	e4fabe135a	Set StartsAt to time of first firing state	9 years ago
Fabian Reinartz	7c90db22ed	Use annotation based alerts in rules/ This commit breaks the previously used alert format.	9 years ago
Fabian Reinartz	e114ce0ff7	Refactor notification handler	9 years ago
Fabian Reinartz	e3b6ec9784	Switch to common/log	9 years ago
Fabian Reinartz	171f50706a	Fix unkeyed field errors.	9 years ago
Brian Brazil	3bcdb2bbba	rules: Allow for setting labels on LHS on scalars	9 years ago
Julius Volz	995d3b831d	Fix most golint warnings. This is with `golint -min_confidence=0.5`. I left several lint warnings untouched because they were either incorrect or I felt it was better not to change them at the moment.	9 years ago
Fabian Reinartz	d6b8da8d43	Switch promql types to common/model	9 years ago
Brian Brazil	fdf0d0642e	Cast value to float, as that's what the console templates expect.	9 years ago
Fabian Reinartz	438e232c9b	Fix grouping of import blocks	9 years ago
Fabian Reinartz	306e8468a0	Switch from client_golang/model to common/model	9 years ago
Brian Brazil	e6a67476c2	rules: Allow recorded rules expressions to be scalars. This is useful if you want to build up a constant metric, such as a set of alert thresholds that vary by label value.	9 years ago
Fabian Reinartz	7a67472fc1	Resolve relative paths on configuration loading This moves the concern of resolving the files relative to the config file into the configuration loading itself. It also fixes #921 which did not load the cert and token files relatively.	9 years ago

1 2 3 4 5 ...

281 Commits (8097a3c52336de37d07822940b0d7fc4b7ef4e86)