prometheus

Commit Graph

Author	SHA1	Message	Date
gotjosh	4daaa59c08	Rule Manager: Only query once per alert rule when restoring alert state Prometheus restores alert state between restarts and updates. For each rule, it looks at the alerts that are meant to be active and then queries the `ALERTS_FOR_STATE` series for _each_ alert within the rules. If the alert rule has 120 instances (or series) it'll execute the same query with slightly different labels. This PR changes the approach so that we only query once per alert rule and then match the corresponding alert that we're about to restore against the series-set. While the approach might use a bit more memory at start-up (if even?) the restore proccess is only ran once per restart so I'd consider this a big win. This builds on top of #13974 Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-24 18:46:05 +01:00
gotjosh	5beb2fe005	Improve the metric description Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-24 15:24:35 +01:00
gotjosh	381a77ac1e	Change variable name to `restoreStartTime` from `now` and introduce a log line to record total time Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-24 14:21:11 +01:00
György Krajcsovits	bcafa5f1f9	Merge remote-tracking branch 'upstream/main' into update-nhcb	2024-04-24 11:06:59 +02:00
gotjosh	e7219e3d36	Rule Manager: Add `rule_group_last_restore_duration_seconds` to measure restore time per rule group When a rule group changes or prometheus is restarted we need to ensure we restore the active alerts that were firing for a corresponding rule, for that Prometheus uses the `ALERTS_FOR_STATE` series to query the previous state and restore it. If a given rule has high cardinality (think 100s of 1000s for series) this proccess can take a bit of time - this is the first of a series of PRs to improve this problem and I'd like to start with exposing the time it takes to restore a rule group as a gauge. Signed-off-by: gotjosh <josue.abreu@gmail.com>	2024-04-23 09:57:08 +01:00
Björn Rabenstein	4ec5c25393	Merge pull request #13731 from suntala/suntala/native-histogram-template histograms: support expansion of native histogram values in templating	2024-04-11 13:24:26 +02:00
Matthieu MOREL	6f595c6762	golangci-lint: enable whitespace linter (#13905 ) Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>	2024-04-11 09:27:54 +01:00
suntala	44f385fd51	Support expansion of native histogram values in alert templates Co-authored-by: Aleks Fazlieva <britishrum@users.noreply.github.com> Signed-off-by: suntala <arati.rana@grafana.com>	2024-03-26 22:30:01 +01:00
György Krajcsovits	a3d1a46eda	Merge branch 'main' into nhcb	2024-03-22 14:51:48 +01:00
Łukasz Mierzwa	3bb27c33e9	Use consistent keys for logs Rule warnings are logged with numDropped=N while every other component uses num_dropped=N: ``` notifier/notifier.go: level.Warn(n.logger).Log("msg", "Alert batch larger than queue capacity, dropping alerts", "num_dropped", d) notifier/notifier.go: level.Warn(n.logger).Log("msg", "Alert notification queue full, dropping alerts", "num_dropped", d) storage/remote/write_handler.go: _ = level.Warn(h.logger).Log("msg", "Error on ingesting out-of-order exemplars", "num_dropped", outOfOrderExemplarErrs) rules/group.go: level.Warn(logger).Log("msg", "Error on ingesting out-of-order result from rule evaluation", "num_dropped", numOutOfOrder) rules/group.go: level.Warn(logger).Log("msg", "Error on ingesting too old result from rule evaluation", "num_dropped", numTooOld) rules/group.go: level.Warn(logger).Log("msg", "Error on ingesting results from rule evaluation with different value but same timestamp", "num_dropped", numDuplicates) scrape/scrape.go: level.Warn(sl.l).Log("msg", "Error on ingesting out-of-order samples", "num_dropped", appErrs.numOutOfOrder) scrape/scrape.go: level.Warn(sl.l).Log("msg", "Error on ingesting samples with different value but same timestamp", "num_dropped", appErrs.numDuplicates) scrape/scrape.go: level.Warn(sl.l).Log("msg", "Error on ingesting samples that are too old or are too far into the future", "num_dropped", appErrs.numOutOfBounds) scrape/scrape.go: level.Warn(sl.l).Log("msg", "Error on ingesting out-of-order exemplars", "num_dropped", appErrs.numExemplarOutOfOrder) ``` Rename numDropped to num_dropped for consistency. Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>	2024-03-21 15:59:20 +00:00
Charles Korn	4e77e8e5ef	Allow using alternative PromQL engines for rule evaluation Signed-off-by: Charles Korn <charles.korn@grafana.com>	2024-03-06 14:54:33 +11:00
machine424	f477e0539a	Move from golang.org/x/exp/slices into slices now that we only support Go >= 1.21 Prevent adding back golang.org/x/exp/slices. Signed-off-by: machine424 <ayoubmrini424@gmail.com>	2024-02-28 14:54:53 +01:00
György Krajcsovits	5d0a0a7542	Add custom buckets to native histogram model (#13592 ) * add custom buckets to native histogram model * simple copy for custom bounds * return errors for unsupported add/sub operations * add test cases for string and update appendhistogram in scrape to account for new schema * check fields which are supposed to be unused but may affect results in equals * allow appending custom buckets histograms regardless of max schema Signed-off-by: Jeanette Tan <jeanette.tan@grafana.com>	2024-02-28 14:06:43 +01:00
Bryan Boreham	3716326f3f	rules: call NewScratchBuilder Need to initialize ScratchBuilder with a SymbolTable. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2024-02-26 11:45:25 +00:00
Bryan Boreham	c0e36e6bb3	Standardise exemplar label as "trace_id" This is consistent with the OpenTelemetry standard, and an example in OpenMetrics. https://github.com/open-telemetry/opentelemetry-specification/blob/89aa01348139/specification/metrics/data-model.md#exemplars https://github.com/OpenObservability/OpenMetrics/blob/138654493130/specification/OpenMetrics.md#exemplars-1 Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2024-02-15 14:20:08 +00:00
Bryan Boreham	17f48f2b3b	Tests: use replacement DeepEquals in more places Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2024-02-08 19:32:33 +00:00
Bryan Boreham	39af788dbd	Tests: use replacement DeepEquals using go-cmp Use DeepEqual replacement using go-cmp, which is more flexible. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2024-02-08 19:30:20 +00:00
Marco Pracucci	5ee3fbe825	Decouple ruler dependency controller from concurrency controller Signed-off-by: Marco Pracucci <marco@pracucci.com>	2024-02-02 10:06:37 +01:00
Marco Pracucci	cbbbd6e70a	Remove superfluous nil check in Group.metrics Signed-off-by: Marco Pracucci <marco@pracucci.com>	2024-01-29 10:21:57 +01:00
Marco Pracucci	046cd7599f	Introduced sequentialRuleEvalController Signed-off-by: Marco Pracucci <marco@pracucci.com>	2024-01-29 10:19:18 +01:00
Marco Pracucci	23f89c18b2	Improved RuleConcurrencyController interface doc Signed-off-by: Marco Pracucci <marco@pracucci.com>	2024-01-29 10:18:29 +01:00
Marco Pracucci	2764c46531	Added more test cases to TestDependenciesEdgeCases Signed-off-by: Marco Pracucci <marco@pracucci.com>	2024-01-29 10:18:03 +01:00
Marco Pracucci	52bc568d04	Add more test cases to TestDependenciesEdgeCases Signed-off-by: Marco Pracucci <marco@pracucci.com>	2024-01-29 10:17:13 +01:00
Marco Pracucci	21a03dc018	Simplify the design to update concurrency controller once the rule evaluation has done Signed-off-by: Marco Pracucci <marco@pracucci.com>	2024-01-29 10:16:31 +01:00
Danny Kopping	7aa3b10c3f	Block until all rules, both sync & async, have completed evaluating Updated & added tests Review feedback nits Return empty map if not indeterminate Use highWatermark to track inflight requests counter Appease the linter Clarify feature flag Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2024-01-29 10:08:41 +01:00
Danny Kopping	f922534c4d	Refactoring for performance, and to allow controller to be overridden Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2024-01-29 10:08:41 +01:00
Danny Kopping	94cdfa30cd	Refactoring Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2024-01-29 10:08:41 +01:00
Danny Kopping	0dc7036db3	Optimising dependencies/dependents funcs to not produce new slices each request Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2024-01-29 10:08:41 +01:00
Danny Kopping	e7758d187e	Refactor concurrency control Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2024-01-29 10:08:39 +01:00
Danny Kopping	940f83a540	Implementation NOTE: Rebased from main after refactor in #13014 Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2024-01-29 10:07:15 +01:00
Filip Petkovski	583f3e587c	Optimize histogram iterators (#13340 ) Optimize histogram iterators Histogram iterators allocate new objects in the AtHistogram and AtFloatHistogram methods, which makes calculating rates over long ranges expensive. In #13215 we allowed an existing object to be reused when converting an integer histogram to a float histogram. This commit follows the same idea and allows injecting an existing object in the AtHistogram and AtFloatHistogram methods. When the injected value is nil, iterators allocate new histograms, otherwise they populate and return the injected object. The commit also adds a CopyTo method to Histogram and FloatHistogram which is used in the BufferedIterator to overwrite items in the ring instead of making new copies. Note that a specialized HPoint pool is needed for all of this to work (`matrixSelectorHPool`). --------- Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>	2024-01-23 17:02:14 +01:00
Filip Petkovski	10a82f87fd	Enable reusing memory when converting between histogram types The 'ToFloat' method on integer histograms currently allocates new memory each time it is called. This commit adds an optional *FloatHistogram parameter that can be used to reuse span and bucket slices. It is up to the caller to make sure the input float histogram is not used anymore after the call. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>	2023-12-08 10:22:59 +01:00
Matthieu MOREL	9c4782f1cc	golangci-lint: enable testifylint linter (#13254 ) Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>	2023-12-07 11:35:01 +00:00
Björn Rabenstein	a43669e611	Merge pull request #12928 from alexandear/ci-enable-godot ci(lint): enable godot; append dot at the end of comments	2023-11-01 17:15:41 +01:00
Oleksandr Redko	fa90ca46e5	ci(lint): enable godot; append dot at the end of comments Signed-off-by: Oleksandr Redko <Oleksandr_Redko@epam.com>	2023-10-31 19:53:38 +02:00
Charles Korn	9a8dbf06bc	Address PR feedback Co-authored-by: Julien Pivotto <roidelapluie@o11y.eu> Signed-off-by: Charles Korn <charleskorn@users.noreply.github.com>	2023-10-31 09:56:05 +11:00
Charles Korn	667a1efb04	Add trace ID to log lines emitted during rule evaluation Signed-off-by: Charles Korn <charles.korn@grafana.com>	2023-10-26 16:14:54 +11:00
Charles Korn	fc132a4557	Use common logger instance to reduce duplication in `Group.Eval()` Signed-off-by: Charles Korn <charles.korn@grafana.com>	2023-10-26 16:14:12 +11:00
Danny Kopping	498b836654	Refactoring manager.go into separate concerns Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2023-10-21 11:11:11 +02:00
Goutham Veeramachaneni	86729d4d7b	Update exp package (#12650 )	2023-09-21 22:53:51 +02:00
Arve Knudsen	6daee89e5f	Add context argument to Querier.Select (#12660 ) Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>	2023-09-12 12:37:38 +02:00
Michael Hoffmann	4d8e380269	promql: allow tests to be imported (#12050 ) Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>	2023-08-18 20:48:59 +02:00
Julien Pivotto	782e6f64fb	Merge pull request #11295 from dimitarvdimitrov/dimitar/simplify-evalTimestamp Simplify rule group's EvalTimestamp formula	2023-07-18 13:21:20 +02:00
Bryan Boreham	5255bf06ad	Replace sort.Slice with faster slices.SortFunc The generic version is more efficient. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2023-07-02 22:17:08 +00:00
beorn7	5b53aa1108	style: Replace `else if` cascades with `switch` Wiser coders than myself have come to the conclusion that a `switch` statement is almost always superior to a statement that includes any `else if`. The exceptions that I have found in our codebase are just these two: * The `if else` is followed by an additional statement before the next condition (separated by a `;`). * The whole thing is within a `for` loop and `break` statements are used. In this case, using `switch` would require tagging the `for` loop, which probably tips the balance. Why are `switch` statements more readable? For one, fewer curly braces. But more importantly, the conditions all have the same alignment, so the whole thing follows the natural flow of going down a list of conditions. With `else if`, in contrast, all conditions but the first are "hidden" behind `} else if `, harder to spot and (for no good reason) presented differently from the first condition. I'm sure the aforemention wise coders can list even more reasons. In any case, I like it so much that I have found myself recommending it in code reviews. I would like to make it a habit in our code base, without making it a hard requirement that we would test on the CI. But for that, there has to be a role model, so this commit eliminates all `if else` occurrences, unless it is autogenerated code or fits one of the exceptions above. Signed-off-by: beorn7 <beorn@grafana.com>	2023-04-19 17:22:31 +02:00
beorn7	c3c7d44d84	lint: Adjust to the lint warnings raised by current versions of golint-ci We haven't updated golint-ci in our CI yet, but this commit prepares for that. There are a lot of new warnings, and it is mostly because the "revive" linter got updated. I agree with most of the new warnings, mostly around not naming unused function parameters (although it is justified in some cases for documentation purposes – while things like mocks are a good example where not naming the parameter is clearer). I'm pretty upset about the "empty block" warning to include `for` loops. It's such a common pattern to do something in the head of the `for` loop and then have an empty block. There is still an open issue about this: https://github.com/mgechev/revive/issues/810 I have disabled "revive" altogether in files where empty blocks are used excessively, and I have made the effort to add individual `// nolint:revive` where empty blocks are used just once or twice. It's borderline noisy, though, but let's go with it for now. I should mention that none of the "empty block" warnings for `for` loop bodies were legitimate. Signed-off-by: beorn7 <beorn@grafana.com>	2023-04-19 17:10:10 +02:00
Ben Ye	fd3630b9a3	add ctx to QueryEngine interface Signed-off-by: Ben Ye <benye@amazon.com>	2023-04-17 21:32:38 -07:00
beorn7	c0879d64cf	promql: Separate `Point` into `FPoint` and `HPoint` In other words: Instead of having a “polymorphous” `Point` that can either contain a float value or a histogram value, use an `FPoint` for floats and an `HPoint` for histograms. This seemingly small change has a _lot_ of repercussions throughout the codebase. The idea here is to avoid the increase in size of `Point` arrays that happened after native histograms had been added. The higher-level data structures (`Sample`, `Series`, etc.) are still “polymorphous”. The same idea could be applied to them, but at each step the trade-offs needed to be evaluated. The idea with this change is to do the minimum necessary to get back to pre-histogram performance for functions that do not touch histograms. Here are comparisons for the `changes` function. The test data doesn't include histograms yet. Ideally, there would be no change in the benchmark result at all. First runtime v2.39 compared to directly prior to this commit: ``` name old time/op new time/op delta RangeQuery/expr=changes(a_one[1d]),steps=1-16 391µs ± 2% 542µs ± 1% +38.58% (p=0.000 n=9+8) RangeQuery/expr=changes(a_one[1d]),steps=10-16 452µs ± 2% 617µs ± 2% +36.48% (p=0.000 n=10+10) RangeQuery/expr=changes(a_one[1d]),steps=100-16 1.12ms ± 1% 1.36ms ± 2% +21.58% (p=0.000 n=8+10) RangeQuery/expr=changes(a_one[1d]),steps=1000-16 7.83ms ± 1% 8.94ms ± 1% +14.21% (p=0.000 n=10+10) RangeQuery/expr=changes(a_ten[1d]),steps=1-16 2.98ms ± 0% 3.30ms ± 1% +10.67% (p=0.000 n=9+10) RangeQuery/expr=changes(a_ten[1d]),steps=10-16 3.66ms ± 1% 4.10ms ± 1% +11.82% (p=0.000 n=10+10) RangeQuery/expr=changes(a_ten[1d]),steps=100-16 10.5ms ± 0% 11.8ms ± 1% +12.50% (p=0.000 n=8+10) RangeQuery/expr=changes(a_ten[1d]),steps=1000-16 77.6ms ± 1% 87.4ms ± 1% +12.63% (p=0.000 n=9+9) RangeQuery/expr=changes(a_hundred[1d]),steps=1-16 30.4ms ± 2% 32.8ms ± 1% +8.01% (p=0.000 n=10+10) RangeQuery/expr=changes(a_hundred[1d]),steps=10-16 37.1ms ± 2% 40.6ms ± 2% +9.64% (p=0.000 n=10+10) RangeQuery/expr=changes(a_hundred[1d]),steps=100-16 105ms ± 1% 117ms ± 1% +11.69% (p=0.000 n=10+10) RangeQuery/expr=changes(a_hundred[1d]),steps=1000-16 783ms ± 3% 876ms ± 1% +11.83% (p=0.000 n=9+10) ``` And then runtime v2.39 compared to after this commit: ``` name old time/op new time/op delta RangeQuery/expr=changes(a_one[1d]),steps=1-16 391µs ± 2% 547µs ± 1% +39.84% (p=0.000 n=9+8) RangeQuery/expr=changes(a_one[1d]),steps=10-16 452µs ± 2% 616µs ± 2% +36.15% (p=0.000 n=10+10) RangeQuery/expr=changes(a_one[1d]),steps=100-16 1.12ms ± 1% 1.26ms ± 1% +12.20% (p=0.000 n=8+10) RangeQuery/expr=changes(a_one[1d]),steps=1000-16 7.83ms ± 1% 7.95ms ± 1% +1.59% (p=0.000 n=10+8) RangeQuery/expr=changes(a_ten[1d]),steps=1-16 2.98ms ± 0% 3.38ms ± 2% +13.49% (p=0.000 n=9+10) RangeQuery/expr=changes(a_ten[1d]),steps=10-16 3.66ms ± 1% 4.02ms ± 1% +9.80% (p=0.000 n=10+9) RangeQuery/expr=changes(a_ten[1d]),steps=100-16 10.5ms ± 0% 10.8ms ± 1% +3.08% (p=0.000 n=8+10) RangeQuery/expr=changes(a_ten[1d]),steps=1000-16 77.6ms ± 1% 78.1ms ± 1% +0.58% (p=0.035 n=9+10) RangeQuery/expr=changes(a_hundred[1d]),steps=1-16 30.4ms ± 2% 33.5ms ± 4% +10.18% (p=0.000 n=10+10) RangeQuery/expr=changes(a_hundred[1d]),steps=10-16 37.1ms ± 2% 40.0ms ± 1% +7.98% (p=0.000 n=10+10) RangeQuery/expr=changes(a_hundred[1d]),steps=100-16 105ms ± 1% 107ms ± 1% +1.92% (p=0.000 n=10+10) RangeQuery/expr=changes(a_hundred[1d]),steps=1000-16 783ms ± 3% 775ms ± 1% -1.02% (p=0.019 n=9+9) ``` In summary, the runtime doesn't really improve with this change for queries with just a few steps. For queries with many steps, this commit essentially reinstates the old performance. This is good because the many-step queries are the one that matter most (longest absolute runtime). In terms of allocations, though, this commit doesn't make a dent at all (numbers not shown). The reason is that most of the allocations happen in the sampleRingIterator (in the storage package), which has to be addressed in a separate commit. Signed-off-by: beorn7 <beorn@grafana.com>	2023-04-13 19:25:16 +02:00
Soon-Ping	6cecb87941	Generalized rule group iteration evaluation hook (#11885 ) Signed-off-by: Soon-Ping Phang <soonping@amazon.com>	2023-04-04 20:21:13 +02:00
Bryan Boreham	b987afa7ef	labels: simplify call to get Labels from Builder It took a `Labels` where the memory could be re-used, but in practice this hardly ever benefitted. Especially after converting `relabel.Process` to `relabel.ProcessBuilder`. Comparing the parameter to `nil` was a bug; `EmptyLabels` is not `nil` so the slice was reallocated multiple times by `append`. Lastly `Builder.Labels()` now estimates that the final size will depend on labels added and deleted. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2023-03-22 17:05:20 +00:00
Björn Rabenstein	847093479b	Merge pull request #11978 from trevorwhitney/set-counter-hint Set `CounterResetHint` and use in recording rules	2023-03-14 21:52:41 +01:00
Trevor Whitney	c3e0a83725	rules: no longer force CounterResetHint to Gauge Signed-off-by: Trevor Whitney <trevorjwhitney@gmail.com>	2023-03-14 14:22:07 -06:00
Charles Korn	3db98d7dde	Avoid unnecessary allocations in recording rule evaluation (#11812 ) Re-use the Builder each time round the loop.	2023-03-08 12:57:19 +00:00
Bryan Boreham	3f7ba22bde	rules: two places need to call EmptyLabels Can't assume nil is a valid value. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2023-02-22 15:14:07 +00:00
Julien Pivotto	259bb5c692	Merge pull request #11826 from dannykopping/dannykopping/rule-eval Pass rule details in evaluation context	2023-02-14 21:38:19 +01:00
Justin Lei	af1d9e01c7	Refactor tsdbutil for tests/native histograms (#11948 ) * Add float histograms to ChunkFromSamplesGeneric Signed-off-by: Justin Lei <justin.lei@grafana.com> * Add GenerateSamples functions to tsdbutil Signed-off-by: Justin Lei <justin.lei@grafana.com> PR responses Signed-off-by: Justin Lei <justin.lei@grafana.com> --------- Signed-off-by: Justin Lei <justin.lei@grafana.com>	2023-02-10 17:09:33 +05:30
Danny Kopping	98c70e1817	Correcting NewAlertingRule args Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2023-01-26 13:21:50 +02:00
Danny Kopping	df078e0a84	Merge branch 'main' into dannykopping/rule-eval Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2023-01-26 13:10:18 +02:00
Julien Pivotto	e811d14963	Add comments Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>	2023-01-23 13:59:43 +01:00
Danny Kopping	c4ca791f18	Appeasing the linter Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2023-01-20 10:53:42 +02:00
Danny Kopping	6486d28c7a	Panic if rule type was not expected Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2023-01-20 10:27:50 +02:00
Julien Pivotto	c0724f4e62	New test Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>	2023-01-19 11:56:04 +01:00
Julien Pivotto	2c408289f8	Add stabilizing to UI Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>	2023-01-19 11:33:54 +01:00
Julien Pivotto	5ad74e6e71	Add tests Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>	2023-01-19 10:36:01 +01:00
Julien Pivotto	ce55e5074d	Add 'keep_firing_for' field to alerting rules This commit adds a new 'keep_firing_for' field to Prometheus alerting rules. The 'resolve_delay' field specifies the minimum amount of time that an alert should remain firing, even if the expression does not return any results. This feature was discussed at a previous dev summit, and it was determined that a feature like this would be useful in order to allow the expression time to stabilize and prevent confusing resolved messages from being propagated through Alertmanager. This approach is simpler than having two PromQL queries, as was sometimes discussed, and it should be easy to implement. This commit does not include tests for the 'resolve_delay' field. This is intentional, as the purpose of this commit is to gather comments on the proposed design of the 'resolve_delay' field before implementing tests. Once the design of the 'resolve_delay' field has been finalized, a follow-up commit will be submitted with tests." See https://github.com/prometheus/prometheus/issues/11570 Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>	2023-01-13 12:11:39 +01:00
Ganesh Vernekar	d82ea2eb1c	Merge pull request #11838 from codesome/histo-rec rules: Support native histograms	2023-01-12 12:35:15 +05:30
Ganesh Vernekar	98a0523e4a	rules: Test native histograms in recording rules Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>	2023-01-11 18:27:57 +05:30
Ganesh Vernekar	53a5071a72	rules: Support native histograms Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>	2023-01-10 19:07:24 +05:30
Danny Kopping	4d8478d9ac	Add license header to appease CI Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2023-01-09 11:05:56 +02:00
Danny Kopping	72527b5f12	Refactoring for simplicity Include labels Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2023-01-09 11:01:46 +02:00
Danny Kopping	d8f3e7d16c	gofumpt Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2023-01-09 11:01:25 +02:00
Danny Kopping	79300340af	Adding recording/alerting rule origin context This will allow correlation of executed rule queries with their associated rule names and type Signed-off-by: Danny Kopping <danny.kopping@grafana.com>	2023-01-09 11:01:24 +02:00
Ganesh Vernekar	f1a332c496	rules: Consider ErrTooOldSample in expected errors Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>	2023-01-05 14:49:30 +05:30
Bryan Boreham	cdbe7f462b	Update package rules for new labels.Labels type Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2022-12-19 15:22:09 +00:00
Bryan Boreham	3c7de69059	storage: allow re-use of iterators Patterned after `Chunk.Iterator()`: pass the old iterator in so it can be re-used to avoid allocating a new object. (This commit does not do any re-use; it is just changing all the method signatures so re-use is possible in later commits.) Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2022-12-15 18:32:45 +00:00
Julius Volz	1a2c645dfa	Correctly handle error unwrapping in rules and remote write receiver errors.Unwrap() actually dangerously returns nil if the error does not have an Unwrap() method, which is the case in at least one of these places where I noticed that no error was being logged at all when it should have. Signed-off-by: Julius Volz <julius.volz@gmail.com>	2022-12-15 12:50:55 +01:00
Dimitar Dimitrov	03ab8dcca0	Add comments on EvalTimestamp Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>	2022-10-12 14:16:22 +02:00
Ganesh Vernekar	648be89822	Merge remote-tracking branch 'upstream/main' into fix-conflict Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>	2022-10-12 14:20:02 +05:30
Ganesh Vernekar	46b26c4f09	Fix notifier relabel changing the labels of active alerts (#11427 ) Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>	2022-10-07 20:28:17 +05:30
Jesus Vazquez	e934d0f011	Merge 'main' into sparsehistogram Signed-off-by: Jesus Vazquez <jesus.vazquez@grafana.com>	2022-10-05 22:14:49 +02:00
Dimitar Dimitrov	3fb881af26	Simplify rule group's EvalTimestamp formula I found it hard to understand how EvalTimestamp works, so I wanted to simplify the math there. This PR should be a noop. Current formula is: ``` offset = g.hash % g.interval adjNow = startTime - offset base = adjNow - (adjNow % g.interval) EvalTimestamp = base + offset ``` I simplify `EvalTimestamp` ``` EvalTimestamp = base + offset # expand base = adjNow - (adjNow % g.interval) + offset # expand adjNow = startTime - offset - ((startTime - offset) % g.interval) + offset # cancel out offset = startTime - ((startTime - offset) % g.interval) # expand A+B (mod M) = (A (mod M) + B (mod M)) (mod M) = startTime - (startTime % g.interval - offset % g.interval) % g.interval # expand offset = startTime - (startTime % g.interval - ((g.hash % g.interval) % g.interval)) % g.interval # remove redundant mod g.interval = startTime - (startTime % g.interval - g.hash % g.interval) % g.interval # simplify (A (mod M) + B (mod M)) (mod M) = A+B (mod M) = startTime - (startTime - g.hash) % g.interval offset = (startTime - g.hash) % g.interval EvalTimestamp = startTime - offset ``` Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>	2022-09-13 10:52:32 +02:00
Bryan Boreham	8297f5cb6b	rules: in tests use labels.FromStrings And a number of `EmptyLabels()` instead of `nil`. Replacing code which assumes the internal structure of `Labels`. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2022-09-09 13:34:49 +02:00
Cosrider	bef6556ca5	delete redundant alias (#11180 ) Signed-off-by: Cosrider <cosrider7@gmail.com> Signed-off-by: Cosrider <cosrider7@gmail.com>	2022-08-31 15:50:38 +02:00
Bryan Boreham	8b863c42dd	Optimise relabeling by re-using memory (#11147 ) * model/relabel: Add benchmark Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * model/relabel: re-use Builder across relabels Saves memory allocations. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * labels.Builder: allow re-use of result slice This reduces memory allocations where the caller has a suitable slice available. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * model/relabel: re-use source values slice To reduce memory allocations. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Unwind one change causing test failures Restore original behaviour in PopulateLabels, where we must not overwrite the input set. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * relabel: simplify values optimisation Use a stack-based array for up to 16 source labels, which will be the vast majority of cases. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * lint Signed-off-by: Bryan Boreham <bjboreham@gmail.com> Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2022-08-19 15:27:52 +05:30
beorn7	c9fd3c235d	Merge branch 'main' into sparsehistogram	2022-08-10 17:54:37 +02:00
Jimmie Han	a5fea2cdd0	Use atomic field avoid (AlertingRule).mtx wait when template expanding (#10858 ) Use atomic field avoid (*AlertingRule).mtx wait when template expanding (#10703) Signed-off-by: hanjm <hanjinming@outlook.com>	2022-07-19 12:58:37 +02:00
beorn7	28f028e938	Merge branch 'main' into sparsehistogram	2022-07-12 19:07:13 +02:00
Matthieu MOREL	ddfa9a7cc5	refactor (rules): move from github.com/pkg/errors to 'errors' and 'fmt' (#10855 ) * refactor (rules): move from github.com/pkg/errors to 'errors' and 'fmt' Signed-off-by: Matthieu MOREL <mmorel-35@users.noreply.github.com>	2022-06-17 09:54:25 +02:00
beorn7	40ad5e284a	Merge branch 'main' into beorn7/sparsehistogram	2022-06-09 20:50:30 +02:00
Julien Pivotto	3a56817a30	Rules: set otel status to ERROR when a rule fails (#10745 ) Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>	2022-05-25 10:06:17 +02:00
Julien Pivotto	0d94cdf107	rules: remove classic UI code (#10730 ) Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>	2022-05-23 16:21:50 +02:00
Łukasz Mierzwa	d3c9c4f574	Stop rule manager before TSDB is stopped (#10680 ) During shutdown TSDB is stopped before rule manager is stopped. Since TSDB shutdown can take a long time (minutes or 10s of minutes) it keeps rule manager running while parts of Prometheus are already stopped (most notebly scrape manager). This can cause false positive alerts to fire, mostly those that rely on absent() calls since new sample appends will stop while alert queries are still evaluated. Stop rules before stopping TSDB and scrape manager to avoid this problem. Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>	2022-05-20 23:26:06 +02:00
beorn7	3bc711e333	Merge branch 'main' into sparsehistogram	2022-05-04 13:37:13 +02:00
Matthieu MOREL	e2ede285a2	refactor: move from io/ioutil to io and os packages (#10528 ) * refactor: move from io/ioutil to io and os packages * use fs.DirEntry instead of os.FileInfo after os.ReadDir Signed-off-by: MOREL Matthieu <matthieu.morel@cnp.fr>	2022-04-27 11:24:36 +02:00
beorn7	7ee1836ef5	Merge branch 'main' into sparsehistogram	2022-04-05 18:31:19 +02:00
Wilbert Guo	83a2e52bc2	Add SyncForState Implementation for Ruler HA (#10070 ) * continuously syncing activeAt for alerts Signed-off-by: Yijie Qin <qinyijie@amazon.com> Signed-off-by: Wilbert Guo <wilbeguo@amazon.com> * add import Signed-off-by: Yijie Qin <qinyijie@amazon.com> Signed-off-by: Wilbert Guo <wilbeguo@amazon.com> * Refactor SyncForState and add unit tests Signed-off-by: Wilbert Guo <wilbeguo@amazon.com> * Format code Signed-off-by: Wilbert Guo <wilbeguo@amazon.com> * Add hook for syncForState Signed-off-by: Wilbert Guo <wilbeguo@amazon.com> Fix go lint Signed-off-by: Wilbert Guo <wilbeguo@amazon.com> Refactor syncForState override implementation Signed-off-by: Wilbert Guo <wilbeguo@amazon.com> Add syncForState override func as argument to Update() Signed-off-by: Wilbert Guo <wilbeguo@amazon.com> Fix go formatting Signed-off-by: Wilbert Guo <wilbeguo@amazon.com> Fix circleci test errors Signed-off-by: Wilbert Guo <wilbeguo@amazon.com> Remove overrideFunc as argument to run() Signed-off-by: Wilbert Guo <wilbeguo@amazon.com> * remove the syncForState Signed-off-by: Yijie Qin <qinyijie@amazon.com> * use the override function to decide if need to replace the activeAt or not Signed-off-by: Yijie Qin <qinyijie@amazon.com> * fix test case Signed-off-by: Yijie Qin <qinyijie@amazon.com> * fix format Signed-off-by: Yijie Qin <qinyijie@amazon.com> * Trigger build Signed-off-by: Yijie Qin <qinyijie@amazon.com> * fixing comments Signed-off-by: Yijie Qin <qinyijie@amazon.com> * return the result of map of alerts instead of single one Signed-off-by: Yijie Qin <qinyijie@amazon.com> * upper case the QueryforStateSeries Signed-off-by: Yijie Qin <qinyijie@amazon.com> * use a more generic rule group post process function type Signed-off-by: Yijie Qin <qinyijie@amazon.com> * fix indentation Signed-off-by: Yijie Qin <qinyijie@amazon.com> * fix gofmt Signed-off-by: Yijie Qin <qinyijie@amazon.com> * fix lint Signed-off-by: Yijie Qin <qinyijie@amazon.com> * fixing naming Signed-off-by: Yijie Qin <qinyijie@amazon.com> * fix comments Signed-off-by: Yijie Qin <qinyijie@amazon.com> * add the lastEvalTimestamp as parameter Signed-off-by: Yijie Qin <qinyijie@amazon.com> * fmt Signed-off-by: Yijie Qin <qinyijie@amazon.com> * change funcType to func Signed-off-by: Yijie Qin <qinyijie@amazon.com> Co-authored-by: Yijie Qin <qinyijie@amazon.com> Co-authored-by: Yijie Qin <63399121+qinxx108@users.noreply.github.com>	2022-03-29 02:16:46 +02:00
beorn7	4210aac74a	Merge branch 'main' into sparsehistogram	2022-03-22 14:47:42 +01:00
Alan Protasio	606ef33d91	Track and report Samples Queried per query We always track total samples queried and add those to the standard set of stats queries can report. We also allow optionally tracking per-step samples queried. This must be enabled both at the engine and query level to be tracked and rendered. The engine flag is exposed via a Prometheus feature flag, while the query flag is set when stats=all. Co-authored-by: Alan Protasio <approtas@amazon.com> Co-authored-by: Andrew Bloomgarden <blmgrdn@amazon.com> Co-authored-by: Harkishen Singh <harkishensingh@hotmail.com> Signed-off-by: Andrew Bloomgarden <blmgrdn@amazon.com>	2022-03-21 23:49:17 +01:00
Alvin Lin	cd739214dd	Log rule name when evaluating rule groups' Eval function logs anything (#10454 ) * Add benchingmark test for rule group eval Signed-off-by: Alvin Lin <alvinlin@amazon.com>	2022-03-21 19:52:20 +01:00
Matej Gera	2c61d29b2a	Tracing: Migrate to OpenTelemetry library (#9724 ) Signed-off-by: Matej Gera <matejgera@gmail.com>	2022-01-25 11:08:04 +01:00

1 2 3 4 5 ...

618 Commits (faf398e38059083c22c1bc7b8a725b9274148e17)