Feature: Allow configuration of a rule evaluation delay (#14061)
* [PATCH] Allow having evaluation delay for rule groups
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* [PATCH] Fix lint
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* [PATCH] Move the option to ManagerOptions
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* [PATCH] Include evaluation_delay in the group config
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
* Fix comments
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Add a server configuration option.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Appease the linter #1
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Add the new server flag documentation
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Improve documentation of the new flag and configuration
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Use named parameters for clarity on the `Rule` interface
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Add `initial` to the flag help
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Change the CHANGELOG area from `ruler` to `rules`
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Rename evaluation_delay to `rule_query_offset`/`query_offset` and make it a global configuration option.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
E Your branch is up to date with 'origin/gotjosh/evaluation-delay'.
* more docs
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Improve wording on CHANGELOG
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Add `RuleQueryOffset` to the default config in tests in case it changes
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Update docs/configuration/recording_rules.md
Co-authored-by: Julius Volz <julius.volz@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Rename `RuleQueryOffset` to `QueryOffset` when in the group context.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
* Improve docstring and documentation on the `rule_query_offset`
Signed-off-by: gotjosh <josue.abreu@gmail.com>
---------
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
Co-authored-by: Julius Volz <julius.volz@gmail.com>
* [CHANGE] Rules: Execute 1 query instead of N (where N is the number of alerts within alert rule) when restoring alerts. #13980
* [FEATURE] Rules: Add new option `query_offset` for each rule group via rule group configuration file and `rule_query_offset` as part of the global configuration to have more resilience for remote write delays. #14061
* [ENHANCEMENT] Rules: Add `rule_group_last_restore_duration_seconds` to measure the time it takes to restore a rule group. #13974
* [ENHANCEMENT] OTLP: Improve remote write format translation performance by using label set hashes for metric identifiers instead of string based ones. #14006#13991
* [ENHANCEMENT] TSDB: Optimize querying with regexp matchers. #13620
// Offset the rule evaluation timestamp of this particular group by the specified duration into the past to ensure the underlying metrics have been received.
# Offset the rule evaluation timestamp of this particular group by the specified duration into the past to ensure the underlying metrics have been received.
# Metric availability delays are more likely to occur when Prometheus is running as a remote write target, but can also occur when there's anomalies with scraping.
[ rule_query_offset: <duration> | default = 0s ]
# The labels to add to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
@ -148,6 +151,9 @@ the rule, active, pending, or inactive, are cleared as well. The event will be
recorded as an error in the evaluation, and as such no stale markers are
written.
# Rule query offset
This is useful to ensure the underlying metrics have been received and stored in Prometheus. Metric availability delays are more likely to occur when Prometheus is running as a remote write target due to the nature of distributed systems, but can also occur when there's anomalies with scraping and/or short evaluation intervals.
# Failed rule evaluations due to slow evaluation
If a rule group hasn't finished evaluating before its next evaluation is supposed to start (as defined by the `evaluation_interval`), the next evaluation will be skipped. Subsequent evaluations of the rule group will continue to be skipped until the initial evaluation either completes or times out. When this happens, there will be a gap in the metric produced by the recording rule. The `rule_group_iterations_missed_total` metric will be incremented for each missed iteration of the rule group.
require.Equal(t,len(test.result),len(filteredRes),"%d. Number of samples in expected and actual output don't match (%d vs. %d)",i,len(test.result),len(res))
varfilteredRespromql.Vector// After removing 'ALERTS' samples.
for_,smpl:=rangeres{
smplName:=smpl.Metric.Get("__name__")
ifsmplName=="ALERTS_FOR_STATE"{
filteredRes=append(filteredRes,smpl)
}else{
// If not 'ALERTS_FOR_STATE', it has to be 'ALERTS'.
require.Equal(t,"ALERTS",smplName)
}
}
fori:=rangetest.result{
test.result[i].T=timestamp.FromTime(evalTime)
// Updating the expected 'for' state.
iftest.result[i].F>=0{
test.result[i].F=forState
for_,aa:=rangerule.ActiveAlerts(){
require.Zero(t,aa.Labels.Get(model.MetricNameLabel),"%s label set on active alert: %s",model.MetricNameLabel,aa.Labels)
}
}
}
require.Equal(t,len(test.result),len(filteredRes),"%d. Number of samples in expected and actual output don't match (%d vs. %d)",i,len(test.result),len(res))
require.True(t,ok,"Series %s not returned.",metric)
require.True(t,value.IsStaleNaN(metricSample[2].F),"Appended second sample not as expected. Wanted: stale NaN Got: %x",math.Float64bits(metricSample[2].F))
require.True(t,ok,"Series %s not returned.",metric)
require.True(t,value.IsStaleNaN(metricSample[2].F),"Appended second sample not as expected. Wanted: stale NaN Got: %x",math.Float64bits(metricSample[2].F))