This attempts to close#3973.
Handles cases where the length of the input vector to an aggregate topk
/ bottomk function is less than the K paramater. The change updates
Prometheus to allocate a result vector the same length as the input
vector in these cases.
Previously Prometheus would out-of-memory panic for large K values. This
change makes that unlikely unless the size of the input vector is
equally large.
Signed-off-by: David King <dave@davbo.org>
* parser test cleanup
- Test against the exported package functions instead of the private functions.
* Improves readability of TestParseSeries
- Moves package function closer to parser function
This adds a parameter to the storage selection interface which allows
query engine(s) to pass information about the operations surrounding a
data selection.
This can for example be used by remote storage backends to infer the
correct downsampling aggregates that need to be provided.
* TotalFoo suggested a comprehensive timing, but TotalEvalTime was part
of the Exec timings, together with Queue timings
* The other option was to rename ExecTotalTime to TotalExecTime, but
there was already ExecQueueTime, suggesting Exec to be some sort of
group
API consumers should be able to get insight into the query run times.
The UI currently measures total roundtrip times. This PR allows for more
fine grained metrics to be exposed.
* adds new timer for total execution time (queue + eval)
* expose new timer, queue timer, and eval timer in stats field of the
range query response:
```json
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [],
"stats": {
"execQueueTimeNs": 4683,
"execTotalTimeNs": 2086587,
"totalEvalTimeNs": 2077851
}
}
}
```
* stats field is optional, only set when query parameter `stats` is not
empty
Try it via
```sh
curl 'http://localhost:9090/api/v1/query_range?query=up&start=1486480279&end=1486483879&step=14000&stats=true'
```
Review feedback
* moved query stats json generation to query_stats.go
* use seconds for all query timers
* expose all timers available
* Changed ExecTotalTime string representation from Exec queue total time to Exec total time
This means that if there is no stale marker, only the usual staleness
delta (5m) applies.
It has occured to me that there is an oddity in the heurestic. It works
fine as long as you have 2 points within the last 5m, but breaks down
when the time window advances to the point where you have just 1 point.
Consider you had points at t=0 and t=10. With the heurestic it goes stale
at t=51, up until t=300. However from t=301 until t=310 we only
see the t=10 point and the series comes back to life. That is not
desirable.
I don't see a way to keep this form of heurestic working given this
issue, so thus I'm removing it.
* Re-add contexts to storage.Storage.Querier()
These are needed when replacing the storage by a multi-tenant
implementation where the tenant is stored in the context.
The 1.x query interfaces already had contexts, but they got lost in 2.x.
* Convert promql.Engine to use native contexts
With the squaring of the timestamp, we run into the
limitations of the 53bit mantissa for a 64bit float.
By subtracting away a timestamp of one of the samples (which is how the
intercept is used) we avoid this issue in practice as it's unlikely
that it is used over a very long time range.
Fixes#2674
* Fix error where we look into the future.
So currently we are adding values that are in the future for an older
timestamp. For example, if we have [(1, 1), (150, 2)] we will end up
showing [(1, 1), (2,2)].
Further it is not advisable to call .At() after Next() returns false.
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
* Retuen early if done
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
* Handle Seek() where we reach the end of iterator
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
* Simplify code
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
To cover the cases where stale markers may not be available,
we need to infer the interval and mark series stale based on that.
As we're lacking stale markers this is less accurate, however
it should be good enough for these cases.
We need 4 intervals as if say we had data at t=0 and t=10,
coming via federation. The next data point should be at t=20 however it
could take up to t=30 for it actually to be ingested, t=40 for it to be
scraped via federation and t=50 for it to be ingested.
We then add 10% on to that for slack, as we do elsewhere.
For instant vectors, if "stale" is the newest sample
ignore the timeseries.
For range vectors, filter out "stale" samples.
Make it possible to inject "stale" samples in promql tests.
Query and query_range should return the timestamp
at which an evaluation is performed, not the timestamp
of the data. This is as that's what query range asked
for, and we need to keep query consistent with that.
Query for a matrix remains unchanged, returning the literal
matrix.
Make the timestamp of instant vectors be the timestamp of the sample
rather than the evaluation. We were not using this anywhere, so this is
safe.
Add a function to return the timestamp of samples in an instant vector.
Fixes#1557
* Use request.Context() instead of a global map of contexts.
* Add some basic opentracing instrumentation on the query path.
* Remove tracehandler endpoint.
* Force buckets in a histogram to be monotonic for quantile estimation
The assumption that bucket counts increase monotonically with increasing
upperBound may be violated during:
* Recording rule evaluation of histogram_quantile, especially when rate()
has been applied to the underlying bucket timeseries.
* Evaluation of histogram_quantile computed over federated bucket
timeseries, especially when rate() has been applied
This is because scraped data is not made available to RR evalution or
federation atomically, so some buckets are computed with data from the N
most recent scrapes, but the other buckets are missing the most recent
observations.
Monotonicity is usually guaranteed because if a bucket with upper bound
u1 has count c1, then any bucket with a higher upper bound u > u1 must
have counted all c1 observations and perhaps more, so that c >= c1.
Randomly interspersed partial sampling breaks that guarantee, and rate()
exacerbates it. Specifically, suppose bucket le=1000 has a count of 10 from
4 samples but the bucket with le=2000 has a count of 7, from 3 samples. The
monotonicity is broken. It is exacerbated by rate() because under normal
operation, cumulative counting of buckets will cause the bucket counts to
diverge such that small differences from missing samples are not a problem.
rate() removes this divergence.)
bucketQuantile depends on that monotonicity to do a binary search for the
bucket with the qth percentile count, so breaking the monotonicity
guarantee causes bucketQuantile() to return undefined (nonsense) results.
As a somewhat hacky solution until the Prometheus project is ready to
accept the changes required to make scrapes atomic, we calculate the
"envelope" of the histogram buckets, essentially removing any decreases
in the count between successive buckets.
* Fix up comment docs for ensureMonotonic
* ensureMonotonic: Use switch statement
Use switch statement rather than if/else for better readability.
Process the most frequent cases first.
* Add max concurrent and current queries engine metrics
This commit adds two metrics to the promql/engine: the
number of max concurrent queries, as configured by the flag, and
the number of current queries being served+blocked in the engine.
This extracts Querier as an instantiateable and closeable object
rather than just defining extending methods of the storage interface.
This improves composability and allows abstracting query transactions,
which can be useful for transaction-level caches, consistent data views,
and encapsulating teardown.
This is based on https://github.com/prometheus/prometheus/pull/1997.
This adds contexts to the relevant Storage methods and already passes
PromQL's new per-query context into the storage's query methods.
The immediate motivation supporting multi-tenancy in Frankenstein, but
this could also be used by Prometheus's normal local storage to support
cancellations and timeouts at some point.
For Weaveworks' Frankenstein, we need to support multitenancy. In
Frankenstein, we initially solved this without modifying the promql
package at all: we constructed a new promql.Engine for every
query and injected a storage implementation into that engine which would
be primed to only collect data for a given user.
This is problematic to upstream, however. Prometheus assumes that there
is only one engine: the query concurrency gate is part of the engine,
and the engine contains one central cancellable context to shut down all
queries. Also, creating a new engine for every query seems like overkill.
Thus, we want to be able to pass per-query contexts into a single engine.
This change gets rid of the promql.Engine's built-in base context and
allows passing in a per-query context instead. Central cancellation of
all queries is still possible by deriving all passed-in contexts from
one central one, but this is now the responsibility of the caller. The
central query context is now created in main() and passed into the
relevant components (web handler / API, rule manager).
In a next step, the per-query context would have to be passed to the
storage implementation, so that the storage can implement multi-tenancy
or other features based on the contextual information.
The current separation between lexer and parser is a bit fuzzy when it
comes to operators, aggregators and other keywords. The lexer already
tries to determine the type of a token, even though that type might
change depending on the context.
This led to the problematic behavior that no tokens known to the lexer
could be used as label names, including operators (and, by, ...),
aggregators (count, quantile, ...) or other keywords (for, offset, ...).
This change additionally checks whether an identifier is one of these
types. We might want to check whether the specific item identification
should be moved from the lexer to the parser.
As described in #1898 'go test -race' detects a race in lexer code. This
pacth fixes it and also add '-race' option to test target to prevent
regression.
This was only relevant so far for the benchmark suite as it would
recycle Expr for repetitions. However, the append is unnecessary as
each node is only inspected once when populating iterators, and
population must always start from scratch.
This also introduces error checking during benchmarks and fixes the so
far undetected test errors during benchmarking.
Also, remove a style nit (two golint warnings less…).
See discussion in
https://groups.google.com/forum/#!topic/prometheus-developers/bkuGbVlvQ9g
The main idea is that the user of a storage shouldn't have to deal with
fingerprints anymore, and should not need to do an individual preload
call for each metric. The storage interface needs to be made more
high-level to not expose these details.
This also makes it easier to reuse the same storage interface for remote
storages later, as fewer roundtrips are required and the fingerprint
concept doesn't work well across the network.
NOTE: this deliberately gets rid of a small optimization in the old
query Analyzer, where we dedupe instants and ranges for the same series.
This should have a minor impact, as most queries do not have multiple
selectors loading the same series (and at the same offset).
tl;dr: This is not a fundamental solution to the indexing problem
(like tindex is) but it at least avoids utilizing the intersection
problem to the greatest possible amount.
In more detail:
Imagine the following query:
nicely:aggregating:rule{job="foo",env="prod"}
While it uses a nicely aggregating recording rule (which might have a
very low cardinality), Prometheus still intersects the low number of
fingerprints for `{__name__="nicely:aggregating:rule"}` with the many
thousands of fingerprints matching `{job="foo"}` and with the millions
of fingerprints matching `{env="prod"}`. This totally innocuous query
is dead slow if the Prometheus server has a lot of time series with
the `{env="prod"}` label. Ironically, if you make the query more
complicated, it becomes blazingly fast:
nicely:aggregating:rule{job=~"foo",env=~"prod"}
Why so? Because Prometheus only intersects with non-Equal matchers if
there are no Equal matchers. That's good in this case because it
retrieves the few fingerprints for
`{__name__="nicely:aggregating:rule"}` and then starts right ahead to
retrieve the metric for those FPs and checking individually if they
match the other matchers.
This change is generalizing the idea of when to stop intersecting FPs
and go into "retrieve metrics and check them individually against
remaining matchers" mode:
- First, sort all matchers by "expected cardinality". Matchers
matching the empty string are always worst (and never used for
intersections). Equal matchers are in general consider best, but by
using some crude heuristics, we declare some better than others
(instance labels or anything that looks like a recording rule).
- Then go through the matchers until we hit a threshold of remaining
FPs in the intersection. This threshold is higher if we are already
in the non-Equal matcher area as intersection is even more expensive
here.
- Once the threshold has been reached (or we have run out of matchers
that do not match the empty string), start with "retrieve metrics
and check them individually against remaining matchers".
A beefy server at SoundCloud was spending 67% of its CPU time in index
lookups (fingerprintsForLabelPairs), serving mostly a dashboard that
is exclusively built with recording rules. With this change, it spends
only 35% in fingerprintsForLabelPairs. The CPU usage dropped from 26
cores to 18 cores. The median latency for query_range dropped from 14s
to 50ms(!). As expected, higher percentile latency didn't improve that
much because the new approach is _occasionally_ running into the worst
case while the old one was _systematically_ doing so. The 99th
percentile latency is now about as high as the median before (14s)
while it was almost twice as high before (26s).
When converting `AlertStmt` to a string, the alert rule labels were
printed as `ANNOTATIONS` instead of the annotations themselves.
Fix and add a test to catch future regressions.
This offers new semantics in allowing on() for matching
two single-element vectors with no known common labels.
Previosuly this was often done using on(dummy).
This also allows making it explict that you meant
to do an aggregation without labels via by().
Fixes#1597.
PromQL only requires a much narrower interface than local.Storage in
order to run queries. Narrower interfaces are easier to replace and
test, too.
We could also change the web interface to use local.Querier, except that
we'll probably use appending functions from there in the future.
Currently the printer doesn't print the annotations of an `*AlertStmt`
declaration. I've added a test case as well, which fails for the current
master.
Since rule evaluations work via String(), this fixes evaluation of
rules containing GROUP_x modifiers without labels. This change is the
minimal bugfix (so that we can release a fixed version without
risk). It does not intend to implement any additional features (like
allowing `GROUP_LEFT()` or `ON()` or even `ON` - see discussion in
https://github.com/prometheus/prometheus/issues/1597 ).