This is a fairly easy attempt to dynamically evict chunks based on the
heap size. A target heap size has to be set as a command line flage,
so that users can essentially say "utilize 4GiB of RAM, and please
don't OOM".
The -storage.local.max-chunks-to-persist and
-storage.local.memory-chunks flags are deprecated by this
change. Backwards compatibility is provided by ignoring
-storage.local.max-chunks-to-persist and use
-storage.local.memory-chunks to set the new
-storage.local.target-heap-size to a reasonable (and conservative)
value (both with a warning).
This also makes the metrics intstrumentation more consistent (in
naming and implementation) and cleans up a few quirks in the tests.
Answers to anticipated comments:
There is a chance that Go 1.9 will allow programs better control over
the Go memory management. I don't expect those changes to be in
contradiction with the approach here, but I do expect them to
complement them and allow them to be more precise and controlled. In
any case, once those Go changes are available, this code has to be
revisted.
One might be tempted to let the user specify an estimated value for
the RSS usage, and then internall set a target heap size of a certain
fraction of that. (In my experience, 2/3 is a fairly safe bet.)
However, investigations have shown that RSS size and its relation to
the heap size is really really complicated. It depends on so many
factors that I wouldn't even start listing them in a commit
description. It depends on many circumstances and not at least on the
risk trade-off of each individual user between RAM utilization and
probability of OOMing during a RAM usage peak. To not add even more to
the confusion, we need to stick to the well-defined number we also use
in the targeting here, the sum of the sizes of heap objects.
Currently, if a series stops to exist, its head chunk will be kept
open for an hour. That prevents it from being persisted. Which
prevents it from being evicted. Which prevents the series from being
archived.
Most of the time, once no sample has been added to a series within the
staleness limit, we can be pretty confident that this series will not
receive samples anymore. The whole chain as described above can be
started after 5m instead of 1h. In the relaxed case, this doesn't
change a lot as the head chunk timeout is only checked during series
maintenance, and usually, a series is only maintained every six
hours. However, there is the typical scenario where a large service is
deployed, the deoply turns out to be bad, and then it is deployed
again within minutes, and quite quickly the number of time series has
tripled. That's the point where the Prometheus server is stressed and
switches (rightfully) into rushed mode. In that mode, time series are
processed as quickly as possible, but all of that is in vein if all of
those recently ended time series cannot be persisted yet for another
hour. In that scenario, this change will help most, and it's exactly
the scenario where help is most desperately needed.
Rationale: The default value for GOGC is 100, i.e. a garbage collected
is initialized once as many heap space has been allocated as was in
use after the last GC was done. This ratio doesn't make a lot of sense
in Prometheus, as typically about 60% of the heap is allocated for
long-lived memory chunks (most of which are around for many hours if
not days). Thus, short-lived heap objects are accumulated for quite
some time until they finally match the large amount of memory used by
bulk memory chunks and a gigantic GC cyle is invoked. With GOGC=40, we
are essentially reinstating "normal" GC behavior by acknowledging that
about 60% of the heap are used for long-term bulk storage.
The median Prometheus production server at SoundCloud runs a GC cycle
every 90 seconds. With GOGC=40, a GC cycle is run every 35 seconds
(which is still not very often). However, the effective RAM usage is
now reduced by about 30%. If settings are updated to utilize more RAM,
the time between GC cycles goes up again (as the heap size is larger
with more long-lived memory chunks, but the frequency of creating
short-lived heap objects does not change). On a quite busy large
Prometheus server, the timing changed from one GC run every 20s to one
GC run every 12s.
In the former case (just changing GOGC, leave everything else as it
is), the CPU usage increases by about 10% (on a mid-size referenc
server from 8.1 to 8.9). If settings are adjusted, the CPU
consumptions increases more drastically (from 8 cores to 13 cores on a
large reference server), despite GCs happening more rarely, presumably
because a 50% larger set of memory chunks is managed now. Having more
memory chunks is good in many regards, and most servers are running
out of memory long before they run out of CPU cycles, so the tradeoff
is overwhelmingly positive in most cases.
Power users can still set the GOGC environment variable as usual, as
the implementation in this commit honors an explicitly set variable.
This removes legacy support for specific remote storage systems in favor
of only offering the generic remote write protocol. An example bridge
application that translates from the generic protocol to each of those
legacy backends is still provided at:
documentation/examples/remote_storage/remote_storage_bridge
See also https://github.com/prometheus/prometheus/issues/10
The next step in the plan is to re-add support for multiple remote
storages.
This is a followup to https://github.com/prometheus/prometheus/pull/2011.
This publishes more of the methods and other names of the chunk code and
moves the chunk code to its own package. There's some unavoidable
ugliness: the chunk and chunkDesc metrics are used by both packages, so
I had to move them to the chunk package. That isn't great, but I don't
see how to do it better without a larger redesign of everything. Same
for the evict requests and some other types.
* Add config, HTTP Basic Auth and TLS support to the generic write path.
- Move generic write path configuration to the config file
- Factor out config.TLSConfig -> tlf.Config translation
- Support TLSConfig for generic remote storage
- Rename Run to Start, and make it non-blocking.
- Dedupe code in httputil for TLS config.
- Make remote queue metrics global.
This is based on https://github.com/prometheus/prometheus/pull/1997.
This adds contexts to the relevant Storage methods and already passes
PromQL's new per-query context into the storage's query methods.
The immediate motivation supporting multi-tenancy in Frankenstein, but
this could also be used by Prometheus's normal local storage to support
cancellations and timeouts at some point.