This is done by bucketing chunks by fingerprint. If the persisting to
disk falls behind, more and more chunks are in the queue. As soon as
there are "double hits", we will now persist both chunks in one go,
doubling the disk throughput (assuming it is limited by disk
seeks). Should even more pile up so that we end wit "triple hits", we
will persist those first, and so on.
Even if we have millions of time series, this will still help,
assuming not all of them are growing with the same speed. Series that
get many samples and/or are not very compressable will accumulate
chunks faster, and they will soon get double- or triple-writes.
To improve the chance of double writes,
-storage.local.persistence-queue-capacity could be set to a higher
value. However, that will slow down shutdown a lot (as the queue has
to be worked through). So we leave it to the user to set it to a
really high value. A more fundamental solution would be to checkpoint
not only head chunks, but also chunks still in the persist queue. That
would be quite complicated for a rather limited use-case (running many
time series with high ingestion rate on slow spinning disks).
numMemoryChunks=flag.Int("storage.local.memory-chunks",1024*1024,"How many chunks to keep in memory. While the size of a chunk is 1kiB, the total memory usage will be significantly higher than this value * 1kiB. Furthermore, for various reasons, more chunks might have to be kept in memory temporarily.")
numMemoryChunks=flag.Int("storage.local.memory-chunks",1024*1024,"How many chunks to keep in memory. While the size of a chunk is 1kiB, the total memory usage will be significantly higher than this value * 1kiB. Furthermore, for various reasons, more chunks might have to be kept in memory temporarily.")
persistenceRetentionPeriod=flag.Duration("storage.local.retention",15*24*time.Hour,"How long to retain samples in the local storage.")
persistenceRetentionPeriod=flag.Duration("storage.local.retention",15*24*time.Hour,"How long to retain samples in the local storage.")
persistenceQueueCapacity=flag.Int("storage.local.persistence-queue-capacity",128*1024,"How many chunks can be waiting for being persisted before sample ingestion will stop.")
persistenceQueueCapacity=flag.Int("storage.local.persistence-queue-capacity",32*1024,"How many chunks can be waiting for being persisted before sample ingestion will stop.")
checkpointInterval=flag.Duration("storage.local.checkpoint-interval",5*time.Minute,"The period at which the in-memory index of time series is checkpointed.")
checkpointInterval=flag.Duration("storage.local.checkpoint-interval",5*time.Minute,"The period at which the in-memory index of time series is checkpointed.")
checkpointDirtySeriesLimit=flag.Int("storage.local.checkpoint-dirty-series-limit",5000,"If approx. that many time series are in a state that would require a recovery operation after a crash, a checkpoint is triggered, even if the checkpoint interval hasn't passed yet. A recovery operation requires a disk seek. The default limit intends to keep the recovery time below 1min even on spinning disks. With SSD, recovery is much faster, so you might want to increase this value in that case to avoid overly frequent checkpoints.")
checkpointDirtySeriesLimit=flag.Int("storage.local.checkpoint-dirty-series-limit",5000,"If approx. that many time series are in a state that would require a recovery operation after a crash, a checkpoint is triggered, even if the checkpoint interval hasn't passed yet. A recovery operation requires a disk seek. The default limit intends to keep the recovery time below 1min even on spinning disks. With SSD, recovery is much faster, so you might want to increase this value in that case to avoid overly frequent checkpoints.")