prometheus

Commit Graph

Author	SHA1	Message	Date
beorn7	40ad5e284a	Merge branch 'main' into beorn7/sparsehistogram	2022-06-09 20:50:30 +02:00
Matej Gera	1dd247f68b	Remote Write: Rename confusing `walDir` parameter to `dir` (#10464 ) * Rename walDir parameter to dir Signed-off-by: Matej Gera <matejgera@gmail.com> * Improve NewQueueManager comment Signed-off-by: Matej Gera <matejgera@gmail.com>	2022-05-30 21:45:30 -07:00
Bryan Boreham	4b9f248e85	unit tests: make all Labels sorted alphabetically (#10532 ) "Labels is a sorted set of labels. Order has to be guaranteed upon instantiation." says the comment, so fix all the tests that break this rule. For `BenchmarkLabelValuesWithMatchers()` and `BenchmarkHeadLabelValuesWithMatchers()` the amount of work done changes significantly if you put the labels in order, because all series refs get neatly partitioned by the `tens` label, so I renamed the labels to maintain the previous behaviour. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2022-05-04 23:41:36 +02:00
beorn7	3bc711e333	Merge branch 'main' into sparsehistogram	2022-05-04 13:37:13 +02:00
Matthieu MOREL	e2ede285a2	refactor: move from io/ioutil to io and os packages (#10528 ) * refactor: move from io/ioutil to io and os packages * use fs.DirEntry instead of os.FileInfo after os.ReadDir Signed-off-by: MOREL Matthieu <matthieu.morel@cnp.fr>	2022-04-27 11:24:36 +02:00
Chris Marchbanks	a11e73edda	Fix a deadlock between Batch and FlushAndShutdown (#10608 ) If FlushAndShutdown is called with a full batchQueue, and then Batch is called rather than the normal path of reading from a queue a deadlock might be encountered. Rather than having FlushAndShutdown having blocking code while holding a lock retry sending the batch every second. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2022-04-20 20:50:41 +02:00
beorn7	4210aac74a	Merge branch 'main' into sparsehistogram	2022-03-22 14:47:42 +01:00
beorn7	79376c1e94	Merge branch 'release-2.33' into beorn7/release	2022-03-08 17:42:49 +01:00
Chris Marchbanks	e970acb085	Fix deadlock between adding to queue and getting batch Do not block when trying to write a batch to the queue. This can cause appends to lock forever if the only thing reading from the queue needs the mutex to write. Instead, if batchQueue is full pop the sample that was just added from the partial batch and return false. The code doing the appending already handles retries with backoff. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2022-03-07 17:15:57 -07:00
Chris Marchbanks	afdc1decac	Write a test that reproduces the deadlock Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2022-03-07 17:15:51 -07:00
DrAuYueng	5a6e26556b	Add an option to use the external labels as selectors for the remote read endpoint (#10254 ) * An option to ignore external_labels Signed-off-by: DrAuYueng <ouyang1204@gmail.com>	2022-02-16 22:12:47 +01:00
Julien Pivotto	b0d70557b7	Merge pull request #10285 from prometheus/release-2.33	2022-02-12 00:02:24 +01:00
Chris Marchbanks	bfb1500a38	Fix deadlock when stopping a shard (#10279 ) If a queue is stopped and one of its shards happens to hit the batch_send_deadline at the same time a deadlock can occur where stop holds the mutex and will not release it until the send is finished, but the send needs the mutex to retrieve the most recent batch. This is fixed by using a second mutex just for writing. In addition, the test I wrote exposed a case where during shutdown a batch could be sent twice due to concurrent calls to queue.Batch() and queue.FlushAndShutdown(). Protect these with a mutex as well. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2022-02-11 07:07:41 -07:00
Matej Gera	2c61d29b2a	Tracing: Migrate to OpenTelemetry library (#9724 ) Signed-off-by: Matej Gera <matejgera@gmail.com>	2022-01-25 11:08:04 +01:00
Eng Zer Jun	3e67654d37	refactor: use `T.TempDir()` and `B.TempDir` to create temporary directory The directory created by `T.TempDir()` and `B.TempDir()` is automatically removed when the test and all its subtests complete. Reference: https://pkg.go.dev/testing#T.TempDir Reference: https://pkg.go.dev/testing#B.TempDir Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>	2022-01-22 18:57:30 +08:00
Bryan Boreham	954c0e8020	remote_write: round desired shards up before check Previously we would reject an increase from 2 to 2.5 as being within 30%; by rounding up first we see this as an increase from 2 to 3. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2022-01-10 09:57:37 +00:00
Bryan Boreham	6d01ce8c4d	remote_write: shard up more when backlogged Change the coefficient from 1% to 5%, so instead of targetting to clear the backlog in 100s we target 20s. Update unit test to reflect the new behaviour. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2022-01-10 09:57:37 +00:00
Bryan Boreham	d588b14d9c	remote_write: detailed test for shard calculation Drive the input parameters to `calculateDesiredShards()` very precisely, to illustrate some questionable behaviour marked with `?!`. See https://github.com/prometheus/prometheus/issues/9178, https://github.com/prometheus/prometheus/issues/9207, Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2022-01-10 09:57:37 +00:00
Chris Marchbanks	ba03f7fc23	Merge pull request #10102 from prometheus/update-metrics-on-rw-fails Update sent timestamp when write irrecoverably fails	2022-01-05 10:46:09 -07:00
Goutham Veeramachaneni	6696b7a5f0	Don't update metrics on context cancellation Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>	2022-01-04 10:46:52 +01:00
Chris Marchbanks	dfa5cb7462	Merge pull request #10038 from charlesxsh/fix-TestReshardRaceWithStop add proper exit for loop	2022-01-03 09:02:45 -07:00
Goutham Veeramachaneni	1af81dc5c9	Update sent timestamp when write irrecoverably fails. We have an alert that fires when prometheus_remote_storage_highest_timestamp_in_seconds - prometheus_remote_storage_queue_highest_sent_timestamp_seconds becomes too high. But we have an agent that fires this when the remote "rate-limits" the user. This is because prometheus_remote_storage_queue_highest_sent_timestamp_seconds doesn't get updated when the remote sends a 429. I think we should update the metrics, and the change I made makes sense. Because if the requests fails because of connectivity issues, etc. we will never exit the `sendWriteRequestWithBackoff` function. It only exits the function when there is a non-recoverable error, like a bad status code, and in that case, I think the metric needs to be updated. Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>	2022-01-03 11:13:48 +01:00
Shihao Xia	c3e7bfb813	add proper exit for loop Signed-off-by: Shihao Xia <charlesxsh@hotmail.com>	2021-12-29 23:48:11 -05:00
beorn7	86cc83b13c	storage: iterator fixes after merge Signed-off-by: beorn7 <beorn@grafana.com>	2021-12-18 14:12:01 +01:00
beorn7	64c7bd2b08	Merge branch 'main' into sparsehistogram	2021-12-18 14:04:25 +01:00
Julien Pivotto	27343277fa	Merge release-2.32 forward into main (#10032 ) * storage: expose bug in iterators #10027 Signed-off-by: beorn7 <beorn@grafana.com> * storage: fix bug #10027 in iterators' Seek method Signed-off-by: beorn7 <beorn@grafana.com> * Append reporting metrics without limit If reporting metrics fails due to reaching the limit, this makes the target appear as UP in the UI, but the metrics are missing. This commit bypasses that limit for report metrics. Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu> * Remove check against cfg so interval/ timeout are always set (#10023) (#10031) Signed-off-by: Nicholas Blott <blottn@tcd.ie> Co-authored-by: Nicholas Blott <blottn@tcd.ie> * Cut v2.32.1 Signed-off-by: Julius Volz <julius.volz@gmail.com> * Apply suggestions from code review Signed-off-by: Julius Volz <julius.volz@gmail.com> Co-authored-by: Levi Harrison <git@leviharrison.dev> Co-authored-by: Julien Pivotto <roidelapluie@inuits.eu> Co-authored-by: Nicholas Blott <blottn@tcd.ie> Co-authored-by: Julius Volz <julius.volz@gmail.com> Co-authored-by: Levi Harrison <git@leviharrison.dev>	2021-12-17 23:18:38 +01:00
beorn7	0ede6ae321	storage: fix bug #10027 in iterators' Seek method Signed-off-by: beorn7 <beorn@grafana.com>	2021-12-16 12:07:35 +01:00
beorn7	b042e29569	storage: expose bug in iterators #10027 Signed-off-by: beorn7 <beorn@grafana.com>	2021-12-16 12:02:15 +01:00
beorn7	6f33ab2b35	Merge branch 'main' into sparsehistogram	2021-12-15 13:49:33 +01:00
Chris Marchbanks	0a8d28ea93	Merge pull request #9934 from bboreham/remote-write-struct remote-write: buffer struct instead of interface to reduce garbage-collection	2021-12-09 09:17:45 -07:00
Bryan Boreham	bd6436605d	Review feedback Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2021-12-09 14:40:44 +00:00
Sebastian Rabenhorst	d8b8678bd1	Log time series details for out-of-order samples in remote write receiver (#9894 ) * Improved out-of-order sample logs in write handler Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com> sign commit Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com> Inlined logAppendError Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com> Update storage/remote/write_handler.go Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com> Fixed fmt Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com> * Improved out-of-order sample logs in write handler Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com> sign commit Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com> Inlined logAppendError Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>	2021-12-08 15:07:51 +00:00
Bryan Boreham	50878ebe5e	remote-write: buffer struct instead of interface This reduces the amount of individual objects allocated, allowing sends to run a bit faster. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2021-12-03 14:30:42 +00:00
Bryan Boreham	c478d6477a	remote-write: benchmark just sending, on 20 shards Previously BenchmarkSampleDelivery spent a lot of effort checking each sample had arrived, so was largely showing the performance of test-only code. Increase the number of shards to be more realistic for a large workload. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>	2021-12-03 14:02:10 +00:00
Chris Marchbanks	e95d4ec3f1	Merge pull request #9830 from prometheus/batch-queues Batch samples before sending them to channels	2021-12-02 08:37:41 -07:00
Chris Marchbanks	c655684142	Subtract from enqueued samples/exemplars upon send Right now the values for enqueuedSamples and enqueuedExemplars is never subtracted leading to inflated values for failedSamples/failedExemplars when a hard shutdown of a shard occurs. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2021-11-30 12:54:50 -07:00
Chris Marchbanks	319249f9db	Batch samples before sending them to channels Channels can cause bottlenecks and tons of context switches when reading hundreds of thousands of samples per second from a single queue. Instead, pre-batch the samples to amortize the cost of the concurrency overhead. Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>	2021-11-30 12:54:45 -07:00
beorn7	68e02be963	Post-merge fixes Signed-off-by: beorn7 <beorn@grafana.com>	2021-11-30 17:20:28 +01:00
beorn7	e4e24453fa	Merge branch 'main' into beorn7/merge2	2021-11-30 17:19:06 +01:00
Björn Rabenstein	b866db009b	storage: Fix and improve the Seek method of various iterators (#9878 ) There was a subtle and nasty bug in listSeriesIterator.Seek. In addition, the Seek call is defined to be a no-op if the current position of the iterator is already pointing to a suitable sample. This commit adds fast paths for this case to several potentially expensive Seek calls. Another bug was in concreteSeriesIterator.Seek. It always searched the whole series and not from the current position of the iterator. Signed-off-by: beorn7 <beorn@grafana.com>	2021-11-29 15:17:56 +05:30
Björn Rabenstein	7e42acd3b1	tsdb: Rework iterators (#9877 ) - Pick At... method via return value of Next/Seek. - Do not clobber returned buckets. - Add partial FloatHistogram suppert. Note that the promql package is now _only_ dealing with FloatHistograms, following the idea that PromQL only knows float values. As a byproduct, I have removed the histogramSeries metric. In my understanding, series can have both float and histogram samples, so that metric doesn't make sense anymore. As another byproduct, I have converged the sampleBuf and the histogramSampleBuf in memSeries into one. The sample type stored in the sampleBuf has been extended to also contain histograms even before this commit. Signed-off-by: beorn7 <beorn@grafana.com>	2021-11-29 13:24:23 +05:30
Matheus Alcantara	e673805d67	storage/remote: use t.TempDir instead of ioutil.TempDir on tests (#9811 ) Signed-off-by: Matheus Alcantara <matheusssilv97@gmail.com>	2021-11-19 15:21:45 -05:00
Hu Shuai	eb43437d83	Fix golint issue (#9800 ) Signed-off-by: Hu Shuai <hus.fnst@fujitsu.com>	2021-11-18 09:26:07 +01:00
beorn7	5d4db805ac	Merge branch 'main' into sparsehistogram	2021-11-17 19:57:31 +01:00
beorn7	4c28d9fac7	Move to histogram.Histogram pointers This is to avoid copying the many fields of a histogram.Histogram all the time. This also fixes a bunch of formerly broken tests. Signed-off-by: beorn7 <beorn@grafana.com>	2021-11-12 23:17:35 +01:00
Mateusz Gozdek	d8561dbfd8	storage/remote: make tests use separate remote write configs So tests can be run in parallel without races. Signed-off-by: Mateusz Gozdek <mgozdekof@gmail.com>	2021-11-10 09:40:43 +01:00
Mateusz Gozdek	116552cc58	storage/remote: check errors from ApplyConfig in tests So tests do not produce obscure errors when applying configuration fails. Signed-off-by: Mateusz Gozdek <mgozdekof@gmail.com>	2021-11-10 09:40:43 +01:00
beorn7	c954cd9d1d	Move packages out of deprecated pkg directory This creates a new `model` directory and moves all data-model related packages over there: exemplar labels relabel rulefmt textparse timestamp value All the others are more or less utilities and have been moved to `util`: gate logging modetimevfs pool runtime Signed-off-by: beorn7 <beorn@grafana.com>	2021-11-09 08:03:10 +01:00
Dieter Plaetinck	cda025b5b5	TSDB: demistify SeriesRefs and ChunkRefs (#9536 ) * TSDB: demistify seriesRefs and ChunkRefs The TSDB package contains many types of series and chunk references, all shrouded in uint types. Often the same uint value may actually mean one of different types, in non-obvious ways. This PR aims to clarify the code and help navigating to relevant docs, usage, etc much quicker. Concretely: * Use appropriately named types and document their semantics and relations. * Make multiplexing and demuxing of types explicit (on the boundaries between concrete implementations and generic interfaces). * Casting between different types should be free. None of the changes should have any impact on how the code runs. TODO: Implement BlockSeriesRef where appropriate (for a future PR) Signed-off-by: Dieter Plaetinck <dieter@grafana.com> * feedback Signed-off-by: Dieter Plaetinck <dieter@grafana.com> * agent: demistify seriesRefs and ChunkRefs Signed-off-by: Dieter Plaetinck <dieter@grafana.com>	2021-11-06 15:40:04 +05:30
sniper	f82e56fbba	fix request bytes size and continue is useless (#9635 ) Signed-off-by: kalmanzhao <kalmanzhao@tencent.com> Co-authored-by: kalmanzhao <kalmanzhao@tencent.com>	2021-11-03 14:40:31 +05:30

1 2 3 4 5 ...

397 Commits (ffaabea91a9a2440d042a10f60a3fd548091a9c5)