Commit Graph

107 Commits (b90350737180876c6318c57cc260b62f8f0c94cb)

Author SHA1 Message Date
Bjoern Rabenstein 89bb376bce Reduce lock-protected area during scrape.
Change-Id: Iaa7faa7c916b1890b568d05bd8bfff6299b6767d
2014-12-05 19:40:41 +01:00
Bjoern Rabenstein fee88a7a77 Remove the remaining races, new and old.
Also, resolve a few other TODOs.

Change-Id: Icb39b5a5e8ca22ebcb48771cd8951c5d9e112691
2014-12-03 18:07:23 +01:00
Bjoern Rabenstein 14bda4180c Changes after pair code review.
Change-Id: Ib72d40f8e9027818cfbbd32a7a7201eebda07455
2014-11-25 17:12:59 +01:00
Bjoern Rabenstein a2feed343a Convert another occurrence from chan bool to chan struct{}.
Change-Id: I11ba127a934ee3aec0fcd139ad32a7751cff77a0
2014-11-25 17:10:39 +01:00
Bjoern Rabenstein 74c143c4c9 Improve scraper shutdown time.
- Stop target pools in parallel.
- Stop individual scrapers in goroutines, too.
- Timing tweaks.

Change-Id: I9dff1ee18616694f14b04408eaf1625d0f989696
2014-11-25 17:10:39 +01:00
Bjoern Rabenstein 92156ee89d Drain the newBaseLabels channel upon shutdown.
This should help cut down shutdown times.

Change-Id: I6e70a598a9e49aa6eeeb2034105b1bc6e9014324
2014-11-25 17:10:39 +01:00
Bjoern Rabenstein 6b37e47f9e Remove unused metrics.
Change-Id: Icf03ba4ce92a5e38daf12930f9661daba79c83bb
2014-11-25 17:09:03 +01:00
Bjoern Rabenstein 4fc8ad6677 Fix retrieval unit tests.
Change-Id: I299b71406b59539230e5182ccc37bc8a83af60b3
2014-11-25 17:08:45 +01:00
Bjoern Rabenstein b3ed9aa7a2 Clean up start-up and shut-down.
Change-Id: Idff4bbb0a15a9f879bfbb3da5b1025179cab5e2c
2014-11-25 17:08:45 +01:00
Bjoern Rabenstein 4447708c9f Fix a race in target.go.
Also, fix problems in shutdown.
Starting serving and shutdown still has to be cleaned up properly.
It's a mess.

Change-Id: I51061db12064e434066446e6fceac32741c4f84c
2014-11-25 17:08:45 +01:00
Bjoern Rabenstein 38fc24d0ed Fix targetpool_test.go and other tests.
Change-Id: I91a4dd1d39e01f174e1aaae653ce1ed7aecaa624
2014-11-25 17:08:26 +01:00
Julius Volz 7f5d3c2c29 Fix and improve the fp locker.
Benchmark:
$ go test -bench 'Fingerprint' -test.run 'Fingerprint' -test.cpu=1,2,4

OLD
BenchmarkFingerprintLockerParallel        500000              3618 ns/op
BenchmarkFingerprintLockerParallel-2      100000             12257 ns/op
BenchmarkFingerprintLockerParallel-4      500000             10164 ns/op
BenchmarkFingerprintLockerSerial        10000000               283 ns/op
BenchmarkFingerprintLockerSerial-2      10000000               284 ns/op
BenchmarkFingerprintLockerSerial-4      10000000               288 ns/op

NEW
BenchmarkFingerprintLockerParallel       1000000              1018 ns/op
BenchmarkFingerprintLockerParallel-2     1000000              1164 ns/op
BenchmarkFingerprintLockerParallel-4     2000000               910 ns/op
BenchmarkFingerprintLockerSerial        50000000                56.0 ns/op
BenchmarkFingerprintLockerSerial-2      50000000                47.9 ns/op
BenchmarkFingerprintLockerSerial-4      50000000                54.5 ns/op

Change-Id: I3c65a43822840e7e64c3c3cfe759e1de51272581
2014-11-25 17:07:45 +01:00
Bjoern Rabenstein e0a6cb281e Fix the accept header.
A '/' is a separator and has to be in a quoted string.

Change-Id: If7a3a847f84f8f709074d05dc98b5b21e954030c
2014-11-25 17:02:00 +01:00
Brian Brazil 5edf689133 Stagger scrapes to spread out load.
Change-Id: Ib141b271e4adfb817886871f86051c207b05cf35
2014-11-25 17:02:00 +01:00
Bjoern Rabenstein 1909686789 Make metrics exported by the Prometheus server itself more consistent.
- Always spell out the time unit (e.g. milliseconds instead of ms).

- Remove "_total" from the names of metrics that are not counters.

- Make use of the "Namespace" and "Subsystem" fields in the options.

- Removed the "capacity" facet from all metrics about channels/queues.
  These are all fixed via command line flags and will never change
  during the runtime of a process. Also, they should not be part of
  the same metric family. I have added separate metrics for the
  capacity of queues as convenience. (They will never change and are
  only set once.)

- I left "metric_disk_latency_microseconds" unchanged, although that
  metric measures the latency of the storage device, even if it is not
  a spinning disk. "SSD" is read by many as "solid state disk", so
  it's not too far off. (It should be "solid state drive", of course,
  but "metric_drive_latency_microseconds" is probably confusing.)

- Brian suggested to not mix "failure" and "success" outcome in the
  same metric family (distinguished by labels). For now, I left it as
  it is. We are touching some bigger issue here, especially as other
  parts in the Prometheus ecosystem are following the same
  principle. We still need to come to terms here and then change
  things consistently everywhere.

Change-Id: If799458b450d18f78500f05990301c12525197d3
2014-11-25 17:02:00 +01:00
Brian Brazil 4a2b96f848 Remove backoff on scrape failure.
Having metrics with variable timestamps inconsistently
spaced when things fail will make it harder to write correct rules.

Update status page, requires some refactoring to insert a function.

Change-Id: Ie1c586cca53b8f3b318af8c21c418873063738a8
2014-11-25 17:02:00 +01:00
Julius Volz 1bb7074fec Fix HTTP connection leak upon non-OK status.
Change-Id: Ie7fbd7dcc089b8306b40631be3e3d736c23c1cd3
2014-11-25 17:02:00 +01:00
Bjoern Rabenstein bacc31d5cc Remove work-around that required copying all bytes of a scrape.
Now that the subtle bug in matttproud/golang_protobuf_extensions is
fixed, we do not need to copy the bytes of a scrape into a buffer
first before starting to parse it.

Change-Id: Ib73ecae16173ddd219cda56388a8f853332f8853
2014-11-25 17:01:59 +01:00
Bjoern Rabenstein 8956faeccb Migrate to new client_golang.
This change will only be submitted when the new client_golang has been
moved to the new version.

Change-Id: Ifceb59333072a08286a8ac910709a8ba2e3a1581
2014-11-25 17:01:59 +01:00
Bjoern Rabenstein 814e479723 Treat non-200 HTTP response as error.
Change-Id: I2a9f3b47012b3c4839be53aa44c66d16dd41a24a
2014-11-25 17:01:59 +01:00
Bjoern Rabenstein ca6a4fccef Weed out our homegrown test.Tester.
The Go stdlib has testing.TB now, which fulfills the exact same
purpose.

Change-Id: I0db9c73400e208ca376b932a02b7e3402234b87c
2014-05-21 19:27:24 +02:00
Brian Brazil 23255f1499 Fix negative Next Retrieval on status page.
Change-Id: Ifa754034660a251fee71f166dbf057697ec4e872
2014-05-12 15:24:34 +01:00
Bjoern Rabenstein 64811caaec Make Prometheus announce its new super-power: text format!
Change-Id: Ia2ddfb28999c145e4d46c395381a9bf89d43148c
2014-04-22 18:44:52 +02:00
Julius Volz 84df022025 Cleanup server address handling, support IPv6.
This fixes https://github.com/prometheus/prometheus/issues/377, as
IPv6 server addresses are now handled correctly.

Change-Id: Iebde7cfdadb0a52041472517e6fdcff4303a25ab
2014-03-09 23:31:30 +01:00
Julius Volz b382e8b7bd Remove overly verbose DNS-SD logging line.
Change-Id: Ie4534437ab88b9a6b99f5cb6c2f32c9588c1fff6
2014-01-24 16:09:41 +01:00
Julius Volz 0378c2ca1f Nonexistent labels in BY-clauses shouldn't propagate to result.
This fixes bug 2. of https://github.com/prometheus/prometheus/issues/374

Change-Id: Ia4a13153616bafce5bf10597966b071434422d09
2014-01-24 16:05:30 +01:00
Stuart Nelson 48a6326d25 Added DNS-SD lookup counter for successful/unsuccessful lookups
Change-Id: I0a71e994a989cecace280b5134a31ebc2ace7591
2013-12-16 08:48:56 -05:00
Julius Volz fb44580110 Cleanup/fix program termination sequence.
Change-Id: I2bc58a2583fb079c9ef383cfc7a5e0fbe613f1cd
2013-12-11 15:40:32 +01:00
Julius Volz 740d448983 Use custom timestamp type for sample timestamps and related code.
So far we've been using Go's native time.Time for anything related to sample
timestamps. Since the range of time.Time is much bigger than what we need, this
has created two problems:

- there could be time.Time values which were out of the range/precision of the
  time type that we persist to disk, therefore causing incorrectly ordered keys.
  One bug caused by this was:

  https://github.com/prometheus/prometheus/issues/367

  It would be good to use a timestamp type that's more closely aligned with
  what the underlying storage supports.

- sizeof(time.Time) is 192, while Prometheus should be ok with a single 64-bit
  Unix timestamp (possibly even a 32-bit one). Since we store samples in large
  numbers, this seriously affects memory usage. Furthermore, copying/working
  with the data will be faster if it's smaller.

*MEMORY USAGE RESULTS*
Initial memory usage comparisons for a running Prometheus with 1 timeseries and
100,000 samples show roughly a 13% decrease in total (VIRT) memory usage. In my
tests, this advantage for some reason decreased a bit the more samples the
timeseries had (to 5-7% for millions of samples). This I can't fully explain,
but perhaps garbage collection issues were involved.

*WHEN TO USE THE NEW TIMESTAMP TYPE*
The new clientmodel.Timestamp type should be used whenever time
calculations are either directly or indirectly related to sample
timestamps.

For example:
- the timestamp of a sample itself
- all kinds of watermarks
- anything that may become or is compared to a sample timestamp (like the timestamp
  passed into Target.Scrape()).

When to still use time.Time:
- for measuring durations/times not related to sample timestamps, like duration
  telemetry exporting, timers that indicate how frequently to execute some
  action, etc.

*NOTE ON OPERATOR OPTIMIZATION TESTS*
We don't use operator optimization code anymore, but it still lives in
the code as dead code. It still has tests, but I couldn't get all of them to
pass with the new timestamp format. I commented out the failing cases for now,
but we should probably remove the dead code soon. I just didn't want to do that
in the same change as this.

Change-Id: I821787414b0debe85c9fffaeb57abd453727af0f
2013-12-03 09:11:28 +01:00
Johannes 'fish' Ziemke 8c08a5031f Add search domain support to SRV lookups
This adds search domain support by trying to resolve a name by
appending each search domain configured in /etc/resolv.conf until
the query succeeds (NOERROR) and has at least one answer.

Change-Id: Ibdc5138c5d8cc049e11fab90c3d5243d5a06852c
2013-10-29 17:19:49 +01:00
Julius Volz 274934bcd3 Revert "Revert "Merge pull request #317 from prometheus/fix/miekg-dns-for-srv""
This reverts commit 88099328d1.

Change-Id: I7bf74de5fda458e2e6f9eea2eacd0e256f95bdee
2013-09-10 17:48:05 +02:00
Johannes 'fish' Ziemke 88099328d1 Revert "Merge pull request #317 from prometheus/fix/miekg-dns-for-srv"
This reverts commit e3bc6fc9dc, reversing
changes made to 1cf9e5840a.

Conflicts:
	retrieval/target_provider.go

Change-Id: Icb6e98fb30419e9e2fe9b686c243702ced372014
2013-08-30 16:32:51 +02:00
Julius Volz 788587426b Make scrape timeouts configurable per job.
Change-Id: I77a7514ad9e7969771f873d63d6353ec50082a62
2013-08-19 12:21:47 +02:00
Julius Volz d69b85e6c9 Add global label support via Ingesters. 2013-08-13 16:54:15 +02:00
Julius Volz 0003027dce Add needed trailing spaces in logs. 2013-08-12 18:22:48 +02:00
Julius Volz aa5d251f8d Use github.com/golang/glog for all logging. 2013-08-12 17:54:36 +02:00
Matt T. Proud a5141e4d0a Depointerize storage conf. and chain ingester.
The storage builders need to work with the assumption that they have
a copy of the underlying configuration data if any mutations are made.
2013-08-12 17:07:03 +02:00
Julius Volz f8b20f30ac Make retrieval work with client's new Ingester interface. 2013-08-12 15:15:41 +02:00
Julius Volz 3b970c5133 Add variable interpolation to notification messages.
This includes required refactorings to enable replacing the http client (for
testing) and moving the NotificationReq type definitions to the "notifications"
package, so that this package doesn't need to depend on "rules" anymore and
that it can instead use a representation of the required data which only
includes the necessary fields.
2013-08-12 12:29:08 +02:00
Julius Volz 35ee2cd3cb Add alertmanager notification support to Prometheus.
Alert definitions now also have mandatory SUMMARY and DESCRIPTION fields
that get sent along a firing alert to the alert manager.
2013-07-30 17:23:41 +02:00
Julius Volz 81f0b85013 Return [] instead of null for empty result vectors. 2013-07-25 12:16:32 +02:00
Julius Volz 331be19af6 Fix broken retrieval tests.
These have been broken since 06b4a40661
2013-07-25 12:15:00 +02:00
Matt T. Proud f7704af4f8 Code Review: Formatting comments. 2013-07-15 15:12:01 +02:00
Matt T. Proud 06b4a40661 Represent targets in a tabular interface.
This commit represents a target group's endpoints in a tabular fashion for better differentiation
of their state in a concise manner.
2013-07-15 15:12:01 +02:00
Matt T. Proud e20e6980e9 Completely extract response payload for decoding.
This commit forces the extraction framework to read the entire response payload
into a buffer before attempting to decode it, for the underlying Protocol Buffer
message readers do not block on partial messages.
2013-07-14 23:04:08 +02:00
Julius Volz 9a48f57b66 Continue scraping old targets on SD fail.
When we have trouble resolving the targets for a job via service
discovery, we shouldn't just stop scraping the targets we currently
have.
2013-07-12 22:38:42 +02:00
juliusv 24715f0ee5 Merge pull request #322 from prometheus/refactor/client/new-model
Include Accept header for telemetry request.
2013-06-27 09:52:00 -07:00
Matt T. Proud b8c7fd8c34 Include Accept header for telemetry request.
This pull request introduces a HTTP Accept header to indicate a
preference for Protocol Buffer-encoded messages.
2013-06-27 18:32:28 +02:00
Johannes 'fish' Ziemke 4bdf1adb6c Use github.com/miekg/dns for resolving SRV records 2013-06-26 16:04:25 +02:00
Matt T. Proud 30b1cf80b5 WIP - Snapshot of Moving to Client Model. 2013-06-25 15:52:42 +02:00