prometheus

Commit Graph

Author	SHA1	Message	Date
Julius Volz	c4adfc4f25	Minor code cleanups. Change-Id: Ib3729cf38b107b7f2186ccf410a745e0472e3630	2014-02-13 15:24:43 +01:00
Julius Volz	94666e20b7	Minor test error reporting cleanup. Change-Id: Ie11c16b4e60de7c179c6d2a86e063f4432e2000f	2014-02-03 12:27:01 +01:00
Julius Volz	fd2158e746	Store copy of metric during fingerprint caching Problem description: ==================== If a rule evaluation referencing a metric/timeseries M happens at a time when M doesn't have a memory timeseries yet, looking up the fingerprint for M (via TieredStorage.GetMetricForFingerprint()) will create a new Metric object for M which gets both: a) attached to a new empty memory timeseries (so we don't have to ask disk for the Metric's fingerprint next time), and b) returned to the rule evaluation layer. However, the rule evaluation layer replaces the name label (and possibly other labels) of the metric with the name of the recorded rule. Since both the rule evaluator and the memory storage share a reference to the same Metric object, the original memory timeseries will now also be incorrectly renamed. Fix: ==== Instead of storing a reference to a shared metric object, take a copy of the object when creating an empty memory timeseries for caching purposes. Change-Id: I9f2172696c16c10b377e6708553a46ef29390f1e	2014-02-02 17:11:08 +01:00
Bjoern Rabenstein	c342ad33a0	Fix OperatorError. This used to work with Go 1.1, but only because of a compiler bug. The bug is fixed in Go 1.2, so we have to fix our code now. Change-Id: I5a9f3a15878afd750e848be33e90b05f3aa055e1	2014-01-21 16:49:51 +01:00
Stuart Nelson	0c58e388f6	rename curation metrics to prometheus_curation Change-Id: I6a0bf277e88ea8eb737670b7e865ae20f2cbfb91	2013-12-13 17:45:01 -05:00
Stuart Nelson	28f59edf16	Added telemetry for counting stored samples Change-Id: I0f36f7c2738d070ca2f107fcb315f98e46803af3	2013-12-12 10:06:41 -05:00
Tobias Schmidt	6947ee9bc9	Try to create metrics root directory if missing This change tries to be nice and create the metrics directoy first before erroring out. Change-Id: I72691cdc32469708cd671c6ef1fb7db55fe60430	2013-12-03 18:16:13 +07:00
Julius Volz	740d448983	Use custom timestamp type for sample timestamps and related code. So far we've been using Go's native time.Time for anything related to sample timestamps. Since the range of time.Time is much bigger than what we need, this has created two problems: - there could be time.Time values which were out of the range/precision of the time type that we persist to disk, therefore causing incorrectly ordered keys. One bug caused by this was: https://github.com/prometheus/prometheus/issues/367 It would be good to use a timestamp type that's more closely aligned with what the underlying storage supports. - sizeof(time.Time) is 192, while Prometheus should be ok with a single 64-bit Unix timestamp (possibly even a 32-bit one). Since we store samples in large numbers, this seriously affects memory usage. Furthermore, copying/working with the data will be faster if it's smaller. MEMORY USAGE RESULTS Initial memory usage comparisons for a running Prometheus with 1 timeseries and 100,000 samples show roughly a 13% decrease in total (VIRT) memory usage. In my tests, this advantage for some reason decreased a bit the more samples the timeseries had (to 5-7% for millions of samples). This I can't fully explain, but perhaps garbage collection issues were involved. WHEN TO USE THE NEW TIMESTAMP TYPE The new clientmodel.Timestamp type should be used whenever time calculations are either directly or indirectly related to sample timestamps. For example: - the timestamp of a sample itself - all kinds of watermarks - anything that may become or is compared to a sample timestamp (like the timestamp passed into Target.Scrape()). When to still use time.Time: - for measuring durations/times not related to sample timestamps, like duration telemetry exporting, timers that indicate how frequently to execute some action, etc. NOTE ON OPERATOR OPTIMIZATION TESTS We don't use operator optimization code anymore, but it still lives in the code as dead code. It still has tests, but I couldn't get all of them to pass with the new timestamp format. I commented out the failing cases for now, but we should probably remove the dead code soon. I just didn't want to do that in the same change as this. Change-Id: I821787414b0debe85c9fffaeb57abd453727af0f	2013-12-03 09:11:28 +01:00
Julius Volz	6b7de31a3c	Upgrade to LevelDB 1.14.0 to fix LevelDB bugs. This tentatively fixes https://github.com/prometheus/prometheus/issues/368 due to an upstream bugfix in snapshotted LevelDB iterator handling, which got fixed in LevelDB 1.14.0: https://code.google.com/p/leveldb/issues/detail?id=200 Change-Id: Ib0cc67b7d3dc33913a1c16736eff32ef702c63bf	2013-12-03 09:07:15 +01:00
Julius Volz	db015de65b	Comment and "go fmt" fixups in compaction tests. Change-Id: Iaa0eda6a22a5caa0590bae87ff579f9ace21e80a	2013-10-30 17:06:17 +01:00
Julius Volz	51408bdfe8	Merge changes I3ffeb091,Idffefea4 * changes: Add chunk sanity checking to dumper tool. Add compaction regression tests.	2013-10-24 13:58:14 +02:00
Julius Volz	2162e57784	Merge "Fix watermarker default time / LevelDB key ordering bug."	2013-10-24 13:57:48 +02:00
Julius Volz	5e18255920	Merge "Fix chunk corruption compaction bug."	2013-10-24 13:57:31 +02:00
Julius Volz	eb461a707d	Add chunk sanity checking to dumper tool. Also, move codecs/filters to common location so they can be used in subsequent test. Change-Id: I3ffeb09188b8f4552e42683cbc9279645f45b32e	2013-10-23 01:06:49 +02:00
Julius Volz	6ea22f2bf9	Add compaction regression tests. This adds regression tests that catch the two error cases reported in https://github.com/prometheus/prometheus/issues/367 It also adds a commented-out test case for the crash in https://github.com/prometheus/prometheus/issues/368 but there's no fix for the latter crash yet. Change-Id: Idffefea4ed7cc281caae660bcad2e3c13ec3bd17	2013-10-23 01:06:28 +02:00
Conor Hennessy	9a48010cec	Add a check for metrics directory existence. Previously on startup the program would just quit without stating explicitly why. Change-Id: I833b85eb74d2dd27cdc3f0f2e65d7bb1c42caa39	2013-10-22 20:54:34 +02:00
Julius Volz	b5f6e3c90c	Fix watermarker default time / LevelDB key ordering bug. This fixes part 2) of https://github.com/prometheus/prometheus/issues/367 (uninitialized time.Time mapping to a higher LevelDB key than "normal" timestamps). Change-Id: Ib079974110a7b7c4757948f81fc47d3d29ae43c9	2013-10-21 14:32:21 +02:00
Julius Volz	a1a97ed064	Fix chunk corruption compaction bug. This fixes part 1) of https://github.com/prometheus/prometheus/issues/367 (the storing of samples with the wrong fingerprint into a compacted chunk, thus corrupting it). Change-Id: I4c36d0d2e508e37a0aba90b8ca2ecc78ee03e3f1	2013-10-21 14:30:22 +02:00
Matt T. Proud	86fcbe5bde	Retain DTO on each cycle. Change-Id: Ifc6f68f98eacb01097771d0dbf043c98bba1d518	2013-09-05 10:14:34 +02:00
Matt T. Proud	4a87c002e8	Update low-level i'faces to reflect wireformats. This commit fixes a critique of the old storage API design, whereby the input parameters were always as raw bytes and never Protocol Buffer messages that encapsulated the data, meaning every place a read or mutation was conducted needed to manually perform said translations on its own. This is taxing. Change-Id: I4786938d0d207cefb7782bd2bd96a517eead186f	2013-09-04 17:13:58 +02:00
Matt T. Proud	7910f6e863	Prevent total storage locking during memory flush. While a hack, this change should allow us to serve queries expeditiously during a flush operation. Change-Id: I9a483fd1dd2b0638ab24ace960df08773c4a5079	2013-08-29 11:33:38 +02:00
Matt T. Proud	12d5e6ca5a	Curation should not starve user-interactive ops. The background curation should be staggered to ensure that disk I/O yields to user-interactive operations in a timely manner. The lack of routine prioritization necessitates this. Change-Id: I9b498a74ccd933ffb856e06fedc167430e521d86	2013-08-26 19:40:55 +02:00
Matt T. Proud	2b42fd0068	Snapshot of no more frontier. Change-Id: Icd52da3f52bfe4529829ea70b4865ed7c9f6c446	2013-08-23 17:13:58 +02:00
Matt T. Proud	7db518d3a0	Abstract high watermark cache into standard LRU. Conflicts: storage/metric/memory.go storage/metric/tiered.go storage/metric/watermark.go Change-Id: Iab2aedbd8f83dc4ce633421bd4a55990fa026b85	2013-08-19 12:26:55 +02:00
Matt T. Proud	d74c2c54d4	Interfacification of stream. Move the stream to an interface, for a number of additional changes around it are underway. Conflicts: storage/metric/memory.go Change-Id: I4a5fc176f4a5274a64ebdb1cad52600954c463c3	2013-08-16 17:35:21 +02:00
Matt T. Proud	c262907fec	Kill interface cruft. These pieces were never used and should be thusly removed. Change-Id: I8dd151ec4c40b6d3ccffad1bb9b8b75a92e9ee37	2013-08-15 11:39:07 +02:00
Matt T. Proud	b23acccea8	Kill AppendSample interface definition. AppendSample will be repcated with AppendSamples, which will take advantage of bulks appends. This is a necessary step for indexing pipeline decoupling. Change-Id: Ia83811a87bcc89973d3b64d64b85a28710253ebc	2013-08-15 11:35:50 +02:00
Matt T. Proud	aaaf3367d6	Include forgotten imports. This fixes the build. Change-Id: Id132f4342adb9ed20116191086f157ca7f7cf515	2013-08-14 18:52:55 +02:00
Matt T. Proud	acf91f38bd	Build layered indexers. The indexers will be extracted in a short while and wrapped accordingly with these types. Change-Id: I4d1abda4e46117210babad5aa0d42f9ca1f6594f	2013-08-14 13:32:53 +02:00
Matt T. Proud	972e856d9b	Kill the curation state channel. The use of the channels for curation state were always unidiomatic. Change-Id: I1cb1d7175ebfb4faf28dff84201066278d6a0d92	2013-08-13 17:20:22 +02:00
Matt T. Proud	1ceb25b701	Publication of LevelDBMetricPersistence Fields. This will enable us to break down the onerous construction method. Change-Id: Ia89337ba39d6745af6757180af2485ec8a990a3b	2013-08-13 00:36:12 +02:00
Julius Volz	0003027dce	Add needed trailing spaces in logs.	2013-08-12 18:22:48 +02:00
Julius Volz	aa5d251f8d	Use github.com/golang/glog for all logging.	2013-08-12 17:54:36 +02:00
Matt T. Proud	a5141e4d0a	Depointerize storage conf. and chain ingester. The storage builders need to work with the assumption that they have a copy of the underlying configuration data if any mutations are made.	2013-08-12 17:07:03 +02:00
Matt T. Proud	820e551988	Code Review: Nits.	2013-08-07 13:29:10 +02:00
Matt T. Proud	a3bf2efdd5	Replace index writes with wrapped interface. This commit is the first of several and should not be regarded as the desired end state for these cleanups. What this one does it, however, is wrap the query index writing behind an interface type that can be injected into the storage stack and have its lifecycle managed separately as needed. It also would mean we can swap out underlying implementations to support remote indexing, buffering, no-op indexing very easily. In the future, most of the individual index interface members in the tiered storage will go away in favor of agents that can query and resolve what they need from the datastore without the user knowing how and why they work.	2013-08-07 12:15:48 +02:00
Matt T. Proud	52664f701a	Hot Fix: Use extracted time.	2013-08-06 14:18:02 +02:00
Matt T. Proud	38dac35b3e	Code Review: Short name consistency.	2013-08-06 12:38:35 +02:00
Matt T. Proud	a00f18d78b	Code Review: Manual re-alignment.	2013-08-06 12:23:06 +02:00
Matt T. Proud	cc989c68e1	Replace direct curation table access with wrapper.	2013-08-06 12:02:52 +02:00
Matt T. Proud	07ac921aec	Code Review: First pass.	2013-08-05 17:31:49 +02:00
Matt T. Proud	d8792cfd86	Extract HighWatermarking. Clean up the rest.	2013-08-05 11:03:03 +02:00
Matt T. Proud	f4669a812c	Extract index storage into separate types.	2013-08-04 15:31:52 +02:00
Matt T. Proud	772d3d6b11	Consolidate LevelDB storage construction. There are too many parameters to constructing a LevelDB storage instance for a construction method, so I've opted to take an idiomatic approach of embedding them in a struct for easier mediation and versioning.	2013-08-03 17:25:03 +02:00
Julius Volz	e3415e953f	Add notifications telemetry.	2013-07-31 12:40:56 +02:00
juliusv	927435d68e	Merge pull request #333 from prometheus/round-time Round time to nearest second in memory storage.	2013-07-16 05:52:31 -07:00
Julius Volz	5d88e8cc45	Round time to nearest second in memory storage. When samples get flushed to disk, they lose sub-second precision anyways. By already dropping sub-second precision, data fetched from memory vs. disk will behave the same. Later, we should consider also storing a more compact representation than time.Time in memory if we're not going to use its full precision.	2013-07-16 14:51:54 +02:00
Matt T. Proud	f7704af4f8	Code Review: Formatting comments.	2013-07-15 15:12:01 +02:00
Julius Volz	a76a797f3f	Always treat series without watermarks as too old. Current series always get watermarks written out upon append now. This drops support for old series without any watermarks by always reporting them as too old (stale) during queries.	2013-06-27 17:10:06 +02:00
Julius Volz	d2da21121c	Implement getValueRangeAtIntervalOp for faster range queries. This also short-circuits optimize() for now, since it is complex to implement for the new operator, and ops generated by the query layer already fulfill the needed invariants. We should still investigate later whether to completely delete operator optimization code or extend it to support getValueRangeAtIntervalOp operators.	2013-06-26 18:10:36 +02:00
Julius Volz	e7f049c85b	Fix expunging of empty memory series (loop var pointerization bug)	2013-06-26 18:00:47 +02:00
Julius Volz	baa5b07829	Fix condition for dropping empty memory series.	2013-06-25 17:57:35 +02:00
Matt T. Proud	30b1cf80b5	WIP - Snapshot of Moving to Client Model.	2013-06-25 15:52:42 +02:00
juliusv	42198c1f1c	Merge pull request #311 from prometheus/fix/watermarking/on-first-write Ensure new metrics are watermarked early.	2013-06-25 03:13:58 -07:00
Matt T. Proud	4137c75523	Shrink default LRU cache sizes. Observing Prometheus in production confirms we can lower these values safely.	2013-06-24 12:09:16 +02:00
Matt T. Proud	ecb9c7bb9d	Code Review: Swap ordering of elements.	2013-06-21 21:17:50 +02:00
Matt T. Proud	5daa0a09ea	Code Review: Swap ordering of watermark getting. A test for Julius.	2013-06-21 18:34:08 +02:00
Matt T. Proud	ee840904d2	Code Review: !Before -> After.	2013-06-21 18:26:40 +02:00
Matt T. Proud	2d5de99fbf	Regard in-memory series as new. This commit ensures that series that exist only in-memory and not on-disk are not regarded as too old for operation exclusion.	2013-06-21 18:26:39 +02:00
Matt T. Proud	81c406630a	Merge pull request #312 from prometheus/fix/sample-append-logging Log correct sample count when appending to disk.	2013-06-21 08:55:51 -07:00
Matt T. Proud	a1a23fbaf8	Ensure new metrics are watermarked early. With the checking of fingerprint freshness to cull stale metrics from queries, we should write watermarks early to aid in more accurate responses.	2013-06-21 16:38:46 +02:00
Julius Volz	ba8c122147	Log correct sample count when appending to disk.	2013-06-21 12:23:27 +02:00
Julius Volz	f2b4067b7b	Speedup and clean up operation optimization.	2013-06-20 03:01:13 +02:00
Julius Volz	008bc09da8	Move check for empty memory series to separate method.	2013-06-19 14:19:53 +02:00
Julius Volz	16364eda37	Drop empty series from memory after flushing.	2013-06-19 12:14:23 +02:00
Julius Volz	71199e2c93	Cache disk fingerprint->metric lookups in memory.	2013-06-18 14:08:58 +02:00
Matt T. Proud	a73f061d3c	Persist solely Protocol Buffers. An design question was open for me in the beginning was whether to serialize other types to disk, but Protocol Buffers quickly won out, which allows us to drop support for other types. This is a good start to cleaning up a lot of cruft in the storage stack and can let us eventually decouple the various moving parts into separate subsystems for easier reasoning. This commit is not strictly required, but it is a start to making the rest a lot more enjoyable to interact with.	2013-06-08 11:02:35 +02:00
juliusv	95400cb785	Merge pull request #290 from prometheus/fix/go-vet Minor "go tool vet" cleanups	2013-06-07 06:52:48 -07:00
Julius Volz	558281890b	Minor "go tool vet" cleanups	2013-06-07 15:34:41 +02:00
juliusv	615972dd01	Merge pull request #288 from prometheus/fix/curator/fallthrough-compaction-ordering Fix fallthrough compaction value ordering.	2013-06-07 05:46:15 -07:00
Matt T. Proud	86f63b078b	Fix fallthrough compaction value ordering. We discovered a regression whereby data chunks could be appended out of order if the fallthrough case was hit.	2013-06-07 14:41:00 +02:00
Julius Volz	7b9ee95030	Minor LevelDB watermark handling cleanups.	2013-06-06 23:56:31 +02:00
Julius Volz	84741b227d	Use LRU cache to avoid querying stale series.	2013-06-06 23:56:19 +02:00
Julius Volz	f98853d7b7	Fix type error in watermark list handling.	2013-06-06 23:56:14 +02:00
Matt T. Proud	ef1d5fd8a2	Introduce semaphores for tiered storage. This commit wraps the tiered storage access componnets in semaphores, since we can handle several concurrent memory reads.	2013-06-06 18:16:18 +02:00
Matt T. Proud	819045541e	Code Review: Make double-drain a panic.	2013-06-06 12:40:06 +02:00
Matt T. Proud	e217a9fb41	Race Work: Make memory arena locks more coarse. We can optimize these as needed later.	2013-06-06 12:08:20 +02:00
Matt T. Proud	beaaf386e7	Add storage state guards and transition callbacks. To ensure that we access tiered storage in the proper way, we have guards now.	2013-06-06 11:52:09 +02:00
Matt T. Proud	abb5353ade	Merge pull request #283 from prometheus/feature/storage/consult-watermark Include LRU cache for fingerprint watermarks.	2013-06-06 02:33:45 -07:00
Matt T. Proud	2c3df44af6	Ensure database access waits until it is started. This commit introduces a channel message to ensure serving state has been reached with the storage stack before anything attempts to use it.	2013-06-06 10:42:21 +02:00
Matt T. Proud	cbe2f3a7b1	Include LRU cache for fingerprint watermarks.	2013-06-06 10:13:18 +02:00
Julius Volz	51689d965d	Add debug timers to instant and range queries. This adds timers around several query-relevant code blocks. For now, the query timer stats are only logged for queries initiated through the UI. In other cases (rule evaluations), the stats are simply thrown away. My hope is that this helps us understand where queries spend time, especially in cases where they sometimes hang for unusual amounts of time.	2013-06-05 18:32:54 +02:00
Matt T. Proud	8339a189cb	Code Review: Fix seriesPresent scope. The seriesPresent scope should be constrained to the scope of a scanJob, since this is keyed to given series.	2013-06-04 13:16:59 +02:00
Matt T. Proud	fe41ce0b19	Conditionalize disk initializations. This commit conditionalizes the creation of the diskFrontier and seriesFrontier along with the iterator such that they are provisioned once something is actually required from disk.	2013-06-04 12:53:57 +02:00
Julius Volz	a8468a2e5e	Fix reversed disk flush cutoff behavior.	2013-05-28 16:14:30 +02:00
Julius Volz	eb1f956909	Revert "Revert "Ensure that all extracted samples are added to view."" This reverts commit `4b30fb86b4`.	2013-05-28 14:36:03 +02:00
Matt T. Proud	4b30fb86b4	Revert "Ensure that all extracted samples are added to view." This reverts commit `008314b5a8`. By running an automated git bisection described in https://gist.github.com/matttproud-soundcloud/22a371a8d2cba382ea64 this commit was found.	2013-05-23 13:36:22 +02:00
Julius Volz	750f862d9a	Use GetBoundaryValues() for non-counter deltas.	2013-05-22 19:13:47 +02:00
Julius Volz	f2b48b8c4a	Make getValuesAtIntervalOp consume all chunk data in one pass. This is mainly a small performance improvement, since we skip past the last extracted time immediately if it was also the last sample in the chunk, instead of trying to extract non-existent values before the chunk end again and again and only gradually approaching the end of the chunk.	2013-05-22 18:14:45 +02:00
Julius Volz	83d60bed89	extractValuesAroundTime() code simplification.	2013-05-22 18:14:45 +02:00
Julius Volz	008314b5a8	Ensure that all extracted samples are added to view. The current behavior only adds those samples to the view that are extracted by the last pass of the last processed op and throws other ones away. This is a bug. We need to append all samples that are extracted by each op pass. This also makes view.appendSamples() take an array of samples.	2013-05-22 18:14:37 +02:00
Matt T. Proud	b586801830	Code Review: Fix to-disk queue infinite growth. We discovered a bug while manually testing this branch on a live instance, whereby the to-disk queue was never actually dumped to disk.	2013-05-22 17:59:53 +02:00
Matt T. Proud	285a8b701b	Code Review: Extend lock.	2013-05-22 17:59:53 +02:00
Matt T. Proud	2526ab8c81	Code Review: Extend lock scope for appending.	2013-05-22 17:59:53 +02:00
Matt T. Proud	f994482d15	Code Review: Avenues for future improvemnet noted.	2013-05-22 17:59:53 +02:00
Matt T. Proud	298a90c143	Code Review: Initial arena size name.	2013-05-22 17:59:53 +02:00
Matt T. Proud	c07abf8521	Initial move away from skiplist.	2013-05-22 17:59:53 +02:00
Matt T. Proud	74a66fd938	Spawn grouping of fingerprints with free semaphore. The previous implementation spawned N goroutines to group samples together and would not start work until the semaphore unblocked. While this didn't leak, it polluted the scheduling space. Thusly, the routine only starts after a semaphore has been acquired.	2013-05-21 16:11:35 +02:00
Julius Volz	5b105c77fc	Repointerize fingerprints.	2013-05-21 14:28:14 +02:00
Matt T. Proud	ec5b5bae28	Fuck you, Travis.	2013-05-21 09:42:00 +02:00

1 2 3 4 5 ...

324 Commits (9b33cfc457cff3b51252e150f6e17d0f58d6a01e)