This change is conceptually very simple, although the diff is large. It
switches logging from "github.com/golang/glog" to
"github.com/prometheus/log", while not actually changing any log
messages. V(1)-style logging has been changed to be log.Debug*().
With this commit, sending SIGHUP to the Prometheus process will reload
and apply the configuration file. The different components attempt
to handle failing changes gracefully.
This commits renames the RuleManager to Manager as the package
name is 'rules' now. The unused layer of abstraction of the
RuleManager interface is removed.
This adds the population standard deviation and
variance as aggregation functions, useful for
spotting how many standard deviations some samples
are from the mean.
The one central sample ingestion channel has caused a variety of
trouble. This commit removes it. Targets and rule evaluation call an
Append method directly now. To incorporate multiple storage backends
(like OpenTSDB), storage.Tee forks the Append into two different
appenders.
Note that the tsdb queue manager had its own queue anyway. It was a
queue after a queue... Much queue, so overhead...
Targets have their own little buffer (implemented as a channel) to
avoid stalling during an http scrape. But a new scrape will only be
started once the old one is fully ingested.
The contraption of three pipelined ingesters was removed. A Target is
an ingester itself now. Despite more logic in Target, things should be
less confusing now.
Also, remove lint and vet warnings in ast.go.
- Move CONTRIBUTORS.md to the more common AUTHORS.
- Added the required NOTICE file.
- Changed "Prometheus Team" to "The Prometheus Authors".
- Reverted the erroneous changes to the Apache License.
Essentially:
- Remove unused code.
- Make it 'go vet' clean. The only remaining warnings are in generated code.
- Make it 'golint' clean. The only remaining warnings are in gerenated code.
- Smoothed out same minor things.
Change-Id: I3fe5c1fbead27b0e7a9c247fee2f5a45bc2d42c6
- Always spell out the time unit (e.g. milliseconds instead of ms).
- Remove "_total" from the names of metrics that are not counters.
- Make use of the "Namespace" and "Subsystem" fields in the options.
- Removed the "capacity" facet from all metrics about channels/queues.
These are all fixed via command line flags and will never change
during the runtime of a process. Also, they should not be part of
the same metric family. I have added separate metrics for the
capacity of queues as convenience. (They will never change and are
only set once.)
- I left "metric_disk_latency_microseconds" unchanged, although that
metric measures the latency of the storage device, even if it is not
a spinning disk. "SSD" is read by many as "solid state disk", so
it's not too far off. (It should be "solid state drive", of course,
but "metric_drive_latency_microseconds" is probably confusing.)
- Brian suggested to not mix "failure" and "success" outcome in the
same metric family (distinguished by labels). For now, I left it as
it is. We are touching some bigger issue here, especially as other
parts in the Prometheus ecosystem are following the same
principle. We still need to come to terms here and then change
things consistently everywhere.
Change-Id: If799458b450d18f78500f05990301c12525197d3
Add a function to bypass the new auto-escaping.
Add a function to workaround go's templates only allowing passing in one argument.
Change-Id: Id7aa3f95e7c227692dc22108388b1d9b1e2eec99
- Always spell out the time unit (e.g. milliseconds instead of ms).
- Remove "_total" from the names of metrics that are not counters.
- Make use of the "Namespace" and "Subsystem" fields in the options.
- Removed the "capacity" facet from all metrics about channels/queues.
These are all fixed via command line flags and will never change
during the runtime of a process. Also, they should not be part of
the same metric family. I have added separate metrics for the
capacity of queues as convenience. (They will never change and are
only set once.)
- I left "metric_disk_latency_microseconds" unchanged, although that
metric measures the latency of the storage device, even if it is not
a spinning disk. "SSD" is read by many as "solid state disk", so
it's not too far off. (It should be "solid state drive", of course,
but "metric_drive_latency_microseconds" is probably confusing.)
- Brian suggested to not mix "failure" and "success" outcome in the
same metric family (distinguished by labels). For now, I left it as
it is. We are touching some bigger issue here, especially as other
parts in the Prometheus ecosystem are following the same
principle. We still need to come to terms here and then change
things consistently everywhere.
Change-Id: If799458b450d18f78500f05990301c12525197d3
Add a function to bypass the new auto-escaping.
Add a function to workaround go's templates only allowing passing in one argument.
Change-Id: Id7aa3f95e7c227692dc22108388b1d9b1e2eec99
Move rulemanager to it's own package to break cicrular dependency.
Make NewTestTieredStorage available to tests, remove duplication.
Change-Id: I33b321245a44aa727bfc3614a7c9ae5005b34e03
This was initially motivated by wanting to distribute the rule checker
tool under `tools/rule_checker`. However, this was not possible without
also distributing the LevelDB dynamic libraries because the tool
transitively depended on Levigo:
rule checker -> query layer -> tiered storage layer -> leveldb
This change separates external storage interfaces from the
implementation (tiered storage, leveldb storage, memory storage) by
putting them into separate packages:
- storage/metric: public, implementation-agnostic interfaces
- storage/metric/tiered: tiered storage implementation, including memory
and LevelDB storage.
I initially also considered splitting up the implementation into
separate packages for tiered storage, memory storage, and LevelDB
storage, but these are currently so intertwined that it would be another
major project in itself.
The query layers and most other parts of Prometheus now have notion of
the storage implementation anymore and just use whatever implementation
they get passed in via interfaces.
The rule_checker is now a static binary :)
Change-Id: I793bbf631a8648ca31790e7e772ecf9c2b92f7a0
So far we've been using Go's native time.Time for anything related to sample
timestamps. Since the range of time.Time is much bigger than what we need, this
has created two problems:
- there could be time.Time values which were out of the range/precision of the
time type that we persist to disk, therefore causing incorrectly ordered keys.
One bug caused by this was:
https://github.com/prometheus/prometheus/issues/367
It would be good to use a timestamp type that's more closely aligned with
what the underlying storage supports.
- sizeof(time.Time) is 192, while Prometheus should be ok with a single 64-bit
Unix timestamp (possibly even a 32-bit one). Since we store samples in large
numbers, this seriously affects memory usage. Furthermore, copying/working
with the data will be faster if it's smaller.
*MEMORY USAGE RESULTS*
Initial memory usage comparisons for a running Prometheus with 1 timeseries and
100,000 samples show roughly a 13% decrease in total (VIRT) memory usage. In my
tests, this advantage for some reason decreased a bit the more samples the
timeseries had (to 5-7% for millions of samples). This I can't fully explain,
but perhaps garbage collection issues were involved.
*WHEN TO USE THE NEW TIMESTAMP TYPE*
The new clientmodel.Timestamp type should be used whenever time
calculations are either directly or indirectly related to sample
timestamps.
For example:
- the timestamp of a sample itself
- all kinds of watermarks
- anything that may become or is compared to a sample timestamp (like the timestamp
passed into Target.Scrape()).
When to still use time.Time:
- for measuring durations/times not related to sample timestamps, like duration
telemetry exporting, timers that indicate how frequently to execute some
action, etc.
*NOTE ON OPERATOR OPTIMIZATION TESTS*
We don't use operator optimization code anymore, but it still lives in
the code as dead code. It still has tests, but I couldn't get all of them to
pass with the new timestamp format. I commented out the failing cases for now,
but we should probably remove the dead code soon. I just didn't want to do that
in the same change as this.
Change-Id: I821787414b0debe85c9fffaeb57abd453727af0f
The ConsoleLinkForExpression() function now escapes console URLs in such a way
that works both in emails and in HTML.
Change-Id: I917bae0b526cbbac28ccd2a4ec3c5ac03ee4c647
This includes required refactorings to enable replacing the http client (for
testing) and moving the NotificationReq type definitions to the "notifications"
package, so that this package doesn't need to depend on "rules" anymore and
that it can instead use a representation of the required data which only
includes the necessary fields.
This commit adds telemetry for the Prometheus expression rule
evaluator, which will enable meta-Prometheus monitoring of customers
to ensure that no instance is falling behind in answering routine
queries.
A few other sundry simplifications are introduced, too.