Figuring out what's going on with the new service discovery
and labels is difficult. Add a popover with the labels
to the target table to make things simpler, and help
discovery of potentially useful labels.
Previously we redirected any non-existent path to the root (or path
prefix).
The new behavior:
With no path prefix:
- "" -> "/"
- "/biz" -> 404
With path prefix of "/foo/bar":
- "" -> "/foo/bar/"
- "/" -> "/foo/bar/"
- "/foo/bar" -> "/foo/bar/"
- "/biz" -> /foo/bar/biz"
(anything not starting with the path prefix gets the prefix prepended)
- "/foo/bar/biz" -> 404
This change is conceptually very simple, although the diff is large. It
switches logging from "github.com/golang/glog" to
"github.com/prometheus/log", while not actually changing any log
messages. V(1)-style logging has been changed to be log.Debug*().
Appending to the storage can block for a long time. Timing out
scrapes can also cause longer blocks. This commit avoids that those
blocks affect other compnents than the target itself.
Also the Target interface was removed.
The target implementation and interface contain methods only serving a
specific purpose of the templates. They were moved to the template
as they operate on more fundamental target data.
With this commit, sending SIGHUP to the Prometheus process will reload
and apply the configuration file. The different components attempt
to handle failing changes gracefully.
This commits renames the RuleManager to Manager as the package
name is 'rules' now. The unused layer of abstraction of the
RuleManager interface is removed.
This commit shifts responsibility for maintaining targets from providers and
pools to the target manager. Target groups have a source name that identifies
them for updates.
This adds the population standard deviation and
variance as aggregation functions, useful for
spotting how many standard deviations some samples
are from the mean.
Don't handle `0` as a special timestamp value for "now" anymore, except
in the `QueryRange()` case, where existing API consumers still expect
`0` to mean "now".
Also, properly return errors now for malformed timestamp/duration
float values.
/api/targets was undocumented and never used and also broken.
Showing instance and job labels on the status page (next to targets)
does not make sense as those labels are set in an obvious way.
Also add a doc comment to TargetStateToClass.
The one central sample ingestion channel has caused a variety of
trouble. This commit removes it. Targets and rule evaluation call an
Append method directly now. To incorporate multiple storage backends
(like OpenTSDB), storage.Tee forks the Append into two different
appenders.
Note that the tsdb queue manager had its own queue anyway. It was a
queue after a queue... Much queue, so overhead...
Targets have their own little buffer (implemented as a channel) to
avoid stalling during an http scrape. But a new scrape will only be
started once the old one is fully ingested.
The contraption of three pipelined ingesters was removed. A Target is
an ingester itself now. Despite more logic in Target, things should be
less confusing now.
Also, remove lint and vet warnings in ast.go.
- original series data is saved so it can be re-transformed after
Rickshaw's stacking modified the series data
- always reconstruct graphs from scratch instead of updating the
settings of an existing one (simplification)
- always wipe and recreate all graph-related DOM elements completely so
that no left-over event handlers cause background event handlers
This is related to #454. Queries now timeout after a duration set by
the -query.timeout flag. The TotalEvalTimer is now started/stopped
inside any of the ast.Eval* functions.
When Rickshaw was updated to 1.5.1 in
fd43daf82e,
the Rickshaw upstream package now contained 3 different D3 files:
d3.min.js
d3.v2.js
d3.v3.js
For details on why that is, see
https://groups.google.com/forum/#!topic/d3-js/lXQgKA7mtEw
For the 1.5.1 Rickshaw to work properly (being able to format dates with
D3 without causing a JS error), it needs d3.v2.js or d3.v3.js, not the
d3.min.js one. I chose to update us to d3.v3.js now, since that is the
most recent and minified version, and I didn't see any problems with it
(also, the current Rickshaw examples are using that D3 version).
Currently, displaying graphs with a range >14d is broken. This fixes
that.
While the recent commit 7e5745f solved the issue of having an
independent blob-stamp file, which was possible to become out of
sync with the necessary web/blob/files.go file, this change further
simplifies the setup by merging the two Makefile.
The only purpose of web/Makefile was to call targets in
web/blob/Makefile. As all dependencies for blob/files.go are
outside of the blob/ directory, the separation isn't logically
necessary.
- Use only the minified versions of bootstrap.
- Do not embed non-minified bootstrap files and bootstrap map files.
- Simplify the 'blob-stamp' Makefile contraption.
- Move CONTRIBUTORS.md to the more common AUTHORS.
- Added the required NOTICE file.
- Changed "Prometheus Team" to "The Prometheus Authors".
- Reverted the erroneous changes to the Apache License.
This provides the basic js, css and console template
templates required to build dashboards.
Included as an example are consoles for the node_exporter.
Change-Id: I4cfeea5e9691a9413f74ae98ca32a908df8e4a59
The "Address" is actually a URL which may contain username and
password. Calling this Address is misleading so we rename it.
Change-Id: I441c7ab9dfa2ceedc67cde7a47e6843a65f60511
This doesn't make the import order consistend everywhere, just where
it was touched by the previous commit.
Change-Id: I82fc75f8691da9901c7ceb808e6f6fe8e5d62c0e
Essentially:
- Remove unused code.
- Make it 'go vet' clean. The only remaining warnings are in generated code.
- Make it 'golint' clean. The only remaining warnings are in gerenated code.
- Smoothed out same minor things.
Change-Id: I3fe5c1fbead27b0e7a9c247fee2f5a45bc2d42c6
After many transformations, it doesn't make sense to keep the metric
names, since the result of the transformation is no longer that metric.
This drops the metric name after such transformations and makes the web
UI deal well with missing metric names.
This depends on the current branch on the following things:
- prometheus/client_golang needs to be at
e237cf15c6
in branch "julius/int-fingerprints" (to be merged with new storage)
- prometheus/promdash needs to be at
dd7691c9c2
Change-Id: Ib3c8cad8d647d9854e8c653c424b8c235ccc231d
This removes the dependancy on C leveldb and snappy.
It also takes care of fewer dependencies as they would
anyway not work on any non-Debian, non-Brew system.
Change-Id: Ia70dce1ba8a816a003587927e0b3a3f8ad2fd28c
Gracefully handle decimal values, by truncating them.
Limit amount of steps, to avoid accidentally pulling too much data.
This limit returns up to ~500kB per timeseries, and allows
for 60s granularity for a week and 1h granularity for a year.
Change-Id: Ie549fc24deb2eecbc6c5d1b6088a548a6b02e849
Having metrics with variable timestamps inconsistently
spaced when things fail will make it harder to write correct rules.
Update status page, requires some refactoring to insert a function.
Change-Id: Ie1c586cca53b8f3b318af8c21c418873063738a8
This fixes the problem where samples become temporarily unavailable for
queries while they are being flushed to disk. Although the entire
flushing code could use some major refactoring, I'm explicitly trying to
do the minimal change to fix the problem since there's a whole new
storage implementation in the pipeline.
Change-Id: I0f5393a30b88654c73567456aeaea62f8b3756d9
Due to the lack of a </a>, this makes the entire header render badly.
Accordingly it's safe to assume noone is using it, so remove it.
With the new console template support, we'll need to something a bit
more nuanced later.
Change-Id: I3424bed6aea18cbd4c63ad48f98808098dadc3ad
Add a function to bypass the new auto-escaping.
Add a function to workaround go's templates only allowing passing in one argument.
Change-Id: Id7aa3f95e7c227692dc22108388b1d9b1e2eec99
This is consistent with alertmanager, and more intiutive for users.
The graphs page just has graphs, so remove mention of consoles.
Change-Id: I87780a4ade33697a6095423e1a7de47d341d2838
Move rulemanager to it's own package to break cicrular dependency.
Make NewTestTieredStorage available to tests, remove duplication.
Change-Id: I33b321245a44aa727bfc3614a7c9ae5005b34e03
This was initially motivated by wanting to distribute the rule checker
tool under `tools/rule_checker`. However, this was not possible without
also distributing the LevelDB dynamic libraries because the tool
transitively depended on Levigo:
rule checker -> query layer -> tiered storage layer -> leveldb
This change separates external storage interfaces from the
implementation (tiered storage, leveldb storage, memory storage) by
putting them into separate packages:
- storage/metric: public, implementation-agnostic interfaces
- storage/metric/tiered: tiered storage implementation, including memory
and LevelDB storage.
I initially also considered splitting up the implementation into
separate packages for tiered storage, memory storage, and LevelDB
storage, but these are currently so intertwined that it would be another
major project in itself.
The query layers and most other parts of Prometheus now have notion of
the storage implementation anymore and just use whatever implementation
they get passed in via interfaces.
The rule_checker is now a static binary :)
Change-Id: I793bbf631a8648ca31790e7e772ecf9c2b92f7a0
The closing of Prometheus now using a sync.Once wrapper to prevent
any accidental multiple invocations of it, which could trigger
corruption or a race condition. The shutdown process is made more
verbose through logging.
A not-enabled by default web handler has been provided to trigger a
remote shutdown if requested for debugging purposes.
Change-Id: If4fee75196bbff1fb1e4a4ef7e1cfa53fef88f2e
This also fixes the compaction test, which before worked only because
the input sample sorting was accidentally equal to the resulting on-disk
sample sorting.
Change-Id: I2a21c4b46ba562424b27058fc02eba84fa6a6006
So far we've been using Go's native time.Time for anything related to sample
timestamps. Since the range of time.Time is much bigger than what we need, this
has created two problems:
- there could be time.Time values which were out of the range/precision of the
time type that we persist to disk, therefore causing incorrectly ordered keys.
One bug caused by this was:
https://github.com/prometheus/prometheus/issues/367
It would be good to use a timestamp type that's more closely aligned with
what the underlying storage supports.
- sizeof(time.Time) is 192, while Prometheus should be ok with a single 64-bit
Unix timestamp (possibly even a 32-bit one). Since we store samples in large
numbers, this seriously affects memory usage. Furthermore, copying/working
with the data will be faster if it's smaller.
*MEMORY USAGE RESULTS*
Initial memory usage comparisons for a running Prometheus with 1 timeseries and
100,000 samples show roughly a 13% decrease in total (VIRT) memory usage. In my
tests, this advantage for some reason decreased a bit the more samples the
timeseries had (to 5-7% for millions of samples). This I can't fully explain,
but perhaps garbage collection issues were involved.
*WHEN TO USE THE NEW TIMESTAMP TYPE*
The new clientmodel.Timestamp type should be used whenever time
calculations are either directly or indirectly related to sample
timestamps.
For example:
- the timestamp of a sample itself
- all kinds of watermarks
- anything that may become or is compared to a sample timestamp (like the timestamp
passed into Target.Scrape()).
When to still use time.Time:
- for measuring durations/times not related to sample timestamps, like duration
telemetry exporting, timers that indicate how frequently to execute some
action, etc.
*NOTE ON OPERATOR OPTIMIZATION TESTS*
We don't use operator optimization code anymore, but it still lives in
the code as dead code. It still has tests, but I couldn't get all of them to
pass with the new timestamp format. I commented out the failing cases for now,
but we should probably remove the dead code soon. I just didn't want to do that
in the same change as this.
Change-Id: I821787414b0debe85c9fffaeb57abd453727af0f
Due to on going issues, we've decided to remove gorest. It started with gorest
not being thread-safe (it does introspection to create a new handler which is
an easy process to mess up with multiple threads of execution):
https://code.google.com/p/gorest/issues/detail?id=15
While the issue has been marked fixed, it looks like the patch has introduced
more problems than the original issue and simply doesn't work properly.
I'm not sure the behaviour was thought through properly. If a new instance is
needed every request then a handler-factory is needed or the library needs to
set expectations about how the new objects should interact with their
constructor state.
While it was tempting to try out another routing library, I think for now
it's better to use dumb vanilla Go routing. At least until we decide which
URL format we intend to standardize on.
Change-Id: Ica3da135d05f8ab8fc206f51eeca4f684f8efa0e
This commit fixes a critique of the old storage API design, whereby
the input parameters were always as raw bytes and never Protocol
Buffer messages that encapsulated the data, meaning every place a
read or mutation was conducted needed to manually perform said
translations on its own. This is taxing.
Change-Id: I4786938d0d207cefb7782bd2bd96a517eead186f
An design question was open for me in the beginning was whether to
serialize other types to disk, but Protocol Buffers quickly won out,
which allows us to drop support for other types. This is a good
start to cleaning up a lot of cruft in the storage stack and
can let us eventually decouple the various moving parts into
separate subsystems for easier reasoning.
This commit is not strictly required, but it is a start to making
the rest a lot more enjoyable to interact with.
This adds timers around several query-relevant code blocks. For now, the
query timer stats are only logged for queries initiated through the UI.
In other cases (rule evaluations), the stats are simply thrown away.
My hope is that this helps us understand where queries spend time,
especially in cases where they sometimes hang for unusual amounts of
time.
In order to help corroborate whether a Prometheus instance has
flapped until meta-monitoring is in-place, we ought to provide the
instance's start time in the console to aid in diagnostics.
This commit simplifies the way that compactions across a database's
keyspace occur due to reading the LevelDB internals. Secondarily it
introduces the database size estimation mechanisms.
Include database health and help interfaces.
Add database statistics; remove status goroutines.
This commit kills the use of Go routines to expose status throughout
the web components of Prometheus. It also dumps raw LevelDB status
on a separate /databases endpoint.
This commit introduces three background compactors, which compact
sparse samples together.
1. Older than five minutes is grouped together into chunks of 50 every 30
minutes.
2. Older than 60 minutes is grouped together into chunks of 250 every 50
minutes.
3. Older than one day is grouped together into chunks of 5000 every 70
minutes.
Unfortunately ``cp`` on Darwin regards some flags as positional and
requires them to be in a specific place. The new Protocol Buffer
descriptor bundling fails on Mac OS.
The Protocol Buffer compiler supports generating a machine-readable
descriptor file encoded as a provided Protocol Buffer message type,
which can be used to decode messages that have been encoded with it
after-the-fact. The generated descriptor also bundles in dependent
message types.
We can use this to perform forensics on old Prometheus clients, if
necessary.
Go's time.Time represents time as UTC in its fundamental data type.
That said, when using ``time.Unix(...)``, it sets the zone for the
time representation to the local. Unfortunately with diagnosis and
our tests, it is a PITA to jump between various zones, even though
the serialized version remains the same.
To keep things easy, all places where times are generated or read
are converted into UTC. These conversions are cheap, for
``Time.In`` merely changes a pointer reference in the struct,
nothing more. This enables me to diagnose test failures with fixture
data very easily.
By setting Access-Control headers, the Prometheus metrics API can be
accessed by cross-origin javascript applications (e.g., an external
dashboard pulling Prometheus metrics).
To achieve that, this PR
- converts static/index.html ("console") and graph to templates
- moved the handlebars template to separated file to avoid escaping issues
Route changes:
/status -> /
/static -> /console
/static/graph.html -> /graph
The curator doesn't do anything yet; rather, this is the type
definition including the anciliary testing scaffold.
Improve Makefile and Git developer experience.
The top-level Makefile was a bit overloaded in terms of generation of
assets and their management. This has been offloaded into separate
Makefiles.
The Git developer experience sucked due to lack of .gitignore
policies.
Also: Fix faulty skiplist naming from old merge.