When using Kubernetes on cloud providers, nodes will have the
spec.providerID field populated to contain the cloud provider specific
name of the EC2/GCE/... instance.
Let's expose this information as an additional label, so that it's
easier to annotate metrics and alerts to contain the cloud provider
specific name of the instance to which it pertains.
Signed-off-by: Ed Schouten <eschouten@apple.com>
This can be useful when generating rules, a query may use a duration,
and it may be useful to template that into a URL parameter. Therefore
this allows interfacing with systems that don't implement Prometheus
style duration parsing.
Signed-off-by: David Leadbeater <dgl@dgl.cx>
* remote-write: slow down retries to avoid DDOS
Increase the default max retry time from 100ms to 5 seconds.
Remote write calls are retried after a recoverable error such as the
back-end returning 500. Prometheus waits the minimum time and retries,
then doubles the wait on each subsequent retry until the maximum is
reached.
If some data is still getting through, remote-write will also increase
shards, and the default maximum is 200. 200 shards sending every 100ms
is 20 calls per second, to a back-end that is already in trouble.
5 seconds was chosen to match the default BatchSendDeadline: if we can
afford to wait that long for no response, then we can wait the same time
to retry. We will reach 5 seconds after 9 successive failures.
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
* Update config doc for max_backoff change
Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
We have been Puppet user for 10 years and we are users of
https://github.com/camptocamp/prometheus-puppetdb-sd
However, that file_sd implementation contains business logic and
assumptions around e.g. the modules which you are using.
This pull request adds a simple PuppetDB service discovery, which will
enable more use cases than the upstream sd.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* PromQL: Fix start and end keywords masking label and metric names
This commit fixes an issue with the "at modifier" that introduced two
new keywords: `start` and `end`. In grouping options and in metric
names, these keywords took precedence over metric or label names, so
that those metrics and labels could no longer be referenced.
Signed-off-by: Clayton Peters <clayton.peters@man.com>
* Add in additional tests for metrics and/or labels called start/end.
Signed-off-by: Clayton Peters <clayton.peters@man.com>
* *: Cut 2.29.0-rc.0
Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
* VERSION: bump to 2.29.0-rc.0
Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
* Remove experimental wording on size-based retention
Followup of #9004
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* Fix PR reference in changelog
Signed-off-by: George Brighton <george@gebn.co.uk>
* Describe EC2 availability zone IDs at most once per refresh (#9142)
Signed-off-by: George Brighton <george@gebn.co.uk>
* Describe EC2 availability zones at most once per SD load
Closes#9142.
Signed-off-by: George Brighton <george@gebn.co.uk>
* Incorporate feedback
Signed-off-by: George Brighton <george@gebn.co.uk>
* Integrate feedback
Signed-off-by: George Brighton <george@gebn.co.uk>
* Add a compatibility note for macOS users.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* *: Cut v2.29.0-rc.1
Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
* Fix `kuma_sd` targetgroup reporting (#9157)
* Bundle all xDS targets into a single group
Signed-off-by: austin ce <austin.cawley@gmail.com>
* *: cut v2.29.0-rc.2
Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
* Rename links
Signed-off-by: Levi Harrison <git@leviharrison.dev>
* bump codemirror-promql to 0.17.0
Signed-off-by: Augustin Husson <husson.augustin@gmail.com>
* *: cut v2.29.0
Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
* tsdb: align atomically accessed int64 (#9192)
This prevents a panic in 32-bit archs:
https://pkg.go.dev/sync/atomic#pkg-note-BUGFixed#9190
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* Release 2.29.1 (#9193)
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
Co-authored-by: Clayton Peters <clayton.peters@man.com>
Co-authored-by: Frederic Branczyk <fbranczyk@gmail.com>
Co-authored-by: George Brighton <george@gebn.co.uk>
Co-authored-by: Austin Cawley-Edwards <austin.cawley@gmail.com>
Co-authored-by: Levi Harrison <git@leviharrison.dev>
Co-authored-by: Augustin Husson <husson.augustin@gmail.com>
* optimize Linode SD by polling for event changes during refresh
Most accounts are fairly "static", in the sense that they're not cycling
through instances constantly. So rather than do a full refresh every
interval and potentially make several behind-the-scenes paginated API
calls, this will now poll the `/account/events/` endpoint every minute
with a list of events that we care about. If a matching event is found,
we then do a full refresh.
Co-authored-by: William Smith <wsmith@linode.com>
Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
Signed-off-by: William Smith <wsmith@linode.com>
* Added MaxSamplesPerSend
Signed-off-by: Levi Harrison <git@leviharrison.dev>
* Added tests
Signed-off-by: Levi Harrison <git@leviharrison.dev>
* Fixed order of require
Signed-off-by: Levi Harrison <git@leviharrison.dev>
* Added docs
Signed-off-by: Levi Harrison <git@leviharrison.dev>
* writes -> writesReceived
Signed-off-by: Levi Harrison <git@leviharrison.dev>
* Improved send loop
Signed-off-by: Levi Harrison <git@leviharrison.dev>
* Write exemplars to the WAL and send them over remote write.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Update example for exemplars, print data in a more obvious format.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Add metrics for remote write of exemplars.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Fix incorrect slices passed to send in remote write.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* We need to unregister the new metrics.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address review comments
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Order of exemplar append vs write exemplar to WAL needs to change.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Several fixes to prevent sending uninitialized or incorrect samples with an exemplar. Fix dropping exemplar for missing series. Add tests for queue_manager sending exemplars
Signed-off-by: Martin Disibio <mdisibio@gmail.com>
* Store both samples and exemplars in the same timeseries buffer to remove the alloc when building final request, keep sub-slices in separate buffers for re-use
Signed-off-by: Martin Disibio <mdisibio@gmail.com>
* Condense sample/exemplar delivery tests to parameterized sub-tests
Signed-off-by: Martin Disibio <mdisibio@gmail.com>
* Rename test methods for clarity now that they also handle exemplars
Signed-off-by: Martin Disibio <mdisibio@gmail.com>
* Rename counter variable. Fix instances where metrics were not updated correctly
Signed-off-by: Martin Disibio <mdisibio@gmail.com>
* Add exemplars to LoadWAL benchmark
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* last exemplars timestamp metric needs to convert value to seconds with
ms precision
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Process exemplar records in a separate go routine when loading the WAL.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address review comments related to clarifying comments and variable
names. Also refactor sample/exemplar to enqueue prompb types.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Regenerate types proto with comments, update protoc version again.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Put remote write of exemplars behind a feature flag.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address some of Ganesh's review comments.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Move exemplar remote write feature flag to a config file field.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address Bartek's review comments.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Don't allocate exemplar buffers in queue_manager if we're not going to
send exemplars over remote write.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Add ValidateExemplar function, validate exemplars when appending to head
and log them all to WAL before adding them to exemplar storage.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address more reivew comments from Ganesh.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Add exemplar total label length check.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
* Address a few last review comments
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Martin Disibio <mdisibio@gmail.com>
* scrape: add label limits per scrape
Add three new limits to the scrape configuration to provide some
mechanism to defend against unbound number of labels and excessive
label lengths. If any of these limits are broken by a sample from a
scrape, the whole scrape will fail. For all of these configuration
options, a zero value means no limit.
The `label_limit` configuration will provide a mechanism to bound the
number of labels per-scrape of a certain sample to a user defined limit.
This limit will be tested against the sample labels plus the discovery
labels, but it will exclude the __name__ from the count since it is a
mandatory Prometheus label to which applying constraints isn't
meaningful.
The `label_name_length_limit` and `label_value_length_limit` will
prevent having labels of excessive lengths. These limits also skip the
__name__ label for the same reasons as the `label_limit` option and will
also make the scrape fail if any sample has a label name/value length
that exceed the predefined limits.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
* scrape: add metrics and alert to label limits
Add three gauge, one for each label limit to easily access the
limit set by a certain scrape target.
Also add a counter to count the number of targets that exceeded the
label limits and thus were dropped. This is useful for the
`PrometheusLabelLimitHit` alert that will notify the users that scraping
some targets failed because they had samples exceeding the label limits
defined in the scrape configuration.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
* scrape: apply label limits to __name__ label
Apply limits to the __name__ label that was previously skipped and
truncate the label names and values in the error messages as they can be
very very long.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
* scrape: remove label limits gauges and refactor
Remove `prometheus_target_scrape_pool_label_limit`,
`prometheus_target_scrape_pool_label_name_length_limit`, and
`prometheus_target_scrape_pool_label_value_length_limit` as they are not
really useful since we don't have the information on the labels in it.
Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
* Enable parsing strings in humanize functions
This is useful to humanize count_values or buckets labels.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
Prometheus adds the ability to read secrets from files. This add
this feature for the scaleway service discovery.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
* Contribute grafana/agent sigv4 code
* address review feedback
- move validation logic for RemoteWrite into unmarshal
- copy configuration fields from ec2 SD config
- remove enabled field, use pointer for enabling sigv4
* Update config/config.go
* Don't provide credentials if secret key / access key left blank
* Add SigV4 headers to the list of unchangeable headers.
* sigv4: don't include all headers in signature
* only test for equality in the authorization header, not the signed date
* address review feedback
1. s/httpClientConfigEnabled/httpClientConfigAuthEnabled
2. bearer_token tuples to "authorization"
3. Un-export NewSigV4RoundTripper
* add x-amz-content-sha256 to list of unchangeable headers
* Document sigv4 configuration
* add suggestion for using default AWS SDK credentials
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
Co-authored-by: Julien Pivotto <roidelapluie@gmail.com>
This PR introduces support for follow_redirect, to enable users to
disable following HTTP redirects.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
The label `__meta_digitalocean_image` expose the `slug` of the image and
the `slug` is only present in the public images.
To refer a user-generated image (`snapshot` or `custom`) we can use
the image's display name.
See: https://developers.digitalocean.com/documentation/v2/#images
Signed-off-by: Matteo Valentini <matteo.valentini@nethesis.it>
Improve the documentation to clarify the differences beetween rules in a
group and outside a group.
Signed-off-by: Thibault Jamet <tjamet@users.noreply.github.com>
commit 9875afc491 changed the type from
metric names to label values, we might as well adjust the description.
The alternative is to revert that commit and restrict names of alerting
rules again even if that was not really enforced.
Signed-off-by: Peter Wu <pwu@cloudflare.com>
The React app's assets are now served under /assets, while all old
custom web assets (including the ones for console templates) are now
served from /classic/static.
I tested different combinations of --web.external-url and
--web.route-prefix with proxies in front, and I couldn't find a problem
yet with the routing. Console templates also still work.
While migrating old endpoints to /classic, I noticed that /version was
being treated like a lot of the old UI pages, with readiness check
handler in front of it, etc. I kept it in /version and removed that
readiness wrapper, since it doesn't seem to be needed for that endpoint.
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* OpenStack SD: Add availability config option, to choose endpoint type
In some environments Prometheus must query OpenStack via an alternative
endpoint type (gophercloud calls this `availability`.
This commit implements this option.
Co-Authored-By: Dennis Kuhn <d.kuhn@syseleven.de>
Signed-off-by: Steffen Neubauer <s.neubauer@syseleven.de>
Added optional configuration item role, defaults to 'container' (backwards-compatible).
Setting role to 'cn' will discover compute nodes instead.
Human-friendly compute node hostname discovery depends on cmon 1.7.0:
c1a2aeca36
Adjust testcases to use discovery config per case as two different types are now supported.
Updated documentation:
* new role setting
* clarify what the name 'container' covers as triton uses different names in different locations
Signed-off-by: jzinkweg <jzinkweg@gmail.com>
Add extra meta labels which will be useful in the case
Prometheus discovery hypervisor .
Signed-off-by: pzqu <pzqu@qq.com>
Co-authored-by: pzqu <pzqu@example.com>
One of our users today asked us if dashes were allowed in recording rule names.
We asserted that they were not, but also that we could not remember for certain.
After determining empirically that they are _not_ allowed, I realized that the
documentation could be slightly clearer about valid rule names.
This PR simply adds a note to the documentation re-iterating that the rules must
be valid metric names - and more importantly, adds a link to where a user can
read what those *are*, in case they were not aware (or did not know where to find it).
Signed-off-by: Andrew Hayworth <ahayworth@gmail.com>
* docs: update unit testing rules
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* More nits fixed
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
The desired shards calculation now properly keeps track of the rate of
pending samples, and uses the previously unused integralAccumulator to
adjust for missing information in the desired shards calculation.
Also, configure more capacity for each shard. The default 10 capacity
causes shards to block on each other while
sending remote requests. Default to a 500 sample capacity and explain in
the documentation that having more capacity will help throughput.
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
Previously, the wording could be misunderstood as setting honor_labels
to "false" for federation.
This also adds scraping the Pushgateway as a typical use case for
honor_labels=true.
Signed-off-by: beorn7 <beorn@grafana.com>
Document the behavior of an empty `ec2_sd_config` `region` setting. If this is
omitted or blank, the region is discovered from the instance metadata, if available.
If it is blank and instance region metadata is not available, an error will
result ("EC2 SD configuration requires a region").
Signed-off-by: Svend Sorensen <svend@svends.net>
With v0.16.0 Alertmanager introduced a new API (v2). This patch adds a
configuration option for Prometheus to send alerts to the v2 endpoint
instead of the defautl v1 endpoint.
Signed-off-by: Max Leonard Inden <IndenML@gmail.com>
Lots of alerts are based on ratios (eg. disk usage), and humans are used
to values in percentage in textual descriptions.
Signed-off-by: Jens Erat <email@jenserat.de>
Add extra meta labels which will be useful in the case
Prometheus discovery instances from all projects.
Signed-off-by: Kien Nguyen <kiennt2609@gmail.com>
Currently, when we access the modified pages with **HTTP**, it is
redirected to **HTTPS** automatically. So this commit aims to
replace **HTTP** to **HTTPs** for security.
Co-Authored-By: Nguyen Phuong An <AnNP@vn.fujitsu.com>
Signed-off-by: Kim Bao Long <longkb@vn.fujitsu.com>
Although it is spelling mistakes, it might make an affects while reading.
Co-Authored-By: Nguyen Phuong An <AnNP@vn.fujitsu.com>
Signed-off-by: Kim Bao Long <longkb@vn.fujitsu.com>
* discovery/kubernetes: fix support for password_file
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Create and pass custom RoundTripper to Kubernetes client
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Use inline HTTPClientConfig
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* add logic to check if an azure VM is deallocated or not
* update documentation with the new azure power state label
Signed-off-by: tariqibrahim <tariq.ibrahim@microsoft.com>
* Adding private_dns_name to the list of ec2 labels which can be used in node naming for dynamic environments
Signed-off-by: Serghei Anicheev <serghei@rentalcover.com>
Set __meta_ec2_platform label with the instance platform string. Set to 'windows' on Windows servers and absent otherwise.
Signed-off-by: Silvio Gissi <silvio@gissilabs.com>
By default, OpenStack SD only queries for instances
from specified project. To discover instances from other
projects, users have to add more openstack_sd_configs for
each project.
This patch adds `all_tenants` <bool> options to
openstack_sd_configs. For example:
- job_name: 'openstack_all_instances'
openstack_sd_configs:
- role: instance
region: RegionOne
identity_endpoint: http://<identity_server>/identity/v3
username: <username>
password: <super_secret_password>
domain_name: Default
all_tenants: true
Co-authored-by: Kien Nguyen <kiennt2609@gmail.com>
Signed-off-by: dmatosl <danielmatos.lima@gmail.com>
Additionally, add triton groups metadata to the discovery reponse
and correct a documentation error regarding the triton server id
metadata.
Signed-off-by: Richard Kiene <richard.kiene@joyent.com>
* Inital support for Azure VMSS
Signed-off-by: Johannes Scheuermann <johannes.scheuermann@inovex.de>
* Add documentation for the newly introduced label
Signed-off-by: Johannes M. Scheuermann <joh.scheuer@gmail.com>
Allowing to set a custom endpoint makes it easy to monitor targets on non AWS providers with EC2 compliant APIs.
Signed-off-by: Jannick Fahlbusch <git@jf-projects.de>