[docs] improve telegraf guide with objectives, summary, formatting (#5527)

pull/5541/head
Judith Malnick 2019-03-22 09:40:38 -07:00 committed by GitHub
parent f8bcfcfa46
commit 6a78e2ae55
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 120 additions and 56 deletions

View File

@ -8,47 +8,70 @@ description: |-
# Monitoring Consul with Telegraf # Monitoring Consul with Telegraf
Consul makes available a range of metrics in various formats in order to measure the health and stability of a cluster, and diagnose or predict potential issues. Consul makes a range of metrics in various formats available so operators can
measure the health and stability of a cluster, and diagnose or predict potential
issues.
There are number of monitoring tools and options, but for the purposes of this guide we are going to use the [telegraf_plugin][] in conjunction with the Statsd protocol supported by Consul. There are number of monitoring tools and options available, but for the purposes
of this guide we are going to use the [telegraf_plugin][] in conjunction with
the StatsD protocol supported by Consul.
You can read the full breakdown of metrics with Consul in the [telemetry documentation](/docs/agent/telemetry.html) You can read the full list of metrics available with Consul in the [telemetry
documentation](/docs/agent/telemetry.html).
In this guide you will:
- Configure Telegraf to collect StatsD and host level metrics
- Configure Consul to send metrics to Telegraf
- See an example of metrics visualization
- Understand important metrics to aggregate and alert on
## Installing Telegraf
The process for installing Telegraf depends on your operating system. We
recommend following the [official Telegraf installation
documentation][telegraf-install].
## Configuring Telegraf ## Configuring Telegraf
# Installing Telegraf Telegraf acts as a StatsD agent and can collect additional metrics about the
hosts where Consul agents are running. Telegraf itself ships with a wide range
of [input plugins][telegraf-input-plugins] to collect data from lots of sources
for this purpose.
Installing Telegraf is straightforward on most Linux distributions. We recommend following the [official Telegraf installation documentation][telegraf-install]. We're going to enable some of the most common input plugins to monitor CPU,
memory, disk I/O, networking, and process status, since these are useful for
# Configuring Telegraf debugging Consul cluster issues.
Besides acting as a statsd agent, Telegraf can collect additional metrics about the host that the Consul agent is running on. Telegraf itself ships with a wide range of [input plugins][telegraf-input-plugins] to collect data from lots of sources for this purpose.
We're going to enable some of the most common ones to monitor CPU, memory, disk I/O, networking, and process status, as these are useful for debugging Consul cluster issues.
The `telegraf.conf` file starts with global options: The `telegraf.conf` file starts with global options:
```ini ```toml
[agent] [agent]
interval = "10s" interval = "10s"
flush_interval = "10s" flush_interval = "10s"
omit_hostname = false omit_hostname = false
``` ```
We set the default collection interval to 10 seconds and ask Telegraf to include a `host` tag in each metric. We set the default collection interval to 10 seconds and ask Telegraf to include
a `host` tag in each metric.
As mentioned above, Telegraf also allows you to set additional tags on the metrics that pass through it. In this case, we are adding tags for the server role and datacenter. We can then use these tags in Grafana to filter queries (for example, to create a dashboard showing only servers with the `consul-server` role, or only servers in the `us-east-1` datacenter). As mentioned above, Telegraf also allows you to set additional tags on the
metrics that pass through it. In this case, we are adding tags for the server
role and datacenter. We can then use these tags in Grafana to filter queries
(for example, to create a dashboard showing only servers with the
`consul-server` role, or only servers in the `us-east-1` datacenter).
```ini ```toml
[global_tags] [global_tags]
role = "consul-server" role = "consul-server"
datacenter = "us-east-1" datacenter = "us-east-1"
``` ```
Next, we set up a statsd listener on UDP port 8125, with instructions to calculate percentile metrics and to Next, we set up a StatsD listener on UDP port 8125, with instructions to
parse DogStatsD-compatible tags, when they're sent: calculate percentile metrics and to parse DogStatsD-compatible tags, when
they're sent:
```ini ```toml
[[inputs.statsd]] [[inputs.statsd]]
protocol = "udp" protocol = "udp"
service_address = ":8125" service_address = ":8125"
@ -63,11 +86,15 @@ parse DogStatsD-compatible tags, when they're sent:
percentile_limit = 1000 percentile_limit = 1000
``` ```
The full reference to all the available statsd-related options in Telegraf is [here][telegraf-statsd-input]. The full reference to all the available StatsD-related options in Telegraf is
[here][telegraf-statsd-input].
Now we can configure inputs for things like CPU, memory, network I/O, and disk I/O. Most of them don't require any configuration, but make sure the `interfaces` list in `inputs.net` matches the interface names you see in `ifconfig`. Now we can configure inputs for things like CPU, memory, network I/O, and disk
I/O. Most of them don't require any configuration, but make sure the
`interfaces` list in `inputs.net` matches the interface names you see in
`ifconfig`.
```ini ```toml
[[inputs.cpu]] [[inputs.cpu]]
percpu = true percpu = true
totalcpu = true totalcpu = true
@ -106,18 +133,22 @@ Now we can configure inputs for things like CPU, memory, network I/O, and disk I
# no configuration # no configuration
``` ```
Another useful plugin is the [procstat][telegraf-procstat-input] plugin, which reports metrics for processes you select: Another useful plugin is the [procstat][telegraf-procstat-input] plugin, which
reports metrics for processes you select:
```ini ```toml
[[inputs.procstat]] [[inputs.procstat]]
pattern = "(consul)" pattern = "(consul)"
``` ```
Telegraf even includes a [plugin][telegraf-consul-input] that monitors the health checks associated with the Consul agent, using Consul API to query the data. Telegraf even includes a [plugin][telegraf-consul-input] that monitors the
health checks associated with the Consul agent, using Consul API to query the
data.
It's important to note: the plugin itself will not report the telemetry, Consul will report those stats already using StatsD protocol. It's important to note: the plugin itself will not report the telemetry, Consul
will report those stats already using StatsD protocol.
```ini ```toml
[[inputs.consul]] [[inputs.consul]]
address = "localhost:8500" address = "localhost:8500"
scheme = "http" scheme = "http"
@ -125,7 +156,8 @@ It's important to note: the plugin itself will not report the telemetry, Consul
## Telegraf Configuration for Consul ## Telegraf Configuration for Consul
Asking Consul to send telemetry to Telegraf is as simple as adding a `telemetry` section to your agent configuration: Asking Consul to send telemetry to Telegraf is as simple as adding a `telemetry`
section to your agent configuration:
```json ```json
{ {
@ -136,25 +168,32 @@ Asking Consul to send telemetry to Telegraf is as simple as adding a `telemetry`
} }
``` ```
As you can see, we only need to specify two options. The `dogstatsd_addr` specifies the hostname and port of the As you can see, we only need to specify two options. The `dogstatsd_addr`
statsd daemon. specifies the hostname and port of the StatsD daemon.
Note that we specify DogStatsD format instead of plain statsd, which tells Consul to send [tags][tagging] Note that we specify DogStatsD format instead of plain StatsD, which tells
with each metric. Tags can be used by Grafana to filter data on your dashboards (for example, displaying only Consul to send [tags][tagging] with each metric. Tags can be used by Grafana to
the data for which `role=consul-server`. Telegraf is compatible with the DogStatsD format and allows us to add filter data on your dashboards (for example, displaying only the data for which
our own tags too. `role=consul-server`. Telegraf is compatible with the DogStatsD format and
allows us to add our own tags too.
The second option tells Consul not to insert the hostname in the names of the metrics it sends to statsd, since the hostnames will be sent as tags. Without this option, the single metric `consul.raft.apply` would become multiple metrics: The second option tells Consul not to insert the hostname in the names of the
metrics it sends to StatsD, since the hostnames will be sent as tags. Without
this option, the single metric `consul.raft.apply` would become multiple
metrics:
consul.server1.raft.apply consul.server1.raft.apply
consul.server2.raft.apply consul.server2.raft.apply
consul.server3.raft.apply consul.server3.raft.apply
If you are using a different agent (e.g. Circonus, Statsite, or plain statsd), you may want to change this configuration, and you can find the configuration reference [here][consul-telemetry-config]. If you are using a different agent (e.g. Circonus, Statsite, or plain StatsD),
you may want to change this configuration, and you can find the configuration
reference [here][consul-telemetry-config].
## Visualising Telegraf Consul Metrics ## Visualising Telegraf Consul Metrics
There a number of ways of consuming the information from Telegraf. Generally they are visualised using a tool like [Grafana][] or [Chronograf][]. You can use a tool like [Grafana][] or [Chronograf][] to visualize metrics from
Telegraf.
Here is an example Grafana dashboard: Here is an example Grafana dashboard:
@ -173,7 +212,10 @@ Here is an example Grafana dashboard:
| `mem.used_percent` | Percentage of physical memory in use. | | `mem.used_percent` | Percentage of physical memory in use. |
| `swap.used_percent` | Percentage of swap space in use. | | `swap.used_percent` | Percentage of swap space in use. |
**Why they're important:** Consul keeps all of its data in memory. If Consul consumes all available memory, it will crash. You should also monitor total available RAM to make sure some RAM is available for other processes, and swap usage should remain at 0% for best performance. **Why they're important:** Consul keeps all of its data in memory. If Consul
consumes all available memory, it will crash. You should also monitor total
available RAM to make sure some RAM is available for other processes, and swap
usage should remain at 0% for best performance.
**What to look for:** If `mem.used_percent` is over 90%, or if **What to look for:** If `mem.used_percent` is over 90%, or if
`swap.used_percent` is greater than 0. `swap.used_percent` is greater than 0.
@ -185,9 +227,14 @@ Here is an example Grafana dashboard:
| `linux_sysctl_fs.file-nr` | Number of file handles being used across all processes on the host. | | `linux_sysctl_fs.file-nr` | Number of file handles being used across all processes on the host. |
| `linux_sysctl_fs.file-max` | Total number of available file handles. | | `linux_sysctl_fs.file-max` | Total number of available file handles. |
**Why it's important:** Practically anything Consul does -- receiving a connection from another host, sending data between servers, writing snapshots to disk -- requires a file descriptor handle. If Consul runs out of handles, it will stop accepting connections. See [the Consul FAQ][consul_faq_fds] for more details. **Why it's important:** Practically anything Consul does -- receiving a
connection from another host, sending data between servers, writing snapshots to
disk -- requires a file descriptor handle. If Consul runs out of handles, it
will stop accepting connections. See [the Consul FAQ][consul_faq_fds] for more
details.
By default, process and kernel limits are fairly conservative. You will want to increase these beyond the defaults. By default, process and kernel limits are fairly conservative. You will want to
increase these beyond the defaults.
**What to look for:** If `file-nr` exceeds 80% of `file-max`. **What to look for:** If `file-nr` exceeds 80% of `file-max`.
@ -198,9 +245,10 @@ By default, process and kernel limits are fairly conservative. You will want to
| `cpu.user_cpu` | Percentage of CPU being used by user processes (such as Consul). | | `cpu.user_cpu` | Percentage of CPU being used by user processes (such as Consul). |
| `cpu.iowait_cpu` | Percentage of CPU time spent waiting for I/O tasks to complete. | | `cpu.iowait_cpu` | Percentage of CPU time spent waiting for I/O tasks to complete. |
**Why they're important:** Consul is not particularly demanding of CPU time, but a spike in CPU usage might **Why they're important:** Consul is not particularly demanding of CPU time, but
indicate too many operations taking place at once, and `iowait_cpu` is critical -- it means Consul is waiting a spike in CPU usage might indicate too many operations taking place at once,
for data to be written to disk, a sign that Raft might be writing snapshots to disk too often. and `iowait_cpu` is critical -- it means Consul is waiting for data to be
written to disk, a sign that Raft might be writing snapshots to disk too often.
**What to look for:** if `cpu.iowait_cpu` greater than 10%. **What to look for:** if `cpu.iowait_cpu` greater than 10%.
@ -211,14 +259,17 @@ for data to be written to disk, a sign that Raft might be writing snapshots to d
| `net.bytes_recv` | Bytes received on each network interface. | | `net.bytes_recv` | Bytes received on each network interface. |
| `net.bytes_sent` | Bytes transmitted on each network interface. | | `net.bytes_sent` | Bytes transmitted on each network interface. |
**Why they're important:** A sudden spike in network traffic to Consul might be the result of a misconfigured **Why they're important:** A sudden spike in network traffic to Consul might be
application client causing too many requests to Consul. This is the raw data from the system, rather than a specific Consul metric. the result of a misconfigured application client causing too many requests to
Consul. This is the raw data from the system, rather than a specific Consul
metric.
**What to look for:** **What to look for:** Sudden large changes to the `net` metrics (greater than
Sudden large changes to the `net` metrics (greater than 50% deviation from baseline). 50% deviation from baseline).
**NOTE:** The `net` metrics are counters, so in order to calculate rates (such as bytes/second), **NOTE:** The `net` metrics are counters, so in order to calculate rates (such
you will need to apply a function such as [non_negative_difference][]. as bytes/second), you will need to apply a function such as
[non_negative_difference][].
### Disk activity ### Disk activity
@ -227,18 +278,30 @@ you will need to apply a function such as [non_negative_difference][].
| `diskio.read_bytes` | Bytes read from each block device. | | `diskio.read_bytes` | Bytes read from each block device. |
| `diskio.write_bytes` | Bytes written to each block device. | | `diskio.write_bytes` | Bytes written to each block device. |
**Why they're important:** If the Consul host is writing a lot of data to disk, such as under high volume workloads, there may be frequent major I/O spikes during leader elections. This is because under heavy load, **Why they're important:** If the Consul host is writing a lot of data to disk,
Consul is checkpointing Raft snapshots to disk frequently. such as under high volume workloads, there may be frequent major I/O spikes
during leader elections. This is because under heavy load, Consul is
checkpointing Raft snapshots to disk frequently.
It may also be caused by Consul having debug/trace logging enabled in production, which can impact performance. It may also be caused by Consul having debug/trace logging enabled in
production, which can impact performance.
Too much disk I/O can cause the rest of the system to slow down or become unavailable, as the kernel spends all its time waiting for I/O to complete. Too much disk I/O can cause the rest of the system to slow down or become
unavailable, as the kernel spends all its time waiting for I/O to complete.
**What to look for:** Sudden large changes to the `diskio` metrics (greater than 50% deviation from baseline, **What to look for:** Sudden large changes to the `diskio` metrics (greater than
or more than 3 standard deviations from baseline). 50% deviation from baseline, or more than 3 standard deviations from baseline).
**NOTE:** The `diskio` metrics are counters, so in order to calculate rates (such as bytes/second), **NOTE:** The `diskio` metrics are counters, so in order to calculate rates
you will need to apply a function such as [non_negative_difference][]. (such as bytes/second), you will need to apply a function such as
[non_negative_difference][].
## Summary
In this guide you learned how to set up Telegraf with Consul to collect metrics,
and considered your options for visualizing, aggregating, and alerting on those
metrics. To learn about other factors (in addition to monitoring) that you
should consider when running Consul in production, see the [Production Checklist][prod-checklist].
[non_negative_difference]: https://docs.influxdata.com/influxdb/v1.5/query_language/functions/#non-negative-difference [non_negative_difference]: https://docs.influxdata.com/influxdb/v1.5/query_language/functions/#non-negative-difference
[consul_faq_fds]: https://www.consul.io/docs/faq.html#q-does-consul-require-certain-user-process-resource-limits- [consul_faq_fds]: https://www.consul.io/docs/faq.html#q-does-consul-require-certain-user-process-resource-limits-
@ -254,3 +317,4 @@ you will need to apply a function such as [non_negative_difference][].
[telegraf-input-plugins]: https://docs.influxdata.com/telegraf/v1.6/plugins/inputs/ [telegraf-input-plugins]: https://docs.influxdata.com/telegraf/v1.6/plugins/inputs/
[Grafana]: https://www.influxdata.com/partners/grafana/ [Grafana]: https://www.influxdata.com/partners/grafana/
[Chronograf]: https://www.influxdata.com/time-series-platform/chronograf/ [Chronograf]: https://www.influxdata.com/time-series-platform/chronograf/
[prod-checklist]: https://learn.hashicorp.com/consul/advanced/day-1-operations/production-checklist