From 64e35777e044a9c6122093067eacc871f440b7db Mon Sep 17 00:00:00 2001 From: FFMMM Date: Thu, 31 Mar 2022 13:04:33 -0700 Subject: [PATCH] docs: new rpc metric (#12608) --- website/content/docs/agent/telemetry.mdx | 120 ++++++++++++++++------- 1 file changed, 83 insertions(+), 37 deletions(-) diff --git a/website/content/docs/agent/telemetry.mdx b/website/content/docs/agent/telemetry.mdx index e17b753965..4f4ef89837 100644 --- a/website/content/docs/agent/telemetry.mdx +++ b/website/content/docs/agent/telemetry.mdx @@ -289,7 +289,7 @@ performance degradations related to Bolt DB, these metrics will show the issue a **What to look for:** -The primary thing to look for are increases in the `consul.raft.boltdb.storeLogs` times. Its value will directly govern an +The primary thing to look for are increases in the `consul.raft.boltdb.storeLogs` times. Its value will directly govern an upper limit to the throughput of write operations within Consul. In Consul each write operation will turn into a single Raft log to be committed. Raft will process these @@ -313,7 +313,7 @@ to drastically increase disk write throughput, potentially beyond what the under detect this situation you can look at the `consul.raft.boltdb.freelistBytes` metric. This metric is a count of the extra bytes that are being written for each log storage operation beyond the log data itself. While not a clear indicator of an actual issue, this metric can be used to diagnose why the `consul.raft.boltdb.storeLogs` metric -is high. +is high. If Bolt DB log storage performance becomes an issue and is caused by free list management then setting [`raft_boltdb.NoFreelistSync`](/docs/agent/options#NoFreelistSync) to `true` in the server's configuration @@ -519,47 +519,93 @@ These metrics are used to monitor the health of the Consul servers. | `consul.grpc.server.streams` | Measures the number of active gRPC streams handled by the server. | streams | gauge | | `consul.xds.server.streams` | Measures the number of active xDS streams handled by the server split by protocol version. | streams | gauge | + +## Server Workload + +** Requirements: ** +* Consul 1.12.0+ + +Label based RPC metrics were added in Consul 1.12.0 as a Beta feature to better understand the workload on a Consul server and, where that workload is coming from. The following metric(s) provide that insight + +| Metric | Description | Unit | Type | +| ------------------------------------- | --------------------------------------------------------- | ------ | --------- | +| `consul.rpc.server.call` | Measures the elapsed time taken to complete an RPC call. | ms | summary | + +Note that values of the `consul.rpc.server.call` may emit as `0 ms`. That means that the elapsed time < `1 ms`. + +### Labels + +The the server workload metrics above come with the following labels: + +| Label Name | Description | Possible values | +| ------------------------------------- | --------------------------------------------------------- | --------------------------------------- | +| `method` | The name of the RPC method. | The value of any RPC request in Consul. | +| `errored` | Indicates whether the RPC call errored. | `True` or `False`. | +| `request_type` | Whether it is a `read` or `write` request. | `read`, `write` or `unreported`. | +| `rpc_type` | The RPC implementation. | `net/rpc` or `internal`. | + +#### Label Explanations + +The `internal` value for the `rpc_type` in the table above refers to leader and cluster management RPC operations that Consul performs. +Historically, `internal` RPC operation metrics were accounted under the same metric names. + +The `unreported` value for the `request_type` in the table above refers to RPC requests within Consul where it is difficult to ascertain whether a request is `read` or `write` type. + +Here is a Prometheus style example of an RPC metric and its labels: + + + +```json + ... + consul_rpc_server_call{errored="false",method="Catalog.ListNodes",request_type="read",rpc_type="net/rpc",quantile="0.5"} 255 + ... +``` + + + +Any metric in this section can be turned off with the [`prefix_filter`](/docs/agent/options#telemetry-prefix_filter). + ## Cluster Health These metrics give insight into the health of the cluster as a whole. | Metric | Description | Unit | Type | | ------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------- | ------- | -| `consul.memberlist.degraded.probe` | Counts the number of times the agent has performed failure detection on another agent at a slower probe rate. The agent uses its own health metric as an indicator to perform this action. (If its health score is low, means that the node is healthy, and vice versa.) | probes / interval | counter | -| `consul.memberlist.degraded.timeout` | Counts the number of times an agent was marked as a dead node, whilst not getting enough confirmations from a randomly selected list of agent nodes in an agent's membership. | occurrence / interval | counter | -| `consul.memberlist.msg.dead` | Counts the number of times an agent has marked another agent to be a dead node. | messages / interval | counter | -| `consul.memberlist.health.score` | Describes a node's perception of its own health based on how well it is meeting the soft real-time requirements of the protocol. This metric ranges from 0 to 8, where 0 indicates "totally healthy". This health score is used to scale the time between outgoing probes, and higher scores translate into longer probing intervals. For more details see section IV of the Lifeguard paper: https://arxiv.org/pdf/1707.00788.pdf | score | gauge | -| `consul.memberlist.msg.suspect` | Increments when an agent suspects another as failed when executing random probes as part of the gossip protocol. These can be an indicator of overloaded agents, network problems, or configuration errors where agents can not connect to each other on the [required ports](/docs/agent/options#ports). | suspect messages received / interval | counter | -| `consul.memberlist.tcp.accept` | Counts the number of times an agent has accepted an incoming TCP stream connection. | connections accepted / interval | counter | -| `consul.memberlist.udp.sent/received` | Measures the total number of bytes sent/received by an agent through the UDP protocol. | bytes sent or bytes received / interval | counter | -| `consul.memberlist.tcp.connect` | Counts the number of times an agent has initiated a push/pull sync with an other agent. | push/pull initiated / interval | counter | -| `consul.memberlist.tcp.sent` | Measures the total number of bytes sent by an agent through the TCP protocol | bytes sent / interval | counter | -| `consul.memberlist.gossip` | Measures the time taken for gossip messages to be broadcasted to a set of randomly selected nodes. | ms | timer | -| `consul.memberlist.msg_alive` | Counts the number of alive messages, that the agent has processed so far, based on the message information given by the network layer. | messages / Interval | counter | -| `consul.memberlist.msg_dead` | The number of dead messages that the agent has processed so far, based on the message information given by the network layer. | messages / Interval | counter | -| `consul.memberlist.msg_suspect` | The number of suspect messages that the agent has processed so far, based on the message information given by the network layer. | messages / Interval | counter | -| `consul.memberlist.probeNode` | Measures the time taken to perform a single round of failure detection on a select agent. | nodes / Interval | counter | -| `consul.memberlist.pushPullNode` | Measures the number of agents that have exchanged state with this agent. | nodes / Interval | counter | -| `consul.serf.member.failed` | Increments when an agent is marked dead. This can be an indicator of overloaded agents, network problems, or configuration errors where agents cannot connect to each other on the [required ports](/docs/agent/options#ports). | failures / interval | counter | -| `consul.serf.member.flap` | Available in Consul 0.7 and later, this increments when an agent is marked dead and then recovers within a short time period. This can be an indicator of overloaded agents, network problems, or configuration errors where agents cannot connect to each other on the [required ports](/docs/agent/options#ports). | flaps / interval | counter | -| `consul.serf.member.join` | Increments when an agent joins the cluster. If an agent flapped or failed this counter also increments when it re-joins. | joins / interval | counter | -| `consul.serf.member.left` | Increments when an agent leaves the cluster. | leaves / interval | counter | -| `consul.serf.events` | Increments when an agent processes an [event](/commands/event). Consul uses events internally so there may be additional events showing in telemetry. There are also a per-event counters emitted as `consul.serf.events.`. | events / interval | counter | -| `consul.serf.msgs.sent` | This metric is sample of the number of bytes of messages broadcast to the cluster. In a given time interval, the sum of this metric is the total number of bytes sent and the count is the number of messages sent. | message bytes / interval | counter | -| `consul.autopilot.failure_tolerance` | Tracks the number of voting servers that the cluster can lose while continuing to function. | servers | gauge | -| `consul.autopilot.healthy` | Tracks the overall health of the local server cluster. If all servers are considered healthy by Autopilot, this will be set to 1. If any are unhealthy, this will be 0. All non-leader servers will report `NaN`. | boolean | gauge | -| `consul.session_ttl.active` | Tracks the active number of sessions being tracked. | sessions | gauge | -| `consul.catalog.service.query.` | Increments for each catalog query for the given service. | queries | counter | -| `consul.catalog.service.query-tag..` | Increments for each catalog query for the given service with the given tag. | queries | counter | -| `consul.catalog.service.query-tags..` | Increments for each catalog query for the given service with the given tags. | queries | counter | -| `consul.catalog.service.not-found.` | Increments for each catalog query where the given service could not be found. | queries | counter | -| `consul.catalog.connect.query.` | Increments for each connect-based catalog query for the given service. | queries | counter | -| `consul.catalog.connect.query-tag..` | Increments for each connect-based catalog query for the given service with the given tag. | queries | counter | -| `consul.catalog.connect.query-tags..` | Increments for each connect-based catalog query for the given service with the given tags. | queries | counter | -| `consul.catalog.connect.not-found.` | Increments for each connect-based catalog query where the given service could not be found. | queries | counter | -| `consul.mesh.active-root-ca.expiry` | The number of seconds until the root CA expires, updated every hour. | seconds | gauge | -| `consul.mesh.active-signing-ca.expiry` | The number of seconds until the signing CA expires, updated every hour. | seconds | gauge | -| `consul.agent.tls.cert.expiry` | The number of seconds until the Agent TLS certificate expires, updated every hour. | seconds | gauge | +| `consul.memberlist.degraded.probe` | Counts the number of times the agent has performed failure detection on another agent at a slower probe rate. The agent uses its own health metric as an indicator to perform this action. (If its health score is low, means that the node is healthy, and vice versa.) | probes / interval | counter | +| `consul.memberlist.degraded.timeout` | Counts the number of times an agent was marked as a dead node, whilst not getting enough confirmations from a randomly selected list of agent nodes in an agent's membership. | occurrence / interval | counter | +| `consul.memberlist.msg.dead` | Counts the number of times an agent has marked another agent to be a dead node. | messages / interval | counter | +| `consul.memberlist.health.score` | Describes a node's perception of its own health based on how well it is meeting the soft real-time requirements of the protocol. This metric ranges from 0 to 8, where 0 indicates "totally healthy". This health score is used to scale the time between outgoing probes, and higher scores translate into longer probing intervals. For more details see section IV of the Lifeguard paper: https://arxiv.org/pdf/1707.00788.pdf | score | gauge | +| `consul.memberlist.msg.suspect` | Increments when an agent suspects another as failed when executing random probes as part of the gossip protocol. These can be an indicator of overloaded agents, network problems, or configuration errors where agents can not connect to each other on the [required ports](/docs/agent/options#ports). | suspect messages received / interval | counter | +| `consul.memberlist.tcp.accept` | Counts the number of times an agent has accepted an incoming TCP stream connection. | connections accepted / interval | counter | +| `consul.memberlist.udp.sent/received` | Measures the total number of bytes sent/received by an agent through the UDP protocol. | bytes sent or bytes received / interval | counter | +| `consul.memberlist.tcp.connect` | Counts the number of times an agent has initiated a push/pull sync with an other agent. | push/pull initiated / interval | counter | +| `consul.memberlist.tcp.sent` | Measures the total number of bytes sent by an agent through the TCP protocol | bytes sent / interval | counter | +| `consul.memberlist.gossip` | Measures the time taken for gossip messages to be broadcasted to a set of randomly selected nodes. | ms | timer | +| `consul.memberlist.msg_alive` | Counts the number of alive messages, that the agent has processed so far, based on the message information given by the network layer. | messages / Interval | counter | +| `consul.memberlist.msg_dead` | The number of dead messages that the agent has processed so far, based on the message information given by the network layer. | messages / Interval | counter | +| `consul.memberlist.msg_suspect` | The number of suspect messages that the agent has processed so far, based on the message information given by the network layer. | messages / Interval | counter | +| `consul.memberlist.probeNode` | Measures the time taken to perform a single round of failure detection on a select agent. | nodes / Interval | counter | +| `consul.memberlist.pushPullNode` | Measures the number of agents that have exchanged state with this agent. | nodes / Interval | counter | +| `consul.serf.member.failed` | Increments when an agent is marked dead. This can be an indicator of overloaded agents, network problems, or configuration errors where agents cannot connect to each other on the [required ports](/docs/agent/options#ports). | failures / interval | counter | +| `consul.serf.member.flap` | Available in Consul 0.7 and later, this increments when an agent is marked dead and then recovers within a short time period. This can be an indicator of overloaded agents, network problems, or configuration errors where agents cannot connect to each other on the [required ports](/docs/agent/options#ports). | flaps / interval | counter | +| `consul.serf.member.join` | Increments when an agent joins the cluster. If an agent flapped or failed this counter also increments when it re-joins. | joins / interval | counter | +| `consul.serf.member.left` | Increments when an agent leaves the cluster. | leaves / interval | counter | +| `consul.serf.events` | Increments when an agent processes an [event](/commands/event). Consul uses events internally so there may be additional events showing in telemetry. There are also a per-event counters emitted as `consul.serf.events.`. | events / interval | counter | +| `consul.serf.msgs.sent` | This metric is sample of the number of bytes of messages broadcast to the cluster. In a given time interval, the sum of this metric is the total number of bytes sent and the count is the number of messages sent. | message bytes / interval | counter | +| `consul.autopilot.failure_tolerance` | Tracks the number of voting servers that the cluster can lose while continuing to function. | servers | gauge | +| `consul.autopilot.healthy` | Tracks the overall health of the local server cluster. If all servers are considered healthy by Autopilot, this will be set to 1. If any are unhealthy, this will be 0. All non-leader servers will report `NaN`. | boolean | gauge | +| `consul.session_ttl.active` | Tracks the active number of sessions being tracked. | sessions | gauge | +| `consul.catalog.service.query.` | Increments for each catalog query for the given service. | queries | counter | +| `consul.catalog.service.query-tag..` | Increments for each catalog query for the given service with the given tag. | queries | counter | +| `consul.catalog.service.query-tags..` | Increments for each catalog query for the given service with the given tags. | queries | counter | +| `consul.catalog.service.not-found.` | Increments for each catalog query where the given service could not be found. | queries | counter | +| `consul.catalog.connect.query.` | Increments for each connect-based catalog query for the given service. | queries | counter | +| `consul.catalog.connect.query-tag..` | Increments for each connect-based catalog query for the given service with the given tag. | queries | counter | +| `consul.catalog.connect.query-tags..` | Increments for each connect-based catalog query for the given service with the given tags. | queries | counter | +| `consul.catalog.connect.not-found.` | Increments for each connect-based catalog query where the given service could not be found. | queries | counter | +| `consul.mesh.active-root-ca.expiry` | The number of seconds until the root CA expires, updated every hour. | seconds | gauge | +| `consul.mesh.active-signing-ca.expiry`| The number of seconds until the signing CA expires, updated every hour. | seconds | gauge | +| `consul.agent.tls.cert.expiry` | The number of seconds until the Agent TLS certificate expires, updated every hour. | seconds | gauge | ## Connect Built-in Proxy Metrics