You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
consul/website/content/docs/agent/wal-logstore/monitoring.mdx

84 lines
4.7 KiB

---
layout: docs
page_title: Monitor Raft metrics and logs for WAL
description: >-
Learn how to monitor Raft metrics emitted the experimental WAL (write-ahead log) LogStore backend shipped in Consul 1.15.
---
# Monitor Raft metrics and logs for WAL
This topic describes how to monitor Raft metrics and logs for the WAL backend. We strongly recommend monitoring the Consul cluster, especially the target server, for evidence that the WAL backend is not functioning correctly.
## Monitor for checksum failures
Log store verification failures on any server, regardless of whether you are running the BoltDB or WAL backed, are unrecoverable errors. Consul may report the following errors in logs.
### Read failures: Disk Corruption
```log hideClipboard
2022-11-15T22:41:23.546Z [ERROR] agent.raft.logstore: verification checksum FAILED: storage corruption rangeStart=1234 rangeEnd=3456 leaderChecksum=0xc1... readChecksum=0x45...
```
This indicates that the server read back data that is different from what it wrote to disk. This indicates corruption in the storage backend or filesystem.
For convenience, Consul also increments a metric `consul.raft.logstore.verifier.read_checksum_failures` when this occurs.
### Write failures: In-flight Corruption
The following error indicates that the checksum on the follower did not match the leader when the follower received the logs. The error implies that the corruption happened in the network or software and not the log store:
```log hideClipboard
2022-11-15T22:41:23.546Z [ERROR] agent.raft.logstore: verification checksum FAILED: in-flight corruption rangeStart=1234 rangeEnd=3456 leaderChecksum=0xc1... followerWriteChecksum=0x45...
```
It is unlikely that this error indicates an issue with the storage backend, but you should take the same steps to resolve and report it.
The `consul.raft.logstore.verifier.write_checksum_failures` metric increments when this error occurs.
## Resolve checksum failures
If either type of corruption is detected, complete the steps to [revert the backend to BoltDB](/consul/docs/agent/wal-logstore/revert-to-boltdb). If the server already uses BoltDB, the errors likely indicate a latent bug in BoltDB or a bug in the verification code. In both cases, follow the instructions to revert to BoltDB.
Report all verification failures as a [GitHub
issue](https://github.com/hashicorp/consul/issues/new?assignees=&labels=&template=bug_report.md&title=WAL:%20Checksum%20Failure).
In your report, include the following:
- Details of your server cluster configuration and hardware
- Logs around the failure message
- Context for how long they have been running the configuration
- Any metrics or description of the workload you have. For example, how many raft
commits per second. Also include the performance metrics described on this page.
We recommend setting up an alert on Consul server logs containing `verification checksum FAILED` or on the `consul.raft.logstore.verifier.{read|write}_checksum_failures` metrics. The sooner you respond to a corrupt server, the lower the chance of any of the [potential risks](/consul/docs/agent/wal-logstore/enable#risks) causing problems in your cluster.
## Performance metrics
The key performance metrics to watch are:
- `consul.raft.commitTime` measures the time to commit new writes on a quorum of
servers. It should be the same or lower after deploying WAL. Even if WAL is
faster for your workload and hardware, it may not be reflected in `commitTime`
until enough followers are using WAL that the leader does not have to wait for
two slower followers in a cluster of five to catch up.
- `consul.raft.rpc.appendEntries.storeLogs` measures the time spent persisting
logs to disk on each _follower_. It should be the same or lower for
WAL-enabled followers.
- `consul.raft.replication.appendEntries.rpc` measures the time taken for each
`AppendEntries` RPC from the leader's perspective. If this is significantly
higher than `consul.raft.rpc.appendEntries` on the follower, it indicates a
known queuing issue in the Raft library and is unrelated to the backend.
Followers with WAL enabled should not be slower than the others. You can
determine which follower is associated with which metric by running the
`consul operator raft list-peers` command and matching the
`peer_id` label value to the server IDs listed.
- `consul.raft.compactLogs` measures the time take to truncate the logs after a
snapshot. WAL-enabled servers should not be slower than BoltDB servers.
- `consul.raft.leader.dispatchLog` measures the time spent persisting logs to
disk on the _leader_. It is only relevant if a WAL-enabled server becomes a
leader. It should be the same or lower than before when the leader was using
BoltDB.