mirror of https://github.com/hashicorp/consul
Browse Source
* Add WAL documentation. Also fix some minor metrics registration details * Add tests to verify metrics are registered correctly * refactor and move wal docs * Updates to the WAL overview page * updates to enable WAL usage topic * updates to the monitoring WAL backend topic * updates for revert WAL topic * a few tweaks to overview and udpated metadescriptions * Apply suggestions from code review Co-authored-by: Paul Banks <pbanks@hashicorp.com> * make revert docs consistent with enable * Apply suggestions from code review Co-authored-by: Paul Banks <pbanks@hashicorp.com> * address feedback * address final feedback * Apply suggestions from code review Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com> --------- Co-authored-by: Paul Banks <pbanks@hashicorp.com> Co-authored-by: trujillo-adam <ajosetru@gmail.com> Co-authored-by: trujillo-adam <47586768+trujillo-adam@users.noreply.github.com> Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>pull/13961/merge
Tu Nguyen
2 years ago
committed by
GitHub
7 changed files with 481 additions and 15 deletions
@ -0,0 +1,143 @@
|
||||
--- |
||||
layout: docs |
||||
page_title: Enable the experimental WAL LogStore backend |
||||
description: >- |
||||
Learn how to safely configure and test the experimental WAL backend in your Consul deployment. |
||||
--- |
||||
|
||||
# Enable the experimental WAL LogStore backend |
||||
|
||||
This topic describes how to safely configure and test the WAL backend in your Consul deployment. |
||||
|
||||
The overall process for enabling the WAL LogStore backend for one server consists of the following steps. In production environments, we recommend starting by enabling the backend on a single server . If you eventually choose to expand the test to further servers, you must repeat these steps for each one. |
||||
|
||||
1. Enable log verification. |
||||
1. Select target server to enable WAL. |
||||
1. Stop target server gracefully. |
||||
1. Remove data directory from target server. |
||||
1. Update target server's configuration. |
||||
1. Start the target server. |
||||
1. Monitor target server raft metrics and logs. |
||||
|
||||
!> **Experimental feature:** The WAL LogStore backend is experimental. |
||||
|
||||
## Requirements |
||||
|
||||
- Consul v1.15 or later is required for all servers in the datacenter. Refer to the [standard upgrade procedure](/consul/docs/upgrading/general-process) and the [1.15 upgrade notes](/consul/docs/upgrading/upgrade-specific#consul-1-15-x) for additional information. |
||||
- A Consul cluster with at least three nodes are required to safely test the WAL backend without downtime. |
||||
|
||||
We recommend taking the following additional measures: |
||||
|
||||
- Take a snapshot prior to testing. |
||||
- Monitor Consul server metrics and logs, and set an alert on specific log events that occur when WAL is enabled. Refer to [Monitor Raft metrics and logs for WAL](/consul/docs/agent/wal-logstore/monitoring) for more information. |
||||
- Enable WAL in a pre-production environment and run it for a several days before enabling it in production. |
||||
|
||||
## Risks |
||||
|
||||
While their likelihood remains low to very low, be aware of the following risks before implementing the WAL backend: |
||||
|
||||
- If WAL corrupts data on a Consul server agent, the server data cannot be recovered. Restart the server with an empty data directory and reload its state from the leader to resolve the issue. |
||||
- WAL may corrupt data or contain a defect that causes the server to panic and crash. WAL may not restart if the defect recurs when WAL reads from the logs on startup. Restart the server with an empty data directory and reload its state from the leader to resolve the issue. |
||||
- If WAL corrupts data, clients may read corrupted data from the Consul server, such as invalid IP addresses or unmatched tokens. This outcome is unlikely even if a recurring defect causes WAL to corrupt data because replication uses objects cached in memory instead of reads from disk. Restart the server with an empty data directory and reload its state from the leader to resolve the issue. |
||||
- If you enable a Consul OSS server to use WAL or enable WAL on a voting server with Consul Enterprise, WAL may corrupt the server's state, become the leader, and replicate the corrupted state to all other servers. In this case, restoring from backup is required to recover a completely uncorrupted state. Test WAL on a non-voting server in Enterprise to prevent this outcome. You can add a new non-voting server to the cluster to test with if there are no existing ones. |
||||
|
||||
## Enable log verification |
||||
|
||||
You must enable log verification on all voting servers in Enterprise and all servers in OSS because the leader writes verification checkpoints. |
||||
|
||||
1. On each voting server, add the following to the server's configuration file: |
||||
|
||||
```hcl |
||||
raft_logstore { |
||||
verification { |
||||
enabled = true |
||||
interval = "60s" |
||||
} |
||||
} |
||||
``` |
||||
|
||||
1. Restart the server to apply the changes. The `consul reload` command is not sufficient to apply `raft_logstore` configuration changes. |
||||
1. Run the `consul operator raft list-peers` command to wait for each server to become a healthy voter before moving on to the next. This may take a few minutes for large snapshots. |
||||
|
||||
When complete, the server's logs should contain verifier reports that appear like the following example: |
||||
|
||||
```log hideClipboard |
||||
2023-01-31T14:44:31.174Z [INFO] agent.server.raft.logstore.verifier: verification checksum OK: elapsed=488.463268ms leaderChecksum=f15db83976f2328c rangeEnd=357802 rangeStart=298132 readChecksum=f15db83976f2328c |
||||
``` |
||||
|
||||
## Select target server to enable WAL |
||||
|
||||
If you are using Consul OSS or Consul Enterprise without non-voting servers, select a follower server to enable WAL. As noted in [Risks](#risks), Consul Enterprise users with non-voting servers should first select a non-voting server, or consider adding another server as a non-voter to test on. |
||||
|
||||
Retrieve the current state of the servers by running the following command: |
||||
|
||||
```shell-session |
||||
$ consul operator raft list-peers |
||||
``` |
||||
|
||||
## Stop target server |
||||
|
||||
Stop the target server gracefully. For example, if you are using `systemd`, |
||||
run the following command: |
||||
|
||||
```shell-session |
||||
$ systemctl stop consul |
||||
``` |
||||
|
||||
If your environment uses configuration management automation that might interfere with this process, such as Chef or Puppet, you must disable them until you have completely enabled WAL as a storage backend. |
||||
|
||||
## Remove data directory from target server |
||||
|
||||
Temporarily moving the data directory to a different location is less destructive than deleting it. We recommend moving it in cases where you unsuccessfully enable WAL. Do not use the old data directory (`/data-dir/raft.bak`) for recovery after restarting the server. We recommend eventually deleting the old directory. |
||||
|
||||
The following example assumes the `data_dir` in the server's configuration is `/data-dir` and renames it to `/data-dir.bak`. |
||||
|
||||
```shell-session |
||||
$ mv /data-dir/raft /data-dir/raft.bak |
||||
``` |
||||
|
||||
When switching backends, you must always remove _the entire raft directory_, not just the `raft.db` file or `wal` directory. The log must always be consistent with the snapshots to avoid undefined behavior or data loss. |
||||
|
||||
## Update target server configuration |
||||
|
||||
Add the following to the target server's configuration file: |
||||
|
||||
```hcl |
||||
raft_logstore { |
||||
backend = "wal" |
||||
verification { |
||||
enabled = true |
||||
interval = "60s" |
||||
} |
||||
} |
||||
``` |
||||
|
||||
## Start target server |
||||
|
||||
Start the target server. For example, if you are using `systemd`, run the following command: |
||||
|
||||
```shell-session |
||||
$ systemctl start consul |
||||
``` |
||||
|
||||
Watch for the server to become a healthy voter again. |
||||
|
||||
```shell-session |
||||
$ consul operator raft list-peers |
||||
``` |
||||
|
||||
## Monitor target server Raft metrics and logs |
||||
|
||||
Refer to [Monitor Raft metrics and logs for WAL](/consul/docs/agent/wal-logstore/monitoring) for details. |
||||
|
||||
We recommend leaving the cluster in the test configuration for several days or weeks, as long as you observe no errors. An extended test provides more confidence that WAL operates correctly under varied workloads and during routine server restarts. If you observe any errors, end the test immediately and report them. |
||||
|
||||
If you disabled configuration management automation, consider reenabling it during the testing phase to pick up other updates for the host. You must ensure that it does not revert the Consul configuration file and remove the altered backend configuration. One way to do this is add the `raft_logstore` block to a separate file that is not managed by your automation. This file can either be added to the directory if [`-config-dir`](/consul/docs/agent/config/cli-flags#_config_dir) is used or as an additional [`-config-file`](/consul/docs/agent/config/cli-flags#_config_file) argument. |
||||
|
||||
## Next steps |
||||
|
||||
- If you observe any verification errors, performance anomalies, or other suspicious behavior from the target server during the test, you should immediately follow [the procedure to revert back to BoltDB](/consul/docs/agent/wal-logstore/revert-to-boltdb). Report failures through GitHub. |
||||
|
||||
- If you do not see errors and would like to expand the test further, you can repeat the above procedure on another target server. We suggest waiting after each test expansion and slowly rolling WAL out to other parts of your environment. Once the majority of your servers use WAL, any bugs not yet discovered may result in cluster unavailability. |
||||
|
||||
- If you wish to permanently enable WAL on all servers, repeat the steps described in this topic for each server. Even if `backend = "wal"` is set in logs, servers continue to use BoltDB if they find an existing raft.db file in the data directory. |
@ -0,0 +1,48 @@
|
||||
--- |
||||
layout: docs |
||||
page_title: WAL LogStore Backend Overview |
||||
description: >- |
||||
The experimental WAL (write-ahead log) LogStore backend shipped in Consul 1.15 is intended to replace the BoltDB backend, improving performance and log storage issues. |
||||
--- |
||||
|
||||
# Experimental WAL LogStore backend overview |
||||
|
||||
This topic provides an overview of the experimental WAL (write-ahead log) LogStore backend. |
||||
|
||||
!> **Experimental feature:** The WAL LogStore backend is experimental. |
||||
|
||||
## WAL versus BoltDB |
||||
|
||||
WAL implements a traditional log with rotating, append-only log files. WAL resolves many issues with the existing `LogStore` provided by the BoltDB backend. The BoltDB `LogStore` is a copy-on-write BTree, which is not optimized for append-only, write-heavy workloads. |
||||
|
||||
### BoltDB storage scalability issues |
||||
|
||||
The existing BoltDB log store inefficiently stores append-only logs to disk because it was designed as a full key-value database. It is a single file that only ever grows. Deleting the oldest logs, which Consul does regularly when it makes new snapshots of the state, leaves free space in the file. The free space must be tracked in a `freelist` so that BoltDB can reuse it on future writes. By contrast, a simple segmented log can delete the oldest log files from disk. |
||||
|
||||
A burst of writes at double or triple the normal volume can suddenly cause the log file to grow to several times its steady-state size. After Consul takes the next snapshot and truncates the oldest logs, the resulting file is mostly empty space. |
||||
|
||||
To track the free space, Consul must write extra metadata to disk with every write. The metadata is proportional to the amount of free pages, so after a large burst write latencies tend to increase. In some cases, the latencies cause serious performance degradation to the cluster. |
||||
|
||||
To mitigate risks associated with sudden bursts of log data, Consul tries to limit lots of logs from accumulating in the LogStore. Significantly larger BoltDB files are slower to append to because the tree is deeper and freelist larger. For this reason, Consul's default options associated with snapshots, truncating logs, and keeping the log history have been aggressively set toward keeping BoltDB small rather than using disk IO optimally. |
||||
|
||||
But the larger the file, the more likely it is to have a large freelist or suddenly form one after a burst of writes. For this reason, the many of Consul's default options asssociated with snapshots, truncating logs, and keeping the log history aggressively keep BoltDT small rather than uisng disk IO more efficiently. |
||||
|
||||
Other reliability issues, such as [raft replication capacity issues](/consul/docs/agent/telemetry#raft-replication-capacity-issues), are much simpler to solve without the performance concerns caused by storing more logs in BoltDB. |
||||
|
||||
### WAL approaches storage issues differently |
||||
|
||||
When directly measured, WAL is more performant than BoltDB because it solves a simpler storage problem. Despite this, some users may not notice a significant performance improvement from the upgrade with the same configuration and workload. In this case, the benefit of WAL is that retaining more logs does not affect write performance. As a result, strategies for reducing disk IO with slower snapshots or for keeping logs to permit slower followers to catch up with cluster state are all possible, increasing the reliability of the deployment. |
||||
|
||||
## WAL quality assurance |
||||
|
||||
The WAL backend has been tested thoroughly during development: |
||||
|
||||
- Every component in the WAL, such as [metadata management](https://github.com/hashicorp/raft-wal/blob/main/types/meta.go), [log file encoding](https://github.com/hashicorp/raft-wal/blob/main/types/segment.go) to actual [file-system interaction](https://github.com/hashicorp/raft-wal/blob/main/types/vfs.go) are abstracted so unit tests can simulate difficult-to-reproduce disk failures. |
||||
|
||||
- We used the [application-level intelligent crash explorer (ALICE)](https://github.com/hashicorp/raft-wal/blob/main/alice/README.md) to exhaustively simulate thousands of possible crash failure scenarios. WAL correctly recovered from all scenarios. |
||||
|
||||
- We ran hundreds of tests in a performance testing cluster with checksum verification enabled and did not detect data loss or corruption. We will continue testing before making WAL the default backend. |
||||
|
||||
We are aware of how complex and critical disk-persistence is for your data. |
||||
|
||||
We hope that many users at different scales will try WAL in their environments after upgrading to 1.15 or later and report success or failure so that we can confidently replace BoltDB as the default for new clusters in a future release. |
@ -0,0 +1,85 @@
|
||||
--- |
||||
layout: docs |
||||
page_title: Monitor Raft metrics and logs for WAL |
||||
description: >- |
||||
Learn how to monitor Raft metrics emitted the experimental WAL (write-ahead log) LogStore backend shipped in Consul 1.15. |
||||
--- |
||||
|
||||
# Monitor Raft metrics and logs for WAL |
||||
|
||||
This topic describes how to monitor Raft metrics and logs if you are testing the WAL backend. We strongly recommend monitoring the Consul cluster, especially the target server, for evidence that the WAL backend is not functioning correctly. Refer to [Enable the experimental WAL LogStore backend](/consul/docs/agent/wal-logstore/index) for additional information about the WAL backend. |
||||
|
||||
!> **Upgrade warning:** The WAL LogStore backend is experimental. |
||||
|
||||
## Monitor for checksum failures |
||||
|
||||
Log store verification failures on any server, regardless of whether you are running the BoltDB or WAL backed, are unrecoverable errors. Consul may report the following errors in logs. |
||||
|
||||
### Read failures: Disk Corruption |
||||
|
||||
```log hideClipboard |
||||
2022-11-15T22:41:23.546Z [ERROR] agent.raft.logstore: verification checksum FAILED: storage corruption rangeStart=1234 rangeEnd=3456 leaderChecksum=0xc1... readChecksum=0x45... |
||||
``` |
||||
|
||||
This indicates that the server read back data that is different from what it wrote to disk. This indicates corruption in the storage backend or filesystem. |
||||
|
||||
For convenience, Consul also increments a metric `consul.raft.logstore.verifier.read_checksum_failures` when this occurs. |
||||
|
||||
### Write failures: In-flight Corruption |
||||
|
||||
The following error indicates that the checksum on the follower did not match the leader when the follower received the logs. The error implies that the corruption happened in the network or software and not the log store: |
||||
|
||||
```log hideClipboard |
||||
2022-11-15T22:41:23.546Z [ERROR] agent.raft.logstore: verification checksum FAILED: in-flight corruption rangeStart=1234 rangeEnd=3456 leaderChecksum=0xc1... followerWriteChecksum=0x45... |
||||
``` |
||||
|
||||
It is unlikely that this error indicates an issue with the storage backend, but you should take the same steps to resolve and report it. |
||||
|
||||
The `consul.raft.logstore.verifier.write_checksum_failures` metric increments when this error occurs. |
||||
|
||||
## Resolve checksum failures |
||||
|
||||
If either type of corruption is detected, complete the instructions for [reverting to BoltDB](/consul/docs/agent/wal-logstore/revert-to-boltdb). If the server already uses BoltDB, the errors likely indicate a latent bug in BoltDB or a bug in the verification code. In both cases, you should follow the revert instructions. |
||||
|
||||
Report all verification failures as a [GitHub |
||||
issue](https://github.com/hashicorp/consul/issues/new?assignees=&labels=&template=bug_report.md&title=WAL:%20Checksum%20Failure). |
||||
|
||||
In your report, include the following: |
||||
- Details of your server cluster configuration and hardware |
||||
- Logs around the failure message |
||||
- Context for how long they have been running the configuration |
||||
- Any metrics or description of the workload you have. For example, how many raft |
||||
commits per second. Also include the performance metrics described on this page. |
||||
|
||||
We recommend setting up an alert on Consul server logs containing `verification checksum FAILED` or on the `consul.raft.logstore.verifier.{read|write}_checksum_failures` metrics. The sooner you respond to a corrupt server, the lower the chance of any of the [potential risks](/consul/docs/agent/wal-logstore/enable#risks) causing problems in your cluster. |
||||
|
||||
## Performance metrics |
||||
|
||||
The key performance metrics to watch are: |
||||
|
||||
- `consul.raft.commitTime` measures the time to commit new writes on a quorum of |
||||
servers. It should be the same or lower after deploying WAL. Even if WAL is |
||||
faster for your workload and hardware, it may not be reflected in `commitTime` |
||||
until enough followers are using WAL that the leader does not have to wait for |
||||
two slower followers in a cluster of five to catch up. |
||||
|
||||
- `consul.raft.rpc.appendEntries.storeLogs` measures the time spent persisting |
||||
logs to disk on each _follower_. It should be the same or lower for |
||||
WAL-enabled followers. |
||||
|
||||
- `consul.raft.replication.appendEntries.rpc` measures the time taken for each |
||||
`AppendEntries` RPC from the leader's perspective. If this is significantly |
||||
higher than `consul.raft.rpc.appendEntries` on the follower, it indicates a |
||||
known queuing issue in the Raft library and is unrelated to the backend. |
||||
Followers with WAL enabled should not be slower than the others. You can |
||||
determine which follower is associated with which metric by running the |
||||
`consul operator raft list-peers` command and matching the |
||||
`peer_id` label value to the server IDs listed. |
||||
|
||||
- `consul.raft.compactLogs` measures the time take to truncate the logs after a |
||||
snapshot. WAL-enabled servers should not be slower than BoltDB servers. |
||||
|
||||
- `consul.raft.leader.dispatchLog` measures the time spent persisting logs to |
||||
disk on the _leader_. It is only relevant if a WAL-enabled server becomes a |
||||
leader. It should be the same or lower than before when the leader was using |
||||
BoltDB. |
@ -0,0 +1,76 @@
|
||||
--- |
||||
layout: docs |
||||
page_title: Revert to BoltDB |
||||
description: >- |
||||
Learn how to revert Consul to the BoltDB backend after enabled the WAL (write-ahead log) LogStore backend shipped in Consul 1.15. |
||||
--- |
||||
|
||||
# Revert storage backend to BoltDB from WAL |
||||
|
||||
This topic describes how to revert your Consul storage backend from the experimental WAL LogStore backend to the default BoltDB. |
||||
|
||||
The overall process for reverting to BoltDB consists of the following steps. Repeat the steps for all Consul servers that you need to revert. |
||||
|
||||
1. Stop target server gracefully. |
||||
1. Remove data directory from target server. |
||||
1. Update target server's configuration. |
||||
1. Start target server. |
||||
|
||||
## Stop target server gracefully |
||||
|
||||
Stop the target server gracefully. For example, if you are using `systemd`, |
||||
run the following command: |
||||
|
||||
```shell-session |
||||
$ systemctl stop consul |
||||
``` |
||||
|
||||
If your environment uses configuration management automation that might interfere with this process, such as Chef or Puppet, you must disable them until you have completely revereted the storage backend. |
||||
|
||||
## Remove data directory from target server |
||||
|
||||
Temporarily moving the data directory to a different location is less destructive than deleting it. We recommend moving the data directory instead of deleted it in cases where you unsuccessfully enable WAL. Do not use the old data directory (`/data-dir/raft.bak`) for recovery after restarting the server. We recommend eventually deleting the old directory. |
||||
|
||||
The following example assumes the `data_dir` in the server's configuration is `/data-dir` and renames it to `/data-dir.wal.bak`. |
||||
|
||||
```shell-session |
||||
$ mv /data-dir/raft /data-dir/raft.wal.bak |
||||
``` |
||||
|
||||
When switching backend, you must always remove _the entire raft directory_ not just the `raft.db` file or `wal` directory. This is because the log must always be consistent with the snapshots to avoid undefined behavior or data loss. |
||||
|
||||
## Update target server's configuration |
||||
|
||||
Modify the `backend` in the target server's configuration file: |
||||
|
||||
```hcl |
||||
raft_logstore { |
||||
backend = "boltdb" |
||||
verification { |
||||
enabled = true |
||||
interval = "60s" |
||||
} |
||||
} |
||||
``` |
||||
|
||||
## Start target server |
||||
|
||||
Start the target server. For example, if you are using `systemd`, run the following command: |
||||
|
||||
```shell-session |
||||
$ systemctl start consul |
||||
``` |
||||
|
||||
Watch for the server to become a healthy voter again. |
||||
|
||||
```shell-session |
||||
$ consul operator raft list-peers |
||||
``` |
||||
|
||||
### Clean up old data directories |
||||
|
||||
If necessary, clean up any `raft.wal.bak` directories. Replace `/data-dir` with the value you specified in your configuration file. |
||||
|
||||
```shell-session |
||||
$ rm /data-dir/raft.bak |
||||
``` |
Loading…
Reference in new issue