From aa9a22df9ad7ab266e4f8b6218ea965f9045e2df Mon Sep 17 00:00:00 2001 From: Ryan Breen Date: Sun, 1 Mar 2015 18:21:33 -0500 Subject: [PATCH] Website: rework docs/guides/outage.html to cover cases where recovery might be easier than manual removal of failed nodes from peers.json. --- .../source/docs/guides/outage.html.markdown | 29 ++++++++++++++----- .../docs/internals/consensus.html.markdown | 2 +- 2 files changed, 22 insertions(+), 9 deletions(-) diff --git a/website/source/docs/guides/outage.html.markdown b/website/source/docs/guides/outage.html.markdown index ac793cf0e2..f1a5708eb6 100644 --- a/website/source/docs/guides/outage.html.markdown +++ b/website/source/docs/guides/outage.html.markdown @@ -8,12 +8,14 @@ description: |- # Outage Recovery -Don't panic! This is a critical first step. Depending on your -[deployment configuration](/docs/internals/consensus.html#toc_4), it may -take only a single server failure for cluster unavailability. Recovery +Don't panic! This is a critical first step. + +Depending on your +[deployment configuration](/docs/internals/consensus.html#deployment_table), it +may take only a single server failure for cluster unavailability. Recovery requires an operator to intervene, but the process is straightforward. -~> This page covers recovery from Consul becoming unavailable due to a majority +~> This guide is for recovery from a Consul outage due to a majority of server nodes in a datacenter being lost. If you are just looking to add or remove a server, [see this guide](/docs/guides/servers.html). @@ -28,15 +30,26 @@ See the [bootstrapping guide](/docs/guides/bootstrapping.html) for more detail. In the case of an unrecoverable server failure in a single server cluster, data loss is inevitable since data was not replicated to any other servers. This is -why a single server deploy is never recommended. +why a single server deploy is **never** recommended. Any services registered with agents will be re-populated when the new server comes online as agents perform anti-entropy. ## Failure of a Server in a Multi-Server Cluster -In a multi-server deploy, there are at least N remaining servers. The first -step is to simply stop all the servers. You can attempt a graceful leave, +If you think the failed server is recoverable, the easiest option is to bring +it back online and have it rejoin the cluster, returning the cluster to a fully +healthy state. Similarly, even if you need to rebuild a new Consul server to +replace the failed node, you may wish to do that immediately. Keep in mind that +the rebuilt server needs to have the same IP as the failed server. Again, once +this server is online, the cluster will return to a fully healthy state. + +Both of these strategies involve a potentially lengthy time to reboot or rebuild +a failed server. If this is impractical, if building a new server with the same +IP isn't an option, or if your failed server is unrecoverable, you need to remove +the failed server from the `raft/peers.json` file on all remaining servers. + +To begin, stop all remaining servers. You can attempt a graceful leave, but it will not work in most cases. Do not worry if the leave exits with an error. The cluster is in an unhealthy state, so this is expected. @@ -76,7 +89,7 @@ nodes should claim leadership and emit a log like: [INFO] consul: cluster leadership acquired ``` -Additional, the [`info`](/docs/commands/info.html) command can be a useful +Additionally, the [`info`](/docs/commands/info.html) command can be a useful debugging tool: ```text diff --git a/website/source/docs/internals/consensus.html.markdown b/website/source/docs/internals/consensus.html.markdown index 6e04e5dd28..aa8ac7d32e 100644 --- a/website/source/docs/internals/consensus.html.markdown +++ b/website/source/docs/internals/consensus.html.markdown @@ -164,7 +164,7 @@ The three read modes are: For more documentation about using these various modes, see the [HTTP API](/docs/agent/http.html). -## Deployment Table +## Deployment Table Below is a table that shows for the number of servers how large the quorum is, as well as how many node failures can be tolerated. The