mirror of https://github.com/hashicorp/consul
Website: GH-730 and cleanup for docs/guides/outage.html
parent
53ee3ffba2
commit
c1e4eb2f2c
|
@ -3,38 +3,47 @@ layout: "docs"
|
|||
page_title: "Outage Recovery"
|
||||
sidebar_current: "docs-guides-outage"
|
||||
description: |-
|
||||
Do not panic! This is a critical first step. Depending on your deployment configuration, it may take only a single server failure for cluster unavailability. Recovery requires an operator to intervene, but is straightforward.
|
||||
Don't panic! This is a critical first step. Depending on your deployment configuration, it may take only a single server failure for cluster unavailability. Recovery requires an operator to intervene, but recovery is straightforward.
|
||||
---
|
||||
|
||||
# Outage Recovery
|
||||
|
||||
Do not panic! This is a critical first step. Depending on your
|
||||
Don't panic! This is a critical first step. Depending on your
|
||||
[deployment configuration](/docs/internals/consensus.html#toc_4), it may
|
||||
take only a single server failure for cluster unavailability. Recovery
|
||||
requires an operator to intervene, but is straightforward.
|
||||
requires an operator to intervene, but the process is straightforward.
|
||||
|
||||
~> This page covers recovery from Consul becoming unavailable due to a majority
|
||||
of server nodes in a datacenter being lost. If you are just looking to
|
||||
add or remove a server [see this page](/docs/guides/servers.html).
|
||||
add or remove a server, [see this guide](/docs/guides/servers.html).
|
||||
|
||||
## Failure of a Single Server Cluster
|
||||
|
||||
If you had only a single server and it has failed, simply restart it.
|
||||
Note that a single server configuration requires the `-bootstrap` or
|
||||
`-bootstrap-expect 1` flag. If that server cannot be recovered, you need to
|
||||
bring up a new server.
|
||||
See the [bootstrapping guide](/docs/guides/bootstrapping.html). Data loss
|
||||
is inevitable, since data was not replicated to any other servers. This
|
||||
is why a single server deploy is never recommended. Any services registered
|
||||
with agents will be re-populated when the new server comes online, as
|
||||
agents perform anti-entropy.
|
||||
Note that a single server configuration requires the
|
||||
[`-bootstrap`](/docs/agent/options.html#_bootstrap) or
|
||||
[`-bootstrap-expect 1`](/docs/agent/options.html#_bootstrap_expect) flag. If
|
||||
the server cannot be recovered, you need to bring up a new server.
|
||||
See the [bootstrapping guide](/docs/guides/bootstrapping.html) for more detail.
|
||||
|
||||
In a multi-server deploy, there are at least N remaining servers. The first step
|
||||
is to simply stop all the servers. You can attempt a graceful leave, but
|
||||
it will not work in most cases. Do not worry if the leave exits with an
|
||||
error, since the cluster is in an unhealthy state.
|
||||
In the case of an unrecoverable server failure in a single server cluster, data
|
||||
loss is inevitable since data was not replicated to any other servers. This is
|
||||
why a single server deploy is never recommended.
|
||||
|
||||
The next step is to go to the `-data-dir` of each Consul server. Inside
|
||||
that directory, there will be a `raft/` sub-directory. We need to edit
|
||||
the `raft/peers.json` file. It should be something like:
|
||||
Any services registered with agents will be re-populated when the new server
|
||||
comes online as agents perform anti-entropy.
|
||||
|
||||
## Failure of a Server in a Multi-Server Cluster
|
||||
|
||||
In a multi-server deploy, there are at least N remaining servers. The first
|
||||
step is to simply stop all the servers. You can attempt a graceful leave,
|
||||
but it will not work in most cases. Do not worry if the leave exits with an
|
||||
error. The cluster is in an unhealthy state, so this is expected.
|
||||
|
||||
The next step is to go to the [`-data-dir`](/docs/agent/options.html#_data_dir)
|
||||
of each Consul server. Inside that directory, there will be a `raft/`
|
||||
sub-directory. We need to edit the `raft/peers.json` file. It should look
|
||||
something like:
|
||||
|
||||
```javascript
|
||||
[
|
||||
|
@ -45,29 +54,30 @@ the `raft/peers.json` file. It should be something like:
|
|||
```
|
||||
|
||||
Simply delete the entries for all the failed servers. You must confirm
|
||||
those servers have indeed failed, and will not later rejoin the cluster.
|
||||
those servers have indeed failed and will not later rejoin the cluster.
|
||||
Ensure that this file is the same across all remaining server nodes.
|
||||
|
||||
At this point, you can restart all the remaining servers. If any servers
|
||||
managed to perform a graceful leave, you may need to have then rejoin
|
||||
the cluster using the `join` command:
|
||||
the cluster using the [`join`](/docs/commands/join.html) command:
|
||||
|
||||
```text
|
||||
$ consul join <Node Address>
|
||||
Successfully joined cluster by contacting 1 nodes.
|
||||
```
|
||||
|
||||
It should be noted that any existing member can be used to rejoin the cluster,
|
||||
It should be noted that any existing member can be used to rejoin the cluster
|
||||
as the gossip protocol will take care of discovering the server nodes.
|
||||
|
||||
At this point the cluster should be in an operable state again. One of the
|
||||
At this point, the cluster should be in an operable state again. One of the
|
||||
nodes should claim leadership and emit a log like:
|
||||
|
||||
```text
|
||||
[INFO] consul: cluster leadership acquired
|
||||
```
|
||||
|
||||
Additional, the `info` command can be a useful debugging tool:
|
||||
Additional, the [`info`](/docs/commands/info.html) command can be a useful
|
||||
debugging tool:
|
||||
|
||||
```text
|
||||
$ consul info
|
||||
|
@ -86,7 +96,7 @@ raft:
|
|||
...
|
||||
```
|
||||
|
||||
You should verify that one server claims to be the `Leader`, and all the
|
||||
You should verify that one server claims to be the `Leader` and all the
|
||||
others should be in the `Follower` state. All the nodes should agree on the
|
||||
peer count as well. This count is (N-1), since a server does not count itself
|
||||
as a peer.
|
||||
|
|
Loading…
Reference in New Issue