Merge pull request #2283 from hashicorp/f-outage-docs

Updates docs for new peers.json behavior.
2016-08-16 15:13:50 -07:00 · 2016-08-16 15:13:50 -07:00 · 942b554abb
parent 462bc71eea ee5740847c
commit 942b554abb
3 changed files with 105 additions and 13 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -56,6 +56,10 @@ BACKWARDS INCOMPATIBILITIES:
  header was added to allow clients to detect if translation is enabled for HTTP
  responses, and a "lan" tag was added to `TaggedAddresses` for clients that need
  the local address regardless of translation. [GH-2280]
+* The behavior of the `peers.json` file is different in this version of Consul:
+  this file won't normally be present and is used only during outage recovery. Be
+  sure to read [Outage Recovery Guide](http://localhost:4567/docs/guides/outage.html)
+  for details.

 IMPROVEMENTS:

--- a/website/source/docs/guides/outage.html.markdown
+++ b/website/source/docs/guides/outage.html.markdown
@ -46,16 +46,35 @@ this server is online, the cluster will return to a fully healthy state.

 Both of these strategies involve a potentially lengthy time to reboot or rebuild
 a failed server. If this is impractical or if building a new server with the same
-IP isn't an option, you need to remove the failed server from the `raft/peers.json`
-file on all remaining servers.
+IP isn't an option, you need to remove the failed server using the `raft/peers.json`
+recovery file on all remaining servers.

 To begin, stop all remaining servers. You can attempt a graceful leave,
 but it will not work in most cases. Do not worry if the leave exits with an
 error. The cluster is in an unhealthy state, so this is expected.

+~> Note that in Consul 0.7 and later, the peers.json file is no longer present
+by default and is only used when performing recovery. This file will be deleted
+after Consul starts and ingests this file. Consul 0.7 also uses a new, automatically-
+created `raft/peers.info` file to avoid ingesting the `raft/peers.json` file on the
+first start after upgrading. Be sure to leave `raft/peers.info` in place for proper
+operation.
+<br>
+<br>
+Note that using `raft/peers.json` for recovery can cause uncommitted Raft log
+entries to be committed, so this should only be used after an outage where no
+other option is available to recover a lost server. Make sure you don't have
+any automated processes that will put the peers in place on a periodic basis,
+for example.
+<br>
+<br>
+When the final version of Consul 0.7 ships, it should include a command to
+remove a dead peer without having to stop servers and edit the `raft/peers.json`
+recovery file.
+
 The next step is to go to the [`-data-dir`](/docs/agent/options.html#_data_dir)
 of each Consul server. Inside that directory, there will be a `raft/`
-sub-directory. We need to edit the `raft/peers.json` file. It should look
+sub-directory. We need to create a `raft/peers.json` file. It should look
 something like:

 ```javascript
@ -66,13 +85,27 @@ something like:
 ]
 ```

-Simply delete the entries for all the failed servers. You must confirm
-those servers have indeed failed and will not later rejoin the cluster.
-Ensure that this file is the same across all remaining server nodes.
+Simply create entries for all remaining servers. You must confirm
+that servers you do not include here have indeed failed and will not later
+rejoin the cluster. Ensure that this file is the same across all remaining
+server nodes.

-At this point, you can restart all the remaining servers. If any servers
-managed to perform a graceful leave, you may need to have them rejoin
-the cluster using the [`join`](/docs/commands/join.html) command:
+At this point, you can restart all the remaining servers. In Consul 0.7 and
+later you will see them ingest recovery file:
+
+```text
+...
+2016/08/16 14:39:20 [INFO] consul: found peers.json file, recovering Raft configuration...
+2016/08/16 14:39:20 [INFO] consul.fsm: snapshot created in 12.484µs
+2016/08/16 14:39:20 [INFO] snapshot: Creating new snapshot at /tmp/peers/raft/snapshots/2-5-1471383560779.tmp
+2016/08/16 14:39:20 [INFO] consul: deleted peers.json file after successful recovery
+2016/08/16 14:39:20 [INFO] raft: Restored from snapshot 2-5-1471383560779
+2016/08/16 14:39:20 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:10.212.15.121:8300 Address:10.212.15.121:8300}]
+...
+```
+
+If any servers managed to perform a graceful leave, you may need to have them
+rejoin the cluster using the [`join`](/docs/commands/join.html) command:

 ```text
 $ consul join <Node Address>
@ -113,3 +146,32 @@ You should verify that one server claims to be the `Leader` and all the
 others should be in the `Follower` state. All the nodes should agree on the
 peer count as well. This count is (N-1), since a server does not count itself
 as a peer.
+
+## Failure of Multiple Servers in a Multi-Server Cluster
+
+In the event that multiple servers are lost, causing a loss of quorum and a
+complete outage, partial recovery is possible using data on the remaining
+servers in the cluster. There may be data loss in this situation because multiple
+servers were lost, so information about what's committed could be incomplete.
+The recovery process implicitly commits all outstanding Raft log entries, so
+it's also possible to commit data that was uncommitted before the failure.
+
+The procedure is the same as for the single-server case above; you simply include
+just the remaining servers in the `raft/peers.json` recovery file. The cluster
+should be able to elect a leader once the remaining servers are all restarted with
+an identical `raft/peers.json` configuration.
+
+Any new servers you introduce later can be fresh with totally clean data directories
+and joined using Consul's `join` command.
+
+In extreme cases, it should be possible to recover with just a single remaining
+server by starting that single server with itself as the only peer in the
+`raft/peers.json` recovery file.
+
+Note that prior to Consul 0.7 it wasn't always possible to recover from certain
+types of outages with `raft/peers.json` because this was ingested before any Raft
+log entries were played back. In Consul 0.7 and later, the `raft/peers.json`
+recovery file is final, and a snapshot is taken after it is ingested, so you are
+guaranteed to start with your recovered configuration. This does implicitly commit
+all Raft log entries, so should only be used to recover from an outage, but it
+should allow recovery from any situation where there's some cluster data available.
--- a/website/source/docs/upgrade-specific.html.markdown
+++ b/website/source/docs/upgrade-specific.html.markdown
@ -16,6 +16,15 @@ standard upgrade flow.

 ## Consul 0.7

+Consul version 0.7 is a very large release with many important changes. Changes
+to be aware of during an upgrade are categorized below.
+
+#### Default Configuration Changes
+
+The default behavior of [`skip_leave_on_interrupt`](/docs/agent/options.html#skip_leave_on_interrupt)
+is now dependent on whether or not the agent is acting as a server or client. When Consul is started as a
+server the default is `true` and `false` when a client.
+
 #### Dropped Support for Protocol Version 1

 Consul version 0.7 dropped support for protocol version 1, which means it
@ -31,20 +40,37 @@ itself. This feature enables using the distance sorting features of prepared
 queries without explicitly providing the node to sort near in requests, but
 requires the agent servicing a request to send additional information about
 itself to the Consul servers when executing the prepared query. Agents prior
-to 0.7.0 do not send this information, which means they are unable to properly
+to 0.7 do not send this information, which means they are unable to properly
 execute prepared queries configured with a `Near` parameter. Similarly, any
-server nodes prior to version 0.7.0 are unable to store the `Near` parameter,
+server nodes prior to version 0.7 are unable to store the `Near` parameter,
 making them unable to properly serve requests for prepared queries using the
-feature. It is recommended that all agents be running version 0.7.0 prior to
+feature. It is recommended that all agents be running version 0.7 prior to
 using this feature.

 #### WAN Address Translation in HTTP Endpoints

 Consul version 0.7 added support for translating WAN addresses in certain
 [HTTP endpoints](/docs/agent/options.html#translate_wan_addrs). The servers
-and the agents need to be running version 0.7.0 or later in order to use this
+and the agents need to be running version 0.7 or later in order to use this
 feature.

+These translated addresses could break clients that are expecting local
+addresses. A new [`X-Consul-Translate-Addresses`](/docs/agent/http.html#translate_header)
+header was added to allow clients to detect if translation is enabled for HTTP
+responses, and a "lan" tag was added to `TaggedAddresses` for clients that need
+the local address regardless of translation.
+
+#### Changes to Outage Recovery and `peers.json`
+
+The `peers.json` file is no longer present by default and is only used when
+performing recovery. This file will be deleted after Consul starts and ingests
+this file. Consul 0.7 also uses a new, automatically-created raft/peers.info file
+to avoid ingesting the `peers.json` file on the first start after upgrading (it
+is simply deleted on the first start after upgrading).
+
+Please be sure to review the [Outage Recovery Guide](/docs/guides/outage.html)
+before upgrading for more details.
+
 ## Consul 0.6.4

 Consul 0.6.4 made some substantial changes to how ACLs work with prepared