docs: Enterprise upgrade instruction (#20985)

* Upgrade general process updates * Add alert + adjust structure * typo
7 months ago · dbc0889c6f
1 changed files with 85 additions and 80 deletions
--- a/website/content/docs/upgrading/instructions/general-process.mdx
+++ b/website/content/docs/upgrading/instructions/general-process.mdx
@ -8,9 +8,7 @@ description: >-

 # General Upgrade Process

-## Introduction
-
-This document describes some best practices that you should follow when
+This pages describes the overall process and best practices that you should follow when
 upgrading Consul. Some versions also have steps that are specific to that
 version, so make sure you also review the [upgrade instructions](/consul/docs/upgrading/instructions)
 for the version you are on.
@ -46,117 +44,124 @@ If you are using Kubernetes, then please review our documentation for

 ## Prepare for the Upgrade

-**1.** Take a snapshot:
+1. Take a snapshot to ensure you have a safe fallback option in case something goes wrong.

-```
-consul snapshot save backup.snap
-```
+  ```shell-session
+  $ consul snapshot save backup.snap
+  ```

-You can inspect the snapshot to ensure if was successful with:
+  You can inspect the snapshot to ensure your cluster's Raft index was successfully captured.

-```
-consul snapshot inspect backup.snap
-```
+  ```shell-session
+  $ consul snapshot inspect backup.snap
+  ```

-Example output:
+  Example output:

-```
-ID           2-1182-1542056499724
-Size         4115
-Index        1182
-Term         2
-Version      1
-```
+  ```shell-session hideClipboard
+  ID           2-1182-1542056499724
+  Size         4115
+  Index        1182
+  Term         2
+  Version      1
+  ```

-This will ensure you have a safe fallback option in case something goes wrong. Store
-this snapshot somewhere safe. More documentation on snapshot usage is available here:
+  Store this snapshot somewhere safe. For more information on snapshots, refer to the following:

- [consul.io/commands/snapshot](/consul/commands/snapshot)
- [Backup Consul Data and State tutorial](/consul/tutorials/production-deploy/backup-and-restore)
+  - [`consul snapshot` CLI command](/consul/commands/snapshot)
+  - [Backup Consul Data and State tutorial](/consul/tutorials/production-deploy/backup-and-restore)

-**2.** Temporarily modify your Consul configuration so that its [log_level](/consul/docs/agent/config/cli-flags#_log_level)
-is set to `debug`. After doing this, issue the following command on your servers to
+2. Temporarily modify your Consul configuration to set the agent's [`log_level`](/consul/docs/agent/config/cli-flags#_log_level) is set to `debug`. Then issue the following command on your servers to
 reload the configuration:

-```
-consul reload
+```shell-session
+$ consul reload
 ```

-This change will give you more information to work with in the event something goes wrong.
+When you change the cluster's log level, Consul gives you more information to work with in the event that something goes wrong.

-## Perform the Upgrade
+### Enterprise upgrades

-**1.** Issue the following command to discover which server is currently the leader:
+<Note>
+To experience a smoother upgrade process on Consul Enterprise, we recommend that you disable the upgrade migration feature.
+</Note>

-```
-consul operator raft list-peers
-```
+Consul Enterprise supports [automated upgrades](/consul/docs/enterprise/upgrades), but the autopilot feature may cause a node running an updated Consul version to elect a new leader before the version is updated on the existing cluster leader.

-You should receive output similar to this (exact formatting and content may differ based on version):
+If your datacenter runs Consul Enterprise, update your server agent configuration file to disable autopilot's upgrade migration or run the following CLI command:

-```
-Node       ID                                    Address         State     Voter  RaftProtocol
-dc1-node1  ae15858f-7f5f-4dcb-b7d5-710fdcdd2745  10.11.0.2:8300  leader    true   3
-dc1-node2  20e6be1b-f1cb-4aab-929f-f7d2d43d9a96  10.11.0.3:8300  follower  true   3
-dc1-node3  658c343b-8769-431f-a71a-236f9dbb17b3  10.11.0.4:8300  follower  true   3
+```shell-session
+$ consul operator autopilot set-config -disable-upgrade-migration=true
 ```

-Take note of which agent is the leader.
+## Perform the Upgrade

-**2.** Copy the new `consul` binary onto your servers and replace the existing
-binary with the new one.
+1. Issue the following command to discover which server is currently the leader:

-**3.** The following steps must be done in order on the server agents, leaving the leader
-agent for last. First, use a service management system (e.g., systemd, upstart, etc.) to restart the Consul service. If
-you are not using a service management system, you must restart the agent manually.
+  ```shell-session
+  $ consul operator raft list-peers
+  ```

-To validate that the agent has rejoined the cluster and is in sync with the leader, issue the
-following command:
+  You should receive output similar to this (exact formatting and content may differ based on version):

-```
-consul info
-```
+  ```shell-session hideClipboard
+  Node       ID                                    Address         State     Voter  RaftProtocol
+  dc1-node1  ae15858f-7f5f-4dcb-b7d5-710fdcdd2745  10.11.0.2:8300  leader    true   3
+  dc1-node2  20e6be1b-f1cb-4aab-929f-f7d2d43d9a96  10.11.0.3:8300  follower  true   3
+  dc1-node3  658c343b-8769-431f-a71a-236f9dbb17b3  10.11.0.4:8300  follower  true   3
+  ```

-Check whether the `commit_index` and `last_log_index` fields have the same value. If done properly,
-this should avoid an unexpected leadership election due to loss of quorum.
+  Take note of which agent is the leader.

-**4.** Double-check that all servers are showing up in the cluster as expected and are on
-the correct version by issuing:
+2. Copy the new `consul` binary onto your servers and replace the existing binary with the new one.

-```
-consul members
-```
+3. Use a service management system such as systemd or upstart to restart the Consul service on each server. You must restart follower server agents first, leaving the leader agent for last. If you are not using a service management system, you must restart the agent manually.

-You should receive output similar to this:
+  To validate that the agent has rejoined the cluster and is in sync with the leader after you restart it, issue the following command to the agent:

-```
-Node       Address         Status  Type    Build  Protocol  DC
-dc1-node1  10.11.0.2:8301  alive   server  1.8.3  2         dc1
-dc1-node2  10.11.0.3:8301  alive   server  1.8.3  2         dc1
-dc1-node3  10.11.0.4:8301  alive   server  1.8.3  2         dc1
-```
+  ```shell-session
+  $ consul info
+  ```

-Also double-check the raft state to make sure there is a leader and sufficient voters:
+  Check whether the `commit_index` and `last_log_index` fields have the same value. If done properly,
+  this should avoid an unexpected leadership election due to loss of quorum.

-```
-consul operator raft list-peers
-```
+4. Double-check that all servers joined the cluster as expected and run the correct version. Issue the following command:

-You should receive output similar to this:
+  ```shell-session
+  $ consul members
+  ```

-```
-Node       ID                                    Address         State     Voter  RaftProtocol
-dc1-node1  ae15858f-7f5f-4dcb-b7d5-710fdcdd2745  10.11.0.2:8300  leader    true   3
-dc1-node2  20e6be1b-f1cb-4aab-929f-f7d2d43d9a96  10.11.0.3:8300  follower  true   3
-dc1-node3  658c343b-8769-431f-a71a-236f9dbb17b3  10.11.0.4:8300  follower  true   3
-```
+  You should receive output that lists the servers as `alive` and running the same updated Consul version.

-**5.** Set your `log_level` back to its original value and issue the following command
+  ```shell-session hideClipboard
+  Node       Address         Status  Type    Build  Protocol  DC
+  dc1-node1  10.11.0.2:8301  alive   server  1.8.3  2         dc1
+  dc1-node2  10.11.0.3:8301  alive   server  1.8.3  2         dc1
+  dc1-node3  10.11.0.4:8301  alive   server  1.8.3  2         dc1
+  ```
+
+  Also double-check the Raft state to make sure there is a leader and sufficient voters:
+
+  ```shell-session
+  $ consul operator raft list-peers
+  ```
+
+  You should receive output that lists one server as the `leader` and the rest as `follower`:
+
+  ```shell-session hideClipboard
+  Node       ID                                    Address         State     Voter  RaftProtocol
+  dc1-node1  ae15858f-7f5f-4dcb-b7d5-710fdcdd2745  10.11.0.2:8300  leader    true   3
+  dc1-node2  20e6be1b-f1cb-4aab-929f-f7d2d43d9a96  10.11.0.3:8300  follower  true   3
+  dc1-node3  658c343b-8769-431f-a71a-236f9dbb17b3  10.11.0.4:8300  follower  true   3
+  ```
+
+5. Set your `log_level` back to its original value and issue the following command
 on your servers to reload the configuration:

-```
-consul reload
-```
+  ```shell-session
+  $ consul reload
+  ```

 ## Troubleshooting

@ -164,7 +169,7 @@ Most problems with upgrading occur due to either failing to upgrade the leader a
 or failing to wait for a follower agent to fully rejoin a cluster before moving
 on to another server. This can cause a loss of quorum and occasionally can result in
 all of your servers attempting to kick off leadership elections endlessly without ever
-reaching a quorum and electing a leader.
+reaching a quorum and electing a leader. Consul Enterprise users should [disable upgrade migration](#enterprise-upgrades) to prevent autopilot from prematurely electing a new cluster leader.

 Most of these problems can be solved by following the steps outlined in our
 [Disaster recovery for Consul clusters](/consul/tutorials/datacenter-operations/recovery-outage) document.