From 8881d352aae4cbaf1507e42db836bca937cc1c95 Mon Sep 17 00:00:00 2001 From: kaitlincarter-hc <43049322+kaitlincarter-hc@users.noreply.github.com> Date: Tue, 27 Nov 2018 09:10:44 -0800 Subject: [PATCH] Updated autopilot guide (#4997) * Updated the autopilot guide to prepare for the migration to the learn platform. * Correcting typos --- website/source/docs/guides/autopilot.html.md | 283 ++++++++++++------- 1 file changed, 182 insertions(+), 101 deletions(-) diff --git a/website/source/docs/guides/autopilot.html.md b/website/source/docs/guides/autopilot.html.md index 5f42d564cd..b8987f1591 100644 --- a/website/source/docs/guides/autopilot.html.md +++ b/website/source/docs/guides/autopilot.html.md @@ -8,7 +8,7 @@ description: |- # Autopilot -Autopilot is a set of new features added in Consul 0.8 to allow for automatic +Autopilot features allow for automatic, operator-friendly management of Consul servers. It includes cleanup of dead servers, monitoring the state of the Raft cluster, and stable server introduction. @@ -19,11 +19,25 @@ the Agent configuration must be set to 3 or higher on all servers. In Consul information, see the [Version Upgrade section](/docs/upgrade-specific.html#raft_protocol) on Raft Protocol versions. -## Configuration +In this guide we will learn more about Autopilot's features. + +* Dead server cleanup +* Server Stabilization +* Redundancy zone tags +* Upgrade migration + +Finally, we will review how to ensure Autopilot is healthy. + +Note, in this guide we are using examples from a Consul 1.4 cluster, we +are starting with Autopilot enabled by default. + +## Default Configuration The configuration of Autopilot is loaded by the leader from the agent's [Autopilot settings](/docs/agent/options.html#autopilot) when initially -bootstrapping the cluster: +bootstrapping the cluster. Since Autopilot and it's features are already +enabled, we only need to update the configuration to disable them. The +following are the defaults. ``` { @@ -31,16 +45,22 @@ bootstrapping the cluster: "last_contact_threshold": "200ms", "max_trailing_logs": 250, "server_stabilization_time": "10s", - "redundancy_zone_tag": "az", + "redundancy_zone_tag": "", "disable_upgrade_migration": false, "upgrade_version_tag": "" } ``` +All Consul servers should have Autopilot and its features either enabled +or disabled to ensure consistency accross servers in case of a failure. Additionally, +Autopilot must be enabled to use any of the features, but the features themselves +can be configured independently. Meaning you can enable or disable any of the features +separately, at any time. + After bootstrapping, the configuration can be viewed or modified either via the [`operator autopilot`](/docs/commands/operator/autopilot.html) subcommand or the [`/v1/operator/autopilot/configuration`](/api/operator.html#autopilot-configuration) -HTTP endpoint: +HTTP endpoint. ``` $ consul operator autopilot get-config @@ -51,7 +71,29 @@ ServerStabilizationTime = 10s RedundancyZoneTag = "" DisableUpgradeMigration = false UpgradeVersionTag = "" +``` +In the example above, we used the `operator autopilot get-config` subcommand to check +the autopilot configuration. You can see we still have all the defaults. + +## Dead Server Cleanup + +If Autopilot is disabled, it will take 72 hours for dead servers to be automatically reaped +or an operator had to script a `consul force-leave`. If another server failure occurred +it could jeopardize the quorum, even if the failed Consul server had been automatically +replaced. Autopilot helps prevent these kinds of outages by quickly removing failed +servers as soon as a replacement Consul server comes online. When servers are removed +by the cleanup process they will enter the "left" state. + +With Autopilot's dead server cleanup enabled, dead servers will periodically be +cleaned up and removed from the Raft peer set to prevent them from interfering with +the quorum size and leader elections. The cleanup process will also be automatically +triggered whenever a new server is successfully added to the cluster. + +To update the dead server cleanup feature use `consul operator autopilot set-config` +with the `-cleanup-dead-servers` flag. + +```sh $ consul operator autopilot set-config -cleanup-dead-servers=false Configuration updated! @@ -65,34 +107,143 @@ DisableUpgradeMigration = false UpgradeVersionTag = "" ``` -## Dead Server Cleanup +We have disabled dead server cleanup, but sill have all the other Autopilot defaults. -Dead servers will periodically be cleaned up and removed from the Raft peer -set, to prevent them from interfering with the quorum size and leader elections. -This cleanup will also happen whenever a new server is successfully added to the -cluster. +## Server Stabalization -Prior to Autopilot, it would take 72 hours for dead servers to be automatically reaped, -or operators had to script a `consul force-leave`. If another server failure occurred, -it could jeopardize the quorum, even if the failed Consul server had been automatically -replaced. Autopilot helps prevent these kinds of outages by quickly removing failed -servers as soon as a replacement Consul server comes online. When servers are removed -by the cleanup process they will enter the "left" state. +When a new server is added to the cluster, there is a waiting period where it +must be healthy and stable for a certain amount of time before being promoted +to a full, voting member. This can be configured via the `ServerStabilizationTime` +setting. -This option can be disabled by running `consul operator autopilot set-config` -with the `-cleanup-dead-servers=false` option. +```sh +consul operator autopilot set-config -server-stabalization-time=5s +Configuration updated! + +$ consul operator autopilot get-config +CleanupDeadServers = false +LastContactThreshold = 200ms +MaxTrailingLogs = 250 +ServerStabilizationTime = 5s +RedundancyZoneTag = "" +DisableUpgradeMigration = false +UpgradeVersionTag = "" +``` + +Now we have disabled dead server cleanup and set the server stabilization time to 5 seconds. +When a new server is added to our cluster, it will only need to be healthy and stable for +5 seconds. + +## Redundancy Zones + +Prior to Autopilot, it was difficult to deploy servers in a way that took advantage of +isolated failure domains such as AWS Availability Zones; users would be forced to either +have an overly-large quorum (2-3 nodes per AZ) or give up redundancy within an AZ by +deploying just one server in each. + +If the `RedundancyZoneTag` setting is set, Consul will use its value to look for a +zone in each server's specified [`-node-meta`](/docs/agent/options.html#_node_meta) +tag. For example, if `RedundancyZoneTag` is set to `zone`, and `-node-meta zone:east1a` +is used when starting a server, that server's redundancy zone will be `east1a`. + +``` +$ consul operator autopilot set-config -redundancy-zone-tag=uswest1 +Configuration updated! + +$ consul operator autopilot get-config +CleanupDeadServers = false +LastContactThreshold = 200ms +MaxTrailingLogs = 250 +ServerStabilizationTime = 5s +RedundancyZoneTag = "uswest1" +DisableUpgradeMigration = false +UpgradeVersionTag = "" +``` + +For our Autopilot features, we now have disabled dead server cleanup, server stabilization time to 5 seconds, and +the redundancy zone tag is uswest1. + +Consul will then use these values to partition the servers by redundancy zone, and will +aim to keep one voting server per zone. Extra servers in each zone will stay as non-voters +on standby to be promoted if the active voter leaves or dies. + +## Upgrade Migrations + +Autopilot in Consul *Enterprise* supports upgrade migrations by default. To disable this +functionality, set `DisableUpgradeMigration` to true. + +```sh +$ consul operator autopilot set-config -disable-upgrade-migration=true +Configuration updated! + +$ consul operator autopilot get-config +CleanupDeadServers = false +LastContactThreshold = 200ms +MaxTrailingLogs = 250 +ServerStabilizationTime = 5s +RedundancyZoneTag = "uswest1" +DisableUpgradeMigration = true +UpgradeVersionTag = "" +``` + +With upgrade migration enabled, when a new server is added and Autopilot detects that +its Consul version is newer than that of the existing servers, Autopilot will avoid +promoting the new server until enough newer-versioned servers have been added to the +cluster. When the count of new servers equals or exceeds that of the old servers, +Autopilot will begin promoting the new servers to voters and demoting the old servers. +After this is finished, the old servers can be safely removed from the cluster. + +To check the consul version of the servers, you can either use the [autopilot health] +(/api/operator.html#autopilot-health) endpoint or the `consul members` +command. + +``` +$ consul members +Node Address Status Type Build Protocol DC Segment +node1 127.0.0.1:8301 alive server 1.4.0 2 dc1 +node2 127.0.0.1:8703 alive server 1.4.0 2 dc1 +node3 127.0.0.1:8803 alive server 1.4.0 2 dc1 +node4 127.0.0.1:8203 alive server 1.3.0 2 dc1 +``` + +### Migrations Without a Consul Version Change + +The `UpgradeVersionTag` can be used to override the version information used during +a migration, so that the migration logic can be used for updating the cluster when +changing configuration. + +If the `UpgradeVersionTag` setting is set, Consul will use its value to look for a +version in each server's specified [`-node-meta`](/docs/agent/options.html#_node_meta) +tag. For example, if `UpgradeVersionTag` is set to `build`, and `-node-meta build:0.0.2` +is used when starting a server, that server's version will be `0.0.2` when considered in +a migration. The upgrade logic will follow semantic versioning and the version string +must be in the form of either `X`, `X.Y`, or `X.Y.Z`. + +```sh +$ consul operator autopilot set-config -upgrade-version-tag=1.4.0 +Configuration updated! + +$ consul operator autopilot get-config +CleanupDeadServers = false +LastContactThreshold = 200ms +MaxTrailingLogs = 250 +ServerStabilizationTime = 5s +RedundancyZoneTag = "uswest1" +DisableUpgradeMigration = true +UpgradeVersionTag = "1.4.0" +``` ## Server Health Checking An internal health check runs on the leader to track the stability of servers. -
A server is considered healthy if all of the following conditions are true: +
A server is considered healthy if all of the following conditions are true. -- It has a SerfHealth status of 'Alive' +- It has a SerfHealth status of 'Alive'. - The time since its last contact with the current leader is below -`LastContactThreshold` -- Its latest Raft term matches the leader's term +`LastContactThreshold`. +- Its latest Raft term matches the leader's term. - The number of Raft log entries it trails the leader by does not exceed -`MaxTrailingLogs` +`MaxTrailingLogs`. The status of these health checks can be viewed through the [`/v1/operator/autopilot/health`] (/api/operator.html#autopilot-health) HTTP endpoint, with a top level @@ -136,83 +287,13 @@ $ curl localhost:8500/v1/operator/autopilot/health } ``` -## Stable Server Introduction +## Summary -When a new server is added to the cluster, there is a waiting period where it -must be healthy and stable for a certain amount of time before being promoted -to a full, voting member. This can be configured via the `ServerStabilizationTime` -setting. +In this guide we configured most of the Autopilot features; dead server cleanup, server +stabalization, redunancy zone tags, upgrade migration, and upgrade version tag. ---- - -~> The following Autopilot features are available only in - [Consul Enterprise](https://www.hashicorp.com/products/consul/) version 0.8.0 and later. - -## Server Read Scaling - -With the [`-non-voting-server`](/docs/agent/options.html#_non_voting_server) option, a -server can be explicitly marked as a non-voter and will never be promoted to a voting -member. This can be useful when more read scaling is needed; being a non-voter means -that the server will still have data replicated to it, but it will not be part of the -quorum that the leader must wait for before committing log entries. - -## Redundancy Zones - -Prior to Autopilot, it was difficult to deploy servers in a way that took advantage of -isolated failure domains such as AWS Availability Zones; users would be forced to either -have an overly-large quorum (2-3 nodes per AZ) or give up redundancy within an AZ by -deploying just one server in each. - -If the `RedundancyZoneTag` setting is set, Consul will use its value to look for a -zone in each server's specified [`-node-meta`](/docs/agent/options.html#_node_meta) -tag. For example, if `RedundancyZoneTag` is set to `zone`, and `-node-meta zone:east1a` -is used when starting a server, that server's redundancy zone will be `east1a`. - -Here's an example showing how to configure this: - -``` -$ consul operator autopilot set-config -redundancy-zone-tag=zone -Configuration updated! -``` - -Consul will then use these values to partition the servers by redundancy zone, and will -aim to keep one voting server per zone. Extra servers in each zone will stay as non-voters -on standby to be promoted if the active voter leaves or dies. - -## Upgrade Migrations - -Autopilot in Consul Enterprise supports upgrade migrations by default. To disable this -functionality, set `DisableUpgradeMigration` to true. - -When a new server is added and Autopilot detects that its Consul version is newer than -that of the existing servers, Autopilot will avoid promoting the new server until enough -newer-versioned servers have been added to the cluster. When the count of new servers -equals or exceeds that of the old servers, Autopilot will begin promoting the new servers -to voters and demoting the old servers. After this is finished, the old servers can be -safely removed from the cluster. - -To check the consul version of the servers, either the [autopilot health] -(/api/operator.html#autopilot-health) endpoint or the `consul members` -command can be used: - -``` -$ consul members -Node Address Status Type Build Protocol DC -node1 127.0.0.1:8301 alive server 0.7.5 2 dc1 -node2 127.0.0.1:8703 alive server 0.7.5 2 dc1 -node3 127.0.0.1:8803 alive server 0.7.5 2 dc1 -node4 127.0.0.1:8203 alive server 0.8.0 2 dc1 -``` - -### Migrations Without a Consul Version Change - -The `UpgradeVersionTag` can be used to override the version information used during -a migration, so that the migration logic can be used for updating the cluster when -changing configuration. - -If the `UpgradeVersionTag` setting is set, Consul will use its value to look for a -version in each server's specified [`-node-meta`](/docs/agent/options.html#_node_meta) -tag. For example, if `UpgradeVersionTag` is set to `build`, and `-node-meta build:0.0.2` -is used when starting a server, that server's version will be `0.0.2` when considered in -a migration. The upgrade logic will follow semantic versioning and the version string -must be in the form of either `X`, `X.Y`, or `X.Y.Z`. +To learn more about the Autopilot settings we did not configure, +[last_contact_threshold](https://www.consul.io/docs/agent/options.html#last_contact_threshold) +and [max_trailing_logs](https://www.consul.io/docs/agent/options.html#max_trailing_logs), +either read the agent configuration documenation or use the help flag with the +opertor autopilot `consul operator autopilot set-config -h`.