mirror of https://github.com/hashicorp/consul
Browse Source
Discuss available strategies for improving server-level and infrastructure-level fault tolerance in Consul.pull/12893/head
Jared Kirschner
3 years ago
2 changed files with 143 additions and 0 deletions
@ -0,0 +1,139 @@
|
||||
--- |
||||
layout: docs |
||||
page_title: Improving Consul Resilience |
||||
description: >- |
||||
Fault tolerance is the ability of a system to continue operating without interruption |
||||
despite the failure of one or more components. Consul's resilience, or fault tolerance, |
||||
is determined by the configuring of its voting server agents. Recommended strategies for |
||||
increasing Consul's fault tolerance include using 3 or 5 voting server agents, spreading |
||||
server agents across infrastructure availability zones, and using Consul Enterprise |
||||
redundancy zones to enable backup voting servers to automatically replace lost voters. |
||||
--- |
||||
|
||||
# Improving Consul Resilience |
||||
|
||||
Fault tolerance is the ability of a system to continue operating without interruption |
||||
despite the failure of one or more components. |
||||
The most basic production deployment of Consul has 3 server agents and can lose a single |
||||
server without interruption. |
||||
|
||||
As you continue to use Consul, your circumstances may change. |
||||
Perhaps a datacenter becomes more business critical or risk management policies change, |
||||
necessitating an increase in fault tolerance. |
||||
The sections below discuss options for how to improve Consul's fault tolerance. |
||||
|
||||
## Fault Tolerance in Consul |
||||
|
||||
Consul's fault tolerance is determined by the configuration of its voting server agents. |
||||
|
||||
Each Consul datacenter depends on a set of Consul voting server agents. |
||||
The voting servers ensure Consul has a consistent, fault-tolerant state |
||||
by requiring a majority of voting servers, known as a quorum, to agree upon any state changes. |
||||
Examples of state changes include: adding or removing services, |
||||
adding or removing nodes, and changes in service or node health status. |
||||
|
||||
Without a quorum, Consul experiences an outage: |
||||
it cannot provide most of its capabilities because they rely on |
||||
the availability of this state information. |
||||
If Consul has an outage, normal operation can be restored by following the |
||||
[outage recovery guide](https://learn.hashicorp.com/tutorials/consul/recovery-outage). |
||||
|
||||
If Consul is deployed with 3 servers, the quorum size is 2. The deployment can lose 1 |
||||
server and still maintain quorum, so it has a fault tolerance of 1. |
||||
If Consul is instead deployed with 5 servers, the quorum size increases to 3, so |
||||
the fault tolerance increases to 2. |
||||
To learn more about the relationship between the |
||||
number of servers, quorum, and fault tolerance, refer to the |
||||
[concensus protocol documentation](/docs/architecture/consensus#deployment_table). |
||||
|
||||
Effectively mitigating your risk is more nuanced than just increasing the fault tolerance |
||||
metric described above. You must consider: |
||||
|
||||
### Correlated Risks |
||||
|
||||
Are you protected against correlated risks? Infrastructure-level failures can cause multiple servers to fail at the same time. This means that a single infrastructure-level failure could cause a Consul outage, even if your server-level fault tolerance is 2. |
||||
|
||||
### Mitigation Costs |
||||
|
||||
What are the costs of the mitigation? Different mitigation options present different trade-offs for operational complexity, computing cost, and Consul request performance. |
||||
|
||||
## Strategies to Increase Fault Tolerance |
||||
|
||||
The following sections explore several options for increasing Consul's fault tolerance. |
||||
|
||||
HashiCorp recommends all production deployments consider: |
||||
- [Spreading Consul servers across availability zones](#spread-servers-across-infrastructure-availability-zones) |
||||
- <EnterpriseAlert inline /><a href="#use-backup-voting-servers-to-replace-lost-voters">Using backup voting servers to replace lost voters</a> |
||||
|
||||
### Spread Servers Across Infrastructure Availability Zones |
||||
|
||||
The cloud or on-premise infrastructure underlying your [Consul datacenter](/docs/install/glossary#datacenter) |
||||
may be split into several "availability zones". |
||||
An availability zone is meant to share no points of failure with other zones by: |
||||
- Having power, cooling, and networking systems independent from other zones |
||||
- Being physically distant enough from other zones so that large-scale disruptions |
||||
such as natural disasters (flooding, earthquakes) are very unlikely to affect multiple zones |
||||
|
||||
Availability zones are available in the regions of most cloud providers and in some on-premise installations. |
||||
If possible, spread your Consul voting servers across 3 availability zones |
||||
to protect your Consul datacenter from a single zone-level failure. |
||||
For example, if deploying 5 Consul servers across 3 availability zones, place no more than 2 servers in each zone. |
||||
If one zone fails, at most 2 servers are lost and quorum will be maintained by the 3 remaining servers. |
||||
|
||||
To distribute your Consul servers across availability zones, modify your infrastructure configuration with your infrastructure provider. No change is needed to your Consul server’s agent configuration. |
||||
|
||||
Additionally, you should leverage resources that can automatically restore your compute instance, |
||||
such as autoscaling groups, virtual machine scale sets, or compute engine autoscaler. |
||||
The autoscaling resources can be customized to re-deploy servers into specific availability zones |
||||
and ensure the desired numbers of servers are available at all time. |
||||
|
||||
### Add More Voting Servers |
||||
|
||||
For most production use cases, we recommend using either 3 or 5 voting servers, |
||||
yielding a server-level fault tolerance of 1 or 2 respectively. |
||||
|
||||
Even though it would improve fault tolerance, |
||||
adding voting servers beyond 5 is **not recommended** because it decreases Consul's performance— |
||||
it requires Consul to involve more servers in every state change or consistent read. |
||||
|
||||
Consul Enterprise provides a way to improve fault tolerance without this performance penalty: |
||||
[using backup voting servers to replace lost voters](#use-backup-voting-servers-to-replace-lost-voters). |
||||
|
||||
### <EnterpriseAlert inline /> Use Backup Voting Servers to Replace Lost Voters |
||||
|
||||
Consul Enterprise [redundancy zones](/docs/enterprise/redundancy) |
||||
can be used to improve fault tolerance without the performance penalty of increasing the number of voting servers. |
||||
|
||||
Each redundancy zone should be assigned 2 or more Consul servers. |
||||
If all servers are healthy, only one server per redundancy zone will be an active voter; |
||||
all other servers will be backup voters. |
||||
If a zone's voter is lost, it will be replaced by: |
||||
- A backup voter within the same zone, if any. Otherwise, |
||||
- A backup voter within another zone, if any. |
||||
|
||||
Consul can replace lost voters with backup voters within 30 seconds in most cases. |
||||
Because this replacement process is not instantaneous, |
||||
redundancy zones do not improve immediate fault tolerance— |
||||
the number of healthy voting servers that can fail at once without causing an outage. |
||||
Instead, redundancy zones improve optimistic fault tolerance: |
||||
the number of healthy active and back-up voting servers that can fail gradually without causing an outage. |
||||
|
||||
The relationship between these two types of fault tolerance is: |
||||
|
||||
_Optimistic fault tolerance = immediate fault tolerance + the number of healthy backup voters_ |
||||
|
||||
For example, consider a Consul datacenter with 3 redundancy zones and 2 servers per zone. |
||||
There will be 3 voting servers (1 per zone), meaning a quorum size of 2 and an immediate fault tolerance of 1. |
||||
There will also be 3 backup voters (1 per zone), each of which increase the optimistic fault tolerance. |
||||
Therefore, the optimistic fault tolerance is 4. |
||||
This provides performance similar to a 3 server setup with fault tolerance similar to a 7 server setup. |
||||
|
||||
We recommend associating each Consul redundancy zone with an infrastructure availability zone |
||||
to also gain the infrastructure-level fault tolerance benefits provided by availability zones. |
||||
However, Consul redundancy zones can be used even without the backing of infrastructure availability zones. |
||||
|
||||
For more information on redundancy zones, refer to: |
||||
- [Redundancy zone documentation](/docs/enterprise/redundancy) |
||||
for a more detailed explanation |
||||
- [Redundancy zone tutorial](https://learn.hashicorp.com/tutorials/consul/redundancy-zones?in=consul/enterprise) |
||||
to learn how to use them |
Loading…
Reference in new issue