Fault tolerance is a system's ability to operate without interruption despite component failure. Learn how a set of Consul servers provide fault tolerance through use of a quorum, and how to further improve control plane resilience through use of infrastructure zones and Enterprise redundancy zones.
---
# Fault Tolerance
# Fault tolerance
Fault tolerance is the ability of a system to continue operating without interruption
despite the failure of one or more components.
The most basic production deployment of Consul has 3 server agents and can lose a single
server without interruption.
As you continue to use Consul, your circumstances may change.
Perhaps a datacenter becomes more business critical or risk management policies change,
necessitating an increase in fault tolerance.
The sections below discuss options for how to improve Consul's fault tolerance.
You must give careful consideration to reliability in the architecture frameworks that you build. When you build a resilient platform, it minimizes the remediation actions you need to take when a failure occurs. This document provides useful information on how to design and operate a resilient Consul cluster, including the methods and functionalities for this goal.
Consul has many features that operate both locally and remotely that can help you offer a resilient service across multiple datacenters.
## Fault Tolerance in Consul
Consul's fault tolerance is determined by the configuration of its voting server agents.
## Introduction
Fault tolerance is the ability of a system to continue operating without interruption
despite the failure of one or more components. In Consul, the number of server agents determines the fault tolerance.
Each Consul datacenter depends on a set of Consul voting server agents.
The voting servers ensure Consul has a consistent, fault-tolerant state
@ -42,28 +40,25 @@ number of servers, quorum, and fault tolerance, refer to the
Effectively mitigating your risk is more nuanced than just increasing the fault tolerance
metric described above. You must consider:
### Correlated Risks
because the infrastructure costs can outweigh the improved resiliency. You must also consider correlated risks at the infrastructure-level. There are occasions when multiple servers fail at the same time. That means that a single failure could cause a Consul outage, even if your server-level fault tolerance is 2.
Are you protected against correlated risks? Infrastructure-level failures can cause multiple servers to fail at the same time. This means that a single infrastructure-level failure could cause a Consul outage, even if your server-level fault tolerance is 2.
Different options for your resilient datacenter present trade-offs between operational complexity, computing cost, and Consul request performance. Consider these factors when designing your resilient architecture.
### Mitigation Costs
## Fault tolerance
What are the costs of the mitigation? Different mitigation options present different trade-offs for operational complexity, computing cost, and Consul request performance.
The following sections explore several options for increasing Consul's fault tolerance. For enhanced reliability, we recommend taking a holistic approach by layering these multiple functionalities together.
## Strategies to Increase Fault Tolerance
- Spread servers across infrastructure [availability zones](#availability-zones).
- Use a [minimum quorum size](#quorum-size) to avoid performance impacts.
- Use [redundancy zones](#redundancy-zones) to improve fault tolerance. <EnterpriseAlert inline />
- Use [Autopilot](#autopilot) to automatically prune failed servers and maintain quorum size.
- Use [cluster peering](#cluster-peering) to provide service redundancy.
The following sections explore several options for increasing Consul's fault tolerance.
### Availability zones
HashiCorp recommends all production deployments consider:
- [Spreading Consul servers across availability zones](#spread-servers-across-infrastructure-availability-zones)
- <EnterpriseAlert inline /><a href="#use-backup-voting-servers-to-replace-lost-voters">Using backup voting servers to replace lost voters</a>
### Spread Servers Across Infrastructure Availability Zones
The cloud or on-premise infrastructure underlying your [Consul datacenter](/consul/docs/install/glossary#datacenter) can run across multiple availability zones.
The cloud or on-premise infrastructure underlying your [Consul datacenter](/consul/docs/install/glossary#datacenter)
may be split into several "availability zones".
An availability zone is meant to share no points of failure with other zones by:
- Having power, cooling, and networking systems independent from other zones
- Being physically distant enough from other zones so that large-scale disruptions
@ -79,25 +74,25 @@ To distribute your Consul servers across availability zones, modify your infrast
Additionally, you should leverage resources that can automatically restore your compute instance,
such as autoscaling groups, virtual machine scale sets, or compute engine autoscaler.
The autoscaling resources can be customized to re-deploy servers into specific availability zones
and ensure the desired numbers of servers are available at all time.
Customize autoscaling resources to re-deploy servers into specific availability zones and ensure the desired numbers of servers are available at all times.
### Add More Voting Servers
### Quorum size
For most production use cases, we recommend using either 3 or 5 voting servers,
For most production use cases, we recommend using a minimum quorum of either 3 or 5 voting servers,
yielding a server-level fault tolerance of 1 or 2 respectively.
Even though it would improve fault tolerance,
adding voting servers beyond 5 is **not recommended** because it decreases Consul's performance—
it requires Consul to involve more servers in every state change or consistent read.
Consul Enterprise provides a way to improve fault tolerance without this performance penalty:
[using backup voting servers to replace lost voters](#use-backup-voting-servers-to-replace-lost-voters).
Consul Enterprise users can use redundancy zones to improve fault tolerance without this performance penalty.
### Redundancy zones <EnterpriseAlert inline />
### <EnterpriseAlert inline /> Use Backup Voting Servers to Replace Lost Voters
Use Consul Enterprise [redundancy zones](/consul/docs/enterprise/redundancy) to improve fault tolerance without the performance penalty of increasing the number of voting servers.
Consul Enterprise [redundancy zones](/consul/docs/enterprise/redundancy)
can be used to improve fault tolerance without the performance penalty of increasing the number of voting servers.
![Reference architecture diagram for Consul Redundancy zones](/img/architecture/consul-redundancy-zones-light.png#light-theme-only)
![Reference architecture diagram for Consul Redundancy zones](/img/architecture/consul-redundancy-zones-dark.png#dark-theme-only)
Each redundancy zone should be assigned 2 or more Consul servers.
If all servers are healthy, only one server per redundancy zone will be an active voter;
@ -132,3 +127,51 @@ For more information on redundancy zones, refer to:
for a more detailed explanation
- [Redundancy zone tutorial](/consul/tutorials/enterprise/redundancy-zones)
to learn how to use them
### Autopilot
Autopilot is a set of functions that introduce servers to a cluster, cleans up dead servers, and monitors the state of the Raft protocol in the Consul cluster.
When you enable Autopilot's dead server cleanup, Autopilot marks failed servers as `Left` and removes them from the Raft peer set to prevent them from interfering with the quorum size. Autopilot does that as soon as a replacement Consul server comes online. This behavior is beneficial when server nodes failed and have been redeployed but Consul considers them as new nodes because their IP address and hostnames have changed. Autopilot keeps the cluster peer set size correct and the quorum requirement simple.
To illustrate the Autopilot advantage, consider a scenario where Consul has a cluster of five server nodes. The quorum is three, which means the cluster can lose two server nodes before the cluster fails. The following events happen:
1. Two server nodes fail.
1. Two replacement nodes are deployed with new hostnames and IPs.
1. The two replacement nodes rejoin the Consul cluster.
1. Consul treats the replacement nodes as extra nodes, unrelated to the previously failed nodes.
_With Autopilot not enabled_, the following happens:
1. Consul does not immediately clean up the failed nodes when the replacement nodes join the cluster.
1. The cluster now has the three surviving nodes, the two failed nodes, and the two replacement nodes, for a total of seven nodes.
- The quorum is increased to four, which means the cluster can only afford to lose one node until after the two failed nodes are deleted in seventy-two hours.
- The redundancy level has decreased from its initial state.
_With Autopilot enabled_, the following happens:
1. Consul immediately cleans up the failed nodes when the replacement nodes join the cluster.
1. The cluster now has the three surviving nodes and the two replacement nodes, for a total of five nodes.
- The quorum stays at three, which means the cluster can afford to lose two nodes before it fails.
- The redundancy level remains the same.
### Cluster peering
Linking multiple Consul clusters together to provide service redundancy is the most effective method to prevent disruption from failure. This method is enhanced when you design individual Consul clusters with resilience in mind. Consul clusters interconnect in two ways: WAN federation and cluster peering. We recommend using cluster peering whenever possible.
Cluster peering lets you connect two or more independent Consul clusters using mesh gateways, so that services can communicate between non-identical partitions in different datacenters.
![Reference architecture diagram for Consul cluster peering](/img/architecture/cluster-peering-diagram-light.png#light-theme-only)
![Reference architecture diagram for Consul cluster peering](/img/architecture/cluster-peering-diagram-dark.png#dark-theme-only)
Cluster peering is the preferred way to interconnect clusters because it is operationally easier to configure and manage than WAN federation. Cluster peering communication between two datacenters runs only on one port on the related Consul mesh gateway, which makes it operationally easy to expose for routing purposes.
When you use cluster peering to connect admin partitions between datacenters, use Consul’s dynamic traffic management functionalities `service-splitter`, `service-router` and `service-failover` to configure your service mesh to automatically forward or failover service traffic between peer clusters. Consul can then manage the traffic intended for the service and do [failover](/consul/docs/connect/config-entries/service-resolver#spec-failover), [load-balancing](/consul/docs/connect/config-entries/service-resolver#spec-loadbalancer), or [redirection](/consul/docs/connect/config-entries/service-resolver#spec-redirect).
Cluster peering also extends service discovery across different datacenters independent of service mesh functions. After you peer datacenters, you can refer to services between datacenters with `<service>.virtual.peer.consul` in Consul DNS. For Consul Enterprise, your query string may need to include the namespace, partition, or both. Refer to the [Consul DNS documentation](/consul/docs/services/discovery/dns-static-lookups#service-virtual-ip-lookups) for details on building virtual service lookups.
For more information on cluster peering, refer to: