Fault tolerance is a system's ability to operate without interruption despite component failure. Learn how a set of Consul servers provide fault tolerance through use of a quorum, and how to further improve control plane resilience through use of infrastructure zones and Enterprise redundancy zones.
You must give careful consideration to reliability in the architecture frameworks that you build. When you build a resilient platform, it minimizes the remediation actions you need to take when a failure occurs. This document provides useful information on how to design and operate a resilient Consul cluster, including the methods and functionalities for this goal.
Consul has many features that operate both locally and remotely that can help you offer a resilient service across multiple datacenters.
because the infrastructure costs can outweigh the improved resiliency. You must also consider correlated risks at the infrastructure-level. There are occasions when multiple servers fail at the same time. That means that a single failure could cause a Consul outage, even if your server-level fault tolerance is 2.
Different options for your resilient datacenter present trade-offs between operational complexity, computing cost, and Consul request performance. Consider these factors when designing your resilient architecture.
The following sections explore several options for increasing Consul's fault tolerance. For enhanced reliability, we recommend taking a holistic approach by layering these multiple functionalities together.
The cloud or on-premise infrastructure underlying your [Consul datacenter](/consul/docs/install/glossary#datacenter) can run across multiple availability zones.
To distribute your Consul servers across availability zones, modify your infrastructure configuration with your infrastructure provider. No change is needed to your Consul server's agent configuration.
Customize autoscaling resources to re-deploy servers into specific availability zones and ensure the desired numbers of servers are available at all times.
Use Consul Enterprise [redundancy zones](/consul/docs/enterprise/redundancy) to improve fault tolerance without the performance penalty of increasing the number of voting servers.
Autopilot is a set of functions that introduce servers to a cluster, cleans up dead servers, and monitors the state of the Raft protocol in the Consul cluster.
When you enable Autopilot's dead server cleanup, Autopilot marks failed servers as `Left` and removes them from the Raft peer set to prevent them from interfering with the quorum size. Autopilot does that as soon as a replacement Consul server comes online. This behavior is beneficial when server nodes failed and have been redeployed but Consul considers them as new nodes because their IP address and hostnames have changed. Autopilot keeps the cluster peer set size correct and the quorum requirement simple.
To illustrate the Autopilot advantage, consider a scenario where Consul has a cluster of five server nodes. The quorum is three, which means the cluster can lose two server nodes before the cluster fails. The following events happen:
1. Two server nodes fail.
1. Two replacement nodes are deployed with new hostnames and IPs.
1. The two replacement nodes rejoin the Consul cluster.
1. Consul treats the replacement nodes as extra nodes, unrelated to the previously failed nodes.
_With Autopilot not enabled_, the following happens:
1. Consul does not immediately clean up the failed nodes when the replacement nodes join the cluster.
1. The cluster now has the three surviving nodes, the two failed nodes, and the two replacement nodes, for a total of seven nodes.
- The quorum is increased to four, which means the cluster can only afford to lose one node until after the two failed nodes are deleted in seventy-two hours.
- The redundancy level has decreased from its initial state.
_With Autopilot enabled_, the following happens:
1. Consul immediately cleans up the failed nodes when the replacement nodes join the cluster.
1. The cluster now has the three surviving nodes and the two replacement nodes, for a total of five nodes.
- The quorum stays at three, which means the cluster can afford to lose two nodes before it fails.
- The redundancy level remains the same.
### Cluster peering
Linking multiple Consul clusters together to provide service redundancy is the most effective method to prevent disruption from failure. This method is enhanced when you design individual Consul clusters with resilience in mind. Consul clusters interconnect in two ways: WAN federation and cluster peering. We recommend using cluster peering whenever possible.
Cluster peering lets you connect two or more independent Consul clusters using mesh gateways, so that services can communicate between non-identical partitions in different datacenters.
![Reference architecture diagram for Consul cluster peering](/img/architecture/cluster-peering-diagram-light.png#light-theme-only)
![Reference architecture diagram for Consul cluster peering](/img/architecture/cluster-peering-diagram-dark.png#dark-theme-only)
Cluster peering is the preferred way to interconnect clusters because it is operationally easier to configure and manage than WAN federation. Cluster peering communication between two datacenters runs only on one port on the related Consul mesh gateway, which makes it operationally easy to expose for routing purposes.
When you use cluster peering to connect admin partitions between datacenters, use Consul’s dynamic traffic management functionalities `service-splitter`, `service-router` and `service-failover` to configure your service mesh to automatically forward or failover service traffic between peer clusters. Consul can then manage the traffic intended for the service and do [failover](/consul/docs/connect/config-entries/service-resolver#spec-failover), [load-balancing](/consul/docs/connect/config-entries/service-resolver#spec-loadbalancer), or [redirection](/consul/docs/connect/config-entries/service-resolver#spec-redirect).
Cluster peering also extends service discovery across different datacenters independent of service mesh functions. After you peer datacenters, you can refer to services between datacenters with `<service>.virtual.peer.consul` in Consul DNS. For Consul Enterprise, your query string may need to include the namespace, partition, or both. Refer to the [Consul DNS documentation](/consul/docs/services/discovery/dns-static-lookups#service-virtual-ip-lookups) for details on building virtual service lookups.
For more information on cluster peering, refer to: