consul/website/source/docs/internals/consensus.html.markdown

---
layout: "docs"
page_title: "Consensus Protocol"
sidebar_current: "docs-internals-consensus"
description: |-
  Consul uses a consensus protocol to provide Consistency as defined by CAP. This page documents the details of this internal protocol. The consensus protocol is based on Raft: In search of an Understandable Consensus Algorithm. For a visual explanation of Raft, see the The Secret Lives of Data.
---

# Consensus Protocol

Consul uses a [consensus protocol](http://en.wikipedia.org/wiki/Consensus_(computer_science))
to provide [Consistency](http://en.wikipedia.org/wiki/CAP_theorem) as defined by CAP.
This page documents the details of this internal protocol. The consensus protocol is based on
["Raft: In search of an Understandable Consensus Algorithm"](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf). For a visual explanation of Raft, see the [The Secret Lives of Data](http://thesecretlivesofdata.com/raft).

~> **Advanced Topic!** This page covers technical details of
the internals of Consul. You don't need to know these details to effectively
operate and use Consul. These details are documented here for those who wish
to learn about them without having to go spelunking through the source code.

## Raft Protocol Overview

Raft is a relatively new consensus algorithm that is based on Paxos,
but is designed to have fewer states and a simpler, more understandable
algorithm. There are a few key terms to know when discussing Raft:

* Log - The primary unit of work in a Raft system is a log entry. The problem
of consistency can be decomposed into a *replicated log*. A log is an ordered
sequence of entries. We consider the log consistent if all members agree on
the entries and their order.

* FSM - [Finite State Machine](http://en.wikipedia.org/wiki/Finite-state_machine).
An FSM is a collection of finite states with transitions between them. As new logs
are applied, the FSM is allowed to transition between states. Application of the
same sequence of logs must result in the same state, meaning behavior must be deterministic.

* Peer set - The peer set is the set of all members participating in log replication.
For Consul's purposes, all server nodes are in the peer set of the local datacenter.

* Quorum - A quorum is a majority of members from a peer set, or (n/2)+1.
For example, if there are 5 members in the peer set, we would need 3 nodes
to form a quorum. If a quorum of nodes is unavailable for any reason, then the
cluster becomes *unavailable*, and no new logs can be committed.

* Committed Entry - An entry is considered *committed* when it is durably stored
on a quorum of nodes. Once an entry is committed it can be applied.

* Leader - At any given time, the peer set elects a single node to be the leader.
The leader is responsible for ingesting new log entries, replicating to followers,
and managing when an entry is considered committed.

Raft is a complex protocol, and will not be covered here in detail. For the full
specification, we recommend reading the [paper](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf). We will attempt to provide a high
level description, which may be useful for building a mental picture.

Raft nodes are always in one of three states: follower, candidate or leader. All
nodes initially start out as a follower. In this state, nodes can accept log entries
from a leader and cast votes. If no entries are received for some time, nodes
self-promote to the candidate state. In the candidate state nodes request votes from
their peers. If a candidate receives a quorum of votes, then it is promoted to a leader.
The leader must accept new log entries and replicate to all the other followers.
In addition, if stale reads are not acceptable, all queries must also be performed on
the leader.

Once a cluster has a leader, it is able to accept new log entries. A client can
request that a leader append a new log entry, which is an opaque binary blob to
Raft. The leader then writes the entry to durable storage and attempts to replicate
to a quorum of followers. Once the log entry is considered *committed*, it can be
*applied* to a finite state machine. The finite state machine is application specific,
and in Consul's case, we use [LMDB](http://symas.com/mdb/) to maintain cluster state.

An obvious question relates to the unbounded nature of a replicated log. Raft provides
a mechanism by which the current state is snapshotted, and the log is compacted. Because
of the FSM abstraction, restoring the state of the FSM must result in the same state
as a replay of old logs. This allows Raft to capture the FSM state at a point in time,
and then remove all the logs that were used to reach that state. This is performed automatically
without user intervention, and prevents unbounded disk usage as well as minimizing
time spent replaying logs. One of the advantages of using LMDB is that it allows Consul
to continue accepting new transactions even while old state is being snapshotted,
preventing any availability issues.

Lastly, there is the issue of updating the peer set when new servers are joining
or existing servers are leaving. As long as a quorum of nodes is available, this
is not an issue as Raft provides mechanisms to dynamically update the peer set.
If a quorum of nodes is unavailable, then this becomes a very challenging issue.
For example, suppose there are only 2 peers, A and B. The quorum size is also
2, meaning both nodes must agree to commit a log entry. If either A or B fails,
it is now impossible to reach quorum. This means the cluster is unable to add,
or remove a node, or commit any additional log entries. This results in *unavailability*.
At this point, manual intervention would be required to remove either A or B,
and to restart the remaining node in bootstrap mode.

A Raft cluster of 3 nodes can tolerate a single node failure, while a cluster
of 5 can tolerate 2 node failures. The recommended configuration is to either
run 3 or 5 Consul servers per datacenter. This maximizes availability without
greatly sacrificing performance. See below for a deployment table.

In terms of performance, Raft is comparable to Paxos. Assuming stable leadership,
committing a log entry requires a single round trip to half of the cluster.
Thus performance is bound by disk I/O and network latency. Although Consul is
not designed to be a high-throughput write system, it should handle on the order
of hundreds to thousands of transactions per second depending on network and
hardware configuration.

## Raft in Consul

Only Consul server nodes participate in Raft, and are part of the peer set. All
client nodes forward requests to servers. Part of the reason for this design is
that as more members are added to the peer set, the size of the quorum also increases.
This introduces performance problems as you may be waiting for hundreds of machines
to agree on an entry instead of a handful.

When getting started, a single Consul server is put into "bootstrap" mode. This mode
allows it to self-elect as a leader. Once a leader is elected, other servers can be
added to the peer set in a way that preserves consistency and safety. Eventually,
bootstrap mode can be disabled, once the first few servers are added. See [this
guide](/docs/guides/bootstrapping.html) for more details.

Since all servers participate as part of the peer set, they all know the current
leader. When an RPC request arrives at a non-leader server, the request is
forwarded to the leader. If the RPC is a *query* type, meaning it is read-only,
then the leader generates the result based on the current state of the FSM. If
the RPC is a *transaction* type, meaning it modifies state, then the leader
generates a new log entry and applies it using Raft. Once the log entry is committed
and applied to the FSM, the transaction is complete.

Because of the nature of Raft's replication, performance is sensitive to network
latency. For this reason, each datacenter elects an independent leader, and maintains
a disjoint peer set. Data is partitioned by datacenter, so each leader is responsible
only for data in their datacenter. When a request is received for a remote datacenter,
the request is forwarded to the correct leader. This design allows for lower latency
transactions and higher availability without sacrificing consistency.

## Consistency Modes

Although all writes to the replicated log go through Raft, reads are more
flexible. To support various tradeoffs that developers may want, Consul
supports 3 different consistency modes for reads.

The three read modes are:

* `default` - Raft makes use of leader leasing, providing a time window
  in which the leader assumes its role is stable. However, if a leader
  is partitioned from the remaining peers, a new leader may be elected
  while the old leader is holding the lease. This means there are 2 leader
  nodes. There is no risk of a split-brain since the old leader will be
  unable to commit new logs. However, if the old leader services any reads
  the values are potentially stale. The default consistency mode relies only
  on leader leasing, exposing clients to potentially stale values. We make
  this trade off because reads are fast, usually strongly consistent, and
  only stale in a hard to trigger situation. The time window of stale reads
  is also bounded, since the leader will step down due to the partition.

* `consistent` - This mode is strongly consistent without caveats. It requires
  that a leader verify with a quorum of peers that it is still leader. This
  introduces an additional round-trip to all server nodes. The trade off is
  always consistent reads, but increased latency due to an extra round trip.

* `stale` - This mode allows any server to service the read, regardless of if
  it is the leader. This means reads can be arbitrarily stale, but are generally
  within 50 milliseconds of the leader. The trade off is very fast and scalable
  reads but values will be stale. This mode allows reads without a leader, meaning
  a cluster that is unavailable will still be able to respond.

For more documentation about using these various modes, see the [HTTP API](/docs/agent/http.html).

## <a name="deployment_table"></a>Deployment Table

Below is a table that shows for the number of servers how large the
quorum is, as well as how many node failures can be tolerated. The
recommended deployment is either 3 or 5 servers. A single server deployment
is _**highly**_ discouraged as data loss is inevitable in a failure scenario.

<table class="table table-bordered table-striped">
  <tr>
    <th>Servers</th>
    <th>Quorum Size</th>
    <th>Failure Tolerance</th>
  </tr>
  <tr>
    <td>1</td>
    <td>1</td>
    <td>0</td>
  </tr>
  <tr>
    <td>2</td>
    <td>2</td>
    <td>0</td>
  </tr>
  <tr class="warning">
    <td>3</td>
    <td>2</td>
    <td>1</td>
  </tr>
  <tr>
    <td>4</td>
    <td>3</td>
    <td>1</td>
  </tr>
  <tr class="warning">
    <td>5</td>
    <td>3</td>
    <td>2</td>
  </tr>
  <tr>
    <td>6</td>
    <td>4</td>
    <td>2</td>
  </tr>
  <tr>
    <td>7</td>
    <td>4</td>
    <td>3</td>
  </tr>
</table>
website: document the high level architecture 2014-02-20 00:58:15 +00:00			`---`
			`layout: "docs"`
			`page_title: "Consensus Protocol"`
			`sidebar_current: "docs-internals-consensus"`
Use new Markdown syntaxes and add SEO descriptions 2014-10-19 23:40:10 +00:00			`description: \|-`
			`Consul uses a consensus protocol to provide Consistency as defined by CAP. This page documents the details of this internal protocol. The consensus protocol is based on Raft: In search of an Understandable Consensus Algorithm. For a visual explanation of Raft, see the The Secret Lives of Data.`
website: document the high level architecture 2014-02-20 00:58:15 +00:00			`---`

			`# Consensus Protocol`

website: documenting the internals 2014-02-20 20:26:50 +00:00			`Consul uses a [consensus protocol](http://en.wikipedia.org/wiki/Consensus_(computer_science))`
website: Documentation cleanup 2014-04-09 18:06:27 +00:00			`to provide [Consistency](http://en.wikipedia.org/wiki/CAP_theorem) as defined by CAP.`
website: documenting the internals 2014-02-20 20:26:50 +00:00			`This page documents the details of this internal protocol. The consensus protocol is based on`
add link to raft visualization 2014-08-30 23:43:45 +00:00			`["Raft: In search of an Understandable Consensus Algorithm"](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf). For a visual explanation of Raft, see the [The Secret Lives of Data](http://thesecretlivesofdata.com/raft).`
website: document the high level architecture 2014-02-20 00:58:15 +00:00
Use new Markdown syntaxes and add SEO descriptions 2014-10-19 23:40:10 +00:00			`~> Advanced Topic! This page covers technical details of`
website: documenting the internals 2014-02-20 20:26:50 +00:00			`the internals of Consul. You don't need to know these details to effectively`
			`operate and use Consul. These details are documented here for those who wish`
website: document the high level architecture 2014-02-20 00:58:15 +00:00			`to learn about them without having to go spelunking through the source code.`

website: documenting the internals 2014-02-20 20:26:50 +00:00			`## Raft Protocol Overview`

			`Raft is a relatively new consensus algorithm that is based on Paxos,`
Add raft link and fix some typos 2014-03-29 03:33:50 +00:00			`but is designed to have fewer states and a simpler, more understandable`
website: documenting the internals 2014-02-20 20:26:50 +00:00			`algorithm. There are a few key terms to know when discussing Raft:`

			`* Log - The primary unit of work in a Raft system is a log entry. The problem`
Add raft link and fix some typos 2014-03-29 03:33:50 +00:00			`of consistency can be decomposed into a replicated log. A log is an ordered`
			`sequence of entries. We consider the log consistent if all members agree on`
website: documenting the internals 2014-02-20 20:26:50 +00:00			`the entries and their order.`

			`* FSM - [Finite State Machine](http://en.wikipedia.org/wiki/Finite-state_machine).`
			`An FSM is a collection of finite states with transitions between them. As new logs`
			`are applied, the FSM is allowed to transition between states. Application of the`
Round 2: Fix typos, grammar errors, and misspellings 2014-04-16 04:01:12 +00:00			`same sequence of logs must result in the same state, meaning behavior must be deterministic.`
website: documenting the internals 2014-02-20 20:26:50 +00:00
			`* Peer set - The peer set is the set of all members participating in log replication.`
			`For Consul's purposes, all server nodes are in the peer set of the local datacenter.`

			`* Quorum - A quorum is a majority of members from a peer set, or (n/2)+1.`
			`For example, if there are 5 members in the peer set, we would need 3 nodes`
			`to form a quorum. If a quorum of nodes is unavailable for any reason, then the`
			`cluster becomes unavailable, and no new logs can be committed.`

			`* Committed Entry - An entry is considered committed when it is durably stored`
			`on a quorum of nodes. Once an entry is committed it can be applied.`

			`* Leader - At any given time, the peer set elects a single node to be the leader.`
			`The leader is responsible for ingesting new log entries, replicating to followers,`
			`and managing when an entry is considered committed.`

			`Raft is a complex protocol, and will not be covered here in detail. For the full`
Add raft link and fix some typos 2014-03-29 03:33:50 +00:00			`specification, we recommend reading the [paper](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf). We will attempt to provide a high`
website: documenting the internals 2014-02-20 20:26:50 +00:00			`level description, which may be useful for building a mental picture.`

			`Raft nodes are always in one of three states: follower, candidate or leader. All`
			`nodes initially start out as a follower. In this state, nodes can accept log entries`
			`from a leader and cast votes. If no entries are received for some time, nodes`
			`self-promote to the candidate state. In the candidate state nodes request votes from`
			`their peers. If a candidate receives a quorum of votes, then it is promoted to a leader.`
			`The leader must accept new log entries and replicate to all the other followers.`
			`In addition, if stale reads are not acceptable, all queries must also be performed on`
			`the leader.`

			`Once a cluster has a leader, it is able to accept new log entries. A client can`
			`request that a leader append a new log entry, which is an opaque binary blob to`
			`Raft. The leader then writes the entry to durable storage and attempts to replicate`
			`to a quorum of followers. Once the log entry is considered committed, it can be`
			`applied to a finite state machine. The finite state machine is application specific,`
			`and in Consul's case, we use [LMDB](http://symas.com/mdb/) to maintain cluster state.`

			`An obvious question relates to the unbounded nature of a replicated log. Raft provides`
			`a mechanism by which the current state is snapshotted, and the log is compacted. Because`
			`of the FSM abstraction, restoring the state of the FSM must result in the same state`
Add raft link and fix some typos 2014-03-29 03:33:50 +00:00			`as a replay of old logs. This allows Raft to capture the FSM state at a point in time,`
website: documenting the internals 2014-02-20 20:26:50 +00:00			`and then remove all the logs that were used to reach that state. This is performed automatically`
			`without user intervention, and prevents unbounded disk usage as well as minimizing`
			`time spent replaying logs. One of the advantages of using LMDB is that it allows Consul`
			`to continue accepting new transactions even while old state is being snapshotted,`
			`preventing any availability issues.`

			`Lastly, there is the issue of updating the peer set when new servers are joining`
Add raft link and fix some typos 2014-03-29 03:33:50 +00:00			`or existing servers are leaving. As long as a quorum of nodes is available, this`
website: documenting the internals 2014-02-20 20:26:50 +00:00			`is not an issue as Raft provides mechanisms to dynamically update the peer set.`
			`If a quorum of nodes is unavailable, then this becomes a very challenging issue.`
			`For example, suppose there are only 2 peers, A and B. The quorum size is also`
			`2, meaning both nodes must agree to commit a log entry. If either A or B fails,`
			`it is now impossible to reach quorum. This means the cluster is unable to add,`
			`or remove a node, or commit any additional log entries. This results in unavailability.`
			`At this point, manual intervention would be required to remove either A or B,`
			`and to restart the remaining node in bootstrap mode.`

			`A Raft cluster of 3 nodes can tolerate a single node failure, while a cluster`
			`of 5 can tolerate 2 node failures. The recommended configuration is to either`
			`run 3 or 5 Consul servers per datacenter. This maximizes availability without`
			`greatly sacrificing performance. See below for a deployment table.`

Copy fixes: typos, misspellings, grammar, wording 2014-04-17 21:45:53 +00:00			`In terms of performance, Raft is comparable to Paxos. Assuming stable leadership,`
Round 2: Fix typos, grammar errors, and misspellings 2014-04-16 04:01:12 +00:00			`committing a log entry requires a single round trip to half of the cluster.`
website: documenting the internals 2014-02-20 20:26:50 +00:00			`Thus performance is bound by disk I/O and network latency. Although Consul is`
			`not designed to be a high-throughput write system, it should handle on the order`
			`of hundreds to thousands of transactions per second depending on network and`
			`hardware configuration.`

			`## Raft in Consul`

			`Only Consul server nodes participate in Raft, and are part of the peer set. All`
			`client nodes forward requests to servers. Part of the reason for this design is`
			`that as more members are added to the peer set, the size of the quorum also increases.`
			`This introduces performance problems as you may be waiting for hundreds of machines`
			`to agree on an entry instead of a handful.`

			`When getting started, a single Consul server is put into "bootstrap" mode. This mode`
			`allows it to self-elect as a leader. Once a leader is elected, other servers can be`
			`added to the peer set in a way that preserves consistency and safety. Eventually,`
website: Documentation cleanup 2014-04-09 18:06:27 +00:00			`bootstrap mode can be disabled, once the first few servers are added. See [this`
			`guide](/docs/guides/bootstrapping.html) for more details.`
website: documenting the internals 2014-02-20 20:26:50 +00:00
			`Since all servers participate as part of the peer set, they all know the current`
			`leader. When an RPC request arrives at a non-leader server, the request is`
			`forwarded to the leader. If the RPC is a query type, meaning it is read-only,`
			`then the leader generates the result based on the current state of the FSM. If`
			`the RPC is a transaction type, meaning it modifies state, then the leader`
			`generates a new log entry and applies it using Raft. Once the log entry is committed`
			`and applied to the FSM, the transaction is complete.`

			`Because of the nature of Raft's replication, performance is sensitive to network`
			`latency. For this reason, each datacenter elects an independent leader, and maintains`
			`a disjoint peer set. Data is partitioned by datacenter, so each leader is responsible`
			`only for data in their datacenter. When a request is received for a remote datacenter,`
			`the request is forwarded to the correct leader. This design allows for lower latency`
			`transactions and higher availability without sacrificing consistency.`

website: Document the consistency modes 2014-04-21 20:46:57 +00:00			`## Consistency Modes`

			`Although all writes to the replicated log go through Raft, reads are more`
			`flexible. To support various tradeoffs that developers may want, Consul`
			`supports 3 different consistency modes for reads.`

			`The three read modes are:`

Use new Markdown syntaxes and add SEO descriptions 2014-10-19 23:40:10 +00:00			* `default` - Raft makes use of leader leasing, providing a time window
website: fix a couple of typos. 2014-05-03 22:23:16 +00:00			`in which the leader assumes its role is stable. However, if a leader`
website: Cleanup verbage 2014-04-21 20:50:18 +00:00			`is partitioned from the remaining peers, a new leader may be elected`
			`while the old leader is holding the lease. This means there are 2 leader`
website: Document the consistency modes 2014-04-21 20:46:57 +00:00			`nodes. There is no risk of a split-brain since the old leader will be`
			`unable to commit new logs. However, if the old leader services any reads`
			`the values are potentially stale. The default consistency mode relies only`
			`on leader leasing, exposing clients to potentially stale values. We make`
			`this trade off because reads are fast, usually strongly consistent, and`
			`only stale in a hard to trigger situation. The time window of stale reads`
			`is also bounded, since the leader will step down due to the partition.`

Use new Markdown syntaxes and add SEO descriptions 2014-10-19 23:40:10 +00:00			* `consistent` - This mode is strongly consistent without caveats. It requires
website: Document the consistency modes 2014-04-21 20:46:57 +00:00			`that a leader verify with a quorum of peers that it is still leader. This`
			`introduces an additional round-trip to all server nodes. The trade off is`
			`always consistent reads, but increased latency due to an extra round trip.`

Use new Markdown syntaxes and add SEO descriptions 2014-10-19 23:40:10 +00:00			* `stale` - This mode allows any server to service the read, regardless of if
website: Document the consistency modes 2014-04-21 20:46:57 +00:00			`it is the leader. This means reads can be arbitrarily stale, but are generally`
			`within 50 milliseconds of the leader. The trade off is very fast and scalable`
			`reads but values will be stale. This mode allows reads without a leader, meaning`
			`a cluster that is unavailable will still be able to respond.`

			`For more documentation about using these various modes, see the [HTTP API](/docs/agent/http.html).`

Website: rework docs/guides/outage.html to cover cases where recovery might be easier than manual removal of failed nodes from peers.json. 2015-03-01 23:21:33 +00:00			`## <a name="deployment_table"></a>Deployment Table`
website: documenting the internals 2014-02-20 20:26:50 +00:00
			`Below is a table that shows for the number of servers how large the`
			`quorum is, as well as how many node failures can be tolerated. The`
website: Multiple warnings about single server deployment 2014-04-11 19:43:06 +00:00			`recommended deployment is either 3 or 5 servers. A single server deployment`
			`is _highly_ discouraged as data loss is inevitable in a failure scenario.`
website: documenting the internals 2014-02-20 20:26:50 +00:00
			`<table class="table table-bordered table-striped">`
Use new Markdown syntaxes and add SEO descriptions 2014-10-19 23:40:10 +00:00			`<tr>`
			`<th>Servers</th>`
			`<th>Quorum Size</th>`
			`<th>Failure Tolerance</th>`
			`</tr>`
			`<tr>`
			`<td>1</td>`
			`<td>1</td>`
			`<td>0</td>`
			`</tr>`
			`<tr>`
			`<td>2</td>`
			`<td>2</td>`
			`<td>0</td>`
			`</tr>`
			`<tr class="warning">`
			`<td>3</td>`
			`<td>2</td>`
			`<td>1</td>`
			`</tr>`
			`<tr>`
			`<td>4</td>`
			`<td>3</td>`
			`<td>1</td>`
			`</tr>`
			`<tr class="warning">`
			`<td>5</td>`
			`<td>3</td>`
			`<td>2</td>`
			`</tr>`
			`<tr>`
			`<td>6</td>`
			`<td>4</td>`
			`<td>2</td>`
			`</tr>`
			`<tr>`
			`<td>7</td>`
			`<td>4</td>`
			`<td>3</td>`
			`</tr>`
website: documenting the internals 2014-02-20 20:26:50 +00:00			`</table>`