mirror of https://github.com/hashicorp/consul
Website: cleanup for docs/internals/consensus.html, including removing LMDB references in favor of BoltDB per GH-857.
parent
9c85ea0c47
commit
2b9bb18921
|
@ -3,15 +3,16 @@ layout: "docs"
|
||||||
page_title: "Consensus Protocol"
|
page_title: "Consensus Protocol"
|
||||||
sidebar_current: "docs-internals-consensus"
|
sidebar_current: "docs-internals-consensus"
|
||||||
description: |-
|
description: |-
|
||||||
Consul uses a consensus protocol to provide Consistency as defined by CAP. This page documents the details of this internal protocol. The consensus protocol is based on Raft: In search of an Understandable Consensus Algorithm. For a visual explanation of Raft, see the The Secret Lives of Data.
|
Consul uses a consensus protocol to provide Consistency as defined by CAP. The consensus protocol is based on Raft: In search of an Understandable Consensus Algorithm. For a visual explanation of Raft, see the the Secret Lives of Data.
|
||||||
---
|
---
|
||||||
|
|
||||||
# Consensus Protocol
|
# Consensus Protocol
|
||||||
|
|
||||||
Consul uses a [consensus protocol](http://en.wikipedia.org/wiki/Consensus_(computer_science))
|
Consul uses a [consensus protocol](http://en.wikipedia.org/wiki/Consensus_(computer_science))
|
||||||
to provide [Consistency](http://en.wikipedia.org/wiki/CAP_theorem) as defined by CAP.
|
to provide [Consistency](http://en.wikipedia.org/wiki/CAP_theorem) as defined by CAP.
|
||||||
This page documents the details of this internal protocol. The consensus protocol is based on
|
The consensus protocol is based on
|
||||||
["Raft: In search of an Understandable Consensus Algorithm"](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf). For a visual explanation of Raft, see the [The Secret Lives of Data](http://thesecretlivesofdata.com/raft).
|
["Raft: In search of an Understandable Consensus Algorithm"](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf).
|
||||||
|
For a visual explanation of Raft, see the [the Secret Lives of Data](http://thesecretlivesofdata.com/raft).
|
||||||
|
|
||||||
~> **Advanced Topic!** This page covers technical details of
|
~> **Advanced Topic!** This page covers technical details of
|
||||||
the internals of Consul. You don't need to know these details to effectively
|
the internals of Consul. You don't need to know these details to effectively
|
||||||
|
@ -20,9 +21,12 @@ to learn about them without having to go spelunking through the source code.
|
||||||
|
|
||||||
## Raft Protocol Overview
|
## Raft Protocol Overview
|
||||||
|
|
||||||
Raft is a relatively new consensus algorithm that is based on Paxos,
|
Raft is a consensus algorithm that is based on
|
||||||
but is designed to have fewer states and a simpler, more understandable
|
[Paxos](http://en.wikipedia.org/wiki/Paxos_%28computer_science%29). Compared
|
||||||
algorithm. There are a few key terms to know when discussing Raft:
|
to Paxos, Raft is designed to have fewer states and a simpler, more
|
||||||
|
understandable algorithm.
|
||||||
|
|
||||||
|
There are a few key terms to know when discussing Raft:
|
||||||
|
|
||||||
* Log - The primary unit of work in a Raft system is a log entry. The problem
|
* Log - The primary unit of work in a Raft system is a log entry. The problem
|
||||||
of consistency can be decomposed into a *replicated log*. A log is an ordered
|
of consistency can be decomposed into a *replicated log*. A log is an ordered
|
||||||
|
@ -37,10 +41,11 @@ same sequence of logs must result in the same state, meaning behavior must be de
|
||||||
* Peer set - The peer set is the set of all members participating in log replication.
|
* Peer set - The peer set is the set of all members participating in log replication.
|
||||||
For Consul's purposes, all server nodes are in the peer set of the local datacenter.
|
For Consul's purposes, all server nodes are in the peer set of the local datacenter.
|
||||||
|
|
||||||
* Quorum - A quorum is a majority of members from a peer set, or (n/2)+1.
|
* Quorum - A quorum is a majority of members from a peer set: for a set of size `n`,
|
||||||
|
quorum requires at least `(n/2)+1` members.
|
||||||
For example, if there are 5 members in the peer set, we would need 3 nodes
|
For example, if there are 5 members in the peer set, we would need 3 nodes
|
||||||
to form a quorum. If a quorum of nodes is unavailable for any reason, then the
|
to form a quorum. If a quorum of nodes is unavailable for any reason, the
|
||||||
cluster becomes *unavailable*, and no new logs can be committed.
|
cluster becomes *unavailable* and no new logs can be committed.
|
||||||
|
|
||||||
* Committed Entry - An entry is considered *committed* when it is durably stored
|
* Committed Entry - An entry is considered *committed* when it is durably stored
|
||||||
on a quorum of nodes. Once an entry is committed it can be applied.
|
on a quorum of nodes. Once an entry is committed it can be applied.
|
||||||
|
@ -49,83 +54,84 @@ on a quorum of nodes. Once an entry is committed it can be applied.
|
||||||
The leader is responsible for ingesting new log entries, replicating to followers,
|
The leader is responsible for ingesting new log entries, replicating to followers,
|
||||||
and managing when an entry is considered committed.
|
and managing when an entry is considered committed.
|
||||||
|
|
||||||
Raft is a complex protocol, and will not be covered here in detail. For the full
|
Raft is a complex protocol and will not be covered here in detail. For the full
|
||||||
specification, we recommend reading the [paper](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf). We will attempt to provide a high
|
specification, we recommend reading this [paper](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf).
|
||||||
level description, which may be useful for building a mental picture.
|
We will, however, attempt to provide a high level description which may be useful
|
||||||
|
for building a mental model.
|
||||||
|
|
||||||
Raft nodes are always in one of three states: follower, candidate or leader. All
|
Raft nodes are always in one of three states: follower, candidate, or leader. All
|
||||||
nodes initially start out as a follower. In this state, nodes can accept log entries
|
nodes initially start out as a follower. In this state, nodes can accept log entries
|
||||||
from a leader and cast votes. If no entries are received for some time, nodes
|
from a leader and cast votes. If no entries are received for some time, nodes
|
||||||
self-promote to the candidate state. In the candidate state nodes request votes from
|
self-promote to the candidate state. In the candidate state, nodes request votes from
|
||||||
their peers. If a candidate receives a quorum of votes, then it is promoted to a leader.
|
their peers. If a candidate receives a quorum of votes, then it is promoted to a leader.
|
||||||
The leader must accept new log entries and replicate to all the other followers.
|
The leader must accept new log entries and replicate to all the other followers.
|
||||||
In addition, if stale reads are not acceptable, all queries must also be performed on
|
In addition, if stale reads are not acceptable, all queries must also be performed on
|
||||||
the leader.
|
the leader.
|
||||||
|
|
||||||
Once a cluster has a leader, it is able to accept new log entries. A client can
|
Once a cluster has a leader, it is able to accept new log entries. A client can
|
||||||
request that a leader append a new log entry, which is an opaque binary blob to
|
request that a leader append a new log entry: from Raft's perspective, a log entry
|
||||||
Raft. The leader then writes the entry to durable storage and attempts to replicate
|
is an opaque binary blob. The leader then writes the entry to durable storage and
|
||||||
to a quorum of followers. Once the log entry is considered *committed*, it can be
|
attempts to replicate to a quorum of followers. Once the log entry is considered
|
||||||
*applied* to a finite state machine. The finite state machine is application specific,
|
*committed*, it can be *applied* to a finite state machine. The finite state machine
|
||||||
and in Consul's case, we use [LMDB](http://symas.com/mdb/) to maintain cluster state.
|
is application specific, and in Consul's case, we use
|
||||||
|
[BoltDB](https://github.com/boltdb/bolt) to maintain cluster state.
|
||||||
|
|
||||||
An obvious question relates to the unbounded nature of a replicated log. Raft provides
|
Obviously, it would be undesirable to allow a replicated log to grow in an unbounded
|
||||||
a mechanism by which the current state is snapshotted, and the log is compacted. Because
|
fashion. Raft provides a mechanism by which the current state is snapshotted and the
|
||||||
of the FSM abstraction, restoring the state of the FSM must result in the same state
|
log is compacted. Because of the FSM abstraction, restoring the state of the FSM must
|
||||||
as a replay of old logs. This allows Raft to capture the FSM state at a point in time,
|
result in the same state as a replay of old logs. This allows Raft to capture the FSM
|
||||||
and then remove all the logs that were used to reach that state. This is performed automatically
|
state at a point in time and then remove all the logs that were used to reach that
|
||||||
without user intervention, and prevents unbounded disk usage as well as minimizing
|
state. This is performed automatically without user intervention and prevents unbounded
|
||||||
time spent replaying logs. One of the advantages of using LMDB is that it allows Consul
|
disk usage while also minimizing time spent replaying logs. One of the advantages of
|
||||||
to continue accepting new transactions even while old state is being snapshotted,
|
using BoltDB is that it allows Consul to continue accepting new transactions even while
|
||||||
preventing any availability issues.
|
old state is being snapshotted, preventing any availability issues.
|
||||||
|
|
||||||
Lastly, there is the issue of updating the peer set when new servers are joining
|
Consensus is fault-tolerant up to the point where quorum is available.
|
||||||
or existing servers are leaving. As long as a quorum of nodes is available, this
|
If a quorum of nodes is unavailable, it is impossible to process log entries or reason
|
||||||
is not an issue as Raft provides mechanisms to dynamically update the peer set.
|
about peer membership. For example, suppose there are only 2 peers: A and B. The quorum
|
||||||
If a quorum of nodes is unavailable, then this becomes a very challenging issue.
|
size is also 2, meaning both nodes must agree to commit a log entry. If either A or B
|
||||||
For example, suppose there are only 2 peers, A and B. The quorum size is also
|
fails, it is now impossible to reach quorum. This means the cluster is unable to add
|
||||||
2, meaning both nodes must agree to commit a log entry. If either A or B fails,
|
or remove a node or to commit any additional log entries. This results in
|
||||||
it is now impossible to reach quorum. This means the cluster is unable to add,
|
*unavailability*. At this point, manual intervention would be required to remove
|
||||||
or remove a node, or commit any additional log entries. This results in *unavailability*.
|
either A or B and to restart the remaining node in bootstrap mode.
|
||||||
At this point, manual intervention would be required to remove either A or B,
|
|
||||||
and to restart the remaining node in bootstrap mode.
|
|
||||||
|
|
||||||
A Raft cluster of 3 nodes can tolerate a single node failure, while a cluster
|
A Raft cluster of 3 nodes can tolerate a single node failure while a cluster
|
||||||
of 5 can tolerate 2 node failures. The recommended configuration is to either
|
of 5 can tolerate 2 node failures. The recommended configuration is to either
|
||||||
run 3 or 5 Consul servers per datacenter. This maximizes availability without
|
run 3 or 5 Consul servers per datacenter. This maximizes availability without
|
||||||
greatly sacrificing performance. See below for a deployment table.
|
greatly sacrificing performance. The [deployment table](#deployment_table) below
|
||||||
|
summarizes the potential cluster size options and the fault tolerance of each.
|
||||||
|
|
||||||
In terms of performance, Raft is comparable to Paxos. Assuming stable leadership,
|
In terms of performance, Raft is comparable to Paxos. Assuming stable leadership,
|
||||||
committing a log entry requires a single round trip to half of the cluster.
|
committing a log entry requires a single round trip to half of the cluster.
|
||||||
Thus performance is bound by disk I/O and network latency. Although Consul is
|
Thus, performance is bound by disk I/O and network latency. Although Consul is
|
||||||
not designed to be a high-throughput write system, it should handle on the order
|
not designed to be a high-throughput write system, it should handle on the order
|
||||||
of hundreds to thousands of transactions per second depending on network and
|
of hundreds to thousands of transactions per second depending on network and
|
||||||
hardware configuration.
|
hardware configuration.
|
||||||
|
|
||||||
## Raft in Consul
|
## Raft in Consul
|
||||||
|
|
||||||
Only Consul server nodes participate in Raft, and are part of the peer set. All
|
Only Consul server nodes participate in Raft and are part of the peer set. All
|
||||||
client nodes forward requests to servers. Part of the reason for this design is
|
client nodes forward requests to servers. Part of the reason for this design is
|
||||||
that as more members are added to the peer set, the size of the quorum also increases.
|
that, as more members are added to the peer set, the size of the quorum also increases.
|
||||||
This introduces performance problems as you may be waiting for hundreds of machines
|
This introduces performance problems as you may be waiting for hundreds of machines
|
||||||
to agree on an entry instead of a handful.
|
to agree on an entry instead of a handful.
|
||||||
|
|
||||||
When getting started, a single Consul server is put into "bootstrap" mode. This mode
|
When getting started, a single Consul server is put into "bootstrap" mode. This mode
|
||||||
allows it to self-elect as a leader. Once a leader is elected, other servers can be
|
allows it to self-elect as a leader. Once a leader is elected, other servers can be
|
||||||
added to the peer set in a way that preserves consistency and safety. Eventually,
|
added to the peer set in a way that preserves consistency and safety. Eventually,
|
||||||
bootstrap mode can be disabled, once the first few servers are added. See [this
|
once the first few servers are added, bootstrap mode can be disabled. See [this
|
||||||
guide](/docs/guides/bootstrapping.html) for more details.
|
guide](/docs/guides/bootstrapping.html) for more details.
|
||||||
|
|
||||||
Since all servers participate as part of the peer set, they all know the current
|
Since all servers participate as part of the peer set, they all know the current
|
||||||
leader. When an RPC request arrives at a non-leader server, the request is
|
leader. When an RPC request arrives at a non-leader server, the request is
|
||||||
forwarded to the leader. If the RPC is a *query* type, meaning it is read-only,
|
forwarded to the leader. If the RPC is a *query* type, meaning it is read-only,
|
||||||
then the leader generates the result based on the current state of the FSM. If
|
the leader generates the result based on the current state of the FSM. If
|
||||||
the RPC is a *transaction* type, meaning it modifies state, then the leader
|
the RPC is a *transaction* type, meaning it modifies state, the leader
|
||||||
generates a new log entry and applies it using Raft. Once the log entry is committed
|
generates a new log entry and applies it using Raft. Once the log entry is committed
|
||||||
and applied to the FSM, the transaction is complete.
|
and applied to the FSM, the transaction is complete.
|
||||||
|
|
||||||
Because of the nature of Raft's replication, performance is sensitive to network
|
Because of the nature of Raft's replication, performance is sensitive to network
|
||||||
latency. For this reason, each datacenter elects an independent leader, and maintains
|
latency. For this reason, each datacenter elects an independent leader and maintains
|
||||||
a disjoint peer set. Data is partitioned by datacenter, so each leader is responsible
|
a disjoint peer set. Data is partitioned by datacenter, so each leader is responsible
|
||||||
only for data in their datacenter. When a request is received for a remote datacenter,
|
only for data in their datacenter. When a request is received for a remote datacenter,
|
||||||
the request is forwarded to the correct leader. This design allows for lower latency
|
the request is forwarded to the correct leader. This design allows for lower latency
|
||||||
|
@ -134,7 +140,7 @@ transactions and higher availability without sacrificing consistency.
|
||||||
## Consistency Modes
|
## Consistency Modes
|
||||||
|
|
||||||
Although all writes to the replicated log go through Raft, reads are more
|
Although all writes to the replicated log go through Raft, reads are more
|
||||||
flexible. To support various tradeoffs that developers may want, Consul
|
flexible. To support various trade-offs that developers may want, Consul
|
||||||
supports 3 different consistency modes for reads.
|
supports 3 different consistency modes for reads.
|
||||||
|
|
||||||
The three read modes are:
|
The three read modes are:
|
||||||
|
@ -144,32 +150,33 @@ The three read modes are:
|
||||||
is partitioned from the remaining peers, a new leader may be elected
|
is partitioned from the remaining peers, a new leader may be elected
|
||||||
while the old leader is holding the lease. This means there are 2 leader
|
while the old leader is holding the lease. This means there are 2 leader
|
||||||
nodes. There is no risk of a split-brain since the old leader will be
|
nodes. There is no risk of a split-brain since the old leader will be
|
||||||
unable to commit new logs. However, if the old leader services any reads
|
unable to commit new logs. However, if the old leader services any reads,
|
||||||
the values are potentially stale. The default consistency mode relies only
|
the values are potentially stale. The default consistency mode relies only
|
||||||
on leader leasing, exposing clients to potentially stale values. We make
|
on leader leasing, exposing clients to potentially stale values. We make
|
||||||
this trade off because reads are fast, usually strongly consistent, and
|
this trade-off because reads are fast, usually strongly consistent, and
|
||||||
only stale in a hard to trigger situation. The time window of stale reads
|
only stale in a hard-to-trigger situation. The time window of stale reads
|
||||||
is also bounded, since the leader will step down due to the partition.
|
is also bounded since the leader will step down due to the partition.
|
||||||
|
|
||||||
* `consistent` - This mode is strongly consistent without caveats. It requires
|
* `consistent` - This mode is strongly consistent without caveats. It requires
|
||||||
that a leader verify with a quorum of peers that it is still leader. This
|
that a leader verify with a quorum of peers that it is still leader. This
|
||||||
introduces an additional round-trip to all server nodes. The trade off is
|
introduces an additional round-trip to all server nodes. The trade-off is
|
||||||
always consistent reads, but increased latency due to an extra round trip.
|
always consistent reads but increased latency due to the extra round trip.
|
||||||
|
|
||||||
* `stale` - This mode allows any server to service the read, regardless of if
|
* `stale` - This mode allows any server to service the read regardless of if
|
||||||
it is the leader. This means reads can be arbitrarily stale, but are generally
|
it is the leader. This means reads can be arbitrarily stale but are generally
|
||||||
within 50 milliseconds of the leader. The trade off is very fast and scalable
|
within 50 milliseconds of the leader. The trade-off is very fast and scalable
|
||||||
reads but values will be stale. This mode allows reads without a leader, meaning
|
reads but with stale values. This mode allows reads without a leader meaning
|
||||||
a cluster that is unavailable will still be able to respond.
|
a cluster that is unavailable will still be able to respond.
|
||||||
|
|
||||||
For more documentation about using these various modes, see the [HTTP API](/docs/agent/http.html).
|
For more documentation about using these various modes, see the
|
||||||
|
[HTTP API](/docs/agent/http.html).
|
||||||
|
|
||||||
## <a name="deployment_table"></a>Deployment Table
|
## <a name="deployment_table"></a>Deployment Table
|
||||||
|
|
||||||
Below is a table that shows for the number of servers how large the
|
Below is a table that shows quorum size and failure tolerance for various
|
||||||
quorum is, as well as how many node failures can be tolerated. The
|
cluster sizes. The recommended deployment is either 3 or 5 servers. A single
|
||||||
recommended deployment is either 3 or 5 servers. A single server deployment
|
server deployment is _**highly**_ discouraged as data loss is inevitable in a
|
||||||
is _**highly**_ discouraged as data loss is inevitable in a failure scenario.
|
failure scenario.
|
||||||
|
|
||||||
<table class="table table-bordered table-striped">
|
<table class="table table-bordered table-striped">
|
||||||
<tr>
|
<tr>
|
||||||
|
|
Loading…
Reference in New Issue