mirror of https://github.com/hashicorp/consul
Armon Dadgar
11 years ago
5 changed files with 234 additions and 13 deletions
@ -0,0 +1,117 @@
|
||||
--- |
||||
layout: "docs" |
||||
page_title: "Consul Architecture" |
||||
sidebar_current: "docs-internals-architecture" |
||||
--- |
||||
|
||||
# Consul Architecture |
||||
|
||||
Consul is a complex system that has many different moving parts. To help |
||||
users and developers of Consul form a mental model of how it works, this |
||||
page documents the system architecture. |
||||
|
||||
<div class="alert alert-block alert-warning"> |
||||
<strong>Advanced Topic!</strong> This page covers technical details of |
||||
the internals of Consul. You don't need to know these details to effectively |
||||
operate and use Consul. These details are documented here for those who wish |
||||
to learn about them without having to go spelunking through the source code. |
||||
</div> |
||||
|
||||
## Glossary |
||||
|
||||
Before describing the architecture, we provide a glossary of terms to help |
||||
clarify what is being discussed: |
||||
|
||||
* Agent - An agent is the long running daemon on every member of the Consul cluster. |
||||
It is started by running `consul agent`. The agent is able to run in either *client*, |
||||
or *server* mode. Since all nodes must be running an agent, it is simpler to refer to |
||||
the node as either being a client or server, but other are instances of the agent. All |
||||
agents can run the DNS or HTTP interfaces, and are responsible for running checks and |
||||
keeping services in sync. |
||||
|
||||
* Client - A client is an agent that forwards all RPC's to a server. The client is relatively |
||||
stateless. The only background activity a client performs is taking part of LAN gossip pool. |
||||
This has a minimal resource overhead and consumes only a small amount of network bandwidth. |
||||
|
||||
* Server - An agent that is server mode. When in server mode, there is an expanded set |
||||
of responsibilities including participated in the Raft quorum, maintaining cluster state, |
||||
responding to RPC queries, WAN gossip to other datacenters, forwarding of queries to leaders |
||||
or remote datacenters. |
||||
|
||||
* Datacenter - A data center seems obvious, but there are subtle details such as multiple |
||||
availability zones in EC2. We define a data center to be a networking environment that is |
||||
private, low latency, and high badwidth. This excludes communication that would traverse |
||||
the public internet. |
||||
|
||||
* Consensus - When used in our documentation we use consensus to mean agreement upon |
||||
the elected leader as well as agreement on the ordering of transactions. Since these |
||||
transactions are applied to a FSM, we implicitly include the consistency of a replicated |
||||
state machine. Consensus is described in more detail on [Wikipedia](http://en.wikipedia.org/wiki/Consensus_(computer_science)), |
||||
as well as our [implementation here](/docs/internals/consensus.html). |
||||
|
||||
* Gossip - Consul is built on top of [Serf](http://www.serfdom.io/), which provides a full |
||||
[gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol) that is used for multiple purposes. |
||||
Serf provides membership, failure detection, and event broadcast mechanisms. Our use of these |
||||
is described more in the [gossip documentation](/docs/internals/gossip.html). It is enough to know |
||||
gossip involves random node-to-node communication, primary over UDP. |
||||
|
||||
* LAN Gossip - This is used to mean that there is a gossip pool, containing nodes that |
||||
are all located on the same local area network or datacenter. |
||||
|
||||
* WAN Gossip - This is used to mean that there is a gossip pool, containing servers that |
||||
are primary located in different datacenters and must communicate over the internet or |
||||
wide area network. |
||||
|
||||
* RPC - RPC is short for a Remote Procedure Call. This is a request / response mechanism |
||||
allowing a client to make a request from a server. |
||||
|
||||
## 10,000 foot view |
||||
|
||||
From a 10,000 foot altitude the architecture of Consul looks like this: |
||||
|
||||
![Consul Architecture](/images/consul-arch.png) |
||||
|
||||
Lets break down this image and describe each piece. First of all we can see |
||||
that there are two datacenters, one and two respectively. Consul has first |
||||
class support for multiple data centers and expects this to be the common case. |
||||
|
||||
Within each datacenter we have a mixture of clients, and servers. It is expected |
||||
that there be between three and five servers. This strikes a balance between |
||||
availability in the case of failure and performance, as consensus gets progressively |
||||
slower as more machines are added. However, there is no limit to the number of clients, |
||||
and they can easily scale into the thousands or tens of thousands. |
||||
|
||||
All the nodes that are in a datacenter participate in a [gossip protocol](/docs/internals/gossip.html). |
||||
This means is there is a Serf cluster that contains all the nodes for a given datacenter. This serves |
||||
a few purposes: first, there is no need to configure clients with the addresses of servers, |
||||
that discovery is done automatically using Serf. Second, the work of detecting node failures |
||||
is not placed on the servers but is distributed. This makes the failure detection much more |
||||
scalable than naive heartbeating schemes. Thirdly, it is used as a messaging layer to notify |
||||
when important events such as leader election take place. |
||||
|
||||
The servers in each datacenter are all part of a single Raft peer set. This means that |
||||
they work together to elect a leader, which has extra duties. The leader is responsible for |
||||
processing all queries and transactions. Transactions must also be replicated to all peers |
||||
as part of the [consensus protocol](/docs/internals/consensus.html). Because of this requirement, |
||||
when a non-leader server receives an RPC request it forwards it to the cluster leader. |
||||
|
||||
The server nodes also operate as part of a WAN gossip. This pool is different from the LAN pool, |
||||
as it is optimized for the higher latency of the internet, and is expected to only contain |
||||
other Consul server nodes. The purpose of this pool is to allow datacenters to discover each |
||||
other in a low touch manner. Bringing a new datacenter online is as easy as joining the existing |
||||
WAN gossip. Because the servers are all operating in this pool, it also enables cross-dc requests. |
||||
When a server receives a request for a different datacenter, it forwards it to a random server |
||||
in the correct datacenter. That server may then forward to the local leader. |
||||
|
||||
This results in a very low coupling between datacenters, but because of a Serf failure detection, |
||||
connection caching and multiplexing, cross-dc requests are relatively fast and reliable. |
||||
|
||||
## Getting in depth |
||||
|
||||
At this point we've covered the high level architecture of Consul, but there are much |
||||
more details to each of the sub-systems. The [consensus protocol](/docs/internals/consensus.html) is |
||||
documented in detail, as is the [gossip protocol](/docs/internals/gossip.html). The [documentation](/docs/internals/security.html) |
||||
for the security model and protocols used for is also available. |
||||
|
||||
For other details, either consult the code, ask in IRC or reach out to the mailing list. |
||||
|
@ -0,0 +1,103 @@
|
||||
--- |
||||
layout: "docs" |
||||
page_title: "Consensus Protocol" |
||||
sidebar_current: "docs-internals-consensus" |
||||
--- |
||||
|
||||
# Consensus Protocol |
||||
|
||||
Serf uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol) |
||||
to broadcast messages to the cluster. This page documents the details of |
||||
this internal protocol. The gossip protocol is based on |
||||
["SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf), |
||||
with a few minor adaptations, mostly to increase propagation speed |
||||
and convergence rate. |
||||
|
||||
<div class="alert alert-block alert-warning"> |
||||
<strong>Advanced Topic!</strong> This page covers the technical details of |
||||
the internals of Serf. You don't need to know these details to effectively |
||||
operate and use Serf. These details are documented here for those who wish |
||||
to learn about them without having to go spelunking through the source code. |
||||
</div> |
||||
|
||||
## SWIM Protocol Overview |
||||
|
||||
Serf begins by joining an existing cluster or starting a new |
||||
cluster. If starting a new cluster, additional nodes are expected to join |
||||
it. New nodes in an existing cluster must be given the address of at |
||||
least one existing member in order to join the cluster. The new member |
||||
does a full state sync with the existing member over TCP and begins gossiping its |
||||
existence to the cluster. |
||||
|
||||
Gossip is done over UDP with a configurable but fixed fanout and interval. |
||||
This ensures that network usage is constant with regards to number of nodes. |
||||
Complete state exchanges with a random node are done periodically over |
||||
TCP, but much less often than gossip messages. This increases the likelihood |
||||
that the membership list converges properly since the full state is exchanged |
||||
and merged. The interval between full state exchanges is configurable or can |
||||
be disabled entirely. |
||||
|
||||
Failure detection is done by periodic random probing using a configurable interval. |
||||
If the node fails to ack within a reasonable time (typically some multiple |
||||
of RTT), then an indirect probe is attempted. An indirect probe asks a |
||||
configurable number of random nodes to probe the same node, in case there |
||||
are network issues causing our own node to fail the probe. If both our |
||||
probe and the indirect probes fail within a reasonable time, then the |
||||
node is marked "suspicious" and this knowledge is gossiped to the cluster. |
||||
A suspicious node is still considered a member of cluster. If the suspect member |
||||
of the cluster does not dispute the suspicion within a configurable period of |
||||
time, the node is finally considered dead, and this state is then gossiped |
||||
to the cluster. |
||||
|
||||
This is a brief and incomplete description of the protocol. For a better idea, |
||||
please read the |
||||
[SWIM paper](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf) |
||||
in its entirety, along with the Serf source code. |
||||
|
||||
## SWIM Modifications |
||||
|
||||
As mentioned earlier, the gossip protocol is based on SWIM but includes |
||||
minor changes, mostly to increase propogation speed and convergence rates. |
||||
|
||||
The changes from SWIM are noted here: |
||||
|
||||
* Serf does a full state sync over TCP periodically. SWIM only propagates |
||||
changes over gossip. While both are eventually consistent, Serf is able to |
||||
more quickly reach convergence, as well as gracefully recover from network |
||||
partitions. |
||||
|
||||
* Serf has a dedicated gossip layer separate from the failure detection |
||||
protocol. SWIM only piggybacks gossip messages on top of probe/ack messages. |
||||
Serf uses piggybacking along with dedicated gossip messages. This |
||||
feature lets you have a higher gossip rate (for example once per 200ms) |
||||
and a slower failure detection rate (such as once per second), resulting |
||||
in overall faster convergence rates and data propagation speeds. |
||||
|
||||
* Serf keeps the state of dead nodes around for a set amount of time, |
||||
so that when full syncs are requested, the requester also receives information |
||||
about dead nodes. Because SWIM doesn't do full syncs, SWIM deletes dead node |
||||
state immediately upon learning that the node is dead. This change again helps |
||||
the cluster converge more quickly. |
||||
|
||||
## Serf-Specific Messages |
||||
|
||||
On top of the SWIM-based gossip layer, Serf sends some custom message types. |
||||
|
||||
Serf makes heavy use of [lamport clocks](http://en.wikipedia.org/wiki/Lamport_timestamps) |
||||
to maintain some notion of message ordering despite being eventually |
||||
consistent. Every message sent by Serf contains a lamport clock time. |
||||
|
||||
When a node gracefully leaves the cluster, Serf sends a _leave intent_ through |
||||
the gossip layer. Because the underlying gossip layer makes no differentiation |
||||
between a node leaving the cluster and a node being detected as failed, this |
||||
allows the higher level Serf layer to detect a failure versus a graceful |
||||
leave. |
||||
|
||||
When a node joins the cluster, Serf sends a _join intent_. The purpose |
||||
of this intent is solely to attach a lamport clock time to a join so that |
||||
it can be ordered properly in case a leave comes out of order. |
||||
|
||||
For custom events, Serf sends a _user event_ message. This message contains |
||||
a lamport time, event name, and event payload. Because user events are sent |
||||
along the gossip layer, which uses UDP, the payload and entire message framing |
||||
must fit within a single UDP packet. |
After Width: | Height: | Size: 77 KiB |
Loading…
Reference in new issue