diff --git a/website/source/docs/internals/architecture.html.markdown b/website/source/docs/internals/architecture.html.markdown new file mode 100644 index 0000000000..4046508c27 --- /dev/null +++ b/website/source/docs/internals/architecture.html.markdown @@ -0,0 +1,117 @@ +--- +layout: "docs" +page_title: "Consul Architecture" +sidebar_current: "docs-internals-architecture" +--- + +# Consul Architecture + +Consul is a complex system that has many different moving parts. To help +users and developers of Consul form a mental model of how it works, this +page documents the system architecture. + +
+Advanced Topic! This page covers technical details of +the internals of Consul. You don't need to know these details to effectively +operate and use Consul. These details are documented here for those who wish +to learn about them without having to go spelunking through the source code. +
+ +## Glossary + +Before describing the architecture, we provide a glossary of terms to help +clarify what is being discussed: + +* Agent - An agent is the long running daemon on every member of the Consul cluster. +It is started by running `consul agent`. The agent is able to run in either *client*, +or *server* mode. Since all nodes must be running an agent, it is simpler to refer to +the node as either being a client or server, but other are instances of the agent. All +agents can run the DNS or HTTP interfaces, and are responsible for running checks and +keeping services in sync. + +* Client - A client is an agent that forwards all RPC's to a server. The client is relatively +stateless. The only background activity a client performs is taking part of LAN gossip pool. +This has a minimal resource overhead and consumes only a small amount of network bandwidth. + +* Server - An agent that is server mode. When in server mode, there is an expanded set +of responsibilities including participated in the Raft quorum, maintaining cluster state, +responding to RPC queries, WAN gossip to other datacenters, forwarding of queries to leaders +or remote datacenters. + +* Datacenter - A data center seems obvious, but there are subtle details such as multiple +availability zones in EC2. We define a data center to be a networking environment that is +private, low latency, and high badwidth. This excludes communication that would traverse +the public internet. + +* Consensus - When used in our documentation we use consensus to mean agreement upon +the elected leader as well as agreement on the ordering of transactions. Since these +transactions are applied to a FSM, we implicitly include the consistency of a replicated +state machine. Consensus is described in more detail on [Wikipedia](http://en.wikipedia.org/wiki/Consensus_(computer_science)), +as well as our [implementation here](/docs/internals/consensus.html). + +* Gossip - Consul is built on top of [Serf](http://www.serfdom.io/), which provides a full +[gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol) that is used for multiple purposes. +Serf provides membership, failure detection, and event broadcast mechanisms. Our use of these +is described more in the [gossip documentation](/docs/internals/gossip.html). It is enough to know +gossip involves random node-to-node communication, primary over UDP. + +* LAN Gossip - This is used to mean that there is a gossip pool, containing nodes that +are all located on the same local area network or datacenter. + +* WAN Gossip - This is used to mean that there is a gossip pool, containing servers that +are primary located in different datacenters and must communicate over the internet or +wide area network. + +* RPC - RPC is short for a Remote Procedure Call. This is a request / response mechanism +allowing a client to make a request from a server. + +## 10,000 foot view + +From a 10,000 foot altitude the architecture of Consul looks like this: + +![Consul Architecture](/images/consul-arch.png) + +Lets break down this image and describe each piece. First of all we can see +that there are two datacenters, one and two respectively. Consul has first +class support for multiple data centers and expects this to be the common case. + +Within each datacenter we have a mixture of clients, and servers. It is expected +that there be between three and five servers. This strikes a balance between +availability in the case of failure and performance, as consensus gets progressively +slower as more machines are added. However, there is no limit to the number of clients, +and they can easily scale into the thousands or tens of thousands. + +All the nodes that are in a datacenter participate in a [gossip protocol](/docs/internals/gossip.html). +This means is there is a Serf cluster that contains all the nodes for a given datacenter. This serves +a few purposes: first, there is no need to configure clients with the addresses of servers, +that discovery is done automatically using Serf. Second, the work of detecting node failures +is not placed on the servers but is distributed. This makes the failure detection much more +scalable than naive heartbeating schemes. Thirdly, it is used as a messaging layer to notify +when important events such as leader election take place. + +The servers in each datacenter are all part of a single Raft peer set. This means that +they work together to elect a leader, which has extra duties. The leader is responsible for +processing all queries and transactions. Transactions must also be replicated to all peers +as part of the [consensus protocol](/docs/internals/consensus.html). Because of this requirement, +when a non-leader server receives an RPC request it forwards it to the cluster leader. + +The server nodes also operate as part of a WAN gossip. This pool is different from the LAN pool, +as it is optimized for the higher latency of the internet, and is expected to only contain +other Consul server nodes. The purpose of this pool is to allow datacenters to discover each +other in a low touch manner. Bringing a new datacenter online is as easy as joining the existing +WAN gossip. Because the servers are all operating in this pool, it also enables cross-dc requests. +When a server receives a request for a different datacenter, it forwards it to a random server +in the correct datacenter. That server may then forward to the local leader. + +This results in a very low coupling between datacenters, but because of a Serf failure detection, +connection caching and multiplexing, cross-dc requests are relatively fast and reliable. + +## Getting in depth + +At this point we've covered the high level architecture of Consul, but there are much +more details to each of the sub-systems. The [consensus protocol](/docs/internals/consensus.html) is +documented in detail, as is the [gossip protocol](/docs/internals/gossip.html). The [documentation](/docs/internals/security.html) +for the security model and protocols used for is also available. + +For other details, either consult the code, ask in IRC or reach out to the mailing list. + diff --git a/website/source/docs/internals/consensus.html.markdown b/website/source/docs/internals/consensus.html.markdown new file mode 100644 index 0000000000..0477b369a7 --- /dev/null +++ b/website/source/docs/internals/consensus.html.markdown @@ -0,0 +1,103 @@ +--- +layout: "docs" +page_title: "Consensus Protocol" +sidebar_current: "docs-internals-consensus" +--- + +# Consensus Protocol + +Serf uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol) +to broadcast messages to the cluster. This page documents the details of +this internal protocol. The gossip protocol is based on +["SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf), +with a few minor adaptations, mostly to increase propagation speed +and convergence rate. + +
+Advanced Topic! This page covers the technical details of +the internals of Serf. You don't need to know these details to effectively +operate and use Serf. These details are documented here for those who wish +to learn about them without having to go spelunking through the source code. +
+ +## SWIM Protocol Overview + +Serf begins by joining an existing cluster or starting a new +cluster. If starting a new cluster, additional nodes are expected to join +it. New nodes in an existing cluster must be given the address of at +least one existing member in order to join the cluster. The new member +does a full state sync with the existing member over TCP and begins gossiping its +existence to the cluster. + +Gossip is done over UDP with a configurable but fixed fanout and interval. +This ensures that network usage is constant with regards to number of nodes. +Complete state exchanges with a random node are done periodically over +TCP, but much less often than gossip messages. This increases the likelihood +that the membership list converges properly since the full state is exchanged +and merged. The interval between full state exchanges is configurable or can +be disabled entirely. + +Failure detection is done by periodic random probing using a configurable interval. +If the node fails to ack within a reasonable time (typically some multiple +of RTT), then an indirect probe is attempted. An indirect probe asks a +configurable number of random nodes to probe the same node, in case there +are network issues causing our own node to fail the probe. If both our +probe and the indirect probes fail within a reasonable time, then the +node is marked "suspicious" and this knowledge is gossiped to the cluster. +A suspicious node is still considered a member of cluster. If the suspect member +of the cluster does not dispute the suspicion within a configurable period of +time, the node is finally considered dead, and this state is then gossiped +to the cluster. + +This is a brief and incomplete description of the protocol. For a better idea, +please read the +[SWIM paper](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf) +in its entirety, along with the Serf source code. + +## SWIM Modifications + +As mentioned earlier, the gossip protocol is based on SWIM but includes +minor changes, mostly to increase propogation speed and convergence rates. + +The changes from SWIM are noted here: + +* Serf does a full state sync over TCP periodically. SWIM only propagates + changes over gossip. While both are eventually consistent, Serf is able to + more quickly reach convergence, as well as gracefully recover from network + partitions. + +* Serf has a dedicated gossip layer separate from the failure detection + protocol. SWIM only piggybacks gossip messages on top of probe/ack messages. + Serf uses piggybacking along with dedicated gossip messages. This + feature lets you have a higher gossip rate (for example once per 200ms) + and a slower failure detection rate (such as once per second), resulting + in overall faster convergence rates and data propagation speeds. + +* Serf keeps the state of dead nodes around for a set amount of time, + so that when full syncs are requested, the requester also receives information + about dead nodes. Because SWIM doesn't do full syncs, SWIM deletes dead node + state immediately upon learning that the node is dead. This change again helps + the cluster converge more quickly. + +## Serf-Specific Messages + +On top of the SWIM-based gossip layer, Serf sends some custom message types. + +Serf makes heavy use of [lamport clocks](http://en.wikipedia.org/wiki/Lamport_timestamps) +to maintain some notion of message ordering despite being eventually +consistent. Every message sent by Serf contains a lamport clock time. + +When a node gracefully leaves the cluster, Serf sends a _leave intent_ through +the gossip layer. Because the underlying gossip layer makes no differentiation +between a node leaving the cluster and a node being detected as failed, this +allows the higher level Serf layer to detect a failure versus a graceful +leave. + +When a node joins the cluster, Serf sends a _join intent_. The purpose +of this intent is solely to attach a lamport clock time to a join so that +it can be ordered properly in case a leave comes out of order. + +For custom events, Serf sends a _user event_ message. This message contains +a lamport time, event name, and event payload. Because user events are sent +along the gossip layer, which uses UDP, the payload and entire message framing +must fit within a single UDP packet. diff --git a/website/source/docs/internals/index.html.markdown b/website/source/docs/internals/index.html.markdown index 72e7caf794..01374e44ce 100644 --- a/website/source/docs/internals/index.html.markdown +++ b/website/source/docs/internals/index.html.markdown @@ -4,16 +4,13 @@ page_title: "Internals" sidebar_current: "docs-internals" --- -# Serf Internals +# Consul Internals -This section goes over some of the internals of Serf, such as the gossip -protocol, ordering of messages via lamport clocks, etc. This section -also contains a useful [convergence simulator](/docs/internals/simulator.html) -that can be used to see how fast a Serf cluster will converge under -various conditions with specific configurations. +This section goes over some of the internals of Consul, such as the architecture, +consensus and gossip protocol, security model, etc.
-Note that knowing about the internals of Serf is not necessary to +Note that knowing about the internals of Consul is not necessary to successfully use it, but we document it here to be completely transparent -about how the "magic" of Serf works. +about how Consul works.
diff --git a/website/source/images/consul-arch.png b/website/source/images/consul-arch.png new file mode 100644 index 0000000000..f4afa60008 Binary files /dev/null and b/website/source/images/consul-arch.png differ diff --git a/website/source/layouts/docs.erb b/website/source/layouts/docs.erb index 9bd2d89de2..f7b75faaf9 100644 --- a/website/source/layouts/docs.erb +++ b/website/source/layouts/docs.erb @@ -21,7 +21,15 @@ > Consul Internals -