website: document the high level architecture

11 years ago · 262c039425
5 changed files with 234 additions and 13 deletions
--- a/website/source/docs/internals/architecture.html.markdown
+++ b/website/source/docs/internals/architecture.html.markdown
@ -0,0 +1,117 @@
+---
+layout: "docs"
+page_title: "Consul Architecture"
+sidebar_current: "docs-internals-architecture"
+---
+
+# Consul Architecture
+
+Consul is a complex system that has many different moving parts. To help
+users and developers of Consul form a mental model of how it works, this
+page documents the system architecture.
+
+<div class="alert alert-block alert-warning">
+<strong>Advanced Topic!</strong> This page covers technical details of
+the internals of Consul. You don't need to know these details to effectively
+operate and use Consul. These details are documented here for those who wish
+to learn about them without having to go spelunking through the source code.
+</div>
+
+## Glossary
+
+Before describing the architecture, we provide a glossary of terms to help
+clarify what is being discussed:
+
+* Agent - An agent is the long running daemon on every member of the Consul cluster.
+It is started by running `consul agent`. The agent is able to run in either *client*,
+or *server* mode. Since all nodes must be running an agent, it is simpler to refer to
+the node as either being a client or server, but other are instances of the agent. All
+agents can run the DNS or HTTP interfaces, and are responsible for running checks and
+keeping services in sync.
+
+* Client - A client is an agent that forwards all RPC's to a server. The client is relatively
+stateless. The only background activity a client performs is taking part of LAN gossip pool.
+This has a minimal resource overhead and consumes only a small amount of network bandwidth.
+
+* Server - An agent that is server mode. When in server mode, there is an expanded set
+of responsibilities including participated in the Raft quorum, maintaining cluster state,
+responding to RPC queries, WAN gossip to other datacenters, forwarding of queries to leaders
+or remote datacenters.
+
+* Datacenter - A data center seems obvious, but there are subtle details such as multiple
+availability zones in EC2. We define a data center to be a networking environment that is
+private, low latency, and high badwidth. This excludes communication that would traverse
+the public internet.
+
+* Consensus - When used in our documentation we use consensus to mean agreement upon
+the elected leader as well as agreement on the ordering of transactions. Since these
+transactions are applied to a FSM, we implicitly include the consistency of a replicated
+state machine. Consensus is described in more detail on [Wikipedia](http://en.wikipedia.org/wiki/Consensus_(computer_science)),
+as well as our [implementation here](/docs/internals/consensus.html).
+
+* Gossip - Consul is built on top of [Serf](http://www.serfdom.io/), which provides a full
+[gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol) that is used for multiple purposes.
+Serf provides membership, failure detection, and event broadcast mechanisms. Our use of these
+is described more in the [gossip documentation](/docs/internals/gossip.html). It is enough to know
+gossip involves random node-to-node communication, primary over UDP.
+
+* LAN Gossip - This is used to mean that there is a gossip pool, containing nodes that
+are all located on the same local area network or datacenter.
+
+* WAN Gossip - This is used to mean that there is a gossip pool, containing servers that
+are primary located in different datacenters and must communicate over the internet or
+wide area network.
+
+* RPC - RPC is short for a Remote Procedure Call. This is a request / response mechanism
+allowing a client to make a request from a server.
+
+## 10,000 foot view
+
+From a 10,000 foot altitude the architecture of Consul looks like this:
+
+![Consul Architecture](/images/consul-arch.png)
+
+Lets break down this image and describe each piece. First of all we can see
+that there are two datacenters, one and two respectively. Consul has first
+class support for multiple data centers and expects this to be the common case.
+
+Within each datacenter we have a mixture of clients, and servers. It is expected
+that there be between three and five servers. This strikes a balance between
+availability in the case of failure and performance, as consensus gets progressively
+slower as more machines are added. However, there is no limit to the number of clients,
+and they can easily scale into the thousands or tens of thousands.
+
+All the nodes that are in a datacenter participate in a [gossip protocol](/docs/internals/gossip.html).
+This means is there is a Serf cluster that contains all the nodes for a given datacenter. This serves
+a few purposes: first, there is no need to configure clients with the addresses of servers,
+that discovery is done automatically using Serf. Second, the work of detecting node failures
+is not placed on the servers but is distributed. This makes the failure detection much more
+scalable than naive heartbeating schemes. Thirdly, it is used as a messaging layer to notify
+when important events such as leader election take place.
+
+The servers in each datacenter are all part of a single Raft peer set. This means that
+they work together to elect a leader, which has extra duties. The leader is responsible for
+processing all queries and transactions. Transactions must also be replicated to all peers
+as part of the [consensus protocol](/docs/internals/consensus.html). Because of this requirement,
+when a non-leader server receives an RPC request it forwards it to the cluster leader.
+
+The server nodes also operate as part of a WAN gossip. This pool is different from the LAN pool,
+as it is optimized for the higher latency of the internet, and is expected to only contain
+other Consul server nodes. The purpose of this pool is to allow datacenters to discover each
+other in a low touch manner. Bringing a new datacenter online is as easy as joining the existing
+WAN gossip. Because the servers are all operating in this pool, it also enables cross-dc requests.
+When a server receives a request for a different datacenter, it forwards it to a random server
+in the correct datacenter. That server may then forward to the local leader.
+
+This results in a very low coupling between datacenters, but because of a Serf failure detection,
+connection caching and multiplexing, cross-dc requests are relatively fast and reliable.
+
+## Getting in depth
+
+At this point we've covered the high level architecture of Consul, but there are much
+more details to each of the sub-systems. The [consensus protocol](/docs/internals/consensus.html) is
+documented in detail, as is the [gossip protocol](/docs/internals/gossip.html). The [documentation](/docs/internals/security.html)
+for the security model and protocols used for is also available.
+
+For other details, either consult the code, ask in IRC or reach out to the mailing list.
+
--- a/website/source/docs/internals/consensus.html.markdown
+++ b/website/source/docs/internals/consensus.html.markdown
@ -0,0 +1,103 @@
+---
+layout: "docs"
+page_title: "Consensus Protocol"
+sidebar_current: "docs-internals-consensus"
+---
+
+# Consensus Protocol
+
+Serf uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol)
+to broadcast messages to the cluster. This page documents the details of
+this internal protocol. The gossip protocol is based on
+["SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf),
+with a few minor adaptations, mostly to increase propagation speed
+and convergence rate.
+
+<div class="alert alert-block alert-warning">
+<strong>Advanced Topic!</strong> This page covers the technical details of
+the internals of Serf. You don't need to know these details to effectively
+operate and use Serf. These details are documented here for those who wish
+to learn about them without having to go spelunking through the source code.
+</div>
+
+## SWIM Protocol Overview
+
+Serf begins by joining an existing cluster or starting a new
+cluster. If starting a new cluster, additional nodes are expected to join
+it. New nodes in an existing cluster must be given the address of at
+least one existing member in order to join the cluster. The new member
+does a full state sync with the existing member over TCP and begins gossiping its
+existence to the cluster.
+
+Gossip is done over UDP with a configurable but fixed fanout and interval.
+This ensures that network usage is constant with regards to number of nodes.
+Complete state exchanges with a random node are done periodically over
+TCP, but much less often than gossip messages. This increases the likelihood
+that the membership list converges properly since the full state is exchanged
+and merged. The interval between full state exchanges is configurable or can
+be disabled entirely.
+
+Failure detection is done by periodic random probing using a configurable interval.
+If the node fails to ack within a reasonable time (typically some multiple
+of RTT), then an indirect probe is attempted. An indirect probe asks a
+configurable number of random nodes to probe the same node, in case there
+are network issues causing our own node to fail the probe. If both our
+probe and the indirect probes fail within a reasonable time, then the
+node is marked "suspicious" and this knowledge is gossiped to the cluster.
+A suspicious node is still considered a member of cluster. If the suspect member
+of the cluster does not dispute the suspicion within a configurable period of
+time, the node is finally considered dead, and this state is then gossiped
+to the cluster.
+
+This is a brief and incomplete description of the protocol. For a better idea,
+please read the
+[SWIM paper](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf)
+in its entirety, along with the Serf source code.
+
+## SWIM Modifications
+
+As mentioned earlier, the gossip protocol is based on SWIM but includes
+minor changes, mostly to increase propogation speed and convergence rates.
+
+The changes from SWIM are noted here:
+
+* Serf does a full state sync over TCP periodically. SWIM only propagates
+  changes over gossip. While both are eventually consistent, Serf is able to
+  more quickly reach convergence, as well as gracefully recover from network
+  partitions.
+
+* Serf has a dedicated gossip layer separate from the failure detection
+  protocol. SWIM only piggybacks gossip messages on top of probe/ack messages.
+  Serf uses piggybacking along with dedicated gossip messages. This
+  feature lets you have a higher gossip rate (for example once per 200ms)
+  and a slower failure detection rate (such as once per second), resulting
+  in overall faster convergence rates and data propagation speeds.
+
+* Serf keeps the state of dead nodes around for a set amount of time,
+  so that when full syncs are requested, the requester also receives information
+  about dead nodes. Because SWIM doesn't do full syncs, SWIM deletes dead node
+  state immediately upon learning that the node is dead. This change again helps
+  the cluster converge more quickly.
+
+## Serf-Specific Messages
+
+On top of the SWIM-based gossip layer, Serf sends some custom message types.
+
+Serf makes heavy use of [lamport clocks](http://en.wikipedia.org/wiki/Lamport_timestamps)
+to maintain some notion of message ordering despite being eventually
+consistent. Every message sent by Serf contains a lamport clock time.
+
+When a node gracefully leaves the cluster, Serf sends a _leave intent_ through
+the gossip layer. Because the underlying gossip layer makes no differentiation
+between a node leaving the cluster and a node being detected as failed, this
+allows the higher level Serf layer to detect a failure versus a graceful
+leave.
+
+When a node joins the cluster, Serf sends a _join intent_. The purpose
+of this intent is solely to attach a lamport clock time to a join so that
+it can be ordered properly in case a leave comes out of order.
+
+For custom events, Serf sends a _user event_ message. This message contains
+a lamport time, event name, and event payload. Because user events are sent
+along the gossip layer, which uses UDP, the payload and entire message framing
+must fit within a single UDP packet.
--- a/website/source/docs/internals/index.html.markdown
+++ b/website/source/docs/internals/index.html.markdown
@ -4,16 +4,13 @@ page_title: "Internals"
 sidebar_current: "docs-internals"
 ---

-# Serf Internals
+# Consul Internals

-This section goes over some of the internals of Serf, such as the gossip
-protocol, ordering of messages via lamport clocks, etc. This section
-also contains a useful [convergence simulator](/docs/internals/simulator.html)
-that can be used to see how fast a Serf cluster will converge under
-various conditions with specific configurations.
+This section goes over some of the internals of Consul, such as the architecture,
+consensus and gossip protocol, security model, etc.

 <div class="alert alert-block alert-info">
-Note that knowing about the internals of Serf is not necessary to
+Note that knowing about the internals of Consul is not necessary to
 successfully use it, but we document it here to be completely transparent
-about how the "magic" of Serf works.
+about how Consul works.
 </div>
--- a/website/source/images/consul-arch.png
+++ b/website/source/images/consul-arch.png
--- a/website/source/layouts/docs.erb
+++ b/website/source/layouts/docs.erb
@ -21,7 +21,15 @@

 				<li<%= sidebar_current("docs-internals") %>>
 				<a href="/docs/internals/index.html">Consul Internals</a>
-				<ul class="nav">
+                <ul class="nav">
+					<li<%= sidebar_current("docs-internals-architecture") %>>
+					<a href="/docs/internals/architecture.html">Architecture</a>
+                    </li>
+
+					<li<%= sidebar_current("docs-internals-consensus") %>>
+					<a href="/docs/internals/consensus.html">Consensus Protocol</a>
+                    </li>
+
 					<li<%= sidebar_current("docs-internals-gossip") %>>
 					<a href="/docs/internals/gossip.html">Gossip Protocol</a>
                    </li>
@ -29,10 +37,6 @@
 					<li<%= sidebar_current("docs-internals-security") %>>
 					<a href="/docs/internals/security.html">Security Model</a>
 					</li>
-
-					<li<%= sidebar_current("docs-internals-simulator") %>>
-					<a href="/docs/internals/simulator.html">Convergence Simulator</a>
-					</li>
 				</ul>
 				</li>