website: document the high level architecture

11 years ago · 262c039425
parent 779e6df7b8
commit 262c039425
5 changed files with 234 additions and 13 deletions
--- a/website/source/docs/internals/architecture.html.markdown
+++ b/website/source/docs/internals/architecture.html.markdown
@ -0,0 +1,117 @@
 ---
 layout: "docs"
 page_title: "Consul Architecture"
 sidebar_current: "docs-internals-architecture"
 ---
 # Consul Architecture
 Consul is a complex system that has many different moving parts. To help
 users and developers of Consul form a mental model of how it works, this
 page documents the system architecture.
 <div class="alert alert-block alert-warning">
 <strong>Advanced Topic!</strong> This page covers technical details of
 the internals of Consul. You don't need to know these details to effectively
 operate and use Consul. These details are documented here for those who wish
 to learn about them without having to go spelunking through the source code.
 </div>
 ## Glossary
 Before describing the architecture, we provide a glossary of terms to help
 clarify what is being discussed:
 * Agent - An agent is the long running daemon on every member of the Consul cluster.
 It is started by running `consul agent`. The agent is able to run in either *client*,
 or *server* mode. Since all nodes must be running an agent, it is simpler to refer to
 the node as either being a client or server, but other are instances of the agent. All
 agents can run the DNS or HTTP interfaces, and are responsible for running checks and
 keeping services in sync.
 * Client - A client is an agent that forwards all RPC's to a server. The client is relatively
 stateless. The only background activity a client performs is taking part of LAN gossip pool.
 This has a minimal resource overhead and consumes only a small amount of network bandwidth.
 * Server - An agent that is server mode. When in server mode, there is an expanded set
 of responsibilities including participated in the Raft quorum, maintaining cluster state,
 responding to RPC queries, WAN gossip to other datacenters, forwarding of queries to leaders
 or remote datacenters.
 * Datacenter - A data center seems obvious, but there are subtle details such as multiple
 availability zones in EC2. We define a data center to be a networking environment that is
 private, low latency, and high badwidth. This excludes communication that would traverse
 the public internet.
 * Consensus - When used in our documentation we use consensus to mean agreement upon
 the elected leader as well as agreement on the ordering of transactions. Since these
 transactions are applied to a FSM, we implicitly include the consistency of a replicated
 state machine. Consensus is described in more detail on [Wikipedia](http://en.wikipedia.org/wiki/Consensus_(computer_science)),
 as well as our [implementation here](/docs/internals/consensus.html).
 * Gossip - Consul is built on top of [Serf](http://www.serfdom.io/), which provides a full
 [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol) that is used for multiple purposes.
 Serf provides membership, failure detection, and event broadcast mechanisms. Our use of these
 is described more in the [gossip documentation](/docs/internals/gossip.html). It is enough to know
 gossip involves random node-to-node communication, primary over UDP.
 * LAN Gossip - This is used to mean that there is a gossip pool, containing nodes that
 are all located on the same local area network or datacenter.
 * WAN Gossip - This is used to mean that there is a gossip pool, containing servers that
 are primary located in different datacenters and must communicate over the internet or
 wide area network.
 * RPC - RPC is short for a Remote Procedure Call. This is a request / response mechanism
 allowing a client to make a request from a server.
 ## 10,000 foot view
 From a 10,000 foot altitude the architecture of Consul looks like this:
 ![Consul Architecture](/images/consul-arch.png)
 Lets break down this image and describe each piece. First of all we can see
 that there are two datacenters, one and two respectively. Consul has first
 class support for multiple data centers and expects this to be the common case.
 Within each datacenter we have a mixture of clients, and servers. It is expected
 that there be between three and five servers. This strikes a balance between
 availability in the case of failure and performance, as consensus gets progressively
 slower as more machines are added. However, there is no limit to the number of clients,
 and they can easily scale into the thousands or tens of thousands.
 All the nodes that are in a datacenter participate in a [gossip protocol](/docs/internals/gossip.html).
 This means is there is a Serf cluster that contains all the nodes for a given datacenter. This serves
 a few purposes: first, there is no need to configure clients with the addresses of servers,
 that discovery is done automatically using Serf. Second, the work of detecting node failures
 is not placed on the servers but is distributed. This makes the failure detection much more
 scalable than naive heartbeating schemes. Thirdly, it is used as a messaging layer to notify
 when important events such as leader election take place.
 The servers in each datacenter are all part of a single Raft peer set. This means that
 they work together to elect a leader, which has extra duties. The leader is responsible for
 processing all queries and transactions. Transactions must also be replicated to all peers
 as part of the [consensus protocol](/docs/internals/consensus.html). Because of this requirement,
 when a non-leader server receives an RPC request it forwards it to the cluster leader.
 The server nodes also operate as part of a WAN gossip. This pool is different from the LAN pool,
 as it is optimized for the higher latency of the internet, and is expected to only contain
 other Consul server nodes. The purpose of this pool is to allow datacenters to discover each
 other in a low touch manner. Bringing a new datacenter online is as easy as joining the existing
 WAN gossip. Because the servers are all operating in this pool, it also enables cross-dc requests.
 When a server receives a request for a different datacenter, it forwards it to a random server
 in the correct datacenter. That server may then forward to the local leader.
 This results in a very low coupling between datacenters, but because of a Serf failure detection,
 connection caching and multiplexing, cross-dc requests are relatively fast and reliable.
 ## Getting in depth
 At this point we've covered the high level architecture of Consul, but there are much
 more details to each of the sub-systems. The [consensus protocol](/docs/internals/consensus.html) is
 documented in detail, as is the [gossip protocol](/docs/internals/gossip.html). The [documentation](/docs/internals/security.html)
 for the security model and protocols used for is also available.
 For other details, either consult the code, ask in IRC or reach out to the mailing list.
--- a/website/source/docs/internals/consensus.html.markdown
+++ b/website/source/docs/internals/consensus.html.markdown
@ -0,0 +1,103 @@
 ---
 layout: "docs"
 page_title: "Consensus Protocol"
 sidebar_current: "docs-internals-consensus"
 ---
 # Consensus Protocol
 Serf uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol)
 to broadcast messages to the cluster. This page documents the details of
 this internal protocol. The gossip protocol is based on
 ["SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf),
 with a few minor adaptations, mostly to increase propagation speed
 and convergence rate.
 <div class="alert alert-block alert-warning">
 <strong>Advanced Topic!</strong> This page covers the technical details of
 the internals of Serf. You don't need to know these details to effectively
 operate and use Serf. These details are documented here for those who wish
 to learn about them without having to go spelunking through the source code.
 </div>
 ## SWIM Protocol Overview
 Serf begins by joining an existing cluster or starting a new
 cluster. If starting a new cluster, additional nodes are expected to join
 it. New nodes in an existing cluster must be given the address of at
 least one existing member in order to join the cluster. The new member
 does a full state sync with the existing member over TCP and begins gossiping its
 existence to the cluster.
 Gossip is done over UDP with a configurable but fixed fanout and interval.
 This ensures that network usage is constant with regards to number of nodes.
 Complete state exchanges with a random node are done periodically over
 TCP, but much less often than gossip messages. This increases the likelihood
 that the membership list converges properly since the full state is exchanged
 and merged. The interval between full state exchanges is configurable or can
 be disabled entirely.
 Failure detection is done by periodic random probing using a configurable interval.
 If the node fails to ack within a reasonable time (typically some multiple
 of RTT), then an indirect probe is attempted. An indirect probe asks a
 configurable number of random nodes to probe the same node, in case there
 are network issues causing our own node to fail the probe. If both our
 probe and the indirect probes fail within a reasonable time, then the
 node is marked "suspicious" and this knowledge is gossiped to the cluster.
 A suspicious node is still considered a member of cluster. If the suspect member
 of the cluster does not dispute the suspicion within a configurable period of
 time, the node is finally considered dead, and this state is then gossiped
 to the cluster.
 This is a brief and incomplete description of the protocol. For a better idea,
 please read the
 [SWIM paper](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf)
 in its entirety, along with the Serf source code.
 ## SWIM Modifications
 As mentioned earlier, the gossip protocol is based on SWIM but includes
 minor changes, mostly to increase propogation speed and convergence rates.
 The changes from SWIM are noted here:
 * Serf does a full state sync over TCP periodically. SWIM only propagates
  changes over gossip. While both are eventually consistent, Serf is able to
  more quickly reach convergence, as well as gracefully recover from network
  partitions.
 * Serf has a dedicated gossip layer separate from the failure detection
  protocol. SWIM only piggybacks gossip messages on top of probe/ack messages.
  Serf uses piggybacking along with dedicated gossip messages. This
  feature lets you have a higher gossip rate (for example once per 200ms)
  and a slower failure detection rate (such as once per second), resulting
  in overall faster convergence rates and data propagation speeds.
 * Serf keeps the state of dead nodes around for a set amount of time,
  so that when full syncs are requested, the requester also receives information
  about dead nodes. Because SWIM doesn't do full syncs, SWIM deletes dead node
  state immediately upon learning that the node is dead. This change again helps
  the cluster converge more quickly.
 ## Serf-Specific Messages
 On top of the SWIM-based gossip layer, Serf sends some custom message types.
 Serf makes heavy use of [lamport clocks](http://en.wikipedia.org/wiki/Lamport_timestamps)
 to maintain some notion of message ordering despite being eventually
 consistent. Every message sent by Serf contains a lamport clock time.
 When a node gracefully leaves the cluster, Serf sends a _leave intent_ through
 the gossip layer. Because the underlying gossip layer makes no differentiation
 between a node leaving the cluster and a node being detected as failed, this
 allows the higher level Serf layer to detect a failure versus a graceful
 leave.
 When a node joins the cluster, Serf sends a _join intent_. The purpose
 of this intent is solely to attach a lamport clock time to a join so that
 it can be ordered properly in case a leave comes out of order.
 For custom events, Serf sends a _user event_ message. This message contains
 a lamport time, event name, and event payload. Because user events are sent
 along the gossip layer, which uses UDP, the payload and entire message framing
 must fit within a single UDP packet.
--- a/website/source/docs/internals/index.html.markdown
+++ b/website/source/docs/internals/index.html.markdown
@ -4,16 +4,13 @@ page_title: "Internals"
 sidebar_current: "docs-internals"
 ---
-# Serf Internals
+# Consul Internals
-This section goes over some of the internals of Serf, such as the gossip
+This section goes over some of the internals of Consul, such as the architecture,
-protocol, ordering of messages via lamport clocks, etc. This section
+consensus and gossip protocol, security model, etc.
 also contains a useful [convergence simulator](/docs/internals/simulator.html)
 that can be used to see how fast a Serf cluster will converge under
 various conditions with specific configurations.
 <div class="alert alert-block alert-info">
-Note that knowing about the internals of Serf is not necessary to
+Note that knowing about the internals of Consul is not necessary to
 successfully use it, but we document it here to be completely transparent
-about how the "magic" of Serf works.
+about how Consul works.
 </div>
--- a/website/source/images/consul-arch.png
+++ b/website/source/images/consul-arch.png
--- a/website/source/layouts/docs.erb
+++ b/website/source/layouts/docs.erb
@ -22,6 +22,14 @@
 				<li<%= sidebar_current("docs-internals") %>>
 				<a href="/docs/internals/index.html">Consul Internals</a>
                <ul class="nav">
 					<li<%= sidebar_current("docs-internals-architecture") %>>
 					<a href="/docs/internals/architecture.html">Architecture</a>
                    </li>
 					<li<%= sidebar_current("docs-internals-consensus") %>>
 					<a href="/docs/internals/consensus.html">Consensus Protocol</a>
                    </li>
 					<li<%= sidebar_current("docs-internals-gossip") %>>
 					<a href="/docs/internals/gossip.html">Gossip Protocol</a>
                    </li>
@ -29,10 +37,6 @@
 					<li<%= sidebar_current("docs-internals-security") %>>
 					<a href="/docs/internals/security.html">Security Model</a>
 					</li>
 					<li<%= sidebar_current("docs-internals-simulator") %>>
 					<a href="/docs/internals/simulator.html">Convergence Simulator</a>
 					</li>
 				</ul>
 				</li>