k3s/docs/design/control-plane-resilience.md

<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->

<!-- BEGIN STRIP_FOR_RELEASE -->

<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">

<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>

If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.

<!-- TAG RELEASE_LINK, added by the munger automatically -->
<strong>
The latest release of this document can be found
[here](http://releases.k8s.io/release-1.4/docs/design/control-plane-resilience.md).

Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--

<!-- END STRIP_FOR_RELEASE -->

<!-- END MUNGE: UNVERSIONED_WARNING -->

# Kubernetes and Cluster Federation Control Plane Resilience

## Long Term Design and Current Status

### by Quinton Hoole, Mike Danese and Justin Santa-Barbara

### December 14, 2015

## Summary

Some amount of confusion exists around how we currently, and in future
want to ensure resilience of the Kubernetes (and by implication
Kubernetes Cluster Federation) control plane.  This document is an attempt to capture that
definitively. It covers areas including self-healing, high
availability, bootstrapping and recovery.  Most of the information in
this document already exists in the form of github comments,
PR's/proposals, scattered documents, and corridor conversations, so
document is primarily a consolidation and clarification of existing
ideas.

## Terms

* **Self-healing:** automatically restarting or replacing failed
  processes and machines without human intervention
* **High availability:** continuing to be available and work correctly
  even if some components are down or uncontactable.  This typically
  involves multiple replicas of critical services, and a reliable way
  to find available replicas. Note that it's possible (but not
  desirable) to have high
  availability properties (e.g. multiple replicas) in the absence of
  self-healing properties (e.g. if a replica fails, nothing replaces
  it). Fairly obviously, given enough time, such systems typically
  become unavailable (after enough replicas have failed).
* **Bootstrapping**: creating an empty cluster from nothing
* **Recovery**: recreating a non-empty cluster after perhaps
  catastrophic failure/unavailability/data corruption

## Overall Goals

1. **Resilience to single failures:** Kubernetes clusters constrained
   to single availability zones should be resilient to individual
   machine and process failures by being both self-healing and highly
   available (within the context of such individual failures).
1. **Ubiquitous resilience by default:** The default cluster creation
   scripts for (at least) GCE, AWS and basic bare metal should adhere
   to the above (self-healing and high availability) by default (with
   options available to disable these features to reduce control plane
   resource requirements if so required).  It is hoped that other
   cloud providers will also follow the above guidelines, but the
   above 3 are the primary canonical use cases.
1. **Resilience to some correlated failures:** Kubernetes clusters
   which span multiple availability zones in a region should by
   default be resilient to complete failure of one entire availability
   zone (by similarly providing self-healing and high availability in
   the default cluster creation scripts as above).
1. **Default implementation shared across cloud providers:** The
   differences between the default implementations of the above for
   GCE, AWS and basic bare metal should be minimized.  This implies
   using shared libraries across these providers in the default
   scripts in preference to highly customized implementations per
   cloud provider.  This is not to say that highly differentiated,
   customized per-cloud cluster creation processes (e.g. for GKE on
   GCE, or some hosted Kubernetes provider on AWS) are discouraged.
   But those fall squarely outside the basic cross-platform OSS
   Kubernetes distro.
1. **Self-hosting:** Where possible, Kubernetes's existing mechanisms
   for achieving system resilience (replication controllers, health
   checking, service load balancing etc) should be used in preference
   to building a separate set of mechanisms to achieve the same thing.
   This implies that self hosting (the kubernetes control plane on
   kubernetes) is strongly preferred, with the caveat below.
1. **Recovery from catastrophic failure:** The ability to quickly and
   reliably recover a cluster from catastrophic failure is critical,
   and should not be compromised by the above goal to self-host
   (i.e. it goes without saying that the cluster should be quickly and
   reliably recoverable, even if the cluster control plane is
   broken). This implies that such catastrophic failure scenarios
   should be carefully thought out, and the subject of regular
   continuous integration testing, and disaster recovery exercises.

## Relative Priorities

1. **(Possibly manual) recovery from catastrophic failures:** having a
Kubernetes cluster, and all applications running inside it, disappear forever
perhaps is the worst possible failure mode. So it is critical that we be able to
recover the applications running inside a cluster from such failures in some
well-bounded time period.
    1. In theory a cluster can be recovered by replaying all API calls
       that have ever been executed against it, in order, but most
       often that state has been lost, and/or is scattered across
       multiple client applications or groups. So in general it is
       probably infeasible.
    1. In theory a cluster can also be recovered to some relatively
       recent non-corrupt backup/snapshot of the disk(s) backing the
       etcd cluster state. But we have no default consistent
       backup/snapshot, verification or restoration process.  And we
       don't routinely test restoration, so even if we did routinely
       perform and verify backups, we have no hard evidence that we
       can in practise effectively recover from catastrophic cluster
       failure or data corruption by restoring from these backups. So
       there's more work to be done here.
1. **Self-healing:** Most major cloud providers provide the ability to
   easily and automatically replace failed virtual machines within a
   small number of minutes (e.g. GCE
   [Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart)
   and Managed Instance Groups,
   AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/)
   and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This
   can fairly trivially be used to reduce control-plane down-time due
   to machine failure to a small number of minutes per failure
   (i.e. typically around "3 nines" availability), provided that:
    1. cluster persistent state (i.e. etcd disks) is either:
        1. truely persistent (i.e. remote persistent disks), or
        1. reconstructible (e.g. using etcd [dynamic member
           addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member)
           or [backup and
           recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)).
    1. and boot disks are either:
        1. truely persistent (i.e. remote persistent disks), or
        1. reconstructible (e.g. using boot-from-snapshot,
           boot-from-pre-configured-image or
           boot-from-auto-initializing image).
1. **High Availability:** This has the potential to increase
   availability above the approximately "3 nines" level provided by
   automated self-healing, but it's somewhat more complex, and
   requires additional resources (e.g. redundant API servers and etcd
   quorum members).  In environments where cloud-assisted automatic
   self-healing might be infeasible (e.g. on-premise bare-metal
   deployments), it also gives cluster administrators more time to
   respond (e.g. replace/repair failed machines) without incurring
   system downtime.

## Design and Status (as of December 2015)

<table>
<tr>
<td><b>Control Plane Component</b></td>
<td><b>Resilience Plan</b></td>
<td><b>Current Status</b></td>
</tr>
<tr>
<td><b>API Server</b></td>
<td>

Multiple stateless, self-hosted, self-healing API servers behind a HA
load balancer, built out by the default "kube-up" automation on GCE,
AWS and basic bare metal (BBM). Note that the single-host approach of
having etcd listen only on localhost to ensure that only API server can
connect to it will no longer work, so alternative security will be
needed in the regard (either using firewall rules, SSL certs, or
something else). All necessary flags are currently supported to enable
SSL between API server and etcd (OpenShift runs like this out of the
box), but this needs to be woven into the "kube-up" and related
scripts.  Detailed design of self-hosting and related bootstrapping
and catastrophic failure recovery will be detailed in a separate
design doc.

</td>
<td>

No scripted self-healing or HA on GCE, AWS or basic bare metal
currently exists in the OSS distro. To be clear, "no self healing"
means that even if multiple e.g. API servers are provisioned for HA
purposes, if they fail, nothing replaces them, so eventually the
system will fail. Self-healing and HA can be set up
manually by following documented instructions, but this is not
currently an automated process, and it is not tested as part of
continuous integration. So it's probably safest to assume that it
doesn't actually work in practise.

</td>
</tr>
<tr>
<td><b>Controller manager and scheduler</b></td>
<td>

Multiple self-hosted, self healing warm standby stateless controller
managers and schedulers with leader election and automatic failover of API
server clients, automatically installed by default "kube-up" automation.

</td>
<td>As above.</td>
</tr>
<tr>
<td><b>etcd</b></td>
<td>

Multiple (3-5) etcd quorum members behind a load balancer with session
affinity (to prevent clients from being bounced from one to another).

Regarding self-healing, if a node running etcd goes down, it is always necessary
to do three things:
<ol>
<li>allocate a new node (not necessary if running etcd as a pod, in
which case specific measures are required to prevent user pods from
interfering with system pods, for example using node selectors as
described in <A HREF="),
<li>start an etcd replica on that new node, and
<li>have the new replica recover the etcd state.
</ol>
In the case of local disk (which fails in concert with the machine), the etcd
state must be recovered from the other replicas. This is called
<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member">
dynamic member addition</A>.

In the case of remote persistent disk, the etcd state can be recovered by
attaching the remote persistent disk to the replacement node, thus the state is
recoverable even if all other replicas are down.

There are also significant performance differences between local disks and remote
persistent disks. For example, the
<A HREF="https://cloud.google.com/compute/docs/disks/#comparison_of_disk_types">
sustained throughput local disks in GCE is approximatley 20x that of remote
disks</A>.

Hence we suggest that self-healing be provided by remotely mounted persistent
disks in non-performance critical, single-zone cloud deployments. For
performance critical installations, faster local SSD's should be used, in which
case remounting on node failure is not an option, so
<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md ">
etcd runtime configuration</A> should be used to replace the failed machine.
Similarly, for cross-zone self-healing, cloud persistent disks are zonal, so
automatic <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md">
runtime configuration</A> is required. Similarly, basic bare metal deployments
cannot generally rely on remote persistent disks, so the same approach applies
there.
</td>
<td>
<A HREF="http://kubernetes.io/v1.1/docs/admin/high-availability.html">
Somewhat vague instructions exist</A> on how to set some of this up manually in
a self-hosted configuration. But automatic bootstrapping and self-healing is not
described (and is not implemented for the non-PD cases). This all still needs to
be automated and continuously tested.
</td>
</tr>
</table>


<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00			`<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->`

Revert "Versioning docs and examples for v1.4.0-alpha.0." This reverts commit cce9db3aa9555671c5ddf69549b46ed0fd7e472a. 2016-06-10 23:46:46 +00:00			`<!-- BEGIN STRIP_FOR_RELEASE -->`

Fix broken warning image link in docs 2016-07-15 09:44:58 +00:00			`<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"`
Revert "Versioning docs and examples for v1.4.0-alpha.0." This reverts commit cce9db3aa9555671c5ddf69549b46ed0fd7e472a. 2016-06-10 23:46:46 +00:00			`width="25" height="25">`
Fix broken warning image link in docs 2016-07-15 09:44:58 +00:00			`<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"`
Revert "Versioning docs and examples for v1.4.0-alpha.0." This reverts commit cce9db3aa9555671c5ddf69549b46ed0fd7e472a. 2016-06-10 23:46:46 +00:00			`width="25" height="25">`
Fix broken warning image link in docs 2016-07-15 09:44:58 +00:00			`<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"`
Revert "Versioning docs and examples for v1.4.0-alpha.0." This reverts commit cce9db3aa9555671c5ddf69549b46ed0fd7e472a. 2016-06-10 23:46:46 +00:00			`width="25" height="25">`
Fix broken warning image link in docs 2016-07-15 09:44:58 +00:00			`<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"`
Revert "Versioning docs and examples for v1.4.0-alpha.0." This reverts commit cce9db3aa9555671c5ddf69549b46ed0fd7e472a. 2016-06-10 23:46:46 +00:00			`width="25" height="25">`
Fix broken warning image link in docs 2016-07-15 09:44:58 +00:00			`<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"`
Revert "Versioning docs and examples for v1.4.0-alpha.0." This reverts commit cce9db3aa9555671c5ddf69549b46ed0fd7e472a. 2016-06-10 23:46:46 +00:00			`width="25" height="25">`

			`<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>`

			`If you are using a released version of Kubernetes, you should`
			`refer to the docs that go with that version.`

Updated docs and examples for release-1.3. 2016-06-13 19:24:34 +00:00			`<!-- TAG RELEASE_LINK, added by the munger automatically -->`
			`<strong>`
			`The latest release of this document can be found`
Update the latestReleaseBranch to release-1.4 in the munger. 2016-09-01 21:40:55 +00:00			`[here](http://releases.k8s.io/release-1.4/docs/design/control-plane-resilience.md).`
Updated docs and examples for release-1.3. 2016-06-13 19:24:34 +00:00
Revert "Versioning docs and examples for v1.4.0-alpha.0." This reverts commit cce9db3aa9555671c5ddf69549b46ed0fd7e472a. 2016-06-10 23:46:46 +00:00			`Documentation for other releases can be found at`
			`[releases.k8s.io](http://releases.k8s.io).`
			`</strong>`
			`--`

			`<!-- END STRIP_FOR_RELEASE -->`
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00
			`<!-- END MUNGE: UNVERSIONED_WARNING -->`

Deprecate the term "Ubernetes" in favor of "Cluster Federation" and "Multi-AZ Clusters" 2016-07-06 22:42:56 +00:00			`# Kubernetes and Cluster Federation Control Plane Resilience`
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00
			`## Long Term Design and Current Status`

			`### by Quinton Hoole, Mike Danese and Justin Santa-Barbara`

			`### December 14, 2015`

			`## Summary`

			`Some amount of confusion exists around how we currently, and in future`
			`want to ensure resilience of the Kubernetes (and by implication`
Deprecate the term "Ubernetes" in favor of "Cluster Federation" and "Multi-AZ Clusters" 2016-07-06 22:42:56 +00:00			`Kubernetes Cluster Federation) control plane. This document is an attempt to capture that`
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00			`definitively. It covers areas including self-healing, high`
			`availability, bootstrapping and recovery. Most of the information in`
			`this document already exists in the form of github comments,`
			`PR's/proposals, scattered documents, and corridor conversations, so`
			`document is primarily a consolidation and clarification of existing`
			`ideas.`

			`## Terms`

			`* Self-healing: automatically restarting or replacing failed`
			`processes and machines without human intervention`
			`* High availability: continuing to be available and work correctly`
			`even if some components are down or uncontactable. This typically`
			`involves multiple replicas of critical services, and a reliable way`
address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree Signed-off-by: mikebrow <brownwm@us.ibm.com> 2016-04-14 00:55:22 +00:00			`to find available replicas. Note that it's possible (but not`
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00			`desirable) to have high`
			`availability properties (e.g. multiple replicas) in the absence of`
			`self-healing properties (e.g. if a replica fails, nothing replaces`
			`it). Fairly obviously, given enough time, such systems typically`
			`become unavailable (after enough replicas have failed).`
			`* Bootstrapping: creating an empty cluster from nothing`
			`* Recovery: recreating a non-empty cluster after perhaps`
			`catastrophic failure/unavailability/data corruption`

			`## Overall Goals`

			`1. Resilience to single failures: Kubernetes clusters constrained`
			`to single availability zones should be resilient to individual`
			`machine and process failures by being both self-healing and highly`
			`available (within the context of such individual failures).`
			`1. Ubiquitous resilience by default: The default cluster creation`
			`scripts for (at least) GCE, AWS and basic bare metal should adhere`
			`to the above (self-healing and high availability) by default (with`
			`options available to disable these features to reduce control plane`
			`resource requirements if so required). It is hoped that other`
			`cloud providers will also follow the above guidelines, but the`
			`above 3 are the primary canonical use cases.`
			`1. Resilience to some correlated failures: Kubernetes clusters`
			`which span multiple availability zones in a region should by`
			`default be resilient to complete failure of one entire availability`
			`zone (by similarly providing self-healing and high availability in`
			`the default cluster creation scripts as above).`
			`1. Default implementation shared across cloud providers: The`
			`differences between the default implementations of the above for`
			`GCE, AWS and basic bare metal should be minimized. This implies`
			`using shared libraries across these providers in the default`
			`scripts in preference to highly customized implementations per`
			`cloud provider. This is not to say that highly differentiated,`
			`customized per-cloud cluster creation processes (e.g. for GKE on`
			`GCE, or some hosted Kubernetes provider on AWS) are discouraged.`
			`But those fall squarely outside the basic cross-platform OSS`
			`Kubernetes distro.`
			`1. Self-hosting: Where possible, Kubernetes's existing mechanisms`
			`for achieving system resilience (replication controllers, health`
			`checking, service load balancing etc) should be used in preference`
			`to building a separate set of mechanisms to achieve the same thing.`
			`This implies that self hosting (the kubernetes control plane on`
			`kubernetes) is strongly preferred, with the caveat below.`
			`1. Recovery from catastrophic failure: The ability to quickly and`
			`reliably recover a cluster from catastrophic failure is critical,`
			`and should not be compromised by the above goal to self-host`
			`(i.e. it goes without saying that the cluster should be quickly and`
			`reliably recoverable, even if the cluster control plane is`
			`broken). This implies that such catastrophic failure scenarios`
			`should be carefully thought out, and the subject of regular`
			`continuous integration testing, and disaster recovery exercises.`

			`## Relative Priorities`

address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree Signed-off-by: mikebrow <brownwm@us.ibm.com> 2016-04-14 00:55:22 +00:00			`1. (Possibly manual) recovery from catastrophic failures: having a`
			`Kubernetes cluster, and all applications running inside it, disappear forever`
			`perhaps is the worst possible failure mode. So it is critical that we be able to`
			`recover the applications running inside a cluster from such failures in some`
			`well-bounded time period.`
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00			`1. In theory a cluster can be recovered by replaying all API calls`
			`that have ever been executed against it, in order, but most`
			`often that state has been lost, and/or is scattered across`
			`multiple client applications or groups. So in general it is`
			`probably infeasible.`
			`1. In theory a cluster can also be recovered to some relatively`
			`recent non-corrupt backup/snapshot of the disk(s) backing the`
address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree Signed-off-by: mikebrow <brownwm@us.ibm.com> 2016-04-14 00:55:22 +00:00			`etcd cluster state. But we have no default consistent`
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00			`backup/snapshot, verification or restoration process. And we`
			`don't routinely test restoration, so even if we did routinely`
			`perform and verify backups, we have no hard evidence that we`
			`can in practise effectively recover from catastrophic cluster`
address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree Signed-off-by: mikebrow <brownwm@us.ibm.com> 2016-04-14 00:55:22 +00:00			`failure or data corruption by restoring from these backups. So`
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00			`there's more work to be done here.`
			`1. Self-healing: Most major cloud providers provide the ability to`
			`easily and automatically replace failed virtual machines within a`
			`small number of minutes (e.g. GCE`
			`[Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart)`
			`and Managed Instance Groups,`
			`AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/)`
			`and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This`
			`can fairly trivially be used to reduce control-plane down-time due`
			`to machine failure to a small number of minutes per failure`
			`(i.e. typically around "3 nines" availability), provided that:`
			`1. cluster persistent state (i.e. etcd disks) is either:`
			`1. truely persistent (i.e. remote persistent disks), or`
			`1. reconstructible (e.g. using etcd [dynamic member`
			`addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member)`
			`or [backup and`
			`recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)).`
			`1. and boot disks are either:`
			`1. truely persistent (i.e. remote persistent disks), or`
			`1. reconstructible (e.g. using boot-from-snapshot,`
			`boot-from-pre-configured-image or`
			`boot-from-auto-initializing image).`
			`1. High Availability: This has the potential to increase`
			`availability above the approximately "3 nines" level provided by`
			`automated self-healing, but it's somewhat more complex, and`
			`requires additional resources (e.g. redundant API servers and etcd`
			`quorum members). In environments where cloud-assisted automatic`
			`self-healing might be infeasible (e.g. on-premise bare-metal`
			`deployments), it also gives cluster administrators more time to`
address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree Signed-off-by: mikebrow <brownwm@us.ibm.com> 2016-04-14 00:55:22 +00:00			`respond (e.g. replace/repair failed machines) without incurring`
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00			`system downtime.`

			`## Design and Status (as of December 2015)`

			`<table>`
			`<tr>`
			`<td><b>Control Plane Component</b></td>`
			`<td><b>Resilience Plan</b></td>`
			`<td><b>Current Status</b></td>`
			`</tr>`
			`<tr>`
			`<td><b>API Server</b></td>`
			`<td>`

			`Multiple stateless, self-hosted, self-healing API servers behind a HA`
			`load balancer, built out by the default "kube-up" automation on GCE,`
address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree Signed-off-by: mikebrow <brownwm@us.ibm.com> 2016-04-14 00:55:22 +00:00			`AWS and basic bare metal (BBM). Note that the single-host approach of`
Fixed several typos 2016-07-13 14:06:24 +00:00			`having etcd listen only on localhost to ensure that only API server can`
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00			`connect to it will no longer work, so alternative security will be`
			`needed in the regard (either using firewall rules, SSL certs, or`
			`something else). All necessary flags are currently supported to enable`
			`SSL between API server and etcd (OpenShift runs like this out of the`
			`box), but this needs to be woven into the "kube-up" and related`
			`scripts. Detailed design of self-hosting and related bootstrapping`
			`and catastrophic failure recovery will be detailed in a separate`
			`design doc.`

			`</td>`
			`<td>`

			`No scripted self-healing or HA on GCE, AWS or basic bare metal`
address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree Signed-off-by: mikebrow <brownwm@us.ibm.com> 2016-04-14 00:55:22 +00:00			`currently exists in the OSS distro. To be clear, "no self healing"`
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00			`means that even if multiple e.g. API servers are provisioned for HA`
			`purposes, if they fail, nothing replaces them, so eventually the`
address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree Signed-off-by: mikebrow <brownwm@us.ibm.com> 2016-04-14 00:55:22 +00:00			`system will fail. Self-healing and HA can be set up`
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00			`manually by following documented instructions, but this is not`
			`currently an automated process, and it is not tested as part of`
address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree Signed-off-by: mikebrow <brownwm@us.ibm.com> 2016-04-14 00:55:22 +00:00			`continuous integration. So it's probably safest to assume that it`
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00			`doesn't actually work in practise.`

			`</td>`
			`</tr>`
			`<tr>`
			`<td><b>Controller manager and scheduler</b></td>`
			`<td>`

			`Multiple self-hosted, self healing warm standby stateless controller`
address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree Signed-off-by: mikebrow <brownwm@us.ibm.com> 2016-04-14 00:55:22 +00:00			`managers and schedulers with leader election and automatic failover of API`
			`server clients, automatically installed by default "kube-up" automation.`
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00
			`</td>`
			`<td>As above.</td>`
			`</tr>`
			`<tr>`
			`<td><b>etcd</b></td>`
			`<td>`

			`Multiple (3-5) etcd quorum members behind a load balancer with session`
			`affinity (to prevent clients from being bounced from one to another).`

address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree Signed-off-by: mikebrow <brownwm@us.ibm.com> 2016-04-14 00:55:22 +00:00			`Regarding self-healing, if a node running etcd goes down, it is always necessary`
			`to do three things:`
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00			`<ol>`
			`<li>allocate a new node (not necessary if running etcd as a pod, in`
			`which case specific measures are required to prevent user pods from`
			`interfering with system pods, for example using node selectors as`
address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree Signed-off-by: mikebrow <brownwm@us.ibm.com> 2016-04-14 00:55:22 +00:00			`described in <A HREF="),`
			`<li>start an etcd replica on that new node, and`
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00			`<li>have the new replica recover the etcd state.`
			`</ol>`
			`In the case of local disk (which fails in concert with the machine), the etcd`
address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree Signed-off-by: mikebrow <brownwm@us.ibm.com> 2016-04-14 00:55:22 +00:00			`state must be recovered from the other replicas. This is called`
			`<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member">`
			`dynamic member addition</A>.`

			`In the case of remote persistent disk, the etcd state can be recovered by`
			`attaching the remote persistent disk to the replacement node, thus the state is`
			`recoverable even if all other replicas are down.`
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00
			`There are also significant performance differences between local disks and remote`
address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree Signed-off-by: mikebrow <brownwm@us.ibm.com> 2016-04-14 00:55:22 +00:00			`persistent disks. For example, the`
			`<A HREF="https://cloud.google.com/compute/docs/disks/#comparison_of_disk_types">`
			`sustained throughput local disks in GCE is approximatley 20x that of remote`
			`disks</A>.`

			`Hence we suggest that self-healing be provided by remotely mounted persistent`
			`disks in non-performance critical, single-zone cloud deployments. For`
			`performance critical installations, faster local SSD's should be used, in which`
			`case remounting on node failure is not an option, so`
			`<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md ">`
			`etcd runtime configuration</A> should be used to replace the failed machine.`
			`Similarly, for cross-zone self-healing, cloud persistent disks are zonal, so`
			`automatic <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md">`
			`runtime configuration</A> is required. Similarly, basic bare metal deployments`
			`cannot generally rely on remote persistent disks, so the same approach applies`
			`there.`
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00			`</td>`
			`<td>`
			`<A HREF="http://kubernetes.io/v1.1/docs/admin/high-availability.html">`
address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree Signed-off-by: mikebrow <brownwm@us.ibm.com> 2016-04-14 00:55:22 +00:00			`Somewhat vague instructions exist</A> on how to set some of this up manually in`
			`a self-hosted configuration. But automatic bootstrapping and self-healing is not`
			`described (and is not implemented for the non-PD cases). This all still needs to`
			`be automated and continuously tested.`
RFC design docs for Cluster Federation/Ubernetes. 2016-01-06 01:53:54 +00:00			`</td>`
			`</tr>`
			`</table>`


			`<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->`
			`[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]()`
			`<!-- END MUNGE: GENERATED_ANALYTICS -->`