Control Plane Component |
Resilience Plan |
Current Status |
API Server |
Multiple stateless, self-hosted, self-healing API servers behind a HA
load balancer, built out by the default "kube-up" automation on GCE,
AWS and basic bare metal (BBM). Note that the single-host approach of
having etcd listen only on localhost to ensure that only API server can
connect to it will no longer work, so alternative security will be
needed in the regard (either using firewall rules, SSL certs, or
something else). All necessary flags are currently supported to enable
SSL between API server and etcd (OpenShift runs like this out of the
box), but this needs to be woven into the "kube-up" and related
scripts. Detailed design of self-hosting and related bootstrapping
and catastrophic failure recovery will be detailed in a separate
design doc.
|
No scripted self-healing or HA on GCE, AWS or basic bare metal
currently exists in the OSS distro. To be clear, "no self healing"
means that even if multiple e.g. API servers are provisioned for HA
purposes, if they fail, nothing replaces them, so eventually the
system will fail. Self-healing and HA can be set up
manually by following documented instructions, but this is not
currently an automated process, and it is not tested as part of
continuous integration. So it's probably safest to assume that it
doesn't actually work in practise.
|
Controller manager and scheduler |
Multiple self-hosted, self healing warm standby stateless controller
managers and schedulers with leader election and automatic failover of API
server clients, automatically installed by default "kube-up" automation.
|
As above. |
etcd |
Multiple (3-5) etcd quorum members behind a load balancer with session
affinity (to prevent clients from being bounced from one to another).
Regarding self-healing, if a node running etcd goes down, it is always necessary
to do three things:
- allocate a new node (not necessary if running etcd as a pod, in
which case specific measures are required to prevent user pods from
interfering with system pods, for example using node selectors as
described in
dynamic member addition.
In the case of remote persistent disk, the etcd state can be recovered by
attaching the remote persistent disk to the replacement node, thus the state is
recoverable even if all other replicas are down.
There are also significant performance differences between local disks and remote
persistent disks. For example, the
sustained throughput local disks in GCE is approximatley 20x that of remote
disks.
Hence we suggest that self-healing be provided by remotely mounted persistent
disks in non-performance critical, single-zone cloud deployments. For
performance critical installations, faster local SSD's should be used, in which
case remounting on node failure is not an option, so
etcd runtime configuration should be used to replace the failed machine.
Similarly, for cross-zone self-healing, cloud persistent disks are zonal, so
automatic
runtime configuration is required. Similarly, basic bare metal deployments
cannot generally rely on remote persistent disks, so the same approach applies
there.
|
Somewhat vague instructions exist on how to set some of this up manually in
a self-hosted configuration. But automatic bootstrapping and self-healing is not
described (and is not implemented for the non-PD cases). This all still needs to
be automated and continuously tested.
|
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]()