Control Plane Component |
Resilience Plan |
Current Status |
API Server |
Multiple stateless, self-hosted, self-healing API servers behind a HA
load balancer, built out by the default "kube-up" automation on GCE,
AWS and basic bare metal (BBM). Note that the single-host approach of
hving etcd listen only on localhost to ensure that onyl API server can
connect to it will no longer work, so alternative security will be
needed in the regard (either using firewall rules, SSL certs, or
something else). All necessary flags are currently supported to enable
SSL between API server and etcd (OpenShift runs like this out of the
box), but this needs to be woven into the "kube-up" and related
scripts. Detailed design of self-hosting and related bootstrapping
and catastrophic failure recovery will be detailed in a separate
design doc.
|
No scripted self-healing or HA on GCE, AWS or basic bare metal
currently exists in the OSS distro. To be clear, "no self healing"
means that even if multiple e.g. API servers are provisioned for HA
purposes, if they fail, nothing replaces them, so eventually the
system will fail. Self-healing and HA can be set up
manually by following documented instructions, but this is not
currently an automated process, and it is not tested as part of
continuous integration. So it's probably safest to assume that it
doesn't actually work in practise.
|
Controller manager and scheduler |
Multiple self-hosted, self healing warm standby stateless controller
managers and schedulers with leader election and automatic failover of API server
clients, automatically installed by default "kube-up" automation.
|
As above. |
etcd |
Multiple (3-5) etcd quorum members behind a load balancer with session
affinity (to prevent clients from being bounced from one to another).
Regarding self-healing, if a node running etcd goes down, it is always necessary to do three
things:
- allocate a new node (not necessary if running etcd as a pod, in
which case specific measures are required to prevent user pods from
interfering with system pods, for example using node selectors as
described in dynamic member
addition.
In the case of remote persistent disk, the etcd state can be recovered
by attaching the remote persistent disk to the replacement node, thus
the state is recoverable even if all other replicas are down.
There are also significant performance differences between local disks and remote
persistent disks. For example, the sustained throughput
local disks in GCE is approximatley 20x that of remote disks.
Hence we suggest that self-healing be provided by remotely mounted persistent disks in
non-performance critical, single-zone cloud deployments. For
performance critical installations, faster local SSD's should be used,
in which case remounting on node failure is not an option, so
etcd runtime configuration
should be used to replace the failed machine. Similarly, for
cross-zone self-healing, cloud persistent disks are zonal, so
automatic
runtime configuration
is required. Similarly, basic bare metal deployments cannot generally
rely on
remote persistent disks, so the same approach applies there.
|
Somewhat vague instructions exist
on how to set some of this up manually in a self-hosted
configuration. But automatic bootstrapping and self-healing is not
described (and is not implemented for the non-PD cases). This all
still needs to be automated and continuously tested.
|
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]()