mirror of https://github.com/k3s-io/k3s
Merge pull request #19313 from quinton-hoole/2015-01-05-ube-design-docs
Auto commit by PR queue botpull/6/head
commit
536a30fe45
|
@ -0,0 +1,269 @@
|
||||||
|
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||||
|
|
||||||
|
<!-- BEGIN STRIP_FOR_RELEASE -->
|
||||||
|
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
|
||||||
|
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
|
||||||
|
|
||||||
|
If you are using a released version of Kubernetes, you should
|
||||||
|
refer to the docs that go with that version.
|
||||||
|
|
||||||
|
Documentation for other releases can be found at
|
||||||
|
[releases.k8s.io](http://releases.k8s.io).
|
||||||
|
</strong>
|
||||||
|
--
|
||||||
|
|
||||||
|
<!-- END STRIP_FOR_RELEASE -->
|
||||||
|
|
||||||
|
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||||
|
|
||||||
|
# Kubernetes/Ubernetes Control Plane Resilience
|
||||||
|
|
||||||
|
## Long Term Design and Current Status
|
||||||
|
|
||||||
|
### by Quinton Hoole, Mike Danese and Justin Santa-Barbara
|
||||||
|
|
||||||
|
### December 14, 2015
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
Some amount of confusion exists around how we currently, and in future
|
||||||
|
want to ensure resilience of the Kubernetes (and by implication
|
||||||
|
Ubernetes) control plane. This document is an attempt to capture that
|
||||||
|
definitively. It covers areas including self-healing, high
|
||||||
|
availability, bootstrapping and recovery. Most of the information in
|
||||||
|
this document already exists in the form of github comments,
|
||||||
|
PR's/proposals, scattered documents, and corridor conversations, so
|
||||||
|
document is primarily a consolidation and clarification of existing
|
||||||
|
ideas.
|
||||||
|
|
||||||
|
## Terms
|
||||||
|
|
||||||
|
* **Self-healing:** automatically restarting or replacing failed
|
||||||
|
processes and machines without human intervention
|
||||||
|
* **High availability:** continuing to be available and work correctly
|
||||||
|
even if some components are down or uncontactable. This typically
|
||||||
|
involves multiple replicas of critical services, and a reliable way
|
||||||
|
to find available replicas. Note that it's possible (but not
|
||||||
|
desirable) to have high
|
||||||
|
availability properties (e.g. multiple replicas) in the absence of
|
||||||
|
self-healing properties (e.g. if a replica fails, nothing replaces
|
||||||
|
it). Fairly obviously, given enough time, such systems typically
|
||||||
|
become unavailable (after enough replicas have failed).
|
||||||
|
* **Bootstrapping**: creating an empty cluster from nothing
|
||||||
|
* **Recovery**: recreating a non-empty cluster after perhaps
|
||||||
|
catastrophic failure/unavailability/data corruption
|
||||||
|
|
||||||
|
## Overall Goals
|
||||||
|
|
||||||
|
1. **Resilience to single failures:** Kubernetes clusters constrained
|
||||||
|
to single availability zones should be resilient to individual
|
||||||
|
machine and process failures by being both self-healing and highly
|
||||||
|
available (within the context of such individual failures).
|
||||||
|
1. **Ubiquitous resilience by default:** The default cluster creation
|
||||||
|
scripts for (at least) GCE, AWS and basic bare metal should adhere
|
||||||
|
to the above (self-healing and high availability) by default (with
|
||||||
|
options available to disable these features to reduce control plane
|
||||||
|
resource requirements if so required). It is hoped that other
|
||||||
|
cloud providers will also follow the above guidelines, but the
|
||||||
|
above 3 are the primary canonical use cases.
|
||||||
|
1. **Resilience to some correlated failures:** Kubernetes clusters
|
||||||
|
which span multiple availability zones in a region should by
|
||||||
|
default be resilient to complete failure of one entire availability
|
||||||
|
zone (by similarly providing self-healing and high availability in
|
||||||
|
the default cluster creation scripts as above).
|
||||||
|
1. **Default implementation shared across cloud providers:** The
|
||||||
|
differences between the default implementations of the above for
|
||||||
|
GCE, AWS and basic bare metal should be minimized. This implies
|
||||||
|
using shared libraries across these providers in the default
|
||||||
|
scripts in preference to highly customized implementations per
|
||||||
|
cloud provider. This is not to say that highly differentiated,
|
||||||
|
customized per-cloud cluster creation processes (e.g. for GKE on
|
||||||
|
GCE, or some hosted Kubernetes provider on AWS) are discouraged.
|
||||||
|
But those fall squarely outside the basic cross-platform OSS
|
||||||
|
Kubernetes distro.
|
||||||
|
1. **Self-hosting:** Where possible, Kubernetes's existing mechanisms
|
||||||
|
for achieving system resilience (replication controllers, health
|
||||||
|
checking, service load balancing etc) should be used in preference
|
||||||
|
to building a separate set of mechanisms to achieve the same thing.
|
||||||
|
This implies that self hosting (the kubernetes control plane on
|
||||||
|
kubernetes) is strongly preferred, with the caveat below.
|
||||||
|
1. **Recovery from catastrophic failure:** The ability to quickly and
|
||||||
|
reliably recover a cluster from catastrophic failure is critical,
|
||||||
|
and should not be compromised by the above goal to self-host
|
||||||
|
(i.e. it goes without saying that the cluster should be quickly and
|
||||||
|
reliably recoverable, even if the cluster control plane is
|
||||||
|
broken). This implies that such catastrophic failure scenarios
|
||||||
|
should be carefully thought out, and the subject of regular
|
||||||
|
continuous integration testing, and disaster recovery exercises.
|
||||||
|
|
||||||
|
## Relative Priorities
|
||||||
|
|
||||||
|
1. **(Possibly manual) recovery from catastrophic failures:** having a Kubernetes cluster, and all
|
||||||
|
applications running inside it, disappear forever perhaps is the worst
|
||||||
|
possible failure mode. So it is critical that we be able to
|
||||||
|
recover the applications running inside a cluster from such
|
||||||
|
failures in some well-bounded time period.
|
||||||
|
1. In theory a cluster can be recovered by replaying all API calls
|
||||||
|
that have ever been executed against it, in order, but most
|
||||||
|
often that state has been lost, and/or is scattered across
|
||||||
|
multiple client applications or groups. So in general it is
|
||||||
|
probably infeasible.
|
||||||
|
1. In theory a cluster can also be recovered to some relatively
|
||||||
|
recent non-corrupt backup/snapshot of the disk(s) backing the
|
||||||
|
etcd cluster state. But we have no default consistent
|
||||||
|
backup/snapshot, verification or restoration process. And we
|
||||||
|
don't routinely test restoration, so even if we did routinely
|
||||||
|
perform and verify backups, we have no hard evidence that we
|
||||||
|
can in practise effectively recover from catastrophic cluster
|
||||||
|
failure or data corruption by restoring from these backups. So
|
||||||
|
there's more work to be done here.
|
||||||
|
1. **Self-healing:** Most major cloud providers provide the ability to
|
||||||
|
easily and automatically replace failed virtual machines within a
|
||||||
|
small number of minutes (e.g. GCE
|
||||||
|
[Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart)
|
||||||
|
and Managed Instance Groups,
|
||||||
|
AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/)
|
||||||
|
and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This
|
||||||
|
can fairly trivially be used to reduce control-plane down-time due
|
||||||
|
to machine failure to a small number of minutes per failure
|
||||||
|
(i.e. typically around "3 nines" availability), provided that:
|
||||||
|
1. cluster persistent state (i.e. etcd disks) is either:
|
||||||
|
1. truely persistent (i.e. remote persistent disks), or
|
||||||
|
1. reconstructible (e.g. using etcd [dynamic member
|
||||||
|
addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member)
|
||||||
|
or [backup and
|
||||||
|
recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)).
|
||||||
|
|
||||||
|
1. and boot disks are either:
|
||||||
|
1. truely persistent (i.e. remote persistent disks), or
|
||||||
|
1. reconstructible (e.g. using boot-from-snapshot,
|
||||||
|
boot-from-pre-configured-image or
|
||||||
|
boot-from-auto-initializing image).
|
||||||
|
1. **High Availability:** This has the potential to increase
|
||||||
|
availability above the approximately "3 nines" level provided by
|
||||||
|
automated self-healing, but it's somewhat more complex, and
|
||||||
|
requires additional resources (e.g. redundant API servers and etcd
|
||||||
|
quorum members). In environments where cloud-assisted automatic
|
||||||
|
self-healing might be infeasible (e.g. on-premise bare-metal
|
||||||
|
deployments), it also gives cluster administrators more time to
|
||||||
|
respond (e.g. replace/repair failed machines) without incurring
|
||||||
|
system downtime.
|
||||||
|
|
||||||
|
## Design and Status (as of December 2015)
|
||||||
|
|
||||||
|
<table>
|
||||||
|
<tr>
|
||||||
|
<td><b>Control Plane Component</b></td>
|
||||||
|
<td><b>Resilience Plan</b></td>
|
||||||
|
<td><b>Current Status</b></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td><b>API Server</b></td>
|
||||||
|
<td>
|
||||||
|
|
||||||
|
Multiple stateless, self-hosted, self-healing API servers behind a HA
|
||||||
|
load balancer, built out by the default "kube-up" automation on GCE,
|
||||||
|
AWS and basic bare metal (BBM). Note that the single-host approach of
|
||||||
|
hving etcd listen only on localhost to ensure that onyl API server can
|
||||||
|
connect to it will no longer work, so alternative security will be
|
||||||
|
needed in the regard (either using firewall rules, SSL certs, or
|
||||||
|
something else). All necessary flags are currently supported to enable
|
||||||
|
SSL between API server and etcd (OpenShift runs like this out of the
|
||||||
|
box), but this needs to be woven into the "kube-up" and related
|
||||||
|
scripts. Detailed design of self-hosting and related bootstrapping
|
||||||
|
and catastrophic failure recovery will be detailed in a separate
|
||||||
|
design doc.
|
||||||
|
|
||||||
|
</td>
|
||||||
|
<td>
|
||||||
|
|
||||||
|
No scripted self-healing or HA on GCE, AWS or basic bare metal
|
||||||
|
currently exists in the OSS distro. To be clear, "no self healing"
|
||||||
|
means that even if multiple e.g. API servers are provisioned for HA
|
||||||
|
purposes, if they fail, nothing replaces them, so eventually the
|
||||||
|
system will fail. Self-healing and HA can be set up
|
||||||
|
manually by following documented instructions, but this is not
|
||||||
|
currently an automated process, and it is not tested as part of
|
||||||
|
continuous integration. So it's probably safest to assume that it
|
||||||
|
doesn't actually work in practise.
|
||||||
|
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td><b>Controller manager and scheduler</b></td>
|
||||||
|
<td>
|
||||||
|
|
||||||
|
Multiple self-hosted, self healing warm standby stateless controller
|
||||||
|
managers and schedulers with leader election and automatic failover of API server
|
||||||
|
clients, automatically installed by default "kube-up" automation.
|
||||||
|
|
||||||
|
</td>
|
||||||
|
<td>As above.</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td><b>etcd</b></td>
|
||||||
|
<td>
|
||||||
|
|
||||||
|
Multiple (3-5) etcd quorum members behind a load balancer with session
|
||||||
|
affinity (to prevent clients from being bounced from one to another).
|
||||||
|
|
||||||
|
Regarding self-healing, if a node running etcd goes down, it is always necessary to do three
|
||||||
|
things:
|
||||||
|
<ol>
|
||||||
|
<li>allocate a new node (not necessary if running etcd as a pod, in
|
||||||
|
which case specific measures are required to prevent user pods from
|
||||||
|
interfering with system pods, for example using node selectors as
|
||||||
|
described in <A HREF=")
|
||||||
|
<li>start an etcd replica on that new node,
|
||||||
|
<li>have the new replica recover the etcd state.
|
||||||
|
</ol>
|
||||||
|
In the case of local disk (which fails in concert with the machine), the etcd
|
||||||
|
state must be recovered from the other replicas. This is called <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member">dynamic member
|
||||||
|
addition</A>.
|
||||||
|
In the case of remote persistent disk, the etcd state can be recovered
|
||||||
|
by attaching the remote persistent disk to the replacement node, thus
|
||||||
|
the state is recoverable even if all other replicas are down.
|
||||||
|
|
||||||
|
There are also significant performance differences between local disks and remote
|
||||||
|
persistent disks. For example, the <A HREF="https://cloud.google.com/compute/docs/disks/#comparison_of_disk_types">sustained throughput
|
||||||
|
local disks in GCE is approximatley 20x that of remote disks</A>.
|
||||||
|
|
||||||
|
Hence we suggest that self-healing be provided by remotely mounted persistent disks in
|
||||||
|
non-performance critical, single-zone cloud deployments. For
|
||||||
|
performance critical installations, faster local SSD's should be used,
|
||||||
|
in which case remounting on node failure is not an option, so
|
||||||
|
<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md ">etcd runtime configuration</A>
|
||||||
|
should be used to replace the failed machine. Similarly, for
|
||||||
|
cross-zone self-healing, cloud persistent disks are zonal, so
|
||||||
|
automatic
|
||||||
|
<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md">runtime configuration</A>
|
||||||
|
is required. Similarly, basic bare metal deployments cannot generally
|
||||||
|
rely on
|
||||||
|
remote persistent disks, so the same approach applies there.
|
||||||
|
</td>
|
||||||
|
<td>
|
||||||
|
<A HREF="http://kubernetes.io/v1.1/docs/admin/high-availability.html">
|
||||||
|
Somewhat vague instructions exist</A>
|
||||||
|
on how to set some of this up manually in a self-hosted
|
||||||
|
configuration. But automatic bootstrapping and self-healing is not
|
||||||
|
described (and is not implemented for the non-PD cases). This all
|
||||||
|
still needs to be automated and continuously tested.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
|
||||||
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]()
|
||||||
|
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
|
@ -0,0 +1,550 @@
|
||||||
|
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||||
|
|
||||||
|
<!-- BEGIN STRIP_FOR_RELEASE -->
|
||||||
|
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
|
||||||
|
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
|
||||||
|
|
||||||
|
If you are using a released version of Kubernetes, you should
|
||||||
|
refer to the docs that go with that version.
|
||||||
|
|
||||||
|
Documentation for other releases can be found at
|
||||||
|
[releases.k8s.io](http://releases.k8s.io).
|
||||||
|
</strong>
|
||||||
|
--
|
||||||
|
|
||||||
|
<!-- END STRIP_FOR_RELEASE -->
|
||||||
|
|
||||||
|
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||||
|
|
||||||
|
# Kubernetes Cluster Federation (a.k.a. "Ubernetes")
|
||||||
|
|
||||||
|
## Cross-cluster Load Balancing and Service Discovery
|
||||||
|
|
||||||
|
### Requirements and System Design
|
||||||
|
|
||||||
|
### by Quinton Hoole, Dec 3 2015
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
### Discovery, Load-balancing and Failover
|
||||||
|
|
||||||
|
1. **Internal discovery and connection**: Pods/containers (running in
|
||||||
|
a Kubernetes cluster) must be able to easily discover and connect
|
||||||
|
to endpoints for Kubernetes services on which they depend in a
|
||||||
|
consistent way, irrespective of whether those services exist in a
|
||||||
|
different kubernetes cluster within the same cluster federation.
|
||||||
|
Hence-forth referred to as "cluster-internal clients", or simply
|
||||||
|
"internal clients".
|
||||||
|
1. **External discovery and connection**: External clients (running
|
||||||
|
outside a Kubernetes cluster) must be able to discover and connect
|
||||||
|
to endpoints for Kubernetes services on which they depend.
|
||||||
|
1. **External clients predominantly speak HTTP(S)**: External
|
||||||
|
clients are most often, but not always, web browsers, or at
|
||||||
|
least speak HTTP(S) - notable exceptions include Enterprise
|
||||||
|
Message Busses (Java, TLS), DNS servers (UDP),
|
||||||
|
SIP servers and databases)
|
||||||
|
1. **Find the "best" endpoint:** Upon initial discovery and
|
||||||
|
connection, both internal and external clients should ideally find
|
||||||
|
"the best" endpoint if multiple eligible endpoints exist. "Best"
|
||||||
|
in this context implies the closest (by network topology) endpoint
|
||||||
|
that is both operational (as defined by some positive health check)
|
||||||
|
and not overloaded (by some published load metric). For example:
|
||||||
|
1. An internal client should find an endpoint which is local to its
|
||||||
|
own cluster if one exists, in preference to one in a remote
|
||||||
|
cluster (if both are operational and non-overloaded).
|
||||||
|
Similarly, one in a nearby cluster (e.g. in the same zone or
|
||||||
|
region) is preferable to one further afield.
|
||||||
|
1. An external client (e.g. in New York City) should find an
|
||||||
|
endpoint in a nearby cluster (e.g. U.S. East Coast) in
|
||||||
|
preference to one further away (e.g. Japan).
|
||||||
|
1. **Easy fail-over:** If the endpoint to which a client is connected
|
||||||
|
becomes unavailable (no network response/disconnected) or
|
||||||
|
overloaded, the client should reconnect to a better endpoint,
|
||||||
|
somehow.
|
||||||
|
1. In the case where there exist one or more connection-terminating
|
||||||
|
load balancers between the client and the serving Pod, failover
|
||||||
|
might be completely automatic (i.e. the client's end of the
|
||||||
|
connection remains intact, and the client is completely
|
||||||
|
oblivious of the fail-over). This approach incurs network speed
|
||||||
|
and cost penalties (by traversing possibly multiple load
|
||||||
|
balancers), but requires zero smarts in clients, DNS libraries,
|
||||||
|
recursing DNS servers etc, as the IP address of the endpoint
|
||||||
|
remains constant over time.
|
||||||
|
1. In a scenario where clients need to choose between multiple load
|
||||||
|
balancer endpoints (e.g. one per cluster), multiple DNS A
|
||||||
|
records associated with a single DNS name enable even relatively
|
||||||
|
dumb clients to try the next IP address in the list of returned
|
||||||
|
A records (without even necessarily re-issuing a DNS resolution
|
||||||
|
request). For example, all major web browsers will try all A
|
||||||
|
records in sequence until a working one is found (TBD: justify
|
||||||
|
this claim with details for Chrome, IE, Safari, Firefox).
|
||||||
|
1. In a slightly more sophisticated scenario, upon disconnection, a
|
||||||
|
smarter client might re-issue a DNS resolution query, and
|
||||||
|
(modulo DNS record TTL's which can typically be set as low as 3
|
||||||
|
minutes, and buggy DNS resolvers, caches and libraries which
|
||||||
|
have been known to completely ignore TTL's), receive updated A
|
||||||
|
records specifying a new set of IP addresses to which to
|
||||||
|
connect.
|
||||||
|
|
||||||
|
### Portability
|
||||||
|
|
||||||
|
A Kubernetes application configuration (e.g. for a Pod, Replication
|
||||||
|
Controller, Service etc) should be able to be successfully deployed
|
||||||
|
into any Kubernetes Cluster or Ubernetes Federation of Clusters,
|
||||||
|
without modification. More specifically, a typical configuration
|
||||||
|
should work correctly (although possibly not optimally) across any of
|
||||||
|
the following environments:
|
||||||
|
|
||||||
|
1. A single Kubernetes Cluster on one cloud provider (e.g. Google
|
||||||
|
Compute Engine, GCE)
|
||||||
|
1. A single Kubernetes Cluster on a different cloud provider
|
||||||
|
(e.g. Amazon Web Services, AWS)
|
||||||
|
1. A single Kubernetes Cluster on a non-cloud, on-premise data center
|
||||||
|
1. A Federation of Kubernetes Clusters all on the same cloud provider
|
||||||
|
(e.g. GCE)
|
||||||
|
1. A Federation of Kubernetes Clusters across multiple different cloud
|
||||||
|
providers and/or on-premise data centers (e.g. one cluster on
|
||||||
|
GCE/GKE, one on AWS, and one on-premise).
|
||||||
|
|
||||||
|
### Trading Portability for Optimization
|
||||||
|
|
||||||
|
It should be possible to explicitly opt out of portability across some
|
||||||
|
subset of the above environments in order to take advantage of
|
||||||
|
non-portable load balancing and DNS features of one or more
|
||||||
|
environments. More specifically, for example:
|
||||||
|
|
||||||
|
1. For HTTP(S) applications running on GCE-only Federations,
|
||||||
|
[GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
|
||||||
|
should be usable. These provide single, static global IP addresses
|
||||||
|
which load balance and fail over globally (i.e. across both regions
|
||||||
|
and zones). These allow for really dumb clients, but they only
|
||||||
|
work on GCE, and only for HTTP(S) traffic.
|
||||||
|
1. For non-HTTP(S) applications running on GCE-only Federations within
|
||||||
|
a single region,
|
||||||
|
[GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
|
||||||
|
should be usable. These provide TCP (i.e. both HTTP/S and
|
||||||
|
non-HTTP/S) load balancing and failover, but only on GCE, and only
|
||||||
|
within a single region.
|
||||||
|
[Google Cloud DNS](https://cloud.google.com/dns) can be used to
|
||||||
|
route traffic between regions (and between different cloud
|
||||||
|
providers and on-premise clusters, as it's plain DNS, IP only).
|
||||||
|
1. For applications running on AWS-only Federations,
|
||||||
|
[AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/)
|
||||||
|
should be usable. These provide both L7 (HTTP(S)) and L4 load
|
||||||
|
balancing, but only within a single region, and only on AWS
|
||||||
|
([AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be
|
||||||
|
used to load balance and fail over across multiple regions, and is
|
||||||
|
also capable of resolving to non-AWS endpoints).
|
||||||
|
|
||||||
|
## Component Cloud Services
|
||||||
|
|
||||||
|
Ubernetes cross-cluster load balancing is built on top of the following:
|
||||||
|
|
||||||
|
1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
|
||||||
|
provide single, static global IP addresses which load balance and
|
||||||
|
fail over globally (i.e. across both regions and zones). These
|
||||||
|
allow for really dumb clients, but they only work on GCE, and only
|
||||||
|
for HTTP(S) traffic.
|
||||||
|
1. [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
|
||||||
|
provide both HTTP(S) and non-HTTP(S) load balancing and failover,
|
||||||
|
but only on GCE, and only within a single region.
|
||||||
|
1. [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/)
|
||||||
|
provide both L7 (HTTP(S)) and L4 load balancing, but only within a
|
||||||
|
single region, and only on AWS.
|
||||||
|
1. [Google Cloud DNS](https://cloud.google.com/dns) (or any other
|
||||||
|
programmable DNS service, like
|
||||||
|
[CloudFlare](http://www.cloudflare.com) can be used to route
|
||||||
|
traffic between regions (and between different cloud providers and
|
||||||
|
on-premise clusters, as it's plain DNS, IP only). Google Cloud DNS
|
||||||
|
doesn't provide any built-in geo-DNS, latency-based routing, health
|
||||||
|
checking, weighted round robin or other advanced capabilities.
|
||||||
|
It's plain old DNS. We would need to build all the aforementioned
|
||||||
|
on top of it. It can provide internal DNS services (i.e. serve RFC
|
||||||
|
1918 addresses).
|
||||||
|
1. [AWS Route 53 DNS service](https://aws.amazon.com/route53/) can
|
||||||
|
be used to load balance and fail over across regions, and is also
|
||||||
|
capable of routing to non-AWS endpoints). It provides built-in
|
||||||
|
geo-DNS, latency-based routing, health checking, weighted
|
||||||
|
round robin and optional tight integration with some other
|
||||||
|
AWS services (e.g. Elastic Load Balancers).
|
||||||
|
1. Kubernetes L4 Service Load Balancing: This provides both a
|
||||||
|
[virtual cluster-local](http://kubernetes.io/v1.1/docs/user-guide/services.html#virtual-ips-and-service-proxies)
|
||||||
|
and a
|
||||||
|
[real externally routable](http://kubernetes.io/v1.1/docs/user-guide/services.html#type-loadbalancer)
|
||||||
|
service IP which is load-balanced (currently simple round-robin)
|
||||||
|
across the healthy pods comprising a service within a single
|
||||||
|
Kubernetes cluster.
|
||||||
|
1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html): A generic wrapper around cloud-provided L4 and L7 load balancing services, and roll-your-own load balancers run in pods, e.g. HA Proxy.
|
||||||
|
|
||||||
|
## Ubernetes API
|
||||||
|
|
||||||
|
The Ubernetes API for load balancing should be compatible with the
|
||||||
|
equivalent Kubernetes API, to ease porting of clients between
|
||||||
|
Ubernetes and Kubernetes. Further details below.
|
||||||
|
|
||||||
|
## Common Client Behavior
|
||||||
|
|
||||||
|
To be useful, our load balancing solution needs to work properly with
|
||||||
|
real client applications. There are a few different classes of
|
||||||
|
those...
|
||||||
|
|
||||||
|
### Browsers
|
||||||
|
|
||||||
|
These are the most common external clients. These are all well-written. See below.
|
||||||
|
|
||||||
|
### Well-written clients
|
||||||
|
|
||||||
|
1. Do a DNS resolution every time they connect.
|
||||||
|
1. Don't cache beyond TTL (although a small percentage of the DNS
|
||||||
|
servers on which they rely might).
|
||||||
|
1. Do try multiple A records (in order) to connect.
|
||||||
|
1. (in an ideal world) Do use SRV records rather than hard-coded port numbers.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
+ all common browsers (except for SRV records)
|
||||||
|
+ ...
|
||||||
|
|
||||||
|
### Dumb clients
|
||||||
|
|
||||||
|
1. Don't do a DNS resolution every time they connect (or do cache
|
||||||
|
beyond the TTL).
|
||||||
|
1. Do try multiple A records
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
+ ...
|
||||||
|
|
||||||
|
### Dumber clients
|
||||||
|
|
||||||
|
1. Only do a DNS lookup once on startup.
|
||||||
|
1. Only try the first returned DNS A record.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
+ ...
|
||||||
|
|
||||||
|
### Dumbest clients
|
||||||
|
|
||||||
|
1. Never do a DNS lookup - are pre-configured with a single (or
|
||||||
|
possibly multiple) fixed server IP(s). Nothing else matters.
|
||||||
|
|
||||||
|
## Architecture and Implementation
|
||||||
|
|
||||||
|
### General control plane architecture
|
||||||
|
|
||||||
|
Each cluster hosts one or more Ubernetes master components (Ubernetes API servers, controller managers with leader election, and
|
||||||
|
etcd quorum members. This is documented in more detail in a
|
||||||
|
[separate design doc: Kubernetes/Ubernetes Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#).
|
||||||
|
|
||||||
|
In the description below, assume that 'n' clusters, named
|
||||||
|
'cluster-1'... 'cluster-n' have been registered against an Ubernetes
|
||||||
|
Federation "federation-1", each with their own set of Kubernetes API
|
||||||
|
endpoints,so,
|
||||||
|
"[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1),
|
||||||
|
[http://endpoint-2.cluster-1](http://endpoint-2.cluster-1)
|
||||||
|
... [http://endpoint-m.cluster-n](http://endpoint-m.cluster-n) .
|
||||||
|
|
||||||
|
### Federated Services
|
||||||
|
|
||||||
|
Ubernetes Services are pretty straight-forward. They're comprised of
|
||||||
|
multiple equivalent underlying Kubernetes Services, each with their
|
||||||
|
own external endpoint, and a load balancing mechanism across them.
|
||||||
|
Let's work through how exactly that works in practice.
|
||||||
|
|
||||||
|
Our user creates the following Ubernetes Service (against an Ubernetes
|
||||||
|
API endpoint):
|
||||||
|
|
||||||
|
$ kubectl create -f my-service.yaml --context="federation-1"
|
||||||
|
|
||||||
|
where service.yaml contains the following:
|
||||||
|
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
run: my-service
|
||||||
|
name: my-service
|
||||||
|
namespace: my-namespace
|
||||||
|
spec:
|
||||||
|
ports:
|
||||||
|
- port: 2379
|
||||||
|
protocol: TCP
|
||||||
|
targetPort: 2379
|
||||||
|
name: client
|
||||||
|
- port: 2380
|
||||||
|
protocol: TCP
|
||||||
|
targetPort: 2380
|
||||||
|
name: peer
|
||||||
|
selector:
|
||||||
|
run: my-service
|
||||||
|
type: LoadBalancer
|
||||||
|
|
||||||
|
Ubernetes in turn creates one equivalent service (identical config to
|
||||||
|
the above) in each of the underlying Kubernetes clusters, each of
|
||||||
|
which results in something like this:
|
||||||
|
|
||||||
|
$ kubectl get -o yaml --context="cluster-1" service my-service
|
||||||
|
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
creationTimestamp: 2015-11-25T23:35:25Z
|
||||||
|
labels:
|
||||||
|
run: my-service
|
||||||
|
name: my-service
|
||||||
|
namespace: my-namespace
|
||||||
|
resourceVersion: "147365"
|
||||||
|
selfLink: /api/v1/namespaces/my-namespace/services/my-service
|
||||||
|
uid: 33bfc927-93cd-11e5-a38c-42010af00002
|
||||||
|
spec:
|
||||||
|
clusterIP: 10.0.153.185
|
||||||
|
ports:
|
||||||
|
- name: client
|
||||||
|
nodePort: 31333
|
||||||
|
port: 2379
|
||||||
|
protocol: TCP
|
||||||
|
targetPort: 2379
|
||||||
|
- name: peer
|
||||||
|
nodePort: 31086
|
||||||
|
port: 2380
|
||||||
|
protocol: TCP
|
||||||
|
targetPort: 2380
|
||||||
|
selector:
|
||||||
|
run: my-service
|
||||||
|
sessionAffinity: None
|
||||||
|
type: LoadBalancer
|
||||||
|
status:
|
||||||
|
loadBalancer:
|
||||||
|
ingress:
|
||||||
|
- ip: 104.197.117.10
|
||||||
|
|
||||||
|
Similar services are created in `cluster-2` and `cluster-3`, each of
|
||||||
|
which are allocated their own `spec.clusterIP`, and
|
||||||
|
`status.loadBalancer.ingress.ip`.
|
||||||
|
|
||||||
|
In Ubernetes `federation-1`, the resulting federated service looks as follows:
|
||||||
|
|
||||||
|
$ kubectl get -o yaml --context="federation-1" service my-service
|
||||||
|
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
creationTimestamp: 2015-11-25T23:35:23Z
|
||||||
|
labels:
|
||||||
|
run: my-service
|
||||||
|
name: my-service
|
||||||
|
namespace: my-namespace
|
||||||
|
resourceVersion: "157333"
|
||||||
|
selfLink: /api/v1/namespaces/my-namespace/services/my-service
|
||||||
|
uid: 33bfc927-93cd-11e5-a38c-42010af00007
|
||||||
|
spec:
|
||||||
|
clusterIP:
|
||||||
|
ports:
|
||||||
|
- name: client
|
||||||
|
nodePort: 31333
|
||||||
|
port: 2379
|
||||||
|
protocol: TCP
|
||||||
|
targetPort: 2379
|
||||||
|
- name: peer
|
||||||
|
nodePort: 31086
|
||||||
|
port: 2380
|
||||||
|
protocol: TCP
|
||||||
|
targetPort: 2380
|
||||||
|
selector:
|
||||||
|
run: my-service
|
||||||
|
sessionAffinity: None
|
||||||
|
type: LoadBalancer
|
||||||
|
status:
|
||||||
|
loadBalancer:
|
||||||
|
ingress:
|
||||||
|
- hostname: my-service.my-namespace.my-federation.my-domain.com
|
||||||
|
|
||||||
|
Note that the federated service:
|
||||||
|
|
||||||
|
1. Is API-compatible with a vanilla Kubernetes service.
|
||||||
|
1. has no clusterIP (as it is cluster-independent)
|
||||||
|
1. has a federation-wide load balancer hostname
|
||||||
|
|
||||||
|
In addition to the set of underlying Kubernetes services (one per
|
||||||
|
cluster) described above, Ubernetes has also created a DNS name
|
||||||
|
(e.g. on [Google Cloud DNS](https://cloud.google.com/dns) or
|
||||||
|
[AWS Route 53](https://aws.amazon.com/route53/), depending on
|
||||||
|
configuration) which provides load balancing across all of those
|
||||||
|
services. For example, in a very basic configuration:
|
||||||
|
|
||||||
|
$ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
|
||||||
|
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.117.10
|
||||||
|
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77
|
||||||
|
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157
|
||||||
|
|
||||||
|
Each of the above IP addresses (which are just the external load
|
||||||
|
balancer ingress IP's of each cluster service) is of course load
|
||||||
|
balanced across the pods comprising the service in each cluster.
|
||||||
|
|
||||||
|
In a more sophisticated configuration (e.g. on GCE or GKE), Ubernetes
|
||||||
|
automatically creates a
|
||||||
|
[GCE Global L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
|
||||||
|
which exposes a single, globally load-balanced IP:
|
||||||
|
|
||||||
|
$ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
|
||||||
|
my-service.my-namespace.my-federation.my-domain.com 180 IN A 107.194.17.44
|
||||||
|
|
||||||
|
Optionally, Ubernetes also configures the local DNS servers (SkyDNS)
|
||||||
|
in each Kubernetes cluster to preferentially return the local
|
||||||
|
clusterIP for the service in that cluster, with other clusters'
|
||||||
|
external service IP's (or a global load-balanced IP) also configured
|
||||||
|
for failover purposes:
|
||||||
|
|
||||||
|
$ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
|
||||||
|
my-service.my-namespace.my-federation.my-domain.com 180 IN A 10.0.153.185
|
||||||
|
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77
|
||||||
|
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157
|
||||||
|
|
||||||
|
If Ubernetes Global Service Health Checking is enabled, multiple
|
||||||
|
service health checkers running across the federated clusters
|
||||||
|
collaborate to monitor the health of the service endpoints, and
|
||||||
|
automatically remove unhealthy endpoints from the DNS record (e.g. a
|
||||||
|
majority quorum is required to vote a service endpoint unhealthy, to
|
||||||
|
avoid false positives due to individual health checker network
|
||||||
|
isolation).
|
||||||
|
|
||||||
|
### Federated Replication Controllers
|
||||||
|
|
||||||
|
So far we have a federated service defined, with a resolvable load
|
||||||
|
balancer hostname by which clients can reach it, but no pods serving
|
||||||
|
traffic directed there. So now we need a Federated Replication
|
||||||
|
Controller. These are also fairly straight-forward, being comprised
|
||||||
|
of multiple underlying Kubernetes Replication Controllers which do the
|
||||||
|
hard work of keeping the desired number of Pod replicas alive in each
|
||||||
|
Kubernetes cluster.
|
||||||
|
|
||||||
|
$ kubectl create -f my-service-rc.yaml --context="federation-1"
|
||||||
|
|
||||||
|
where `my-service-rc.yaml` contains the following:
|
||||||
|
|
||||||
|
kind: ReplicationController
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
run: my-service
|
||||||
|
name: my-service
|
||||||
|
namespace: my-namespace
|
||||||
|
spec:
|
||||||
|
replicas: 6
|
||||||
|
selector:
|
||||||
|
run: my-service
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
run: my-service
|
||||||
|
spec:
|
||||||
|
containers:
|
||||||
|
image: gcr.io/google_samples/my-service:v1
|
||||||
|
name: my-service
|
||||||
|
ports:
|
||||||
|
- containerPort: 2379
|
||||||
|
protocol: TCP
|
||||||
|
- containerPort: 2380
|
||||||
|
protocol: TCP
|
||||||
|
|
||||||
|
Ubernetes in turn creates one equivalent replication controller
|
||||||
|
(identical config to the above, except for the replica count) in each
|
||||||
|
of the underlying Kubernetes clusters, each of which results in
|
||||||
|
something like this:
|
||||||
|
|
||||||
|
$ ./kubectl get -o yaml rc my-service --context="cluster-1"
|
||||||
|
kind: ReplicationController
|
||||||
|
metadata:
|
||||||
|
creationTimestamp: 2015-12-02T23:00:47Z
|
||||||
|
labels:
|
||||||
|
run: my-service
|
||||||
|
name: my-service
|
||||||
|
namespace: my-namespace
|
||||||
|
selfLink: /api/v1/namespaces/my-namespace/replicationcontrollers/my-service
|
||||||
|
uid: 86542109-9948-11e5-a38c-42010af00002
|
||||||
|
spec:
|
||||||
|
replicas: 2
|
||||||
|
selector:
|
||||||
|
run: my-service
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
run: my-service
|
||||||
|
spec:
|
||||||
|
containers:
|
||||||
|
image: gcr.io/google_samples/my-service:v1
|
||||||
|
name: my-service
|
||||||
|
ports:
|
||||||
|
- containerPort: 2379
|
||||||
|
protocol: TCP
|
||||||
|
- containerPort: 2380
|
||||||
|
protocol: TCP
|
||||||
|
resources: {}
|
||||||
|
dnsPolicy: ClusterFirst
|
||||||
|
restartPolicy: Always
|
||||||
|
status:
|
||||||
|
replicas: 2
|
||||||
|
|
||||||
|
The exact number of replicas created in each underlying cluster will
|
||||||
|
of course depend on what scheduling policy is in force. In the above
|
||||||
|
example, the scheduler created an equal number of replicas (2) in each
|
||||||
|
of the three underlying clusters, to make up the total of 6 replicas
|
||||||
|
required. To handle entire cluster failures, various approaches are possible,
|
||||||
|
including:
|
||||||
|
1. **simple overprovisioing**, such that sufficient replicas remain even if a
|
||||||
|
cluster fails. This wastes some resources, but is simple and
|
||||||
|
reliable.
|
||||||
|
2. **pod autoscaling**, where the replication controller in each
|
||||||
|
cluster automatically and autonomously increases the number of
|
||||||
|
replicas in its cluster in response to the additional traffic
|
||||||
|
diverted from the
|
||||||
|
failed cluster. This saves resources and is reatively simple,
|
||||||
|
but there is some delay in the autoscaling.
|
||||||
|
3. **federated replica migration**, where the Ubernetes Federation
|
||||||
|
Control Plane detects the cluster failure and automatically
|
||||||
|
increases the replica count in the remainaing clusters to make up
|
||||||
|
for the lost replicas in the failed cluster. This does not seem to
|
||||||
|
offer any benefits relative to pod autoscaling above, and is
|
||||||
|
arguably more complex to implement, but we note it here as a
|
||||||
|
possibility.
|
||||||
|
|
||||||
|
### Implementation Details
|
||||||
|
|
||||||
|
The implementation approach and architecture is very similar to
|
||||||
|
Kubernetes, so if you're familiar with how Kubernetes works, none of
|
||||||
|
what follows will be surprising. One additional design driver not
|
||||||
|
present in Kubernetes is that Ubernetes aims to be resilient to
|
||||||
|
individual cluster and availability zone failures. So the control
|
||||||
|
plane spans multiple clusters. More specifically:
|
||||||
|
|
||||||
|
+ Ubernetes runs it's own distinct set of API servers (typically one
|
||||||
|
or more per underlying Kubernetes cluster). These are completely
|
||||||
|
distinct from the Kubernetes API servers for each of the underlying
|
||||||
|
clusters.
|
||||||
|
+ Ubernetes runs it's own distinct quorum-based metadata store (etcd,
|
||||||
|
by default). Approximately 1 quorum member runs in each underlying
|
||||||
|
cluster ("approximately" because we aim for an odd number of quorum
|
||||||
|
members, and typically don't want more than 5 quorum members, even
|
||||||
|
if we have a larger number of federated clusters, so 2 clusters->3
|
||||||
|
quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc).
|
||||||
|
|
||||||
|
Cluster Controllers in Ubernetes watch against the Ubernetes API
|
||||||
|
server/etcd state, and apply changes to the underlying kubernetes
|
||||||
|
clusters accordingly. They also have the anti-entropy mechanism for
|
||||||
|
reconciling ubernetes "desired desired" state against kubernetes
|
||||||
|
"actual desired" state.
|
||||||
|
|
||||||
|
|
||||||
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-services.md?pixel)]()
|
||||||
|
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
|
@ -0,0 +1,434 @@
|
||||||
|
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||||
|
|
||||||
|
<!-- BEGIN STRIP_FOR_RELEASE -->
|
||||||
|
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
|
||||||
|
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
|
||||||
|
|
||||||
|
If you are using a released version of Kubernetes, you should
|
||||||
|
refer to the docs that go with that version.
|
||||||
|
|
||||||
|
Documentation for other releases can be found at
|
||||||
|
[releases.k8s.io](http://releases.k8s.io).
|
||||||
|
</strong>
|
||||||
|
--
|
||||||
|
|
||||||
|
<!-- END STRIP_FOR_RELEASE -->
|
||||||
|
|
||||||
|
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||||
|
|
||||||
|
# Ubernetes Design Spec (phase one)
|
||||||
|
|
||||||
|
**Huawei PaaS Team**
|
||||||
|
|
||||||
|
## INTRODUCTION
|
||||||
|
|
||||||
|
In this document we propose a design for the “Control Plane” of
|
||||||
|
Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of
|
||||||
|
this work please refer to
|
||||||
|
[this proposal](../../docs/proposals/federation.md).
|
||||||
|
The document is arranged as following. First we briefly list scenarios
|
||||||
|
and use cases that motivate K8S federation work. These use cases drive
|
||||||
|
the design and they also verify the design. We summarize the
|
||||||
|
functionality requirements from these use cases, and define the “in
|
||||||
|
scope” functionalities that will be covered by this design (phase
|
||||||
|
one). After that we give an overview of the proposed architecture, API
|
||||||
|
and building blocks. And also we go through several activity flows to
|
||||||
|
see how these building blocks work together to support use cases.
|
||||||
|
|
||||||
|
## REQUIREMENTS
|
||||||
|
|
||||||
|
There are many reasons why customers may want to build a K8S
|
||||||
|
federation:
|
||||||
|
|
||||||
|
+ **High Availability:** Customers want to be immune to the outage of
|
||||||
|
a single availability zone, region or even a cloud provider.
|
||||||
|
+ **Sensitive workloads:** Some workloads can only run on a particular
|
||||||
|
cluster. They cannot be scheduled to or migrated to other clusters.
|
||||||
|
+ **Capacity overflow:** Customers prefer to run workloads on a
|
||||||
|
primary cluster. But if the capacity of the cluster is not
|
||||||
|
sufficient, workloads should be automatically distributed to other
|
||||||
|
clusters.
|
||||||
|
+ **Vendor lock-in avoidance:** Customers want to spread their
|
||||||
|
workloads on different cloud providers, and can easily increase or
|
||||||
|
decrease the workload proportion of a specific provider.
|
||||||
|
+ **Cluster Size Enhancement:** Currently K8S cluster can only support
|
||||||
|
a limited size. While the community is actively improving it, it can
|
||||||
|
be expected that cluster size will be a problem if K8S is used for
|
||||||
|
large workloads or public PaaS infrastructure. While we can separate
|
||||||
|
different tenants to different clusters, it would be good to have a
|
||||||
|
unified view.
|
||||||
|
|
||||||
|
Here are the functionality requirements derived from above use cases:
|
||||||
|
|
||||||
|
+ Clients of the federation control plane API server can register and deregister clusters.
|
||||||
|
+ Workloads should be spread to different clusters according to the
|
||||||
|
workload distribution policy.
|
||||||
|
+ Pods are able to discover and connect to services hosted in other
|
||||||
|
clusters (in cases where inter-cluster networking is necessary,
|
||||||
|
desirable and implemented).
|
||||||
|
+ Traffic to these pods should be spread across clusters (in a manner
|
||||||
|
similar to load balancing, although it might not be strictly
|
||||||
|
speaking balanced).
|
||||||
|
+ The control plane needs to know when a cluster is down, and migrate
|
||||||
|
the workloads to other clusters.
|
||||||
|
+ Clients have a unified view and a central control point for above
|
||||||
|
activities.
|
||||||
|
|
||||||
|
## SCOPE
|
||||||
|
|
||||||
|
It’s difficult to have a perfect design with one click that implements
|
||||||
|
all the above requirements. Therefore we will go with an iterative
|
||||||
|
approach to design and build the system. This document describes the
|
||||||
|
phase one of the whole work. In phase one we will cover only the
|
||||||
|
following objectives:
|
||||||
|
|
||||||
|
+ Define the basic building blocks and API objects of control plane
|
||||||
|
+ Implement a basic end-to-end workflow
|
||||||
|
+ Clients register federated clusters
|
||||||
|
+ Clients submit a workload
|
||||||
|
+ The workload is distributed to different clusters
|
||||||
|
+ Service discovery
|
||||||
|
+ Load balancing
|
||||||
|
|
||||||
|
The following parts are NOT covered in phase one:
|
||||||
|
|
||||||
|
+ Authentication and authorization (other than basic client
|
||||||
|
authentication against the ubernetes API, and from ubernetes control
|
||||||
|
plane to the underlying kubernetes clusters).
|
||||||
|
+ Deployment units other than replication controller and service
|
||||||
|
+ Complex distribution policy of workloads
|
||||||
|
+ Service affinity and migration
|
||||||
|
|
||||||
|
## ARCHITECTURE
|
||||||
|
|
||||||
|
The overall architecture of a control plane is shown as following:
|
||||||
|
|
||||||
|
![Ubernetes Architecture](ubernetes-design.png)
|
||||||
|
|
||||||
|
Some design principles we are following in this architecture:
|
||||||
|
|
||||||
|
1. Keep the underlying K8S clusters independent. They should have no
|
||||||
|
knowledge of control plane or of each other.
|
||||||
|
1. Keep the Ubernetes API interface compatible with K8S API as much as
|
||||||
|
possible.
|
||||||
|
1. Re-use concepts from K8S as much as possible. This reduces
|
||||||
|
customers’ learning curve and is good for adoption. Below is a brief
|
||||||
|
description of each module contained in above diagram.
|
||||||
|
|
||||||
|
## Ubernetes API Server
|
||||||
|
|
||||||
|
The API Server in the Ubernetes control plane works just like the API
|
||||||
|
Server in K8S. It talks to a distributed key-value store to persist,
|
||||||
|
retrieve and watch API objects. This store is completely distinct
|
||||||
|
from the kubernetes key-value stores (etcd) in the underlying
|
||||||
|
kubernetes clusters. We still use `etcd` as the distributed
|
||||||
|
storage so customers don’t need to learn and manage a different
|
||||||
|
storage system, although it is envisaged that other storage systems
|
||||||
|
(consol, zookeeper) will probably be developedand supported over
|
||||||
|
time.
|
||||||
|
|
||||||
|
## Ubernetes Scheduler
|
||||||
|
|
||||||
|
The Ubernetes Scheduler schedules resources onto the underlying
|
||||||
|
Kubernetes clusters. For example it watches for unscheduled Ubernetes
|
||||||
|
replication controllers (those that have not yet been scheduled onto
|
||||||
|
underlying Kubernetes clusters) and performs the global scheduling
|
||||||
|
work. For each unscheduled replication controller, it calls policy
|
||||||
|
engine to decide how to spit workloads among clusters. It creates a
|
||||||
|
Kubernetes Replication Controller on one ore more underlying cluster,
|
||||||
|
and post them back to `etcd` storage.
|
||||||
|
|
||||||
|
One sublety worth noting here is that the scheduling decision is
|
||||||
|
arrived at by combining the application-specific request from the user (which might
|
||||||
|
include, for example, placement constraints), and the global policy specified
|
||||||
|
by the federation administrator (for example, "prefer on-premise
|
||||||
|
clusters over AWS clusters" or "spread load equally across clusters").
|
||||||
|
|
||||||
|
## Ubernetes Cluster Controller
|
||||||
|
|
||||||
|
The cluster controller
|
||||||
|
performs the following two kinds of work:
|
||||||
|
|
||||||
|
1. It watches all the sub-resources that are created by Ubernetes
|
||||||
|
components, like a sub-RC or a sub-service. And then it creates the
|
||||||
|
corresponding API objects on the underlying K8S clusters.
|
||||||
|
1. It periodically retrieves the available resources metrics from the
|
||||||
|
underlying K8S cluster, and updates them as object status of the
|
||||||
|
`cluster` API object. An alternative design might be to run a pod
|
||||||
|
in each underlying cluster that reports metrics for that cluster to
|
||||||
|
the Ubernetes control plane. Which approach is better remains an
|
||||||
|
open topic of discussion.
|
||||||
|
|
||||||
|
## Ubernetes Service Controller
|
||||||
|
|
||||||
|
The Ubernetes service controller is a federation-level implementation
|
||||||
|
of K8S service controller. It watches service resources created on
|
||||||
|
control plane, creates corresponding K8S services on each involved K8S
|
||||||
|
clusters. Besides interacting with services resources on each
|
||||||
|
individual K8S clusters, the Ubernetes service controller also
|
||||||
|
performs some global DNS registration work.
|
||||||
|
|
||||||
|
## API OBJECTS
|
||||||
|
|
||||||
|
## Cluster
|
||||||
|
|
||||||
|
Cluster is a new first-class API object introduced in this design. For
|
||||||
|
each registered K8S cluster there will be such an API resource in
|
||||||
|
control plane. The way clients register or deregister a cluster is to
|
||||||
|
send corresponding REST requests to following URL:
|
||||||
|
`/api/{$version}/clusters`. Because control plane is behaving like a
|
||||||
|
regular K8S client to the underlying clusters, the spec of a cluster
|
||||||
|
object contains necessary properties like K8S cluster address and
|
||||||
|
credentials. The status of a cluster API object will contain
|
||||||
|
following information:
|
||||||
|
|
||||||
|
1. Which phase of its lifecycle
|
||||||
|
1. Cluster resource metrics for scheduling decisions.
|
||||||
|
1. Other metadata like the version of cluster
|
||||||
|
|
||||||
|
$version.clusterSpec
|
||||||
|
|
||||||
|
<table style="border:1px solid #000000;border-collapse:collapse;">
|
||||||
|
<tbody>
|
||||||
|
<tr>
|
||||||
|
<td style="padding:5px;"><b>Name</b><br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;"><b>Description</b><br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;"><b>Required</b><br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;"><b>Schema</b><br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;"><b>Default</b><br>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td style="padding:5px;">Address<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;">address of the cluster<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;">yes<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;">address<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;"><p></p></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td style="padding:5px;">Credential<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;">the type (e.g. bearer token, client
|
||||||
|
certificate etc) and data of the credential used to access cluster. It’s used for system routines (not behalf of users)<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;">yes<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;">string <br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;"><p></p></td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
$version.clusterStatus
|
||||||
|
|
||||||
|
<table style="border:1px solid #000000;border-collapse:collapse;">
|
||||||
|
<tbody>
|
||||||
|
<tr>
|
||||||
|
<td style="padding:5px;"><b>Name</b><br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;"><b>Description</b><br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;"><b>Required</b><br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;"><b>Schema</b><br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;"><b>Default</b><br>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td style="padding:5px;">Phase<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;">the recently observed lifecycle phase of the cluster<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;">yes<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;">enum<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;"><p></p></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td style="padding:5px;">Capacity<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;">represents the available resources of a cluster<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;">yes<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;">any<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;"><p></p></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td style="padding:5px;">ClusterMeta<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;">Other cluster metadata like the version<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;">yes<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;">ClusterMeta<br>
|
||||||
|
</td>
|
||||||
|
<td style="padding:5px;"><p></p></td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
**For simplicity we didn’t introduce a separate “cluster metrics” API
|
||||||
|
object here**. The cluster resource metrics are stored in cluster
|
||||||
|
status section, just like what we did to nodes in K8S. In phase one it
|
||||||
|
only contains available CPU resources and memory resources. The
|
||||||
|
cluster controller will periodically poll the underlying cluster API
|
||||||
|
Server to get cluster capability. In phase one it gets the metrics by
|
||||||
|
simply aggregating metrics from all nodes. In future we will improve
|
||||||
|
this with more efficient ways like leveraging heapster, and also more
|
||||||
|
metrics will be supported. Similar to node phases in K8S, the “phase”
|
||||||
|
field includes following values:
|
||||||
|
|
||||||
|
+ pending: newly registered clusters or clusters suspended by admin
|
||||||
|
for various reasons. They are not eligible for accepting workloads
|
||||||
|
+ running: clusters in normal status that can accept workloads
|
||||||
|
+ offline: clusters temporarily down or not reachable
|
||||||
|
+ terminated: clusters removed from federation
|
||||||
|
|
||||||
|
Below is the state transition diagram.
|
||||||
|
|
||||||
|
![Cluster State Transition Diagram](ubernetes-cluster-state.png)
|
||||||
|
|
||||||
|
## Replication Controller
|
||||||
|
|
||||||
|
A global workload submitted to control plane is represented as an
|
||||||
|
Ubernetes replication controller. When a replication controller
|
||||||
|
is submitted to control plane, clients need a way to express its
|
||||||
|
requirements or preferences on clusters. Depending on different use
|
||||||
|
cases it may be complex. For example:
|
||||||
|
|
||||||
|
+ This workload can only be scheduled to cluster Foo. It cannot be
|
||||||
|
scheduled to any other clusters. (use case: sensitive workloads).
|
||||||
|
+ This workload prefers cluster Foo. But if there is no available
|
||||||
|
capacity on cluster Foo, it’s OK to be scheduled to cluster Bar
|
||||||
|
(use case: workload )
|
||||||
|
+ Seventy percent of this workload should be scheduled to cluster Foo,
|
||||||
|
and thirty percent should be scheduled to cluster Bar (use case:
|
||||||
|
vendor lock-in avoidance). In phase one, we only introduce a
|
||||||
|
_clusterSelector_ field to filter acceptable clusters. In default
|
||||||
|
case there is no such selector and it means any cluster is
|
||||||
|
acceptable.
|
||||||
|
|
||||||
|
Below is a sample of the YAML to create such a replication controller.
|
||||||
|
|
||||||
|
```
|
||||||
|
apiVersion: v1
|
||||||
|
kind: ReplicationController
|
||||||
|
metadata:
|
||||||
|
name: nginx-controller
|
||||||
|
spec:
|
||||||
|
replicas: 5
|
||||||
|
selector:
|
||||||
|
app: nginx
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
app: nginx
|
||||||
|
spec:
|
||||||
|
containers:
|
||||||
|
- name: nginx
|
||||||
|
image: nginx
|
||||||
|
ports:
|
||||||
|
- containerPort: 80
|
||||||
|
clusterSelector:
|
||||||
|
name in (Foo, Bar)
|
||||||
|
```
|
||||||
|
|
||||||
|
Currently clusterSelector (implemented as a
|
||||||
|
[LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704))
|
||||||
|
only supports a simple list of acceptable clusters. Workloads will be
|
||||||
|
evenly distributed on these acceptable clusters in phase one. After
|
||||||
|
phase one we will define syntax to represent more advanced
|
||||||
|
constraints, like cluster preference ordering, desired number of
|
||||||
|
splitted workloads, desired ratio of workloads spread on different
|
||||||
|
clusters, etc.
|
||||||
|
|
||||||
|
Besides this explicit “clusterSelector” filter, a workload may have
|
||||||
|
some implicit scheduling restrictions. For example it defines
|
||||||
|
“nodeSelector” which can only be satisfied on some particular
|
||||||
|
clusters. How to handle this will be addressed after phase one.
|
||||||
|
|
||||||
|
## Ubernetes Services
|
||||||
|
|
||||||
|
The Service API object exposed by Ubernetes is similar to service
|
||||||
|
objects on Kubernetes. It defines the access to a group of pods. The
|
||||||
|
Ubernetes service controller will create corresponding Kubernetes
|
||||||
|
service objects on underlying clusters. These are detailed in a
|
||||||
|
separate design document: [Federated Services](federated-services.md).
|
||||||
|
|
||||||
|
## Pod
|
||||||
|
|
||||||
|
In phase one we only support scheduling replication controllers. Pod
|
||||||
|
scheduling will be supported in later phase. This is primarily in
|
||||||
|
order to keep the Ubernetes API compatible with the Kubernetes API.
|
||||||
|
|
||||||
|
## ACTIVITY FLOWS
|
||||||
|
|
||||||
|
## Scheduling
|
||||||
|
|
||||||
|
The below diagram shows how workloads are scheduled on the Ubernetes control plane:
|
||||||
|
|
||||||
|
1. A replication controller is created by the client.
|
||||||
|
1. APIServer persists it into the storage.
|
||||||
|
1. Cluster controller periodically polls the latest available resource
|
||||||
|
metrics from the underlying clusters.
|
||||||
|
1. Scheduler is watching all pending RCs. It picks up the RC, make
|
||||||
|
policy-driven decisions and split it into different sub RCs.
|
||||||
|
1. Each cluster control is watching the sub RCs bound to its
|
||||||
|
corresponding cluster. It picks up the newly created sub RC.
|
||||||
|
1. The cluster controller issues requests to the underlying cluster
|
||||||
|
API Server to create the RC. In phase one we don’t support complex
|
||||||
|
distribution policies. The scheduling rule is basically:
|
||||||
|
1. If a RC does not specify any nodeSelector, it will be scheduled
|
||||||
|
to the least loaded K8S cluster(s) that has enough available
|
||||||
|
resources.
|
||||||
|
1. If a RC specifies _N_ acceptable clusters in the
|
||||||
|
clusterSelector, all replica will be evenly distributed among
|
||||||
|
these clusters.
|
||||||
|
|
||||||
|
There is a potential race condition here. Say at time _T1_ the control
|
||||||
|
plane learns there are _m_ available resources in a K8S cluster. As
|
||||||
|
the cluster is working independently it still accepts workload
|
||||||
|
requests from other K8S clients or even another Ubernetes control
|
||||||
|
plane. The Ubernetes scheduling decision is based on this data of
|
||||||
|
available resources. However when the actual RC creation happens to
|
||||||
|
the cluster at time _T2_, the cluster may don’t have enough resources
|
||||||
|
at that time. We will address this problem in later phases with some
|
||||||
|
proposed solutions like resource reservation mechanisms.
|
||||||
|
|
||||||
|
![Ubernetes Scheduling](ubernetes-scheduling.png)
|
||||||
|
|
||||||
|
## Service Discovery
|
||||||
|
|
||||||
|
This part has been included in the section “Federated Service” of
|
||||||
|
document
|
||||||
|
“[Ubernetes Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. Please
|
||||||
|
refer to that document for details.
|
||||||
|
|
||||||
|
|
||||||
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]()
|
||||||
|
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
Binary file not shown.
After Width: | Height: | Size: 14 KiB |
Binary file not shown.
After Width: | Height: | Size: 20 KiB |
Binary file not shown.
After Width: | Height: | Size: 38 KiB |
Loading…
Reference in New Issue