Merge pull request #19313 from quinton-hoole/2015-01-05-ube-design-docs

Auto commit by PR queue bot
pull/6/head
k8s-merge-robot 2016-03-14 22:25:28 -07:00
commit 536a30fe45
6 changed files with 1253 additions and 0 deletions

View File

@ -0,0 +1,269 @@
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
<!-- BEGIN STRIP_FOR_RELEASE -->
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.
Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--
<!-- END STRIP_FOR_RELEASE -->
<!-- END MUNGE: UNVERSIONED_WARNING -->
# Kubernetes/Ubernetes Control Plane Resilience
## Long Term Design and Current Status
### by Quinton Hoole, Mike Danese and Justin Santa-Barbara
### December 14, 2015
## Summary
Some amount of confusion exists around how we currently, and in future
want to ensure resilience of the Kubernetes (and by implication
Ubernetes) control plane. This document is an attempt to capture that
definitively. It covers areas including self-healing, high
availability, bootstrapping and recovery. Most of the information in
this document already exists in the form of github comments,
PR's/proposals, scattered documents, and corridor conversations, so
document is primarily a consolidation and clarification of existing
ideas.
## Terms
* **Self-healing:** automatically restarting or replacing failed
processes and machines without human intervention
* **High availability:** continuing to be available and work correctly
even if some components are down or uncontactable. This typically
involves multiple replicas of critical services, and a reliable way
to find available replicas. Note that it's possible (but not
desirable) to have high
availability properties (e.g. multiple replicas) in the absence of
self-healing properties (e.g. if a replica fails, nothing replaces
it). Fairly obviously, given enough time, such systems typically
become unavailable (after enough replicas have failed).
* **Bootstrapping**: creating an empty cluster from nothing
* **Recovery**: recreating a non-empty cluster after perhaps
catastrophic failure/unavailability/data corruption
## Overall Goals
1. **Resilience to single failures:** Kubernetes clusters constrained
to single availability zones should be resilient to individual
machine and process failures by being both self-healing and highly
available (within the context of such individual failures).
1. **Ubiquitous resilience by default:** The default cluster creation
scripts for (at least) GCE, AWS and basic bare metal should adhere
to the above (self-healing and high availability) by default (with
options available to disable these features to reduce control plane
resource requirements if so required). It is hoped that other
cloud providers will also follow the above guidelines, but the
above 3 are the primary canonical use cases.
1. **Resilience to some correlated failures:** Kubernetes clusters
which span multiple availability zones in a region should by
default be resilient to complete failure of one entire availability
zone (by similarly providing self-healing and high availability in
the default cluster creation scripts as above).
1. **Default implementation shared across cloud providers:** The
differences between the default implementations of the above for
GCE, AWS and basic bare metal should be minimized. This implies
using shared libraries across these providers in the default
scripts in preference to highly customized implementations per
cloud provider. This is not to say that highly differentiated,
customized per-cloud cluster creation processes (e.g. for GKE on
GCE, or some hosted Kubernetes provider on AWS) are discouraged.
But those fall squarely outside the basic cross-platform OSS
Kubernetes distro.
1. **Self-hosting:** Where possible, Kubernetes's existing mechanisms
for achieving system resilience (replication controllers, health
checking, service load balancing etc) should be used in preference
to building a separate set of mechanisms to achieve the same thing.
This implies that self hosting (the kubernetes control plane on
kubernetes) is strongly preferred, with the caveat below.
1. **Recovery from catastrophic failure:** The ability to quickly and
reliably recover a cluster from catastrophic failure is critical,
and should not be compromised by the above goal to self-host
(i.e. it goes without saying that the cluster should be quickly and
reliably recoverable, even if the cluster control plane is
broken). This implies that such catastrophic failure scenarios
should be carefully thought out, and the subject of regular
continuous integration testing, and disaster recovery exercises.
## Relative Priorities
1. **(Possibly manual) recovery from catastrophic failures:** having a Kubernetes cluster, and all
applications running inside it, disappear forever perhaps is the worst
possible failure mode. So it is critical that we be able to
recover the applications running inside a cluster from such
failures in some well-bounded time period.
1. In theory a cluster can be recovered by replaying all API calls
that have ever been executed against it, in order, but most
often that state has been lost, and/or is scattered across
multiple client applications or groups. So in general it is
probably infeasible.
1. In theory a cluster can also be recovered to some relatively
recent non-corrupt backup/snapshot of the disk(s) backing the
etcd cluster state. But we have no default consistent
backup/snapshot, verification or restoration process. And we
don't routinely test restoration, so even if we did routinely
perform and verify backups, we have no hard evidence that we
can in practise effectively recover from catastrophic cluster
failure or data corruption by restoring from these backups. So
there's more work to be done here.
1. **Self-healing:** Most major cloud providers provide the ability to
easily and automatically replace failed virtual machines within a
small number of minutes (e.g. GCE
[Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart)
and Managed Instance Groups,
AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/)
and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This
can fairly trivially be used to reduce control-plane down-time due
to machine failure to a small number of minutes per failure
(i.e. typically around "3 nines" availability), provided that:
1. cluster persistent state (i.e. etcd disks) is either:
1. truely persistent (i.e. remote persistent disks), or
1. reconstructible (e.g. using etcd [dynamic member
addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member)
or [backup and
recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)).
1. and boot disks are either:
1. truely persistent (i.e. remote persistent disks), or
1. reconstructible (e.g. using boot-from-snapshot,
boot-from-pre-configured-image or
boot-from-auto-initializing image).
1. **High Availability:** This has the potential to increase
availability above the approximately "3 nines" level provided by
automated self-healing, but it's somewhat more complex, and
requires additional resources (e.g. redundant API servers and etcd
quorum members). In environments where cloud-assisted automatic
self-healing might be infeasible (e.g. on-premise bare-metal
deployments), it also gives cluster administrators more time to
respond (e.g. replace/repair failed machines) without incurring
system downtime.
## Design and Status (as of December 2015)
<table>
<tr>
<td><b>Control Plane Component</b></td>
<td><b>Resilience Plan</b></td>
<td><b>Current Status</b></td>
</tr>
<tr>
<td><b>API Server</b></td>
<td>
Multiple stateless, self-hosted, self-healing API servers behind a HA
load balancer, built out by the default "kube-up" automation on GCE,
AWS and basic bare metal (BBM). Note that the single-host approach of
hving etcd listen only on localhost to ensure that onyl API server can
connect to it will no longer work, so alternative security will be
needed in the regard (either using firewall rules, SSL certs, or
something else). All necessary flags are currently supported to enable
SSL between API server and etcd (OpenShift runs like this out of the
box), but this needs to be woven into the "kube-up" and related
scripts. Detailed design of self-hosting and related bootstrapping
and catastrophic failure recovery will be detailed in a separate
design doc.
</td>
<td>
No scripted self-healing or HA on GCE, AWS or basic bare metal
currently exists in the OSS distro. To be clear, "no self healing"
means that even if multiple e.g. API servers are provisioned for HA
purposes, if they fail, nothing replaces them, so eventually the
system will fail. Self-healing and HA can be set up
manually by following documented instructions, but this is not
currently an automated process, and it is not tested as part of
continuous integration. So it's probably safest to assume that it
doesn't actually work in practise.
</td>
</tr>
<tr>
<td><b>Controller manager and scheduler</b></td>
<td>
Multiple self-hosted, self healing warm standby stateless controller
managers and schedulers with leader election and automatic failover of API server
clients, automatically installed by default "kube-up" automation.
</td>
<td>As above.</td>
</tr>
<tr>
<td><b>etcd</b></td>
<td>
Multiple (3-5) etcd quorum members behind a load balancer with session
affinity (to prevent clients from being bounced from one to another).
Regarding self-healing, if a node running etcd goes down, it is always necessary to do three
things:
<ol>
<li>allocate a new node (not necessary if running etcd as a pod, in
which case specific measures are required to prevent user pods from
interfering with system pods, for example using node selectors as
described in <A HREF=")
<li>start an etcd replica on that new node,
<li>have the new replica recover the etcd state.
</ol>
In the case of local disk (which fails in concert with the machine), the etcd
state must be recovered from the other replicas. This is called <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member">dynamic member
addition</A>.
In the case of remote persistent disk, the etcd state can be recovered
by attaching the remote persistent disk to the replacement node, thus
the state is recoverable even if all other replicas are down.
There are also significant performance differences between local disks and remote
persistent disks. For example, the <A HREF="https://cloud.google.com/compute/docs/disks/#comparison_of_disk_types">sustained throughput
local disks in GCE is approximatley 20x that of remote disks</A>.
Hence we suggest that self-healing be provided by remotely mounted persistent disks in
non-performance critical, single-zone cloud deployments. For
performance critical installations, faster local SSD's should be used,
in which case remounting on node failure is not an option, so
<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md ">etcd runtime configuration</A>
should be used to replace the failed machine. Similarly, for
cross-zone self-healing, cloud persistent disks are zonal, so
automatic
<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md">runtime configuration</A>
is required. Similarly, basic bare metal deployments cannot generally
rely on
remote persistent disks, so the same approach applies there.
</td>
<td>
<A HREF="http://kubernetes.io/v1.1/docs/admin/high-availability.html">
Somewhat vague instructions exist</A>
on how to set some of this up manually in a self-hosted
configuration. But automatic bootstrapping and self-healing is not
described (and is not implemented for the non-PD cases). This all
still needs to be automated and continuously tested.
</td>
</tr>
</table>
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -0,0 +1,550 @@
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
<!-- BEGIN STRIP_FOR_RELEASE -->
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.
Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--
<!-- END STRIP_FOR_RELEASE -->
<!-- END MUNGE: UNVERSIONED_WARNING -->
# Kubernetes Cluster Federation (a.k.a. "Ubernetes")
## Cross-cluster Load Balancing and Service Discovery
### Requirements and System Design
### by Quinton Hoole, Dec 3 2015
## Requirements
### Discovery, Load-balancing and Failover
1. **Internal discovery and connection**: Pods/containers (running in
a Kubernetes cluster) must be able to easily discover and connect
to endpoints for Kubernetes services on which they depend in a
consistent way, irrespective of whether those services exist in a
different kubernetes cluster within the same cluster federation.
Hence-forth referred to as "cluster-internal clients", or simply
"internal clients".
1. **External discovery and connection**: External clients (running
outside a Kubernetes cluster) must be able to discover and connect
to endpoints for Kubernetes services on which they depend.
1. **External clients predominantly speak HTTP(S)**: External
clients are most often, but not always, web browsers, or at
least speak HTTP(S) - notable exceptions include Enterprise
Message Busses (Java, TLS), DNS servers (UDP),
SIP servers and databases)
1. **Find the "best" endpoint:** Upon initial discovery and
connection, both internal and external clients should ideally find
"the best" endpoint if multiple eligible endpoints exist. "Best"
in this context implies the closest (by network topology) endpoint
that is both operational (as defined by some positive health check)
and not overloaded (by some published load metric). For example:
1. An internal client should find an endpoint which is local to its
own cluster if one exists, in preference to one in a remote
cluster (if both are operational and non-overloaded).
Similarly, one in a nearby cluster (e.g. in the same zone or
region) is preferable to one further afield.
1. An external client (e.g. in New York City) should find an
endpoint in a nearby cluster (e.g. U.S. East Coast) in
preference to one further away (e.g. Japan).
1. **Easy fail-over:** If the endpoint to which a client is connected
becomes unavailable (no network response/disconnected) or
overloaded, the client should reconnect to a better endpoint,
somehow.
1. In the case where there exist one or more connection-terminating
load balancers between the client and the serving Pod, failover
might be completely automatic (i.e. the client's end of the
connection remains intact, and the client is completely
oblivious of the fail-over). This approach incurs network speed
and cost penalties (by traversing possibly multiple load
balancers), but requires zero smarts in clients, DNS libraries,
recursing DNS servers etc, as the IP address of the endpoint
remains constant over time.
1. In a scenario where clients need to choose between multiple load
balancer endpoints (e.g. one per cluster), multiple DNS A
records associated with a single DNS name enable even relatively
dumb clients to try the next IP address in the list of returned
A records (without even necessarily re-issuing a DNS resolution
request). For example, all major web browsers will try all A
records in sequence until a working one is found (TBD: justify
this claim with details for Chrome, IE, Safari, Firefox).
1. In a slightly more sophisticated scenario, upon disconnection, a
smarter client might re-issue a DNS resolution query, and
(modulo DNS record TTL's which can typically be set as low as 3
minutes, and buggy DNS resolvers, caches and libraries which
have been known to completely ignore TTL's), receive updated A
records specifying a new set of IP addresses to which to
connect.
### Portability
A Kubernetes application configuration (e.g. for a Pod, Replication
Controller, Service etc) should be able to be successfully deployed
into any Kubernetes Cluster or Ubernetes Federation of Clusters,
without modification. More specifically, a typical configuration
should work correctly (although possibly not optimally) across any of
the following environments:
1. A single Kubernetes Cluster on one cloud provider (e.g. Google
Compute Engine, GCE)
1. A single Kubernetes Cluster on a different cloud provider
(e.g. Amazon Web Services, AWS)
1. A single Kubernetes Cluster on a non-cloud, on-premise data center
1. A Federation of Kubernetes Clusters all on the same cloud provider
(e.g. GCE)
1. A Federation of Kubernetes Clusters across multiple different cloud
providers and/or on-premise data centers (e.g. one cluster on
GCE/GKE, one on AWS, and one on-premise).
### Trading Portability for Optimization
It should be possible to explicitly opt out of portability across some
subset of the above environments in order to take advantage of
non-portable load balancing and DNS features of one or more
environments. More specifically, for example:
1. For HTTP(S) applications running on GCE-only Federations,
[GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
should be usable. These provide single, static global IP addresses
which load balance and fail over globally (i.e. across both regions
and zones). These allow for really dumb clients, but they only
work on GCE, and only for HTTP(S) traffic.
1. For non-HTTP(S) applications running on GCE-only Federations within
a single region,
[GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
should be usable. These provide TCP (i.e. both HTTP/S and
non-HTTP/S) load balancing and failover, but only on GCE, and only
within a single region.
[Google Cloud DNS](https://cloud.google.com/dns) can be used to
route traffic between regions (and between different cloud
providers and on-premise clusters, as it's plain DNS, IP only).
1. For applications running on AWS-only Federations,
[AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/)
should be usable. These provide both L7 (HTTP(S)) and L4 load
balancing, but only within a single region, and only on AWS
([AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be
used to load balance and fail over across multiple regions, and is
also capable of resolving to non-AWS endpoints).
## Component Cloud Services
Ubernetes cross-cluster load balancing is built on top of the following:
1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
provide single, static global IP addresses which load balance and
fail over globally (i.e. across both regions and zones). These
allow for really dumb clients, but they only work on GCE, and only
for HTTP(S) traffic.
1. [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
provide both HTTP(S) and non-HTTP(S) load balancing and failover,
but only on GCE, and only within a single region.
1. [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/)
provide both L7 (HTTP(S)) and L4 load balancing, but only within a
single region, and only on AWS.
1. [Google Cloud DNS](https://cloud.google.com/dns) (or any other
programmable DNS service, like
[CloudFlare](http://www.cloudflare.com) can be used to route
traffic between regions (and between different cloud providers and
on-premise clusters, as it's plain DNS, IP only). Google Cloud DNS
doesn't provide any built-in geo-DNS, latency-based routing, health
checking, weighted round robin or other advanced capabilities.
It's plain old DNS. We would need to build all the aforementioned
on top of it. It can provide internal DNS services (i.e. serve RFC
1918 addresses).
1. [AWS Route 53 DNS service](https://aws.amazon.com/route53/) can
be used to load balance and fail over across regions, and is also
capable of routing to non-AWS endpoints). It provides built-in
geo-DNS, latency-based routing, health checking, weighted
round robin and optional tight integration with some other
AWS services (e.g. Elastic Load Balancers).
1. Kubernetes L4 Service Load Balancing: This provides both a
[virtual cluster-local](http://kubernetes.io/v1.1/docs/user-guide/services.html#virtual-ips-and-service-proxies)
and a
[real externally routable](http://kubernetes.io/v1.1/docs/user-guide/services.html#type-loadbalancer)
service IP which is load-balanced (currently simple round-robin)
across the healthy pods comprising a service within a single
Kubernetes cluster.
1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html): A generic wrapper around cloud-provided L4 and L7 load balancing services, and roll-your-own load balancers run in pods, e.g. HA Proxy.
## Ubernetes API
The Ubernetes API for load balancing should be compatible with the
equivalent Kubernetes API, to ease porting of clients between
Ubernetes and Kubernetes. Further details below.
## Common Client Behavior
To be useful, our load balancing solution needs to work properly with
real client applications. There are a few different classes of
those...
### Browsers
These are the most common external clients. These are all well-written. See below.
### Well-written clients
1. Do a DNS resolution every time they connect.
1. Don't cache beyond TTL (although a small percentage of the DNS
servers on which they rely might).
1. Do try multiple A records (in order) to connect.
1. (in an ideal world) Do use SRV records rather than hard-coded port numbers.
Examples:
+ all common browsers (except for SRV records)
+ ...
### Dumb clients
1. Don't do a DNS resolution every time they connect (or do cache
beyond the TTL).
1. Do try multiple A records
Examples:
+ ...
### Dumber clients
1. Only do a DNS lookup once on startup.
1. Only try the first returned DNS A record.
Examples:
+ ...
### Dumbest clients
1. Never do a DNS lookup - are pre-configured with a single (or
possibly multiple) fixed server IP(s). Nothing else matters.
## Architecture and Implementation
### General control plane architecture
Each cluster hosts one or more Ubernetes master components (Ubernetes API servers, controller managers with leader election, and
etcd quorum members. This is documented in more detail in a
[separate design doc: Kubernetes/Ubernetes Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#).
In the description below, assume that 'n' clusters, named
'cluster-1'... 'cluster-n' have been registered against an Ubernetes
Federation "federation-1", each with their own set of Kubernetes API
endpoints,so,
"[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1),
[http://endpoint-2.cluster-1](http://endpoint-2.cluster-1)
... [http://endpoint-m.cluster-n](http://endpoint-m.cluster-n) .
### Federated Services
Ubernetes Services are pretty straight-forward. They're comprised of
multiple equivalent underlying Kubernetes Services, each with their
own external endpoint, and a load balancing mechanism across them.
Let's work through how exactly that works in practice.
Our user creates the following Ubernetes Service (against an Ubernetes
API endpoint):
$ kubectl create -f my-service.yaml --context="federation-1"
where service.yaml contains the following:
kind: Service
metadata:
labels:
run: my-service
name: my-service
namespace: my-namespace
spec:
ports:
- port: 2379
protocol: TCP
targetPort: 2379
name: client
- port: 2380
protocol: TCP
targetPort: 2380
name: peer
selector:
run: my-service
type: LoadBalancer
Ubernetes in turn creates one equivalent service (identical config to
the above) in each of the underlying Kubernetes clusters, each of
which results in something like this:
$ kubectl get -o yaml --context="cluster-1" service my-service
apiVersion: v1
kind: Service
metadata:
creationTimestamp: 2015-11-25T23:35:25Z
labels:
run: my-service
name: my-service
namespace: my-namespace
resourceVersion: "147365"
selfLink: /api/v1/namespaces/my-namespace/services/my-service
uid: 33bfc927-93cd-11e5-a38c-42010af00002
spec:
clusterIP: 10.0.153.185
ports:
- name: client
nodePort: 31333
port: 2379
protocol: TCP
targetPort: 2379
- name: peer
nodePort: 31086
port: 2380
protocol: TCP
targetPort: 2380
selector:
run: my-service
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer:
ingress:
- ip: 104.197.117.10
Similar services are created in `cluster-2` and `cluster-3`, each of
which are allocated their own `spec.clusterIP`, and
`status.loadBalancer.ingress.ip`.
In Ubernetes `federation-1`, the resulting federated service looks as follows:
$ kubectl get -o yaml --context="federation-1" service my-service
apiVersion: v1
kind: Service
metadata:
creationTimestamp: 2015-11-25T23:35:23Z
labels:
run: my-service
name: my-service
namespace: my-namespace
resourceVersion: "157333"
selfLink: /api/v1/namespaces/my-namespace/services/my-service
uid: 33bfc927-93cd-11e5-a38c-42010af00007
spec:
clusterIP:
ports:
- name: client
nodePort: 31333
port: 2379
protocol: TCP
targetPort: 2379
- name: peer
nodePort: 31086
port: 2380
protocol: TCP
targetPort: 2380
selector:
run: my-service
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer:
ingress:
- hostname: my-service.my-namespace.my-federation.my-domain.com
Note that the federated service:
1. Is API-compatible with a vanilla Kubernetes service.
1. has no clusterIP (as it is cluster-independent)
1. has a federation-wide load balancer hostname
In addition to the set of underlying Kubernetes services (one per
cluster) described above, Ubernetes has also created a DNS name
(e.g. on [Google Cloud DNS](https://cloud.google.com/dns) or
[AWS Route 53](https://aws.amazon.com/route53/), depending on
configuration) which provides load balancing across all of those
services. For example, in a very basic configuration:
$ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.117.10
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157
Each of the above IP addresses (which are just the external load
balancer ingress IP's of each cluster service) is of course load
balanced across the pods comprising the service in each cluster.
In a more sophisticated configuration (e.g. on GCE or GKE), Ubernetes
automatically creates a
[GCE Global L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
which exposes a single, globally load-balanced IP:
$ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
my-service.my-namespace.my-federation.my-domain.com 180 IN A 107.194.17.44
Optionally, Ubernetes also configures the local DNS servers (SkyDNS)
in each Kubernetes cluster to preferentially return the local
clusterIP for the service in that cluster, with other clusters'
external service IP's (or a global load-balanced IP) also configured
for failover purposes:
$ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
my-service.my-namespace.my-federation.my-domain.com 180 IN A 10.0.153.185
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157
If Ubernetes Global Service Health Checking is enabled, multiple
service health checkers running across the federated clusters
collaborate to monitor the health of the service endpoints, and
automatically remove unhealthy endpoints from the DNS record (e.g. a
majority quorum is required to vote a service endpoint unhealthy, to
avoid false positives due to individual health checker network
isolation).
### Federated Replication Controllers
So far we have a federated service defined, with a resolvable load
balancer hostname by which clients can reach it, but no pods serving
traffic directed there. So now we need a Federated Replication
Controller. These are also fairly straight-forward, being comprised
of multiple underlying Kubernetes Replication Controllers which do the
hard work of keeping the desired number of Pod replicas alive in each
Kubernetes cluster.
$ kubectl create -f my-service-rc.yaml --context="federation-1"
where `my-service-rc.yaml` contains the following:
kind: ReplicationController
metadata:
labels:
run: my-service
name: my-service
namespace: my-namespace
spec:
replicas: 6
selector:
run: my-service
template:
metadata:
labels:
run: my-service
spec:
containers:
image: gcr.io/google_samples/my-service:v1
name: my-service
ports:
- containerPort: 2379
protocol: TCP
- containerPort: 2380
protocol: TCP
Ubernetes in turn creates one equivalent replication controller
(identical config to the above, except for the replica count) in each
of the underlying Kubernetes clusters, each of which results in
something like this:
$ ./kubectl get -o yaml rc my-service --context="cluster-1"
kind: ReplicationController
metadata:
creationTimestamp: 2015-12-02T23:00:47Z
labels:
run: my-service
name: my-service
namespace: my-namespace
selfLink: /api/v1/namespaces/my-namespace/replicationcontrollers/my-service
uid: 86542109-9948-11e5-a38c-42010af00002
spec:
replicas: 2
selector:
run: my-service
template:
metadata:
labels:
run: my-service
spec:
containers:
image: gcr.io/google_samples/my-service:v1
name: my-service
ports:
- containerPort: 2379
protocol: TCP
- containerPort: 2380
protocol: TCP
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
status:
replicas: 2
The exact number of replicas created in each underlying cluster will
of course depend on what scheduling policy is in force. In the above
example, the scheduler created an equal number of replicas (2) in each
of the three underlying clusters, to make up the total of 6 replicas
required. To handle entire cluster failures, various approaches are possible,
including:
1. **simple overprovisioing**, such that sufficient replicas remain even if a
cluster fails. This wastes some resources, but is simple and
reliable.
2. **pod autoscaling**, where the replication controller in each
cluster automatically and autonomously increases the number of
replicas in its cluster in response to the additional traffic
diverted from the
failed cluster. This saves resources and is reatively simple,
but there is some delay in the autoscaling.
3. **federated replica migration**, where the Ubernetes Federation
Control Plane detects the cluster failure and automatically
increases the replica count in the remainaing clusters to make up
for the lost replicas in the failed cluster. This does not seem to
offer any benefits relative to pod autoscaling above, and is
arguably more complex to implement, but we note it here as a
possibility.
### Implementation Details
The implementation approach and architecture is very similar to
Kubernetes, so if you're familiar with how Kubernetes works, none of
what follows will be surprising. One additional design driver not
present in Kubernetes is that Ubernetes aims to be resilient to
individual cluster and availability zone failures. So the control
plane spans multiple clusters. More specifically:
+ Ubernetes runs it's own distinct set of API servers (typically one
or more per underlying Kubernetes cluster). These are completely
distinct from the Kubernetes API servers for each of the underlying
clusters.
+ Ubernetes runs it's own distinct quorum-based metadata store (etcd,
by default). Approximately 1 quorum member runs in each underlying
cluster ("approximately" because we aim for an odd number of quorum
members, and typically don't want more than 5 quorum members, even
if we have a larger number of federated clusters, so 2 clusters->3
quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc).
Cluster Controllers in Ubernetes watch against the Ubernetes API
server/etcd state, and apply changes to the underlying kubernetes
clusters accordingly. They also have the anti-entropy mechanism for
reconciling ubernetes "desired desired" state against kubernetes
"actual desired" state.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-services.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -0,0 +1,434 @@
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
<!-- BEGIN STRIP_FOR_RELEASE -->
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.
Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--
<!-- END STRIP_FOR_RELEASE -->
<!-- END MUNGE: UNVERSIONED_WARNING -->
# Ubernetes Design Spec (phase one)
**Huawei PaaS Team**
## INTRODUCTION
In this document we propose a design for the “Control Plane” of
Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of
this work please refer to
[this proposal](../../docs/proposals/federation.md).
The document is arranged as following. First we briefly list scenarios
and use cases that motivate K8S federation work. These use cases drive
the design and they also verify the design. We summarize the
functionality requirements from these use cases, and define the “in
scope” functionalities that will be covered by this design (phase
one). After that we give an overview of the proposed architecture, API
and building blocks. And also we go through several activity flows to
see how these building blocks work together to support use cases.
## REQUIREMENTS
There are many reasons why customers may want to build a K8S
federation:
+ **High Availability:** Customers want to be immune to the outage of
a single availability zone, region or even a cloud provider.
+ **Sensitive workloads:** Some workloads can only run on a particular
cluster. They cannot be scheduled to or migrated to other clusters.
+ **Capacity overflow:** Customers prefer to run workloads on a
primary cluster. But if the capacity of the cluster is not
sufficient, workloads should be automatically distributed to other
clusters.
+ **Vendor lock-in avoidance:** Customers want to spread their
workloads on different cloud providers, and can easily increase or
decrease the workload proportion of a specific provider.
+ **Cluster Size Enhancement:** Currently K8S cluster can only support
a limited size. While the community is actively improving it, it can
be expected that cluster size will be a problem if K8S is used for
large workloads or public PaaS infrastructure. While we can separate
different tenants to different clusters, it would be good to have a
unified view.
Here are the functionality requirements derived from above use cases:
+ Clients of the federation control plane API server can register and deregister clusters.
+ Workloads should be spread to different clusters according to the
workload distribution policy.
+ Pods are able to discover and connect to services hosted in other
clusters (in cases where inter-cluster networking is necessary,
desirable and implemented).
+ Traffic to these pods should be spread across clusters (in a manner
similar to load balancing, although it might not be strictly
speaking balanced).
+ The control plane needs to know when a cluster is down, and migrate
the workloads to other clusters.
+ Clients have a unified view and a central control point for above
activities.
## SCOPE
Its difficult to have a perfect design with one click that implements
all the above requirements. Therefore we will go with an iterative
approach to design and build the system. This document describes the
phase one of the whole work. In phase one we will cover only the
following objectives:
+ Define the basic building blocks and API objects of control plane
+ Implement a basic end-to-end workflow
+ Clients register federated clusters
+ Clients submit a workload
+ The workload is distributed to different clusters
+ Service discovery
+ Load balancing
The following parts are NOT covered in phase one:
+ Authentication and authorization (other than basic client
authentication against the ubernetes API, and from ubernetes control
plane to the underlying kubernetes clusters).
+ Deployment units other than replication controller and service
+ Complex distribution policy of workloads
+ Service affinity and migration
## ARCHITECTURE
The overall architecture of a control plane is shown as following:
![Ubernetes Architecture](ubernetes-design.png)
Some design principles we are following in this architecture:
1. Keep the underlying K8S clusters independent. They should have no
knowledge of control plane or of each other.
1. Keep the Ubernetes API interface compatible with K8S API as much as
possible.
1. Re-use concepts from K8S as much as possible. This reduces
customers learning curve and is good for adoption. Below is a brief
description of each module contained in above diagram.
## Ubernetes API Server
The API Server in the Ubernetes control plane works just like the API
Server in K8S. It talks to a distributed key-value store to persist,
retrieve and watch API objects. This store is completely distinct
from the kubernetes key-value stores (etcd) in the underlying
kubernetes clusters. We still use `etcd` as the distributed
storage so customers dont need to learn and manage a different
storage system, although it is envisaged that other storage systems
(consol, zookeeper) will probably be developedand supported over
time.
## Ubernetes Scheduler
The Ubernetes Scheduler schedules resources onto the underlying
Kubernetes clusters. For example it watches for unscheduled Ubernetes
replication controllers (those that have not yet been scheduled onto
underlying Kubernetes clusters) and performs the global scheduling
work. For each unscheduled replication controller, it calls policy
engine to decide how to spit workloads among clusters. It creates a
Kubernetes Replication Controller on one ore more underlying cluster,
and post them back to `etcd` storage.
One sublety worth noting here is that the scheduling decision is
arrived at by combining the application-specific request from the user (which might
include, for example, placement constraints), and the global policy specified
by the federation administrator (for example, "prefer on-premise
clusters over AWS clusters" or "spread load equally across clusters").
## Ubernetes Cluster Controller
The cluster controller
performs the following two kinds of work:
1. It watches all the sub-resources that are created by Ubernetes
components, like a sub-RC or a sub-service. And then it creates the
corresponding API objects on the underlying K8S clusters.
1. It periodically retrieves the available resources metrics from the
underlying K8S cluster, and updates them as object status of the
`cluster` API object. An alternative design might be to run a pod
in each underlying cluster that reports metrics for that cluster to
the Ubernetes control plane. Which approach is better remains an
open topic of discussion.
## Ubernetes Service Controller
The Ubernetes service controller is a federation-level implementation
of K8S service controller. It watches service resources created on
control plane, creates corresponding K8S services on each involved K8S
clusters. Besides interacting with services resources on each
individual K8S clusters, the Ubernetes service controller also
performs some global DNS registration work.
## API OBJECTS
## Cluster
Cluster is a new first-class API object introduced in this design. For
each registered K8S cluster there will be such an API resource in
control plane. The way clients register or deregister a cluster is to
send corresponding REST requests to following URL:
`/api/{$version}/clusters`. Because control plane is behaving like a
regular K8S client to the underlying clusters, the spec of a cluster
object contains necessary properties like K8S cluster address and
credentials. The status of a cluster API object will contain
following information:
1. Which phase of its lifecycle
1. Cluster resource metrics for scheduling decisions.
1. Other metadata like the version of cluster
$version.clusterSpec
<table style="border:1px solid #000000;border-collapse:collapse;">
<tbody>
<tr>
<td style="padding:5px;"><b>Name</b><br>
</td>
<td style="padding:5px;"><b>Description</b><br>
</td>
<td style="padding:5px;"><b>Required</b><br>
</td>
<td style="padding:5px;"><b>Schema</b><br>
</td>
<td style="padding:5px;"><b>Default</b><br>
</td>
</tr>
<tr>
<td style="padding:5px;">Address<br>
</td>
<td style="padding:5px;">address of the cluster<br>
</td>
<td style="padding:5px;">yes<br>
</td>
<td style="padding:5px;">address<br>
</td>
<td style="padding:5px;"><p></p></td>
</tr>
<tr>
<td style="padding:5px;">Credential<br>
</td>
<td style="padding:5px;">the type (e.g. bearer token, client
certificate etc) and data of the credential used to access cluster. Its used for system routines (not behalf of users)<br>
</td>
<td style="padding:5px;">yes<br>
</td>
<td style="padding:5px;">string <br>
</td>
<td style="padding:5px;"><p></p></td>
</tr>
</tbody>
</table>
$version.clusterStatus
<table style="border:1px solid #000000;border-collapse:collapse;">
<tbody>
<tr>
<td style="padding:5px;"><b>Name</b><br>
</td>
<td style="padding:5px;"><b>Description</b><br>
</td>
<td style="padding:5px;"><b>Required</b><br>
</td>
<td style="padding:5px;"><b>Schema</b><br>
</td>
<td style="padding:5px;"><b>Default</b><br>
</td>
</tr>
<tr>
<td style="padding:5px;">Phase<br>
</td>
<td style="padding:5px;">the recently observed lifecycle phase of the cluster<br>
</td>
<td style="padding:5px;">yes<br>
</td>
<td style="padding:5px;">enum<br>
</td>
<td style="padding:5px;"><p></p></td>
</tr>
<tr>
<td style="padding:5px;">Capacity<br>
</td>
<td style="padding:5px;">represents the available resources of a cluster<br>
</td>
<td style="padding:5px;">yes<br>
</td>
<td style="padding:5px;">any<br>
</td>
<td style="padding:5px;"><p></p></td>
</tr>
<tr>
<td style="padding:5px;">ClusterMeta<br>
</td>
<td style="padding:5px;">Other cluster metadata like the version<br>
</td>
<td style="padding:5px;">yes<br>
</td>
<td style="padding:5px;">ClusterMeta<br>
</td>
<td style="padding:5px;"><p></p></td>
</tr>
</tbody>
</table>
**For simplicity we didnt introduce a separate “cluster metrics” API
object here**. The cluster resource metrics are stored in cluster
status section, just like what we did to nodes in K8S. In phase one it
only contains available CPU resources and memory resources. The
cluster controller will periodically poll the underlying cluster API
Server to get cluster capability. In phase one it gets the metrics by
simply aggregating metrics from all nodes. In future we will improve
this with more efficient ways like leveraging heapster, and also more
metrics will be supported. Similar to node phases in K8S, the “phase”
field includes following values:
+ pending: newly registered clusters or clusters suspended by admin
for various reasons. They are not eligible for accepting workloads
+ running: clusters in normal status that can accept workloads
+ offline: clusters temporarily down or not reachable
+ terminated: clusters removed from federation
Below is the state transition diagram.
![Cluster State Transition Diagram](ubernetes-cluster-state.png)
## Replication Controller
A global workload submitted to control plane is represented as an
Ubernetes replication controller. When a replication controller
is submitted to control plane, clients need a way to express its
requirements or preferences on clusters. Depending on different use
cases it may be complex. For example:
+ This workload can only be scheduled to cluster Foo. It cannot be
scheduled to any other clusters. (use case: sensitive workloads).
+ This workload prefers cluster Foo. But if there is no available
capacity on cluster Foo, its OK to be scheduled to cluster Bar
(use case: workload )
+ Seventy percent of this workload should be scheduled to cluster Foo,
and thirty percent should be scheduled to cluster Bar (use case:
vendor lock-in avoidance). In phase one, we only introduce a
_clusterSelector_ field to filter acceptable clusters. In default
case there is no such selector and it means any cluster is
acceptable.
Below is a sample of the YAML to create such a replication controller.
```
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx-controller
spec:
replicas: 5
selector:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
clusterSelector:
name in (Foo, Bar)
```
Currently clusterSelector (implemented as a
[LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704))
only supports a simple list of acceptable clusters. Workloads will be
evenly distributed on these acceptable clusters in phase one. After
phase one we will define syntax to represent more advanced
constraints, like cluster preference ordering, desired number of
splitted workloads, desired ratio of workloads spread on different
clusters, etc.
Besides this explicit “clusterSelector” filter, a workload may have
some implicit scheduling restrictions. For example it defines
“nodeSelector” which can only be satisfied on some particular
clusters. How to handle this will be addressed after phase one.
## Ubernetes Services
The Service API object exposed by Ubernetes is similar to service
objects on Kubernetes. It defines the access to a group of pods. The
Ubernetes service controller will create corresponding Kubernetes
service objects on underlying clusters. These are detailed in a
separate design document: [Federated Services](federated-services.md).
## Pod
In phase one we only support scheduling replication controllers. Pod
scheduling will be supported in later phase. This is primarily in
order to keep the Ubernetes API compatible with the Kubernetes API.
## ACTIVITY FLOWS
## Scheduling
The below diagram shows how workloads are scheduled on the Ubernetes control plane:
1. A replication controller is created by the client.
1. APIServer persists it into the storage.
1. Cluster controller periodically polls the latest available resource
metrics from the underlying clusters.
1. Scheduler is watching all pending RCs. It picks up the RC, make
policy-driven decisions and split it into different sub RCs.
1. Each cluster control is watching the sub RCs bound to its
corresponding cluster. It picks up the newly created sub RC.
1. The cluster controller issues requests to the underlying cluster
API Server to create the RC. In phase one we dont support complex
distribution policies. The scheduling rule is basically:
1. If a RC does not specify any nodeSelector, it will be scheduled
to the least loaded K8S cluster(s) that has enough available
resources.
1. If a RC specifies _N_ acceptable clusters in the
clusterSelector, all replica will be evenly distributed among
these clusters.
There is a potential race condition here. Say at time _T1_ the control
plane learns there are _m_ available resources in a K8S cluster. As
the cluster is working independently it still accepts workload
requests from other K8S clients or even another Ubernetes control
plane. The Ubernetes scheduling decision is based on this data of
available resources. However when the actual RC creation happens to
the cluster at time _T2_, the cluster may dont have enough resources
at that time. We will address this problem in later phases with some
proposed solutions like resource reservation mechanisms.
![Ubernetes Scheduling](ubernetes-scheduling.png)
## Service Discovery
This part has been included in the section “Federated Service” of
document
“[Ubernetes Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. Please
refer to that document for details.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB