2016-01-04 15:29:11 +00:00
|
|
|
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
|
|
|
|
|
|
|
<!-- BEGIN STRIP_FOR_RELEASE -->
|
|
|
|
|
|
|
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
|
|
|
width="25" height="25">
|
|
|
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
|
|
|
width="25" height="25">
|
|
|
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
|
|
|
width="25" height="25">
|
|
|
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
|
|
|
width="25" height="25">
|
|
|
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
|
|
|
width="25" height="25">
|
|
|
|
|
|
|
|
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
|
|
|
|
|
|
|
|
If you are using a released version of Kubernetes, you should
|
|
|
|
refer to the docs that go with that version.
|
|
|
|
|
|
|
|
Documentation for other releases can be found at
|
|
|
|
[releases.k8s.io](http://releases.k8s.io).
|
|
|
|
</strong>
|
|
|
|
--
|
|
|
|
|
|
|
|
<!-- END STRIP_FOR_RELEASE -->
|
|
|
|
|
|
|
|
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
|
|
|
|
|
|
|
# Kubernetes Multi-AZ Clusters
|
|
|
|
|
|
|
|
## (a.k.a. "Ubernetes-Lite")
|
|
|
|
|
|
|
|
## Introduction
|
|
|
|
|
|
|
|
Full Ubernetes will offer sophisticated federation between multiple kuberentes
|
|
|
|
clusters, offering true high-availability, multiple provider support &
|
|
|
|
cloud-bursting, multiple region support etc. However, many users have
|
|
|
|
expressed a desire for a "reasonably" high-available cluster, that runs in
|
2016-02-12 19:33:32 +00:00
|
|
|
multiple zones on GCE or availability zones in AWS, and can tolerate the failure
|
2016-01-04 15:29:11 +00:00
|
|
|
of a single zone without the complexity of running multiple clusters.
|
|
|
|
|
|
|
|
Ubernetes-Lite aims to deliver exactly that functionality: to run a single
|
|
|
|
Kubernetes cluster in multiple zones. It will attempt to make reasonable
|
|
|
|
scheduling decisions, in particular so that a replication controller's pods are
|
|
|
|
spread across zones, and it will try to be aware of constraints - for example
|
|
|
|
that a volume cannot be mounted on a node in a different zone.
|
|
|
|
|
|
|
|
Ubernetes-Lite is deliberately limited in scope; for many advanced functions
|
|
|
|
the answer will be "use Ubernetes (full)". For example, multiple-region
|
|
|
|
support is not in scope. Routing affinity (e.g. so that a webserver will
|
|
|
|
prefer to talk to a backend service in the same zone) is similarly not in
|
|
|
|
scope.
|
|
|
|
|
|
|
|
## Design
|
|
|
|
|
|
|
|
These are the main requirements:
|
|
|
|
|
|
|
|
1. kube-up must allow bringing up a cluster that spans multiple zones.
|
|
|
|
1. pods in a replication controller should attempt to spread across zones.
|
|
|
|
1. pods which require volumes should not be scheduled onto nodes in a different zone.
|
|
|
|
1. load-balanced services should work reasonably
|
|
|
|
|
|
|
|
### kube-up support
|
|
|
|
|
|
|
|
kube-up support for multiple zones will initially be considered
|
|
|
|
advanced/experimental functionality, so the interface is not initially going to
|
|
|
|
be particularly user-friendly. As we design the evolution of kube-up, we will
|
|
|
|
make multiple zones better supported.
|
|
|
|
|
|
|
|
For the initial implemenation, kube-up must be run multiple times, once for
|
|
|
|
each zone. The first kube-up will take place as normal, but then for each
|
|
|
|
additional zone the user must run kube-up again, specifying
|
2015-11-29 19:38:03 +00:00
|
|
|
`KUBE_USE_EXISTING_MASTER=true` and `KUBE_SUBNET_CIDR=172.20.x.0/24`. This will then
|
2016-01-04 15:29:11 +00:00
|
|
|
create additional nodes in a different zone, but will register them with the
|
|
|
|
existing master.
|
|
|
|
|
|
|
|
### Zone spreading
|
|
|
|
|
|
|
|
This will be implemented by modifying the existing scheduler priority function
|
|
|
|
`SelectorSpread`. Currently this priority function aims to put pods in an RC
|
|
|
|
on different hosts, but it will be extended first to spread across zones, and
|
|
|
|
then to spread across hosts.
|
|
|
|
|
|
|
|
So that the scheduler does not need to call out to the cloud provider on every
|
|
|
|
scheduling decision, we must somehow record the zone information for each node.
|
|
|
|
The implementation of this will be described in the implementation section.
|
|
|
|
|
|
|
|
Note that zone spreading is 'best effort'; zones are just be one of the factors
|
|
|
|
in making scheduling decisions, and thus it is not guaranteed that pods will
|
2016-02-12 19:33:32 +00:00
|
|
|
spread evenly across zones. However, this is likely desirable: if a zone is
|
2016-01-04 15:29:11 +00:00
|
|
|
overloaded or failing, we still want to schedule the requested number of pods.
|
|
|
|
|
|
|
|
### Volume affinity
|
|
|
|
|
|
|
|
Most cloud providers (at least GCE and AWS) cannot attach their persistent
|
|
|
|
volumes across zones. Thus when a pod is being scheduled, if there is a volume
|
|
|
|
attached, that will dictate the zone. This will be implemented using a new
|
|
|
|
scheduler predicate (a hard constraint): `VolumeZonePredicate`.
|
|
|
|
|
|
|
|
When `VolumeZonePredicate` observes a pod scheduling request that includes a
|
|
|
|
volume, if that volume is zone-specific, `VolumeZonePredicate` will exclude any
|
|
|
|
nodes not in that zone.
|
|
|
|
|
|
|
|
Again, to avoid the scheduler calling out to the cloud provider, this will rely
|
|
|
|
on information attached to the volumes. This means that this will only support
|
|
|
|
PersistentVolumeClaims, because direct mounts do not have a place to attach
|
|
|
|
zone information. PersistentVolumes will then include zone information where
|
|
|
|
volumes are zone-specific.
|
|
|
|
|
|
|
|
### Load-balanced services should operate reasonably
|
|
|
|
|
|
|
|
For both AWS & GCE, Kubernetes creates a native cloud load-balancer for each
|
|
|
|
service of type LoadBalancer. The native cloud load-balancers on both AWS &
|
|
|
|
GCE are region-level, and support load-balancing across instances in multiple
|
|
|
|
zones (in the same region). For both clouds, the behaviour of the native cloud
|
|
|
|
load-balancer is reasonable in the face of failures (indeed, this is why clouds
|
|
|
|
provide load-balancing as a primitve).
|
|
|
|
|
|
|
|
For Ubernetes-Lite we will therefore simply rely on the native cloud provider
|
|
|
|
load balancer behaviour, and we do not anticipate substantial code changes.
|
|
|
|
|
|
|
|
One notable shortcoming here is that load-balanced traffic still goes through
|
|
|
|
kube-proxy controlled routing, and kube-proxy does not (currently) favor
|
|
|
|
targeting a pod running on the same instance or even the same zone. This will
|
|
|
|
likely produce a lot of unnecessary cross-zone traffic (which is likely slower
|
|
|
|
and more expensive). This might be sufficiently low-hanging fruit that we
|
|
|
|
choose to address it in kube-proxy / Ubernetes-Lite, but this can be addressed
|
|
|
|
after the initial Ubernetes-Lite implementation.
|
|
|
|
|
|
|
|
|
|
|
|
## Implementation
|
|
|
|
|
|
|
|
The main implementation points are:
|
|
|
|
|
|
|
|
1. how to attach zone information to Nodes and PersistentVolumes
|
|
|
|
1. how nodes get zone information
|
|
|
|
1. how volumes get zone information
|
|
|
|
|
|
|
|
### Attaching zone information
|
|
|
|
|
|
|
|
We must attach zone information to Nodes and PersistentVolumes, and possibly to
|
|
|
|
other resources in future. There are two obvious alternatives: we can use
|
|
|
|
labels/annotations, or we can extend the schema to include the information.
|
|
|
|
|
|
|
|
For the initial implementation, we propose to use labels. The reasoning is:
|
|
|
|
|
|
|
|
1. It is considerably easier to implement.
|
|
|
|
1. We will reserve the two labels `failure-domain.alpha.kubernetes.io/zone` and
|
|
|
|
`failure-domain.alpha.kubernetes.io/region` for the two pieces of information
|
|
|
|
we need. By putting this under the `kubernetes.io` namespace there is no risk
|
|
|
|
of collision, and by putting it under `alpha.kubernetes.io` we clearly mark
|
|
|
|
this as an experimental feature.
|
|
|
|
1. We do not yet know whether these labels will be sufficient for all
|
|
|
|
environments, nor which entities will require zone information. Labels give us
|
|
|
|
more flexibility here.
|
|
|
|
1. Because the labels are reserved, we can move to schema-defined fields in
|
|
|
|
future using our cross-version mapping techniques.
|
|
|
|
|
|
|
|
### Node labeling
|
|
|
|
|
|
|
|
We do not want to require an administrator to manually label nodes. We instead
|
|
|
|
modify the kubelet to include the appropriate labels when it registers itself.
|
|
|
|
The information is easily obtained by the kubelet from the cloud provider.
|
|
|
|
|
|
|
|
### Volume labeling
|
|
|
|
|
|
|
|
As with nodes, we do not want to require an administrator to manually label
|
|
|
|
volumes. We will create an admission controller `PersistentVolumeLabel`.
|
|
|
|
`PersistentVolumeLabel` will intercept requests to create PersistentVolumes,
|
|
|
|
and will label them appropriately by calling in to the cloud provider.
|
|
|
|
|
|
|
|
## AWS Specific Considerations
|
|
|
|
|
|
|
|
The AWS implementation here is fairly straightforward. The AWS API is
|
|
|
|
region-wide, meaning that a single call will find instances and volumes in all
|
|
|
|
zones. In addition, instance ids and volume ids are unique per-region (and
|
|
|
|
hence also per-zone). I believe they are actually globally unique, but I do
|
|
|
|
not know if this is guaranteed; in any case we only need global uniqueness if
|
|
|
|
we are to span regions, which will not be supported by Ubernetes-Lite (to do
|
|
|
|
that correctly requires an Ubernetes-Full type approach).
|
|
|
|
|
|
|
|
## GCE Specific Considerations
|
|
|
|
|
|
|
|
The GCE implementation is more complicated than the AWS implementation because
|
|
|
|
GCE APIs are zone-scoped. To perform an operation, we must perform one REST
|
|
|
|
call per zone and combine the results, unless we can determine in advance that
|
|
|
|
an operation references a particular zone. For many operations, we can make
|
|
|
|
that determination, but in some cases - such as listing all instances, we must
|
|
|
|
combine results from calls in all relevant zones.
|
|
|
|
|
|
|
|
A further complexity is that GCE volume names are scoped per-zone, not
|
|
|
|
per-region. Thus it is permitted to have two volumes both named `myvolume` in
|
|
|
|
two different GCE zones. (Instance names are currently unique per-region, and
|
|
|
|
thus are not a problem for Ubernetes-Lite).
|
|
|
|
|
|
|
|
The volume scoping leads to a (small) behavioural change for Ubernetes-Lite on
|
|
|
|
GCE. If you had two volumes both named `myvolume` in two different GCE zones,
|
|
|
|
this would not be ambiguous when Kubernetes is operating only in a single zone.
|
|
|
|
But, if Ubernetes-Lite is operating in multiple zones, `myvolume` is no longer
|
|
|
|
sufficient to specify a volume uniquely. Worse, the fact that a volume happens
|
|
|
|
to be unambigious at a particular time is no guarantee that it will continue to
|
|
|
|
be unambigious in future, because a volume with the same name could
|
|
|
|
subsequently be created in a second zone. While perhaps unlikely in practice,
|
|
|
|
we cannot automatically enable Ubernetes-Lite for GCE users if this then causes
|
|
|
|
volume mounts to stop working.
|
|
|
|
|
|
|
|
This suggests that (at least on GCE), Ubernetes-Lite must be optional (i.e.
|
|
|
|
there must be a feature-flag). It may be that we can make this feature
|
|
|
|
semi-automatic in future, by detecting whether nodes are running in multiple
|
|
|
|
zones, but it seems likely that kube-up could instead simply set this flag.
|
|
|
|
|
|
|
|
For the initial implementation, creating volumes with identical names will
|
|
|
|
yield undefined results. Later, we may add some way to specify the zone for a
|
|
|
|
volume (and possibly require that volumes have their zone specified when
|
|
|
|
running with Ubernetes-Lite). We could add a new `zone` field to the
|
|
|
|
PersistentVolume type for GCE PD volumes, or we could use a DNS-style dotted
|
|
|
|
name for the volume name (<name>.<zone>)
|
|
|
|
|
|
|
|
Initially therefore, the GCE changes will be to:
|
|
|
|
|
|
|
|
1. change kube-up to support creation of a cluster in multiple zones
|
|
|
|
1. pass a flag enabling Ubernetes-Lite with kube-up
|
|
|
|
1. change the kuberentes cloud provider to iterate through relevant zones when resolving items
|
|
|
|
1. tag GCE PD volumes with the appropriate zone information
|
|
|
|
|
|
|
|
|
|
|
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
|
|
|
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federation-lite.md?pixel)]()
|
|
|
|
<!-- END MUNGE: GENERATED_ANALYTICS -->
|