mirror of https://github.com/k3s-io/k3s
437 lines
16 KiB
Markdown
437 lines
16 KiB
Markdown
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||
|
||
<!-- BEGIN STRIP_FOR_RELEASE -->
|
||
|
||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||
width="25" height="25">
|
||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||
width="25" height="25">
|
||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||
width="25" height="25">
|
||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||
width="25" height="25">
|
||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||
width="25" height="25">
|
||
|
||
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
|
||
|
||
If you are using a released version of Kubernetes, you should
|
||
refer to the docs that go with that version.
|
||
|
||
Documentation for other releases can be found at
|
||
[releases.k8s.io](http://releases.k8s.io).
|
||
</strong>
|
||
--
|
||
|
||
<!-- END STRIP_FOR_RELEASE -->
|
||
|
||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||
|
||
# Ubernetes Design Spec (phase one)
|
||
|
||
**Huawei PaaS Team**
|
||
|
||
## INTRODUCTION
|
||
|
||
In this document we propose a design for the “Control Plane” of
|
||
Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of
|
||
this work please refer to
|
||
[this proposal](../../docs/proposals/federation.md).
|
||
The document is arranged as following. First we briefly list scenarios
|
||
and use cases that motivate K8S federation work. These use cases drive
|
||
the design and they also verify the design. We summarize the
|
||
functionality requirements from these use cases, and define the “in
|
||
scope” functionalities that will be covered by this design (phase
|
||
one). After that we give an overview of the proposed architecture, API
|
||
and building blocks. And also we go through several activity flows to
|
||
see how these building blocks work together to support use cases.
|
||
|
||
## REQUIREMENTS
|
||
|
||
There are many reasons why customers may want to build a K8S
|
||
federation:
|
||
|
||
+ **High Availability:** Customers want to be immune to the outage of
|
||
a single availability zone, region or even a cloud provider.
|
||
+ **Sensitive workloads:** Some workloads can only run on a particular
|
||
cluster. They cannot be scheduled to or migrated to other clusters.
|
||
+ **Capacity overflow:** Customers prefer to run workloads on a
|
||
primary cluster. But if the capacity of the cluster is not
|
||
sufficient, workloads should be automatically distributed to other
|
||
clusters.
|
||
+ **Vendor lock-in avoidance:** Customers want to spread their
|
||
workloads on different cloud providers, and can easily increase or
|
||
decrease the workload proportion of a specific provider.
|
||
+ **Cluster Size Enhancement:** Currently K8S cluster can only support
|
||
a limited size. While the community is actively improving it, it can
|
||
be expected that cluster size will be a problem if K8S is used for
|
||
large workloads or public PaaS infrastructure. While we can separate
|
||
different tenants to different clusters, it would be good to have a
|
||
unified view.
|
||
|
||
Here are the functionality requirements derived from above use cases:
|
||
|
||
+ Clients of the federation control plane API server can register and deregister
|
||
clusters.
|
||
+ Workloads should be spread to different clusters according to the
|
||
workload distribution policy.
|
||
+ Pods are able to discover and connect to services hosted in other
|
||
clusters (in cases where inter-cluster networking is necessary,
|
||
desirable and implemented).
|
||
+ Traffic to these pods should be spread across clusters (in a manner
|
||
similar to load balancing, although it might not be strictly
|
||
speaking balanced).
|
||
+ The control plane needs to know when a cluster is down, and migrate
|
||
the workloads to other clusters.
|
||
+ Clients have a unified view and a central control point for above
|
||
activities.
|
||
|
||
## SCOPE
|
||
|
||
It’s difficult to have a perfect design with one click that implements
|
||
all the above requirements. Therefore we will go with an iterative
|
||
approach to design and build the system. This document describes the
|
||
phase one of the whole work. In phase one we will cover only the
|
||
following objectives:
|
||
|
||
+ Define the basic building blocks and API objects of control plane
|
||
+ Implement a basic end-to-end workflow
|
||
+ Clients register federated clusters
|
||
+ Clients submit a workload
|
||
+ The workload is distributed to different clusters
|
||
+ Service discovery
|
||
+ Load balancing
|
||
|
||
The following parts are NOT covered in phase one:
|
||
|
||
+ Authentication and authorization (other than basic client
|
||
authentication against the ubernetes API, and from ubernetes control
|
||
plane to the underlying kubernetes clusters).
|
||
+ Deployment units other than replication controller and service
|
||
+ Complex distribution policy of workloads
|
||
+ Service affinity and migration
|
||
|
||
## ARCHITECTURE
|
||
|
||
The overall architecture of a control plane is shown as following:
|
||
|
||
![Ubernetes Architecture](ubernetes-design.png)
|
||
|
||
Some design principles we are following in this architecture:
|
||
|
||
1. Keep the underlying K8S clusters independent. They should have no
|
||
knowledge of control plane or of each other.
|
||
1. Keep the Ubernetes API interface compatible with K8S API as much as
|
||
possible.
|
||
1. Re-use concepts from K8S as much as possible. This reduces
|
||
customers’ learning curve and is good for adoption. Below is a brief
|
||
description of each module contained in above diagram.
|
||
|
||
## Ubernetes API Server
|
||
|
||
The API Server in the Ubernetes control plane works just like the API
|
||
Server in K8S. It talks to a distributed key-value store to persist,
|
||
retrieve and watch API objects. This store is completely distinct
|
||
from the kubernetes key-value stores (etcd) in the underlying
|
||
kubernetes clusters. We still use `etcd` as the distributed
|
||
storage so customers don’t need to learn and manage a different
|
||
storage system, although it is envisaged that other storage systems
|
||
(consol, zookeeper) will probably be developedand supported over
|
||
time.
|
||
|
||
## Ubernetes Scheduler
|
||
|
||
The Ubernetes Scheduler schedules resources onto the underlying
|
||
Kubernetes clusters. For example it watches for unscheduled Ubernetes
|
||
replication controllers (those that have not yet been scheduled onto
|
||
underlying Kubernetes clusters) and performs the global scheduling
|
||
work. For each unscheduled replication controller, it calls policy
|
||
engine to decide how to spit workloads among clusters. It creates a
|
||
Kubernetes Replication Controller on one ore more underlying cluster,
|
||
and post them back to `etcd` storage.
|
||
|
||
One sublety worth noting here is that the scheduling decision is arrived at by
|
||
combining the application-specific request from the user (which might
|
||
include, for example, placement constraints), and the global policy specified
|
||
by the federation administrator (for example, "prefer on-premise
|
||
clusters over AWS clusters" or "spread load equally across clusters").
|
||
|
||
## Ubernetes Cluster Controller
|
||
|
||
The cluster controller
|
||
performs the following two kinds of work:
|
||
|
||
1. It watches all the sub-resources that are created by Ubernetes
|
||
components, like a sub-RC or a sub-service. And then it creates the
|
||
corresponding API objects on the underlying K8S clusters.
|
||
1. It periodically retrieves the available resources metrics from the
|
||
underlying K8S cluster, and updates them as object status of the
|
||
`cluster` API object. An alternative design might be to run a pod
|
||
in each underlying cluster that reports metrics for that cluster to
|
||
the Ubernetes control plane. Which approach is better remains an
|
||
open topic of discussion.
|
||
|
||
## Ubernetes Service Controller
|
||
|
||
The Ubernetes service controller is a federation-level implementation
|
||
of K8S service controller. It watches service resources created on
|
||
control plane, creates corresponding K8S services on each involved K8S
|
||
clusters. Besides interacting with services resources on each
|
||
individual K8S clusters, the Ubernetes service controller also
|
||
performs some global DNS registration work.
|
||
|
||
## API OBJECTS
|
||
|
||
## Cluster
|
||
|
||
Cluster is a new first-class API object introduced in this design. For
|
||
each registered K8S cluster there will be such an API resource in
|
||
control plane. The way clients register or deregister a cluster is to
|
||
send corresponding REST requests to following URL:
|
||
`/api/{$version}/clusters`. Because control plane is behaving like a
|
||
regular K8S client to the underlying clusters, the spec of a cluster
|
||
object contains necessary properties like K8S cluster address and
|
||
credentials. The status of a cluster API object will contain
|
||
following information:
|
||
|
||
1. Which phase of its lifecycle
|
||
1. Cluster resource metrics for scheduling decisions.
|
||
1. Other metadata like the version of cluster
|
||
|
||
$version.clusterSpec
|
||
|
||
<table style="border:1px solid #000000;border-collapse:collapse;">
|
||
<tbody>
|
||
<tr>
|
||
<td style="padding:5px;"><b>Name</b><br>
|
||
</td>
|
||
<td style="padding:5px;"><b>Description</b><br>
|
||
</td>
|
||
<td style="padding:5px;"><b>Required</b><br>
|
||
</td>
|
||
<td style="padding:5px;"><b>Schema</b><br>
|
||
</td>
|
||
<td style="padding:5px;"><b>Default</b><br>
|
||
</td>
|
||
</tr>
|
||
<tr>
|
||
<td style="padding:5px;">Address<br>
|
||
</td>
|
||
<td style="padding:5px;">address of the cluster<br>
|
||
</td>
|
||
<td style="padding:5px;">yes<br>
|
||
</td>
|
||
<td style="padding:5px;">address<br>
|
||
</td>
|
||
<td style="padding:5px;"><p></p></td>
|
||
</tr>
|
||
<tr>
|
||
<td style="padding:5px;">Credential<br>
|
||
</td>
|
||
<td style="padding:5px;">the type (e.g. bearer token, client
|
||
certificate etc) and data of the credential used to access cluster. It’s used for system routines (not behalf of users)<br>
|
||
</td>
|
||
<td style="padding:5px;">yes<br>
|
||
</td>
|
||
<td style="padding:5px;">string <br>
|
||
</td>
|
||
<td style="padding:5px;"><p></p></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
|
||
$version.clusterStatus
|
||
|
||
<table style="border:1px solid #000000;border-collapse:collapse;">
|
||
<tbody>
|
||
<tr>
|
||
<td style="padding:5px;"><b>Name</b><br>
|
||
</td>
|
||
<td style="padding:5px;"><b>Description</b><br>
|
||
</td>
|
||
<td style="padding:5px;"><b>Required</b><br>
|
||
</td>
|
||
<td style="padding:5px;"><b>Schema</b><br>
|
||
</td>
|
||
<td style="padding:5px;"><b>Default</b><br>
|
||
</td>
|
||
</tr>
|
||
<tr>
|
||
<td style="padding:5px;">Phase<br>
|
||
</td>
|
||
<td style="padding:5px;">the recently observed lifecycle phase of the cluster<br>
|
||
</td>
|
||
<td style="padding:5px;">yes<br>
|
||
</td>
|
||
<td style="padding:5px;">enum<br>
|
||
</td>
|
||
<td style="padding:5px;"><p></p></td>
|
||
</tr>
|
||
<tr>
|
||
<td style="padding:5px;">Capacity<br>
|
||
</td>
|
||
<td style="padding:5px;">represents the available resources of a cluster<br>
|
||
</td>
|
||
<td style="padding:5px;">yes<br>
|
||
</td>
|
||
<td style="padding:5px;">any<br>
|
||
</td>
|
||
<td style="padding:5px;"><p></p></td>
|
||
</tr>
|
||
<tr>
|
||
<td style="padding:5px;">ClusterMeta<br>
|
||
</td>
|
||
<td style="padding:5px;">Other cluster metadata like the version<br>
|
||
</td>
|
||
<td style="padding:5px;">yes<br>
|
||
</td>
|
||
<td style="padding:5px;">ClusterMeta<br>
|
||
</td>
|
||
<td style="padding:5px;"><p></p></td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
|
||
**For simplicity we didn’t introduce a separate “cluster metrics” API
|
||
object here**. The cluster resource metrics are stored in cluster
|
||
status section, just like what we did to nodes in K8S. In phase one it
|
||
only contains available CPU resources and memory resources. The
|
||
cluster controller will periodically poll the underlying cluster API
|
||
Server to get cluster capability. In phase one it gets the metrics by
|
||
simply aggregating metrics from all nodes. In future we will improve
|
||
this with more efficient ways like leveraging heapster, and also more
|
||
metrics will be supported. Similar to node phases in K8S, the “phase”
|
||
field includes following values:
|
||
|
||
+ pending: newly registered clusters or clusters suspended by admin
|
||
for various reasons. They are not eligible for accepting workloads
|
||
+ running: clusters in normal status that can accept workloads
|
||
+ offline: clusters temporarily down or not reachable
|
||
+ terminated: clusters removed from federation
|
||
|
||
Below is the state transition diagram.
|
||
|
||
![Cluster State Transition Diagram](ubernetes-cluster-state.png)
|
||
|
||
## Replication Controller
|
||
|
||
A global workload submitted to control plane is represented as an
|
||
Ubernetes replication controller. When a replication controller
|
||
is submitted to control plane, clients need a way to express its
|
||
requirements or preferences on clusters. Depending on different use
|
||
cases it may be complex. For example:
|
||
|
||
+ This workload can only be scheduled to cluster Foo. It cannot be
|
||
scheduled to any other clusters. (use case: sensitive workloads).
|
||
+ This workload prefers cluster Foo. But if there is no available
|
||
capacity on cluster Foo, it’s OK to be scheduled to cluster Bar
|
||
(use case: workload )
|
||
+ Seventy percent of this workload should be scheduled to cluster Foo,
|
||
and thirty percent should be scheduled to cluster Bar (use case:
|
||
vendor lock-in avoidance). In phase one, we only introduce a
|
||
_clusterSelector_ field to filter acceptable clusters. In default
|
||
case there is no such selector and it means any cluster is
|
||
acceptable.
|
||
|
||
Below is a sample of the YAML to create such a replication controller.
|
||
|
||
```
|
||
apiVersion: v1
|
||
kind: ReplicationController
|
||
metadata:
|
||
name: nginx-controller
|
||
spec:
|
||
replicas: 5
|
||
selector:
|
||
app: nginx
|
||
template:
|
||
metadata:
|
||
labels:
|
||
app: nginx
|
||
spec:
|
||
containers:
|
||
- name: nginx
|
||
image: nginx
|
||
ports:
|
||
- containerPort: 80
|
||
clusterSelector:
|
||
name in (Foo, Bar)
|
||
```
|
||
|
||
Currently clusterSelector (implemented as a
|
||
[LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704))
|
||
only supports a simple list of acceptable clusters. Workloads will be
|
||
evenly distributed on these acceptable clusters in phase one. After
|
||
phase one we will define syntax to represent more advanced
|
||
constraints, like cluster preference ordering, desired number of
|
||
splitted workloads, desired ratio of workloads spread on different
|
||
clusters, etc.
|
||
|
||
Besides this explicit “clusterSelector” filter, a workload may have
|
||
some implicit scheduling restrictions. For example it defines
|
||
“nodeSelector” which can only be satisfied on some particular
|
||
clusters. How to handle this will be addressed after phase one.
|
||
|
||
## Ubernetes Services
|
||
|
||
The Service API object exposed by Ubernetes is similar to service
|
||
objects on Kubernetes. It defines the access to a group of pods. The
|
||
Ubernetes service controller will create corresponding Kubernetes
|
||
service objects on underlying clusters. These are detailed in a
|
||
separate design document: [Federated Services](federated-services.md).
|
||
|
||
## Pod
|
||
|
||
In phase one we only support scheduling replication controllers. Pod
|
||
scheduling will be supported in later phase. This is primarily in
|
||
order to keep the Ubernetes API compatible with the Kubernetes API.
|
||
|
||
## ACTIVITY FLOWS
|
||
|
||
## Scheduling
|
||
|
||
The below diagram shows how workloads are scheduled on the Ubernetes control\
|
||
plane:
|
||
|
||
1. A replication controller is created by the client.
|
||
1. APIServer persists it into the storage.
|
||
1. Cluster controller periodically polls the latest available resource
|
||
metrics from the underlying clusters.
|
||
1. Scheduler is watching all pending RCs. It picks up the RC, make
|
||
policy-driven decisions and split it into different sub RCs.
|
||
1. Each cluster control is watching the sub RCs bound to its
|
||
corresponding cluster. It picks up the newly created sub RC.
|
||
1. The cluster controller issues requests to the underlying cluster
|
||
API Server to create the RC. In phase one we don’t support complex
|
||
distribution policies. The scheduling rule is basically:
|
||
1. If a RC does not specify any nodeSelector, it will be scheduled
|
||
to the least loaded K8S cluster(s) that has enough available
|
||
resources.
|
||
1. If a RC specifies _N_ acceptable clusters in the
|
||
clusterSelector, all replica will be evenly distributed among
|
||
these clusters.
|
||
|
||
There is a potential race condition here. Say at time _T1_ the control
|
||
plane learns there are _m_ available resources in a K8S cluster. As
|
||
the cluster is working independently it still accepts workload
|
||
requests from other K8S clients or even another Ubernetes control
|
||
plane. The Ubernetes scheduling decision is based on this data of
|
||
available resources. However when the actual RC creation happens to
|
||
the cluster at time _T2_, the cluster may don’t have enough resources
|
||
at that time. We will address this problem in later phases with some
|
||
proposed solutions like resource reservation mechanisms.
|
||
|
||
![Ubernetes Scheduling](ubernetes-scheduling.png)
|
||
|
||
## Service Discovery
|
||
|
||
This part has been included in the section “Federated Service” of
|
||
document
|
||
“[Ubernetes Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”.
|
||
Please refer to that document for details.
|
||
|
||
|
||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]()
|
||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|