mirror of https://github.com/k3s-io/k3s
435 lines
16 KiB
Markdown
435 lines
16 KiB
Markdown
![]() |
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
|||
|
|
|||
|
<!-- BEGIN STRIP_FOR_RELEASE -->
|
|||
|
|
|||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
|||
|
width="25" height="25">
|
|||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
|||
|
width="25" height="25">
|
|||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
|||
|
width="25" height="25">
|
|||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
|||
|
width="25" height="25">
|
|||
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
|||
|
width="25" height="25">
|
|||
|
|
|||
|
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
|
|||
|
|
|||
|
If you are using a released version of Kubernetes, you should
|
|||
|
refer to the docs that go with that version.
|
|||
|
|
|||
|
Documentation for other releases can be found at
|
|||
|
[releases.k8s.io](http://releases.k8s.io).
|
|||
|
</strong>
|
|||
|
--
|
|||
|
|
|||
|
<!-- END STRIP_FOR_RELEASE -->
|
|||
|
|
|||
|
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
|||
|
|
|||
|
# Ubernetes Design Spec (phase one)
|
|||
|
|
|||
|
**Huawei PaaS Team**
|
|||
|
|
|||
|
## INTRODUCTION
|
|||
|
|
|||
|
In this document we propose a design for the “Control Plane” of
|
|||
|
Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of
|
|||
|
this work please refer to
|
|||
|
[this proposal](../../docs/proposals/federation.md).
|
|||
|
The document is arranged as following. First we briefly list scenarios
|
|||
|
and use cases that motivate K8S federation work. These use cases drive
|
|||
|
the design and they also verify the design. We summarize the
|
|||
|
functionality requirements from these use cases, and define the “in
|
|||
|
scope” functionalities that will be covered by this design (phase
|
|||
|
one). After that we give an overview of the proposed architecture, API
|
|||
|
and building blocks. And also we go through several activity flows to
|
|||
|
see how these building blocks work together to support use cases.
|
|||
|
|
|||
|
## REQUIREMENTS
|
|||
|
|
|||
|
There are many reasons why customers may want to build a K8S
|
|||
|
federation:
|
|||
|
|
|||
|
+ **High Availability:** Customers want to be immune to the outage of
|
|||
|
a single availability zone, region or even a cloud provider.
|
|||
|
+ **Sensitive workloads:** Some workloads can only run on a particular
|
|||
|
cluster. They cannot be scheduled to or migrated to other clusters.
|
|||
|
+ **Capacity overflow:** Customers prefer to run workloads on a
|
|||
|
primary cluster. But if the capacity of the cluster is not
|
|||
|
sufficient, workloads should be automatically distributed to other
|
|||
|
clusters.
|
|||
|
+ **Vendor lock-in avoidance:** Customers want to spread their
|
|||
|
workloads on different cloud providers, and can easily increase or
|
|||
|
decrease the workload proportion of a specific provider.
|
|||
|
+ **Cluster Size Enhancement:** Currently K8S cluster can only support
|
|||
|
a limited size. While the community is actively improving it, it can
|
|||
|
be expected that cluster size will be a problem if K8S is used for
|
|||
|
large workloads or public PaaS infrastructure. While we can separate
|
|||
|
different tenants to different clusters, it would be good to have a
|
|||
|
unified view.
|
|||
|
|
|||
|
Here are the functionality requirements derived from above use cases:
|
|||
|
|
|||
|
+ Clients of the federation control plane API server can register and deregister clusters.
|
|||
|
+ Workloads should be spread to different clusters according to the
|
|||
|
workload distribution policy.
|
|||
|
+ Pods are able to discover and connect to services hosted in other
|
|||
|
clusters (in cases where inter-cluster networking is necessary,
|
|||
|
desirable and implemented).
|
|||
|
+ Traffic to these pods should be spread across clusters (in a manner
|
|||
|
similar to load balancing, although it might not be strictly
|
|||
|
speaking balanced).
|
|||
|
+ The control plane needs to know when a cluster is down, and migrate
|
|||
|
the workloads to other clusters.
|
|||
|
+ Clients have a unified view and a central control point for above
|
|||
|
activities.
|
|||
|
|
|||
|
## SCOPE
|
|||
|
|
|||
|
It’s difficult to have a perfect design with one click that implements
|
|||
|
all the above requirements. Therefore we will go with an iterative
|
|||
|
approach to design and build the system. This document describes the
|
|||
|
phase one of the whole work. In phase one we will cover only the
|
|||
|
following objectives:
|
|||
|
|
|||
|
+ Define the basic building blocks and API objects of control plane
|
|||
|
+ Implement a basic end-to-end workflow
|
|||
|
+ Clients register federated clusters
|
|||
|
+ Clients submit a workload
|
|||
|
+ The workload is distributed to different clusters
|
|||
|
+ Service discovery
|
|||
|
+ Load balancing
|
|||
|
|
|||
|
The following parts are NOT covered in phase one:
|
|||
|
|
|||
|
+ Authentication and authorization (other than basic client
|
|||
|
authentication against the ubernetes API, and from ubernetes control
|
|||
|
plane to the underlying kubernetes clusters).
|
|||
|
+ Deployment units other than replication controller and service
|
|||
|
+ Complex distribution policy of workloads
|
|||
|
+ Service affinity and migration
|
|||
|
|
|||
|
## ARCHITECTURE
|
|||
|
|
|||
|
The overall architecture of a control plane is shown as following:
|
|||
|
|
|||
|
![Ubernetes Architecture](ubernetes-design.png)
|
|||
|
|
|||
|
Some design principles we are following in this architecture:
|
|||
|
|
|||
|
1. Keep the underlying K8S clusters independent. They should have no
|
|||
|
knowledge of control plane or of each other.
|
|||
|
1. Keep the Ubernetes API interface compatible with K8S API as much as
|
|||
|
possible.
|
|||
|
1. Re-use concepts from K8S as much as possible. This reduces
|
|||
|
customers’ learning curve and is good for adoption. Below is a brief
|
|||
|
description of each module contained in above diagram.
|
|||
|
|
|||
|
## Ubernetes API Server
|
|||
|
|
|||
|
The API Server in the Ubernetes control plane works just like the API
|
|||
|
Server in K8S. It talks to a distributed key-value store to persist,
|
|||
|
retrieve and watch API objects. This store is completely distinct
|
|||
|
from the kubernetes key-value stores (etcd) in the underlying
|
|||
|
kubernetes clusters. We still use `etcd` as the distributed
|
|||
|
storage so customers don’t need to learn and manage a different
|
|||
|
storage system, although it is envisaged that other storage systems
|
|||
|
(consol, zookeeper) will probably be developedand supported over
|
|||
|
time.
|
|||
|
|
|||
|
## Ubernetes Scheduler
|
|||
|
|
|||
|
The Ubernetes Scheduler schedules resources onto the underlying
|
|||
|
Kubernetes clusters. For example it watches for unscheduled Ubernetes
|
|||
|
replication controllers (those that have not yet been scheduled onto
|
|||
|
underlying Kubernetes clusters) and performs the global scheduling
|
|||
|
work. For each unscheduled replication controller, it calls policy
|
|||
|
engine to decide how to spit workloads among clusters. It creates a
|
|||
|
Kubernetes Replication Controller on one ore more underlying cluster,
|
|||
|
and post them back to `etcd` storage.
|
|||
|
|
|||
|
One sublety worth noting here is that the scheduling decision is
|
|||
|
arrived at by combining the application-specific request from the user (which might
|
|||
|
include, for example, placement constraints), and the global policy specified
|
|||
|
by the federation administrator (for example, "prefer on-premise
|
|||
|
clusters over AWS clusters" or "spread load equally across clusters").
|
|||
|
|
|||
|
## Ubernetes Cluster Controller
|
|||
|
|
|||
|
The cluster controller
|
|||
|
performs the following two kinds of work:
|
|||
|
|
|||
|
1. It watches all the sub-resources that are created by Ubernetes
|
|||
|
components, like a sub-RC or a sub-service. And then it creates the
|
|||
|
corresponding API objects on the underlying K8S clusters.
|
|||
|
1. It periodically retrieves the available resources metrics from the
|
|||
|
underlying K8S cluster, and updates them as object status of the
|
|||
|
`cluster` API object. An alternative design might be to run a pod
|
|||
|
in each underlying cluster that reports metrics for that cluster to
|
|||
|
the Ubernetes control plane. Which approach is better remains an
|
|||
|
open topic of discussion.
|
|||
|
|
|||
|
## Ubernetes Service Controller
|
|||
|
|
|||
|
The Ubernetes service controller is a federation-level implementation
|
|||
|
of K8S service controller. It watches service resources created on
|
|||
|
control plane, creates corresponding K8S services on each involved K8S
|
|||
|
clusters. Besides interacting with services resources on each
|
|||
|
individual K8S clusters, the Ubernetes service controller also
|
|||
|
performs some global DNS registration work.
|
|||
|
|
|||
|
## API OBJECTS
|
|||
|
|
|||
|
## Cluster
|
|||
|
|
|||
|
Cluster is a new first-class API object introduced in this design. For
|
|||
|
each registered K8S cluster there will be such an API resource in
|
|||
|
control plane. The way clients register or deregister a cluster is to
|
|||
|
send corresponding REST requests to following URL:
|
|||
|
`/api/{$version}/clusters`. Because control plane is behaving like a
|
|||
|
regular K8S client to the underlying clusters, the spec of a cluster
|
|||
|
object contains necessary properties like K8S cluster address and
|
|||
|
credentials. The status of a cluster API object will contain
|
|||
|
following information:
|
|||
|
|
|||
|
1. Which phase of its lifecycle
|
|||
|
1. Cluster resource metrics for scheduling decisions.
|
|||
|
1. Other metadata like the version of cluster
|
|||
|
|
|||
|
$version.clusterSpec
|
|||
|
|
|||
|
<table style="border:1px solid #000000;border-collapse:collapse;">
|
|||
|
<tbody>
|
|||
|
<tr>
|
|||
|
<td style="padding:5px;"><b>Name</b><br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;"><b>Description</b><br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;"><b>Required</b><br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;"><b>Schema</b><br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;"><b>Default</b><br>
|
|||
|
</td>
|
|||
|
</tr>
|
|||
|
<tr>
|
|||
|
<td style="padding:5px;">Address<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;">address of the cluster<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;">yes<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;">address<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;"><p></p></td>
|
|||
|
</tr>
|
|||
|
<tr>
|
|||
|
<td style="padding:5px;">Credential<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;">the type (e.g. bearer token, client
|
|||
|
certificate etc) and data of the credential used to access cluster. It’s used for system routines (not behalf of users)<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;">yes<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;">string <br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;"><p></p></td>
|
|||
|
</tr>
|
|||
|
</tbody>
|
|||
|
</table>
|
|||
|
|
|||
|
$version.clusterStatus
|
|||
|
|
|||
|
<table style="border:1px solid #000000;border-collapse:collapse;">
|
|||
|
<tbody>
|
|||
|
<tr>
|
|||
|
<td style="padding:5px;"><b>Name</b><br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;"><b>Description</b><br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;"><b>Required</b><br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;"><b>Schema</b><br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;"><b>Default</b><br>
|
|||
|
</td>
|
|||
|
</tr>
|
|||
|
<tr>
|
|||
|
<td style="padding:5px;">Phase<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;">the recently observed lifecycle phase of the cluster<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;">yes<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;">enum<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;"><p></p></td>
|
|||
|
</tr>
|
|||
|
<tr>
|
|||
|
<td style="padding:5px;">Capacity<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;">represents the available resources of a cluster<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;">yes<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;">any<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;"><p></p></td>
|
|||
|
</tr>
|
|||
|
<tr>
|
|||
|
<td style="padding:5px;">ClusterMeta<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;">Other cluster metadata like the version<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;">yes<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;">ClusterMeta<br>
|
|||
|
</td>
|
|||
|
<td style="padding:5px;"><p></p></td>
|
|||
|
</tr>
|
|||
|
</tbody>
|
|||
|
</table>
|
|||
|
|
|||
|
**For simplicity we didn’t introduce a separate “cluster metrics” API
|
|||
|
object here**. The cluster resource metrics are stored in cluster
|
|||
|
status section, just like what we did to nodes in K8S. In phase one it
|
|||
|
only contains available CPU resources and memory resources. The
|
|||
|
cluster controller will periodically poll the underlying cluster API
|
|||
|
Server to get cluster capability. In phase one it gets the metrics by
|
|||
|
simply aggregating metrics from all nodes. In future we will improve
|
|||
|
this with more efficient ways like leveraging heapster, and also more
|
|||
|
metrics will be supported. Similar to node phases in K8S, the “phase”
|
|||
|
field includes following values:
|
|||
|
|
|||
|
+ pending: newly registered clusters or clusters suspended by admin
|
|||
|
for various reasons. They are not eligible for accepting workloads
|
|||
|
+ running: clusters in normal status that can accept workloads
|
|||
|
+ offline: clusters temporarily down or not reachable
|
|||
|
+ terminated: clusters removed from federation
|
|||
|
|
|||
|
Below is the state transition diagram.
|
|||
|
|
|||
|
![Cluster State Transition Diagram](ubernetes-cluster-state.png)
|
|||
|
|
|||
|
## Replication Controller
|
|||
|
|
|||
|
A global workload submitted to control plane is represented as an
|
|||
|
Ubernetes replication controller. When a replication controller
|
|||
|
is submitted to control plane, clients need a way to express its
|
|||
|
requirements or preferences on clusters. Depending on different use
|
|||
|
cases it may be complex. For example:
|
|||
|
|
|||
|
+ This workload can only be scheduled to cluster Foo. It cannot be
|
|||
|
scheduled to any other clusters. (use case: sensitive workloads).
|
|||
|
+ This workload prefers cluster Foo. But if there is no available
|
|||
|
capacity on cluster Foo, it’s OK to be scheduled to cluster Bar
|
|||
|
(use case: workload )
|
|||
|
+ Seventy percent of this workload should be scheduled to cluster Foo,
|
|||
|
and thirty percent should be scheduled to cluster Bar (use case:
|
|||
|
vendor lock-in avoidance). In phase one, we only introduce a
|
|||
|
_clusterSelector_ field to filter acceptable clusters. In default
|
|||
|
case there is no such selector and it means any cluster is
|
|||
|
acceptable.
|
|||
|
|
|||
|
Below is a sample of the YAML to create such a replication controller.
|
|||
|
|
|||
|
```
|
|||
|
apiVersion: v1
|
|||
|
kind: ReplicationController
|
|||
|
metadata:
|
|||
|
name: nginx-controller
|
|||
|
spec:
|
|||
|
replicas: 5
|
|||
|
selector:
|
|||
|
app: nginx
|
|||
|
template:
|
|||
|
metadata:
|
|||
|
labels:
|
|||
|
app: nginx
|
|||
|
spec:
|
|||
|
containers:
|
|||
|
- name: nginx
|
|||
|
image: nginx
|
|||
|
ports:
|
|||
|
- containerPort: 80
|
|||
|
clusterSelector:
|
|||
|
name in (Foo, Bar)
|
|||
|
```
|
|||
|
|
|||
|
Currently clusterSelector (implemented as a
|
|||
|
[LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704))
|
|||
|
only supports a simple list of acceptable clusters. Workloads will be
|
|||
|
evenly distributed on these acceptable clusters in phase one. After
|
|||
|
phase one we will define syntax to represent more advanced
|
|||
|
constraints, like cluster preference ordering, desired number of
|
|||
|
splitted workloads, desired ratio of workloads spread on different
|
|||
|
clusters, etc.
|
|||
|
|
|||
|
Besides this explicit “clusterSelector” filter, a workload may have
|
|||
|
some implicit scheduling restrictions. For example it defines
|
|||
|
“nodeSelector” which can only be satisfied on some particular
|
|||
|
clusters. How to handle this will be addressed after phase one.
|
|||
|
|
|||
|
## Ubernetes Services
|
|||
|
|
|||
|
The Service API object exposed by Ubernetes is similar to service
|
|||
|
objects on Kubernetes. It defines the access to a group of pods. The
|
|||
|
Ubernetes service controller will create corresponding Kubernetes
|
|||
|
service objects on underlying clusters. These are detailed in a
|
|||
|
separate design document: [Federated Services](federated-services.md).
|
|||
|
|
|||
|
## Pod
|
|||
|
|
|||
|
In phase one we only support scheduling replication controllers. Pod
|
|||
|
scheduling will be supported in later phase. This is primarily in
|
|||
|
order to keep the Ubernetes API compatible with the Kubernetes API.
|
|||
|
|
|||
|
## ACTIVITY FLOWS
|
|||
|
|
|||
|
## Scheduling
|
|||
|
|
|||
|
The below diagram shows how workloads are scheduled on the Ubernetes control plane:
|
|||
|
|
|||
|
1. A replication controller is created by the client.
|
|||
|
1. APIServer persists it into the storage.
|
|||
|
1. Cluster controller periodically polls the latest available resource
|
|||
|
metrics from the underlying clusters.
|
|||
|
1. Scheduler is watching all pending RCs. It picks up the RC, make
|
|||
|
policy-driven decisions and split it into different sub RCs.
|
|||
|
1. Each cluster control is watching the sub RCs bound to its
|
|||
|
corresponding cluster. It picks up the newly created sub RC.
|
|||
|
1. The cluster controller issues requests to the underlying cluster
|
|||
|
API Server to create the RC. In phase one we don’t support complex
|
|||
|
distribution policies. The scheduling rule is basically:
|
|||
|
1. If a RC does not specify any nodeSelector, it will be scheduled
|
|||
|
to the least loaded K8S cluster(s) that has enough available
|
|||
|
resources.
|
|||
|
1. If a RC specifies _N_ acceptable clusters in the
|
|||
|
clusterSelector, all replica will be evenly distributed among
|
|||
|
these clusters.
|
|||
|
|
|||
|
There is a potential race condition here. Say at time _T1_ the control
|
|||
|
plane learns there are _m_ available resources in a K8S cluster. As
|
|||
|
the cluster is working independently it still accepts workload
|
|||
|
requests from other K8S clients or even another Ubernetes control
|
|||
|
plane. The Ubernetes scheduling decision is based on this data of
|
|||
|
available resources. However when the actual RC creation happens to
|
|||
|
the cluster at time _T2_, the cluster may don’t have enough resources
|
|||
|
at that time. We will address this problem in later phases with some
|
|||
|
proposed solutions like resource reservation mechanisms.
|
|||
|
|
|||
|
![Ubernetes Scheduling](ubernetes-scheduling.png)
|
|||
|
|
|||
|
## Service Discovery
|
|||
|
|
|||
|
This part has been included in the section “Federated Service” of
|
|||
|
document
|
|||
|
“[Ubernetes Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. Please
|
|||
|
refer to that document for details.
|
|||
|
|
|||
|
|
|||
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
|||
|
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]()
|
|||
|
<!-- END MUNGE: GENERATED_ANALYTICS -->
|