k3s/docs/design/federation-phase-1.md

<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->

<!-- BEGIN STRIP_FOR_RELEASE -->

<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">

<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>

If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.

Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--

<!-- END STRIP_FOR_RELEASE -->

<!-- END MUNGE: UNVERSIONED_WARNING -->

# Ubernetes Design Spec (phase one)

**Huawei PaaS Team**

## INTRODUCTION

In this document we propose a design for the “Control Plane” of
Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of
this work please refer to
[this proposal](../../docs/proposals/federation.md).
The document is arranged as following. First we briefly list scenarios
and use cases that motivate K8S federation work. These use cases drive
the design and they also verify the design. We summarize the
functionality requirements from these use cases, and define the “in
scope” functionalities that will be covered by this design (phase
one). After that we give an overview of the proposed architecture, API
and building blocks. And also we go through several activity flows to
see how these building blocks work together to support use cases.

## REQUIREMENTS

There are many reasons why customers may want to build a K8S
federation:

+ **High Availability:** Customers want to be immune to the outage of
   a single availability zone, region or even a cloud provider.
+ **Sensitive workloads:** Some workloads can only run on a particular
   cluster. They cannot be scheduled to or migrated to other clusters.
+ **Capacity overflow:** Customers prefer to run workloads on a
   primary cluster. But if the capacity of the cluster is not
   sufficient, workloads should be automatically distributed to other
   clusters.
+ **Vendor lock-in avoidance:** Customers want to spread their
   workloads on different cloud providers, and can easily increase or
   decrease the workload proportion of a specific provider.
+ **Cluster Size Enhancement:** Currently K8S cluster can only support
a limited size. While the community is actively improving it, it can
be expected that cluster size will be a problem if K8S is used for
large workloads or public PaaS infrastructure. While we can separate
different tenants to different clusters, it would be good to have a
unified view.

Here are the functionality requirements derived from above use cases:

+ Clients of the federation control plane API server can register and deregister clusters.
+ Workloads should be spread to different clusters according to the
   workload distribution policy.
+ Pods are able to discover and connect to services hosted in other
  clusters (in cases where inter-cluster networking is necessary,
  desirable and implemented).
+ Traffic to these pods should be spread across clusters (in a manner
  similar to load balancing, although it might not be strictly
  speaking balanced).
+ The control plane needs to know when a cluster is down, and migrate
   the workloads to other clusters.
+ Clients have a unified view and a central control point for above
   activities.

## SCOPE

It’s difficult to have a perfect design with one click that implements
all the above requirements. Therefore we will go with an iterative
approach to design and build the system. This document describes the
phase one of the whole work.  In phase one we will cover only the
following objectives:

+ Define the basic building blocks and API objects of control plane
+ Implement a basic end-to-end workflow
   + Clients register federated clusters
   + Clients submit a workload
   + The workload is distributed to different clusters
   + Service discovery
   + Load balancing

The following parts are NOT covered in phase one:

+ Authentication and authorization (other than basic client
  authentication against the ubernetes API, and from ubernetes control
  plane to the underlying kubernetes clusters).
+ Deployment units other than replication controller and service
+ Complex distribution policy of workloads
+ Service affinity and migration

## ARCHITECTURE

The overall architecture of a control plane is shown as following:

![Ubernetes Architecture](ubernetes-design.png)

Some design principles we are following in this architecture:

1. Keep the underlying K8S clusters independent. They should have no
   knowledge of control plane or of each other.
1. Keep the Ubernetes API interface compatible with K8S API as much as
   possible.
1. Re-use concepts from K8S as much as possible. This reduces
customers’ learning curve and is good for adoption.  Below is a brief
description of each module contained in above diagram.

## Ubernetes API Server

The API Server in the Ubernetes control plane works just like the API
Server in K8S. It talks to a distributed key-value store to persist,
retrieve and watch API objects.  This store is completely distinct
from the kubernetes key-value stores (etcd) in the underlying
kubernetes clusters.  We still use `etcd` as the distributed
storage so customers don’t need to learn and manage a different
storage system, although it is envisaged that other storage systems
(consol, zookeeper) will probably be developedand supported over
time.

## Ubernetes Scheduler

The Ubernetes Scheduler schedules resources onto the underlying
Kubernetes clusters.  For example it watches for unscheduled Ubernetes
replication controllers (those that have not yet been scheduled onto
underlying Kubernetes clusters) and performs the global scheduling
work.  For each unscheduled replication controller, it calls policy
engine to decide how to spit workloads among clusters. It creates a
Kubernetes Replication Controller on one ore more underlying cluster,
and post them back to `etcd` storage.

One sublety worth noting here is that the scheduling decision is
arrived at by combining the application-specific request from the user (which might
include, for example, placement constraints), and the global policy specified
by the federation administrator (for example, "prefer on-premise
clusters over AWS clusters" or "spread load equally across clusters").

## Ubernetes Cluster Controller

The cluster controller
performs the following two kinds of work:

1. It watches all the sub-resources that are created by Ubernetes
   components, like a sub-RC or a sub-service. And then it creates the
   corresponding API objects on the underlying K8S clusters.
1. It periodically retrieves the available resources metrics from the
   underlying K8S cluster, and updates them as object status of the
   `cluster` API object.  An alternative design might be to run a pod
   in each underlying cluster that reports metrics for that cluster to
   the Ubernetes control plane.  Which approach is better remains an
   open topic of discussion.

## Ubernetes Service Controller

The Ubernetes service controller is a federation-level implementation
of K8S service controller. It watches service resources created on
control plane, creates corresponding K8S services on each involved K8S
clusters.  Besides interacting with services resources on each
individual K8S clusters, the Ubernetes service controller also
performs some global DNS registration work.

## API OBJECTS

## Cluster

Cluster is a new first-class API object introduced in this design. For
each registered K8S cluster there will be such an API resource in
control plane. The way clients register or deregister a cluster is to
send corresponding REST requests to following URL:
`/api/{$version}/clusters`.  Because control plane is behaving like a
regular K8S client to the underlying clusters, the spec of a cluster
object contains necessary properties like K8S cluster address and
credentials.  The status of a cluster API object will contain
following information:

1. Which phase of its lifecycle
1. Cluster resource metrics for scheduling decisions.
1. Other metadata like the version of cluster

$version.clusterSpec

<table style="border:1px solid #000000;border-collapse:collapse;">
<tbody>
<tr>
<td style="padding:5px;"><b>Name</b><br>
</td>
<td style="padding:5px;"><b>Description</b><br>
</td>
<td style="padding:5px;"><b>Required</b><br>
</td>
<td style="padding:5px;"><b>Schema</b><br>
</td>
<td style="padding:5px;"><b>Default</b><br>
</td>
</tr>
<tr>
<td style="padding:5px;">Address<br>
</td>
<td style="padding:5px;">address of the cluster<br>
</td>
<td style="padding:5px;">yes<br>
</td>
<td style="padding:5px;">address<br>
</td>
<td style="padding:5px;"><p></p></td>
</tr>
<tr>
<td style="padding:5px;">Credential<br>
</td>
<td style="padding:5px;">the type (e.g. bearer token, client
certificate etc) and data of the credential used to access cluster. It’s used for system routines (not behalf of users)<br>
</td>
<td style="padding:5px;">yes<br>
</td>
<td style="padding:5px;">string <br>
</td>
<td style="padding:5px;"><p></p></td>
</tr>
</tbody>
</table>

$version.clusterStatus

<table style="border:1px solid #000000;border-collapse:collapse;">
<tbody>
<tr>
<td style="padding:5px;"><b>Name</b><br>
</td>
<td style="padding:5px;"><b>Description</b><br>
</td>
<td style="padding:5px;"><b>Required</b><br>
</td>
<td style="padding:5px;"><b>Schema</b><br>
</td>
<td style="padding:5px;"><b>Default</b><br>
</td>
</tr>
<tr>
<td style="padding:5px;">Phase<br>
</td>
<td style="padding:5px;">the recently observed lifecycle phase of the cluster<br>
</td>
<td style="padding:5px;">yes<br>
</td>
<td style="padding:5px;">enum<br>
</td>
<td style="padding:5px;"><p></p></td>
</tr>
<tr>
<td style="padding:5px;">Capacity<br>
</td>
<td style="padding:5px;">represents the available resources of a cluster<br>
</td>
<td style="padding:5px;">yes<br>
</td>
<td style="padding:5px;">any<br>
</td>
<td style="padding:5px;"><p></p></td>
</tr>
<tr>
<td style="padding:5px;">ClusterMeta<br>
</td>
<td style="padding:5px;">Other cluster metadata like the version<br>
</td>
<td style="padding:5px;">yes<br>
</td>
<td style="padding:5px;">ClusterMeta<br>
</td>
<td style="padding:5px;"><p></p></td>
</tr>
</tbody>
</table>

**For simplicity we didn’t introduce a separate “cluster metrics” API
object here**. The cluster resource metrics are stored in cluster
status section, just like what we did to nodes in K8S. In phase one it
only contains available CPU resources and memory resources.  The
cluster controller will periodically poll the underlying cluster API
Server to get cluster capability. In phase one it gets the metrics by
simply aggregating metrics from all nodes. In future we will improve
this with more efficient ways like leveraging heapster, and also more
metrics will be supported.  Similar to node phases in K8S, the “phase”
field includes following values:

+ pending: newly registered clusters or clusters suspended by admin
   for various reasons. They are not eligible for accepting workloads
+ running: clusters in normal status that can accept workloads
+ offline: clusters temporarily down or not reachable
+ terminated: clusters removed from federation

Below is the state transition diagram.

![Cluster State Transition Diagram](ubernetes-cluster-state.png)

## Replication Controller

A global workload submitted to control plane is represented as an
Ubernetes replication controller.   When a replication controller
is submitted to control plane, clients need a way to express its
requirements or preferences on clusters. Depending on different use
cases it may be complex. For example:

+ This workload can only be scheduled to cluster Foo. It cannot be
   scheduled to any other clusters. (use case: sensitive workloads).
+ This workload prefers cluster Foo. But if there is no available
   capacity on cluster Foo, it’s OK to be scheduled to cluster Bar
   (use case: workload )
+ Seventy percent of this workload should be scheduled to cluster Foo,
    and thirty percent should be scheduled to cluster Bar (use case:
    vendor lock-in avoidance).  In phase one, we only introduce a
    _clusterSelector_ field to filter acceptable clusters. In default
    case there is no such selector and it means any cluster is
    acceptable.

Below is a sample of the YAML to create such a replication controller.

``` 
apiVersion: v1
kind: ReplicationController
metadata:
  name: nginx-controller
spec:
  replicas: 5
  selector:
    app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
      clusterSelector: 
      name in (Foo, Bar)
```

Currently clusterSelector (implemented as a
[LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704))
only supports a simple list of acceptable clusters. Workloads will be
evenly distributed on these acceptable clusters in phase one. After
phase one we will define syntax to represent more advanced
constraints, like cluster preference ordering, desired number of
splitted workloads, desired ratio of workloads spread on different
clusters, etc.

Besides this explicit “clusterSelector” filter, a workload may have
some implicit scheduling restrictions. For example it defines
“nodeSelector” which can only be satisfied on some particular
clusters. How to handle this will be addressed after phase one.

## Ubernetes Services

The Service API object exposed by Ubernetes is similar to service
objects on Kubernetes. It defines the access to a group of pods. The
Ubernetes service controller will create corresponding Kubernetes
service objects on underlying clusters.  These are detailed in a
separate design document: [Federated Services](federated-services.md).

## Pod

In phase one we only support scheduling replication controllers. Pod
scheduling will be supported in later phase. This is primarily in
order to keep the Ubernetes API compatible with the Kubernetes API.

## ACTIVITY FLOWS

## Scheduling

The below diagram shows how workloads are scheduled on the Ubernetes control plane:

1. A replication controller is created by the client.
1. APIServer persists it into the storage.
1. Cluster controller periodically polls the latest available resource
   metrics from the underlying clusters.
1. Scheduler is watching all pending RCs. It picks up the RC, make
   policy-driven decisions and split it into different sub RCs.
1. Each cluster control is watching the sub RCs bound to its
   corresponding cluster. It picks up the newly created sub RC.
1. The cluster controller issues requests to the underlying cluster
API Server to create the RC.  In phase one we don’t support complex
distribution policies. The scheduling rule is basically:
    1. If a RC does not specify any nodeSelector, it will be scheduled
       to the least loaded K8S cluster(s) that has enough available
       resources.
    1. If a RC specifies _N_ acceptable clusters in the
       clusterSelector, all replica will be evenly distributed among
       these clusters.

There is a potential race condition here. Say at time _T1_ the control
plane learns there are _m_ available resources in a K8S cluster. As
the cluster is working independently it still accepts workload
requests from other K8S clients or even another Ubernetes control
plane. The Ubernetes scheduling decision is based on this data of
available resources. However when the actual RC creation happens to
the cluster at time _T2_, the cluster may don’t have enough resources
at that time. We will address this problem in later phases with some
proposed solutions like resource reservation mechanisms.

![Ubernetes Scheduling](ubernetes-scheduling.png)

## Service Discovery

This part has been included in the section “Federated Service” of
document
“[Ubernetes Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. Please
refer to that document for details.


<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->