Merge pull request #19313 from quinton-hoole/2015-01-05-ube-design-docs

Auto commit by PR queue bot
2016-03-14 22:25:28 -07:00 · 2016-03-14 22:25:28 -07:00 · 536a30fe45
parent 24de13c794 227710d25b
commit 536a30fe45
6 changed files with 1253 additions and 0 deletions
--- a/docs/design/control-plane-resilience.md
+++ b/docs/design/control-plane-resilience.md
@ -0,0 +1,269 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+# Kubernetes/Ubernetes Control Plane Resilience
+
+## Long Term Design and Current Status
+
+### by Quinton Hoole, Mike Danese and Justin Santa-Barbara
+
+### December 14, 2015
+
+## Summary
+
+Some amount of confusion exists around how we currently, and in future
+want to ensure resilience of the Kubernetes (and by implication
+Ubernetes) control plane.  This document is an attempt to capture that
+definitively. It covers areas including self-healing, high
+availability, bootstrapping and recovery.  Most of the information in
+this document already exists in the form of github comments,
+PR's/proposals, scattered documents, and corridor conversations, so
+document is primarily a consolidation and clarification of existing
+ideas.
+
+## Terms
+
+* **Self-healing:** automatically restarting or replacing failed
+  processes and machines without human intervention
+* **High availability:** continuing to be available and work correctly
+  even if some components are down or uncontactable.  This typically
+  involves multiple replicas of critical services, and a reliable way
+  to find available replicas.  Note that it's possible (but not
+  desirable) to have high
+  availability properties (e.g. multiple replicas) in the absence of
+  self-healing properties (e.g. if a replica fails, nothing replaces
+  it). Fairly obviously, given enough time, such systems typically
+  become unavailable (after enough replicas have failed).
+* **Bootstrapping**: creating an empty cluster from nothing
+* **Recovery**: recreating a non-empty cluster after perhaps
+  catastrophic failure/unavailability/data corruption
+
+## Overall Goals
+
+1. **Resilience to single failures:** Kubernetes clusters constrained
+   to single availability zones should be resilient to individual
+   machine and process failures by being both self-healing and highly
+   available (within the context of such individual failures).
+1. **Ubiquitous resilience by default:** The default cluster creation
+   scripts for (at least) GCE, AWS and basic bare metal should adhere
+   to the above (self-healing and high availability) by default (with
+   options available to disable these features to reduce control plane
+   resource requirements if so required).  It is hoped that other
+   cloud providers will also follow the above guidelines, but the
+   above 3 are the primary canonical use cases.
+1. **Resilience to some correlated failures:** Kubernetes clusters
+   which span multiple availability zones in a region should by
+   default be resilient to complete failure of one entire availability
+   zone (by similarly providing self-healing and high availability in
+   the default cluster creation scripts as above).
+1. **Default implementation shared across cloud providers:** The
+   differences between the default implementations of the above for
+   GCE, AWS and basic bare metal should be minimized.  This implies
+   using shared libraries across these providers in the default
+   scripts in preference to highly customized implementations per
+   cloud provider.  This is not to say that highly differentiated,
+   customized per-cloud cluster creation processes (e.g. for GKE on
+   GCE, or some hosted Kubernetes provider on AWS) are discouraged.
+   But those fall squarely outside the basic cross-platform OSS
+   Kubernetes distro.
+1. **Self-hosting:** Where possible, Kubernetes's existing mechanisms
+   for achieving system resilience (replication controllers, health
+   checking, service load balancing etc) should be used in preference
+   to building a separate set of mechanisms to achieve the same thing.
+   This implies that self hosting (the kubernetes control plane on
+   kubernetes) is strongly preferred, with the caveat below.
+1. **Recovery from catastrophic failure:** The ability to quickly and
+   reliably recover a cluster from catastrophic failure is critical,
+   and should not be compromised by the above goal to self-host
+   (i.e. it goes without saying that the cluster should be quickly and
+   reliably recoverable, even if the cluster control plane is
+   broken). This implies that such catastrophic failure scenarios
+   should be carefully thought out, and the subject of regular
+   continuous integration testing, and disaster recovery exercises.
+
+## Relative Priorities
+
+1. **(Possibly manual) recovery from catastrophic failures:** having a Kubernetes cluster, and all
+   applications running inside it, disappear forever perhaps is the worst
+   possible failure mode.  So it is critical that we be able to
+   recover the applications running inside a cluster from such
+   failures in some well-bounded time period.
+    1. In theory a cluster can be recovered by replaying all API calls
+       that have ever been executed against it, in order, but most
+       often that state has been lost, and/or is scattered across
+       multiple client applications or groups. So in general it is
+       probably infeasible.
+    1. In theory a cluster can also be recovered to some relatively
+       recent non-corrupt backup/snapshot of the disk(s) backing the
+       etcd cluster state.  But we have no default consistent
+       backup/snapshot, verification or restoration process.  And we
+       don't routinely test restoration, so even if we did routinely
+       perform and verify backups, we have no hard evidence that we
+       can in practise effectively recover from catastrophic cluster
+       failure or data corruption by restoring from these backups.  So
+       there's more work to be done here.
+1. **Self-healing:** Most major cloud providers provide the ability to
+   easily and automatically replace failed virtual machines within a
+   small number of minutes (e.g. GCE
+   [Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart)
+   and Managed Instance Groups,
+   AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/)
+   and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This
+   can fairly trivially be used to reduce control-plane down-time due
+   to machine failure to a small number of minutes per failure
+   (i.e. typically around "3 nines" availability), provided that:
+    1. cluster persistent state (i.e. etcd disks) is either:
+        1. truely persistent (i.e. remote persistent disks), or
+        1. reconstructible (e.g. using etcd [dynamic member
+           addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member)
+           or [backup and
+           recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)).
+
+    1. and boot disks are either:
+        1. truely persistent (i.e. remote persistent disks), or
+        1. reconstructible (e.g. using boot-from-snapshot,
+           boot-from-pre-configured-image or
+           boot-from-auto-initializing image).
+1. **High Availability:** This has the potential to increase
+   availability above the approximately "3 nines" level provided by
+   automated self-healing, but it's somewhat more complex, and
+   requires additional resources (e.g. redundant API servers and etcd
+   quorum members).  In environments where cloud-assisted automatic
+   self-healing might be infeasible (e.g. on-premise bare-metal
+   deployments), it also gives cluster administrators more time to
+   respond (e.g.  replace/repair failed machines) without incurring
+   system downtime.
+
+## Design and Status (as of December 2015)
+
+<table>
+<tr>
+<td><b>Control Plane Component</b></td>
+<td><b>Resilience Plan</b></td>
+<td><b>Current Status</b></td>
+</tr>
+<tr>
+<td><b>API Server</b></td>
+<td>
+
+Multiple stateless, self-hosted, self-healing API servers behind a HA
+load balancer, built out by the default "kube-up" automation on GCE,
+AWS and basic bare metal (BBM).  Note that the single-host approach of
+hving etcd listen only on localhost to ensure that onyl API server can
+connect to it will no longer work, so alternative security will be
+needed in the regard (either using firewall rules, SSL certs, or
+something else). All necessary flags are currently supported to enable
+SSL between API server and etcd (OpenShift runs like this out of the
+box), but this needs to be woven into the "kube-up" and related
+scripts.  Detailed design of self-hosting and related bootstrapping
+and catastrophic failure recovery will be detailed in a separate
+design doc.
+
+</td>
+<td>
+
+No scripted self-healing or HA on GCE, AWS or basic bare metal
+currently exists in the OSS distro.   To be clear, "no self healing"
+means that even if multiple e.g. API servers are provisioned for HA
+purposes, if they fail, nothing replaces them, so eventually the
+system will fail.  Self-healing and HA can be set up
+manually by following documented instructions, but this is not
+currently an automated process, and it is not tested as part of
+continuous integration.  So it's probably safest to assume that it
+doesn't actually work in practise.
+
+</td>
+</tr>
+<tr>
+<td><b>Controller manager and scheduler</b></td>
+<td>
+
+Multiple self-hosted, self healing warm standby stateless controller
+managers and schedulers with leader election and automatic failover of API server
+clients, automatically installed by default "kube-up" automation.
+
+</td>
+<td>As above.</td>
+</tr>
+<tr>
+<td><b>etcd</b></td>
+<td>
+
+Multiple (3-5) etcd quorum members behind a load balancer with session
+affinity (to prevent clients from being bounced from one to another).
+
+Regarding self-healing, if a node running etcd goes down, it is always necessary to do three
+things:
+<ol>
+<li>allocate a new node (not necessary if running etcd as a pod, in
+which case specific measures are required to prevent user pods from
+interfering with system pods, for example using node selectors as
+described in <A HREF=")
+<li>start an etcd replica on that new node,
+<li>have the new replica recover the etcd state.
+</ol>
+In the case of local disk (which fails in concert with the machine), the etcd
+state must be recovered from the other replicas.  This is called  <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member">dynamic member
+           addition</A>.
+In the case of remote persistent disk, the etcd state can be recovered
+by attaching the remote persistent disk to the replacement node, thus
+the state is recoverable even if all other replicas are down.
+
+There are also significant performance differences between local disks and remote
+persistent disks.  For example, the <A HREF="https://cloud.google.com/compute/docs/disks/#comparison_of_disk_types">sustained throughput
+local disks in GCE is approximatley 20x that of remote disks</A>.
+
+Hence we suggest that self-healing be provided by remotely mounted persistent disks in
+non-performance critical, single-zone cloud deployments.  For
+performance critical installations, faster local SSD's should be used,
+in which case remounting on node failure is not an option, so
+<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md ">etcd runtime configuration</A>
+should be used to replace the failed machine.  Similarly, for
+cross-zone self-healing, cloud persistent disks are zonal, so
+automatic
+<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md">runtime configuration</A>
+is required.  Similarly, basic bare metal deployments cannot generally
+rely on
+remote persistent disks, so the same approach applies there.
+</td>
+<td>
+<A HREF="http://kubernetes.io/v1.1/docs/admin/high-availability.html">
+Somewhat vague instructions exist</A>
+on how to set some of this up manually in a self-hosted
+configuration. But automatic bootstrapping and self-healing is not
+described (and is not implemented for the non-PD cases).  This all
+still needs to be automated and continuously tested.
+</td>
+</tr>
+</table>
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
--- a/docs/design/federated-services.md
+++ b/docs/design/federated-services.md
@ -0,0 +1,550 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+# Kubernetes Cluster Federation (a.k.a. "Ubernetes")
+
+## Cross-cluster Load Balancing and Service Discovery
+
+### Requirements and System Design
+
+### by Quinton Hoole, Dec 3 2015
+
+## Requirements
+
+### Discovery, Load-balancing and Failover
+
+1. **Internal discovery and connection**: Pods/containers (running in
+   a Kubernetes cluster) must be able to easily discover and connect
+   to endpoints for Kubernetes services on which they depend in a
+   consistent way, irrespective of whether those services exist in a
+   different kubernetes cluster within the same cluster federation.
+   Hence-forth referred to as "cluster-internal clients", or simply
+   "internal clients".
+1. **External discovery and connection**: External clients (running
+   outside a Kubernetes cluster) must be able to discover and connect
+   to endpoints for Kubernetes services on which they depend.
+   1. **External clients predominantly speak HTTP(S)**: External
+      clients are most often, but not always, web browsers, or at
+      least speak HTTP(S) - notable exceptions include Enterprise
+      Message Busses (Java, TLS), DNS servers (UDP),
+      SIP servers and databases)
+1. **Find the "best" endpoint:** Upon initial discovery and
+   connection, both internal and external clients should ideally find
+   "the best" endpoint if multiple eligible endpoints exist.  "Best"
+   in this context implies the closest (by network topology) endpoint
+   that is both operational (as defined by some positive health check)
+   and not overloaded (by some published load metric).  For example:
+   1. An internal client should find an endpoint which is local to its
+      own cluster if one exists, in preference to one in a remote
+      cluster (if both are operational and non-overloaded).
+      Similarly, one in a nearby cluster (e.g. in the same zone or
+      region) is preferable to one further afield.
+   1. An external client (e.g. in New York City) should find an
+      endpoint in a nearby cluster (e.g. U.S. East Coast) in
+      preference to one further away (e.g. Japan).
+1. **Easy fail-over:** If the endpoint to which a client is connected
+   becomes unavailable (no network response/disconnected) or
+   overloaded, the client should reconnect to a better endpoint,
+   somehow.
+   1. In the case where there exist one or more connection-terminating
+      load balancers between the client and the serving Pod, failover
+      might be completely automatic (i.e. the client's end of the
+      connection remains intact, and the client is completely
+      oblivious of the fail-over).  This approach incurs network speed
+      and cost penalties (by traversing possibly multiple load
+      balancers), but requires zero smarts in clients, DNS libraries,
+      recursing DNS servers etc, as the IP address of the endpoint
+      remains constant over time.
+   1. In a scenario where clients need to choose between multiple load
+      balancer endpoints (e.g. one per cluster), multiple DNS A
+      records associated with a single DNS name enable even relatively
+      dumb clients to try the next IP address in the list of returned
+      A records (without even necessarily re-issuing a DNS resolution
+      request).  For example, all major web browsers will try all A
+      records in sequence until a working one is found (TBD: justify
+      this claim with details for Chrome, IE, Safari, Firefox).
+   1. In a slightly more sophisticated scenario, upon disconnection, a
+      smarter client might re-issue a DNS resolution query, and
+      (modulo DNS record TTL's which can typically be set as low as 3
+      minutes, and buggy DNS resolvers, caches and libraries which
+      have been known to completely ignore TTL's), receive updated A
+      records specifying a new set of IP addresses to which to
+      connect.
+
+### Portability
+
+A Kubernetes application configuration (e.g. for a Pod, Replication
+Controller, Service etc) should be able to be successfully deployed
+into any Kubernetes Cluster or Ubernetes Federation of Clusters,
+without modification.  More specifically, a typical configuration
+should work correctly (although possibly not optimally) across any of
+the following environments:
+
+1. A single Kubernetes Cluster on one cloud provider (e.g. Google
+   Compute Engine, GCE)
+1. A single Kubernetes Cluster on a different cloud provider
+   (e.g. Amazon Web Services, AWS)
+1. A single Kubernetes Cluster on a non-cloud, on-premise data center
+1. A Federation of Kubernetes Clusters all on the same cloud provider
+   (e.g. GCE)
+1. A Federation of Kubernetes Clusters across multiple different cloud
+   providers and/or on-premise data centers (e.g. one cluster on
+   GCE/GKE, one on AWS, and one on-premise).
+
+### Trading Portability for Optimization
+
+It should be possible to explicitly opt out of portability across some
+subset of the above environments in order to take advantage of
+non-portable load balancing and DNS features of one or more
+environments.  More specifically, for example:
+
+1. For HTTP(S) applications running on GCE-only Federations,
+   [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
+   should be usable.  These provide single, static global IP addresses
+   which load balance and fail over globally (i.e. across both regions
+   and zones).  These allow for really dumb clients, but they only
+   work on GCE, and only for HTTP(S) traffic.
+1. For non-HTTP(S) applications running on GCE-only Federations within
+   a single region,
+   [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
+   should be usable.  These provide TCP (i.e. both HTTP/S and
+   non-HTTP/S) load balancing and failover, but only on GCE, and only
+   within a single region.
+   [Google Cloud DNS](https://cloud.google.com/dns) can be used to
+   route traffic between regions (and between different cloud
+   providers and on-premise clusters, as it's plain DNS, IP only).
+1. For applications running on AWS-only Federations,
+   [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/)
+   should be usable.  These provide both L7 (HTTP(S)) and L4 load
+   balancing, but only within a single region, and only on AWS
+   ([AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be
+   used to load balance and fail over across multiple regions, and is
+   also capable of resolving to non-AWS endpoints).
+
+## Component Cloud Services
+
+Ubernetes cross-cluster load balancing is built on top of the following:
+
+1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
+   provide single, static global IP addresses which load balance and
+   fail over globally (i.e. across both regions and zones).  These
+   allow for really dumb clients, but they only work on GCE, and only
+   for HTTP(S) traffic.
+1. [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
+   provide both HTTP(S) and non-HTTP(S) load balancing and failover,
+   but only on GCE, and only within a single region.
+1. [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/)
+   provide both L7 (HTTP(S)) and L4 load balancing, but only within a
+   single region, and only on AWS.
+1. [Google Cloud DNS](https://cloud.google.com/dns) (or any other
+   programmable DNS service, like
+   [CloudFlare](http://www.cloudflare.com) can be used to route
+   traffic between regions (and between different cloud providers and
+   on-premise clusters, as it's plain DNS, IP only). Google Cloud DNS
+   doesn't provide any built-in geo-DNS, latency-based routing, health
+   checking, weighted round robin or other advanced capabilities.
+   It's plain old DNS. We would need to build all the aforementioned
+   on top of it.  It can provide internal DNS services (i.e. serve RFC
+   1918 addresses).
+   1. [AWS Route 53 DNS service](https://aws.amazon.com/route53/) can
+   be used to load balance and fail over across regions, and is also
+   capable of routing to non-AWS endpoints). It provides built-in
+   geo-DNS, latency-based routing, health checking, weighted
+   round robin and optional tight integration with some other
+   AWS services (e.g. Elastic Load Balancers).
+1. Kubernetes L4 Service Load Balancing: This provides both a
+   [virtual cluster-local](http://kubernetes.io/v1.1/docs/user-guide/services.html#virtual-ips-and-service-proxies)
+   and a
+   [real externally routable](http://kubernetes.io/v1.1/docs/user-guide/services.html#type-loadbalancer)
+   service IP which is load-balanced (currently simple round-robin)
+   across the healthy pods comprising a service within a single
+   Kubernetes cluster.
+1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html): A generic wrapper around cloud-provided L4 and L7 load balancing services, and roll-your-own load balancers run in pods, e.g. HA Proxy.
+
+## Ubernetes API
+
+The Ubernetes API for load balancing should be compatible with the
+equivalent Kubernetes API, to ease porting of clients between
+Ubernetes and Kubernetes. Further details below.
+
+## Common Client Behavior
+
+To be useful, our load balancing solution needs to work properly with
+real client applications. There are a few different classes of
+those...
+
+### Browsers
+
+These are the most common external clients.  These are all well-written.  See below.
+
+### Well-written clients
+
+1. Do a DNS resolution every time they connect.
+1. Don't cache beyond TTL (although a small percentage of the DNS
+   servers on which they rely might).
+1. Do try multiple A records (in order) to connect.
+1. (in an ideal world) Do use SRV records rather than hard-coded port numbers.
+
+Examples:
+
+  all common browsers (except for SRV records)
+  ...
+
+### Dumb clients
+
+1. Don't do a DNS resolution every time they connect (or do cache
+   beyond the TTL).
+1. Do try multiple A records
+
+Examples:
+
+  ...
+
+### Dumber clients
+
+1. Only do a DNS lookup once on startup.
+1. Only try the first returned DNS A record.
+
+Examples:
+
+  ...
+
+### Dumbest clients
+
+1. Never do a DNS lookup - are pre-configured with a single (or
+   possibly multiple) fixed server IP(s).  Nothing else matters.
+
+## Architecture and Implementation
+
+### General control plane architecture
+
+Each cluster hosts one or more Ubernetes master components (Ubernetes API servers, controller managers with leader election, and
+etcd quorum members.  This is documented in more detail in a
+[separate design doc: Kubernetes/Ubernetes Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#).
+
+In the description below, assume that 'n' clusters, named
+'cluster-1'... 'cluster-n' have been registered against an Ubernetes
+Federation "federation-1", each with their own set of Kubernetes API
+endpoints,so,
+"[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1),
+[http://endpoint-2.cluster-1](http://endpoint-2.cluster-1)
+... [http://endpoint-m.cluster-n](http://endpoint-m.cluster-n) .
+
+### Federated Services
+
+Ubernetes Services are pretty straight-forward.  They're comprised of
+multiple equivalent underlying Kubernetes Services, each with their
+own external endpoint, and a load balancing mechanism across them.
+Let's work through how exactly that works in practice.
+
+Our user creates the following Ubernetes Service (against an Ubernetes
+API endpoint):
+
+    $ kubectl create -f my-service.yaml --context="federation-1"
+
+where service.yaml contains the following:
+
+    kind: Service
+    metadata:
+      labels:
+        run: my-service
+      name: my-service
+      namespace: my-namespace
+    spec:
+      ports:
+      - port: 2379
+        protocol: TCP
+        targetPort: 2379
+        name: client
+      - port: 2380
+        protocol: TCP
+        targetPort: 2380
+        name: peer
+      selector:
+        run: my-service
+      type: LoadBalancer
+
+Ubernetes in turn creates one equivalent service (identical config to
+the above) in each of the underlying Kubernetes clusters, each of
+which results in something like this:
+
+    $ kubectl get -o yaml --context="cluster-1" service my-service
+
+    apiVersion: v1
+    kind: Service
+    metadata:
+      creationTimestamp: 2015-11-25T23:35:25Z
+      labels:
+        run: my-service
+      name: my-service
+      namespace: my-namespace
+      resourceVersion: "147365"
+      selfLink: /api/v1/namespaces/my-namespace/services/my-service
+      uid: 33bfc927-93cd-11e5-a38c-42010af00002
+    spec:
+      clusterIP: 10.0.153.185
+      ports:
+      - name: client
+        nodePort: 31333
+        port: 2379
+        protocol: TCP
+        targetPort: 2379
+      - name: peer
+        nodePort: 31086
+        port: 2380
+        protocol: TCP
+        targetPort: 2380
+      selector:
+        run: my-service
+      sessionAffinity: None
+      type: LoadBalancer
+    status:
+      loadBalancer:
+        ingress:
+        - ip: 104.197.117.10
+
+Similar services are created in `cluster-2` and `cluster-3`, each of
+which are allocated their own `spec.clusterIP`, and
+`status.loadBalancer.ingress.ip`.
+
+In Ubernetes `federation-1`, the resulting federated service looks as follows:
+
+    $ kubectl get -o yaml --context="federation-1" service my-service
+
+    apiVersion: v1
+    kind: Service
+    metadata:
+      creationTimestamp: 2015-11-25T23:35:23Z
+      labels:
+        run: my-service
+      name: my-service
+      namespace: my-namespace
+      resourceVersion: "157333"
+      selfLink: /api/v1/namespaces/my-namespace/services/my-service
+      uid: 33bfc927-93cd-11e5-a38c-42010af00007
+    spec:
+      clusterIP:
+      ports:
+      - name: client
+        nodePort: 31333
+        port: 2379
+        protocol: TCP
+        targetPort: 2379
+      - name: peer
+        nodePort: 31086
+        port: 2380
+        protocol: TCP
+        targetPort: 2380
+      selector:
+        run: my-service
+      sessionAffinity: None
+      type: LoadBalancer
+    status:
+      loadBalancer:
+        ingress:
+        - hostname: my-service.my-namespace.my-federation.my-domain.com
+
+Note that the federated service:
+
+1. Is API-compatible with a vanilla Kubernetes service.
+1. has no clusterIP (as it is cluster-independent)
+1. has a federation-wide load balancer hostname
+
+In addition to the set of underlying Kubernetes services (one per
+cluster) described above, Ubernetes has also created a DNS name
+(e.g. on [Google Cloud DNS](https://cloud.google.com/dns) or
+[AWS Route 53](https://aws.amazon.com/route53/), depending on
+configuration) which provides load balancing across all of those
+services.  For example, in a very basic configuration:
+
+    $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
+    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 104.197.117.10
+    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 104.197.74.77
+    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 104.197.38.157
+
+Each of the above IP addresses (which are just the external load
+balancer ingress IP's of each cluster service) is of course load
+balanced across the pods comprising the service in each cluster.
+
+In a more sophisticated configuration (e.g. on GCE or GKE), Ubernetes
+automatically creates a
+[GCE Global L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
+which exposes a single, globally load-balanced IP:
+
+    $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
+    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 107.194.17.44
+
+Optionally, Ubernetes also configures the local DNS servers (SkyDNS)
+in each Kubernetes cluster to preferentially return the local
+clusterIP for the service in that cluster, with other clusters'
+external service IP's (or a global load-balanced IP) also configured
+for failover purposes:
+
+    $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
+    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 10.0.153.185
+    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 104.197.74.77
+    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 104.197.38.157
+
+If Ubernetes Global Service Health Checking is enabled, multiple
+service health checkers running across the federated clusters
+collaborate to monitor the health of the service endpoints, and
+automatically remove unhealthy endpoints from the DNS record (e.g. a
+majority quorum is required to vote a service endpoint unhealthy, to
+avoid false positives due to individual health checker network
+isolation).
+
+### Federated Replication Controllers
+
+So far we have a federated service defined, with a resolvable load
+balancer hostname by which clients can reach it, but no pods serving
+traffic directed there.  So now we need a Federated Replication
+Controller.  These are also fairly straight-forward, being comprised
+of multiple underlying Kubernetes Replication Controllers which do the
+hard work of keeping the desired number of Pod replicas alive in each
+Kubernetes cluster.
+
+    $ kubectl create -f my-service-rc.yaml --context="federation-1"
+
+where `my-service-rc.yaml` contains the following:
+
+    kind: ReplicationController
+    metadata:
+      labels:
+        run: my-service
+      name: my-service
+      namespace: my-namespace
+    spec:
+      replicas: 6
+      selector:
+        run: my-service
+      template:
+        metadata:
+          labels:
+            run: my-service
+        spec:
+          containers:
+            image: gcr.io/google_samples/my-service:v1
+            name: my-service
+            ports:
+            - containerPort: 2379
+              protocol: TCP
+            - containerPort: 2380
+              protocol: TCP
+
+Ubernetes in turn creates one equivalent replication controller
+(identical config to the above, except for the replica count) in each
+of the underlying Kubernetes clusters, each of which results in
+something like this:
+
+    $ ./kubectl get -o yaml rc my-service --context="cluster-1"
+    kind: ReplicationController
+    metadata:
+      creationTimestamp: 2015-12-02T23:00:47Z
+      labels:
+        run: my-service
+      name: my-service
+      namespace: my-namespace
+      selfLink: /api/v1/namespaces/my-namespace/replicationcontrollers/my-service
+      uid: 86542109-9948-11e5-a38c-42010af00002
+    spec:
+      replicas: 2
+      selector:
+        run: my-service
+      template:
+        metadata:
+          labels:
+            run: my-service
+        spec:
+          containers:
+            image: gcr.io/google_samples/my-service:v1
+            name: my-service
+            ports:
+            - containerPort: 2379
+              protocol: TCP
+            - containerPort: 2380
+              protocol: TCP
+            resources: {}
+          dnsPolicy: ClusterFirst
+          restartPolicy: Always
+    status:
+      replicas: 2
+
+The exact number of replicas created in each underlying cluster will
+of course depend on what scheduling policy is in force.  In the above
+example, the scheduler created an equal number of replicas (2) in each
+of the three underlying clusters, to make up the total of 6 replicas
+required.  To handle entire cluster failures, various approaches are possible,
+including:
+1. **simple overprovisioing**, such that sufficient replicas remain even if a
+   cluster fails.  This wastes some resources, but is simple and
+   reliable.
+2. **pod autoscaling**, where the replication controller in each
+      cluster automatically and autonomously increases the number of
+      replicas in its cluster in response to the additional traffic
+      diverted from the
+      failed cluster.  This saves resources and is reatively simple,
+      but there is some delay in the autoscaling.
+3. **federated replica migration**, where the Ubernetes Federation
+   Control Plane detects the cluster failure and automatically
+   increases the replica count in the remainaing clusters to make up
+   for the lost replicas in the failed cluster.  This does not seem to
+   offer any benefits relative to pod autoscaling above, and is
+   arguably more complex to implement, but we note it here as a
+   possibility.
+
+### Implementation Details
+
+The implementation approach and architecture is very similar to
+Kubernetes, so if you're familiar with how Kubernetes works, none of
+what follows will be surprising.  One additional design driver not
+present in Kubernetes is that Ubernetes aims to be resilient to
+individual cluster and availability zone failures.  So the control
+plane spans multiple clusters.  More specifically:
+
+ Ubernetes runs it's own distinct set of API servers (typically one
+   or more per underlying Kubernetes cluster).  These are completely
+   distinct from the Kubernetes API servers for each of the underlying
+   clusters.
+  Ubernetes runs it's own distinct quorum-based metadata store (etcd,
+   by default).  Approximately 1 quorum member runs in each underlying
+   cluster ("approximately" because we aim for an odd number of quorum
+   members, and typically don't want more than 5 quorum members, even
+   if we have a larger number of federated clusters, so 2 clusters->3
+   quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc).
+
+Cluster Controllers in Ubernetes watch against the Ubernetes API
+server/etcd state, and apply changes to the underlying kubernetes
+clusters accordingly.  They also have the anti-entropy mechanism for
+reconciling ubernetes "desired desired" state against kubernetes
+"actual desired" state.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-services.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
--- a/docs/design/federation-phase-1.md
+++ b/docs/design/federation-phase-1.md
@ -0,0 +1,434 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+# Ubernetes Design Spec (phase one)
+
+**Huawei PaaS Team**
+
+## INTRODUCTION
+
+In this document we propose a design for the “Control Plane” of
+Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of
+this work please refer to
+[this proposal](../../docs/proposals/federation.md).
+The document is arranged as following. First we briefly list scenarios
+and use cases that motivate K8S federation work. These use cases drive
+the design and they also verify the design. We summarize the
+functionality requirements from these use cases, and define the “in
+scope” functionalities that will be covered by this design (phase
+one). After that we give an overview of the proposed architecture, API
+and building blocks. And also we go through several activity flows to
+see how these building blocks work together to support use cases.
+
+## REQUIREMENTS
+
+There are many reasons why customers may want to build a K8S
+federation:
+
+ **High Availability:** Customers want to be immune to the outage of
+   a single availability zone, region or even a cloud provider.
+ **Sensitive workloads:** Some workloads can only run on a particular
+   cluster. They cannot be scheduled to or migrated to other clusters.
+ **Capacity overflow:** Customers prefer to run workloads on a
+   primary cluster. But if the capacity of the cluster is not
+   sufficient, workloads should be automatically distributed to other
+   clusters.
+ **Vendor lock-in avoidance:** Customers want to spread their
+   workloads on different cloud providers, and can easily increase or
+   decrease the workload proportion of a specific provider.
+ **Cluster Size Enhancement:** Currently K8S cluster can only support
+a limited size. While the community is actively improving it, it can
+be expected that cluster size will be a problem if K8S is used for
+large workloads or public PaaS infrastructure. While we can separate
+different tenants to different clusters, it would be good to have a
+unified view.
+
+Here are the functionality requirements derived from above use cases:
+
+ Clients of the federation control plane API server can register and deregister clusters.
+ Workloads should be spread to different clusters according to the
+   workload distribution policy.
+ Pods are able to discover and connect to services hosted in other
+  clusters (in cases where inter-cluster networking is necessary,
+  desirable and implemented).
+ Traffic to these pods should be spread across clusters (in a manner
+  similar to load balancing, although it might not be strictly
+  speaking balanced).
+ The control plane needs to know when a cluster is down, and migrate
+   the workloads to other clusters.
+ Clients have a unified view and a central control point for above
+   activities.
+
+## SCOPE
+
+It’s difficult to have a perfect design with one click that implements
+all the above requirements. Therefore we will go with an iterative
+approach to design and build the system. This document describes the
+phase one of the whole work.  In phase one we will cover only the
+following objectives:
+
+ Define the basic building blocks and API objects of control plane
+ Implement a basic end-to-end workflow
+   + Clients register federated clusters
+   + Clients submit a workload
+   + The workload is distributed to different clusters
+   + Service discovery
+   + Load balancing
+
+The following parts are NOT covered in phase one:
+
+ Authentication and authorization (other than basic client
+  authentication against the ubernetes API, and from ubernetes control
+  plane to the underlying kubernetes clusters).
+ Deployment units other than replication controller and service
+ Complex distribution policy of workloads
+ Service affinity and migration
+
+## ARCHITECTURE
+
+The overall architecture of a control plane is shown as following:
+
+![Ubernetes Architecture](ubernetes-design.png)
+
+Some design principles we are following in this architecture:
+
+1. Keep the underlying K8S clusters independent. They should have no
+   knowledge of control plane or of each other.
+1. Keep the Ubernetes API interface compatible with K8S API as much as
+   possible.
+1. Re-use concepts from K8S as much as possible. This reduces
+customers’ learning curve and is good for adoption.  Below is a brief
+description of each module contained in above diagram.
+
+## Ubernetes API Server
+
+The API Server in the Ubernetes control plane works just like the API
+Server in K8S. It talks to a distributed key-value store to persist,
+retrieve and watch API objects.  This store is completely distinct
+from the kubernetes key-value stores (etcd) in the underlying
+kubernetes clusters.  We still use `etcd` as the distributed
+storage so customers don’t need to learn and manage a different
+storage system, although it is envisaged that other storage systems
+(consol, zookeeper) will probably be developedand supported over
+time.
+
+## Ubernetes Scheduler
+
+The Ubernetes Scheduler schedules resources onto the underlying
+Kubernetes clusters.  For example it watches for unscheduled Ubernetes
+replication controllers (those that have not yet been scheduled onto
+underlying Kubernetes clusters) and performs the global scheduling
+work.  For each unscheduled replication controller, it calls policy
+engine to decide how to spit workloads among clusters. It creates a
+Kubernetes Replication Controller on one ore more underlying cluster,
+and post them back to `etcd` storage.
+
+One sublety worth noting here is that the scheduling decision is
+arrived at by combining the application-specific request from the user (which might
+include, for example, placement constraints), and the global policy specified
+by the federation administrator (for example, "prefer on-premise
+clusters over AWS clusters" or "spread load equally across clusters").
+
+## Ubernetes Cluster Controller
+
+The cluster controller
+performs the following two kinds of work:
+
+1. It watches all the sub-resources that are created by Ubernetes
+   components, like a sub-RC or a sub-service. And then it creates the
+   corresponding API objects on the underlying K8S clusters.
+1. It periodically retrieves the available resources metrics from the
+   underlying K8S cluster, and updates them as object status of the
+   `cluster` API object.  An alternative design might be to run a pod
+   in each underlying cluster that reports metrics for that cluster to
+   the Ubernetes control plane.  Which approach is better remains an
+   open topic of discussion.
+
+## Ubernetes Service Controller
+
+The Ubernetes service controller is a federation-level implementation
+of K8S service controller. It watches service resources created on
+control plane, creates corresponding K8S services on each involved K8S
+clusters.  Besides interacting with services resources on each
+individual K8S clusters, the Ubernetes service controller also
+performs some global DNS registration work.
+
+## API OBJECTS
+
+## Cluster
+
+Cluster is a new first-class API object introduced in this design. For
+each registered K8S cluster there will be such an API resource in
+control plane. The way clients register or deregister a cluster is to
+send corresponding REST requests to following URL:
+`/api/{$version}/clusters`.  Because control plane is behaving like a
+regular K8S client to the underlying clusters, the spec of a cluster
+object contains necessary properties like K8S cluster address and
+credentials.  The status of a cluster API object will contain
+following information:
+
+1. Which phase of its lifecycle
+1. Cluster resource metrics for scheduling decisions.
+1. Other metadata like the version of cluster
+
+$version.clusterSpec
+
+<table style="border:1px solid #000000;border-collapse:collapse;">
+<tbody>
+<tr>
+<td style="padding:5px;"><b>Name</b><br>
+</td>
+<td style="padding:5px;"><b>Description</b><br>
+</td>
+<td style="padding:5px;"><b>Required</b><br>
+</td>
+<td style="padding:5px;"><b>Schema</b><br>
+</td>
+<td style="padding:5px;"><b>Default</b><br>
+</td>
+</tr>
+<tr>
+<td style="padding:5px;">Address<br>
+</td>
+<td style="padding:5px;">address of the cluster<br>
+</td>
+<td style="padding:5px;">yes<br>
+</td>
+<td style="padding:5px;">address<br>
+</td>
+<td style="padding:5px;"><p></p></td>
+</tr>
+<tr>
+<td style="padding:5px;">Credential<br>
+</td>
+<td style="padding:5px;">the type (e.g. bearer token, client
+certificate etc) and data of the credential used to access cluster. It’s used for system routines (not behalf of users)<br>
+</td>
+<td style="padding:5px;">yes<br>
+</td>
+<td style="padding:5px;">string <br>
+</td>
+<td style="padding:5px;"><p></p></td>
+</tr>
+</tbody>
+</table>
+
+$version.clusterStatus
+
+<table style="border:1px solid #000000;border-collapse:collapse;">
+<tbody>
+<tr>
+<td style="padding:5px;"><b>Name</b><br>
+</td>
+<td style="padding:5px;"><b>Description</b><br>
+</td>
+<td style="padding:5px;"><b>Required</b><br>
+</td>
+<td style="padding:5px;"><b>Schema</b><br>
+</td>
+<td style="padding:5px;"><b>Default</b><br>
+</td>
+</tr>
+<tr>
+<td style="padding:5px;">Phase<br>
+</td>
+<td style="padding:5px;">the recently observed lifecycle phase of the cluster<br>
+</td>
+<td style="padding:5px;">yes<br>
+</td>
+<td style="padding:5px;">enum<br>
+</td>
+<td style="padding:5px;"><p></p></td>
+</tr>
+<tr>
+<td style="padding:5px;">Capacity<br>
+</td>
+<td style="padding:5px;">represents the available resources of a cluster<br>
+</td>
+<td style="padding:5px;">yes<br>
+</td>
+<td style="padding:5px;">any<br>
+</td>
+<td style="padding:5px;"><p></p></td>
+</tr>
+<tr>
+<td style="padding:5px;">ClusterMeta<br>
+</td>
+<td style="padding:5px;">Other cluster metadata like the version<br>
+</td>
+<td style="padding:5px;">yes<br>
+</td>
+<td style="padding:5px;">ClusterMeta<br>
+</td>
+<td style="padding:5px;"><p></p></td>
+</tr>
+</tbody>
+</table>
+
+**For simplicity we didn’t introduce a separate “cluster metrics” API
+object here**. The cluster resource metrics are stored in cluster
+status section, just like what we did to nodes in K8S. In phase one it
+only contains available CPU resources and memory resources.  The
+cluster controller will periodically poll the underlying cluster API
+Server to get cluster capability. In phase one it gets the metrics by
+simply aggregating metrics from all nodes. In future we will improve
+this with more efficient ways like leveraging heapster, and also more
+metrics will be supported.  Similar to node phases in K8S, the “phase”
+field includes following values:
+
+ pending: newly registered clusters or clusters suspended by admin
+   for various reasons. They are not eligible for accepting workloads
+ running: clusters in normal status that can accept workloads
+ offline: clusters temporarily down or not reachable
+ terminated: clusters removed from federation
+
+Below is the state transition diagram.
+
+![Cluster State Transition Diagram](ubernetes-cluster-state.png)
+
+## Replication Controller
+
+A global workload submitted to control plane is represented as an
+Ubernetes replication controller.   When a replication controller
+is submitted to control plane, clients need a way to express its
+requirements or preferences on clusters. Depending on different use
+cases it may be complex. For example:
+
+ This workload can only be scheduled to cluster Foo. It cannot be
+   scheduled to any other clusters. (use case: sensitive workloads).
+ This workload prefers cluster Foo. But if there is no available
+   capacity on cluster Foo, it’s OK to be scheduled to cluster Bar
+   (use case: workload )
+ Seventy percent of this workload should be scheduled to cluster Foo,
+    and thirty percent should be scheduled to cluster Bar (use case:
+    vendor lock-in avoidance).  In phase one, we only introduce a
+    _clusterSelector_ field to filter acceptable clusters. In default
+    case there is no such selector and it means any cluster is
+    acceptable.
+
+Below is a sample of the YAML to create such a replication controller.
+
+``` 
+apiVersion: v1
+kind: ReplicationController
+metadata:
+  name: nginx-controller
+spec:
+  replicas: 5
+  selector:
+    app: nginx
+  template:
+    metadata:
+      labels:
+        app: nginx
+    spec:
+      containers:
+      - name: nginx
+        image: nginx
+        ports:
+        - containerPort: 80
+      clusterSelector: 
+      name in (Foo, Bar)
+```
+
+Currently clusterSelector (implemented as a
+[LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704))
+only supports a simple list of acceptable clusters. Workloads will be
+evenly distributed on these acceptable clusters in phase one. After
+phase one we will define syntax to represent more advanced
+constraints, like cluster preference ordering, desired number of
+splitted workloads, desired ratio of workloads spread on different
+clusters, etc.
+
+Besides this explicit “clusterSelector” filter, a workload may have
+some implicit scheduling restrictions. For example it defines
+“nodeSelector” which can only be satisfied on some particular
+clusters. How to handle this will be addressed after phase one.
+
+## Ubernetes Services
+
+The Service API object exposed by Ubernetes is similar to service
+objects on Kubernetes. It defines the access to a group of pods. The
+Ubernetes service controller will create corresponding Kubernetes
+service objects on underlying clusters.  These are detailed in a
+separate design document: [Federated Services](federated-services.md).
+
+## Pod
+
+In phase one we only support scheduling replication controllers. Pod
+scheduling will be supported in later phase. This is primarily in
+order to keep the Ubernetes API compatible with the Kubernetes API.
+
+## ACTIVITY FLOWS
+
+## Scheduling
+
+The below diagram shows how workloads are scheduled on the Ubernetes control plane:
+
+1. A replication controller is created by the client.
+1. APIServer persists it into the storage.
+1. Cluster controller periodically polls the latest available resource
+   metrics from the underlying clusters.
+1. Scheduler is watching all pending RCs. It picks up the RC, make
+   policy-driven decisions and split it into different sub RCs.
+1. Each cluster control is watching the sub RCs bound to its
+   corresponding cluster. It picks up the newly created sub RC.
+1. The cluster controller issues requests to the underlying cluster
+API Server to create the RC.  In phase one we don’t support complex
+distribution policies. The scheduling rule is basically:
+    1. If a RC does not specify any nodeSelector, it will be scheduled
+       to the least loaded K8S cluster(s) that has enough available
+       resources.
+    1. If a RC specifies _N_ acceptable clusters in the
+       clusterSelector, all replica will be evenly distributed among
+       these clusters.
+
+There is a potential race condition here. Say at time _T1_ the control
+plane learns there are _m_ available resources in a K8S cluster. As
+the cluster is working independently it still accepts workload
+requests from other K8S clients or even another Ubernetes control
+plane. The Ubernetes scheduling decision is based on this data of
+available resources. However when the actual RC creation happens to
+the cluster at time _T2_, the cluster may don’t have enough resources
+at that time. We will address this problem in later phases with some
+proposed solutions like resource reservation mechanisms.
+
+![Ubernetes Scheduling](ubernetes-scheduling.png)
+
+## Service Discovery
+
+This part has been included in the section “Federated Service” of
+document
+“[Ubernetes Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. Please
+refer to that document for details.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
--- a/docs/design/ubernetes-cluster-state.png
+++ b/docs/design/ubernetes-cluster-state.png
--- a/docs/design/ubernetes-design.png
+++ b/docs/design/ubernetes-design.png
--- a/docs/design/ubernetes-scheduling.png
+++ b/docs/design/ubernetes-scheduling.png