Updated API conventions and other details, per #6133.

2015-04-15 00:39:23 +00:00 · 2015-04-15 00:39:23 +00:00 · 7beb6ddc76
parent f7ae442a02
commit 7beb6ddc76
6 changed files with 103 additions and 83 deletions
--- a/docs/api-conventions.md
+++ b/docs/api-conventions.md
@ -1,7 +1,9 @@
 API Conventions
 ===============

-The conventions of the Kubernetes API (and related APIs in the ecosystem) are intended to ease client development and ensure that configuration mechanisms can be implemented that work across a diverse set of use cases consistently.
+Updated: 4/14/2015
+
+The conventions of the [Kubernetes API](api.md) (and related APIs in the ecosystem) are intended to ease client development and ensure that configuration mechanisms can be implemented that work across a diverse set of use cases consistently.

 The general style of the Kubernetes API is RESTful - clients create, update, delete, or retrieve a description of an object via the standard HTTP verbs (POST, PUT, DELETE, and GET) - and those APIs preferentially accept and return JSON. Kubernetes also exposes additional endpoints for non-standard verbs and allows alternative content types. All of the JSON accepted and returned by the server has a schema, identified by the "kind" and "apiVersion" fields.

@ -14,6 +16,8 @@ The following terms are defined:

 Each resource typically accepts and returns data of a single kind.  A kind may be accepted or returned by multiple resources that reflect specific use cases. For instance, the kind "pod" is exposed as a "pods" resource that allows end users to create, update, and delete pods, while a separate "pod status" resource (that acts on "pod" kind) allows automated processes to update a subset of the fields in that resource. A "restart" resource might be exposed for a number of different resources to allow the same action to have different results for each object.

+Resource collections should be all lowercase and plural, whereas kinds are CamelCase and singular.
+

 Types (Kinds)
 -------------
@ -50,6 +54,7 @@ Kinds are grouped into three categories:

 The standard REST verbs (defined below) MUST return singular JSON objects. Some API endpoints may deviate from the strict REST pattern and return resources that are not singular JSON objects, such as streams of JSON objects or unstructured text log data.

+The term "kind" is reserved for these "top-level" API types. The term "type" should be used for distinguishing sub-categories within objects or subobjects.

 ### Resources

@ -58,6 +63,7 @@ All JSON objects returned by an API MUST have the following fields:
 * kind: a string that identifies the schema this object should have
 * apiVersion: a string that identifies the version of the schema the object should have

+These fields are required for proper decoding of the object. They may be populated by the server by default from the specified URL path, but the client likely needs to know the values in order to construct the URL path.

 ### Objects

@ -73,25 +79,51 @@ Every object SHOULD have the following metadata in a nested object field called

 * resourceVersion: a string that identifies the internal version of this object that can be used by clients to determine when objects have changed. This value MUST be treated as opaque by clients and passed unmodified back to the server. Clients should not assume that the resource version has meaning across namespaces, different kinds of resources, or different servers. (see [concurrency control](#concurrency-control-and-consistency), below, for more details)
 * creationTimestamp: a string representing an RFC 3339 date of the date and time an object was created
+* deletionTimestamp: a string representing an RFC 3339 date of the date and time after which this resource will be deleted. This field is set by the server when a graceful deletion is requested by the user, and is not directly settable by a client. The resource will be deleted (no longer visible from resource lists, and not reachable by name) after the time in this field. Once set, this value may not be unset or be set further into the future, although it may be shortened or the resource may be deleted prior to this time.
 * labels: a map of string keys and values that can be used to organize and categorize objects (see [labels.md](labels.md))
 * annotations: a map of string keys and values that can be used by external tooling to store and retrieve arbitrary metadata about this object (see [annotations.md](annotations.md))

-Labels are intended for organizational purposes by end users (select the pods that match this label query). Annotations enable third party automation and tooling to decorate objects with additional metadata for their own use.
+Labels are intended for organizational purposes by end users (select the pods that match this label query). Annotations enable third-party automation and tooling to decorate objects with additional metadata for their own use.

 #### Spec and Status

-By convention, the Kubernetes API makes a distinction between the specification of the desired state of an object (a nested object field called "spec") and the status of the object at the current time (a nested object field called "status"). The specification is persisted in stable storage with the API object and reflects user input. The status is summarizes the current state of the object in the system, and is usually persisted with the object by an automated processes (but may be created on the fly).
+By convention, the Kubernetes API makes a distinction between the specification of the desired state of an object (a nested object field called "spec") and the status of the object at the current time (a nested object field called "status"). The specification is a complete description of the desired state, including configuration settings provided by the user, [default values](#defaulting) expanded by the system, and properties initialized or otherwise changed after creation by other ecosystem components (e.g., schedulers, auto-scalers), and is persisted in stable storage with the API object. If the specification is deleted, the object will be purged from the system. The status summarizes the current state of the object in the system, and is usually persisted with the object by an automated processes but may be generated on the fly. At some cost and perhaps some temporary degradation in behavior, the status could be reconstructed by observation if it were lost.

-For example, a pod object has a "spec" object field that defines how the pod should be run. The pod also has a "status" object field that shows details about what is happening on the host that is running the containers in the pod (if available) and a summarized "phase" string that indicates where the pod is in its lifecycle.
+When a new version of an object is POSTed or PUT, the "spec" is updated and available immediately. Over time the system will work to bring the "status" into line with the "spec". The system will drive toward the most recent "spec" regardless of previous versions of that stanza. In other words, if a value is changed from 2 to 5 in one PUT and then back down to 3 in another PUT the system is not required to 'touch base' at 5 before changing the "status" to 3. In other words, the system's behavior is *level-based* rather than *edge-based*. This enables robust behavior in the presence of missed intermediate state changes.

-When a new version of an object is POSTed or PUT, the "spec" is updated and available immediately. Over time the system will work to bring the "status" into line with the "spec". The system will drive toward the most recent "spec" regardless of previous versions of that stanza. In other words, if a value is changed from 2 to 5 in one PUT and then back down to 3 in another PUT the system is not required to 'touch base' at 5 before changing the "status" to 3.
+The Kubernetes API also serves as the foundation for the declarative configuration schema for the system. In order to facilitate level-based operation and expression of declarative configuration, fields in the specification should have declarative rather than imperative names and semantics -- they represent the desired state, not actions intended to yield the desired state.

-The PUT and POST verbs on objects will ignore the "status" values. Otherwise, PUT expects the whole object to be specified. Therefore, if a field is omitted it is assumed that the client wants to clear that field's value.
+The PUT and POST verbs on objects will ignore the "status" values. A `/status` subresource is provided to enable system components to update statuses of resources they manage.

-The PUT verb does not accept partial updates. Modification of just part of an object may be achieved by GETting the resource, modifying part of the spec, labels, or annotations, and then PUTting it back. See [concurrency control](#concurrency-control-and-consistency), below, regarding read-modify-write consistency when using this pattern. Some objects may expose alternative resource representations that allow mutation of the status, or performing custom actions on the object.
+Otherwise, PUT expects the whole object to be specified. Therefore, if a field is omitted it is assumed that the client wants to clear that field's value. The PUT verb does not accept partial updates. Modification of just part of an object may be achieved by GETting the resource, modifying part of the spec, labels, or annotations, and then PUTting it back. See [concurrency control](#concurrency-control-and-consistency), below, regarding read-modify-write consistency when using this pattern. Some objects may expose alternative resource representations that allow mutation of the status, or performing custom actions on the object.

 All objects that represent a physical resource whose state may vary from the user's desired intent SHOULD have a "spec" and a "status".  Objects whose state cannot vary from the user's desired intent MAY have only "spec", and MAY rename "spec" to a more appropriate name.

+Objects that contain both spec and status should not contain additional top-level fields other than the standard metadata fields.
+
+##### Typical status properties
+
+* **phase**: The phase is a simple, high-level summary of the phase of the lifecycle of an object. The phase should progress monotonically. Typical phase values are `Pending` (not yet fully physically realized), `Running` or `Active` (fully realized and active, but not necessarily operating correctly), and `Terminated` (no longer active), but may vary slightly for different types of objects. New phase values should not be added to existing objects in the future. Like other status fields, it must be possible to ascertain the lifecycle phase by observation. Additional details regarding the current phase may be contained in other fields.
+* **conditions**: Conditions represent orthogonal observations of an object's current state. Objects may report multiple conditions, and new types of conditions may be added in the future. Condition status values may be `True`, `False`, or `Unknown`. Unlike the phase, conditions are not expected to be monotonic -- their values may change back and forth. A typical condition type is `Ready`, which indicates the object was believed to be fully operational at the time it was last probed. Conditions may carry additional information, such as the last probe time or last transition time. 
+
+TODO(@vishh): Reason and Message.
+
+Phases and conditions are observations and not, themselves, state machines, nor do we define comprehensive state machines for objects with behaviors associated with state transitions. The system is level-based and should assume an Open World. Additionally, new observations and details about these observations may be added over time. 
+
+In order to preserve extensibility, in the future, we intend to explicitly convey properties that users and components care about rather than requiring those properties to be inferred from observations.
+
+Note that historical information status (e.g., last transition time, failure counts) is only provided at best effort, and is not guaranteed to not be lost.
+
+Status information that may be large (especially unbounded in size, such as lists of references to other objects -- see below) and/or rapidly changing, such as [resource usage](resources.md#usage-data), should be put into separate objects, with possibly a reference from the original object. This helps to ensure that GETs and watch remain reasonably efficient for the majority of clients, which may not need that data.
+
+#### References to related objects
+
+References to loosely coupled sets of objects, such as [pods](pods.md) overseen by a [replication controller](replication-controller.md), are usually best referred to using a [label selector](labels.md). In order to ensure that GETs of individual objects remain bounded in time and space, these sets may be queried via separate API queries, but will not be expanded in the referring object's status.
+
+References to specific objects, especially specific resource versions and/or specific fields of those objects, are specified using the `ObjectReference` type. Unlike partial URLs, the ObjectReference type facilitates flexible defaulting of fields from the referring object or other contextual information.
+
+References in the status of the referee to the referrer may be permitted, when the references are one-to-one and do not need to be frequently updated, particularly in an edge-based manner.
+
 #### Lists of named subobjects preferred over maps

 Discussed in [#2004](https://github.com/GoogleCloudPlatform/kubernetes/issues/2004) and elsewhere. There are no maps of subobjects in any API objects. Instead, the convention is to use a list of subobjects containing name fields.
@ -143,8 +175,8 @@ API resources should use the traditional REST pattern:

 * GET /&lt;resourceNamePlural&gt; - Retrieve a list of type &lt;resourceName&gt;, e.g. GET /pods returns a list of Pods.
 * POST /&lt;resourceNamePlural&gt; - Create a new resource from the JSON object provided by the client.
-* GET /&lt;resourceNamePlural&gt;/&lt;name&gt; - Retrieves a single resource with the given name, e.g. GET /pods/first returns a Pod named 'first'.
-* DELETE /&lt;resourceNamePlural&gt;/&lt;name&gt;  - Delete the single resource with the given name.
+* GET /&lt;resourceNamePlural&gt;/&lt;name&gt; - Retrieves a single resource with the given name, e.g. GET /pods/first returns a Pod named 'first'. Should be constant time, and the resource should be bounded in size.
+* DELETE /&lt;resourceNamePlural&gt;/&lt;name&gt;  - Delete the single resource with the given name. DeleteOptions may specify gracePeriodSeconds, the optional duration in seconds before the object should be deleted. Individual kinds may declare fields which provide a default grace period, and different kinds may have differing kind-wide default grace periods. A user provided grace period overrides a default grace period, including the zero grace period ("now").
 * PUT /&lt;resourceNamePlural&gt;/&lt;name&gt; - Update or create the resource with the given name with the JSON object provided by the client.
 * PATCH /&lt;resourceNamePlural&gt;/&lt;name&gt; - Selectively modify the specified fields of the resource. See more information [below](#patch).

@ -248,14 +280,15 @@ Idempotency

 All compatible Kubernetes APIs MUST support "name idempotency" and respond with an HTTP status code 409 when a request is made to POST an object that has the same name as an existing object in the system. See [identifiers.md](identifiers.md) for details.

-TODO: name generation
+Names generated by the system may be requested using `metadata.generateName`. GenerateName indicates that the name should be made unique by the server prior to persisting it. A non-empty value for the field indicates the name will be made unique (and the name returned to the client will be different than the name passed). The value of this field will be combined with a unique suffix on the server if the Name field has not been provided. The provided value must be valid within the rules for Name, and may be truncated by the length of the suffix required to make the value unique on the server. If this field is specified, and Name is not present, the server will NOT return a 409 if the generated name exists - instead, it will either return 201 Created or 504 with Reason `ServerTimeout` indicating a unique name could not be found in the time allotted, and the client should retry (optionally after the time indicated in the Retry-After header).

 Defaulting
 ----------

 Default resource values are API version-specific, and they are applied during
 the conversion from API-versioned declarative configuration to internal objects
-representing the desired state (`Spec`) of the resource.
+representing the desired state (`Spec`) of the resource. Subsequent GETs of the
+resource will include the default values explicitly.

 Incorporating the default values into the `Spec` ensures that `Spec` depicts the
 full desired state so that it is easier for the system to determine how to
@ -299,6 +332,12 @@ APIs may return alternative representations of any resource in response to an Ac
 All dates should be serialized as RFC3339 strings.


+Units
+-----
+
+Units must either be explicit in the field name (e.g., `timeoutSeconds`), or must be specified as part of the value (e.g., `resource.Quantity`). Which approach is preferred is TBD.
+
+
 Selecting Fields
 ----------------

@ -514,9 +553,3 @@ Events
 TODO: Document events (refer to another doc for details)


-API Documentation
-----------------
-
-API documentation can be found at [http://kubernetes.io/third_party/swagger-ui/](http://kubernetes.io/third_party/swagger-ui/).
-
-
--- a/docs/identifiers.md
+++ b/docs/identifiers.md
@ -1,8 +1,10 @@
 # Identifiers
-All objects in the Kubernetes REST API are identified by a Name and a UID.
+All objects in the Kubernetes REST API are unambiguously identified by a Name and a UID.
+
+For non-unique user-provided attributes, Kubernetes provides [labels](labels.md) and [annotations](annotations.md).

 ## Names
-Names are user-provided.  Only one object of a given kind can have a given name at a time.  But if you delete an object, you can make a new object with the same name.  Names are the used to refer to an object in a resource URL, such as `/api/v1beta3/pods/some.name`.   Names may be up to maximum length of 253 characters and consist of lower case alphanumeric characters, `-`, and `.`.  See the [identifiers design doc](design/identifiers.md) for the precise syntax rules for names.
+Names are generally client-provided.  Only one object of a given kind can have a given name at a time (i.e., they are spatially unique).  But if you delete an object, you can make a new object with the same name.  Names are the used to refer to an object in a resource URL, such as `/api/v1beta3/pods/some-name`.   By convention, the names of Kubernetes resources should be up to maximum length of 253 characters and consist of lower case alphanumeric characters, `-`, and `.`, but certain resources have more specific restructions.  See the [identifiers design doc](design/identifiers.md) for the precise syntax rules for names.

 ## UIDs
-UID are generated by Kubernetes.  Every object created over the whole lifetime of a Kubernetes cluster has a distinct UID.
+UID are generated by Kubernetes.  Every object created over the whole lifetime of a Kubernetes cluster has a distinct UID (i.e., they are spatially and temporally unique).
--- a/docs/labels.md
+++ b/docs/labels.md
@ -80,8 +80,8 @@ _Set-based_ requirements can be mixed with _equality-based_ requirements. For ex
 ## API

 LIST and WATCH operations may specify label selectors to filter the sets of objects returned using a query parameter. Both requirements are permitted:
-   - _equality-based_ requirements: `?labels=key1%3Dvalue1,key2%3Dvalue2`
-   - _set-based_ requirements: `?labels=key+in+%28value1%2Cvalue2%29%2Ckey2+notin+%28value3`
+   - _equality-based_ requirements: `?label-selector=key1%3Dvalue1,key2%3Dvalue2`
+   - _set-based_ requirements: `?label-selector=key+in+%28value1%2Cvalue2%29%2Ckey2+notin+%28value3`

 Kubernetes also currently supports two objects that use label selectors to keep track of their members, `service`s and `replicationController`s:
 - `service`: A [service](/docs/services.md) is a configuration unit for the proxies that run on every worker node.  It is named and points to one or more pods.
--- a/docs/pod-states.md
+++ b/docs/pod-states.md
@ -1,103 +1,86 @@
 # The life of a pod

-Updated: 9/22/2014
+Updated: 4/14/2015

-This document covers the intersection of pod states, the PodStatus type, the life-cycle of a pod, events, restart policies, and replication controllers.  It is not an exhaustive document, but an introduction to the topics.
+This document covers the lifecycle of a pod.  It is not an exhaustive document, but an introduction to the topic.

-## What is PodStatus?
+## Pod Phase

-While `PodStatus` represents the state of a pod, it is not intended to form a state machine. `PodStatus` is an observation of the current state of a pod.  As such, we discourage people from thinking about "transitions" or "changes" or "future states".
+As consistent with the overall [API convention](api-conventions.md#typical-status-properties), phase is a simple, high-level summary of the phase of the lifecycle of a pod. It is not intended to be a comprehensive rollup of observations of container-level or even pod-level conditions or other state, nor is it intended to be a comprehensive state machine.

-## Events
+The number and meanings of `PodPhase` values are tightly guarded.  Other than what is documented here, nothing should be assumed about pods with a given `PodPhase`.

-Since `PodStatus` is not a state machine, there are no edges which can be considered the "reason" for the current state.  Reasons can be determined by examining the events for the pod.  Events that affect containers, e.g. OOM, are reported as pod events.
+* Pending: The pod has been accepted by the system, but one or more of the container images has not been created.  This includes time before being scheduled as well as time spent downloading images over the network, which could take a while.
+* Running: The pod has been bound to a node, and all of the containers have been created.  At least one container is still running, or is in the process of starting or restarting.
+* Succeeded: All containers in the pod have terminated in success, and will not be restarted.
+* Failed: All containers in the pod have terminated, at least one container has terminated in failure (exited with non-zero exit status or was terminated by the system).

-TODO(@lavalamp) Event design
+## Pod Conditions

-## Controllers and RestartPolicy
+A pod containing containers that specify readiness probes will also report the Ready condition. Condition status values may be `True`, `False`, or `Unknown`.

-The only controller we have today is `ReplicationController`.  `ReplicationController` is *only* appropriate for pods with `RestartPolicy = Always`.  `ReplicationController` should refuse to instantiate any pod that has a different restart policy.
+## Container Statuses

-There is a legitimate need for a controller which keeps pods with other policies alive.  Both of the other policies (`OnFailure` and `Never`) eventually terminate, at which point the controller should stop recreating them.  Because of this fundamental distinction, let's hypothesize a new controller, called `JobController` for the sake of this document, which can implement this policy.
+More detailed information about the current (and previous) container statuses can be found in `containerStatuses`. The information reported depends on the current ContainerState, which may be Waiting, Running, or Termination (sic). 

-## Container termination
+## RestartPolicy

-Containers can terminate with one of two statuses:
-   1. success: The container exited voluntarily with a status code of 0.
-   1. failure: The container exited with any other status code or signal, or was stopped by the system.
+The RestartPolicy may be `Always`, `OnFailure`, or `Never`. RestartPolicy applies to all containers in the pod. RestartPolicy only refers to restarts of the containers by the Kubelet on the same node. As discussed in the [pods document](pods.md#durability-of-pods-or-lack-thereof), once bound to a node, a pod may never be rebound to another node. This means that some kind of controller is necessary in order for a pod to survive node failure, even if just a single pod at a time is desired.

-TODO(@dchen1107) Define ContainerStatus like PodStatus
+The only controller we have today is [`ReplicationController`](replication-controller.md).  `ReplicationController` is *only* appropriate for pods with `RestartPolicy = Always`.  `ReplicationController` should refuse to instantiate any pod that has a different restart policy.

-## PodStatus values and meanings
-
-The number and meanings of `PodStatus` values are tightly guarded.  Other than what is documented here, nothing should be assumed about pods with a given `PodStatus`.
-
-### pending
-
-The pod has been accepted by the system, but one or more of the containers has not been started.  This includes time before being scheduled as well as time spent downloading images over the network, which could take a while.
-
-### running
-
-The pod has been bound to a node, and all of the containers have been started.  At least one container is still running (or is in the process of restarting).
-
-### succeeded
-
-All containers in the pod have terminated in success.
-
-### failed
-
-All containers in the pod have terminated, at least one container has terminated in failure.
+There is a legitimate need for a controller which keeps pods with other policies alive.  Both of the other policies (`OnFailure` and `Never`) eventually terminate, at which point the controller should stop recreating them.  Because of this fundamental distinction, let's hypothesize a new controller, called [`JobController`](https://github.com/GoogleCloudPlatform/kubernetes/issues/1624) for the sake of this document, which can implement this policy.

 ## Pod lifetime

-In general, pods which are created do not disappear until someone destroys them.  This might be a human or a `ReplicationController`.  The only exception to this rule is that pods with a `PodStatus` of `succeeded` or `failed` for more than some duration (determined by the master) will expire and be automatically reaped.
+In general, pods which are created do not disappear until someone destroys them.  This might be a human or a `ReplicationController`.  The only exception to this rule is that pods with a `PodPhase` of `Succeeded` or `Failed` for more than some duration (determined by the master) will expire and be automatically reaped.

-If a node dies or is disconnected from the rest of the cluster, some entity within the system (call it the NodeController for now) is responsible for applying policy (e.g. a timeout) and marking any pods on the lost node as `failed`.
+If a node dies or is disconnected from the rest of the cluster, some entity within the system (call it the NodeController for now) is responsible for applying policy (e.g. a timeout) and marking any pods on the lost node as `Failed`.

 ## Examples

-   * Pod is `running`, 1 container, container exits success
+   * Pod is `Running`, 1 container, container exits success
     * Log completion event
     * If RestartPolicy is:
-       * Always: restart container, pod stays `running`
-       * OnFailure: pod becomes `succeeded`
-       * Never: pod becomes `succeeded`
+       * Always: restart container, pod stays `Running`
+       * OnFailure: pod becomes `Succeeded`
+       * Never: pod becomes `Succeeded`

-   * Pod is `running`, 1 container, container exits failure
+   * Pod is `Running`, 1 container, container exits failure
     * Log failure event
     * If RestartPolicy is:
-       * Always: restart container, pod stays `running`
-       * OnFailure: restart container, pod stays `running`
-       * Never: pod becomes `failed`
+       * Always: restart container, pod stays `Running`
+       * OnFailure: restart container, pod stays `Running`
+       * Never: pod becomes `Failed`

-   * Pod is `running`, 2 containers, container 1 exits failure
+   * Pod is `Running`, 2 containers, container 1 exits failure
     * Log failure event
     * If RestartPolicy is:
-       * Always: restart container, pod stays `running`
-       * OnFailure: restart container, pod stays `running`
-       * Never: pod stays `running`
+       * Always: restart container, pod stays `Running`
+       * OnFailure: restart container, pod stays `Running`
+       * Never: pod stays `Running`
     * When container 2 exits...
       * Log failure event
       * If RestartPolicy is:
-         * Always: restart container, pod stays `running`
-         * OnFailure: restart container, pod stays `running`
-         * Never: pod becomes `failed`
+         * Always: restart container, pod stays `Running`
+         * OnFailure: restart container, pod stays `Running`
+         * Never: pod becomes `Failed`

-   * Pod is `running`, container becomes OOM
+   * Pod is `Running`, container becomes OOM
     * Container terminates in failure
     * Log OOM event
     * If RestartPolicy is:
-       * Always: restart container, pod stays `running`
-       * OnFailure: restart container, pod stays `running`
-       * Never: log failure event, pod becomes `failed`
+       * Always: restart container, pod stays `Running`
+       * OnFailure: restart container, pod stays `Running`
+       * Never: log failure event, pod becomes `Failed`

-   * Pod is `running`, a disk dies
+   * Pod is `Running`, a disk dies
     * All containers are killed
     * Log appropriate event
-     * Pod becomes `failed`
+     * Pod becomes `Failed`
     * If running under a controller, pod will be recreated elsewhere

-   * Pod is `running`, its node is segmented out
+   * Pod is `Running`, its node is segmented out
     * NodeController waits for timeout
-     * NodeController marks pod `failed`
+     * NodeController marks pod `Failed`
     * If running under a controller, pod will be recreated elsewhere
--- a/docs/pods.md
+++ b/docs/pods.md
@ -60,7 +60,9 @@ That approach would provide co-location, but would not provide most of the benef

 Pods aren't intended to be treated as durable pets. They won't survive scheduling failures, node failures, or other evictions, such as due to lack of resources, or in the case of node maintenance. 

-In general, users shouldn't need to create pods directly. They should almost always use controllers (e.g., [replication controller](replication-controller.md)), even for singletons.  Controllers provide self-healing with a cluster scope, as well as replication and rollout management.
+In general, users shouldn't need to create pods directly. They should almost always use controllers (e.g., [replication controller](replication-controller.md)), even for singletons.  Controllers provide self-healing with a cluster scope, as well as replication and rollout management. 
+
+The use of collective APIs as the primary user-facing primitive is relatively common among cluster scheduling systems, including [Borg](http://eurosys2015.labri.fr/program/papers/), [Marathon](https://mesosphere.github.io/marathon/docs/rest-api.html, https://github.com/gambol99/go-marathon/blob/master/application.go), [Aurora](http://aurora.apache.org/documentation/latest/configuration-reference/#job-schema), and [Tupperware](http://www.slideshare.net/Docker/aravindnarayanan-facebook140613153626phpapp02-37588997).

 Pod is exposed as a primitive in order to facilitate:
 * scheduler and controller pluggability
--- a/docs/replication-controller.md
+++ b/docs/replication-controller.md
@ -2,7 +2,7 @@

 ## What is a _replication controller_?

-A _replication controller_ ensures that a specified number of pod "replicas" are running at any one time.  If there are too many, it will kill some.  If there are too few, it will start more. As opposed to just creating singleton pods or even creating pods in bulk, a replication controller replaces pods that are deleted or terminated for any reason, such as in the case of node failure. For this reason, we recommend that you use a replication controller even if your application requires only a single pod.
+A _replication controller_ ensures that a specified number of pod "replicas" are running at any one time.  If there are too many, it will kill some.  If there are too few, it will start more. Unlike in the case where a user directly created pods, a replication controller replaces pods that are deleted or terminated for any reason, such as in the case of node failure or disruptive node maintenance, such as a kernel upgrade. For this reason, we recommend that you use a replication controller even if your application requires only a single pod. Think of it similarly to a process supervisor, only it supervises multiple pods across multiple nodes instead of individual processes on a single node. Replication controller delegates local container restarts to some agent on the node (e.g., Kubelet or Docker).

 As discussed in [life of a pod](pod-states.md), `replicationController` is *only* appropriate for pods with `RestartPolicy = Always`.  `ReplicationController` should refuse to instantiate any pod that has a different restart policy. As discussed in [issue #503](https://github.com/GoogleCloudPlatform/kubernetes/issues/503#issuecomment-50169443), we expect other types of controllers to be added to Kubernetes to handle other types of workloads, such as build/test and batch workloads, in the future.