mirror of https://github.com/k3s-io/k3s
Separated user, dev, and design docs.
Renamed: logging.md -> devel/logging.m Renamed: access.md -> design/access.md Renamed: identifiers.md -> design/identifiers.md Renamed: labels.md -> design/labels.md Renamed: namespaces.md -> design/namespaces.md Renamed: security.md -> design/security.md Renamed: networking.md -> design/networking.md Added abbreviated user user-focused document in place of most moved docs. Added docs/README.md explains how docs are organized. Added short, user-oriented documentation on labels Added a glossary. Fixed up some links.pull/6/head
parent
a18cdac616
commit
c47693c0d5
|
@ -0,0 +1,29 @@
|
|||
# Kubernetes Documentation
|
||||
|
||||
Kubernetes documentation is organized into several categories.
|
||||
|
||||
- **Getting Started Guides**
|
||||
- for people who want to create a kubernetes cluster
|
||||
- in [docs/getting-started-guides](./getting-started-guides)
|
||||
- **User Documentation**
|
||||
- in [docs](./overview.md)
|
||||
- for people who want to run programs on kubernetes
|
||||
- describes current features of the system (with brief mentions of planned features)
|
||||
- **Developer Documentation**
|
||||
- in [docs/devel](./devel)
|
||||
- for people who want to contribute code to kubernetes
|
||||
- covers development conventions
|
||||
- explains current architecture and project plans
|
||||
- **Design Documentation**
|
||||
- in [docs/design](./design)
|
||||
- for people who want to understand the design choices made
|
||||
- describes tradeoffs, alternative designs
|
||||
- descriptions of planned features that are too long for a github issue.
|
||||
- **Walkthroughs and Examples**
|
||||
- in [examples](../examples)
|
||||
- Hands on introduction and example config files
|
||||
- **API documentation**
|
||||
- in [api](../api)
|
||||
- automatically generated REST API documentation
|
||||
- **Wiki**
|
||||
- in [wiki](https://github.com/GoogleCloudPlatform/kubernetes/wiki)
|
|
@ -0,0 +1,90 @@
|
|||
# Identifiers and Names in Kubernetes
|
||||
|
||||
A summarization of the goals and recommendations for identifiers in Kubernetes. Described in [GitHub issue #199](https://github.com/GoogleCloudPlatform/kubernetes/issues/199).
|
||||
|
||||
|
||||
## Definitions
|
||||
|
||||
UID
|
||||
: A non-empty, opaque, system-generated value guaranteed to be unique in time and space; intended to distinguish between historical occurrences of similar entities.
|
||||
|
||||
Name
|
||||
: A non-empty string guaranteed to be unique within a given scope at a particular time; used in resource URLs; provided by clients at creation time and encouraged to be human friendly; intended to facilitate creation idempotence and space-uniqueness of singleton objects, distinguish distinct entities, and reference particular entities across operations.
|
||||
|
||||
[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) label (DNS_LABEL)
|
||||
: An alphanumeric (a-z, A-Z, and 0-9) string, with a maximum length of 63 characters, with the '-' character allowed anywhere except the first or last character, suitable for use as a hostname or segment in a domain name
|
||||
|
||||
[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) subdomain (DNS_SUBDOMAIN)
|
||||
: One or more rfc1035/rfc1123 labels separated by '.' with a maximum length of 253 characters
|
||||
|
||||
[rfc4122](http://www.ietf.org/rfc/rfc4122.txt) universally unique identifier (UUID)
|
||||
: A 128 bit generated value that is extremely unlikely to collide across time and space and requires no central coordination
|
||||
|
||||
|
||||
## Objectives for names and UIDs
|
||||
|
||||
1. Uniquely identify (via a UID) an object across space and time
|
||||
|
||||
2. Uniquely name (via a name) an object across space
|
||||
|
||||
3. Provide human-friendly names in API operations and/or configuration files
|
||||
|
||||
4. Allow idempotent creation of API resources (#148) and enforcement of space-uniqueness of singleton objects
|
||||
|
||||
5. Allow DNS names to be automatically generated for some objects
|
||||
|
||||
|
||||
## General design
|
||||
|
||||
1. When an object is created via an API, a Name string (a DNS_SUBDOMAIN) must be specified. Name must be non-empty and unique within the apiserver. This enables idempotent and space-unique creation operations. Parts of the system (e.g. replication controller) may join strings (e.g. a base name and a random suffix) to create a unique Name. For situations where generating a name is impractical, some or all objects may support a param to auto-generate a name. Generating random names will defeat idempotency.
|
||||
* Examples: "guestbook.user", "backend-x4eb1"
|
||||
|
||||
2. When an object is created via an api, a Namespace string (a DNS_SUBDOMAIN? format TBD via #1114) may be specified. Depending on the API receiver, namespaces might be validated (e.g. apiserver might ensure that the namespace actually exists). If a namespace is not specified, one will be assigned by the API receiver. This assignment policy might vary across API receivers (e.g. apiserver might have a default, kubelet might generate something semi-random).
|
||||
* Example: "api.k8s.example.com"
|
||||
|
||||
3. Upon acceptance of an object via an API, the object is assigned a UID (a UUID). UID must be non-empty and unique across space and time.
|
||||
* Example: "01234567-89ab-cdef-0123-456789abcdef"
|
||||
|
||||
|
||||
## Case study: Scheduling a pod
|
||||
|
||||
Pods can be placed onto a particular node in a number of ways. This case
|
||||
study demonstrates how the above design can be applied to satisfy the
|
||||
objectives.
|
||||
|
||||
### A pod scheduled by a user through the apiserver
|
||||
|
||||
1. A user submits a pod with Namespace="" and Name="guestbook" to the apiserver.
|
||||
|
||||
2. The apiserver validates the input.
|
||||
1. A default Namespace is assigned.
|
||||
2. The pod name must be space-unique within the Namespace.
|
||||
3. Each container within the pod has a name which must be space-unique within the pod.
|
||||
|
||||
3. The pod is accepted.
|
||||
1. A new UID is assigned.
|
||||
|
||||
4. The pod is bound to a node.
|
||||
1. The kubelet on the node is passed the pod's UID, Namespace, and Name.
|
||||
|
||||
5. Kubelet validates the input.
|
||||
|
||||
6. Kubelet runs the pod.
|
||||
1. Each container is started up with enough metadata to distinguish the pod from whence it came.
|
||||
2. Each attempt to run a container is assigned a UID (a string) that is unique across time.
|
||||
* This may correspond to Docker's container ID.
|
||||
|
||||
### A pod placed by a config file on the node
|
||||
|
||||
1. A config file is stored on the node, containing a pod with UID="", Namespace="", and Name="cadvisor".
|
||||
|
||||
2. Kubelet validates the input.
|
||||
1. Since UID is not provided, kubelet generates one.
|
||||
2. Since Namespace is not provided, kubelet generates one.
|
||||
1. The generated namespace should be deterministic and cluster-unique for the source, such as a hash of the hostname and file path.
|
||||
* E.g. Namespace="file-f4231812554558a718a01ca942782d81"
|
||||
|
||||
3. Kubelet runs the pod.
|
||||
1. Each container is started up with enough metadata to distinguish the pod from whence it came.
|
||||
2. Each attempt to run a container is assigned a UID (a string) that is unique across time.
|
||||
1. This may correspond to Docker's container ID.
|
|
@ -0,0 +1,68 @@
|
|||
# Labels
|
||||
|
||||
_Labels_ are key/value pairs identifying client/user-defined attributes (and non-primitive system-generated attributes) of API objects, which are stored and returned as part of the [metadata of those objects](api-conventions.md). Labels can be used to organize and to select subsets of objects according to these attributes.
|
||||
|
||||
Each object can have a set of key/value labels set on it, with at most one label with a particular key.
|
||||
```
|
||||
"labels": {
|
||||
"key1" : "value1",
|
||||
"key2" : "value2"
|
||||
}
|
||||
```
|
||||
|
||||
Unlike [names and UIDs](identifiers.md), labels do not provide uniqueness. In general, we expect many objects to carry the same label(s).
|
||||
|
||||
Via a _label selector_, the client/user can identify a set of objects. The label selector is the core grouping primitive in Kubernetes.
|
||||
|
||||
Label selectors may also be used to associate policies with sets of objects.
|
||||
|
||||
We also [plan](https://github.com/GoogleCloudPlatform/kubernetes/issues/560) to make labels available inside pods and [lifecycle hooks](container-environment.md).
|
||||
|
||||
[Namespacing of label keys](https://github.com/GoogleCloudPlatform/kubernetes/issues/1491) is under discussion.
|
||||
|
||||
Valid labels follow a slightly modified RFC952 format: 24 characters or less, all lowercase, begins with alpha, dashes (-) are allowed, and ends with alphanumeric.
|
||||
|
||||
## Motivation
|
||||
|
||||
Service deployments and batch processing pipelines are often multi-dimensional entities (e.g., multiple partitions or deployments, multiple release tracks, multiple tiers, multiple micro-services per tier). Management often requires cross-cutting operations, which breaks encapsulation of strictly hierarchical representations, especially rigid hierarchies determined by the infrastructure rather than by users. Labels enable users to map their own organizational structures onto system objects in a loosely coupled fashion, without requiring clients to store these mappings.
|
||||
|
||||
## Label selectors
|
||||
|
||||
Label selectors permit very simple filtering by label keys and values. The simplicity of label selectors is deliberate. It is intended to facilitate transparency for humans, easy set overlap detection, efficient indexing, and reverse-indexing (i.e., finding all label selectors matching an object's labels - https://github.com/GoogleCloudPlatform/kubernetes/issues/1348).
|
||||
|
||||
Currently the system supports selection by exact match of a map of keys and values. Matching objects must have all of the specified labels (both keys and values), though they may have additional labels as well.
|
||||
|
||||
We are in the process of extending the label selection specification (see [selector.go](../blob/master/pkg/labels/selector.go) and https://github.com/GoogleCloudPlatform/kubernetes/issues/341) to support conjunctions of requirements of the following forms:
|
||||
```
|
||||
key1 in (value11, value12, ...)
|
||||
key1 not in (value11, value12, ...)
|
||||
key1 exists
|
||||
```
|
||||
|
||||
LIST and WATCH operations may specify label selectors to filter the sets of objects returned using a query parameter: `?labels=key1%3Dvalue1,key2%3Dvalue2,...`. We may extend such filtering to DELETE operations in the future.
|
||||
|
||||
Kubernetes also currently supports two objects that use label selectors to keep track of their members, `service`s and `replicationController`s:
|
||||
- `service`: A [service](services.md) is a configuration unit for the proxies that run on every worker node. It is named and points to one or more pods.
|
||||
- `replicationController`: A [replication controller](replication-controller.md) ensures that a specified number of pod "replicas" are running at any one time. If there are too many, it'll kill some. If there are too few, it'll start more.
|
||||
|
||||
The set of pods that a `service` targets is defined with a label selector. Similarly, the population of pods that a `replicationController` is monitoring is also defined with a label selector.
|
||||
|
||||
For management convenience and consistency, `services` and `replicationControllers` may themselves have labels and would generally carry the labels their corresponding pods have in common.
|
||||
|
||||
In the future, label selectors will be used to identify other types of distributed service workers, such as worker pool members or peers in a distributed application.
|
||||
|
||||
Individual labels are used to specify identifying metadata, and to convey the semantic purposes/roles of pods of containers. Examples of typical pod label keys include `service`, `environment` (e.g., with values `dev`, `qa`, or `production`), `tier` (e.g., with values `frontend` or `backend`), and `track` (e.g., with values `daily` or `weekly`), but you are free to develop your own conventions.
|
||||
|
||||
Sets identified by labels and label selectors could be overlapping (think Venn diagrams). For instance, a service might target all pods with `tier in (frontend), environment in (prod)`. Now say you have 10 replicated pods that make up this tier. But you want to be able to 'canary' a new version of this component. You could set up a `replicationController` (with `replicas` set to 9) for the bulk of the replicas with labels `tier=frontend, environment=prod, track=stable` and another `replicationController` (with `replicas` set to 1) for the canary with labels `tier=frontend, environment=prod, track=canary`. Now the service is covering both the canary and non-canary pods. But you can mess with the `replicationControllers` separately to test things out, monitor the results, etc.
|
||||
|
||||
Note that the superset described in the previous example is also heterogeneous. In long-lived, highly available, horizontally scaled, distributed, continuously evolving service applications, heterogeneity is inevitable, due to canaries, incremental rollouts, live reconfiguration, simultaneous updates and auto-scaling, hardware upgrades, and so on.
|
||||
|
||||
Pods (and other objects) may belong to multiple sets simultaneously, which enables representation of service substructure and/or superstructure. In particular, labels are intended to facilitate the creation of non-hierarchical, multi-dimensional deployment structures. They are useful for a variety of management purposes (e.g., configuration, deployment) and for application introspection and analysis (e.g., logging, monitoring, alerting, analytics). Without the ability to form sets by intersecting labels, many implicitly related, overlapping flat sets would need to be created, for each subset and/or superset desired, which would lose semantic information and be difficult to keep consistent. Purely hierarchically nested sets wouldn't readily support slicing sets across different dimensions.
|
||||
|
||||
Pods may be removed from these sets by changing their labels. This flexibility may be used to remove pods from service for debugging, data recovery, etc.
|
||||
|
||||
Since labels can be set at pod creation time, no separate set add/remove operations are necessary, which makes them easier to use than manual set management. Additionally, since labels are directly attached to pods and label selectors are fairly simple, it's easy for users and for clients and tools to determine what sets they belong to (i.e., they are reversible). OTOH, with sets formed by just explicitly enumerating members, one would (conceptually) need to search all sets to determine which ones a pod belonged to.
|
||||
|
||||
## Labels vs. annotations
|
||||
|
||||
We'll eventually index and reverse-index labels for efficient queries and watches, use them to sort and group in UIs and CLIs, etc. We don't want to pollute labels with non-identifying, especially large and/or structured, data. Non-identifying information should be recorded using [annotations](annotations.md).
|
|
@ -0,0 +1,193 @@
|
|||
# Kubernetes Proposal - Namespaces
|
||||
|
||||
**Related PR:**
|
||||
|
||||
| Topic | Link |
|
||||
| ---- | ---- |
|
||||
| Identifiers.md | https://github.com/GoogleCloudPlatform/kubernetes/pull/1216 |
|
||||
| Access.md | https://github.com/GoogleCloudPlatform/kubernetes/pull/891 |
|
||||
| Indexing | https://github.com/GoogleCloudPlatform/kubernetes/pull/1183 |
|
||||
| Cluster Subdivision | https://github.com/GoogleCloudPlatform/kubernetes/issues/442 |
|
||||
|
||||
## Background
|
||||
|
||||
High level goals:
|
||||
|
||||
* Enable an easy-to-use mechanism to logically scope Kubernetes resources
|
||||
* Ensure extension resources to Kubernetes can share the same logical scope as core Kubernetes resources
|
||||
* Ensure it aligns with access control proposal
|
||||
* Ensure system has log n scale with increasing numbers of scopes
|
||||
|
||||
## Use cases
|
||||
|
||||
Actors:
|
||||
|
||||
1. k8s admin - administers a kubernetes cluster
|
||||
2. k8s service - k8s daemon operates on behalf of another user (i.e. controller-manager)
|
||||
2. k8s policy manager - enforces policies imposed on k8s cluster
|
||||
3. k8s user - uses a kubernetes cluster to schedule pods
|
||||
|
||||
User stories:
|
||||
|
||||
1. Ability to set immutable namespace to k8s resources
|
||||
2. Ability to list k8s resource scoped to a namespace
|
||||
3. Restrict a namespace identifier to a DNS-compatible string to support compound naming conventions
|
||||
4. Ability for a k8s policy manager to enforce a k8s user's access to a set of namespaces
|
||||
5. Ability to set/unset a default namespace for use by kubecfg client
|
||||
6. Ability for a k8s service to monitor resource changes across namespaces
|
||||
7. Ability for a k8s service to list resources across namespaces
|
||||
|
||||
## Proposed Design
|
||||
|
||||
### Model Changes
|
||||
|
||||
Introduce a new attribute *Namespace* for each resource that must be scoped in a Kubernetes cluster.
|
||||
|
||||
A *Namespace* is a DNS compatible subdomain.
|
||||
|
||||
```
|
||||
// TypeMeta is shared by all objects sent to, or returned from the client
|
||||
type TypeMeta struct {
|
||||
Kind string `json:"kind,omitempty" yaml:"kind,omitempty"`
|
||||
Uid string `json:"uid,omitempty" yaml:"uid,omitempty"`
|
||||
CreationTimestamp util.Time `json:"creationTimestamp,omitempty" yaml:"creationTimestamp,omitempty"`
|
||||
SelfLink string `json:"selfLink,omitempty" yaml:"selfLink,omitempty"`
|
||||
ResourceVersion uint64 `json:"resourceVersion,omitempty" yaml:"resourceVersion,omitempty"`
|
||||
APIVersion string `json:"apiVersion,omitempty" yaml:"apiVersion,omitempty"`
|
||||
Namespace string `json:"namespace,omitempty" yaml:"namespace,omitempty"`
|
||||
Name string `json:"name,omitempty" yaml:"name,omitempty"`
|
||||
}
|
||||
```
|
||||
|
||||
An identifier, *UID*, is unique across time and space intended to distinguish between historical occurences of similar entities.
|
||||
|
||||
A *Name* is unique within a given *Namespace* at a particular time, used in resource URLs; provided by clients at creation time
|
||||
and encouraged to be human friendly; intended to facilitate creation idempotence and space-uniqueness of singleton objects, distinguish
|
||||
distinct entities, and reference particular entities across operations.
|
||||
|
||||
As of this writing, the following resources MUST have a *Namespace* and *Name*
|
||||
|
||||
* pod
|
||||
* service
|
||||
* replicationController
|
||||
* endpoint
|
||||
|
||||
A *policy* MAY be associated with a *Namespace*.
|
||||
|
||||
If a *policy* has an associated *Namespace*, the resource paths it enforces are scoped to a particular *Namespace*.
|
||||
|
||||
## k8s API server
|
||||
|
||||
In support of namespace isolation, the Kubernetes API server will address resources by the following conventions:
|
||||
|
||||
The typical actors for the following requests are the k8s user or the k8s service.
|
||||
|
||||
| Action | HTTP Verb | Path | Description |
|
||||
| ---- | ---- | ---- | ---- |
|
||||
| CREATE | POST | /api/{version}/ns/{ns}/{resourceType}/ | Create instance of {resourceType} in namespace {ns} |
|
||||
| GET | GET | /api/{version}/ns/{ns}/{resourceType}/{name} | Get instance of {resourceType} in namespace {ns} with {name} |
|
||||
| UPDATE | PUT | /api/{version}/ns/{ns}/{resourceType}/{name} | Update instance of {resourceType} in namespace {ns} with {name} |
|
||||
| DELETE | DELETE | /api/{version}/ns/{ns}/{resourceType}/{name} | Delete instance of {resourceType} in namespace {ns} with {name} |
|
||||
| LIST | GET | /api/{version}/ns/{ns}/{resourceType} | List instances of {resourceType} in namespace {ns} |
|
||||
| WATCH | GET | /api/{version}/watch/ns/{ns}/{resourceType} | Watch for changes to a {resourceType} in namespace {ns} |
|
||||
|
||||
The typical actor for the following requests are the k8s service or k8s admin as enforced by k8s Policy.
|
||||
|
||||
| Action | HTTP Verb | Path | Description |
|
||||
| ---- | ---- | ---- | ---- |
|
||||
| WATCH | GET | /api/{version}/watch/{resourceType} | Watch for changes to a {resourceType} across all namespaces |
|
||||
| LIST | GET | /api/{version}/list/{resourceType} | List instances of {resourceType} across all namespaces |
|
||||
|
||||
The legacy API patterns for k8s are an alias to interacting with the *default* namespace as follows.
|
||||
|
||||
| Action | HTTP Verb | Path | Description |
|
||||
| ---- | ---- | ---- | ---- |
|
||||
| CREATE | POST | /api/{version}/{resourceType}/ | Create instance of {resourceType} in namespace *default* |
|
||||
| GET | GET | /api/{version}/{resourceType}/{name} | Get instance of {resourceType} in namespace *default* |
|
||||
| UPDATE | PUT | /api/{version}/{resourceType}/{name} | Update instance of {resourceType} in namespace *default* |
|
||||
| DELETE | DELETE | /api/{version}/{resourceType}/{name} | Delete instance of {resourceType} in namespace *default* |
|
||||
|
||||
The k8s API server verifies the *Namespace* on resource creation matches the *{ns}* on the path.
|
||||
|
||||
The k8s API server will enable efficient mechanisms to filter model resources based on the *Namespace*. This may require
|
||||
the creation of an index on *Namespace* that could support query by namespace with optional label selectors.
|
||||
|
||||
The k8s API server will associate a resource with a *Namespace* if not populated by the end-user based on the *Namespace* context
|
||||
of the incoming request. If the *Namespace* of the resource being created, or updated does not match the *Namespace* on the request,
|
||||
then the k8s API server will reject the request.
|
||||
|
||||
TODO: Update to discuss k8s api server proxy patterns
|
||||
|
||||
## k8s storage
|
||||
|
||||
A namespace provides a unique identifier space and therefore must be in the storage path of a resource.
|
||||
|
||||
In etcd, we want to continue to still support efficient WATCH across namespaces.
|
||||
|
||||
Resources that persist content in etcd will have storage paths as follows:
|
||||
|
||||
/registry/{resourceType}/{resource.Namespace}/{resource.Name}
|
||||
|
||||
This enables k8s service to WATCH /registry/{resourceType} for changes across namespace of a particular {resourceType}.
|
||||
|
||||
Upon scheduling a pod to a particular host, the pod's namespace must be in the key path as follows:
|
||||
|
||||
/host/{host}/pod/{pod.Namespace}/{pod.Name}
|
||||
|
||||
## k8s Authorization service
|
||||
|
||||
This design assumes the existence of an authorization service that filters incoming requests to the k8s API Server in order
|
||||
to enforce user authorization to a particular k8s resource. It performs this action by associating the *subject* of a request
|
||||
with a *policy* to an associated HTTP path and verb. This design encodes the *namespace* in the resource path in order to enable
|
||||
external policy servers to function by resource path alone. If a request is made by an identity that is not allowed by
|
||||
policy to the resource, the request is terminated. Otherwise, it is forwarded to the apiserver.
|
||||
|
||||
## k8s controller-manager
|
||||
|
||||
The controller-manager will provision pods in the same namespace as the associated replicationController.
|
||||
|
||||
## k8s Kubelet
|
||||
|
||||
There is no major change to the kubelet introduced by this proposal.
|
||||
|
||||
### kubecfg client
|
||||
|
||||
kubecfg supports following:
|
||||
|
||||
```
|
||||
kubecfg [OPTIONS] ns {namespace}
|
||||
```
|
||||
|
||||
To set a namespace to use across multiple operations:
|
||||
|
||||
```
|
||||
$ kubecfg ns ns1
|
||||
```
|
||||
|
||||
To view the current namespace:
|
||||
|
||||
```
|
||||
$ kubecfg ns
|
||||
Using namespace ns1
|
||||
```
|
||||
|
||||
To reset to the default namespace:
|
||||
|
||||
```
|
||||
$ kubecfg ns default
|
||||
```
|
||||
|
||||
In addition, each kubecfg request may explicitly specify a namespace for the operation via the following OPTION
|
||||
|
||||
--ns
|
||||
|
||||
When loading resource files specified by the -c OPTION, the kubecfg client will ensure the namespace is set in the
|
||||
message body to match the client specified default.
|
||||
|
||||
If no default namespace is applied, the client will assume the following default namespace:
|
||||
|
||||
* default
|
||||
|
||||
The kubecfg client would store default namespace information in the same manner it caches authentication information today
|
||||
as a file on user's file system.
|
||||
|
|
@ -0,0 +1,107 @@
|
|||
# Networking
|
||||
|
||||
## Model and motivation
|
||||
|
||||
Kubernetes deviates from the default Docker networking model. The goal is for each pod to have an IP in a flat shared networking namespace that has full communication with other physical computers and containers across the network. IP-per-pod creates a clean, backward-compatible model where pods can be treated much like VMs or physical hosts from the perspectives of port allocation, networking, naming, service discovery, load balancing, application configuration, and migration.
|
||||
|
||||
OTOH, dynamic port allocation requires supporting both static ports (e.g., for externally accessible services) and dynamically allocated ports, requires partitioning centrally allocated and locally acquired dynamic ports, complicates scheduling (since ports are a scarce resource), is inconvenient for users, complicates application configuration, is plagued by port conflicts and reuse and exhaustion, requires non-standard approaches to naming (e.g., etcd rather than DNS), requires proxies and/or redirection for programs using standard naming/addressing mechanisms (e.g., web browsers), requires watching and cache invalidation for address/port changes for instances in addition to watching group membership changes, and obstructs container/pod migration (e.g., using CRIU). NAT introduces additional complexity by fragmenting the addressing space, which breaks self-registration mechanisms, among other problems.
|
||||
|
||||
With the IP-per-pod model, all user containers within a pod behave as if they are on the same host with regard to networking. They can all reach each other’s ports on localhost. Ports which are published to the host interface are done so in the normal Docker way. All containers in all pods can talk to all other containers in all other pods by their 10-dot addresses.
|
||||
|
||||
In addition to avoiding the aforementioned problems with dynamic port allocation, this approach reduces friction for applications moving from the world of uncontainerized apps on physical or virtual hosts to containers within pods. People running application stacks together on the same host have already figured out how to make ports not conflict (e.g., by configuring them through environment variables) and have arranged for clients to find them.
|
||||
|
||||
The approach does reduce isolation between containers within a pod -- ports could conflict, and there couldn't be private ports across containers within a pod, but applications requiring their own port spaces could just run as separate pods and processes requiring private communication could run within the same container. Besides, the premise of pods is that containers within a pod share some resources (volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. Additionally, the user can control what containers belong to the same pod whereas, in general, they don't control what pods land together on a host.
|
||||
|
||||
When any container calls SIOCGIFADDR, it sees the IP that any peer container would see them coming from -- each pod has its own IP address that other pods can know. By making IP addresses and ports the same within and outside the containers and pods, we create a NAT-less, flat address space. "ip addr show" should work as expected. This would enable all existing naming/discovery mechanisms to work out of the box, including self-registration mechanisms and applications that distribute IP addresses. (We should test that with etcd and perhaps one other option, such as Eureka (used by Acme Air) or Consul.) We should be optimizing for inter-pod network communication. Within a pod, containers are more likely to use communication through volumes (e.g., tmpfs) or IPC.
|
||||
|
||||
This is different from the standard Docker model. In that mode, each container gets an IP in the 172-dot space and would only see that 172-dot address from SIOCGIFADDR. If these containers connect to another container the peer would see the connect coming from a different IP than the container itself knows. In short - you can never self-register anything from a container, because a container can not be reached on its private IP.
|
||||
|
||||
An alternative we considered was an additional layer of addressing: pod-centric IP per container. Each container would have its own local IP address, visible only within that pod. This would perhaps make it easier for containerized applications to move from physical/virtual hosts to pods, but would be more complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS) and to reason about, due to the additional layer of address translation, and would break self-registration and IP distribution mechanisms.
|
||||
|
||||
## Current implementation
|
||||
|
||||
For the Google Compute Engine cluster configuration scripts, [advanced routing](https://developers.google.com/compute/docs/networking#routing) is set up so that each VM has an extra 256 IP addresses that get routed to it. This is in addition to the 'main' IP address assigned to the VM that is NAT-ed for Internet access. The networking bridge (called `cbr0` to differentiate it from `docker0`) is set up outside of Docker proper and only does NAT for egress network traffic that isn't aimed at the virtual network.
|
||||
|
||||
Ports mapped in from the 'main IP' (and hence the internet if the right firewall rules are set up) are proxied in user mode by Docker. In the future, this should be done with `iptables` by either the Kubelet or Docker: [Issue #15](https://github.com/GoogleCloudPlatform/kubernetes/issues/15).
|
||||
|
||||
We start Docker with:
|
||||
DOCKER_OPTS="--bridge cbr0 --iptables=false"
|
||||
|
||||
We set up this bridge on each node with SaltStack, in [container_bridge.py](cluster/saltbase/salt/_states/container_bridge.py).
|
||||
|
||||
cbr0:
|
||||
container_bridge.ensure:
|
||||
- cidr: {{ grains['cbr-cidr'] }}
|
||||
...
|
||||
grains:
|
||||
roles:
|
||||
- kubernetes-pool
|
||||
cbr-cidr: $MINION_IP_RANGE
|
||||
|
||||
We make these addresses routable in GCE:
|
||||
|
||||
gcutil addroute ${MINION_NAMES[$i]} ${MINION_IP_RANGES[$i]} \
|
||||
--norespect_terminal_width \
|
||||
--project ${PROJECT} \
|
||||
--network ${NETWORK} \
|
||||
--next_hop_instance ${ZONE}/instances/${MINION_NAMES[$i]} &
|
||||
|
||||
The minion IP ranges are /24s in the 10-dot space.
|
||||
|
||||
GCE itself does not know anything about these IPs, though.
|
||||
|
||||
These are not externally routable, though, so containers that need to communicate with the outside world need to use host networking. To set up an external IP that forwards to the VM, it will only forward to the VM's primary IP (which is assigned to no pod). So we use docker's -p flag to map published ports to the main interface. This has the side effect of disallowing two pods from exposing the same port. (More discussion on this in [Issue #390](https://github.com/GoogleCloudPlatform/kubernetes/issues/390).)
|
||||
|
||||
We create a container to use for the pod network namespace -- a single loopback device and a single veth device. All the user's containers get their network namespaces from this pod networking container.
|
||||
|
||||
Docker allocates IP addresses from a bridge we create on each node, using its “container” networking mode.
|
||||
|
||||
1. Create a normal (in the networking sense) container which uses a minimal image and runs a command that blocks forever. This is not a user-defined container, and gets a special well-known name.
|
||||
- creates a new network namespace (netns) and loopback device
|
||||
- creates a new pair of veth devices and binds them to the netns
|
||||
- auto-assigns an IP from docker’s IP range
|
||||
|
||||
2. Create the user containers and specify the name of the network container as their “net” argument. Docker finds the PID of the command running in the network container and attaches to the netns of that PID.
|
||||
|
||||
### Other networking implementation examples
|
||||
With the primary aim of providing IP-per-pod-model, other implementations exist to serve the purpose outside of GCE.
|
||||
- [OpenVSwitch with GRE/VxLAN](../ovs-networking.md)
|
||||
- [Flannel](https://github.com/coreos/flannel#flannel)
|
||||
|
||||
## Challenges and future work
|
||||
|
||||
### Docker API
|
||||
|
||||
Right now, docker inspect doesn't show the networking configuration of the containers, since they derive it from another container. That information should be exposed somehow.
|
||||
|
||||
### External IP assignment
|
||||
|
||||
We want to be able to assign IP addresses externally from Docker ([Docker issue #6743](https://github.com/dotcloud/docker/issues/6743)) so that we don't need to statically allocate fixed-size IP ranges to each node, so that IP addresses can be made stable across network container restarts ([Docker issue #2801](https://github.com/dotcloud/docker/issues/2801)), and to facilitate pod migration. Right now, if the network container dies, all the user containers must be stopped and restarted because the netns of the network container will change on restart, and any subsequent user container restart will join that new netns, thereby not being able to see its peers. Additionally, a change in IP address would encounter DNS caching/TTL problems. External IP assignment would also simplify DNS support (see below).
|
||||
|
||||
### Naming, discovery, and load balancing
|
||||
|
||||
In addition to enabling self-registration with 3rd-party discovery mechanisms, we'd like to setup DDNS automatically ([Issue #146](https://github.com/GoogleCloudPlatform/kubernetes/issues/146)). hostname, $HOSTNAME, etc. should return a name for the pod ([Issue #298](https://github.com/GoogleCloudPlatform/kubernetes/issues/298)), and gethostbyname should be able to resolve names of other pods. Probably we need to set up a DNS resolver to do the latter ([Docker issue #2267](https://github.com/dotcloud/docker/issues/2267)), so that we don't need to keep /etc/hosts files up to date dynamically.
|
||||
|
||||
[Service](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/services.md) endpoints are currently found through environment variables. Both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) variables and kubernetes-specific variables ({NAME}_SERVICE_HOST and {NAME}_SERVICE_BAR) are supported, and resolve to ports opened by the service proxy. We don't actually use [the Docker ambassador pattern](https://docs.docker.com/articles/ambassador_pattern_linking/) to link containers because we don't require applications to identify all clients at configuration time, yet. While services today are managed by the service proxy, this is an implementation detail that applications should not rely on. Clients should instead use the [service portal IP](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/services.md) (which the above environment variables will resolve to). However, a flat service namespace doesn't scale and environment variables don't permit dynamic updates, which complicates service deployment by imposing implicit ordering constraints. We intend to register each service portal IP in DNS, and for that to become the preferred resolution protocol.
|
||||
|
||||
We'd also like to accommodate other load-balancing solutions (e.g., HAProxy), non-load-balanced services ([Issue #260](https://github.com/GoogleCloudPlatform/kubernetes/issues/260)), and other types of groups (worker pools, etc.). Providing the ability to Watch a label selector applied to pod addresses would enable efficient monitoring of group membership, which could be directly consumed or synced with a discovery mechanism. Event hooks ([Issue #140](https://github.com/GoogleCloudPlatform/kubernetes/issues/140)) for join/leave events would probably make this even easier.
|
||||
|
||||
### External routability
|
||||
|
||||
We want traffic between containers to use the pod IP addresses across nodes. Say we have Node A with a container IP space of 10.244.1.0/24 and Node B with a container IP space of 10.244.2.0/24. And we have Container A1 at 10.244.1.1 and Container B1 at 10.244.2.1. We want Container A1 to talk to Container B1 directly with no NAT. B1 should see the "source" in the IP packets of 10.244.1.1 -- not the "primary" host IP for Node A. That means that we want to turn off NAT for traffic between containers (and also between VMs and containers).
|
||||
|
||||
We'd also like to make pods directly routable from the external internet. However, we can't yet support the extra container IPs that we've provisioned talking to the internet directly. So, we don't map external IPs to the container IPs. Instead, we solve that problem by having traffic that isn't to the internal network (! 10.0.0.0/8) get NATed through the primary host IP address so that it can get 1:1 NATed by the GCE networking when talking to the internet. Similarly, incoming traffic from the internet has to get NATed/proxied through the host IP.
|
||||
|
||||
So we end up with 3 cases:
|
||||
|
||||
1. Container -> Container or Container <-> VM. These should use 10. addresses directly and there should be no NAT.
|
||||
|
||||
2. Container -> Internet. These have to get mapped to the primary host IP so that GCE knows how to egress that traffic. There is actually 2 layers of NAT here: Container IP -> Internal Host IP -> External Host IP. The first level happens in the guest with IP tables and the second happens as part of GCE networking. The first one (Container IP -> internal host IP) does dynamic port allocation while the second maps ports 1:1.
|
||||
|
||||
3. Internet -> Container. This also has to go through the primary host IP and also has 2 levels of NAT, ideally. However, the path currently is a proxy with (External Host IP -> Internal Host IP -> Docker) -> (Docker -> Container IP). Once [issue #15](https://github.com/GoogleCloudPlatform/kubernetes/issues/15) is closed, it should be External Host IP -> Internal Host IP -> Container IP. But to get that second arrow we have to set up the port forwarding iptables rules per mapped port.
|
||||
|
||||
Another approach could be to create a new host interface alias for each pod, if we had a way to route an external IP to it. This would eliminate the scheduling constraints resulting from using the host's IP address.
|
||||
|
||||
### IPv6
|
||||
|
||||
IPv6 would be a nice option, also, but we can't depend on it yet. Docker support is in progress: [Docker issue #2974](https://github.com/dotcloud/docker/issues/2974), [Docker issue #6923](https://github.com/dotcloud/docker/issues/6923), [Docker issue #6975](https://github.com/dotcloud/docker/issues/6975). Additionally, direct ipv6 assignment to instances doesn't appear to be supported by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-)
|
|
@ -0,0 +1,26 @@
|
|||
Logging Conventions
|
||||
===================
|
||||
|
||||
The following conventions for the glog levels to use. glog is globally prefered to "log" for better runtime control.
|
||||
|
||||
* glog.Errorf() - Always an error
|
||||
* glog.Warningf() - Something unexpected, but probably not an error
|
||||
* glog.Infof() has multiple levels:
|
||||
* glog.V(0) - Generally useful for this to ALWAYS be visible to an operator
|
||||
* Programmer errors
|
||||
* Logging extra info about a panic
|
||||
* CLI argument handling
|
||||
* glog.V(1) - A reasonable default log level if you don't want verbosity.
|
||||
* Information about config (listening on X, watching Y)
|
||||
* Errors that repeat frequently that relate to conditions that can be corrected (pod detected as unhealthy)
|
||||
* glog.V(2) - Useful steady state information about the service and important log messages that may correlate to significant changes in the system. This is the recommended default log level for most systems.
|
||||
* Logging HTTP requests and their exit code
|
||||
* System state changing (killing pod)
|
||||
* Controller state change events (starting pods)
|
||||
* Scheduler log messages
|
||||
* glog.V(3) - Extended information about changes
|
||||
* More info about system state changes
|
||||
* glog.V(4) - Debug level verbosity (for now)
|
||||
* Logging in particularly thorny parts of code where you may want to come back later and check it
|
||||
|
||||
As per the comments, the practical default level is V(2). Developers and QE environments may wish to run at V(3) or V(4). If you wish to change the log level, you can pass in `-v=X` where X is the desired maximum level to log.
|
|
@ -0,0 +1,52 @@
|
|||
|
||||
# Glossary and Concept Index
|
||||
|
||||
**Authorization**
|
||||
:Kubernetes does not currently have an authorization system. Anyone with the cluster password can do anything. We plan
|
||||
to add sophisticated authorization, and to make it pluggable. See the [access control design doc](./devel/access.md) and
|
||||
[this issue](https://github.com/GoogleCloudPlatform/kubernetes/issue/1430).
|
||||
|
||||
**Annotation**
|
||||
: A key/value pair that can hold large (compared to a Label), and possibly not human-readable data. Intended to store
|
||||
non-identifying metadata associated with an object, such as provenance information. Not indexed.
|
||||
|
||||
**Image**
|
||||
: A [Docker Image](https://docs.docker.com/userguide/dockerimages/). See [images](./images.md).
|
||||
|
||||
**Label**
|
||||
: A key/value pair conveying user-defined identifying attributes of an object, and used to form sets of related objects, such as
|
||||
pods which are replicas in a load-balanced service. Not intended to hold large or non-human-readable data. See [labels](./labels.md).
|
||||
|
||||
**Name**
|
||||
: A user-provided name for an object. See [identifiers](identifiers.md).
|
||||
|
||||
**Namespace**
|
||||
: A namespace is like a prefix to the name of an object. You can configure your client to use a particular namespace,
|
||||
so you do not have to type it all the time. Namespaces allow multiple projects to prevent naming collisions between unrelated teams.
|
||||
|
||||
**Pod**
|
||||
: A collection of containers which will be scheduled onto the same node, which share and an IP and port space, and which
|
||||
can be created/destroyed together. See [pods](./pods.md).
|
||||
|
||||
**Replication Controller**
|
||||
: A _replication controller_ ensures that a specified number of pod "replicas" are running at any one time. Both allows
|
||||
for easy scaling of replicated systems, and handles restarting of a Pod when the machine it is on reboots or otherwise fails.
|
||||
|
||||
**Resource**
|
||||
: CPU, memory, and other things that a pod can request. See [resources](resources.md).
|
||||
|
||||
**Selector**
|
||||
: An expression that matches Labels. Can identify related objects, such as pods which are replicas in a load-balanced
|
||||
service. See [labels](labels.md).
|
||||
|
||||
**Service**
|
||||
: A load-balanced set of `pods` which can be accessed via a single stable IP address. See [services](./services.md).
|
||||
|
||||
**UID**
|
||||
: An identifier on all Kubernetes objects that is set by the Kubernetes API server. Can be used to distinguish between historical
|
||||
occurrences of same-Name objects. See [identifiers](identifiers.md).
|
||||
|
||||
**Volume**
|
||||
: A directory, possibly with some data in it, which is accessible to a Container as part of its filesystem. Kubernetes
|
||||
Volumes build upon [Docker Volumes](https://docs.docker.com/userguide/dockervolumes/), adding provisioning of the Volume
|
||||
directory and/or device. See [volumes](volumes.md).
|
|
@ -1,90 +1,8 @@
|
|||
# Identifiers and Names in Kubernetes
|
||||
# Identifiers
|
||||
All objects in the Kubernetes REST API are identified by a Name and a UID.
|
||||
|
||||
A summarization of the goals and recommendations for identifiers in Kubernetes. Described in [GitHub issue #199](https://github.com/GoogleCloudPlatform/kubernetes/issues/199).
|
||||
## Names
|
||||
Names are user-provided. Only one object of a given kind can have a given name at a time. But if you delete an object, you can make a new object with the same name. Names are the used to refer to an object in a resource URL, such as `/api/v1beta3/pods/some.name`. Names may be up to maximum length of 253 characters and consist of alphanumeric characters, `-`, and `.`. See the [identifiers design doc](design/identifiers.md) for the precise syntax rules for names.
|
||||
|
||||
|
||||
## Definitions
|
||||
|
||||
UID
|
||||
: A non-empty, opaque, system-generated value guaranteed to be unique in time and space; intended to distinguish between historical occurrences of similar entities.
|
||||
|
||||
Name
|
||||
: A non-empty string guaranteed to be unique within a given scope at a particular time; used in resource URLs; provided by clients at creation time and encouraged to be human friendly; intended to facilitate creation idempotence and space-uniqueness of singleton objects, distinguish distinct entities, and reference particular entities across operations.
|
||||
|
||||
[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) label (DNS_LABEL)
|
||||
: An alphanumeric (a-z, A-Z, and 0-9) string, with a maximum length of 63 characters, with the '-' character allowed anywhere except the first or last character, suitable for use as a hostname or segment in a domain name
|
||||
|
||||
[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) subdomain (DNS_SUBDOMAIN)
|
||||
: One or more rfc1035/rfc1123 labels separated by '.' with a maximum length of 253 characters
|
||||
|
||||
[rfc4122](http://www.ietf.org/rfc/rfc4122.txt) universally unique identifier (UUID)
|
||||
: A 128 bit generated value that is extremely unlikely to collide across time and space and requires no central coordination
|
||||
|
||||
|
||||
## Objectives for names and UIDs
|
||||
|
||||
1. Uniquely identify (via a UID) an object across space and time
|
||||
|
||||
2. Uniquely name (via a name) an object across space
|
||||
|
||||
3. Provide human-friendly names in API operations and/or configuration files
|
||||
|
||||
4. Allow idempotent creation of API resources (#148) and enforcement of space-uniqueness of singleton objects
|
||||
|
||||
5. Allow DNS names to be automatically generated for some objects
|
||||
|
||||
|
||||
## General design
|
||||
|
||||
1. When an object is created via an API, a Name string (a DNS_SUBDOMAIN) must be specified. Name must be non-empty and unique within the apiserver. This enables idempotent and space-unique creation operations. Parts of the system (e.g. replication controller) may join strings (e.g. a base name and a random suffix) to create a unique Name. For situations where generating a name is impractical, some or all objects may support a param to auto-generate a name. Generating random names will defeat idempotency.
|
||||
* Examples: "guestbook.user", "backend-x4eb1"
|
||||
|
||||
2. When an object is created via an api, a Namespace string (a DNS_SUBDOMAIN? format TBD via #1114) may be specified. Depending on the API receiver, namespaces might be validated (e.g. apiserver might ensure that the namespace actually exists). If a namespace is not specified, one will be assigned by the API receiver. This assignment policy might vary across API receivers (e.g. apiserver might have a default, kubelet might generate something semi-random).
|
||||
* Example: "api.k8s.example.com"
|
||||
|
||||
3. Upon acceptance of an object via an API, the object is assigned a UID (a UUID). UID must be non-empty and unique across space and time.
|
||||
* Example: "01234567-89ab-cdef-0123-456789abcdef"
|
||||
|
||||
|
||||
## Case study: Scheduling a pod
|
||||
|
||||
Pods can be placed onto a particular node in a number of ways. This case
|
||||
study demonstrates how the above design can be applied to satisfy the
|
||||
objectives.
|
||||
|
||||
### A pod scheduled by a user through the apiserver
|
||||
|
||||
1. A user submits a pod with Namespace="" and Name="guestbook" to the apiserver.
|
||||
|
||||
2. The apiserver validates the input.
|
||||
1. A default Namespace is assigned.
|
||||
2. The pod name must be space-unique within the Namespace.
|
||||
3. Each container within the pod has a name which must be space-unique within the pod.
|
||||
|
||||
3. The pod is accepted.
|
||||
1. A new UID is assigned.
|
||||
|
||||
4. The pod is bound to a node.
|
||||
1. The kubelet on the node is passed the pod's UID, Namespace, and Name.
|
||||
|
||||
5. Kubelet validates the input.
|
||||
|
||||
6. Kubelet runs the pod.
|
||||
1. Each container is started up with enough metadata to distinguish the pod from whence it came.
|
||||
2. Each attempt to run a container is assigned a UID (a string) that is unique across time.
|
||||
* This may correspond to Docker's container ID.
|
||||
|
||||
### A pod placed by a config file on the node
|
||||
|
||||
1. A config file is stored on the node, containing a pod with UID="", Namespace="", and Name="cadvisor".
|
||||
|
||||
2. Kubelet validates the input.
|
||||
1. Since UID is not provided, kubelet generates one.
|
||||
2. Since Namespace is not provided, kubelet generates one.
|
||||
1. The generated namespace should be deterministic and cluster-unique for the source, such as a hash of the hostname and file path.
|
||||
* E.g. Namespace="file-f4231812554558a718a01ca942782d81"
|
||||
|
||||
3. Kubelet runs the pod.
|
||||
1. Each container is started up with enough metadata to distinguish the pod from whence it came.
|
||||
2. Each attempt to run a container is assigned a UID (a string) that is unique across time.
|
||||
1. This may correspond to Docker's container ID.
|
||||
## UIDs
|
||||
UID are generated by Kubernetes. Every object created over the whole lifetime of a Kubernetes cluster has a distinct UID.
|
||||
|
|
|
@ -1,8 +1,9 @@
|
|||
# Labels
|
||||
|
||||
_Labels_ are key/value pairs identifying client/user-defined attributes (and non-primitive system-generated attributes) of API objects, which are stored and returned as part of the [metadata of those objects](api-conventions.md). Labels can be used to organize and to select subsets of objects according to these attributes.
|
||||
|
||||
Each object can have a set of key/value labels set on it, with at most one label with a particular key.
|
||||
_Labels_ are key/value pairs that are attached to objects, such as pods.
|
||||
Labels can be used to organize and to select subsets of objects. They are
|
||||
created by users at the same time as an object. Each object can have a set of
|
||||
key/value labels set on it, with at most one label with a particular key.
|
||||
```
|
||||
"labels": {
|
||||
"key1" : "value1",
|
||||
|
@ -14,55 +15,28 @@ Unlike [names and UIDs](identifiers.md), labels do not provide uniqueness. In ge
|
|||
|
||||
Via a _label selector_, the client/user can identify a set of objects. The label selector is the core grouping primitive in Kubernetes.
|
||||
|
||||
Label selectors may also be used to associate policies with sets of objects.
|
||||
|
||||
We also [plan](https://github.com/GoogleCloudPlatform/kubernetes/issues/560) to make labels available inside pods and [lifecycle hooks](container-environment.md).
|
||||
|
||||
[Namespacing of label keys](https://github.com/GoogleCloudPlatform/kubernetes/issues/1491) is under discussion.
|
||||
Labels let you categorize objects in a complex service deployment or batch processing pipelines along multiple
|
||||
dimensions, such as:
|
||||
- `release=stable`, `release=canary`, ...
|
||||
- `environment=dev`, `environment=qa`, `environment=production`
|
||||
- `tier=frontend`, `tier=backend`, ...
|
||||
- `partition=customerA`, `partition=customerB`, ...
|
||||
- `track=daily`, `track=weekly`
|
||||
These are just examples; you are free to develop your own conventions.
|
||||
|
||||
Valid labels follow a slightly modified RFC952 format: 24 characters or less, all lowercase, begins with alpha, dashes (-) are allowed, and ends with alphanumeric.
|
||||
|
||||
## Motivation
|
||||
|
||||
Service deployments and batch processing pipelines are often multi-dimensional entities (e.g., multiple partitions or deployments, multiple release tracks, multiple tiers, multiple micro-services per tier). Management often requires cross-cutting operations, which breaks encapsulation of strictly hierarchical representations, especially rigid hierarchies determined by the infrastructure rather than by users. Labels enable users to map their own organizational structures onto system objects in a loosely coupled fashion, without requiring clients to store these mappings.
|
||||
|
||||
## Label selectors
|
||||
|
||||
Label selectors permit very simple filtering by label keys and values. The simplicity of label selectors is deliberate. It is intended to facilitate transparency for humans, easy set overlap detection, efficient indexing, and reverse-indexing (i.e., finding all label selectors matching an object's labels - https://github.com/GoogleCloudPlatform/kubernetes/issues/1348).
|
||||
|
||||
Currently the system supports selection by exact match of a map of keys and values. Matching objects must have all of the specified labels (both keys and values), though they may have additional labels as well.
|
||||
|
||||
We are in the process of extending the label selection specification (see [selector.go](../blob/master/pkg/labels/selector.go) and https://github.com/GoogleCloudPlatform/kubernetes/issues/341) to support conjunctions of requirements of the following forms:
|
||||
Label selectors permit very simple filtering by label keys and values. Currently, label selectors only support these forms:
|
||||
```
|
||||
key1
|
||||
key1 = value11
|
||||
key1 != value11
|
||||
key1 in (value11, value12, ...)
|
||||
key1 not in (value11, value12, ...)
|
||||
key1 exists
|
||||
```
|
||||
|
||||
LIST and WATCH operations may specify label selectors to filter the sets of objects returned using a query parameter: `?labels=key1%3Dvalue1,key2%3Dvalue2,...`. We may extend such filtering to DELETE operations in the future.
|
||||
LIST and WATCH operations may specify label selectors to filter the sets of objects returned using a query parameter: `?labels=key1%3Dvalue1,key2%3Dvalue2,...`.
|
||||
|
||||
Kubernetes also currently supports two objects that use label selectors to keep track of their members, `service`s and `replicationController`s:
|
||||
- `service`: A [service](services.md) is a configuration unit for the proxies that run on every worker node. It is named and points to one or more pods.
|
||||
- `replicationController`: A [replication controller](replication-controller.md) ensures that a specified number of pod "replicas" are running at any one time. If there are too many, it'll kill some. If there are too few, it'll start more.
|
||||
The `service` and `replicationController` kinds of objects use selectors to match sets of pods that they operate on.
|
||||
|
||||
The set of pods that a `service` targets is defined with a label selector. Similarly, the population of pods that a `replicationController` is monitoring is also defined with a label selector.
|
||||
|
||||
For management convenience and consistency, `services` and `replicationControllers` may themselves have labels and would generally carry the labels their corresponding pods have in common.
|
||||
|
||||
In the future, label selectors will be used to identify other types of distributed service workers, such as worker pool members or peers in a distributed application.
|
||||
|
||||
Individual labels are used to specify identifying metadata, and to convey the semantic purposes/roles of pods of containers. Examples of typical pod label keys include `service`, `environment` (e.g., with values `dev`, `qa`, or `production`), `tier` (e.g., with values `frontend` or `backend`), and `track` (e.g., with values `daily` or `weekly`), but you are free to develop your own conventions.
|
||||
|
||||
Sets identified by labels and label selectors could be overlapping (think Venn diagrams). For instance, a service might target all pods with `tier in (frontend), environment in (prod)`. Now say you have 10 replicated pods that make up this tier. But you want to be able to 'canary' a new version of this component. You could set up a `replicationController` (with `replicas` set to 9) for the bulk of the replicas with labels `tier=frontend, environment=prod, track=stable` and another `replicationController` (with `replicas` set to 1) for the canary with labels `tier=frontend, environment=prod, track=canary`. Now the service is covering both the canary and non-canary pods. But you can mess with the `replicationControllers` separately to test things out, monitor the results, etc.
|
||||
|
||||
Note that the superset described in the previous example is also heterogeneous. In long-lived, highly available, horizontally scaled, distributed, continuously evolving service applications, heterogeneity is inevitable, due to canaries, incremental rollouts, live reconfiguration, simultaneous updates and auto-scaling, hardware upgrades, and so on.
|
||||
|
||||
Pods (and other objects) may belong to multiple sets simultaneously, which enables representation of service substructure and/or superstructure. In particular, labels are intended to facilitate the creation of non-hierarchical, multi-dimensional deployment structures. They are useful for a variety of management purposes (e.g., configuration, deployment) and for application introspection and analysis (e.g., logging, monitoring, alerting, analytics). Without the ability to form sets by intersecting labels, many implicitly related, overlapping flat sets would need to be created, for each subset and/or superset desired, which would lose semantic information and be difficult to keep consistent. Purely hierarchically nested sets wouldn't readily support slicing sets across different dimensions.
|
||||
|
||||
Pods may be removed from these sets by changing their labels. This flexibility may be used to remove pods from service for debugging, data recovery, etc.
|
||||
|
||||
Since labels can be set at pod creation time, no separate set add/remove operations are necessary, which makes them easier to use than manual set management. Additionally, since labels are directly attached to pods and label selectors are fairly simple, it's easy for users and for clients and tools to determine what sets they belong to (i.e., they are reversible). OTOH, with sets formed by just explicitly enumerating members, one would (conceptually) need to search all sets to determine which ones a pod belonged to.
|
||||
|
||||
## Labels vs. annotations
|
||||
|
||||
We'll eventually index and reverse-index labels for efficient queries and watches, use them to sort and group in UIs and CLIs, etc. We don't want to pollute labels with non-identifying, especially large and/or structured, data. Non-identifying information should be recorded using [annotations](annotations.md).
|
||||
See the [Labels Design Document](./design/labels.md) for more about how we expect labels and selectors to be used, and planned features.
|
||||
|
|
|
@ -1,26 +1,12 @@
|
|||
Logging Conventions
|
||||
===================
|
||||
# Logging
|
||||
|
||||
## Logging by Kubernetes Components
|
||||
Kubernetes components, such as kubelet and apiserver, use the [glog](https://godoc.org/github.com/golang/glog) logging library. Developer conventions for logging severity are described in [devel/logging.md](devel/logging.md).
|
||||
|
||||
## Logging in Containers
|
||||
There are no Kubernetes-specific requirements for logging from within containers. A
|
||||
[search](https://www.google.com/?q=docker+container+logging) will turn up any number of articles about logging and
|
||||
Docker containers. However, we do provide an example of how to collect, index, and view pod logs [using Elasicsearch and Kibana](./getting-started-guides/logging.md)
|
||||
|
||||
The following conventions for the glog levels to use. glog is globally prefered to "log" for better runtime control.
|
||||
|
||||
* glog.Errorf() - Always an error
|
||||
* glog.Warningf() - Something unexpected, but probably not an error
|
||||
* glog.Infof() has multiple levels:
|
||||
* glog.V(0) - Generally useful for this to ALWAYS be visible to an operator
|
||||
* Programmer errors
|
||||
* Logging extra info about a panic
|
||||
* CLI argument handling
|
||||
* glog.V(1) - A reasonable default log level if you don't want verbosity.
|
||||
* Information about config (listening on X, watching Y)
|
||||
* Errors that repeat frequently that relate to conditions that can be corrected (pod detected as unhealthy)
|
||||
* glog.V(2) - Useful steady state information about the service and important log messages that may correlate to significant changes in the system. This is the recommended default log level for most systems.
|
||||
* Logging HTTP requests and their exit code
|
||||
* System state changing (killing pod)
|
||||
* Controller state change events (starting pods)
|
||||
* Scheduler log messages
|
||||
* glog.V(3) - Extended information about changes
|
||||
* More info about system state changes
|
||||
* glog.V(4) - Debug level verbosity (for now)
|
||||
* Logging in particularly thorny parts of code where you may want to come back later and check it
|
||||
|
||||
As per the comments, the practical default level is V(2). Developers and QE environments may wish to run at V(3) or V(4). If you wish to change the log level, you can pass in `-v=X` where X is the desired maximum level to log.
|
||||
|
|
|
@ -1,193 +1,7 @@
|
|||
# Kubernetes Proposal - Namespaces
|
||||
# Namespaces
|
||||
|
||||
**Related PR:**
|
||||
Namespaces help different projects, teams, or customers to share a kubernetes cluster. First, they provide a scope for [Names](identifiers.md). Second, as our access control code develops, it is expected that it will be convenient to attach authorization and other policy to namespaces.
|
||||
|
||||
| Topic | Link |
|
||||
| ---- | ---- |
|
||||
| Identifiers.md | https://github.com/GoogleCloudPlatform/kubernetes/pull/1216 |
|
||||
| Access.md | https://github.com/GoogleCloudPlatform/kubernetes/pull/891 |
|
||||
| Indexing | https://github.com/GoogleCloudPlatform/kubernetes/pull/1183 |
|
||||
| Cluster Subdivision | https://github.com/GoogleCloudPlatform/kubernetes/issues/442 |
|
||||
|
||||
## Background
|
||||
|
||||
High level goals:
|
||||
|
||||
* Enable an easy-to-use mechanism to logically scope Kubernetes resources
|
||||
* Ensure extension resources to Kubernetes can share the same logical scope as core Kubernetes resources
|
||||
* Ensure it aligns with access control proposal
|
||||
* Ensure system has log n scale with increasing numbers of scopes
|
||||
|
||||
## Use cases
|
||||
|
||||
Actors:
|
||||
|
||||
1. k8s admin - administers a kubernetes cluster
|
||||
2. k8s service - k8s daemon operates on behalf of another user (i.e. controller-manager)
|
||||
2. k8s policy manager - enforces policies imposed on k8s cluster
|
||||
3. k8s user - uses a kubernetes cluster to schedule pods
|
||||
|
||||
User stories:
|
||||
|
||||
1. Ability to set immutable namespace to k8s resources
|
||||
2. Ability to list k8s resource scoped to a namespace
|
||||
3. Restrict a namespace identifier to a DNS-compatible string to support compound naming conventions
|
||||
4. Ability for a k8s policy manager to enforce a k8s user's access to a set of namespaces
|
||||
5. Ability to set/unset a default namespace for use by kubecfg client
|
||||
6. Ability for a k8s service to monitor resource changes across namespaces
|
||||
7. Ability for a k8s service to list resources across namespaces
|
||||
|
||||
## Proposed Design
|
||||
|
||||
### Model Changes
|
||||
|
||||
Introduce a new attribute *Namespace* for each resource that must be scoped in a Kubernetes cluster.
|
||||
|
||||
A *Namespace* is a DNS compatible subdomain.
|
||||
|
||||
```
|
||||
// TypeMeta is shared by all objects sent to, or returned from the client
|
||||
type TypeMeta struct {
|
||||
Kind string `json:"kind,omitempty" yaml:"kind,omitempty"`
|
||||
Uid string `json:"uid,omitempty" yaml:"uid,omitempty"`
|
||||
CreationTimestamp util.Time `json:"creationTimestamp,omitempty" yaml:"creationTimestamp,omitempty"`
|
||||
SelfLink string `json:"selfLink,omitempty" yaml:"selfLink,omitempty"`
|
||||
ResourceVersion uint64 `json:"resourceVersion,omitempty" yaml:"resourceVersion,omitempty"`
|
||||
APIVersion string `json:"apiVersion,omitempty" yaml:"apiVersion,omitempty"`
|
||||
Namespace string `json:"namespace,omitempty" yaml:"namespace,omitempty"`
|
||||
Name string `json:"name,omitempty" yaml:"name,omitempty"`
|
||||
}
|
||||
```
|
||||
|
||||
An identifier, *UID*, is unique across time and space intended to distinguish between historical occurences of similar entities.
|
||||
|
||||
A *Name* is unique within a given *Namespace* at a particular time, used in resource URLs; provided by clients at creation time
|
||||
and encouraged to be human friendly; intended to facilitate creation idempotence and space-uniqueness of singleton objects, distinguish
|
||||
distinct entities, and reference particular entities across operations.
|
||||
|
||||
As of this writing, the following resources MUST have a *Namespace* and *Name*
|
||||
|
||||
* pod
|
||||
* service
|
||||
* replicationController
|
||||
* endpoint
|
||||
|
||||
A *policy* MAY be associated with a *Namespace*.
|
||||
|
||||
If a *policy* has an associated *Namespace*, the resource paths it enforces are scoped to a particular *Namespace*.
|
||||
|
||||
## k8s API server
|
||||
|
||||
In support of namespace isolation, the Kubernetes API server will address resources by the following conventions:
|
||||
|
||||
The typical actors for the following requests are the k8s user or the k8s service.
|
||||
|
||||
| Action | HTTP Verb | Path | Description |
|
||||
| ---- | ---- | ---- | ---- |
|
||||
| CREATE | POST | /api/{version}/ns/{ns}/{resourceType}/ | Create instance of {resourceType} in namespace {ns} |
|
||||
| GET | GET | /api/{version}/ns/{ns}/{resourceType}/{name} | Get instance of {resourceType} in namespace {ns} with {name} |
|
||||
| UPDATE | PUT | /api/{version}/ns/{ns}/{resourceType}/{name} | Update instance of {resourceType} in namespace {ns} with {name} |
|
||||
| DELETE | DELETE | /api/{version}/ns/{ns}/{resourceType}/{name} | Delete instance of {resourceType} in namespace {ns} with {name} |
|
||||
| LIST | GET | /api/{version}/ns/{ns}/{resourceType} | List instances of {resourceType} in namespace {ns} |
|
||||
| WATCH | GET | /api/{version}/watch/ns/{ns}/{resourceType} | Watch for changes to a {resourceType} in namespace {ns} |
|
||||
|
||||
The typical actor for the following requests are the k8s service or k8s admin as enforced by k8s Policy.
|
||||
|
||||
| Action | HTTP Verb | Path | Description |
|
||||
| ---- | ---- | ---- | ---- |
|
||||
| WATCH | GET | /api/{version}/watch/{resourceType} | Watch for changes to a {resourceType} across all namespaces |
|
||||
| LIST | GET | /api/{version}/list/{resourceType} | List instances of {resourceType} across all namespaces |
|
||||
|
||||
The legacy API patterns for k8s are an alias to interacting with the *default* namespace as follows.
|
||||
|
||||
| Action | HTTP Verb | Path | Description |
|
||||
| ---- | ---- | ---- | ---- |
|
||||
| CREATE | POST | /api/{version}/{resourceType}/ | Create instance of {resourceType} in namespace *default* |
|
||||
| GET | GET | /api/{version}/{resourceType}/{name} | Get instance of {resourceType} in namespace *default* |
|
||||
| UPDATE | PUT | /api/{version}/{resourceType}/{name} | Update instance of {resourceType} in namespace *default* |
|
||||
| DELETE | DELETE | /api/{version}/{resourceType}/{name} | Delete instance of {resourceType} in namespace *default* |
|
||||
|
||||
The k8s API server verifies the *Namespace* on resource creation matches the *{ns}* on the path.
|
||||
|
||||
The k8s API server will enable efficient mechanisms to filter model resources based on the *Namespace*. This may require
|
||||
the creation of an index on *Namespace* that could support query by namespace with optional label selectors.
|
||||
|
||||
The k8s API server will associate a resource with a *Namespace* if not populated by the end-user based on the *Namespace* context
|
||||
of the incoming request. If the *Namespace* of the resource being created, or updated does not match the *Namespace* on the request,
|
||||
then the k8s API server will reject the request.
|
||||
|
||||
TODO: Update to discuss k8s api server proxy patterns
|
||||
|
||||
## k8s storage
|
||||
|
||||
A namespace provides a unique identifier space and therefore must be in the storage path of a resource.
|
||||
|
||||
In etcd, we want to continue to still support efficient WATCH across namespaces.
|
||||
|
||||
Resources that persist content in etcd will have storage paths as follows:
|
||||
|
||||
/registry/{resourceType}/{resource.Namespace}/{resource.Name}
|
||||
|
||||
This enables k8s service to WATCH /registry/{resourceType} for changes across namespace of a particular {resourceType}.
|
||||
|
||||
Upon scheduling a pod to a particular host, the pod's namespace must be in the key path as follows:
|
||||
|
||||
/host/{host}/pod/{pod.Namespace}/{pod.Name}
|
||||
|
||||
## k8s Authorization service
|
||||
|
||||
This design assumes the existence of an authorization service that filters incoming requests to the k8s API Server in order
|
||||
to enforce user authorization to a particular k8s resource. It performs this action by associating the *subject* of a request
|
||||
with a *policy* to an associated HTTP path and verb. This design encodes the *namespace* in the resource path in order to enable
|
||||
external policy servers to function by resource path alone. If a request is made by an identity that is not allowed by
|
||||
policy to the resource, the request is terminated. Otherwise, it is forwarded to the apiserver.
|
||||
|
||||
## k8s controller-manager
|
||||
|
||||
The controller-manager will provision pods in the same namespace as the associated replicationController.
|
||||
|
||||
## k8s Kubelet
|
||||
|
||||
There is no major change to the kubelet introduced by this proposal.
|
||||
|
||||
### kubecfg client
|
||||
|
||||
kubecfg supports following:
|
||||
|
||||
```
|
||||
kubecfg [OPTIONS] ns {namespace}
|
||||
```
|
||||
|
||||
To set a namespace to use across multiple operations:
|
||||
|
||||
```
|
||||
$ kubecfg ns ns1
|
||||
```
|
||||
|
||||
To view the current namespace:
|
||||
|
||||
```
|
||||
$ kubecfg ns
|
||||
Using namespace ns1
|
||||
```
|
||||
|
||||
To reset to the default namespace:
|
||||
|
||||
```
|
||||
$ kubecfg ns default
|
||||
```
|
||||
|
||||
In addition, each kubecfg request may explicitly specify a namespace for the operation via the following OPTION
|
||||
|
||||
--ns
|
||||
|
||||
When loading resource files specified by the -c OPTION, the kubecfg client will ensure the namespace is set in the
|
||||
message body to match the client specified default.
|
||||
|
||||
If no default namespace is applied, the client will assume the following default namespace:
|
||||
|
||||
* default
|
||||
|
||||
The kubecfg client would store default namespace information in the same manner it caches authentication information today
|
||||
as a file on user's file system.
|
||||
Use of multiple namespaces is optional. For small teams, they may not be needed.
|
||||
|
||||
Namespaces are still under development. For now, the best documentation is the [Namespaces Design Document](design/namespaces.md).
|
||||
|
|
|
@ -1,107 +1,5 @@
|
|||
# Networking
|
||||
Kubernetes gives every pod its own IP address allocated from an internal network, so you do not need to explicitly create links between communicating pods.
|
||||
However, since pods can be fail and be scheduled to different nodes, we do not recommend having a pod directly talk to the IP address of another Pod. Instead, if a pod, or collection of pods, provide some service, then you should create a `service` object spanning those pods, and clients should connect to the IP of the service object. See [services](services.md).
|
||||
|
||||
## Model and motivation
|
||||
|
||||
Kubernetes deviates from the default Docker networking model. The goal is for each [pod](docs/pods.md) to have an IP in a flat shared networking namespace that has full communication with other physical computers and containers across the network. IP-per-pod creates a clean, backward-compatible model where pods can be treated much like VMs or physical hosts from the perspectives of port allocation, networking, naming, service discovery, load balancing, application configuration, and migration.
|
||||
|
||||
OTOH, dynamic port allocation requires supporting both static ports (e.g., for externally accessible services) and dynamically allocated ports, requires partitioning centrally allocated and locally acquired dynamic ports, complicates scheduling (since ports are a scarce resource), is inconvenient for users, complicates application configuration, is plagued by port conflicts and reuse and exhaustion, requires non-standard approaches to naming (e.g., etcd rather than DNS), requires proxies and/or redirection for programs using standard naming/addressing mechanisms (e.g., web browsers), requires watching and cache invalidation for address/port changes for instances in addition to watching group membership changes, and obstructs container/pod migration (e.g., using CRIU). NAT introduces additional complexity by fragmenting the addressing space, which breaks self-registration mechanisms, among other problems.
|
||||
|
||||
With the IP-per-pod model, all user containers within a pod behave as if they are on the same host with regard to networking. They can all reach each other’s ports on localhost. Ports which are published to the host interface are done so in the normal Docker way. All containers in all pods can talk to all other containers in all other pods by their 10-dot addresses.
|
||||
|
||||
In addition to avoiding the aforementioned problems with dynamic port allocation, this approach reduces friction for applications moving from the world of uncontainerized apps on physical or virtual hosts to containers within pods. People running application stacks together on the same host have already figured out how to make ports not conflict (e.g., by configuring them through environment variables) and have arranged for clients to find them.
|
||||
|
||||
The approach does reduce isolation between containers within a pod -- ports could conflict, and there couldn't be private ports across containers within a pod, but applications requiring their own port spaces could just run as separate pods and processes requiring private communication could run within the same container. Besides, the premise of pods is that containers within a pod share some resources (volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. Additionally, the user can control what containers belong to the same pod whereas, in general, they don't control what pods land together on a host.
|
||||
|
||||
When any container calls SIOCGIFADDR, it sees the IP that any peer container would see them coming from -- each pod has its own IP address that other pods can know. By making IP addresses and ports the same within and outside the containers and pods, we create a NAT-less, flat address space. "ip addr show" should work as expected. This would enable all existing naming/discovery mechanisms to work out of the box, including self-registration mechanisms and applications that distribute IP addresses. (We should test that with etcd and perhaps one other option, such as Eureka (used by Acme Air) or Consul.) We should be optimizing for inter-pod network communication. Within a pod, containers are more likely to use communication through volumes (e.g., tmpfs) or IPC.
|
||||
|
||||
This is different from the standard Docker model. In that mode, each container gets an IP in the 172-dot space and would only see that 172-dot address from SIOCGIFADDR. If these containers connect to another container the peer would see the connect coming from a different IP than the container itself knows. In short - you can never self-register anything from a container, because a container can not be reached on its private IP.
|
||||
|
||||
An alternative we considered was an additional layer of addressing: pod-centric IP per container. Each container would have its own local IP address, visible only within that pod. This would perhaps make it easier for containerized applications to move from physical/virtual hosts to pods, but would be more complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS) and to reason about, due to the additional layer of address translation, and would break self-registration and IP distribution mechanisms.
|
||||
|
||||
## Current implementation
|
||||
|
||||
For the Google Compute Engine cluster configuration scripts, [advanced routing](https://developers.google.com/compute/docs/networking#routing) is set up so that each VM has an extra 256 IP addresses that get routed to it. This is in addition to the 'main' IP address assigned to the VM that is NAT-ed for Internet access. The networking bridge (called `cbr0` to differentiate it from `docker0`) is set up outside of Docker proper and only does NAT for egress network traffic that isn't aimed at the virtual network.
|
||||
|
||||
Ports mapped in from the 'main IP' (and hence the internet if the right firewall rules are set up) are proxied in user mode by Docker. In the future, this should be done with `iptables` by either the Kubelet or Docker: [Issue #15](https://github.com/GoogleCloudPlatform/kubernetes/issues/15).
|
||||
|
||||
We start Docker with:
|
||||
DOCKER_OPTS="--bridge cbr0 --iptables=false"
|
||||
|
||||
We set up this bridge on each node with SaltStack, in [container_bridge.py](cluster/saltbase/salt/_states/container_bridge.py).
|
||||
|
||||
cbr0:
|
||||
container_bridge.ensure:
|
||||
- cidr: {{ grains['cbr-cidr'] }}
|
||||
...
|
||||
grains:
|
||||
roles:
|
||||
- kubernetes-pool
|
||||
cbr-cidr: $MINION_IP_RANGE
|
||||
|
||||
We make these addresses routable in GCE:
|
||||
|
||||
gcutil addroute ${MINION_NAMES[$i]} ${MINION_IP_RANGES[$i]} \
|
||||
--norespect_terminal_width \
|
||||
--project ${PROJECT} \
|
||||
--network ${NETWORK} \
|
||||
--next_hop_instance ${ZONE}/instances/${MINION_NAMES[$i]} &
|
||||
|
||||
The minion IP ranges are /24s in the 10-dot space.
|
||||
|
||||
GCE itself does not know anything about these IPs, though.
|
||||
|
||||
These are not externally routable, though, so containers that need to communicate with the outside world need to use host networking. To set up an external IP that forwards to the VM, it will only forward to the VM's primary IP (which is assigned to no pod). So we use docker's -p flag to map published ports to the main interface. This has the side effect of disallowing two pods from exposing the same port. (More discussion on this in [Issue #390](https://github.com/GoogleCloudPlatform/kubernetes/issues/390).)
|
||||
|
||||
We create a container to use for the pod network namespace -- a single loopback device and a single veth device. All the user's containers get their network namespaces from this pod networking container.
|
||||
|
||||
Docker allocates IP addresses from a bridge we create on each node, using its “container” networking mode.
|
||||
|
||||
1. Create a normal (in the networking sense) container which uses a minimal image and runs a command that blocks forever. This is not a user-defined container, and gets a special well-known name.
|
||||
- creates a new network namespace (netns) and loopback device
|
||||
- creates a new pair of veth devices and binds them to the netns
|
||||
- auto-assigns an IP from docker’s IP range
|
||||
|
||||
2. Create the user containers and specify the name of the network container as their “net” argument. Docker finds the PID of the command running in the network container and attaches to the netns of that PID.
|
||||
|
||||
### Other networking implementation examples
|
||||
With the primary aim of providing IP-per-pod-model, other implementations exist to serve the purpose outside of GCE.
|
||||
- [OpenVSwitch with GRE/VxLAN](docs/ovs-networking.md)
|
||||
- [Flannel](https://github.com/coreos/flannel#flannel)
|
||||
|
||||
## Challenges and future work
|
||||
|
||||
### Docker API
|
||||
|
||||
Right now, docker inspect doesn't show the networking configuration of the containers, since they derive it from another container. That information should be exposed somehow.
|
||||
|
||||
### External IP assignment
|
||||
|
||||
We want to be able to assign IP addresses externally from Docker ([Docker issue #6743](https://github.com/dotcloud/docker/issues/6743)) so that we don't need to statically allocate fixed-size IP ranges to each node, so that IP addresses can be made stable across network container restarts ([Docker issue #2801](https://github.com/dotcloud/docker/issues/2801)), and to facilitate pod migration. Right now, if the network container dies, all the user containers must be stopped and restarted because the netns of the network container will change on restart, and any subsequent user container restart will join that new netns, thereby not being able to see its peers. Additionally, a change in IP address would encounter DNS caching/TTL problems. External IP assignment would also simplify DNS support (see below).
|
||||
|
||||
### Naming, discovery, and load balancing
|
||||
|
||||
In addition to enabling self-registration with 3rd-party discovery mechanisms, we'd like to setup DDNS automatically ([Issue #146](https://github.com/GoogleCloudPlatform/kubernetes/issues/146)). hostname, $HOSTNAME, etc. should return a name for the pod ([Issue #298](https://github.com/GoogleCloudPlatform/kubernetes/issues/298)), and gethostbyname should be able to resolve names of other pods. Probably we need to set up a DNS resolver to do the latter ([Docker issue #2267](https://github.com/dotcloud/docker/issues/2267)), so that we don't need to keep /etc/hosts files up to date dynamically.
|
||||
|
||||
[Service](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/services.md) endpoints are currently found through environment variables. Both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) variables and kubernetes-specific variables ({NAME}_SERVICE_HOST and {NAME}_SERVICE_BAR) are supported, and resolve to ports opened by the service proxy. We don't actually use [the Docker ambassador pattern](https://docs.docker.com/articles/ambassador_pattern_linking/) to link containers because we don't require applications to identify all clients at configuration time, yet. While services today are managed by the service proxy, this is an implementation detail that applications should not rely on. Clients should instead use the [service portal IP](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/services.md) (which the above environment variables will resolve to). However, a flat service namespace doesn't scale and environment variables don't permit dynamic updates, which complicates service deployment by imposing implicit ordering constraints. We intend to register each service portal IP in DNS, and for that to become the preferred resolution protocol.
|
||||
|
||||
We'd also like to accommodate other load-balancing solutions (e.g., HAProxy), non-load-balanced services ([Issue #260](https://github.com/GoogleCloudPlatform/kubernetes/issues/260)), and other types of groups (worker pools, etc.). Providing the ability to Watch a label selector applied to pod addresses would enable efficient monitoring of group membership, which could be directly consumed or synced with a discovery mechanism. Event hooks ([Issue #140](https://github.com/GoogleCloudPlatform/kubernetes/issues/140)) for join/leave events would probably make this even easier.
|
||||
|
||||
### External routability
|
||||
|
||||
We want traffic between containers to use the pod IP addresses across nodes. Say we have Node A with a container IP space of 10.244.1.0/24 and Node B with a container IP space of 10.244.2.0/24. And we have Container A1 at 10.244.1.1 and Container B1 at 10.244.2.1. We want Container A1 to talk to Container B1 directly with no NAT. B1 should see the "source" in the IP packets of 10.244.1.1 -- not the "primary" host IP for Node A. That means that we want to turn off NAT for traffic between containers (and also between VMs and containers).
|
||||
|
||||
We'd also like to make pods directly routable from the external internet. However, we can't yet support the extra container IPs that we've provisioned talking to the internet directly. So, we don't map external IPs to the container IPs. Instead, we solve that problem by having traffic that isn't to the internal network (! 10.0.0.0/8) get NATed through the primary host IP address so that it can get 1:1 NATed by the GCE networking when talking to the internet. Similarly, incoming traffic from the internet has to get NATed/proxied through the host IP.
|
||||
|
||||
So we end up with 3 cases:
|
||||
|
||||
1. Container -> Container or Container <-> VM. These should use 10. addresses directly and there should be no NAT.
|
||||
|
||||
2. Container -> Internet. These have to get mapped to the primary host IP so that GCE knows how to egress that traffic. There is actually 2 layers of NAT here: Container IP -> Internal Host IP -> External Host IP. The first level happens in the guest with IP tables and the second happens as part of GCE networking. The first one (Container IP -> internal host IP) does dynamic port allocation while the second maps ports 1:1.
|
||||
|
||||
3. Internet -> Container. This also has to go through the primary host IP and also has 2 levels of NAT, ideally. However, the path currently is a proxy with (External Host IP -> Internal Host IP -> Docker) -> (Docker -> Container IP). Once [issue #15](https://github.com/GoogleCloudPlatform/kubernetes/issues/15) is closed, it should be External Host IP -> Internal Host IP -> Container IP. But to get that second arrow we have to set up the port forwarding iptables rules per mapped port.
|
||||
|
||||
Another approach could be to create a new host interface alias for each pod, if we had a way to route an external IP to it. This would eliminate the scheduling constraints resulting from using the host's IP address.
|
||||
|
||||
### IPv6
|
||||
|
||||
IPv6 would be a nice option, also, but we can't depend on it yet. Docker support is in progress: [Docker issue #2974](https://github.com/dotcloud/docker/issues/2974), [Docker issue #6923](https://github.com/dotcloud/docker/issues/6923), [Docker issue #6975](https://github.com/dotcloud/docker/issues/6975). Additionally, direct ipv6 assignment to instances doesn't appear to be supported by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-)
|
||||
The networking model and its rationale, and our future plans are described in more detail in the [networking design document](design/networking.md).
|
||||
|
|
|
@ -0,0 +1,49 @@
|
|||
# Kubernetes User Documentation
|
||||
|
||||
|
||||
## Resources
|
||||
|
||||
The Kubernetes API currently manages 3 main resources: `pods`,
|
||||
`replicationControllers`, and `services`. Pods correspond to colocated groups
|
||||
of [Docker containers](http://docker.io) with shared volumes, as supported by
|
||||
[Google Cloud Platform container-vm
|
||||
images](https://developers.google.com/compute/docs/containers). Singleton pods
|
||||
can be created directly via the `/pods` endpoint. Sets of pods may created,
|
||||
maintained, and scaled using replicationControllers. Services create
|
||||
load-balanced targets for sets of pods.
|
||||
|
||||
Each resource has a two [identifiers](identifiers.md): a string `Name` and a
|
||||
string `UID`. The name is provided by the user. The UID is generated by the
|
||||
system and is guaranteed to be unique in space and time across all resources.
|
||||
`labels` is a map of string (key) to string (value).
|
||||
|
||||
Each resource has list of key-value [labels](labels.md).
|
||||
Individual labels are used to specify identifying metadata that can be used to define sets of resources by
|
||||
specifying required labels.
|
||||
|
||||
|
||||
## Creation and Updates
|
||||
|
||||
Object creation is idempotent when the client remembers the name of the object it wants to create.
|
||||
Resources have a `desiredState` for the user provided parameters and a
|
||||
`currentState` for the actual system state. When a new version of a resource
|
||||
is PUT the `desiredState` is updated and available immediately. Over time the
|
||||
system will work to bring the `currentState` into line with the `desiredState`.
|
||||
The system will drive toward the most recent `desiredState` regardless of
|
||||
previous versions of that stanza. In other words, if a value is changed from 2
|
||||
to 5 in one PUT and then back down to 3 in another PUT the system is not
|
||||
required to 'touch base' at 5 before making 3 the `currentState`.
|
||||
|
||||
When doing an update, we assume that the entire `desiredState` stanza is
|
||||
specified. If a field is omitted it is assumed that the user is looking to
|
||||
delete that field. It is viable for a user to GET the resource, modify what
|
||||
they like in the `desiredState` or labels stanzas and then PUT it back. If the
|
||||
`currentState` is included in the PUT it will be silently ignored.
|
||||
|
||||
Concurrent modification should be accomplished with optimistic locking of
|
||||
resources. All resources have a `ResourceVersion` as part of their metadata.
|
||||
If this is included with the PUT operation the system will verify that there
|
||||
have not been other successful mutations to the resource during a
|
||||
read/modify/write cycle. The correct client action at this point is to GET the
|
||||
resource again, apply the changes afresh and try submitting again.
|
||||
|
|
@ -24,7 +24,7 @@ The Kubernetes user interface is a query-based visualization of the Kubernetes A
|
|||
_GroupBy_ takes a label ```key``` as a parameter, places all objects with the same value for that key within a single group. For example ```/groups/host/selector``` groups pods by host. ```/groups/name/selector``` groups pods by name. Groups are hiearchical, for example ```/groups/name/host/selector``` first groups by pod name, and then by host.
|
||||
|
||||
#### Select
|
||||
Select takes a [label selector](docs/labels.md) and uses it to filter, so only resources which match that label selector are displayed. For example, ```/groups/host/selector/name=frontend```, shows pods, grouped by host, which have a label with the name `frontend`.
|
||||
Select takes a [label selector](./labels.md) and uses it to filter, so only resources which match that label selector are displayed. For example, ```/groups/host/selector/name=frontend```, shows pods, grouped by host, which have a label with the name `frontend`.
|
||||
|
||||
|
||||
## Rebuilding the UX
|
||||
|
|
Loading…
Reference in New Issue