mirror of https://github.com/k3s-io/k3s
double dash replaced by html mdash
parent
43889c612c
commit
7b403edd6f
|
@ -10,11 +10,11 @@ With the IP-per-pod model, all user containers within a pod behave as if they ar
|
|||
|
||||
In addition to avoiding the aforementioned problems with dynamic port allocation, this approach reduces friction for applications moving from the world of uncontainerized apps on physical or virtual hosts to containers within pods. People running application stacks together on the same host have already figured out how to make ports not conflict (e.g., by configuring them through environment variables) and have arranged for clients to find them.
|
||||
|
||||
The approach does reduce isolation between containers within a pod -- ports could conflict, and there couldn't be private ports across containers within a pod, but applications requiring their own port spaces could just run as separate pods and processes requiring private communication could run within the same container. Besides, the premise of pods is that containers within a pod share some resources (volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. Additionally, the user can control what containers belong to the same pod whereas, in general, they don't control what pods land together on a host.
|
||||
The approach does reduce isolation between containers within a pod — ports could conflict, and there couldn't be private ports across containers within a pod, but applications requiring their own port spaces could just run as separate pods and processes requiring private communication could run within the same container. Besides, the premise of pods is that containers within a pod share some resources (volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. Additionally, the user can control what containers belong to the same pod whereas, in general, they don't control what pods land together on a host.
|
||||
|
||||
When any container calls SIOCGIFADDR, it sees the IP that any peer container would see them coming from -- each pod has its own IP address that other pods can know. By making IP addresses and ports the same within and outside the containers and pods, we create a NAT-less, flat address space. "ip addr show" should work as expected. This would enable all existing naming/discovery mechanisms to work out of the box, including self-registration mechanisms and applications that distribute IP addresses. (We should test that with etcd and perhaps one other option, such as Eureka (used by Acme Air) or Consul.) We should be optimizing for inter-pod network communication. Within a pod, containers are more likely to use communication through volumes (e.g., tmpfs) or IPC.
|
||||
When any container calls SIOCGIFADDR, it sees the IP that any peer container would see them coming from — each pod has its own IP address that other pods can know. By making IP addresses and ports the same within and outside the containers and pods, we create a NAT-less, flat address space. "ip addr show" should work as expected. This would enable all existing naming/discovery mechanisms to work out of the box, including self-registration mechanisms and applications that distribute IP addresses. (We should test that with etcd and perhaps one other option, such as Eureka (used by Acme Air) or Consul.) We should be optimizing for inter-pod network communication. Within a pod, containers are more likely to use communication through volumes (e.g., tmpfs) or IPC.
|
||||
|
||||
This is different from the standard Docker model. In that mode, each container gets an IP in the 172-dot space and would only see that 172-dot address from SIOCGIFADDR. If these containers connect to another container the peer would see the connect coming from a different IP than the container itself knows. In short - you can never self-register anything from a container, because a container can not be reached on its private IP.
|
||||
This is different from the standard Docker model. In that mode, each container gets an IP in the 172-dot space and would only see that 172-dot address from SIOCGIFADDR. If these containers connect to another container the peer would see the connect coming from a different IP than the container itself knows. In short — you can never self-register anything from a container, because a container can not be reached on its private IP.
|
||||
|
||||
An alternative we considered was an additional layer of addressing: pod-centric IP per container. Each container would have its own local IP address, visible only within that pod. This would perhaps make it easier for containerized applications to move from physical/virtual hosts to pods, but would be more complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS) and to reason about, due to the additional layer of address translation, and would break self-registration and IP distribution mechanisms.
|
||||
|
||||
|
@ -53,7 +53,7 @@ GCE itself does not know anything about these IPs, though.
|
|||
|
||||
These are not externally routable, though, so containers that need to communicate with the outside world need to use host networking. To set up an external IP that forwards to the VM, it will only forward to the VM's primary IP (which is assigned to no pod). So we use docker's -p flag to map published ports to the main interface. This has the side effect of disallowing two pods from exposing the same port. (More discussion on this in [Issue #390](https://github.com/GoogleCloudPlatform/kubernetes/issues/390).)
|
||||
|
||||
We create a container to use for the pod network namespace -- a single loopback device and a single veth device. All the user's containers get their network namespaces from this pod networking container.
|
||||
We create a container to use for the pod network namespace — a single loopback device and a single veth device. All the user's containers get their network namespaces from this pod networking container.
|
||||
|
||||
Docker allocates IP addresses from a bridge we create on each node, using its “container” networking mode.
|
||||
|
||||
|
@ -89,7 +89,7 @@ We'd also like to accommodate other load-balancing solutions (e.g., HAProxy), no
|
|||
|
||||
### External routability
|
||||
|
||||
We want traffic between containers to use the pod IP addresses across nodes. Say we have Node A with a container IP space of 10.244.1.0/24 and Node B with a container IP space of 10.244.2.0/24. And we have Container A1 at 10.244.1.1 and Container B1 at 10.244.2.1. We want Container A1 to talk to Container B1 directly with no NAT. B1 should see the "source" in the IP packets of 10.244.1.1 -- not the "primary" host IP for Node A. That means that we want to turn off NAT for traffic between containers (and also between VMs and containers).
|
||||
We want traffic between containers to use the pod IP addresses across nodes. Say we have Node A with a container IP space of 10.244.1.0/24 and Node B with a container IP space of 10.244.2.0/24. And we have Container A1 at 10.244.1.1 and Container B1 at 10.244.2.1. We want Container A1 to talk to Container B1 directly with no NAT. B1 should see the "source" in the IP packets of 10.244.1.1 — not the "primary" host IP for Node A. That means that we want to turn off NAT for traffic between containers (and also between VMs and containers).
|
||||
|
||||
We'd also like to make pods directly routable from the external internet. However, we can't yet support the extra container IPs that we've provisioned talking to the internet directly. So, we don't map external IPs to the container IPs. Instead, we solve that problem by having traffic that isn't to the internal network (! 10.0.0.0/8) get NATed through the primary host IP address so that it can get 1:1 NATed by the GCE networking when talking to the internet. Similarly, incoming traffic from the internet has to get NATed/proxied through the host IP.
|
||||
|
||||
|
|
|
@ -4,7 +4,7 @@ In Kubernetes, rather than individual application containers, _pods_ are the sma
|
|||
|
||||
## What is a _pod_?
|
||||
|
||||
A _pod_ (as in a pod of whales or pea pod) corresponds to a colocated group of applications running with a shared context. Within that context, the applications may also have individual cgroup isolations applied. A pod models an application-specific "logical host" in a containerized environment. It may contain one or more applications which are relatively tightly coupled -- in a pre-container world, they would have executed on the same physical or virtual host.
|
||||
A _pod_ (as in a pod of whales or pea pod) corresponds to a colocated group of applications running with a shared context. Within that context, the applications may also have individual cgroup isolations applied. A pod models an application-specific "logical host" in a containerized environment. It may contain one or more applications which are relatively tightly coupled — in a pre-container world, they would have executed on the same physical or virtual host.
|
||||
|
||||
The context of the pod can be defined as the conjunction of several Linux namespaces:
|
||||
|
||||
|
@ -71,8 +71,8 @@ Pod is exposed as a primitive in order to facilitate:
|
|||
* scheduler and controller pluggability
|
||||
* support for pod-level operations without the need to "proxy" them via controller APIs
|
||||
* decoupling of pod lifetime from controller lifetime, such as for bootstrapping
|
||||
* decoupling of controllers and services -- the endpoint controller just watches pods
|
||||
* clean composition of Kubelet-level functionality with cluster-level functionality -- Kubelet is effectively the "pod controller"
|
||||
* decoupling of controllers and services — the endpoint controller just watches pods
|
||||
* clean composition of Kubelet-level functionality with cluster-level functionality — Kubelet is effectively the "pod controller"
|
||||
* high-availability applications, which will expect pods to be replaced in advance of their termination and certainly in advance of deletion, such as in the case of planned evictions, image prefetching, or live pod migration [#3949](https://github.com/GoogleCloudPlatform/kubernetes/issues/3949)
|
||||
|
||||
The current best practice for pets is to create a replication controller with `replicas` equal to `1` and a corresponding service. If you find this cumbersome, please comment on [issue #260](https://github.com/GoogleCloudPlatform/kubernetes/issues/260).
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
# The Kubernetes resource model
|
||||
|
||||
To do good pod placement, Kubernetes needs to know how big pods are, as well as the sizes of the nodes onto which they are being placed. The definition of "how big" is given by the Kubernetes resource model - the subject of this document.
|
||||
To do good pod placement, Kubernetes needs to know how big pods are, as well as the sizes of the nodes onto which they are being placed. The definition of "how big" is given by the Kubernetes resource model — the subject of this document.
|
||||
|
||||
The resource model aims to be:
|
||||
* simple, for common cases;
|
||||
|
@ -57,7 +57,7 @@ resourceRequirementSpec: [
|
|||
]
|
||||
```
|
||||
Where:
|
||||
* _request_ [optional]: the amount of resources being requested, or that were requested and have been allocated. Scheduler algorithms will use these quantities to test feasibility (whether a pod will fit onto a node). If a container (or pod) tries to use more resources than its _request_, any associated SLOs are voided - e.g., the program it is running may be throttled (compressible resource types), or the attempt may be denied. If _request_ is omitted for a container, it defaults to _limit_ if that is explicitly specified, otherwise to an implementation-defined value; this will always be 0 for a user-defined resource type. If _request_ is omitted for a pod, it defaults to the sum of the (explicit or implicit) _request_ values for the containers it encloses.
|
||||
* _request_ [optional]: the amount of resources being requested, or that were requested and have been allocated. Scheduler algorithms will use these quantities to test feasibility (whether a pod will fit onto a node). If a container (or pod) tries to use more resources than its _request_, any associated SLOs are voided — e.g., the program it is running may be throttled (compressible resource types), or the attempt may be denied. If _request_ is omitted for a container, it defaults to _limit_ if that is explicitly specified, otherwise to an implementation-defined value; this will always be 0 for a user-defined resource type. If _request_ is omitted for a pod, it defaults to the sum of the (explicit or implicit) _request_ values for the containers it encloses.
|
||||
|
||||
* _limit_ [optional]: an upper bound or cap on the maximum amount of resources that will be made available to a container or pod; if a container or pod uses more resources than its _limit_, it may be terminated. The _limit_ defaults to "unbounded"; in practice, this probably means the capacity of an enclosing container, pod, or node, but may result in non-deterministic behavior, especially for memory.
|
||||
|
||||
|
@ -90,13 +90,13 @@ The following resource types are predefined ("reserved") by Kubernetes in the `k
|
|||
* Units: Kubernetes Compute Unit seconds/second (i.e., CPU cores normalized to a canonical "Kubernetes CPU")
|
||||
* Internal representation: milli-KCUs
|
||||
* Compressible? yes
|
||||
* Qualities: this is a placeholder for the kind of thing that may be supported in the future -- see [#147](https://github.com/GoogleCloudPlatform/kubernetes/issues/147)
|
||||
* Qualities: this is a placeholder for the kind of thing that may be supported in the future — see [#147](https://github.com/GoogleCloudPlatform/kubernetes/issues/147)
|
||||
* [future] `schedulingLatency`: as per lmctfy
|
||||
* [future] `cpuConversionFactor`: property of a node: the speed of a CPU core on the node's processor divided by the speed of the canonical Kubernetes CPU (a floating point value; default = 1.0).
|
||||
|
||||
To reduce performance portability problems for pods, and to avoid worse-case provisioning behavior, the units of CPU will be normalized to a canonical "Kubernetes Compute Unit" (KCU, pronounced ˈko͝oko͞o), which will roughly be equivalent to a single CPU hyperthreaded core for some recent x86 processor. The normalization may be implementation-defined, although some reasonable defaults will be provided in the open-source Kubernetes code.
|
||||
|
||||
Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will be allocated - control of aspects like this will be handled by resource _qualities_ (a future feature).
|
||||
Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will be allocated — control of aspects like this will be handled by resource _qualities_ (a future feature).
|
||||
|
||||
|
||||
### Memory
|
||||
|
|
Loading…
Reference in New Issue