2015-07-12 04:04:52 +00:00
|
|
|
|
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
|
|
|
|
|
|
|
|
|
<!-- BEGIN STRIP_FOR_RELEASE -->
|
|
|
|
|
|
2015-07-16 17:02:26 +00:00
|
|
|
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
|
|
|
|
width="25" height="25">
|
|
|
|
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
|
|
|
|
width="25" height="25">
|
|
|
|
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
|
|
|
|
width="25" height="25">
|
|
|
|
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
|
|
|
|
width="25" height="25">
|
|
|
|
|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
|
|
|
|
width="25" height="25">
|
|
|
|
|
|
|
|
|
|
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
|
|
|
|
|
|
|
|
|
|
If you are using a released version of Kubernetes, you should
|
|
|
|
|
refer to the docs that go with that version.
|
|
|
|
|
|
2015-12-14 18:37:38 +00:00
|
|
|
|
<!-- TAG RELEASE_LINK, added by the munger automatically -->
|
2015-07-16 17:02:26 +00:00
|
|
|
|
<strong>
|
2015-11-03 18:17:57 +00:00
|
|
|
|
The latest release of this document can be found
|
2016-03-09 02:06:40 +00:00
|
|
|
|
[here](http://releases.k8s.io/release-1.2/docs/design/networking.md).
|
2015-07-16 17:02:26 +00:00
|
|
|
|
|
|
|
|
|
Documentation for other releases can be found at
|
|
|
|
|
[releases.k8s.io](http://releases.k8s.io).
|
|
|
|
|
</strong>
|
|
|
|
|
--
|
2015-07-13 22:15:35 +00:00
|
|
|
|
|
2015-07-12 04:04:52 +00:00
|
|
|
|
<!-- END STRIP_FOR_RELEASE -->
|
|
|
|
|
|
|
|
|
|
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
2015-07-17 22:35:41 +00:00
|
|
|
|
|
2014-10-16 21:45:16 +00:00
|
|
|
|
# Networking
|
|
|
|
|
|
2015-07-06 22:58:00 +00:00
|
|
|
|
There are 4 distinct networking problems to solve:
|
2015-07-11 22:19:01 +00:00
|
|
|
|
|
2015-07-06 22:58:00 +00:00
|
|
|
|
1. Highly-coupled container-to-container communications
|
|
|
|
|
2. Pod-to-Pod communications
|
|
|
|
|
3. Pod-to-Service communications
|
|
|
|
|
4. External-to-internal communications
|
2014-10-16 21:45:16 +00:00
|
|
|
|
|
2015-07-06 22:58:00 +00:00
|
|
|
|
## Model and motivation
|
2014-10-16 21:45:16 +00:00
|
|
|
|
|
2015-07-06 22:58:00 +00:00
|
|
|
|
Kubernetes deviates from the default Docker networking model (though as of
|
2016-04-14 00:55:22 +00:00
|
|
|
|
Docker 1.8 their network plugins are getting closer). The goal is for each pod
|
2015-07-06 22:58:00 +00:00
|
|
|
|
to have an IP in a flat shared networking namespace that has full communication
|
2016-04-14 00:55:22 +00:00
|
|
|
|
with other physical computers and containers across the network. IP-per-pod
|
2015-07-06 22:58:00 +00:00
|
|
|
|
creates a clean, backward-compatible model where pods can be treated much like
|
|
|
|
|
VMs or physical hosts from the perspectives of port allocation, networking,
|
|
|
|
|
naming, service discovery, load balancing, application configuration, and
|
|
|
|
|
migration.
|
|
|
|
|
|
|
|
|
|
Dynamic port allocation, on the other hand, requires supporting both static
|
|
|
|
|
ports (e.g., for externally accessible services) and dynamically allocated
|
|
|
|
|
ports, requires partitioning centrally allocated and locally acquired dynamic
|
|
|
|
|
ports, complicates scheduling (since ports are a scarce resource), is
|
|
|
|
|
inconvenient for users, complicates application configuration, is plagued by
|
|
|
|
|
port conflicts and reuse and exhaustion, requires non-standard approaches to
|
|
|
|
|
naming (e.g. consul or etcd rather than DNS), requires proxies and/or
|
|
|
|
|
redirection for programs using standard naming/addressing mechanisms (e.g. web
|
|
|
|
|
browsers), requires watching and cache invalidation for address/port changes
|
|
|
|
|
for instances in addition to watching group membership changes, and obstructs
|
|
|
|
|
container/pod migration (e.g. using CRIU). NAT introduces additional complexity
|
|
|
|
|
by fragmenting the addressing space, which breaks self-registration mechanisms,
|
|
|
|
|
among other problems.
|
|
|
|
|
|
|
|
|
|
## Container to container
|
|
|
|
|
|
|
|
|
|
All containers within a pod behave as if they are on the same host with regard
|
|
|
|
|
to networking. They can all reach each other’s ports on localhost. This offers
|
|
|
|
|
simplicity (static ports know a priori), security (ports bound to localhost
|
2016-04-14 00:55:22 +00:00
|
|
|
|
are visible within the pod but never outside it), and performance. This also
|
2015-07-06 22:58:00 +00:00
|
|
|
|
reduces friction for applications moving from the world of uncontainerized apps
|
2016-04-14 00:55:22 +00:00
|
|
|
|
on physical or virtual hosts. People running application stacks together on
|
2015-07-06 22:58:00 +00:00
|
|
|
|
the same host have already figured out how to make ports not conflict and have
|
|
|
|
|
arranged for clients to find them.
|
|
|
|
|
|
|
|
|
|
The approach does reduce isolation between containers within a pod —
|
|
|
|
|
ports could conflict, and there can be no container-private ports, but these
|
2016-04-14 00:55:22 +00:00
|
|
|
|
seem to be relatively minor issues with plausible future workarounds. Besides,
|
2015-07-06 22:58:00 +00:00
|
|
|
|
the premise of pods is that containers within a pod share some resources
|
|
|
|
|
(volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation.
|
|
|
|
|
Additionally, the user can control what containers belong to the same pod
|
|
|
|
|
whereas, in general, they don't control what pods land together on a host.
|
|
|
|
|
|
|
|
|
|
## Pod to pod
|
|
|
|
|
|
|
|
|
|
Because every pod gets a "real" (not machine-private) IP address, pods can
|
2016-04-14 00:55:22 +00:00
|
|
|
|
communicate without proxies or translations. The pod can use well-known port
|
2015-07-06 22:58:00 +00:00
|
|
|
|
numbers and can avoid the use of higher-level service discovery systems like
|
|
|
|
|
DNS-SD, Consul, or Etcd.
|
|
|
|
|
|
|
|
|
|
When any container calls ioctl(SIOCGIFADDR) (get the address of an interface),
|
|
|
|
|
it sees the same IP that any peer container would see them coming from —
|
|
|
|
|
each pod has its own IP address that other pods can know. By making IP addresses
|
|
|
|
|
and ports the same both inside and outside the pods, we create a NAT-less, flat
|
|
|
|
|
address space. Running "ip addr show" should work as expected. This would enable
|
|
|
|
|
all existing naming/discovery mechanisms to work out of the box, including
|
2016-04-14 00:55:22 +00:00
|
|
|
|
self-registration mechanisms and applications that distribute IP addresses. We
|
2015-07-06 22:58:00 +00:00
|
|
|
|
should be optimizing for inter-pod network communication. Within a pod,
|
|
|
|
|
containers are more likely to use communication through volumes (e.g., tmpfs) or
|
|
|
|
|
IPC.
|
|
|
|
|
|
|
|
|
|
This is different from the standard Docker model. In that mode, each container
|
|
|
|
|
gets an IP in the 172-dot space and would only see that 172-dot address from
|
|
|
|
|
SIOCGIFADDR. If these containers connect to another container the peer would see
|
|
|
|
|
the connect coming from a different IP than the container itself knows. In short
|
|
|
|
|
— you can never self-register anything from a container, because a
|
|
|
|
|
container can not be reached on its private IP.
|
|
|
|
|
|
|
|
|
|
An alternative we considered was an additional layer of addressing: pod-centric
|
|
|
|
|
IP per container. Each container would have its own local IP address, visible
|
|
|
|
|
only within that pod. This would perhaps make it easier for containerized
|
|
|
|
|
applications to move from physical/virtual hosts to pods, but would be more
|
|
|
|
|
complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS)
|
|
|
|
|
and to reason about, due to the additional layer of address translation, and
|
|
|
|
|
would break self-registration and IP distribution mechanisms.
|
|
|
|
|
|
|
|
|
|
Like Docker, ports can still be published to the host node's interface(s), but
|
|
|
|
|
the need for this is radically diminished.
|
|
|
|
|
|
|
|
|
|
## Implementation
|
|
|
|
|
|
|
|
|
|
For the Google Compute Engine cluster configuration scripts, we use [advanced
|
|
|
|
|
routing rules](https://developers.google.com/compute/docs/networking#routing)
|
|
|
|
|
and ip-forwarding-enabled VMs so that each VM has an extra 256 IP addresses that
|
|
|
|
|
get routed to it. This is in addition to the 'main' IP address assigned to the
|
|
|
|
|
VM that is NAT-ed for Internet access. The container bridge (called `cbr0` to
|
|
|
|
|
differentiate it from `docker0`) is set up outside of Docker proper.
|
|
|
|
|
|
|
|
|
|
Example of GCE's advanced routing rules:
|
|
|
|
|
|
2015-07-19 08:46:02 +00:00
|
|
|
|
```sh
|
2015-11-24 03:04:40 +00:00
|
|
|
|
gcloud compute routes add "${NODE_NAMES[$i]}" \
|
2015-07-06 22:58:00 +00:00
|
|
|
|
--project "${PROJECT}" \
|
2015-11-24 03:03:44 +00:00
|
|
|
|
--destination-range "${NODE_IP_RANGES[$i]}" \
|
2015-07-06 22:58:00 +00:00
|
|
|
|
--network "${NETWORK}" \
|
2015-11-24 03:04:40 +00:00
|
|
|
|
--next-hop-instance "${NODE_NAMES[$i]}" \
|
2015-07-06 22:58:00 +00:00
|
|
|
|
--next-hop-instance-zone "${ZONE}" &
|
|
|
|
|
```
|
|
|
|
|
|
2016-04-14 00:55:22 +00:00
|
|
|
|
GCE itself does not know anything about these IPs, though. This means that when
|
2015-07-06 22:58:00 +00:00
|
|
|
|
a pod tries to egress beyond GCE's project the packets must be SNAT'ed
|
|
|
|
|
(masqueraded) to the VM's IP, which GCE recognizes and allows.
|
|
|
|
|
|
|
|
|
|
### Other implementations
|
|
|
|
|
|
|
|
|
|
With the primary aim of providing IP-per-pod-model, other implementations exist
|
|
|
|
|
to serve the purpose outside of GCE.
|
2015-07-09 20:33:48 +00:00
|
|
|
|
- [OpenVSwitch with GRE/VxLAN](../admin/ovs-networking.md)
|
2014-10-16 21:45:16 +00:00
|
|
|
|
- [Flannel](https://github.com/coreos/flannel#flannel)
|
2015-07-06 22:58:00 +00:00
|
|
|
|
- [L2 networks](http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/)
|
|
|
|
|
("With Linux Bridge devices" section)
|
|
|
|
|
- [Weave](https://github.com/zettio/weave) is yet another way to build an
|
|
|
|
|
overlay network, primarily aiming at Docker integration.
|
|
|
|
|
- [Calico](https://github.com/Metaswitch/calico) uses BGP to enable real
|
|
|
|
|
container IPs.
|
|
|
|
|
|
|
|
|
|
## Pod to service
|
|
|
|
|
|
2015-07-14 16:37:37 +00:00
|
|
|
|
The [service](../user-guide/services.md) abstraction provides a way to group pods under a
|
2016-04-14 00:55:22 +00:00
|
|
|
|
common access policy (e.g. load-balanced). The implementation of this creates a
|
2015-07-13 02:03:06 +00:00
|
|
|
|
virtual IP which clients can access and which is transparently proxied to the
|
2016-04-14 00:55:22 +00:00
|
|
|
|
pods in a Service. Each node runs a kube-proxy process which programs
|
2015-07-06 22:58:00 +00:00
|
|
|
|
`iptables` rules to trap access to service IPs and redirect them to the correct
|
2016-04-14 00:55:22 +00:00
|
|
|
|
backends. This provides a highly-available load-balancing solution with low
|
2015-07-06 22:58:00 +00:00
|
|
|
|
performance overhead by balancing client traffic from a node on that same node.
|
|
|
|
|
|
|
|
|
|
## External to internal
|
|
|
|
|
|
|
|
|
|
So far the discussion has been about how to access a pod or service from within
|
2016-04-14 00:55:22 +00:00
|
|
|
|
the cluster. Accessing a pod from outside the cluster is a bit more tricky. We
|
2015-07-06 22:58:00 +00:00
|
|
|
|
want to offer highly-available, high-performance load balancing to target
|
2016-04-14 00:55:22 +00:00
|
|
|
|
Kubernetes Services. Most public cloud providers are simply not flexible enough
|
2015-07-06 22:58:00 +00:00
|
|
|
|
yet.
|
|
|
|
|
|
|
|
|
|
The way this is generally implemented is to set up external load balancers (e.g.
|
2016-04-14 00:55:22 +00:00
|
|
|
|
GCE's ForwardingRules or AWS's ELB) which target all nodes in a cluster. When
|
2015-07-06 22:58:00 +00:00
|
|
|
|
traffic arrives at a node it is recognized as being part of a particular Service
|
2016-04-14 00:55:22 +00:00
|
|
|
|
and routed to an appropriate backend Pod. This does mean that some traffic will
|
|
|
|
|
get double-bounced on the network. Once cloud providers have better offerings
|
2015-07-06 22:58:00 +00:00
|
|
|
|
we can take advantage of those.
|
2014-10-16 21:45:16 +00:00
|
|
|
|
|
|
|
|
|
## Challenges and future work
|
|
|
|
|
|
|
|
|
|
### Docker API
|
|
|
|
|
|
2015-07-06 22:58:00 +00:00
|
|
|
|
Right now, docker inspect doesn't show the networking configuration of the
|
|
|
|
|
containers, since they derive it from another container. That information should
|
|
|
|
|
be exposed somehow.
|
2014-10-16 21:45:16 +00:00
|
|
|
|
|
|
|
|
|
### External IP assignment
|
|
|
|
|
|
2015-07-06 22:58:00 +00:00
|
|
|
|
We want to be able to assign IP addresses externally from Docker
|
|
|
|
|
[#6743](https://github.com/dotcloud/docker/issues/6743) so that we don't need
|
|
|
|
|
to statically allocate fixed-size IP ranges to each node, so that IP addresses
|
|
|
|
|
can be made stable across pod infra container restarts
|
|
|
|
|
([#2801](https://github.com/dotcloud/docker/issues/2801)), and to facilitate
|
|
|
|
|
pod migration. Right now, if the pod infra container dies, all the user
|
|
|
|
|
containers must be stopped and restarted because the netns of the pod infra
|
|
|
|
|
container will change on restart, and any subsequent user container restart
|
|
|
|
|
will join that new netns, thereby not being able to see its peers.
|
|
|
|
|
Additionally, a change in IP address would encounter DNS caching/TTL problems.
|
|
|
|
|
External IP assignment would also simplify DNS support (see below).
|
2014-10-16 21:45:16 +00:00
|
|
|
|
|
|
|
|
|
### IPv6
|
|
|
|
|
|
2016-04-14 00:55:22 +00:00
|
|
|
|
IPv6 would be a nice option, also, but we can't depend on it yet. Docker support
|
|
|
|
|
is in progress: [Docker issue #2974](https://github.com/dotcloud/docker/issues/2974),
|
|
|
|
|
[Docker issue #6923](https://github.com/dotcloud/docker/issues/6923),
|
|
|
|
|
[Docker issue #6975](https://github.com/dotcloud/docker/issues/6975).
|
|
|
|
|
Additionally, direct ipv6 assignment to instances doesn't appear to be supported
|
|
|
|
|
by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull
|
|
|
|
|
requests from people running Kubernetes on bare metal, though. :-)
|
2015-05-14 22:12:45 +00:00
|
|
|
|
|
|
|
|
|
|
2015-07-14 00:13:09 +00:00
|
|
|
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
2015-05-14 22:12:45 +00:00
|
|
|
|
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/networking.md?pixel)]()
|
2015-07-14 00:13:09 +00:00
|
|
|
|
<!-- END MUNGE: GENERATED_ANALYTICS -->
|