When demonstrating runtime constraints, it's useful to show what happens when a node is under heavy load. For
this scenario, we have a single node with 2 cpus and 1GB of memory to demonstrate behavior under load, but the
results extend to multi-node scenarios.
### CPU requests
Each container in a pod may specify the amount of CPU it requests on a node. CPU requests are used at schedule time, and represent a minimum amount of CPU that should be reserved for your container to run.
When executing your container, the Kubelet maps your containers CPU requests to CFS shares in the Linux kernel. CFS CPU shares do not impose a ceiling on the actual amount of CPU the container can use. Instead, it defines a relative weight across all containers on the system for how much CPU time the container should get if there is CPU contention.
Let's demonstrate this concept using a simple container that will consume as much CPU as possible.
```
$ cluster/kubectl.sh run cpuhog \
--image=busybox \
--requests=cpu=100m \
-- md5sum /dev/urandom
```
This will create a single pod on your minion that requests 1/10 of a CPU, but it has no limit on how much CPU it may actually consume
on the node.
To demonstrate this, if you SSH into your machine, you will see it is consuming as much CPU as possible on the node.
```
$ vagrant ssh minion-1
$ sudo docker stats $(sudo docker ps -q)
CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O
6b593b1a9658 0.00% 1.425 MB/1.042 GB 0.14% 1.038 kB/738 B
ae8ae4ffcfe4 150.06% 831.5 kB/1.042 GB 0.08% 0 B/0 B
```
As you can see, its consuming 150% of the total CPU.
If we scale our replication controller to 20 pods, we should see that each container is given an equal proportion of CPU time.
If you only schedule __Guaranteed__ memory containers, where the request is equal to the limit, then you are not in major danger of
causing an OOM event on your node. If any individual container consumes more than their specified limit, it will be killed.
If you schedule __BestEffort__ memory containers, where the request and limit is not specified, or __Burstable__ memory containers, where
the request is less than any specified limit, then it is possible that a container will request more memory than what is actually available on the node.
If this occurs, the system will attempt to prioritize the containers that are killed based on their quality of service. This is done
by using the OOMScoreAdjust feature in the Linux kernel which provides a heuristic to rank a process between -1000 and 1000. Processes
with lower values are preserved in favor of processes with higher values. The system daemons (kubelet, kube-proxy, docker) all run with
low OOMScoreAdjust values.
In simplest terms, containers with __Guaranteed__ memory containers are given a lower value than __Burstable__ containers which has
a lower value than __BestEffort__ containers. As a consequence, containers with __BestEffort__ should be killed before the other tiers.
To demonstrate this, let's spin up a set of different replication controllers that will over commit the node.
You see that our BestEffort pod goes in a restart cycle, but the pods with greater levels of quality of service continue to function.
As you can see, we rely on the Kernel to react to system OOM events. Depending on how your host operating
system was configured, and which process the Kernel ultimately decides to kill on your Node, you may experience unstable results. In addition, during an OOM event, while the kernel is cleaning up processes, the system may experience significant periods of slow down or appear unresponsive. As a result, while the system allows you to overcommit on memory, we recommend to not induce a Kernel sys OOM.