k3s/docs/proposals/kubelet-eviction.md

<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->

<!-- BEGIN STRIP_FOR_RELEASE -->

<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">

<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>

If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.

<!-- TAG RELEASE_LINK, added by the munger automatically -->
<strong>
The latest release of this document can be found
[here](http://releases.k8s.io/release-1.4/docs/proposals/kubelet-eviction.md).

Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--

<!-- END STRIP_FOR_RELEASE -->

<!-- END MUNGE: UNVERSIONED_WARNING -->

# Kubelet - Eviction Policy

**Authors**: Derek Carr (@derekwaynecarr), Vishnu Kannan (@vishh)

**Status**: Proposed (memory evictions WIP)

This document presents a specification for how the `kubelet` evicts pods when compute resources are too low.

## Goals

The node needs a mechanism to preserve stability when available compute resources are low.

This is especially important when dealing with incompressible compute resources such
as memory or disk.  If either resource is exhausted, the node would become unstable.

The `kubelet` has some support for influencing system behavior in response to a system OOM by
having the system OOM killer see higher OOM score adjust scores for containers that have consumed
the largest amount of memory relative to their request.  System OOM events are very compute
intensive, and can stall the node until the OOM killing process has completed.  In addition,
the system is prone to return to an unstable state since the containers that are killed due to OOM
are either restarted or a new pod is scheduled on to the node.

Instead, we would prefer a system where the `kubelet` can pro-actively monitor for
and prevent against total starvation of a compute resource, and in cases of where it
could appear to occur, pro-actively fail one or more pods, so the workload can get
moved and scheduled elsewhere when/if its backing controller creates a new pod.

## Scope of proposal

This proposal defines a pod eviction policy for reclaiming compute resources.

As of now, memory and disk based evictions are supported.
The proposal focuses on a simple default eviction strategy
intended to cover the broadest class of user workloads.

## Eviction Signals

The `kubelet` will support the ability to trigger eviction decisions on the following signals.

| Eviction Signal  | Description                                                                     |
|------------------|---------------------------------------------------------------------------------|
| memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet |
| nodefs.available   | nodefs.available := node.stats.fs.available |
| nodefs.inodesFree | nodefs.inodesFree := node.stats.fs.inodesFree |
| imagefs.available | imagefs.available := node.stats.runtime.imagefs.available |
| imagefs.inodesFree | imagefs.inodesFree := node.stats.runtime.imagefs.inodesFree |

Each of the above signals support either a literal or percentage based value.  The percentage based value
is calculated relative to the total capacity associated with each signal.

`kubelet` supports only two filesystem partitions.

1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc.
1. The `imagefs` filesystem that container runtimes uses for storing images and container writable layers.

`imagefs` is optional. `kubelet` auto-discovers these filesystems using cAdvisor.
`kubelet` does not care about any other filesystems. Any other types of configurations are not currently supported by the kubelet. For example, it is *not OK* to store volumes and logs in a dedicated `imagefs`.

## Eviction Thresholds

The `kubelet` will support the ability to specify eviction thresholds.

An eviction threshold is of the following form:

`<eviction-signal><operator><quantity | int%>`

* valid `eviction-signal` tokens as defined above.
* valid `operator` tokens are `<`
* valid `quantity` tokens must match the quantity representation used by Kubernetes
* an eviction threshold can be expressed as a percentage if ends with `%` token.

If threshold criteria are met, the `kubelet` will take pro-active action to attempt
to reclaim the starved compute resource associated with the eviction signal.

The `kubelet` will support soft and hard eviction thresholds.

For example, if a node has `10Gi` of memory, and the desire is to induce eviction
if available memory falls below `1Gi`, an eviction signal can be specified as either
of the following (but not both).

* `memory.available<10%`
* `memory.available<1Gi`

### Soft Eviction Thresholds

A soft eviction threshold pairs an eviction threshold with a required
administrator specified grace period.  No action is taken by the `kubelet`
to reclaim resources associated with the eviction signal until that grace
period has been exceeded.  If no grace period is provided, the `kubelet` will
error on startup.

In addition, if a soft eviction threshold has been met, an operator can
specify a maximum allowed pod termination grace period to use when evicting
pods from the node.  If specified, the `kubelet` will use the lesser value among
the `pod.Spec.TerminationGracePeriodSeconds` and the max allowed grace period.
If not specified, the `kubelet` will kill pods immediately with no graceful
termination.

To configure soft eviction thresholds, the following flags will be supported:

```
--eviction-soft="": A set of eviction thresholds (e.g. memory.available<1.5Gi) that if met over a corresponding grace period would trigger a pod eviction.
--eviction-soft-grace-period="": A set of eviction grace periods (e.g. memory.available=1m30s) that correspond to how long a soft eviction threshold must hold before triggering a pod eviction.
--eviction-max-pod-grace-period="0": Maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.
```

### Hard Eviction Thresholds

A hard eviction threshold has no grace period, and if observed, the `kubelet`
will take immediate action to reclaim the associated starved resource.  If a
hard eviction threshold is met, the `kubelet` will kill the pod immediately
with no graceful termination.

To configure hard eviction thresholds, the following flag will be supported:

```
--eviction-hard="": A set of eviction thresholds (e.g. memory.available<1Gi) that if met would trigger a pod eviction.
```

## Eviction Monitoring Interval

The `kubelet` will initially evaluate eviction thresholds at the same
housekeeping interval as `cAdvisor` housekeeping.

In Kubernetes 1.2, this was defaulted to `10s`.

It is a goal to shrink the monitoring interval to a much shorter window.
This may require changes to `cAdvisor` to let alternate housekeeping intervals
be specified for selected data (https://github.com/google/cadvisor/issues/1247)

For the purposes of this proposal, we expect the monitoring interval to be no
more than `10s` to know when a threshold has been triggered, but we will strive
to reduce that latency time permitting.

## Node Conditions

The `kubelet` will support a node condition that corresponds to each eviction signal.

If a hard eviction threshold has been met, or a soft eviction threshold has been met
independent of its associated grace period, the `kubelet` will report a condition that
reflects the node is under pressure.

The following node conditions are defined that correspond to the specified eviction signal.

| Node Condition | Eviction Signal  | Description                                                      |
|----------------|------------------|------------------------------------------------------------------|
| MemoryPressure | memory.available | Available memory on the node has satisfied an eviction threshold |
| DiskPressure | nodefs.available, nodefs.inodesFree, imagefs.available, or imagefs.inodesFree | Available disk space and inodes on either the node's root filesytem or image filesystem has satisfied an eviction threshold |

The `kubelet` will continue to report node status updates at the frequency specified by
`--node-status-update-frequency` which defaults to `10s`.

### Oscillation of node conditions

If a node is oscillating above and below a soft eviction threshold, but not exceeding
its associated grace period, it would cause the corresponding node condition to
constantly oscillate between true and false, and could cause poor scheduling decisions
as a consequence.

To protect against this oscillation, the following flag is defined to control how
long the `kubelet` must wait before transitioning out of a pressure condition.

```
--eviction-pressure-transition-period=5m0s: Duration for which the kubelet has to wait
before transitioning out of an eviction pressure condition.
```

The `kubelet` would ensure that it has not observed an eviction threshold being met
for the specified pressure condition for the period specified before toggling the
condition back to `false`.

## Eviction scenarios

### Memory

Let's assume the operator started the `kubelet` with the following:

```
--eviction-hard="memory.available<100Mi"
--eviction-soft="memory.available<300Mi"
--eviction-soft-grace-period="memory.available=30s"
```

The `kubelet` will run a sync loop that looks at the available memory
on the node as reported from `cAdvisor` by calculating (capacity - workingSet).
If available memory is observed to drop below 100Mi, the `kubelet` will immediately
initiate eviction. If available memory is observed as falling below `300Mi`,
it will record when that signal was observed internally in a cache.  If at the next
sync, that criteria was no longer satisfied, the cache is cleared for that
signal.  If that signal is observed as being satisfied for longer than the
specified period, the `kubelet` will initiate eviction to attempt to
reclaim the resource that has met its eviction threshold.

### Disk

Let's assume the operator started the `kubelet` with the following:

```
--eviction-hard="nodefs.available<1Gi,nodefs.inodesFree<1,imagefs.available<10Gi,imagefs.inodesFree<10"
--eviction-soft="nodefs.available<1.5Gi,nodefs.inodesFree<10,imagefs.available<20Gi,imagefs.inodesFree<100"
--eviction-soft-grace-period="nodefs.available=1m,imagefs.available=2m"
```

The `kubelet` will run a sync loop that looks at the available disk
on the node's supported partitions as reported from `cAdvisor`.
If available disk space on the node's primary filesystem is observed to drop below 1Gi
or the free inodes on the node's primary filesystem is less than 1,
the `kubelet` will immediately initiate eviction.
If available disk space on the node's image filesystem is observed to drop below 10Gi
or the free inodes on the node's primary image filesystem is less than 10,
the `kubelet` will immediately initiate eviction.

If available disk space on the node's primary filesystem is observed as falling below `1.5Gi`,
or if the free inodes on the node's primary filesystem is less than 10,
or if available disk space on the node's image filesystem is observed as falling below `20Gi`,
or if the free inodes on the node's image filesystem is less than 100,
it will record when that signal was observed internally in a cache.  If at the next
sync, that criterion was no longer satisfied, the cache is cleared for that
signal.  If that signal is observed as being satisfied for longer than the
specified period, the `kubelet` will initiate eviction to attempt to
reclaim the resource that has met its eviction threshold.

## Eviction of Pods

If an eviction threshold has been met, the `kubelet` will initiate the
process of evicting pods until it has observed the signal has gone below
its defined threshold.

The eviction sequence works as follows:

* for each monitoring interval, if eviction thresholds have been met
 * find candidate pod
 * fail the pod
 * block until pod is terminated on node

If a pod is not terminated because a container does not happen to die
(i.e. processes stuck in disk IO for example), the `kubelet` may select
an additional pod to fail instead.  The `kubelet` will invoke the `KillPod`
operation exposed on the runtime interface.  If an error is returned,
the `kubelet` will select a subsequent pod.

## Eviction Strategy

The `kubelet` will implement a default eviction strategy oriented around
the pod quality of service class.

It will target pods that are the largest consumers of the starved compute
resource relative to their scheduling request.  It ranks pods within a
quality of service tier in the following order.

* `BestEffort` pods that consume the most of the starved resource are failed
first.
* `Burstable` pods that consume the greatest amount of the starved resource
relative to their request for that resource are killed first.  If no pod
has exceeded its request, the strategy targets the largest consumer of the
starved resource.
* `Guaranteed` pods that consume the greatest amount of the starved resource
relative to their request are killed first.  If no pod has exceeded its request,
the strategy targets the largest consumer of the starved resource.

A guaranteed pod is guaranteed to never be evicted because of another pod's
resource consumption.  That said, guarantees are only as good as the underlying
foundation they are built upon.  If a system daemon
(i.e. `kubelet`, `docker`, `journald`, etc.) is consuming more resources than
were reserved via `system-reserved` or `kube-reserved` allocations, and the node
only has guaranteed pod(s) remaining, then the node must choose to evict a
guaranteed pod in order to preserve node stability, and to limit the impact
of the unexpected consumption to other guaranteed pod(s).

## Disk based evictions

### With Imagefs

If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:

1. Delete logs
1. Evict Pods if required.

If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:

1. Delete unused images
1. Evict Pods if required.

### Without Imagefs

If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:

1. Delete logs
1. Delete unused images
1. Evict Pods if required.

Let's explore the different options for freeing up disk space.

### Delete logs of dead pods/containers

As of today, logs are tied to a container's lifetime. `kubelet` keeps dead containers around,
to provide access to logs.
In the future, if we store logs of dead containers outside of the container itself, then
`kubelet` can delete these logs to free up disk space.
Once the lifetime of containers and logs are split, kubelet can support more user friendly policies
around log evictions. `kubelet` can delete logs of the oldest containers first.
Since logs from the first and the most recent incarnation of a container is the most important for most applications,
kubelet can try to preserve these logs and aggresively delete logs from other container incarnations.

Until logs are split from container's lifetime, `kubelet` can delete dead containers to free up disk space.

### Delete unused images

`kubelet` performs image garbage collection based on thresholds today. It uses a high and a low watermark.
Whenever disk usage exceeds the high watermark, it removes images until the low watermark is reached.
`kubelet` employs a LRU policy when it comes to deleting images.

The existing policy will be replaced with a much simpler policy.
Images will be deleted based on eviction thresholds. If kubelet can delete logs and keep disk space availability
above eviction thresholds, then kubelet will not delete any images.
If `kubelet` decides to delete unused images, it will delete *all* unused images.

### Evict pods

There is no ability to specify disk limits for pods/containers today.
Disk is a best effort resource. When necessary, `kubelet` can evict pods one at a time.
`kubelet` will follow the [Eviction Strategy](#eviction-strategy) mentioned above for making eviction decisions.
`kubelet` will evict the pod that will free up the maximum amount of disk space on the filesystem that has hit eviction thresholds.
Within each QoS bucket, `kubelet` will sort pods according to their disk usage.
`kubelet` will sort pods in each bucket as follows:

#### Without Imagefs

If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage
- local volumes + logs & writable layer of all its containers.

#### With Imagefs

If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs`
- local volumes + logs of all its containers.

If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers.

## Minimum eviction reclaim

In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
 is time consuming.

To mitigate these issues, `kubelet` will have a per-resource `minimum-reclaim`. Whenever `kubelet` observes
resource pressure, `kubelet` will attempt to reclaim at least `minimum-reclaim` amount of resource.

Following are the flags through which `minimum-reclaim` can be configured for each evictable resource:

`--eviction-minimum-reclaim="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"`

The default `eviction-minimum-reclaim` is `0` for all resources.

## Deprecation of existing features

`kubelet` has been freeing up disk space on demand to keep the node stable. As part of this proposal,
some of the existing features/flags around disk space retrieval will be deprecated in-favor of this proposal.

| Existing Flag | New Flag | Rationale |
| ------------- | -------- | --------- |
| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` | existing eviction signals can capture image garbage collection |
| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` | eviction reclaims achieve the same behavior |
| `--maximum-dead-containers` | | deprecated once old logs are stored outside of container's context |
| `--maximum-dead-containers-per-container` | | deprecated once old logs are stored outside of container's context |
| `--minimum-container-ttl-duration` | | deprecated once old logs are stored outside of container's context |
| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` | this use case is better handled by this proposal |
| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` | make the flag generic to suit all compute resources |

## Kubelet Admission Control

### Feasibility checks during kubelet admission

#### Memory

The `kubelet` will reject `BestEffort` pods if any of the memory
eviction thresholds have been exceeded independent of the configured
grace period.

Let's assume the operator started the `kubelet` with the following:

```
--eviction-soft="memory.available<256Mi"
--eviction-soft-grace-period="memory.available=30s"
```

If the `kubelet` sees that it has less than `256Mi` of memory available
on the node, but the `kubelet` has not yet initiated eviction since the
grace period criteria has not yet been met, the `kubelet` will still immediately
fail any incoming best effort pods.

The reasoning for this decision is the expectation that the incoming pod is
likely to further starve the particular compute resource and the `kubelet` should
return to a steady state before accepting new workloads.

#### Disk

The `kubelet` will reject all pods if any of the disk eviction thresholds have been met.

Let's assume the operator started the `kubelet` with the following:

```
--eviction-soft="nodefs.available<1500Mi"
--eviction-soft-grace-period="nodefs.available=30s"
```

If the `kubelet` sees that it has less than `1500Mi` of disk available
on the node, but the `kubelet` has not yet initiated eviction since the
grace period criteria has not yet been met, the `kubelet` will still immediately
fail any incoming pods.

The rationale for failing **all** pods instead of just best effort is because disk is currently
a best effort resource for all QoS classes.

Kubelet will apply the same policy even if there is a dedicated `image` filesystem.

## Scheduler

The node will report a condition when a compute resource is under pressure.  The
scheduler should view that condition as a signal to dissuade placing additional
best effort pods on the node.

In this case, the `MemoryPressure` condition if true should dissuade the scheduler
from placing new best effort pods on the node since they will be rejected by the `kubelet` in admission.

On the other hand, the `DiskPressure` condition if true should dissuade the scheduler from
placing **any** new pods on the node since they will be rejected by the `kubelet` in admission.

## Best Practices

### DaemonSet

It is never desired for a `kubelet` to evict a pod that was derived from
a `DaemonSet` since the pod will immediately be recreated and rescheduled
back to the same node.

At the moment, the `kubelet` has no ability to distinguish a pod created
from `DaemonSet` versus any other object.  If/when that information is
available, the `kubelet` could pro-actively filter those pods from the
candidate set of pods provided to the eviction strategy.

In general, it should be strongly recommended that `DaemonSet` not
create `BestEffort` pods to avoid being identified as a candidate pod
for eviction. Instead `DaemonSet` should ideally include Guaranteed pods only.

## Known issues

### kubelet may evict more pods than needed

The pod eviction may evict more pods than needed due to stats collection timing gap. This can be mitigated by adding
the ability to get root container stats on an on-demand basis (https://github.com/google/cadvisor/issues/1247) in the future.

### How kubelet ranks pods for eviction in response to inode exhaustion

At this time, it is not possible to know how many inodes were consumed by a particular container.  If the `kubelet` observes
inode exhaustion, it will evict pods by ranking them by quality of service.  The following issue has been opened in cadvisor
to track per container inode consumption (https://github.com/google/cadvisor/issues/1422) which would allow us to rank pods
by inode consumption.  For example, this would let us identify a container that created large numbers of 0 byte files, and evict
that pod over others.

<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-eviction.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->