Merge pull request #25237 from vishh/disk-based-eviction-proposal

Automatic merge from submit-queue

Proposal for disk based evictions

cc @dchen1107 @derekwaynecarr
pull/6/head
k8s-merge-robot 2016-05-20 17:57:18 -07:00
commit 12f7b81481
1 changed files with 173 additions and 10 deletions

View File

@ -29,9 +29,9 @@ Documentation for other releases can be found at
# Kubelet - Eviction Policy # Kubelet - Eviction Policy
**Author**: Derek Carr (@derekwaynecarr) **Authors**: Derek Carr (@derekwaynecarr), Vishnu Kannan (@vishh)
**Status**: Proposed **Status**: Proposed (memory evictions WIP)
This document presents a specification for how the `kubelet` evicts pods when compute resources are too low. This document presents a specification for how the `kubelet` evicts pods when compute resources are too low.
@ -58,8 +58,8 @@ moved and scheduled elsewhere when/if its backing controller creates a new pod.
This proposal defines a pod eviction policy for reclaiming compute resources. This proposal defines a pod eviction policy for reclaiming compute resources.
In the first iteration, it focuses on memory; later iterations are expected to cover As of now, memory and disk based evictions are supported.
other resources like disk. The proposal focuses on a simple default eviction strategy The proposal focuses on a simple default eviction strategy
intended to cover the broadest class of user workloads. intended to cover the broadest class of user workloads.
## Eviction Signals ## Eviction Signals
@ -69,6 +69,16 @@ The `kubelet` will support the ability to trigger eviction decisions on the foll
| Eviction Signal | Description | | Eviction Signal | Description |
|------------------|---------------------------------------------------------------------------------| |------------------|---------------------------------------------------------------------------------|
| memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet | | memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet |
| nodefs.available | nodefs.available := node.stats.fs.available |
| imagefs.available | imagefs.available := node.stats.runtime.imagefs.available |
`kubelet` supports only two filesystem partitions.
1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc.
1. The `imagefs` filesystem that container runtimes uses for storing images and container writable layers.
`imagefs` is optional. `kubelet` auto-discovers these filesystems using cAdvisor.
`kubelet` does not care about any other filesystems. Any other types of configurations are not currently supported by the kubelet. For example, it is *not OK* to store volumes and logs in a dedicated `imagefs`.
## Eviction Thresholds ## Eviction Thresholds
@ -151,6 +161,7 @@ The following node conditions are defined that correspond to the specified evict
| Node Condition | Eviction Signal | Description | | Node Condition | Eviction Signal | Description |
|----------------|------------------|------------------------------------------------------------------| |----------------|------------------|------------------------------------------------------------------|
| MemoryPressure | memory.available | Available memory on the node has satisfied an eviction threshold | | MemoryPressure | memory.available | Available memory on the node has satisfied an eviction threshold |
| DiskPressure | nodefs.available (or) imagefs.available | Available disk space on either the node's root filesytem or image filesystem has satisfied an eviction threshold |
The `kubelet` will continue to report node status updates at the frequency specified by The `kubelet` will continue to report node status updates at the frequency specified by
`--node-status-update-frequency` which defaults to `10s`. `--node-status-update-frequency` which defaults to `10s`.
@ -174,7 +185,9 @@ The `kubelet` would ensure that it has not observed an eviction threshold being
for the specified pressure condition for the period specified before toggling the for the specified pressure condition for the period specified before toggling the
condition back to `false`. condition back to `false`.
## Eviction scenario ## Eviction scenarios
### Memory
Let's assume the operator started the `kubelet` with the following: Let's assume the operator started the `kubelet` with the following:
@ -194,6 +207,31 @@ signal. If that signal is observed as being satisfied for longer than the
specified period, the `kubelet` will initiate eviction to attempt to specified period, the `kubelet` will initiate eviction to attempt to
reclaim the resource that has met its eviction threshold. reclaim the resource that has met its eviction threshold.
### Disk
Let's assume the operator started the `kubelet` with the following:
```
--eviction-hard="nodefs.available<1Gi,imagefs.available<10Gi"
--eviction-soft="nodefs.available<1.5Gi,imagefs.available<20Gi"
--eviction-soft-grace-period="nodefs.available=1m,imagefs.available=2m"
```
The `kubelet` will run a sync loop that looks at the available disk
on the node's supported partitions as reported from `cAdvisor`.
If available disk space on the node's primary filesystem is observed to drop below 1Gi,
the `kubelet` will immediately initiate eviction.
If available disk space on the node's image filesystem is observed to drop below 10Gi,
the `kubelet` will immediately initiate eviction.
If available disk space on the node's primary filesystem is observed as falling below `1.5Gi`,
or if available disk space on the node's image filesystem is observed as falling below `20Gi`,
it will record when that signal was observed internally in a cache. If at the next
sync, that criterion was no longer satisfied, the cache is cleared for that
signal. If that signal is observed as being satisfied for longer than the
specified period, the `kubelet` will initiate eviction to attempt to
reclaim the resource that has met its eviction threshold.
## Eviction of Pods ## Eviction of Pods
If an eviction threshold has been met, the `kubelet` will initiate the If an eviction threshold has been met, the `kubelet` will initiate the
@ -241,11 +279,111 @@ only has guaranteed pod(s) remaining, then the node must choose to evict a
guaranteed pod in order to preserve node stability, and to limit the impact guaranteed pod in order to preserve node stability, and to limit the impact
of the unexpected consumption to other guaranteed pod(s). of the unexpected consumption to other guaranteed pod(s).
## Disk based evictions
### With Imagefs
If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
1. Delete logs
1. Evict Pods if required.
If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
1. Delete unused images
1. Evict Pods if required.
### Without Imagefs
If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
1. Delete logs
1. Delete unused images
1. Evict Pods if required.
Let's explore the different options for freeing up disk space.
### Delete logs of dead pods/containers
As of today, logs are tied to a container's lifetime. `kubelet` keeps dead containers around,
to provide access to logs.
In the future, if we store logs of dead containers outside of the container itself, then
`kubelet` can delete these logs to free up disk space.
Once the lifetime of containers and logs are split, kubelet can support more user friendly policies
around log evictions. `kubelet` can delete logs of the oldest containers first.
Since logs from the first and the most recent incarnation of a container is the most important for most applications,
kubelet can try to preserve these logs and aggresively delete logs from other container incarnations.
Until logs are split from container's lifetime, `kubelet` can delete dead containers to free up disk space.
### Delete unused images
`kubelet` performs image garbage collection based on thresholds today. It uses a high and a low watermark.
Whenever disk usage exceeds the high watermark, it removes images until the low watermark is reached.
`kubelet` employs a LRU policy when it comes to deleting images.
The existing policy will be replaced with a much simpler policy.
Images will be deleted based on eviction thresholds. If kubelet can delete logs and keep disk space availability
above eviction thresholds, then kubelet will not delete any images.
If `kubelet` decides to delete unused images, it will delete *all* unused images.
### Evict pods
There is no ability to specify disk limits for pods/containers today.
Disk is a best effort resource. When necessary, `kubelet` can evict pods one at a time.
`kubelet` will follow the [Eviction Strategy](#eviction-strategy) mentioned above for making eviction decisions.
`kubelet` will evict the pod that will free up the maximum amount of disk space on the filesystem that has hit eviction thresholds.
Within each QoS bucket, `kubelet` will sort pods according to their disk usage.
`kubelet` will sort pods in each bucket as follows:
#### Without Imagefs
If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage
- local volumes + logs & writable layer of all its containers.
#### With Imagefs
If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs`
- local volumes + logs of all its containers.
If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers.
## Minimum eviction thresholds
In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
is time consuming.
To mitigate these issues, `kubelet` will have a per-resource `minimum-threshold`. Whenever `kubelet` observes
resource pressure, `kubelet` will attempt to reclaim at least `minimum-threshold` amount of resource.
Following are the flags through which `minimum-thresholds` can be configured for each evictable resource:
`--minimum-eviction-thresholds="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"`
The default `minimum-eviction-threshold` is `0` for all resources.
## Deprecation of existing features
`kubelet` has been freeing up disk space on demand to keep the node stable. As part of this proposal,
some of the existing features/flags around disk space retrieval will be deprecated in-favor of this proposal.
| Existing Flag | New Flag | Rationale |
| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` | existing eviction signals can capture image garbage collection |
| `--image-gc-low-threshold` | `--minimum-eviction-thresholds` | eviction thresholds achieve the same behavior |
| `--maximum-dead-containers` | | deprecated once old logs are stored outside of container's context |
| `--maximum-dead-containers-per-container` | | deprecated once old logs are stored outside of container's context |
| `--minimum-container-ttl-duration` | | deprecated once old logs are stored outside of container's context |
| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` | this use case is better handled by this proposal |
| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` | make the flag generic to suit all compute resources |
## Kubelet Admission Control ## Kubelet Admission Control
### Feasibility checks during kubelet admission ### Feasibility checks during kubelet admission
The `kubelet` will reject `BestEffort` pods if any of its associated #### Memory
The `kubelet` will reject `BestEffort` pods if any of the memory
eviction thresholds have been exceeded independent of the configured eviction thresholds have been exceeded independent of the configured
grace period. grace period.
@ -265,13 +403,38 @@ The reasoning for this decision is the expectation that the incoming pod is
likely to further starve the particular compute resource and the `kubelet` should likely to further starve the particular compute resource and the `kubelet` should
return to a steady state before accepting new workloads. return to a steady state before accepting new workloads.
#### Disk
The `kubelet` will reject all pods if any of the disk eviction thresholds have been met.
Let's assume the operator started the `kubelet` with the following:
```
--eviction-soft="disk.available<1500Mi"
--eviction-soft-grace-period="disk.available=30s"
```
If the `kubelet` sees that it has less than `1500Mi` of disk available
on the node, but the `kubelet` has not yet initiated eviction since the
grace period criteria has not yet been met, the `kubelet` will still immediately
fail any incoming pods.
The rationale for failing **all** pods instead of just best effort is because disk is currently
a best effort resource for all QoS classes.
Kubelet will apply the same policy even if there is a dedicated `image` filesystem.
## Scheduler ## Scheduler
The node will report a condition when a compute resource is under pressure. The The node will report a condition when a compute resource is under pressure. The
scheduler should view that condition as a signal to dissuade placing additional scheduler should view that condition as a signal to dissuade placing additional
best effort pods on the node. In this case, the `MemoryPressure` condition if true best effort pods on the node.
should dissuade the scheduler from placing new best effort pods on the node since
they will be rejected by the `kubelet` in admission. In this case, the `MemoryPressure` condition if true should dissuade the scheduler
from placing new best effort pods on the node since they will be rejected by the `kubelet` in admission.
On the other hand, the `DiskPressure` condition if true should dissuade the scheduler from
placing **any** new pods on the node since they will be rejected by the `kubelet` in admission.
## Best Practices ## Best Practices
@ -288,7 +451,7 @@ candidate set of pods provided to the eviction strategy.
In general, it should be strongly recommended that `DaemonSet` not In general, it should be strongly recommended that `DaemonSet` not
create `BestEffort` pods to avoid being identified as a candidate pod create `BestEffort` pods to avoid being identified as a candidate pod
for eviction. for eviction. Instead `DaemonSet` should ideally include Guaranteed pods only.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS --> <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-eviction.md?pixel)]() [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-eviction.md?pixel)]()