mirror of https://github.com/k3s-io/k3s
Merge pull request #25237 from vishh/disk-based-eviction-proposal
Automatic merge from submit-queue Proposal for disk based evictions cc @dchen1107 @derekwaynecarrpull/6/head
commit
12f7b81481
|
@ -29,9 +29,9 @@ Documentation for other releases can be found at
|
||||||
|
|
||||||
# Kubelet - Eviction Policy
|
# Kubelet - Eviction Policy
|
||||||
|
|
||||||
**Author**: Derek Carr (@derekwaynecarr)
|
**Authors**: Derek Carr (@derekwaynecarr), Vishnu Kannan (@vishh)
|
||||||
|
|
||||||
**Status**: Proposed
|
**Status**: Proposed (memory evictions WIP)
|
||||||
|
|
||||||
This document presents a specification for how the `kubelet` evicts pods when compute resources are too low.
|
This document presents a specification for how the `kubelet` evicts pods when compute resources are too low.
|
||||||
|
|
||||||
|
@ -58,8 +58,8 @@ moved and scheduled elsewhere when/if its backing controller creates a new pod.
|
||||||
|
|
||||||
This proposal defines a pod eviction policy for reclaiming compute resources.
|
This proposal defines a pod eviction policy for reclaiming compute resources.
|
||||||
|
|
||||||
In the first iteration, it focuses on memory; later iterations are expected to cover
|
As of now, memory and disk based evictions are supported.
|
||||||
other resources like disk. The proposal focuses on a simple default eviction strategy
|
The proposal focuses on a simple default eviction strategy
|
||||||
intended to cover the broadest class of user workloads.
|
intended to cover the broadest class of user workloads.
|
||||||
|
|
||||||
## Eviction Signals
|
## Eviction Signals
|
||||||
|
@ -69,6 +69,16 @@ The `kubelet` will support the ability to trigger eviction decisions on the foll
|
||||||
| Eviction Signal | Description |
|
| Eviction Signal | Description |
|
||||||
|------------------|---------------------------------------------------------------------------------|
|
|------------------|---------------------------------------------------------------------------------|
|
||||||
| memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet |
|
| memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet |
|
||||||
|
| nodefs.available | nodefs.available := node.stats.fs.available |
|
||||||
|
| imagefs.available | imagefs.available := node.stats.runtime.imagefs.available |
|
||||||
|
|
||||||
|
`kubelet` supports only two filesystem partitions.
|
||||||
|
|
||||||
|
1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc.
|
||||||
|
1. The `imagefs` filesystem that container runtimes uses for storing images and container writable layers.
|
||||||
|
|
||||||
|
`imagefs` is optional. `kubelet` auto-discovers these filesystems using cAdvisor.
|
||||||
|
`kubelet` does not care about any other filesystems. Any other types of configurations are not currently supported by the kubelet. For example, it is *not OK* to store volumes and logs in a dedicated `imagefs`.
|
||||||
|
|
||||||
## Eviction Thresholds
|
## Eviction Thresholds
|
||||||
|
|
||||||
|
@ -151,6 +161,7 @@ The following node conditions are defined that correspond to the specified evict
|
||||||
| Node Condition | Eviction Signal | Description |
|
| Node Condition | Eviction Signal | Description |
|
||||||
|----------------|------------------|------------------------------------------------------------------|
|
|----------------|------------------|------------------------------------------------------------------|
|
||||||
| MemoryPressure | memory.available | Available memory on the node has satisfied an eviction threshold |
|
| MemoryPressure | memory.available | Available memory on the node has satisfied an eviction threshold |
|
||||||
|
| DiskPressure | nodefs.available (or) imagefs.available | Available disk space on either the node's root filesytem or image filesystem has satisfied an eviction threshold |
|
||||||
|
|
||||||
The `kubelet` will continue to report node status updates at the frequency specified by
|
The `kubelet` will continue to report node status updates at the frequency specified by
|
||||||
`--node-status-update-frequency` which defaults to `10s`.
|
`--node-status-update-frequency` which defaults to `10s`.
|
||||||
|
@ -174,7 +185,9 @@ The `kubelet` would ensure that it has not observed an eviction threshold being
|
||||||
for the specified pressure condition for the period specified before toggling the
|
for the specified pressure condition for the period specified before toggling the
|
||||||
condition back to `false`.
|
condition back to `false`.
|
||||||
|
|
||||||
## Eviction scenario
|
## Eviction scenarios
|
||||||
|
|
||||||
|
### Memory
|
||||||
|
|
||||||
Let's assume the operator started the `kubelet` with the following:
|
Let's assume the operator started the `kubelet` with the following:
|
||||||
|
|
||||||
|
@ -194,6 +207,31 @@ signal. If that signal is observed as being satisfied for longer than the
|
||||||
specified period, the `kubelet` will initiate eviction to attempt to
|
specified period, the `kubelet` will initiate eviction to attempt to
|
||||||
reclaim the resource that has met its eviction threshold.
|
reclaim the resource that has met its eviction threshold.
|
||||||
|
|
||||||
|
### Disk
|
||||||
|
|
||||||
|
Let's assume the operator started the `kubelet` with the following:
|
||||||
|
|
||||||
|
```
|
||||||
|
--eviction-hard="nodefs.available<1Gi,imagefs.available<10Gi"
|
||||||
|
--eviction-soft="nodefs.available<1.5Gi,imagefs.available<20Gi"
|
||||||
|
--eviction-soft-grace-period="nodefs.available=1m,imagefs.available=2m"
|
||||||
|
```
|
||||||
|
|
||||||
|
The `kubelet` will run a sync loop that looks at the available disk
|
||||||
|
on the node's supported partitions as reported from `cAdvisor`.
|
||||||
|
If available disk space on the node's primary filesystem is observed to drop below 1Gi,
|
||||||
|
the `kubelet` will immediately initiate eviction.
|
||||||
|
If available disk space on the node's image filesystem is observed to drop below 10Gi,
|
||||||
|
the `kubelet` will immediately initiate eviction.
|
||||||
|
|
||||||
|
If available disk space on the node's primary filesystem is observed as falling below `1.5Gi`,
|
||||||
|
or if available disk space on the node's image filesystem is observed as falling below `20Gi`,
|
||||||
|
it will record when that signal was observed internally in a cache. If at the next
|
||||||
|
sync, that criterion was no longer satisfied, the cache is cleared for that
|
||||||
|
signal. If that signal is observed as being satisfied for longer than the
|
||||||
|
specified period, the `kubelet` will initiate eviction to attempt to
|
||||||
|
reclaim the resource that has met its eviction threshold.
|
||||||
|
|
||||||
## Eviction of Pods
|
## Eviction of Pods
|
||||||
|
|
||||||
If an eviction threshold has been met, the `kubelet` will initiate the
|
If an eviction threshold has been met, the `kubelet` will initiate the
|
||||||
|
@ -241,11 +279,111 @@ only has guaranteed pod(s) remaining, then the node must choose to evict a
|
||||||
guaranteed pod in order to preserve node stability, and to limit the impact
|
guaranteed pod in order to preserve node stability, and to limit the impact
|
||||||
of the unexpected consumption to other guaranteed pod(s).
|
of the unexpected consumption to other guaranteed pod(s).
|
||||||
|
|
||||||
|
## Disk based evictions
|
||||||
|
|
||||||
|
### With Imagefs
|
||||||
|
|
||||||
|
If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
|
||||||
|
|
||||||
|
1. Delete logs
|
||||||
|
1. Evict Pods if required.
|
||||||
|
|
||||||
|
If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
|
||||||
|
|
||||||
|
1. Delete unused images
|
||||||
|
1. Evict Pods if required.
|
||||||
|
|
||||||
|
### Without Imagefs
|
||||||
|
|
||||||
|
If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
|
||||||
|
|
||||||
|
1. Delete logs
|
||||||
|
1. Delete unused images
|
||||||
|
1. Evict Pods if required.
|
||||||
|
|
||||||
|
Let's explore the different options for freeing up disk space.
|
||||||
|
|
||||||
|
### Delete logs of dead pods/containers
|
||||||
|
|
||||||
|
As of today, logs are tied to a container's lifetime. `kubelet` keeps dead containers around,
|
||||||
|
to provide access to logs.
|
||||||
|
In the future, if we store logs of dead containers outside of the container itself, then
|
||||||
|
`kubelet` can delete these logs to free up disk space.
|
||||||
|
Once the lifetime of containers and logs are split, kubelet can support more user friendly policies
|
||||||
|
around log evictions. `kubelet` can delete logs of the oldest containers first.
|
||||||
|
Since logs from the first and the most recent incarnation of a container is the most important for most applications,
|
||||||
|
kubelet can try to preserve these logs and aggresively delete logs from other container incarnations.
|
||||||
|
|
||||||
|
Until logs are split from container's lifetime, `kubelet` can delete dead containers to free up disk space.
|
||||||
|
|
||||||
|
### Delete unused images
|
||||||
|
|
||||||
|
`kubelet` performs image garbage collection based on thresholds today. It uses a high and a low watermark.
|
||||||
|
Whenever disk usage exceeds the high watermark, it removes images until the low watermark is reached.
|
||||||
|
`kubelet` employs a LRU policy when it comes to deleting images.
|
||||||
|
|
||||||
|
The existing policy will be replaced with a much simpler policy.
|
||||||
|
Images will be deleted based on eviction thresholds. If kubelet can delete logs and keep disk space availability
|
||||||
|
above eviction thresholds, then kubelet will not delete any images.
|
||||||
|
If `kubelet` decides to delete unused images, it will delete *all* unused images.
|
||||||
|
|
||||||
|
### Evict pods
|
||||||
|
|
||||||
|
There is no ability to specify disk limits for pods/containers today.
|
||||||
|
Disk is a best effort resource. When necessary, `kubelet` can evict pods one at a time.
|
||||||
|
`kubelet` will follow the [Eviction Strategy](#eviction-strategy) mentioned above for making eviction decisions.
|
||||||
|
`kubelet` will evict the pod that will free up the maximum amount of disk space on the filesystem that has hit eviction thresholds.
|
||||||
|
Within each QoS bucket, `kubelet` will sort pods according to their disk usage.
|
||||||
|
`kubelet` will sort pods in each bucket as follows:
|
||||||
|
|
||||||
|
#### Without Imagefs
|
||||||
|
|
||||||
|
If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage
|
||||||
|
- local volumes + logs & writable layer of all its containers.
|
||||||
|
|
||||||
|
#### With Imagefs
|
||||||
|
|
||||||
|
If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs`
|
||||||
|
- local volumes + logs of all its containers.
|
||||||
|
|
||||||
|
If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers.
|
||||||
|
|
||||||
|
## Minimum eviction thresholds
|
||||||
|
|
||||||
|
In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in
|
||||||
|
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
|
||||||
|
is time consuming.
|
||||||
|
|
||||||
|
To mitigate these issues, `kubelet` will have a per-resource `minimum-threshold`. Whenever `kubelet` observes
|
||||||
|
resource pressure, `kubelet` will attempt to reclaim at least `minimum-threshold` amount of resource.
|
||||||
|
|
||||||
|
Following are the flags through which `minimum-thresholds` can be configured for each evictable resource:
|
||||||
|
|
||||||
|
`--minimum-eviction-thresholds="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"`
|
||||||
|
|
||||||
|
The default `minimum-eviction-threshold` is `0` for all resources.
|
||||||
|
|
||||||
|
## Deprecation of existing features
|
||||||
|
|
||||||
|
`kubelet` has been freeing up disk space on demand to keep the node stable. As part of this proposal,
|
||||||
|
some of the existing features/flags around disk space retrieval will be deprecated in-favor of this proposal.
|
||||||
|
|
||||||
|
| Existing Flag | New Flag | Rationale |
|
||||||
|
| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` | existing eviction signals can capture image garbage collection |
|
||||||
|
| `--image-gc-low-threshold` | `--minimum-eviction-thresholds` | eviction thresholds achieve the same behavior |
|
||||||
|
| `--maximum-dead-containers` | | deprecated once old logs are stored outside of container's context |
|
||||||
|
| `--maximum-dead-containers-per-container` | | deprecated once old logs are stored outside of container's context |
|
||||||
|
| `--minimum-container-ttl-duration` | | deprecated once old logs are stored outside of container's context |
|
||||||
|
| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` | this use case is better handled by this proposal |
|
||||||
|
| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` | make the flag generic to suit all compute resources |
|
||||||
|
|
||||||
## Kubelet Admission Control
|
## Kubelet Admission Control
|
||||||
|
|
||||||
### Feasibility checks during kubelet admission
|
### Feasibility checks during kubelet admission
|
||||||
|
|
||||||
The `kubelet` will reject `BestEffort` pods if any of its associated
|
#### Memory
|
||||||
|
|
||||||
|
The `kubelet` will reject `BestEffort` pods if any of the memory
|
||||||
eviction thresholds have been exceeded independent of the configured
|
eviction thresholds have been exceeded independent of the configured
|
||||||
grace period.
|
grace period.
|
||||||
|
|
||||||
|
@ -265,13 +403,38 @@ The reasoning for this decision is the expectation that the incoming pod is
|
||||||
likely to further starve the particular compute resource and the `kubelet` should
|
likely to further starve the particular compute resource and the `kubelet` should
|
||||||
return to a steady state before accepting new workloads.
|
return to a steady state before accepting new workloads.
|
||||||
|
|
||||||
|
#### Disk
|
||||||
|
|
||||||
|
The `kubelet` will reject all pods if any of the disk eviction thresholds have been met.
|
||||||
|
|
||||||
|
Let's assume the operator started the `kubelet` with the following:
|
||||||
|
|
||||||
|
```
|
||||||
|
--eviction-soft="disk.available<1500Mi"
|
||||||
|
--eviction-soft-grace-period="disk.available=30s"
|
||||||
|
```
|
||||||
|
|
||||||
|
If the `kubelet` sees that it has less than `1500Mi` of disk available
|
||||||
|
on the node, but the `kubelet` has not yet initiated eviction since the
|
||||||
|
grace period criteria has not yet been met, the `kubelet` will still immediately
|
||||||
|
fail any incoming pods.
|
||||||
|
|
||||||
|
The rationale for failing **all** pods instead of just best effort is because disk is currently
|
||||||
|
a best effort resource for all QoS classes.
|
||||||
|
|
||||||
|
Kubelet will apply the same policy even if there is a dedicated `image` filesystem.
|
||||||
|
|
||||||
## Scheduler
|
## Scheduler
|
||||||
|
|
||||||
The node will report a condition when a compute resource is under pressure. The
|
The node will report a condition when a compute resource is under pressure. The
|
||||||
scheduler should view that condition as a signal to dissuade placing additional
|
scheduler should view that condition as a signal to dissuade placing additional
|
||||||
best effort pods on the node. In this case, the `MemoryPressure` condition if true
|
best effort pods on the node.
|
||||||
should dissuade the scheduler from placing new best effort pods on the node since
|
|
||||||
they will be rejected by the `kubelet` in admission.
|
In this case, the `MemoryPressure` condition if true should dissuade the scheduler
|
||||||
|
from placing new best effort pods on the node since they will be rejected by the `kubelet` in admission.
|
||||||
|
|
||||||
|
On the other hand, the `DiskPressure` condition if true should dissuade the scheduler from
|
||||||
|
placing **any** new pods on the node since they will be rejected by the `kubelet` in admission.
|
||||||
|
|
||||||
## Best Practices
|
## Best Practices
|
||||||
|
|
||||||
|
@ -288,7 +451,7 @@ candidate set of pods provided to the eviction strategy.
|
||||||
|
|
||||||
In general, it should be strongly recommended that `DaemonSet` not
|
In general, it should be strongly recommended that `DaemonSet` not
|
||||||
create `BestEffort` pods to avoid being identified as a candidate pod
|
create `BestEffort` pods to avoid being identified as a candidate pod
|
||||||
for eviction.
|
for eviction. Instead `DaemonSet` should ideally include Guaranteed pods only.
|
||||||
|
|
||||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-eviction.md?pixel)]()
|
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-eviction.md?pixel)]()
|
||||||
|
|
Loading…
Reference in New Issue