mirror of https://github.com/k3s-io/k3s
Merge pull request #25237 from vishh/disk-based-eviction-proposal
Automatic merge from submit-queue Proposal for disk based evictions cc @dchen1107 @derekwaynecarrpull/6/head
commit
12f7b81481
|
@ -29,9 +29,9 @@ Documentation for other releases can be found at
|
|||
|
||||
# Kubelet - Eviction Policy
|
||||
|
||||
**Author**: Derek Carr (@derekwaynecarr)
|
||||
**Authors**: Derek Carr (@derekwaynecarr), Vishnu Kannan (@vishh)
|
||||
|
||||
**Status**: Proposed
|
||||
**Status**: Proposed (memory evictions WIP)
|
||||
|
||||
This document presents a specification for how the `kubelet` evicts pods when compute resources are too low.
|
||||
|
||||
|
@ -58,8 +58,8 @@ moved and scheduled elsewhere when/if its backing controller creates a new pod.
|
|||
|
||||
This proposal defines a pod eviction policy for reclaiming compute resources.
|
||||
|
||||
In the first iteration, it focuses on memory; later iterations are expected to cover
|
||||
other resources like disk. The proposal focuses on a simple default eviction strategy
|
||||
As of now, memory and disk based evictions are supported.
|
||||
The proposal focuses on a simple default eviction strategy
|
||||
intended to cover the broadest class of user workloads.
|
||||
|
||||
## Eviction Signals
|
||||
|
@ -69,6 +69,16 @@ The `kubelet` will support the ability to trigger eviction decisions on the foll
|
|||
| Eviction Signal | Description |
|
||||
|------------------|---------------------------------------------------------------------------------|
|
||||
| memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet |
|
||||
| nodefs.available | nodefs.available := node.stats.fs.available |
|
||||
| imagefs.available | imagefs.available := node.stats.runtime.imagefs.available |
|
||||
|
||||
`kubelet` supports only two filesystem partitions.
|
||||
|
||||
1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc.
|
||||
1. The `imagefs` filesystem that container runtimes uses for storing images and container writable layers.
|
||||
|
||||
`imagefs` is optional. `kubelet` auto-discovers these filesystems using cAdvisor.
|
||||
`kubelet` does not care about any other filesystems. Any other types of configurations are not currently supported by the kubelet. For example, it is *not OK* to store volumes and logs in a dedicated `imagefs`.
|
||||
|
||||
## Eviction Thresholds
|
||||
|
||||
|
@ -151,6 +161,7 @@ The following node conditions are defined that correspond to the specified evict
|
|||
| Node Condition | Eviction Signal | Description |
|
||||
|----------------|------------------|------------------------------------------------------------------|
|
||||
| MemoryPressure | memory.available | Available memory on the node has satisfied an eviction threshold |
|
||||
| DiskPressure | nodefs.available (or) imagefs.available | Available disk space on either the node's root filesytem or image filesystem has satisfied an eviction threshold |
|
||||
|
||||
The `kubelet` will continue to report node status updates at the frequency specified by
|
||||
`--node-status-update-frequency` which defaults to `10s`.
|
||||
|
@ -174,7 +185,9 @@ The `kubelet` would ensure that it has not observed an eviction threshold being
|
|||
for the specified pressure condition for the period specified before toggling the
|
||||
condition back to `false`.
|
||||
|
||||
## Eviction scenario
|
||||
## Eviction scenarios
|
||||
|
||||
### Memory
|
||||
|
||||
Let's assume the operator started the `kubelet` with the following:
|
||||
|
||||
|
@ -194,6 +207,31 @@ signal. If that signal is observed as being satisfied for longer than the
|
|||
specified period, the `kubelet` will initiate eviction to attempt to
|
||||
reclaim the resource that has met its eviction threshold.
|
||||
|
||||
### Disk
|
||||
|
||||
Let's assume the operator started the `kubelet` with the following:
|
||||
|
||||
```
|
||||
--eviction-hard="nodefs.available<1Gi,imagefs.available<10Gi"
|
||||
--eviction-soft="nodefs.available<1.5Gi,imagefs.available<20Gi"
|
||||
--eviction-soft-grace-period="nodefs.available=1m,imagefs.available=2m"
|
||||
```
|
||||
|
||||
The `kubelet` will run a sync loop that looks at the available disk
|
||||
on the node's supported partitions as reported from `cAdvisor`.
|
||||
If available disk space on the node's primary filesystem is observed to drop below 1Gi,
|
||||
the `kubelet` will immediately initiate eviction.
|
||||
If available disk space on the node's image filesystem is observed to drop below 10Gi,
|
||||
the `kubelet` will immediately initiate eviction.
|
||||
|
||||
If available disk space on the node's primary filesystem is observed as falling below `1.5Gi`,
|
||||
or if available disk space on the node's image filesystem is observed as falling below `20Gi`,
|
||||
it will record when that signal was observed internally in a cache. If at the next
|
||||
sync, that criterion was no longer satisfied, the cache is cleared for that
|
||||
signal. If that signal is observed as being satisfied for longer than the
|
||||
specified period, the `kubelet` will initiate eviction to attempt to
|
||||
reclaim the resource that has met its eviction threshold.
|
||||
|
||||
## Eviction of Pods
|
||||
|
||||
If an eviction threshold has been met, the `kubelet` will initiate the
|
||||
|
@ -241,11 +279,111 @@ only has guaranteed pod(s) remaining, then the node must choose to evict a
|
|||
guaranteed pod in order to preserve node stability, and to limit the impact
|
||||
of the unexpected consumption to other guaranteed pod(s).
|
||||
|
||||
## Disk based evictions
|
||||
|
||||
### With Imagefs
|
||||
|
||||
If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
|
||||
|
||||
1. Delete logs
|
||||
1. Evict Pods if required.
|
||||
|
||||
If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
|
||||
|
||||
1. Delete unused images
|
||||
1. Evict Pods if required.
|
||||
|
||||
### Without Imagefs
|
||||
|
||||
If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
|
||||
|
||||
1. Delete logs
|
||||
1. Delete unused images
|
||||
1. Evict Pods if required.
|
||||
|
||||
Let's explore the different options for freeing up disk space.
|
||||
|
||||
### Delete logs of dead pods/containers
|
||||
|
||||
As of today, logs are tied to a container's lifetime. `kubelet` keeps dead containers around,
|
||||
to provide access to logs.
|
||||
In the future, if we store logs of dead containers outside of the container itself, then
|
||||
`kubelet` can delete these logs to free up disk space.
|
||||
Once the lifetime of containers and logs are split, kubelet can support more user friendly policies
|
||||
around log evictions. `kubelet` can delete logs of the oldest containers first.
|
||||
Since logs from the first and the most recent incarnation of a container is the most important for most applications,
|
||||
kubelet can try to preserve these logs and aggresively delete logs from other container incarnations.
|
||||
|
||||
Until logs are split from container's lifetime, `kubelet` can delete dead containers to free up disk space.
|
||||
|
||||
### Delete unused images
|
||||
|
||||
`kubelet` performs image garbage collection based on thresholds today. It uses a high and a low watermark.
|
||||
Whenever disk usage exceeds the high watermark, it removes images until the low watermark is reached.
|
||||
`kubelet` employs a LRU policy when it comes to deleting images.
|
||||
|
||||
The existing policy will be replaced with a much simpler policy.
|
||||
Images will be deleted based on eviction thresholds. If kubelet can delete logs and keep disk space availability
|
||||
above eviction thresholds, then kubelet will not delete any images.
|
||||
If `kubelet` decides to delete unused images, it will delete *all* unused images.
|
||||
|
||||
### Evict pods
|
||||
|
||||
There is no ability to specify disk limits for pods/containers today.
|
||||
Disk is a best effort resource. When necessary, `kubelet` can evict pods one at a time.
|
||||
`kubelet` will follow the [Eviction Strategy](#eviction-strategy) mentioned above for making eviction decisions.
|
||||
`kubelet` will evict the pod that will free up the maximum amount of disk space on the filesystem that has hit eviction thresholds.
|
||||
Within each QoS bucket, `kubelet` will sort pods according to their disk usage.
|
||||
`kubelet` will sort pods in each bucket as follows:
|
||||
|
||||
#### Without Imagefs
|
||||
|
||||
If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage
|
||||
- local volumes + logs & writable layer of all its containers.
|
||||
|
||||
#### With Imagefs
|
||||
|
||||
If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs`
|
||||
- local volumes + logs of all its containers.
|
||||
|
||||
If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers.
|
||||
|
||||
## Minimum eviction thresholds
|
||||
|
||||
In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in
|
||||
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
|
||||
is time consuming.
|
||||
|
||||
To mitigate these issues, `kubelet` will have a per-resource `minimum-threshold`. Whenever `kubelet` observes
|
||||
resource pressure, `kubelet` will attempt to reclaim at least `minimum-threshold` amount of resource.
|
||||
|
||||
Following are the flags through which `minimum-thresholds` can be configured for each evictable resource:
|
||||
|
||||
`--minimum-eviction-thresholds="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"`
|
||||
|
||||
The default `minimum-eviction-threshold` is `0` for all resources.
|
||||
|
||||
## Deprecation of existing features
|
||||
|
||||
`kubelet` has been freeing up disk space on demand to keep the node stable. As part of this proposal,
|
||||
some of the existing features/flags around disk space retrieval will be deprecated in-favor of this proposal.
|
||||
|
||||
| Existing Flag | New Flag | Rationale |
|
||||
| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` | existing eviction signals can capture image garbage collection |
|
||||
| `--image-gc-low-threshold` | `--minimum-eviction-thresholds` | eviction thresholds achieve the same behavior |
|
||||
| `--maximum-dead-containers` | | deprecated once old logs are stored outside of container's context |
|
||||
| `--maximum-dead-containers-per-container` | | deprecated once old logs are stored outside of container's context |
|
||||
| `--minimum-container-ttl-duration` | | deprecated once old logs are stored outside of container's context |
|
||||
| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` | this use case is better handled by this proposal |
|
||||
| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` | make the flag generic to suit all compute resources |
|
||||
|
||||
## Kubelet Admission Control
|
||||
|
||||
### Feasibility checks during kubelet admission
|
||||
|
||||
The `kubelet` will reject `BestEffort` pods if any of its associated
|
||||
#### Memory
|
||||
|
||||
The `kubelet` will reject `BestEffort` pods if any of the memory
|
||||
eviction thresholds have been exceeded independent of the configured
|
||||
grace period.
|
||||
|
||||
|
@ -265,13 +403,38 @@ The reasoning for this decision is the expectation that the incoming pod is
|
|||
likely to further starve the particular compute resource and the `kubelet` should
|
||||
return to a steady state before accepting new workloads.
|
||||
|
||||
#### Disk
|
||||
|
||||
The `kubelet` will reject all pods if any of the disk eviction thresholds have been met.
|
||||
|
||||
Let's assume the operator started the `kubelet` with the following:
|
||||
|
||||
```
|
||||
--eviction-soft="disk.available<1500Mi"
|
||||
--eviction-soft-grace-period="disk.available=30s"
|
||||
```
|
||||
|
||||
If the `kubelet` sees that it has less than `1500Mi` of disk available
|
||||
on the node, but the `kubelet` has not yet initiated eviction since the
|
||||
grace period criteria has not yet been met, the `kubelet` will still immediately
|
||||
fail any incoming pods.
|
||||
|
||||
The rationale for failing **all** pods instead of just best effort is because disk is currently
|
||||
a best effort resource for all QoS classes.
|
||||
|
||||
Kubelet will apply the same policy even if there is a dedicated `image` filesystem.
|
||||
|
||||
## Scheduler
|
||||
|
||||
The node will report a condition when a compute resource is under pressure. The
|
||||
scheduler should view that condition as a signal to dissuade placing additional
|
||||
best effort pods on the node. In this case, the `MemoryPressure` condition if true
|
||||
should dissuade the scheduler from placing new best effort pods on the node since
|
||||
they will be rejected by the `kubelet` in admission.
|
||||
best effort pods on the node.
|
||||
|
||||
In this case, the `MemoryPressure` condition if true should dissuade the scheduler
|
||||
from placing new best effort pods on the node since they will be rejected by the `kubelet` in admission.
|
||||
|
||||
On the other hand, the `DiskPressure` condition if true should dissuade the scheduler from
|
||||
placing **any** new pods on the node since they will be rejected by the `kubelet` in admission.
|
||||
|
||||
## Best Practices
|
||||
|
||||
|
@ -288,7 +451,7 @@ candidate set of pods provided to the eviction strategy.
|
|||
|
||||
In general, it should be strongly recommended that `DaemonSet` not
|
||||
create `BestEffort` pods to avoid being identified as a candidate pod
|
||||
for eviction.
|
||||
for eviction. Instead `DaemonSet` should ideally include Guaranteed pods only.
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-eviction.md?pixel)]()
|
||||
|
|
Loading…
Reference in New Issue