Merge pull request #25237 from vishh/disk-based-eviction-proposal

Automatic merge from submit-queue Proposal for disk based evictions cc @dchen1107 @derekwaynecarr
2016-05-20 17:57:18 -07:00 · 2016-05-20 17:57:18 -07:00 · 12f7b81481
parent 7170c8910d 34ebb7e384
commit 12f7b81481
1 changed files with 173 additions and 10 deletions
--- a/docs/proposals/kubelet-eviction.md
+++ b/docs/proposals/kubelet-eviction.md
@ -29,9 +29,9 @@ Documentation for other releases can be found at
 # Kubelet - Eviction Policy
-**Author**: Derek Carr (@derekwaynecarr)
+**Authors**: Derek Carr (@derekwaynecarr), Vishnu Kannan (@vishh)
-**Status**: Proposed
+**Status**: Proposed (memory evictions WIP)
 This document presents a specification for how the `kubelet` evicts pods when compute resources are too low.
@ -58,8 +58,8 @@ moved and scheduled elsewhere when/if its backing controller creates a new pod.
 This proposal defines a pod eviction policy for reclaiming compute resources.
-In the first iteration, it focuses on memory; later iterations are expected to cover
+As of now, memory and disk based evictions are supported.
-other resources like disk.  The proposal focuses on a simple default eviction strategy
+The proposal focuses on a simple default eviction strategy
 intended to cover the broadest class of user workloads.
 ## Eviction Signals
@ -69,6 +69,16 @@ The `kubelet` will support the ability to trigger eviction decisions on the foll
 | Eviction Signal  | Description                                                                     |
 |------------------|---------------------------------------------------------------------------------|
 | memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet |
 | nodefs.available   | nodefs.available := node.stats.fs.available |
 | imagefs.available | imagefs.available := node.stats.runtime.imagefs.available |
 `kubelet` supports only two filesystem partitions.
 1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc.
 1. The `imagefs` filesystem that container runtimes uses for storing images and container writable layers.
 `imagefs` is optional. `kubelet` auto-discovers these filesystems using cAdvisor.
 `kubelet` does not care about any other filesystems. Any other types of configurations are not currently supported by the kubelet. For example, it is *not OK* to store volumes and logs in a dedicated `imagefs`.
 ## Eviction Thresholds
@ -151,6 +161,7 @@ The following node conditions are defined that correspond to the specified evict
 | Node Condition | Eviction Signal  | Description                                                      |
 |----------------|------------------|------------------------------------------------------------------|
 | MemoryPressure | memory.available | Available memory on the node has satisfied an eviction threshold |
 | DiskPressure | nodefs.available (or) imagefs.available | Available disk space on either the node's root filesytem or image filesystem has satisfied an eviction threshold |
 The `kubelet` will continue to report node status updates at the frequency specified by
 `--node-status-update-frequency` which defaults to `10s`.
@ -174,7 +185,9 @@ The `kubelet` would ensure that it has not observed an eviction threshold being
 for the specified pressure condition for the period specified before toggling the
 condition back to `false`.
-## Eviction scenario
+## Eviction scenarios
 ### Memory
 Let's assume the operator started the `kubelet` with the following:
@ -194,6 +207,31 @@ signal.  If that signal is observed as being satisfied for longer than the
 specified period, the `kubelet` will initiate eviction to attempt to
 reclaim the resource that has met its eviction threshold.
 ### Disk
 Let's assume the operator started the `kubelet` with the following:
 ```
 --eviction-hard="nodefs.available<1Gi,imagefs.available<10Gi"
 --eviction-soft="nodefs.available<1.5Gi,imagefs.available<20Gi"
 --eviction-soft-grace-period="nodefs.available=1m,imagefs.available=2m"
 ```
 The `kubelet` will run a sync loop that looks at the available disk
 on the node's supported partitions as reported from `cAdvisor`.
 If available disk space on the node's primary filesystem is observed to drop below 1Gi,
 the `kubelet` will immediately initiate eviction.
 If available disk space on the node's image filesystem is observed to drop below 10Gi,
 the `kubelet` will immediately initiate eviction.
 If available disk space on the node's primary filesystem is observed as falling below `1.5Gi`,
 or if available disk space on the node's image filesystem is observed as falling below `20Gi`,
 it will record when that signal was observed internally in a cache.  If at the next
 sync, that criterion was no longer satisfied, the cache is cleared for that
 signal.  If that signal is observed as being satisfied for longer than the
 specified period, the `kubelet` will initiate eviction to attempt to
 reclaim the resource that has met its eviction threshold.
 ## Eviction of Pods
 If an eviction threshold has been met, the `kubelet` will initiate the
@ -241,11 +279,111 @@ only has guaranteed pod(s) remaining, then the node must choose to evict a
 guaranteed pod in order to preserve node stability, and to limit the impact
 of the unexpected consumption to other guaranteed pod(s).
 ## Disk based evictions
 ### With Imagefs
 If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
 1. Delete logs
 1. Evict Pods if required.
 If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
 1. Delete unused images
 1. Evict Pods if required.
 ### Without Imagefs
 If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
 1. Delete logs
 1. Delete unused images
 1. Evict Pods if required.
 Let's explore the different options for freeing up disk space.
 ### Delete logs of dead pods/containers
 As of today, logs are tied to a container's lifetime. `kubelet` keeps dead containers around,
 to provide access to logs.
 In the future, if we store logs of dead containers outside of the container itself, then
 `kubelet` can delete these logs to free up disk space.
 Once the lifetime of containers and logs are split, kubelet can support more user friendly policies
 around log evictions. `kubelet` can delete logs of the oldest containers first.
 Since logs from the first and the most recent incarnation of a container is the most important for most applications,
 kubelet can try to preserve these logs and aggresively delete logs from other container incarnations.
 Until logs are split from container's lifetime, `kubelet` can delete dead containers to free up disk space.
 ### Delete unused images
 `kubelet` performs image garbage collection based on thresholds today. It uses a high and a low watermark.
 Whenever disk usage exceeds the high watermark, it removes images until the low watermark is reached.
 `kubelet` employs a LRU policy when it comes to deleting images.
 The existing policy will be replaced with a much simpler policy.
 Images will be deleted based on eviction thresholds. If kubelet can delete logs and keep disk space availability
 above eviction thresholds, then kubelet will not delete any images.
 If `kubelet` decides to delete unused images, it will delete *all* unused images.
 ### Evict pods
 There is no ability to specify disk limits for pods/containers today.
 Disk is a best effort resource. When necessary, `kubelet` can evict pods one at a time.
 `kubelet` will follow the [Eviction Strategy](#eviction-strategy) mentioned above for making eviction decisions.
 `kubelet` will evict the pod that will free up the maximum amount of disk space on the filesystem that has hit eviction thresholds.
 Within each QoS bucket, `kubelet` will sort pods according to their disk usage.
 `kubelet` will sort pods in each bucket as follows:
 #### Without Imagefs
 If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage
 - local volumes + logs & writable layer of all its containers.
 #### With Imagefs
 If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs`
 - local volumes + logs of all its containers.
 If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers.
 ## Minimum eviction thresholds
 In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in
 `kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
 is time consuming.
 To mitigate these issues, `kubelet` will have a per-resource `minimum-threshold`. Whenever `kubelet` observes
 resource pressure, `kubelet` will attempt to reclaim at least `minimum-threshold` amount of resource.
 Following are the flags through which `minimum-thresholds` can be configured for each evictable resource:
 `--minimum-eviction-thresholds="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"`
 The default `minimum-eviction-threshold` is `0` for all resources.
 ## Deprecation of existing features
 `kubelet` has been freeing up disk space on demand to keep the node stable. As part of this proposal,
 some of the existing features/flags around disk space retrieval will be deprecated in-favor of this proposal.
 | Existing Flag | New Flag | Rationale |
 | `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` | existing eviction signals can capture image garbage collection |
 | `--image-gc-low-threshold` | `--minimum-eviction-thresholds` | eviction thresholds achieve the same behavior |
 | `--maximum-dead-containers` | | deprecated once old logs are stored outside of container's context |
 | `--maximum-dead-containers-per-container` | | deprecated once old logs are stored outside of container's context |
 | `--minimum-container-ttl-duration` | | deprecated once old logs are stored outside of container's context |
 | `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` | this use case is better handled by this proposal |
 | `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` | make the flag generic to suit all compute resources |
 ## Kubelet Admission Control
 ### Feasibility checks during kubelet admission
-The `kubelet` will reject `BestEffort` pods if any of its associated
+#### Memory
 The `kubelet` will reject `BestEffort` pods if any of the memory
 eviction thresholds have been exceeded independent of the configured
 grace period.
@ -265,13 +403,38 @@ The reasoning for this decision is the expectation that the incoming pod is
 likely to further starve the particular compute resource and the `kubelet` should
 return to a steady state before accepting new workloads.
 #### Disk
 The `kubelet` will reject all pods if any of the disk eviction thresholds have been met.
 Let's assume the operator started the `kubelet` with the following:
 ```
 --eviction-soft="disk.available<1500Mi"
 --eviction-soft-grace-period="disk.available=30s"
 ```
 If the `kubelet` sees that it has less than `1500Mi` of disk available
 on the node, but the `kubelet` has not yet initiated eviction since the
 grace period criteria has not yet been met, the `kubelet` will still immediately
 fail any incoming pods.
 The rationale for failing **all** pods instead of just best effort is because disk is currently
 a best effort resource for all QoS classes.
 Kubelet will apply the same policy even if there is a dedicated `image` filesystem.
 ## Scheduler
 The node will report a condition when a compute resource is under pressure.  The
 scheduler should view that condition as a signal to dissuade placing additional
-best effort pods on the node.  In this case, the `MemoryPressure` condition if true
+best effort pods on the node.
-should dissuade the scheduler from placing new best effort pods on the node since
+
-they will be rejected by the `kubelet` in admission.
+In this case, the `MemoryPressure` condition if true should dissuade the scheduler
 from placing new best effort pods on the node since they will be rejected by the `kubelet` in admission.
 On the other hand, the `DiskPressure` condition if true should dissuade the scheduler from
 placing **any** new pods on the node since they will be rejected by the `kubelet` in admission.
 ## Best Practices
@ -288,7 +451,7 @@ candidate set of pods provided to the eviction strategy.
 In general, it should be strongly recommended that `DaemonSet` not
 create `BestEffort` pods to avoid being identified as a candidate pod
-for eviction.
+for eviction. Instead `DaemonSet` should ideally include Guaranteed pods only.
 <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-eviction.md?pixel)]()