k3s/docs/proposals/rescheduling.md

# Controlled Rescheduling in Kubernetes

## Overview

Although the Kubernetes scheduler(s) try to make good placement decisions for pods,
conditions in the cluster change over time (e.g. jobs finish and new pods arrive, nodes
are removed due to failures or planned maintenance or auto-scaling down, nodes appear due
to recovery after a failure or re-joining after maintenance or auto-scaling up or adding
new hardware to a bare-metal cluster), and schedulers are not omniscient (e.g. there are
some interactions between pods, or between pods and nodes, that they cannot predict). As
a result, the initial node selected for a pod may turn out to be a bad match, from the
perspective of the pod and/or the cluster as a whole, at some point after the pod has
started running.

Today (Kubernetes version 1.2) once a pod is scheduled to a node, it never moves unless
it terminates on its own, is deleted by the user, or experiences some unplanned event
(e.g. the node where it is running dies). Thus in a cluster with long-running pods, the
assignment of pods to nodes degrades over time, no matter how good an initial scheduling
decision the scheduler makes. This observation motivates "controlled rescheduling," a
mechanism by which Kubernetes will "move" already-running pods over time to improve their
placement. Controlled rescheduling is the subject of this proposal.

Note that the term "move" is not technically accurate -- the mechanism used is that
Kubernetes will terminate a pod that is managed by a controller, and the controller will
create a replacement pod that is then scheduled by the pod's scheduler. The terminated
pod and replacement pod are completely separate pods, and no pod migration is
implied. However, describing the process as "moving" the pod is approximately accurate
and easier to understand, so we will use this terminology in the document.

We use the term "rescheduling" to describe any action the system takes to move an
already-running pod. The decision may be made and executed by any component; we wil
introduce the concept of a "rescheduler" component later, but it is not the only
component that can do rescheduling.

This proposal primarily focuses on the architecture and features/mechanisms used to
achieve rescheduling, and only briefly discuss example policies. We expect that community
experimentation will lead to a significantly better understanding of the range, potential,
and limitations of rescheduling policies.

## Example use cases

Example use cases for rescheduling are

* moving a running pod onto a node that better satisfies its scheduling criteria
  * moving a pod onto an under-utilized node
  * moving a pod onto a node that meets more of the pod's affinity/anti-affinity preferences
* moving a running pod off of a node in anticipation of a known or speculated future event
  * draining a node in preparation for maintenance, decomissioning, auto-scale-down, etc.
  * "preempting" a running pod to make room for a pending pod to schedule
  * proactively/speculatively make room for large and/or exclusive pods to facilitate
    fast scheduling in the future (often called "defragmentation")
  * (note that these last two cases are the only use cases where the first-order intent
    is to move a pod specifically for the benefit of another pod)
* moving a running pod off of a node from which it is receiving poor service
  * anomalous crashlooping or other mysterious incompatiblity between the pod and the node
  * repeated out-of-resource killing (see #18724)
  * repeated attempts by the scheduler to schedule the pod onto some node, but it is
    rejected by Kubelet admission control due to incomplete scheduler knowledge
  * poor performance due to interference from other containers on the node (CPU hogs,
    cache thrashers, etc.) (note that in this case there is a choice of moving the victim
    or the aggressor)

## Some axes of the design space

Among the key design decisions are

* how does a pod specify its tolerance for these system-generated disruptions, and how
  does the system enforce such disruption limits
* for each use case, where is the decision made about when and which pods to reschedule
  (controllers, schedulers, an entirely new component e.g. "rescheduler", etc.)
* rescheduler design issues: how much does a rescheduler need to know about pods'
  schedulers' policies, how does the rescheduler specify its rescheduling
  requests/decisions (e.g. just as an eviction, an eviction with a hint about where to
  reschedule, or as an eviction paired with a specific binding), how does the system
  implement these requests, does the rescheduler take into account the second-order
  effects of decisions (e.g. whether an evicted pod will reschedule, will cause
  a preemption when it reschedules, etc.), does the rescheduler execute multi-step plans
  (e.g. evict two pods at the same time with the intent of moving one into the space
  vacated by the other, or even more complex plans)

Additional musings on the rescheduling design space can be found [here](rescheduler.md).

## Design proposal

The key mechanisms and components of the proposed design are priority, preemption,
disruption budgets, the `/evict` subresource, and the rescheduler.

### Priority

#### Motivation


Just as it is useful to overcommit nodes to increase node-level utilization, it is useful
to overcommit clusters to increase cluster-level utilization.  Scheduling priority (which
we abbreviate as *priority*, in combination with disruption budgets (described in the
next section), allows Kubernetes to safely overcommit clusters much as QoS levels allow
it to safely overcommit nodes.

Today, cluster sharing among users, workload types, etc. is regulated via the
[quota](../admin/resourcequota/README.md) mechanism. When allocating quota, a cluster
administrator has two choices: (1) the sum of the quotas is less than or equal to the
capacity of the cluster, or (2) the sum of the quotas is greater than the capacity of the
cluster (that is, the cluster is overcommitted).  (1) is likely to lead to cluster
under-utilization, while (2) is unsafe in the sense that someone's pods may go pending
indefinitely even though they are still within their quota. Priority makes cluster
overcommitment (i.e. case (2)) safe by allowing users and/or administrators to identify
which pods should be allowed to run, and which should go pending, when demand for cluster
resources exceeds supply to due to cluster overcommitment.

Priority is also useful in some special-case scenarios, such as ensuring that system
DaemonSets can always schedule and reschedule onto every node where they want to run
(assuming they are given the highest priority), e.g. see #21767.

#### Specifying priorities

We propose to add a required `Priority` field to `PodSpec`. Its value type is string, and
the cluster administrator defines a total ordering on these strings (for example
`Critical`, `Normal`, `Preemptible`). We choose string instead of integer so that it is
easy for an administrator to add new priority levels in between existing levels, to
encourage thinking about priority in terms of user intent and avoid magic numbers, and to
make the internal implementation more flexible.

When a scheduler is scheduling a new pod P and cannot find any node that meets all of P's
scheduling predicates, it is allowed to evict ("preempt") one or more pods that are at
the same or lower priority than P (subject to disruption budgets, see next section) from
a node in order to make room for P, i.e. in order to make the scheduling predicates
satisfied for P on that node.  (Note that when we add cluster-level resources (#19080),
it might be necessary to preempt from multiple nodes, but that scenario is outside the
scope of this document.)  The preempted pod(s) may or may not be able to reschedule. The
net effect of this process is that when demand for cluster resources exceeds supply, the
higher-priority pods will be able to run while the lower-priority pods will be forced to
wait. The detailed mechanics of preemption are described in a later section.

In addition to taking disruption budget into account, for equal-priority preemptions the
scheduler will try to enforce fairness (across victim controllers, services, etc.)

Priorities could be specified directly by users in the podTemplate, or assigned by an
admission controller using
properties of the pod. Either way, all schedulers must be configured to understand the
same priorities (names and ordering). This could be done by making them constants in the
API, or using ConfigMap to configure the schedulers with the information. The advantage of
the former (at least making the names, if not the ordering, constants in the API) is that
it allows the API server to do validation (e.g. to catch mis-spelling).

In the future, which priorities are usable for a given namespace and pods with certain
attributes may be configurable, similar to ResourceQuota, LimitRange, or security policy.

Priority and resource QoS are indepedent.

The priority we have described here might be used to prioritize the scheduling queue
(i.e. the order in which a scheduler examines pods in its scheduling loop), but the two
priority concepts do not have to be connected. It is somewhat logical to tie them
together, since a higher priority genreally indicates that a pod is more urgent to get
running. Also, scheduling low-priority pods before high-priority pods might lead to
avoidable preemptions if the high-priority pods end up preempting the low-priority pods
that were just scheduled.

TODO: Priority and preemption are global or namespace-relative? See
[this discussion thread](https://github.com/kubernetes/kubernetes/pull/22217#discussion_r55737389).

#### Relationship of priority to quota

Of course, if the decision of what priority to give a pod is solely up to the user, then
users have no incentive to ever request any priority less than the maximum.  Thus
priority is intimately related to quota, in the sense that resource quotas must be
allocated on a per-priority-level basis (X amount of RAM at priority A, Y amount of RAM
at priority B, etc.). The "guarantee" that highest-priority pods will always be able to
schedule can only be achieved if the sum of the quotas at the top priority level is less
than or equal to the cluster capacity. This is analogous to QoS, where safety can only be
achieved if the sum of the limits of the top QoS level ("Guaranteed") is less than or
equal to the node capacity. In terms of incentives, an organization could "charge"
an amount proportional to the priority of the resources.

The topic of how to allocate quota at different priority levels to achieve a desired
balance between utilization and probability of schedulability is an extremely complex
topic that is outside the scope of this document. For example, resource fragmentation and
RequiredDuringScheduling node and pod affinity and anti-affinity means that even if the
sum of the quotas at the top priority level is less than or equal to the total aggregate
capacity of the cluster, some pods at the top priority level might still go pending. In
general, priority provdes a *probabilistic* guarantees of pod schedulability in the face
of overcommitment, by allowing prioritization of which pods should be allowed to run pods
when demand for cluster resources exceeds supply.

### Disruption budget

While priority can protect pods from one source of disruption (preemption by a
lower-priority pod), *disruption budgets* limit disruptions from all Kubernetes-initiated
causes, including preemption by an equal or higher-priority pod, or being evicted to
achieve other rescheduling goals. In particular, each pod is optionally associated with a
"disruption budget," a new API resource that limits Kubernetes-initiated terminations
across a set of pods (e.g. the pods of a particular Service might all point to the same
disruption budget object), regardless of cause. Initially we expect disruption budget
(e.g. `DisruptionBudgetSpec`) to consist of

* a rate limit on disruptions (preemption and other evictions) across the corresponding
  set of pods, e.g. no more than one disruption per hour across the pods of a particular Service
* a minimum number of pods that must be up simultaneously (sometimes called "shard
  strength") (of course this can also be expressed as the inverse, i.e. the number of
  pods of the collection that can be down simultaneously)

The second item merits a bit more explanation. One use case is to specify a quorum size,
e.g. to ensure that at least 3 replicas of a quorum-based service with 5 replicas are up
at the same time. In practice, a service should ideally create enough replicas to survive
at least one planned and one unplanned outage. So in our quorum example, we would specify
that at least 4 replicas must be up at the same time; this allows for one intentional
disruption (bringing the number of live replicas down from 5 to 4 and consuming one unit
of shard strength budget) and one unplanned disruption (bringing the number of live
replicas down from 4 to 3) while still maintaining a quorum. Shard strength is also
useful for simpler replicated services; for example, you might not want more than 10% of
your front-ends to be down at the same time, so as to avoid overloading the remaining
replicas.

Initially, disruption budgets will be specified by the user. Thus as with priority,
disruption budgets need to be tied into quota, to prevent users from saying none of their
pods can ever be disrupted. The exact way of expressing and enforcing this quota is TBD,
though a simple starting point would be to have an admission controller assign a default
disruption budget based on priority level (more liberal with decreasing priority).
We also likely need a quota that applies to Kubernetes *components*, to the limit the rate
at which any one component is allowed to consume disruption budget.

Of course there should also be a `DisruptionBudgetStatus` that indicates the current
disruption rate that the collection of pods is experiencing, and the number of pods that
are up.

For the purposes of disruption budget, a pod is considered to be disrupted as soon as its
graceful termination period starts.

A pod that is not covered by a disruption budget but is managed by a controller,
gets an implicit disruption budget of infinite (though the system should try to not
unduly victimize such pods). How a pod that is not managed by a controller is
handled is TBD.

TBD: In addition to `PodSpec`, where do we store pointer to disruption budget
(podTemplate in controller that managed the pod?)? Do we auto-generate a disruption
budget (e.g. when instantiating a Service), or require the user to create it manually
before they create a controller? Which objects should return the disruption budget object
as part of the output on `kubectl get` other than (obviously) `kubectl get` for the
disruption budget itself?

TODO: Clean up distinction between "down due to voluntary action taken by Kubernetes"
and "down due to unplanned outage" in spec and status.

For now, there is nothing to prevent clients from circumventing the disruption budget
protections. Of course, clients that do this are not being "good citizens." In the next
section we describe a mechanism that at least makes it easy for well-behaved clients to
obey the disruption budgets.

See #12611 for additional discussion of disruption budgets.

### /evict subresource and PreferAvoidPods

Although we could put the responsibility for checking and updating disruption budgets
solely on the client, it is safer and more convenient if we implement that functionality
in the API server. Thus we will introduce a new `/evict` subresource on pod. It is similar to
today's "delete" on pod except

  * It will be rejected if the deletion would violate disruption budget. (See how
    Deployment handles failure of /rollback for ideas on how clients could handle failure
    of `/evict`.) There are two possible ways to implement this:

    * For the initial implementation, this will be accomplished by the API server just
    looking at the `DisruptionBudgetStatus` and seeing if the disruption would violate the
    `DisruptionBudgetSpec`. In this approach, we assume a disruption budget controller
    keeps the `DisruptionBudgetStatus` up-to-date by observing all pod deletions and
    creations in the cluster, so that an approved disruption is quickly reflected in the
    `DisruptionBudgetStatus`. Of course this approach does allow a race in which one or
    more additional disruptions could be approved before the first one is reflected in the
    `DisruptionBudgetStatus`.

    * Thus a subsequent implementation will have the API server explicitly debit the
    `DisruptionBudgetStatus` when it accepts an `/evict`. (There still needs to be a
    controller, to keep the shard strength status up-to-date when replacement pods are
    created after an eviction; the controller may also be necessary for the rate status
    depending on how rate is represented, e.g. adding tokens to a bucket at a fixed rate.)
    Once etcd support multi-object transactions (etcd v3), the debit and pod deletion will
    be placed in the same transaction.

    * Note: For the purposes of disruption budget, a pod is considered to be disrupted as soon as its
    graceful termination period starts (so when we say "delete" here we do not mean
    "deleted from etcd" but rather "graceful termination period has started").

  * It will allow clients to communicate additional parameters when they wish to delete a
  pod. (In the absence of the `/evict` subresource, we would have to create a pod-specific
  type analogous to `api.DeleteOptions`.)

We will make `kubectl delete pod` use `/evict` by default, and require a command-line
flag to delete the pod directly.

We will add to `NodeStatus` a bounded-sized list of signatures of pods that should avoid
that node (provisionally called `PreferAvoidPods`). One of the pieces of information
specified in the `/evict` subresource is whether the eviction should add the evicted
pod's signature to the corresponding node's `PreferAvoidPods`. Initially the pod
signature will be a
[controllerRef](https://github.com/kubernetes/kubernetes/issues/14961#issuecomment-183431648),
i.e. a reference to the pod's controller. Controllers are responsible for garbage
collecting, after some period of time, `PreferAvoidPods` entries that point to them, but the API
server will also enforce a bounded size on the list. All schedulers will have a
highest-weighted priority function that gives a node the worst priority if the pod it is
scheduling appears in that node's `PreferAvoidPods` list. Thus appearing in
`PreferAvoidPods` is similar to
[RequiredDuringScheduling node anti-affinity](../../docs/user-guide/node-selection/README.md)
but it takes precedence over all other priority criteria and is not explicitly listed in
the `NodeAffinity` of the pod.

`PreferAvoidPods` is useful for the "moving a running pod off of a node from which it is
receiving poor service" use case, as it reduces the chance that the replacement pod will
end up on the same node (keep in mind that most of those cases are situations that the
scheduler does not have explicit priority functions for, for example it cannot know in
advance that a pod will be starved). Also, though we do not intend to implement any such
policies in the first version of the rescheduler, it is useful whenever the rescheduler evicts
two pods A and B with the intention of moving A into the space vacated by B (it prevents
B from rescheduling back into the space it vacated before A's scheduler has a chance to
reschedule A there). Note that these two uses are subtly different; in the first
case we want the avoidance to last a relatively long time, whereas in the second case we
may only need it to last until A schedules.

See #20699 for more discussion.

### Preemption mechanics

**NOTE: We expect a fuller design doc to be written on preemption before it is implemented.
However, a sketch of some ideas are presented here, since preemption is closely related to the
concepts discussed in this doc.**

Pod schedulers will decide and enact preemptions, subject to the priority and disruption
budget rules described earlier. (Though note that we currently do not have any mechanism
to prevent schedulers from bypassing either the priority or disruption budget rules.)
The scheduler does not concern itself with whether the evicted pod(s) can reschedule. The
eviction(s) use(s) the `/evict` subresource so that it is subject to the disruption
budget(s) of the victim(s), but it does not request to add the victim pod(s) to the
nodes' `PreferAvoidPods`.

Evicting victim(s) and binding the pending pod that the evictions are intended to enable
to schedule, are not transactional. We expect the scheduler to issue the operations in
sequence, but it is still possible that another scheduler could schedule its pod in
between the eviction(s) and the binding, or that the set of pods running on the node in
question changed between the time the scheduler made its decision and the time it sent
the operations to the API server thereby causing the eviction(s) to be not sufficient to get the
pending pod to schedule. In general there are a number of race conditions that cannot be
avoided without (1) making the evictions and binding be part of a single transaction, and
(2) making the binding preconditioned on a version number that is associated with the
node and is incremented on every binding. We may or may not implement those mechanisms in
the future.

Given a choice between a node where scheduling a pod requires preemption and one where it
does not, all other things being equal, a scheduler should choose the one where
preemption is not required. (TBD: Also, if the selected node does require preemption, the
scheduler should preempt lower-priority pods before higher-priority pods (e.g. if the
scheduler needs to free up 4 GB of RAM, and the node has two 2 GB low-priority pods and
one 4 GB high-priority pod, all of which have sufficient disruption budget, it should
preempt the two low-priority pods). This is debatable, since all have sufficient
disruption budget. But still better to err on the side of giving better disruption SLO to
higher-priority pods when possible?)

Preemption victims must be given their termination grace period. One possible sequence
of events is

1. The API server binds the preemptor to the node (i.e. sets `nodeName` on the
preempting pod) and sets `deletionTimestamp` on the victims
2. Kubelet sees that `deletionTimestamp` has been set on the victims; they enter their
graceful termination period
3. Kubelet sees the preempting pod. It runs the admission checks on the new pod
assuming all pods that are in their graceful termination period are gone and that
all pods that are in the waiting state (see (4)) are running.
4. If (3) fails, then the new pod is rejected. If (3) passes, then Kubelet holds the
new pod in a waiting state, and does not run it until the pod passes passes the
admission checks using the set of actually running pods.

Note that there are a lot of details to be figured out here; above is just a very
hand-wavy sketch of one general approach that might work.

See #22212 for additional discussion.

### Node drain

Node drain will be handled by one or more components not described in this document. They
will respect disruption budgets. Initially, we will just make `kubectl drain`
respect disruption budgets.  See #17393 for other discussion.

### Rescheduler

All rescheduling other than preemption and node drain will be decided and enacted by a
new component called the *rescheduler*. It runs continuously in the background, looking
for opportunities to move pods to better locations. It acts when the degree of
improvement meets some threshold and is allowed by the pod's disruption budget.  The
action is eviction of a pod using the `/evict` subresource, with the pod's signature
enqueued in the node's `PreferAvoidPods`. It does not force the pod to reschedule to any
particular node. Thus it is really an "unscheduler"; only in combination with the evicted
pod's scheduler, which schedules the replacement pod, do we get true "rescheduling."  See
the "Example use cases" section earlier for some example use cases.

The rescheduler is a best-effort service that makes no guarantees about how quickly (or
whether) it will resolve a suboptimal pod placement.

The first version of the rescheduler will not take into consideration where or whether an
evicted pod will reschedule. The evicted pod may go pending, consuming one unit of the
corresponding shard strength disruption budget by one indefinitely. By using the `/evict`
subresource, the rescheduler ensures that an evicted pod has sufficient budget for the
evicted pod to go and stay pending.  We expect future versions of the rescheduler may be
linked with the "mandatory" predicate functions (currently, the ones that constitute the
Kubelet admission criteria), and will only evict if the rescheduler determines that the
pod can reschedule somewhere according to those criteria. (Note that this still does not
guarantee that the pod actually will be able to reschedule, for at least two reasons: (1)
the state of the cluster may change between the time the rescheduler evaluates it and
when the evicted pod's scheduler tries to schedule the replacement pod, and (2) the
evicted pod's scheduler may have additional predicate functions in addition to the
mandatory ones).

(Note: see [this comment](https://github.com/kubernetes/kubernetes/pull/22217#discussion_r54527968)).

The first version of the rescheduler will only implement two objectives: moving a pod
onto an under-utilized node, and moving a pod onto a node that meets more of the pod's
affinity/anti-affinity preferences than wherever it is currently running. (We assume that
nodes that are intentionally under-utilized, e.g. because they are being drained, are
marked unschedulable, thus the first objective will not cause the rescheduler to "fight"
a system that is draining nodes.)  We assume that all schedulers sufficiently weight the
priority functions for affinity/anti-affinity and avoiding very packed nodes,
otherwise evicted pods may not actually move onto a node that is better according to
the criteria that caused it to be evicted. (But note that in all cases it will move to a
node that is better according to the totality of its scheduler's priority functions,
except in the case where the node where it was already running was the only node
where it can run.) As a general rule, the rescheduler should only act when it sees
particularly bad situations, since (1) an eviction for a marginal improvement is likely
not worth the disruption--just because there is sufficient budget for an eviction doesn't
mean an eviction is painless to the application, and (2) rescheduling the pod might not
actually mitigate the identified problem if it is minor enough that other scheduling
factors dominate the decision of where the replacement pod is scheduled.

We assume schedulers' priority functions are at least vaguely aligned with the
rescheduler's policies; otherwise the rescheduler will never accomplish anything useful,
given that it relies on the schedulers to actually reschedule the evicted pods. (Even if
the rescheduler acted as a scheduler, explicitly rebinding evicted pods, we'd still want
this to be true, to prevent the schedulers and rescheduler from "fighting" one another.)

The rescheduler will be configured using ConfigMap; the cluster administrator can enable
or disable policies and can tune the rescheduler's aggressiveness (aggressive means it
will use a relatively low threshold for triggering an eviction and may consume a lot of
disruption budget, while non-aggressive means it will use a relatively high threshold for
triggering an eviction and will try to leave plenty of buffer in disruption budgets). The
first version of the rescheduler will not be extensible or pluggable, since we want to
keep the code simple while we gain experience with the overall concept. In the future, we
anticipate a version that will be extensible and pluggable.

We might want some way to force the evicted pod to the front of the scheduler queue,
independently of its priority.

See #12140 for additional discussion.

### Final comments

In general, the design space for this topic is huge. This document describes some of the
design considerations and proposes one particular initial implementation. We expect
certain aspects of the design to be "permanent" (e.g. the notion and use of priorities,
preemption, disruption budgets, and the `/evict` subresource) while others may change over time
(e.g. the partitioning of functionality between schedulers, controllers, rescheduler,
horizontal pod autoscaler, and cluster autoscaler; the policies the rescheduler implements;
the factors the rescheduler takes into account when making decisions (e.g. knowledge of
schedulers' predicate and priority functions, second-order effects like whether and where
evicted pod will be able to reschedule, etc.); the way the rescheduler enacts its
decisions; and the complexity of the plans the rescheduler attempts to implement).

## Implementation plan

The highest-priority feature to implement is the rescheduler with the two use cases
highlighted earlier: moving a pod onto an under-utilized node, and moving a pod onto a
node that meets more of the pod's affinity/anti-affinity preferences.  The former is
useful to rebalance pods after cluster auto-scale-up, and the latter is useful for
Ubernetes. This requires implementing disruption budgets and the `/evict` subresource,
but not priority or preemption.

Because the general topic of rescheduling is very speculative, we have intentionally
proposed that the first version of the rescheduler be very simple -- only uses eviction
(no attempt to guide replacement pod to any particular node), doesn't know schedulers'
predicate or priority functions, doesn't try to move two pods at the same time, and only
implements two use cases. As alluded to in the previous subsection, we expect the design
and implementation to evolve over time, and we encourage members of the community to
experiment with more sophisticated policies and to report their results from using them
on real workloads.

## Alternative implementations

TODO.

## Additional references

TODO.

TODO: Add reference to this doc from docs/proposals/rescheduler.md


<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/rescheduling.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->