k3s/docs/design/taint-toleration-dedicated.md

<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->


<!-- END MUNGE: UNVERSIONED_WARNING -->

# Taints, Tolerations, and Dedicated Nodes

## Introduction

This document describes *taints* and *tolerations*, which constitute a generic
mechanism for restricting the set of pods that can use a node. We also describe
one concrete use case for the mechanism, namely to limit the set of users (or
more generally, authorization domains) who can access a set of nodes (a feature
we call *dedicated nodes*). There are many other uses--for example, a set of
nodes with a particular piece of hardware could be reserved for pods that
require that hardware, or a node could be marked as unschedulable when it is
being drained before shutdown, or a node could trigger evictions when it
experiences hardware or software problems or abnormal node configurations; see
issues #17190 and #3885 for more discussion.

## Taints, tolerations, and dedicated nodes

A *taint* is a new type that is part of the `NodeSpec`; when present, it
prevents pods from scheduling onto the node unless the pod *tolerates* the taint
(tolerations are listed in the `PodSpec`). Note that there are actually multiple
flavors of taints: taints that prevent scheduling on a node, taints that cause
the scheduler to try to avoid scheduling on a node but do not prevent it, taints
that prevent a pod from starting on Kubelet even if the pod's `NodeName` was
written directly (i.e. pod did not go through the scheduler), and taints that
evict already-running pods.
[This comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375)
has more background on these different scenarios. We will focus on the first
kind of taint in this doc, since it is the kind required for the "dedicated
nodes" use case.

Implementing dedicated nodes using taints and tolerations is straightforward: in
essence, a node that is dedicated to group A gets taint `dedicated=A` and the
pods belonging to group A get toleration `dedicated=A`. (The exact syntax and
semantics of taints and tolerations are described later in this doc.) This keeps
all pods except those belonging to group A off of the nodes. This approach
easily generalizes to pods that are allowed to schedule into multiple dedicated
node groups, and nodes that are a member of multiple dedicated node groups.

Note that because tolerations are at the granularity of pods, the mechanism is
very flexible -- any policy can be used to determine which tolerations should be
placed on a pod. So the "group A" mentioned above could be all pods from a
particular namespace or set of namespaces, or all pods with some other arbitrary
characteristic in common. We expect that any real-world usage of taints and
tolerations will employ an admission controller to apply the tolerations. For
example, to give all pods from namespace A access to dedicated node group A, an
admission controller would add the corresponding toleration to all pods from
namespace A. Or to give all pods that require GPUs access to GPU nodes, an
admission controller would add the toleration for GPU taints to pods that
request the GPU resource.

Everything that can be expressed using taints and tolerations can be expressed
using [node affinity](https://github.com/kubernetes/kubernetes/pull/18261), e.g.
in the example in the previous paragraph, you could put a label `dedicated=A` on
the set of dedicated nodes and a node affinity `dedicated NotIn A` on all pods *not*
belonging to group A. But it is cumbersome to express exclusion policies using
node affinity because every time you add a new type of restricted node, all pods
that aren't allowed to use those nodes need to start avoiding those nodes using
node affinity. This means the node affinity list can get quite long in clusters
with lots of different groups of special nodes (lots of dedicated node groups,
lots of different kinds of special hardware, etc.). Moreover, you need to also
update any Pending pods when you add new types of special nodes. In contrast,
with taints and tolerations, when you add a new type of special node, "regular"
pods are unaffected, and you just need to add the necessary toleration to the
pods you subsequent create that need to use the new type of special nodes. To
put it another way, with taints and tolerations, only pods that use a set of
special nodes need to know about those special nodes; with the node affinity
approach, pods that have no interest in those special nodes need to know about
all of the groups of special nodes.

One final comment: in practice, it is often desirable to not only keep "regular"
pods off of special nodes, but also to keep "special" pods off of regular nodes.
An example in the dedicated nodes case is to not only keep regular users off of
dedicated nodes, but also to keep dedicated users off of non-dedicated (shared)
nodes. In this case, the "non-dedicated" nodes can be modeled as their own
dedicated node group (for example, tainted as `dedicated=shared`), and pods that
are not given access to any dedicated nodes ("regular" pods) would be given a
toleration for `dedicated=shared`. (As mentioned earlier, we expect tolerations
will be added by an admission controller.) In this case taints/tolerations are
still better than node affinity because with taints/tolerations each pod only
needs one special "marking", versus in the node affinity case where every time
you add a dedicated node group (i.e. a new `dedicated=` value), you need to add
a new node affinity rule to all pods (including pending pods) except the ones
allowed to use that new dedicated node group.

## API

```go
// The node this Taint is attached to has the effect "effect" on
// any pod that that does not tolerate the Taint.
type Taint struct {
  Key string  `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
  Value string  `json:"value,omitempty"`
  Effect TaintEffect  `json:"effect"`
}

type TaintEffect string

const (
  // Do not allow new pods to schedule unless they tolerate the taint,
  // but allow all pods submitted to Kubelet without going through the scheduler
  // to start, and allow all already-running pods to continue running. 
  // Enforced by the scheduler.
  TaintEffectNoSchedule TaintEffect = "NoSchedule"
  // Like TaintEffectNoSchedule, but the scheduler tries not to schedule
  // new pods onto the node, rather than prohibiting new pods from scheduling
  // onto the node. Enforced by the scheduler.
  TaintEffectPreferNoSchedule TaintEffect = "PreferNoSchedule"
  // Do not allow new pods to schedule unless they tolerate the taint,
  // do not allow pods to start on Kubelet unless they tolerate the taint,
  // but allow all already-running pods to continue running.
  // Enforced by the scheduler and Kubelet.
  TaintEffectNoScheduleNoAdmit TaintEffect = "NoScheduleNoAdmit"
  // Do not allow new pods to schedule unless they tolerate the taint,
  // do not allow pods to start on Kubelet unless they tolerate the taint,
  // and try to eventually evict any already-running pods that do not tolerate the taint.
  // Enforced by the scheduler and Kubelet.
  TaintEffectNoScheduleNoAdmitNoExecute = "NoScheduleNoAdmitNoExecute"
)

// The pod this Toleration is attached to tolerates any taint that matches
// the triple <key,value,effect> using the matching operator <operator>.
type Toleration struct {
  Key string  `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
  // operator represents a key's relationship to the value.
  // Valid operators are Exists and Equal. Defaults to Equal.
  // Exists is equivalent to wildcard for value, so that a pod can
  // tolerate all taints of a particular category.
  Operator TolerationOperator `json:"operator"`
  Value string                `json:"value,omitempty"`
  Effect TaintEffect          `json:"effect"`
  // TODO: For forgiveness (#1574), we'd eventually add at least a grace period
  // here, and possibly an occurrence threshold and period.
}

// A toleration operator is the set of operators that can be used in a toleration.
type TolerationOperator string

const (
  TolerationOpExists  TolerationOperator = "Exists"
  TolerationOpEqual   TolerationOperator = "Equal"
)

```

(See [this comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375)
to understand the motivation for the various taint effects.)

We will add:

```go
	// Multiple tolerations with the same key are allowed.
	Tolerations []Toleration  `json:"tolerations,omitempty"`
```

to `PodSpec`. A pod must tolerate all of a node's taints (except taints of type
TaintEffectPreferNoSchedule) in order to be able to schedule onto that node.

We will add:

```go
	// Multiple taints with the same key are not allowed.
	Taints []Taint  `json:"taints,omitempty"`
```

to both `NodeSpec` and `NodeStatus`. The value in `NodeStatus` is the union
of the taints specified by various sources. For now, the only source is
the `NodeSpec` itself, but in the future one could imagine a node inheriting
taints from pods (if we were to allow taints to be attached to pods), from
the node's startup coniguration, etc. The scheduler should look at the `Taints`
in `NodeStatus`, not in `NodeSpec`.

Taints and tolerations are not scoped to namespace.

## Implementation plan: taints, tolerations, and dedicated nodes

Using taints and tolerations to implement dedicated nodes requires these steps:

1. Add the API described above
1. Add a scheduler predicate function that respects taints and tolerations (for
TaintEffectNoSchedule) and a scheduler priority function that respects taints
and tolerations (for TaintEffectPreferNoSchedule).
1. Add to the Kubelet code to implement the "no admit" behavior of
TaintEffectNoScheduleNoAdmit and TaintEffectNoScheduleNoAdmitNoExecute
1. Implement code in Kubelet that evicts a pod that no longer satisfies
TaintEffectNoScheduleNoAdmitNoExecute. In theory we could do this in the
controllers instead, but since taints might be used to enforce security
policies, it is better to do in kubelet because kubelet can respond quickly and
can guarantee the rules will be applied to all pods. Eviction may need to happen
under a variety of circumstances: when a taint is added, when an existing taint
is updated, when a toleration is removed from a pod, or when a toleration is
modified on a pod.
1. Add a new `kubectl` command that adds/removes taints to/from nodes,
1. (This is the one step is that is specific to dedicated nodes) Implement an
admission controller that adds tolerations to pods that are supposed to be
allowed to use dedicated nodes (for example, based on pod's namespace).

In the future one can imagine a generic policy configuration that configures an
admission controller to apply the appropriate tolerations to the desired class
of pods and taints to Nodes upon node creation. It could be used not just for
policies about dedicated nodes, but also other uses of taints and tolerations,
e.g. nodes that are restricted due to their hardware configuration.

The `kubectl` command to add and remove taints on nodes will be modeled after
`kubectl label`. Examples usages:

```sh
# Update node 'foo' with a taint with key 'dedicated' and value 'special-user' and effect 'NoScheduleNoAdmitNoExecute'.
# If a taint with that key already exists, its value and effect are replaced as specified.
$ kubectl taint nodes foo dedicated=special-user:NoScheduleNoAdmitNoExecute

# Remove from node 'foo' the taint with key 'dedicated' if one exists.
$ kubectl taint nodes foo dedicated-
```

## Example: implementing a dedicated nodes policy

Let's say that the cluster administrator wants to make nodes `foo`, `bar`, and `baz` available
only to pods in a particular namespace `banana`. First the administrator does

```sh
$ kubectl taint nodes foo dedicated=banana:NoScheduleNoAdmitNoExecute
$ kubectl taint nodes bar dedicated=banana:NoScheduleNoAdmitNoExecute
$ kubectl taint nodes baz dedicated=banana:NoScheduleNoAdmitNoExecute

```

(assuming they want to evict pods that are already running on those nodes if those
pods don't already tolerate the new taint)

Then they ensure that the `PodSpec` for all pods created in namespace `banana` specify
a toleration with `key=dedicated`, `value=banana`, and `policy=NoScheduleNoAdmitNoExecute`.

In the future, it would be nice to be able to specify the nodes via a `NodeSelector` rather than having
to enumerate them by name.

## Future work

At present, the Kubernetes security model allows any user to add and remove any
taints and tolerations. Obviously this makes it impossible to securely enforce
rules like dedicated nodes. We need some mechanism that prevents regular users
from mutating the `Taints` field of `NodeSpec` (probably we want to prevent them
from mutating any fields of `NodeSpec`) and from mutating the `Tolerations`
field of their pods. #17549 is relevant.

Another security vulnerability arises if nodes are added to the cluster before
receiving their taint. Thus we need to ensure that a new node does not become
"Ready" until it has been configured with its taints. One way to do this is to
have an admission controller that adds the taint whenever a Node object is
created.

A quota policy may want to treat nodes differently based on what taints, if any,
they have. For example, if a particular namespace is only allowed to access
dedicated nodes, then it may be convenient to give the namespace unlimited
quota. (To use finite quota, you'd have to size the namespace's quota to the sum
of the sizes of the machines in the dedicated node group, and update it when
nodes are added/removed to/from the group.)

It's conceivable that taints and tolerations could be unified with
[pod anti-affinity](https://github.com/kubernetes/kubernetes/pull/18265).
We have chosen not to do this for the reasons described in the "Future work"
section of that doc.

## Backward compatibility

Old scheduler versions will ignore taints and tolerations. New scheduler
versions will respect them.

Users should not start using taints and tolerations until the full
implementation has been in Kubelet and the master for enough binary versions
that we feel comfortable that we will not need to roll back either Kubelet or
master to a version that does not support them. Longer-term we will use a
progamatic approach to enforcing this (#4855).

## Related issues

This proposal is based on the discussion in #17190. There are a number of other
related issues, all of which are linked to from #17190.

The relationship between taints and node drains is discussed in #1574.

The concepts of taints and tolerations were originally developed as part of the
Omega project at Google.


<!-- BEGIN MUNGE: IS_VERSIONED -->
<!-- TAG IS_VERSIONED -->
<!-- END MUNGE: IS_VERSIONED -->


<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/taint-toleration-dedicated.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->