From 14c276306c6aab6464fabc85fa5a3bb472379d19 Mon Sep 17 00:00:00 2001
From: David Oppenheimer <davidopp@google.com>
Date: Sat, 5 Dec 2015 16:06:17 -0800
Subject: [PATCH] Dedicated nodes, taints, and tolerations design doc.

---
 docs/design/taint-toleration-dedicated.md | 301 ++++++++++++++++++++++
 1 file changed, 301 insertions(+)
 create mode 100644 docs/design/taint-toleration-dedicated.md
diff --git a/docs/design/taint-toleration-dedicated.md b/docs/design/taint-toleration-dedicated.md
new file mode 100644
index 0000000000..cca2ee448e
--- /dev/null
+++ b/docs/design/taint-toleration-dedicated.md
@@ -0,0 +1,301 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+# Taints, Tolerations, and Dedicated Nodes
+
+## Introduction
+
+This document describes *taints* and *tolerations*, which constitute a generic mechanism for restricting
+the set of pods that can use a node. We also describe one concrete use case for the mechanism,
+namely to limit the set of users (or more generally, authorization domains)
+who can access a set of nodes (a feature we call
+*dedicated nodes*). There are many other uses--for example, a set of nodes with a particular
+piece of hardware could
+be reserved for pods that require that hardware, or a node could be marked as unschedulable
+when it is being drained before shutdown, or a node could trigger evictions when it experiences
+hardware or software problems or abnormal node configurations; see #17190 and #3885 for more discussion.
+
+## Taints, tolerations, and dedicated nodes
+
+A *taint* is a new type that is part of the `NodeSpec`; when present, it prevents pods
+from scheduling onto the node unless the pod *tolerates* the taint (tolerations are listed
+in the `PodSpec`). Note that there are actually multiple flavors of taints: taints that
+prevent scheduling on a node, taints that cause the scheduler to try to avoid scheduling
+on a node but do not prevent it, taints that prevent a pod from starting on Kubelet even
+if the pod's `NodeName` was written directly (i.e. pod did not go through the scheduler),
+and taints that evict already-running pods.
+[This comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375)
+has more background on these diffrent scenarios. We will focus on the first
+kind of taint in this doc, since it is the kind required for the "dedicated nodes" use case.
+
+Implementing dedicated nodes using taints and tolerations is straightforward: in essence, a node that
+is dedicated to group A gets taint `dedicated=A` and the pods belonging to group A get
+toleration `dedicated=A`. (The exact syntax and semantics of taints and tolerations are
+described later in this doc.) This keeps all pods except those belonging to group A off of the nodes.
+This approach easily generalizes to pods that are allowed to
+schedule into multiple dedicated node groups, and nodes that are a member of multiple
+dedicated node groups.
+
+Note that because tolerations are at the granularity of pods,
+the mechanism is very flexible -- any policy can be used to determine which tolerations
+should be placed on a pod. So the "group A" mentioned above could be all pods from a
+particular namespace or set of namespaces, or all pods with some other arbitrary characteristic
+in common. We expect that any real-world usage of taints and tolerations will employ an admission controller
+to apply the tolerations. For example, to give all pods from namespace A access to dedicated
+node group A, an admission controller would add the corresponding toleration to all
+pods from namespace A. Or to give all pods that require GPUs access to GPU nodes, an admission
+controller would add the toleration for GPU taints to pods that request the GPU resource.
+
+Everything that can be expressed using taints and tolerations can be expressed using
+[node affinity](https://github.com/kubernetes/kubernetes/pull/18261), e.g. in the example
+in the previous paragraph, you could put a label `dedicated=A` on the set of dedicated nodes and
+a node affinity `dedicated NotIn A` on all pods *not* belonging to group A. But it is
+cumbersome to express exclusion policies using node affinity because every time you add
+a new type of restricted node, all pods that aren't allowed to use those nodes need to start avoiding those
+nodes using node affinity. This means the node affinity list can get quite long in clusters with lots of different
+groups of special nodes (lots of dedicated node groups, lots of different kinds of special hardware, etc.).
+Moreover, you need to also update any Pending pods when you add new types of special nodes.
+In contrast, with taints and tolerations,
+when you add a new type of special node, "regular" pods are unaffected, and you just need to add
+the necessary toleration to the pods you subsequent create that need to use the new type of special nodes.
+To put it another way, with taints and tolerations, only pods that use a set of special nodes
+need to know about those special nodes; with the node affinity approach, pods that have
+no interest in those special nodes need to know about all of the groups of special nodes.
+
+One final comment: in practice, it is often desirable to not
+only keep "regular" pods off of special nodes, but also to keep "special" pods off of
+regular nodes. An example in the dedicated nodes case is to not only keep regular
+users off of dedicated nodes, but also to keep dedicated users off of non-dedicated (shared)
+nodes. In this case, the "non-dedicated" nodes can be modeled as their own dedicated node group
+(for example, tainted as `dedicated=shared`), and pods that are not given access to any
+dedicated nodes ("regular" pods) would be given a toleration for `dedicated=shared`. (As mentioned earlier,
+we expect tolerations will be added by an admission controller.) In this case taints/tolerations
+are still better than node affinity because with taints/tolerations each pod only needs one special "marking",
+versus in the node affinity case where every time you add a dedicated node group (i.e. a new
+`dedicated=` value), you need to add a new node affinity rule to all pods (including pending pods)
+except the ones allowed to use that new dedicated node group.
+
+## API
+
+```go
+// The node this Taint is attached to has the effect "effect" on
+// any pod that that does not tolerate the Taint.
+type Taint struct {
+	Key string  `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
+	Value string  `json:"value,omitempty"`
+	Effect TaintEffect  `json:"effect"`
+}
+
+type TaintEffect string
+
+const (
+	// Do not allow new pods to schedule unless they tolerate the taint,
+	// but allow all pods submitted to Kubelet without going through the scheduler
+	// to start, and allow all already-running pods to continue running. 
+	// Enforced by the scheduler.
+	TaintEffectNoSchedule TaintEffect = "NoSchedule"
+	// Like TaintEffectNoSchedule, but the scheduler tries not to schedule
+	// new pods onto the node, rather than prohibiting new pods from scheduling
+	// onto the node. Enforced by the scheduler.
+	TaintEffectPreferNoSchedule TaintEffect = "PreferNoSchedule"
+	// Do not allow new pods to schedule unless they tolerate the taint,
+	// do not allow pods to start on Kubelet unless they tolerate the taint,
+	// but allow all already-running pods to continue running.
+	// Enforced by the scheduler and Kubelet.
+	TaintEffectNoScheduleNoAdmit TaintEffect = "NoScheduleNoAdmit"
+	// Do not allow new pods to schedule unless they tolerate the taint,
+	// do not allow pods to start on Kubelet unless they tolerate the taint,
+	// and try to eventually evict any already-running pods that do not tolerate the taint.
+	// Enforced by the scheduler and Kubelet.
+	TaintEffectNoScheduleNoAdmitNoExecute = "NoScheduleNoAdmitNoExecute"
+)
+
+// The pod this Toleration is attached to tolerates any taint that matches
+// the triple <key,value,effect> using the matching operator <operator>.
+type Toleration struct {
+	Key string  `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
+	// operator represents a key's relationship to the value.
+	// Valid operators are Exists and Equal. Defaults to Equal.
+	// Exists is equivalent to wildcard for value, so that a pod can
+	// tolerate all taints of a particular category.
+	Operator TolerationOperator `json:"operator"`
+	Value string                `json:"value,omitempty"`
+	Effect TaintEffect          `json:"effect"`
+	// TODO: For forgiveness (#1574), we'd eventually add at least a grace period
+	// here, and possibly an occurrence threshold and period.
+}
+
+// A toleration operator is the set of operators that can be used in a toleration.
+type TolerationOperator string
+
+const (
+    TolerationOpExists  TolerationOperator = "Exists"
+    TolerationOpEqual   TolerationOperator = "Equal"
+)
+
+```
+
+(See [this comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375)
+to understand the motivation for the various taint effects.)
+
+We will add
+
+```go
+	// Multiple tolerations with the same key are allowed.
+	Tolerations []Toleration  `json:"tolerations,omitempty"`
+```
+
+to `PodSpec`. A pod must tolerate all of a node's taints (except taints
+of type TaintEffectPreferNoSchedule) in order to be able
+to schedule onto that node.
+
+We will add
+
+```go
+	// Multiple taints with the same key are not allowed.
+	Taints []Taint  `json:"taints,omitempty"`
+```
+
+to both `NodeSpec` and `NodeStatus`. The value in `NodeStatus` is the union
+of the taints specified by various sources. For now, the only source is
+the `NodeSpec` itself, but in the future one could imagine a node inheriting
+taints from pods (if we were to allow taints to be attached to pods), from
+the node's startup coniguration, etc. The scheduler should look at the `Taints`
+in `NodeStatus`, not in `NodeSpec`.
+
+Taints and tolerations are not scoped to namespace.
+
+## Implementation plan: taints, tolerations, and dedicated nodes
+
+Using taints and tolerations to implement dedicated nodes requires these steps:
+
+1. Add the API described above
+1. Add a scheduler predicate function that respects taints and tolerations (for TaintEffectNoSchedule)
+and a scheduler priority function that respects taints and tolerations (for TaintEffectPreferNoSchedule).
+1. Add to the Kubelet code to implement the "no admit" behavior of TaintEffectNoScheduleNoAdmit and
+TaintEffectNoScheduleNoAdmitNoExecute
+1. Implement code in Kubelet that evicts a pod that no longer satisfies
+TaintEffectNoScheduleNoAdmitNoExecute. In theory we could do this in the controllers
+instead, but since taints might be used to enforce security policies, it is better
+to do in kubelet because kubelet can respond quickly and can guarantee the rules will
+be applied to all pods.
+Eviction may need to happen under a variety of circumstances: when a taint is added, when an existing
+taint is updated, when a toleration is removed from a pod, or when a toleration is modified on a pod.
+1. Add a new `kubectl` command that adds/removes taints to/from nodes,
+1. (This is the one step is that is specific to dedicated nodes)
+Implement an admission controller that adds tolerations to pods that are supposed
+to be allowed to use dedicated nodes (for example, based on pod's namespace).
+
+In the future one can imagine a generic policy configuration that configures
+an admission controller to apply the appropriate tolerations to the desired class of pods and
+taints to Nodes upon node creation. It could be used not just for policies about dedicated nodes,
+but also other uses of taints and tolerations, e.g. nodes that are restricted
+due to their hardware configuration.
+
+The `kubectl` command to add and remove taints on nodes will be modeled after `kubectl label`.
+Examples usages:
+
+```sh
+# Update node 'foo' with a taint with key 'dedicated' and value 'special-user' and effect 'NoScheduleNoAdmitNoExecute'.
+# If a taint with that key already exists, its value and effect are replaced as specified.
+$ kubectl taint nodes foo dedicated=special-user:NoScheduleNoAdmitNoExecute
+
+# Remove from node 'foo' the taint with key 'dedicated' if one exists.
+$ kubectl taint nodes foo dedicated-
+```
+
+## Example: implementing a dedicated nodes policy
+
+Let's say that the cluster administrator wants to make nodes `foo`, `bar`, and `baz` available
+only to pods in a particular namespace `banana`. First the administrator does
+
+```sh
+$ kubectl taint nodes foo dedicated=banana:NoScheduleNoAdmitNoExecute
+$ kubectl taint nodes bar dedicated=banana:NoScheduleNoAdmitNoExecute
+$ kubectl taint nodes baz dedicated=banana:NoScheduleNoAdmitNoExecute
+
+```
+
+(assuming they want to evict pods that are already running on those nodes if those
+pods don't already tolerate the new taint)
+
+Then they ensure that the `PodSpec` for all pods created in namespace `banana` specify
+a toleration with `key=dedicated`, `value=banana`, and `policy=NoScheduleNoAdmitNoExecute`.
+
+In the future, it would be nice to be able to specify the nodes via a `NodeSelector` rather than having
+to enumerate them by name.
+
+## Future work
+
+At present, the Kubernetes security model allows any user to add and remove any taints and tolerations.
+Obviously this makes it impossible to securely enforce
+rules like dedicated nodes. We need some mechanism that prevents regular users from mutating the `Taints`
+field of `NodeSpec` (probably we want to prevent them from mutating any fields of `NodeSpec`)
+and from mutating the `Tolerations` field of their pods. #17549 is relevant.
+
+Another security vulnterability arises if nodes are added to the cluster before receiving
+their taint. Thus we need to ensure that a new node does not become "Ready" until it has been
+configured with its taints. One way to do this is to have an admission controller that adds the taint whenever
+a Node object is created.
+
+A quota policy may want to treat nodes diffrently based on what taints, if any,
+they have. For example, if a particular namespace is only allowed to access dedicated nodes,
+then it may be convenient to give the namespace unlimited quota. (To use finite quota,
+you'd have to size the namespace's quota to the sum of the sizes of the machines in the
+dedicated node group, and update it when nodes are added/removed to/from the group.)
+
+It's conceivable that taints and tolerations could be unified with [pod anti-affinity](https://github.com/kubernetes/kubernetes/pull/18265).
+We have chosen not to do this for the reasons described in the "Future work" section of that doc.
+
+## Backward compatibility
+
+Old scheduler versions will ignore taints and tolerations. New scheduler versions
+will respect them.
+
+Users should not start using taints and tolerations until the full implementation
+has been in Kubelet and the master for enough binary versions that we
+feel comfortable that we will not need to roll back either Kubelet or
+master to a version that does not support them. Longer-term we will
+use a progamatic approach to enforcing this (#4855).
+
+## Related issues
+
+This proposal is based on the discussion in #17190. There are a number of other
+related issues, all of which are linked to from #17190.
+
+The relationship between taints and node drains is discussed in #1574.
+
+The concepts of taints and tolerations were originally developed as part of the
+Omega project at Google.
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/taint-toleration-dedicated.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->