* Can only schedule P onto nodes that are running pods that satisfy `P1`. (Assumes all nodes have a label with key "node" and value specifying their node name.)
* Should try to schedule P onto zones that are running pods that satisfy `P3`. (Assumes all nodes have a label with key "zone" and value specifying their zone.)
* Cannot schedule P onto any racks that are running pods that satisfy `P2`. (Assumes all nodes have a label with key "rack" and value specifying their rack name.)
* Should try not to schedule P onto any power domains that are running pods that satisfy `P4`. (Assumes all nodes have a label with key "power" and value specifying their power domain.)
When `RequiredDuringScheduling` has multiple elements, the requirements are ANDed.
For `PreferredDuringScheduling` the weights are added for the terms that are satisfied for each node, and
the node(s) with the highest weight(s) are the most preferred.
In reality there are two variants of `RequiredDuringScheduling`: one suffixed with
`RequiredDuringEecution` and one suffixed with `IgnoredDuringExecution`. For the
first variant, if the affinity/anti-affinity ceases to be met at some point during
pod execution (e.g. due to a pod label update), the system will try to eventually evict the pod
from its node. In the second variant, the system may or may not try to eventually
evict the pod from its node.
## A comment on symmetry
One thing that makes affinity and anti-affinity tricky is symmetry.
Imagine a cluster that is running pods from two services, S1 and S2. Imagine that the pods of S1 have a RequiredDuringScheduling anti-affinity rule
"do not run me on nodes that are running pods from S2." It is not sufficient just to check that there are no S2 pods on a node when
you are scheduling a S1 pod. You also need to ensure that there are no S1 pods on a node when you are scheduling a S2 pod,
*even though the S2 pod does not have any anti-affinity rules*. Otherwise if an S1 pod schedules before an S2 pod, the S1
pod's RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving S2 pod. More specifically, if S1 has the aforementioned
RequiredDuringScheduling anti-affinity rule, then
* if a node is empty, you can schedule S1 or S2 onto the node
* if a node is running S1 (S2), you cannot schedule S2 (S1) onto the node
Note that while RequiredDuringScheduling anti-affinity is symmetric,
RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1 have a RequiredDuringScheduling affinity rule "run me on nodes that are running
pods from S2," it is not required that there be S1 pods on a node in order to schedule a S2 pod onto that node. More
specifically, if S1 has the aforementioned RequiredDuringScheduling affinity rule, then
* if a node is empty, you can schedule S2 onto the node
* if a node is empty, you cannot schedule S1 onto the node
* if a node is running S2, you can schedule S1 onto the node
* if a node is running S1+S2 and S1 terminates, S2 continues running
* if a node is running S1+S2 and S2 terminates, the system terminates S1 (eventually)
However, although RequiredDuringScheduling affinity is not symmetric, there is an implicit PreferredDuringScheduling affinity rule corresponding to every
RequiredDuringScheduling affinity rule: if the pods of S1 have a RequiredDuringScheduling affinity rule "run me on nodes that are running
pods from S2" then it is not required that there be S1 pods on a node in order to schedule a S2 pod onto that node,
but it would be better if there are.
PreferredDuringScheduling is symmetric.
If the pods of S1 had a PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that are running pods from S2"
then we would prefer to keep a S1 pod that we are scheduling off of nodes that are running S2 pods, and also
to keep a S2 pod that we are scheduling off of nodes that are running S1 pods. Likewise if the pods of
S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that are running pods from S2" then we would prefer
to place a S1 pod that we are scheduling onto a node that is running a S2 pod, and also to place
a S2 pod that we are scheduling onto a node that is running a S1 pod.
## Examples
Here are some examples of how you would express various affinity and anti-affinity rules using the API we described.
### Affinity
In the examples below, the word "put" is intentionally ambiguous; the rules are the same
whether "put" means "must put" (RequiredDuringScheduling) or "try to put"
(PreferredDuringScheduling)--all that changes is which field the rule goes into.
Also, we only discuss scheduling-time, and ignore the execution-time.
Finally, some of the examples
use "zone" and some use "node," just to make the examples more interesting; any of the examples
with "zone" will also work for "node" if you change the `TopologyKey`, and vice-versa.
* **Put the pod in zone Z**:
Tricked you! It is not possible express this using the API described here. For this you should use node affinity.
* **Put the pod in a zone that is running at least one pod from service S**:
`{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}`
* **Put the pod on a node that is already running a pod that requires a license for software package P**:
Assuming pods that require a license for software package P have a label `{key=license, value=P}`:
`{LabelSelector: "license" In "P", TopologyKey: "node"}`
* **Put this pod in the same zone as other pods from its same service**:
Assuming pods from this pod's service have some label `{key=service, value=S}`:
`{LabelSelector: "service" In "S", TopologyKey: "zone"}`
This last example illustrates a small issue with this API when it is used
with a scheduler that processes the pending queue one pod at a time, like the current
Kubernetes scheduler. The RequiredDuringScheduling rule
`{LabelSelector: "service" In "S", TopologyKey: "zone"}`
only "works" once one pod from service S has been scheduled. But if all pods in service
S have this RequiredDuringScheduling rule in their PodSpec, then the RequiredDuringScheduling rule
will block the first
pod of the service from ever scheduling, since it is only allowed to run in a zone with another pod from
the same service. And of course that means none of the pods of the service will be able
to schedule. This problem *only* applies to RequiredDuringScheduling affinity, not
PreferredDuringScheduling affinity or any variant of anti-affinity.
There are at least three ways to solve this problem
* **short-term**: have the scheduler use a rule that if the RequiredDuringScheduling affinity requirement
matches a pod's own labels, and there are no other such pods anywhere, then disregard the requirement.
This approach has a corner case when running parallel schedulers that are allowed to
schedule pods from the same replicated set (e.g. a single PodTemplate): both schedulers may try to
schedule pods from the set
at the same time and think there are no other pods from that set scheduled yet (e.g. they are
trying to schedule the first two pods from the set), but by the time
the second binding is committed, the first one has already been committed, leaving you with
two pods running that do not respect their RequiredDuringScheduling affinity. There is no
simple way to detect this "conflict" at scheduling time given the current system implementation.
* **longer-term**: when a controller creates pods from a PodTemplate, for exactly *one* of those
pods, it should omit any RequiredDuringScheduling affinity rules that select the pods of that PodTemplate.
* **very long-term/speculative**: controllers could present the scheduler with a group of pods from
the same PodTemplate as a single unit. This is similar to the first approach described above but
avoids the corner case. No special logic is needed in the controllers. Moreover, this would allow
the scheduler to do proper [gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845)
since it could receive an entire gang simultaneously as a single unit.
### Anti-affinity
As with the affinity examples, the examples here can be RequiredDuringScheduling or
PreferredDuringScheduling anti-affinity, i.e.
"don't" can be interpreted as "must not" or as "try not to" depending on whether the rule appears
in `RequiredDuringScheduling` or `PreferredDuringScheduling`.
* **Spread the pods of this service S across nodes and zones**:
`{{LabelSelector: <selector that matches S's pods>, TopologyKey: "node"}, {LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}}`
(note that if this is specified as a RequiredDuringScheduling anti-affinity, then the first clause is redundant, since the second
clause will force the scheduler to not put more than one pod from S in the same zone, and thus by
definition it will not put more than one pod from S on the same node, assuming each node is in one zone.
This rule is more useful as PreferredDuringScheduling anti-affinity, e.g. one might expect it to be common in
Note that this works because `"service" NotIn "S"` matches pods with no key "service"
as well as pods with key "service" and a corresponding value that is not "S."
## Algorithm
An example algorithm a scheduler might use to implement affinity and anti-affinity rules is as follows.
There are certainly more efficient ways to do it; this is just intended to demonstrate that the API's
semantics are implementable.
Terminology definition: We say a pod P is "feasible" on a node N if P meets all of the scheduler
predicates for scheduling P onto N. Note that this algorithm is only concerned about scheduling
time, thus it makes no distinction between RequiredDuringExecution and IgnoredDuringExecution.
To make the algorithm slightly more readable, we use the term "HardPodAffinity" as shorthand
for "RequiredDuringSchedulingScheduling pod affinity" and "SoftPodAffinity" as shorthand for
"PreferredDuringScheduling pod affinity." Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity."
** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity} into account;
currently it assumes all terms have weight 1. **
```
Z = the pod you are scheduling
{N} = the set of all nodes in the system // this algorithm will reduce it to the set of all nodes feasible for Z
// Step 1a: Reduce {N} to the set of nodes satisfying Z's HardPodAffinity in the "forward" direction
X = {Z's PodSpec's HardPodAffinity}
foreach element H of {X}
P = {all pods in the system that match H.LabelSelector}
M map[string]int // topology value -> number of pods running on nodes with that topology value
foreach pod Q of {P}
L = {labels of the node on which Q is running, represented as a map from label key to label value}
M[L[H.TopologyKey]]++
{N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]>0]}
// Step 1b: Further reduce {N} to the set of nodes also satisfying Z's HardPodAntiAffinity
// This step is identical to Step 1a except the M[K] > 0 comparison becomes M[K] == 0
X = {Z's PodSpec's HardPodAntiAffinity}
foreach element H of {X}
P = {all pods in the system that match H.LabelSelector}
M map[string]int // topology value -> number of pods running on nodes with that topology value
foreach pod Q of {P}
L = {labels of the node on which Q is running, represented as a map from label key to label value}
M[L[H.TopologyKey]]++
{N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]==0]}
// Step 2: Further reduce {N} by enforcing symmetry requirement for other pods' HardPodAntiAffinity
foreach node A of {N}
foreach pod B that is bound to A
if any of B's HardPodAntiAffinity are currently satisfied but would be violated if Z runs on A, then remove A from {N}
// At this point, all node in {N} are feasible for Z.
// Step 3a: Soft version of Step 1a
Y map[string]int // node -> number of Z's soft affinity/anti-affinity preferences satisfied by that node
Initialize the keys of Y to all of the nodes in {N}, and the values to 0
X = {Z's PodSpec's SoftPodAffinity}
Repeat Step 1a except replace the last line with "foreach node W of {N} having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++"
// Step 3b: Soft version of Step 1b
X = {Z's PodSpec's SoftPodAntiAffinity}
Repeat Step 1b except replace the last line with "foreach node W of {N} not having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++"
// Step 4: Symmetric soft, plus treat forward direction of hard affinity as a soft
foreach node A of {N}
foreach pod B that is bound to A
increment Y[A] by the number of B's SoftPodAffinity, SoftPodAntiAffinity, and HardPodAffinity that are satisfied if Z runs on A but are not satisfied if Z does not run on A
// We're done. {N} contains all of the nodes that satisfy the affinity/anti-affinity rules, and Y is
// a map whose keys are the elements of {N} and whose values are how "good" of a choice N is for Z with
// respect to the explicit and implicit affinity/anti-affinity rules (larger number is better).
```
## Special considerations for RequiredDuringScheduling anti-affinity
In this section we discuss three issues with RequiredDuringScheduling anti-affinity:
Denial of Service (DoS), co-existing with daemons, and determining which pod(s) to kill.
See issue #18265 for additional discussion of these topics.
### Denial of Service
Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity can intentionally
or unintentionally cause various problems for other pods, due to the symmetry property of anti-affinity.
The most notable danger is the ability for a
pod that arrives first to some topology domain, to block all other pods from
scheduling there by stating a conflict with all other pods.
The standard approach
to preventing resource hogging is quota, but simple resource quota cannot prevent
this scenario because the pod may request very little resources. Addressing this
using quota requires a quota scheme that charges based on "opportunity cost" rather
than based simply on requested resources. For example, when handling a pod that expresses
RequiredDuringScheduling anti-affinity for all pods using a "node" `TopologyKey`
(i.e. exclusive access to a node), it could charge for the resources of the
average or largest node in the cluster. Likewise if a pod expresses RequiredDuringScheduling
anti-affinity for all pods using a "cluster" `TopologyKey`, it could charge for the resources of the
end of the Denial of Service section above, and can be addressed in similar ways.
The nature of affinity (as opposed to anti-affinity) means that there is no issue of
determining which pod(s) to kill
when a pod's labels change: it is obviously the pod with the affinity rule that becomes
violated that must be killed. (Killing a pod never "fixes" violation of an affinity rule;
it can only "fix" violation an anti-affinity rule.) However, affinity does have a
different question related to killing: how long should the system wait before declaring
that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met at runtime?
For example, if a pod P has such an affinity for a pod Q and pod Q is temporarily killed
so that it can be updated to a new binary version, should that trigger killing of P? More
generally, how long should the system wait before declaring that P's affinity is
violated? (Of course affinity is expressed in terms of label selectors, not for a specific
pod, but the scenario is easier to describe using a concrete pod.) This is closely related to
the concept of forgiveness (see issue #1574). In theory we could make this time duration be
configurable by the user on a per-pod basis, but for the first version of this feature we will
make it a configurable property of whichever component does the killing and that applies across
all pods using the feature. Making it configurable by the user would require a nontrivial change
to the API syntax (since the field would only apply to RequiredDuringSchedulingRequiredDuringExecution
affinity).
## Implementation plan
1. Add the `Affinity` field to PodSpec and the `PodAffinity` and `PodAntiAffinity` types to the API along with all of their descendant types.
2. Implement a scheduler predicate that takes `RequiredDuringSchedulingIgnoredDuringExecution`
affinity and anti-affinity into account. Include a workaround for the issue described at the end of the Affinity section of the Examples section (can't schedule first pod).
3. Implement a scheduler priority function that takes `PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into account
4. Implement admission controller that rejects requests that specify "all namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling` anti-affinity.
This admission controller should be enabled by default.