mirror of https://github.com/k3s-io/k3s
Design doc for node affinity, including NodeSelector.
parent
717551b13c
commit
32fa44041b
|
@ -0,0 +1,263 @@
|
|||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
<!-- BEGIN STRIP_FOR_RELEASE -->
|
||||
|
||||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
|
||||
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
|
||||
|
||||
If you are using a released version of Kubernetes, you should
|
||||
refer to the docs that go with that version.
|
||||
|
||||
Documentation for other releases can be found at
|
||||
[releases.k8s.io](http://releases.k8s.io).
|
||||
</strong>
|
||||
--
|
||||
|
||||
<!-- END STRIP_FOR_RELEASE -->
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Node affinity and NodeSelector
|
||||
|
||||
## Introduction
|
||||
|
||||
This document proposes a new label selector representation, called `NodeSelector`,
|
||||
that is similar in many ways to `LabelSelector`, but is a bit more flexible and is
|
||||
intended to be used only for selecting nodes.
|
||||
|
||||
In addition, we propose to replace the `map[string]string` in `PodSpec` that the scheduler
|
||||
currently uses as part of restricting the set of nodes onto which a pod is
|
||||
eligible to schedule, with a field of type `Affinity` that contains contains one or
|
||||
more affinity specifications. In this document we discuss `NodeAffinity`, which
|
||||
contains one or more of the following
|
||||
* a field called `RequiredDuringSchedulingRequiredDuringExecution` that will be
|
||||
represented by a `NodeSelector`, and thus generalizes the scheduling behavior of
|
||||
the current `map[string]string` but still serves the purpose of restricting
|
||||
the set of nodes onto which the pod can schedule. In addition, unlike the behavior
|
||||
of the current `map[string]string`, when it becomes violated the system will
|
||||
try to eventually evict the pod from its node.
|
||||
* a field called `RequiredDuringSchedulingIgnoredDuringExecution` which is identical
|
||||
to `RequiredDuringSchedulingRequiredDuringExecution` except that the system
|
||||
may or may not try to eventually evict the pod from its node.
|
||||
* a field called `PreferredDuringSchedulingIgnoredDuringExecution` that specifies which nodes are
|
||||
preferred for scheduling among those that meet all scheduling requirements.
|
||||
|
||||
(In practice, as discussed later, we will actually *add* the `Affinity` field
|
||||
rather than replacing `map[string]string`, due to backward compatibility requirements.)
|
||||
|
||||
The affiniy specifications described above allow a pod to request various properties
|
||||
that are inherent to nodes, for example "run this pod on a node with an Intel CPU" or, in a
|
||||
multi-zone cluster, "run this pod on a node in zone Z."
|
||||
([This issue](https://github.com/kubernetes/kubernetes/issues/9044) describes
|
||||
some of the properties that a node might publish as labels, which affinity expressions
|
||||
can match against.)
|
||||
They do *not* allow a pod to request to schedule
|
||||
(or not schedule) on a node based on what other pods are running on the node. That
|
||||
feature is called "inter-pod topological affinity/anti-afinity" and is described
|
||||
[here](https://github.com/kubernetes/kubernetes/pull/18265).
|
||||
|
||||
## API
|
||||
|
||||
### NodeSelector
|
||||
|
||||
```go
|
||||
// A node selector represents the union of the results of one or more label queries
|
||||
// over a set of nodes; that is, it represents the OR of the selectors represented
|
||||
// by the nodeSelectorTerms.
|
||||
type NodeSelector struct {
|
||||
// nodeSelectorTerms is a list of node selector terms. The terms are ORed.
|
||||
NodeSelectorTerms []NodeSelectorTerm `json:"nodeSelectorTerms,omitempty"`
|
||||
}
|
||||
|
||||
// An empty node selector term matches all objects. A null node selector term
|
||||
// matches no objects.
|
||||
type NodeSelectorTerm struct {
|
||||
// matchExpressions is a list of node selector requirements. The requirements are ANDed.
|
||||
MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty"`
|
||||
}
|
||||
|
||||
// A node selector requirement is a selector that contains values, a key, and an operator
|
||||
// that relates the key and values.
|
||||
type NodeSelectorRequirement struct {
|
||||
// key is the label key that the selector applies to.
|
||||
Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
|
||||
// operator represents a key's relationship to a set of values.
|
||||
// Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt.
|
||||
Operator NodeSelectorOperator `json:"operator"`
|
||||
// values is an array of string values. If the operator is In or NotIn,
|
||||
// the values array must be non-empty. If the operator is Exists or DoesNotExist,
|
||||
// the values array must be empty. If the operator is Gt or Lt, the values
|
||||
// array must have a single element, which will be interpreted as an integer.
|
||||
// This array is replaced during a strategic merge patch.
|
||||
Values []string `json:"values,omitempty"`
|
||||
}
|
||||
|
||||
// A node selector operator is the set of operators that can be used in
|
||||
// a node selector requirement.
|
||||
type NodeSelectorOperator string
|
||||
|
||||
const (
|
||||
NodeSelectorOpIn NodeSelectorOperator = "In"
|
||||
NodeSelectorOpNotIn NodeSelectorOperator = "NotIn"
|
||||
NodeSelectorOpExists NodeSelectorOperator = "Exists"
|
||||
NodeSelectorOpDoesNotExist NodeSelectorOperator = "DoesNotExist"
|
||||
NodeSelectorOpGt NodeSelectorOperator = "Gt"
|
||||
NodeSelectorOpLt NodeSelectorOperator = "Lt"
|
||||
)
|
||||
```
|
||||
|
||||
### NodeAffinity
|
||||
|
||||
We will add one field to `PodSpec`
|
||||
|
||||
```go
|
||||
Affinity *Affinity `json:"affinity,omitempty"`
|
||||
```
|
||||
|
||||
The `Affinity` type is defined as follows
|
||||
|
||||
```go
|
||||
type Affinity struct {
|
||||
NodeAffinity *NodeAffinity `json:"nodeAffinity,omitempty"`
|
||||
}
|
||||
|
||||
type NodeAffinity struct {
|
||||
// If the affinity requirements specified by this field are not met at
|
||||
// scheduling time, the pod will not be scheduled onto the node.
|
||||
// If the affinity requirements specified by this field cease to be met
|
||||
// at some point during pod execution (e.g. due to a node label update),
|
||||
// the system will try to eventually evict the pod from its node.
|
||||
RequiredDuringSchedulingRequiredDuringExecution *NodeSelector `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
|
||||
// If the affinity requirements specified by this field are not met at
|
||||
// scheduling time, the pod will not be scheduled onto the node.
|
||||
// If the affinity requirements specified by this field cease to be met
|
||||
// at some point during pod execution (e.g. due to a node label update),
|
||||
// the system may or may not try to eventually evict the pod from its node.
|
||||
RequiredDuringSchedulingIgnoredDuringExecution *NodeSelector `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
|
||||
// The scheduler will prefer to schedule pods to nodes that satisfy
|
||||
// the affinity expressions specified by this field, but it may choose
|
||||
// a node that violates one or more of the expressions. The node that is
|
||||
// most preferred is the one with the greatest sum of weights, i.e.
|
||||
// for each node that meets all of the scheduling requirements (resource
|
||||
// request, RequiredDuringScheduling affinity expressions, etc.),
|
||||
// compute a sum by iterating through the elements of this field and adding
|
||||
// "weight" to the sum if the node matches the corresponding MatchExpressions; the
|
||||
// node(s) with the highest sum are the most preferred.
|
||||
PreferredDuringSchedulingIgnoredDuringExecution []PreferredSchedulingTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
|
||||
}
|
||||
|
||||
// An empty preferred scheduling term matches all objects with implicit weight 0
|
||||
// (i.e. it's a no-op). A null preferred scheduling term matches no objects.
|
||||
type PreferredSchedulingTerm struct {
|
||||
// weight is in the range 1-100
|
||||
Weight int `json:"weight"`
|
||||
// matchExpressions is a list of node selector requirements. The requirements are ANDed.
|
||||
MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty"`
|
||||
}
|
||||
```
|
||||
|
||||
Unfortunately, the name of the existing `map[string]string` field in PodSpec is `NodeSelector`
|
||||
and we can't change it since this name is part of the API. Hopefully this won't
|
||||
cause too much confusion.
|
||||
|
||||
## Examples
|
||||
|
||||
** TODO: fill in this section **
|
||||
|
||||
* Run this pod on a node with an Intel or AMD CPU
|
||||
|
||||
* Run this pod on a node in availability zone Z
|
||||
|
||||
|
||||
## Backward compatibility
|
||||
|
||||
When we add `Affinity` to PodSpec, we will deprecate, but not remove, the current field in PodSpec
|
||||
|
||||
```go
|
||||
NodeSelector map[string]string `json:"nodeSelector,omitempty"`
|
||||
```
|
||||
|
||||
Old version of the scheduler will ignore the `Affinity` field.
|
||||
New versions of the scheduler will apply their scheduling predicates to both `Affinity` and `nodeSelector`,
|
||||
i.e. the pod can only schedule onto nodes that satisfy both sets of requirements. We will not
|
||||
attempt to convert between `Affinity` and `nodeSelector`.
|
||||
|
||||
Old versions of non-scheduling clients will not know how to do anything semantically meaningful
|
||||
with `Affinity`, but we don't expect that this will cause a problem.
|
||||
|
||||
See [this comment](https://github.com/kubernetes/kubernetes/issues/341#issuecomment-140809259)
|
||||
for more discussion.
|
||||
|
||||
Users should not start using `NodeAffinity` until the full implementation has been in Kubelet and the master
|
||||
for enough binary versions that we feel comfortable that we will not need to roll back either Kubelet
|
||||
or master to a version that does not support them. Longer-term we will use a programatic approach to
|
||||
enforcing this (#4855).
|
||||
|
||||
## Implementation plan
|
||||
|
||||
1. Add the `Affinity` field to PodSpec and the `NodeAffinity`, `PreferredDuringSchedulingIgnoredDuringExecution`,
|
||||
and `RequiredDuringSchedulingIgnoredDuringExecution` types to the API
|
||||
2. Implement a scheduler predicate that takes `RequiredDuringSchedulingIgnoredDuringExecution` into account
|
||||
3. Implement a scheduler priority function that takes `PreferredDuringSchedulingIgnoredDuringExecution` into account
|
||||
4. At this point, the feature can be deployed and `PodSpec.NodeSelector` can be marked as deprecated
|
||||
5. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to the API
|
||||
6. Modify the scheduler predicate from step 2 to also take `RequiredDuringSchedulingRequiredDuringExecution` into account
|
||||
7. Add `RequiredDuringSchedulingRequiredDuringExecution` to Kubelet's admission decision
|
||||
8. Implement code in Kubelet *or* the controllers that evicts a pod that no longer satisfies
|
||||
`RequiredDuringSchedulingRequiredDuringExecution`
|
||||
(see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
|
||||
|
||||
We assume Kubelet publishes labels describing the node's membership in all of the relevant scheduling
|
||||
domains (e.g. node name, rack name, availability zone name, etc.). See #9044.
|
||||
|
||||
## Extensibility
|
||||
|
||||
The design described here is the result of careful analysis of use cases, a decade of experience
|
||||
with Borg at Google, and a review of similar features in other open-source container orchestration
|
||||
systems. We believe that it properly balances the goal of expressiveness against the goals of
|
||||
simplicity and efficiency of implementation. However, we recognize that
|
||||
use cases may arise in the future that cannot be expressed using the syntax described here.
|
||||
Although we are not implementing an affinity-specific extensibility mechanism for a variety
|
||||
of reasons (simplicity of the codebase, simplicity of cluster deployment, desire for Kubernetes
|
||||
users to get a consistent experience, etc.), the regular Kubernetes
|
||||
annotation mechanism can be used to add or replace affinity rules. The way this work would is
|
||||
|
||||
1. Define one or more annotations to describe the new affinity rule(s)
|
||||
1. User (or an admission controller) attaches the annotation(s) to pods to request the desired scheduling behavior.
|
||||
If the new rule(s) *replace* one or more fields of `Affinity` then the user would omit those fields
|
||||
from `Affinity`; if they are *additional rules*, then the user would fill in `Affinity` as well as the
|
||||
annotation(s).
|
||||
1. Scheduler takes the annotation(s) into account when scheduling.
|
||||
|
||||
If some particular new syntax becomes popular, we would consider upstreaming it by integrating
|
||||
it into the standard `Affinity`.
|
||||
|
||||
## Future work
|
||||
|
||||
Are there any other fields we should convert from `map[string]string` to `NodeSelector`?
|
||||
|
||||
## Related issues
|
||||
|
||||
The review for this proposal is in #18261.
|
||||
|
||||
The main related issue is #341. Issue #367 is also related. Those issues reference other
|
||||
related issues.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/nodeaffinity.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
Loading…
Reference in New Issue