k3s/pkg
Kubernetes Submit Queue a3f40dd8df
Merge pull request #60856 from jiayingz/race-fix
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fixes the races around devicemanager Allocate() and endpoint deletion.

There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.

To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.

Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.

Tested:
Ran GPUDevicePlugin e2e_node test 100 times and all passed now.



**What this PR does / why we need it**:

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes https://github.com/kubernetes/kubernetes/issues/60176

**Special notes for your reviewer**:

**Release note**:

```release-note
Fixes the races around devicemanager Allocate() and endpoint deletion.
```
2018-03-12 02:50:13 -07:00
..
api API Changes for RunAsGroup and Implementation and e2e 2018-02-28 22:09:56 -08:00
apis Merge pull request #60682 from hanxiaoshuai/update0302 2018-03-07 17:20:04 -08:00
auth Autogenerated: hack/update-bazel.sh 2018-02-16 13:43:01 -08:00
capabilities Autogenerated: hack/update-bazel.sh 2018-02-16 13:43:01 -08:00
client Run hack/update-all.sh 2018-02-26 17:16:14 -08:00
cloudprovider Get external IP for azure standard nodes 2018-03-09 11:10:44 +08:00
controller Merge pull request #59862 from k82cn/k8s_59194_3 2018-03-11 06:19:27 -07:00
credentialprovider Autogenerated: hack/update-bazel.sh 2018-02-16 13:43:01 -08:00
features Task 2: Schedule DaemonSet Pods by default scheduler. 2018-03-08 17:36:49 +08:00
fieldpath Autogenerated: hack/update-bazel.sh 2018-02-16 13:43:01 -08:00
generated generated 2018-02-27 21:21:14 -08:00
kubeapiserver Merge pull request #55856 from miaoyq/replace-for-with-sets 2018-02-28 00:00:32 -08:00
kubectl Merge pull request #60950 from juanvallejo/jvallejo/use-temp-kubeconfig-file-tests 2018-03-09 15:00:21 -08:00
kubelet Merge pull request #60856 from jiayingz/race-fix 2018-03-12 02:50:13 -07:00
kubemark add nodeport-addresses flag for kube-proxy 2018-02-26 23:48:46 +08:00
master implement token authenticator for new id tokens 2018-02-27 17:20:46 -08:00
printers Add missing table converters for server side printing 2018-02-28 17:27:45 +01:00
probe Autogenerated: hack/update-bazel.sh 2018-02-16 13:43:01 -08:00
proxy Merge pull request #56880 from MrHohn/kube-proxy-ipv6-fix 2018-02-28 00:00:29 -08:00
quota Merge pull request #57375 from tianshapjq/cleanup-useless-func-core/services.go 2018-02-28 01:12:29 -08:00
registry Avoid reallocating of map in PodToSelectableFields 2018-03-07 12:26:02 +01:00
routes Remove /ui/ redirect 2018-02-12 10:54:33 -05:00
scheduler Fix a grammatical error in a comment 2018-03-02 21:30:44 +08:00
security Autogenerated: hack/update-bazel.sh 2018-02-16 13:43:01 -08:00
securitycontext API Changes for RunAsGroup and Implementation and e2e 2018-02-28 22:09:56 -08:00
serviceaccount implement token authenticator for new id tokens 2018-02-27 17:20:46 -08:00
ssh Autogenerated: hack/update-bazel.sh 2018-02-16 13:43:01 -08:00
util Merge pull request #56880 from MrHohn/kube-proxy-ipv6-fix 2018-02-28 00:00:29 -08:00
version Require boilerplate on Bazel Skylark source files 2018-02-16 13:44:04 -08:00
volume add remount logic for azure file plugin 2018-03-01 07:46:07 +00:00
watch/json remove outdate package 2018-01-15 23:17:19 +08:00
windows/service Add support for binaries to run as Windows services 2018-03-07 00:51:36 +01:00
.import-restrictions
BUILD Add support for binaries to run as Windows services 2018-03-07 00:51:36 +01:00
OWNERS