Lightweight Kubernetes
 
 
 
 
Go to file
Kubernetes Submit Queue a3f40dd8df
Merge pull request #60856 from jiayingz/race-fix
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fixes the races around devicemanager Allocate() and endpoint deletion.

There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.

To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.

Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.

Tested:
Ran GPUDevicePlugin e2e_node test 100 times and all passed now.



**What this PR does / why we need it**:

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes https://github.com/kubernetes/kubernetes/issues/60176

**Special notes for your reviewer**:

**Release note**:

```release-note
Fixes the races around devicemanager Allocate() and endpoint deletion.
```
2018-03-12 02:50:13 -07:00
.github Merge pull request #54114 from xiangpengzhao/fix-pr-template 2017-10-30 18:37:06 -07:00
Godeps Merge pull request #60450 from verult/repd-beta-integration 2018-03-08 16:27:05 -08:00
api Merge pull request #60682 from hanxiaoshuai/update0302 2018-03-07 17:20:04 -08:00
build Rollback etcd server version to 3.1.11 due to #60589 2018-03-08 13:07:15 +01:00
cluster Merge pull request #60926 from crassirostris/audit-log-gce-config 2018-03-09 20:00:17 -08:00
cmd Merge pull request #60891 from shyamjvs/go-back-to-etcd-3.1.10 2018-03-08 12:45:46 -08:00
docs Merge pull request #60682 from hanxiaoshuai/update0302 2018-03-07 17:20:04 -08:00
examples Merge pull request #60365 from CaoShuFeng/example_test 2018-03-08 21:06:33 -08:00
hack Merge pull request #60866 from fisherxu/autodate 2018-03-10 06:40:44 -08:00
logo Don't use strokes in the logo SVG 2017-10-12 09:38:56 -07:00
pkg Merge pull request #60856 from jiayingz/race-fix 2018-03-12 02:50:13 -07:00
plugin Merge pull request #55019 from mikedanese/svcacct 2018-02-27 10:50:46 -08:00
staging Merge pull request #60943 from jennybuckley/webhook-https-url 2018-03-08 15:18:46 -08:00
test Merge pull request #60856 from jiayingz/race-fix 2018-03-12 02:50:13 -07:00
third_party Merge pull request #60506 from php-coder/fix_suppress_gdate_cmd 2018-02-28 07:20:25 -08:00
translations Merge pull request #51925 from zhanghuidinah/fix-broken-link 2018-02-27 21:40:21 -08:00
vendor Merge pull request #60450 from verult/repd-beta-integration 2018-03-08 16:27:05 -08:00
.bazelrc move build related files out of the root directory 2017-05-15 15:53:54 -07:00
.generated_files
.gitattributes Hide generated files only on github 2018-01-22 10:58:48 +01:00
.gitignore fix all the typos across the project 2018-02-11 11:04:14 +08:00
.kazelcfg.json Switch from gazel to kazel, and move kazelcfg into build/root 2017-07-18 12:48:51 -07:00
BUILD.bazel move build related files out of the root directory 2017-05-15 15:53:54 -07:00
CHANGELOG-1.2.md Update TOC of CHANGELOG 2017-09-09 13:38:29 +08:00
CHANGELOG-1.3.md fix the format for github error 2018-01-31 14:49:29 +08:00
CHANGELOG-1.4.md fix the format for github error 2018-02-02 18:44:27 +08:00
CHANGELOG-1.5.md fix typo in kubeadm 2018-02-06 13:48:18 +08:00
CHANGELOG-1.6.md Fix typo 2018-02-01 19:11:19 +08:00
CHANGELOG-1.7.md Update CHANGELOG-1.7.md for v1.7.13. 2018-03-01 09:06:35 +00:00
CHANGELOG-1.8.md Update CHANGELOG-1.8.md for v1.8.8. 2018-02-09 15:01:39 -08:00
CHANGELOG-1.9.md Fix incorrectly formatted URL 2018-02-22 12:20:54 -08:00
CHANGELOG-1.10.md Update CHANGELOG-1.10.md for v1.10.0-beta.2. 2018-03-07 21:17:37 +00:00
CHANGELOG.md Update release note links for 1.10 2018-01-17 22:45:12 +01:00
CONTRIBUTING.md Pointed to community/contributors/guide/README.md 2017-12-15 22:08:34 +05:30
LICENSE
Makefile move build related files out of the root directory 2017-05-15 15:53:54 -07:00
Makefile.generated_files move build related files out of the root directory 2017-05-15 15:53:54 -07:00
OWNERS Fix my incorrect username in #46649 2017-08-10 11:59:54 -07:00
OWNERS_ALIASES Merge pull request #60603 from m1093782566/milestone 2018-03-02 20:08:56 -08:00
README.md Update README.md 2018-02-11 04:34:01 +00:00
SUPPORT.md Add a SUPPORT.md file for github 2017-08-11 14:42:36 -04:00
WORKSPACE move build related files out of the root directory 2017-05-15 15:53:54 -07:00
code-of-conduct.md Update code-of-conduct.md 2017-12-20 13:33:36 -05:00
labels.yaml Merge pull request #51848 from xiangpengzhao/milestone-label 2017-09-05 15:46:19 -07:00

README.md

Kubernetes

Submit Queue Widget GoDoc Widget CII Best Practices


Kubernetes is an open source system for managing containerized applications across multiple hosts; providing basic mechanisms for deployment, maintenance, and scaling of applications.

Kubernetes builds upon a decade and a half of experience at Google running production workloads at scale using a system called Borg, combined with best-of-breed ideas and practices from the community.

Kubernetes is hosted by the Cloud Native Computing Foundation (CNCF). If you are a company that wants to help shape the evolution of technologies that are container-packaged, dynamically-scheduled and microservices-oriented, consider joining the CNCF. For details about who's involved and how Kubernetes plays a role, read the CNCF announcement.


To start using Kubernetes

See our documentation on kubernetes.io.

Try our interactive tutorial.

Take a free course on Scalable Microservices with Kubernetes.

To start developing Kubernetes

The community repository hosts all information about building Kubernetes from source, how to contribute code and documentation, who to contact about what, etc.

If you want to build Kubernetes right away there are two options:

You have a working Go environment.
$ go get -d k8s.io/kubernetes
$ cd $GOPATH/src/k8s.io/kubernetes
$ make
You have a working Docker environment.
$ git clone https://github.com/kubernetes/kubernetes
$ cd kubernetes
$ make quick-release

For the full story, head over to the developer's documentation.

Support

If you need support, start with the troubleshooting guide, and work your way through the process that we've outlined.

That said, if you have questions, reach out to us one way or another.

Analytics