k3s/hack
Kubernetes Submit Queue 6de28fab7d Merge pull request #42942 from vishh/gpu-cont-fix
Automatic merge from submit-queue (batch tested with PRs 42942, 42935)

[Bug] Handle container restarts and avoid using runtime pod cache while allocating GPUs

Fixes #42412

**Background**
Support for multiple GPUs is an experimental feature in v1.6. 
Container restarts were handled incorrectly which resulted in stranding of GPUs
Kubelet is incorrectly using runtime cache to track running pods which can result in race conditions (as it did in other parts of kubelet). This can result in same GPU being assigned to multiple pods.

**What does this PR do**
This PR tracks assignment of GPUs to containers and returns pre-allocated GPUs instead of (incorrectly) allocating new GPUs.
GPU manager is updated to consume a list of active pods derived from apiserver cache instead of runtime cache.
Node e2e has been extended to validate this failure scenario.

**Risk**
Minimal/None since support for GPUs is an experimental feature that is turned off by default. The code is also isolated to GPU manager in kubelet.

**Workarounds**
In the absence of this PR, users can mitigate the original issue by setting `RestartPolicyNever`  in their pods.
There is no workaround for the race condition caused by using the runtime cache though.
Hence it is worth including this fix in v1.6.0.

cc @jianzhangbjz @seelam @kubernetes/sig-node-pr-reviews 

Replaces #42560
2017-03-14 10:19:17 -07:00
..
boilerplate Enable auto-generating sources rules 2017-01-05 14:14:13 -08:00
cmd/teststale
e2e-internal Split federation-{up,down} from e2e-{up,down}. 2017-02-24 14:27:31 -08:00
gen-swagger-doc
jenkins Add staging repos to GOPATH in verify-godeps 2017-03-01 10:23:30 -05:00
lib Merge pull request #42623 from liggitt/kubectl-version 2017-03-13 15:06:31 -07:00
make-rules Introduce new generator for apps/v1beta1 deployments 2017-03-10 12:08:01 +01:00
testdata add apply cmd tests for TPR 2017-02-02 15:20:45 -08:00
verify-flags Merge pull request #41794 from shashidharatd/federation-upgrade-tests-1 2017-03-10 22:02:15 -08:00
.linted_packages linter fixes 2017-03-13 10:58:26 -07:00
BUILD Add verify-gofmt as a Bazel test. 2017-02-10 17:00:28 -08:00
OWNERS Convert hack/e2e.go to a test-infra/kubetest shim 2017-02-02 17:42:46 -08:00
autogenerated_placeholder.txt
benchmark-go.sh unify newline format for benchmark-go.sh 2016-12-10 01:15:30 -08:00
benchmark-integration.sh
build-cross.sh
build-go.sh
build-ui.sh move swagger route to apiserver 2017-02-01 15:18:32 -05:00
cherry_pick_pull.sh hack/cherry_pick_pull.sh: cleanup patch files 2016-12-14 14:33:17 -08:00
dev-build-and-push.sh hack/dev-build-*: Run dev build instead of release build 2016-12-15 10:35:16 -07:00
dev-build-and-up.sh hack/dev-build-*: Run dev build instead of release build 2016-12-15 10:35:16 -07:00
dev-push-hyperkube.sh Rename build-tools/ back to build/ 2016-12-14 13:42:15 -08:00
e2e-node-test.sh
e2e.go Convert hack/e2e.go to a test-infra/kubetest shim 2017-02-02 17:42:46 -08:00
e2e_test.go hack/e2e_test.go's tester shouldn't stat files from the future 2017-02-15 15:59:47 -08:00
federated-ginkgo-e2e.sh [Federation] Unjoin only the joined clusters while bringing down the federation control plane. 2017-03-12 13:05:26 -07:00
generate-bindata.sh Run bindata generation from KUBE_ROOT 2017-01-10 14:28:19 -05:00
generate-docs.sh Move .generated_docs to docs/ so docs OWNERS can review / approve 2017-02-16 10:11:57 -08:00
get-build.sh
ginkgo-e2e.sh [Federation][init-11] Switch federation e2e tests to use the new federation control plane bootstrap via the `kubefed init` command. 2016-12-16 11:22:44 +05:30
godep-restore.sh hack/godep-restore.sh: use godep v79 which works 2017-03-12 18:43:10 +01:00
godep-save.sh Unify godep code in hack/*-godep*.sh 2017-03-09 15:03:13 +01:00
grab-profiles.sh Make all useage of sort deterministic 2016-10-20 16:47:20 -04:00
install-etcd.sh
list-feature-tests.sh Make all useage of sort deterministic 2016-10-20 16:47:20 -04:00
local-up-cluster.sh Merge pull request #42316 from feiskyer/cri-local 2017-03-01 07:09:19 -08:00
lookup_pull.py
print-workspace-status.sh bazel: save git version in kubernetes.tar.gz 2017-01-23 17:28:08 -08:00
run-in-gopath.sh
test-cmd.sh
test-go.sh
test-integration.sh
test-update-storage-objects.sh Update clusters to use 3.0.17 etcd 2017-02-23 10:08:50 +01:00
update-all.sh Unify godep code in hack/*-godep*.sh 2017-03-09 15:03:13 +01:00
update-api-reference-docs.sh update generation bash to handle vendor dir 2017-01-17 09:06:34 -05:00
update-bazel.sh update-bazel.sh to treat GOPATH as a path 2017-02-16 14:40:05 -08:00
update-codecgen.sh Make all useage of sort deterministic 2016-10-20 16:47:20 -04:00
update-codegen.sh Add settings API and admission controller 2017-03-01 13:04:28 -08:00
update-federation-api-reference-docs.sh update generation bash to handle vendor dir 2017-01-17 09:06:34 -05:00
update-federation-generated-swagger-docs.sh update generation bash to handle vendor dir 2017-01-17 09:06:34 -05:00
update-federation-openapi-spec.sh genericapiserver: move MasterCount and service options into master 2016-12-16 17:23:43 +01:00
update-federation-swagger-spec.sh Federation does not generate swagger spec correctly 2017-01-06 23:45:04 -05:00
update-generated-docs.sh Move .generated_docs to docs/ so docs OWNERS can review / approve 2017-02-16 10:11:57 -08:00
update-generated-protobuf-dockerized.sh spell check for test/* 2016-12-14 06:03:00 -08:00
update-generated-protobuf.sh Rename build-tools/ back to build/ 2016-12-14 13:42:15 -08:00
update-generated-runtime-dockerized.sh CRI: use more gogoprotobuf plugins 2017-01-25 13:52:24 -08:00
update-generated-runtime.sh Rename build-tools/ back to build/ 2016-12-14 13:42:15 -08:00
update-generated-swagger-docs.sh update generation bash to handle vendor dir 2017-01-17 09:06:34 -05:00
update-godep-licenses.sh make godep licenses/copyright check case insensitive 2016-10-24 18:00:08 -07:00
update-gofmt.sh hack/*.sh: re-add staging dirs to verify+update scripts 2017-02-17 08:51:31 +01:00
update-openapi-spec.sh Fix race in service IP allocation repair loop 2016-12-26 21:59:27 -08:00
update-staging-client-go.sh update-staging-{client-go,godeps}.sh: no godep-restore, pin godep, check workdir 2017-02-25 22:38:23 +01:00
update-staging-godeps.sh Don't try to run hack/verify-staging-* on dirty repository 2017-03-09 13:05:31 +01:00
update-swagger-spec.sh
update-translations.sh Update extraction script, sort messages, add .pot file. 2017-02-23 18:53:00 +00:00
update_owners.py updated test owner generation script to add sig column 2017-02-03 12:41:47 -08:00
verify-all.sh
verify-api-groups.sh add script to check for updates to the files for generation 2016-11-01 15:59:50 -04:00
verify-api-reference-docs.sh
verify-bazel.sh bump gazel to v14 2017-02-09 11:09:13 -08:00
verify-boilerplate.sh Add a build rule for the boilerplate unit test. 2017-01-01 22:54:32 -08:00
verify-cli-conventions.sh Tools for checking CLI conventions 2016-10-17 11:50:02 -02:00
verify-codecgen.sh add apiregistration types 2016-12-06 13:45:10 -05:00
verify-codegen.sh update scripts for new kube-aggregator location 2017-02-14 14:16:59 -05:00
verify-description.sh
verify-federation-openapi-spec.sh Add verify script federation OpenAPI spec generation 2016-11-07 02:41:50 -08:00
verify-flags-underscore.py ignore BUILD in the flags-underscore.py validation 2016-10-21 17:32:33 -07:00
verify-generated-docs.sh Move .generated_docs to docs/ so docs OWNERS can review / approve 2017-02-16 10:11:57 -08:00
verify-generated-protobuf.sh utils: Use macOS copatible copying method 2016-10-18 11:09:38 +02:00
verify-generated-runtime.sh add update-staging-client-go.sh and verify-staging-client-go.sh; 2016-10-29 14:20:39 -07:00
verify-generated-swagger-docs.sh docs generation: Use macos compatible copy method 2016-10-18 11:11:03 +02:00
verify-godep-licenses.sh
verify-godeps.sh Add staging repos to GOPATH in verify-godeps 2017-03-01 10:23:30 -05:00
verify-gofmt.sh hack/*.sh: re-add staging dirs to verify+update scripts 2017-02-17 08:51:31 +01:00
verify-golint.sh hack/verify-golint: enforce cleanup of old packages 2017-01-24 08:34:06 +01:00
verify-govet.sh
verify-import-boss.sh
verify-linkcheck.sh
verify-openapi-spec.sh verify-openapi-spec.sh should not ignore extra file in the spec folder api/openapi-spec 2016-11-01 01:13:11 -07:00
verify-pkg-names.sh hack/*.sh: re-add staging dirs to verify+update scripts 2017-02-17 08:51:31 +01:00
verify-readonly-packages.sh hack/*.sh: re-add staging dirs to verify+update scripts 2017-02-17 08:51:31 +01:00
verify-staging-client-go.sh hack/verify-staging-client-go.sh: fail on changes 2017-02-27 14:11:41 +01:00
verify-staging-godeps.sh update-staging-{client-go,godeps}.sh: no godep-restore, pin godep, check workdir 2017-02-25 22:38:23 +01:00
verify-staging-imports.sh add godep.json to staging repos 2017-02-21 09:38:55 -05:00
verify-swagger-spec.sh
verify-symbols.sh spell check for test/* 2016-12-14 06:03:00 -08:00
verify-test-images.sh Make all useage of sort deterministic 2016-10-20 16:47:20 -04:00
verify-test-owners.sh Disable verify-test-owners.sh and make `go vet` more obvious 2016-12-21 11:44:04 -08:00