k3s/hack
Kubernetes Submit Queue 1e2105808b Merge pull request #45136 from vishh/cos-nvidia-driver-install
Automatic merge from submit-queue

Enable "kick the tires" support for Nvidia GPUs in COS

This PR provides an installation daemonset that will install Nvidia CUDA drivers on Google Container Optimized OS (COS).
User space libraries and debug utilities from the Nvidia driver installation are made available on the host in a special directory on the host -
* `/home/kubernetes/bin/nvidia/lib` for libraries
*  `/home/kubernetes/bin/nvidia/bin` for debug utilities

Containers that run CUDA applications on COS are expected to consume the libraries and debug utilities (if necessary) from the host directories using `HostPath` volumes.

Note: This solution requires updating Pod Spec across distros. This is a known issue and will be addressed in the future. Until then CUDA workloads will not be portable.

This PR updates the COS base image version to m59. This is coupled with this PR for the following reasons:
1. Driver installation requires disabling a kernel feature in COS. 
2. The kernel API for disabling this interface changed across COS versions
3. If the COS image update is not handled in this PR, then a subsequent COS image update will break GPU integration and will require an update to the installation scripts in this PR.
4. Instead of having to post `3` PRs, one each for adding the basic installer, updating COS to m59, and then updating the installer again, this PR combines all the changes to reduce review overhead and latency, and additional noise that will be created when GPU tests break.

**Try out this PR**
1. Get Quota for GPUs in any region
2. `export `KUBE_GCE_ZONE=<zone-with-gpus>` KUBE_NODE_OS_DISTRIBUTION=gci`
3. `NODE_ACCELERATORS="type=nvidia-tesla-k80,count=1" cluster/kube-up.sh`
4. `kubectl create -f cluster/gce/gci/nvidia-gpus/cos-installer-daemonset.yaml`
5. Run your CUDA app in a pod.

**Another option is to run a e2e manually to try out this PR**
1. Get Quota for GPUs in any region
2. export `KUBE_GCE_ZONE=<zone-with-gpus>` KUBE_NODE_OS_DISTRIBUTION=gci
3. `NODE_ACCELERATORS="type=nvidia-tesla-k80,count=1"`
4. `go run hack/e2e.go -- --up` 
5. `hack/ginkgo-e2e.sh --ginkgo.focus="\[Feature:GPU\]"`
The e2e will install the drivers automatically using the daemonset and then run test workloads to validate driver integration.

TODO:
- [x] Update COS image version to m59 release.
- [x] Remove sleep from the install script and add it to the daemonset
- [x] Add an e2e that will run the daemonset and run a sample CUDA app on COS clusters.
- [x] Setup a test project with necessary quota to run GPU tests against HEAD to start with https://github.com/kubernetes/test-infra/pull/2759
- [x] Update node e2e serial configs to install nvidia drivers on COS by default
2017-05-23 10:46:10 -07:00
..
boilerplate Enable auto-generating sources rules 2017-01-05 14:14:13 -08:00
cmd/teststale
e2e-internal Split federation-{up,down} from e2e-{up,down}. 2017-02-24 14:27:31 -08:00
gen-swagger-doc
jenkins Export patch files to artifacts 2017-03-25 12:16:50 -07:00
lib add "admission" API group 2017-05-19 10:17:37 -06:00
make-rules PDB MaxUnavailable: kubectl changes 2017-05-23 07:18:44 -07:00
testdata Remove vestiges of defaulting from conversion path, switch to top-level default registration only 2017-04-12 13:36:15 -04:00
verify-flags PDB MaxUnavailable: kubectl changes 2017-05-23 07:18:44 -07:00
.linted_packages Merge pull request #38990 from mikedanese/go-genrule-sets 2017-05-22 19:06:58 -07:00
BUILD Add verify-gofmt as a Bazel test. 2017-02-10 17:00:28 -08:00
OWNERS Merge pull request #45996 from cblecker/hack-owner 2017-05-19 16:06:27 -07:00
autogenerated_placeholder.txt
benchmark-go.sh
build-cross.sh
build-go.sh
build-ui.sh move swagger route to apiserver 2017-02-01 15:18:32 -05:00
cherry_pick_pull.sh hack/cherry_pick_pull.sh: cleanup patch files 2016-12-14 14:33:17 -08:00
dev-build-and-push.sh hack/dev-build-*: Run dev build instead of release build 2016-12-15 10:35:16 -07:00
dev-build-and-up.sh hack/dev-build-*: Run dev build instead of release build 2016-12-15 10:35:16 -07:00
dev-push-hyperkube.sh Rename build-tools/ back to build/ 2016-12-14 13:42:15 -08:00
e2e-node-test.sh
e2e.go Convert hack/e2e.go to a test-infra/kubetest shim 2017-02-02 17:42:46 -08:00
e2e_test.go hack/e2e_test.go's tester shouldn't stat files from the future 2017-02-15 15:59:47 -08:00
federated-ginkgo-e2e.sh Default FEDERATION_KUBE_CONTEXT to FEDERATION_NAME in federation e2e up/down scripts. 2017-04-05 18:47:03 -07:00
generate-bindata.sh Adding an installer script that installs Nvidia drivers in Container Optimized OS 2017-05-20 21:17:19 -07:00
generate-docs.sh Move .generated_docs to docs/ so docs OWNERS can review / approve 2017-02-16 10:11:57 -08:00
get-build.sh
ginkgo-e2e.sh e2e test: test azure disk volume 2017-04-28 18:51:34 +00:00
godep-restore.sh hack/godep-restore.sh: use godep v79 which works 2017-03-12 18:43:10 +01:00
godep-save.sh wire new staging repo 2017-05-02 08:43:31 -04:00
grab-profiles.sh
install-etcd.sh
list-feature-tests.sh
local-up-cluster.sh Fix unbound variable 2017-05-19 00:29:50 -04:00
lookup_pull.py
print-workspace-status.sh Use munged semantic version for side-loaded docker tag 2017-04-27 15:05:40 -07:00
run-in-gopath.sh
test-cmd.sh
test-go.sh
test-integration.sh hack/test-integration.sh: provide a recommended command and exit 2017-02-17 08:44:49 -08:00
test-update-storage-objects.sh Update clusters to use 3.0.17 etcd 2017-02-23 10:08:50 +01:00
update-all.sh Add update-federation-* scripts to update-all.sh 2017-05-15 16:51:09 -07:00
update-api-reference-docs.sh update generation bash to handle vendor dir 2017-01-17 09:06:34 -05:00
update-bazel.sh Update gazel to v17 2017-04-27 15:01:34 -07:00
update-codecgen.sh Add internal audit API types 2017-05-18 10:30:21 -07:00
update-codegen.sh wire new staging repo 2017-05-02 08:43:31 -04:00
update-federation-api-reference-docs.sh update generation bash to handle vendor dir 2017-01-17 09:06:34 -05:00
update-federation-generated-swagger-docs.sh update generation bash to handle vendor dir 2017-01-17 09:06:34 -05:00
update-federation-openapi-spec.sh Fix hack/update-federation-openapi-spec.sh flakyness 2017-05-19 15:39:08 -07:00
update-federation-swagger-spec.sh Federation does not generate swagger spec correctly 2017-01-06 23:45:04 -05:00
update-generated-docs.sh Move .generated_docs to docs/ so docs OWNERS can review / approve 2017-02-16 10:11:57 -08:00
update-generated-protobuf-dockerized.sh spell check for test/* 2016-12-14 06:03:00 -08:00
update-generated-protobuf.sh Rename build-tools/ back to build/ 2016-12-14 13:42:15 -08:00
update-generated-runtime-dockerized.sh Reorganize kubelet tree so apis can be independently versioned 2017-05-12 10:02:33 -07:00
update-generated-runtime.sh Rename build-tools/ back to build/ 2016-12-14 13:42:15 -08:00
update-generated-swagger-docs.sh update generation bash to handle vendor dir 2017-01-17 09:06:34 -05:00
update-godep-licenses.sh
update-gofmt.sh hack/*.sh: re-add staging dirs to verify+update scripts 2017-02-17 08:51:31 +01:00
update-openapi-spec.sh Fix race in service IP allocation repair loop 2016-12-26 21:59:27 -08:00
update-staging-client-go.sh Use "hack/godep-restore.sh" instead of godep restore 2017-03-28 04:05:47 -04:00
update-staging-godeps.sh move metrics to staging 2017-05-01 16:43:50 -07:00
update-swagger-spec.sh wire in aggregation 2017-03-27 09:44:10 -04:00
update-translations.sh Extract a bunch more strings from kubectl 2017-04-06 20:12:50 -07:00
update_owners.py Make update_owners.py also emit a JSON sig-owners list. 2017-05-11 17:01:29 -07:00
verify-all.sh
verify-api-groups.sh add "admission" API group 2017-05-19 10:17:37 -06:00
verify-api-reference-docs.sh
verify-bazel.sh Update gazel to v17 2017-04-27 15:01:34 -07:00
verify-boilerplate.sh Add a build rule for the boilerplate unit test. 2017-01-01 22:54:32 -08:00
verify-cli-conventions.sh More cli sanity verifications 2017-05-18 15:44:49 -03:00
verify-codecgen.sh
verify-codegen.sh Update generated files 2017-05-18 10:39:04 -07:00
verify-description.sh
verify-federation-api-reference-docs.sh Adding verify-federation-api-reference-docs.sh 2017-05-14 17:20:24 -07:00
verify-federation-generated-swagger-docs.sh Adding verify-federation-generated-swagger-docs.sh 2017-05-14 17:20:24 -07:00
verify-federation-openapi-spec.sh
verify-federation-swagger-spec.sh Adding verify-federation-swagger-spec.sh 2017-05-14 17:15:41 -07:00
verify-flags-underscore.py
verify-generated-docs.sh Move .generated_docs to docs/ so docs OWNERS can review / approve 2017-02-16 10:11:57 -08:00
verify-generated-protobuf.sh Verify generated protobuf script should fail on staging/ changes too 2017-03-15 16:15:02 -07:00
verify-generated-runtime.sh Reorganize kubelet tree so apis can be independently versioned 2017-05-12 10:02:33 -07:00
verify-generated-swagger-docs.sh
verify-godep-licenses.sh
verify-godeps.sh Export patch files to artifacts 2017-03-25 12:16:50 -07:00
verify-gofmt.sh hack/*.sh: re-add staging dirs to verify+update scripts 2017-02-17 08:51:31 +01:00
verify-golint.sh hack/verify-golint: enforce cleanup of old packages 2017-01-24 08:34:06 +01:00
verify-govet.sh
verify-import-boss.sh
verify-linkcheck.sh
verify-no-vendor-cycles.sh Detect and prevent new vendor cycles 2017-05-12 16:56:08 -07:00
verify-openapi-spec.sh
verify-pkg-names.sh add "admission" API group 2017-05-19 10:17:37 -06:00
verify-readonly-packages.sh hack/*.sh: re-add staging dirs to verify+update scripts 2017-02-17 08:51:31 +01:00
verify-staging-client-go.sh hack/verify-staging-client-go.sh: fail on changes 2017-02-27 14:11:41 +01:00
verify-staging-godeps.sh update-staging-{client-go,godeps}.sh: no godep-restore, pin godep, check workdir 2017-02-25 22:38:23 +01:00
verify-staging-imports.sh hack/verify-staging-imports.sh: check that plugins are not imported by default 2017-03-12 19:51:31 +01:00
verify-swagger-spec.sh
verify-symbols.sh spell check for test/* 2016-12-14 06:03:00 -08:00
verify-test-images.sh
verify-test-owners.sh Disable verify-test-owners.sh and make `go vet` more obvious 2016-12-21 11:44:04 -08:00