Automatic merge from submit-queue
Enable "kick the tires" support for Nvidia GPUs in COS
This PR provides an installation daemonset that will install Nvidia CUDA drivers on Google Container Optimized OS (COS).
User space libraries and debug utilities from the Nvidia driver installation are made available on the host in a special directory on the host -
* `/home/kubernetes/bin/nvidia/lib` for libraries
* `/home/kubernetes/bin/nvidia/bin` for debug utilities
Containers that run CUDA applications on COS are expected to consume the libraries and debug utilities (if necessary) from the host directories using `HostPath` volumes.
Note: This solution requires updating Pod Spec across distros. This is a known issue and will be addressed in the future. Until then CUDA workloads will not be portable.
This PR updates the COS base image version to m59. This is coupled with this PR for the following reasons:
1. Driver installation requires disabling a kernel feature in COS.
2. The kernel API for disabling this interface changed across COS versions
3. If the COS image update is not handled in this PR, then a subsequent COS image update will break GPU integration and will require an update to the installation scripts in this PR.
4. Instead of having to post `3` PRs, one each for adding the basic installer, updating COS to m59, and then updating the installer again, this PR combines all the changes to reduce review overhead and latency, and additional noise that will be created when GPU tests break.
**Try out this PR**
1. Get Quota for GPUs in any region
2. `export `KUBE_GCE_ZONE=<zone-with-gpus>` KUBE_NODE_OS_DISTRIBUTION=gci`
3. `NODE_ACCELERATORS="type=nvidia-tesla-k80,count=1" cluster/kube-up.sh`
4. `kubectl create -f cluster/gce/gci/nvidia-gpus/cos-installer-daemonset.yaml`
5. Run your CUDA app in a pod.
**Another option is to run a e2e manually to try out this PR**
1. Get Quota for GPUs in any region
2. export `KUBE_GCE_ZONE=<zone-with-gpus>` KUBE_NODE_OS_DISTRIBUTION=gci
3. `NODE_ACCELERATORS="type=nvidia-tesla-k80,count=1"`
4. `go run hack/e2e.go -- --up`
5. `hack/ginkgo-e2e.sh --ginkgo.focus="\[Feature:GPU\]"`
The e2e will install the drivers automatically using the daemonset and then run test workloads to validate driver integration.
TODO:
- [x] Update COS image version to m59 release.
- [x] Remove sleep from the install script and add it to the daemonset
- [x] Add an e2e that will run the daemonset and run a sample CUDA app on COS clusters.
- [x] Setup a test project with necessary quota to run GPU tests against HEAD to start with https://github.com/kubernetes/test-infra/pull/2759
- [x] Update node e2e serial configs to install nvidia drivers on COS by default
Automatic merge from submit-queue (batch tested with PRs 46133, 46211, 46224, 46205, 45910)
Make CPU request for heapster in kubemark scale with the number of Nodes
Automatic merge from submit-queue
Make real proxier in hollow-proxy optional (default=true)
Ref https://github.com/kubernetes/kubernetes/pull/45622
This allows using real proxier for hollow proxy, but we use the fake one by default.
cc @kubernetes/sig-scalability-misc @wojtek-t @gmarek
Automatic merge from submit-queue (batch tested with PRs 45569, 45602, 45604, 45478, 45550)
Minor bug fix in start-kubemark-master script
cc @wojtek-t @gmarek
Per Clayton's suggestion, move stuff from cluster/lib/util.sh to
hack/lib/util.sh. Also consolidate ensure-temp-dir and use the
hack/lib/util.sh implementation rather than cluster/common.sh.
Automatic merge from submit-queue (batch tested with PRs 42672, 42770, 42818, 42820, 40849)
kubemark test: Bump addon-manager to v6.4-beta.1
Follow up PR of #42760. This PR bumps addon-manager to v6.4-beta.1 for kubemark test.
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 42456, 42457, 42414, 42480, 42370)
Update npd in kubemark since #42201 is merged.
Revert https://github.com/kubernetes/kubernetes/pull/41716.
#42201 has been merged, and #41713 is fixed. Now we could retry update npd in kubemark.
/cc @shyamjvs @wojtek-t @dchen1107
Automatic merge from submit-queue (batch tested with PRs 41980, 42192, 42223, 41822, 42048)
Modified kubemark startup scripts to restore master on reboot
Fixes#41735
As discussed in the issue, modified the scripts to satisfy the conditions of restoring master env, running non-idempotent operations only for the first time and persist important data like pki/auth files on a PD.
Also attached `start-kubemark-master.sh` as startup-script metadata to master instance (on GCE) so that it is called automatically on each boot.
cc @kubernetes/sig-scalability-misc @wojtek-t @gmarek
Automatic merge from submit-queue (batch tested with PRs 42316, 41618, 42201, 42113, 42191)
[Kubemark] Add init container in hollow node for setting inotify limit of node to 200
Fixes#41713
Along with adding the init container, I also changed the manifest to a yaml as otherwise the entire init container annotation would have to be in a single line (with escaped characters), as json doesn't allow multi-line strings.
cc @kubernetes/sig-scalability-misc @wojtek-t @gmarek @Random-Liu
Automatic merge from submit-queue (batch tested with PRs 40746, 41699, 42108, 42174, 42093)
Avoid fake node names in user info
Node usernames should follow the format `system:node:<node-name>`,
but if we don't know the node name, it's worse to put a fake one in.
In the future, we plan to have a dedicated node authorizer, which would
start rejecting requests from a user with a bogus node name like this.
The right approach is to either mint correct credentials per node, or use node bootstrapping so it requests a correct client certificate itself.
Automatic merge from submit-queue (batch tested with PRs 41962, 42055, 42062, 42019, 42054)
Update flag to --check-version-skew instead of --check_version_skew
https://github.com/kubernetes/test-infra/issues/2012
Also add a `--` to send the flags to kubetest without triggering a warning.
Automatic merge from submit-queue (batch tested with PRs 41349, 41532, 41256, 41587, 41657)
Update kubectl in addon-manager to use HPA in autoscaling/v1
Addon-manager is broken since HPA objects were removed from extensions api group.
Came across the logs from [the latest addon-manager on Jenkins](https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce/4290/artifacts/bootstrap-e2e-master/kube-addon-manager.log):
```
INFO: == Entering periodical apply loop at 2017-02-16T17:33:37+0000 ==
error: error pruning namespaced object extensions/v1beta1, Kind=HorizontalPodAutoscaler: the server could not find the requested resource
WRN: == Failed to execute /usr/local/bin/kubectl apply --namespace=kube-system -f /etc/kubernetes/addons --prune=true -l kubernetes.io/cluster-service=true --recursive >/dev/null at 2017-02-16T17:33:38+0000. 2 tries remaining. ==
error: error pruning namespaced object extensions/v1beta1, Kind=HorizontalPodAutoscaler: the server could not find the requested resource
WRN: == Failed to execute /usr/local/bin/kubectl apply --namespace=kube-system -f /etc/kubernetes/addons --prune=true -l kubernetes.io/cluster-service=true --recursive >/dev/null at 2017-02-16T17:33:46+0000. 1 tries remaining. ==
error: error pruning namespaced object extensions/v1beta1, Kind=HorizontalPodAutoscaler: the server could not find the requested resource
WRN: == Failed to execute /usr/local/bin/kubectl apply --namespace=kube-system -f /etc/kubernetes/addons --prune=true -l kubernetes.io/cluster-service=true --recursive >/dev/null at 2017-02-16T17:33:53+0000. 0 tries remaining. ==
WRN: == Kubernetes addon update completed with errors at 2017-02-16T17:33:58+0000 ==
```
And notice this commit (f66679a4e9) came in two weeks ago, which removed HorizontalPodAutoscaler from extensions/v1beta1.
Addon-manager is now partially functioning that it could successfully create and update addons, but will fail to prune objects, which means upgrade tests may mostly fail.
Pushed another version of addon-manager with kubectl v1.6.0-alpha.2 ([release 2 days ago](https://github.com/kubernetes/kubernetes/releases/tag/v1.6.0-alpha.2)) for fixing, including below images:
- gcr.io/google-containers/kube-addon-manager:v6.4-alpha.2
- gcr.io/google-containers/kube-addon-manager-amd64:v6.4-alpha.2
- gcr.io/google-containers/kube-addon-manager-arm:v6.4-alpha.2
- gcr.io/google-containers/kube-addon-manager-arm64:v6.4-alpha.2
- gcr.io/google-containers/kube-addon-manager-ppc64le:v6.4-alpha.2
- gcr.io/google-containers/kube-addon-manager-s390x:v6.4-alpha.2
@mikedanese
cc @wojtek-t @shyamjvs
Automatic merge from submit-queue (batch tested with PRs 41706, 39063, 41330, 41739, 41576)
[Kubemark] Add option to log hollow-node logs
Ref https://github.com/kubernetes/kubernetes/issues/41613
Added an option to log kubemark hollow-node logs which includes kubelet, kubeproxy and npd logs for each hollow-node.
Setting the env var `ENABLE_HOLLOW_NODE_LOGS=true` should now enable logging for tests.
cc @kubernetes/sig-scalability-misc @wojtek-t @gmarek @yujuhong @Random-Liu
Automatic merge from submit-queue
Fix kubemark default e2e test suite's name
Seems like the suite "[Feature:performance]" doesn't trigger tests anymore. Changed it to "[Feature:Performance]" in kubemark run-e2e-tests.sh.
cc @wojtek-t @gmarek