Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Fixes the races around devicemanager Allocate() and endpoint deletion.
There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.
To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.
Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.
Tested:
Ran GPUDevicePlugin e2e_node test 100 times and all passed now.
**What this PR does / why we need it**:
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes https://github.com/kubernetes/kubernetes/issues/60176
**Special notes for your reviewer**:
**Release note**:
```release-note
Fixes the races around devicemanager Allocate() and endpoint deletion.
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
[e2e service] Fix CleanupGCEResources for regional test
*What this PR does / why we need it**:
From https://k8s-testgrid.appspot.com/google-gke-staging#gke-staging-1-8-1-9-upgrade-regional-cluster&width=20, regional cluster test is failing because the GCE resource cleanup function attempts to parse region from `--gcp-zone` while regional cluster only set `--gcp-region`.
This PR pipes region into the cleanup function as well. This will need to be cherrypicked to 1.8 and 1.9.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #NONE
**Special notes for your reviewer**:
/assign @bowei @wojtek-t
cc @nikhiljindal to see if there is anything should be fixed for federation.
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
[e2e service] Move apiserver restart validation logic into util
**What this PR does / why we need it**:
Follow up of #60906, on GKE apiserver pod is invisible on k8s, hence test is failing.
This PR bakes the restart validation logic into the util function instead so it could be env-awared.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#60761
**Special notes for your reviewer**:
Sorry for the noise.
/assign @rramkumar1 @bowei
cc @krzyzacy
**Release note**:
```release-note
NONE
```
There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.
To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.
Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Revert "[Test change - don't merge] Skip load test"
This reverts commit ba6bb999f7.
This was accidentally merged as part of 60891.
/cc @wojtek-t
/sig scalability
/kind bug
/priority important-soon
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Use quotas in default performance tests
Better to use more features in default tests if possible.
LGTM whenever you think we're ready.
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
[e2e service] Refine apiserver restart logic
**What this PR does / why we need it**:
Ref https://github.com/kubernetes/kubernetes/issues/60761#issuecomment-371308569, wait for apiserver's restart count increases before proceeding the test.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes (hopefully) #60761
**Special notes for your reviewer**:
/assign @rramkumar1 @bowei
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 60891, 60935). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Rollback etcd server version to 3.1.11 due to #60589
Ref https://github.com/kubernetes/kubernetes/issues/60589#issuecomment-371171837
The dependencies were a bit complex (so many things relying on it) + the version was updated to 3.2.16 on top of the original bump.
So I had to mostly make manual reverting changes on a case-by-case basis - so likely to have errors :)
/cc @wojtek-t @jpbetz
```release-note
Downgrade default etcd server version to 3.1.11 due to #60589
```
(I'm not sure if we should instead remove release-notes of the original PRs)
Automatic merge from submit-queue (batch tested with PRs 60642, 60627). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Create fake /etc/hosts for conformance test
**What this PR does / why we need it**:
"KubeletManagedEtcHosts should test kubelet managed /etc/hosts file"
conformance test fails in the CI's Docker-In-Docker environment.
This test mounts a /etc/hosts file and checks if "# Kubernetes-managed
hosts file." string is present or not under various conditions. The
specific failure with DIND happens when the /etc/hosts picked up
from the box where e2e test are running already has this string. This
happens because our CI runs on kubernetes and the e2e tests are running
in a container that was started on kubernetes (and hence already has
that string)
To avoid this situation, we create a new /etc/hosts file with known
contents (and does not have the "# Kubernetes-managed hosts file."
string)
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #
**Special notes for your reviewer**:
Please see:
https://k8s-testgrid.appspot.com/sig-testing-misc#ci-kubernetes-local-e2ehttps://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-local-e2e/285
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Add cblecker to test OWNERS
**What this PR does / why we need it**:
- Adds an OWNERS file for `test/typecheck/` that was just added
- Add cblecker to approvers for `test/`
I won't touch anything I don't understand :)
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
improve daemonset's retry creating failed daemon pods e2e test
**What this PR does / why we need it**:
This PR improves daemonset's retry creating failed daemon pods e2e test by checking that the failed pod does not show up among existing daemon pods.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#59076
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Fixing e2e CSI test, II
The fix for #60803 in commit 2ae33cc324 had a typo, so the "Server
rejected event" error still showed up in the external-provisioner log
of the "Sanity CSI plugin test using hostPath CSI driver" e2e test.
```release-note
None
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Bump etcd server patch version to 3.2.16
etcd 3.2.16 contains a critical fix for HA clusters: https://github.com/coreos/etcd/pull/9281
Also, update newly added tests to use `REGISTRY` make variable.
Release note:
```release-note
Upgrade the default etcd server version to 3.2.16
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Fix test failure and delete unused code
**What this PR does / why we need it**:
`Daemon set [Serial] should update pod when spec was updated and update strategy is RollingUpdate` is broken by recent updates as Template generation isn't supported by apps/v1.DaemonSet anymore
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#60745
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
"KubeletManagedEtcHosts should test kubelet managed /etc/hosts file"
conformance test fails in the CI's Docker-In-Docker environment.
This test mounts a /etc/hosts file and checks if "# Kubernetes-managed
hosts file." string is present or not under various conditions. The
specific failure with DIND happens when the /etc/hosts picked up
from the box where e2e test are running already has this string. This
happens because our CI runs on kubernetes and the e2e tests are running
in a container that was started on kubernetes (and hence already has
that string)
To avoid this situation, we create a new /etc/hosts file with known
contents (and does not have the "# Kubernetes-managed hosts file."
string)
The fix for #60803 in commit 2ae33cc324 had a typo, so the "Server
rejected event" error still showed up in the external-provisioner log
of the "Sanity CSI plugin test using hostPath CSI driver" e2e test.
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Run server-side print tests only on k8s 1.10+
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#60765
**Release note**:
```release-note
None
```
/assign @smarterclayton @deads2k
@juanvallejo fyi
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Prevent webhooks from affecting admission requests for WebhookConfiguration objects
**What this PR does / why we need it**:
As it stands now webhooks can be added to the system which make it impossible for a user to remove that webhook, or two webhooks could be registered which make it impossible to remove each other.
The first commit of this will add a test to make sure webhook deletion is never blocked by a webhook. This test will fail until the second commit is added which will prevent webhooks from affecting admission requests for ValidatingWebhookConfiguration and MutatingWebhookConfiguration objects in the admissionregistration.k8s.io group
- [x] Test that webhook deletion is never blocked by a webhook ([test fails before second commit](https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/59840/pull-kubernetes-e2e-gce/23731/))
- [x] Prevent webhooks from being called on admission requests for [Validating|Mutating]WebhookConfiguration objects
- [x] Document this new behavior maybe in another PR
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Part of fixing #59124 (Verifies that it can remove the broken webhook.)
**Release note**:
```release-note
ValidatingWebhooks and MutatingWebhooks will not be called on admission requests for ValidatingWebhookConfiguration and MutatingWebhookConfiguration objects in the admissionregistration.k8s.io group
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Fix DaemonSet e2e test for OnDelete
**What this PR does / why we need it**: DaemonSet `OnDelete` e2e test is broken after #59883 is merged, because default update strategy is different in apps/v1 API endpoint.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
xref #60003
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 60679, 60804). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Fixing e2e CSI test
Closes#60803
After switching to CSI 0.2.0 spec, additional RBAC permissions are required. This PR adds missing permissions.
```release-note
None
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Add node-e2e test for ShareProcessNamespace
**What this PR does / why we need it**: Adds a node-e2e test for kubernetes/features#495
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes#59554
**Special notes for your reviewer**: This requires a feature gate to be enabled in both the kubelet and API server. I'm not sure which jenkins configs need to be updated (or if these are even still used) so I just updated a pile of them.
opened kubernetes/test-infra#7030 for https://github.com/kubernetes/test-infra/blob/master/jobs/config.json
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Volume deletion should be idempotent
- Describe* calls should return `aws.Error` so caller can handle individual errors. `aws.Error` already has enough context (`"InvalidVolume.NotFound: The volume 'vol-0a06cc096e989c5a2' does not exist"`)
- Deletion of already deleted volume should succeed.
**Release note**:
Fixes: #60778
```release-note
NONE
```
/sig storage
/sig aws
/assign @justinsb @gnufied
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Add selector to DaemonSet in newDaemonSet function so that the v1 api…
**What this PR does / why we need it**:
When we upgraded the DaemonSet e2e to use apps v1 I neglected to add a selector to match the labels of the created Pods. This broke some apps Serial tests.
```release-note
NONE
```
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Update kubectl e2e test manifests to apps/v1
**What this PR does / why we need it**:
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 60159, 60731, 60720, 60736, 60740). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Cap max number of nodes to use for local PV e2e tests
**What this PR does / why we need it**:
Large scale tests have thousands of nodes, which will make some local PV tests that use each node take forever
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Partially addresses #60589
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```