Automatic merge from submit-queue
AllowedNotReadyNodes allowed to be not ready for absolutely *any* reason
It's as good as we allow those many nodes to be not part of the cluster at all, ever.
Btw - currently our 5k-node correctness test fails if "kubelet stopped posting node status" or "route not created", etc (ref: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/3/build-log.txt)
cc @kubernetes/sig-scalability-misc
Automatic merge from submit-queue (batch tested with PRs 50213, 50707, 49502, 51230, 50848)
StatefulSet: Deflake e2e `kubectl exec` commands.
This may help with another source of flakiness found while investigating #48031.
We seem to get a lot of flakes due to "connection refused" while running `kubectl exec`. I can't find any reason this would be caused by the test flow, so I'm adding retries to see if that helps.
Automatic merge from submit-queue (batch tested with PRs 51224, 51191, 51158, 50669, 51222)
StatefulSet: Deflake e2e "restart" phase.
This addresses another source of flakiness found while investigating #48031.
The test used to scale the StatefulSet down to 0, wait for ListPods to return 0 matching Pods, and then scale the StatefulSet back up.
This was prone to a race in which StatefulSet was told to scale back up before it had observed its own deletion of the last Pod, as evidenced by logs showing the creation of Pod ss-1 prior to the creation of the replacement Pod ss-0.
Instead, we now wait for the controller to observe all deletions before scaling it back up. This should fix flakes of the form:
```
Too many pods scheduled, expected 1 got 2
```
We seem to get a lot of flakes due to "connection refused" while running
`kubectl exec`. I can't find any reason this would be caused by the test
flow, so I'm adding retries to see if that helps.
Automatic merge from submit-queue (batch tested with PRs 51108, 51035, 50539, 51160, 50947)
Auto-calculate CLUSTER_IP_RANGE based on cluster size
In preparation for eliminating CLUSTER_IP_RANGE env var from job configs, making it less error prone while folks try to start their own large cluster tests (https://github.com/kubernetes/kubernetes/issues/50907).
/cc @kubernetes/sig-scalability-misc @wojtek-t @gmarek
The test used to scale the StatefulSet down to 0, wait for ListPods to
return 0 matching Pods, and then scale the StatefulSet back up.
This was prone to a race in which StatefulSet was told to scale back up
before it had observed its own deletion of the last Pod, as evidenced by
logs showing the creation of Pod ss-1 prior to the creation of the
replacement Pod ss-0.
We now wait for the controller to observe all deletions before
scaling it back up. This should fix flakes of the form:
```
Too many pods scheduled, expected 1 got 2
```
Automatic merge from submit-queue
StatefulSet: Deflake e2e "Saturate" phase.
This should reduce one source of flakiness found while investigating #48031.
The "Saturate" phase of StatefulSet e2e tests verifies orderly startup by controlling when each Pod is allowed to report Ready. If a Pod unexepectedly goes down during the test, the replacement Pod
created by the controller will forget if it was already allowed to report Ready.
After this change, the signal that allows each Pod to report Ready is persisted in the Pod's PVC. Thus, the replacement Pod will remember that it was already told to proceed to a Ready state.
Automatic merge from submit-queue (batch tested with PRs 50893, 50913, 50963, 50629, 50640)
Increase latency threshold for list api calls
This is only a short-term solution to make our density test green. In the long-term, we should measure as per our new SLIs.
From @wojtek-t's [doc](https://docs.google.com/document/d/1Q5qxdeBPgTTIXZxdsFILg7kgqWhvOwY8uROEf0j5YBw) on the new SLIs/SLOs, we have the following SLO for list calls:
```
SLO1: In default Kubernetes installation, 99th percentile of SLI2 per cluster-day:
<= 1s if total number of objects of the same type as resource in the system <= X
<= 5s if total number of objects of the same type as resource in the system <= Y
<= 30s if total number of objects of the same types as resource in the system <= Z
```
I would guess that 170,000 pods would fall into the 2nd bracket (at least) and hence the new value of 5s. WDYT?
cc @kubernetes/sig-scalability-misc @wojtek-t @gmarek
The "Saturate" phase of StatefulSet e2e tests verifies orderly startup
by controlling when each Pod is allowed to report Ready.
If a Pod unexepectedly goes down during the test, the replacement Pod
created by the controller will forget if it was already allowed to
report Ready.
After this change, the signal that allows each Pod to report Ready is
persisted in the Pod's PVC. Thus, the replacement Pod will remember that
it was already told to proceed to a Ready state.
Automatic merge from submit-queue (batch tested with PRs 46512, 50146)
Make metav1.(Micro)?Time functions take pointers
Is there any reason for those functions not to be on pointers?
What this PR does / why we need it:
This adds an e2e test for aggregation based on the sample-apiserver.
Currently is uses a sample-apiserver built as of 1.7.
This should ensure that the aggregation system works end-to-end.
It will also help detect if we break "old" extension api servers.
Which issue this PR fixes (optional, in fixes #<issue number>(, fixes
fixes#43714
Fixed bazel for the change.
Fixed # of args issue from govet.
Added code to test dynamic.Client.
Automatic merge from submit-queue (batch tested with PRs 50550, 50768)
Don't SSH to master for metrics in case of GKE
cc @kubernetes/sig-scalability-misc @crassirostris
Automatic merge from submit-queue (batch tested with PRs 49904, 50484, 50214)
Adding support for internal IP for e2e tests
Currently IssueSSHComand in util.go only checks for External IP address
to ssh, this PR adds check for internal IP too.
Closes#50630
Automatic merge from submit-queue (batch tested with PRs 50537, 49699, 50160, 49025, 50205)
When not using a CloudProvider, set both InternalIP and ExternalIP on Nodes
#36095 changed all of the cloudproviders to set both InternalIP and ExternalIP on Nodes, but the non-cloudprovider fallback code now only sets InternalIP.
This causes the test "should be able to create a functioning NodePort service" in test/e2e/service.go to fail on cloud-provider-less clusters, because (with LegacyHostIP gone), it now will only try to work with ExternalIPs, and will fail if the node has only an InternalIP.
There isn't much other code that assumes that ExternalIP will always be set (there's something in pkg/master/master.go, but I don't know what it's doing, so maybe it's only useful in the case where InternalIP != ExternalIP anyway). But given that several of the cloudproviders (mesos, ovirt, rackspace) now explicitly set both InternalIP and ExternalIP to the same value always, it seemed right to do that in the fallback case too.
@deads2k FYI
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue
Remove deprecated ESIPP beta annotations
**What this PR does / why we need it**:
Remove deprecated ESIPP beta annotations.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#50187
**Special notes for your reviewer**:
/assign @MrHohn
/sig network
**Release note**:
```release-note
Beta annotations `service.beta.kubernetes.io/external-traffic` and `service.beta.kubernetes.io/healthcheck-nodeport` have been removed. Please use fields `service.spec.externalTrafficPolicy` and `service.spec.healthCheckNodePort` instead.
```
Automatic merge from submit-queue
Migrate to controller references helpers in meta/v1
**What this PR does / why we need it**:
This is a follow up for #48319 that migrates all method usages to new methods in meta/v1.
**Special notes for your reviewer**:
Looking at each commit individually might be easier.
**Release note**:
```release-note
NONE
```
/sig api-machinery
/kind cleanup
Automatic merge from submit-queue (batch tested with PRs 45186, 50440)
Add functionality needed by Cluster Autoscaler to Kubemark Provider.
Make adding nodes asynchronous. Add method for getting target
size of node group. Add method for getting node group for node.
Factor out some common code.
**Release note**:
```
NONE
```
Make adding nodes asynchronous. Add method for getting target
size of node group. Add method for getting node group for node.
Factor out some common code.
Automatic merge from submit-queue
Add waitForFailure for e2e test framework
**What this PR does / why we need it**:
Add waitForFailure for e2e test framework, this could reduce the reliance on logs.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*:
Part of #44118. Refer https://github.com/kubernetes/kubernetes/pull/48858#discussion_r128331726
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue
Add a simple cloud provider for e2e tests on kubemark
**What this PR does / why we need it**:
Adds a simplified cloud provider for kubemark. This enables us to add and
remove nodes and operate on nodegroups while running tests on kubemark.
This is needed to run scalability tests for cluster autoscaler on kubemark.
See https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/kubemark_integration.md
**Release note**:
```
NONE
```
Automatic merge from submit-queue
StatefulSet scale subresource
**What this PR does / why we need it**: This PR implements scale subresource for StatefulSet.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#46005
**Special notes for your reviewer**:
**Release note**:
```release-note
StatefulSet uses scale subresource when scaling in accord with ReplicationController, ReplicaSet, and Deployment implementations.
```
**Feature Checklist**:
- [x] Introduce Registry interface for storage purpose
- [x] Introduce `ScaleREST New(), Get() and Update()` utility functions
- [x] Create a `ScaleREST` object at `NewREST()` and return it
- [x] Enable scale subresource by adding `/scale` field to the storage map
**Testing Checklist**:
- Unit testing
- [x] Modify `newStorage()` to call `NewStorage()`, and change all unit tests accordingly
- [x] Add unit tests for `ScaleREST Get() and Update()` utility functions
- [x] Add missing unit test for `ShortNames`
- Manual testing
- [x] Verify existence of the subresource using `kubectl proxy` command
- [x] Modify the subresource using `curl` via `POST`
- e2e testing
- [x] Add e2e tests using `RESTClient`
Automatic merge from submit-queue (batch tested with PRs 49524, 46760, 50206, 50166, 49603)
Handled taints on node in batch.
**What this PR does / why we need it**:
Enhanced helpers to handled taints on node in batch.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#49522
**Release note**:
```release-note
None
```
This is needed for cluster autoscaler e2e test to
run on kubemark. We need the ability to add and
remove nodes and operate on nodegroups. Kubemark
does not provide this at the moment.
Automatic merge from submit-queue (batch tested with PRs 50091, 50231, 50238, 50236, 50243)
Fix storage tests for multizone test configuration.
**What this PR does / why we need it**:
This PR modifies "[sig-storage] Volumes PD should be mountable with (ext3|ext4)" tests to schedule pods in zone, where PD is created.
This is to make the test work in multizone environment.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 50091, 50231, 50238, 50236, 50243)
Move the sig-instrumentation test to a dedicated folder
Move the last remaining test to sig-instrumentation folder, also move "metrics" package to the "framework" folder
Related issue: https://github.com/kubernetes/kubernetes/issues/49161
/cc @xiangpengzhao @piosz
Automatic merge from submit-queue (batch tested with PRs 50091, 50231, 50238, 50236, 50243)
Modify e2e.go to arbitrarily pick one of zones we have nodes in for multizone tests.
**What this PR does / why we need it**:
When e2e runs in multizone configuration, zone config property can be empty.
This PR, in that case, overrides an empty value with arbitrarily chosen zone that we have nodes in.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 50119, 48366, 47181, 41611, 49547)
Add basic install and mount flexvolumes e2e tests
fixes https://github.com/kubernetes/kubernetes/issues/47010
These two tests install a skeleton "dummy" flex driver, attachable and non-attachable respectively, then test that a pod can successfully use the flex driver. They are labeled disruptive because kubelet and controller-manager get restarted as part of the flex install. IMO it's important to keep this install procedure as part of the test to isolate any bugs with the startup plugin probe code.
There is a bit of an ugly dependency on cluster/gce/config-test.sh because --flex-volume-plugin-dir must be set to a dir that's readable from controller-manager container and writable by the flex e2e test. The default path is not writable on GCE masters with read-only root so I picked a location that looks okay.
In the "dummy" drivers I trick kubelet into thinking there is a mount point by doing "mount -t tmpfs none ${MNTPATH} >/dev/null 2>&1", hope that is okay.
I have only tested on GCE and theoretically they may work on AWS but I don't think there is a need to test on multiple cloudproviders.
-->
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 50000, 49954, 49943, 50018, 49607)
Update the DeleteReplicaSet in rs_util.go to use server side reaper
**What this PR does / why we need it**:
fix#47832
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#47832
**Special notes for your reviewer**:
**Release note**:
```
None
```
Automatic merge from submit-queue (batch tested with PRs 49990, 49997, 44278, 49936, 49891)
Allow mode in e2e-framework to gather metrics only from master
This should enable getting metrics for our 5k-node clusters.
cc @kubernetes/sig-scalability-misc @gmarek
Automatic merge from submit-queue (batch tested with PRs 49898, 49897, 49919, 48860, 49491)
Fix usage a make(struct, len()) followed by append()
A couple of places in the code we allocate with make() but then use
append(), instead of copy() or direct assignment. This results in a
slice with len() zero elements at the front followed by the expected
data. The correct form for such usage is `make(struct, 0, len())`.
I found these by running:
```
$ git grep -EI -A7 'make\([^,]*, len\(' | grep 'append(' -B7 | grep -v vendor
```
And then manually looking through the results. I'm sure something better
could exist.
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 49651, 49707, 49662, 47019, 49747)
improve detectability of deleted pods
**What this PR does / why we need it**:
Adds comment to `waitForPodTerminatedInNamespace` to better explain how it's implemented.
~~It improves pod deletion detection in the e2e framework as follows:~~
~~1. the `waitForPodTerminatedInNamespace` func looks for pod.Status.Phase == _PodFailed_ or _PodSucceeded_ since both values imply that all containers have terminated.~~
~~2. the `waitForPodTerminatedInNamespace` func also ignores the pod's Reason if the passed-in `reason` parm is "". Reason is not really relevant to the pod being deleted or not, but if the caller passes a non-blank `reason` then it will be lower-cased, de-blanked and compared to the pod's Reason (also lower-cased and de-blanked). The idea is to make Reason checking more flexible and to prevent a pod from being considered running when all of its containers have terminated just because of a Reason mis-match.~~
Releated to pr [49597](https://github.com/kubernetes/kubernetes/pull/49597) and issue [49529](https://github.com/kubernetes/kubernetes/issues/49529).
**Release note**:
```release-note
NONE
```
A couple of places in the code we allocate with make() but then use
append(), instead of copy() or direct assignment. This results in a
slice with len() zero elements at the front followed by the expected
data. The correct form for such usage is `make(struct, 0, len())`.
I found these by running:
```
$ git grep -EI -A7 'make\([^,]*, len\(' | grep 'append(' -B7 | grep -v vendor
```
And then manually looking through the results. I'm sure something better
could exist.
Automatic merge from submit-queue (batch tested with PRs 49538, 49708, 47665, 49750, 49528)
Add a support for GKE regional clusters in e2e tests.
**What this PR does / why we need it**:
Add a support for GKE regional clusters in e2e tests.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 45813, 49594, 49443, 49167, 47539)
Add node e2e tests for GKE environment
Ref: https://github.com/kubernetes/kubernetes/issues/46891
This PR adds node e2e tests for validating images used on GKE.
- We pass the `SYSTEM_SPEC_NAME` to the node e2e test process via the flag `--system-spec-name` so that we can skip the environment specific tests using `RunIfSystemSpecNameIs()`.
- Also added `SkipIfContainerRuntimeIs()` as the opposite of `RunIfContainerRuntimeIs()`.
**Release note**:
```
None
```
Automatic merge from submit-queue (batch tested with PRs 49619, 49598, 47267, 49597, 49638)
improve log for pod deletion poll loop
**What this PR does / why we need it**:
It improves some logging related to waiting for a pod to reach a passed-in condition. Specifically, related to issue [49529](https://github.com/kubernetes/kubernetes/issues/49529) where better logging may help to debug the root cause.
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 49420, 49296, 49299, 49371, 46514)
Refactoring taint functions to reduce sprawl
**What this PR does / why we need it**:
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#45060
**Special notes for your reviewer**:
@gmarek @timothysc @k82cn @jayunit100 - I moved some fn's to helpers and some to utils. LMK, if you are ok with this change.
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue
Add a new API version apps/v1beta2
xref: #49135
This PR adds a new API version `apps/v1beta2` which contains a copy (of types, conversions, and defaults) of `apps/v1beta1` StatefulSet, Deployment, and their subresources. Note that `apps/v1beta2` is still WIP and we will make breaking changes to it before releasing 1.8.
Moving core controllers (StatefulSet, Deployment, ReplicaSet, DaemonSet) to `apps/v1beta2` is the first step of moving them to `apps/v1` (GA).
This PR is a starting point for DaemonSet and ReplicaSet to move from `/extensions` to `/apps` and for Deployment and StatefulSet to make some breaking changes (e.g. new defaults and/or remove deprecated fields).
```release-note
Add a new API version apps/v1beta2
```
Automatic merge from submit-queue (batch tested with PRs 48377, 48940, 49144, 49062, 49148)
fixit: break sig-cluster-lifecycle tests into subpackage
this is part of fixit week. ref #49161
@kubernetes/sig-cluster-lifecycle-misc
Automatic merge from submit-queue (batch tested with PRs 47509, 46821, 45319, 49121, 49125)
Tolerate a missing MasterName (for GKE)
When testing, GKE created clusters don't provide a `MasterName`, so don't throw a warning and give up when that happens.
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 47509, 46821, 45319, 49121, 49125)
volume i/o tests for storage plugins
**What this PR does / why we need it**:
Addresses issues [25268](https://github.com/kubernetes/kubernetes/issues/25268) and [28367](https://github.com/kubernetes/kubernetes/issues/28367), though it may be weak re. the streaming i/o issue. @matchstick
**Special notes for your reviewer**:
This is a new file. Plugins other than NFS, GlusterFS, iSCSI, and Ceph-RBD code will need to be supported in a separate PR.
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 48043, 48200, 49139, 36238, 49130)
Implement equivalence cache by caching and re-using predicate result
The last part of #30844, I opened a new PR instead of overwrite the old one because we changed some basic assumption by allowing invalidating equivalence cache item by individual predicate.
The idea of this PR is based on discussion in https://github.com/kubernetes/kubernetes/issues/32024
- [x] Pods belong to same controllerRef considered to be equivalent
- [x] ` podFitsOnNode` will use cached predicate result if it's available
- [x] Equivalence cache will be updated when if a fresh new predicate is done
- [x] `factory.go` will invalid specific predicate cache(s) based on the object change
- [x] Since `schedule` and `bind` are async, we need to optimistically invalid affected cache(s) before `bind`
- [x] Fully unit test of affected files
- [x] e2e test to verify cache update/invalid workflow
- [x] performance test results
- [x] Some nits fixes related but expected to result in `needs-rebase` so they are split to: #36060#35968#37512
cc @wojtek-t @davidopp
Automatic merge from submit-queue
Azure PD (Managed/Blob)
This is exactly the same code as this [PR](https://github.com/kubernetes/kubernetes/pull/41950). It has a clean set of generated items. We created a separate PR to accelerate the accept/merge the PR
CC @colemickens
CC @brendandburns
**What this PR does / why we need it**:
1. Adds K8S support for Azure Managed Disks.
2. Adds support for dedicated blob disks (1:1 to storage account) in addition to shared blob disks (n:1 to storage account).
3. Automatically manages the underlying storage accounts. New storage accounts are created at 50% utilization. Max is 100 disks, 60 disks per storage account.
2. Addresses the current issues with Blob Disks:
..* Significantly faster attach process. Disks are now usually available for pods on nodes under 30 sec if formatted, under a min if not formatted.
..* Adds support to move disks between nodes.
..* Adds consistent attach/detach behavior, checks if the disk is leased/attached on a different node before attempting to attach to target nodes.
..* Fixes a random hang behavior on Azure VMs during mount/format (for both blob + managed disks).
..* Fixes a potential conflict by avoiding the use of disk names for mount paths. The new plugin uses hashed disk uri for mount path.
The existing AzureDisk is used as is. Additional "kind" property was added allowing the user to decide if the pd will be shared, dedicated or managed (Azure Managed Disks are used).
Due to the change in mounting paths, existing PDs need to be recreated as PV or PVCs on the new plugin.
Automatic merge from submit-queue (batch tested with PRs 48864, 48651, 47703)
Enable logexporter mechanism to dump logs from k8s nodes to GCS directly
Ref https://github.com/kubernetes/kubernetes/issues/48513
This adds support for logexporter from k8s side. Next I'll send a PR adding support from test-infra side.
/cc @kubernetes/sig-scalability-misc @kubernetes/test-infra-maintainers @fejta @wojtek-t @gmarek
Automatic merge from submit-queue
add dockershim checkpoint node e2e test
Add a bunch of disruptive cases to test kubelet/dockershim's checkpoint work flow.
Some steps are quite hacky. Not sure if there is better ways to do things.
Automatic merge from submit-queue (batch tested with PRs 48555, 48849)
GCE: Fix panic when service loadbalancer has static IP address
Fixes#48848
```release-note
Fix service controller crash loop when Service with GCP LoadBalancer uses static IP (#48848, @nicksardo)
```
Automatic merge from submit-queue (batch tested with PRs 48594, 47042, 48801, 48641, 48243)
Prepare to introduce websockets for exec and portforward
Refactor the code in remotecommand to better represent the structure of
what is common between portforward and exec.
Ref #48633
Automatic merge from submit-queue
[e2e-ingress] Get node tag from instance under GKE
**What this PR does / why we need it**: Making ingress CI green again.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#48167
**Special notes for your reviewer**:
/assign @nicksardo
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 47918, 47964, 48151, 47881, 48299)
Add ApiEndpoint support to GCE config.
**What this PR does / why we need it**:
Add the ability to change ApiEndpoint for GCE.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
None
```
Automatic merge from submit-queue (batch tested with PRs 47850, 47835, 46197, 47250, 48284)
Do not fail on error when deleting ingress
Fixes#48239
If the api server or master is unavailable, the test should manually teardown load balancer resources.
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 47850, 47835, 46197, 47250, 48284)
Allocate clusterIP when change service type from ExternalName to ClusterIP
**What this PR does / why we need it**:
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#35354#46190
**Special notes for your reviewer**:
/cc @smarterclayton @thockin
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 48214, 48154)
Adding a retry and traceroute to the master version checking
This is hitting a lot of connection refused errors in the e2e upgrade tests. We should make this more robust in case this is intermittent network errors. In the event of an error, attempt to log a traceroute to the master.
cc @kubernetes/sig-cluster-lifecycle-bugs @dchen1107
#47379
Automatic merge from submit-queue (batch tested with PRs 48139, 48042, 47645, 48054, 48003)
Pipe clusterID into gce_loadbalancer_external.go
**What this PR does / why we need it**: Small cleanup for GCE ELB codes.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#48002
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 47915, 47856, 44086, 47575, 47475)
deprecate created-by annotation for e2e test framework
**What this PR does / why we need it**: This PR deprecates created-by annotation for e2e test framework. This is needed as we now have ControllerRef.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: xref #44407
**Special notes for your reviewer**: This is the third PR for deprecating created-by annotation. Other PRs can be found here: #47469, #47471
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 47470, 47260, 47411, 46852, 46135)
Write reports for each upgrade test
Due to the way Ginkgo runs individual test cases and the level of coordination required for the upgrade tests, they were all run under a single Ginkgo test case. This PR generates and auxiliary report that break out the results of each upgrade test. This is accomplished by:
1) Wrapping `ginkgo.Fail` and `ginkgo.Skip` to get the actual failure or skip messages.
2) Recovering that info in the upgrade test to generate an auxiliary report.
I suggest reviewing commit by commit.
Sample report: https://storage.googleapis.com/krouseytestreports/logs/results/1/artifacts/junit_upgrades.xmlFixes: #47371
Automatic merge from submit-queue (batch tested with PRs 46835, 46856)
Made WaitForReplicas and EnsureDesiredReplicas use PollImmediate and improved logging.
**What this PR does / why we need it**: Most importantly, this results in better logging: timeout is logged at the level of the caller, not the helper function, helping debugging.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 46885, 47197)
Fix e2e ns deletion message for flake analysis
**What this PR does / why we need it**:
Let's us know when pods have a missing deletion timestamp.
**Special notes for your reviewer**:
helps https://github.com/kubernetes/kubernetes/issues/47135
Automatic merge from submit-queue (batch tested with PRs 47065, 47157, 47143)
Use actual hostname when creating network e2e test pod
**What this PR does / why we need it**:
This changes a e2e framework network test Pod use the actual hostname value to match the `kubernetes.io/hostname` label in it's `NodeSelector`. Currently it assumes the Node name will match that hostname label which is not true in all environments.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*:
Fixescoreos/tectonic-installer#1018
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 46235, 44786, 46833, 46756, 46669)
implements StatefulSet update
**What this PR does / why we need it**:
1. Implements rolling update for StatefulSets
2. Implements controller history for StatefulSets.
3. Makes StatefulSet status reporting consistent with DaemonSet and ReplicaSet.
https://github.com/kubernetes/features/issues/188
**Special notes for your reviewer**:
**Release note**:
```release-note
Implements rolling update for StatefulSets. Updates can be performed using the RollingUpdate, Paritioned, or OnDelete strategies. OnDelete implements the manual behavior from 1.6. status now tracks
replicas, readyReplicas, currentReplicas, and updatedReplicas. The semantics of replicas is now consistent with DaemonSet and ReplicaSet, and readyReplicas has the semantics that replicas did prior to this release.
```
Automatic merge from submit-queue (batch tested with PRs 46997, 47021)
Don't parse human-readable output from gcloud in tests
This is the reason `[k8s.io] Services should be able to change the type and ports of a service [Slow]` is currently failing on GKE e2e tests. For GKE jobs we run a prerelease version of gcloud, in which the default command output was changed.
gcloud's default output for commands is human readable, and is subject to change. Anything scripting against gcloud should always pass `--format=json|yaml|value(...)` so you get standardized output.
fixes: #46918
Implements history utilities for ControllerRevision in the controller/history package
StatefulSetStatus now has additional fields for consistency with DaemonSet and Deployment
StatefulSetStatus.Replicas now represents the current number of createdPods and StatefulSetStatus.ReadyReplicas is the current number of ready Pods
Automatic merge from submit-queue (batch tested with PRs 46972, 42829, 46799, 46802, 46844)
Multizone static pv test
**What this PR does / why we need it**:
Adds an e2e test for checking that pods get scheduled to the same zone as statically created PVs. This tests the PersistentVolumeLabel admission controller, which adds zone and region labels when PVs are created. As part of this, I also had to make changes to volume test utility code to pass in a zone parameter for creating PDs, and also had to add an argument to the e2e test program to accept a list of zones.
Fixes#46995
**Special notes for your reviewer**:
It's probably easier to review each commit separately.
**Release note**:
NONE
Automatic merge from submit-queue
Respect PDBs during node upgrades and add test coverage to the ServiceTest upgrade test.
This is still a WIP... needs to be squashed at least, and I don't think it's currently passing until I increase the scale of the RC, but please have a look at the general outline. Thanks!
Fixes#38336
@kow3ns @bdbauer @krousey @erictune @maisem @davidopp
```
On GCE, node upgrades will now respect PodDisruptionBudgets, if present.
```
Respect PDBs during node upgrades and add test coverage to the
ServiceTest upgrade test. Modified that test so that we include pod
anti-affinity constraints and a PDB.
Automatic merge from submit-queue (batch tested with PRs 46076, 43879, 44897, 46556, 46654)
Local storage plugin
**What this PR does / why we need it**:
Volume plugin implementation for local persistent volumes. Scheduler predicate will direct already-bound PVCs to the node that the local PV is at. PVC binding still happens independently.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*:
Part of #43640
**Release note**:
```
Alpha feature: Local volume plugin allows local directories to be created and consumed as a Persistent Volume. These volumes have node affinity and pods will only be scheduled to the node that the volume is at.
```
Automatic merge from submit-queue
Support grabbing test suite metrics
**What this PR does / why we need it**:
Add support for grabbing metrics that cover the entire test suite's execution.
Update the "interesting" controller-manager metrics to match the
current names for the garbage collector, and add namespace controller
metrics to the list.
If you enable `--gather-suite-metrics-at-teardown`, the metrics file is written to a file with a name such as `MetricsForE2ESuite_2017-05-25T20:25:57Z.json` in the `--report-dir`. If you don't specify `--report-dir`, the metrics are written to the test log output.
I'd like to enable this for some of the `pull-*` CI jobs, which will require a separate PR to test-infra.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
@kubernetes/sig-testing-pr-reviews @smarterclayton @wojtek-t @gmarek @derekwaynecarr @timothysc
Automatic merge from submit-queue (batch tested with PRs 45534, 37212, 46613, 46350)
[e2e]Fix define redundant parameter
When timeout to reach HTTP service, redundant parameter make the
error is nil.
Automatic merge from submit-queue (batch tested with PRs 46407, 46457)
GCE - Refactor API for firewall and backend service creation
**What this PR does / why we need it**:
- Currently, firewall creation function actually instantiates the firewall object; this is inconsistent with the rest of GCE api calls. The API normally gets passed in an existing object.
- Necessary information for firewall creation, (`computeHostTags`,`nodeTags`,`networkURL`,`subnetworkURL`,`region`) were private to within the package. These now have public getters.
- Consumers might need to know whether the cluster is running on a cross-project network. A new `OnXPN` func will make that information available.
- Backend services for regions have been added. Global ones have been renamed to specify global.
- NamedPort management of instance groups has been changed from an `AddPortsToInstanceGroup` func (and missing complementary `Remove...`) to a single, simple `SetNamedPortsOfInstanceGroup`
- Addressed nitpick review comments of #45524
ILB needs the regional backend services and firewall refactor. The ingress controller needs the new `OnXPN` func to decide whether to create a firewall.
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue
Apply KubeProxyEndpointLagTimeout to ESIPP tests
Fixes#46533.
The previous construction of ESIPP tests is weird, so I redo it a bit.
A 30 seconds `KubeProxyEndpointLagTimeout` is introduced, as these tests ain't verifying performance, may be better to not make it too tight.
/assign @thockin
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 46252, 45524, 46236, 46277, 46522)
Make GCE load-balancers create health checks for nodes
From #14661. Proposal on kubernetes/community#552. Fixes#46313.
Bullet points:
- Create nodes health check and firewall (for health checking) for non-OnlyLocal service.
- Create local traffic health check and firewall (for health checking) for OnlyLocal service.
- Version skew:
- Don't create nodes health check if any nodes has version < 1.7.0.
- Don't backfill nodes health check on existing LBs unless users explicitly trigger it.
**Release note**:
```release-note
GCE Cloud Provider: New created LoadBalancer type Service now have health checks for nodes by default.
An existing LoadBalancer will have health check attached to it when:
- Change Service.Spec.Type from LoadBalancer to others and flip it back.
- Any effective change on Service.Spec.ExternalTrafficPolicy.
```
Update the "interesting" controller-manager metrics to match the
current names for the garbage collector, and add namespace controller
metrics to the list.
Automatic merge from submit-queue (batch tested with PRs 45949, 46009, 46320, 46423, 46437)
Unregister some metrics
delete some registered metrics since they are not observed
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 45573, 46354, 46376, 46162, 46366)
Subresources are not included in apiserver prometheus metrics
Subresources are very often completely different code paths and errors
generated on those code paths are important to distinguish.
@kubernetes/sig-api-machinery-pr-reviews
```release-note
The Prometheus metrics for the kube-apiserver for tracking incoming API requests and latencies now return the `subresource` label for correctly attributing the type of API call.
```
Automatic merge from submit-queue (batch tested with PRs 38990, 45781, 46225, 44899, 43663)
Support parallel scaling on StatefulSets
Fixes#41255
```release-note
StatefulSets now include an alpha scaling feature accessible by setting the `spec.podManagementPolicy` field to `Parallel`. The controller will not wait for pods to be ready before adding the other pods, and will replace deleted pods as needed. Since parallel scaling creates pods out of order, you cannot depend on predictable membership changes within your set.
```