Since we weren't running the HPA with metrics REST clients by default,
we had no bootstrap policy enabling the HPA controller to talk to the
metrics APIs.
This adds permissions for the HPA controller to talk list
pods.metrics.k8s.io, and list any resource in custom.metrics.k8s.io.
Automatic merge from submit-queue (batch tested with PRs 51796, 52223)
Add bsalamat to sig-scheduling-maintainers
**What this PR does / why we need it**:
Adds bsalamat to sig-scheduling-maintainers.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes # N/A
**Release note**:
```release-note
NONE
```
@kubernetes/sig-scheduling-pr-reviews @davidopp @timothysc @k82cn @wojtek-t
Automatic merge from submit-queue (batch tested with PRs 51796, 52223)
Fix pod and node names switched around in error message.
**What this PR does / why we need it**: This PR fixes a pod name and a node name switched around in an error message from the daemon controller.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: No issue that I know of
**Special notes for your reviewer**: -
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue
Bugfix: Fix e2e Flaky Apps/Job BackoffLimit test
This fix is linked to the PR #51153 that introduce the `JobSpec.BackoffLimit`.
Previously the Timeout used in the test was too aggressive and generates flaky test execution. Now it used the default `framework.JobTimeout` used in others tests.
**What this PR does / why we need it**:
This PR should fix flaky "[sig-apps] Job should exceed backoffLimit" test, due to a too short timeout duration.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
fixes#51153
**Special notes for your reviewer**:
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 52452, 52115, 52260, 52290)
Fixes device plugin re-registration handling logic to make sure:
- If a device plugin exits, its exported resource will be removed.
- No capacity change if a new device plugin instance comes up to replace the old instance.
**What this PR does / why we need it**:
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes https://github.com/kubernetes/kubernetes/issues/52510
**Special notes for your reviewer**:
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 52452, 52115, 52260, 52290)
fix azure disk mounter issue
**What this PR does / why we need it**:
fix azure disk mounter issue, it's a P1 bug, it exists in 1.7, 1.8 release, should cherry pick to 1.7, 1.8
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
fixes#52261
consider following issue:
1) A pod mounting an azure disk in a k8s agent
2) The kubelet is restarted in that k8s agent
3) The pod could not start up, it always reports error as following:
4d 1m 3065 kubelet, 14777acs9000 Warning FailedMount MountVolume.SetUp failed for volume "pvc-7a0cdeb9-92c7-11e7-b86b-000d3a36d70c" : azureDisk - No
t a mounting point for disk andykubewin175-dynamic-pvc-7a0cdeb9-92c7-11e7-b86b-000d3a36d70c on \var\lib\kubelet\pods\d146c023-92c7-11e7-b86b-000d3a36d70c\volumes\kubernetes.io~azure-disk\pvc-7a0cdeb9-92c7-11
e7-b86b-000d3a36d70c
4d 1m 3157 kubelet, 14777acs9000 Warning FailedMount Error syncing pod
**Special notes for your reviewer**:
If you take a look at following implementation of vsphere or gce, it will return nil instead of error:
https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/vsphere_volume/vsphere_volume.go#L217-L220https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/gce_pd/gce_pd.go#L273-L275
The logic of return info parsing here, it's wrong to return error
https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/operationexecutor/operation_generator.go#L469-L475
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 52452, 52115, 52260, 52290)
Add env var to enable kubelet rotation in kube-up.sh.
Fixes https://github.com/kubernetes/kubernetes/issues/52114
```release-note
Adds ROTATE_CERTIFICATES environment variable to kube-up.sh script for GCE
clusters. When that var is set to true, the command line flag enabling kubelet
client certificate rotation will be added to the kubelet command line.
```
Automatic merge from submit-queue (batch tested with PRs 52452, 52115, 52260, 52290)
Fix support for updating quota on update
This PR implements support for properly handling quota when resources are updated. We never take negative values and add them up.
Fixes https://github.com/kubernetes/kubernetes/issues/51736
cc @derekwaynecarr
/sig storage
```release-note
Make sure that resources being updated are handled correctly by Quota system
```
Automatic merge from submit-queue (batch tested with PRs 51824, 50476, 52451, 52009, 52237)
Improve apiserver metrics reporting
Normalize "WATCHLIST" to "WATCH", add "scope" to the other metrics (listing 50k pods is != listing pods in a namespace), and add a new scope "resource" to cover individual resource calls.
This roughly aligns metrics with our ACL model (technically resource scope is GET, but POST to a subresource and POST to a namespace are not the same thing).
```release-note
WATCHLIST calls are now reported as WATCH verbs in prometheus for the apiserver_request_* series. A new "scope" label is added to all apiserver_request_* values that is either 'cluster', 'resource', or 'namespace' depending on which level the query is performed at.
```
Automatic merge from submit-queue (batch tested with PRs 51824, 50476, 52451, 52009, 52237)
fix issue(#47976)Invalid value error when creating service from expor…
…ted config
**What this PR does / why we need it**:
close issue #47976
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 51824, 50476, 52451, 52009, 52237)
Plumbing the proxy dialer to the webhook admission plugin
* Fixing https://github.com/kubernetes/kubernetes/issues/49987. Plumb the `Dial` function to the `transport.Config`
* Fixing https://github.com/kubernetes/kubernetes/issues/52366. Let the webhook admission plugin sets the `TLSConfg.ServerName`.
I tested it in my gke setup. I don't have time to implement an e2e test before 1.8 release. I think it's ok to add the test later, because *i)* the change only affects the alpha webhook admission feature, and *ii)* the webhook feature is unusable without the fix. That said, it's up to my reviewer to decide.
Filed https://github.com/kubernetes/kubernetes/issues/52368 for the missing e2e test.
( The second commit is https://github.com/kubernetes/kubernetes/pull/52372, which is just a cleanup of client configuration in e2e tests. It removed a function that marshalled the client config to json and then unmarshalled it. It is a prerequisite of this PR, because this PR added the `Dial` function to the config which is not json marshallable.)
```release-note
Fixed the webhook admission plugin so that it works even if the apiserver and the nodes are in two networks (e.g., in GKE).
Fixed the webhook admission plugin so that webhook author could use the DNS name of the service as the CommonName when generating the server cert for the webhook.
Action required:
Anyone who generated server cert for admission webhooks need to regenerate the cert. Previously, when generating server cert for the admission webhook, the CN value doesn't matter. Now you must set it to the DNS name of the webhook service, i.e., `<service.Name>.<service.Namespace>.svc`.
```
Automatic merge from submit-queue (batch tested with PRs 51824, 50476, 52451, 52009, 52237)
Allow metadata firewall & proxy on in GCE, off by default
**What this PR does / why we need it**: Add necessary variables in kube-env to allow a user to turn on metadata firewall and proxy for K8s on GCE.
Ref #8867.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*:
**Special notes for your reviewer**:
**Release note**:
```release-note
GCE users can enable the metadata firewall and metadata proxy with KUBE_FIREWALL_METADATA_SERVER and ENABLE_METADATA_PROXY, respectively.
```
Automatic merge from submit-queue (batch tested with PRs 52442, 52247, 46542, 52363, 51781)
Add more tests for pod preemption
**What this PR does / why we need it**:
Adds more e2e and integration tests for pod preemption.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
This PR is based on #50949. Only the last commit is new.
**Release note**:
```release-note
NONE
```
ref/ #47604
@kubernetes/sig-scheduling-pr-reviews @davidopp
Automatic merge from submit-queue (batch tested with PRs 52442, 52247, 46542, 52363, 51781)
Make CPU manager release CPUs when Pod enters completed phase.
**What this PR does / why we need it**: When CPU manager is enabled, this PR releases allocated CPUs when container is not running and is non-restartable.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#52351
**Special notes for your reviewer**:
This bug is only reproduced for pods with `restartPolicy` = `Never` or `OnFailure`. The following output is from a 4 CPU node. This bug can be reproduced as long >= half the cores are requested.
pod1.yaml:
```
apiVersion: v1
kind: Pod
metadata:
name: test-pod1
spec:
containers:
- image: ubuntu
command: ["/bin/bash"]
args: ["-c", "sleep 5"]
name: test-container1
resources:
requests:
cpu: 2
memory: 100Mi
limits:
cpu: 2
memory: 100Mi
restartPolicy: "Never"
```
pod2.yaml:
```
apiVersion: v1
kind: Pod
metadata:
name: test-pod2
spec:
containers:
- image: ubuntu
command: ["/bin/bash"]
args: ["-c", "sleep 5"]
name: test-container1
resources:
requests:
cpu: 2
memory: 100Mi
limits:
cpu: 2
memory: 100Mi
restartPolicy: "Never"
```
Run a local Kubernetes cluster with CPU manager enabled.
```sh
KUBELET_FLAGS='--feature-gates=CPUManager=true --cpu-manager-policy=static --cpu-manager-reconcile-period=1s --kube-reserved=cpu=500m' ./hack/local-up-cluster.sh
```
_Before:_
Create `test-pod1` using pod1.yaml.
```
./cluster/kubectl.sh create -f pod1.yaml
```
Wait for the pod to complete and wait another 90 seconds (give enough time for GC to kick-in).
Create `test-pod2` using pod2.yaml.
```
./cluster/kubectl.sh create -f pod2.yaml
```
Get all pods in the cluster.
```
./cluster/kubectl.sh get pods -a
NAME READY STATUS RESTARTS AGE
test-pod1 0/1 Completed 0 1m
test-pod2 0/1 not enough cpus available to satisfy request 0 9s
```
_After:_
Create `test-pod1` using pod1.yaml.
```
./cluster/kubectl.sh create -f pod1.yaml
```
Wait for the pod to complete and wait another 90 seconds (give enough time for GC to kick-in).
Create `test-pod2` using pod2.yaml.
```
./cluster/kubectl.sh create -f pod2.yaml
```
Get all pods in the cluster.
```
./cluster/kubectl.sh get pods -a
NAME READY STATUS RESTARTS AGE
test-pod1 0/1 Completed 0 1m
test-pod2 0/1 Completed 0 9s
```
Automatic merge from submit-queue (batch tested with PRs 52442, 52247, 46542, 52363, 51781)
Ignore pods for quota marked for deletion whose node is unreachable
**What this PR does / why we need it**:
Traditionally, we charge to quota all pods that are in a non-terminal phase. We have a user report that noted the behavior change in kube 1.5 for the node controller to no longer force delete pods whose nodes have been lost. Instead, the pod is marked for deletion, and the reason is updated to state that the node is unreachable. The user expected the quota to be released. If the user was at their quota limit, their application may not be able to create a new replica given the current behavior. As a result, this PR ignores pods marked for deletion that have exceeded their grace period.
**Which issue this PR fixes**
xref https://bugzilla.redhat.com/show_bug.cgi?id=1455743
fixes https://github.com/kubernetes/kubernetes/issues/52436
**Release note**:
```release-note
Ignore pods marked for deletion that exceed their grace period in ResourceQuota
```
Automatic merge from submit-queue (batch tested with PRs 52442, 52247, 46542, 52363, 51781)
Add some test case in default_test.go
**What this PR does / why we need it**:
Add some test case in default_test.go
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue
[fluentd-gcp addon] Remove some e2e tests out of blocking suites
Fixes https://github.com/kubernetes/kubernetes/issues/52433
Some Stackdriver Logging e2e tests are broken in release-blocking suites:
- Due to the change in Docker 1.13, on some systems logs are automatically split by 16K chunks. This PR removes an e2e test that assumes otherwise
- In large clusters, it's not possible to ingest system logs from all nodes
Since it's not a Kubernetes problem per se, mitigating this by removing these tests from blocking suites.
Automatic merge from submit-queue
use specified discovery information if possible
Fixes https://github.com/kubernetes/kubernetes/issues/49948
This uses the available discovery information if available, but it seems we never updated "normal" resources to show the singular name, so its often not available. I've left this code compatible.
@enisoc @ash2k
@kubernetes/sig-api-machinery-misc
```release-note
custom resources that use unconventional pluralization now work properly with kubectl and garbage collection
```
Automatic merge from submit-queue
Delete the federation namespace from fcp instead of individual objects
**What this PR does / why we need it**:
This PR simplifies cleanup by deleting the entire namespace instead of individual objects.
This PR is linked to https://github.com/kubernetes/kubernetes/issues/50543. This may not solve the issue but instead to try an alternative.
**Release note**:
```release-note
NONE
```
/assign @madhusudancs
Automatic merge from submit-queue
Fix failing autoscaling test in GKE
This should fix `[sig-autoscaling] Cluster size autoscaling [Slow] should increase cluster size if pending pods are small and there is another node pool that is not autoscaled [Feature:ClusterSizeAutoscalingScaleUp]` by getting a list of nodes from GKE nodepool in a different way (filtering nodes by labels.) Currently, gcloud command used for it is failing, as we only have GKE node pool name in the test and not the actual MIG name.
Automatic merge from submit-queue (batch tested with PRs 52376, 52439, 52382, 52358, 52372)
Remove the conversion of client config
It was needed because the clientset code in client-go was a copy of the clientset code in Kubernetes.. client-go is authoritative now, so we can remove the nasty copy.
Automatic merge from submit-queue (batch tested with PRs 52376, 52439, 52382, 52358, 52372)
Add new api groups to the GCE advanced audit policy
Fixes https://github.com/kubernetes/kubernetes/issues/52265
It introduces the missing api groups, that were introduced in 1.8 release.
@piosz there's also the 'metrics' api group, should we audit it?
Automatic merge from submit-queue (batch tested with PRs 52376, 52439, 52382, 52358, 52372)
Workaround go-junit-report bug for TestApps
**What this PR does / why we need it**: Fix output from pkg/kubectl/apps/TestApps unit test
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#51253
**Special notes for your reviewer**: Literally copy-pasta of the approach taken in #45320. Maybe a sign that this should be extracted into something shared. I'm just trying to see if we can make https://k8s-testgrid.appspot.com/kubernetes-presubmits and https://k8s-testgrid.appspot.com/release-master-blocking a little more green for now.
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 52376, 52439, 52382, 52358, 52372)
Pass correct clientbuilder to cloudproviders
Fixes https://github.com/kubernetes/kubeadm/issues/425 by moving the Initialize call to after the start of the token controller and passing `clientBuilder` instead of `rootClientBuilder` to the cloudproviders.
/assign @bowei
**Release note**:
```release-note
NONE
```
Should fix in 1.8 and cherrypick to 1.7
- If a device plugin exits, its exported resource will be removed.
- No capacity change if a new device plugin instance comes up to replace the old instance.
This fix is linked to the PR #51153 that introduce the
JobSpec.BackoffLimit.
Previously the Timeout used in the test was too agressive and generates
flaky test execution. Now it used the default framework.JobTimeout used
in others tests.
Automatic merge from submit-queue
Remove 1.2.* release notes in CHANGELOG.md
**What this PR does / why we need it**:
Remove 1.2.* release notes in CHANGELOG.md to make the file smaller so its content can be shown.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
ref: https://github.com/kubernetes/kubernetes/issues/48985#issuecomment-328076817
**Special notes for your reviewer**:
This is just a quick fix before we have an ideal solution of #48985
/cc @jdumars
/priority important-soon
/sig release
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue
Make CPU constraint for l7-lb-controller in density test scale with #nodes
Just noticed that we changed the memory last time, but didn't change cpu. From the last run:
```
Sep 13 04:25:03.360: INFO: Unexpected error occurred: Container l7-lb-controller-v0.9.6-gce-scale-cluster-master/l7-lb-controller is using 0.642709233/0.15 CPU
```
Automatic merge from submit-queue
Fix swallowed errors in various volume packages
**What this PR does / why we need it**: Fixes swallowed errors in various volume packages.
**Release note**:
```release-note NONE
```
Automatic merge from submit-queue (batch tested with PRs 51601, 52153, 52364, 52362, 52342)
Make advanced audit policy on GCP configurable
Related to https://github.com/kubernetes/kubernetes/issues/52265
Make GCP audit policy configurable
/cc @tallclair
Automatic merge from submit-queue (batch tested with PRs 51601, 52153, 52364, 52362, 52342)
fix kubeadm token create error
**What this PR does / why we need it**:
fix kubeadm token create error
**Which issue this PR fixes**
[#436](https://github.com/kubernetes/kubeadm/issues/436)
**Special notes for your reviewer**:
CC @luxas
Automatic merge from submit-queue (batch tested with PRs 51601, 52153, 52364, 52362, 52342)
fix Kubeadm phase addon error
What this PR does / why we need it:
fix Kubeadm phase addon error
Which issue this PR fixes
[#437](https://github.com/kubernetes/kubeadm/issues/437)
Special notes for your reviewer:
CC @luxas @andrewrynhard
Automatic merge from submit-queue (batch tested with PRs 51601, 52153, 52364, 52362, 52342)
Improve kubeadm help text
* Replace 'misc' with more specific at-mentions bugs and feature-requests.
* Replace ReplicaSets with Deployments as example, because ReplicaSets are dated.
* Generalize join example.
Before:
```
┌──────────────────────────────────────────────────────────┐
│ KUBEADM IS BETA, DO NOT USE IT FOR PRODUCTION CLUSTERS! │
│ │
│ But, please try it out! Give us feedback at: │
│ https://github.com/kubernetes/kubeadm/issues │
│ and at-mention @kubernetes/sig-cluster-lifecycle-misc │
└──────────────────────────────────────────────────────────┘
Example usage:
Create a two-machine cluster with one master (which controls the cluster),
and one node (where your workloads, like Pods and ReplicaSets run).
┌──────────────────────────────────────────────────────────┐
│ On the first machine │
├──────────────────────────────────────────────────────────┤
│ master# kubeadm init │
└──────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ On the second machine │
├──────────────────────────────────────────────────────────┤
│ node# kubeadm join --token=<token> <ip-of-master>:<port> │
└──────────────────────────────────────────────────────────┘
You can then repeat the second step on as many other machines as you like.
```
After (changes highlighted with `<--`):
```
┌──────────────────────────────────────────────────────────┐
│ KUBEADM IS BETA, DO NOT USE IT FOR PRODUCTION CLUSTERS! │
│ │
│ But, please try it out! Give us feedback at: │
│ https://github.com/kubernetes/kubeadm/issues │
│ and at-mention @kubernetes/sig-cluster-lifecycle-bugs │ <--
│ or @kubernetes/sig-cluster-lifecycle-feature-requests │ <--
└──────────────────────────────────────────────────────────┘
Example usage:
Create a two-machine cluster with one master (which controls the cluster),
and one node (where your workloads, like Pods and Deployments run). <--
┌──────────────────────────────────────────────────────────┐
│ On the first machine │
├──────────────────────────────────────────────────────────┤
│ master# kubeadm init │
└──────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ On the second machine │
├──────────────────────────────────────────────────────────┤
│ node# kubeadm join <arguments-returned-from-init> │ <--
└──────────────────────────────────────────────────────────┘
You can then repeat the second step on as many other machines as you like.
```
cc @luxas