Automatic merge from submit-queue
Add node problem detector as an addon pod.
```release-note
Introduce a new add-on pod NodeProblemDetector.
NodeProblemDetector is a DaemonSet running on each node, monitoring node health and reporting
node problems as NodeCondition and Event. Currently it already supports kernel log monitoring, and
will support more problem detection in the future. It is enabled by default on gce now.
```
This PR enables NodeProblemDetector as an add-on pod.
/cc @mikedanese @kubernetes/sig-node
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/.github/PULL_REQUEST_TEMPLATE.md?pixel)]()
Automatic merge from submit-queue
Configuration for GCP webhook authentication and authorization
This PR adds configuration for GCP webhook authentication and authorization in ContainerVM and GCI. The change of configure-vm.sh and kube-apiserver.manifest is directly copied from @cjcullen's PR #25380 and #25296. The change in GCI script configure-helper.sh includes the support for webhook authentication and authorization, and also some code refactor to improve readability.
@cjcullen @roberthbailey @zmerlynn please review it. The original PRs are P1, please mark this as P1.
cc/ @fabioy @kubernetes/goog-image FYI.
I verified it by running e2e tests on GCI cluster. Without the GCI side change, cluster creation fails as being capture by GKE Jenkins tests. I don't test when the two env GCP_AUTHN_URL and GCP_AUTHZ_URL are set, because they are only set in GKE. After this PR is merged, @cjcullen will test in GKE.
Automatic merge from submit-queue
cluster/gce/coreos: Set service-cluster-ip-range
Broken by #19242
See also #26002
This is necessary to kube-up for me, but depending on how #26002 plays out, this PR might not be necessary. Happy to close this or merge or whatever depending on what's best.
cc @yifan-gu @sjpotter @mikedanese
Automatic merge from submit-queue
GCI: Fix the condition for using the default image
This PR revises the condition for using the default GCI image. The old logic is not convenient for manually run e2e tests in some cases (mainly for GCI team to test custom images). The new logic by this PR is very similar to the logic in using ContainerVM. When setting distro to "gci", if master or node image is unset, we use gci-dev for it. If either is set, we respect it.
@roberthbailey @zmerlynn @dchen1107 please review it, and we should cherry pick it in release-1.2 branch. Thanks!
cc/ @kubernetes/goog-image @adityakali FYI
Automatic merge from submit-queue
GCI/Trusty: Fix an issue in using 'find' commands
This PR makes the logic of 'find' command consistent with the 'cp' command afterwards, i.e., only check one layer of a given dir. Without this fix, we have seen a recent breakage after PR #25309 added the file cluster/addons/fluentd-elasticsearch/es-image/template-k8s-logstash.json. The 'find' command discovers this json file, but the 'cp' command fails.
@roberthbailey @dchen1107 @zmerlynn please review this fix, and mark it as a cherry pick candidate. I already verified this fix can resolve the breakage.
cc/ @wonderfly @fabioy @kubernetes/goog-image FYI
Automatic merge from submit-queue
GCI: Enable the log of upstart jobs
This PR enables the log of upstart jobs in master.yaml and node.yaml. By default, log of upstart jobs are enabled in Trusty and placed in /var/log/upstart, but not enabled in GCI. This change explicitly directs the log to the system logger. For trusty, they are in /var/log/syslog file. In GCI, we can check it using "journalctl". This change will be useful for debugging if cluster initialization fails.
@roberthbailey @maisem @dchen1107 please review it. This will be useful for issues like #23634. We should also cherry pick it in release-1.2
cc/ @fabioy @zmerlynn @wonderfly FYI.
Automatic merge from submit-queue
Salt configuration for the new Cluster Autoscaler for GCE
Adds support for cloud autoscaler from contrib/cloud-autoscaler in kube-up.sh GCE script.
cc: @fgrzadkowski @piosz
Automatic merge from submit-queue
Use --format='value(name)' with gcloud instead of grep/awk/cut
Fixing our fragile parsing of `gcloud` is getting old (#24746, #25159, maybe others?).
Instead, let's just get the proper output out of `gcloud` in the first place.
Automatic merge from submit-queue
Change default clusterCIDRs from /16 to /14 in GCE configs allowing 1000 Node clusters by default.
cc @thockin @roberthbailey @wojtek-t @zmerlynn @davidopp
Automatic merge from submit-queue
GCI: Add two GCI specific metadata pairs
This PR adds two GCI specific metadata pairs when using GCI image.
(1) "gci-update-strategy": by default the GCI in-place updater is enabled. It means that when a new image is released, the instance on the old image will be upgraded to the new image. In this change, we turn it off;
(2) "gci-ensure-gke-docker": GCI is built with two versions of docker. When this metadata is set to "true", the version satisfying kubernetes qualification will be used. Setting this metadata prevents from using incorrect docker version.
Automatic merge from submit-queue
Fix detect-node-names to not error out if there are no nodes
Fixes#21564.
Teardown was not working correctly in rare cases because `detect-node-names` was failing before any of the actual cleanup was run. I'm pretty sure the issue was that there was an instance group, but no instances in the instance group, so we bailed out when we tried to expand the bash array.
This PR adds a guard so we don't bail if the array is empty.
cc @jlowdermilk @spxtr
Automatic merge from submit-queue
Add support for running clusters on GCI
Google Container-VM Image (GCI) is the next revision of Container-VM. See documentation at https://cloud.google.com/compute/docs/containers/vm-image/. This change adds support for starting a Kubernetes cluster using GCI.
With this change, users can start a kubernetes cluster using the latest kubelet and kubectl release binary built in the GCI image by running:
$ KUBE_OS_DISTRIBUTION="gci" cluster/kube-up.sh
Or run a testing cluster on GCI by running:
$ KUBE_OS_DISTRIBUTION="gci" go run hack/e2e.go -v --up
The commands above will choose the latest GCI image by default.
Automatic merge from submit-queue
Switch to ABAC authorization from AllowAll
Switch from AllowAll to ABAC. All existing identities (that are created by deployment scripts) are given full permissions through ABAC. Manually created identities will need policies added to the `policy.jsonl` file on the master.
Automatic merge from submit-queue
don't source the kube-env in addon-manager
This was added in 2feb658ed7 which became unused after #23603 but wasn't removed
Automatic merge from submit-queue
Trusty: Add debug supports for docker and kubelet
This PR adds debug support in two aspects: (1) For a test cluster, docker command will have "--debug" flag. Recently we noticed that this is very helpful in debug e2e test failures; (2) The kubelet command line will be put in /etc/default/kubelet. If a developer wants to test kubelet flags without recreating a cluster, she/he only needs to revise this file and then run "initctl restart kubelet". In addition, this PR fixes a couple of small things like comments and alignment.
Test result:
(1) Manually verified changing /etc/default/kubelet and run "initctl restart kubelet";
(2) Verified docker command line flag "--debug";
(3) e2e on pure trusty cluster and hybrid cluster all passed.
@roberthbailey @dchen1107 @zmerlynn please review it.
cc/ @yujuhong @fabioy @wonderfly FYI.
Automatic merge from submit-queue
Trusty: Add retry in curl commands
This fix is for improving robustness in fetch critical metadata files when the metadata server is temporarily unreachable.
@roberthbailey @zmerlynn @dchen1107 please review it.
cc/ @fabioy @wonderfly FYI.
Automatic merge from submit-queue
jenkins: Allow configuration of release bucket
This allows others to leverage the existing E2E code to test some
patched kube binary by simply overriding the bucket and reusing many of
the existing scripts
Automatic merge from submit-queue
Trusty: Handle the new var in kube-proxy manifest
This is to capture the kube-proxy manifest change in PR #24429.
@roberthbailey @fabioy @zmerlynn please review this change and mark it as cherry pick candidate. We need to catch up 1.2.3 release.
cc/ @dchen1107 @wonderfly @cjcullen FYI.
I have verified this fix. Without this fix, kube-proxy pod in Trusty nodes cannot be started correctly, i.e., the command line has an unhadled variable. And some other kube-system pods do not work correctly as kube-proxy is not working well. After applying this fix, kube-proxy can be started correctly, and all kube-system pods run successfully.
Automatic merge from submit-queue
Strip comments from configure-vm.sh for gce
We are getting very close to the 32KiB limit on GCE metadata entry length. We used to strip comments before putting the value in metadata, but I think we removed it in a refactor because it wasn't absolutely necessary, and leaving it out made the scripts slightly cleaner. It's close to being necessary again.
Removing comments reduces the size from 31,609B to 27,221B: https://www.diffchecker.com/0xmmecvw.
This allows others to leverage the existing E2E code to test some
patched kube binary by simply overriding the bucket and reusing many of
the existing scripts
Automatic merge from submit-queue
Trusty: Fixes for running GKE master
This PR includes two fixes for running GKE master on our image:
(1) The kubelet command line assembly had a missing part for cbr0. We did not catch it because the code path is not covered by OSS k8s tests;
(2) Remove the "" from the variables in the cert files. It causes a parsing issue in GKE. Again, this code path is not covered by k8s tests.
This PR also refactors the code for assembling kubelet flag. I move all logic into a single function assemble_kubelet_flags in configure-helper.sh for better readability and also simplify node.yaml and master.yaml.
@roberthbailey @dchen1107 please review it, and mark it as cherrypick-candidate. This PR is verified by @maisem. Together with his CL for GKE, we can run GKE cluster with master on our image and nodes on ContainerVM.
cc/ @maisem @fabioy @wonderfly FYI
This only applies to gce kube-up. 60 seconds of open connection should
be sufficient for anything that we should be downloading. The release
tar is currently 255M.