With bug #27653, kubelet could remove mounted volumes and delete user data.
The bug itself is fixed, however our trust in kubelet is significantly lower.
Let's add an extra version of RemoveAll that does not cross mount boundary
(rm -rf --one-file-system).
It calls lstat(path) three times for each removed directory - once in
RemoveAllOneFilesystem and twice in IsLikelyNotMountPoint, however this way
it's platform independent and the directory that is being removed by kubelet
should be almost empty.
Automatic merge from submit-queue (batch tested with PRs 41116, 41804, 42104, 42111, 42120)
Remove SandboxReceived event
This PR removes SandboxReceived event in sync pod.
> This event seems somewhat meaningless, and clouds the event records for a pod. Do we actually need it? Pulling and pod received on the node are very relevant, this seems much less so. Would suggest we either remove it, or turn it into a message that clearly indicates why it has value.
Refer d65309399a (commitcomment-21052453).
cc @smarterclayton @yujuhong
Automatic merge from submit-queue (batch tested with PRs 41116, 41804, 42104, 42111, 42120)
Add support for attacher/detacher interface in Flex volume
Add support for attacher/detacher interface in Flex volume
This change breaks backward compatibility and requires to be release noted.
```release-note
Flex volume plugin is updated to support attach/detach interfaces. It broke backward compatibility. Please update your drivers and implement the new callouts.
```
Automatic merge from submit-queue (batch tested with PRs 41962, 42055, 42062, 42019, 42054)
dockershim puts pause container in pod cgroup
**What this PR does / why we need it**:
The CRI was not launching the pause container in the pod level cgroup. The non-CRI code path was.
Automatic merge from submit-queue
Admit critical pods under resource pressure
And evict critical pods that are not static.
Depends on #40952.
For #40573
Automatic merge from submit-queue (batch tested with PRs 41994, 41969, 41997, 40952, 40576)
Guaranteed admission for Critical Pods
This is the first step in implementing node-level preemption for critical pods.
It defines the AdmissionFailureHandler interface, which allows callers, like the kubelet, to define how failed predicates are handled, and take steps to correct failures if necessary.
In the kubelet's implementation, it triggers preemption if the pod being admitted is critical, and if the only failed predicates are InsufficientResourceErrors, then it prempts (not yet implemented) other other pods to allow admission of the critical pod.
cc: @vishh
Automatic merge from submit-queue (batch tested with PRs 41814, 41922, 41957, 41406, 41077)
Use consistent helper for getting secret names from pod
Kubelet secret-manager and mirror-pod admission both need to know what secrets a pod spec references. Eventually, a node authorizer will also need to know the list of secrets.
This creates a single (well, double, because api versions) helper that can be used to traverse the secret names referenced from a pod, optionally short-circuiting (for places that are just looking to see if any secrets are referenced, like admission, or are looking for a particular secret ref, like authorization)
Fixes:
* secret manager not handling secrets used by env/envFrom in initcontainers
* admission allowing mirror pods with secret references
@smarterclayton @wojtek-t
Automatic merge from submit-queue (batch tested with PRs 41621, 41946, 41941, 41250, 41729)
bug fix for hostport-syncer
fix a bug introduced by the previous refactoring of hostport-syncer. https://github.com/kubernetes/kubernetes/pull/39443
and fix some nits
Automatic merge from submit-queue
BestEffort QoS class has min cpu shares
**What this PR does / why we need it**:
BestEffort QoS class is given the minimum amount of CPU shares per the QoS design.
Automatic merge from submit-queue (batch tested with PRs 41714, 41510, 42052, 41918, 31515)
Switch scheduler to use generated listers/informers
Where possible, switch the scheduler to use generated listers and
informers. There are still some places where it probably makes more
sense to use one-off reflectors/informers (listing/watching just a
single node, listing/watching scheduled & unscheduled pods using a field
selector).
I think this can wait until master is open for 1.7 pulls, given that we're close to the 1.6 freeze.
After this and #41482 go in, the only code left that references legacylisters will be federation, and 1 bit in a stateful set unit test (which I'll clean up in a follow-up).
@resouer I imagine this will conflict with your equivalence class work, so one of us will be doing some rebasing 😄
cc @wojtek-t @gmarek @timothysc @jayunit100 @smarterclayton @deads2k @liggitt @sttts @derekwaynecarr @kubernetes/sig-scheduling-pr-reviews @kubernetes/sig-scalability-pr-reviews
Where possible, switch the scheduler to use generated listers and
informers. There are still some places where it probably makes more
sense to use one-off reflectors/informers (listing/watching just a
single node, listing/watching scheduled & unscheduled pods using a field
selector).
Automatic merge from submit-queue (batch tested with PRs 41812, 41665, 40007, 41281, 41771)
Kubelet-rkt: Add useful informations for Ops on the Kubelet Host
Create a Systemd SyslogIdentifier inside the [Service]
Create a Systemd Description inside the [Unit]
**What this PR does / why we need it**:
#### Overview
Logged against the host, it's difficult to identify who's who.
This PR add useful information to quickly get straight to the point with the **DESCRIPTION** field:
```
systemctl list-units "k8s*"
UNIT LOAD ACTIVE SUB DESCRIPTION
k8s_b5a9bdf7-e396-4989-8df0-30a5fda7f94c.service loaded active running kube-controller-manager-172.20.0.206
k8s_bec0d8a1-dc15-4b47-a850-e09cf098646a.service loaded active running nginx-daemonset-gxm4s
k8s_d2981e9c-2845-4aa2-a0de-46e828f0c91b.service loaded active running kube-apiserver-172.20.0.206
k8s_fde4b0ab-87f8-4fd1-b5d2-3154918f6c89.service loaded active running kube-scheduler-172.20.0.206
```
#### Overview and Journal
Always on the host, to easily retrieve the pods logs, this PR add a SyslogIdentifier named as the PodBaseName.
```
# A DaemonSet prometheus-node-exporter is running on the Kubernetes Cluster
systemctl list-units "k8s*" | grep prometheus-node-exporter
k8s_c60a4b1a-387d-4fce-afa1-642d6f5716c1.service loaded active running prometheus-node-exporter-85cpp
# Get the logs from the prometheus-node-exporter DaemonSet
journalctl -t prometheus-node-exporter | wc -l
278
```
Sadly the `journalctl` flag `-t` / `--identifier` doesn't allow a pattern to catch the logs.
Also this field improve any queries made by any tools who exports the Journal (E.g: ES, Kibana):
```
{
"__CURSOR" : "s=86fd390d123b47af89bb15f41feb9863;i=164b2c27;b=7709deb3400841009e0acc2fec1ebe0e;m=1fe822ca4;t=54635e6a62285;x=b2d321019d70f36f",
"__REALTIME_TIMESTAMP" : "1484572200411781",
"__MONOTONIC_TIMESTAMP" : "8564911268",
"_BOOT_ID" : "7709deb3400841009e0acc2fec1ebe0e",
"PRIORITY" : "6",
"_UID" : "0",
"_GID" : "0",
"_SYSTEMD_SLICE" : "system.slice",
"_SELINUX_CONTEXT" : "system_u:system_r:kernel_t:s0",
"_MACHINE_ID" : "7bbb4401667243da81671e23fd8a2246",
"_HOSTNAME" : "Kubelet-Host",
"_TRANSPORT" : "stdout",
"SYSLOG_FACILITY" : "3",
"_COMM" : "ld-linux-x86-64",
"_CAP_EFFECTIVE" : "3fffffffff",
"SYSLOG_IDENTIFIER" : "prometheus-node-exporter",
"_PID" : "88827",
"_EXE" : "/var/lib/rkt/pods/run/c60a4b1a-387d-4fce-afa1-642d6f5716c1/stage1/rootfs/usr/lib64/ld-2.21.so",
"_CMDLINE" : "stage1/rootfs/usr/lib/ld-linux-x86-64.so.2 stage1/rootfs/usr/bin/systemd-nspawn [....]",
"_SYSTEMD_CGROUP" : "/system.slice/k8s_c60a4b1a-387d-4fce-afa1-642d6f5716c1.service",
"_SYSTEMD_UNIT" : "k8s_c60a4b1a-387d-4fce-afa1-642d6f5716c1.service",
"MESSAGE" : "[ 8564.909237] prometheus-node-exporter[115]: time=\"2017-01-16T13:10:00Z\" level=info msg=\" - time\" source=\"node_exporter.go:157\""
}
```
Automatic merge from submit-queue (batch tested with PRs 41146, 41486, 41482, 41538, 41784)
fix issue #41746
**What this PR does / why we need it**:
**Which issue this PR fixes** : fixes#41746
**Special notes for your reviewer**:
cc @feiskyer
Automatic merge from submit-queue (batch tested with PRs 38957, 41819, 41851, 40667, 41373)
Change taints/tolerations to api fields
This PR changes current implementation of taints and tolerations from annotations to API fields. Taint and toleration are now part of `NodeSpec` and `PodSpec`, respectively. The annotation keys: `scheduler.alpha.kubernetes.io/tolerations` and `scheduler.alpha.kubernetes.io/taints` have been removed.
**Release note**:
Pod tolerations and node taints have moved from annotations to API fields in the PodSpec and NodeSpec, respectively. Pod tolerations and node taints that are defined in the annotations will be ignored. The annotation keys: `scheduler.alpha.kubernetes.io/tolerations` and `scheduler.alpha.kubernetes.io/taints` have been removed.
Automatic merge from submit-queue (batch tested with PRs 41349, 41532, 41256, 41587, 41657)
Enable pod level cgroups by default
**What this PR does / why we need it**:
It enables pod level cgroups by default.
**Special notes for your reviewer**:
This is intended to be enabled by default on 2/14/2017 per the plan outlined here:
https://github.com/kubernetes/community/pull/314
**Release note**:
```release-note
Each pod has its own associated cgroup by default.
```
Automatic merge from submit-queue
Log that debug handlers have been turned on.
**What this PR does / why we need it**: PR allows user to have a message in logs that debug handlers are on. It should allow the operator to know and automate a check for the case where debug has been left on.
**Release note**:
```
NONE
```
Automatic merge from submit-queue (batch tested with PRs 39373, 41585, 41617, 41707, 39958)
Fix ConfigMaps for Windows
**What this PR does / why we need it**: ConfigMaps were broken for Windows as the existing code used linux specific file paths. Updated the code in `kubelet_getters.go` to use `path/filepath` to get the directories. Also reverted back the code in `secret.go` as updating `kubelet_getters.go` to use `path/filepath` also fixes `secrets`
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes https://github.com/kubernetes/kubernetes/issues/39372
```release-note
Fix ConfigMap for Windows Containers.
```
cc: @pires
Changes the kubelet so it doesn't use the cert/key files directly for
starting the TLS server. Instead the TLS server reads the cert/key from
the new CertificateManager component, which is responsible for
requesting new certificates from the Certificate Signing Request API on
the API Server.
Automatic merge from submit-queue (batch tested with PRs 41649, 41658, 41266, 41371, 41626)
Understand why kubelet cannot cleanup orphaned pod dirs
**What this PR does / why we need it**:
Understand if we are unable to clean up orphaned pod directories due to a failure to read the directory versus paths still existing to improve ability to debug error situations.
Automatic merge from submit-queue (batch tested with PRs 40505, 34664, 37036, 40726, 41595)
dockertools: call TearDownPod when GC-ing infra pods
The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
infra pods are now always GC-ed rather than gating them by
containersToKeep.
This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.
Fixes: https://github.com/kubernetes/kubernetes/issues/14940
Related: https://github.com/kubernetes/kubernetes/pull/35572
Automatic merge from submit-queue (batch tested with PRs 38101, 41431, 39606, 41569, 41509)
Report node not ready on failed PLEG health check
Report node not ready if PLEG health check fails.
Automatic merge from submit-queue (batch tested with PRs 38101, 41431, 39606, 41569, 41509)
optimize killPod() and syncPod() functions
make sure that one of the two arguments must be non-nil: runningPod, status ,just like the function note says
and judge the return value in syncPod() function before setting podKilled
Automatic merge from submit-queue (batch tested with PRs 38101, 41431, 39606, 41569, 41509)
[hairpin] fix argument of nsenter
**Release note**:
```release-note
None
```
We should use:
nsenter --net=netnsPath -- -F some_command
instend of:
nsenter -n netnsPath -- -F some_command
Because "nsenter -n netnsPath" get an error output:
# nsenter -n /proc/67197/ns/net ip addr
nsenter: neither filename nor target pid supplied for ns/net
If we really want use -n, we need to use -n in such format:
# sudo nsenter -n/proc/67197/ns/net ip addr
Dead infra containers may still have network resources allocated to
them and may not be GC-ed for a long time. But allowing SyncPod()
to restart an infra container before the old one is destroyed
prevents network plugins from carrying the old network details
(eg IPAM) over to the new infra container.
The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
networking is always torn down for the container even if the
container itself is not deleted.
This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.
Fixes: https://github.com/kubernetes/kubernetes/issues/14940
Related: https://github.com/kubernetes/kubernetes/pull/35572
We need to tear down networking when garbage collecting containers too,
and GC is run from a different goroutine in kubelet. We don't want
container network operations running for the same pod concurrently.
The PluginManager almost duplicates the network plugin interface, but
not quite since the Init() function should be called by whatever
actually finds and creates the network plugin instance. Only then
does it get passed off to the PluginManager.
The Manager synchronizes pod-specific network operations like setup,
teardown, and pod network status. It passes through all other
operations so that runtimes don't have to cache the network plugin
directly, but can use the PluginManager as a wrapper.
Automatic merge from submit-queue (batch tested with PRs 41466, 41456, 41550, 41238, 41416)
Delay Deletion of a Pod until volumes are cleaned up
#41436 fixed the bug that caused #41095 and #40239 to have to be reverted. Now that the bug is fixed, this shouldn't cause problems.
@vishh @derekwaynecarr @sjenning @jingxu97 @kubernetes/sig-storage-misc
Automatic merge from submit-queue (batch tested with PRs 41531, 40417, 41434)
Always detach volumes in operator executor
**What this PR does / why we need it**:
Instead of marking a volume as detached immediately in Kubelet's
reconciler, delegate the marking asynchronously to the operator
executor. This is necessary to prevent race conditions with other
operations mutating the same volume state.
An example of one such problem:
1. pod is created, volume is added to desired state of the world
2. reconciler process starts
3. reconciler starts MountVolume, which is kicked off asynchronously via
operation_executor.go
4. MountVolume mounts the volume, but hasn't yet marked it as mounted
5. pod is deleted, volume is removed from desired state of the world
6. reconciler reaches detach volume section, detects volume is no longer in desired state of world,
removes it from volumes in use
7. MountVolume tries to mark mount, throws an error because
volume is no longer in actual state of world list. After this, kubelet isn't aware of the mount
so doesn't try to unmount again.
8. controller-manager tries to detach the volume, this fails because it
is still mounted to the OS.
9. EBS gets stuck indefinitely in busy state trying to detach.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#32881, fixes ##37854 (maybe)
**Special notes for your reviewer**:
**Release note**:
```release-note
```
make sure that one of the two arguments must be non-nil: runningPod, status ,just like the function note says
and judge the return value in syncPod() function before setting podKilled
Automatic merge from submit-queue
kubeadm: Migrate to client-go
**What this PR does / why we need it**: Finish the migration for kubeadm to use client-go wherever possible
**Which issue this PR fixes**: fixes #https://github.com/kubernetes/kubeadm/issues/52
**Special notes for your reviewer**: /cc @luxas @pires
**Release note**:
```release-note
NONE
```
Some imports dont exist yet (or so it seems) in client-go (examples
being:
- "k8s.io/kubernetes/pkg/api/validation"
- "k8s.io/kubernetes/pkg/util/initsystem"
- "k8s.io/kubernetes/pkg/util/node"
one change in kubelet to import to client-go
Automatic merge from submit-queue
Allow multipe DNS servers as comma-seperated argument for kubelet --dns
This PR explores how kubectls "--dns" could be extended to specify multiple DNS servers for in-cluster PODs. Testing on the local libvirt-coreos cluster shows that multiple DNS server are injected without issues.
Specifying multiple DNS servers increases resilience against
- Packet drops
- Single server failure
I am debugging services that do 50+ DNS requests for a single incoming interactive request, thus highly increase the chance of a slowdown (+5s) due to a single packet drop. Switching to two DNS servers will reduce the impact of the issues (roughly +1s on glibc, 0s on musl, error-rate goes down to error-rate^2).
Note that there is no need to change any runtime related code as far as I know. In the case of "default" dns the /etc/resolv.conf is parsed and multiple DNS server are send to the backend anyway. This only adds the same capability for the clusterFirst case.
I've heard from @thockin that multiple DNS entries are somehow considered. I've no idea what was considered, though. This is what I would like to see for our production use, though.
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 41360, 41423, 41430, 40647, 41352)
kubelet: reduce extraneous logging for pods using host network
For pods using the host network, kubelet/shim should not log
error/warning messages when determining the pod IP address.
Automatic merge from submit-queue (batch tested with PRs 41196, 41252, 41300, 39179, 41449)
record ReduceCPULimits result err info if err returned
record ReduceCPULimits result err info if err returned for debug
Automatic merge from submit-queue
Fix bug in status manager TerminatePod
In TerminatePod, we previously pass pod.Status to updateStatusInternal. This is a bug, since it is the original status that we are given. Not only does it skip updates made to container statuses, but in some cases it reverted the pod's status to an earlier version, since it was being passed a stale status initially.
This was the case in #40239 and #41095. As shown in #40239, the pod's status is set to running after it is set to failed, occasionally causing very long delays in pod deletion since we have to wait for this to be corrected.
This PR fixes the bug, adds some helpful debugging statements, and adds a unit test for TerminatePod (which for some reason didnt exist before?).
@kubernetes/sig-node-bugs @vish @Random-Liu
Automatic merge from submit-queue
Make EnableCRI default to true
This change makes kubelet to use the CRI implementation by default,
unless the users opt out explicitly by using --enable-cri=false.
For the rkt integration, the --enable-cri flag will have no effect
since rktnetes does not use CRI.
Also, mark the original --experimental-cri flag hidden and deprecated,
so that we can remove it in the next release. If both flags are specified,
the --enable-cri flag overrides the --experimental-cri flag.
Automatic merge from submit-queue
fix comment
**What this PR does / why we need it**:
fix comment
Thanks.
**Special notes for your reviewer**:
**Release note**:
```release-note
```
This change makes kubelet to use the CRI implementation by default,
unless the users opt out explicitly by using --enable-cri=false.
For the rkt integration, the --enable-cri flag will have no effect
since rktnetes does not use CRI.
Also, mark the original --experimental-cri flag hidden and deprecated,
so that we can remove it in the next release.
To safely mark a volume detached when the volume controller manager is used.
An example of one such problem:
1. pod is created, volume is added to desired state of the world
2. reconciler process starts
3. reconciler starts MountVolume, which is kicked off asynchronously via
operation_executor.go
4. MountVolume mounts the volume, but hasn't yet marked it as mounted
5. pod is deleted, volume is removed from desired state of the world
6. reconciler detects volume is no longer in desired state of world,
removes it from volumes in use
7. MountVolume tries to mark volume in use, throws an error because
volume is no longer in actual state of world list.
8. controller-manager tries to detach the volume, this fails because it
is still mounted to the OS.
9. EBS gets stuck indefinitely in busy state trying to detach.
Automatic merge from submit-queue (batch tested with PRs 40796, 40878, 36033, 40838, 41210)
Implement TTL controller and use the ttl annotation attached to node in secret manager
For every secret attached to a pod as volume, Kubelet is trying to refresh it every sync period. Currently Kubelet has a ttl-cache of secrets of its pods and the ttl is set to 1 minute. That means that in large clusters we are targetting (5k nodes, 30pods/node), given that each pod has a secret associated with ServiceAccount from its namespaces, and with large enough number of namespaces (where on each node (almost) every pod is from a different namespace), that resource in ~30 GETs to refresh all secrets every minute from one node, which gives ~2500QPS for GET secrets to apiserver.
Apiserver cannot keep up with it very easily.
Desired solution would be to watch for secret changes, but because of security we don't want a node watching for all secrets, and it is not possible for now to watch only for secrets attached to pods from my node.
So as a temporary solution, we are introducing an annotation that would be a suggestion for kubelet for the TTL of secrets in the cache and a very simple controller that would be setting this annotation based on the cluster size (the large cluster is, the bigger ttl is).
That workaround mean that only very local changes are needed in Kubelet, we are creating a well separated very simple controller, and once watching "my secrets" will be possible it will be easy to remove it and switch to that. And it will allow us to reach scalability goals.
@dchen1107 @thockin @liggitt
Automatic merge from submit-queue (batch tested with PRs 41074, 41147, 40854, 41167, 40045)
Add debug logging to eviction manager
**What this PR does / why we need it**:
This PR adds debug logging to eviction manager.
We need it to help users understand when/why eviction manager is/is not making decisions to support information gathering during support.
Automatic merge from submit-queue (batch tested with PRs 40873, 40948, 39580, 41065, 40815)
[CRI] Enable Hostport Feature for Dockershim
Commits:
1. Refactor common hostport util logics and add more tests
2. Add HostportManager which can ADD/DEL hostports instead of a complete sync.
3. Add Interface for retreiving portMappings information of a pod in Network Host interface.
Implement GetPodPortMappings interface in dockerService.
4. Teach kubenet to use HostportManager
Automatic merge from submit-queue (batch tested with PRs 38796, 40823, 40756, 41083, 41105)
Let ReadLogs return when there is a read error.
Fixes a bug in kuberuntime log.
Today, @yujuhong found that once we cancel `kubectl logs -f` with `Ctrl+C`, kuberuntime will keep complaining:
```
27939 kuberuntime_logs.go:192] Failed with err write tcp 10.240.0.4:10250->10.240.0.2:53913: write: broken pipe when writing log for log file "/var/log/pods/5bb76510-ed71-11e6-ad02-42010af00002/busybox_0.log": &{timestamp:{sec:63622095387 nsec:625309193 loc:0x484c440} stream:stdout log:[84 117 101 32 70 101 98 32 32 55 32 50 48 58 49 54 58 50 55 32 85 84 67 32 50 48 49 55 10]}
```
This is because kuberuntime keeps writing to the connection even though it is already closed. Actually, kuberuntime should return and report error whenever there is a writing error.
Ref the [docker code](3a4ae1f661/pkg/stdcopy/stdcopy.go (L159-L167))
I'm still creating the cluster and verifying this fix. Will post the result here after that.
/cc @yujuhong @kubernetes/sig-node-bugs
Automatic merge from submit-queue (batch tested with PRs 38796, 40823, 40756, 41083, 41105)
kubelet/network-cni-plugin: modify the log's info
**What this PR does / why we need it**:
Checking the startup logs of kubelet, i can always find a error like this:
"E1215 10:19:24.891724 2752 cni.go:163] error updating cni config: No networks found in /etc/cni/net.d"
It will appears, neither i use cni network-plugin or not.
After analysis codes, i thought it should be a warn log, because it will not produce any actions like as exit or abort, and just ignored when not any valid plugins exit.
thank you!
Automatic merge from submit-queue (batch tested with PRs 41103, 41042, 41097, 40946, 40770)
Use Clientset interface in KubeletDeps
**What this PR does / why we need it**:
This replaces the Clientset struct with the equivalent interface for the KubeClient injected via KubeletDeps. This is useful for testing and for accessing the Node and Pod status event stream without an API server.
**Special notes for your reviewer**:
Follow up to #4907
**Release note**:
`NONE`
Automatic merge from submit-queue (batch tested with PRs 41103, 41042, 41097, 40946, 40770)
dockershim: set security option separators based on the docker version
Also add a version cache to avoid hitting the docker daemon frequently.
This is part of #38164
Automatic merge from submit-queue (batch tested with PRs 40971, 41027, 40709, 40903, 39369)
Set docker opt separator correctly for SELinux options
This is based on @pmorie's commit from #40179
Automatic merge from submit-queue (batch tested with PRs 40930, 40951)
Fix CRI port forwarding
Websocket support was introduced #33684, which broke the CRI
implementation. This change fixes it.
Automatic merge from submit-queue (batch tested with PRs 40289, 40877, 40879, 39972, 40942)
Rename experimental-cgroups-per-pod flag
**What this PR does / why we need it**:
1. Rename `experimental-cgroups-per-qos` to `cgroups-per-qos`
1. Update hack/local-up-cluster to match `CGROUP_DRIVER` with docker runtime if used.
**Special notes for your reviewer**:
We plan to roll this feature out in the upcoming release. Previous node e2e runs were running with this feature on by default. We will default this feature on for all e2es next week.
**Release note**:
```release-note
Rename --experiemental-cgroups-per-qos to --cgroups-per-qos
```
Automatic merge from submit-queue
CRI: Handle cri in-place upgrade
Fixes https://github.com/kubernetes/kubernetes/issues/40051.
## How does this PR restart/remove legacy containers/sandboxes?
With this PR, dockershim will convert and return legacy containers and infra containers as regular containers/sandboxes. Then we can rely on the SyncPod logic to stop the legacy containers/sandboxes, and the garbage collector to remove the legacy containers/sandboxes.
To forcibly trigger restart:
* For infra containers, we manually set `hostNetwork` to opposite value to trigger a restart (See [here](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L389))
* For application containers, they will be restarted with the infra container.
## How does this PR avoid extra overhead when there is no legacy container/sandbox?
For the lack of some labels, listing legacy containers needs extra `docker ps`. We should not introduce constant performance regression for legacy container cleanup. So we added the `legacyCleanupFlag`:
* In `ListContainers` and `ListPodSandbox`, only do extra `ListLegacyContainers` and `ListLegacyPodSandbox` when `legacyCleanupFlag` is `NotDone`.
* When dockershim starts, it will check whether there are legacy containers/sandboxes.
* If there are none, it will mark `legacyCleanupFlag` as `Done`.
* If there are any, it will leave `legacyCleanupFlag` as `NotDone`, and start a goroutine periodically check whether legacy cleanup is done.
This makes sure that there is overhead only when there are legacy containers/sandboxes not cleaned up yet.
## Caveats
* In-place upgrade will cause kubelet to restart all running containers.
* RestartNever container will not be restarted.
* Garbage collector sometimes keep the legacy containers for a long time if there aren't too many containers on the node. In that case, dockershim will keep performing extra `docker ps` which introduces overhead.
* Manually remove all legacy containers will fix this.
* Should we garbage collect legacy containers/sandboxes in dockershim by ourselves? /cc @yujuhong
* Host port will not be reclaimed for the lack of checkpoint for legacy sandboxes. https://github.com/kubernetes/kubernetes/pull/39903 /cc @freehan
/cc @yujuhong @feiskyer @dchen1107 @kubernetes/sig-node-api-reviews
**Release note**:
```release-note
We should mention the caveats of in-place upgrade in release note.
```
Automatic merge from submit-queue
Optionally avoid evicting critical pods in kubelet
For #40573
```release-note
When feature gate "ExperimentalCriticalPodAnnotation" is set, Kubelet will avoid evicting pods in "kube-system" namespace that contains a special annotation - `scheduler.alpha.kubernetes.io/critical-pod`
This feature should be used in conjunction with the rescheduler to guarantee availability for critical system pods - https://kubernetes.io/docs/admin/rescheduler/
```
Automatic merge from submit-queue (batch tested with PRs 40795, 40863)
Use caching secret manager in kubelet
I just found that this is in my local branch I'm using for testing, but not in master :)
Automatic merge from submit-queue
Add websocket support for port forwarding
#32880
**Release note**:
```release-note
Port forwarding can forward over websockets or SPDY.
```
- adjust ports to int32
- CRI flows the websocket ports as query params
- Do not validate ports since the protocol is unknown
SPDY flows the ports as headers and websockets uses query params
- Only flow query params if there is at least one port query param
Automatic merge from submit-queue
securitycontext: move docker-specific logic into kubelet/dockertools
This change moves the code specific to docker to kubelet/dockertools,
while leaving the common utility functions at its current package
(pkg/securitycontext).
When we deprecate dockertools in the future, the code will be moved to
pkg/kubelet/dockershim instead.
Depending on an exact cluster setup multiple dns may make sense.
Comma-seperated lists of DNS server are quite common as DNS servers
are always plain IPs.
- split out port forwarding into its own package
Allow multiple port forwarding ports
- Make it easy to determine which port is tied to which channel
- odd channels are for data
- even channels are for errors
- allow comma separated ports to specify multiple ports
Add portfowardtester 1.2 to whitelist
Automatic merge from submit-queue (batch tested with PRs 40638, 40742, 40710, 40718, 40763)
move client/record
An attempt at moving client/record to client-go. It's proving very stubborn and needs a lot manual intervention and near as I can tell, no one actually gets any benefit from the sink and source complexity it adds.
@sttts @caesarchaoxu
Automatic merge from submit-queue
kuberuntime: remove the kubernetesManagedLabel label
The CRI shim should be responsible for returning only those
containers/sandboxes created through CRI. Remove this label in kubelet.
Automatic merge from submit-queue (batch tested with PRs 40527, 40738, 39366, 40609, 40748)
pkg/kubelet/dockertools/docker_manager.go: removing unused stuff
This PR removes unused constants and variables. I checked that neither kubernetes nor openshift code aren't using them.
Automatic merge from submit-queue (batch tested with PRs 40392, 39242, 40579, 40628, 40713)
optimize podSandboxChanged() function and fix some function notes
Automatic merge from submit-queue (batch tested with PRs 38443, 40145, 40701, 40682)
fix GetVolumeInUse() function
Since we just want to get volume name info, each volume name just need to added once. desiredStateOfWorld.GetVolumesToMount() will return volume and pod binding info,
if one volume is mounted to several pods, the volume name will be return several times. That is not what we want in this function.
We can add a new function to only get the volume name info or judge whether the volume name is added to the desiredVolumesMap array.
This change moves the code specific to docker to kubelet/dockertools,
while leaving the common utility functions at its current package
(pkg/securitycontext).
When we deprecate dockertools in the future, the code will be moved to
pkg/kubelet/dockershim instead.
Automatic merge from submit-queue (batch tested with PRs 40126, 40565, 38777, 40564, 40572)
docker-CRI: Remove legacy code for non-grpc integration
A minor cleanup to remove the code that is no longer in use to simplify the logic.
Automatic merge from submit-queue (batch tested with PRs 40126, 40565, 38777, 40564, 40572)
Add IsContainerNotFound in kube_docker_client
This PR added `IsContainerNotFound` function in kube_docker_client and changed dockershim to use it.
@yujuhong @freehan
Automatic merge from submit-queue (batch tested with PRs 40046, 40073, 40547, 40534, 40249)
fix a typo in cni log
**What this PR does / why we need it**:
fixes a typo s/unintialized/uninitialized in pkg/kubelet/network/cni/cni.go
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 40239, 40397, 40449, 40448, 40360)
CRI: Work around container create conflict.
Fixes https://github.com/kubernetes/kubernetes/issues/40443.
This PR added a random suffix in the container name when we:
* Failed to create the container because of "Conflict".
* And failed to remove the container because of "No such container".
@yujuhong @feiskyer
/cc @kubernetes/sig-node-bugs
Automatic merge from submit-queue (batch tested with PRs 40239, 40397, 40449, 40448, 40360)
move the discovery and dynamic clients
Moved the dynamic client, discovery client, testing/core, and testing/cache to `client-go`. Dependencies on api groups we don't have generated clients for have dropped out, so federation, kubeadm, and imagepolicy.
@caesarxuchao @sttts
approved based on https://github.com/kubernetes/kubernetes/issues/40363
Automatic merge from submit-queue (batch tested with PRs 40239, 40397, 40449, 40448, 40360)
CRI: use more gogoprotobuf plugins
Generate marshaler/unmarshaler code should help improve the performance.
This addresses #40098
Automatic merge from submit-queue
Delay deletion of pod from the API server until volumes are deleted
Depends on #37228, and will not pass tests until that PR is merged, and this is rebased.
Keeps all kubelet behavior the same, except the kubelet will not make the "Delete" call (kubeClient.Core().Pods(pod.Namespace).Delete(pod.Name, deleteOptions)) until the volumes associated with that pod are removed. I will perform some performance testing so that we better understand the latency impact of this change.
Is kubelet_pods.go the correct file to include the "when can I delete this pod" logic?
cc: @vishh @sjenning @derekwaynecarr
Automatic merge from submit-queue (batch tested with PRs 38739, 40480, 40495, 40172, 40393)
Use fnv hash in the CRI implementation
fnv is more stable than adler. This PR changes CRI implementation to
use fnv for generating container hashes, but leaving the old
implementation (dockertools/rkt). This is because hash is what kubelet
uses to identify a container -- changes to the hash will cause kubelet
to restart existing containers. This is ok for CRI implementation (which
requires a disruptive upgrade already), but not for older implementations.
#40140
Automatic merge from submit-queue (batch tested with PRs 39538, 40188, 40357, 38214, 40195)
Use SecretManager when getting secrets for EnvFrom
Merges crossed in the night which missed this needed change.
Leave the old implementation (dockertools/rkt) untouched so that
containers will not be restarted during kubelet upgrade. For CRI
implementation (kuberuntime), container restart is required for kubelet
upgrade.
Automatic merge from submit-queue
Move remaining *Options to metav1
Primarily delete options, but will remove all internal references to non-metav1 options (except ListOptions).
Still working through it @sttts @deads2k
Automatic merge from submit-queue (batch tested with PRs 39275, 40327, 37264)
dockertools: remove some dead code
Remove `dockerRoot` that's not used anywhere.
Automatic merge from submit-queue
Fix bad time values in kubelet FakeRuntimeService
These values don't affect tests but they can be confusing
for developers looking at the code for reference.
Automatic merge from submit-queue
Optional configmaps and secrets
Allow configmaps and secrets for environment variables and volume sources to be optional
Implements approved proposal c9f881b7bb
Release note:
```release-note
Volumes and environment variables populated from ConfigMap and Secret objects can now tolerate the named source object or specific keys being missing, by adding `optional: true` to the volume or environment variable source specifications.
```
Enforce the following limits:
12kb for total message length in container status
4kb for the termination message path file
2kb or 80 lines (whichever is shorter) from the log on error
Fallback to log output if the user requests it.
Automatic merge from submit-queue
Remove TODOs to refactor kubelet labels
To address #39650 completely.
Remove label refactoring TODOs, we don't need them since CRI rollout is on the way.
Automatic merge from submit-queue (batch tested with PRs 40250, 40134, 40210)
Typo fix: Change logging function to formatting version
**What this PR does / why we need it**:
Slightly broken logging message:
```
I0120 10:56:08.555712 7575 kubelet_node_status.go:135] Deleted old node object %qkubernetes-cit-kubernetes-cr0-0
```
Automatic merge from submit-queue (batch tested with PRs 40232, 40235, 40237, 40240)
move listers out of cache to reduce import tree
Moving the listers from `pkg/client/cache` snips links to all the different API groups from `pkg/storage`, but the dreaded `ListOptions` remains.
@sttts
Automatic merge from submit-queue (batch tested with PRs 37228, 40146, 40075, 38789, 40189)
Cleanup temp dirs
So funny story my /tmp ran out of space running the unit tests so I am cleaning up all the temp dirs we create.
Automatic merge from submit-queue (batch tested with PRs 37228, 40146, 40075, 38789, 40189)
kubelet: storage: teardown terminated pod volumes
This is a continuation of the work done in https://github.com/kubernetes/kubernetes/pull/36779
There really is no reason to keep volumes for terminated pods attached on the node. This PR extends the removal of volumes on the node from memory-backed (the current policy) to all volumes.
@pmorie raised a concern an impact debugging volume related issues if terminated pod volumes are removed. To address this issue, the PR adds a `--keep-terminated-pod-volumes` flag the kubelet and sets it for `hack/local-up-cluster.sh`.
For consideration in 1.6.
Fixes#35406
@derekwaynecarr @vishh @dashpole
```release-note
kubelet tears down pod volumes on pod termination rather than pod deletion
```
Automatic merge from submit-queue (batch tested with PRs 40011, 40159)
dockertools/nsenterexec: fix err shadow
The shadow of err meant the combination of `exec-handler=nsenter` +
`tty` + a non-zero exit code meant that the exit code would be LOST
FOREVER 👻
This isn't all that important since no one really used the nsenter exec
handler as I understand it
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 36693, 40154, 40170, 39033)
make client-go authoritative for pkg/client/restclient
Moves client/restclient to client-go and a util/certs, util/testing as transitives.
Automatic merge from submit-queue (batch tested with PRs 40168, 40165, 39158, 39966, 40190)
dockershim: add support for the 'nsenter' exec handler
This change simply plumbs the kubelet configuration
(--docker-exec-handler) to DockerService.
This fixes#35747.
Automatic merge from submit-queue (batch tested with PRs 40168, 40165, 39158, 39966, 40190)
CRI: upgrade protobuf to v3
For #38854, this PR upgrades CRI protobuf version to v3, and also updated related packages for confirming to new api.
**Release note**:
```
CRI: upgrade protobuf version to v3.
```