Automatic merge from submit-queue
Include security options in the container created event
New container creation events look like:
```
Created container with docker id /k8s_bar2.a4; Security:[seccomp=sub/subtest(md5:07c9bcb4db631f7ca191d6e0bca49f76)]
Created container with docker id /k8s_bar2.a4; Security:[seccomp=unconfined apparmor=foo-profile]
```
The goal is to provide enough information to confirm that the requseted security constraints were honored.
For https://github.com/kubernetes/kubernetes/issues/31284
/cc @dchen1107 @thockin @jfrazelle @pweil- @pmorie
---
Justification for v1.4:
- Risk: low. This appends some additional information to a human readable message. A bug here would probably not break any functionality
- Roll-back: I don't anticipate any more changes to this area of the code. No functionality depends on this change.
- Cost of not including: Users don't get any (positive) confirmation that the AppArmor or Seccomp profile they requested were actually enabled.
Automatic merge from submit-queue
Add log message in Kubelet when controller attach/detach is enabled
Adds a message to the Kubelet log indicating whether controller attach/detach is enabled for a node.
cc @kubernetes/sig-storage
Automatic merge from submit-queue
Set imagefs rank and reclaim functions when nodefs+imagefs share comm…
Fixes#31192
I decided that the behavior should match the current output of the kubelet summary API. With no dedicated imagefs, the ranking and reclaim functions will match the nodefs ranking and reclaim functions.
/cc @ronnielai @vishh
Automatic merge from submit-queue
Fix hang/websocket timeout when streaming container log with no content
When streaming and following a container log, no response headers are sent from the kubelet `containerLogs` endpoint until the first byte of content is written to the log. This propagates back to the API server, which also will not send response headers until it gets response headers from the kubelet. That includes upgrade headers, which means a websocket connection upgrade is not performed and can time out.
To recreate, create a busybox pod that runs `/bin/sh -c 'sleep 30 && echo foo && sleep 10'`
As soon as the pod starts, query the kubelet API:
```
curl -N -k -v 'https://<node>:10250/containerLogs/<ns>/<pod>/<container>?follow=true&limitBytes=100'
```
or the master API:
```
curl -N -k -v 'http://<master>:8080/api/v1/<ns>/pods/<pod>/log?follow=true&limitBytes=100'
```
In both cases, notice that the response headers are not sent until the first byte of log content is available.
This PR:
* does a 0-byte write prior to handing off to the container runtime stream copy. That commits the response header, even if the subsequent copy blocks waiting for the first byte of content from the log.
* fixes a bug with the "ping" frame sent to websocket streams, which was not respecting the requested protocol (it was sending a binary frame to a websocket that requested a base64 text protocol)
* fixes a bug in the limitwriter, which was not propagating 0-length writes, even before the writer's limit was reached
Automatic merge from submit-queue
rkt: Force `rkt fetch` to fetch from remote to conform the image pull policy.
Fix https://github.com/kubernetes/kubernetes/issues/27646
Use `--no-store` option for `rkt fetch` to force it to fetch from remote.
However, `--no-store` will fetch the remote image regardless of whether the content of the image has changed or not.
This causes performance downgrade when the image tag is ':latest' and the image pull policy is 'always'.
The issue is tracked in https://github.com/coreos/rkt/issues/2937.
Automatic merge from submit-queue
Filter internal Kubernetes labels from Prometheus metrics
**What this PR does / why we need it**:
Kubernetes uses Docker labels as storage for some internal labels. The
majority of these labels are not meaningful metric labels and a few of
them are even harmful as they're not static and cause wrong aggregation
results.
This change provides a custom labels func to only attach meaningful
labels to cAdvisor exported metrics.
**Which issue this PR fixes**
google/cadvisor#1312
**Special notes for your reviewer**:
Depends on google/cadvisor#1429. Once that is merged, I'll update the vendor update commit.
**Release note**:
```release-note
Remove environment variables and internal Kubernetes Docker labels from cAdvisor Prometheus metric labels.
Old behavior:
- environment variables explicitly whitelisted via --docker-env-metadata-whitelist were exported as `container_env_*=*`. Default is zero so by default non were exported
- all docker labels were exported as `container_label_*=*`
New behavior:
- Only `container_name`, `pod_name`, `namespace`, `id`, `image`, and `name` labels are exposed
- no environment variables will be exposed ever via /metrics, even if whitelisted
```
---
Given that we have full control over the exported label set, I shortened the pod_name, pod_namespace and container_name label names. Below an example of the change (reformatted for readability).
```
# BEFORE
container_cpu_cfs_periods_total{
container_label_io_kubernetes_container_hash="5af8c3b4",
container_label_io_kubernetes_container_name="sync",
container_label_io_kubernetes_container_restartCount="1",
container_label_io_kubernetes_container_terminationMessagePath="/dev/termination-log",
container_label_io_kubernetes_pod_name="popularsearches-web-3165456836-2bfey",
container_label_io_kubernetes_pod_namespace="popularsearches",
container_label_io_kubernetes_pod_terminationGracePeriod="30",
container_label_io_kubernetes_pod_uid="6a291e48-47c4-11e6-84a4-c81f66bdf8bd",
id="/docker/68e1f15353921f4d6d4d998fa7293306c4ac828d04d1284e410ddaa75cf8cf25",
image="redacted.com/popularsearches:42-16-ba6bd88",
name="k8s_sync.5af8c3b4_popularsearches-web-3165456836-2bfey_popularsearches_6a291e48-47c4-11e6-84a4-c81f66bdf8bd_c02d3775"
} 72819
# AFTER
container_cpu_cfs_periods_total{
container_name="sync",
pod_name="popularsearches-web-3165456836-2bfey",
namespace="popularsearches",
id="/docker/68e1f15353921f4d6d4d998fa7293306c4ac828d04d1284e410ddaa75cf8cf25",
image="redacted.com/popularsearches:42-16-ba6bd88",
name="k8s_sync.5af8c3b4_popularsearches-web-3165456836-2bfey_popularsearches_6a291e48-47c4-11e6-84a4-c81f66bdf8bd_c02d3775"
} 72819
```
Feedback requested on:
* Label names. Other suggestions? Should we keep these very long ones?
* Do we need to export io.kubernetes.pod.uid? It makes working with the metrics a bit more complicated and the pod name is already unique at any time (but not over time). The UID is aslo part of `name`.
As discussed with @timstclair, this should be added to v1.4 as the current labels are harmful.
PTAL @jimmidyson @fabxc @vishh
This refactor removes the legacy KubeletConfig object and adds a new
KubeletDeps object, which contains injected runtime objects and
separates them from static config. It also reduces NewMainKubelet to two
arguments: a KubeletConfiguration and a KubeletDeps.
Some mesos and kubemark code was affected by this change, and has been
modified accordingly.
And a few final notes:
KubeletDeps:
KubeletDeps will be a temporary bin for things we might consider
"injected dependencies", until we have a better dependency injection
story for the Kubelet. We will have to discuss this eventually.
RunOnce:
We will likely not pull new KubeletConfiguration from the API server
when in runonce mode, so it doesn't make sense to make this something
that can be configured centrally. We will leave it as a flag-only option
for now. Additionally, it is increasingly looking like nobody actually uses the
Kubelet's runonce mode anymore, so it may be a candidate for deprecation
and removal.
Automatic merge from submit-queue
rkt: Improve support for privileged pod (pod whose all containers are privileged)
Fix https://github.com/kubernetes/kubernetes/issues/31100
This takes advantage of https://github.com/coreos/rkt/pull/2983 . By appending the new `--all-run` insecure-options to `rkt run-prepared` command when all the containers are privileged. The pod now gets more privileged power.
Automatic merge from submit-queue
Add sysctl support
Implementation of proposal https://github.com/kubernetes/kubernetes/pull/26057, feature https://github.com/kubernetes/features/issues/34
TODO:
- [x] change types.go
- [x] implement docker and rkt support
- [x] add e2e tests
- [x] decide whether we want apiserver validation
- ~~[ ] add documentation~~: api docs exist. Existing PodSecurityContext docs is very light and links back to the api docs anyway: 6684555ed9/docs/user-guide/security-context.md
- [x] change PodSecurityPolicy in types.go
- [x] write admission controller support for PodSecurityPolicy
- [x] write e2e test for PodSecurityPolicy
- [x] make sure we are compatible in the sense of https://github.com/kubernetes/kubernetes/blob/master/docs/devel/api_changes.md
- [x] test e2e with rkt: it only works with kubenet, not with no-op network plugin. The later has no sysctl support.
- ~~[ ] add RunC implementation~~ (~~if that is already in kube,~~ it isn't)
- [x] update whitelist
- [x] switch PSC fields to annotations
- [x] switch PSP fields to annotations
- [x] decide about `--experimental-whitelist-sysctl` flag to be additive or absolute
- [x] decide whether to add a sysctl node whitelist annotation
### Release notes:
```release-note
The pod annotation `security.alpha.kubernetes.io/sysctls` now allows customization of namespaced and well isolated kernel parameters (sysctls), starting with `kernel.shm_rmid_forced`, `net.ipv4.ip_local_port_range`, `net.ipv4.tcp_max_syn_backlog` and `net.ipv4.tcp_syncookies` for Kubernetes 1.4.
The pod annotation `security.alpha.kubernetes.io/unsafeSysctls` allows customization of namespaced sysctls where isolation is unclear. Unsafe sysctls must be enabled at-your-own-risk on the kubelet with the `--experimental-allowed-unsafe-sysctls` flag. Future versions will improve on resource isolation and more sysctls will be considered safe.
```
Automatic merge from submit-queue
Increase request timeout based on termination grace period
When terminationGracePeriodSeconds is set to > 2 minutes (which is
the default request timeout), ContainerStop() times out at 2 minutes.
We should check the timeout being passed in and bump up the
request timeout if needed.
Fixes#31219
Automatic merge from submit-queue
Kubelet code move: volume / util
Addresses some odds and ends that I apparently missed earlier. Preparation for kubelet code-move ENDGAME.
cc @kubernetes/sig-node
Automatic merge from submit-queue
Kubelet: implement GetNetNS for new runtime api
Kubelet: implement GetNetNS for new runtime api.
CC @yujuhong @thockin @kubernetes/sig-node @kubernetes/sig-rktnetes
Automatic merge from submit-queue
Add kubelet --network-plugin-mtu flag for MTU selection
* Add network-plugin-mtu option which lets us pass down a MTU to a network provider (currently processed by kubenet)
* Add a test, and thus make sysctl testable
When terminationGracePeriodSeconds is set to > 2 minutes (which is
the default request timeout), ContainerStop() times out at 2 minutes.
We should check the timeout being passed in and bump up the
request timeout if needed.
Fixes#31219
Automatic merge from submit-queue
[kubelet test] Improve node status test debug info
I find the output format `%v` of glog couldn't output useful information of an `api.Node` object. The output of this line https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_node_status_test.go#L492
is
```
kubelet_node_status_test.go:491: expected
&TypeMeta{Kind:,APIVersion:,}
, got
&TypeMeta{Kind:,APIVersion:,}
```
- It's difficult for me to tell the difference between expected and got.
- I prefer to use `diff.ObjectDiff(expectedNode, updatedNode)` to output the debug information as it will point out the starting character of the different objects.
I think this line https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_node_status_test.go#L647 can use `diff.ObjectDiff()` as well.
The related issus is #30952
MTU selection is difficult, and if there is a transport such as IPSEC in
use may be impossible. So we allow specification of the MTU with the
network-plugin-mtu flag, and we pass this down into the network
provider.
Currently implemented by kubenet.
Automatic merge from submit-queue
Kubelet: pass pod name/namespace/uid in new runtime API
First part of #30463.
Pass pod name/namespace/uid in new runtime API and change dockershim to build unique sandbox/container name based on them.
CC @yujuhong @euank @yifan-gu @kubernetes/sig-node
Automatic merge from submit-queue
remove duplicate code in updatePodCIDR
As kl.runtimeState.podCIDR() is a sync method, need fetch lock and release lock, so we only invoke once here
Kubernetes uses Docker labels as storage for some internal labels. The
majority of these labels are not meaningful metric labels and a few of
them are even harmful as they're not static and cause wrong aggregation
results.
This change provides a custom labels func to only attach meaningful
labels to cAdvisor exported metrics.
Automatic merge from submit-queue
Kubelet: implement GetPods for new runtime API
Implement GetPods for kuberuntime. Part of #28789 .
CC @yujuhong @Random-Liu
Automatic merge from submit-queue
Kubelet: add --container-runtime-endpoint and --image-service-endpoint
Flag `--container-runtime-endpoint` (overrides `--container-runtime`) is introduced to identify the unix socket file of the remote runtime service. And flag `--image-service-endpoint` is introduced to identify the unix socket file of the image service.
This PR is part of #28789 Milestone 0.
CC @yujuhong @Random-Liu