WARNING WARNING WARNING WARNING WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

If you are using a released version of Kubernetes, you should refer to the docs that go with that version. Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). -- ## Abstract Presents a proposal for enhancing the security of Kubernetes clusters using SELinux and simplifying the implementation of SELinux support within the Kubelet by removing the need to label the Kubelet directory with an SELinux context usable from a container. ## Motivation The current Kubernetes codebase relies upon the Kubelet directory being labeled with an SELinux context usable from a container. This means that a container escaping namespace isolation will be able to use any file within the Kubelet directory without defeating kernel [MAC (mandatory access control)](https://en.wikipedia.org/wiki/Mandatory_access_control). In order to limit the attack surface, we should enhance the Kubelet to relabel any bind-mounts into containers into a usable SELinux context without depending on the Kubelet directory's SELinux context. ## Constraints and Assumptions 1. No API changes allowed 2. Behavior must be fully backward compatible 3. No new admission controllers - make incremental improvements without huge refactorings ## Use Cases 1. As a cluster operator, I want to avoid having to label the Kubelet directory with a label usable from a container, so that I can limit the attack surface available to a container escaping its namespace isolation 2. As a user, I want to run a pod without an SELinux context explicitly specified and be isolated using MCS (multi-category security) on systems where SELinux is enabled, so that the pods on each host are isolated from one another 3. As a user, I want to run a pod that uses the host IPC or PID namespace and want the system to do the right thing with regard to SELinux, so that no unnecessary relabel actions are performed ### Labeling the Kubelet directory As previously stated, the current codebase relies on the Kubelet directory being labeled with an SELinux context usable from a container. The Kubelet uses the SELinux context of this directory to determine what SELinux context `tmpfs` mounts (provided by the EmptyDir memory-medium option) should receive. The problem with this is that it opens an attack surface to a container that escapes its namespace isolation; such a container would be able to use any file in the Kubelet directory without defeating kernel MAC. ### SELinux when no context is specified When no SELinux context is specified, Kubernetes should just do the right thing, where doing the right thing is defined as isolating pods with a node- unique set of categories. Node-uniqueness means unique among the pods scheduled onto the node. Long-term, we want to have a cluster-wide allocator for MCS labels. Node-unique MCS labels are a good middle ground that is possible without a new, large, feature. ### SELinux and host IPC and PID namespaces Containers in pods that use the host IPC or PID namespaces need access to other processes and IPC mechanisms on the host. Therefore, these containers should be run with the `spc_t` SELinux type by the container runtime. The `spc_t` type is an unconfined type that other SELinux domains are allowed to connect to. In the case where a pod uses one of these host namespaces, it should be unnecessary to relabel the pod's volumes. ## Analysis ### Libcontainer SELinux library Docker and rkt both use the libcontainer SELinux library. This library provides a method, `GetLxcContexts`, that returns the a unique SELinux contexts for container processes and files used by them. `GetLxcContexts` reads the base SELinux context information from a file at `/etc/selinux//contexts/lxc_contexts` and then adds a process-unique MCS label. Docker and rkt both leverage this call to determine the 'starting' SELinux contexts for containers. ### Docker Docker's behavior when no SELinux context is defined for a container is to give the container a node-unique MCS label. #### Sharing IPC namespaces On the Docker runtime, the containers in a Kubernetes pod share the IPC and PID namespaces of the pod's infra container. Docker's behavior for containers sharing these namespaces is as follows: if a container B shares the IPC namespace of another container A, container B is given the SELinux context of container A. Therefore, for Kubernetes pods running on docker, in a vacuum the containers in a pod should have the same SELinux context. [**Known issue**](https://bugzilla.redhat.com/show_bug.cgi?id=1377869): When the seccomp profile is set on a docker container that shares the IPC namespace of another container, that container will not receive the other container's SELinux context. #### Host IPC and PID namespaces In the case of a pod that shares the host IPC or PID namespace, this flag is simply ignored and the container receives the `spc_t` SELinux type. The `spc_t` type is unconfined, and so no relabeling needs to be done for volumes for these pods. Currently, however, there is code which relabels volumes into explicitly specified SELinux contexts for these pods. This code is unnecessary and should be removed. #### Relabeling bind-mounts Docker is capable of relabeling bind-mounts into containers using the `:Z` bind-mount flag. However, in the current implementation of the docker runtime in Kubernetes, the `:Z` option is only applied when the pod's SecurityContext contains an SELinux context. We could easily implement the correct behaviors by always setting `:Z` on systems where SELinux is enabled. ### rkt rkt's behavior when no SELinux context is defined for a pod is similar to Docker's -- an SELinux context with a node-unique MCS label is given to the containers of a pod. #### Sharing IPC namespaces Containers (apps, in rkt terminology) in rkt pods share an IPC and PID namespace by default. #### Relabeling bind-mounts Bind-mounts into rkt pods are automatically relabeled into the pod's SELinux context. #### Host IPC and PID namespaces Using the host IPC and PID namespaces is not currently supported by rkt. ## Proposed Changes ### Refactor `pkg/util/selinux` 1. The `selinux` package should provide a method `SELinuxEnabled` that returns whether SELinux is enabled, and is built for all platforms (the libcontainer SELinux is only built on linux) 2. The `SelinuxContextRunner` interface should be renamed to `SELinuxRunner` and be changed to have the same method names and signatures as the libcontainer methods its implementations wrap 3. The `SELinuxRunner` interface only needs `Getfilecon`, which is used by the rkt code ```go package selinux // Note: the libcontainer SELinux package is only built for Linux, so it is // necessary to have a NOP wrapper which is built for non-Linux platforms to // allow code that links to this package not to differentiate its own methods // for Linux and non-Linux platforms. // // SELinuxRunner wraps certain libcontainer SELinux calls. For more // information, see: // // https://github.com/opencontainers/runc/blob/master/libcontainer/selinux/selinux.go type SELinuxRunner interface { // Getfilecon returns the SELinux context for the given path or returns an // error. Getfilecon(path string) (string, error) } ``` ### Kubelet Changes 1. The `relabelVolumes` method in `kubelet_volumes.go` is not needed and can be removed 2. The `GenerateRunContainerOptions` method in `kubelet_pods.go` should no longer call `relabelVolumes` 3. The `makeHostsMount` method in `kubelet_pods.go` should set the `SELinuxRelabel` attribute of the mount for the pod's hosts file to `true` ### Changes to `pkg/kubelet/dockertools/` 1. The `makeMountBindings` should be changed to: 1. No longer accept the `podHasSELinuxLabel` parameter 2. Always use the `:Z` bind-mount flag when SELinux is enabled and the mount has the `SELinuxRelabel` attribute set to `true` 2. The `runContainer` method should be changed to always use the `:Z` bind-mount flag on the termination message mount when SELinux is enabled ### Changes to `pkg/kubelet/rkt` The should not be any required changes for the rkt runtime; we should test to ensure things work as expected under rkt. ### Changes to volume plugins and infrastructure 1. The `VolumeHost` interface contains a method called `GetRootContext`; this is an artifact of the old assumptions about the Kubelet directory's SELinux context and can be removed 2. The `empty_dir.go` file should be changed to be completely agnostic of SELinux; no behavior in this plugin needs to be differentiated when SELinux is enabled ### Changes to `pkg/controller/...` The `VolumeHost` abstraction is used in a couple of PV controllers as NOP implementations. These should be altered to no longer include `GetRootContext`. [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/selinux-enhancements.md?pixel)]()