From 4a5781e4c86f87fb474be74cad786e7578213172 Mon Sep 17 00:00:00 2001 From: Paul Morie Date: Mon, 3 Oct 2016 11:12:04 -0400 Subject: [PATCH] Proposal: SELinux enhancements --- docs/proposals/selinux-enhancements.md | 247 +++++++++++++++++++++++++ 1 file changed, 247 insertions(+) create mode 100644 docs/proposals/selinux-enhancements.md diff --git a/docs/proposals/selinux-enhancements.md b/docs/proposals/selinux-enhancements.md new file mode 100644 index 0000000000..e9d153f60b --- /dev/null +++ b/docs/proposals/selinux-enhancements.md @@ -0,0 +1,247 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +## Abstract + +Presents a proposal for enhancing the security of Kubernetes clusters using +SELinux and simplifying the implementation of SELinux support within the +Kubelet by removing the need to label the Kubelet directory with an SELinux +context usable from a container. + +## Motivation + +The current Kubernetes codebase relies upon the Kubelet directory being +labeled with an SELinux context usable from a container. This means that a +container escaping namespace isolation will be able to use any file within the +Kubelet directory without defeating kernel +[MAC (mandatory access control)](https://en.wikipedia.org/wiki/Mandatory_access_control). +In order to limit the attack surface, we should enhance the Kubelet to relabel +any bind-mounts into containers into a usable SELinux context without depending +on the Kubelet directory's SELinux context. + +## Constraints and Assumptions + +1. No API changes allowed +2. Behavior must be fully backward compatible +3. No new admission controllers - make incremental improvements without huge + refactorings + +## Use Cases + +1. As a cluster operator, I want to avoid having to label the Kubelet + directory with a label usable from a container, so that I can limit the + attack surface available to a container escaping its namespace isolation +2. As a user, I want to run a pod without an SELinux context explicitly + specified and be isolated using MCS (multi-category security) on systems + where SELinux is enabled, so that the pods on each host are isolated from + one another +3. As a user, I want to run a pod that uses the host IPC or PID namespace and + want the system to do the right thing with regard to SELinux, so that no + unnecessary relabel actions are performed + +### Labeling the Kubelet directory + +As previously stated, the current codebase relies on the Kubelet directory +being labeled with an SELinux context usable from a container. The Kubelet +uses the SELinux context of this directory to determine what SELinux context +`tmpfs` mounts (provided by the EmptyDir memory-medium option) should receive. +The problem with this is that it opens an attack surface to a container that +escapes its namespace isolation; such a container would be able to use any +file in the Kubelet directory without defeating kernel MAC. + +### SELinux when no context is specified + +When no SELinux context is specified, Kubernetes should just do the right +thing, where doing the right thing is defined as isolating pods with a node- +unique set of categories. Node-uniqueness means unique among the pods +scheduled onto the node. Long-term, we want to have a cluster-wide allocator +for MCS labels. Node-unique MCS labels are a good middle ground that is +possible without a new, large, feature. + +### SELinux and host IPC and PID namespaces + +Containers in pods that use the host IPC or PID namespaces need access to +other processes and IPC mechanisms on the host. Therefore, these containers +should be run with the `spc_t` SELinux type by the container runtime. The +`spc_t` type is an unconfined type that other SELinux domains are allowed to +connect to. In the case where a pod uses one of these host namespaces, it +should be unnecessary to relabel the pod's volumes. + +## Analysis + +### Libcontainer SELinux library + +Docker and rkt both use the libcontainer SELinux library. This library +provides a method, `GetLxcContexts`, that returns the a unique SELinux +contexts for container processes and files used by them. `GetLxcContexts` +reads the base SELinux context information from a file at `/etc/selinux//contexts/lxc_contexts` and then adds a process-unique MCS label. + +Docker and rkt both leverage this call to determine the 'starting' SELinux +contexts for containers. + +### Docker + +Docker's behavior when no SELinux context is defined for a container is to +give the container a node-unique MCS label. + +#### Sharing IPC namespaces + +On the Docker runtime, the containers in a Kubernetes pod share the IPC and +PID namespaces of the pod's infra container. + +Docker's behavior for containers sharing these namespaces is as follows: if a +container B shares the IPC namespace of another container A, container B is +given the SELinux context of container A. Therefore, for Kubernetes pods +running on docker, in a vacuum the containers in a pod should have the same +SELinux context. + +[**Known issue**](https://bugzilla.redhat.com/show_bug.cgi?id=1377869): When +the seccomp profile is set on a docker container that shares the IPC namespace +of another container, that container will not receive the other container's +SELinux context. + +#### Host IPC and PID namespaces + +In the case of a pod that shares the host IPC or PID namespace, this flag is +simply ignored and the container receives the `spc_t` SELinux type. The +`spc_t` type is unconfined, and so no relabeling needs to be done for volumes +for these pods. Currently, however, there is code which relabels volumes into +explicitly specified SELinux contexts for these pods. This code is unnecessary +and should be removed. + +#### Relabeling bind-mounts + +Docker is capable of relabeling bind-mounts into containers using the `:Z` +bind-mount flag. However, in the current implementation of the docker runtime +in Kubernetes, the `:Z` option is only applied when the pod's SecurityContext +contains an SELinux context. We could easily implement the correct behaviors +by always setting `:Z` on systems where SELinux is enabled. + +### rkt + +rkt's behavior when no SELinux context is defined for a pod is similar to +Docker's -- an SELinux context with a node-unique MCS label is given to the +containers of a pod. + +#### Sharing IPC namespaces + +Containers (apps, in rkt terminology) in rkt pods share an IPC and PID +namespace by default. + +#### Relabeling bind-mounts + +Bind-mounts into rkt pods are automatically relabeled into the pod's SELinux +context. + +#### Host IPC and PID namespaces + +Using the host IPC and PID namespaces is not currently supported by rkt. + +## Proposed Changes + +### Refactor `pkg/util/selinux` + +1. The `selinux` package should provide a method `SELinuxEnabled` that returns + whether SELinux is enabled, and is built for all platforms (the + libcontainer SELinux is only built on linux) +2. The `SelinuxContextRunner` interface should be renamed to `SELinuxRunner` + and be changed to have the same method names and signatures as the + libcontainer methods its implementations wrap +3. The `SELinuxRunner` interface should have a new method added called + `GetLxcContexts`; this should return a **shared** (ie, without MCS labels) + SELinux context usable by a container + +```go +package selinux + +// Note: the libcontainer SELinux package is only built for Linux, so it is +// necessary to have a NOP wrapper which is built for non-Linux platforms to +// allow code that links to this package not to differentiate its own methods +// for Linux and non-Linux platforms. +// +// SELinuxRunner wraps certain libcontainer SELinux calls. For more +// information, see: +// +// https://github.com/opencontainers/runc/blob/master/libcontainer/selinux/selinux.go +type SELinuxRunner interface { + // Setfilecon sets the SELinux context for the given path or returns an + // error. + Setfilecon(path, context string) error + + // Getfilecon returns the SELinux context for the given path or returns an + // error. + Getfilecon(path string) (string, error) + + // GetLxcContexts returns the process and file SELinux contexts to use for + // containers. + GetLxcContexts() (string, string) +} +``` + +### Kubelet Changes + +1. The `relabelVolumes` method in `kubelet_volumes.go` is not needed and can + be removed +2. The `GenerateRunContainerOptions` method in `kubelet_pods.go` should no + longer call `relabelVolumes` +3. The `makeHostsMount` method in `kubelet_pods.go` should set the + `SELinuxRelabel` attribute of the mount for the pod's hosts file to `true` + +### Changes to `pkg/kubelet/dockertools/` + +1. The `makeMountBindings` should be changed to: + 1. No longer accept the `podHasSELinuxLabel` parameter + 2. Always use the `:Z` bind-mount flag when SELinux is enabled and the mount + has the `SELinuxRelabel` attribute set to `true` +2. The `runContainer` method should be changed to always use the `:Z` + bind-mount flag on the termination message mount when SELinux is enabled + +### Changes to `pkg/kubelet/rkt` + +The should not be any required changes for the rkt runtime; we should test to +ensure things work as expected under rkt. + +### Changes to volume plugins and infrastructure + +1. The `VolumeHost` interface contains a method called `GetRootContext`; this + is an artifact of the old assumptions about the Kubelet directory's SELinux + context and can be removed +2. The `empty_dir.go` file should be changed to create an `SELinuxRunner` and + call its `GetLxcContexts` method to determine the right SELinux context to + give `tmpfs` mounts + +### Changes to `pkg/controller/...` + +The `VolumeHost` abstraction is used in a couple of PV controllers as NOP +implementations. These should be altered to no longer include `GetRootContext`. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/selinux-enhancements.md?pixel)]() +