Merge pull request #14192 from pmorie/generic-selinux

Pod level SELinux context and volumes
2015-10-05 14:43:44 -07:00 · 2015-10-05 14:43:44 -07:00 · a81545db15
parent fe185f5ea7 7f049b7deb
commit a81545db15
1 changed files with 347 additions and 0 deletions
--- a/docs/proposals/selinux.md
+++ b/docs/proposals/selinux.md
@ -0,0 +1,347 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+<strong>
+The latest 1.0.x release of this document can be found
+[here](http://releases.k8s.io/release-1.0/docs/proposals/selinux.md).
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+## Abstract
+
+A proposal for enabling containers in a pod to share volumes using a pod level SELinux context.
+
+## Motivation
+
+Many users have a requirement to run pods on systems that have SELinux enabled.  Volume plugin
+authors should not have to explicitly account for SELinux except for volume types that require
+special handling of the SELinux context during setup.
+
+Currently, each container in a pod has an SELinux context.  This is not an ideal factoring for
+sharing resources using SELinux.
+
+We propose a pod-level SELinux context and a mechanism to support SELinux labeling of volumes in a
+generic way.
+
+Goals of this design:
+
+1.  Describe the problems with a container SELinux context
+2.  Articulate a design for generic SELinux support for volumes using a pod level SELinux context
+    which is backward compatible with the v1.0.0 API
+
+## Constraints and Assumptions
+
+1.  We will not support securing containers within a pod from one another
+2.  Volume plugins should not have to handle setting SELinux context on volumes
+3.  We will not deal with shared storage
+
+## Current State Overview
+
+### Docker
+
+Docker uses a base SELinux context and calculates a unique MCS label per container.  The SELinux
+context of a container can be overriden with the `SecurityOpt` api that allows setting the different
+parts of the SELinux context individually.
+
+Docker has functionality to relabel bind-mounts with a usable SElinux and supports two different
+use-cases:
+
+1.  The `:Z` bind-mount flag, which tells Docker to relabel a bind-mount with the container's
+    SELinux context
+2.  The `:z` bind-mount flag, which tells Docker to relabel a bind-mount with the container's
+    SElinux context, but remove the MCS labels, making the volume shareable beween containers
+
+We should avoid using the `:z` flag, because it relaxes the SELinux context so that any container
+(from an SELinux standpoint) can use the volume.
+
+### Rocket
+
+Rocket currently reads the base SELinux context to use from `/etc/selinux/*/contexts/lxc_contexts`
+and allocates a unique MCS label per pod.
+
+### Kubernetes
+
+
+There is a [proposed change](https://github.com/GoogleCloudPlatform/kubernetes/pull/9844) to the
+EmptyDir plugin that adds SELinux relabeling capabilities to that plugin, which is also carried as a
+patch in [OpenShift](https://github.com/openshift/origin).  It is preferable to solve the problem
+in general of handling SELinux in kubernetes to merging this PR.
+
+A new `PodSecurityContext` type has been added that carries information about security attributes
+that apply to the entire pod and that apply to all containers in a pod.  See:
+
+1.  [Skeletal implementation](https://github.com/kubernetes/kubernetes/pull/13939)
+1.  [Proposal for inlining container security fields](https://github.com/kubernetes/kubernetes/pull/12823)
+
+## Use Cases
+
+1.  As a cluster operator, I want to support securing pods from one another using SELinux when
+    SELinux integration is enabled in the cluster
+2.  As a user, I want volumes sharing to work correctly amongst containers in pods
+
+#### SELinux context: pod- or container- level?
+
+Currently, SELinux context is specifiable only at the container level.  This is an inconvenient
+factoring for sharing volumes and other SELinux-secured resources between containers because there
+is no way in SELinux to share resources between processes with different MCS labels except to
+remove MCS labels from the shared resource.  This is a big security risk: _any container_ in the
+system can work with a resource which has the same SELinux context as it and no MCS labels.  Since
+we are also not interested in isolating containers in a pod from one another, the SELinux context
+should be shared by all containers in a pod to facilitate isolation from the containers in other
+pods and sharing resources amongst all the containers of a pod.
+
+#### Volumes
+
+Kubernetes volumes can be divided into two broad categories:
+
+1.  Unshared storage:
+    1.  Volumes created by the kubelet on the host directory: empty directory, git repo, secret,
+        downward api.  All volumes in this category delegate to `EmptyDir` for their underlying
+        storage.
+    2.  Volumes based on network block devices: AWS EBS, iSCSI, RBD, etc, *when used exclusively
+        by a single pod*.
+2.  Shared storage:
+    1.  `hostPath` is shared storage because it is necessarily used by a container and the host
+    2.  Network file systems such as NFS, Glusterfs, Cephfs, etc.
+    3.  Block device based volumes in `ReadOnlyMany` or `ReadWriteMany` modes are shared because
+        they may be used simultaneously by multiple pods.
+
+For unshared storage, SELinux handling for most volumes can be generalized into running a `chcon` operation on the volume directory after running the volume plugin's `Setup` function.  For these
+volumes, the Kubelet can perform the `chcon` operation and keep SELinux concerns out of the volume
+plugin code.  Some volume plugins may need to use the SELinux context during a mount operation in
+certain cases.  To account for this, our design must have a way for volume plugins to state that
+a particular volume should or should not receive generic label management.
+
+For shared storage, the picture is murkier.  Labels for existing shared storage will be managed
+outside Kubernetes and administrators will have to set the SELinux context of pods correctly.
+The problem of solving SELinux label management for new shared storage is outside the scope for
+this proposal.
+
+## Analysis
+
+The system needs to be able to:
+
+1.  Model correctly which volumes require SELinux label management
+1.  Relabel volumes with the correct SELinux context when required
+
+### Modeling whether a volume requires label management
+
+#### Unshared storage: volumes derived from `EmptyDir`
+
+Empty dir and volumes derived from it are created by the system, so Kubernetes must always ensure
+that the ownership and SELinux context (when relevant) are set correctly for the volume to be
+usable.
+
+#### Unshared storage: network block devices
+
+Volume plugins based on network block devices such as AWS EBS and RBS can be treated the same way
+as local volumes.  Since inodes are written to these block devices in the same way as `EmptyDir`
+volumes, permissions and ownership can be managed on the client side by the Kubelet when used
+exclusively by one pod.  When the volumes are used outside of a persistent volume, or with the
+`ReadWriteOnce` mode, they are effectively unshared storage.
+
+When used by multiple pods, there are many additional use-cases to analyze before we can be
+confident that we can support SELinux label management robustly with these file systems.  The right
+design is one that makes it easy to experiment and develop support for ownership management with
+volume plugins to enable developers and cluster operators to continue exploring these issues.
+
+#### Shared storage: hostPath
+
+The `hostPath` volume should only be used by effective-root users, and the permissions of paths
+exposed into containers via hostPath volumes should always be managed by the cluster operator.  If
+the Kubelet managed the SELinux labels for `hostPath` volumes, a user who could create a `hostPath`
+volume could affect changes in the state of arbitrary paths within the host's filesystem.  This
+would be a severe security risk, so we will consider hostPath a corner case that the kubelet should
+never perform ownership management for.
+
+#### Shared storage: network
+
+Ownership management of shared storage is a complex topic.  SELinux labels for existing shared
+storage will be managed externally from Kubernetes.  For this case, our API should make it simple to
+express whether a particular volume should have these concerns managed by Kubernetes.
+
+We will not attempt to address the concerns of new shared storage in this proposal.
+
+When a network block device is used as a persistent volume in `ReadWriteMany` or `ReadOnlyMany`
+modes, it is shared storage, and thus outside the scope of this proposal.
+
+#### API requirements
+
+From the above, we know that label management must be applied:
+
+1.  To some volume types always
+2.  To some volume types never
+3.  To some volume types *sometimes*
+
+Volumes should be relabeled with the correct SELinux context.  Docker has this capability today; it
+is desireable for other container runtime implementations to provide similar functionality.
+
+Relabeling should be an optional aspect of a volume plugin to accomodate:
+
+1.  volume types for which generalized relabeling support is not sufficient
+2.  testing for each volume plugin individually
+
+## Proposed Design
+
+Our design should minimize code for handling SELinux labelling required in the Kubelet and volume
+plugins.
+
+### Deferral: MCS label allocation
+
+Our short-term goal is to facilitate volume sharing and isolation with SELinux and expose the
+primitives for higher level composition; making these automatic is a longer-term goal.  Allocating
+groups and MCS labels are fairly complex problems in their own right, and so our proposal will not
+encompass either of these topics.  There are several problems that the solution for allocation
+depends on:
+
+1.  Users and groups in Kubernetes
+2.  General auth policy in Kubernetes
+3.  [security policy](https://github.com/GoogleCloudPlatform/kubernetes/pull/7893)
+
+### API changes
+
+The [inline container security attributes PR (12823)](https://github.com/kubernetes/kubernetes/pull/12823)
+adds a `pod.Spec.SecurityContext.SELinuxOptions` field.  The change to the API in this proposal is
+the addition of the semantics to this field:
+
+* When the `pod.Spec.SecurityContext.SELinuxOptions` field is set, volumes that support ownership
+management in the Kubelet have their SELinuxContext set from this field.
+
+```go
+package api
+
+type PodSecurityContext struct {
+    // SELinuxOptions captures the SELinux context for all containers in a Pod.  If a container's
+    // SecurityContext.SELinuxOptions field is set, that setting has precedent for that container.
+    //
+    // This field will be used to set the SELinux of volumes that support SELinux label management
+    // by the kubelet.
+    SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"`
+}
+```
+
+The V1 API is extended with the same semantics:
+
+```go
+package v1
+
+type PodSecurityContext struct {
+    // SELinuxOptions captures the SELinux context for all containers in a Pod.  If a container's
+    // SecurityContext.SELinuxOptions field is set, that setting has precedent for that container.
+    //
+    // This field will be used to set the SELinux of volumes that support SELinux label management
+    // by the kubelet.
+    SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"`
+}
+```
+
+#### API backward compatibility
+
+Old pods that do not have the `pod.Spec.SecurityContext.SELinuxOptions` field set will not receive
+SELinux label management for their volumes.  This is acceptable since old clients won't know about
+this field and won't have any expectation of their volumes being managed this way.
+
+The existing backward compatibility semantics for SELinux do not change at all with this proposal.
+
+### Kubelet changes
+
+The Kubelet should be modified to perform SELinux label management when required for a volume.  The
+criteria to activate the kubelet SELinux label management for volumes are:
+
+1.  SELinux integration is enabled in the cluster
+2.  SELinux is enabled on the node
+3.  The `pod.Spec.SecurityContext.SELinuxOptions` field is set
+4.  The volume plugin supports SELinux label management
+
+The `volume.Builder` interface should have a new method added that indicates whether the plugin
+supports SELinux label management:
+
+```go
+package volume
+
+type Builder interface {
+    // other methods omitted
+    SupportsSELinux() bool
+}
+```
+
+Individual volume plugins are responsible for correctly reporting whether they support label
+management in the kubelet.  In the first round of work, only `hostPath` and `emptyDir` and its
+derivations will be tested with ownership management support:
+
+| Plugin Name             | SupportsOwnershipManagement   |
+|-------------------------|-------------------------------|
+| `hostPath`              | false                         |
+| `emptyDir`              | true                          |
+| `gitRepo`               | true                          |
+| `secret`                | true                          |
+| `downwardAPI`           | true                          |
+| `gcePersistentDisk`     | false                         |
+| `awsElasticBlockStore`  | false                         |
+| `nfs`                   | false                         |
+| `iscsi`                 | false                         |
+| `glusterfs`             | false                         |
+| `persistentVolumeClaim` | depends on underlying volume and PV mode |
+| `rbd`                   | false                         |
+| `cinder`                | false                         |
+| `cephfs`                | false                         |
+
+Ultimately, the matrix will theoretically look like:
+
+| Plugin Name             | SupportsOwnershipManagement   |
+|-------------------------|-------------------------------|
+| `hostPath`              | false                         |
+| `emptyDir`              | true                          |
+| `gitRepo`               | true                          |
+| `secret`                | true                          |
+| `downwardAPI`           | true                          |
+| `gcePersistentDisk`     | true                          |
+| `awsElasticBlockStore`  | true                          |
+| `nfs`                   | false                         |
+| `iscsi`                 | true                          |
+| `glusterfs`             | false                         |
+| `persistentVolumeClaim` | depends on underlying volume and PV mode |
+| `rbd`                   | true                          |
+| `cinder`                | false                         |
+| `cephfs`                | false                         |
+
+In order to limit the amount of SELinux label management code in Kubernetes, we propose that it be a
+function of the container runtime implementations.  Initially, we will modify the docker runtime
+implementation to correctly set the `:Z` flag on the appropriate bind-mounts in order to accomplish
+generic label management for docker containers.
+
+Volume types that require SELinux context information at mount must be injected with and respect the
+enablement setting for the labeling for the volume type.  The proposed `VolumeConfig` mechanism
+will be used to carry information about label management enablement to the volume plugins that have
+to manage labels individually.
+
+This allows the volume plugins to determine when they do and don't want this type of support from
+the Kubelet, and allows the criteria each plugin uses to evolve without changing the Kubelet.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/selinux.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->