mirror of https://github.com/k3s-io/k3s
Merge pull request #33951 from pmorie/selinux-overhaul
Automatic merge from submit-queue Proposal: SELinux enhancements TLDR: Try to make SELinux support better by not requiring Kubelet directory to be labeled with an SELinux type usable from the container. cc @kubernetes/sig-node @yifan-gupull/6/head
commit
6f78c0d912
|
@ -0,0 +1,247 @@
|
|||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
<!-- BEGIN STRIP_FOR_RELEASE -->
|
||||
|
||||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
|
||||
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
|
||||
|
||||
If you are using a released version of Kubernetes, you should
|
||||
refer to the docs that go with that version.
|
||||
|
||||
Documentation for other releases can be found at
|
||||
[releases.k8s.io](http://releases.k8s.io).
|
||||
</strong>
|
||||
--
|
||||
|
||||
<!-- END STRIP_FOR_RELEASE -->
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
## Abstract
|
||||
|
||||
Presents a proposal for enhancing the security of Kubernetes clusters using
|
||||
SELinux and simplifying the implementation of SELinux support within the
|
||||
Kubelet by removing the need to label the Kubelet directory with an SELinux
|
||||
context usable from a container.
|
||||
|
||||
## Motivation
|
||||
|
||||
The current Kubernetes codebase relies upon the Kubelet directory being
|
||||
labeled with an SELinux context usable from a container. This means that a
|
||||
container escaping namespace isolation will be able to use any file within the
|
||||
Kubelet directory without defeating kernel
|
||||
[MAC (mandatory access control)](https://en.wikipedia.org/wiki/Mandatory_access_control).
|
||||
In order to limit the attack surface, we should enhance the Kubelet to relabel
|
||||
any bind-mounts into containers into a usable SELinux context without depending
|
||||
on the Kubelet directory's SELinux context.
|
||||
|
||||
## Constraints and Assumptions
|
||||
|
||||
1. No API changes allowed
|
||||
2. Behavior must be fully backward compatible
|
||||
3. No new admission controllers - make incremental improvements without huge
|
||||
refactorings
|
||||
|
||||
## Use Cases
|
||||
|
||||
1. As a cluster operator, I want to avoid having to label the Kubelet
|
||||
directory with a label usable from a container, so that I can limit the
|
||||
attack surface available to a container escaping its namespace isolation
|
||||
2. As a user, I want to run a pod without an SELinux context explicitly
|
||||
specified and be isolated using MCS (multi-category security) on systems
|
||||
where SELinux is enabled, so that the pods on each host are isolated from
|
||||
one another
|
||||
3. As a user, I want to run a pod that uses the host IPC or PID namespace and
|
||||
want the system to do the right thing with regard to SELinux, so that no
|
||||
unnecessary relabel actions are performed
|
||||
|
||||
### Labeling the Kubelet directory
|
||||
|
||||
As previously stated, the current codebase relies on the Kubelet directory
|
||||
being labeled with an SELinux context usable from a container. The Kubelet
|
||||
uses the SELinux context of this directory to determine what SELinux context
|
||||
`tmpfs` mounts (provided by the EmptyDir memory-medium option) should receive.
|
||||
The problem with this is that it opens an attack surface to a container that
|
||||
escapes its namespace isolation; such a container would be able to use any
|
||||
file in the Kubelet directory without defeating kernel MAC.
|
||||
|
||||
### SELinux when no context is specified
|
||||
|
||||
When no SELinux context is specified, Kubernetes should just do the right
|
||||
thing, where doing the right thing is defined as isolating pods with a node-
|
||||
unique set of categories. Node-uniqueness means unique among the pods
|
||||
scheduled onto the node. Long-term, we want to have a cluster-wide allocator
|
||||
for MCS labels. Node-unique MCS labels are a good middle ground that is
|
||||
possible without a new, large, feature.
|
||||
|
||||
### SELinux and host IPC and PID namespaces
|
||||
|
||||
Containers in pods that use the host IPC or PID namespaces need access to
|
||||
other processes and IPC mechanisms on the host. Therefore, these containers
|
||||
should be run with the `spc_t` SELinux type by the container runtime. The
|
||||
`spc_t` type is an unconfined type that other SELinux domains are allowed to
|
||||
connect to. In the case where a pod uses one of these host namespaces, it
|
||||
should be unnecessary to relabel the pod's volumes.
|
||||
|
||||
## Analysis
|
||||
|
||||
### Libcontainer SELinux library
|
||||
|
||||
Docker and rkt both use the libcontainer SELinux library. This library
|
||||
provides a method, `GetLxcContexts`, that returns the a unique SELinux
|
||||
contexts for container processes and files used by them. `GetLxcContexts`
|
||||
reads the base SELinux context information from a file at `/etc/selinux/<policy-
|
||||
name>/contexts/lxc_contexts` and then adds a process-unique MCS label.
|
||||
|
||||
Docker and rkt both leverage this call to determine the 'starting' SELinux
|
||||
contexts for containers.
|
||||
|
||||
### Docker
|
||||
|
||||
Docker's behavior when no SELinux context is defined for a container is to
|
||||
give the container a node-unique MCS label.
|
||||
|
||||
#### Sharing IPC namespaces
|
||||
|
||||
On the Docker runtime, the containers in a Kubernetes pod share the IPC and
|
||||
PID namespaces of the pod's infra container.
|
||||
|
||||
Docker's behavior for containers sharing these namespaces is as follows: if a
|
||||
container B shares the IPC namespace of another container A, container B is
|
||||
given the SELinux context of container A. Therefore, for Kubernetes pods
|
||||
running on docker, in a vacuum the containers in a pod should have the same
|
||||
SELinux context.
|
||||
|
||||
[**Known issue**](https://bugzilla.redhat.com/show_bug.cgi?id=1377869): When
|
||||
the seccomp profile is set on a docker container that shares the IPC namespace
|
||||
of another container, that container will not receive the other container's
|
||||
SELinux context.
|
||||
|
||||
#### Host IPC and PID namespaces
|
||||
|
||||
In the case of a pod that shares the host IPC or PID namespace, this flag is
|
||||
simply ignored and the container receives the `spc_t` SELinux type. The
|
||||
`spc_t` type is unconfined, and so no relabeling needs to be done for volumes
|
||||
for these pods. Currently, however, there is code which relabels volumes into
|
||||
explicitly specified SELinux contexts for these pods. This code is unnecessary
|
||||
and should be removed.
|
||||
|
||||
#### Relabeling bind-mounts
|
||||
|
||||
Docker is capable of relabeling bind-mounts into containers using the `:Z`
|
||||
bind-mount flag. However, in the current implementation of the docker runtime
|
||||
in Kubernetes, the `:Z` option is only applied when the pod's SecurityContext
|
||||
contains an SELinux context. We could easily implement the correct behaviors
|
||||
by always setting `:Z` on systems where SELinux is enabled.
|
||||
|
||||
### rkt
|
||||
|
||||
rkt's behavior when no SELinux context is defined for a pod is similar to
|
||||
Docker's -- an SELinux context with a node-unique MCS label is given to the
|
||||
containers of a pod.
|
||||
|
||||
#### Sharing IPC namespaces
|
||||
|
||||
Containers (apps, in rkt terminology) in rkt pods share an IPC and PID
|
||||
namespace by default.
|
||||
|
||||
#### Relabeling bind-mounts
|
||||
|
||||
Bind-mounts into rkt pods are automatically relabeled into the pod's SELinux
|
||||
context.
|
||||
|
||||
#### Host IPC and PID namespaces
|
||||
|
||||
Using the host IPC and PID namespaces is not currently supported by rkt.
|
||||
|
||||
## Proposed Changes
|
||||
|
||||
### Refactor `pkg/util/selinux`
|
||||
|
||||
1. The `selinux` package should provide a method `SELinuxEnabled` that returns
|
||||
whether SELinux is enabled, and is built for all platforms (the
|
||||
libcontainer SELinux is only built on linux)
|
||||
2. The `SelinuxContextRunner` interface should be renamed to `SELinuxRunner`
|
||||
and be changed to have the same method names and signatures as the
|
||||
libcontainer methods its implementations wrap
|
||||
3. The `SELinuxRunner` interface should have a new method added called
|
||||
`GetLxcContexts`; this should return a **shared** (ie, without MCS labels)
|
||||
SELinux context usable by a container
|
||||
|
||||
```go
|
||||
package selinux
|
||||
|
||||
// Note: the libcontainer SELinux package is only built for Linux, so it is
|
||||
// necessary to have a NOP wrapper which is built for non-Linux platforms to
|
||||
// allow code that links to this package not to differentiate its own methods
|
||||
// for Linux and non-Linux platforms.
|
||||
//
|
||||
// SELinuxRunner wraps certain libcontainer SELinux calls. For more
|
||||
// information, see:
|
||||
//
|
||||
// https://github.com/opencontainers/runc/blob/master/libcontainer/selinux/selinux.go
|
||||
type SELinuxRunner interface {
|
||||
// Setfilecon sets the SELinux context for the given path or returns an
|
||||
// error.
|
||||
Setfilecon(path, context string) error
|
||||
|
||||
// Getfilecon returns the SELinux context for the given path or returns an
|
||||
// error.
|
||||
Getfilecon(path string) (string, error)
|
||||
|
||||
// GetLxcContexts returns the process and file SELinux contexts to use for
|
||||
// containers.
|
||||
GetLxcContexts() (string, string)
|
||||
}
|
||||
```
|
||||
|
||||
### Kubelet Changes
|
||||
|
||||
1. The `relabelVolumes` method in `kubelet_volumes.go` is not needed and can
|
||||
be removed
|
||||
2. The `GenerateRunContainerOptions` method in `kubelet_pods.go` should no
|
||||
longer call `relabelVolumes`
|
||||
3. The `makeHostsMount` method in `kubelet_pods.go` should set the
|
||||
`SELinuxRelabel` attribute of the mount for the pod's hosts file to `true`
|
||||
|
||||
### Changes to `pkg/kubelet/dockertools/`
|
||||
|
||||
1. The `makeMountBindings` should be changed to:
|
||||
1. No longer accept the `podHasSELinuxLabel` parameter
|
||||
2. Always use the `:Z` bind-mount flag when SELinux is enabled and the mount
|
||||
has the `SELinuxRelabel` attribute set to `true`
|
||||
2. The `runContainer` method should be changed to always use the `:Z`
|
||||
bind-mount flag on the termination message mount when SELinux is enabled
|
||||
|
||||
### Changes to `pkg/kubelet/rkt`
|
||||
|
||||
The should not be any required changes for the rkt runtime; we should test to
|
||||
ensure things work as expected under rkt.
|
||||
|
||||
### Changes to volume plugins and infrastructure
|
||||
|
||||
1. The `VolumeHost` interface contains a method called `GetRootContext`; this
|
||||
is an artifact of the old assumptions about the Kubelet directory's SELinux
|
||||
context and can be removed
|
||||
2. The `empty_dir.go` file should be changed to create an `SELinuxRunner` and
|
||||
call its `GetLxcContexts` method to determine the right SELinux context to
|
||||
give `tmpfs` mounts
|
||||
|
||||
### Changes to `pkg/controller/...`
|
||||
|
||||
The `VolumeHost` abstraction is used in a couple of PV controllers as NOP
|
||||
implementations. These should be altered to no longer include `GetRootContext`.
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/selinux-enhancements.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
Loading…
Reference in New Issue