mirror of https://github.com/k3s-io/k3s
Merge pull request #33951 from pmorie/selinux-overhaul
Automatic merge from submit-queue Proposal: SELinux enhancements TLDR: Try to make SELinux support better by not requiring Kubelet directory to be labeled with an SELinux type usable from the container. cc @kubernetes/sig-node @yifan-gupull/6/head
commit
6f78c0d912
|
@ -0,0 +1,247 @@
|
||||||
|
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||||
|
|
||||||
|
<!-- BEGIN STRIP_FOR_RELEASE -->
|
||||||
|
|
||||||
|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
||||||
|
width="25" height="25">
|
||||||
|
|
||||||
|
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
|
||||||
|
|
||||||
|
If you are using a released version of Kubernetes, you should
|
||||||
|
refer to the docs that go with that version.
|
||||||
|
|
||||||
|
Documentation for other releases can be found at
|
||||||
|
[releases.k8s.io](http://releases.k8s.io).
|
||||||
|
</strong>
|
||||||
|
--
|
||||||
|
|
||||||
|
<!-- END STRIP_FOR_RELEASE -->
|
||||||
|
|
||||||
|
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||||
|
|
||||||
|
## Abstract
|
||||||
|
|
||||||
|
Presents a proposal for enhancing the security of Kubernetes clusters using
|
||||||
|
SELinux and simplifying the implementation of SELinux support within the
|
||||||
|
Kubelet by removing the need to label the Kubelet directory with an SELinux
|
||||||
|
context usable from a container.
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
The current Kubernetes codebase relies upon the Kubelet directory being
|
||||||
|
labeled with an SELinux context usable from a container. This means that a
|
||||||
|
container escaping namespace isolation will be able to use any file within the
|
||||||
|
Kubelet directory without defeating kernel
|
||||||
|
[MAC (mandatory access control)](https://en.wikipedia.org/wiki/Mandatory_access_control).
|
||||||
|
In order to limit the attack surface, we should enhance the Kubelet to relabel
|
||||||
|
any bind-mounts into containers into a usable SELinux context without depending
|
||||||
|
on the Kubelet directory's SELinux context.
|
||||||
|
|
||||||
|
## Constraints and Assumptions
|
||||||
|
|
||||||
|
1. No API changes allowed
|
||||||
|
2. Behavior must be fully backward compatible
|
||||||
|
3. No new admission controllers - make incremental improvements without huge
|
||||||
|
refactorings
|
||||||
|
|
||||||
|
## Use Cases
|
||||||
|
|
||||||
|
1. As a cluster operator, I want to avoid having to label the Kubelet
|
||||||
|
directory with a label usable from a container, so that I can limit the
|
||||||
|
attack surface available to a container escaping its namespace isolation
|
||||||
|
2. As a user, I want to run a pod without an SELinux context explicitly
|
||||||
|
specified and be isolated using MCS (multi-category security) on systems
|
||||||
|
where SELinux is enabled, so that the pods on each host are isolated from
|
||||||
|
one another
|
||||||
|
3. As a user, I want to run a pod that uses the host IPC or PID namespace and
|
||||||
|
want the system to do the right thing with regard to SELinux, so that no
|
||||||
|
unnecessary relabel actions are performed
|
||||||
|
|
||||||
|
### Labeling the Kubelet directory
|
||||||
|
|
||||||
|
As previously stated, the current codebase relies on the Kubelet directory
|
||||||
|
being labeled with an SELinux context usable from a container. The Kubelet
|
||||||
|
uses the SELinux context of this directory to determine what SELinux context
|
||||||
|
`tmpfs` mounts (provided by the EmptyDir memory-medium option) should receive.
|
||||||
|
The problem with this is that it opens an attack surface to a container that
|
||||||
|
escapes its namespace isolation; such a container would be able to use any
|
||||||
|
file in the Kubelet directory without defeating kernel MAC.
|
||||||
|
|
||||||
|
### SELinux when no context is specified
|
||||||
|
|
||||||
|
When no SELinux context is specified, Kubernetes should just do the right
|
||||||
|
thing, where doing the right thing is defined as isolating pods with a node-
|
||||||
|
unique set of categories. Node-uniqueness means unique among the pods
|
||||||
|
scheduled onto the node. Long-term, we want to have a cluster-wide allocator
|
||||||
|
for MCS labels. Node-unique MCS labels are a good middle ground that is
|
||||||
|
possible without a new, large, feature.
|
||||||
|
|
||||||
|
### SELinux and host IPC and PID namespaces
|
||||||
|
|
||||||
|
Containers in pods that use the host IPC or PID namespaces need access to
|
||||||
|
other processes and IPC mechanisms on the host. Therefore, these containers
|
||||||
|
should be run with the `spc_t` SELinux type by the container runtime. The
|
||||||
|
`spc_t` type is an unconfined type that other SELinux domains are allowed to
|
||||||
|
connect to. In the case where a pod uses one of these host namespaces, it
|
||||||
|
should be unnecessary to relabel the pod's volumes.
|
||||||
|
|
||||||
|
## Analysis
|
||||||
|
|
||||||
|
### Libcontainer SELinux library
|
||||||
|
|
||||||
|
Docker and rkt both use the libcontainer SELinux library. This library
|
||||||
|
provides a method, `GetLxcContexts`, that returns the a unique SELinux
|
||||||
|
contexts for container processes and files used by them. `GetLxcContexts`
|
||||||
|
reads the base SELinux context information from a file at `/etc/selinux/<policy-
|
||||||
|
name>/contexts/lxc_contexts` and then adds a process-unique MCS label.
|
||||||
|
|
||||||
|
Docker and rkt both leverage this call to determine the 'starting' SELinux
|
||||||
|
contexts for containers.
|
||||||
|
|
||||||
|
### Docker
|
||||||
|
|
||||||
|
Docker's behavior when no SELinux context is defined for a container is to
|
||||||
|
give the container a node-unique MCS label.
|
||||||
|
|
||||||
|
#### Sharing IPC namespaces
|
||||||
|
|
||||||
|
On the Docker runtime, the containers in a Kubernetes pod share the IPC and
|
||||||
|
PID namespaces of the pod's infra container.
|
||||||
|
|
||||||
|
Docker's behavior for containers sharing these namespaces is as follows: if a
|
||||||
|
container B shares the IPC namespace of another container A, container B is
|
||||||
|
given the SELinux context of container A. Therefore, for Kubernetes pods
|
||||||
|
running on docker, in a vacuum the containers in a pod should have the same
|
||||||
|
SELinux context.
|
||||||
|
|
||||||
|
[**Known issue**](https://bugzilla.redhat.com/show_bug.cgi?id=1377869): When
|
||||||
|
the seccomp profile is set on a docker container that shares the IPC namespace
|
||||||
|
of another container, that container will not receive the other container's
|
||||||
|
SELinux context.
|
||||||
|
|
||||||
|
#### Host IPC and PID namespaces
|
||||||
|
|
||||||
|
In the case of a pod that shares the host IPC or PID namespace, this flag is
|
||||||
|
simply ignored and the container receives the `spc_t` SELinux type. The
|
||||||
|
`spc_t` type is unconfined, and so no relabeling needs to be done for volumes
|
||||||
|
for these pods. Currently, however, there is code which relabels volumes into
|
||||||
|
explicitly specified SELinux contexts for these pods. This code is unnecessary
|
||||||
|
and should be removed.
|
||||||
|
|
||||||
|
#### Relabeling bind-mounts
|
||||||
|
|
||||||
|
Docker is capable of relabeling bind-mounts into containers using the `:Z`
|
||||||
|
bind-mount flag. However, in the current implementation of the docker runtime
|
||||||
|
in Kubernetes, the `:Z` option is only applied when the pod's SecurityContext
|
||||||
|
contains an SELinux context. We could easily implement the correct behaviors
|
||||||
|
by always setting `:Z` on systems where SELinux is enabled.
|
||||||
|
|
||||||
|
### rkt
|
||||||
|
|
||||||
|
rkt's behavior when no SELinux context is defined for a pod is similar to
|
||||||
|
Docker's -- an SELinux context with a node-unique MCS label is given to the
|
||||||
|
containers of a pod.
|
||||||
|
|
||||||
|
#### Sharing IPC namespaces
|
||||||
|
|
||||||
|
Containers (apps, in rkt terminology) in rkt pods share an IPC and PID
|
||||||
|
namespace by default.
|
||||||
|
|
||||||
|
#### Relabeling bind-mounts
|
||||||
|
|
||||||
|
Bind-mounts into rkt pods are automatically relabeled into the pod's SELinux
|
||||||
|
context.
|
||||||
|
|
||||||
|
#### Host IPC and PID namespaces
|
||||||
|
|
||||||
|
Using the host IPC and PID namespaces is not currently supported by rkt.
|
||||||
|
|
||||||
|
## Proposed Changes
|
||||||
|
|
||||||
|
### Refactor `pkg/util/selinux`
|
||||||
|
|
||||||
|
1. The `selinux` package should provide a method `SELinuxEnabled` that returns
|
||||||
|
whether SELinux is enabled, and is built for all platforms (the
|
||||||
|
libcontainer SELinux is only built on linux)
|
||||||
|
2. The `SelinuxContextRunner` interface should be renamed to `SELinuxRunner`
|
||||||
|
and be changed to have the same method names and signatures as the
|
||||||
|
libcontainer methods its implementations wrap
|
||||||
|
3. The `SELinuxRunner` interface should have a new method added called
|
||||||
|
`GetLxcContexts`; this should return a **shared** (ie, without MCS labels)
|
||||||
|
SELinux context usable by a container
|
||||||
|
|
||||||
|
```go
|
||||||
|
package selinux
|
||||||
|
|
||||||
|
// Note: the libcontainer SELinux package is only built for Linux, so it is
|
||||||
|
// necessary to have a NOP wrapper which is built for non-Linux platforms to
|
||||||
|
// allow code that links to this package not to differentiate its own methods
|
||||||
|
// for Linux and non-Linux platforms.
|
||||||
|
//
|
||||||
|
// SELinuxRunner wraps certain libcontainer SELinux calls. For more
|
||||||
|
// information, see:
|
||||||
|
//
|
||||||
|
// https://github.com/opencontainers/runc/blob/master/libcontainer/selinux/selinux.go
|
||||||
|
type SELinuxRunner interface {
|
||||||
|
// Setfilecon sets the SELinux context for the given path or returns an
|
||||||
|
// error.
|
||||||
|
Setfilecon(path, context string) error
|
||||||
|
|
||||||
|
// Getfilecon returns the SELinux context for the given path or returns an
|
||||||
|
// error.
|
||||||
|
Getfilecon(path string) (string, error)
|
||||||
|
|
||||||
|
// GetLxcContexts returns the process and file SELinux contexts to use for
|
||||||
|
// containers.
|
||||||
|
GetLxcContexts() (string, string)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Kubelet Changes
|
||||||
|
|
||||||
|
1. The `relabelVolumes` method in `kubelet_volumes.go` is not needed and can
|
||||||
|
be removed
|
||||||
|
2. The `GenerateRunContainerOptions` method in `kubelet_pods.go` should no
|
||||||
|
longer call `relabelVolumes`
|
||||||
|
3. The `makeHostsMount` method in `kubelet_pods.go` should set the
|
||||||
|
`SELinuxRelabel` attribute of the mount for the pod's hosts file to `true`
|
||||||
|
|
||||||
|
### Changes to `pkg/kubelet/dockertools/`
|
||||||
|
|
||||||
|
1. The `makeMountBindings` should be changed to:
|
||||||
|
1. No longer accept the `podHasSELinuxLabel` parameter
|
||||||
|
2. Always use the `:Z` bind-mount flag when SELinux is enabled and the mount
|
||||||
|
has the `SELinuxRelabel` attribute set to `true`
|
||||||
|
2. The `runContainer` method should be changed to always use the `:Z`
|
||||||
|
bind-mount flag on the termination message mount when SELinux is enabled
|
||||||
|
|
||||||
|
### Changes to `pkg/kubelet/rkt`
|
||||||
|
|
||||||
|
The should not be any required changes for the rkt runtime; we should test to
|
||||||
|
ensure things work as expected under rkt.
|
||||||
|
|
||||||
|
### Changes to volume plugins and infrastructure
|
||||||
|
|
||||||
|
1. The `VolumeHost` interface contains a method called `GetRootContext`; this
|
||||||
|
is an artifact of the old assumptions about the Kubelet directory's SELinux
|
||||||
|
context and can be removed
|
||||||
|
2. The `empty_dir.go` file should be changed to create an `SELinuxRunner` and
|
||||||
|
call its `GetLxcContexts` method to determine the right SELinux context to
|
||||||
|
give `tmpfs` mounts
|
||||||
|
|
||||||
|
### Changes to `pkg/controller/...`
|
||||||
|
|
||||||
|
The `VolumeHost` abstraction is used in a couple of PV controllers as NOP
|
||||||
|
implementations. These should be altered to no longer include `GetRootContext`.
|
||||||
|
|
||||||
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||||
|
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/selinux-enhancements.md?pixel)]()
|
||||||
|
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
Loading…
Reference in New Issue