Initial design doc for AWS GPU support

2016-04-08 16:19:08 -04:00 · 2016-04-08 16:19:08 -04:00 · 82e1949170
parent daf6be1a66
commit 82e1949170
1 changed files with 308 additions and 0 deletions
--- a/docs/proposals/gpu-support.md
+++ b/docs/proposals/gpu-support.md
@ -0,0 +1,308 @@
 <!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 <!-- BEGIN STRIP_FOR_RELEASE -->
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 If you are using a released version of Kubernetes, you should
 refer to the docs that go with that version.
 Documentation for other releases can be found at
 [releases.k8s.io](http://releases.k8s.io).
 </strong>
 --
 <!-- END STRIP_FOR_RELEASE -->
 <!-- END MUNGE: UNVERSIONED_WARNING -->
 <!-- BEGIN MUNGE: GENERATED_TOC -->
 - [GPU support](#gpu-support)
  - [Objective](#objective)
  - [Background](#background)
  - [Detailed discussion](#detailed-discussion)
    - [Inventory](#inventory)
    - [Scheduling](#scheduling)
    - [The runtime](#the-runtime)
      - [NVIDIA support](#nvidia-support)
    - [Event flow](#event-flow)
    - [Too complex for now: nvidia-docker](#too-complex-for-now-nvidia-docker)
  - [Implementation plan](#implementation-plan)
    - [V0](#v0)
      - [Scheduling](#scheduling)
      - [Runtime](#runtime)
      - [Other](#other)
  - [Future work](#future-work)
    - [V1](#v1)
    - [V2](#v2)
    - [V3](#v3)
    - [Undetermined](#undetermined)
  - [Security considerations](#security-considerations)
 <!-- END MUNGE: GENERATED_TOC -->
 # GPU support
 Author: @therc
 Date: Apr 2016
 Status: Design in progress, early implementation of requirements
 ## Objective
 Users should be able to request GPU resources for their workloads, as easily as
 for CPU or memory. Kubernetes should keep an inventory of machines with GPU
 hardware, schedule containers on appropriate nodes and set up the container
 environment with all that's necessary to access the GPU. All of this should
 eventually be supported for clusters on either bare metal or cloud providers.
 ## Background
 An increasing number of workloads, such as machine learning and seismic survey
 processing, benefits from offloading computations to graphic hardware. While not
 as tuned as traditional, dedicated high performance computing systems such as
 MPI, a Kubernetes cluster can still be a great environment for organizations
 that need a variety of additional, "classic" workloads, such as database, web
 serving, etc.
 GPU support is hard to provide extensively and will thus take time to tame
 completely, because
 - different vendors expose the hardware to users in different ways
 - some vendors require fairly tight coupling between the kernel driver
 controlling the GPU and the libraries/applications that access the hardware
 - it adds more resource types (whole GPUs, GPU cores, GPU memory)
 - it can introduce new security pitfalls
 - for systems with multiple GPUs, affinity matters, similarly to NUMA
 considerations for CPUs
 - running GPU code in containers is still a relatively novel idea
 ## Detailed discussion
 Currently, this document is mostly focused on the basic use case: run GPU code
 on AWS `g2.2xlarge` EC2 machine instances using Docker. It constitutes a narrow
 enough scenario that it does not require large amounts of generic code yet. GCE
 doesn't support GPUs at all; bare metal systems throw a lot of extra variables
 into the mix.
 Later sections will outline future work to support a broader set of hardware,
 environments and container runtimes.
 ### Inventory
 Before any scheduling can occur, we need to know what's available out there. In
 v0, we'll hardcode capacity detected by the kubelet based on a flag,
 `--experimental-nvidia-gpu`. This will result in the user-defined resource
 `alpha.kubernetes.io/nvidia-gpu` to be reported for `NodeCapacity` and
 `NodeAllocatable`, as well as as a node label.
 ### Scheduling
 GPUs will be visible as first-class resources. In v0, we'll only assign whole
 devices; sharing among multiple pods is left to future implementations. It's
 probable that GPUs will exacerbate the need for [a rescheduler](rescheduler.md)
 or pod priorities, especially if the nodes in a cluster are not homogeneous.
 Consider these two cases:
 > Only half of the machines have a GPU and they're all busy with other
 workloads. The other half of the cluster is doing very little work. A GPU
 workload comes, but it can't schedule, because the devices are sitting idle on
 nodes that are running something else and the nodes with little load lack the
 hardware.
 > Some or all the machines have two graphic cards each. A number of jobs get
 scheduled, requesting one device per pod. The scheduler puts them all on
 different machines, spreading the load, perhaps by design. Then a new job comes
 in, requiring two devices per pod, but it can't schedule anywhere, because all
 we can find, at most, is one unused device per node.
 ### The runtime
 Once we know where to run the container, it's time to set up its environment. At
 a minimum, we'll need to map the host device(s) into the container. Because each
 manufacturer exposes different device nodes (`/dev/ati/card0`, `/dev/nvidia0`,
 but also the required `/dev/nvidiactl` and `/dev/nvidia-uvm`), some of the logic
 needs to be hardware-specific, mapping from a logical device to a list of device
 nodes necessary for software to talk to it.
 Support binaries and libraries are often versioned along with the kernel module,
 so there should be further hooks to project those under `/bin` and some kind of
 `/lib` before the application is started. This can be done for Docker with the
 use of a versioned [Docker
 volume](https://docs.docker.com/engine/userguide/containers/dockervolumes/) or
 with upcoming Kubernetes-specific hooks such as init containers and volume
 containers. In v0, images are expected to bundle everything they need.
 #### NVIDIA support
 The first implementation and testing ground will be for NVIDIA devices, by far
 the most common setup.
 In v0, the `--experimental-nvidia-gpu` flag will also result in the host devices
 (limited to those required to drive the first card, `nvidia0`) to be mapped into
 the container by the dockertools library.
 ### Event flow
 This is what happens before and after an user schedules a GPU pod.
 1. Administrator installs a number of Kubernetes nodes with GPUs. The correct
 kernel modules and device nodes under `/dev/` are present.
 1. Administrator makes sure the latest CUDA/driver versions are installed.
 1. Administrator enables `--experimental-nvidia-gpu` on kubelets
 1. Kubelets update node status with information about the GPU device, in addition
 to cAdvisor's usual data about CPU/memory/disk
 1. User creates a Docker image compiling their application for CUDA, bundling
 the necessary libraries. We ignore any versioning requirements in the image
 using labels based on [NVIDIA's
 conventions](https://github.com/NVIDIA/nvidia-docker/blob/64510511e3fd0d00168eb076623854b0fcf1507d/tools/src/nvidia-docker/utils.go#L13).
 1. User creates a pod using the image, requiring
 `alpha.kubernetes.io/nvidia-gpu: 1`
 1. Scheduler picks a node for the pod
 1. The kubelet notices the GPU requirement and maps the three devices. In
 Docker's engine-api, this means it'll add them to the Resources.Devices list.
 1. Docker runs the container to completion
 1. The scheduler notices that the device is available again
 ### Too complex for now: nvidia-docker
 For v0, we discussed at length, but decided to leave aside initially the
 [nvidia-docker plugin](https://github.com/NVIDIA/nvidia-docker). The plugin is
 an officially supported solution, thus avoiding a lot of new low level code, as
 it takes care of functionality such as:
 - creating a Docker volume with binaries such as `nvidia-smi` and shared
 libraries
 - providing HTTP endpoints that monitoring tools can use to collect GPU metrics
 - abstracting details such as `/dev` entry names for each device, as well as
 control ones like `nvidiactl`
 The `nvidia-docker` wrapper also verifies that the CUDA version required by a
 given image is supported by the host drivers, through inspection of well-known
 image labels, if present. We should try to provide equivalent checks, either
 for CUDA or OpenCL.
 This is current sample output from `nvidia-docker-plugin`, wrapped for
 readability:
    $ curl -s localhost:3476/docker/cli
    --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0
    --volume-driver=nvidia-docker
    --volume=nvidia_driver_352.68:/usr/local/nvidia:ro
 It runs as a daemon listening for HTTP requests on port 3476. The endpoint above
 returns flags that need to be added to the Docker command line in order to
 expose GPUs to the containers. There are optional URL arguments to request
 specific devices if more than one are present on the system, as well as specific
 versions of the support software. An obvious improvement is an additional
 endpoint for JSON output.
 The unresolved question is whether `nvidia-docker-plugin` would run standalone
 as it does today (called over HTTP, perhaps with endpoints for a new Kubernetes
 resource API) or whether the relevant code from its `nvidia` package should be
 linked directly into kubelet. A partial list of tradeoffs:
 |                     | External binary                                                                                   | Linked in                                                    |
 |---------------------|---------------------------------------------------------------------------------------------------|--------------------------------------------------------------|
 | Use of cgo          | Confined to binary                                                                                | Linked into kubelet, but with lazy binding                   |
 | Expandibility       | Limited if we run the plugin, increased if library is used to build a Kubernetes-tailored daemon. | Can reuse the `nvidia` library as we prefer                  |
 | Bloat               | None                                                                                              | Larger kubelet, even for systems without GPUs                |
 | Reliability         | Need to handle the binary disappearing at any time                                                | Fewer headeaches                                             |
 | (Un)Marshalling     | Need to talk over JSON                                                                            | None                                                         |
 | Administration cost | One more daemon to install, configure and monitor                                                 | No extra work required, other than perhaps configuring flags |
 | Releases            | Potentially on its own schedule                                                                   | Tied to Kubernetes'                                          |
 ## Implementation plan
 ### V0
 The first two tracks can progress in parallel.
 #### Scheduling
 1. Define new resource `alpha.kubernetes.io:nvidia-gpu` in `pkg/api/types.go`
 and co.
 1. Plug resource into feasability checks used by kubelet, scheduler and
 schedulercache. Maybe gated behind a flag?
 1. Plug resource into resource_helpers.go
 1. Plug resource into the limitranger
 #### Runtime
 1. Add kubelet config parameter to enable the resource
 1. Make kubelet's `setNodeStatusMachineInfo` report the resource
 1. Add a Devices list to container.RunContainerOptions
 1. Use it from DockerManager's runContainer
 1. Do the same for rkt (stretch goal)
 1. When a pod requests a GPU, add the devices to the container options
 #### Other
 1. Add new resource to `kubectl describe` output. Optional for non-GPU users?
 1. Administrator documentation, with sample scripts
 1. User documentation
 ## Future work
 Above all, we need to collect feedback from real users and use that to set
 priorities for any of the items below.
 ### V1
 - Perform real detection of the installed hardware
 - Figure a standard way to avoid bundling of shared libraries in images
 - Support fractional resources so multiple pods can share the same GPU
 - Support bare metal setups
 - Report resource usage
 ### V2
 - Support multiple GPUs with resource hierarchies and affinities
 - Support versioning of resources (e.g. "CUDA v7.5+")
 - Build resource plugins into the kubelet?
 - Support other device vendors
 - Support Azure?
 - Support rkt?
 ### V3
 - Support OpenCL (so images can be device-agnostic)
 ### Undetermined
 It makes sense to turn the output of this project (external resource plugins,
 etc.) into a more generic abstraction at some point.
 ## Security considerations
 There should be knobs for the cluster administrator to only allow certain users
 or roles to schedule GPU workloads. Overcommitting or sharing the same device
 across different pods is not considered safe. It should be possible to segregate
 such GPU-sharing pods by user, namespace or a combination thereof.
 <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/gpu-support.md?pixel)]()
 <!-- END MUNGE: GENERATED_ANALYTICS -->