Commit Graph

11 Commits (07d1a8cb0c89d70572560611ad37e8986e9b6ead)

Author SHA1 Message Date
Rohit Agarwal 87dda3375b Delete in-tree support for NVIDIA GPUs.
This removes the alpha Accelerators feature gate which was deprecated in 1.10.
The alternative feature DevicePlugins went beta in 1.10.
2018-04-02 20:17:01 -07:00
Jiaying Zhang 5514a1f4dd Fixes the races around devicemanager Allocate() and endpoint deletion.
There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.

To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.

Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.
2018-03-09 17:00:57 -08:00
Jiaying Zhang fee083feac Update device plugin e2e_node test to not changing Kubelet config
as DevicePlugins feature is enabled by default now.
2018-02-26 22:45:44 -08:00
Penghao Cen 386c077dc6 Move common functions together 2018-01-10 09:47:05 +08:00
Jiaying Zhang 8d9a2e09c4 Make sure node is ready before calling getLocalNode to fix test failure. 2017-11-28 15:18:17 -08:00
Jiaying Zhang 048bafdd0b Adds device plugin registration count metric and allocation latency metric. 2017-11-21 13:44:10 -08:00
Jiaying Zhang 990113ce60 Extends gpu_device_plugin e2e_node test to verify that scheduled pods
can continue to run even after device plugin deletion and kubelet
restarts.
2017-11-20 23:40:27 -08:00
Michael Taufen 131b419596 Make feature gates loadable from a map[string]bool
Command line flag API remains the same. This allows ComponentConfig
structures (e.g. KubeletConfiguration) to express the map structure
behind feature gates in a natural way when written as JSON or YAML.

For example:

KubeletConfiguration Before:
```
apiVersion: kubeletconfig/v1alpha1
kind: KubeletConfiguration
featureGates: "DynamicKubeletConfig=true,Accelerators=true"
```

KubeletConfiguration After:
```
apiVersion: kubeletconfig/v1alpha1
kind: KubeletConfiguration
featureGates:
  DynamicKubeletConfig: true
  Accelerators: true
```
2017-10-10 09:37:51 -07:00
Jiaying Zhang b73f4acdee Fixes test/e2e_node/gpu_device_plugin.go test failure. 2017-10-02 17:31:10 -07:00
Jiaying Zhang ba40bee5c1 Modified test/e2e_node/gpu-device-plugin.go to make sure it passes. 2017-09-22 20:21:26 +02:00
Renaud Gaubert 6993612cec Added device plugin e2e kubelet failure test
Signed-off-by: Renaud Gaubert <renaud.gaubert@gmail.com>
2017-09-22 01:24:01 +02:00