Kubelet restart process seems to get a bit slower recently. From running
the gpu_device_plugin e2e_node test on GCE, I saw it took ~37 seconds
for kubelet to start CM DeviceManager after it restarts, and then took
~12 seconds for the gpu device plugin to re-register. As the result,
this e2e_node test fails because the current 10 sec waiting time is too
small. Restarting a container also seems to get slower that it sometimes
exceeds the current 2 min waiting time in ensurePodContainerRestart().
This change increase both waiting time to 5 min to leave enough space
on slower machines.
There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc()
could get Node with non-zero deviceplugin resource allocatable for a
non-existing endpoint. That race can happen when a device plugin fails,
but is more likely when kubelet restarts as with the current registration
model, there is a time gap between kubelet restart and device plugin
re-registration. During this time window, even though devicemanager could
have removed the resource initially during GetCapacity() call, Kubelet
may overwrite the device plugin resource capacity/allocatable with the
old value when node update from the API server comes in later. This
could cause a pod to be started without proper device runtime config set.
To solve this problem, introduce endpointStopGracePeriod. When a device
plugin fails, don't immediately remove the endpoint but set stopTime in
its endpoint. During kubelet restart, create endpoints with stopTime set
for any checkpointed registered resource. The endpoint is considered to be
in stopGracePeriod if its stoptime is set. This allows us to track what
resources should be handled by devicemanager during the time gap.
When an endpoint's stopGracePeriod expires, we remove the endpoint and
its resource. This allows the resource to be exported through other channels
(e.g., by directly updating node status through API server) if there is such
use case. Currently endpointStopGracePeriod is set as 5 minutes.
Given that an endpoint is no longer immediately removed upon disconnection,
mark all its devices unhealthy so that we can signal the resource allocatable
change to the scheduler to avoid scheduling more pods to the node.
When a device plugin endpoint is in stopGracePeriod, pods requesting the
corresponding resource will fail admission handler.
Command line flag API remains the same. This allows ComponentConfig
structures (e.g. KubeletConfiguration) to express the map structure
behind feature gates in a natural way when written as JSON or YAML.
For example:
KubeletConfiguration Before:
```
apiVersion: kubeletconfig/v1alpha1
kind: KubeletConfiguration
featureGates: "DynamicKubeletConfig=true,Accelerators=true"
```
KubeletConfiguration After:
```
apiVersion: kubeletconfig/v1alpha1
kind: KubeletConfiguration
featureGates:
DynamicKubeletConfig: true
Accelerators: true
```