github/k3s - k3s - https://git.xinac.net

Commit Graph

Author	SHA1	Message	Date
danielqsj	79a3eb816c	rename latency to duration in metrics	2019-02-18 17:40:04 +08:00
wackxu	f3823cc2cf	fix e2e tests which set PodPriority are failing	2018-07-23 09:31:26 +08:00
Jiaying Zhang	265f3a48d3	Increase certain waiting time window in gpu_device_plugin e2e_node test. Kubelet restart process seems to get a bit slower recently. From running the gpu_device_plugin e2e_node test on GCE, I saw it took ~37 seconds for kubelet to start CM DeviceManager after it restarts, and then took ~12 seconds for the gpu device plugin to re-register. As the result, this e2e_node test fails because the current 10 sec waiting time is too small. Restarting a container also seems to get slower that it sometimes exceeds the current 2 min waiting time in ensurePodContainerRestart(). This change increase both waiting time to 5 min to leave enough space on slower machines.	2018-06-27 11:00:36 -07:00
Yu-Ju Hong	7cbd897e3e	test/e2e_node: Add Node-exclusive feature tags to existing tests	2018-05-21 17:52:36 -07:00
vikaschoudhary16	b953f852f5	[Device-Plugin]: Extend e2e test to cover node allocatables	2018-05-03 14:19:29 -04:00
Rohit Agarwal	87dda3375b	Delete in-tree support for NVIDIA GPUs. This removes the alpha Accelerators feature gate which was deprecated in 1.10. The alternative feature DevicePlugins went beta in 1.10.	2018-04-02 20:17:01 -07:00
Jiaying Zhang	5514a1f4dd	Fixes the races around devicemanager Allocate() and endpoint deletion. There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc() could get Node with non-zero deviceplugin resource allocatable for a non-existing endpoint. That race can happen when a device plugin fails, but is more likely when kubelet restarts as with the current registration model, there is a time gap between kubelet restart and device plugin re-registration. During this time window, even though devicemanager could have removed the resource initially during GetCapacity() call, Kubelet may overwrite the device plugin resource capacity/allocatable with the old value when node update from the API server comes in later. This could cause a pod to be started without proper device runtime config set. To solve this problem, introduce endpointStopGracePeriod. When a device plugin fails, don't immediately remove the endpoint but set stopTime in its endpoint. During kubelet restart, create endpoints with stopTime set for any checkpointed registered resource. The endpoint is considered to be in stopGracePeriod if its stoptime is set. This allows us to track what resources should be handled by devicemanager during the time gap. When an endpoint's stopGracePeriod expires, we remove the endpoint and its resource. This allows the resource to be exported through other channels (e.g., by directly updating node status through API server) if there is such use case. Currently endpointStopGracePeriod is set as 5 minutes. Given that an endpoint is no longer immediately removed upon disconnection, mark all its devices unhealthy so that we can signal the resource allocatable change to the scheduler to avoid scheduling more pods to the node. When a device plugin endpoint is in stopGracePeriod, pods requesting the corresponding resource will fail admission handler.	2018-03-09 17:00:57 -08:00
Jiaying Zhang	fee083feac	Update device plugin e2e_node test to not changing Kubelet config as DevicePlugins feature is enabled by default now.	2018-02-26 22:45:44 -08:00
Penghao Cen	386c077dc6	Move common functions together	2018-01-10 09:47:05 +08:00
Jiaying Zhang	8d9a2e09c4	Make sure node is ready before calling getLocalNode to fix test failure.	2017-11-28 15:18:17 -08:00
Jiaying Zhang	048bafdd0b	Adds device plugin registration count metric and allocation latency metric.	2017-11-21 13:44:10 -08:00
Jiaying Zhang	990113ce60	Extends gpu_device_plugin e2e_node test to verify that scheduled pods can continue to run even after device plugin deletion and kubelet restarts.	2017-11-20 23:40:27 -08:00
Michael Taufen	131b419596	Make feature gates loadable from a map[string]bool Command line flag API remains the same. This allows ComponentConfig structures (e.g. KubeletConfiguration) to express the map structure behind feature gates in a natural way when written as JSON or YAML. For example: KubeletConfiguration Before: ``` apiVersion: kubeletconfig/v1alpha1 kind: KubeletConfiguration featureGates: "DynamicKubeletConfig=true,Accelerators=true" ``` KubeletConfiguration After: ``` apiVersion: kubeletconfig/v1alpha1 kind: KubeletConfiguration featureGates: DynamicKubeletConfig: true Accelerators: true ```	2017-10-10 09:37:51 -07:00
Jiaying Zhang	b73f4acdee	Fixes test/e2e_node/gpu_device_plugin.go test failure.	2017-10-02 17:31:10 -07:00
Jiaying Zhang	ba40bee5c1	Modified test/e2e_node/gpu-device-plugin.go to make sure it passes.	2017-09-22 20:21:26 +02:00
Renaud Gaubert	6993612cec	Added device plugin e2e kubelet failure test Signed-off-by: Renaud Gaubert <renaud.gaubert@gmail.com>	2017-09-22 01:24:01 +02:00

16 Commits (795ae352018d36b2bc2dfdd85b156919ec01013a)