github/k3s - k3s - https://git.xinac.net

Commit Graph

Author	SHA1	Message	Date
Darren Shepherd	f68bedbafa	Remove KubeletPodResources	2019-08-19 08:33:10 -07:00
danielqsj	79a3eb816c	rename latency to duration in metrics	2019-02-18 17:40:04 +08:00
danielqsj	9fd99a48f5	Change kubelet metrics to conform guideline	2019-02-18 14:01:58 +08:00
Jiaying Zhang	00b88c14b0	Checks whether we have cached runtime state before starting a container that requests any device plugin resource. If not, re-issue Allocate grpc calls. This allows us to handle the edge case that a pod got assigned to a node even before it populates its extended resource capacity.	2019-02-07 11:12:36 -08:00
Kubernetes Prow Robot	03b434c9d4	Merge pull request #58122 from tianshapjq/nit-int-is-enough Len() is already int	2019-02-03 12:02:24 -08:00
Kubernetes Prow Robot	d88994cf9f	Merge pull request #71306 from ping035627/k8s-181121 fix some typos	2019-01-09 09:06:31 -08:00
yuexiao-wang	f3353c358d	[scheduler cleanup phase 2]: Rename to Signed-off-by: yuexiao-wang <wang.yuexiao@zte.com.cn>	2018-12-11 11:21:12 +08:00
saad-ali	a7c5582bba	Permit use of deprecated dir in device plugin.	2018-11-21 18:37:31 -08:00
saad-ali	8f666d9e41	Modify kubelet watcher to support old versions Modify kubelet plugin watcher to support older CSI drivers that use an the old plugins directory for socket registration. Also modify CSI plugin registration to support multiple versions of CSI registering with the same name.	2018-11-21 18:37:31 -08:00
PingWang	9d541911bb	fix some typos Signed-off-by: PingWang <wang.ping5@zte.com.cn> fix typo Signed-off-by: PingWang <wang.ping5@zte.com.cn>	2018-11-22 08:27:14 +08:00
David Ashpole	630cb53f82	add kubelet grpc server for pod-resources service	2018-11-15 09:43:20 -08:00
Davanum Srinivas	954996e231	Move from glog to klog - Move from the old github.com/golang/glog to k8s.io/klog - klog as explicit InitFlags() so we add them as necessary - we update the other repositories that we vendor that made a similar change from glog to klog * github.com/kubernetes/repo-infra * k8s.io/gengo/ * k8s.io/kube-openapi/ * github.com/google/cadvisor - Entirely remove all references to glog - Fix some tests by explicit InitFlags in their init() methods Change-Id: I92db545ff36fcec83afe98f550c9e630098b3135	2018-11-10 07:50:31 -05:00
Renaud Gaubert	8dd1d27c03	Updated the device manager pluginwatcher handler	2018-09-06 15:34:46 +02:00
Kubernetes Submit Queue	d017bebf6b	Merge pull request #67145 from jiayingz/reboot-fix Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Fail container start if its requested device plugin resource is unknown. With the change, Kubelet device manager now checks whether it has cached option state for the requested device plugin resource to make sure the resource is in ready state when we start the container. What this PR does / why we need it: Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes https://github.com/kubernetes/kubernetes/issues/67107 Special notes for your reviewer: Release note: ```release-note Fail container start if its requested device plugin resource hasn't registered after Kubelet restart. ```	2018-08-21 01:48:54 -07:00
tianshapjq	81081dc9e7	nits in manager.go	2018-08-15 08:16:04 +08:00
Jiaying Zhang	7b1ae66432	Fail container start if its requested device plugin resource doesn't have cached option state to make sure the device plugin resource is in ready state when we start the container.	2018-08-08 13:11:36 -07:00
hui luo	7101c17498	While reviewing devicemanager code, found the caching layer on endpoint is redundant. Here are the 3 related objects in picture: devicemanager <-> endpoint <-> plugin Plugin is the source of truth for devices and device health status. devicemanager maintain healthyDevices, unhealthyDevices, allocatedDevices based on updates from plugin. So there is no point for endpoint caching devices, this patch is removing this caching layer on endpoint, Also removing the Manager.Devices() since i didn't find any caller of this other than test, i am adding a notification channel to facilitate testing, If we need to get all devices from manager in future, it just need to return healthyDevices + unhealthyDevices, we don't have to call endpoint after all. This patch makes code more readable, data model been simplified.	2018-07-29 21:07:14 -07:00
Kubernetes Submit Queue	32e38b6659	Merge pull request #58755 from vikaschoudhary16/probing-mode Automatic merge from submit-queue (batch tested with PRs 58755, 66414). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Use probe based plugin watcher mechanism in Device Manager What this PR does / why we need it: Uses this probe based utility in the device plugin manager. Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes #56944 Notes For Reviewers: Changes are backward compatible and existing device plugins will continue to work. At the same time, any new plugins that has required support for probing model (Identity service implementation), will also work. Release note ```release-note Add support kubelet plugin watcher in device manager. ``` /sig node /area hw-accelerators /cc /cc @jiayingz @RenaudWasTaken @vishh @ScorpioCPH @sjenning @derekwaynecarr @jeremyeder @lichuqiang @tengqm @saad-ali @chakri-nelluri @ConnorDoyle	2018-07-27 15:20:06 -07:00
bingshen.wbs	b1bdd043c4	fix kubelet npe on device plugin return zero container Signed-off-by: bingshen.wbs <bingshen.wbs@alibaba-inc.com>	2018-07-25 10:15:30 +08:00
vikaschoudhary16	a5842503eb	Use probe based plugin discovery mechanism in device manager	2018-07-17 04:02:31 -04:00
Kubernetes Submit Queue	c399c306e2	Merge pull request #59174 from tianshapjq/todo-already-done Automatic merge from submit-queue (batch tested with PRs 65230, 57355, 59174, 63698, 63659). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. TODO has already been implemented What this PR does / why we need it: TODO has already been implemented, remove the TODO tag. Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes # Special notes for your reviewer: Release note: ```release-note ```NONE	2018-06-19 20:19:17 -07:00
Guoliang Wang	761cf41427	Move pkg/scheduler/schedulercache -> pkg/scheduler/cache	2018-05-31 22:55:34 +08:00
Kubernetes Submit Queue	15cc20630d	Merge pull request #60034 from pohly/device-manager-goroutine Automatic merge from submit-queue (batch tested with PRs 58474, 60034, 62101, 63198). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. avoid race condition in device manager and plugin startup/shutdown: wait for goroutines What this PR does / why we need it: Commit `1325c2f` worked around issue #59488, but it is still worthwhile to fix the underlying root cause properly. Which issue(s) this PR fixes: Fixes #59488 Special notes for your reviewer: This is an alternative to PR #59861, which used a different approach. Personally I tend to prefer this one now. Release note: ```release-note NONE ``` /sig node /area hw-accelerators /assign vikaschoudhary16	2018-04-30 13:24:08 -07:00
vikaschoudhary16	c846d5fe63	Fix race between stopping old and starting new endpoint	2018-04-24 22:22:39 -04:00
vikaschoudhary16	d62bd9ef65	Node-level Checkpointing manager	2018-04-16 00:19:42 -04:00
Patrick Ohly	fcbb64b93d	avoid race condition in device manager and plugin startup/shutdown A flaky test exposed a race condition where shutting down one server instance broke the startup of the next instance when using the same socket path. Commit `1325c2f8be` removed the reuse of the same socket path and thus avoided the issue. But the real fix is to ensure that the listening socket is really closed once Stop returns. Two solutions were proposed in https://github.com/grpc/grpc-go/issues/1861: - waiting for the goroutine to complete - closing the socket The former is done here because it's cleaner to not keep lingering goroutines. While at it, the Stop methods are made idempotent (similar to e.g. Close on a socket) and no longer crash when called without prior Start. Fixes https://github.com/kubernetes/kubernetes/issues/59488	2018-04-12 17:59:10 +02:00
Kubernetes Submit Queue	0022bec3a2	Merge pull request #61525 from tianshapjq/place-consts-together Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. move the const to the place it should be What this PR does / why we need it: move the const to the place it should be Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes # Special notes for your reviewer: Release note: ```release-note ```	2018-03-25 09:51:42 -07:00
hzxuzhonghu	70e45eccf2	Replace "golang.org/x/net/context" with "context"	2018-03-22 20:57:14 +08:00
tianshapjq	55921d0827	move the const to the place it should be	2018-03-22 14:20:15 +08:00
Jiaying Zhang	5514a1f4dd	Fixes the races around devicemanager Allocate() and endpoint deletion. There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc() could get Node with non-zero deviceplugin resource allocatable for a non-existing endpoint. That race can happen when a device plugin fails, but is more likely when kubelet restarts as with the current registration model, there is a time gap between kubelet restart and device plugin re-registration. During this time window, even though devicemanager could have removed the resource initially during GetCapacity() call, Kubelet may overwrite the device plugin resource capacity/allocatable with the old value when node update from the API server comes in later. This could cause a pod to be started without proper device runtime config set. To solve this problem, introduce endpointStopGracePeriod. When a device plugin fails, don't immediately remove the endpoint but set stopTime in its endpoint. During kubelet restart, create endpoints with stopTime set for any checkpointed registered resource. The endpoint is considered to be in stopGracePeriod if its stoptime is set. This allows us to track what resources should be handled by devicemanager during the time gap. When an endpoint's stopGracePeriod expires, we remove the endpoint and its resource. This allows the resource to be exported through other channels (e.g., by directly updating node status through API server) if there is such use case. Currently endpointStopGracePeriod is set as 5 minutes. Given that an endpoint is no longer immediately removed upon disconnection, mark all its devices unhealthy so that we can signal the resource allocatable change to the scheduler to avoid scheduling more pods to the node. When a device plugin endpoint is in stopGracePeriod, pods requesting the corresponding resource will fail admission handler.	2018-03-09 17:00:57 -08:00
Jiaying Zhang	07beac6004	Made a couple API changes to deviceplugin/v1beta1 to avoid future incompatible changes: - Add GetDevicePluginOptions rpc call. This is needed when we switch from Registration service to probe-based plugin watcher. - Change AllocateRequest and AllocateResponse to allow device requests from multiple containers in a pod. Currently only made mechanical change on the devicemanager and test code to cope with the API but still issues an Allocate call per container. We can modify the devicemanager in 1.11 to issue a single Allocate call per pod. The change will also facilitate incremental API change to communicate pod level information through Allocate rpc if there is such future need.	2018-02-23 16:15:09 -08:00
vikaschoudhary16	e64517cd74	Migrate deviceplugin api from v1alpha to v1beta1	2018-02-21 01:26:20 -05:00
vikaschoudhary16	defcab81d5	Invoke PreStart RPC call before container start, if desired by plugin Signed-off-by: vikaschoudhary16 <vichoudh@redhat.com>	2018-02-21 01:25:24 -05:00
tianshapjq	21702e3c39	TODO has already been implemented	2018-02-01 14:38:29 +08:00
tianshapjq	e0f15bf5bf	Len() is already int	2018-01-29 09:01:23 +08:00
Connor Doyle	e5667cf426	Rename package deviceplugin => devicemanager.	2018-01-24 22:32:43 -08:00

36 Commits (1ba3e9873359f4ae34522f350e35f8d9e13b96e6)