github/k3s - k3s - https://git.xinac.net

Commit Graph

Author	SHA1	Message	Date
Kubernetes Submit Queue	204520b029	Merge pull request #63344 from RobertKrawitz/fix-process-kill-algorithm Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Correct kill logic for pod processes Correct the kill logic for processes in the pod's cgroup. os.FindProcess() does not check whether the process exists on POSIX systems.	2018-05-11 11:41:19 -07:00
Kubernetes Submit Queue	321201f672	Merge pull request #63406 from derekwaynecarr/label-pod-cgroups Automatic merge from submit-queue (batch tested with PRs 60200, 63623, 63406). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Apply pod name and namespace labels for pod cgroup for cadvisor metrics What this PR does / why we need it: 1. Enable Prometheus users to determine usage by pod name and namespace for pod cgroup sandbox. 1. Label cAdvisor metrics for pod cgroups by pod name and namespace. 1. Aligns with kubelet stats summary endpoint pod cpu and memory stats. Special notes for your reviewer: This provides parity with the summary API enhancements done here: https://github.com/kubernetes/kubernetes/pull/55969 Release note: ```release-note Apply pod name and namespace labels to pod cgroup in cAdvisor metrics ```	2018-05-10 08:33:11 -07:00
Derek Carr	a09990cd43	Apply pod name and namespace labels for pod cgroup for cadvisor metrics	2018-05-07 14:51:12 -04:00
Kubernetes Submit Queue	1929e0d86d	Merge pull request #63298 from dims/kubelet-remove-unused-code Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. kubelet - Remove unused code What this PR does / why we need it: Looks like we have a bunch of unused methods. Let's clean them up Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes # Special notes for your reviewer: Release note: ```release-note NONE ```	2018-05-04 04:20:06 -07:00
Kubernetes Submit Queue	592c39bccc	Merge pull request #62541 from filbranden/cgroupname1 Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Use a []string for CgroupName, which is a more accurate internal representation What this PR does / why we need it: This is purely a refactoring and should bring no essential change in behavior. It does clarify the cgroup handling code quite a bit. It is preparation for further changes we might want to do in the cgroup hierarchy. (But it's useful on its own, so even if we don't do any, it should still be considered.) Special notes for your reviewer: The slice of strings more precisely captures the hierarchic nature of the cgroup paths we use to represent pods and their groupings. It also ensures we're reducing the chances of passing an incorrect path format to a cgroup driver that requires a different path naming, since now explicit conversions are always needed. The new constructor `NewCgroupName` starts from an existing `CgroupName`, which enforces a hierarchy where a root is always needed. It also performs checking on the component names to ensure invalid characters ("/" and "_") are not in use. A `RootCgroupName` for the top of the cgroup hierarchy tree is introduced. This refactor results in a net reduction of around 30 lines of code, mainly with the demise of ConvertCgroupNameToSystemd which had fairly complicated logic in it and was doing just too many things. There's a small TODO in a helper `updateSystemdCgroupInfo` that was introduced to make this commit possible. That logic really belongs in libcontainer, I'm planning to send a PR there to include it there. (The API already takes a field with that information, only that field is only processed in cgroupfs and not systemd driver, we should fix that.) Tested: By running the e2e-node tests on both Ubuntu 16.04 (with cgroupfs driver) and CentOS 7 (with systemd driver.) NOTE: I only tested this with dockershim, we should double-check that this works with the CRI endpoints too, both in cgroupfs and systemd modes. /assign @derekwaynecarr /assign @dashpole /assign @Random-Liu Release note: ```release-note NONE ```	2018-05-03 08:16:45 -07:00
Kubernetes Submit Queue	4f56127582	Merge pull request #63073 from andyxning/refactor_grpc_dial_with_dialcontext Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. refactor device plugin grpc dial with dialcontext What this PR does / why we need it: Refactor grpc `dial` with `dialContext` as `grpc.WithTimeout` has been deprecated by: > use DialContext and context.WithTimeout instead. Special notes for your reviewer: Release note: ```release-note NONE ```	2018-05-03 01:16:34 -07:00
Kubernetes Submit Queue	186dd7beb1	Merge pull request #62903 from cofyc/fixfsgroupcheckinlocal Automatic merge from submit-queue (batch tested with PRs 62657, 63278, 62903, 63375). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add more volume types in e2e and fix part of them. What this PR does / why we need it: - Add dir-link/dir-bindmounted/dir-link-bindmounted/bockfs volume types for e2e tests. - Fix fsGroup related e2e tests partially. - Return error if we cannot resolve volume path. - Because we should not fallback to volume path, if it's a symbolic link, we may get wrong results. To safely set fsGroup on local volume, we need to implement these two methods correctly for all volume types both on the host and in container: - get volume path kubelet can access - paths on the host and in container are different - get mount references - for directories, we cannot use its mount source (device field) to identify mount references, because directories on same filesystem have same mount source (e.g. tmpfs), we need to check filesystem's major:minor and directory root path on it Here is current status: \| \| (A) volume-path (host) \| (B) volume-path (container) \| (C) mount-refs (host) \| (D) mount-refs (container) \| \| --- \| --- \| --- \| --- \| --- \| \| (1) dir \| OK \| FAIL \| FAIL \| FAIL \| \| (2) dir-link \| OK \| FAIL \| FAIL \| FAIL \| \| (3) dir-bindmounted \| OK \| FAIL \| FAIL \| FAIL \| \| (4) dir-link-bindmounted \| OK \| FAIL \| FAIL \| FAIL \| \| (5) tmpfs\| OK \| FAIL \| FAIL \| FAIL \| \| (6) blockfs\| OK \| FAIL \| OK \| FAIL \| \| (7) block\| NOTNEEDED \| NOTNEEDED \| NOTNEEDED \| NOTNEEDED \| \| (8) gce-localssd-scsi-fs\| NOTTESTED \| NOTTESTED \| NOTTESTED \| NOTTESTED \| - This PR uses `nsenter ... readlink` to resolve path in container as @msau42 @jsafrane [suggested](https://github.com/kubernetes/kubernetes/pull/61489#pullrequestreview-110032850). This fixes B1:B6 and D6, , the rest will be addressed in https://github.com/kubernetes/kubernetes/pull/62102. - C5:D5 marked `FAIL` because `tmpfs` filesystems can share same mount source, we cannot rely on it to check mount references. e2e tests passes due to we use unique mount source string in tests. - A7:D7 marked `NOTNEEDED` because we don't set fsGroup on block devices in local plugin. (TODO: Should we set fsGroup on block device?) - A8:D8 marked `NOTTESTED` because I didn't test it, I leave it to `pull-kubernetes-e2e-gce`. I think it should be same as `blockfs`. Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes # Special notes for your reviewer: Release note: ```release-note NONE ```	2018-05-02 20:13:11 -07:00
Yecheng Fu	3748197876	Add more volume types in e2e and fix part of them. - Add dir-link/dir-bindmounted/dir-link-bindmounted/blockfs volume types for e2e tests. - Return error if we cannot resolve volume path. - Add GetFSGroup/GetMountRefs methods for mount.Interface. - Fix fsGroup related e2e tests partially.	2018-05-02 10:31:42 +08:00
Robert Krawitz	3f3c04d722	WIP: Correct kill logic for cgroup processes	2018-05-01 19:38:12 -04:00
Filipe Brandenburger	b230fb8ac4	Use a []string for CgroupName, which is a more accurate internal representation The slice of strings more precisely captures the hierarchic nature of the cgroup paths we use to represent pods and their groupings. It also ensures we're reducing the chances of passing an incorrect path format to a cgroup driver that requires a different path naming, since now explicit conversions are always needed. The new constructor NewCgroupName starts from an existing CgroupName, which enforces a hierarchy where a root is always needed. It also performs checking on the component names to ensure invalid characters ("/" and "_") are not in use. A RootCgroupName for the top of the cgroup hierarchy tree is introduced. This refactor results in a net reduction of around 30 lines of code, mainly with the demise of ConvertCgroupNameToSystemd which had fairly complicated logic in it and was doing just too many things. There's a small TODO in a helper updateSystemdCgroupInfo that was introduced to make this commit possible. That logic really belongs in libcontainer, I'm planning to send a PR there to include it there. (The API already takes a field with that information, only that field is only processed in cgroupfs and not systemd driver, we should fix that.) Tested by running the e2e-node tests on both Ubuntu 16.04 (with cgroupfs driver) and CentOS 7 (with systemd driver.)	2018-05-01 08:29:06 -07:00
Kubernetes Submit Queue	15cc20630d	Merge pull request #60034 from pohly/device-manager-goroutine Automatic merge from submit-queue (batch tested with PRs 58474, 60034, 62101, 63198). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. avoid race condition in device manager and plugin startup/shutdown: wait for goroutines What this PR does / why we need it: Commit `1325c2f` worked around issue #59488, but it is still worthwhile to fix the underlying root cause properly. Which issue(s) this PR fixes: Fixes #59488 Special notes for your reviewer: This is an alternative to PR #59861, which used a different approach. Personally I tend to prefer this one now. Release note: ```release-note NONE ``` /sig node /area hw-accelerators /assign vikaschoudhary16	2018-04-30 13:24:08 -07:00
Davanum Srinivas	4bacd77321	Remove unused code	2018-04-30 14:57:26 -04:00
Andy Xie	b01657d0c7	refactor device plugin grpc dial with dialcontext	2018-04-25 18:40:23 +08:00
vikaschoudhary16	c846d5fe63	Fix race between stopping old and starting new endpoint	2018-04-24 22:22:39 -04:00
choury	c1b19fce90	avoid dobule RLock() in cpumanager	2018-04-23 10:33:40 +08:00
Kubernetes Submit Queue	4d6a6ced8c	Merge pull request #56525 from tianshapjq/testcase-helpers_linux.go Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. new testcase to helpers_linux.go new testcase to helpers_linux.go, PTAL. ```release-note NONE ```	2018-04-20 18:55:13 -07:00
Kubernetes Submit Queue	e9374411d5	Merge pull request #62509 from sjenning/qos-reserved-feature-gate Automatic merge from submit-queue (batch tested with PRs 61962, 58972, 62509, 62606). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. kubelet: move QOSReserved from experimental to alpha feature gate Fixes https://github.com/kubernetes/kubernetes/issues/61665 Release note: ```release-note The --experimental-qos-reserve kubelet flags is replaced by the alpha level --qos-reserved flag or QOSReserved field in the kubeletconfig and requires the QOSReserved feature gate to be enabled. ``` /sig node /assign @derekwaynecarr /cc @mtaufen	2018-04-19 16:47:21 -07:00
Kubernetes Submit Queue	f3599ba3c9	Merge pull request #61962 from liggitt/flag-race Automatic merge from submit-queue (batch tested with PRs 61962, 58972, 62509, 62606). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Avoid data races in unit tests Setting global flags in unit tests leads to data races like this: ``` ================== WARNING: DATA RACE Write at 0x0000028f5241 by goroutine 47: flag.(boolValue).Set() /home/jliggitt/.gvm/gos/go1.9.5/src/flag/flag.go:91 +0x7b flag.(FlagSet).Set() /home/jliggitt/.gvm/gos/go1.9.5/src/flag/flag.go:366 +0x10c flag.Set() /home/jliggitt/.gvm/gos/go1.9.5/src/flag/flag.go:379 +0x76 k8s.io/kubernetes/pkg/kubelet/cm/devicemanager.TestPodContainerDeviceAllocation() /home/jliggitt/go/src/k8s.io/kubernetes/pkg/kubelet/cm/devicemanager/manager_test.go:549 +0x126 testing.tRunner() /home/jliggitt/.gvm/gos/go1.9.5/src/testing/testing.go:746 +0x16c Previous read at 0x0000028f5241 by goroutine 34: k8s.io/kubernetes/vendor/github.com/golang/glog.(loggingT).output() /home/jliggitt/go/src/k8s.io/kubernetes/vendor/github.com/golang/glog/glog.go:682 +0x730 k8s.io/kubernetes/vendor/github.com/golang/glog.(loggingT).printf() /home/jliggitt/go/src/k8s.io/kubernetes/vendor/github.com/golang/glog/glog.go:655 +0x259 k8s.io/kubernetes/vendor/github.com/golang/glog.Errorf() /home/jliggitt/go/src/k8s.io/kubernetes/vendor/github.com/golang/glog/glog.go:1118 +0x74 k8s.io/kubernetes/pkg/kubelet/cm/devicemanager.(endpointImpl).run() /home/jliggitt/go/src/k8s.io/kubernetes/pkg/kubelet/cm/devicemanager/endpoint.go:132 +0x1c7e k8s.io/kubernetes/pkg/kubelet/cm/devicemanager.(ManagerImpl).addEndpoint.func1() /home/jliggitt/go/src/k8s.io/kubernetes/pkg/kubelet/cm/devicemanager/manager.go:378 +0x3f Goroutine 47 (running) created at: testing.(T).Run() /home/jliggitt/.gvm/gos/go1.9.5/src/testing/testing.go:789 +0x568 testing.runTests.func1() /home/jliggitt/.gvm/gos/go1.9.5/src/testing/testing.go:1004 +0xa7 testing.tRunner() /home/jliggitt/.gvm/gos/go1.9.5/src/testing/testing.go:746 +0x16c testing.runTests() /home/jliggitt/.gvm/gos/go1.9.5/src/testing/testing.go:1002 +0x521 testing.(M).Run() /home/jliggitt/.gvm/gos/go1.9.5/src/testing/testing.go:921 +0x206 main.main() k8s.io/kubernetes/pkg/kubelet/cm/devicemanager/_test/_testmain.go:68 +0x1d3 Goroutine 34 (finished) created at: k8s.io/kubernetes/pkg/kubelet/cm/devicemanager.(*ManagerImpl).addEndpoint() /home/jliggitt/go/src/k8s.io/kubernetes/pkg/kubelet/cm/devicemanager/manager.go:377 +0x9d6 ================== --- FAIL: TestPodContainerDeviceAllocation (0.00s) testing.go:699: race detected during execution of test FAIL FAIL k8s.io/kubernetes/pkg/kubelet/cm/devicemanager 0.124s ```	2018-04-19 16:47:14 -07:00
Seth Jennings	9bcd986b23	kubelet: move QOSReserved from experimental to alpha feature gate	2018-04-16 13:08:40 -05:00
vikaschoudhary16	cedbd93255	Make 'pod' package to use unified checkpointManager Signed-off-by: vikaschoudhary16 <choudharyvikas16@gmail.com>	2018-04-16 01:30:20 -04:00
vikaschoudhary16	d62bd9ef65	Node-level Checkpointing manager	2018-04-16 00:19:42 -04:00
Patrick Ohly	fcbb64b93d	avoid race condition in device manager and plugin startup/shutdown A flaky test exposed a race condition where shutting down one server instance broke the startup of the next instance when using the same socket path. Commit `1325c2f8be` removed the reuse of the same socket path and thus avoided the issue. But the real fix is to ensure that the listening socket is really closed once Stop returns. Two solutions were proposed in https://github.com/grpc/grpc-go/issues/1861: - waiting for the goroutine to complete - closing the socket The former is done here because it's cleaner to not keep lingering goroutines. While at it, the Stop methods are made idempotent (similar to e.g. Close on a socket) and no longer crash when called without prior Start. Fixes https://github.com/kubernetes/kubernetes/issues/59488	2018-04-12 17:59:10 +02:00
Jordan Liggitt	b562263427	Avoid data races in unit tests	2018-03-30 17:19:40 -04:00
jianglingxia	583e4b61f5	fix format and typo of NodeAllocatableCgroups	2018-03-28 17:29:23 +08:00
Kubernetes Submit Queue	0022bec3a2	Merge pull request #61525 from tianshapjq/place-consts-together Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. move the const to the place it should be What this PR does / why we need it: move the const to the place it should be Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes # Special notes for your reviewer: Release note: ```release-note ```	2018-03-25 09:51:42 -07:00
hzxuzhonghu	70e45eccf2	Replace "golang.org/x/net/context" with "context"	2018-03-22 20:57:14 +08:00
tianshapjq	55921d0827	move the const to the place it should be	2018-03-22 14:20:15 +08:00
Derek Carr	f68f3ff783	Fix cpu cfs quota flag with pod cgroups	2018-03-16 15:27:11 -04:00
Kubernetes Submit Queue	3d1331f297	Merge pull request #61044 from liggitt/subpath-master Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. subpath fixes fixes #60813 for master / 1.10 ```release-note Fixes CVE-2017-1002101 - See https://issue.k8s.io/60813 for details ```	2018-03-12 11:51:59 -07:00
Kubernetes Submit Queue	a3f40dd8df	Merge pull request #60856 from jiayingz/race-fix Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Fixes the races around devicemanager Allocate() and endpoint deletion. There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc() could get Node with non-zero deviceplugin resource allocatable for a non-existing endpoint. That race can happen when a device plugin fails, but is more likely when kubelet restarts as with the current registration model, there is a time gap between kubelet restart and device plugin re-registration. During this time window, even though devicemanager could have removed the resource initially during GetCapacity() call, Kubelet may overwrite the device plugin resource capacity/allocatable with the old value when node update from the API server comes in later. This could cause a pod to be started without proper device runtime config set. To solve this problem, introduce endpointStopGracePeriod. When a device plugin fails, don't immediately remove the endpoint but set stopTime in its endpoint. During kubelet restart, create endpoints with stopTime set for any checkpointed registered resource. The endpoint is considered to be in stopGracePeriod if its stoptime is set. This allows us to track what resources should be handled by devicemanager during the time gap. When an endpoint's stopGracePeriod expires, we remove the endpoint and its resource. This allows the resource to be exported through other channels (e.g., by directly updating node status through API server) if there is such use case. Currently endpointStopGracePeriod is set as 5 minutes. Given that an endpoint is no longer immediately removed upon disconnection, mark all its devices unhealthy so that we can signal the resource allocatable change to the scheduler to avoid scheduling more pods to the node. When a device plugin endpoint is in stopGracePeriod, pods requesting the corresponding resource will fail admission handler. Tested: Ran GPUDevicePlugin e2e_node test 100 times and all passed now. What this PR does / why we need it: Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes https://github.com/kubernetes/kubernetes/issues/60176 Special notes for your reviewer: Release note: ```release-note Fixes the races around devicemanager Allocate() and endpoint deletion. ```	2018-03-12 02:50:13 -07:00
Jiaying Zhang	5514a1f4dd	Fixes the races around devicemanager Allocate() and endpoint deletion. There is a race in predicateAdmitHandler Admit() that getNodeAnyWayFunc() could get Node with non-zero deviceplugin resource allocatable for a non-existing endpoint. That race can happen when a device plugin fails, but is more likely when kubelet restarts as with the current registration model, there is a time gap between kubelet restart and device plugin re-registration. During this time window, even though devicemanager could have removed the resource initially during GetCapacity() call, Kubelet may overwrite the device plugin resource capacity/allocatable with the old value when node update from the API server comes in later. This could cause a pod to be started without proper device runtime config set. To solve this problem, introduce endpointStopGracePeriod. When a device plugin fails, don't immediately remove the endpoint but set stopTime in its endpoint. During kubelet restart, create endpoints with stopTime set for any checkpointed registered resource. The endpoint is considered to be in stopGracePeriod if its stoptime is set. This allows us to track what resources should be handled by devicemanager during the time gap. When an endpoint's stopGracePeriod expires, we remove the endpoint and its resource. This allows the resource to be exported through other channels (e.g., by directly updating node status through API server) if there is such use case. Currently endpointStopGracePeriod is set as 5 minutes. Given that an endpoint is no longer immediately removed upon disconnection, mark all its devices unhealthy so that we can signal the resource allocatable change to the scheduler to avoid scheduling more pods to the node. When a device plugin endpoint is in stopGracePeriod, pods requesting the corresponding resource will fail admission handler.	2018-03-09 17:00:57 -08:00
Jan Safranek	5110db5087	Lock subPath volumes Users must not be allowed to step outside the volume with subPath. Therefore the final subPath directory must be "locked" somehow and checked if it's inside volume. On Windows, we lock the directories. On Linux, we bind-mount the final subPath into /var/lib/kubelet/pods/<uid>/volume-subpaths/<container name>/<subPathName>, it can't be changed to symlink user once it's bind-mounted.	2018-03-05 09:14:44 +01:00
Jing Xu	b2e744c620	Promote LocalStorageCapacityIsolation feature to beta The LocalStorageCapacityIsolation feature added a new resource type ResourceEphemeralStorage "ephemeral-storage" so that this resource can be allocated, limited, and consumed as the same way as CPU/memory. All the features related to resource management (resource request/limit, quota, limitrange) are avaiable for local ephemeral storage. This local ephemeral storage represents the storage for root file system, which will be consumed by containers' writtable layer and logs. Some volumes such as emptyDir might also consume this storage.	2018-03-02 15:10:08 -08:00
Kubernetes Submit Queue	e31c8a2252	Merge pull request #60318 from jiayingz/api-change Automatic merge from submit-queue (batch tested with PRs 59159, 60318, 60079, 59371, 57415). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Made a couple API changes to deviceplugin/v1beta1 to avoid future incompatible API changes: - Add GetDevicePluginOptions rpc call. This is needed when we switch from Registration service to probe-based plugin watcher. - Change AllocateRequest and AllocateResponse to allow device requests from multiple containers in a pod. Currently only made mechanical change on the devicemanager and test code to cope with the API but still issues an Allocate call per container. We can modify the devicemanager in 1.11 to issue a single Allocate call per pod. The change will also facilitate incremental API change to communicate pod level information through Allocate rpc if there is such future need. What this PR does / why we need it: Made a couple API changes to deviceplugin/v1beta1 to avoid future incompatible API changes. Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes https://github.com/kubernetes/kubernetes/issues/59370 Special notes for your reviewer: Release note: ```release-note ```	2018-02-24 21:19:33 -08:00
Jiaying Zhang	07beac6004	Made a couple API changes to deviceplugin/v1beta1 to avoid future incompatible changes: - Add GetDevicePluginOptions rpc call. This is needed when we switch from Registration service to probe-based plugin watcher. - Change AllocateRequest and AllocateResponse to allow device requests from multiple containers in a pod. Currently only made mechanical change on the devicemanager and test code to cope with the API but still issues an Allocate call per container. We can modify the devicemanager in 1.11 to issue a single Allocate call per pod. The change will also facilitate incremental API change to communicate pod level information through Allocate rpc if there is such future need.	2018-02-23 16:15:09 -08:00
Kubernetes Submit Queue	d5aba0c6ca	Merge pull request #59088 from YuxiJin-tobeyjin/codeClean-merge-logfAndFailnow-to-fatalf Automatic merge from submit-queue (batch tested with PRs 60106, 59510, 60263, 60063, 59088). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. CodeClean, merge Logf And FailNow to Fatalf What this PR does / why we need it: Trivial changes to clean code, merge Logf And FailNow to Fatalf. Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes # Special notes for your reviewer: Release note: ```release-note "NONE" ```	2018-02-23 02:59:55 -08:00
Kubernetes Submit Queue	e8dd75f37d	Merge pull request #58282 from vikaschoudhary16/per-container-allocate Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Invoke preStart RPC call before container start, if desired by plugin What this PR does / why we need it: 1. Adds a new RPC `preStart` to device plugin API 2. Update `Register` RPC handling to receive a flag from the Device plugins as an indicator if kubelet should invoke `preStart` RPC before starting container. 3. Changes in device manager to invoke `preStart` before container start 4. Test case updates Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes #56943 #56307 Special notes for your reviewer: Release note: ```release-note None ``` /sig node /area hw-accelerators /cc @jiayingz @RenaudWasTaken @vishh @ScorpioCPH @sjenning @derekwaynecarr @jeremyeder @lichuqiang @tengqm	2018-02-21 13:07:26 -08:00
vikaschoudhary16	e64517cd74	Migrate deviceplugin api from v1alpha to v1beta1	2018-02-21 01:26:20 -05:00
vikaschoudhary16	defcab81d5	Invoke PreStart RPC call before container start, if desired by plugin Signed-off-by: vikaschoudhary16 <vichoudh@redhat.com>	2018-02-21 01:25:24 -05:00
ravisantoshgudimetla	a9a724d500	Test cases fix after path expansion	2018-02-20 14:23:09 -05:00
Kubernetes Submit Queue	96ec318718	Merge pull request #59842 from ixdy/update-rules_go-02-2018 Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Update bazelbuild/rules_go, kubernetes/repo-infra, and gazelle dependencies What this PR does / why we need it: updates our bazelbuild/rules_go dependency in order to bump everything to go1.9.4. I'm separating this effort into two separate PRs, since updating rules_go requires a large cleanup, removing an attribute from most build rules. Release note: ```release-note NONE ```	2018-02-19 22:23:05 -08:00
David Ashpole	960856f4e8	collect metrics on the /kubepods cgroup on-demand	2018-02-17 12:32:40 -08:00
Jeff Grafton	ef56a8d6bb	Autogenerated: hack/update-bazel.sh	2018-02-16 13:43:01 -08:00
David Ashpole	b259543985	collect ephemeral storage capacity on initialization	2018-02-15 17:33:22 -08:00
Kubernetes Submit Queue	58dcf3c533	Merge pull request #59489 from pohly/master-tmpdir Automatic merge from submit-queue (batch tested with PRs 59489, 59716). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. devicemanager testing: dynamically choose tmp dir This avoids the test issue #59488 that I was running into. I believe I have a reasonable explanation for the race condition in that issue (TLDR: it's probably part of the gRPC API and k8s can only avoid the issue until a proper solution gets worked out together with gRPC), therefore I suggest to merge this PR now both because it avoids the issue and because using fixed tmp directories is something that should be avoided anyway. /assign @jiayingz	2018-02-14 00:14:31 -08:00
Kubernetes Submit Queue	317853c90c	Merge pull request #59464 from dixudx/fix_all_typos Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. fix all the typos across the project What this PR does / why we need it: There are lots of typos across the project. We should avoid small PRs on fixing those annoying typos, which is time-consuming and low efficient. This PR does fix all the typos across the project currently. And with #59463, typos could be avoided when a new PR gets merged. Which issue(s) this PR fixes (optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged): Fixes # Special notes for your reviewer: /sig testing /area test-infra /sig release /cc @ixdy /assign @fejta Release note: ```release-note None ```	2018-02-10 22:12:45 -08:00
Di Xu	48388fec7e	fix all the typos across the project	2018-02-11 11:04:14 +08:00
Patrick Ohly	0d828e061b	devicemanager testing: time out sooner Each individual step should not take longer than a second. Suggest by Vikas Choudhary (https://github.com/kubernetes/kubernetes/pull/59489#discussion_r167205672).	2018-02-09 20:51:54 +01:00
Kubernetes Submit Queue	76e6da25fa	Merge pull request #59481 from rojkov/dm-unittests Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. devicemanager: increase code coverege of endpoint's unit test Particularly cover the code path when an unhealthy device becomes healthy.	2018-02-09 10:35:22 -08:00
Patrick Ohly	1325c2f8be	devicemanager testing: dynamically choose tmp dir Hard-coding the tests to use /tmp/device_plugin for sockets is problematic because it prevents running tests in parallel on the same machine (perhaps because there are multiple developers, perhaps because testing is done independently on different code checkouts). /tmp/device_plugin also was not removed after testing. This is probably not that relevant. But more importantly, this change also fixes https://github.com/kubernetes/kubernetes/issues/59488. "make test" failed in TestDevicePluginReRegistration because something removed /tmp/device_plugin/device-plugin.sock while something else tried to connect to it: 2018/02/07 14:34:39 Starting to serve on /tmp/device_plugin/device-plugin.sock [pid 29568] connect(14, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/server.sock"}, 33) = 0 [pid 29568] unlinkat(AT_FDCWD, "/tmp/device_plugin/server.sock", 0) = 0 [pid 29568] unlinkat(AT_FDCWD, "/tmp/device_plugin/device-plugin.sock", 0) = 0 [pid 29568] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=29568, si_uid=1000} --- [pid 29568] connect(6, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/device-plugin.sock"}, 40) = -1 ENOENT (No such file or directory) E0207 14:34:39.961321 29568 endpoint.go:117] listAndWatch ended unexpectedly for device plugin mock with error rpc error: code = Unavailable desc = transport is closing strace: Process 29623 attached [pid 29574] connect(3, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/device-plugin.sock"}, 40) = -1 ENOENT (No such file or directory) [pid 29623] connect(3, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/device-plugin.sock"}, 40) = -1 ENOENT (No such file or directory) [pid 29574] connect(3, {sa_family=AF_UNIX, sun_path="/tmp/device_plugin/device-plugin.sock"}, 40) = -1 ENOENT (No such file or directory) E0207 14:34:49.961324 29568 endpoint.go:60] Can't create new endpoint with path /tmp/device_plugin/device-plugin.sock err failed to dial device plugin: context deadline exceeded E0207 14:34:49.961390 29568 manager.go:340] Failed to dial device plugin with request &RegisterRequest{Version:v1alpha2,Endpoint:device-plugin.sock,ResourceName:fake-domain/resource,}: failed to dial device plugin: context deadline exceeded panic: test timed out after 2m0s It's not entirely certain which code was to blame for this unlinkat() calls (perhaps some cleanup code from a previous test running in a goroutine?) but this no longer happened after switching to per-test socket directories.	2018-02-09 14:01:13 +01:00

1 2 3 4 5 ...

340 Commits (8cccc022b05dc6fb130295a97373b06735a636cc)