Wait for specific final state instead of waiting for specific number of
operations in volume unit tests. The tests are more readable and will survive
random goroutine ordering (PV and PVC controller have both their own
goroutine).
Automatic merge from submit-queue
allow lock acquisition injection for quota admission
Allows for custom lock acquisition when composing the quota admission controller.
@derekwaynecarr I'm still experimenting to make sure this satisfies the need downstream, but looking for agreement in principle
Changes:
* moved waiting for synced caches before starting any work
* refactored worker() to really quit on quit
* changed queue to a ratelimiting queue and added retries on errors
* deep-copy deployments before mutating - we still need to deep-copy
replica sets and pods
Automatic merge from submit-queue
Move deployment functions to deployment/util.go not widely used
If the function is not used in multiple areas, move it to deployment/util.go
fixes#26750
Automatic merge from submit-queue
Some scheduler optimizations
Ref #28590
This PR doesn't do anything fancy - it is just reducing amount of memory allocations in scheduler, which in turn significantly speeds up scheduler.
Automatic merge from submit-queue
allow handler to join after the informer has started
This allows an event handler to join after a SharedInformer has started. It can't add any indexes, but it can add its reaction functions.
This works by
1. stopping the flow of events from the reflector (thus stopping updates to our store)
1. registering the new handler
1. sending synthetic "add" events to the new handler only
1. unblocking the flow of events
It would be possible to
1. block
1. list
1. add recorder
1. unblock
1. play list to as-yet unregistered handler
1. block
1. remove recorder
1. play recording
1. add new handler
1. unblock
But that is considerably more complicated. I'd rather not start there since this ought to be the exception rather than the rule.
@wojtek-t who requested this power in the initial review
@smarterclayton @liggitt I think this resolves our all-in-one ordering problem.
@hongchaodeng since this came up on the call
* rolling.go (has all the logic for rolling deployments)
* recreate.go (has all the logic for recreate deployments)
* sync.go (has all the logic for getting and scaling replica sets)
* rollback.go (has all the logic for rolling back a deployment)
* util.go (contains all the utilities used throughout the controller)
Leave back at deployment_controller.go all the necessary bits for
creating, setting up, and running the controller loop.
Also add package documentation.
Automatic merge from submit-queue
Use `CreatedByAnnotation` constant
A nit but didn't want the strings to get out of sync.
Signed-off-by: Doug Davis <dug@us.ibm.com>
Automatic merge from submit-queue
Track object modifications in fake clientset
Fake clientset is used by unit tests extensively but it has some
shortcomings:
- no filtering on namespace and name: tests that want to test objects in
multiple namespaces end up getting all objects from this clientset,
as it doesn't perform any filtering based on name and namespace;
- updates and deletes don't modify the clientset state, so some tests
can get unexpected results if they modify/delete objects using the
clientset;
- it's possible to insert multiple objects with the same
kind/name/namespace, this leads to confusing behavior, as retrieval is
based on the insertion order, but anchors on the last added object as
long as no more objects are added.
This change changes core.ObjectRetriever implementation to track object
adds, updates and deletes.
Some unit tests were depending on the previous (and somewhat incorrect)
behavior. These are fixed in the following few commits.
Automatic merge from submit-queue
Kubelet should mark VolumeInUse before checking if it is Attached
Kubelet should mark VolumeInUse before checking if it is Attached.
Controller should fetch fresh copy of node object before detach instead of relying on node informer cache.
Fixes#27836
Ensure that kublet marks VolumeInUse before checking if it is Attached.
Also ensures that the attach/detach controller always fetches a fresh
copy of the node object before detach (instead ofKubelet relying on node
informer cache).
Using new fake clientset registry exposes the actual flow on pending
namespace finalization: get namespace, create finalizer, list pods and
delete namespace if there are no pods.
Since fake clientset now correctly tracks objects created by deployment
controller, it triggers different controller behavior: controller only
creates replica set once and updates deployment once.
Fake clientset no longer needs to be prepopulated with records: keeping
them in leads to the name conflict on creates. Also, since fake
clientset now respects namespaces, we need to correctly populate them.
Automatic merge from submit-queue
Convert service account token controller to use a work queue
Converts the service account token controller to use a work queue. This allows parallelization of token generation (useful when there are several simultaneous namespaces or service accounts being created). It also lets us requeue failures to be retried sooned than the next sync period (which can be very long).
Fixes an issue seen when a namespace is created with secrets quotaed, and the token controller tries to create a token secret prior to the quota status having been initialized. In that case, the secret is rejected at admission, and the token controller wasn't retrying until the resync period.
Automatic merge from submit-queue
Fix initialization of volume controller caches.
Fix `PersistentVolumeController.initializeCaches()` to pass pointers to volume or claim to `storeObjectUpdate()` and add extra functions to enforce that the right types are checked in the future.
Fixes#28076
Fix PersistentVolumeController.initializeCaches() to pass pointers to volume
or claim to storeObjectUpdate() and add extra functions to enforce that the
right types are checked in the future.
Fixes#28076
Automatic merge from submit-queue
[client-gen]Add Patch to clientset
* add the Patch() method to the clientset.
* I have to rename the existing Patch() method of `Event` to PatchWithEventNamespace() to avoid overriding.
* some minor changes to the fake Patch action.
cc @Random-Liu since he asked for the method
@kubernetes/sig-api-machinery
ref #26580
```release-note
Add the Patch method to the generated clientset.
```
Automatic merge from submit-queue
Fix startup type error in initializeCaches
The following error was getting logged:
PersistentVolumeController can't initialize caches, expected list of volumes, got:
&{TypeMeta:{Kind: APIVersion:} ListMeta:{SelfLink:/api/v1/persistentvolumes ResourceVersion:11} Items:[]}
The tests make extensive use of NewFakeControllerSource which uses api.List
instead of api.PersistentVolumeList. So use reflect to help iterate over the
items then assert the item type.
fixes#27757
Automatic merge from submit-queue
daemon/controller.go: refactor worker
1. function name is better to be verb or verb+noun
2. remove unnecessary func call
Automatic merge from submit-queue
Proportionally scale paused and rolling deployments
Enable paused and rolling deployments to be proportionally scaled.
Also have cleanup policy work for paused deployments.
Fixes#20853Fixes#20966Fixes#20754
@bgrant0607 @janetkuo @ironcladlou @nikhiljindal
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/20273)
<!-- Reviewable:end -->
The following error was getting logged:
PersistentVolumeController can't initialize caches, expected list of volumes, got:
&{TypeMeta:{Kind: APIVersion:} ListMeta:{SelfLink:/api/v1/persistentvolumes ResourceVersion:11} Items:[]}
Automatic merge from submit-queue
let dynamic client handle non-registered ListOptions
And register v1.ListOptions in the policy group.
Fix#27622
@lavalamp @smarterclayton @krousey
Automatic merge from submit-queue
Prevent detach before node status update
The PR prevents the attach/detach controller from start a detach operation before updating the node status (to remove the volume from the list of attached volumes).
Fixes https://github.com/kubernetes/kubernetes/issues/27836
Automatic merge from submit-queue
AWS/GCE: Spread PetSet volume creation across zones, create GCE volumes in non-master zones
Long term we plan on integrating this into the scheduler, but in the
short term we use the volume name to place it onto a zone.
We hash the volume name so we don't bias to the first few zones.
If the volume name "looks like" a PetSet volume name (ending with
-<number>) then we use the number as an offset. In that case we hash
the base name.
This change implements the desiredStateOfWorld populator to sync up with
the pod informer. It periodically check each pod in the
desiredStateOfworld and verify whether it is still in pod informer
cache. If it not, remove it from the desiredStateOfWorld
Automatic merge from submit-queue
Deployment controller's cleanupUnhealthyReplicas should respect minReadySeconds
```release-note
Fixed an issue that Deployment may be scaled down further than allowed by maxUnavailable when minReadySeconds is set.
```
Fixes#26834
Detected by a flake in deployment rollover e2e test (the only test that specifies `minReadySeconds`).
cc @kubernetes/deployment @pwittrock
cc @mqliang who first added `cleanupUnhealthyReplicas` in deployment controller
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/.github/PULL_REQUEST_TEMPLATE.md?pixel)]()
Automatic merge from submit-queue
Kubelet Volume Manager Wait For Attach Detach Controller and Backoff on Error
* Closes https://github.com/kubernetes/kubernetes/issues/27483
* Modified Attach/Detach controller to report `Node.Status.AttachedVolumes` on successful attach (unique volume name along with device path).
* Modified Kubelet Volume Manager wait for Attach/Detach controller to report success before proceeding with attach.
* Closes https://github.com/kubernetes/kubernetes/issues/27492
* Implemented an exponential backoff mechanism for for volume manager and attach/detach controller to prevent operations (attach/detach/mount/unmount/wait for controller attach/etc) from executing back to back unchecked.
* Closes https://github.com/kubernetes/kubernetes/issues/26679
* Modified volume `Attacher.WaitForAttach()` methods to uses the device path reported by the Attach/Detach controller in `Node.Status.AttachedVolumes` instead of calling out to cloud providers.
Modify attach/detach controller to keep track of volumes to report
attached in Node VolumeToAttach status.
Modify kubelet volume manager to wait for volume to show up in Node
VolumeToAttach status.
Implement exponential backoff for errors in volume manager and attach
detach controller
Automatic merge from submit-queue
Allow disabling of dynamic provisioning
Allow administrators to opt-out of dynamic provisioning. Provisioning is still on by default, which is the current behavior.
Per a conversation with @jsafrane, a boolean toggle was added and plumbed through into the controller. Deliberate disabling will simply return nil from `provisionClaim` whereas a misconfigured provisioner will continue on and generate error events for the PVC.
@kubernetes/rh-storage @saad-ali @thockin @abhgupta
Automatic merge from submit-queue
Fill PV.Status.Message with deleter/recycler errors.
Instead of empty `Message` `kubectl describe pv` now shows:
```
Name: nfs
Labels: <none>
Status: Failed
Claim: default/nfs
Reclaim Policy: Recycle
Access Modes: RWX
Capacity: 1Mi
Message: Recycler failed: Pod was active on the node longer than specified deadline
Source:
Type: NFS (an NFS mount that lasts the lifetime of a pod)
Server: 10.999.999.999
Path: /
ReadOnly: false
```
This is actually a regression since 1.2
@kubernetes/sig-storage
Long term we plan on integrating this into the scheduler, but in the
short term we use the volume name to place it onto a zone.
We hash the volume name so we don't bias to the first few zones.
If the volume name "looks like" a PetSet volume name (ending with
-<number>) then we use the number as an offset. In that case we hash
the base name.
Fixes#27256
Automatic merge from submit-queue
fix updatePod() of RS and RC controllers
Fix updatePod of replication controller manager and replica set controller to handle pod label updates that match no RC or RS.
Fix#27405
This commit adds a new volume manager in kubelet that synchronizes
volume mount/unmount (and attach/detach, if attach/detach controller
is not enabled).
This eliminates the race conditions between the pod creation loop
and the orphaned volumes loops. It also removes the unmount/detach
from the `syncPod()` path so volume clean up never blocks the
`syncPod` loop.
Automatic merge from submit-queue
processor listener: fix locking in pop()
Currently the lock in processorListener is used to guard pendingNotifications. But in pop, it also locks around on select chan. This will block the goroutine with lock acquired.
This PR changes the lock to guard the correct section only.
Automatic merge from submit-queue
volume controller: Convert PersistentVolumes from Kubernetes 1.2
In Kubernetes 1.2 we used template PersistentVolume for provisioning. When a claim for dynamic volume was detected, Kubernetes did:
- create template PV for the claim with dummy pointer to storage asset
- allocate storage asset such as AWS EBS
- fill real pointer to the created storage asset to the template PV
In refactored volume provisioner, Kubernetes allocates the storage asset first and then creates a Kubernetes PV instance already with the correct pointer to the storage asset.
To support seamles upgrade from 1.2 to 1.3 we need to remove these unprovisioned template PVs. The new controller does not use them, it will see PVC for dynamic provisioning and create real PV instead.
See https://github.com/pmorie/pv-haxxz/pull/3 for pseudocode.
Automatic merge from submit-queue
Wait for all volumes/claims to get synced in unit test.
Controller.HasSynced() returns true when all initial claims/volumes were sent
to appropriate goroutines, not when the goroutine has actually processed them.
Fixes#26712
In Kubernetes 1.2 we used template PersistentVolume for provisioning. When a
claim for dynamic volume was detected, Kubernetes did:
- create template PV for the claim with dummy pointer to storage asset
- allocate storage asset such as AWS EBS
- fill real pointer to the created storage asset to the template PV
In refactored volume provisioner, Kubernetes allocates the storage asset first
and then creates a Kubernetes PV instance already with the correct pointer
to the storage asset.
To support seamles upgrade from 1.2 to 1.3 we need to remove these
unprovisioned template PVs. The new controller does not use them, it will see
PVC for dynamic provisioning and create real PV instead.
Automatic merge from submit-queue
Fix typo and linewrap comments in PV controller
Fix some typos and linewrap long comments that I found while going over this code investigating something.
Controller.HasSynced() returns true when all initial claims/volumes were sent
to appropriate goroutines, not when the goroutine has actually processed them.
Automatic merge from submit-queue
Attach/Detach Controller Kubelet Changes
This PR contains changes to enable attach/detach controller proposed in #20262.
Specifically it:
* Introduces a new `enable-controller-attach-detach` kubelet flag to enable control by attach/detach controller. Default enabled.
* Removes all references `SafeToDetach` annotation from controller.
* Adds the new `VolumesInUse` field to the Node Status API object.
* Modifies the controller to use `VolumesInUse` instead of `SafeToDetach` annotation to gate detachment.
* Modifies kubelet to set `VolumesInUse` before Mount and after Unmount.
* There is a bug in the `node-problem-detector` binary that causes `VolumesInUse` to get reset to nil every 30 seconds. Issue https://github.com/kubernetes/node-problem-detector/issues/9#issuecomment-221770924 opened to fix that.
* There is a bug here in the mount/unmount code that prevents resetting `VolumeInUse in some cases, this will be fixed by mount/unmount refactor.
* Have controller process detaches before attaches so that volumes referenced by pods that are rescheduled to a different node are detached first.
* Fix misc bugs in controller.
* Modify GCE attacher to: remove retries, remove mutex, and not fail if volume is already attached or already detached.
Fixes#14642, #19953
```release-note
Kubernetes v1.3 introduces a new Attach/Detach Controller. This controller manages attaching and detaching volumes on-behalf of nodes that have the "volumes.kubernetes.io/controller-managed-attach-detach" annotation.
A kubelet flag, "enable-controller-attach-detach" (default true), controls whether a node sets the "controller-managed-attach-detach" or not.
```
Automatic merge from submit-queue
Fill controller caches on startup
The controller needs to fill its caches before it starts binding/recycling/ deleting or provisioning volumes and claims. This was done using blocking initial 'xxx added' from going through syncClaim/syncVolume. However, when the caches were full, the controller waited for the next sync period to do actual binding/recycling etc.
In this patch, the controller fills its caches directly from etcd and then processes initial 'xxx added' events to reconcile the world and bind/recycle/ delete/provision stuff, resulting in faster binding after startup.
Fixes#25967 (properly)
This PR contains Kubelet changes to enable attach/detach controller control.
* It introduces a new "enable-controller-attach-detach" kubelet flag to
enable control by controller. Default enabled.
* It removes all references "SafeToDetach" annoation from controller.
* It adds the new VolumesInUse field to the Node Status API object.
* It modifies the controller to use VolumesInUse instead of SafeToDetach
annotation to gate detachment.
* There is a bug in node-problem-detector that causes VolumesInUse to
get reset every 30 seconds. Issue https://github.com/kubernetes/node-problem-detector/issues/9
opened to fix that.
Automatic merge from submit-queue
Fix data race in volume controller unit test.
Reactor must be locked when fiddling with reactor.volumes and reactor.claims. Therefore add new functions to add/delete volume/claim with sending an event.
Fixes#26345
Event recorder should wait for some time to get all expected events, the event
may be written by another goroutine that just have finished.
It should not slow down the test in most cases, only when there is a bug and
expected event is not sent.
Reactor must be locked when fiddling with reactor.volumes and reactor.claims.
Therefore add new functions to add/delete volume/claim with sending an event.
Automatic merge from submit-queue
Stabilize controller unit tests.
Remove test "5-1", it's flaky as it depends on order of execution of goroutines. When the controller starts, existing claim is enqueued as "initial sync event" and a new volume is enqueued to separate goroutine. It is not deterministic which goroutine processes its events first and there is no way how to tell that the claim event was processed.
Also, force resync of the controllers after the test to make sure all events are processed.
Fixes unit test flakes.
@kubernetes/sig-storage
Remove test "5-1", it's flaky as it depends on order of execution of
goroutines. When the controller starts, existing claim is enqueued as
"initial sync event" and a new volume is enqueued to separate goroutine.
It is not deterministic which goroutine processes its events first and
there is no way how to tell that the claim event was processed.
Also, force resync of the controllers after the test to make sure all
events are processed.
Automatic merge from submit-queue
daemonset handle DeletedFinalStateUnknown
During an e2e run in OpenShift we ran into the DS controller panic when handling `DeletedFinalStateUnknown`. This PR checks for `DeletedFinalStateUnknown` and queues the embedded object if it is a `DaemonSet`.
@mikedanese - would you mind taking a look?
@deads2k
```
panic: interface conversion: interface is cache.DeletedFinalStateUnknown, not *extensions.DaemonSet
goroutine 4369 [running]:
k8s.io/kubernetes/pkg/controller/daemon.func·005(0x2f8a0c0, 0xc20b559680)
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/controller/daemon/controller.go:160 +0x50
k8s.io/kubernetes/pkg/controller/framework.ResourceEventHandlerFuncs.OnDelete(0xc20a0ae090, 0xc20a0ae0a0, 0xc20a0ae0b0, 0x2f8a0c0, 0xc20b559680)
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/controller/framework/controller.go:178 +0x41
k8s.io/kubernetes/pkg/controller/framework.(*ResourceEventHandlerFuncs).OnDelete(0xc20b8ebf20, 0x2f8a0c0, 0xc20b559680)
<autogenerated>:25 +0xb5
k8s.io/kubernetes/pkg/controller/framework.func·001(0x2f8a280, 0xc20b5522e0, 0x0, 0x0)
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/controller/framework/controller.go:248 +0x4be
k8s.io/kubernetes/pkg/controller/framework.(*Controller).processLoop(0xc20bb727e0)
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/controller/framework/controller.go:122 +0x6f
k8s.io/kubernetes/pkg/controller/framework.*Controller.(k8s.io/kubernetes/pkg/controller/framework.processLoop)·fm()
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/controller/framework/controller.go:97 +0x27
k8s.io/kubernetes/pkg/util/wait.func·001()
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/util/wait/wait.go:66 +0x61
k8s.io/kubernetes/pkg/util/wait.JitterUntil(0xc209f8cfb8, 0x3b9aca00, 0x0, 0xc2080543c0)
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/util/wait/wait.go:67 +0x8f
k8s.io/kubernetes/pkg/util/wait.Until(0xc209f8cfb8, 0x3b9aca00, 0xc2080543c0)
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/util/wait/wait.go:47 +0x4a
k8s.io/kubernetes/pkg/controller/framework.(*Controller).Run(0xc20bb727e0, 0xc2080543c0)
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/controller/framework/controller.go:97 +0x1fb
created by k8s.io/kubernetes/pkg/controller/daemon.(*DaemonSetsController).Run
/data/src/github.com/openshift/origin/Godeps/_workspace/src/k8s.io/kubernetes/pkg/controller/daemon/controller.go:212 +0xae
```
https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_check/1002/artifact/origin/artifacts/test-cmd/logs/openshift.log
Automatic merge from submit-queue
Reduce volume controller sync period
fixes#24236 and most probably also fixes#25294.
Needs #25881! With the cache, binder is not affected by sync period. Without the cache, binding of 1000 PVCs takes more than 5 minutes (instead of ~70 seconds).
15 seconds were chosen by fair 2d10 roll :-)
The controller needs to fill its caches before it starts binding/recycling/
deleting or provisioning volumes and claims. This was done using blocking
initial 'xxx added' from going through syncClaim/syncVolume. However, when
the caches were full, the controller waited for the next sync period to do
actual binding/recycling etc.
In this patch, the controller fills its caches directly from etcd and then
processes initial 'xxx added' events to reconcile the world and bind/recycle/
delete/provision stuff, resulting in faster binding after startup.
Fixes#25967 (properly)
Automatic merge from submit-queue
volume controller: Add cache with the latest version of PVs and PVCs
When the controller binds a PV to PVC, it saves both objects to etcd. However, there is still an old version of these objects in the controller Informer cache. So, when a new PVC comes, the PV is still seen as available and may get bound to the new PVC. This will be blocked by etcd, still, it creates unnecessary traffic that slows everything down.
To make everything worse, when periodic sync with the old PVC is performed, this PVC is seen by the controller as Pending (while it's already Bound on etcd) and will be bound to a different PV. Writing to this PV won't be blocked by etcd, only subsequent write of the PVC fails. So, the controller will need to roll back the PV in another transaction(s). The controller can keep itself pretty busy this way.
Also, we save bound PVs (and PVCs) as two transactions - we save say PV.Spec first and then .Status. The controller gets "PV.Spec updated" event from etcd and tries to fix the Status, as it seems to the controller it's outdated. This write again fails - there already is a correct version in etcd.
As we can't influence the Informer cache, it is read-only to the controller, this patch introduces second cache in the controller, which holds latest and greatest version on PVs and PVCs to prevent these useless writes to etcd . It gets updated with events from etcd *and* after etcd confirms successful save of PV/PVC modified by the controller.
The cache stores only *pointers* to PVs/PVCs, so in ideal case it shares the actual object data with the informer cache. They will diverge only for a short time when the controller modifies something and the informer cache did not get update events yet.
@kubernetes/sig-storage
Automatic merge from submit-queue
Add a NodeCondition "NetworkUnavaiable" to prevent scheduling onto a node until the routes have been created
This is new version of #26267 (based on top of that one).
The new workflow is:
- we have an "NetworkNotReady" condition
- Kubelet when it creates a node, it sets it to "true"
- RouteController will set it to "false" when the route is created
- Scheduler is scheduling only on nodes that doesn't have "NetworkNotReady ==true" condition
@gmarek @bgrant0607 @zmerlynn @cjcullen @derekwaynecarr @danwinship @dcbw @lavalamp @vishh
Automatic merge from submit-queue
prevent namespace cleanup hotloop
Found chasing a sentry report. Looks like we hot-loop on namespace deletion failures.
@derekwaynecarr ptal
Automatic merge from submit-queue
Attach Detach Controller Business Logic
This PR adds the meat of the attach/detach controller proposed in #20262.
The PR splits the in-memory cache into a desired and actual state of the world.
Automatic merge from submit-queue
volume controller: use better operation names
Using volume/claim.UID in the operation name is not really useful, as UIDs are not logged by rest of the controller. On the other hand, volume.Name and claim.Namespace/Name is logged pretty often and it would help to log these also in operation name. Still, I'd prefer to have the operation name really unique to be protected from users deleting a volume and quickly creating another one with the same name, so UID is still part of the operation name.
This has been already proven to be very useful in controller debugging.
Automatic merge from submit-queue
Add some extra checking in the tests to prevent flakes.
Attempts to fix https://github.com/kubernetes/kubernetes/issues/25967
The hypothesis is that somehow waitTest() catches an idle that occurs before all changes have been applied. This will block until the expected number of changes have arrived.
Automatic merge from submit-queue
use monotonic now in TestDelNode
Fixes https://github.com/kubernetes/kubernetes/issues/24971.
Briefly, the rate_limited_queue uses a `container/heap` to store values, and use this data structure to ensure we can always fetch the value with the minimum `processAt`. However, in some extreme condition, the continuous call to `time.Now()` would get the same value, which causes some unpredictable order in the queue, this fix uses a monotonic `now()` to avoid that.
@smarterclayton please take a look.
Split controller cache into actual and desired state of world.
Controller will only operate on volumes scheduled to nodes that
have the "volumes.kubernetes.io/controller-managed-attach" annotation.
Automatic merge from submit-queue
ResourceQuota controller uses rate limiter to prevent hot-loops in error situations
Have resource quota controller use a rate limited queue to prevent hot-looping in error situations.
Automatic merge from submit-queue
volume recycler: Don't start a new recycler pod if one already exists.
Recycling is a long duration process and when the recycler controller is restarted in the meantime, it should not start a new recycler pod if there is one already running.
This means that the recycler pod must have deterministic name based on name of the recycled PV, we then get name conflicts when creating the pod.
Two things need to be changed:
- recycler controller and recycler plugins must pass the PV.Name to place, where the pod is created. This is most of the patch and it should be pretty straightforward.
- create recycler pod with deterministic name and check "already exists" error.
When at it, remove useless 'resourceVersion' argument and make log messages starting with lowercase.
There is an unit test to check the behavior + there is an e2e test that checks that regular recycling is not broken (it does not try to run two recycler pods in parallel as the recycler is single-threaded now).
Automatic merge from submit-queue
NodeController doesn't evict Pods if no Nodes are Ready
Fix#13412#24597
When NodeControllers don't see any Ready Node it goes into "network segmentation mode". In this mode it cancels all evictions and don't evict any Pods.
It leaves network segmentation mode when it sees at least one Ready Node. When leaving it resets all timers, so each Node has full grace period to reconnect to the cluster.
cc @lavalamp @davidopp @mml @wojtek-t @fgrzadkowski
The binder sorts all available volumes first, then it filters out volumes
that cannot be bound by processing each volume in a loop and then finds
the smallest matching volume by binary search.
So, if we process every available volume in a loop, we can also remember the
smallest matching one and save us potentially long sorting (and quick binary
search).
When the controller binds a PV to PVC, it saves both objects to etcd.
However, there is still an old version of these objects in the controller
Informer cache. So, when a new PVC comes, the PV is still seen as available
and may get bound to the new PVC. This will be blocked by etcd, still, it
creates unnecessary traffic that slows everything down.
Also, we save bound PV/PVC as two transactions - we save PV/PVC.Spec first
and then .Status. The controller gets "PV/PVC.Spec updated" event from etcd
and tries to fix the Status, as it seems to the controller it's outdated.
This write again fails - there already is a correct version in etcd.
We can't influence the Informer cache, it is read-only to the controller.
To prevent these useless writes to etcd, this patch introduces second cache
in the controller, which holds latest and greatest version on PVs and PVCs.
It gets updated with events from etcd *and* after etcd confirms successful
save of PV/PVC modified by the controller.
The cache stores only *pointers* to PVs/PVCs, so in ideal case it shares the
actual object data with the informer cache. They will diverge only when
the controller modifies something and the informer cache did not get update
events yet.
Using volume/claim.UID in the operation name is not really useful, as UIDs
are not logged by rest of the controller. On the other hand, volume.Name and
claim.Namespace/Name is logged pretty often and it would help to log these
also in operation name.
This has been already proven to be very useful in controller debugging.
Recycling is a long duration process and when the recycler controller is
restarted in the meantime, it should not start a new recycler pod if there is
one already running.
This means that the recycler pod must have deterministic name based on name
of the recycled PV, we then get name conflicts when creating the pod.
Two things need to be changed:
- recycler controller and recycler plugins must pass the PV.Name to place,
where the pod is created.
- create recycler pod with deterministic name and check "already exists" error.
When at it, remove useless 'resourceVersion' argument and make log messages
starting with lowercase.
Automatic merge from submit-queue
Refactor persistent volume controller
Here is complete persistent controller as designed in https://github.com/pmorie/pv-haxxz/blob/master/controller.go
It's feature complete and compatible with current binder/recycler/provisioner. No new features, it *should* be much more stable and predictable.
Testing
--
The unit test framework is quite complicated, still it was necessary to reach reasonable coverage (78% in `persistentvolume_controller.go`). The untested part are error cases, which are quite hard to test in reasonable way - sure, I can inject a VersionConflictError on any object update and check the error bubbles up to appropriate places, but the real test would be to run `syncClaim`/`syncVolume` again and check it recovers appropriately from the error in the next periodic sync. That's the hard part.
Organization
---
The PR starts with `rm -rf kubernetes/pkg/controller/persistentvolume`. I find it easier to read when I see only the new controller without old pieces scattered around.
[`types.go` from the old controller is reused to speed up matching a bit, the code looks solid and has 95% unit test coverage].
I tried to split the PR into smaller patches, let me know what you think.
~~TODO~~
--
* ~~Missing: provisioning, recycling~~.
* ~~Fix integration tests~~
* ~~Fix e2e tests~~
@kubernetes/sig-storage
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/24331)
<!-- Reviewable:end -->
Fixes#15632
Automatic merge from submit-queue
Make name validators return string slices
Part of the larger validation PR, broken out for easier review and merge. Builds on previous PRs in the series.
- remove persistentvolume_ prefix from all files
- split controller.go into controller.go and controller_base.go (to have them
under 1500 lines for github)
This fixes e2e test for provisioning - it expects that provisioned volumes
are bound quickly.
Majority of this patch is update of test framework needs to initialize the
controller appropriately.
- Add reclaim policy to newVolume() call.
- Implement reactor Volumes().Get().
- Implement mock volume plugin.
- Add recycler tests.
- Add a synchronization condition to controller.scheduleOperation
- we need to pause the controller here, let the test to do some bad things
to the controller and test error cases in recycleVolumeOperation.
Test framework gets more and more complicated... But this is the last piece,
I promise.
We need to keep list of running recyclers, deleters and provisioners in
memory in order not to start a new recycling/deleting/provisioning twice
for the same volume/claim.
This will be eventually replaced by GoRoutineMap from PR #24838.
Automatic merge from submit-queue
prevent nil pointer when starting controllers before running the shar…
Fixes https://github.com/kubernetes/kubernetes/issues/25643.
https://github.com/kubernetes/kubernetes/pull/23795 changed initialization order, so the controller isn't guaranteed to be present at startup.
@mqliang @wojtek-t I'm pretty sure that we're not guaranteed to get back the correct `cache.Indexer` or `cache.Store` either. I'll look at re-plumbing the `AddIndexer` path to use the same instance so that its safe to use again.
Automatic merge from submit-queue
Horizontal autoscalaer tolerance breaching verifier
@piosz @jszczepkowski This is the original HPA test without the other modifications, it makes explicit the mathematics of the logic in case changes ever occur.
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/24736)
<!-- Reviewable:end -->
Automatic merge from submit-queue
WIP v0 NVIDIA GPU support
```release-note
* Alpha support for scheduling pods on machines with NVIDIA GPUs whose kubelets use the `--experimental-nvidia-gpus` flag, using the alpha.kubernetes.io/nvidia-gpu resource
```
Implements part of #24071 for #23587
I am not familiar with the scheduler enough to know what to do with the scores. Mostly punting for now.
Missing items from the implementation plan: limitranger, rkt support, kubectl
support and docs
cc @erictune @davidopp @dchen1107 @vishh @Hui-Zhi @gopinatht
Automatic merge from submit-queue
Remove myself from a bunch of OWNERS files
For the time being I am too overloaded to do non scheduler/admission related reviews that aren't explicitly assigned to me.
cc/ @brendandburns
Automatic merge from submit-queue
Move internal types of hpa from pkg/apis/extensions to pkg/apis/autoscaling
ref #21577
@lavalamp could you please review or delegate to someone from CSI team?
@janetkuo could you please take a look into the kubelet changes?
cc @fgrzadkowski @jszczepkowski @mwielgus @kubernetes/autoscaling
Automatic merge from submit-queue
Let the dynamic client take runtime.Object instead of v1.ListOptions
so that I can pass whatever version of ListOptions to the List/Watch/DeleteCollection methods.
cc @krousey