Automatic merge from submit-queue
replace global registry in apimachinery with global registry in k8s.io/kubernetes
We'd like to remove all globals, but our immediate problem is that a shared registry between k8s.io/kubernetes and k8s.io/client-go doesn't work. Since client-go makes a copy, we can actually keep a global registry with other globals in pkg/api for now.
@kubernetes/sig-api-machinery-misc @lavalamp @smarterclayton @sttts
Automatic merge from submit-queue (batch tested with PRs 39803, 39698, 39537, 39478)
[scheduling] Moved pod affinity and anti-affinity from annotations to api fields #25319
Converted pod affinity and anti-affinity from annotations to api fields
Related: #25319
Related: #34508
**Release note**:
```Pod affinity and anti-affinity has moved from annotations to api fields in the pod spec. Pod affinity or anti-affinity that is defined in the annotations will be ignored.```
Automatic merge from submit-queue (batch tested with PRs 39803, 39698, 39537, 39478)
Use controller interface for everything in config factory
**What this PR does / why we need it**:
We want to replace controller structs with interfaces
- per the TODO in `ControllerInterface`
- Specifically this will make the decoupling from Config and reuse of the scheduler's subcomponents cleaner.
Automatic merge from submit-queue (batch tested with PRs 39230, 39718)
Remove special case for StatefulSets in scheduler
**What this PR does / why we need it**: Removes special case for StatefulSet in scheduler code
/ref: https://github.com/kubernetes/kubernetes/issues/39687
**Special notes for your reviewer**:
**Release note**:
```release-note
Scheduler treats StatefulSet pods as belonging to a single equivalence class.
```
Automatic merge from submit-queue (batch tested with PRs 39684, 39577, 38989, 39534, 39702)
Set PodStatus QOSClass field
This PR continues the work for https://github.com/kubernetes/kubernetes/pull/37968
It converts all local usage of the `qos` package class types to the new API level types (first commit) and sets the pod status QOSClass field in the at pod creation time on the API server in `PrepareForCreate` and in the kubelet in the pod status update path (second commit). This way the pod QOS class is set even if the pod isn't scheduled yet.
Fixes#33255
@ConnorDoyle @derekwaynecarr @vishh
Automatic merge from submit-queue
Fix comment and optimize code
**What this PR does / why we need it**:
Fix comment and optimize code.
Thanks.
**Special notes for your reviewer**:
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 39075, 39350, 39353)
Move pkg/api.{Context,RequestContextMapper} into pkg/genericapiserver/api/request
**Based on #39350**
Automatic merge from submit-queue
Ensure that assumed pod won't be expired until the end of binding
In case if api server is overloaded and will reply with 429 too many requests error, binding may take longer than ttl of scheduler cache for assumed pods 1199d42210/pkg/client/restclient/request.go (L787-L850)
This problem was mitigated by using this fix e4d215d508 and increased rate limit for api server. But it is possible that it will occur again.
In such cases when api server is overloaded and returns a lot of
429 (too many requests) errors - binding may take a lot of time
to succeed due to retry policy implemented in rest client.
In such events cache ttl for assumed pods wasn't big enough.
In order to minimize probability of such errors ttl for assumed pods
will be counted from the time when binding for particular pod is finished
(either with error or success)
Change-Id: Ib0122f8a76dc57c82f2c7c52497aad1bdd8be411
Automatic merge from submit-queue
Avoid unnecessary memory allocations
Low-hanging fruits in saving memory allocations. During our 5000-node kubemark runs I've see this:
ControllerManager:
- 40.17% k8s.io/kubernetes/pkg/util/system.IsMasterNode
- 19.04% k8s.io/kubernetes/pkg/controller.(*PodControllerRefManager).Classify
Scheduler:
- 42.74% k8s.io/kubernetes/plugin/pkg/scheduler/algrorithm/predicates.(*MaxPDVolumeCountChecker).filterVolumes
This PR is eliminating all of those.
Automatic merge from submit-queue
Optimize pod affinity when predicate
Optimize by returning as early as possible to avoid invoking priorityutil.PodMatchesTermsNamespaceAndSelector.
Automatic merge from submit-queue
Add use case to service affinity
Also part of nits in refactoring predicates, I found the explanation of `serviceaffinity` in its comment is very hard to understand. So I added example instead here to help user/developer to digest it.
Automatic merge from submit-queue (batch tested with PRs 38154, 38502)
Rename "release_1_5" clientset to just "clientset"
We used to keep multiple releases in the main repo. Now that [client-go](https://github.com/kubernetes/client-go) does the versioning, there is no need to keep releases in the main repo. This PR renames the "release_1_5" clientset to just "clientset", clientset development will be done in this directory.
@kubernetes/sig-api-machinery @deads2k
```release-note
The main repository does not keep multiple releases of clientsets anymore. Please find previous releases at https://github.com/kubernetes/client-go
```
Automatic merge from submit-queue (batch tested with PRs 38648, 38747)
fix typo
**What this PR does / why we need it**:
fix typo.
**Release note**:
```NONE
```
Automatic merge from submit-queue (batch tested with PRs 35436, 37090, 38700)
Make iscsi pv/pvc aware of nodiskconflict feature
Being iscsi a `RWO, ROX` volume we should conflict if more than one pod is using same iscsi LUN.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
Automatic merge from submit-queue (batch tested with PRs 38377, 36365, 36648, 37691, 38339)
Do not create selector and namespaces in a loop where possible
With 1000 nodes and 5000 pods (5 pods per node) with anti-affinity a lot of CPU wasted on creating LabelSelector and sets.String (map).
With this change we are able to deploy that number of pods in ~25 minutes. Without - it takes 30 minutes to deploy 500 pods with anti-affinity configured.
failureDomains are only used for PreferredDuringScheduling pod
anti-affinity, which is ignored by PodAffinityChecker.
This unnecessary requirement was making it hard to move
PodAffinityChecker to GeneralPredicates because that would require
passing --failure-domains to both kubelet and kube-controller-manager.
Automatic merge from submit-queue (batch tested with PRs 38076, 38137, 36882, 37634, 37558)
[scheduler] Use V(10) for anything which may be O(N*P) logging
Fixes#37014
This PR makes sure that logging statements which are capable of being called on a perNode / perPod basis (i.e. non essential ones that will just clog up logs at large scale) are at V(10) level.
I dreamt of a levenstein filter that built a weak map of word frequencies and alerted once log throughput increased w/o varying information content.... but then I woke up and realized this is probably all we really need for now :)
Automatic merge from submit-queue
Revert "Avoid hard-coding list of Node Conditions"
* we don't know how other API consumers are using node conditions (there was no prior expectation that the scheduler would block on custom conditions)
* not all conditions map directly to schedulability (e.g. `MemoryPressure`/`DiskPressure`)
* not all conditions use True to mean "unschedulable" (e.g. `Ready`)
This reverts commit 511b2ecaa8 to avoid breaking existing API users and to avoid constraining future uses of the node conditions API
Automatic merge from submit-queue
split scheduler priorities into separate files
In the current state it's really hard to find a thing one is looking for, if he doesn't know already know where to look. cc @davidopp
- Prevents kubelet from overwriting capacity during sync.
- Handles opaque integer resources in the scheduler.
- Adds scheduler predicate tests for opaque resources.
- Validates opaque int resources:
- Ensures supplied opaque int quantities in node capacity,
node allocatable, pod request and pod limit are integers.
- Adds tests for new validation logic (node update and pod spec).
- Added e2e tests for opaque integer resources.
Automatic merge from submit-queue
Make "Unschedulable" reason a constant in api
String "Unschedulable" is used in couple places in K8S:
* scheduler
* federation replicaset and deployment controllers
* cluster autoscaler
* rescheduler
This PR makes the string a part of API so it not changed.
cc: @davidopp @fgrzadkowski @wojtek-t
Automatic merge from submit-queue
Create restclient interface
Refactoring of code to allow replace *restclient.RESTClient with any RESTClient implementation that implements restclient.RESTClientInterface interface.
Automatic merge from submit-queue
Predicate cacheing and cleanup
Fix to #31795
First pass @ cleanup and caching of the CheckServiceAffinity function.
The cleanup IMO is necessary because the logic around the pod listing and the use of the "implicit selector" (which is reverse engineered to enable the homogenous pod groups).
Should still pass the E2Es.
@timothysc @wojtek-t
Comments addressed, Make emptyMetadataProducer a func to avoid casting,
FakeSvcLister: remove error return for len(svc)=0. New test for
predicatePrecomp to make method semantics explictly enforced when meta
is missing. Precompute wrapper.
Automatic merge from submit-queue
PVC informer lister supports listing
This will be used in follow-on PRs for quota evaluation backed by informers for pvcs.
/cc @deads2k @eparis @kubernetes/sig-storage
Most normal codec use should perform defaulting. DirectCodecs should not
perform defaulting. Update the defaulting_test to fuzz the list of known
defaulters. Use the new versioning.NewDefaultingCodec() method.
Automatic merge from submit-queue
Refactor scheduler to enable switch on equivalence cache
Part 2 of #30844
Refactoring to enable easier switch on of equivalence cache.
Automatic merge from submit-queue
Fix typos and englishify plugin/pkg
**What this PR does / why we need it**: Just typos
**Which issue this PR fixes**: `None`
**Special notes for your reviewer**: Just typos
**Release note**: `NONE`
Automatic merge from submit-queue
[Part 1] Implementation of equivalence pod
Part 1 of #30844
This PR:
- Refactored `predicate.go`, so that `GetResourceRequest` can be used in other places to `GetEquivalencePod`.
- Implement a `equivalence_cache.go` to deal with all information we need to calculate an equivalent pod.
- Define and register a `RegisterGetEquivalencePodFunction `.
Work in next PR:
- The update of `equivalence_cache.go`
- Unit test
- Integration/e2e test
I think we can begin from the `equivalence_cache.go`? Thanks. cc @wojtek-t @davidopp
If I missed any other necessary part, please point it out.
Automatic merge from submit-queue
Turn down hootloop logs in priorities.
Excessive spam is output once we near cluster capacity, sometimes a panic is triggered but that data is clipped in the logs see #33935 for more details.
Automatic merge from submit-queue
scheduler: initialize podsWithAffinity
Without initializing podsWithAffinity, scheduler panics when deleting
a pod from a node that has no pods with affinity ever scheduled to.
Fix#33772
Automatic merge from submit-queue
simplify RC and SVC listers
Make the RC and SVC listers use the common list functions that more closely match client APIs, are consistent with other listers, and avoid unnecessary copies.
Automatic merge from submit-queue
Fix minor nits in test cases
Found a group of nits when doing #30844, fixed them in a this PR since 30844 requires a long time to review.
Automatic merge from submit-queue
Exit scheduler retry loop correctly
The error was being eaten and shadowed, which means we would never exit
this loop. This might lead to a goroutine in the scheduler being used
forever without exiting at maximum backoff.
Switched to use the real client to make later refactors easier.
@wojtek-t this may lead to scheduler informer exhaustion - not that familiar with retries
The error was being eaten and shadowed, which means we would never exit
this loop. This might lead to a goroutine in the scheduler being used
forever without exiting at maximum backoff.
Switched to use the real client to make later refactors easier.
Automatic merge from submit-queue
Update scheduler config file compatibility tests
**What this PR does / why we need it**:
Added missing compatibility tests for scheduler config file options.
**Which issue this PR fixes**
fixes#30099
**Special notes for your reviewer**:
I came up with the options based on the contents of default.go in each branch.
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue
update comment info for scheduler binding fails
Since the process logic for scheduler binding failed has changed, I think we should update the comment information to avoid make people confused :)
The related issue is #30611.
@wojtek-t What do you think about it ?
Thanks!
Automatic merge from submit-queue
ClusterAutoscaler-friendly scheduler priority function that promotes well-used nodes
It will help cluster autoscaler to put pods on nodes that are unlikely to be deleted soon due to low usage. Otherwise a pod may be frequently kicked from one node to another. A flag that enables it when CA is on will be added in a separate PR.
Fixes: #28842
Convert single GV and lists of GVs into an interface that can handle
more complex scenarios (everything internal, nothing supported). Pass
the interface down into conversion.
Automatic merge from submit-queue
remove useless value copy
Copy something to values in parameters won't change them in go. So, remove it to avoid making people confused.
Automatic merge from submit-queue
SchedulerExtender: add failedPredicateMap in Filter() returns
Fix#25797. modify extender.Filter for adding extenders information to “failedPredicateMap” in findNodesThatFit.
When all the filtered nodes that passed "predicateFuncs" don’t pass the extenders filter, the failedPredicateMap hasn’t the extenders information, should add it, I think. So when the length of the “filteredNodes.Items” is 0, we can know the integral information. (The length of the “filteredNodes.Items” is 0, may be because the extenders filter failed.)
Automatic merge from submit-queue
Prepare for using "ControllerRef" in scheduler
This is part of a PR that I already have to avoid a bunch of rebases in the future (controller ref probably won't happen in 1.4 release).
@davidopp
Automatic merge from submit-queue
Initial support for pod eviction based on disk
This PR adds the following:
1. node reports disk pressure condition based on configured thresholds
1. scheduler does not place pods on nodes reporting disk pressure
1. kubelet will not admit any pod when it reports disk pressure
1. kubelet ranks pods for eviction when low on disk
1. kubelet evicts greediest pod
Follow-on PRs will need to handle:
1. integrate with new image gc PR (https://github.com/kubernetes/kubernetes/pull/27199)
1. container gc policy should always run (will not be launched from eviction, tbd who does that)
1. this means kill pod is fine for all eviction code paths since container gc will remove dead container
1. min reclaim support will just poll summary provider (derek will do follow-on)
1. need to know if imagefs is same device as rootfs from summary (derek follow-on)
/cc @vishh @kubernetes/sig-node
Automatic merge from submit-queue
First step of optimizing PodAffinity priority function
Ref #26144
This is obviously only a first step - I will continue working on this code. However, this is changing the general scheme of computations to what is described in: https://github.com/kubernetes/kubernetes/issues/26144#issuecomment-232612384
Automatic merge from submit-queue
Optimise the process of the CalculateSpreadPriority in selector_spreading.go
It had better inspect if the nodeLister is normal first in the CalculateSpreadPriority in selector_spreading.go. If the nodeLister.List return error, the function return directly, not need deal the serviceLister and controllerLister and replicaSetLister.
Automatic merge from submit-queue
Add hooks for cluster health detection
Separate a function that decides if zone is healthy. First real commit for preventing massive pod eviction.
Ref. #28832
cc @davidopp
Automatic merge from submit-queue
Change storeToNodeConditionLister to return []*api.Node instead of api.NodeList for performance
Currently copies that are made while copying/creating api.NodeList are significant part of scheduler profile, and a bunch of them are made in places, that are not-parallelizable.
Ref #28590
Automatic merge from submit-queue
Add test case to TestPodFitsResources() of scheduler algorithm
File "plugin\pkg\scheduler\algorithm\predicates", function "TestPodFitsResources()", line 199, only provide test case "one resource cpu fits but memory not", it should add test case "one resource memory fits but cpu not".
Automatic merge from submit-queue
Optimize priorities in scheduler
Ref #28590
It's probably easier to review it commit by commit, since those changes are kind of independent from each other.
@davidopp - FYI
Automatic merge from submit-queue
Add meta field to predicate signature to avoid computing the same things multiple times
This PR only uses it to avoid computing QOS of a pod for every node from scratch.
Ref #28590
Automatic merge from submit-queue
scheduler: change phantom pod test from integration into unit test
This is an effort for #24440.
Why this PR?
- Integration test is hard to debug. We could model the test as a unit test similar to [TestSchedulerForgetAssumedPodAfterDelete()](132ebb091a/plugin/pkg/scheduler/scheduler_test.go (L173)). Currently the test is testing expiring case, we can change that to delete.
- Add a test similar to TestSchedulerForgetAssumedPodAfterDelete() to test phantom pod.
- refactor scheduler tests to share the code between TestSchedulerNoPhantomPodAfterExpire() and TestSchedulerNoPhantomPodAfterDelete()
- Decouple scheduler tests from scheduler events: not to use events
Automatic merge from submit-queue
[Refactor] QOS to have QOS Class type for QoS classes
This PR adds a QOSClass type and initializes QOSclass constants for the three QoS classes.
It would be good to use this in all future QOS related features.
This would be good to have for the (Pod level cgroups isolation proposal)[https://github.com/kubernetes/kubernetes/pull/26751] that i am working on aswell.
@vishh PTAL
Signed-off-by: Buddha Prakash <buddhap@google.com>
Automatic merge from submit-queue
Fix #25606: Add the length detection of the "predicateFuncs" in generic_scheduler.go
Fix #25606
The PR add the length detection of the "predicateFuncs" for "findNodesThatFit" function of generic_scheduler.go.
In “findNodesThatFit” function, if the length of the "predicateFuncs" parameter is 0, it can set filtered equals nodes.Items, and needn't to traverse the nodes.Items.
When kubelet starts a pod that refers to non-existing PV, PVC or Node, it
should clearly show that the requested element does not exist.
Previous "PersistentVolumeClaim 'default/ceph-claim-wm' is not in cache"
looks like random kubelet hiccup, while "PersistentVolumeClaim
'default/ceph-claim-wm' not found" suggests that the object may not exist at
all and it might be an user error.
Fixes#27523
Automatic merge from submit-queue
scheduler: remove unused random generator
The way scheduler selecting host has been changed to round-robin.
Clean up leftover.
Automatic merge from submit-queue
Add a NodeCondition "NetworkUnavaiable" to prevent scheduling onto a node until the routes have been created
This is new version of #26267 (based on top of that one).
The new workflow is:
- we have an "NetworkNotReady" condition
- Kubelet when it creates a node, it sets it to "true"
- RouteController will set it to "false" when the route is created
- Scheduler is scheduling only on nodes that doesn't have "NetworkNotReady ==true" condition
@gmarek @bgrant0607 @zmerlynn @cjcullen @derekwaynecarr @danwinship @dcbw @lavalamp @vishh
Automatic merge from submit-queue
Introduce node memory pressure condition to scheduler
Following the work done by @derekwaynecarr at https://github.com/kubernetes/kubernetes/pull/21274, introducing memory pressure predicate for scheduler.
Missing:
* write down unit-test
* test the implementation
At the moment this is a heads up for further discussion how the new node's memory pressure condition should be handled in the generic scheduler.
**Additional info**
* Based on [1], only best effort pods are subject to filtering.
* Based on [2], best effort pods are those pods "iff requests & limits are not specified for any resource across all containers".
[1] 542668cc79/docs/proposals/kubelet-eviction.md (scheduler)
[2] https://github.com/kubernetes/kubernetes/pull/14943
The PR add the length detection of the "predicateFuncs" for "findNodesThatFit" function of generic_scheduler.go.
In “findNodesThatFit” function, if the length of the "predicateFuncs" parameter is 0, it can set filtered equals nodes.Items, and needn't to traverse the nodes.Items.
It should reduce the resource data after finding the pod in the pods, because perhaps no corresponding pod in the pods of the node, at this time it shouldn't reduce the resource data of the node.
Automatic merge from submit-queue
Make IsValidLabelValue return error strings
Part of the larger validation PR, broken out for easier review and merge. Builds on previous PRs in the series.
Automatic merge from submit-queue
Improve fatal error description in plugins.go of scheduler
The PR add more information for the fatal error in plugins.go of scheduler.
Automatic merge from submit-queue
Make IsQualifiedName return error strings
Part of the larger validation PR, broken out for easier review and merge.
@lavalamp FYI, but I know you're swamped, too.
In RegisterCustomFitPredicate, when policy.Argument is nil and fitPredicateMap has the policy.Name, it can return the policy.Name directly. Subsequent operations are redundant.
Automatic merge from submit-queue
WIP v0 NVIDIA GPU support
```release-note
* Alpha support for scheduling pods on machines with NVIDIA GPUs whose kubelets use the `--experimental-nvidia-gpus` flag, using the alpha.kubernetes.io/nvidia-gpu resource
```
Implements part of #24071 for #23587
I am not familiar with the scheduler enough to know what to do with the scores. Mostly punting for now.
Missing items from the implementation plan: limitranger, rkt support, kubectl
support and docs
cc @erictune @davidopp @dchen1107 @vishh @Hui-Zhi @gopinatht
Automatic merge from submit-queue
Add pod condition PodScheduled to detect situation when scheduler tried to schedule a Pod, but failed
Set `PodSchedule` condition to `ConditionFalse` in `scheduleOne()` if scheduling failed and to `ConditionTrue` in `/bind` subresource.
Ref #24404
@mml (as it seems to be related to "why pending" effort)
<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/24459)
<!-- Reviewable:end -->
Implements part of #24071
I am not familiar with the scheduler enough to know what to do with the scores. Punting for now.
Missing items from the implementation plan: limitranger, rkt support, kubectl
support and user docs
Automatic merge from submit-queue
add namespace index for cache
@wojtek-t
Implement in this approach make the change of lister.go small, but we should replace all `NewInformer()` to `NewIndexInformer()`, even when someone not want to filter by namespace(eg. gc_controller and scheduler). Any suggestion?