Automatic merge from submit-queue
Add plugin/pkg/scheduler to linted packages
**What this PR does / why we need it**:
Adds plugin/pkg/scheduler to linted packages to improve style correctness.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#41868
**Special notes for your reviewer**:
**Release note**:
```release-note
NONE
```
Automatic merge from submit-queue (batch tested with PRs 43681, 40423, 43562, 43008, 43381)
Changes for removing deadcode in taint_tolerations
**What this PR does / why we need it**:
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#43007
Code cleanup with some modifications and a test-case in taints and tolerations
Code cleanup with some modifications and a test-case in taints and tolerations
Removed unnecessary code from my last commit
Code cleanup with some modifications and a test-case in taints and tolerations
SUggested changes for taints_tolerations
Changes for removing deadcode in taint_tolerations
small changes again
small changes again
Small changes for clear documentation.
Automatic merge from submit-queue (batch tested with PRs 40404, 43134, 43117)
Improve code coverage for scheduler/api/validation
**What this PR does / why we need it**:
Improve code coverage for scheduler/api/validation from #39559
Thanks
**Special notes for your reviewer**:
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 42762, 42739, 42425, 42778)
Fixed potential OutOfSync of nodeInfo.
The cloned NodeInfo still share the same resource objects in cache; it may make `requestedResource` and Pods OutOfSync, for example, if the pod was deleted, the `requestedResource` is updated by Pods are not in cloned info. Found this when investigating #32531 , but seems not the root cause, as nodeInfo are readonly in predicts & priorities.
Sample codes for `&(*)`:
```
package main
import (
"fmt"
)
type Resource struct {
A int
}
type Node struct {
Res *Resource
}
func main() {
r1 := &Resource { A:10 }
n1 := &Node{Res: r1}
r2 := &(*n1.Res)
r2.A = 11
fmt.Printf("%t, %d %d\n", r1==r2, r1, r2)
}
```
Output:
```
true, &{11} &{11}
```
- Added schedulercache.Resource.SetOpaque helper.
- Amend kubelet allocatable sync so that when OIRs are removed from capacity
they are also removed from allocatable.
- Fixes#41861.
Automatic merge from submit-queue (batch tested with PRs 42200, 39535, 41708, 41487, 41335)
Add support for statefulset spreading to the scheduler
**What this PR does / why we need it**:
The scheduler SelectorSpread priority funtion didn't have the code to spread pods of StatefulSets. This PR adds StatefulSets to the list of controllers that SelectorSpread supports.
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes#41513
**Special notes for your reviewer**:
**Release note**:
```release-note
Add the support to the scheduler for spreading pods of StatefulSets.
```
Automatic merge from submit-queue (batch tested with PRs 41937, 41151, 42092, 40269, 42135)
Improve code coverage for scheduler/algorithmprovider/defaults
**What this PR does / why we need it**:
Improve code coverage for scheduler/algorithmprovider/defaults from #39559
Thanks for your review.
**Special notes for your reviewer**:
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 41954, 40528, 41875, 41165, 41877)
Move the scheduler Fake* to testing folder
Address this issue: https://github.com/kubernetes/kubernetes/issues/41662
@timothysc I moved the interface to the types.go, and the Fake* to the testing package, not sure whether you like or not.
Automatic merge from submit-queue (batch tested with PRs 41701, 41818, 41897, 41119, 41562)
Optimization of on-wire information sent to scheduler extender interfaces that are stateful
The changes are to address the issue described in https://github.com/kubernetes/kubernetes/issues/40991
```release-note
Added support to minimize sending verbose node information to scheduler extender by sending only node names and expecting extenders to cache the rest of the node information
```
cc @ravigadde
Automatic merge from submit-queue (batch tested with PRs 41701, 41818, 41897, 41119, 41562)
Allow updates to pod tolerations.
Opening this PR to continue discussion for pod spec tolerations updates when a pod has been scheduled already. This PR is built on top of https://github.com/kubernetes/kubernetes/pull/38957.
@kubernetes/sig-scheduling-pr-reviews @liggitt @davidopp @derekwaynecarr @kubernetes/rh-cluster-infra
Automatic merge from submit-queue (batch tested with PRs 41994, 41969, 41997, 40952, 40576)
Guaranteed admission for Critical Pods
This is the first step in implementing node-level preemption for critical pods.
It defines the AdmissionFailureHandler interface, which allows callers, like the kubelet, to define how failed predicates are handled, and take steps to correct failures if necessary.
In the kubelet's implementation, it triggers preemption if the pod being admitted is critical, and if the only failed predicates are InsufficientResourceErrors, then it prempts (not yet implemented) other other pods to allow admission of the critical pod.
cc: @vishh
node cache don't need to get all the information about every
candidate node. For huge clusters, sending full node information
of all nodes in the cluster on the wire every time a pod is scheduled
is expensive. If the scheduler is capable of caching node information
along with its capabilities, sending node name alone is sufficient.
These changes provide that optimization in a backward compatible way
- removed the inadvertent signature change of Prioritize() function
- added defensive checks as suggested
- added annotation specific test case
- updated the comments in the scheduler types
- got rid of apiVersion thats unused
- using *v1.NodeList as suggested
- backing out pod annotation update related changes made in the
1st commit
- Adjusted the comments in types.go and v1/types.go as suggested
in the code review
Where possible, switch the scheduler to use generated listers and
informers. There are still some places where it probably makes more
sense to use one-off reflectors/informers (listing/watching just a
single node, listing/watching scheduled & unscheduled pods using a field
selector).
Automatic merge from submit-queue
Add scheduler predicate to filter for max Azure disks attached
**What this PR does / why we need it**: This PR adds scheduler predicates for maximum Azure Disks count. This allows to use the environment variable KUBE_MAX_PD_VOLS on scheduler the same as it's already possible with GCE and AWS.
This is needed as we need a way to specify the maximum attachable disks on Azure to avoid permanently failing disk attachment in cases k8s scheduled too many PODs with AzureDisk volumes onto the same node.
I've chosen 16 as the default value for DefaultMaxAzureDiskVolumes even though it may be too high for many smaller VM types and too low for the larger VM types. This means, the default behavior may change for clusters with large VM types. For smaller VM types, the behavior will not change (it will keep failing attaching).
In the future, the value should be determined at run time on a per node basis, depending on the VM size. I know that this is already implemented in the ongoing Azure Managed Disks work, but I don't remember where to find this anymore and also forgot who was working on this. Maybe @colemickens can help here.
**Release note**:
```release-note
Support KUBE_MAX_PD_VOLS on Azure
```
CC @colemickens @brendandburns
Automatic merge from submit-queue
Adjust nodiskconflict support based on iscsi multipath.
With the multipath support is in place, to declare whether both iscsi disks are same, we need to only depend on IQN.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
Automatic merge from submit-queue
Improved code coverage for plugin/pkg/scheduler/algorithm/priorities…
…/most_requested.go
**What this PR does / why we need it**:
Part of #39559 , code coverage improved from 70+% to 80+%
Automatic merge from submit-queue (batch tested with PRs 41378, 41413, 40743, 41155, 41385)
'core' package to prevent dependency creep and isolate core functiona…
**What this PR does / why we need it**:
Solves these two problems:
- Top level Scheduler root directory has several files in it that are needed really by the factory and algorithm implementations. Thus they should be subpackages of scheduler.
- In addition scheduler.go and generic_scheduler.go don't naturally differentiate themselves when they are in the same package. scheduler.go is eseentially the daemon entry point and so it should be isolated from the core
*No release note needed*
Automatic merge from submit-queue
Removed a space in portforward.go.
**What this PR does / why we need it**:
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 40696, 39914, 40374)
Forgiveness library changes
**What this PR does / why we need it**:
Splited from #34825, contains library changes that are needed to implement forgiveness:
1. ~~make taints-tolerations matching respect timestamps, so that one toleration can just tolerate a taint for only a period of time.~~ As TaintManager is caching taints and observing taint changes, time-based checking is now outside the library (in TaintManager). see #40355.
2. make tolerations respect wildcard key.
3. add/refresh some related functions to wrap taints-tolerations operation.
**Which issue this PR fixes**:
Related issue: #1574
Related PR: #34825, #39469
~~Please note that the first 2 commits in this PR come from #39469 .~~
**Special notes for your reviewer**:
~~Since currently we have `pkg/api/helpers.go` and `pkg/api/v1/helpers.go`, there are some duplicated periods of code laying in these two files.~~
~~Ideally we should move taints-tolerations related functions into a separate package (pkg/util/taints), and make it a unified set of implementations. But I'd just suggest to do it in a follow-up PR after Forgiveness ones done, in case of feature Forgiveness getting blocked to long.~~
**Release note**:
```release-note
make tolerations respect wildcard key
```
Automatic merge from submit-queue (batch tested with PRs 40405, 38601, 40083, 40730)
fix typo
**What this PR does / why we need it**:
fix typo.
**Release note**:
```NONE
```
Automatic merge from submit-queue (batch tested with PRs 34543, 40606)
sync client-go and move util/workqueue
The vision of client-go is that it provides enough utilities to build a reasonable controller. It has been copying `util/workqueue`. This makes it authoritative.
@liggitt I'm getting really close to making client-go authoritative ptal.
approved based on https://github.com/kubernetes/kubernetes/issues/40363
Automatic merge from submit-queue
Don't require failureDomains in PodAffinityChecker
`failureDomains` are only used for `PreferredDuringScheduling` pod
anti-affinity, which is ignored by `PodAffinityChecker`.
This unnecessary requirement was making it hard to move
`PodAffinityChecker` to `GeneralPredicates` because that would require
passing `--failure-domains` to both `kubelet` and `kube-controller-manager`.
Automatic merge from submit-queue (batch tested with PRs 40543, 39999)
Improve code coverage for scheduler/algorithm/priorities
**What this PR does / why we need it**:
Improve code coverage for scheduler/algorithm/priorities from #39559
This is my first unit test for kubernetes , thanks for your review.
**Special notes for your reviewer**:
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 39538, 40188, 40357, 38214, 40195)
Decoupling scheduler creation from creation of scheduler.Config struc…
**What this PR does / why we need it**:
Adds functionality to the scheduler to initialize from an Configurator interface, rather then via a Config struct.
**Which issue this PR fixes**
Reduces coupling to `scheduler.Config` data structure format so that we can proliferate more interface driven composition of scheduler components.
Automatic merge from submit-queue
move client/cache and client/discovery to client-go
mechanical changes to move those packages. Had to create a `k8s.io/kubernetes/pkg/client/tests` package for tests that were blacklisted from client-go. We can rewrite these tests later and move them, but for now they'll still run at least.
@caesarxuchao @sttts
Automatic merge from submit-queue
Skip schedule deleting pod
Since binding a deleting pod will always return fail, we should skip that kind of pod early
These files have been created lately, so we don't have much information
about them anyway, so let's just:
- Remove assignees and make them approvers
- Copy approves as reviewers
Automatic merge from submit-queue (batch tested with PRs 36693, 40154, 40170, 39033)
make client-go authoritative for pkg/client/restclient
Moves client/restclient to client-go and a util/certs, util/testing as transitives.
Automatic merge from submit-queue (batch tested with PRs 36693, 40154, 40170, 39033)
Minor hygiene in scheduler.
**What this PR does / why we need it**:
Minor cleanups in scheduler, related to PR #31652.
- Unified lazy opaque resource caching.
- Deleted a commented-out line of code.
**Release note**:
```release-note
N/A
```
Automatic merge from submit-queue
move pkg/fields to apimachinery
Purely mechanical move of `pkg/fields` to apimachinery.
Discussed with @lavalamp on slack. Moving this an `labels` to apimachinery.
@liggitt any concerns? I think the idea of field selection should become generic and this ends up shared between client and server, so this is a more logical location.
Automatic merge from submit-queue (batch tested with PRs 39898, 39904)
[scheduler] interface for config
**What this PR fixes**
This PR converts the Scheduler configuration factory into an interface, so that
- the scheduler_perf and scheduler integration tests dont rely on the struct for their implementation
- the exported functionality of the factory (i.e. what it needs to provide to create a scheduler configuration) is completely explicit, rather then completely coupled to a struct.
- makes some parts of the factory immutable, again to minimize possible coupling.
This makes it easier to make a custom factory in instances where we might specifically want to import scheduler logic without actually reusing the entire scheduler codebase.
Automatic merge from submit-queue
Corrected a typo in scheduler factory.go.
**What this PR does / why we need it**:
**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
**Special notes for your reviewer**:
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 39945, 39601)
bugfix for PodToleratesNodeTaints
`PodToleratesNodeTaints`predicate func should return true if pod has no toleration annotations and node's taint effect is `PreferNoSchedule`
Automatic merge from submit-queue (batch tested with PRs 37680, 39968)
Update Owners for Scheduler
Update Owners file for scheduler component to spread the reviews around.
/cc @davidopp per previous sig-mtg.
Automatic merge from submit-queue (batch tested with PRs 39807, 37505, 39844, 39525, 39109)
Made cache.Controller to be interface.
**What this PR does / why we need it**:
#37504
Automatic merge from submit-queue
replace global registry in apimachinery with global registry in k8s.io/kubernetes
We'd like to remove all globals, but our immediate problem is that a shared registry between k8s.io/kubernetes and k8s.io/client-go doesn't work. Since client-go makes a copy, we can actually keep a global registry with other globals in pkg/api for now.
@kubernetes/sig-api-machinery-misc @lavalamp @smarterclayton @sttts
Automatic merge from submit-queue (batch tested with PRs 39803, 39698, 39537, 39478)
[scheduling] Moved pod affinity and anti-affinity from annotations to api fields #25319
Converted pod affinity and anti-affinity from annotations to api fields
Related: #25319
Related: #34508
**Release note**:
```Pod affinity and anti-affinity has moved from annotations to api fields in the pod spec. Pod affinity or anti-affinity that is defined in the annotations will be ignored.```
Automatic merge from submit-queue (batch tested with PRs 39803, 39698, 39537, 39478)
Use controller interface for everything in config factory
**What this PR does / why we need it**:
We want to replace controller structs with interfaces
- per the TODO in `ControllerInterface`
- Specifically this will make the decoupling from Config and reuse of the scheduler's subcomponents cleaner.
Automatic merge from submit-queue (batch tested with PRs 39230, 39718)
Remove special case for StatefulSets in scheduler
**What this PR does / why we need it**: Removes special case for StatefulSet in scheduler code
/ref: https://github.com/kubernetes/kubernetes/issues/39687
**Special notes for your reviewer**:
**Release note**:
```release-note
Scheduler treats StatefulSet pods as belonging to a single equivalence class.
```
Automatic merge from submit-queue (batch tested with PRs 39684, 39577, 38989, 39534, 39702)
Set PodStatus QOSClass field
This PR continues the work for https://github.com/kubernetes/kubernetes/pull/37968
It converts all local usage of the `qos` package class types to the new API level types (first commit) and sets the pod status QOSClass field in the at pod creation time on the API server in `PrepareForCreate` and in the kubelet in the pod status update path (second commit). This way the pod QOS class is set even if the pod isn't scheduled yet.
Fixes#33255
@ConnorDoyle @derekwaynecarr @vishh
Automatic merge from submit-queue
Fix comment and optimize code
**What this PR does / why we need it**:
Fix comment and optimize code.
Thanks.
**Special notes for your reviewer**:
**Release note**:
```release-note
```
Automatic merge from submit-queue (batch tested with PRs 39075, 39350, 39353)
Move pkg/api.{Context,RequestContextMapper} into pkg/genericapiserver/api/request
**Based on #39350**
Automatic merge from submit-queue
Ensure that assumed pod won't be expired until the end of binding
In case if api server is overloaded and will reply with 429 too many requests error, binding may take longer than ttl of scheduler cache for assumed pods 1199d42210/pkg/client/restclient/request.go (L787-L850)
This problem was mitigated by using this fix e4d215d508 and increased rate limit for api server. But it is possible that it will occur again.
In such cases when api server is overloaded and returns a lot of
429 (too many requests) errors - binding may take a lot of time
to succeed due to retry policy implemented in rest client.
In such events cache ttl for assumed pods wasn't big enough.
In order to minimize probability of such errors ttl for assumed pods
will be counted from the time when binding for particular pod is finished
(either with error or success)
Change-Id: Ib0122f8a76dc57c82f2c7c52497aad1bdd8be411
Automatic merge from submit-queue
Avoid unnecessary memory allocations
Low-hanging fruits in saving memory allocations. During our 5000-node kubemark runs I've see this:
ControllerManager:
- 40.17% k8s.io/kubernetes/pkg/util/system.IsMasterNode
- 19.04% k8s.io/kubernetes/pkg/controller.(*PodControllerRefManager).Classify
Scheduler:
- 42.74% k8s.io/kubernetes/plugin/pkg/scheduler/algrorithm/predicates.(*MaxPDVolumeCountChecker).filterVolumes
This PR is eliminating all of those.
Automatic merge from submit-queue
Optimize pod affinity when predicate
Optimize by returning as early as possible to avoid invoking priorityutil.PodMatchesTermsNamespaceAndSelector.
Automatic merge from submit-queue
Add use case to service affinity
Also part of nits in refactoring predicates, I found the explanation of `serviceaffinity` in its comment is very hard to understand. So I added example instead here to help user/developer to digest it.
Automatic merge from submit-queue (batch tested with PRs 38154, 38502)
Rename "release_1_5" clientset to just "clientset"
We used to keep multiple releases in the main repo. Now that [client-go](https://github.com/kubernetes/client-go) does the versioning, there is no need to keep releases in the main repo. This PR renames the "release_1_5" clientset to just "clientset", clientset development will be done in this directory.
@kubernetes/sig-api-machinery @deads2k
```release-note
The main repository does not keep multiple releases of clientsets anymore. Please find previous releases at https://github.com/kubernetes/client-go
```
Automatic merge from submit-queue (batch tested with PRs 38648, 38747)
fix typo
**What this PR does / why we need it**:
fix typo.
**Release note**:
```NONE
```
Automatic merge from submit-queue (batch tested with PRs 35436, 37090, 38700)
Make iscsi pv/pvc aware of nodiskconflict feature
Being iscsi a `RWO, ROX` volume we should conflict if more than one pod is using same iscsi LUN.
Signed-off-by: Humble Chirammal <hchiramm@redhat.com>
Automatic merge from submit-queue (batch tested with PRs 38377, 36365, 36648, 37691, 38339)
Do not create selector and namespaces in a loop where possible
With 1000 nodes and 5000 pods (5 pods per node) with anti-affinity a lot of CPU wasted on creating LabelSelector and sets.String (map).
With this change we are able to deploy that number of pods in ~25 minutes. Without - it takes 30 minutes to deploy 500 pods with anti-affinity configured.
failureDomains are only used for PreferredDuringScheduling pod
anti-affinity, which is ignored by PodAffinityChecker.
This unnecessary requirement was making it hard to move
PodAffinityChecker to GeneralPredicates because that would require
passing --failure-domains to both kubelet and kube-controller-manager.
Automatic merge from submit-queue (batch tested with PRs 38076, 38137, 36882, 37634, 37558)
[scheduler] Use V(10) for anything which may be O(N*P) logging
Fixes#37014
This PR makes sure that logging statements which are capable of being called on a perNode / perPod basis (i.e. non essential ones that will just clog up logs at large scale) are at V(10) level.
I dreamt of a levenstein filter that built a weak map of word frequencies and alerted once log throughput increased w/o varying information content.... but then I woke up and realized this is probably all we really need for now :)
Automatic merge from submit-queue
Revert "Avoid hard-coding list of Node Conditions"
* we don't know how other API consumers are using node conditions (there was no prior expectation that the scheduler would block on custom conditions)
* not all conditions map directly to schedulability (e.g. `MemoryPressure`/`DiskPressure`)
* not all conditions use True to mean "unschedulable" (e.g. `Ready`)
This reverts commit 511b2ecaa8 to avoid breaking existing API users and to avoid constraining future uses of the node conditions API
Automatic merge from submit-queue
split scheduler priorities into separate files
In the current state it's really hard to find a thing one is looking for, if he doesn't know already know where to look. cc @davidopp
- Prevents kubelet from overwriting capacity during sync.
- Handles opaque integer resources in the scheduler.
- Adds scheduler predicate tests for opaque resources.
- Validates opaque int resources:
- Ensures supplied opaque int quantities in node capacity,
node allocatable, pod request and pod limit are integers.
- Adds tests for new validation logic (node update and pod spec).
- Added e2e tests for opaque integer resources.
Automatic merge from submit-queue
Make "Unschedulable" reason a constant in api
String "Unschedulable" is used in couple places in K8S:
* scheduler
* federation replicaset and deployment controllers
* cluster autoscaler
* rescheduler
This PR makes the string a part of API so it not changed.
cc: @davidopp @fgrzadkowski @wojtek-t
Automatic merge from submit-queue
Create restclient interface
Refactoring of code to allow replace *restclient.RESTClient with any RESTClient implementation that implements restclient.RESTClientInterface interface.
Automatic merge from submit-queue
Predicate cacheing and cleanup
Fix to #31795
First pass @ cleanup and caching of the CheckServiceAffinity function.
The cleanup IMO is necessary because the logic around the pod listing and the use of the "implicit selector" (which is reverse engineered to enable the homogenous pod groups).
Should still pass the E2Es.
@timothysc @wojtek-t
Comments addressed, Make emptyMetadataProducer a func to avoid casting,
FakeSvcLister: remove error return for len(svc)=0. New test for
predicatePrecomp to make method semantics explictly enforced when meta
is missing. Precompute wrapper.
Automatic merge from submit-queue
PVC informer lister supports listing
This will be used in follow-on PRs for quota evaluation backed by informers for pvcs.
/cc @deads2k @eparis @kubernetes/sig-storage
Most normal codec use should perform defaulting. DirectCodecs should not
perform defaulting. Update the defaulting_test to fuzz the list of known
defaulters. Use the new versioning.NewDefaultingCodec() method.
Automatic merge from submit-queue
Refactor scheduler to enable switch on equivalence cache
Part 2 of #30844
Refactoring to enable easier switch on of equivalence cache.
Automatic merge from submit-queue
Fix typos and englishify plugin/pkg
**What this PR does / why we need it**: Just typos
**Which issue this PR fixes**: `None`
**Special notes for your reviewer**: Just typos
**Release note**: `NONE`