Automatic merge from submit-queue
Define interfaces for kubelet pod admission and eviction
There is too much code and logic in `kubelet.go` that makes it hard to test functions in discrete pieces.
I propose an interface that an internal module can implement that will let it make an admission decision for a pod. If folks are ok with the pattern, I want to move the a) predicate checking, b) out of disk, c) eviction preventing best-effort pods being admitted into their own dedicated handlers that would be easier for us to mock test. We can then just write tests to ensure that the `Kubelet` calls a call-out, and we can write easier unit tests to ensure that dedicated handlers do the right thing.
The second interface I propose was a `PodEvictor` that is invoked in the main kubelet sync loop to know if pods should be pro-actively evicted from the machine. The current active deadline check should move into a simple evictor implementation, and I want to plug the out of resource killer code path as an implementation of the same interface.
@vishh @timothysc - if you guys can ack on this, I will add some unit testing to ensure we do the call-outs.
/cc @kubernetes/sig-node @kubernetes/rh-cluster-infra
- Expand Attacher/Detacher interfaces to break up work more
explicitly.
- Add arguments to all functions to avoid having implementers store
the data needed for operations.
- Expand unit tests to check that Attach, Detach, WaitForAttach,
WaitForDetach, MountDevice, and UnmountDevice get call where
appropriet.
Automatic merge from submit-queue
Move predicates into library
This PR tries to implement #12744
Any suggestions/ideas are welcome. @davidopp
current state: integration test fails if including podCount check in Kubelet.
DONE:
1. refactor all predicates: predicates return fitOrNot(bool) and error(Error) in which the latter is of type PredicateFailureError or InsufficientResourceError
2. GeneralPredicates() is a predicate function, which includes serveral other predicate functions (PodFitsResource, PodFitsHost, PodFitsHostPort). It is registered as one of the predicates in DefaultAlgorithmProvider, and is also called in canAdmitPod() in Kubelet and should be called by other components (like rescheduler, etc if necessary. See discussion in issue #12744
TODO:
1. determine which predicates should be included in GeneralPredicates()
2. separate GeneralPredicates() into: a.) GeneralPredicatesEvictPod() and b.) GeneralPredicatesNotEvictPod()
3. DaemonSet should use GeneralPredicates()
DONE:
1. refactor all predicates: predicates return fitOrNot(bool) and error(Error) in which the latter is of type
PredicateFailureError or InsufficientResourceError. (For violation of either MaxEBSVolumeCount or
MaxGCEPDVolumeCount, returns one same error type as ErrMaxVolumeCountExceeded)
2. GeneralPredicates() is a predicate function, which includes serveral other predicate functions (PodFitsResource,
PodFitsHost, PodFitsHostPort). It is registered as one of the predicates in DefaultAlgorithmProvider, and
is also called in canAdmitPod() in Kubelet and should be called by other components (like rescheduler, etc)
if necessary. See discussion in issue #12744
3. remove podNumber check from GeneralPredicates
4. HostName is now verified in Kubelet's canAdminPod(). add TestHostNameConflicts in kubelet_test.go
5. add getNodeAnyWay() method in Kubelet to get node information in standaloneMode
TODO:
1. determine which predicates should be included in GeneralPredicates()
2. separate GeneralPredicates() into:
a. GeneralPredicatesEvictPod() and
b. GeneralPredicatesNotEvictPod()
3. DaemonSet should use GeneralPredicates()
Kubelet.cleanupOrphanedVolumes() compares list of volumes mounted to a node
with list of volumes that are required by pods scheduled on the node
("scheduled volume").
Both lists should contain real volumes, i.e. when a pod uses
PersistentVolumeClaim, the list must contain name of the bound volume instead
of name of the claim.
This change removes RuntimeCache in the pod workers and the syncPod() function.
Note that it doesn't deprecate RuntimeCache completely as other components
still rely on the cache.
Many users attempt to use 'kubectl logs' in order to find the logs
for a container, but receive no logs or an error telling them their
container is not running. The fix in this case is to run with '--previous',
but this does not match user expectations for the logs command.
This commit changes the behavior of the Kubelet to return the logs of
the currently running container or the previous running container unless
the user provides the "previous" flag. If the user specifies "follow"
the logs of the most recent container will be displayed, and if it is
a terminated container the logs will come to an end (the user can
repeatedly invoke 'kubectl logs --follow' and see the same output).
Clean up error messages in the kubelet log path to be consistent and
give users a more predictable experience.
Have the Kubelet return 400 on invalid requests
This updates the dockertools.dockerVersion to use a semantic versioning
library to more gracefully support engine versions which include
additional version fields.
Previously, go-dockerclient's APIVersion struct was use which only
handles plain numeric x.y.z version strings. With #19675, the library
was now used on the Docker engine string, however it is possible for the
engine string to include include additional information for beta, rc, or
distro specific builds.
This PR also enables the TestDockerRuntimeVersion test which was
previously just a FIXME and updates it to pass, and be used to test the
version string that cause #20005.
This negates the need for fsouza/go-dockerclient#451, since even with
that change, if a user was running Docker 1.10.0-rc1, this would cause
the kubelet to report it as simply 1.10.0.
- Ignore the "not found" error on deletion.
- Recognize the "already exists" error on creation and check if the existing
pod meets requirement. If so, don't report an error.
- Immediately create a mirror pod after a successful deletion, if needed.
This address a TODO when collecting the node version information so it
will properly report the configured runtime and its version. Previously,
this was hardcoded to "docker://" and the docker version, and would show
"docker://1.9.1" even when the kubelet was configured to use rkt.
With this change, it will use the runtime's Type() and Version() data.
This also changes the container.Runtime interface to add an APIVersion()
method. This can be used when the runtime has separate versions for the
engine and the API, such as with Docker. The Docker minimum version
validation has been updated to use APIVersion(), and
DockerManager.Version() now returns the engine version.
Add `kube-reserved` and `system-reserved` flags for configuration
reserved resources for usage outside of kubernetes pods. Allocatable is
provided by the Kubelet according to the formula:
```
Allocatable = Capacity - KubeReserved - SystemReserved
```
Also provides a method for estimating a reasonable default for
`KubeReserved`, but the current implementation probably is low and needs
more tuning.
Kubelet doesn't perform checkpointing and loses all its internal states after
restarts. It'd then mistaken pods from the api server as new pods and attempt
to go through the admission process. This may result in pods being rejected
even though they are running on the node (e.g., out of disk situation). This
change adds a condition to check whether the pod was seen before and categorize
such pods as updates. The change also removes freeze/unfreeze mechanism used to
work around such cases, since it is no longer needed and it stopped working
correctly ever since we switched to incremental updates.
Implement a flag that defines the frequency at which a node's out of
disk condition can change its status. Use this flag to suspend out of
disk status changes in the time period specified by the flag, after
the status is changed once.
Set the flag to 0 in e2e tests so that we can predictably test out of
disk node condition.
Also, use util.Clock interface for all time related functionality in
the kubelet. Calling time functions in unversioned package or time
package such as unversioned.Now() or time.Now() makes it really hard
to test such code. It also makes the tests flaky and sometimes
unnecessarily slow due to time.Sleep() calls used to simulate the
time elapsed. So use util.Clock interface instead which can be faked
in the tests.
Refactor Kubelet's server functionality into a server package. Most
notably, move pkg/kubelet/server.go into
pkg/kubelet/server/server.go. This will lead to better separation of
concerns and a more readable code hierarchy.
- Add volume.MetricsProvider function to Volume interface.
- Add volume.MetricsDu for providing metrics via executing "du".
- Add volulme.MetricsNil for unsupported Volumes.
Addresses a version skew issue where the last condition status is always
evaluated as the NodeReady status. As a workaround force the NodeReady
condition to be the last in the list of node conditions.
ref: https://github.com/kubernetes/kubernetes/issues/16961
This change introduces pod lifecycle event generator (PLEG), and adds a generic
PLEG. The generic PLEG relies on relisting to discover container events, and is
container-runtime-agnostic. Both docker and rkt are changed to use generic
PLEG.
- status.Manager always deals with the local (static) pod, but gets the
mirror pod when syncing
- This lets components like the probe workers ignore mirror pods
Now that kubelet checks sources seen correctly, there is no need to enforce the
initial order of pod updates and housekeeping. Use a ticker for housekeeping to
simplify the code.
Currently kubelet syncs all pods every 10s. This is not preferred because
* Some pods may have been sync'd recently.
* This may cause all the pods to be sync'd at once, causing undesirable
CPU spikes.
This PR replaces the global syncs with independent, periodic pod syncs. At the
end of syncing, each pod worker will enqueue itslef with a future timestamp (
current time + sync interval), when it will be due for another sync.
* If the pod worker encoutners an sync error, it may requeue with a different
timestamp to retry sooner.
* If a sync is triggered by the update channel (events or spec changes), the
pod worker would enqueue a new sync time.
This change is necessary for moving to long or no periodic sync period once pod
lifecycle event generator is completed. We will still rely on the mechanism to
requeue the pod on sync error.
This change also makes sure that if a sync does not succeed (either due to
real error or the per-container backoff mechanism), an error would be propagated
back to the pod worker, which is responsible for requeuing.
Define a new out of disk node condition and use it to report when node
goes out of disk.
Make a copy of loop range clause variable in node listers so that it
is available outside the for loop.
Also update/implement unit tests.
This commit builds on previous work and creates an independent
worker for every liveness probe. Liveness probes behave largely the same
as readiness probes, so much of the code is shared by introducing a
probeType paramater to distinguish the type when it matters. The
circular dependency between the runtime and the prober is broken by
exposing a shared liveness ResultsManager, owned by the
kubelet. Finally, an Updates channel is introduced to the ResultsManager
so the kubelet can react to unhealthy containers immediately.
Change all references to the container ID in pkg/kubelet/... to the
strong type defined in pkg/kubelet/container: ContainerID
The motivation for this change is to make the format of the ID
unambiguous, specifically whether or not it includes the runtime
prefix (e.g. "docker://").
The current implementation considers a source seen when it receives a SET at
kubelet/config/config.go. However, the main kubelet sync loop may not have
received the pod update from the source via the channel. This change ensures
that kubelet would consider all sources are ready only after the sync loop has
seen all the sources.
Each container with a readiness has an individual go-routine which
handles periodic probing for that container. The results are cached, and
written to the status.Manager in the pod sync path.
In many cases clients may wish to view not ready addresses for endpoints
in order to do set membership prior to a pod being ready. For instance,
a pod that uses the service endpoints to connect to other pods under
the same service, but does not want to signal ready before it has
contacted at least a minimal number of other pods.
This is backwards compatible with old servers and clients. There is
an additional cost in size of endpoints before services ramp up, which
will add minor CPU and memory use for services that have a significant
number of pods which have not become ready.
This refactor is in preparation for moving more state handling to the
status manager. It will become the canonical cache for the latest
information on running containers and probe status, as part of the
prober refactoring.
1. Make reason field of StatusReport objects in kubelet in CamelCase format.
2. Add Message field for ContainerStateWaiting to describe detail about Reason.
3. Make reason field of Events in kubelet in CamelCase format.
4. Update swagger,deep-copy and so on.
Both GetNode and the cache.ListWatch listfunc in the
kubelet package call List unnecessary.
GetNodeInfo is sufficient for GetNode and makes looping
through a list of nodes to check for a matching name
unnecessary.
resolves#13476
Allow the user to specify the resolver configuration file that is used
to determine the default DNS parameters. This defaults to the system's
/etc/resolv.conf.
Currently, whenever there is any update, kubelet would force all pod workers to
sync again, causing resource contention and hence performance degradation.
This commit flips kubelet to use incremental updates (as opposed to snapshots).
This allows us to know what pods have changed and send updates to those pod
workers only. The `SyncPods` function has been replaced with individual
handlers, each handling an operation (ADD, REMOVE, UPDATE). Pod workers are
still triggered periodically, and kubelet performs periodic cleanup as well.
This commit also spawns a new goroutine solely responsible for killing pods.
This is necessary because pod killing could hold up the sync loop for
indefinitely long amount of time now user can define the graceful termination
period in the container spec.
We chose to use podFullName (name_namespace) as key in the status manager
because mirror pod and static pod share the same status. This is no longer
needed because we do not store statuses for static pods anymore (we only
store statuses for their mirror pods). Also, reviously, a few fixes were
merged to ensure statuses are cleaned up so that a new pod with the same
name would not resuse an old status.
This change cleans up the code by using UID as key so that the code would
become less brittle.
Getting the public IP a container is supposed to use is O(hard),
and usually involves ugly gyrations in python or with interfaces.
Using the downward API means that the IP Kube is announcing to
other endpoints is also visible inside the container for pods to
identify themselves.