The kubelet sync loop relies on getting one update as the signal that the
specific source is ready. This change ensures that we don't send multiple
updates (ADD, UPDATE) for the first batch of pods. This is required to prevent
the cleanup routine from killing pods prematurely.
PLEG is reponsible for listing the pods running on the node. If it's hung
due to non-responsive container runtime or internal bugs, we should restart
kubelet.
cleanupTerminatedPods is responsible for checking whether a pod has been
terminated and force a status update to trigger the pod deletion. However, this
function is called in the periodic clenup routine, which runs every 2 seconds.
In other words, it forces a status update for each non-running (and not yet
deleted in the apiserver) pod. When batch deleting tens of pods, the rate of
new updates surpasses what the status manager can handle, causing numerous
redundant requests (and the status channel to be full).
This change forces a status update only when detecting the DeletionTimestamp is
set for a terminated pod. Note that for other non-terminated pods, the pod
workers should be responsible for setting the correct status after killling all
the containers.
cleanupTerminatedPods is responsible for checking whether a pod has been
terminated and force a status update to trigger the pod deletion. However, this
function is called in the periodic clenup routine, which runs every 2 seconds.
In other words, it forces a status update for each non-running (and not yet
deleted in the apiserver) pod. When batch deleting tens of pods, the rate of
new updates surpasses what the status manager can handle, causing numerous
redundant requests (and the status channel to be full).
This change forces a status update only when detecting the DeletionTimestamp is
set for a terminated pod. Note that for other non-terminated pods, the pod
workers should be responsible for setting the correct status after killling all
the containers.
* Metrics will not be expose until they are hooked up to a handler
* Metrics are not cached and expose a dos vector, this must be fixed before release or the stats should not be exposed through an api endpoint
Kubelet.cleanupOrphanedVolumes() compares list of volumes mounted to a node
with list of volumes that are required by pods scheduled on the node
("scheduled volume").
Both lists should contain real volumes, i.e. when a pod uses
PersistentVolumeClaim, the list must contain name of the bound volume instead
of name of the claim.
This change removes RuntimeCache in the pod workers and the syncPod() function.
Note that it doesn't deprecate RuntimeCache completely as other components
still rely on the cache.
Many users attempt to use 'kubectl logs' in order to find the logs
for a container, but receive no logs or an error telling them their
container is not running. The fix in this case is to run with '--previous',
but this does not match user expectations for the logs command.
This commit changes the behavior of the Kubelet to return the logs of
the currently running container or the previous running container unless
the user provides the "previous" flag. If the user specifies "follow"
the logs of the most recent container will be displayed, and if it is
a terminated container the logs will come to an end (the user can
repeatedly invoke 'kubectl logs --follow' and see the same output).
Clean up error messages in the kubelet log path to be consistent and
give users a more predictable experience.
Have the Kubelet return 400 on invalid requests
- Ignore the "not found" error on deletion.
- Recognize the "already exists" error on creation and check if the existing
pod meets requirement. If so, don't report an error.
- Immediately create a mirror pod after a successful deletion, if needed.
This address a TODO when collecting the node version information so it
will properly report the configured runtime and its version. Previously,
this was hardcoded to "docker://" and the docker version, and would show
"docker://1.9.1" even when the kubelet was configured to use rkt.
With this change, it will use the runtime's Type() and Version() data.
This also changes the container.Runtime interface to add an APIVersion()
method. This can be used when the runtime has separate versions for the
engine and the API, such as with Docker. The Docker minimum version
validation has been updated to use APIVersion(), and
DockerManager.Version() now returns the engine version.
Add `kube-reserved` and `system-reserved` flags for configuration
reserved resources for usage outside of kubernetes pods. Allocatable is
provided by the Kubelet according to the formula:
```
Allocatable = Capacity - KubeReserved - SystemReserved
```
Also provides a method for estimating a reasonable default for
`KubeReserved`, but the current implementation probably is low and needs
more tuning.
Kubelet doesn't perform checkpointing and loses all its internal states after
restarts. It'd then mistaken pods from the api server as new pods and attempt
to go through the admission process. This may result in pods being rejected
even though they are running on the node (e.g., out of disk situation). This
change adds a condition to check whether the pod was seen before and categorize
such pods as updates. The change also removes freeze/unfreeze mechanism used to
work around such cases, since it is no longer needed and it stopped working
correctly ever since we switched to incremental updates.
There has been a recent regression causing kubelet to assume no containers are
running for the pod if kubelet has not seen the pod before. This would cause
all containers to be restarted after kubelet gets restarted. This change fixes
the bug.
Implement a flag that defines the frequency at which a node's out of
disk condition can change its status. Use this flag to suspend out of
disk status changes in the time period specified by the flag, after
the status is changed once.
Set the flag to 0 in e2e tests so that we can predictably test out of
disk node condition.
Also, use util.Clock interface for all time related functionality in
the kubelet. Calling time functions in unversioned package or time
package such as unversioned.Now() or time.Now() makes it really hard
to test such code. It also makes the tests flaky and sometimes
unnecessarily slow due to time.Sleep() calls used to simulate the
time elapsed. So use util.Clock interface instead which can be faked
in the tests.
Refactor Kubelet's server functionality into a server package. Most
notably, move pkg/kubelet/server.go into
pkg/kubelet/server/server.go. This will lead to better separation of
concerns and a more readable code hierarchy.
The formatting function is used often in logging. This improves the readability
by shortening the length of the call. Also change the fomartted string to
include the pod UID.
Before this change we have a mish-mash of ways to pass field names around for
error generation. Sometimes string fieldnames, sometimes .Prefix(), sometimes
neither, often wrong names or not indexed when it should be.
Instead of that mess, this is part one of a couple of commits that will make it
more strongly typed and hopefully encourage correct behavior. At least you
will have to think about field names, which is better than nothing.
It turned out to be really hard to do this incrementally.
Addresses a version skew issue where the last condition status is always
evaluated as the NodeReady status. As a workaround force the NodeReady
condition to be the last in the list of node conditions.
ref: https://github.com/kubernetes/kubernetes/issues/16961
Set resyncInterval to one minute now that we rely on the generic pleg to trigger
pod syncs on container events. When there is an error during syncing, pod
workers need to wake up sooner to retry. Set the sync error backoff period to
10 second in this case.
This change introduces pod lifecycle event generator (PLEG), and adds a generic
PLEG. The generic PLEG relies on relisting to discover container events, and is
container-runtime-agnostic. Both docker and rkt are changed to use generic
PLEG.
+ Fix#14992
+ "When deploying a pod using an on-disk kubelet manifest (a la /etc/kubernetes/manifests), it appears that the network plugin setUpPod is notified of the new pod before the apiserver."
Push status updates as soon as readiness state changes for containers,
rather than waiting for the sync loop to update the status. In
particular, this should help new containers to come online faster.
Additionally, consolidates prober test helpers into a single file.
This commit fixes getting the logs from complete/failed pods after
a kubelet restart by falling back to the api server in case we fail
to resolve the pod status using the status cache.
- PeriodSeconds - How often to probe
- SuccessThreshold - Number of successful probes to go from failure to success state
- FailureThreshold - Number of failing probes to go from success to failure state
This commit includes to changes in behavior:
1. InitialDelaySeconds now defaults to 10 seconds, rather than the
kubelet sync interval (although that also defaults to 10 seconds).
2. Prober only retries on probe error, not failure. To compensate, the
default FailureThreshold is set to the maxRetries, 3.
This lays the groundwork for simple multizone capabilities.
In a cloud environment, nodes are typically created by the kubelet
registering with the API server. When creating a new node, we now query
the cloudprovider to see if it can provide Zone information, and if so
we add some well-known labels to the Node we are creating.
- status.Manager always deals with the local (static) pod, but gets the
mirror pod when syncing
- This lets components like the probe workers ignore mirror pods
Now that kubelet checks sources seen correctly, there is no need to enforce the
initial order of pod updates and housekeeping. Use a ticker for housekeeping to
simplify the code.
Currently kubelet syncs all pods every 10s. This is not preferred because
* Some pods may have been sync'd recently.
* This may cause all the pods to be sync'd at once, causing undesirable
CPU spikes.
This PR replaces the global syncs with independent, periodic pod syncs. At the
end of syncing, each pod worker will enqueue itslef with a future timestamp (
current time + sync interval), when it will be due for another sync.
* If the pod worker encoutners an sync error, it may requeue with a different
timestamp to retry sooner.
* If a sync is triggered by the update channel (events or spec changes), the
pod worker would enqueue a new sync time.
This change is necessary for moving to long or no periodic sync period once pod
lifecycle event generator is completed. We will still rely on the mechanism to
requeue the pod on sync error.
This change also makes sure that if a sync does not succeed (either due to
real error or the per-container backoff mechanism), an error would be propagated
back to the pod worker, which is responsible for requeuing.
Define a new out of disk node condition and use it to report when node
goes out of disk.
Make a copy of loop range clause variable in node listers so that it
is available outside the for loop.
Also update/implement unit tests.
This commit builds on previous work and creates an independent
worker for every liveness probe. Liveness probes behave largely the same
as readiness probes, so much of the code is shared by introducing a
probeType paramater to distinguish the type when it matters. The
circular dependency between the runtime and the prober is broken by
exposing a shared liveness ResultsManager, owned by the
kubelet. Finally, an Updates channel is introduced to the ResultsManager
so the kubelet can react to unhealthy containers immediately.
Change all references to the container ID in pkg/kubelet/... to the
strong type defined in pkg/kubelet/container: ContainerID
The motivation for this change is to make the format of the ID
unambiguous, specifically whether or not it includes the runtime
prefix (e.g. "docker://").
The current implementation considers a source seen when it receives a SET at
kubelet/config/config.go. However, the main kubelet sync loop may not have
received the pod update from the source via the channel. This change ensures
that kubelet would consider all sources are ready only after the sync loop has
seen all the sources.
Each container with a readiness has an individual go-routine which
handles periodic probing for that container. The results are cached, and
written to the status.Manager in the pod sync path.
Network configuration error message while setting Kubelet status was
being written to "reasons" slice. Write this message to "messages" slice
instead.
Also remove "reasons" slice entirely since it is not used anywhere.
Now that kubelet has switched to incremental updates, it has complete
information of the pod update type (create, update, sync). This change pipes
this information to pod workers so that they don't have to derive the type
again.
Increase the supported controls on pod logging. Add validaiton to pod
log options. Ensure the Kubelet is using a consistent, structured way to
process pod log arguments.
Add ?sinceSeconds=<durationInSeconds>, &sinceTime=<RFC3339>, ?timestamps=<bool>,
?tailLines=<number>, and ?limitBytes=<number>
In many cases clients may wish to view not ready addresses for endpoints
in order to do set membership prior to a pod being ready. For instance,
a pod that uses the service endpoints to connect to other pods under
the same service, but does not want to signal ready before it has
contacted at least a minimal number of other pods.
This is backwards compatible with old servers and clients. There is
an additional cost in size of endpoints before services ramp up, which
will add minor CPU and memory use for services that have a significant
number of pods which have not become ready.
The methods for registering a node and syncing node status to the apiserver
have grown large enough that it makes sense for them to live in a separate
place. This change adds a nodeManager to handle such interaction with the
apiserver.
This refactor is in preparation for moving more state handling to the
status manager. It will become the canonical cache for the latest
information on running containers and probe status, as part of the
prober refactoring.
1. Make reason field of StatusReport objects in kubelet in CamelCase format.
2. Add Message field for ContainerStateWaiting to describe detail about Reason.
3. Make reason field of Events in kubelet in CamelCase format.
4. Update swagger,deep-copy and so on.
A lot of packages use StringSet, but they don't use anything else from
the util package. Moving StringSet into another package will shrink
their dependency trees significantly.
Both GetNode and the cache.ListWatch listfunc in the
kubelet package call List unnecessary.
GetNodeInfo is sufficient for GetNode and makes looping
through a list of nodes to check for a matching name
unnecessary.
resolves#13476
1. Add EvnetRecordQps and EventBurst parameter in kubelet.
2. If EvnetRecordQps and EventBurst was set, rate limit events in kubelet
with a independent ratelimiter as setted.
PR #13293 added a safety check to not remove a pod directory if the child
volumes directory is not empty. This logic is faulty because kubelet may have
directory structures podUID/volumes/volumeKind/volumeName. E.g.,
`056db95d-50ee-11e5-a2e4-42010af0ba1d/volumes/kubernetes.io~empty-dir/default-token-al3r2`
This change fixes that by properly listing all volumes under a pod.
Before, kubelet performs global cleanup tasks every iteration. After the
PR #13003, kubelet performs the tasks on every sync internval (10 seconds).
This PR decouples the housekeeping period with the sync internval to ensure
that kubelet cleans up promptly, while not too often (no more than once every
minimum housekeeping period).
Allow the user to specify the resolver configuration file that is used
to determine the default DNS parameters. This defaults to the system's
/etc/resolv.conf.
Currently, whenever there is any update, kubelet would force all pod workers to
sync again, causing resource contention and hence performance degradation.
This commit flips kubelet to use incremental updates (as opposed to snapshots).
This allows us to know what pods have changed and send updates to those pod
workers only. The `SyncPods` function has been replaced with individual
handlers, each handling an operation (ADD, REMOVE, UPDATE). Pod workers are
still triggered periodically, and kubelet performs periodic cleanup as well.
This commit also spawns a new goroutine solely responsible for killing pods.
This is necessary because pod killing could hold up the sync loop for
indefinitely long amount of time now user can define the graceful termination
period in the container spec.
We chose to use podFullName (name_namespace) as key in the status manager
because mirror pod and static pod share the same status. This is no longer
needed because we do not store statuses for static pods anymore (we only
store statuses for their mirror pods). Also, reviously, a few fixes were
merged to ensure statuses are cleaned up so that a new pod with the same
name would not resuse an old status.
This change cleans up the code by using UID as key so that the code would
become less brittle.
The sync loop should check for terminated pods that are no longer
running and clear them. The status loop should never write status
if the pod UID changes. Mirror pods should be deleted immediately
rather than gracefully.
Avoid TTL by deleting pods immediately when they aren't
scheduled, and letting the Kubelet delete them otherwise.
Ensure the Kubelet uses pod.Spec.TerminationGracePeriodSeconds
when no pod.DeletionGracePeriodSeconds is available.
Getting the public IP a container is supposed to use is O(hard),
and usually involves ugly gyrations in python or with interfaces.
Using the downward API means that the IP Kube is announcing to
other endpoints is also visible inside the container for pods to
identify themselves.
Eventually we would like to replace the all-encompassing SyncPods function with
more well-defined, smaller functions. This would not only help with the
readability and profiling of the code, it'd also set in motion for the plans to
trigger pod worker individually based on the content of the pod updates.
This commit serves as the first step of that, while avoiding breaking all unit
tests by preserving the SyncPods function for the time being.