The kubelet sync loop relies on getting one update as the signal that the
specific source is ready. This change ensures that we don't send multiple
updates (ADD, UPDATE) for the first batch of pods. This is required to prevent
the cleanup routine from killing pods prematurely.
Kubelet doesn't perform checkpointing and loses all its internal states after
restarts. It'd then mistaken pods from the api server as new pods and attempt
to go through the admission process. This may result in pods being rejected
even though they are running on the node (e.g., out of disk situation). This
change adds a condition to check whether the pod was seen before and categorize
such pods as updates. The change also removes freeze/unfreeze mechanism used to
work around such cases, since it is no longer needed and it stopped working
correctly ever since we switched to incremental updates.
The formatting function is used often in logging. This improves the readability
by shortening the length of the call. Also change the fomartted string to
include the pod UID.
Before this change we have a mish-mash of ways to pass field names around for
error generation. Sometimes string fieldnames, sometimes .Prefix(), sometimes
neither, often wrong names or not indexed when it should be.
Instead of that mess, this is part one of a couple of commits that will make it
more strongly typed and hopefully encourage correct behavior. At least you
will have to think about field names, which is better than nothing.
It turned out to be really hard to do this incrementally.
In PodConfigNotificationIncremental PodConfig mode, when no pods are available
for a source, the Merge function correctly concluded that neither ADD, UPDATE nor
REMOVE updates are to be sent to the kubelet. But as a consequence the kubelet will
not mark that source as seen.
This is usually not a problem for the apiserver source. But it is a problem for
an empty "file" source, e.g. by passing an empty directory to the kubelet for
static pods. Then the file source will never be seen and the kubelet will stay
in its special not-all-source-seen mode.
The current implementation considers a source seen when it receives a SET at
kubelet/config/config.go. However, the main kubelet sync loop may not have
received the pod update from the source via the channel. This change ensures
that kubelet would consider all sources are ready only after the sync loop has
seen all the sources.
kubelet relies on source annotation to distinguish pods from different sources.
It should always annotate the source when receiving the pod to avoid any
confusion. This commit fixes that and also make sure we don't overwrite the
first seen time.
A lot of packages use StringSet, but they don't use anything else from
the util package. Moving StringSet into another package will shrink
their dependency trees significantly.
When the pod annotations are updated in the apiserver, update the pod.
Annotations may be used to convey attributes that are required to the
pod execution, such as networking parameters.
Currently, whenever there is any update, kubelet would force all pod workers to
sync again, causing resource contention and hence performance degradation.
This commit flips kubelet to use incremental updates (as opposed to snapshots).
This allows us to know what pods have changed and send updates to those pod
workers only. The `SyncPods` function has been replaced with individual
handlers, each handling an operation (ADD, REMOVE, UPDATE). Pod workers are
still triggered periodically, and kubelet performs periodic cleanup as well.
This commit also spawns a new goroutine solely responsible for killing pods.
This is necessary because pod killing could hold up the sync loop for
indefinitely long amount of time now user can define the graceful termination
period in the container spec.
Add a new latency metric for the time from seeing the pod for the first time
to starting a pod worker for it.
Also, change PodStartLatency to include this initial processing latency.
This commit wires together the graceful delete option for pods
on the Kubelet. When a pod is deleted on the API server, a
grace period is calculated that is based on the
Pod.Spec.TerminationGracePeriodInSeconds, the user's provided grace
period, or a default. The grace period can only shrink once set.
The value provided by the user (or the default) is set onto metadata
as DeletionGracePeriod.
When the Kubelet sees a pod with DeletionTimestamp set, it uses the
value of ObjectMeta.GracePeriodSeconds as the grace period
sent to Docker. When updating status, if the pod has DeletionTimestamp
set and all containers are terminated, the Kubelet will update the
status one last time and then invoke Delete(pod, grace: 0) to
clean up the pod immediately.