We chose to use podFullName (name_namespace) as key in the status manager
because mirror pod and static pod share the same status. This is no longer
needed because we do not store statuses for static pods anymore (we only
store statuses for their mirror pods). Also, reviously, a few fixes were
merged to ensure statuses are cleaned up so that a new pod with the same
name would not resuse an old status.
This change cleans up the code by using UID as key so that the code would
become less brittle.
Getting the public IP a container is supposed to use is O(hard),
and usually involves ugly gyrations in python or with interfaces.
Using the downward API means that the IP Kube is announcing to
other endpoints is also visible inside the container for pods to
identify themselves.
This commit wires together the graceful delete option for pods
on the Kubelet. When a pod is deleted on the API server, a
grace period is calculated that is based on the
Pod.Spec.TerminationGracePeriodInSeconds, the user's provided grace
period, or a default. The grace period can only shrink once set.
The value provided by the user (or the default) is set onto metadata
as DeletionGracePeriod.
When the Kubelet sees a pod with DeletionTimestamp set, it uses the
value of ObjectMeta.GracePeriodSeconds as the grace period
sent to Docker. When updating status, if the pod has DeletionTimestamp
set and all containers are terminated, the Kubelet will update the
status one last time and then invoke Delete(pod, grace: 0) to
clean up the pod immediately.
This generalizes the handling of containers in the
ContainerManager.
Also introduces the ability to determine how much
resources are reserved for those system containers.
This test verifies that kubelet doesn't kill non-kubelet-managed containers.
This is obsolete now since container runtime provides only the KillPod
function, and has no support for killing a container.
* Start using FakeRuntime to replace FakeDockerClient in unit tests.
* Move and adapt docker-specific tests (e.g. creating/deleting infra
containers) to manager_test.go in dockertools.
- Delete nodes when they are no longer ready and don't exist in the
cloud provider.
- Label each node with it's hostname.
- Add flag to skip node registration.
- Add a test for registering an existing node.
We recently changed `SyncPods` to filter out terminated pods at the beginning
for two reasons:
* performance: kubelet no longer keeps goroutines to checks containers for
terminated pods.
* correctness: kubelet relies on inspecting dead containers to generate
pod status. Because dead containers may get garbage collected and
kubelet does not have checkpoints yet, syncing terminated pod could
lead to modifying the status of a terminated pod.
However, even though kubelet should not *sync* the terminated pods, it
should not attempt to remove the directories and volumes for such
pods as long as they have not been deleted. This change fixes aggresive
directory removal by passing all pods (including terminated pods) to the
cleanup functions.
Per-pod workers have sufficient knowledge to determine whether a pod has
exceeded the active deadline, and they set the status at the end of each sync.
Move the active deadline check to generatePodStatus so that per pod workers
can update the pod status directly. This eliminates the possibility of a race
condition where both SyncPods and the pod worker are updating the status, which
could lead to temporary erratic pod status behavior (pod phase: failed ->
running -> failed).
Pod statuses are periodically writtien to the status manager, and status
manager sets the start time of the pod. All non-status-modifying code should
perform cache lookup and should not attempt to generate pod status on its own.
Kubelet will stop accepting new pods if it detects low disk space on root fs or fs holding docker images.
Running pods are not affected. low-diskspace-threshold-mb is used to configure the low diskspace threshold.
This change instructs kubelet to switch to using the Runtime interface. In order
to do it, the change moves the Prober instantiation to DockerManager.
Note that most of the tests in kubelet_test.go needs to be migrated to
dockertools. For now, we use type assertion to convert the Runtime interface to
DockerManager in most tests.
This change removes docker-specifc code in killUnwantedPods. It
also instructs the cleanup code to move away from interacting with
containers directly. They should always deal with the pod-level
abstraction if at all possible.
Once a pod reaches a terminated state (whether failed or succeeded), it should
not transit out ever again. Currently, kubelet relies on examining the dead
containers to verify that the container has already been run. This is fine
in most cases, but if the dead containers were garbage collected, kubelet may
falsely concluded that the pod has never been run. It would then try to restart
all the containers.
This change eliminates most of such possibilities by pre-filtering out the pods
in the final states before sending updates to per-pod workers.
Remove GetDockerServerVersion() from DockerContainerCommandRunner interface,
replaced with runtime.Version(). Also added Version type in runtime for version
comparision.
Use go-dockerclient's APIVersion to check the minimum required Docker
version, as it contains methods for parsing the ApiVersion response from
the Docker daemon and for comparing 2 APIVersion objects.
Currently, restart count are generated by examine dead docker containers, which
are subject to background garbage collection. Therefore, the restart count is
capped at 5 and can decrement if GC happens.
This change leverages the container statuses recorded in the pod status as a
reference point. If a container finished after the last observation, restart
count is incremented on top of the last observed count. If container is created
after last observation, but GC'd before the current observation time, kubelet
would not be aware of the existence of such a container, and would not increase
the restart count accordingly. However, the chance of this should be low, given
that pod statuses are reported frequently. Also, the restart cound would still
be increasing monotonically (with the exception of container insepct error).
Container creation/start failure cannot be reproduced by inspecting the
containers. This change caches such errors so that kubelet can retrieve it
later.
This change also extends FakeDockerClient to support setting error response
for a specific function.
If a static pod changes, delete the corresponding mirror pod. When kubelet
could not see mirror pod from the API server update, it'd attemp to create a
new mirror pod with up-to-date specs.
Standardize how our fakes are used so that a test case can use a
simpler mechanism for providing large, complex data sets, as well
as represent queries over time.
We want to stop leaking more docker details into kubelet, and we also want to
consolidate some of the existing docker interfaces/structs. This change creates
DockerManager as the new home of some functions in dockertools/docker.go. It
also absorbs containerRunner. In addition, GetDockerPodStatus is renamed to
GetPodStatus with the entire pod passed to it so that it is simialr to the what
is defined in the container Runtime interface.
Eventually, DockerManager should implement the container Runtime interface, and
integrate DockerCache with a flag to turn on/off caching. Code in kubelet.go
should not be using docker client directly.
* Improper format specifier (e.g. %s for bools or %s for ints)
* More or less parameters than format specifiers
* Not calling a formatting function when it should have (e.g. Error() instead of Errorf())
This fixes TestSyncPodsDeletesWithNoPodInfraContainer.
Since we need to sync two pods in parallel, we should not verify
the docker calls in strict order.
This change cleans up the pod manager extensively so that
* Mirror pods are actually stored in the pod manager.
* Both (non-mirror) pods and mirror pods are indexed by UID and full name for
easy lookup and mapping. This is required for the next change to send
full pod along with the pod status update.
This change also renames mirrorManager as mirrorClient since it is merely a
client to contact the API server and create/delete mirror pods.
This change moves pod array and mirrorPods into podManager, along with all
methods accessing these internal pod storages. This is the first step of the
refactoring, and no function change is involved.
Kubelet supports retrieving stats for pods/containers with and without UID.
This does not always work for the static pods because users may get the UIDs of
the mirror pods from the API server, and use them to query Kubelet. In this
case, Kubelet would fail to locate the containers due to mismatched UIDs.
This change adds a intenral mirror to static pod UID mapping and teaches all
public-facing functions to perform UID lookup before proceeding. This allows
users to use either mirror or static pod's UID to retrieve stats.
Currently, API server is not aware of the static pods (manifests from
sources other than the API server, e.g. file and http) at all. This is
inconvenient since users cannot check the static pods through kubectl.
It is also sub-optimal because scheduler is unaware of the resource
consumption by these static pods on the node.
This change syncs the information back to the API server by creating a
mirror pod via API server for each static pod.
- Kubelet creates containers for the static pod, as it would do
normally.
- If a mirror pod gets deleted, Kubelet will re-create one. The
containers are sync'd to the static pods, so they will not be
affected.
- If a static pod gets removed from the source (e.g. manifest file
removed from the directory), the orphaned mirror pod will be deleted.
Note that because events are associated with UID, and the mirror pod has
a different UID than the original static pod, the events will not be
shown for the mirror pod when running `kubectl describe pod
<mirror_pod>`.
These containers may be caused by a change in the Kubernetes naming
convention. The old containers are killed, the new ones started, but the
old ones are never GC'd. This change makes Kubelet GC all Kubernetes
containers, old and new.
Fixes#5372.
This will make it easier to start running the real cAdvisor alongside
Kubelet. This change is primarily no-op refactoring. The main behavioral
change is that we always create a cAdvisor interface and expect it to
always be available. When we make a request, if cAdvisor is not
connected the request fails with a connection error. This failure is
handled today as well.
There are two main goals for this change.
1. Fix the naming scheme in kubelet so that it accepts DNS subdomain
name/namespaces correctly (#4920). The design is discussed in #3453.
2. Prepare for syncing the static pods back to the apiserver(#4090). This
includes
- Eliminate the source component in the internal full pod name (#4922). Pods
no longer need sources as they will all be sync'd via apiserver.
- Changing the naming scheme for the static (file-, http-, and etcd-based)
pods such that they are distinguishable when syncing back to the apiserver.
The changes includes:
* name = <pod.Name>-<hostname>
* namespace = <cluster_namespace> (i.e. "default" for now).
* container_name = k8s_<contianer_name>.<hash_of_container>_<pod_name>_<namespace>_<uid>_<random>
Note that this is not backward-compatible, meaning the kubelet won't recognize
existing running containers using the old naming scheme.
When a host port conflict is detected, kubelet should set the pod status to
fail. The failed status will then be polled by other components at a later time,
which allows replication controller to create a new pod if necessary.
To achieve this, this change stores the pod status information in a status map
upon the detecton of port conflict. GetPodStatus() consults this status map
before attempting to query docker. The entries in the status map will be removed
when the pod is no longer associated with the node.
Currently, kubelet silently ignores pods that caused host port conflict. This
commit surfaces the error by recording an event.
It also makes sure that kubelet iterates through the pods in the order of the
creation timestamp, which ensures that pods created later are ignored on
conflict.
Setting the oom_score_adj of the PID of the POD container to -100 which is less
than the default of 0. This ensures that this PID is the last OOM victim
chosen by the kernel.
Fixes#3067.
Since the parsing function doesn't return an error all the components
returned empty strings. This caused us to enforce the MaxContainerLimit
as a global limit instead of a per-container limit.
Fixes#4413.