mirror of https://github.com/k3s-io/k3s
445 lines
18 KiB
Markdown
445 lines
18 KiB
Markdown
![]() |
# Pod initialization
|
||
|
|
||
|
@smarterclayton
|
||
|
|
||
|
March 2016
|
||
|
|
||
|
## Proposal and Motivation
|
||
|
|
||
|
Within a pod there is a need to initialize local data or adapt to the current
|
||
|
cluster environment that is not easily achieved in the current container model.
|
||
|
Containers start in parallel after volumes are mounted, leaving no opportunity
|
||
|
for coordination between containers without specialization of the image. If
|
||
|
two containers need to share common initialization data, both images must
|
||
|
be altered to cooperate using filesystem or network semantics, which introduces
|
||
|
coupling between images. Likewise, if an image requires configuration in order
|
||
|
to start and that configuration is environment dependent, the image must be
|
||
|
altered to add the necessary templating or retrieval.
|
||
|
|
||
|
This proposal introduces the concept of an **init container**, one or more
|
||
|
containers started in sequence before the pod's normal containers are started.
|
||
|
These init containers may share volumes, perform network operations, and perform
|
||
|
computation prior to the start of the remaining containers. They may also, by
|
||
|
virtue of their sequencing, block or delay the startup of application containers
|
||
|
until some precondition is met. In this document we refer to the existing pod
|
||
|
containers as **app containers**.
|
||
|
|
||
|
This proposal also provides a high level design of **volume containers**, which
|
||
|
initialize a particular volume, as a feature that specializes some of the tasks
|
||
|
defined for init containers. The init container design anticipates the existence
|
||
|
of volume containers and highlights where they will take future work
|
||
|
|
||
|
## Design Points
|
||
|
|
||
|
* Init containers should be able to:
|
||
|
* Perform initialization of shared volumes
|
||
|
* Download binaries that will be used in app containers as execution targets
|
||
|
* Inject configuration or extension capability to generic images at startup
|
||
|
* Perform complex templating of information available in the local environment
|
||
|
* Initialize a database by starting a temporary execution process and applying
|
||
|
schema info.
|
||
|
* Delay the startup of application containers until preconditions are met
|
||
|
* Register the pod with other components of the system
|
||
|
* Reduce coupling:
|
||
|
* Between application images, eliminating the need to customize those images for
|
||
|
Kubernetes generally or specific roles
|
||
|
* Inside of images, by specializing which containers perform which tasks
|
||
|
(install git into init container, use filesystem contents
|
||
|
in web container)
|
||
|
* Between initialization steps, by supporting multiple sequential init containers
|
||
|
* Init containers allow simple start preconditions to be implemented that are
|
||
|
decoupled from application code
|
||
|
* The order init containers start should be predictable and allow users to easily
|
||
|
reason about the startup of a container
|
||
|
* Complex ordering and failure will not be supported - all complex workflows can
|
||
|
if necessary be implemented inside of a single init container, and this proposal
|
||
|
aims to enable that ordering without adding undue complexity to the system.
|
||
|
Pods in general are not intended to support DAG workflows.
|
||
|
* Both run-once and run-forever pods should be able to use init containers
|
||
|
* As much as possible, an init container should behave like an app container
|
||
|
to reduce complexity for end users, for clients, and for divergent use cases.
|
||
|
An init container is a container with the minimum alterations to accomplish
|
||
|
its goal.
|
||
|
* Volume containers should be able to:
|
||
|
* Perform initialization of a single volume
|
||
|
* Start in parallel
|
||
|
* Perform computation to initialize a volume, and delay start until that
|
||
|
volume is initialized successfully.
|
||
|
* Using a volume container that does not populate a volume to delay pod start
|
||
|
(in the absence of init containers) would be an abuse of the goal of volume
|
||
|
containers.
|
||
|
* Container pre-start hooks are not sufficient for all initialization cases:
|
||
|
* They cannot easily coordinate complex conditions across containers
|
||
|
* They can only function with code in the image or code in a shared volume,
|
||
|
which would have to be statically linked (not a common pattern in wide use)
|
||
|
* They cannot be implemented with the current Docker implementation - see
|
||
|
[#140](https://github.com/kubernetes/kubernetes/issues/140)
|
||
|
|
||
|
|
||
|
|
||
|
## Alternatives
|
||
|
|
||
|
* Any mechanism that runs user code on a node before regular pod containers
|
||
|
should itself be a container and modeled as such - we explicitly reject
|
||
|
creating new mechanisms for running user processes.
|
||
|
* The container pre-start hook (not yet implemented) requires execution within
|
||
|
the container's image and so cannot adapt existing images. It also cannot
|
||
|
block startup of containers
|
||
|
* Running a "pre-pod" would defeat the purpose of the pod being an atomic
|
||
|
unit of scheduling.
|
||
|
|
||
|
|
||
|
## Design
|
||
|
|
||
|
Each pod may have 0..N init containers defined along with the existing
|
||
|
1..M app containers.
|
||
|
|
||
|
On startup of the pod, after the network and volumes are initialized, the
|
||
|
init containers are started in order. Each container must exit successfully
|
||
|
before the next is invoked. If a container fails to start (due to the runtime)
|
||
|
or exits with failure, it is retried according to the pod RestartPolicy.
|
||
|
RestartPolicyNever pods will immediately fail and exit. RestartPolicyAlways
|
||
|
pods will retry the failing init container with increasing backoff until it
|
||
|
succeeds. To align with the design of application containers, init containers
|
||
|
will only support "infinite retries" (RestartPolicyAlways) or "no retries"
|
||
|
(RestartPolicyNever).
|
||
|
|
||
|
A pod cannot be ready until all init containers have succeeded. The ports
|
||
|
on an init container are not aggregated under a service. A pod that is
|
||
|
being initialized is in the `Pending` phase but should have a distinct
|
||
|
condition. Each app container and all future init containers should have
|
||
|
the reason `PodInitializing`. The pod should have a condition `Initializing`
|
||
|
set to `false` until all init containers have succeeded, and `true` thereafter.
|
||
|
If the pod is restarted, the `Initializing` condition should be set to `false.
|
||
|
|
||
|
If the pod is "restarted" all containers stopped and started due to
|
||
|
a node restart, change to the pod definition, or admin interaction, all
|
||
|
init containers must execute again. Restartable conditions are defined as:
|
||
|
|
||
|
* An init container image is changed
|
||
|
* The pod infrastructure container is restarted (shared namespaces are lost)
|
||
|
* The Kubelet detects that all containers in a pod are terminated AND
|
||
|
no record of init container completion is available on disk (due to GC)
|
||
|
|
||
|
Changes to the init container spec are limited to the container image field.
|
||
|
Altering the container image field is equivalent to restarting the pod.
|
||
|
|
||
|
Because init containers can be restarted, retried, or reexecuted, container
|
||
|
authors should make their init behavior idempotent by handling volumes that
|
||
|
are already populated or the possibility that this instance of the pod has
|
||
|
already contacted a remote system.
|
||
|
|
||
|
Each init container has all of the fields of an app container. The following
|
||
|
fields are prohibited from being used on init containers by validation:
|
||
|
|
||
|
* `readinessProbe` - init containers must exit for pod startup to continue,
|
||
|
are not included in rotation, and so cannot define readiness distinct from
|
||
|
completion.
|
||
|
|
||
|
Init container authors may use `activeDeadlineSeconds` on the pod and
|
||
|
`livenessProbe` on the container to prevent init containers from failing
|
||
|
forever. The active deadline includes init containers.
|
||
|
|
||
|
Because init containers are semantically different in lifecycle from app
|
||
|
containers (they are run serially, rather than in parallel), for backwards
|
||
|
compatibility and design clarity they will be identified as distinct fields
|
||
|
in the API:
|
||
|
|
||
|
pod:
|
||
|
spec:
|
||
|
containers: ...
|
||
|
initContainers:
|
||
|
- name: init-container1
|
||
|
image: ...
|
||
|
...
|
||
|
- name: init-container2
|
||
|
...
|
||
|
status:
|
||
|
containerStatuses: ...
|
||
|
initContainerStatuses:
|
||
|
- name: init-container1
|
||
|
...
|
||
|
- name: init-container2
|
||
|
...
|
||
|
|
||
|
This separation also serves to make the order of container initialization
|
||
|
clear - init containers are executed in the order that they appear, then all
|
||
|
app containers are started at once.
|
||
|
|
||
|
The name of each app and init container in a pod must be unique - it is a
|
||
|
validation error for any container to share a name.
|
||
|
|
||
|
While pod containers are in alpha state, they will be serialized as an annotation
|
||
|
on the pod with the name `pod.alpha.kubernetes.io/init-containers` and the status
|
||
|
of the containers will be stored as `pod.alpha.kubernetes.io/init-container-statuses`.
|
||
|
Mutation of these annotations is prohibited on existing pods.
|
||
|
|
||
|
|
||
|
### Resources
|
||
|
|
||
|
Given the ordering and execution for init containers, the following rules
|
||
|
for resource usage apply:
|
||
|
|
||
|
* The highest of any particular resource request or limit defined on all init
|
||
|
containers is the **effective init request/limit**
|
||
|
* The pod's **effective request/limit** for a resource is the higher of:
|
||
|
* sum of all app containers request/limit for a resource
|
||
|
* effective init request/limit for a resource
|
||
|
* Scheduling is done based on effective requests/limits, which means
|
||
|
init containers can reserve resources for initialization that are not used
|
||
|
during the life of the pod.
|
||
|
* The lowest QoS tier of init containers per resource is the **effective init QoS tier**,
|
||
|
and the highest QoS tier of both init containers and regular containers is the
|
||
|
**effective pod QoS tier**.
|
||
|
|
||
|
So the following pod:
|
||
|
|
||
|
pod:
|
||
|
spec:
|
||
|
initContainers:
|
||
|
- limits:
|
||
|
cpu: 100m
|
||
|
memory: 1GiB
|
||
|
- limits:
|
||
|
cpu: 50m
|
||
|
memory: 2GiB
|
||
|
containers:
|
||
|
- limits:
|
||
|
cpu: 10m
|
||
|
memory: 1100MiB
|
||
|
- limits:
|
||
|
cpu: 10m
|
||
|
memory: 1100MiB
|
||
|
|
||
|
has an effective pod limit of `cpu: 100m`, `memory: 2200MiB` (highest init
|
||
|
container cpu is larger than sum of all app containers, sum of container
|
||
|
memory is larger than the max of all init containers). The scheduler, node,
|
||
|
and quota must respect the effective pod request/limit.
|
||
|
|
||
|
In the absence of a defined request or limit on a container, the effective
|
||
|
request/limit will be applied. For example, the following pod:
|
||
|
|
||
|
pod:
|
||
|
spec:
|
||
|
initContainers:
|
||
|
- limits:
|
||
|
cpu: 100m
|
||
|
memory: 1GiB
|
||
|
containers:
|
||
|
- request:
|
||
|
cpu: 10m
|
||
|
memory: 1100MiB
|
||
|
|
||
|
will have an effective request of `10m / 1100MiB`, and an effective limit
|
||
|
of `100m / 1GiB`, i.e.:
|
||
|
|
||
|
pod:
|
||
|
spec:
|
||
|
initContainers:
|
||
|
- request:
|
||
|
cpu: 10m
|
||
|
memory: 1GiB
|
||
|
- limits:
|
||
|
cpu: 100m
|
||
|
memory: 1100MiB
|
||
|
containers:
|
||
|
- request:
|
||
|
cpu: 10m
|
||
|
memory: 1GiB
|
||
|
- limits:
|
||
|
cpu: 100m
|
||
|
memory: 1100MiB
|
||
|
|
||
|
and thus have the QoS tier **Burstable** (because request is not equal to
|
||
|
limit).
|
||
|
|
||
|
Quota and limits will be applied based on the effective pod request and
|
||
|
limit.
|
||
|
|
||
|
Pod level cGroups will be based on the effective pod request and limit, the
|
||
|
same as the scheduler.
|
||
|
|
||
|
|
||
|
### Kubelet and container runtime details
|
||
|
|
||
|
Container runtimes should treat the set of init and app containers as one
|
||
|
large pool. An individual init container execution should be identical to
|
||
|
an app container, including all standard container environment setup
|
||
|
(network, namespaces, hostnames, DNS, etc).
|
||
|
|
||
|
All app container operations are permitted on init containers. The
|
||
|
logs for an init container should be available for the duration of the pod
|
||
|
lifetime or until the pod is restarted.
|
||
|
|
||
|
During initialization, app container status should be shown with the reason
|
||
|
PodInitializing if any init containers are present. Each init container
|
||
|
should show appropriate container status, and all init containers that are
|
||
|
waiting for earlier init containers to finish should have the `reason`
|
||
|
PendingInitialization.
|
||
|
|
||
|
The container runtime should aggressively prune failed init containers.
|
||
|
The container runtime should record whether all init containers have
|
||
|
succeeded internally, and only invoke new init containers if a pod
|
||
|
restart is needed (for Docker, if all containers terminate or if the pod
|
||
|
infra container terminates). Init containers should follow backoff rules
|
||
|
as necessary. The Kubelet *must* preserve at least the most recent instance
|
||
|
of an init container to serve logs and data for end users and to track
|
||
|
failure states. The Kubelet *should* prefer to garbage collect completed
|
||
|
init containers over app containers, as long as the Kubelet is able to
|
||
|
track that initialization has been completed. In the future, container
|
||
|
state checkpointing in the Kubelet may remove or reduce the need to
|
||
|
preserve old init containers.
|
||
|
|
||
|
For the initial implementation, the Kubelet will use the last termination
|
||
|
container state of the highest indexed init container to determine whether
|
||
|
the pod has completed initialization. During a pod restart, initialization
|
||
|
will be restarted from the beginning (all initializers will be rerun).
|
||
|
|
||
|
|
||
|
### API Behavior
|
||
|
|
||
|
All APIs that access containers by name should operate on both init and
|
||
|
app containers. Because names are unique the addition of the init container
|
||
|
should be transparent to use cases.
|
||
|
|
||
|
A client with no knowledge of init containers should see appropriate
|
||
|
container status `reason` and `message` fields while the pod is in the
|
||
|
`Pending` phase, and so be able to communicate that to end users.
|
||
|
|
||
|
|
||
|
### Example init containers
|
||
|
|
||
|
* Wait for a service to be created
|
||
|
|
||
|
pod:
|
||
|
spec:
|
||
|
initContainers:
|
||
|
- name: wait
|
||
|
image: centos:centos7
|
||
|
command: ["/bin/sh", "-c", "for i in {1..100}; do sleep 1; if dig myservice; then exit 0; fi; exit 1"]
|
||
|
containers:
|
||
|
- name: run
|
||
|
image: application-image
|
||
|
command: ["/my_application_that_depends_on_myservice"]
|
||
|
|
||
|
* Register this pod with a remote server
|
||
|
|
||
|
pod:
|
||
|
spec:
|
||
|
initContainers:
|
||
|
- name: register
|
||
|
image: centos:centos7
|
||
|
command: ["/bin/sh", "-c", "curl -X POST http://$MANAGEMENT_SERVICE_HOST:$MANAGEMENT_SERVICE_PORT/register -d 'instance=$(POD_NAME)&ip=$(POD_IP)'"]
|
||
|
env:
|
||
|
- name: POD_NAME
|
||
|
valueFrom:
|
||
|
field: metadata.name
|
||
|
- name: POD_IP
|
||
|
valueFrom:
|
||
|
field: status.podIP
|
||
|
containers:
|
||
|
- name: run
|
||
|
image: application-image
|
||
|
command: ["/my_application_that_depends_on_myservice"]
|
||
|
|
||
|
* Wait for an arbitrary period of time
|
||
|
|
||
|
pod:
|
||
|
spec:
|
||
|
initContainers:
|
||
|
- name: wait
|
||
|
image: centos:centos7
|
||
|
command: ["/bin/sh", "-c", "sleep 60"]
|
||
|
containers:
|
||
|
- name: run
|
||
|
image: application-image
|
||
|
command: ["/static_binary_without_sleep"]
|
||
|
|
||
|
* Clone a git repository into a volume (can be implemented by volume containers in the future):
|
||
|
|
||
|
pod:
|
||
|
spec:
|
||
|
initContainers:
|
||
|
- name: download
|
||
|
image: image-with-git
|
||
|
command: ["git", "clone", "https://github.com/myrepo/myrepo.git", "/var/lib/data"]
|
||
|
volumeMounts:
|
||
|
- mountPath: /var/lib/data
|
||
|
volumeName: git
|
||
|
containers:
|
||
|
- name: run
|
||
|
image: centos:centos7
|
||
|
command: ["/var/lib/data/binary"]
|
||
|
volumeMounts:
|
||
|
- mountPath: /var/lib/data
|
||
|
volumeName: git
|
||
|
volumes:
|
||
|
- emptyDir: {}
|
||
|
name: git
|
||
|
|
||
|
* Execute a template transformation based on environment (can be implemented by volume containers in the future):
|
||
|
|
||
|
pod:
|
||
|
spec:
|
||
|
initContainers:
|
||
|
- name: copy
|
||
|
image: application-image
|
||
|
command: ["/bin/cp", "mytemplate.j2", "/var/lib/data/"]
|
||
|
volumeMounts:
|
||
|
- mountPath: /var/lib/data
|
||
|
volumeName: data
|
||
|
- name: transform
|
||
|
image: image-with-jinja
|
||
|
command: ["/bin/sh", "-c", "jinja /var/lib/data/mytemplate.j2 > /var/lib/data/mytemplate.conf"]
|
||
|
volumeMounts:
|
||
|
- mountPath: /var/lib/data
|
||
|
volumeName: data
|
||
|
containers:
|
||
|
- name: run
|
||
|
image: application-image
|
||
|
command: ["/myapplication", "-conf", "/var/lib/data/mytemplate.conf"]
|
||
|
volumeMounts:
|
||
|
- mountPath: /var/lib/data
|
||
|
volumeName: data
|
||
|
volumes:
|
||
|
- emptyDir: {}
|
||
|
name: data
|
||
|
|
||
|
* Perform a container build
|
||
|
|
||
|
pod:
|
||
|
spec:
|
||
|
initContainers:
|
||
|
- name: copy
|
||
|
image: base-image
|
||
|
workingDir: /home/user/source-tree
|
||
|
command: ["make"]
|
||
|
containers:
|
||
|
- name: commit
|
||
|
image: image-with-docker
|
||
|
command:
|
||
|
- /bin/sh
|
||
|
- -c
|
||
|
- docker commit $(complex_bash_to_get_container_id_of_copy) \
|
||
|
docker push $(commit_id) myrepo:latest
|
||
|
volumesMounts:
|
||
|
- mountPath: /var/run/docker.sock
|
||
|
volumeName: dockersocket
|
||
|
|
||
|
## Backwards compatibilty implications
|
||
|
|
||
|
Since this is a net new feature in the API and Kubelet, new API servers during upgrade may not
|
||
|
be able to rely on Kubelets implementing init containers. The management of feature skew between
|
||
|
master and Kubelet is tracked in issue [#4855](https://github.com/kubernetes/kubernetes/issues/4855).
|
||
|
|
||
|
|
||
|
## Future work
|
||
|
|
||
|
* Unify pod QoS class with init containers
|
||
|
* Implement container / image volumes to make composition of runtime from images efficient
|
||
|
|
||
|
|
||
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||
|
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/container-init.md?pixel)]()
|
||
|
<!-- END MUNGE: GENERATED_ANALYTICS -->
|