mirror of https://github.com/k3s-io/k3s
Merge pull request #23666 from smarterclayton/initctrproposal
Automatic merge from submit-queue Proposal for implementing init containers Addresses #1589. Implemented in #23567. Docs in https://github.com/kubernetes/kubernetes.github.io/pull/679 ```release-note Init containers enable pod authors to perform tasks before their normal containers start. Each init container is started in order, and failing containers will prevent the application from starting. ```pull/6/head
commit
d03a5fab29
|
@ -0,0 +1,473 @@
|
|||
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
<!-- BEGIN STRIP_FOR_RELEASE -->
|
||||
|
||||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
|
||||
width="25" height="25">
|
||||
|
||||
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
|
||||
|
||||
If you are using a released version of Kubernetes, you should
|
||||
refer to the docs that go with that version.
|
||||
|
||||
Documentation for other releases can be found at
|
||||
[releases.k8s.io](http://releases.k8s.io).
|
||||
</strong>
|
||||
--
|
||||
|
||||
<!-- END STRIP_FOR_RELEASE -->
|
||||
|
||||
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
||||
|
||||
# Pod initialization
|
||||
|
||||
@smarterclayton
|
||||
|
||||
March 2016
|
||||
|
||||
## Proposal and Motivation
|
||||
|
||||
Within a pod there is a need to initialize local data or adapt to the current
|
||||
cluster environment that is not easily achieved in the current container model.
|
||||
Containers start in parallel after volumes are mounted, leaving no opportunity
|
||||
for coordination between containers without specialization of the image. If
|
||||
two containers need to share common initialization data, both images must
|
||||
be altered to cooperate using filesystem or network semantics, which introduces
|
||||
coupling between images. Likewise, if an image requires configuration in order
|
||||
to start and that configuration is environment dependent, the image must be
|
||||
altered to add the necessary templating or retrieval.
|
||||
|
||||
This proposal introduces the concept of an **init container**, one or more
|
||||
containers started in sequence before the pod's normal containers are started.
|
||||
These init containers may share volumes, perform network operations, and perform
|
||||
computation prior to the start of the remaining containers. They may also, by
|
||||
virtue of their sequencing, block or delay the startup of application containers
|
||||
until some precondition is met. In this document we refer to the existing pod
|
||||
containers as **app containers**.
|
||||
|
||||
This proposal also provides a high level design of **volume containers**, which
|
||||
initialize a particular volume, as a feature that specializes some of the tasks
|
||||
defined for init containers. The init container design anticipates the existence
|
||||
of volume containers and highlights where they will take future work
|
||||
|
||||
## Design Points
|
||||
|
||||
* Init containers should be able to:
|
||||
* Perform initialization of shared volumes
|
||||
* Download binaries that will be used in app containers as execution targets
|
||||
* Inject configuration or extension capability to generic images at startup
|
||||
* Perform complex templating of information available in the local environment
|
||||
* Initialize a database by starting a temporary execution process and applying
|
||||
schema info.
|
||||
* Delay the startup of application containers until preconditions are met
|
||||
* Register the pod with other components of the system
|
||||
* Reduce coupling:
|
||||
* Between application images, eliminating the need to customize those images for
|
||||
Kubernetes generally or specific roles
|
||||
* Inside of images, by specializing which containers perform which tasks
|
||||
(install git into init container, use filesystem contents
|
||||
in web container)
|
||||
* Between initialization steps, by supporting multiple sequential init containers
|
||||
* Init containers allow simple start preconditions to be implemented that are
|
||||
decoupled from application code
|
||||
* The order init containers start should be predictable and allow users to easily
|
||||
reason about the startup of a container
|
||||
* Complex ordering and failure will not be supported - all complex workflows can
|
||||
if necessary be implemented inside of a single init container, and this proposal
|
||||
aims to enable that ordering without adding undue complexity to the system.
|
||||
Pods in general are not intended to support DAG workflows.
|
||||
* Both run-once and run-forever pods should be able to use init containers
|
||||
* As much as possible, an init container should behave like an app container
|
||||
to reduce complexity for end users, for clients, and for divergent use cases.
|
||||
An init container is a container with the minimum alterations to accomplish
|
||||
its goal.
|
||||
* Volume containers should be able to:
|
||||
* Perform initialization of a single volume
|
||||
* Start in parallel
|
||||
* Perform computation to initialize a volume, and delay start until that
|
||||
volume is initialized successfully.
|
||||
* Using a volume container that does not populate a volume to delay pod start
|
||||
(in the absence of init containers) would be an abuse of the goal of volume
|
||||
containers.
|
||||
* Container pre-start hooks are not sufficient for all initialization cases:
|
||||
* They cannot easily coordinate complex conditions across containers
|
||||
* They can only function with code in the image or code in a shared volume,
|
||||
which would have to be statically linked (not a common pattern in wide use)
|
||||
* They cannot be implemented with the current Docker implementation - see
|
||||
[#140](https://github.com/kubernetes/kubernetes/issues/140)
|
||||
|
||||
|
||||
|
||||
## Alternatives
|
||||
|
||||
* Any mechanism that runs user code on a node before regular pod containers
|
||||
should itself be a container and modeled as such - we explicitly reject
|
||||
creating new mechanisms for running user processes.
|
||||
* The container pre-start hook (not yet implemented) requires execution within
|
||||
the container's image and so cannot adapt existing images. It also cannot
|
||||
block startup of containers
|
||||
* Running a "pre-pod" would defeat the purpose of the pod being an atomic
|
||||
unit of scheduling.
|
||||
|
||||
|
||||
## Design
|
||||
|
||||
Each pod may have 0..N init containers defined along with the existing
|
||||
1..M app containers.
|
||||
|
||||
On startup of the pod, after the network and volumes are initialized, the
|
||||
init containers are started in order. Each container must exit successfully
|
||||
before the next is invoked. If a container fails to start (due to the runtime)
|
||||
or exits with failure, it is retried according to the pod RestartPolicy.
|
||||
RestartPolicyNever pods will immediately fail and exit. RestartPolicyAlways
|
||||
pods will retry the failing init container with increasing backoff until it
|
||||
succeeds. To align with the design of application containers, init containers
|
||||
will only support "infinite retries" (RestartPolicyAlways) or "no retries"
|
||||
(RestartPolicyNever).
|
||||
|
||||
A pod cannot be ready until all init containers have succeeded. The ports
|
||||
on an init container are not aggregated under a service. A pod that is
|
||||
being initialized is in the `Pending` phase but should have a distinct
|
||||
condition. Each app container and all future init containers should have
|
||||
the reason `PodInitializing`. The pod should have a condition `Initializing`
|
||||
set to `false` until all init containers have succeeded, and `true` thereafter.
|
||||
If the pod is restarted, the `Initializing` condition should be set to `false.
|
||||
|
||||
If the pod is "restarted" all containers stopped and started due to
|
||||
a node restart, change to the pod definition, or admin interaction, all
|
||||
init containers must execute again. Restartable conditions are defined as:
|
||||
|
||||
* An init container image is changed
|
||||
* The pod infrastructure container is restarted (shared namespaces are lost)
|
||||
* The Kubelet detects that all containers in a pod are terminated AND
|
||||
no record of init container completion is available on disk (due to GC)
|
||||
|
||||
Changes to the init container spec are limited to the container image field.
|
||||
Altering the container image field is equivalent to restarting the pod.
|
||||
|
||||
Because init containers can be restarted, retried, or reexecuted, container
|
||||
authors should make their init behavior idempotent by handling volumes that
|
||||
are already populated or the possibility that this instance of the pod has
|
||||
already contacted a remote system.
|
||||
|
||||
Each init container has all of the fields of an app container. The following
|
||||
fields are prohibited from being used on init containers by validation:
|
||||
|
||||
* `readinessProbe` - init containers must exit for pod startup to continue,
|
||||
are not included in rotation, and so cannot define readiness distinct from
|
||||
completion.
|
||||
|
||||
Init container authors may use `activeDeadlineSeconds` on the pod and
|
||||
`livenessProbe` on the container to prevent init containers from failing
|
||||
forever. The active deadline includes init containers.
|
||||
|
||||
Because init containers are semantically different in lifecycle from app
|
||||
containers (they are run serially, rather than in parallel), for backwards
|
||||
compatibility and design clarity they will be identified as distinct fields
|
||||
in the API:
|
||||
|
||||
pod:
|
||||
spec:
|
||||
containers: ...
|
||||
initContainers:
|
||||
- name: init-container1
|
||||
image: ...
|
||||
...
|
||||
- name: init-container2
|
||||
...
|
||||
status:
|
||||
containerStatuses: ...
|
||||
initContainerStatuses:
|
||||
- name: init-container1
|
||||
...
|
||||
- name: init-container2
|
||||
...
|
||||
|
||||
This separation also serves to make the order of container initialization
|
||||
clear - init containers are executed in the order that they appear, then all
|
||||
app containers are started at once.
|
||||
|
||||
The name of each app and init container in a pod must be unique - it is a
|
||||
validation error for any container to share a name.
|
||||
|
||||
While pod containers are in alpha state, they will be serialized as an annotation
|
||||
on the pod with the name `pod.alpha.kubernetes.io/init-containers` and the status
|
||||
of the containers will be stored as `pod.alpha.kubernetes.io/init-container-statuses`.
|
||||
Mutation of these annotations is prohibited on existing pods.
|
||||
|
||||
|
||||
### Resources
|
||||
|
||||
Given the ordering and execution for init containers, the following rules
|
||||
for resource usage apply:
|
||||
|
||||
* The highest of any particular resource request or limit defined on all init
|
||||
containers is the **effective init request/limit**
|
||||
* The pod's **effective request/limit** for a resource is the higher of:
|
||||
* sum of all app containers request/limit for a resource
|
||||
* effective init request/limit for a resource
|
||||
* Scheduling is done based on effective requests/limits, which means
|
||||
init containers can reserve resources for initialization that are not used
|
||||
during the life of the pod.
|
||||
* The lowest QoS tier of init containers per resource is the **effective init QoS tier**,
|
||||
and the highest QoS tier of both init containers and regular containers is the
|
||||
**effective pod QoS tier**.
|
||||
|
||||
So the following pod:
|
||||
|
||||
pod:
|
||||
spec:
|
||||
initContainers:
|
||||
- limits:
|
||||
cpu: 100m
|
||||
memory: 1GiB
|
||||
- limits:
|
||||
cpu: 50m
|
||||
memory: 2GiB
|
||||
containers:
|
||||
- limits:
|
||||
cpu: 10m
|
||||
memory: 1100MiB
|
||||
- limits:
|
||||
cpu: 10m
|
||||
memory: 1100MiB
|
||||
|
||||
has an effective pod limit of `cpu: 100m`, `memory: 2200MiB` (highest init
|
||||
container cpu is larger than sum of all app containers, sum of container
|
||||
memory is larger than the max of all init containers). The scheduler, node,
|
||||
and quota must respect the effective pod request/limit.
|
||||
|
||||
In the absence of a defined request or limit on a container, the effective
|
||||
request/limit will be applied. For example, the following pod:
|
||||
|
||||
pod:
|
||||
spec:
|
||||
initContainers:
|
||||
- limits:
|
||||
cpu: 100m
|
||||
memory: 1GiB
|
||||
containers:
|
||||
- request:
|
||||
cpu: 10m
|
||||
memory: 1100MiB
|
||||
|
||||
will have an effective request of `10m / 1100MiB`, and an effective limit
|
||||
of `100m / 1GiB`, i.e.:
|
||||
|
||||
pod:
|
||||
spec:
|
||||
initContainers:
|
||||
- request:
|
||||
cpu: 10m
|
||||
memory: 1GiB
|
||||
- limits:
|
||||
cpu: 100m
|
||||
memory: 1100MiB
|
||||
containers:
|
||||
- request:
|
||||
cpu: 10m
|
||||
memory: 1GiB
|
||||
- limits:
|
||||
cpu: 100m
|
||||
memory: 1100MiB
|
||||
|
||||
and thus have the QoS tier **Burstable** (because request is not equal to
|
||||
limit).
|
||||
|
||||
Quota and limits will be applied based on the effective pod request and
|
||||
limit.
|
||||
|
||||
Pod level cGroups will be based on the effective pod request and limit, the
|
||||
same as the scheduler.
|
||||
|
||||
|
||||
### Kubelet and container runtime details
|
||||
|
||||
Container runtimes should treat the set of init and app containers as one
|
||||
large pool. An individual init container execution should be identical to
|
||||
an app container, including all standard container environment setup
|
||||
(network, namespaces, hostnames, DNS, etc).
|
||||
|
||||
All app container operations are permitted on init containers. The
|
||||
logs for an init container should be available for the duration of the pod
|
||||
lifetime or until the pod is restarted.
|
||||
|
||||
During initialization, app container status should be shown with the reason
|
||||
PodInitializing if any init containers are present. Each init container
|
||||
should show appropriate container status, and all init containers that are
|
||||
waiting for earlier init containers to finish should have the `reason`
|
||||
PendingInitialization.
|
||||
|
||||
The container runtime should aggressively prune failed init containers.
|
||||
The container runtime should record whether all init containers have
|
||||
succeeded internally, and only invoke new init containers if a pod
|
||||
restart is needed (for Docker, if all containers terminate or if the pod
|
||||
infra container terminates). Init containers should follow backoff rules
|
||||
as necessary. The Kubelet *must* preserve at least the most recent instance
|
||||
of an init container to serve logs and data for end users and to track
|
||||
failure states. The Kubelet *should* prefer to garbage collect completed
|
||||
init containers over app containers, as long as the Kubelet is able to
|
||||
track that initialization has been completed. In the future, container
|
||||
state checkpointing in the Kubelet may remove or reduce the need to
|
||||
preserve old init containers.
|
||||
|
||||
For the initial implementation, the Kubelet will use the last termination
|
||||
container state of the highest indexed init container to determine whether
|
||||
the pod has completed initialization. During a pod restart, initialization
|
||||
will be restarted from the beginning (all initializers will be rerun).
|
||||
|
||||
|
||||
### API Behavior
|
||||
|
||||
All APIs that access containers by name should operate on both init and
|
||||
app containers. Because names are unique the addition of the init container
|
||||
should be transparent to use cases.
|
||||
|
||||
A client with no knowledge of init containers should see appropriate
|
||||
container status `reason` and `message` fields while the pod is in the
|
||||
`Pending` phase, and so be able to communicate that to end users.
|
||||
|
||||
|
||||
### Example init containers
|
||||
|
||||
* Wait for a service to be created
|
||||
|
||||
pod:
|
||||
spec:
|
||||
initContainers:
|
||||
- name: wait
|
||||
image: centos:centos7
|
||||
command: ["/bin/sh", "-c", "for i in {1..100}; do sleep 1; if dig myservice; then exit 0; fi; exit 1"]
|
||||
containers:
|
||||
- name: run
|
||||
image: application-image
|
||||
command: ["/my_application_that_depends_on_myservice"]
|
||||
|
||||
* Register this pod with a remote server
|
||||
|
||||
pod:
|
||||
spec:
|
||||
initContainers:
|
||||
- name: register
|
||||
image: centos:centos7
|
||||
command: ["/bin/sh", "-c", "curl -X POST http://$MANAGEMENT_SERVICE_HOST:$MANAGEMENT_SERVICE_PORT/register -d 'instance=$(POD_NAME)&ip=$(POD_IP)'"]
|
||||
env:
|
||||
- name: POD_NAME
|
||||
valueFrom:
|
||||
field: metadata.name
|
||||
- name: POD_IP
|
||||
valueFrom:
|
||||
field: status.podIP
|
||||
containers:
|
||||
- name: run
|
||||
image: application-image
|
||||
command: ["/my_application_that_depends_on_myservice"]
|
||||
|
||||
* Wait for an arbitrary period of time
|
||||
|
||||
pod:
|
||||
spec:
|
||||
initContainers:
|
||||
- name: wait
|
||||
image: centos:centos7
|
||||
command: ["/bin/sh", "-c", "sleep 60"]
|
||||
containers:
|
||||
- name: run
|
||||
image: application-image
|
||||
command: ["/static_binary_without_sleep"]
|
||||
|
||||
* Clone a git repository into a volume (can be implemented by volume containers in the future):
|
||||
|
||||
pod:
|
||||
spec:
|
||||
initContainers:
|
||||
- name: download
|
||||
image: image-with-git
|
||||
command: ["git", "clone", "https://github.com/myrepo/myrepo.git", "/var/lib/data"]
|
||||
volumeMounts:
|
||||
- mountPath: /var/lib/data
|
||||
volumeName: git
|
||||
containers:
|
||||
- name: run
|
||||
image: centos:centos7
|
||||
command: ["/var/lib/data/binary"]
|
||||
volumeMounts:
|
||||
- mountPath: /var/lib/data
|
||||
volumeName: git
|
||||
volumes:
|
||||
- emptyDir: {}
|
||||
name: git
|
||||
|
||||
* Execute a template transformation based on environment (can be implemented by volume containers in the future):
|
||||
|
||||
pod:
|
||||
spec:
|
||||
initContainers:
|
||||
- name: copy
|
||||
image: application-image
|
||||
command: ["/bin/cp", "mytemplate.j2", "/var/lib/data/"]
|
||||
volumeMounts:
|
||||
- mountPath: /var/lib/data
|
||||
volumeName: data
|
||||
- name: transform
|
||||
image: image-with-jinja
|
||||
command: ["/bin/sh", "-c", "jinja /var/lib/data/mytemplate.j2 > /var/lib/data/mytemplate.conf"]
|
||||
volumeMounts:
|
||||
- mountPath: /var/lib/data
|
||||
volumeName: data
|
||||
containers:
|
||||
- name: run
|
||||
image: application-image
|
||||
command: ["/myapplication", "-conf", "/var/lib/data/mytemplate.conf"]
|
||||
volumeMounts:
|
||||
- mountPath: /var/lib/data
|
||||
volumeName: data
|
||||
volumes:
|
||||
- emptyDir: {}
|
||||
name: data
|
||||
|
||||
* Perform a container build
|
||||
|
||||
pod:
|
||||
spec:
|
||||
initContainers:
|
||||
- name: copy
|
||||
image: base-image
|
||||
workingDir: /home/user/source-tree
|
||||
command: ["make"]
|
||||
containers:
|
||||
- name: commit
|
||||
image: image-with-docker
|
||||
command:
|
||||
- /bin/sh
|
||||
- -c
|
||||
- docker commit $(complex_bash_to_get_container_id_of_copy) \
|
||||
docker push $(commit_id) myrepo:latest
|
||||
volumesMounts:
|
||||
- mountPath: /var/run/docker.sock
|
||||
volumeName: dockersocket
|
||||
|
||||
## Backwards compatibilty implications
|
||||
|
||||
Since this is a net new feature in the API and Kubelet, new API servers during upgrade may not
|
||||
be able to rely on Kubelets implementing init containers. The management of feature skew between
|
||||
master and Kubelet is tracked in issue [#4855](https://github.com/kubernetes/kubernetes/issues/4855).
|
||||
|
||||
|
||||
## Future work
|
||||
|
||||
* Unify pod QoS class with init containers
|
||||
* Implement container / image volumes to make composition of runtime from images efficient
|
||||
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/container-init.md?pixel)]()
|
||||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|
Loading…
Reference in New Issue