Given a work list, like in the [work list examples](#work-list-examples),
the information from the work list needs to get into each Pod of the Job.
Users will typically not want to create a new image for each job they
run. They will want to use existing images. So, the image is not the place
for the work list.
A work list can be stored on networked storage, and mounted by pods of the job.
Also, as a shortcut, for small worklists, it can be included in an annotation on the Job object,
which is then exposed as a volume in the pod via the downward API.
### What Varies Between Pods of a Job
Pods need to differ in some way to do something different. (They do not
differ in the work-queue style of Job, but that style has ease-of-use issues).
A general approach would be to allow pods to differ from each other in arbitrary ways.
For example, the Job object could have a list of PodSpecs to run.
However, this is so general that it provides little value. It would:
- make the Job Spec very verbose, especially for jobs with thousands of work items
- Job becomes such a vague concept that it is hard to explain to users
- in practice, we do not see cases where many pods which differ across many fields of their
specs, and need to run as a group, with no ordering constraints.
- CLIs and UIs need to support more options for creating Job
- it is useful for monitoring and accounting databases want to aggregate data for pods
with the same controller. However, pods with very different Specs may not make sense
to aggregate.
- profiling, debugging, accounting, auditing and monitoring tools cannot assume common
images/files, behaviors, provenance and so on between Pods of a Job.
Also, variety has another cost. Pods which differ in ways that affect scheduling
(node constraints, resource requirements, labels) prevent the scheduler
from treating them as fungible, which is an important optimization for the scheduler.
Therefore, we will not allow Pods from the same Job to differ arbitrarily
(anyway, users can use multiple Job objects for that case). We will try to
allow as little as possible to differ between pods of the same Job, while
still allowing users to express common parallel patterns easily.
For users who need to run jobs which differ in other ways, they can create multiple
Jobs, and manage them as a group using labels.
From the above work lists, we see a need for Pods which differ in their command
lines, and in their environment variables. These work lists do not require the
pods to differ in other ways.
Experience in a [similar systems](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf) has shown this model to be applicable
to a very broad range of problems, despite this restriction.
Therefore we to allow pods in the same Job to differ **only** in the following aspects:
- command line
- environment variables
### Composition of existing images
The docker image that is used in a job may not be maintained by the person
running the job. Over time, the Dockerfile may change the ENTRYPOINT or CMD.
If we require people to specify the complete command line to use Indexed Job,
then they will not automatically pick up changes in the default
command or args.
This needs more thought.
### Running Ad-Hoc Jobs using kubectl
A user should be able to easily start an Indexed Job using `kubectl`.
For example to run [work list 1](#work-list-1), a user should be able
to type something simple like:
```
kubectl run process-files --image=myfileprocessor \
How exactly this happens is discussed later in the doc: this is a sketch of the user experience.
In practice, the list of files might be much longer and stored in a file
on the users local host, like:
```
$ cat files-to-process.txt
12342.dat
97283.dat
38732.dat
...
```
So, the user could specify instead: `--per-completion-env=F="$(cat files-to-process.txt)"`.
However, `kubectl` should also support a format like:
`--per-completion-env=F=@files-to-process.txt`.
That allows `kubectl` to parse the file, point out any syntax errors, and would not run up against command line length limits (2MB is common, as low as 4kB is POSIX compliant).
One case we do not try to handle is where the file of work is stored on a cloud filesystem, and not accessible from the users local host. Then we cannot easily use indexed job, because we do not know the number of completions. The user needs to copy the file locally first or use the Work-Queue style of Job (already supported).
Another case we do not try to handle is where the input file does not exist yet because this Job is to be run at a future time, or depends on another job. The workflow and scheduled job proposal need to consider this case. For that case, you could use an indexed job which runs a program which shards the input file (map-reduce-style).
Kubectl should build the same JobSpec, and then put it into a ScheduledJob (#11980) and create that.
For [workflow type jobs](../../docs/user-guide/jobs.md#job-patterns), creating a complete workflow from a single command line would be messy, because of the need to specify all the arguments multiple times.
For that use case, the user could create a workflow message by hand.
Or the user could create a job template, and then make a workflow from the templates, perhaps like this:
```
$ kubectl run process-files --image=myfileprocessor \
The JobSpec can be extended to hold a list of parameter tuples (which
are more easily expressed as a list of lists of individual parameters).
For example:
```
apiVersion: extensions/v1beta1
kind: Job
...
spec:
completions: 3
...
template:
...
perCompletionArgs:
container: 0
-
- "-f apple.txt"
- "-f banana.txt"
- "-f cherry.txt"
-
- "--remove-seeds"
- ""
- "--remove-pit"
perCompletionEnvVars:
- name: "FRUIT_COLOR"
- "green"
- "yellow"
- "red"
```
However, just providing custom env vars, and not arguments, is sufficient
for many use cases: parameter can be put into env vars, and then
substituted on the command line.
#### Comparison
The multiple substitution approach:
- keeps the *per completion parameters* in the JobSpec.
- Drawback: makes the job spec large for job with thousands of completions. (But for very large jobs, the work-queue style or another type of controller, such as map-reduce or spark, may be a better fit.)
- Drawback: is a form of server-side templating, which we want in Kubernetes but have not fully designed
(see the [PetSets proposal](https://github.com/kubernetes/kubernetes/pull/18016/files?short_path=61f4179#diff-61f41798f4bced6e42e45731c1494cee)).
The index-only approach:
- requires that the user keep the *per completion parameters* in a separate storage, such as a configData or networked storage.
- makes no changes to the JobSpec.
- Drawback: while in separate storage, they could be mutatated, which would have unexpected effects
However, in order to inject other env vars from parameter list,
kubectl still needs to edit the command line.
Parameter lists could be passed via a configData volume instead of a secret.
Kubectl can be changed to work that way once the configData implementation is
complete.
Parameter lists could be passed inside an EnvVar. This would have length
limitations, would pollute the output of `kubectl describe pods` and `kubectl
get pods -o json`.
Parameter lists could be passed inside an annotation. This would have length
limitations, would pollute the output of `kubectl describe pods` and `kubectl
get pods -o json`. Also, currently annotations can only be extracted into a
single file. Complex logic is then needed to filter out exactly the desired
annotation data.
Bash array variables could simplify extraction of a particular parameter from a
list of parameters. However, some popular base images do not include
`/bin/bash`. For example, `busybox` uses a compact `/bin/sh` implementation
that does not support array syntax.
Kubelet does support [expanding varaibles without a
shell](http://kubernetes.io/v1.1/docs/design/expansion.html). But it does not
allow for recursive substitution, which is required to extract the correct
parameter from a list based on the completion index of the pod. The syntax
could be extended, but doing so seems complex and will be an unfamiliar syntax
for users.
Putting all the command line editing into a script and running that causes
the least pollution to the original command line, and it allows
for complex error handling.
Kubectl could store the script in an [Inline Volume](
https://github.com/kubernetes/kubernetes/issues/13610) if that proposal
is approved. That would remove the need to manage the lifetime of the
configData/secret, and prevent the case where someone changes the
configData mid-job, and breaks things in a hard-to-debug way.
## Interactions with other features
#### Supporting Work Queue Jobs too
For Work Queue Jobs, completions has no meaning. Parallelism should be allowed to be greater than it, and pods have no identity. So, the job controller should not create a scoreboard in the JobStatus, just a count. Therefore, we need to add one of the following to JobSpec:
- allow unset `.spec.completions` to indicate no scoreboard, and no index for tasks (identical tasks)
- allow `.spec.completions=-1` to indicate the same.
- add `.spec.indexed` to job to indicate need for scoreboard.
#### Interaction with vertical autoscaling
Since pods of the same job will not be created with different resources,
a vertical autoscaler will need to:
- if it has index-specific initial resource suggestions, suggest those at admission
time; it will need to understand indexes.
- mutate resource requests on already created pods based on usage trend or previous container failures
- modify the job template, affecting all indexes.
#### Comparison to PetSets
The *Index substitution-only* option corresponds roughly to PetSet Proposal 1b.
The `perCompletionArgs` approach is similar to PetSet Proposal 1e, but more restrictive and thus less verbose.
It would be easier for users if Indexed Job and PetSet are similar where possible.
However, PetSet differs in several key respects:
- PetSet is for ones to tens of instances. Indexed job should work with tens of
thousands of instances.
- When you have few instances, you may want to given them pet names. When you have many
instances, you that many instances, integer indexes make more sense.
- When you have thousands of instances, storing the work-list in the JobSpec
is verbose. For PetSet, this is less of a problem.
- PetSets (apparently) need to differ in more fields than indexed Jobs.
This differs from PetSet in that PetSet uses names and not indexes.
PetSet is intended to support ones to tens of things.