In upstream the kubelet is responsible for all pods which have the spec.NodeName
set. In Mesos we have a two-stage scheduling process:
1. pods with a pre-set spec.NodeName are still scheduled by the scheduler.
2. The kubelet will only see them when a Mesos task was started and the executor
passes the pod to the kubelet.
With this PR a pod with spec.NodeName which is gracefully terminated, but not
yet scheduled, e.g.
- because the termination happened just after creation and the scheduler was
not fast enough
- because the NodeSelector does not match
is deleted by the Mesos scheduler.
- pre-create node api objects from the scheduler when offers arrive
- decline offers until nodes a registered
- turn slave attributes as k8s.mesosphere.io/attribute-* labels
- update labels from executor Register/Reregister
- watch nodes in scheduler to make non-Mesos labels available for NodeSelector matching
- add unit tests for label predicate
- add e2e test to check that slave attributes really end up as node labels
- remove bleeding of registry-internal objects, without any locking
- rename from SlaveStorage to Registry which fits much better to what
it actually does
A lot of packages use StringSet, but they don't use anything else from
the util package. Moving StringSet into another package will shrink
their dependency trees significantly.
- new: introduce AllocationStrategy, Predicate, and Procurement to scheduler pkg
- new: --contain-pod-resources flag (workaround for docker+systemd+mesos problems)
- new: --account-for-pod-resources flag (for testing overcommitment)
- bugfix: forward -v flag from minion controller to executor
- The TASK_ERROR task status was introduced with Mesos 0.21 and is actually used since 0.22.
It was not handled at all before this patch, leaving errored task in the registry in phase
"Pending". This will lead to task status updates from the Mesos Master on reconciliation with empty
slaveId fields, leading to scheduler crashes eventually.
- Handle terminal task with empty slaveId.
The slave id can be empty for TASK_ERROR.
The modified code path does not use the slaveId.
- move assigned slave to T.Spec.AssignedSlave
- only create the BindingHost annoation in prepareTaskForLaunch
- recover the assigned slave from annotation and write it back to the T.Spec field
Before this patch the annotation were used to store the assign slave. But due
to the cloning of tasks in the registry, this value was never persisted in the
registry.
This patch adds it to the Spec of a task and only creates the annotation
last-minute before launching.
Without this patch pods which fail before binding will stay in the registry,
but they are never rescheduled again. The reason: the BindingHost annotation does
not exist in the registry and not on the apiserver (compare reconcilePod function).