Merge pull request #21894 from huang195/enhance_k8sm_ha_docs

Auto commit by PR queue bot
pull/6/head
k8s-merge-robot 2016-03-01 04:57:10 -08:00
commit ed1a8b1234
1 changed files with 456 additions and 1 deletions

View File

@ -1,4 +1,32 @@
## High Availability
# High Availability
Kubernetes on Mesos will eventually support two HA modes:
* [Hot-standby](#hot-standby) (*work-in-progress*)
* [Cold-standby](#cold-standby)
Hot-standby mode is currently still work-in-progress as controller manager is not
yet HA-aware (the work is being tracked [here][2]). Nevertheless, we will
describe how hot-standby mode is intended to work. It is recommended to use
cold-standby mode for HA for the time being until this work is done. In hot-standby
mode all master components (apiserver, controller manager, and scheduler)
actively run on every master node. Additional logic is added to the controller
manager and scheduler to coordinate their access to the etcd backend to deal
with concurrency issues when modifying cluster state. As apiserver does not
modify cluster state, multiple of these can run concurrently without
coordination. When the leader (i.e., the node whose scheduler is active)
crashes, other master nodes will detect the failure after some time and then
elect a new leader.
In cold-standby mode, similar to hot-standby mode apiserver will actively run
on every master node. However, only one scheduler and controller manager will
run at any instance in time. This is coordinated by a small external program
called `podmaster` that uses etcd to perform leadership selection, and only on
the leader node will the `podmaster` start the scheduler and controller
manager. Cold-standby mode is how Kubernetes supports HA, and more information
can be found [here][1].
## Hot-standby
### Scheduler
@ -66,5 +94,432 @@ The command line parameters that affect the hash calculation are listed below.
- `--proxy-*`
- `--static-pods-config`
## Cold-standby
Setting up Kubernetes on Mesos in cold-standby mode is similar to Kubernetes in
standalone mode described in [Kubernetes HA][1]. However, special attention is
needed when setting up K8sm scheduler so that when the currently active
scheduler crashes/dies, a new one can be instantiated and take over the work.
More precisely, the new scheduler needs to be compatible with the executors
that were started previously by the dead scheduler.
### Environment Variables
We will set up K8sm master on 2 nodes in HA mode. The same steps can be
extended to set up more master nodes to deal with more concurrent failures. We
will define a few environment variables first to describe the testbed
environment.
```
MESOS_IP=192.168.0.1
MESOS_PORT=5050
ETCD_IP=192.168.0.2
ETCD_PORT=4001
K8S_1_IP=192.168.0.3
K8S_2_IP=192.168.0.4
K8S_APISERVER_PORT=8080
K8S_SCHEDULER_PORT=10251
NGINX_IP=192.168.0.5
NGINX_APISERVER_PORT=80
NGINX_SCHEDULER_PORT=81
```
Other than the 2 K8sm master nodes (`192.168.0.3` and `192.168.0.4`), we also
define a Mesos master at `192.168.0.1`, an etcd server at `192.168.0.2`, and an
Nginx server that load balances between the 2 K8sm master nodes.
### K8sm Container Image
We use podmaster to coordinate leadership selection amongst K8sm masters.
However, podmaster needs to run in a container (preferably in a pod), and on
the leader node, its podmaster will instantiate scheduler and controller
manager also in their separate pods. The podmaster image is pre-built and can
be obtained from `gcr.io/google_containers/podmaster`. An official image that
contains the `km` binary to start apiserver, scheduler, and controller
manager is not yet available. But it can be built fairly easily.
```shell
$ cat <<EOF >Dockerfile
FROM ubuntu
MAINTAINER Hai Huang <haih@us.ibm.com>
RUN mkdir -p /opt/kubernetes
COPY kubernetes/_output/dockerized/bin/linux/amd64/ /opt/kubernetes
ENTRYPOINT ["/opt/kubernetes/km"]
EOF
$ cat <<EOF >build.sh
#!/bin/bash
K8SM_IMAGE_NAME=haih/k8sm
git clone https://github.com/mesosphere/kubernetes
cd kubernetes
git checkout release-v0.7-v1.1
KUBERNETES_CONTRIB=mesos build/run.sh hack/build-go.sh
cd ..
sudo docker build -t $K8SM_IMAGE_NAME --no-cache .
EOF
$ chmod 755 build.sh
$ ./build.sh
```
Make sure Docker engine is running locally as we will compile Kubernetes using
a Docker image. One can also change the image name and which Kubernetes release
to compile by modifying the script. After the script has finished running,
there should be a local Docker image called `haih/k8sm` (use `docker images` to
check).
Optionally, we can also push the image to Docker Hub (i.e., `docker push
$K8SM_IMAGE_NAME`) so we do not have to compile the image on every K8sm master
node.
**IMPORTANT:** Mesosphere team is currently maintaining the stable K8sm release in
a separate [fork][3]. At the time of this writing, the latest stable release is
`release-v0.7-v1.1`.
### Configure ETCD
We assume there's an etcd server on `$ETCD_IP`. Ideally this should be a
cluster of etcd servers running in HA mode backed up by redundant persistent
storage. For testing purposes, on the etcd server one can spin up an etcd
instance in a Docker container.
```shell
$ docker run -d --hostname $(uname -n) --name etcd \
-p ${ETCD_PORT}:${ETCD_PORT} \
quay.io/coreos/etcd:v2.0.12 \
--listen-client-urls http://0.0.0.0:${ETCD_PORT} \
--advertise-client-urls http://${ETCD_IP}:${ETCD_PORT}
```
### Configure Podmaster
Since we plan to run all K8sm components and podmaster in pods, we can use
`kubelet` to bootstrap these pods by specifying a manifests directory.
```shell
$ mkdir -p /etc/kubernetes/manifests/
$ mkdir -p /srv/kubernetes/manifests/
```
Once the kubelet has started, it will check the manifests directory periodically
to see if it needs to start or stop pods. Pods can be started by putting their
specification yaml files into the manifests directory, and subsequently they
can be stopped by removing these yaml files.
```shell
$ cat <<EOF > /etc/kubernetes/manifests/podmaster.yaml
apiVersion: v1
kind: Pod
metadata:
name: kube-podmaster
namespace: kube-system
spec:
hostNetwork: true
containers:
- name: scheduler-elector
image: gcr.io/google_containers/podmaster:1.1
command:
- /podmaster
- --etcd-servers=http://${ETCD_IP}:${ETCD_PORT}
- --key=scheduler
- --whoami=${MY_IP}
- --source-file=/src/manifests/scheduler.yaml
- --dest-file=/dst/manifests/scheduler.yaml
volumeMounts:
- mountPath: /src/manifests
name: manifest-src
readOnly: true
- mountPath: /dst/manifests
name: manifest-dst
- name: controller-manager-elector
image: gcr.io/google_containers/podmaster:1.1
command:
- /podmaster
- --etcd-servers=http://${ETCD_IP}:${ETCD_PORT}
- --key=controller
- --whoami=${MY_IP}
- --source-file=/src/manifests/controller-mgr.yaml
- --dest-file=/dst/manifests/controller-mgr.yaml
terminationMessagePath: /dev/termination-log
volumeMounts:
- mountPath: /src/manifests
name: manifest-src
readOnly: true
- mountPath: /dst/manifests
name: manifest-dst
volumes:
- hostPath:
path: /srv/kubernetes/manifests
name: manifest-src
- hostPath:
path: /etc/kubernetes/manifests
name: manifest-dst
```
One must change `$MY_IP` to either `$K8S_1_IP` or `K8S_2_IP` depending
on which master node you are currently setting up the podmaster. Podmasters
will compete with each other for leadership, and the winner will copy scheduler
and controller manager's pod specification yaml files from
`/srv/kubernetes/manifests/` to `/etc/kubernetes/manifests/`. When the kubelet
detects these new yaml files, it will start the corresponding pods.
### Configure Scheduler
The scheduler pod specification will be put into `/srv/kubernetes/manifests/`.
```shell
$ cat <<EOF > /srv/kubernetes/manifests/scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
name: kube-scheduler
namespace: kube-system
spec:
hostNetwork: true
containers:
- name: kube-scheduler
image: haih/k8sm:latest
imagePullPolicy: IfNotPresent
command:
- /opt/kubernetes/km
- scheduler
- --address=${MY_IP}
- --advertised-address=${NGINX_IP}:${NGINX_SCHEDULER_PORT}
- --mesos-master=${MESOS_IP}:${MESOS_PORT}
- --etcd-servers=http://${ETCD_IP}:${ETCD_PORT}
- --api-servers=${NGINX_IP}:${NGINX_APISERVER_PORT}
- --v=10
EOF
```
Again, one must change `$MY_IP` to either `$K8S_1_IP` or `K8S_2_IP` depending
on which master node is currently being working on. Even though we have not set
up Nginx yet, we can still specify `--api-servers` and `--advertised-address`
using Nginx's address and ports (make sure Nginx is already running before
turning on the scheduler). Having `--api-servers` point to Nginx allows
executors to maintain connectivity to one of the apiservers even when one or
more apiservers is down as Nginx can automatically re-route requests to a
working apiserver.
It is critically important to point `--advertised-address` to Nginx so all the
schedulers would be assigned the same executor ID. Otherwise, if we assign
`--advertised-address=${K8S_1_IP}` on the first K8s master and
`--advertised-address=${K8S_2_IP}` on the second K8s master, they would
generate different executor IDs. During a fail-over, the new scheduler would
not be able to use the executor started by the failed scheduler. If so, one
could get this error message in the scheduler log:
> Declining incompatible offer...
### Configure Controller Manager
The controller manager pod specification will also be put into `/srv/kubernetes/manifests/`.
```shell
$ cat <<EOF > /srv/kubernetes/manifests/controller-mgr.yaml
apiVersion: v1
kind: Pod
metadata:
name: kube-controller-manager
namespace: kube-system
spec:
hostNetwork: true
containers:
- name: kube-controller-manager
image: haih/k8sm:latest
imagePullPolicy: IfNotPresent
command:
- /opt/kubernetes/km
- controller-manager
- --master=http://${NGINX_IP}:${NGINX_APISERVER_PORT}
- --cloud-provider=mesos
- --cloud-config=/etc/kubernetes/mesos-cloud.conf
volumeMounts:
- mountPath: /etc/kubernetes
name: kubernetes-config
readOnly: true
volumes:
- hostPath:
path: /etc/kubernetes
name: kubernetes-config
EOF
```
Controller manager also needs a mesos configuration file as one of its
parameters, and this configuration file is written to
`/etc/kubernetes/mesos-cloud.conf`.
```shell
$ cat <<EOF >/etc/kubernetes/mesos-cloud.conf
[mesos-cloud]
mesos-master = ${MESOS_IP}:${MESOS_PORT}
EOF
```
### Configure Apiserver
Apiserver runs on every master node, so its specification file is put into
`/etc/kubernetes/manifests/`.
```shell
cat <<EOF > /etc/kubernetes/manifests/apiserver.yaml
apiVersion: v1
kind: Pod
metadata:
name: kube-apiserver
namespace: kube-system
spec:
hostNetwork: true
containers:
- name: kube-apiserver
image: haih/k8sm:latest
imagePullPolicy: IfNotPresent
command:
- /opt/kubernetes/km
- apiserver
- --insecure-bind-address=0.0.0.0
- --etcd-servers=http://${ETCD_IP}:${ETCD_PORT}
- --allow-privileged=true
- --service-cluster-ip-range=10.10.10.0/24
- --insecure-port=${K8S_APISERVER_PORT}
- --cloud-provider=mesos
- --cloud-config=/etc/kubernetes/mesos-cloud.conf
- --advertise-address=${MY_IP}
ports:
- containerPort: ${K8S_APISERVER_PORT}
hostPort: ${K8S_APISERVER_PORT}
name: local
volumeMounts:
- mountPath: /etc/kubernetes
name: kubernetes-config
readOnly: true
volumes:
- hostPath:
path: /etc/kubernetes
name: kubernetes-config
EOF
```
Again, one must change `$MY_IP` to either `$K8S_1_IP` or `K8S_2_IP`
depending on which master node is currently being working on.
To summarize our current setup: we have apiserver and podmaster's pod
specification files put into `/etc/kubernetes/manifests/` so they run on every
master node. Scheduler and controller manager's pod specification files are
put into `/srv/kubernetes/manifests/`, and they will be copied into
`/etc/kubernetes/manifests/` by their podmaster if and only if their podmaster was
elected the leader.
### Configure Nginx
Nginx needs to be configured to load balance for both the apiserver and scheduler.
For testing purpose, one can start Nginx in a Docker container.
```shell
cat <<EOF >nginx.conf
events {
worker_connections 4096; ## Default: 1024
}
http {
upstream apiservers {
server ${K8S_1_IP}:${K8S_APISERVER_PORT};
server ${K8S_2_IP}:${K8S_APISERVER_PORT};
}
upstream schedulers {
server ${K8S_1_IP}:${K8S_SCHEDULER_PORT};
server ${K8S_2_IP}:${K8S_SCHEDULER_PORT};
}
server {
listen ${NGINX_APISERVER_PORT};
location / {
proxy_pass http://apiservers;
proxy_next_upstream error timeout invalid_header http_500;
proxy_connect_timeout 2;
proxy_buffering off;
proxy_read_timeout 12h;
proxy_send_timeout 12h;
}
}
server {
listen ${NGINX_SCHEDULER_PORT};
location / {
proxy_pass http://schedulers;
proxy_next_upstream error timeout invalid_header http_500;
proxy_connect_timeout 2;
proxy_buffering off;
proxy_read_timeout 12h;
proxy_send_timeout 12h;
}
}
}
EOF
$ docker run \
-p $NGINX_APISERVER_PORT:$NGINX_APISERVER_PORT \
-p $NGINX_SCHEDULER_PORT:$NGINX_SCHEDULER_PORT \
--name nginx \
-v `pwd`/nginx.conf:/etc/nginx/nginx.conf:ro \
-d nginx:latest
```
For the sake of clarity, configuring Nginx to support HTTP over TLS/SPDY is
outside of our scope. However, one should keep in mind that without TLS/SPDY
properly configured, some `kubectl` commands might not work properly. This
problem is documented [here][4].
### Start Kubelet
To start everything up, we need to start the kubelet on K8s master nodes so
they can start apiserver and podmaster. On the leader node, podmaster will
subsequently start the scheduler and controller manager.
```shell
$ mkdir -p /var/log/kubernetes
$ kubelet \
--api_servers=http://127.0.0.1:${K8S_APISERVER_PORT} \
--register-node=false \
--allow-privileged=true \
--config=/etc/kubernetes/manifests \
1>/var/log/kubernetes/kubelet.log 2>&1 &
```
### Verification
On each of the K8s master nodes, one can run `docker ps` to verify that there
is an apiserver pod and a podmaster pod running, and on one of the K8s master
nodes, there is a controller manager and a scheduler pod running.
One should also verify if we can create user pods in the K8sm cluster
```shell
$ export KUBERNETES_MASTER=http://${NGINX_IP}:${NGINX_APISERVER_PORT}
$ kubectl create -f <userpod yaml file>
$ kubectl get pods
```
The pod should be shown in a `Running` state after some short amount of time.
### Tuning
During a fail-over, cold-standby mode takes some time before a new scheduler
can be started to take over the work from the failed one. However, one can
tune various parameters to shorten this time.
Podmaster has `--sleep` and `--ttl-secs` parameters that can be tuned, and both
allow for faster failure detection. However, it is probably not a good idea to
set `--ttl-secs` too small to minimize false positives.
Kubelet has `--file-check-frequency` parameter that controls how frequently it
checks the manifests directory. It is set to 20 seconds by default.
[1]: http://kubernetes.io/v1.0/docs/admin/high-availability.html
[2]: https://github.com/mesosphere/kubernetes-mesos/issues/457
[3]: https://github.com/mesosphere/kubernetes
[4]: https://github.com/kubernetes/kubernetes/blob/master/contrib/mesos/docs/issues.md#kubectl
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/contrib/mesos/docs/ha.md?pixel)]()