Merge pull request #21894 from huang195/enhance_k8sm_ha_docs

Auto commit by PR queue bot
2016-03-01 04:57:10 -08:00 · 2016-03-01 04:57:10 -08:00 · ed1a8b1234
parent 84a98673cd 3cc729a482
commit ed1a8b1234
1 changed files with 456 additions and 1 deletions
--- a/contrib/mesos/docs/ha.md
+++ b/contrib/mesos/docs/ha.md
@ -1,4 +1,32 @@
-## High Availability
+# High Availability
+
+Kubernetes on Mesos will eventually support two HA modes:
+
+* [Hot-standby](#hot-standby) (*work-in-progress*)
+* [Cold-standby](#cold-standby)
+
+Hot-standby mode is currently still work-in-progress as controller manager is not
+yet HA-aware (the work is being tracked [here][2]). Nevertheless, we will
+describe how hot-standby mode is intended to work. It is recommended to use
+cold-standby mode for HA for the time being until this work is done. In hot-standby
+mode all master components (apiserver, controller manager, and scheduler)
+actively run on every master node. Additional logic is added to the controller
+manager and scheduler to coordinate their access to the etcd backend to deal
+with concurrency issues when modifying cluster state. As apiserver does not
+modify cluster state, multiple of these can run concurrently without
+coordination.  When the leader (i.e., the node whose scheduler is active)
+crashes, other master nodes will detect the failure after some time and then
+elect a new leader.
+
+In cold-standby mode, similar to hot-standby mode apiserver will actively run
+on every master node.  However, only one scheduler and controller manager will
+run at any instance in time. This is coordinated by a small external program
+called `podmaster` that uses etcd to perform leadership selection, and only on
+the leader node will the `podmaster` start the scheduler and controller
+manager. Cold-standby mode is how Kubernetes supports HA, and more information
+can be found [here][1].
+
+## Hot-standby

 ### Scheduler

@ -66,5 +94,432 @@ The command line parameters that affect the hash calculation are listed below.
 - `--proxy-*`
 - `--static-pods-config`

+## Cold-standby
+
+Setting up Kubernetes on Mesos in cold-standby mode is similar to Kubernetes in
+standalone mode described in [Kubernetes HA][1]. However, special attention is
+needed when setting up K8sm scheduler so that when the currently active
+scheduler crashes/dies, a new one can be instantiated and take over the work.
+More precisely, the new scheduler needs to be compatible with the executors
+that were started previously by the dead scheduler.
+
+### Environment Variables
+
+We will set up K8sm master on 2 nodes in HA mode. The same steps can be
+extended to set up more master nodes to deal with more concurrent failures. We
+will define a few environment variables first to describe the testbed
+environment.
+
+```
+MESOS_IP=192.168.0.1
+MESOS_PORT=5050
+
+ETCD_IP=192.168.0.2
+ETCD_PORT=4001
+
+K8S_1_IP=192.168.0.3
+K8S_2_IP=192.168.0.4
+K8S_APISERVER_PORT=8080
+K8S_SCHEDULER_PORT=10251
+
+NGINX_IP=192.168.0.5
+NGINX_APISERVER_PORT=80
+NGINX_SCHEDULER_PORT=81
+```
+
+Other than the 2 K8sm master nodes (`192.168.0.3` and `192.168.0.4`), we also
+define a Mesos master at `192.168.0.1`, an etcd server at `192.168.0.2`, and an
+Nginx server that load balances between the 2 K8sm master nodes. 
+
+### K8sm Container Image
+
+We use podmaster to coordinate leadership selection amongst K8sm masters.
+However, podmaster needs to run in a container (preferably in a pod), and on
+the leader node, its podmaster will instantiate scheduler and controller
+manager also in their separate pods. The podmaster image is pre-built and can
+be obtained from `gcr.io/google_containers/podmaster`. An official image that
+contains the `km` binary to start apiserver, scheduler, and controller
+manager is not yet available. But it can be built fairly easily.
+
+```shell
+$ cat <<EOF >Dockerfile
+FROM ubuntu
+MAINTAINER Hai Huang <haih@us.ibm.com>
+RUN mkdir -p /opt/kubernetes
+COPY kubernetes/_output/dockerized/bin/linux/amd64/ /opt/kubernetes
+ENTRYPOINT ["/opt/kubernetes/km"]
+EOF
+$ cat <<EOF >build.sh
+#!/bin/bash
+K8SM_IMAGE_NAME=haih/k8sm
+git clone https://github.com/mesosphere/kubernetes
+cd kubernetes
+git checkout release-v0.7-v1.1
+KUBERNETES_CONTRIB=mesos build/run.sh hack/build-go.sh
+cd ..
+sudo docker build -t $K8SM_IMAGE_NAME --no-cache .
+EOF
+$ chmod 755 build.sh
+$ ./build.sh
+```
+
+Make sure Docker engine is running locally as we will compile Kubernetes using
+a Docker image. One can also change the image name and which Kubernetes release
+to compile by modifying the script. After the script has finished running,
+there should be a local Docker image called `haih/k8sm` (use `docker images` to
+check).
+
+Optionally, we can also push the image to Docker Hub (i.e., `docker push
+$K8SM_IMAGE_NAME`) so we do not have to compile the image on every K8sm master
+node.
+
+**IMPORTANT:** Mesosphere team is currently maintaining the stable K8sm release in
+a separate [fork][3]. At the time of this writing, the latest stable release is
+`release-v0.7-v1.1`.
+
+
+### Configure ETCD
+
+We assume there's an etcd server on `$ETCD_IP`. Ideally this should be a
+cluster of etcd servers running in HA mode backed up by redundant persistent
+storage. For testing purposes, on the etcd server one can spin up an etcd
+instance in a Docker container.
+
+```shell
+$ docker run -d --hostname $(uname -n) --name etcd \
+  -p ${ETCD_PORT}:${ETCD_PORT} \
+  quay.io/coreos/etcd:v2.0.12 \
+  --listen-client-urls http://0.0.0.0:${ETCD_PORT} \
+  --advertise-client-urls http://${ETCD_IP}:${ETCD_PORT}
+```
+
+### Configure Podmaster
+
+Since we plan to run all K8sm components and podmaster in pods, we can use
+`kubelet` to bootstrap these pods by specifying a manifests directory. 
+
+```shell
+$ mkdir -p /etc/kubernetes/manifests/
+$ mkdir -p /srv/kubernetes/manifests/
+```
+
+Once the kubelet has started, it will check the manifests directory periodically
+to see if it needs to start or stop pods. Pods can be started by putting their
+specification yaml files into the manifests directory, and subsequently they
+can be stopped by removing these yaml files.
+
+```shell
+$ cat <<EOF > /etc/kubernetes/manifests/podmaster.yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: kube-podmaster
+  namespace: kube-system
+spec:
+  hostNetwork: true
+  containers:
+  - name: scheduler-elector
+    image: gcr.io/google_containers/podmaster:1.1
+    command:
+    - /podmaster
+    - --etcd-servers=http://${ETCD_IP}:${ETCD_PORT}
+    - --key=scheduler
+    - --whoami=${MY_IP}
+    - --source-file=/src/manifests/scheduler.yaml
+    - --dest-file=/dst/manifests/scheduler.yaml
+    volumeMounts:
+    - mountPath: /src/manifests
+      name: manifest-src
+      readOnly: true
+    - mountPath: /dst/manifests
+      name: manifest-dst
+  - name: controller-manager-elector
+    image: gcr.io/google_containers/podmaster:1.1
+    command:
+    - /podmaster
+    - --etcd-servers=http://${ETCD_IP}:${ETCD_PORT}
+    - --key=controller
+    - --whoami=${MY_IP}
+    - --source-file=/src/manifests/controller-mgr.yaml
+    - --dest-file=/dst/manifests/controller-mgr.yaml
+    terminationMessagePath: /dev/termination-log
+    volumeMounts:
+    - mountPath: /src/manifests
+      name: manifest-src
+      readOnly: true
+    - mountPath: /dst/manifests
+      name: manifest-dst
+  volumes:
+  - hostPath:
+      path: /srv/kubernetes/manifests
+    name: manifest-src
+  - hostPath:
+      path: /etc/kubernetes/manifests
+    name: manifest-dst
+```
+
+One must change `$MY_IP` to either `$K8S_1_IP` or `K8S_2_IP` depending
+on which master node you are currently setting up the podmaster. Podmasters
+will compete with each other for leadership, and the winner will copy scheduler
+and controller manager's pod specification yaml files from
+`/srv/kubernetes/manifests/` to `/etc/kubernetes/manifests/`. When the kubelet
+detects these new yaml files, it will start the corresponding pods.
+
+### Configure Scheduler
+
+The scheduler pod specification will be put into `/srv/kubernetes/manifests/`.
+
+```shell
+$ cat <<EOF > /srv/kubernetes/manifests/scheduler.yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: kube-scheduler
+  namespace: kube-system
+spec:
+  hostNetwork: true
+  containers:
+  - name: kube-scheduler
+    image: haih/k8sm:latest
+    imagePullPolicy: IfNotPresent
+    command:
+    - /opt/kubernetes/km
+    - scheduler
+    - --address=${MY_IP}
+    - --advertised-address=${NGINX_IP}:${NGINX_SCHEDULER_PORT}
+    - --mesos-master=${MESOS_IP}:${MESOS_PORT}
+    - --etcd-servers=http://${ETCD_IP}:${ETCD_PORT}
+    - --api-servers=${NGINX_IP}:${NGINX_APISERVER_PORT}
+    - --v=10
+EOF
+```
+
+Again, one must change `$MY_IP` to either `$K8S_1_IP` or `K8S_2_IP` depending
+on which master node is currently being working on. Even though we have not set
+up Nginx yet, we can still specify `--api-servers` and `--advertised-address`
+using Nginx's address and ports (make sure Nginx is already running before
+turning on the scheduler). Having `--api-servers` point to Nginx allows
+executors to maintain connectivity to one of the apiservers even when one or
+more apiservers is down as Nginx can automatically re-route requests to a
+working apiserver.
+
+It is critically important to point `--advertised-address` to Nginx so all the
+schedulers would be assigned the same executor ID. Otherwise, if we assign
+`--advertised-address=${K8S_1_IP}` on the first K8s master and
+`--advertised-address=${K8S_2_IP}` on the second K8s master, they would
+generate different executor IDs. During a fail-over, the new scheduler would
+not be able to use the executor started by the failed scheduler. If so, one
+could get this error message in the scheduler log:
+
+> Declining incompatible offer...
+
+### Configure Controller Manager
+
+The controller manager pod specification will also be put into `/srv/kubernetes/manifests/`.
+
+```shell
+$ cat <<EOF > /srv/kubernetes/manifests/controller-mgr.yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: kube-controller-manager
+  namespace: kube-system
+spec:
+  hostNetwork: true
+  containers:
+  - name: kube-controller-manager
+    image: haih/k8sm:latest
+    imagePullPolicy: IfNotPresent
+    command:
+    - /opt/kubernetes/km
+    - controller-manager
+    - --master=http://${NGINX_IP}:${NGINX_APISERVER_PORT}
+    - --cloud-provider=mesos
+    - --cloud-config=/etc/kubernetes/mesos-cloud.conf
+    volumeMounts:
+    - mountPath: /etc/kubernetes
+      name: kubernetes-config
+      readOnly: true
+  volumes:
+  - hostPath:
+      path: /etc/kubernetes
+    name: kubernetes-config
+EOF
+```
+
+Controller manager also needs a mesos configuration file as one of its
+parameters, and this configuration file is written to
+`/etc/kubernetes/mesos-cloud.conf`.
+
+```shell
+$ cat <<EOF >/etc/kubernetes/mesos-cloud.conf
+[mesos-cloud]
+        mesos-master        = ${MESOS_IP}:${MESOS_PORT}
+EOF
+```
+
+### Configure Apiserver
+
+Apiserver runs on every master node, so its specification file is put into
+`/etc/kubernetes/manifests/`.
+
+```shell
+cat <<EOF > /etc/kubernetes/manifests/apiserver.yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: kube-apiserver
+  namespace: kube-system
+spec:
+  hostNetwork: true
+  containers:
+  - name: kube-apiserver
+    image: haih/k8sm:latest
+    imagePullPolicy: IfNotPresent
+    command:
+    - /opt/kubernetes/km
+    - apiserver
+    - --insecure-bind-address=0.0.0.0
+    - --etcd-servers=http://${ETCD_IP}:${ETCD_PORT}
+    - --allow-privileged=true
+    - --service-cluster-ip-range=10.10.10.0/24
+    - --insecure-port=${K8S_APISERVER_PORT}
+    - --cloud-provider=mesos
+    - --cloud-config=/etc/kubernetes/mesos-cloud.conf
+    - --advertise-address=${MY_IP}
+    ports:
+    - containerPort: ${K8S_APISERVER_PORT}
+      hostPort: ${K8S_APISERVER_PORT}
+      name: local
+    volumeMounts:
+    - mountPath: /etc/kubernetes
+      name: kubernetes-config
+      readOnly: true
+  volumes:
+  - hostPath:
+      path: /etc/kubernetes
+    name: kubernetes-config
+EOF
+```
+
+Again, one must change `$MY_IP` to either `$K8S_1_IP` or `K8S_2_IP`
+depending on which master node is currently being working on.
+
+To summarize our current setup: we have apiserver and podmaster's pod
+specification files put into `/etc/kubernetes/manifests/` so they run on every
+master node.  Scheduler and controller manager's pod specification files are
+put into `/srv/kubernetes/manifests/`, and they will be copied into
+`/etc/kubernetes/manifests/` by their podmaster if and only if their podmaster was
+elected the leader.
+
+### Configure Nginx
+
+Nginx needs to be configured to load balance for both the apiserver and scheduler.
+For testing purpose, one can start Nginx in a Docker container.
+
+```shell
+cat <<EOF >nginx.conf
+events {
+  worker_connections  4096;  ## Default: 1024
+}
+
+http {
+  upstream apiservers {
+    server ${K8S_1_IP}:${K8S_APISERVER_PORT};
+    server ${K8S_2_IP}:${K8S_APISERVER_PORT};
+  }
+
+  upstream schedulers {
+    server ${K8S_1_IP}:${K8S_SCHEDULER_PORT};
+    server ${K8S_2_IP}:${K8S_SCHEDULER_PORT};
+  }
+
+  server {
+    listen ${NGINX_APISERVER_PORT};
+    location / {
+      proxy_pass              http://apiservers;
+      proxy_next_upstream     error timeout invalid_header http_500;
+      proxy_connect_timeout   2;
+      proxy_buffering off;
+      proxy_read_timeout 12h;
+      proxy_send_timeout 12h;
+    }
+  }
+
+  server {
+    listen ${NGINX_SCHEDULER_PORT};
+    location / {
+      proxy_pass              http://schedulers;
+      proxy_next_upstream     error timeout invalid_header http_500;
+      proxy_connect_timeout   2;
+      proxy_buffering off;
+      proxy_read_timeout 12h;
+      proxy_send_timeout 12h;
+    }
+  }
+}
+EOF
+$ docker run \
+  -p $NGINX_APISERVER_PORT:$NGINX_APISERVER_PORT \
+  -p $NGINX_SCHEDULER_PORT:$NGINX_SCHEDULER_PORT \
+  --name nginx \
+  -v `pwd`/nginx.conf:/etc/nginx/nginx.conf:ro \
+  -d nginx:latest
+```
+
+For the sake of clarity, configuring Nginx to support HTTP over TLS/SPDY is
+outside of our scope. However, one should keep in mind that without TLS/SPDY
+properly configured, some `kubectl` commands might not work properly. This
+problem is documented [here][4].
+
+### Start Kubelet
+
+To start everything up, we need to start the kubelet on K8s master nodes so
+they can start apiserver and podmaster. On the leader node, podmaster will
+subsequently start the scheduler and controller manager.
+
+```shell
+$ mkdir -p /var/log/kubernetes
+$ kubelet \
+  --api_servers=http://127.0.0.1:${K8S_APISERVER_PORT} \
+  --register-node=false \
+  --allow-privileged=true \
+  --config=/etc/kubernetes/manifests \
+  1>/var/log/kubernetes/kubelet.log 2>&1 &
+```
+
+### Verification
+
+On each of the K8s master nodes, one can run `docker ps` to verify that there
+is an apiserver pod and a podmaster pod running, and on one of the K8s master
+nodes, there is a controller manager and a scheduler pod running.
+
+One should also verify if we can create user pods in the K8sm cluster
+
+```shell
+$ export KUBERNETES_MASTER=http://${NGINX_IP}:${NGINX_APISERVER_PORT} 
+$ kubectl create -f <userpod yaml file>
+$ kubectl get pods
+```
+
+The pod should be shown in a `Running` state after some short amount of time.
+
+### Tuning
+
+During a fail-over, cold-standby mode takes some time before a new scheduler
+can be started to take over the work from the failed one. However, one can
+tune various parameters to shorten this time.
+
+Podmaster has `--sleep` and `--ttl-secs` parameters that can be tuned, and both
+allow for faster failure detection. However, it is probably not a good idea to
+set `--ttl-secs` too small to minimize false positives.
+
+Kubelet has `--file-check-frequency` parameter that controls how frequently it 
+checks the manifests directory. It is set to 20 seconds by default.
+
+[1]: http://kubernetes.io/v1.0/docs/admin/high-availability.html
+[2]: https://github.com/mesosphere/kubernetes-mesos/issues/457
+[3]: https://github.com/mesosphere/kubernetes
+[4]: https://github.com/kubernetes/kubernetes/blob/master/contrib/mesos/docs/issues.md#kubectl

 [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/contrib/mesos/docs/ha.md?pixel)]()