k3s/docs/design/ha_master.md

# Automated HA master deployment

**Author:** filipg@, jsz@

# Introduction

We want to allow users to easily replicate kubernetes masters to have highly available cluster,
initially using `kube-up.sh` and `kube-down.sh`.

This document describes technical design of this feature. It assumes that we are using aforementioned
scripts for cluster deployment. All of the ideas described in the following sections should be easy
to implement on GCE, AWS and other cloud providers.

It is a non-goal to design a specific setup for bare-metal environment, which
might be very different.

# Overview

In a cluster with replicated master, we will have N VMs, each running regular master components
such as apiserver, etcd, scheduler or controller manager. These components will interact in the
following way:
* All etcd replicas will be clustered together and will be using master election
  and quorum mechanism to agree on the state. All of these mechanisms are integral
  parts of etcd and we will only have to configure them properly.
* All apiserver replicas will be working independently talking to an etcd on
  127.0.0.1 (i.e. local etcd replica), which if needed will forward requests to the current etcd master
  (as explained [here](https://coreos.com/etcd/docs/latest/getting-started-with-etcd.html)).
* We will introduce provider specific solutions to load balance traffic between master replicas
  (see section `load balancing`)
* Controller manager, scheduler & cluster autoscaler will use lease mechanism and
  only a single instance will be an active master. All other will be waiting in a standby mode.
* All add-on managers will work independently and each of them will try to keep add-ons in sync

# Detailed design

## Components

### etcd

```
Note: This design for etcd clustering is quite pet-set like - each etcd
replica has its name which is explicitly used in etcd configuration etc. In
medium-term future we would like to have the ability to run masters as part of
autoscaling-group (AWS) or managed-instance-group (GCE) and add/remove replicas
automatically. This is pretty tricky and this design does not cover this.
It will be covered in a separate doc.
```

All etcd instances will be clustered together and one of them will be an elected master.
In order to commit any change quorum of the cluster will have to confirm it. Etcd will be
configured in such a way that all writes and reads will go through the master (requests
will be forwarded by the local etcd server such that it’s invisible for the user). It will
affect latency for all operations, but it should not increase by much more than the network
latency between master replicas (latency between GCE zones with a region is < 10ms).

Currently etcd exposes port only using localhost interface. In order to allow clustering
and inter-VM communication we will also have to use public interface. To secure the
communication we will use SSL (as described [here](https://coreos.com/etcd/docs/latest/security.html)).

When generating command line for etcd we will always assume it’s part of a cluster
(initially of size 1) and list all existing kubernetes master replicas.
Based on that, we will set the following flags:
* `-initial-cluster` - list of all hostnames/DNS names for master replicas (including the new one)
* `-initial-cluster-state` (keep in mind that we are adding master replicas one by one):
  * `new` if we are adding the first replica, i.e. the list of existing master replicas is empty
  * `existing` if there are more than one replica, i.e. the list of existing master replicas is non-empty.

This will allow us to have exactly the same logic for HA and non-HA master. List of DNS names for VMs
with master replicas will be generated in `kube-up.sh` script and passed to as a env variable
`INITIAL_ETCD_CLUSTER`.

### apiservers

All apiservers will work independently. They will contact etcd on 127.0.0.1, i.e. they will always contact
etcd replica running on the same VM. If needed, such requests will be forwarded by etcd server to the
etcd leader. This functionality is completely hidden from the client (apiserver
in our case).

Caching mechanism, which is implemented in apiserver, will not be affected by
replicating master because:
* GET requests go directly to etcd
* LIST requests go either directly to etcd or to cache populated via watch
  (depending on the ResourceVersion in ListOptions). In the second scenario,
  after a PUT/POST request, changes might not be visible in LIST response.
  This is however not worse than it is with the current single master.
* WATCH does not give any guarantees when change will be delivered.

#### load balancing

With multiple apiservers we need a way to load balance traffic to/from master replicas. As different cloud
providers have different capabilities and limitations, we will not try to find a common lowest
denominator that will work everywhere. Instead we will document various options and apply different
solution for different deployments. Below we list possible approaches:

1. `Managed DNS` - user need to specify a domain name during cluster creation. DNS entries will be managed
automaticaly by the deployment tool that will be intergrated with solutions like Route53 (AWS)
or Google Cloud DNS (GCP). For load balancing we will have two options:
  1.1. create an L4 load balancer in front of all apiservers and update DNS name appropriately
  1.2. use round-robin DNS technique to access all apiservers directly
2. `Unmanaged DNS` - this is very similar to `Managed DNS`, with the exception that DNS entries
will be manually managed by the user. We will provide detailed documentation for the entries we
expect.
3. [GCP only] `Promote master IP` - in GCP, when we create the first master replica, we generate a static
external IP address that is later assigned to the master VM. When creating additional replicas we
will create a loadbalancer infront of them and reassign aforementioned IP to point to the load balancer
instead of a single master. When removing second to last replica we will reverse this operation (assign
IP address to the remaining master VM and delete load balancer). That way user will not have to provide
a domain name and all client configurations will keep working.

This will also impact `kubelet <-> master` communication as it should use load
balancing for it. Depending on the chosen method we will use it to properly configure
kubelet.

#### `kubernetes` service

Kubernetes maintains a special service called `kubernetes`. Currently it keeps a
list of IP addresses for all apiservers. As it uses a command line flag
`--apiserver-count` it is not very dynamic and would require restarting all
masters to change number of master replicas.

To allow dynamic changes to the number of apiservers in the cluster, we will
introduce a `ConfigMap` in `kube-system` namespace, that will keep an expiration
time for each apiserver (keyed by IP). Each apiserver will do three things:

1. periodically update expiration time for it's own IP address
2. remove all the stale IP addresses from the endpoints list
3. add it's own IP address if it's not on the list yet.

That way we will not only solve the problem of dynamically changing number
of apiservers in the cluster, but also the problem of non-responsive apiservers
that should be removed from the `kubernetes` service endpoints list.

#### Certificates

Certificate generation will work as today. In particular, on GCE, we will
generate it for the public IP used to access the cluster (see `load balancing`
section) and local IP of the master replica VM.

That means that with multiple master replicas and a load balancer in front
of them, accessing one of the replicas directly (using it's ephemeral public
IP) will not work on GCE without appropriate flags:

- `kubectl --insecure-skip-tls-verify=true`
- `curl --insecure`
- `wget --no-check-certificate`

For other deployment tools and providers the details of certificate generation
may be different, but it must be possible to access the cluster by using either
the main cluster endpoint (DNS name or IP address) or internal service called
`kubernetes` that points directly to the apiservers.

### controller manager, scheduler & cluster autoscaler

Controller manager and scheduler will by default use a lease mechanism to choose an active instance
among all masters. Only one instance will be performing any operations.
All other will be waiting in standby mode.

We will use the same configuration in non-replicated mode to simplify deployment scripts.

### add-on manager

All add-on managers will be working independently. Each of them will observe current state of
add-ons and will try to sync it with files on disk. As a result, due to races, a single add-on
can be updated multiple times in a row after upgrading the master. Long-term we should fix this
by using a similar mechanisms as controller manager or scheduler. However, currently add-on
manager is just a bash script and adding a master election mechanism would not be easy.

## Adding replica

Command to add new replica on GCE using kube-up script:

```
KUBE_REPLICATE_EXISTING_MASTER=true KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-up.sh
```

A pseudo-code for adding a new master replica using managed DNS and a loadbalancer is the following:

```
1. If there is no load balancer for this cluster:
  1. Create load balancer using ephemeral IP address
  2. Add existing apiserver to the load balancer
  3. Wait until load balancer is working, i.e. all data is propagated, in GCE up to 20 min (sic!)
  4. Update DNS to point to the load balancer.
2. Clone existing master (create a new VM with the same configuration) including
   all env variables (certificates, IP ranges etc), with the exception of
   `INITIAL_ETCD_CLUSTER`.
3. SSH to an existing master and run the following command to extend etcd cluster
   with the new instance:
   `curl <existing_master>:4001/v2/members -XPOST -H "Content-Type: application/json" -d '{"peerURLs":["http://<new_master>:2380"]}'`
4. Add IP address of the new apiserver to the load balancer.
```

A simplified algorithm for adding a new master replica and promoting master IP to the load balancer
is identical to the one when using DNS, with a different step to setup load balancer:

```
1. If there is no load balancer for this cluster:
  1. Unassign IP from the existing master replica
  2. Create load balancer using static IP reclaimed in the previous step
  3. Add existing apiserver to the load balancer
  4. Wait until load balancer is working, i.e. all data is propagated, in GCE up to 20 min (sic!)
...
```

## Deleting replica

Command to delete one replica on GCE using kube-up script:

```
KUBE_DELETE_NODES=false KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-down.sh
```

A pseudo-code for deleting an existing replica for the master is the following:

```
1. Remove replica IP address from the load balancer or DNS configuration
2. SSH to one of the remaining masters and run the following command to remove replica from the cluster:
  `curl etcd-0:4001/v2/members/<id> -XDELETE -L`
3. Delete replica VM
4. If load balancer has only a single target instance, then delete load balancer
5. Update DNS to point to the remaining master replica, or [on GCE] assign static IP back to the master VM.
```

## Upgrades

Upgrading replicated master will be possible by upgrading them one by one using existing tools
(e.g. upgrade.sh for GCE). This will work out of the box because:
* Requests from nodes will be correctly served by either new or old master because apiserver is backward compatible.
* Requests from scheduler (and controllers) go to a local apiserver via localhost interface, so both components
will be in the same version.
* Apiserver talks only to a local etcd replica which will be in a compatible version
* We assume we will introduce this setup after we upgrade to etcd v3 so we don't need to cover upgrading database.

<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/ha_master.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->