Merge pull request #10804 from saad-ali/clusterScaleDoc

Add documentation for creating large kubernetes clusters
2015-07-13 16:19:42 -07:00 · 2015-07-13 16:19:42 -07:00 · 32699e873a
parent 7c3fadd1eb c4a42ea405
commit 32699e873a
2 changed files with 115 additions and 8 deletions
--- a/docs/cluster-large.md
+++ b/docs/cluster-large.md
@ -0,0 +1,60 @@
+# Kubernetes Large Cluster
+
+## Support
+At v1.0, Kubernetes supports clusters up to 100 nodes with 30-50 pods per node and 1-2 container per pod (as defined in the [1.0 roadmap](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/roadmap.md#reliability-and-performance)).
+
+## Setup
+
+Normally the number of nodes in a cluster is controlled by the the value `NUM_MINIONS` in the platform-specific `config-default.sh` file (for example, see [GCE's `config-default.sh`](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/cluster/gce/config-default.sh)).
+
+Simply changing that value to something very large, however, may cause the setup script to fail for many cloud providers. A GCE deployment, for example, will run in to quota issues and fail to bring the cluster up.
+
+When setting up a large Kubernetes cluster, the following must be taken into consideration.
+
+### Quota Issues
+
+To avoid running into cloud provider quota issues, when creating a cluster with many nodes, consider:
+* Increase the quota for things like CPU, IPs, etc.
+  * In [GCE, for example,](https://cloud.google.com/compute/docs/resource-quotas) you'll want to increase the quota for:
+    * CPUs
+    * VM instances
+    * Total persistent disk reserved
+    * In-use IP addresses
+    * Firewall Rules
+    * Forwarding rules
+    * Routes
+    * Target pools
+* Gating the setup script so that it brings up new node VMs in smaller batches with waits in between, because some cloud providers limit the number of VMs you can create during a given period.
+
+### Addon Resources
+To prevent memory leaks or other resource issues in [cluster addons](https://github.com/GoogleCloudPlatform/kubernetes/tree/master/cluster/addons/) from consuming all the resources available on a node, Kubernetes sets resource limits on addon containers to limit the CPU and Memory resources they can consume (See PR [#10653](https://github.com/GoogleCloudPlatform/kubernetes/pull/10653/files) and [#10778](https://github.com/GoogleCloudPlatform/kubernetes/pull/10778/files)).
+
+For example:
+```YAML
+      containers:
+        - image: gcr.io/google_containers/heapster:v0.15.0
+          name: heapster
+          resources:
+            limits:
+              cpu: 100m
+              memory: 200Mi
+```
+
+These limits, however, are based on data collected from addons running on 4-node clusters (see [#10335](https://github.com/GoogleCloudPlatform/kubernetes/issues/10335#issuecomment-117861225)). The addons consume a lot more resources when running on large deployment clusters (see [#5880](https://github.com/GoogleCloudPlatform/kubernetes/issues/5880#issuecomment-113984085)). So, if a large cluster is deployed without adjusting these values, the addons may continuously get killed because they keep hitting the limits.
+
+To avoid running into cluster addon resource issues, when creating a cluster with many nodes, consider the following:
+* Scale memory and CPU limits for each of the following addons, if used, along with the size of cluster (there is one replica of each handling the entire cluster so memory and CPU usage tends to grow proportionally with size/load on cluster):
+  * Heapster ([GCM/GCL backed](../cluster/addons/cluster-monitoring/google/heapster-controller.yaml), [InfluxDB backed](../cluster/addons/cluster-monitoring/influxdb/heapster-controller.yaml), [InfluxDB/GCL backed](../cluster/addons/cluster-monitoring/googleinfluxdb/heapster-controller-combined.yaml), [standalone](../cluster/addons/cluster-monitoring/standalone/heapster-controller.yaml))
+  * [InfluxDB and Grafana](../cluster/addons/cluster-monitoring/influxdb/influxdb-grafana-controller.yaml)
+  * [skydns, kube2sky, and dns etcd](../cluster/addons/dns/skydns-rc.yaml.in)
+  * [Kibana](../cluster/addons/fluentd-elasticsearch/kibana-controller.yaml)
+* Scale number of replicas for the following addons, if used, along with the size of cluster (there are multiple replicas of each so increasing replicas should help handle increased load, but, since load per replica also increases slightly, also consider increasing CPU/memory limits):
+  * [elasticsearch](../cluster/addons/fluentd-elasticsearch/es-controller.yaml)
+* Increase memory and CPU limits sligthly for each of the following addons, if used, along with the size of cluster (there is one replica per node but CPU/memory usage increases slightly along with cluster load/size as well):
+  * [FluentD with ElasticSearch Plugin](../cluster/saltbase/salt/fluentd-es/fluentd-es.yaml)
+  * [FluentD with GCP Plugin](../cluster/saltbase/salt/fluentd-gcp/fluentd-gcp.yaml)
+
+For directions on how to detect if addon containers are hitting resource limits, see the [Troubleshooting section of Compute Resources](compute_resources.md#troubleshooting).
+
+
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/cluster-large.md?pixel)]()
--- a/docs/compute_resources.md
+++ b/docs/compute_resources.md
@ -21,6 +21,7 @@ certainly want the docs that go with that version.</h1>
  - [How Pods with Resource Limits are Run](#how-pods-with-resource-limits-are-run)
  - [Monitoring Compute Resource Usage](#monitoring-compute-resource-usage)
  - [Troubleshooting](#troubleshooting)
+    - [Detecting Resource Starved Containers](#detecting-resource-starved-containers)
  - [Planned Improvements](#planned-improvements)

 When specifying a [pod](pods.md), you can optionally specify how much CPU and memory (RAM) each
@ -108,18 +109,14 @@ When using Docker:
 **TODO: document behavior for rkt**

 If a container exceeds its memory limit, it may be terminated.  If it is restartable, it will be
-restarted by kubelet, as will any other type of runtime failure.  If it is killed for exceeding its
-memory limit, you will see the reason `OOM Killed`, as in this example:
-```
-$ kubectl get pods/memhog
-NAME      READY     REASON       RESTARTS   AGE
-memhog    0/1       OOM Killed   0          1h
-```
-*OOM* stands for Out Of Memory.
+restarted by kubelet, as will any other type of runtime failure.  

 A container may or may not be allowed to exceed its CPU limit for extended periods of time.
 However, it will not be killed for excessive CPU usage.

+To determine if a container cannot be scheduled or is being killed due to resource limits, see the
+"Troubleshooting" section below.
+
 ## Monitoring Compute Resource Usage

 The resource usage of a pod is reported as part of the Pod status.
@ -154,6 +151,56 @@ The [resource quota](resource_quota_admin.md) feature can be configured
 to limit the total amount of resources that can be consumed.  If used in conjunction
 with namespaces, it can prevent one team from hogging all the resources.

+### Detecting Resource Starved Containers
+To check if a container is being killed because it is hitting a resource limit, call `kubectl describe pod`
+on the pod you are interested in:
+
+```
+[12:54:41] $ ./cluster/kubectl.sh describe pod simmemleak-hra99
+Name:               simmemleak-hra99
+Namespace:          default
+Image(s):           saadali/simmemleak
+Node:               kubernetes-minion-tf0f/10.240.216.66
+Labels:             name=simmemleak
+Status:             Running
+Reason:             
+Message:            
+IP:             10.244.2.75
+Replication Controllers:    simmemleak (1/1 replicas created)
+Containers:
+  simmemleak:
+    Image:  saadali/simmemleak
+    Limits:
+      cpu:      100m
+      memory:       50Mi
+    State:      Running
+      Started:      Tue, 07 Jul 2015 12:54:41 -0700
+    Ready:      False
+    Restart Count:  5
+Conditions:
+  Type      Status
+  Ready     False 
+Events:
+  FirstSeen                         LastSeen                         Count  From                              SubobjectPath                       Reason      Message
+  Tue, 07 Jul 2015 12:53:51 -0700   Tue, 07 Jul 2015 12:53:51 -0700  1      {scheduler }                                                          scheduled   Successfully assigned simmemleak-hra99 to kubernetes-minion-tf0f
+  Tue, 07 Jul 2015 12:53:51 -0700   Tue, 07 Jul 2015 12:53:51 -0700  1      {kubelet kubernetes-minion-tf0f}  implicitly required container POD   pulled      Pod container image "gcr.io/google_containers/pause:0.8.0" already present on machine
+  Tue, 07 Jul 2015 12:53:51 -0700   Tue, 07 Jul 2015 12:53:51 -0700  1      {kubelet kubernetes-minion-tf0f}  implicitly required container POD   created     Created with docker id 6a41280f516d
+  Tue, 07 Jul 2015 12:53:51 -0700   Tue, 07 Jul 2015 12:53:51 -0700  1      {kubelet kubernetes-minion-tf0f}  implicitly required container POD   started     Started with docker id 6a41280f516d
+  Tue, 07 Jul 2015 12:53:51 -0700   Tue, 07 Jul 2015 12:53:51 -0700  1      {kubelet kubernetes-minion-tf0f}  spec.containers{simmemleak}         created     Created with docker id 87348f12526a
+```
+
+The `Restart Count:  5` indicates that the `simmemleak` container in this pod was terminated and restarted 5 times.
+
+Once [#10861](https://github.com/GoogleCloudPlatform/kubernetes/issues/10861) is resolved the reason for the termination of the last container will also be printed in this output.
+
+Until then you can call `get pod` with the `-o template -t ...` option to fetch the status of previously terminated containers:
+```
+[13:59:01] $ ./cluster/kubectl.sh  get pod -o template -t '{{range.status.containerStatuses}}{{"Container Name: "}}{{.name}}{{"\r\nLastState: "}}{{.lastState}}{{end}}'  simmemleak-60xbc
+Container Name: simmemleak
+LastState: map[terminated:map[exitCode:137 reason:OOM Killed startedAt:2015-07-07T20:58:43Z finishedAt:2015-07-07T20:58:43Z containerID:docker://0e4095bba1feccdfe7ef9fb6ebffe972b4b14285d5acdec6f0d3ae8a22fad8b2]][13:59:03] clusterScaleDoc ~/go/src/github.com/GoogleCloudPlatform/kubernetes $ 
+```
+We can see that this container was terminated because `reason:OOM Killed`, where *OOM* stands for Out Of Memory.
+
 ## Planned Improvements

 The current system only allows resource quantities to be specified on a container.