k3s/docs/admin/cluster-troubleshooting.md

<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->

<!-- BEGIN STRIP_FOR_RELEASE -->

<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">

<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>

If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.

<strong>
The latest release of this document can be found
[here](http://releases.k8s.io/release-1.1/docs/admin/cluster-troubleshooting.md).

Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--

<!-- END STRIP_FOR_RELEASE -->

<!-- END MUNGE: UNVERSIONED_WARNING -->

# Cluster Troubleshooting

This doc is about cluster troubleshooting; we assume you have already ruled out your application as the root cause of the
problem you are experiencing. See
the [application troubleshooting guide](../user-guide/application-troubleshooting.md) for tips on application debugging.
You may also visit [troubleshooting document](../troubleshooting.md) for more information.

## Listing your cluster

The first thing to debug in your cluster is if your nodes are all registered correctly.

Run

```sh
kubectl get nodes
```

And verify that all of the nodes you expect to see are present and that they are all in the `Ready` state.

## Looking at logs

For now, digging deeper into the cluster requires logging into the relevant machines.  Here are the locations
of the relevant log files.  (note that on systemd-based systems, you may need to use `journalctl` instead)

### Master

   * /var/log/kube-apiserver.log - API Server, responsible for serving the API
   * /var/log/kube-scheduler.log - Scheduler, responsible for making scheduling decisions
   * /var/log/kube-controller-manager.log - Controller that manages replication controllers

### Worker Nodes

   * /var/log/kubelet.log - Kubelet, responsible for running containers on the node
   * /var/log/kube-proxy.log - Kube Proxy, responsible for service load balancing

## A general overview of cluster failure modes

This is an incomplete list of things that could go wrong, and how to adjust your cluster setup to mitigate the problems.

Root causes:
  - VM(s) shutdown
  - Network partition within cluster, or between cluster and users
  - Crashes in Kubernetes software
  - Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
  - Operator error, e.g. misconfigured Kubernetes software or application software

Specific scenarios:
  - Apiserver VM shutdown or apiserver crashing
    - Results
      - unable to stop, update, or start new pods, services, replication controller
      - existing pods and services should continue to work normally, unless they depend on the Kubernetes API
  - Apiserver backing storage lost
    - Results
      - apiserver should fail to come up
      - kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
      - manual recovery or recreation of apiserver state necessary before apiserver is restarted
  - Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
    - currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
    - in future, these will be replicated as well and may not be co-located
    - they do not have their own persistent state
  - Individual node (VM or physical machine) shuts down
    - Results
      - pods on that Node stop running
  - Network partition
    - Results
      - partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)
  - Kubelet software fault
    - Results
      - crashing kubelet cannot start new pods on the node
      - kubelet might delete the pods or not
      - node marked unhealthy
      - replication controllers start new pods elsewhere
  - Cluster operator error
    - Results
      - loss of pods, services, etc
      - lost of apiserver backing store
      - users unable to read API
      - etc.

Mitigations:
- Action: Use IaaS provider's automatic VM restarting feature for IaaS VMs
  - Mitigates: Apiserver VM shutdown or apiserver crashing
  - Mitigates: Supporting services VM shutdown or crashes

- Action use IaaS providers reliable storage (e.g GCE PD or AWS EBS volume) for VMs with apiserver+etcd
  - Mitigates: Apiserver backing storage lost

- Action: Use (experimental) [high-availability](high-availability.md) configuration
  - Mitigates: Master VM shutdown or master components (scheduler, API server, controller-managing) crashing
    - Will tolerate one or more simultaneous node or component failures
  - Mitigates: Apiserver backing storage (i.e., etcd's data directory) lost
    - Assuming you used clustered etcd.

- Action: Snapshot apiserver PDs/EBS-volumes periodically
  - Mitigates: Apiserver backing storage lost
  - Mitigates: Some cases of operator error
  - Mitigates: Some cases of Kubernetes software fault

- Action: use replication controller and services in front of pods
  - Mitigates: Node shutdown
  - Mitigates: Kubelet software fault

- Action: applications (containers) designed to tolerate unexpected restarts
  - Mitigates: Node shutdown
  - Mitigates: Kubelet software fault

- Action: [Multiple independent clusters](multi-cluster.md) (and avoid making risky changes to all clusters at once)
  - Mitigates: Everything listed above.


<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/cluster-troubleshooting.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
Run gendocs and munges 2015-07-12 04:04:52 +00:00			`<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->`

			`<!-- BEGIN STRIP_FOR_RELEASE -->`

Better scary message 2015-07-16 17:02:26 +00:00			`<img src="http://kubernetes.io/img/warning.png" alt="WARNING"`
			`width="25" height="25">`
			`<img src="http://kubernetes.io/img/warning.png" alt="WARNING"`
			`width="25" height="25">`
			`<img src="http://kubernetes.io/img/warning.png" alt="WARNING"`
			`width="25" height="25">`
			`<img src="http://kubernetes.io/img/warning.png" alt="WARNING"`
			`width="25" height="25">`
			`<img src="http://kubernetes.io/img/warning.png" alt="WARNING"`
			`width="25" height="25">`

			`<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>`

			`If you are using a released version of Kubernetes, you should`
			`refer to the docs that go with that version.`

			`<strong>`
Run update-gendocs 2015-11-03 18:17:57 +00:00			`The latest release of this document can be found`
			`[here](http://releases.k8s.io/release-1.1/docs/admin/cluster-troubleshooting.md).`
Better scary message 2015-07-16 17:02:26 +00:00
			`Documentation for other releases can be found at`
			`[releases.k8s.io](http://releases.k8s.io).`
			`</strong>`
			`--`
Run gendocs 2015-07-13 22:15:35 +00:00
Run gendocs and munges 2015-07-12 04:04:52 +00:00			`<!-- END STRIP_FOR_RELEASE -->`

			`<!-- END MUNGE: UNVERSIONED_WARNING -->`
Run gendocs 2015-07-17 22:35:41 +00:00
Add some trouble shooting instructions. 2015-06-09 20:32:37 +00:00			`# Cluster Troubleshooting`
Run gendocs 2015-07-17 22:35:41 +00:00
Fix per reviewer comments. 2015-07-17 18:31:41 +00:00			`This doc is about cluster troubleshooting; we assume you have already ruled out your application as the root cause of the`
			`problem you are experiencing. See`
			`the [application troubleshooting guide](../user-guide/application-troubleshooting.md) for tips on application debugging.`
Fix trailing whitespace in all docs 2015-07-24 21:52:18 +00:00			`You may also visit [troubleshooting document](../troubleshooting.md) for more information.`
Add some trouble shooting instructions. 2015-06-09 20:32:37 +00:00
			`## Listing your cluster`
Run gendocs 2015-07-17 22:35:41 +00:00
Add some trouble shooting instructions. 2015-06-09 20:32:37 +00:00			`The first thing to debug in your cluster is if your nodes are all registered correctly.`

			`Run`
apply changes 2015-07-17 02:01:02 +00:00
Improve admin docs syntax highlighting. 2015-07-19 05:43:48 +00:00			```sh
Add some trouble shooting instructions. 2015-06-09 20:32:37 +00:00			`kubectl get nodes`
			```

Replace ``` with ` when emphasizing something inline in docs/ 2015-07-19 05:58:13 +00:00			And verify that all of the nodes you expect to see are present and that they are all in the `Ready` state.
Add some trouble shooting instructions. 2015-06-09 20:32:37 +00:00
			`## Looking at logs`
Run gendocs 2015-07-17 22:35:41 +00:00
Add some trouble shooting instructions. 2015-06-09 20:32:37 +00:00			`For now, digging deeper into the cluster requires logging into the relevant machines. Here are the locations`
Replace ``` with ` when emphasizing something inline in docs/ 2015-07-19 05:58:13 +00:00			of the relevant log files. (note that on systemd-based systems, you may need to use `journalctl` instead)
Add some trouble shooting instructions. 2015-06-09 20:32:37 +00:00
			`### Master`
Run gendocs 2015-07-17 22:35:41 +00:00
Add some trouble shooting instructions. 2015-06-09 20:32:37 +00:00			`* /var/log/kube-apiserver.log - API Server, responsible for serving the API`
			`* /var/log/kube-scheduler.log - Scheduler, responsible for making scheduling decisions`
			`* /var/log/kube-controller-manager.log - Controller that manages replication controllers`

			`### Worker Nodes`
Run gendocs 2015-07-17 22:35:41 +00:00
Add some trouble shooting instructions. 2015-06-09 20:32:37 +00:00			`* /var/log/kubelet.log - Kubelet, responsible for running containers on the node`
			`* /var/log/kube-proxy.log - Kube Proxy, responsible for service load balancing`

Take availability.md doc and - extract the portion related to multi-cluster operation into a new multi-cluster.md doc - merge the remainder (that was basically high-level troubleshooting advice) into cluster-troubleshooting.md 2015-07-16 09:20:30 +00:00			`## A general overview of cluster failure modes`

Various minor edits/clarifications to docs/admin/ docs. Deleted docs/admin/namespaces.md as it was content-free and the topic is already covered well in docs/user-guide/namespaces.md 2015-07-17 17:12:08 +00:00			`This is an incomplete list of things that could go wrong, and how to adjust your cluster setup to mitigate the problems.`
Take availability.md doc and - extract the portion related to multi-cluster operation into a new multi-cluster.md doc - merge the remainder (that was basically high-level troubleshooting advice) into cluster-troubleshooting.md 2015-07-16 09:20:30 +00:00
			`Root causes:`
			`- VM(s) shutdown`
			`- Network partition within cluster, or between cluster and users`
Fix trailing whitespace in all docs 2015-07-24 21:52:18 +00:00			`- Crashes in Kubernetes software`
Take availability.md doc and - extract the portion related to multi-cluster operation into a new multi-cluster.md doc - merge the remainder (that was basically high-level troubleshooting advice) into cluster-troubleshooting.md 2015-07-16 09:20:30 +00:00			`- Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)`
Fix capitalization of Kubernetes in the documentation. 2015-07-20 20:45:36 +00:00			`- Operator error, e.g. misconfigured Kubernetes software or application software`
Take availability.md doc and - extract the portion related to multi-cluster operation into a new multi-cluster.md doc - merge the remainder (that was basically high-level troubleshooting advice) into cluster-troubleshooting.md 2015-07-16 09:20:30 +00:00
			`Specific scenarios:`
			`- Apiserver VM shutdown or apiserver crashing`
			`- Results`
			`- unable to stop, update, or start new pods, services, replication controller`
			`- existing pods and services should continue to work normally, unless they depend on the Kubernetes API`
			`- Apiserver backing storage lost`
			`- Results`
			`- apiserver should fail to come up`
			`- kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying`
			`- manual recovery or recreation of apiserver state necessary before apiserver is restarted`
			`- Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes`
			`- currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver`
			`- in future, these will be replicated as well and may not be co-located`
			`- they do not have their own persistent state`
			`- Individual node (VM or physical machine) shuts down`
			`- Results`
			`- pods on that Node stop running`
			`- Network partition`
			`- Results`
			`- partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)`
			`- Kubelet software fault`
			`- Results`
			`- crashing kubelet cannot start new pods on the node`
			`- kubelet might delete the pods or not`
			`- node marked unhealthy`
			`- replication controllers start new pods elsewhere`
			`- Cluster operator error`
			`- Results`
			`- loss of pods, services, etc`
			`- lost of apiserver backing store`
			`- users unable to read API`
			`- etc.`

			`Mitigations:`
Various minor edits/clarifications to docs/admin/ docs. Deleted docs/admin/namespaces.md as it was content-free and the topic is already covered well in docs/user-guide/namespaces.md 2015-07-17 17:12:08 +00:00			`- Action: Use IaaS provider's automatic VM restarting feature for IaaS VMs`
Take availability.md doc and - extract the portion related to multi-cluster operation into a new multi-cluster.md doc - merge the remainder (that was basically high-level troubleshooting advice) into cluster-troubleshooting.md 2015-07-16 09:20:30 +00:00			`- Mitigates: Apiserver VM shutdown or apiserver crashing`
			`- Mitigates: Supporting services VM shutdown or crashes`

			`- Action use IaaS providers reliable storage (e.g GCE PD or AWS EBS volume) for VMs with apiserver+etcd`
			`- Mitigates: Apiserver backing storage lost`

Various minor edits/clarifications to docs/admin/ docs. Deleted docs/admin/namespaces.md as it was content-free and the topic is already covered well in docs/user-guide/namespaces.md 2015-07-17 17:12:08 +00:00			`- Action: Use (experimental) [high-availability](high-availability.md) configuration`
			`- Mitigates: Master VM shutdown or master components (scheduler, API server, controller-managing) crashing`
			`- Will tolerate one or more simultaneous node or component failures`
Take availability.md doc and - extract the portion related to multi-cluster operation into a new multi-cluster.md doc - merge the remainder (that was basically high-level troubleshooting advice) into cluster-troubleshooting.md 2015-07-16 09:20:30 +00:00			`- Mitigates: Apiserver backing storage (i.e., etcd's data directory) lost`
Various minor edits/clarifications to docs/admin/ docs. Deleted docs/admin/namespaces.md as it was content-free and the topic is already covered well in docs/user-guide/namespaces.md 2015-07-17 17:12:08 +00:00			`- Assuming you used clustered etcd.`
Take availability.md doc and - extract the portion related to multi-cluster operation into a new multi-cluster.md doc - merge the remainder (that was basically high-level troubleshooting advice) into cluster-troubleshooting.md 2015-07-16 09:20:30 +00:00
			`- Action: Snapshot apiserver PDs/EBS-volumes periodically`
			`- Mitigates: Apiserver backing storage lost`
			`- Mitigates: Some cases of operator error`
Fix capitalization of Kubernetes in the documentation. 2015-07-20 20:45:36 +00:00			`- Mitigates: Some cases of Kubernetes software fault`
Take availability.md doc and - extract the portion related to multi-cluster operation into a new multi-cluster.md doc - merge the remainder (that was basically high-level troubleshooting advice) into cluster-troubleshooting.md 2015-07-16 09:20:30 +00:00
			`- Action: use replication controller and services in front of pods`
			`- Mitigates: Node shutdown`
			`- Mitigates: Kubelet software fault`

			`- Action: applications (containers) designed to tolerate unexpected restarts`
			`- Mitigates: Node shutdown`
			`- Mitigates: Kubelet software fault`

Add link to multi-cluster doc 2015-07-17 00:26:19 +00:00			`- Action: [Multiple independent clusters](multi-cluster.md) (and avoid making risky changes to all clusters at once)`
Take availability.md doc and - extract the portion related to multi-cluster operation into a new multi-cluster.md doc - merge the remainder (that was basically high-level troubleshooting advice) into cluster-troubleshooting.md 2015-07-16 09:20:30 +00:00			`- Mitigates: Everything listed above.`

Add some trouble shooting instructions. 2015-06-09 20:32:37 +00:00
Apply mungedocs changes 2015-07-14 00:13:09 +00:00			`<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->`
Move some docs from docs/ top-level into docs/{admin/,devel/,user-guide/}. 2015-07-15 05:07:44 +00:00			`[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/cluster-troubleshooting.md?pixel)]()`
Apply mungedocs changes 2015-07-14 00:13:09 +00:00			`<!-- END MUNGE: GENERATED_ANALYTICS -->`