k3s/docs/admin/cluster-troubleshooting.md

147 lines
5.9 KiB
Markdown
Raw Normal View History

2015-07-12 04:04:52 +00:00
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
<!-- BEGIN STRIP_FOR_RELEASE -->
2015-07-16 17:02:26 +00:00
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.
<strong>
The latest 1.0.x release of this document can be found
[here](http://releases.k8s.io/release-1.0/docs/admin/cluster-troubleshooting.md).
Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--
2015-07-13 22:15:35 +00:00
2015-07-12 04:04:52 +00:00
<!-- END STRIP_FOR_RELEASE -->
<!-- END MUNGE: UNVERSIONED_WARNING -->
2015-07-17 22:35:41 +00:00
# Cluster Troubleshooting
2015-07-17 22:35:41 +00:00
2015-07-17 18:31:41 +00:00
This doc is about cluster troubleshooting; we assume you have already ruled out your application as the root cause of the
problem you are experiencing. See
the [application troubleshooting guide](../user-guide/application-troubleshooting.md) for tips on application debugging.
2015-07-24 21:52:18 +00:00
You may also visit [troubleshooting document](../troubleshooting.md) for more information.
## Listing your cluster
2015-07-17 22:35:41 +00:00
The first thing to debug in your cluster is if your nodes are all registered correctly.
Run
2015-07-17 02:01:02 +00:00
```sh
kubectl get nodes
```
And verify that all of the nodes you expect to see are present and that they are all in the `Ready` state.
## Looking at logs
2015-07-17 22:35:41 +00:00
For now, digging deeper into the cluster requires logging into the relevant machines. Here are the locations
of the relevant log files. (note that on systemd-based systems, you may need to use `journalctl` instead)
### Master
2015-07-17 22:35:41 +00:00
* /var/log/kube-apiserver.log - API Server, responsible for serving the API
* /var/log/kube-scheduler.log - Scheduler, responsible for making scheduling decisions
* /var/log/kube-controller-manager.log - Controller that manages replication controllers
### Worker Nodes
2015-07-17 22:35:41 +00:00
* /var/log/kubelet.log - Kubelet, responsible for running containers on the node
* /var/log/kube-proxy.log - Kube Proxy, responsible for service load balancing
## A general overview of cluster failure modes
This is an incomplete list of things that could go wrong, and how to adjust your cluster setup to mitigate the problems.
Root causes:
- VM(s) shutdown
- Network partition within cluster, or between cluster and users
2015-07-24 21:52:18 +00:00
- Crashes in Kubernetes software
- Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
- Operator error, e.g. misconfigured Kubernetes software or application software
Specific scenarios:
- Apiserver VM shutdown or apiserver crashing
- Results
- unable to stop, update, or start new pods, services, replication controller
- existing pods and services should continue to work normally, unless they depend on the Kubernetes API
- Apiserver backing storage lost
- Results
- apiserver should fail to come up
- kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
- manual recovery or recreation of apiserver state necessary before apiserver is restarted
- Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
- currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
- in future, these will be replicated as well and may not be co-located
- they do not have their own persistent state
- Individual node (VM or physical machine) shuts down
- Results
- pods on that Node stop running
- Network partition
- Results
- partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)
- Kubelet software fault
- Results
- crashing kubelet cannot start new pods on the node
- kubelet might delete the pods or not
- node marked unhealthy
- replication controllers start new pods elsewhere
- Cluster operator error
- Results
- loss of pods, services, etc
- lost of apiserver backing store
- users unable to read API
- etc.
Mitigations:
- Action: Use IaaS provider's automatic VM restarting feature for IaaS VMs
- Mitigates: Apiserver VM shutdown or apiserver crashing
- Mitigates: Supporting services VM shutdown or crashes
- Action use IaaS providers reliable storage (e.g GCE PD or AWS EBS volume) for VMs with apiserver+etcd
- Mitigates: Apiserver backing storage lost
- Action: Use (experimental) [high-availability](high-availability.md) configuration
- Mitigates: Master VM shutdown or master components (scheduler, API server, controller-managing) crashing
- Will tolerate one or more simultaneous node or component failures
- Mitigates: Apiserver backing storage (i.e., etcd's data directory) lost
- Assuming you used clustered etcd.
- Action: Snapshot apiserver PDs/EBS-volumes periodically
- Mitigates: Apiserver backing storage lost
- Mitigates: Some cases of operator error
- Mitigates: Some cases of Kubernetes software fault
- Action: use replication controller and services in front of pods
- Mitigates: Node shutdown
- Mitigates: Kubelet software fault
- Action: applications (containers) designed to tolerate unexpected restarts
- Mitigates: Node shutdown
- Mitigates: Kubelet software fault
2015-07-17 00:26:19 +00:00
- Action: [Multiple independent clusters](multi-cluster.md) (and avoid making risky changes to all clusters at once)
- Mitigates: Everything listed above.
2015-07-14 00:13:09 +00:00
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/admin/cluster-troubleshooting.md?pixel)]()
2015-07-14 00:13:09 +00:00
<!-- END MUNGE: GENERATED_ANALYTICS -->