Fix bad config in flaky test documentation and add script to help check

for flakes.
pull/6/head
Alex Robinson 2015-02-11 12:16:16 -08:00
parent 15c57efde2
commit d3d71df943
1 changed files with 18 additions and 6 deletions

View File

@ -11,7 +11,7 @@ There is a testing image ```brendanburns/flake``` up on the docker hub. We will
Create a replication controller with the following config:
```yaml
id: flakeController
id: flakecontroller
kind: ReplicationController
apiVersion: v1beta1
desiredState:
@ -41,14 +41,26 @@ labels:
```./cluster/kubectl.sh create -f controller.yaml```
This will spin up 100 instances of the test. They will run to completion, then exit, the kubelet will restart them, eventually you will have sufficient
runs for your purposes, and you can stop the replication controller by setting the ```replicas``` field to 0 and then running:
This will spin up 24 instances of the test. They will run to completion, then exit, and the kubelet will restart them, accumulating more and more runs of the test.
You can examine the recent runs of the test by calling ```docker ps -a``` and looking for tasks that exited with non-zero exit codes. Unfortunately, docker ps -a only keeps around the exit status of the last 15-20 containers with the same image, so you have to check them frequently.
You can use this script to automate checking for failures, assuming your cluster is running on GCE and has four nodes:
```sh
./cluster/kubectl.sh update -f controller.yaml
./cluster/kubectl.sh delete -f controller.yaml
echo "" > output.txt
for i in {1..4}; do
echo "Checking kubernetes-minion-${i}"
echo "kubernetes-minion-${i}:" >> output.txt
gcloud compute ssh "kubernetes-minion-${i}" --command="sudo docker ps -a" >> output.txt
done
grep "Exited ([^0])" output.txt
```
Now examine the machines with ```docker ps -a``` and look for tasks that exited with non-zero exit codes (ignore those that exited -1, since that's what happens when you stop the replica controller)
Eventually you will have sufficient runs for your purposes. At that point you can stop and delete the replication controller by running:
```sh
./cluster/kubectl.sh stop replicationcontroller flakecontroller
```
If you do a final check for flakes with ```docker ps -a```, ignore tasks that exited -1, since that's what happens when you stop the replication controller.
Happy flake hunting!