Merge pull request #1050 from brendandburns/docs

Add initial docs for flake hunting.
pull/6/head
Daniel Smith 2014-08-27 18:18:19 -07:00
commit 0965c4ee77
2 changed files with 55 additions and 0 deletions

View File

@ -167,6 +167,9 @@ cd kubernetes
hack/e2e-test.sh
```
### Testing out flaky tests
[Instructions here](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/flaky-tests.md)
### Add/Update dependencies
Kubernetes uses [godep](https://github.com/tools/godep) to manage dependencies. To add or update a package, please follow the instructions on [godep's document](https://github.com/tools/godep).

52
docs/flaky-tests.md Normal file
View File

@ -0,0 +1,52 @@
# Hunting flaky tests in Kubernetes
Sometimes unit tests are flaky. This means that due to (usually) race conditions, they will occasionally fail, even though most of the time they pass.
We have a goal of 99.9% flake free tests. This means that there is only one flake in one thousand runs of a test.
Running a test 1000 times on your own machine can be tedious and time consuming. Fortunately, there is a better way to achieve this using Kubernetes.
_Note: these instructions are mildly hacky for now, as we get run once semantics and logging they will get better_
There is a testing image ```brendanburns/flake``` up on the docker hub. We will use this image to test our fix.
Create a replication controller with the following config:
```yaml
id: flakeController
desiredState:
replicas: 24
replicaSelector:
name: flake
podTemplate:
desiredState:
manifest:
version: v1beta1
id: ""
volumes: []
containers:
- name: flake
image: brendanburns/flake
env:
- name: TEST_PACKAGE
value: pkg/tools
- name: REPO_SPEC
value: https://github.com/GoogleCloudPlatform/kubernetes
restartpolicy: {}
labels:
name: flake
labels:
name: flake
```
```./cluster/kubecfg.sh -c controller.yaml create replicaControllers```
This will spin up 100 instances of the test. They will run to completion, then exit, the kubelet will restart them, eventually you will have sufficient
runs for your purposes, and you can stop the replication controller:
```sh
./cluster/kubecfg.sh stop flakeController
./cluster/kubecfg.sh rm flakeController
```
Now examine the machines with ```docker ps -a``` and look for tasks that exited with non-zero exit codes (ignore those that exited -1, since that's what happens when you stop the replica controller)
Happy flake hunting!