The image prepulling pod calls docker directly to pull images. If the pod
hasn't finished before running the resource usage tracking test, there'd be a
cpu spike in docker. We'd rather wait and fail if this is the case, before
running the test.
Automatic merge from submit-queue
Kubelet: Add docker operation timeout
For #23563.
Based on #24748, only the last 2 commits are new.
This PR:
1) Add timeout for all docker operations.
2) Add docker operation timeout metrics
3) Cleanup kubelet stats and add runtime operation error and timeout rate monitoring.
4) Monitor runtime operation error and timeout rate in kubelet perf.
@yujuhong
/cc @gmarek Because of the metrics change.
/cc @kubernetes/sig-node
This commit switch most functions in kubelet_stats.go to use the new API.
However, the functions that perform one-time resource usage retrieval remain
unchanged to be compatible with reource_usage_gatherer.go. They should be
handled separately.
Also, the new summary API does not provide the RSS memory yet, so all memory
checking tests will *always* pass. We plan to add this metrics in the API and
restore the functionality of the test.
This allows resource usage monitoring test to launch 100 test pods per node, in
addition to the add-on pods.
Also reduce the test time length since the results over the shorter period are
representative enough.
- remove skip list from conformance-test.sh and filter by the new tag
- remove experimental api tests from conformance test suite
- remove all tests from conformance test suite which are either
restricted to e.g. gce, gke, aws or require SSH
This scopes down the initially ambitious PR:
https://github.com/kubernetes/kubernetes/pull/14960 to replace just
`pause` and `fluentd-elasticsearch` to come through `beta.gcr.io`.
The v2 versions have been pushed under new tags, `pause:2.0` and
`fluentd-elastisearch:1.12`.
NOTE: `beta.gcr.io` will still serve images using v1 until they are repushed with v2. Pulls through `gcr.io` will still work after pushing through `beta.gcr.io`, but will be served over v1 (via compat logic).
Some tests in this test suite expects --max-pods (i.e. the maximum pod capacity
on kubelet) to be greater than default, which applies only to the GCE test
environment. Split the tests into two sets so that we can better categorize
the tests in the jenkins setup, without making the test itself aware of the
environment.
This test tracks kubelet resource usage over a long period of time (1hr)
when running N pods (e.g., N=0,50), and prints out the resource usage. This
would give us an idea how much kubelet's management overhead is in a stable
cluster.
Some followup items:
* Use a more realistic workload (e.g., including probing)
* Fail the test if the resource usage is too high.
Caveat:
* We assume the scheduler would do a decent job distributing the pause pods,
but we should double check.
* Cluster addon pods could be unevenly distributed and skews the resource
usage on nodes.