Previously, we used the docker config digest (also called "image ID"
by Docker) for the value of the `ImageID` field in the container status.
This was not particularly useful, since the config manifest is not
what's used to identify the image in a registry, which uses the manifest
digest instead. Docker 1.12+ always populates the RepoDigests field
with the manifest digests, and Docker 1.10 and 1.11 populate it when
images are pulled by digest.
This commit changes `ImageID` to point to the the manifest digest when
available, using the prefix `docker-pullable://` (instead of
`docker://`)
Automatic merge from submit-queue
Fix summary test
Issue was comparing an `unversioned.Time` rather than `time.Time`. I temporarily removed the `[Flaky]` tag so the PR builder will run the test. I will revert that change before submitting.
Automatic merge from submit-queue
Change minion to node
Continuation of #1111
I tried to keep this PR down to just a simple search-n-replace to keep
things simple. I may have gone too far in some spots but its easy to
roll those back if needed - just let me know.
I avoided renaming `contrib/mesos/pkg/minion` because there's already
a `contrib/mesos/pkg/node` dir and fixing that will require a bit of work
due to a circular import chain that pops up. So I'm saving that for a
follow-on PR.
Signed-off-by: Doug Davis <dug@us.ibm.com>
Contination of #1111
I tried to keep this PR down to just a simple search-n-replace to keep
things simple. I may have gone too far in some spots but its easy to
roll those back if needed.
I avoided renaming `contrib/mesos/pkg/minion` because there's already
a `contrib/mesos/pkg/node` dir and fixing that will require a bit of work
due to a circular import chain that pops up. So I'm saving that for a
follow-on PR.
I rolled back some of this from a previous commit because it just got
to big/messy. Will follow up with additional PRs
Signed-off-by: Doug Davis <dug@us.ibm.com>
Automatic merge from submit-queue
Allow garbage collection to work against different API prefixes
The GC needs to build clients based only on Resource or Kind. Hoist the
restmapper out of the controller and the clientpool, support a new
ClientForGroupVersionKind and ClientForGroupVersionResource, and use the
appropriate one in both places.
Allows OpenShift to use the GC
Automatic merge from submit-queue
Fix node performance benchmark by using latest containervm image (docker 1.11.2)
Also add two more tests for resource tracking.
cc/ @Random-Liu @coufon
The GC needs to build clients based only on Resource or Kind. Hoist the
restmapper out of the controller and the clientpool, support a new
ClientForGroupVersionKind and ClientForGroupVersionResource, and use the
appropriate one in both places.
Automatic merge from submit-queue
Node E2E: Change the disk eviction test to pull images again after the test.
Fixes https://github.com/kubernetes/kubernetes/issues/32022#issuecomment-248677706.
This PR changes the disk eviction test to pull test images again in `AfterEach`, because images may be evicted during the test.
@yujuhong
/cc @kubernetes/sig-node
Automatic merge from submit-queue
Make node E2E tests more transparent
Add some logging and minor code reorg to make the node E2E tests a little more transparent and understandable.
Automatic merge from submit-queue
Update the containervm image to the latest one (container-v1-3-v20160…
Node e2e is running with old containervm image which only has docker 1.9.1. This pr fixed such issue.
Automatic merge from submit-queue
Fix the properties file for node e2e cri validation.
I fixed this locally before, but accidentally missed in the PR. Sorry about that.
This time, I've tried myself, it should work.
@yujuhong
Automatic merge from submit-queue
Bump up GCI version.
```release-note
Upgrading Container-VM base image for k8s on GCE. Brief changelog as follows:
- Fixed performance regression in veth device driver
- Docker and related binaries are statically linked
- Fixed the issue of systemd being oom-killable
```
Fixes#32596
This needs a cherrypick into v1.4 release branch because it is fixing v1.4 release blocking issues. This patch is easy and safe to rollback in case of emergencies.
@vishh can you please review?
Fixes#32596 and many other issues.
cc/ @kubernetes/goog-image FYI
Brief changelog compared to gci-dev-54-8743-3-0:
- Fixed performance regression in veth device driver
- Docker and related binaries are statically linked
- Fixed the issue of systemd being oom-killable
- Updated built-in kubelet version to 1.3.7
- add ethtool and ebtables binaries expected by kubelet
Fixes#32596
Automatic merge from submit-queue
Node E2E: Add image white list
This is part of #29081. Fixes#29155.
As is discussed with @yujuhong in #29155, it is difficult to maintain the prepull image list if it is not enforced.
This PR added an image white list in the test framework, only images in the white list could be used in the test. If the image is not in the white list, the test will fail with reason:
```
Image "XXX" is not in the white list, consider adding it to CommonImageWhiteList in test/e2e/common/util.go or NodeImageWhiteList in test/e2e_node/image_list.go
```
Notice that if image pull policy is `PullAlways`, the image is not necessary to be in the white list or prepulled, because the test expects the image to be pulled during the test.
Currently, the image white list is only enabled in node e2e, because the image puller in e2e test is not integrated with the image white list yet.
/cc @kubernetes/sig-node
Automatic merge from submit-queue
[kubelet] Fix oom-score-adj policy in kubelet
Fixes#32238
We have been having this regression since v1.3. It is critical for GKE/GCE deployments of k8s because docker daemon has a high likelihood of being OOM killed which will end up nuking all containers.
The reason for moving from mnt to pid is that docker daemon moves itself into a new mnt namespace with systemd based deployments.
Automatic merge from submit-queue
change the error log for empty resource usage
This PR changes the error log for empty resource usage buffer for a container to be more clear. It happens when the container name is wrong, or cAdvisor somehow does not response.
Automatic merge from submit-queue
Get image and machine info from apiserver in node e2e test
This PR changes node e2e test to get image and machine information from API server instead of pass them from Jenkins test framework. The original format to pass image and machine info is naming the test node as "machine-image-uuid", which is hard to parse because "-" occurs a lot in both machine and image names.
Now we add two labels "image" and "machine" into performance data. The machine type has the format "cpu:1core,memory:3.6GB".
This PR is based on #32250.
Automatic merge from submit-queue
Add node e2e density test using 60 QPS for benchmark
This PR adds a new benchmark node e2e density test which sets Kubelet API QPS limit from default 5 to 60, through ConfigMap.
The latency caused by API QPS limit is as large as ~30% when creating a large batch of pods (e.g. 105). It makes the pod startup latency, as well creation throughput underestimated. This test helps us to know the real performance of Kubelet core.
Automatic merge from submit-queue
Update container image version for downward api volume tests
Some tests were using 0.7, and some were using 0.6, so updating all to 0.7.
@kubernetes/rh-cluster-infra
Automatic merge from submit-queue
Automated Docker Validation: Change wrong name in perf config.
The config key `containervm-density*` is improper, remove it.
/cc @coufon
Automatic merge from submit-queue
Log pressure condition, memory usage, events in memory eviction test
I want to log this to help us debug some of the latest memory eviction test flakes, where we are seeing burstable "fail" before the besteffort. I saw (in the logs) attempts by the eviction manager to evict besteffort a while before burstable phase changed to "Failed", but the besteffort's phase appeared to remain "Running". I want to see the pressure condition interleaved with the pod phases to get a sense of the eviction manager's knowledge vs. pod phase.
Automatic merge from submit-queue
Plumb --feature-gates from TEST_ARGS to components in node e2e tests
This means you can set `TEST_ARGS` on the command line, in a `.properties` config for a Jenkins job, etc, to toggle gated features. For example:
`TEST_ARGS='--feature-gates=DynamicKubeletConfig=true'`
/cc @vishh @jlowdermilk
Automatic merge from submit-queue
Rename ConnectToDockerOrDie to CreateDockerClientOrDie
This function does not actually attempt to connect to the docker daemon, it just creates a client object that can be used to do so later. The old name was confusing, as it implied that a failure to touch the docker daemon could cause program termination (rather than just a failure to create the client).
Automatic merge from submit-queue
Node E2E: Fix wrong permission bit for log file.
When creating log for logs from journald, we use `0755` which is weird to me.
This PR changes it to `0666`.
Automatic merge from submit-queue
Move wait for pressure to subside to AfterEach
so we still wait if the test part of the test for eviction order fails.
Prior to this change, a K8s branch (master as well as release) was
pinned to a GCI milestone. It would pick up the latest GCI release on
that milestone at the time of cluster creation. The rationale was the
K8s users would automatically get the bug fixes in newer versions of
GCI. However in practice, it makes the runtime environment
non-deterministic, and lack of continuous e2e tests mean we would run
into breakages sooner or later.
With this change, each K8s release will pick a specific version
of GCI by default (similar to how the Debian-based container-vm gets used).
Users can override the default version through KUBE_GCE_MASTER_IMAGE and
KUBE_GCE_NODE_IMAGE environment variables.
We expect the default GCI version will be updated relatively frequently stay
updated with newer GCI releases. We can also automate the process to
automatically bump the hard-coded GCI version in future.
This commit enables the dynamic kubelet configuration feature for the
node e2e Jenkins serial tests, which is where the test for dynamic kubelet
configuration currently runs.
This gives the node e2e test binary a --feature-gates flag that populates a
FeatureGates field on the test context. The value of this field is forwarded
to the kubelet's --feature-gates flag and is also used to populate the global
DefaultFeatureGate object so that statically-linked components see the same
feature gate settings as provided via the flag.
This means that you can set feature gates via the TEST_ARGS environment
variable when running node e2e tests. For example:
TEST_ARGS='--feature-gates=DynamicKubeletConfig=true'
This should stop the test from flaking while we figure out why there is
a mismatch between the reported pressure condition and the eviction
manager's decision to evict due to memory pressure.
Automatic merge from submit-queue
Node E2E: Move host info around test result.
Discussed offline with @yujuhong and @dchen1107. Currently, the node e2e result is organized as:
```
================================================================
Success Finished Host tmp-node-e2e-b6c375c7-e2e-node-containervm-v20160321-image Test Suite
{ginkgo-output}
{framework-error}
================================================================
```
This makes it painful to find which image the test is failing on. The `{ginkgo-output}` is usually quite long, so we have to scroll mouse up and down to find the host name.
This PR changes the test result to:
```
================================================================
Start Host tmp-node-e2e-b6c375c7-e2e-node-containervm-v20160321-image Test Suite
{ginkgo-output}
Success Finished Host tmp-node-e2e-b6c375c7-e2e-node-containervm-v20160321-image Test Suite
{framework-error}
================================================================
```
This is not perfect, but much better than before. We can easily find the host name under the ginkgo test result, like this:
```
================================================================
Start Host test-gci-dev-54-8743-3-0 Test Suite
Running Suite: E2eNode Suite
============================
Random Seed: 1472511489 - Will randomize all specs
Will run 0 of 131 specs
Running in parallel across 8 nodes
I0829 22:58:13.727764 1143 e2e_node_suite_test.go:98] Pre-pulling images so that they are cached for the tests.
I0829 22:58:28.562459 1143 e2e_node_suite_test.go:111] Node services started. Running tests...
I0829 22:58:28.562477 1143 e2e_node_suite_test.go:116] Wait for the node to be ready
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
------------------------------
I0829 22:58:29.742596 1143 e2e_node_suite_test.go:136] Stopping node services...
I0829 22:58:29.742650 1143 services.go:673] Killing process 1423 (services) with -TERM
I0829 22:58:29.860893 1143 e2e_node_suite_test.go:141] Tests Finished
Ran 0 of 131 Specs in 16.185 seconds
SUCCESS! -- 0 Passed | 0 Failed | 0 Pending | 131 Skipped
Ginkgo ran 1 suite in 19.939034297s
Test Suite Passed
Success Finished Host test-gci-dev-54-8743-3-0 Test Suite
================================================================
```
In a following PR, I'll print the test result from different images into different files to make it more clear for debugging. Mark v1.4 because this helps us de-flake test.
/cc @kubernetes/sig-node
Automatic merge from submit-queue
Explicitly delete pods in node performance tests
This PR explicitly deletes all created pods at the end in node e2e performance related tests.
The large number of pods may cause namespace cleanup times out (in #30878), therefore we explicitly delete all pods for cleaning up.
Automatic merge from submit-queue
increase latency and resource limit accroding to test results
This PR increases the latency limit of node e2e density test according to previous test results.
Fixed#30878
Automatic merge from submit-queue
test/node-e2e: Update CoreOS update disabling
Previously in this saga... #25004
This disables update-engine and locksmithd with ignition instead of
cloud-init so that they're really totally 100% disabled. Our ignition guy promises.
Pretty much every way of disabling them with cloud-init is mildly racy.
Fixes#31633
I think @vishh can say "I told you so" after the comment on https://github.com/kubernetes/kubernetes/pull/30023#discussion-diff-73431324 .. he was right, but it turns out "stop" there doesn't really work either because of the mess that is cloud-init. Fortunately, converting our cloud-init to json and calling it "ignition" works quite well 😄
Testing done: I ssh'd in and verified that yes, they're disabled. I didn't wait on the e2e tests to pass, so we'll let this PR check that.