The detailed dumps of original and patched item content was useful
while developing the feature, but is less relevant now and too
verbose. It might be relevant again, so it's left in the code as
comments.
What gets logged now is just a single-line "creating" resp. "deleting"
message with the type of the item and its unique name.
This also enhances up some other aspects of the original logging:
- the namespace is included for item types that are namespaced
- the "deleting" message no longer gets replicated in each factory
method
Fixes: #70448
- Scale down based on custom metric was flaking. Increase target value
of the metric.
- Scale down based on CPU was flaking during stabilization. Increase
tolerance of stabilization (caused by resource consumer using more CPU
than requested).
Debugging the CSI driver tests depends a lot on the output of the CSI
sidecar containers and the CSI driver, but that information was not
captured automatically and thus unavailable after a test run. This is
particularly bad when running in a remote CI system, but also manually
watching the cluster during a test was cumbersome.
Now pod events and log messages get copied to the test's output at the
time that they happen (when running without report directory) or get
written to individual log files (when running with report directory in
the CI).
Ensuring that CSI drivers get deployed for testing exactly as intended
was problematic because the original .yaml files had to be converted
into code. e2e/manifest helped a bit, but not enough:
- could not load all entities
- didn't handle loading .yaml files with multiple entities
- actually creating and deleting entities still had to be done in tests
The new framework utility code handles all of that, including the
tricky cleanup operation that tests got wrong (AfterEach does not get
called after test failures!).
In addition, it is ensuring that each test gets its own instance of the
entities.
The PSP role binding for hostpath is now necessary because we switch
from creating a pod directly to creation via the StatefulSet
controller, which runs with less privileges.
Without this, the hostpath test runs into these errors in the
kubernetes-e2e-gce job:
Oct 19 16:30:09.225: INFO: At 2018-10-19 16:25:07 +0000 UTC - event for csi-hostpath-attacher: {statefulset-controller } FailedCreate: create Pod csi-hostpath-attacher-0 in StatefulSet csi-hostpath-attacher failed error: pods "csi-hostpath-attacher-0" is forbidden: unable to validate against any pod security policy: []
Oct 19 16:30:09.225: INFO: At 2018-10-19 16:25:07 +0000 UTC - event for csi-hostpath-provisioner: {statefulset-controller } FailedCreate: create Pod csi-hostpath-provisioner-0 in StatefulSet csi-hostpath-provisioner failed error: pods "csi-hostpath-provisioner-0" is forbidden: unable to validate against any pod security policy: []
Oct 19 16:30:09.225: INFO: At 2018-10-19 16:25:07 +0000 UTC - event for csi-hostpathplugin: {daemonset-controller } FailedCreate: Error creating: pods "csi-hostpathplugin-" is forbidden: unable to validate against any pod security policy: []
The extra role binding is silently ignored on clusters which don't
have this particular role.
When we get an unsupported provider message, it often isn't clear what
method actually failed - add more information to the error message.
Issue #70280
Setting command line arguments via env variables that are not needed
by the binaries is just unnecessarily complex. The driver renaming
code in the E2E manifest PR would have to be made more complex to deal
with such a deployment. It is easier for that code and humans who look
at the .yaml file to remove the indirection.
I increased request from 500mCPU to 1CPU in
https://github.com/kubernetes/kubernetes/pull/70125 but this interacts
poorly with other e2e tests (higher requests mean pods fil to schedule).
So revert that change.
Force ELB to ensureHealthCheck when target pool exists.
Add e2e test to ensure that HC interval will be reconciled when
kube-controller-manager restarts.
Health checks with bigger thresholds and larger intervals will not be reconciled.
Add unittest for ILB and ELB to ensure HC reconciles and is configurable.
With CRI-O we've been hitting a lot of flakes with the following test:
[sig-apps] CronJob should remove from active list jobs that have been deleted
The events shown in the test failures in both kube and openshift were the following:
STEP: Found 13 events.
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:02 +0000 UTC - event for forbid: {cronjob-controller } SuccessfulCreate: Created job forbid-1540412040
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:02 +0000 UTC - event for forbid-1540412040: {job-controller } SuccessfulCreate: Created pod: forbid-1540412040-z7n7t
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:02 +0000 UTC - event for forbid-1540412040-z7n7t: {default-scheduler } Scheduled: Successfully assigned e2e-tests-cronjob-rjr2m/forbid-1540412040-z7n7t to 127.0.0.1
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:03 +0000 UTC - event for forbid-1540412040-z7n7t: {kubelet 127.0.0.1} Pulled: Container image "docker.io/library/busybox:1.29" already present on machine
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:03 +0000 UTC - event for forbid-1540412040-z7n7t: {kubelet 127.0.0.1} Created: Created container
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:03 +0000 UTC - event for forbid-1540412040-z7n7t: {kubelet 127.0.0.1} Started: Started container
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:12 +0000 UTC - event for forbid: {cronjob-controller } MissingJob: Active job went missing: forbid-1540412040
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:15:02 +0000 UTC - event for forbid: {cronjob-controller } SuccessfulCreate: Created job forbid-1540412100
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:15:02 +0000 UTC - event for forbid-1540412100: {job-controller } SuccessfulCreate: Created pod: forbid-1540412100-rq89l
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:15:02 +0000 UTC - event for forbid-1540412100-rq89l: {default-scheduler } Scheduled: Successfully assigned e2e-tests-cronjob-rjr2m/forbid-1540412100-rq89l to 127.0.0.1
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:15:06 +0000 UTC - event for forbid-1540412100-rq89l: {kubelet 127.0.0.1} Started: Started container
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:15:06 +0000 UTC - event for forbid-1540412100-rq89l: {kubelet 127.0.0.1} Created: Created container
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:15:06 +0000 UTC - event for forbid-1540412100-rq89l: {kubelet 127.0.0.1} Pulled: Container image "docker.io/library/busybox:1.29" already present on machine
The code in test is racy because the Forbid policy can still let the controller to create
a new pod for the cronjob. CRI-O is fast at re-creating the pod and by the time
the test code reaches the check, it fails. The events are as follow:
[It] should remove from active list jobs that have been deleted
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/apps/cronjob.go:192
STEP: Creating a ForbidConcurrent cronjob
STEP: Ensuring a job is scheduled
STEP: Ensuring exactly one is scheduled
STEP: Deleting the job
STEP: deleting Job.batch forbid-1540412040 in namespace e2e-tests-cronjob-rjr2m, will wait for the garbage collector to delete the pods
Oct 24 20:14:02.533: INFO: Deleting Job.batch forbid-1540412040 took: 2.699182ms
Oct 24 20:14:02.634: INFO: Terminating Job.batch forbid-1540412040 pods took: 100.223228ms
STEP: Ensuring job was deleted
STEP: Ensuring there are no active jobs in the cronjob
[AfterEach] [sig-apps] CronJob
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:148
It looks clear that by the time we're ensuring that there are no more active jobs, there
could be _already_ a new job spinned, making the test flakes.
This PR fixes all the above by making sure that the _deleted_ job is not in the Active
list anymore, besides other pod already running but with different UUID which is
going to be fine anyway for the purpose of the test.
Signed-off-by: Antonio Murdaca <runcom@linux.com>
The E2E refactoring tightened the sanity checking of the --provider
parameter such that it only allowed known providers. That seemed to
make sense because it catches typos, but it turned out that various
callers depended on the "accept arbitrary provider value" behavior,
therefore it gets restored.
Provisioning test for retain policy requires each driver's backend volume
deletion logic. Without it, volume leakage happens. Move this test back
to volume_provisioning.go and test it only for gce, until general
backend volume deletion code for each driver becomes available.
Fixes: #70191
Resource consumer might use slightly more CPU than requested. That
resulted in HPA sometimes increasing size of deployments during e2e
tests. Deflake tests by:
- Scaling up CPU requests in those tests. Resource consumer might go a fixed
number of milli CPU seconds above target. Having higher requests makes
the test less sensitive.
- On scale down consume CPU in the middle between what would generate
recommendation of expexted size and 1 pod fewer (instead of righ on
edge beween expected and expected +1).
Some variables were int32 but always cast to int before use. Make them
int.
The service account authenticator isn't the only authenticator that
should respect API audience. The authentication config structure should
reflect that.
If dig fails to reach the DNS, it will output errors to stdout, which
can cause the test -n to falsely pass:
ubuntu@ubuntu:~$ dig +notcp +noall +answer +search kubernetes.default A
;; connection timed out; no servers could be reached
command terminated with exit code 9
ubuntu@ubuntu:~$ test -n "$(dig +notcp +noall +answer +search kubernetes.default A)" && echo OK
OK
This patch solves this issue by making sure the dig command actually succeeds
before checking its output.
This change updates the ETCD storage test so that its data is
exported. Thus it can be used by other tests. The dry run test was
updated to consume this data instead of having a duplicate copy.
The code to start a master that can be used for "one of every
resource" style tests was also factored out. It is reused in the
dry run test as well.
This prevents these tests from drifting in the future and reduces
the long term maintenance burden.
Signed-off-by: Monis Khan <mkhan@redhat.com>
In some storage tests, kubelet is stopped first and the test check node
NotReady state. However, if it fails to have this state, kubelet could
not be restarted because the defer function is placed after the stop
kubelet command. This PR fixes this issue.
Make CreatePrivilegedPSPBinding reentrant so tests using it (e.g. DNS) can be
executed more than once against a cluster. Without this change, such tests will
fail because the PSP already exists, short circuiting test setup.
Not all users of the E2E framework want to run cloud-provider specific
tests. By splitting out the code it becomes possible to decide in
a E2E test suite which providers are supported.
This is achieved in two ways:
- the framework calls certain functions through a provider
interface instead of calling specific cloud provider functions
directly
- tests that are cloud-provider specific directly import the
new provider packages
The ingress test utilities are only needed by a few tests. Splitting
them out into a separate package makes the framework simpler for test
suites not using those tests.
Fixes: #66649