Tests for scaling down based on external metric are flaky. I think this
is because they:
- Start with 2 replicas,
- Export metric value == 1/2 target,
- Expect scale down to 1.
Since the expected recommendation is exactly 1 it might flake (and with
scale down stabilization any recommendations higher than 1 will
persist).
Change expected value of the metric so recommended size will be lower
than 1. This should make those tests less flaky.
The detailed dumps of original and patched item content was useful
while developing the feature, but is less relevant now and too
verbose. It might be relevant again, so it's left in the code as
comments.
What gets logged now is just a single-line "creating" resp. "deleting"
message with the type of the item and its unique name.
This also enhances up some other aspects of the original logging:
- the namespace is included for item types that are namespaced
- the "deleting" message no longer gets replicated in each factory
method
Fixes: #70448
- Scale down based on custom metric was flaking. Increase target value
of the metric.
- Scale down based on CPU was flaking during stabilization. Increase
tolerance of stabilization (caused by resource consumer using more CPU
than requested).
Debugging the CSI driver tests depends a lot on the output of the CSI
sidecar containers and the CSI driver, but that information was not
captured automatically and thus unavailable after a test run. This is
particularly bad when running in a remote CI system, but also manually
watching the cluster during a test was cumbersome.
Now pod events and log messages get copied to the test's output at the
time that they happen (when running without report directory) or get
written to individual log files (when running with report directory in
the CI).
Ensuring that CSI drivers get deployed for testing exactly as intended
was problematic because the original .yaml files had to be converted
into code. e2e/manifest helped a bit, but not enough:
- could not load all entities
- didn't handle loading .yaml files with multiple entities
- actually creating and deleting entities still had to be done in tests
The new framework utility code handles all of that, including the
tricky cleanup operation that tests got wrong (AfterEach does not get
called after test failures!).
In addition, it is ensuring that each test gets its own instance of the
entities.
The PSP role binding for hostpath is now necessary because we switch
from creating a pod directly to creation via the StatefulSet
controller, which runs with less privileges.
Without this, the hostpath test runs into these errors in the
kubernetes-e2e-gce job:
Oct 19 16:30:09.225: INFO: At 2018-10-19 16:25:07 +0000 UTC - event for csi-hostpath-attacher: {statefulset-controller } FailedCreate: create Pod csi-hostpath-attacher-0 in StatefulSet csi-hostpath-attacher failed error: pods "csi-hostpath-attacher-0" is forbidden: unable to validate against any pod security policy: []
Oct 19 16:30:09.225: INFO: At 2018-10-19 16:25:07 +0000 UTC - event for csi-hostpath-provisioner: {statefulset-controller } FailedCreate: create Pod csi-hostpath-provisioner-0 in StatefulSet csi-hostpath-provisioner failed error: pods "csi-hostpath-provisioner-0" is forbidden: unable to validate against any pod security policy: []
Oct 19 16:30:09.225: INFO: At 2018-10-19 16:25:07 +0000 UTC - event for csi-hostpathplugin: {daemonset-controller } FailedCreate: Error creating: pods "csi-hostpathplugin-" is forbidden: unable to validate against any pod security policy: []
The extra role binding is silently ignored on clusters which don't
have this particular role.
When we get an unsupported provider message, it often isn't clear what
method actually failed - add more information to the error message.
Issue #70280
Setting command line arguments via env variables that are not needed
by the binaries is just unnecessarily complex. The driver renaming
code in the E2E manifest PR would have to be made more complex to deal
with such a deployment. It is easier for that code and humans who look
at the .yaml file to remove the indirection.
I increased request from 500mCPU to 1CPU in
https://github.com/kubernetes/kubernetes/pull/70125 but this interacts
poorly with other e2e tests (higher requests mean pods fil to schedule).
So revert that change.
Force ELB to ensureHealthCheck when target pool exists.
Add e2e test to ensure that HC interval will be reconciled when
kube-controller-manager restarts.
Health checks with bigger thresholds and larger intervals will not be reconciled.
Add unittest for ILB and ELB to ensure HC reconciles and is configurable.
With CRI-O we've been hitting a lot of flakes with the following test:
[sig-apps] CronJob should remove from active list jobs that have been deleted
The events shown in the test failures in both kube and openshift were the following:
STEP: Found 13 events.
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:02 +0000 UTC - event for forbid: {cronjob-controller } SuccessfulCreate: Created job forbid-1540412040
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:02 +0000 UTC - event for forbid-1540412040: {job-controller } SuccessfulCreate: Created pod: forbid-1540412040-z7n7t
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:02 +0000 UTC - event for forbid-1540412040-z7n7t: {default-scheduler } Scheduled: Successfully assigned e2e-tests-cronjob-rjr2m/forbid-1540412040-z7n7t to 127.0.0.1
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:03 +0000 UTC - event for forbid-1540412040-z7n7t: {kubelet 127.0.0.1} Pulled: Container image "docker.io/library/busybox:1.29" already present on machine
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:03 +0000 UTC - event for forbid-1540412040-z7n7t: {kubelet 127.0.0.1} Created: Created container
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:03 +0000 UTC - event for forbid-1540412040-z7n7t: {kubelet 127.0.0.1} Started: Started container
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:12 +0000 UTC - event for forbid: {cronjob-controller } MissingJob: Active job went missing: forbid-1540412040
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:15:02 +0000 UTC - event for forbid: {cronjob-controller } SuccessfulCreate: Created job forbid-1540412100
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:15:02 +0000 UTC - event for forbid-1540412100: {job-controller } SuccessfulCreate: Created pod: forbid-1540412100-rq89l
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:15:02 +0000 UTC - event for forbid-1540412100-rq89l: {default-scheduler } Scheduled: Successfully assigned e2e-tests-cronjob-rjr2m/forbid-1540412100-rq89l to 127.0.0.1
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:15:06 +0000 UTC - event for forbid-1540412100-rq89l: {kubelet 127.0.0.1} Started: Started container
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:15:06 +0000 UTC - event for forbid-1540412100-rq89l: {kubelet 127.0.0.1} Created: Created container
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:15:06 +0000 UTC - event for forbid-1540412100-rq89l: {kubelet 127.0.0.1} Pulled: Container image "docker.io/library/busybox:1.29" already present on machine
The code in test is racy because the Forbid policy can still let the controller to create
a new pod for the cronjob. CRI-O is fast at re-creating the pod and by the time
the test code reaches the check, it fails. The events are as follow:
[It] should remove from active list jobs that have been deleted
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/apps/cronjob.go:192
STEP: Creating a ForbidConcurrent cronjob
STEP: Ensuring a job is scheduled
STEP: Ensuring exactly one is scheduled
STEP: Deleting the job
STEP: deleting Job.batch forbid-1540412040 in namespace e2e-tests-cronjob-rjr2m, will wait for the garbage collector to delete the pods
Oct 24 20:14:02.533: INFO: Deleting Job.batch forbid-1540412040 took: 2.699182ms
Oct 24 20:14:02.634: INFO: Terminating Job.batch forbid-1540412040 pods took: 100.223228ms
STEP: Ensuring job was deleted
STEP: Ensuring there are no active jobs in the cronjob
[AfterEach] [sig-apps] CronJob
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:148
It looks clear that by the time we're ensuring that there are no more active jobs, there
could be _already_ a new job spinned, making the test flakes.
This PR fixes all the above by making sure that the _deleted_ job is not in the Active
list anymore, besides other pod already running but with different UUID which is
going to be fine anyway for the purpose of the test.
Signed-off-by: Antonio Murdaca <runcom@linux.com>
The E2E refactoring tightened the sanity checking of the --provider
parameter such that it only allowed known providers. That seemed to
make sense because it catches typos, but it turned out that various
callers depended on the "accept arbitrary provider value" behavior,
therefore it gets restored.
Provisioning test for retain policy requires each driver's backend volume
deletion logic. Without it, volume leakage happens. Move this test back
to volume_provisioning.go and test it only for gce, until general
backend volume deletion code for each driver becomes available.
Fixes: #70191
Resource consumer might use slightly more CPU than requested. That
resulted in HPA sometimes increasing size of deployments during e2e
tests. Deflake tests by:
- Scaling up CPU requests in those tests. Resource consumer might go a fixed
number of milli CPU seconds above target. Having higher requests makes
the test less sensitive.
- On scale down consume CPU in the middle between what would generate
recommendation of expexted size and 1 pod fewer (instead of righ on
edge beween expected and expected +1).
Some variables were int32 but always cast to int before use. Make them
int.
The service account authenticator isn't the only authenticator that
should respect API audience. The authentication config structure should
reflect that.
If dig fails to reach the DNS, it will output errors to stdout, which
can cause the test -n to falsely pass:
ubuntu@ubuntu:~$ dig +notcp +noall +answer +search kubernetes.default A
;; connection timed out; no servers could be reached
command terminated with exit code 9
ubuntu@ubuntu:~$ test -n "$(dig +notcp +noall +answer +search kubernetes.default A)" && echo OK
OK
This patch solves this issue by making sure the dig command actually succeeds
before checking its output.
This change updates the ETCD storage test so that its data is
exported. Thus it can be used by other tests. The dry run test was
updated to consume this data instead of having a duplicate copy.
The code to start a master that can be used for "one of every
resource" style tests was also factored out. It is reused in the
dry run test as well.
This prevents these tests from drifting in the future and reduces
the long term maintenance burden.
Signed-off-by: Monis Khan <mkhan@redhat.com>
In some storage tests, kubelet is stopped first and the test check node
NotReady state. However, if it fails to have this state, kubelet could
not be restarted because the defer function is placed after the stop
kubelet command. This PR fixes this issue.
Make CreatePrivilegedPSPBinding reentrant so tests using it (e.g. DNS) can be
executed more than once against a cluster. Without this change, such tests will
fail because the PSP already exists, short circuiting test setup.
Not all users of the E2E framework want to run cloud-provider specific
tests. By splitting out the code it becomes possible to decide in
a E2E test suite which providers are supported.
This is achieved in two ways:
- the framework calls certain functions through a provider
interface instead of calling specific cloud provider functions
directly
- tests that are cloud-provider specific directly import the
new provider packages
The ingress test utilities are only needed by a few tests. Splitting
them out into a separate package makes the framework simpler for test
suites not using those tests.
Fixes: #66649
Some commands used in tests are Linux specific and do not exist
or do not behave the same on Windows nodes. This can cause those
tests to fail on Windows nodes.
Replaces the mentioned commands with ones that behave the same on
both Linux and Windows.
The test "Should be able to deny attaching pod" can randomly fail
because it never waits for the pod to enter a "Running" state. Because
of this, the "kubectl attach" command can potentially fail, if the pod is
not in the correct state.
Additionally, if the "kubectl attach" actually manages to attach, then the
test will hang. The command should be executed with a timeout.
After the call to ioutil.TempDir, the directory has already been
created, and MkdirAll therefore can't do anything. The mode argument
in particular is misleading.
Tests settings should be defined in the test source code itself
because conceptually the framework is a separate entity that not all
test authors can modify.
For the sake of backwards compatibility the name of the command line
flags are not changed.
Tests settings should be defined in the test source code itself
because conceptually the framework is a separate entity that not all
test authors can modify.
Using the new framework/config code also has several advantages:
- defaults can be set with less code
- no confusion around what's a duration
- the options can also be set via command line flags
While at it, a minor bug gets fixed:
- readConfig() returns only defaults when called while
registering Ginkgo tests because Viperize() gets called later,
so the scale in the logging soak test couldn't really be configured;
now the value is read when the test runs and thus can be changed
The options get moved into the "instrumentation.logging"
resp. "instrumentation.monitoring" group to make it more obvious where
they are used. This is a breaking change, but that was already
necessary to improve the duration setting from plain integer to a
proper time duration.
Tests shouldn't have to use the central context for their settings,
because conceptually tests and framework get developed independently.
This does not yet use the new framework/config utility code because
that code still needs to be reviewed.
Besides moving the flags, they also get renamed from the top-level
"--csiImage{Version|Registry}" to
"--storage.csi.image.{version|registry}". These flags were introduced
fairly recently and shouldn't be in use much, so now is a good time to
introduce a hierarchical naming for storage flags, in particular
because more flags will be added soon.
Storing settings in the framework's TestContext is not something that
out-of-tree test authors can do because for them the framework is a
read-only upstream component. Conceptually the same is true for
in-tree tests, so the recommended approach is to define configuration
settings in the code that uses them.
How to do that is a bit uncertain. Viper has several
drawbacks (maintenance status uncertain, cannot list supported
options, cannot validate the configuration file). How to handle
configuration files is currently getting discussed for kubeadm, with
similar concerns about
Viper (https://github.com/kubernetes/kubeadm/issues/1040).
Instead of making a choice now for E2E, the recommendation is that
test authors continue to define command line flags as before, except
that they should do it in their own code and with better flag names.
But the ability to read options also from a file is useful, so
several enhancements get added:
- all settings defined via flags can also be read from a
configuration file, without extra work for test authors
- framework/config makes it possible to populate a struct directly
and define flags with a single function call
- a path and file suffix can be given to --viper-config (as in
"--viper-config /tmp/e2e.json") instead of expecting the file in
the current directory; as before, just plain "--viper-config e2e"
still works
- if "--viper-config" is set, the file must exist; otherwise the
"e2e" config is optional (as before)
- errors from Viper are no longer silently ignored, so syntax errors
are detected early
- Viper support is optional: test suite authors who don't want
it are not forced to use it by the e2e/framework
Individual implementations are not yet being moved.
Fixed all dependencies which call the interface.
Fixed golint exceptions to reflect the move.
Added project info as per @dims and
https://github.com/kubernetes/kubernetes-template-project.
Added dims to the security contacts.
Fixed minor issues.
Added missing template files.
Copied ControllerClientBuilder interface to cp.
This allows us to break the only dependency on K8s/K8s.
Added TODO to ControllerClientBuilder.
Fixed GoDeps.
Factored in feedback from JustinSB.