The detailed dumps of original and patched item content was useful
while developing the feature, but is less relevant now and too
verbose. It might be relevant again, so it's left in the code as
comments.
What gets logged now is just a single-line "creating" resp. "deleting"
message with the type of the item and its unique name.
This also enhances up some other aspects of the original logging:
- the namespace is included for item types that are namespaced
- the "deleting" message no longer gets replicated in each factory
method
Fixes: #70448
- Scale down based on custom metric was flaking. Increase target value
of the metric.
- Scale down based on CPU was flaking during stabilization. Increase
tolerance of stabilization (caused by resource consumer using more CPU
than requested).
Debugging the CSI driver tests depends a lot on the output of the CSI
sidecar containers and the CSI driver, but that information was not
captured automatically and thus unavailable after a test run. This is
particularly bad when running in a remote CI system, but also manually
watching the cluster during a test was cumbersome.
Now pod events and log messages get copied to the test's output at the
time that they happen (when running without report directory) or get
written to individual log files (when running with report directory in
the CI).
Ensuring that CSI drivers get deployed for testing exactly as intended
was problematic because the original .yaml files had to be converted
into code. e2e/manifest helped a bit, but not enough:
- could not load all entities
- didn't handle loading .yaml files with multiple entities
- actually creating and deleting entities still had to be done in tests
The new framework utility code handles all of that, including the
tricky cleanup operation that tests got wrong (AfterEach does not get
called after test failures!).
In addition, it is ensuring that each test gets its own instance of the
entities.
The PSP role binding for hostpath is now necessary because we switch
from creating a pod directly to creation via the StatefulSet
controller, which runs with less privileges.
Without this, the hostpath test runs into these errors in the
kubernetes-e2e-gce job:
Oct 19 16:30:09.225: INFO: At 2018-10-19 16:25:07 +0000 UTC - event for csi-hostpath-attacher: {statefulset-controller } FailedCreate: create Pod csi-hostpath-attacher-0 in StatefulSet csi-hostpath-attacher failed error: pods "csi-hostpath-attacher-0" is forbidden: unable to validate against any pod security policy: []
Oct 19 16:30:09.225: INFO: At 2018-10-19 16:25:07 +0000 UTC - event for csi-hostpath-provisioner: {statefulset-controller } FailedCreate: create Pod csi-hostpath-provisioner-0 in StatefulSet csi-hostpath-provisioner failed error: pods "csi-hostpath-provisioner-0" is forbidden: unable to validate against any pod security policy: []
Oct 19 16:30:09.225: INFO: At 2018-10-19 16:25:07 +0000 UTC - event for csi-hostpathplugin: {daemonset-controller } FailedCreate: Error creating: pods "csi-hostpathplugin-" is forbidden: unable to validate against any pod security policy: []
The extra role binding is silently ignored on clusters which don't
have this particular role.
When we get an unsupported provider message, it often isn't clear what
method actually failed - add more information to the error message.
Issue #70280
Setting command line arguments via env variables that are not needed
by the binaries is just unnecessarily complex. The driver renaming
code in the E2E manifest PR would have to be made more complex to deal
with such a deployment. It is easier for that code and humans who look
at the .yaml file to remove the indirection.
I increased request from 500mCPU to 1CPU in
https://github.com/kubernetes/kubernetes/pull/70125 but this interacts
poorly with other e2e tests (higher requests mean pods fil to schedule).
So revert that change.
Force ELB to ensureHealthCheck when target pool exists.
Add e2e test to ensure that HC interval will be reconciled when
kube-controller-manager restarts.
Health checks with bigger thresholds and larger intervals will not be reconciled.
Add unittest for ILB and ELB to ensure HC reconciles and is configurable.
With CRI-O we've been hitting a lot of flakes with the following test:
[sig-apps] CronJob should remove from active list jobs that have been deleted
The events shown in the test failures in both kube and openshift were the following:
STEP: Found 13 events.
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:02 +0000 UTC - event for forbid: {cronjob-controller } SuccessfulCreate: Created job forbid-1540412040
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:02 +0000 UTC - event for forbid-1540412040: {job-controller } SuccessfulCreate: Created pod: forbid-1540412040-z7n7t
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:02 +0000 UTC - event for forbid-1540412040-z7n7t: {default-scheduler } Scheduled: Successfully assigned e2e-tests-cronjob-rjr2m/forbid-1540412040-z7n7t to 127.0.0.1
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:03 +0000 UTC - event for forbid-1540412040-z7n7t: {kubelet 127.0.0.1} Pulled: Container image "docker.io/library/busybox:1.29" already present on machine
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:03 +0000 UTC - event for forbid-1540412040-z7n7t: {kubelet 127.0.0.1} Created: Created container
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:03 +0000 UTC - event for forbid-1540412040-z7n7t: {kubelet 127.0.0.1} Started: Started container
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:14:12 +0000 UTC - event for forbid: {cronjob-controller } MissingJob: Active job went missing: forbid-1540412040
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:15:02 +0000 UTC - event for forbid: {cronjob-controller } SuccessfulCreate: Created job forbid-1540412100
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:15:02 +0000 UTC - event for forbid-1540412100: {job-controller } SuccessfulCreate: Created pod: forbid-1540412100-rq89l
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:15:02 +0000 UTC - event for forbid-1540412100-rq89l: {default-scheduler } Scheduled: Successfully assigned e2e-tests-cronjob-rjr2m/forbid-1540412100-rq89l to 127.0.0.1
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:15:06 +0000 UTC - event for forbid-1540412100-rq89l: {kubelet 127.0.0.1} Started: Started container
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:15:06 +0000 UTC - event for forbid-1540412100-rq89l: {kubelet 127.0.0.1} Created: Created container
Oct 24 20:20:05.541: INFO: At 2018-10-24 20:15:06 +0000 UTC - event for forbid-1540412100-rq89l: {kubelet 127.0.0.1} Pulled: Container image "docker.io/library/busybox:1.29" already present on machine
The code in test is racy because the Forbid policy can still let the controller to create
a new pod for the cronjob. CRI-O is fast at re-creating the pod and by the time
the test code reaches the check, it fails. The events are as follow:
[It] should remove from active list jobs that have been deleted
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/apps/cronjob.go:192
STEP: Creating a ForbidConcurrent cronjob
STEP: Ensuring a job is scheduled
STEP: Ensuring exactly one is scheduled
STEP: Deleting the job
STEP: deleting Job.batch forbid-1540412040 in namespace e2e-tests-cronjob-rjr2m, will wait for the garbage collector to delete the pods
Oct 24 20:14:02.533: INFO: Deleting Job.batch forbid-1540412040 took: 2.699182ms
Oct 24 20:14:02.634: INFO: Terminating Job.batch forbid-1540412040 pods took: 100.223228ms
STEP: Ensuring job was deleted
STEP: Ensuring there are no active jobs in the cronjob
[AfterEach] [sig-apps] CronJob
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:148
It looks clear that by the time we're ensuring that there are no more active jobs, there
could be _already_ a new job spinned, making the test flakes.
This PR fixes all the above by making sure that the _deleted_ job is not in the Active
list anymore, besides other pod already running but with different UUID which is
going to be fine anyway for the purpose of the test.
Signed-off-by: Antonio Murdaca <runcom@linux.com>
The E2E refactoring tightened the sanity checking of the --provider
parameter such that it only allowed known providers. That seemed to
make sense because it catches typos, but it turned out that various
callers depended on the "accept arbitrary provider value" behavior,
therefore it gets restored.
Provisioning test for retain policy requires each driver's backend volume
deletion logic. Without it, volume leakage happens. Move this test back
to volume_provisioning.go and test it only for gce, until general
backend volume deletion code for each driver becomes available.
Fixes: #70191
Resource consumer might use slightly more CPU than requested. That
resulted in HPA sometimes increasing size of deployments during e2e
tests. Deflake tests by:
- Scaling up CPU requests in those tests. Resource consumer might go a fixed
number of milli CPU seconds above target. Having higher requests makes
the test less sensitive.
- On scale down consume CPU in the middle between what would generate
recommendation of expexted size and 1 pod fewer (instead of righ on
edge beween expected and expected +1).
Some variables were int32 but always cast to int before use. Make them
int.
The service account authenticator isn't the only authenticator that
should respect API audience. The authentication config structure should
reflect that.
If dig fails to reach the DNS, it will output errors to stdout, which
can cause the test -n to falsely pass:
ubuntu@ubuntu:~$ dig +notcp +noall +answer +search kubernetes.default A
;; connection timed out; no servers could be reached
command terminated with exit code 9
ubuntu@ubuntu:~$ test -n "$(dig +notcp +noall +answer +search kubernetes.default A)" && echo OK
OK
This patch solves this issue by making sure the dig command actually succeeds
before checking its output.
The test "should run with the expected status" passes with and without
the set SELinuxOptions, but removing it will ensure that the test will
be able to run and pass on Windows nodes as well.
This change updates the ETCD storage test so that its data is
exported. Thus it can be used by other tests. The dry run test was
updated to consume this data instead of having a duplicate copy.
The code to start a master that can be used for "one of every
resource" style tests was also factored out. It is reused in the
dry run test as well.
This prevents these tests from drifting in the future and reduces
the long term maintenance burden.
Signed-off-by: Monis Khan <mkhan@redhat.com>
In some storage tests, kubelet is stopped first and the test check node
NotReady state. However, if it fails to have this state, kubelet could
not be restarted because the defer function is placed after the stop
kubelet command. This PR fixes this issue.
Make CreatePrivilegedPSPBinding reentrant so tests using it (e.g. DNS) can be
executed more than once against a cluster. Without this change, such tests will
fail because the PSP already exists, short circuiting test setup.
Not all users of the E2E framework want to run cloud-provider specific
tests. By splitting out the code it becomes possible to decide in
a E2E test suite which providers are supported.
This is achieved in two ways:
- the framework calls certain functions through a provider
interface instead of calling specific cloud provider functions
directly
- tests that are cloud-provider specific directly import the
new provider packages
The ingress test utilities are only needed by a few tests. Splitting
them out into a separate package makes the framework simpler for test
suites not using those tests.
Fixes: #66649
When facing an issue which is due to lack of PV/PVC Protection
finalizer on the e2e tests, the error message is just like
Expected
<bool>: false
to be true
Actually we cannot understand what happened during the tests.
This adds more debugging info on the tests.
Some commands used in tests are Linux specific and do not exist
or do not behave the same on Windows nodes. This can cause those
tests to fail on Windows nodes.
Replaces the mentioned commands with ones that behave the same on
both Linux and Windows.
The test "Should be able to deny attaching pod" can randomly fail
because it never waits for the pod to enter a "Running" state. Because
of this, the "kubectl attach" command can potentially fail, if the pod is
not in the correct state.
Additionally, if the "kubectl attach" actually manages to attach, then the
test will hang. The command should be executed with a timeout.
After the call to ioutil.TempDir, the directory has already been
created, and MkdirAll therefore can't do anything. The mode argument
in particular is misleading.
Tests settings should be defined in the test source code itself
because conceptually the framework is a separate entity that not all
test authors can modify.
For the sake of backwards compatibility the name of the command line
flags are not changed.
Tests settings should be defined in the test source code itself
because conceptually the framework is a separate entity that not all
test authors can modify.
Using the new framework/config code also has several advantages:
- defaults can be set with less code
- no confusion around what's a duration
- the options can also be set via command line flags
While at it, a minor bug gets fixed:
- readConfig() returns only defaults when called while
registering Ginkgo tests because Viperize() gets called later,
so the scale in the logging soak test couldn't really be configured;
now the value is read when the test runs and thus can be changed
The options get moved into the "instrumentation.logging"
resp. "instrumentation.monitoring" group to make it more obvious where
they are used. This is a breaking change, but that was already
necessary to improve the duration setting from plain integer to a
proper time duration.
Tests shouldn't have to use the central context for their settings,
because conceptually tests and framework get developed independently.
This does not yet use the new framework/config utility code because
that code still needs to be reviewed.
Besides moving the flags, they also get renamed from the top-level
"--csiImage{Version|Registry}" to
"--storage.csi.image.{version|registry}". These flags were introduced
fairly recently and shouldn't be in use much, so now is a good time to
introduce a hierarchical naming for storage flags, in particular
because more flags will be added soon.
Storing settings in the framework's TestContext is not something that
out-of-tree test authors can do because for them the framework is a
read-only upstream component. Conceptually the same is true for
in-tree tests, so the recommended approach is to define configuration
settings in the code that uses them.
How to do that is a bit uncertain. Viper has several
drawbacks (maintenance status uncertain, cannot list supported
options, cannot validate the configuration file). How to handle
configuration files is currently getting discussed for kubeadm, with
similar concerns about
Viper (https://github.com/kubernetes/kubeadm/issues/1040).
Instead of making a choice now for E2E, the recommendation is that
test authors continue to define command line flags as before, except
that they should do it in their own code and with better flag names.
But the ability to read options also from a file is useful, so
several enhancements get added:
- all settings defined via flags can also be read from a
configuration file, without extra work for test authors
- framework/config makes it possible to populate a struct directly
and define flags with a single function call
- a path and file suffix can be given to --viper-config (as in
"--viper-config /tmp/e2e.json") instead of expecting the file in
the current directory; as before, just plain "--viper-config e2e"
still works
- if "--viper-config" is set, the file must exist; otherwise the
"e2e" config is optional (as before)
- errors from Viper are no longer silently ignored, so syntax errors
are detected early
- Viper support is optional: test suite authors who don't want
it are not forced to use it by the e2e/framework
Individual implementations are not yet being moved.
Fixed all dependencies which call the interface.
Fixed golint exceptions to reflect the move.
Added project info as per @dims and
https://github.com/kubernetes/kubernetes-template-project.
Added dims to the security contacts.
Fixed minor issues.
Added missing template files.
Copied ControllerClientBuilder interface to cp.
This allows us to break the only dependency on K8s/K8s.
Added TODO to ControllerClientBuilder.
Fixed GoDeps.
Factored in feedback from JustinSB.
Some e2e tests are skipped by depending on Linux distribution of
master and node, and the options can be one of "debian", "ubuntu",
"gci" or "custom". This updates the help message of the options.
The words "error" and "fail" are magic in test output, and are
highlighted in the build logs. Change some test strings to avoid
hitting the highlighting in normal operation.
The new test/e2e/framework/testfiles package makes it possible to
write tests that do not depend on a specific way of providing
additional test files at runtime. Such tests and the framework are
then more easily reused in other test suites.
In the test/e2e suite file access is enabled based on the existing
"repo-root" command line parameter and the built-in bindata. Tests
using the new API will first check for files under "repo-root" and
then fall back to the builtin data. This way, users of a test binary
can modify those files without having to rebuild the binary.
"repo-root" is still needed because at least some tests check for
additional files (secret.yaml, via ingress_utils.go) that are not part
of the upstream source code and thus may or may not be built into a
test binary.
Tests using bindata or repo-root directly get modified to use the new
API, or removed when they are obsolete: test/e2e/examples.go depended
on files that were removed in
https://github.com/kubernetes/kubernetes/pull/61246 and thus can no
longer be run in Kubernetes. Moving the tests to kubernetes/examples
is tracked in https://github.com/kubernetes/examples/issues/214.
The file removal did not break the automated E2E testing probably
because the tests are under the Feature:Example tag and thus not
enabled during normal CI runs.
Removing also the obsolete tests makes it simpler to rework the
"repo-root" setting because less code uses it.
Related-to: #66649 and #23987
New command is now `kubectl diff` rather than `kubectl alpha diff` since
it's moving out of alpha soon, and will be using dry-run apply to
produce the diff rather than the custom merge logic.
This change makes the schema structs consistently use non-pointer
receivers. This makes it easier to call these methods since these
structs are used as values instead of pointers.
Signed-off-by: Monis Khan <mkhan@redhat.com>
Bootstrap initializes the necessary vSphere objects before the tests are
run. A call to Bootstrap was missing in persistent_volumes-vsphere.go's
BeforeEach. This results in Panic while running e2e tests for 'vsphere'
provider with a stack trace like this:
/usr/local/go/src/runtime/panic.go:502 +0x229
github.com/docker/kube-e2e-image/vendor/k8s.io/kubernetes/test/e2e/storage/vsphere.glob..func1.1()
/go/src/github.com/docker/kube-e2e-image/vendor/k8s.io/kubernetes/test/e2e/storage/vsphere/persistent_volumes-vsphere.go:77
+0xa21
github.com/docker/kube-e2e-image/vendor/github.com/onsi/ginkgo/internal/leafnodes.(*runner).runSync(0xc4217c9b60,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/docker/kube-e2e-image/e2e_test.go:88 +0x2c8
testing.tRunner(0xc4206e01e0, 0x4212900)
/usr/local/go/src/testing/testing.go:777 +0xd0
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:824 +0x2e0
This change fixes the Panic by calling Bootstrap.
Testing:
After this change, tests with FOCUS set to "PersistentVolumes:vsphere"
dont Panic. They pass as expected.
Signed-off-by: Anusha Ragunathan <anusha.ragunathan@docker.com>
Normally the pod would get created via a DaemonSet controller, but
during testing it is easier to create it directly. We just need to
ignore errors (like 'No API token found for service account
"csi-service-account"') and retry for a while. If the error persists,
the error will still abort and report it eventually.
This problem also occurs elsewhere, so an utility function in the
framework for it seems justified.
Fixes: #68776