Commit Graph

1668 Commits (1a01154162341dfff89f9526324b30f1cfccec64)

Author SHA1 Message Date
Kubernetes Prow Robot 2d20b57594
Merge pull request #77832 from anfernee/release-1.14
Bump ip-masq-agent version to v2.3.0
2019-06-20 17:40:36 -07:00
Yongkun Gui d55560f29a Bump ip-masq-agent version to v2.3.0 2019-05-13 12:04:31 -07:00
Steve Coffman 68cff866d1 Update k8s-dns-node-cache image version
This revised image resolves kubernetes dns#292 by updating the image from `k8s-dns-node-cache:1.15.2` to `k8s-dns-node-cache:1.15.2`
2019-05-13 11:55:57 -07:00
Marek Siarkowicz 65b0557d31 Pick up security patches for fluentd-gcp-scaler by upgrading to version 0.5.2 2019-05-02 16:08:23 +02:00
Marek Siarkowicz a98d10ba0b Restore metrics-server using of IP addresses
This preference list matches is used to pick prefered field from k8s
node object. It was introduced in metrics-server 0.3 and changed default
behaviour to use DNS instead of IP addresses. It was merged into k8s
1.12 and caused breaking change by introducing dependency on DNS
configuration.
2019-04-19 16:55:38 +02:00
Zhen Wang 412a892706 Use Node-Problem-Detector v0.6.3 on GCI 2019-04-08 13:16:28 -07:00
Marek Siarkowicz 93388c9fc2 Update gcp images with security patches
[stackdriver addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes.
[fluentd-gcp addon] Bump fluentd-gcp-scaler to v0.5.1 to pick up security fixes.
[fluentd-gcp addon] Bump event-exporter to v0.2.4 to pick up security fixes.
[fluentd-gcp addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes.
[metatada-proxy addon] Bump prometheus-to-sd v0.5.0 to pick up security fixes.
2019-03-29 13:31:44 +01:00
Kubernetes Prow Robot d778b9308a
Merge pull request #75063 from wangzhen127/npd-test-fix
Fix NPD e2e test on Ubuntu node and update NPD container version
2019-03-08 14:19:09 -08:00
Tim Allclair 63f61a6714 Migrate RuntimeClass to internal API 2019-03-07 11:07:54 -08:00
Zhen Wang f4d9e7d992 Fix NPD e2e test on Ubuntu node and update NPD container version 2019-03-06 22:42:47 -08:00
Kubernetes Prow Robot 45e5f6053b
Merge pull request #74424 from liggitt/drop-k8s-io-node-labels
Clean up self-set node labels
2019-03-06 08:24:26 -08:00
Kubernetes Prow Robot 95cd1d59e4
Merge pull request #74209 from monotek/fluentd-helm-readme
added production note about EFK stack to the readme
2019-03-04 17:55:12 -08:00
Kubernetes Prow Robot ccf33be0cc
Merge pull request #73940 from jiayingz/nvidia-dp-update
Update nvidia-gpu-device-plugin addon.
2019-02-27 17:13:01 -08:00
Kubernetes Prow Robot 1942c1ccb0
Merge pull request #71251 from monotek/kibana
updated kibana to 6.6.1
2019-02-26 23:40:33 -08:00
Kubernetes Prow Robot 7a4496429d
Merge pull request #71252 from monotek/elasticsearch
updated elasticsearch to 6.6.1
2019-02-26 09:33:44 -08:00
Jordan Liggitt 0174e043c5 Prepare switch from beta.kubernetes.io/masq-agent-ds-ready to node.kubernetes.io/masq-agent-ds-ready 2019-02-26 11:43:10 -05:00
Jordan Liggitt 943b32a289 Prepare switch from beta.kubernetes.io/kube-proxy-ds-ready to node.kubernetes.io/kube-proxy-ds-ready 2019-02-26 11:42:23 -05:00
Jordan Liggitt d6664a2365 Prepare switch from beta.kubernetes.io/metadata-proxy-ready to cloud.google.com/metadata-proxy-ready 2019-02-26 11:42:23 -05:00
Jordan Liggitt 8975233788 Finish migration of fluentd to daemonset 2019-02-26 11:42:23 -05:00
André Bauer 9e2d9cfbb0 changed es image repo
Signed-off-by: André Bauer <monotek23@gmail.com>
2019-02-26 09:09:21 +01:00
Florent Delannoy e627474e8f Fix fluentd-gcp addon liveness probe
Fix three issues with the fluentd-gcp liveness probe:

h1. STUCK_THRESHOLD_SECONDS was overridden by LIVENESS_THRESHOLD_SECONDS
if defined

Probably a copy/paste issue introduced in edf1ffc074

h1. `[[` is [a bashism](https://stackoverflow.com/a/47576482), and will always failed when called with `/bin/sh`

Introduced by a844523c20

Given that we call the liveness probe with `/bin/sh`, we cannot use the
double-bracketed `[[` syntax for test, as it is not POSIX-compliant and
will throw an error.

Annoyingly, even through it prints an error, `sh` returns with exit code 0
in this case:

```bash
root@fluentd-7mprs:/# sh liveness.sh
liveness.sh: 8: liveness.sh: [[: not found
liveness.sh: 15: liveness.sh: [[: not found
root@fluentd-7mprs:/# echo $?
0
```

Which means the liveness probe is considered successful by Kubernetes,
despite failing to test things as it was intended. This is also
probably the reason why this bug wasn't reported sooner :)

Thankfully, the test in this case can just as easily be written as
POSIX-compliant as it doesn't use any bash-specific features within the
`[[` block.

h1. Buffers are transient and cannot be relied upon for monitoring

Finally, after fixing the above issue, we started seeing the fluentd
containers being restarted very often, and found an issue with the
underlying logic of the liveness probe.

The probe checks that the pod is still alive by running the following
command:

`find /var/log/fluentd-buffers -type f -newer /tmp/marker-stuck -print -quit`

This checks if any _regular_ file exists under `/var/log/fluentd-buffers`
that is more recent than a predetermined time, and will return an empty
string otherwise.

The issue is that these buffers are temporary and volatile, they get created and
deleted constantly. Here is an example of running that check every second on a
running fluentd:

```
root@fluentd-eks-playground-jdc8m:/# LIVENESS_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-300};
root@fluentd-eks-playground-jdc8m:/# STUCK_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-900};
root@fluentd-eks-playground-jdc8m:/# touch -d "${STUCK_THRESHOLD_SECONDS} seconds ago" /tmp/marker-stuck;
root@fluentd-eks-playground-jdc8m:/# touch -d "${LIVENESS_THRESHOLD_SECONDS} seconds ago" /tmp/marker-liveness;
root@fluentd-eks-playground-jdc8m:/# while true; do date ; find /var/log/fluentd-buffers -type f -newer /tmp/marker-stuck -print -quit ; sleep 1 ; done
Fri Feb 22 10:52:57 UTC 2019
Fri Feb 22 10:52:58 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964ccf4c7004103c3fa7c8533f85.log
Fri Feb 22 10:52:59 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964ccf4c7004103c3fa7c8533f85.log
Fri Feb 22 10:53:00 UTC 2019
Fri Feb 22 10:53:01 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964fb8b2eedcccd2763ea7775cc2.log
Fri Feb 22 10:53:02 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964fb8b2eedcccd2763ea7775cc2.log
Fri Feb 22 10:53:03 UTC 2019
Fri Feb 22 10:53:04 UTC 2019
Fri Feb 22 10:53:05 UTC 2019
Fri Feb 22 10:53:06 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827965564883997b673d703af54848b.log
Fri Feb 22 10:53:07 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827965564883997b673d703af54848b.log
Fri Feb 22 10:53:08 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827965564883997b673d703af54848b.log
Fri Feb 22 10:53:09 UTC 2019
Fri Feb 22 10:53:10 UTC 2019
Fri Feb 22 10:53:11 UTC 2019
Fri Feb 22 10:53:12 UTC 2019
Fri Feb 22 10:53:13 UTC 2019
Fri Feb 22 10:53:14 UTC 2019
Fri Feb 22 10:53:15 UTC 2019
Fri Feb 22 10:53:16 UTC 2019
```

We can see buffers being created, then disappearing. The LivenessProbe running
under these conditions has a ~50% chance of failing, despite fluentd being
perfectly happy.

I believe that check is probably ok for fluentd installs using large
amounts of buffers, in which case the liveness probe will be correct more
often than not, but fluentd installs that use buffering less intensively
will be negatively impacted by this.

My solution to fix this is to check the last updated time of buffering
_folders_ within `/var/log/fluentd_buffers`. These _do_ get updated when
buffers are created, and do not get deleted as buffers are emptied,
making them the perfect candidate for our use.

Here's an example with the `-d` flag for directories:
```
root@fluentd-eks-playground-jdc8m:/# while true; do date ; find /var/log/fluentd-buffers -type d -newer /tmp/marker-stuck -print -quit ; sleep 1 ; done
Fri Feb 22 10:57:51 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:52 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:53 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:54 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:55 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:56 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:57 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:58 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:59 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:00 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:01 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:02 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:03 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
```

And example of the directory being updated as new buffers come in:
```
root@fluentd-eks-playground-jdc8m:/# ls -lah /var/log/fluentd-buffers/kubernetes.system.buffer
total 0
drwxr-xr-x 2 root root  6 Feb 22 11:17 .
drwxr-xr-x 3 root root 38 Feb 22 11:14 ..
root@fluentd-eks-playground-jdc8m:/# ls -lah /var/log/fluentd-buffers/kubernetes.system.buffer
total 16K
drwxr-xr-x 2 root root  224 Feb 22 11:18 .
drwxr-xr-x 3 root root   38 Feb 22 11:14 ..
-rw-r--r-- 1 root root 1.8K Feb 22 11:18 buffer.b58279be6e21e8b29fc333a7d50096ed0.log
-rw-r--r-- 1 root root  215 Feb 22 11:18 buffer.b58279be6e21e8b29fc333a7d50096ed0.log.meta
-rw-r--r-- 1 root root  429 Feb 22 11:18 buffer.b58279be6f09bdfe047a96486a525ece2.log
-rw-r--r-- 1 root root  195 Feb 22 11:18 buffer.b58279be6f09bdfe047a96486a525ece2.log.meta
root@fluentd-eks-playground-jdc8m:/# ls -lah /var/log/fluentd-buffers/kubernetes.system.buffer
total 0
drwxr-xr-x 2 root root  6 Feb 22 11:18 .
drwxr-xr-x 3 root root 38 Feb 22 11:14 ..
```
2019-02-25 11:48:31 +00:00
André Bauer 2bd6d3dc12 use image version 6.6.1
Signed-off-by: André Bauer <monotek23@gmail.com>
2019-02-25 11:05:52 +01:00
André Bauer 2d15ffc9cc updated to 6.5.2
Signed-off-by: André Bauer <monotek23@gmail.com>
2019-02-25 10:56:50 +01:00
André Bauer 0c29ea1a2e Update es-statefulset.yaml 2019-02-25 10:55:23 +01:00
André Bauer 53a936c359 Update Makefile 2019-02-25 10:55:23 +01:00
André Bauer 0e44fa6359 updated elasticsearch to 6.5.0 2019-02-25 10:55:23 +01:00
André Bauer fc850b5ecd fixed wording
Signed-off-by: André Bauer <monotek23@gmail.com>
2019-02-25 10:49:43 +01:00
André Bauer 421fcd8262 added prodution note to readme
Signed-off-by: André Bauer <monotek23@gmail.com>
2019-02-25 10:47:26 +01:00
Xiang Dai 36065c6dd7 delete all duplicate empty blanks
Signed-off-by: Xiang Dai <764524258@qq.com>
2019-02-23 10:28:04 +08:00
Kubernetes Prow Robot 743f864310
Merge pull request #73819 from coffeepac/move-fluentd-es-images
Move fluentd es images
2019-02-22 17:58:12 -08:00
Patrick Christopher 1bd45ba6eb review updates 2019-02-22 10:00:10 -08:00
Kubernetes Prow Robot 125dc6c8ea
Merge pull request #74187 from xichengliudui/fixgolint0218
Fix shellcheck lint errors in cluster/addons/fluentd-elasticsearch/fl……uentd-es-image/run.sh
2019-02-21 20:51:13 -08:00
Kubernetes Prow Robot 042f9ed3af
Merge pull request #74093 from blakebarnett/lower-neg-cache-ttl
Lowers the default nodelocaldns denial cache TTL
2019-02-21 17:47:16 -08:00
Blake 46c299c1b1 Match default cache size of 10000
https://github.com/coredns/coredns/blob/master/plugin/cache/cache.go#L236
This gets rounded down to the nearest multiple of 256: 9984
2019-02-21 15:03:30 -08:00
xichengliudui 053332ad46 Fix shellcheck lint errors in cluster/addons/fluentd-elasticsearch/fluentd-es-image/run.sh
update pull request

update pull request

update pull request

update pull request

update pull request
2019-02-21 02:00:48 -05:00
Kubernetes Prow Robot 7b203c6809
Merge pull request #74137 from rajansandeep/readinessprobe
Add readinessProbe to CoreDNS
2019-02-19 16:24:04 -08:00
Kubernetes Prow Robot cbf45eea13
Merge pull request #74138 from rramkumar1/ingress-docs-fix
Update docs for Ingress-GCE related cluster addon
2019-02-19 15:05:50 -08:00
Sandeep Rajan 37c3d68a91 Add readinessProbe 2019-02-19 10:14:12 -05:00
Rohit Ramkumar a50752ceb7 Update docs for Ingress-GCE related cluster addon 2019-02-18 13:17:01 -08:00
André Bauer d82d5fda35 updated kibana to 6.6.0
Signed-off-by: André Bauer <monotek23@gmail.com>
2019-02-18 11:00:02 +01:00
André Bauer fa859e4644 Merge branch 'master' into kibana 2019-02-18 10:58:49 +01:00
Kubernetes Prow Robot a22763b24e
Merge pull request #74063 from huynq0911/fix_wrong_format_yaml_influxdb
Fix incorrect influxdb yaml file
2019-02-15 16:46:18 -08:00
Ben Moss 34ac4d9ee9 Update deprecated links 2019-02-15 09:13:07 -05:00
Nguyen Quang Huy ac8466444c Fix incorrect influxdb yamle file
Remove redundant attribute in container declaration
2019-02-14 14:26:05 +07:00
Blake e51c9025ac Lowers the default nodelocaldns denial cache TTL
Similar to `--no-negcache` on dnsmasq, this prevents issues which poll DNS for orchestration such as operators with StatefulSets. It can also be very confusing for users when negative caching results in a change they just made seeming to be broken until the cache expires. This assumes that 5 seconds is reasonable and will still catch repeated AAAA negative responses. We could also set the denial cache size to zero which should effectively fully disable it like dnsmasq in kube-dns but testing shows this approach seems to work well in our (albeit small) test clusters.
2019-02-13 13:23:53 -08:00
Jeff Grafton e216995ef1 Update repo-infra, bazel-skylib, rules_docker, and rules_go dependencies
Also require bazel 0.18.0+
2019-02-12 17:55:10 -08:00
Kubernetes Prow Robot aa00afe231
Merge pull request #73649 from ojmhetar/coredns-priorityclass
Add priority class to CoreDNS pods
2019-02-11 22:55:45 -08:00
Jiaying Zhang 52e92ab4b9 Update nvidia-gpu-device-plugin addon.
This includes changes from GoogleCloudPlatform/container-engine-accelerators#102
2019-02-11 15:52:33 -08:00
Kubernetes Prow Robot b50c643be0
Merge pull request #73540 from rlenferink/patch-5
Updated OWNERS files to include link to docs
2019-02-08 09:05:56 -08:00
patc 0e219f4caa boilerplate fix 2019-02-07 21:12:46 -08:00