Automatic merge from submit-queue (batch tested with PRs 35300, 36709, 37643, 37813, 37697)
[etcd] test cleanup: remove unnecessary AddPrefix()
What?
Remove etcdtest.AddPrefix() in tests. They will be automatically prepended in etcd storage.
Why?
ref: #36290#36374
After the change, it will double prepend.
Automatic merge from submit-queue
Reuse fields and labels
This should significantly reduce memory allocations in apiserver in large cluster.
Explanation:
- every kubelet is refreshing watch every 5-10 minutes (this generally is not causing relist - it just renews watch)
- that means, in 5000-node cluster, we are issuing ~10 watches per second
- since we don't have "watch heartbets", the watch is issued from previously received resourceVersion
- to make some assumption, let's assume pods are evenly spread across pods, and writes for them are evenly spread - that means, that a given kubelet is interested in 1 per 5000 pod changes
- with that assumption, each watch, has to process 2500 (on average) previous watch events
- for each of such even, we are currently computing fields.
This PR is fixing this problem.
Automatic merge from submit-queue
Make etcd cache size configurable
Instead of the prior 50K limit, allow users to specify a more sensible size for their cluster.
I'm not sure what a sensible default is here. I'm still experimenting on my own clusters. 50 gives me a 270MB max footprint. 50K caused my apiserver to run out of memory as it exceeded >2GB. I believe that number is far too large for most people's use cases.
There are some other fundamental issues that I'm not addressing here:
- Old etcd items are cached and potentially never removed (it stores using modifiedIndex, and doesn't remove the old object when it gets updated)
- Cache isn't LRU, so there's no guarantee the cache remains hot. This makes its performance difficult to predict. More of an issue with a smaller cache size.
- 1.2 etcd entries seem to have a larger memory footprint (I never had an issue in 1.1, even though this cache existed there). I suspect that's due to image lists on the node status.
This is provided as a fix for #23323