Add ADR for etcd snapshot CRD migration

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
pull/8624/head
Brad Davidson 2023-07-28 00:19:34 +00:00 committed by Brad Davidson
parent 9bb1ce1253
commit 22065affa2
1 changed files with 53 additions and 0 deletions

View File

@ -0,0 +1,53 @@
# Store etcd snapshot metadata in a Custom Resource
Date: 2023-07-27
## Status
Accepted
## Context
K3s currently stores a list of etcd snapshots and associated metadata in a ConfigMap. Other downstream
projects and controllers consume the content of this ConfigMap in order to present cluster administrators with
a list of snapshots that can be restored.
On clusters with more than a handful of nodes, and reasonable snapshot intervals and retention periods, the snapshot
list ConfigMap frequently reaches the maximum size allowed by Kubernetes, and fails to store any additional information.
The snapshots are still created, but they cannot be discovered by users or accessed by tools that consume information
from the ConfigMap.
When this occurs, the K3s service log shows errors such as:
```
level=error msg="failed to save local snapshot data to configmap: ConfigMap \"k3s-etcd-snapshots\" is invalid: []: Too long: must have at most 1048576 bytes"
```
Reference:
* https://github.com/rancher/rke2/issues/4495
* https://github.com/k3s-io/k3s/blob/36645e7311e9bdbbf2adb79ecd8bd68556bc86f6/pkg/etcd/etcd.go#L1503-L1516
### Existing Work
Rancher already has a `rke.cattle.io/v1 ETCDSnapshot` Custom Resource that contains the same information after it's been
imported by the management cluster:
* https://github.com/rancher/rancher/blob/027246f77f03b82660dc2e91df6bf2cd549163f0/pkg/apis/rke.cattle.io/v1/etcd.go#L48-L74
It is unlikely that we would want to use this custom resource in its current package; we may be able to negotiate moving
it into a neutral project for use by both projects.
## Decision
1. Instead of populating snapshots into a ConfigMap using the JSON serialization of the private `snapshotFile` type, K3s
will manage creation of an new Custom Resource Definition with similar fields.
2. Metadata on each snapshot will be stored in a distinct Custom Resource.
3. The new Custom Resource will be cluster-scoped, as etcd and its snapshots are a cluster-level resource.
4. Downstream consumers of etcd snapshot lists will migrate to watching the Custom Resource, instead of the ConfigMap.
5. K3s will observe a three minor version transition period, where both the new Custom Resource, and the existing
ConfigMap, will both be used.
6. During the transition period, older snapshot metadata may be removed from the ConfigMap while those snapshots still
exist and are referenced by new Custom Resources, if the ConfigMap exceeds a preset size or key count limit.
## Consequences
* Snapshot metadata will no longer be lost when the number of snapshots exceeds what can be stored in the ConfigMap.
* There will be some additional complexity in managing the new Custom Resource, and working with other projects to migrate to using it.