Commit Graph

172 Commits (8262c02cdda5d394e5967c885a830c29b5f96490)

Author SHA1 Message Date
Brad Davidson 8262c02cdd Fix issue caused by sole server marked as failed under load
If health checks are failing for all servers, make a second pass through the server list with health-checks ignored before returning failure

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit ca39614d4e)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-05-31 09:16:55 -07:00
Brad Davidson 2e7b394713 Fix netpol crash when node remains tained unintialized
It is concievable that users might take more than 60 seconds to deploy their own cloud-provider. Instead of exiting, we should wait forever, but with more logging to indicate what's being waited on.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit ed23a2bb48)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-05-31 09:16:55 -07:00
Brad Davidson 0a728b8ff9 Convert remaining http handlers over to use util.SendError
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit f8e0648304)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-05-31 09:16:55 -07:00
Brad Davidson 2b63eb4a27 Fix issue with k3s-etcd informers not starting
Start shared informer caches when k3s-etcd controller wins leader election. Previously, these were only started when the main k3s apiserver controller won an election. If the leaders ended up going to different nodes, some informers wouldn't be started

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 3d14092f76)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-05-31 09:16:55 -07:00
Brad Davidson 94e29e2ef5 Make /db/info available anonymously from localhost
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-04-22 19:34:43 -07:00
Brad Davidson 5b431ca531 Fix on-demand snapshots not honoring folder
Also fix etcd s3 tests to actually check that the files are saved to s3 🙃

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-04-19 23:26:51 -07:00
Brad Davidson 7d9abc9f07 Improve etcd load-balancer startup behavior
Prefer the address of the etcd member being joined, and seed the full address list immediately on startup.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-04-09 15:36:33 -07:00
Brad Davidson fe465cc832 Move etcd snapshot management CLI to request/response
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-04-09 15:21:26 -07:00
Derek Nola 14f54d0b26
Transition from deprecated pointer library to ptr (#9801)
Signed-off-by: Derek Nola <derek.nola@suse.com>
2024-03-28 10:07:02 -07:00
Vitor Savian 5d69d6e782 Add tls for kine
Signed-off-by: Vitor Savian <vitor.savian@suse.com>

Bump kine

Signed-off-by: Vitor Savian <vitor.savian@suse.com>

Add integration tests for kine with tls

Signed-off-by: Vitor Savian <vitor.savian@suse.com>
2024-03-28 11:12:07 -03:00
Brad Davidson c51d7bfbd1 Add health-check support to loadbalancer
* Adds support for health-checking loadbalancer servers. If a
  health-check fails when dialing, all existing connections to the
  server will be closed.
* Wires up a remotedialer tunnel connectivity check as the health check
  for supervisor/apiserver connections.
* Wires up a simple ping request to the supervisor port as the health
  check for etcd connections.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-03-27 16:50:27 -07:00
Brad Davidson edb0440017 Fix etcd snapshot reconcile for agentless nodes
Disable cleanup of orphaned snapshots and patching of node annotations if running agentless

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-03-27 16:44:36 -07:00
Brad Davidson d7cdbb7d4d Send error response if member list cannot be retrieved
Prevents joining nodes from being stuck with bad initial member list if there is a transient failure, or if they try to join themselves

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-03-26 15:17:15 -07:00
Brad Davidson 3576ed4327 Clean up snapshotDir create/exists logic
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-03-04 12:09:29 -08:00
Brad Davidson 82432a2df7 Fix issue with etcd node name missing hostname
* Set ServerNodeName in snapshot CLI setup
* Raise errer if ServerNodeName ends up empty some other way
* Fix status controller to use etcd node name annotation instead of prefix checking

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-03-01 13:52:53 -08:00
Derek Nola fae41a8b2a Rename AgentReady to ContainerRuntimeReady for better clarity
Signed-off-by: Derek Nola <derek.nola@suse.com>
2024-02-21 12:21:19 -08:00
Brad Davidson de825845b2 Bump kine and set NotifyInterval to what the apiserver expects
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-02-09 14:22:38 -08:00
Vitor Savian f9ee66f4d8 Changed how lastHeartBeatTime works in the etcd condition
Signed-off-by: Vitor Savian <vitor.savian@suse.com>
2024-02-07 15:05:33 -03:00
Brad Davidson 8224a3a7f6 Fix ipv6 endpoint address selection for on-demand snapshots
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-02-06 18:02:36 -08:00
Brad Davidson 6ec1926f88 Add check for etcd-snapshot-dir and fix panic in Walk
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-02-06 17:47:33 -08:00
Brad Davidson 4005600d4e Fix excessive retry on snapshot reconcile
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-02-06 17:46:24 -08:00
Vitor Savian 9a70021a9e Error getting node in setEtcdStatusCondition
Signed-off-by: Vitor Savian <vitor.savian@suse.com>

Added retry and changed nodes for

Signed-off-by: Vitor Savian <vitor.savian@suse.com>
2024-01-11 22:06:36 -03:00
Vitor Savian 4a92ced8ee Handle etcd status condition when cluster reset and disable etcd
Signed-off-by: Vitor Savian <vitor.savian@suse.com>

Set condition if node is unhealthy

Signed-off-by: Vitor Savian <vitor.savian@suse.com>
2024-01-09 11:20:41 -03:00
Brad Davidson 319dca3e82 Fix nil map in full snapshot configmap reconcile
If a full reconcile wins the race against sync of an individual snapshot resource, or someone intentionally deletes the configmap, the data map could be nil and cause a crash.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-01-04 16:49:58 -08:00
Brad Davidson 1e663622d2 Fix the OTHER log message that prints the wrong variable
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2024-01-04 15:23:39 -08:00
Brad Davidson 6d3a92a658 Print key instead of file path in snapshot metadata log message
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-11-21 14:03:27 -08:00
Brad Davidson b23e70d519 Don't apply s3 retention if S3 client failed to initialize
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-11-21 14:03:27 -08:00
Brad Davidson a92c4a0f17 Don't request metadata when listing objects
While some implementations may support it, it appears that most don't,
and some may in fact return an error if it is requested.

We already stat the object to get the metadata anyway, so this was
unnecessary if harmless on most implementations.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-11-21 14:03:27 -08:00
Brad Davidson 1e0a7044cf Reorder snapshot configmap reconcile to reduce log spew during initial startup
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-11-17 10:09:01 -08:00
Vitor Savian e53c189587
Handle nil pointer when runtime core is not ready in etcd
Signed-off-by: Vitor <vitor.savian@suse.com>
2023-11-16 15:58:42 -08:00
Brad Davidson 2088218c5f Fix issue with snapshot metadata configmap
Omit snapshot list configmap entries for snapshots without extra metadata; reduce log level of warnings about missing s3 metadata files.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-11-15 14:25:28 -08:00
Vitor Savian c5cd7b3d65
Added etcd status condition
Signed-off-by: Vitor <vitor.savian@suse.com>
2023-11-13 06:39:24 -08:00
Brad Davidson 49411e7084 Don't try to read token hash and cluster id during cluster-reset
These fields are only necessary when saving snapshots to S3, and will block restoration if attempted

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-10-27 15:06:29 -07:00
Brad Davidson 5b6b9685e9 Manually requeue configmap reconcile when no nodes have reconciled snapshots
Silences error message from lasso - this is a normal startup condition
when no snapshots exist so we shouldn't log nasty looking errors.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-10-18 15:09:25 -07:00
Brad Davidson 3db1d33282 Re-enable etcd endpoint auto-sync
Removing this in 002e6c43ee regressed
control-plane-only nodes, as we rely on the etcd client to update its
endpoint list internally so that we can use it to sync the load-balancer
address list.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-10-18 08:33:03 -07:00
Brad Davidson 9597ea1183 Start etcd client before ensuring self removal
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-10-13 23:24:16 -07:00
Brad Davidson d885162967 Add server token hash to CR and S3
This required pulling the token hash stuff out of the cluster package, into util.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-10-12 15:04:45 -07:00
Brad Davidson 550ab36ab7 Switch to managing ETCDSnapshotFile resources
Reconcile snapshot CRs instead of ConfigMap; manage ConfigMap downstream from CR list

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-10-12 15:04:45 -07:00
Brad Davidson 5cd4f69bfa Move snapshot delete into local/s3 functions
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-10-12 15:04:45 -07:00
Brad Davidson 7464007037 Store extra metadata and cluster ID for snapshots
Write the extra metadata both locally and to S3. These files are placed such that they will not be used by older versions of K3s that do not make use of them.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-10-12 15:04:45 -07:00
Brad Davidson 80f909d0ca Move s3 snapshot list functionality to s3.go
Also, don't list ONLY s3 snapshots if S3 is enabled.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-10-12 15:04:45 -07:00
Brad Davidson 8d47645312 Consistently set snapshotFile timestamp
Attempt to use timestamp from creation or filename instead of file/object modification times

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-10-12 15:04:45 -07:00
Brad Davidson f1afe153a3 Tidy s3 upload functions
Consistently refer to object keys as such, simplify error handling.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-10-12 15:04:45 -07:00
Brad Davidson 2b0e2e8ada Elide old snapshot data when apiserver rejects configmap with ErrRequestEntityTooLarge
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-10-12 15:04:45 -07:00
Brad Davidson 676b00aa0e Move etcd snapshot code into separate file
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-10-12 15:04:45 -07:00
Brad Davidson 8705a88bf4 Clear remove annotations on cluster reset; refuse to delete last member from cluster
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-09-25 11:54:23 -07:00
Brad Davidson 002e6c43ee Reorganize Driver interface and etcd driver to avoid passing context and config into most calls
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-09-25 11:54:23 -07:00
Brad Davidson 890645924f Don't export functions not needed outside the etcd package
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-09-25 11:54:23 -07:00
Brad Davidson 8c73fd670b Disable HTTP on main etcd client port
Fixes performance issue under load, ref: https://github.com/etcd-io/etcd/issues/15402 and https://github.com/kubernetes/kubernetes/pull/118460

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
2023-09-25 08:29:57 -07:00
Vitor Savian e83b1ba4aa
Fixed the etcd retention to delete orphaned snapshots based on the date (#8177)
* Fix retention using name instead of date

Signed-off-by: Vitor <vitor.savian@suse.com>
2023-08-14 18:48:59 -03:00