* Set replication metric to 0 when losing leadership
* Fix replication metrics on replication.go also
---------
Co-authored-by: sarahalsmiller <100602640+sarahalsmiller@users.noreply.github.com>
* NET-5879 - move the filter for non-passing to occur in the health RPC layer rather than the callers of the RPC
* fix import of slices
* NET-5879 - expose sameness group param on service health endpoint and move sameness group health fallback logic into HealthService RPC layer
* fixing deepcopy
* fix license headers
This refactors and relocates the following packages to live under internal/gossip instead of either in the toplevel lib or agent/consul:
- librtt : related to serf coordinates
- libserf : random serf stuff
* Define file-system-certificate config entry
* Collect file-system-certificate(s) referenced by api-gateway onto snapshot
* Add file-system-certificate to config entry kind allow lists
* Remove inapplicable validation
This validation makes sense for inline certificates since Consul server is holding the certificate; however, for file system certificates, Consul server never actually sees the certificate.
* Support file-system-certificate as source for listener TLS certificate
* Add more required mappings for the new config entry type
* Construct proper TLS context based on certificate kind
* Add support or SDS in xdscommon
* Remove unused param
* Adds back verification of certs for inline-certificates
* Undo tangential changes to TLS config consumption
* Remove stray curly braces
* Undo some more tangential changes
* Improve function name for generating API gateway secrets
* Add changelog entry
* Update .changelog/20873.txt
Co-authored-by: Jared Kirschner <85913323+jkirschner-hashicorp@users.noreply.github.com>
* Add some nil-checking, remove outdated TODO
* Update test assertions to include file-system-certificate
* Add documentation for file-system-certificate config entry
Add new doc to nav
* Fix grammar mistake
* Rename watchmaps, remove outdated TODO
---------
Co-authored-by: Melisa Griffin <melisa.griffin@hashicorp.com>
Co-authored-by: Jared Kirschner <85913323+jkirschner-hashicorp@users.noreply.github.com>
This operation would previously fail due to unconsumed bytes in the
decoder buffer when reading the Ent snapshot (the first byte of the
record would be misinterpreted as a type indicator, and the remaining
bytes would fail to be deserialized or read as invalid data).
Ensure restore succeeds by decoding the ignored record as an
interface{}, which will consume the record bytes without requiring a
concrete target struct, then moving on to the next record.
* put conditionals are hcp initialization for consul server
* put more things behind configuration flags
* add changelog
* TestServer_hcpManager
* fix TestAgent_scadaProvider
Currently, when a client starts a blocking query and an ACL token expires within
that time, Consul will return ACL not found error with a 403 status code. However,
sometimes if an ACL token is invalidated at the same time as the query's deadline is reached,
Consul will instead return an empty response with a 200 status code.
This is because of the events being executed.
1. Client issues a blocking query request with timeout `t`.
2. ACL is deleted.
3. Server detects a change in ACLs and force closes the gRPC stream.
4. Client resubscribes with the same token and resets its state (view).
5. Client sees "ACL not found" error.
If ACL is deleted before step 4, the client is unaware that the stream was closed due to
an ACL error and will return an empty view (from the reset state) with the 200 status code.
To fix this problem, we introduce another state to the subsciption to indicate when a change
to ACLs has occured. If the server sees that there was an error due to ACL change, it will
re-authenticate the request and return an error if the token is no longer valid.
Fixes#20790
* disable terminating gateway auto host rewrite
* add changelog
* clean up unneeded additional snapshot fields
* add new field to docs
* squash
* fix test
Ensure all topics are refreshed on FSM restore and add supervisor loop to v1 controller subscriptions
This PR fixes two issues:
1. Not all streams were force closed whenever a snapshot restore happened. This means that anything consuming data from the stream (controllers, queries, etc) were unaware that the data they have is potentially stale / invalid. This first part ensures that all topics are purged.
2. The v1 controllers did not properly handle stream errors (which are likely to appear much more often due to 1 above) and so it introduces a supervisor thread to restart the watches when these errors occur.
* Add function to get update channel for watching HCP Link
* Add MonitorHCPLink function
This function can be called in a goroutine to manage the lifecycle
of the HCP manager.
* Update HCP Manager config in link monitor before starting
This updates HCPMonitorLink so it updates the HCP manager
with an HCP client and management token when a Link is upserted.
* Let MonitorHCPManager handle lifecycle instead of link controller
* Remove cleanup from Link controller and move it to MonitorHCPLink
Previously, the Link Controller was responsible for cleaning up the
HCP-related files on the file system. This change makes it so
MonitorHCPLink handles this cleanup. As a result, we are able to remove
the PlacementEachServer placement strategy for the Link controller
because it no longer needs to do this per-node cleanup.
* Remove HCP Manager dependency from Link Controller
The Link controller does not need to have HCP Manager
as a dependency anymore, so this removes that dependency
in order to simplify the design.
* Add Linked prefix to Linked status variables
This is in preparation for adding a new status type to the
Link resource.
* Add new "validated" status type to link resource
The link resource controller will now set a "validated" status
in addition to the "linked" status. This is needed so that other
components (eg the HCP manager) know when the Link is ready to link
with HCP.
* Fix tests
* Handle new 'EndOfSnapshot' WatchList event
* Fix watch test
* Remove unnecessary config from TestAgent_scadaProvider
Since the Scada provider is now started on agent startup
regardless of whether a cloud config is provided, this removes
the cloud config override from the relevant test.
This change is not exactly related to the changes from this PR,
but rather is something small and sort of related that was noticed
while working on this PR.
* Simplify link watch test and remove sleep from link watch
This updates the link watch test so that it uses more mocks
and does not require setting up the infrastructure for the HCP Link
controller.
This also removes the time.Sleep delay in the link watcher loop in favor
of an error counter. When we receive 10 consecutive errors, we shut down
the link watcher loop.
* Add better logging for link validation. Remove EndOfSnapshot test.
* Refactor link monitor test into a table test
* Add some clarifying comments to link monitor
* Simplify link watch test
* Test a bunch more errors cases in link monitor test
* Use exponential backoff instead of errorCounter in LinkWatch
* Move link watch and link monitor into a single goroutine called from server.go
* Refactor HCP link watcher to use single go-routine.
Previously, if the WatchClient errored, we would've never recovered
because we never retry to create the stream. With this change,
we have a single goroutine that runs for the life of the server agent
and if the WatchClient stream ever errors, we retry the creation
of the stream with an exponential backoff.
Decouple xds capacity controller and autopilot
This prevents a potential bug where autopilot deadlocks while attempting
to execute `AutopilotDelegate.NotifyState()` on an xdscapacity controller
that stopped consuming messages.
* Panic when controllers attempt to make invalid requests to the resource service
This will help to catch bugs in tests that could cause infinite errors to be emitted.
* Disable the API GW v2 controller
With the previous commit, this would cause a server to panic due to watching a type which has not yet been created/registered.
* Ensure that a test server gets the full type registry instead of constructing its own
* Skip TestServer_ControllerDependencies
* Fix peering tests so that they use the full resource registry.
Fix issue with persisting proxy-defaults
This resolves an issue introduced in hashicorp/consul#19829
where the proxy-defaults configuration entry with an HTTP protocol
cannot be updated after it has been persisted once and a router
exists. This occurs because the protocol field is not properly
pre-computed before being passed into validation functions.
* Trigger the v1 compat exported-services controller when the v1 config entry is modified.
* Hook up exported-services config entries to the event publisher.
* Add tests to the v2 exported services shim.
* Use the local materializer trigger updates on the v1 compat exported services controller when exported-services config entries are modified.
* stop sleeping when context is cancelled
* Add Stop method to telemetry provider
Stop the main loop of the provider and set the config
to disabled.
* Add interface for telemetry provider
Added for easier testing. Also renamed Run to Start, which better
fits with Stop.
* Add Stop method to HCP manager
* Add manager interface, rename implementation
Add interface for easier testing, rename existing Manager to HCPManager.
* Stop HCP manager in link Finalizer
* Attempt to cleanup if resource has been deleted
The link should be cleaned up by the finalizer, but there's an edge
case in a multi-server setup where the link is fully deleted on one
server before the other server reconciles. This will cover the case
where the reconcile happens after the resource is deleted.
* Add a delete mananagement token function
Passes a function to the HCP manager that deletes the management token
that was initially created by the manager.
* Delete token as part of stopping the manager
* Lock around disabling config, remove descriptions
* Check for ACL write permissions on write
Link eventually will be creating a token, so require acl:write.
* Convert Run to Start, only allow to start once
* Always initialize HCP components at startup
* Support for updating config and client
* Pass HCP manager to controller
* Start HCP manager in link resource
Start as part of link creation rather than always starting. Update
the HCP manager with values from the link before starting as well.
* Fix metrics sink leaked goroutine
* Remove the hardcoded disabled hostname prefix
The HCP metrics sink will always be enabled, so the length of sinks will
always be greater than zero. This also means that we will also always
default to prefixing metrics with the hostname, which is what our
documentation states is the expected behavior anyway.
* Add changelog
* Check and set running status in one method
* Check for primary datacenter, add back test
* Clarify merge reasoning, fix timing issue in test
* Add comment about controller placement
* Expand on breaking change, fix typo in changelog
* API Gateway proto
* fix lint issue
* new line
* run make proto format
* checkpoint
* stub
* Update internal/mesh/internal/controllers/apigateways/controller.go
* Move config-dependent methods to separate package
In order to reuse the fetching and file creation part of the
bootstrap package, move the code that would cause cyclical
dependencies to a different package.
* Export needed bootstrap methods and variables
Also add back validating persisted config and update tests.
* Add support to check for just management token
Add a new method that fetches the bootstrap configuration only if
there isn't a valid management token file instead of checking for
all the hcp-config files.
* Pass data dir as a dependency to link controller
The link controller needs to check the data directory for
the hcp-config files.
* Fetch bootstrap config for token in controller
Load the management token when reconciling a link resource, which will
fetch the agent boostrap configuration if the token is not already
persisted locally. Skip this step if the cluster is in read-only mode.
* Validate resource ID format in link creation
* Handle unauthorized and forbidden errors
Check for 401 and 403s when making GNM requests, exit bootstrap fetch
loop and return specific failure statuses for link.
* Move test function to a testing file
* Log load and status write errors
* Exported services api implemented
* Tests added, refactored code
* Adding server tests
* changelog added
* Proto gen added
* Adding codegen changes
* changing url, response object
* Fixing lint error by having namespace and partition directly
* Tests changes
* refactoring tests
* Simplified uniqueness logic for exported services, sorted the response in order of service name
* Fix lint errors, refactored code
Ultimately we will have to rectify wan federation with v2 catalog adjacent
experiments, but for now blanket prevent usage of the resource-apis,
v2dns, and v2tenancy experiments in secondary datacenters.
* Create HCP management token in HCP manager
* Change InitializeManagementToken to ManagementTokenUpserter
* Implement and use management token upsert function
* Fix race condition in test
* Add idea for improvement as comment
* Return early in upsertManagementToken if token exists
* Add Initializer to the controller
The Initializer adds support for running any required initialization
steps when the controller is first started.
* Implement HCP Link initializer
The link initializer will create a Link resource if the
cloud configuration has been set.
* Simplify retry logic and testing
* Remove internal retry, replace with logging logic