Commit Graph

355 Commits (0ac8ae6c3beb8641927f0880d5c8c4e9754e2d82)

Author SHA1 Message Date
Derek Menteer 0ac8ae6c3b
Fix xDS deadlock due to syncLoop termination. (#20867)
* Fix xDS deadlock due to syncLoop termination.

This fixes an issue where agentless xDS streams can deadlock permanently until
a server is restarted. When this issue occurs, no new proxies are able to
successfully connect to the server.

Effectively, the trigger for this deadlock stems from the following return
statement:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L199-L202

When this happens, the entire `syncLoop()` terminates and stops consuming from
the following channel:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L182-L192

Which results in the `ConfigSource.cleanup()` function never receiving a
response and holding a mutex indefinitely:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L241-L247

Because this mutex is shared, it effectively deadlocks the server's ability to
process new xDS streams.

----

The fix to this issue involves removing the `chan chan struct{}` used like an
RPC-over-channels pattern and replacing it with two distinct channels:

+ `stopSyncLoopCh` - indicates that the `syncLoop()` should terminate soon.  +
`syncLoopDoneCh` - indicates that the `syncLoop()` has terminated.

Splitting these two concepts out and deferring a `close(syncLoopDoneCh)` in the
`syncLoop()` function ensures that the deadlock above should no longer occur.

We also now evict xDS connections of all proxies for the corresponding
`syncLoop()` whenever it encounters an irrecoverable error. This is done by
hoisting the new `syncLoopDoneCh` upwards so that it's visible to the xDS delta
processing. Prior to this fix, the behavior was to simply orphan them so they
would never receive catalog-registration or service-defaults updates.

* Add changelog.
2024-03-15 13:57:11 -05:00
Matt Keeler 8fcafb139c
Add `consul snapshot decode` command (#20824)
Add snapshot decoding command
2024-03-14 12:59:06 -04:00
Michael Zalimeni d4761c0ccd
security: upgrade google.golang.org/protobuf to 1.33.0 (#20801)
Resolves CVE-2024-24786.
2024-03-06 23:04:42 +00:00
Matt Keeler 1a454b7b47
Make routes controller tests more robust when testing with multiple tenancies (#20743)
When executing the test with multiple tenancies the various mesh controllers remained running across multiple sub-tests. While the resourcetest.Client cleans up resources created with it, resources created by the controllers were not being cleaned up. This could result in subsequent sub tests encountering unexpected values and failing when they should pass.
2024-02-28 11:03:12 -05:00
Semir Patel 8ba919f913
v2tenancy: make CE specific version of `resource.Registration` (#20681) 2024-02-20 15:38:06 -08:00
Semir Patel 943426bc79
v2tenancy: add optional LicenseFeature to type Registration struct (#20673) 2024-02-20 14:42:31 -06:00
Chris S. Kim 53afd8f4c5
[CE] Test tenancies for exported-services config manager (#20678)
Sync controller tests from ENT
2024-02-20 10:03:49 -07:00
John Maguire 2ed67ba9ae
[NET-7375] Add Skeleton for APIGW ProxyStateTemplateBuilder (#20628)
* WIP

* got empty state working and cleaned up additional code

* Moved api gateway mapper to it's own file, removed mesh port usage for
api gateway, allow gateway to reconcile without routes

* Update internal/mesh/internal/controllers/gatewayproxy/controller.go

Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>

* Feedback from PR Review:
  - rename referencedTCPRoutes variable to be more descriptive
  - fetchers are in alphabetical order
  - move log statements to be more indicative of internal state

---------

Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
2024-02-16 10:38:08 -05:00
Chris S. Kim 248969c2a7
Update ComputedTrafficPermissions ACL hooks (#20622) 2024-02-13 15:16:54 -05:00
Chris S. Kim 0c2c7ca9c7
Add samenessgroup registration for enterprise (#20624) 2024-02-13 18:23:19 +00:00
Chris S. Kim ab3c6cf1e5
Add BoundReferences to ComputedTrafficPermissions (#20593) 2024-02-13 17:27:24 +00:00
Chris S. Kim 7575b53737
Refactor xTP tests (#20597) 2024-02-13 12:02:42 -05:00
Semir Patel b716a9ef6b
resource: reconcile managed types every ~8hrs (#20606) 2024-02-13 10:51:54 -06:00
R.B. Boyer 671c436415
mesh: use ComputedImplicitDestinations resource in the sidecar controller (#20553)
Wire the ComputedImplicitDestinations resource into the sidecar controller, replacing the inline version already present.

Also:

- Rewrite the controller to use the controller cache
- Rewrite it to no longer depend on ServiceEndpoints
- Remove the fetcher and (local) cache abstraction
2024-02-12 14:10:33 -06:00
John Murret 7e8f2e5f08
NET-7644/NET-7634 - Implement query lookup for tagged addresses on nodes and services including WAN translation. (#20583)
NET-7644 - Implement tagged addresses and wan translation
2024-02-12 14:27:25 -05:00
John Maguire ec76090be9
[NET-7450] Fix listenerToProtocol function for input (#20536)
* make listenerProtocolToCatalogProtocol function more forgiving for
different cased input

* update tests
2024-02-12 11:24:56 -05:00
Nick Cellino 5fb6ab6a3a
Move HCP Manager lifecycle management out of Link controller (#20401)
* Add function to get update channel for watching HCP Link

* Add MonitorHCPLink function

This function can be called in a goroutine to manage the lifecycle
of the HCP manager.

* Update HCP Manager config in link monitor before starting

This updates HCPMonitorLink so it updates the HCP manager
with an HCP client and management token when a Link is upserted.

* Let MonitorHCPManager handle lifecycle instead of link controller

* Remove cleanup from Link controller and move it to MonitorHCPLink

Previously, the Link Controller was responsible for cleaning up the
HCP-related files on the file system. This change makes it so
MonitorHCPLink handles this cleanup. As a result, we are able to remove
the PlacementEachServer placement strategy for the Link controller
because it no longer needs to do this per-node cleanup.

* Remove HCP Manager dependency from Link Controller

The Link controller does not need to have HCP Manager
as a dependency anymore, so this removes that dependency
in order to simplify the design.

* Add Linked prefix to Linked status variables

This is in preparation for adding a new status type to the
Link resource.

* Add new "validated" status type to link resource

The link resource controller will now set a "validated" status
in addition to the "linked" status. This is needed so that other
components (eg the HCP manager) know when the Link is ready to link
with HCP.

* Fix tests

* Handle new 'EndOfSnapshot' WatchList event

* Fix watch test

* Remove unnecessary config from TestAgent_scadaProvider

Since the Scada provider is now started on agent startup
regardless of whether a cloud config is provided, this removes
the cloud config override from the relevant test.

This change is not exactly related to the changes from this PR,
but rather is something small and sort of related that was noticed
while working on this PR.

* Simplify link watch test and remove sleep from link watch

This updates the link watch test so that it uses more mocks
and does not require setting up the infrastructure for the HCP Link
controller.

This also removes the time.Sleep delay in the link watcher loop in favor
of an error counter. When we receive 10 consecutive errors, we shut down
the link watcher loop.

* Add better logging for link validation. Remove EndOfSnapshot test.

* Refactor link monitor test into a table test

* Add some clarifying comments to link monitor

* Simplify link watch test

* Test a bunch more errors cases in link monitor test

* Use exponential backoff instead of errorCounter in LinkWatch

* Move link watch and link monitor into a single goroutine called from server.go

* Refactor HCP link watcher to use single go-routine.

Previously, if the WatchClient errored, we would've never recovered
because we never retry to create the stream. With this change,
we have a single goroutine that runs for the life of the server agent
and if the WatchClient stream ever errors, we retry the creation
of the stream with an exponential backoff.
2024-02-12 10:48:23 -05:00
Tauhid Anjum 9d8f9a5470
NET-7783: Fix sameness group expansion to 0 sources error CE (#20584) 2024-02-12 17:04:18 +05:30
R.B. Boyer 6742340878
mesh: add ComputedImplicitDestinations resource for future use (#20547)
Creates a new controller to create ComputedImplicitDestinations resources by 
composing ComputedRoutes, Services, and ComputedTrafficPermissions to 
infer all ParentRef services that could possibly send some portion of traffic to a 
Service that has at least one accessible Workload Identity. A followup PR will 
rewire the sidecar controller to make use of this new resource.

As this is a performance optimization, rather than a security feature the following 
aspects of traffic permissions have been ignored:

- DENY rules
- port rules (all ports are allowed)

Also:

- Add some v2 TestController machinery to help test complex dependency mappers.
2024-02-09 15:42:10 -06:00
Matt Keeler 6c4b83c119
Allow reuse of cache indexes (#20562)
Previously calling `index.New` would return an object with the index information such as the Indexer, whether it was required, and the name of the index as well as a radix tree to store indexed data.

Now the main `Index` type doesn’t contain the radix tree for indexed data. Instead the `IndexedData` method can be used to combine the main `Index` with a radix tree in the `IndexedData` structure.

The cache still only allows configuring the `Index` type and will invoke the `IndexedData` method on the provided indexes to get the structure that the cache can use for actual data management.

All of this makes it now safe to reuse the `index.Index` types.
2024-02-09 13:00:21 -05:00
John Maguire 7c3a379e48
Fixes gatewayproxy controller tests for ent (#20543)
fix tests for ent
2024-02-08 18:34:44 +00:00
Eric Haberkorn b26282568f
Move sameness groups to v2beta1 version (#20531) 2024-02-08 11:05:06 -05:00
John Maguire 2ee32b1980
Add tests for gw proxy controller (#20510)
* Added basic tests for gatewayproxy controller

* add copyright header

* Clean up tests
2024-02-07 17:01:10 -05:00
Nathan Coleman 45d645471b
[NET-7414] Reconcile PST for mesh gateway workloads on change to ComputedExportedServices (#20271)
* Reconcile ProxyStateTemplate on change to ComputedExportedServices

* gofmt changeset

---------

Co-authored-by: NiniOak <anita.akaeze@hashicorp.com>
2024-02-07 21:27:13 +00:00
skpratt 57bad0df85
add traffic permissions excludes and tests (#20453)
* add traffic permissions tests

* review fixes

* Update internal/mesh/internal/controllers/sidecarproxy/builder/local_app.go

Co-authored-by: John Landa <jonathanlanda@gmail.com>

---------

Co-authored-by: John Landa <jonathanlanda@gmail.com>
2024-02-07 20:21:44 +00:00
Eric Haberkorn 1bd253021b
V1 Compat Exported Services Controller Optimizations (#20517)
V1 compat exported services controller optimizations

* Don't start the v2 exported services controller in v1 mode.
* Use the controller cache.
2024-02-07 14:05:42 -05:00
Matt Keeler 3ca4f39fa1
Register the multicluster types for the catalogtest integration tests (#20516)
In particular the failover controller needs these in Consul Enterprise
2024-02-07 13:35:02 -05:00
John Maguire 24e9603d9b
Fix Gatewayproxy Controller and Re-Enable APIGW v2 Controller (#20508)
re-enable apigw controller, fix typo in key name for metadata for
gatewayproxy
2024-02-06 18:55:55 +00:00
Matt Keeler 49e6c0232d
Panic for unregistered types (#20476)
* Panic when controllers attempt to make invalid requests to the resource service

This will help to catch bugs in tests that could cause infinite errors to be emitted.

* Disable the API GW v2 controller

With the previous commit, this would cause a server to panic due to watching a type which has not yet been created/registered.

* Ensure that a test server gets the full type registry instead of constructing its own

* Skip TestServer_ControllerDependencies

* Fix peering tests so that they use the full resource registry.
2024-02-06 11:23:06 -05:00
John Maguire 54c974748e
[NET-7280] Add APIGW support to the gatewayproxy controller (#20484)
* Add APIGW support to the gatewayproxy controller

* update copywrite headers
2024-02-06 11:03:37 -05:00
Tauhid Anjum 88b8a1cc36
NET-6776 - Update Routes controller to use ComputedFailoverPolicy CE (#20496)
Update Routes controller to use ComputedFailoverPolicy
2024-02-06 13:28:18 +05:30
R.B. Boyer deca6a49bd
catalog: improve the bound workload identity encoding on services (#20458)
The endpoints controller currently encodes the list of unique workload identities 
referenced by all workload matched by a Service into a special data-bearing 
status condition on that Service. This allows a downstream controller to avoid an 
expensive watch on the ServiceEndpoints type just to get this data.

The current encoding does not lend itself well to machine parsing, which is what 
the field is meant for, so this PR simplifies the encoding from:

    "blah blah: " + strings.Join(ids, ",") + "."

to

    strings.Join(ids, ",")

It also provides an exported utility function to easily extract this data.
2024-02-02 16:28:39 -06:00
Nick Ethier 9d4ad74a63
internal/hcp: prevent write loop on telemetrystate resource updates (#20435)
* internal/hcp: prevent write loop on telemetrystate resource updates

* Update controller.go

Co-authored-by: Nick Cellino <nick.cellino@hashicorp.com>

* internal/hcp: add assertion for looping controller

---------

Co-authored-by: Nick Cellino <nick.cellino@hashicorp.com>
2024-02-02 16:28:20 -05:00
R.B. Boyer c029b20615
v2: ensure the controller caches are fully populated before first use (#20421)
The new controller caches are initialized before the DependencyMappers or the 
Reconciler run, but importantly they are not populated. The expectation is that 
when the WatchList call is made to the resource service it will send an initial 
snapshot of all resources matching a single type, and then perpetually send 
UPSERT/DELETE events afterward. This initial snapshot will cycle through the 
caching layer and will catch it up to reflect the stored data.

Critically the dependency mappers and reconcilers will race against the restoration 
of the caches on server startup or leader election. During this time it is possible a
 mapper or reconciler will use the cache to lookup a specific relationship and 
not find it. That very same reconciler may choose to then recompute some 
persisted resource and in effect rewind it to a prior computed state.

Change

- Since we are updating the behavior of the WatchList RPC, it was aligned to 
  match that of pbsubscribe and pbpeerstream using a protobuf oneof instead of the enum+fields option.

- The WatchList rpc now has 3 alternating response events: Upsert, Delete, 
  EndOfSnapshot. When set the initial batch of "snapshot" Upserts sent on a new 
  watch, those operations will be followed by an EndOfSnapshot event before beginning 
  the never-ending sequence of Upsert/Delete events.

- Within the Controller startup code we will launch N+1 goroutines to execute WatchList 
  queries for the watched types. The UPSERTs will be applied to the nascent cache
   only (no mappers will execute).

- Upon witnessing the END operation, those goroutines will terminate.

- When all cache priming routines complete, then the normal set of N+1 long lived 
watch routines will launch to officially witness all events in the system using the 
primed cached.
2024-02-02 15:11:05 -06:00
Eric Haberkorn 543c6a30af
Trigger the V1 Compat exported-services Controller when V1 Config Entries are Updated (#20456)
* Trigger the v1 compat exported-services controller when the v1 config entry is modified.

* Hook up exported-services config entries to the event publisher.
* Add tests to the v2 exported services shim.
* Use the local materializer trigger updates on the v1 compat exported services controller when exported-services config entries are modified.

* stop sleeping when context is cancelled
2024-02-02 15:30:04 -05:00
Eric Haberkorn d0243b618d
Change the multicluster group to v2 (#20430) 2024-02-01 12:08:26 -05:00
Melisa Griffin 7c00d396cf
[NET-6417] Add validation of MeshGateway name + listeners (#20425)
* Add validation of MeshGateway name + listeners

* Adds test for ValidateMeshGateway

* Fixes data fetcher test for gatewayproxy

---------

Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
2024-01-31 18:47:57 -05:00
Nick Ethier 383d92e9ab
hcp.v2.TelemetryState resource and controller implementation (#20257)
* pbhcp: add TelemetryState resource

* agent/hcp: add GetObservabilitySecrets to client

* internal/hcp: add TelemetryState controller logic

* hcp/telemetry-state: added config options for hcp sdk and debug key to skip deletion during reconcile

* pbhcp: update proto documentation

* hcp: address PR feedback, additional validations and code cleanup

* internal/hcp: fix type sig change in test

* update testdata/v2-resource-dependencies
2024-01-31 14:47:05 -05:00
Nathan Coleman 74e4200d07
[NET-6429] Program ProxyStateTemplate to route cross-partition traffi… (#20410)
[NET-6429] Program ProxyStateTemplate to route cross-partition traffic to the correct destination mesh gateway

* Program mesh port to route wildcarded gateway SNI to the appropriate remote partition's mesh gateway

* Update target + route ports in service endpoint refs when building PST

* Use proper name of local datacenter when constructing SNI for gateway target

* Use destination identities for TLS when routing L4 traffic through the mesh gateway

* Use new constants, move comment to correct location

* Use new constants for port names

* Update test assertions

* Undo debug logging change
2024-01-31 10:46:04 -05:00
Ronald 8799c36410
[NET-6231] Handle Partition traffic permissions when reconciling traffic permissions (#20408)
[NET-6231] Partition traffic permissions

Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
2024-01-30 22:14:32 +00:00
Chris S. Kim 7cc88a1577
Handle NamespaceTrafficPermissions when reconciling TrafficPermissions (#20407) 2024-01-30 21:31:25 +00:00
Nathan Coleman 21b3c18d5d
Use a full EndpointRef on ComputedRoutes targets instead of just the ID (#20400)
* Use a full EndpointRef on ComputedRoutes targets instead of just the ID

Today, the `ComputedRoutes` targets have the appropriate ID set for their `ServiceEndpoints` reference; however, the `MeshPort` and `RoutePort` are assumed to be that of the target when adding the endpoints reference in the sidecar's `ProxyStateTemplate`.

This is problematic when the target lives behind a `MeshGateway` and the `Mesh/RoutePort` used in the sidecar's `ProxyStateTemplate` should be that of the `MeshGateway` instead of the target.

Instead of assuming the `MeshPort` and `RoutePort` when building the `ProxyStateTemplate` for the sidecar, let's just add the full `EndpointRef` -- including the ID and the ports -- when hydrating the computed destinations.

* Make sure the UID from the existing ServiceEndpoints makes it onto ComputedRoutes

* Update test assertions

* Undo confusing whitespace change

* Remove one-line function wrapper

* Use plural name for endpoints ref

* Add constants for gateway name, kind and port names
2024-01-30 16:25:44 -05:00
Ronald 783f33db3b
[NET-7074] Exported Services typo fix (#20402) 2024-01-30 21:08:36 +00:00
Ganesh S 4ca6573384
Add status for exported services controller (#20376) 2024-01-30 22:20:09 +05:30
Melissa Kam b0e87dbe13
[CC-7049] Stop the HCP manager when link is deleted (#20351)
* Add Stop method to telemetry provider

Stop the main loop of the provider and set the config
to disabled.

* Add interface for telemetry provider

Added for easier testing. Also renamed Run to Start, which better
fits with Stop.

* Add Stop method to HCP manager

* Add manager interface, rename implementation

Add interface for easier testing, rename existing Manager to HCPManager.

* Stop HCP manager in link Finalizer

* Attempt to cleanup if resource has been deleted

The link should be cleaned up by the finalizer, but there's an edge
case in a multi-server setup where the link is fully deleted on one
server before the other server reconciles. This will cover the case
where the reconcile happens after the resource is deleted.

* Add a delete mananagement token function

Passes a function to the HCP manager that deletes the management token
that was initially created by the manager.

* Delete token as part of stopping the manager

* Lock around disabling config, remove descriptions
2024-01-30 09:40:36 -06:00
Melissa Kam 3b9bb8d6f9
[CC-7044] Start HCP manager as part of link creation (#20312)
* Check for ACL write permissions on write

Link eventually will be creating a token, so require acl:write.

* Convert Run to Start, only allow to start once

* Always initialize HCP components at startup

* Support for updating config and client

* Pass HCP manager to controller

* Start HCP manager in link resource

Start as part of link creation rather than always starting. Update
the HCP manager with values from the link before starting as well.

* Fix metrics sink leaked goroutine

* Remove the hardcoded disabled hostname prefix

The HCP metrics sink will always be enabled, so the length of sinks will
always be greater than zero. This also means that we will also always
default to prefixing metrics with the hostname, which is what our
documentation states is the expected behavior anyway.

* Add changelog

* Check and set running status in one method

* Check for primary datacenter, add back test

* Clarify merge reasoning, fix timing issue in test

* Add comment about controller placement

* Expand on breaking change, fix typo in changelog
2024-01-29 16:31:44 -06:00
Matt Keeler d350115e7f
Fix filename with two periods (#20389) 2024-01-29 15:38:40 -05:00
Matt Keeler 34a32d4ce5
Remove V2 PeerName field from pbresource.Tenancy (#19865)
The peer name will eventually show up elsewhere in the resource. For now though this rips it out of where we don’t want it to be.
2024-01-29 15:08:31 -05:00
Nitya Dhanushkodi 92aab7ea31
[NET-5586][rebased] v2: Support virtual port references in config (#20371)
[OG Author: michael.zalimeni@hashicorp.com, rebase needed a separate PR]

* v2: support virtual port in Service port references

In addition to Service target port references, allow users to specify a
port by stringified virtual port value. This is useful in environments
such as Kubernetes where typical configuration is written in terms of
Service virtual ports rather than workload (pod) target port names.

Retaining the option of referencing target ports by name supports VMs,
Nomad, and other use cases where virtual ports are not used by default.

To support both uses cases at once, we will strictly interpret port
references based on whether the value is numeric. See updated
`ServicePort` docs for more details.

* v2: update service ref docs for virtual port support

Update proto and generated .go files with docs reflecting virtual port
reference support.

* v2: add virtual port references to L7 topo test

Add coverage for mixed virtual and target port references to existing
test.

* update failover policy controller tests to work with computed failover policy and assert error conditions against FailoverPolicy and ComputedFailoverPolicy resources

* accumulate services; don't overwrite them in enterprise

---------

Co-authored-by: Michael Zalimeni <michael.zalimeni@hashicorp.com>
Co-authored-by: R.B. Boyer <rb@hashicorp.com>
2024-01-29 10:43:41 -08:00
Chris S. Kim a2d50af1bd
Fix panic on error (#20377) 2024-01-29 17:44:13 +00:00