Commit Graph

361 Commits (7554f8da374625917344b44f4c00cd7f2e58a76b)

Author SHA1 Message Date
hc-github-team-consul-core 993f2d21c0
Backport of Fix to not create a watch to `Internal.ServiceDump` when mesh gateway is not used into release/1.17.x (#20268)
This add a fix to properly verify the gateway mode before creating a watch specific to mesh gateways. This watch have a high performance cost and when mesh gateways are not used is not used.

This also adds an optimization to only return the nodes when watching the Internal.ServiceDump RPC to avoid unnecessary disco chain compilation. As watches in proxy config only need the nodes.

* backport of commit b0ce20b5e2
* backport of commit 3d4bde00cf
* backport of commit b2c77246b9
* backport of commit e7ab4d418d
* backport of commit d00d9c5da4
* backport of commit b2db3d5eb4
* backport of commit 50fb45ac74
* backport of commit 7b41a61c17
* backport of commit 2fa0e0a629
* backport of commit 88849c9030
* backport of commit 4ac54f10bc
* backport of commit 2a9dfc37f2

---------

Co-authored-by: Dhia Ayachi <dhia@hashicorp.com>
2024-01-19 08:17:22 -06:00
Derek Menteer 212485578c
Backport of: Fix ClusterLoadAssignment timeouts dropping endpoints. into 1.17 (#19884)
Fix ClusterLoadAssignment timeouts dropping endpoints.

When a large number of upstreams are configured on a single envoy
proxy, there was a chance that it would timeout when waiting for
ClusterLoadAssignments. While this doesn't always immediately cause
issues, consul-dataplane instances appear to consistently drop
endpoints from their configurations after an xDS connection is
re-established (the server dies, random disconnect, etc).

This commit adds an `xds_fetch_timeout_ms` config to service registrations
so that users can set the value higher for large instances that have
many upstreams. The timeout can be disabled by setting a value of `0`.

This configuration was introduced to reduce the risk of causing a
breaking change for users if there is ever a scenario where endpoints
would never be received. Rather than just always blocking indefinitely
or for a significantly longer period of time, this config will affect
only the service instance associated with it.
2023-12-11 10:02:33 -06:00
hc-github-team-consul-core a34009b7c1
Backport of parse config protocol on write to optimize disco-chain compilation into release/1.17.x (#19859)
* parse config protocol on write to optimize disco-chain compilation (#19829)

* parse config protocol on write to optimize disco-chain compilation

* add changelog

* add test fixes from PR

* adding missing config field

---------

Co-authored-by: Dhia Ayachi <dhia@hashicorp.com>
2023-12-07 15:35:26 -05:00
Chris S. Kim 92ce814693
Remove old build tags (#19128) 2023-10-10 10:58:06 -04:00
Thomas Eckert 342306c312
Allow connections through Terminating Gateways from peered clusters NET-3463 (#18959)
* Add InboundPeerTrustBundle maps to Terminating Gateway

* Add notify and cancelation of watch for inbound peer trust bundles

* Pass peer trust bundles to the RBAC creation function

* Regenerate Golden Files

* add changelog, also adds another spot that needed peeredTrustBundles

* Add basic test for terminating gateway with peer trust bundle

* Add intention to cluster peered golden test

* rerun codegen

* update changelog

* really update the changelog

---------

Co-authored-by: Melisa Griffin <melisa.griffin@hashicorp.com>
2023-10-05 21:54:23 +00:00
Dhia Ayachi b1688ad856
Run copyright after running deep-copy as part of the Makefile/CI (#18741)
* execute copyright headers after performing deep-copy generation.

* fix copyright install

* Apply suggestions from code review

Co-authored-by: Semir Patel <semir.patel@hashicorp.com>

* Apply suggestions from code review

Co-authored-by: Semir Patel <semir.patel@hashicorp.com>

* rename steps to match codegen naming

* remove copywrite install category

---------

Co-authored-by: Semir Patel <semir.patel@hashicorp.com>
2023-09-11 13:50:52 -04:00
Derek Menteer a698142325
Add extra logging for mesh health endpoints. (#18647) 2023-09-01 12:29:09 -05:00
John Maguire 9876923e23
Add the plumbing for APIGW JWT work (#18609)
* Add the plumbing for APIGW JWT work

* Remove unneeded import

* Add deep equal function for HTTPMatch

* Added plumbing for status conditions

* Remove unneeded comment

* Fix comments

* Add calls in xds listener for apigateway to setup listener jwt auth
2023-08-31 12:23:59 -04:00
Ashwin Venkatesh 797e42dc24
Watch the ProxyTracker from xDS controller (#18611) 2023-08-29 14:39:29 -07:00
John Murret 0e606504bc
NET-4944 - wire up controllers with proxy tracker (#18603)
Co-authored-by: github-team-consul-core <github-team-consul-core@hashicorp.com>
2023-08-29 09:15:34 -06:00
John Murret 051f250edb
NET-5338 - NET-5338 - Run a v2 mode xds server (#18579)
* NET-5338 - NET-5338 - Run a v2 mode xds server

* fix linting
2023-08-24 16:44:14 -06:00
John Maguire 59ab57f350
NET-5147: Added placeholder structs for JWT functionality (#18575)
* Added placeholder structs for JWT functionality

* Added watches for CE vs ENT

* Add license header

* Undo plumbing work

* Add context arg
2023-08-24 15:07:14 -04:00
Semir Patel 53e28a4963
OSS -> CE (community edition) changes (#18517) 2023-08-22 09:46:03 -05:00
hashicorp-copywrite[bot] 5fb9df1640
[COMPLIANCE] License changes (#18443)
* Adding explicit MPL license for sub-package

This directory and its subdirectories (packages) contain files licensed with the MPLv2 `LICENSE` file in this directory and are intentionally licensed separately from the BSL `LICENSE` file at the root of this repository.

* Adding explicit MPL license for sub-package

This directory and its subdirectories (packages) contain files licensed with the MPLv2 `LICENSE` file in this directory and are intentionally licensed separately from the BSL `LICENSE` file at the root of this repository.

* Updating the license from MPL to Business Source License

Going forward, this project will be licensed under the Business Source License v1.1. Please see our blog post for more details at <Blog URL>, FAQ at www.hashicorp.com/licensing-faq, and details of the license at www.hashicorp.com/bsl.

* add missing license headers

* Update copyright file headers to BUSL-1.1

* Update copyright file headers to BUSL-1.1

* Update copyright file headers to BUSL-1.1

* Update copyright file headers to BUSL-1.1

* Update copyright file headers to BUSL-1.1

* Update copyright file headers to BUSL-1.1

* Update copyright file headers to BUSL-1.1

* Update copyright file headers to BUSL-1.1

* Update copyright file headers to BUSL-1.1

* Update copyright file headers to BUSL-1.1

* Update copyright file headers to BUSL-1.1

* Update copyright file headers to BUSL-1.1

* Update copyright file headers to BUSL-1.1

* Update copyright file headers to BUSL-1.1

* Update copyright file headers to BUSL-1.1

---------

Co-authored-by: hashicorp-copywrite[bot] <110428419+hashicorp-copywrite[bot]@users.noreply.github.com>
2023-08-11 09:12:13 -04:00
Dan Stough 284e3bdb54
[OSS] test: xds coverage for routes (#18369)
test: xds coverage for routes
2023-08-03 15:03:02 -04:00
cui fliter 18a5edd232
docs: Fix some comments (#17118)
Signed-off-by: cui fliter <imcusg@gmail.com>
2023-07-31 10:56:09 -07:00
Nathan Coleman 5caa0ae3f5
api-gateway: subscribe to bound-api-gateway only after receiving api-gateway (#18291)
* api-gateway: subscribe to bound-api-gateway only after receiving api-gateway

This fixes a race condition due to our dependency on having the listener(s) from the api-gateway config entry in order to fully and properly process the resources on the bound-api-gateway config entry.

* Apply suggestions from code review

* Add changelog entry
2023-07-26 16:02:04 -04:00
Dan Stough 8e3a1ddeb6
[OSS] Improve xDS Code Coverage - Endpoints and Misc (#18222)
test: improve xDS endpoints code coverage
2023-07-21 17:48:25 -04:00
Dan Stough 2793761702
[OSS] Improve xDS Code Coverage - Clusters (#18165)
test: improve xDS cluster code coverage
2023-07-20 18:02:21 -04:00
Dan Stough 33d898b857
[OSS] test: improve xDS listener code coverage (#18138)
test: improve xDS listener code coverage
2023-07-17 13:49:40 -04:00
Dan Stough 1b08626358
[OSS] Fix initial_fetch_timeout to wait for all xDS resources (#18024)
* fix(connect): set initial_fetch_time to wait indefinitely

* changelog

* PR feedback 1
2023-07-10 17:08:06 -04:00
Ronald 80394278b8
Expose JWKS cluster config through JWTProviderConfigEntry (#17978)
* Expose JWKS cluster config through JWTProviderConfigEntry

* fix typos, rename trustedCa to trustedCA
2023-07-04 09:12:06 -04:00
Eric Haberkorn a3ba559149
Make locality aware routing xDS changes (#17826) 2023-06-21 12:39:53 -04:00
R.B. Boyer 72f991d8d3
agent: remove agent cache dependency from service mesh leaf certificate management (#17075)
* agent: remove agent cache dependency from service mesh leaf certificate management

This extracts the leaf cert management from within the agent cache.

This code was produced by the following process:

1. All tests in agent/cache, agent/cache-types, agent/auto-config,
   agent/consul/servercert were run at each stage.

    - The tests in agent matching .*Leaf were run at each stage.

    - The tests in agent/leafcert were run at each stage after they
      existed.

2. The former leaf cert Fetch implementation was extracted into a new
   package behind a "fake RPC" endpoint to make it look almost like all
   other cache type internals.

3. The old cache type was shimmed to use the fake RPC endpoint and
   generally cleaned up.

4. I selectively duplicated all of Get/Notify/NotifyCallback/Prepopulate
   from the agent/cache.Cache implementation over into the new package.
   This was renamed as leafcert.Manager.

    - Code that was irrelevant to the leaf cert type was deleted
      (inlining blocking=true, refresh=false)

5. Everything that used the leaf cert cache type (including proxycfg
   stuff) was shifted to use the leafcert.Manager instead.

6. agent/cache-types tests were moved and gently replumbed to execute
   as-is against a leafcert.Manager.

7. Inspired by some of the locking changes from derek's branch I split
   the fat lock into N+1 locks.

8. The waiter chan struct{} was eventually replaced with a
   singleflight.Group around cache updates, which was likely the biggest
   net structural change.

9. The awkward two layers or logic produced as a byproduct of marrying
   the agent cache management code with the leaf cert type code was
   slowly coalesced and flattened to remove confusion.

10. The .*Leaf tests from the agent package were copied and made to work
    directly against a leafcert.Manager to increase direct coverage.

I have done a best effort attempt to port the previous leaf-cert cache
type's tests over in spirit, as well as to take the e2e-ish tests in the
agent package with Leaf in the test name and copy those into the
agent/leafcert package to get more direct coverage, rather than coverage
tangled up in the agent logic.

There is no net-new test coverage, just coverage that was pushed around
from elsewhere.
2023-06-13 10:54:45 -05:00
R.B. Boyer ec347ef01d
sort some imports that are wonky between oss and ent (#17637) 2023-06-09 11:30:56 -05:00
Andrew Stucki 9a4f503b2b
[API Gateway] Fix trust domain for external peered services in synthesis code (#17609)
* [API Gateway] Fix trust domain for external peered services in synthesis code

* Add changelog
2023-06-08 12:18:17 -04:00
Michael Zalimeni ad03a5d0f2
Avoid panic applying TProxy Envoy extensions (#17537)
When UpstreamEnvoyExtender was introduced, some code was left duplicated
between it and BasicEnvoyExtender. One path in that code panics when a
TProxy listener patch is attempted due to no upstream data in
RuntimeConfig matching the local service (which would only happen in
rare cases).

Instead, we can remove the special handling of upstream VIPs from
BasicEnvoyExtender entirely, greatly simplifying the listener filter
patch code and avoiding the panic. UpstreamEnvoyExtender, which needs
this code to function, is modified to ensure a panic does not occur.

This also fixes a second regression in which the Lua extension was not
applied to TProxy outbound listeners.
2023-06-01 13:04:39 -04:00
sarahalsmiller b147323fb0
xds: Remove APIGateway ToIngress function (#17453)
* xds generation for routes api gateway

* Update gateway.go

* move buildHttpRoute into xds package

* Update agent/consul/discoverychain/gateway.go

* remove unneeded function

* convert http route code to only run for http protocol to future proof code path

* Update agent/consul/discoverychain/gateway.go

Co-authored-by: Mike Morris <mikemorris@users.noreply.github.com>

* fix tests, clean up http check logic

* clean up todo

* Fix casing in docstring

* Fix import block, adjust docstrings

* Rename func

* Consolidate docstring onto single line

* Remove ToIngress() conversion for APIGW, which generates its own xDS now

* update name and comment

* use constant value

* use constant

* rename readyUpstreams to readyListeners to better communicate what that function is doing

---------

Co-authored-by: Mike Morris <mikemorris@users.noreply.github.com>
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
2023-05-25 15:16:37 +00:00
Dan Stough d935c7b466
[OSS] gRPC Blocking Queries (#17426)
* feat: initial grpc blocking queries

* changelog and docs update
2023-05-23 17:29:10 -04:00
sarahalsmiller e2a81aa8bd
xds: generate listeners directly from API gateway snapshot (#17398)
* API Gateway XDS Primitives, endpoints and clusters (#17002)

* XDS primitive generation for endpoints and clusters

Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>

* server_test

* deleted extra file

* add missing parents to test

---------

Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>

* Routes for API Gateway (#17158)

* XDS primitive generation for endpoints and clusters

Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>

* server_test

* deleted extra file

* add missing parents to test

* checkpoint

* delete extra file

* httproute flattening code

* linting issue

* so close on this, calling for tonight

* unit test passing

* add in header manip to virtual host

* upstream rebuild commented out

* Use consistent upstream name whether or not we're rebuilding

* Start working through route naming logic

* Fix typos in test descriptions

* Simplify route naming logic

* Simplify RebuildHTTPRouteUpstream

* Merge additional compiled discovery chains instead of overwriting

* Use correct chain for flattened route, clean up + add TODOs

* Remove empty conditional branch

* Restore previous variable declaration

Limit the scope of this PR

* Clean up, improve TODO

* add logging, clean up todos

* clean up function

---------

Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>

* checkpoint, skeleton, tests not passing

* checkpoint

* endpoints xds cluster configuration

* resources test fix

* fix reversion in resources_test

* checkpoint

* Update agent/proxycfg/api_gateway.go

Co-authored-by: John Maguire <john.maguire@hashicorp.com>

* unit tests passing

* gofmt

* add deterministic sorting to appease the unit test gods

* remove panic

* Find ready upstream matching listener instead of first in list

* Clean up, improve TODO

* Modify getReadyUpstreams to filter upstreams by listener (#17410)

Each listener would previously have all upstreams from any route that bound to the listener. This is problematic when a route bound to one listener also binds to other listeners and so includes upstreams for multiple listeners. The list for a given listener would then wind up including upstreams for other listeners.

* clean up todos, references to api gateway in listeners_ingress

* merge in Nathan's fix

* Update agent/consul/discoverychain/gateway.go

* cleanup current todos, remove snapshot manipulation from generation code

* Update agent/structs/config_entry_gateways.go

Co-authored-by: Thomas Eckert <teckert@hashicorp.com>

* Update agent/consul/discoverychain/gateway.go

Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>

* Update agent/consul/discoverychain/gateway.go

Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>

* Update agent/proxycfg/snapshot.go

Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>

* clarified header comment for FlattenHTTPRoute, changed RebuildHTTPRouteUpstream to BuildHTTPRouteUpstream

* simplify cert logic

* Delete scratch

* revert route related changes in listener PR

* Update agent/consul/discoverychain/gateway.go

* Update agent/proxycfg/snapshot.go

* clean up uneeded extra lines in endpoints

---------

Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
Co-authored-by: John Maguire <john.maguire@hashicorp.com>
Co-authored-by: Thomas Eckert <teckert@hashicorp.com>
2023-05-22 17:36:29 -04:00
Ronald 113202d541
JWT Authentication with service intentions: xds package update (#17414)
* JWT Authentication with service intentions: update xds package to translate config to envoy
2023-05-19 18:14:16 -04:00
sarahalsmiller 134aac7c26
xds: generate endpoints directly from API gateway snapshot (#17390)
* endpoints xds cluster configuration

* resources test fix

* fix reversion in resources_test

* Update agent/proxycfg/api_gateway.go

Co-authored-by: John Maguire <john.maguire@hashicorp.com>

* gofmt

* Modify getReadyUpstreams to filter upstreams by listener (#17410)

Each listener would previously have all upstreams from any route that bound to the listener. This is problematic when a route bound to one listener also binds to other listeners and so includes upstreams for multiple listeners. The list for a given listener would then wind up including upstreams for other listeners.

* Update agent/proxycfg/api_gateway.go

Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>

* Restore import blocking

* Skip to next route if route has no upstreams

* cleanup

* change set from bool to empty struct

---------

Co-authored-by: John Maguire <john.maguire@hashicorp.com>
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
2023-05-19 18:50:59 +00:00
Kyle Havlovitz 2904d0a431
Pull virtual IPs for filter chains from discovery chains (#17375) 2023-05-17 11:18:39 -07:00
Connor 0789661ce5
Rename hcp-metrics-collector to consul-telemetry-collector (#17327)
* Rename hcp-metrics-collector to consul-telemetry-collector

* Fix docs

* Fix doc comment

---------

Co-authored-by: Ashvitha Sridharan <ashvitha.sridharan@hashicorp.com>
2023-05-16 14:36:05 -04:00
Freddy 7c3e9cd862
Hash namespace+proxy ID when creating socket path (#17204)
UNIX domain socket paths are limited to 104-108 characters, depending on
the OS. This limit was quite easy to exceed when testing the feature on
Kubernetes, due to how proxy IDs encode the Pod ID eg:
metrics-collector-59467bcb9b-fkkzl-hcp-metrics-collector-sidecar-proxy

To ensure we stay under that character limit this commit makes a
couple changes:
- Use a b64 encoded SHA1 hash of the namespace + proxy ID to create a
  short and deterministic socket file name.
- Add validation to proxy registrations and proxy-defaults to enforce a
  limit on the socket directory length.
2023-05-09 12:20:26 -06:00
Derek Menteer 4f6da20fe5
Fix multiple issues related to proxycfg health queries. (#17241)
Fix multiple issues related to proxycfg health queries.

1. The datacenter was not being provided to a proxycfg query, which resulted in
bypassing agentless query optimizations and using the normal API instead.

2. The health rpc endpoint would return a zero index when insufficient ACLs were
detected. This would result in the agent cache performing an infinite loop of
queries in rapid succession without backoff.
2023-05-09 12:37:58 -05:00
Semir Patel 5eaeb7b8e5
Support Envoy's MaxEjectionPercent and BaseEjectionTime config entries for passive health checks (#15979)
* Add MaxEjectionPercent to config entry

* Add BaseEjectionTime to config entry

* Add MaxEjectionPercent and BaseEjectionTime to protobufs

* Add MaxEjectionPercent and BaseEjectionTime to api

* Fix integration test breakage

* Verify MaxEjectionPercent and BaseEjectionTime in integration test upstream confings

* Website docs for MaxEjectionPercent and BaseEjection time

* Add `make docs` to browse docs at http://localhost:3000

* Changelog entry

* so that is the difference between consul-docker and dev-docker

* blah

* update proto funcs

* update proto

---------

Co-authored-by: Maliz <maliheh.monshizadeh@hashicorp.com>
2023-04-26 15:59:48 -07:00
Eric Haberkorn b1fae05983
Add sameness groups to service intentions. (#17064) 2023-04-20 12:16:04 -04:00
Eric Haberkorn 44b39240a8
move enterprise test cases out of open source (#16985) 2023-04-13 09:07:06 -04:00
cskh a319953576
docs: add envoy to the proxycfg diagram (#16834)
* docs: add envoy to the proxycfg diagram
2023-04-04 09:42:42 -04:00
Eric Haberkorn a6d69adcf5
Add default resolvers to disco chains based on the default sameness group (#16837) 2023-03-31 14:35:56 -04:00
Eric Haberkorn 0d1d2fc4c9
add order by locality failover to Consul enterprise (#16791) 2023-03-30 10:08:38 -04:00
Ronald 94ec4eb2f4
copyright headers for agent folder (#16704)
* copyright headers for agent folder

* Ignore test data files

* fix proto files and remove headers in agent/uiserver folder

* ignore deep-copy files
2023-03-28 14:39:22 -04:00
Derek Menteer 2236975011
Change partition for peers in discovery chain targets (#16769)
This commit swaps the partition field to the local partition for
discovery chains targeting peers. Prior to this change, peer upstreams
would always use a value of default regardless of which partition they
exist in. This caused several issues in xds / proxycfg because of id
mismatches.

Some prior fixes were made to deal with one-off id mismatches that this
PR also cleans up, since they are no longer needed.
2023-03-24 15:40:19 -05:00
Eric Haberkorn 495ad4c7ef
add enterprise xds tests (#16738) 2023-03-22 14:56:18 -04:00
John Maguire 8dd1d73874
Remove unused are hosts set check (#16691)
* Remove unused are hosts set check

* Remove all traces of unused 'AreHostsSet' parameter

* Remove unused Hosts attribute

* Remove commented out use of snap.APIGateway.Hosts
2023-03-21 16:23:23 +00:00
Nitya Dhanushkodi b9bd2c3780
peering: peering partition failover fixes (#16673)
add local source partition for peered upstreams
2023-03-20 10:00:29 -07:00
John Maguire 1ef9f4dade
Fix route subscription when using namespaces (#16677)
* Fix route subscription when using namespaces

* Update changelog

* Fix changelog entry to reference that the bug was enterprise only
2023-03-20 12:42:30 -04:00
Ashvitha f95ffe0355
Allow HCP metrics collection for Envoy proxies
Co-authored-by: Ashvitha Sridharan <ashvitha.sridharan@hashicorp.com>
Co-authored-by: Freddy <freddygv@users.noreply.github.com>

Add a new envoy flag: "envoy_hcp_metrics_bind_socket_dir", a directory
where a unix socket will be created with the name
`<namespace>_<proxy_id>.sock` to forward Envoy metrics.

If set, this will configure:
- In bootstrap configuration a local stats_sink and static cluster.
  These will forward metrics to a loopback listener sent over xDS.

- A dynamic listener listening at the socket path that the previously
  defined static cluster is sending metrics to.

- A dynamic cluster that will forward traffic received at this listener
  to the hcp-metrics-collector service.


Reasons for having a static cluster pointing at a dynamic listener:
- We want to secure the metrics stream using TLS, but the stats sink can
  only be defined in bootstrap config. With dynamic listeners/clusters
  we can use the proxy's leaf certificate issued by the Connect CA,
  which isn't available at bootstrap time.

- We want to intelligently route to the HCP collector. Configuring its
  addreess at bootstrap time limits our flexibility routing-wise. More
  on this below.

Reasons for defining the collector as an upstream in `proxycfg`:
- The HCP collector will be deployed as a mesh service.

- Certificate management is taken care of, as mentioned above.

- Service discovery and routing logic is automatically taken care of,
  meaning that no code changes are required in the xds package.

- Custom routing rules can be added for the collector using discovery
  chain config entries. Initially the collector is expected to be
  deployed to each admin partition, but in the future could be deployed
  centrally in the default partition. These config entries could even be
  managed by HCP itself.
2023-03-10 13:52:54 -07:00
R.B. Boyer 9a485cdb49
proxycfg: ensure that an irrecoverable error in proxycfg closes the xds session and triggers a replacement proxycfg watcher (#16497)
Receiving an "acl not found" error from an RPC in the agent cache and the
streaming/event components will cause any request loops to cease under the
assumption that they will never work again if the token was destroyed. This
prevents log spam (#14144, #9738).

Unfortunately due to things like:

- authz requests going to stale servers that may not have witnessed the token
  creation yet

- authz requests in a secondary datacenter happening before the tokens get
  replicated to that datacenter

- authz requests from a primary TO a secondary datacenter happening before the
  tokens get replicated to that datacenter

The caller will get an "acl not found" *before* the token exists, rather than
just after. The machinery added above in the linked PRs will kick in and
prevent the request loop from looping around again once the tokens actually
exist.

For `consul-dataplane` usages, where xDS is served by the Consul servers
rather than the clients ultimately this is not a problem because in that
scenario the `agent/proxycfg` machinery is on-demand and launched by a new xDS
stream needing data for a specific service in the catalog. If the watching
goroutines are terminated it ripples down and terminates the xDS stream, which
CDP will eventually re-establish and restart everything.

For Consul client usages, the `agent/proxycfg` machinery is ahead-of-time
launched at service registration time (called "local" in some of the proxycfg
machinery) so when the xDS stream comes in the data is already ready to go. If
the watching goroutines terminate it should terminate the xDS stream, but
there's no mechanism to re-spawn the watching goroutines. If the xDS stream
reconnects it will see no `ConfigSnapshot` and will not get one again until
the client agent is restarted, or the service is re-registered with something
changed in it.

This PR fixes a few things in the machinery:

- there was an inadvertent deadlock in fetching snapshot from the proxycfg
  machinery by xDS, such that when the watching goroutine terminated the
  snapshots would never be fetched. This caused some of the xDS machinery to
  get indefinitely paused and not finish the teardown properly.

- Every 30s we now attempt to re-insert all locally registered services into
  the proxycfg machinery.

- When services are re-inserted into the proxycfg machinery we special case
  "dead" ones such that we unilaterally replace them rather that doing that
  conditionally.
2023-03-03 14:27:53 -06:00