Commit Graph

590 Commits (0ac8ae6c3beb8641927f0880d5c8c4e9754e2d82)

Author SHA1 Message Date
Derek Menteer 0ac8ae6c3b
Fix xDS deadlock due to syncLoop termination. (#20867)
* Fix xDS deadlock due to syncLoop termination.

This fixes an issue where agentless xDS streams can deadlock permanently until
a server is restarted. When this issue occurs, no new proxies are able to
successfully connect to the server.

Effectively, the trigger for this deadlock stems from the following return
statement:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L199-L202

When this happens, the entire `syncLoop()` terminates and stops consuming from
the following channel:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L182-L192

Which results in the `ConfigSource.cleanup()` function never receiving a
response and holding a mutex indefinitely:
https://github.com/hashicorp/consul/blob/v1.18.0/agent/proxycfg-sources/catalog/config_source.go#L241-L247

Because this mutex is shared, it effectively deadlocks the server's ability to
process new xDS streams.

----

The fix to this issue involves removing the `chan chan struct{}` used like an
RPC-over-channels pattern and replacing it with two distinct channels:

+ `stopSyncLoopCh` - indicates that the `syncLoop()` should terminate soon.  +
`syncLoopDoneCh` - indicates that the `syncLoop()` has terminated.

Splitting these two concepts out and deferring a `close(syncLoopDoneCh)` in the
`syncLoop()` function ensures that the deadlock above should no longer occur.

We also now evict xDS connections of all proxies for the corresponding
`syncLoop()` whenever it encounters an irrecoverable error. This is done by
hoisting the new `syncLoopDoneCh` upwards so that it's visible to the xDS delta
processing. Prior to this fix, the behavior was to simply orphan them so they
would never receive catalog-registration or service-defaults updates.

* Add changelog.
2024-03-15 13:57:11 -05:00
Derek Menteer eabff257d7
Various bug-fixes and improvements (#20866)
* Shuffle the list of servers returned by `pbserverdiscovery.WatchServers`.

This randomizes the list of servers to help reduce the chance of clients
all connecting to the same server simultaneously. Consul-dataplane is one
such client that does not randomize its own list of servers.

* Fix potential goroutine leak in xDS recv loop.

This commit ensures that the goroutine which receives xDS messages from
proxies will not block forever if the stream's context is cancelled but
the `processDelta()` function never consumes the message (due to being
terminated).

* Add changelog.
2024-03-15 13:10:48 -05:00
sarahalsmiller 262f435800
NET-6821 Disable Terminating Gateway Auto Host Header Rewrite (#20802)
* disable terminating gateway auto host rewrite

* add changelog

* clean up unneeded additional snapshot fields

* add new field to docs

* squash

* fix test
2024-03-12 15:37:20 -05:00
skpratt 57bad0df85
add traffic permissions excludes and tests (#20453)
* add traffic permissions tests

* review fixes

* Update internal/mesh/internal/controllers/sidecarproxy/builder/local_app.go

Co-authored-by: John Landa <jonathanlanda@gmail.com>

---------

Co-authored-by: John Landa <jonathanlanda@gmail.com>
2024-02-07 20:21:44 +00:00
Chris S. Kim b6f10bc58f
Skip filter chain created by permissive mtls (#20406) 2024-01-31 16:39:12 -05:00
Derek Menteer 3e8ec8d18e
Fix SAN matching on terminating gateways (#20417)
Fixes issue: hashicorp/consul#20360

A regression was introduced in hashicorp/consul#19954 where the SAN validation
matching was reduced from 4 potential types down to just the URI.

Terminating gateways will need to match on many fields depending on user
configuration, since they make egress calls outside of the cluster. Having more
than one matcher behaves like an OR operation, where any match is sufficient to
pass the certificate validation. To maintain backwards compatibility with the
old untyped `match_subject_alt_names` Envoy behavior, we should match on all 4
enum types.

https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/transport_sockets/tls/v3/common.proto#enum-extensions-transport-sockets-tls-v3-subjectaltnamematcher-santype
2024-01-31 12:17:45 -06:00
Matt Keeler 34a32d4ce5
Remove V2 PeerName field from pbresource.Tenancy (#19865)
The peer name will eventually show up elsewhere in the resource. For now though this rips it out of where we don’t want it to be.
2024-01-29 15:08:31 -05:00
skpratt 0abf8f8426
Net 5092/internal l7 traffic permissions (#20276)
* wire up L7 Traffic Permissions

* testing

* update comment
2024-01-23 20:07:58 -06:00
Lord-Y 758ddf84e9
Case sensitive route match (#19647)
Add case insensitive param on service route match

This commit adds in a new feature that allows service routers to specify that
paths and path prefixes should ignore upper / lower casing when matching URLs.

Co-authored-by: Derek Menteer <105233703+hashi-derek@users.noreply.github.com>
2024-01-22 09:23:24 -06:00
John Murret 7a410d7c5b
NET-6945 - Replace usage of deprecated Envoy field envoy.config.core.v3.HeaderValueOption.append (#20078)
* NET-6945 - Replace usage of deprecated Envoy field envoy.config.core.v3.HeaderValueOption.append

* update proto for v2 and then update xds v2 logic

* add changelog

* Update 20078.txt to be consistent with existing changelog entries

* swap enum values tomatch envoy.
2024-01-04 00:36:25 +00:00
John Murret d925e4b812
NET-6946 / NET-6941 - Replace usage of deprecated Envoy fields envoy.config.route.v3.HeaderMatcher.safe_regex_match and envoy.type.matcher.v3.RegexMatcher.google_re2 (#20013)
* NET-6946 - Replace usage of deprecated Envoy field envoy.config.route.v3.HeaderMatcher.safe_regex_match

* removing unrelated changes

* update golden files

* do not set engine type
2024-01-03 09:53:39 -07:00
John Murret 2f335113f8
NET-6943 - Replace usage of deprecated Envoy field envoy.config.router.v3.WeightedCluster.total_weight. (#20011) 2023-12-22 19:49:44 +00:00
John Murret 90cd56c5c3
NET-4774 - replace usage of deprecated Envoy field match_subject_alt_names (#19954) 2023-12-22 18:34:44 +00:00
John Murret 21ea5c92fd
NET-6944 - Replace usage of deprecated Envoy field envoy.extensions.filters.http.lua.v3.Lua.inline_code (#20012) 2023-12-22 17:20:41 +00:00
Nitya Dhanushkodi 9975b8bd73
[NET-5455] Allow disabling request and idle timeouts with negative values in service router and service resolver (#19992)
* add coverage for testing these timeouts
2023-12-19 15:36:07 -08:00
Derek Menteer dfab5ade50
Fix ClusterLoadAssignment timeouts dropping endpoints. (#19871)
When a large number of upstreams are configured on a single envoy
proxy, there was a chance that it would timeout when waiting for
ClusterLoadAssignments. While this doesn't always immediately cause
issues, consul-dataplane instances appear to consistently drop
endpoints from their configurations after an xDS connection is
re-established (the server dies, random disconnect, etc).

This commit adds an `xds_fetch_timeout_ms` config to service registrations
so that users can set the value higher for large instances that have
many upstreams. The timeout can be disabled by setting a value of `0`.

This configuration was introduced to reduce the risk of causing a
breaking change for users if there is ever a scenario where endpoints
would never be received. Rather than just always blocking indefinitely
or for a significantly longer period of time, this config will affect
only the service instance associated with it.
2023-12-11 09:25:11 -06:00
Derek Menteer 0ac958f27b
Fix xDS missing endpoint race condition. (#19866)
This fixes the following race condition:
- Send update endpoints
- Send update cluster
- Recv ACK endpoints
- Recv ACK cluster

Prior to this fix, it would have resulted in the endpoints NOT existing in
Envoy. This occurred because the cluster update implicitly clears the endpoints
in Envoy, but we would never re-send the endpoint data to compensate for the
loss, because we would incorrectly ACK the invalid old endpoint hash. Since the
endpoint's hash did not actually change, they would not be resent.

The fix for this is to effectively clear out the invalid pending ACKs for child
resources whenever the parent changes. This ensures that we do not store the
child's hash as accepted when the race occurs.

An escape-hatch environment variable `XDS_PROTOCOL_LEGACY_CHILD_RESEND` was
added so that users can revert back to the old legacy behavior in the event
that this produces unknown side-effects. Visit the following thread for some
extra context on why certainty around these race conditions is difficult:
https://github.com/envoyproxy/envoy/issues/13009

This bug report and fix was mostly implemented by @ksmiley with some minor
tweaks.

Co-authored-by: Keith Smiley <ksmiley@salesforce.com>
2023-12-08 11:37:12 -06:00
Thomas Eckert 8125a32a4e
Add CE version of Gateway Upstream Disambiguation (#19860)
* Add CE version of gateway-upstream-disambiguation

* Use NamespaceOrDefault and PartitionOrDefault

* Add Changelog entry

* Remove the unneeded reassignment

* Use c.ID()
2023-12-07 17:56:14 -05:00
Dhia Ayachi d93f7f730d
parse config protocol on write to optimize disco-chain compilation (#19829)
* parse config protocol on write to optimize disco-chain compilation

* add changelog
2023-12-07 13:46:46 -05:00
John Murret 780e91688d
Migrate remaining individual resource tests for service mesh to TestAllResourcesFromSnapshot (#19583)
* migrate expose checks and paths  tests to resources_test.go

* fix failing expose paths tests

* fix the way endpoint resources get created to make expose tests pass.

* remove endpoint resources that are already inlined on local_app clusters

* renaiming and comments

* migrate remaining service mesh tests to resources_test.go

* cleanup

* update proxystateconverter to skip ading alpn to clusters and listener filterto match v1 behavior
2023-11-09 20:08:37 +00:00
John Murret f5bf256425
Migrate individual resource tests for API Gateway to TestAllResourcesFromSnapshot (#19584)
migrate individual api gateway tests to resources_test.go
2023-11-09 17:01:54 +00:00
John Murret a94fa4c3ed
Migrate individual resource tests for Mesh Gateway to TestAllResourcesFromSnapshot (#19502)
migrate mesh-gateway tests to resources_test.go
2023-11-09 16:39:16 +00:00
John Murret 4aa95f3d1f
Migrate individual resource tests for Ingress Gateway to TestAllResourcesFromSnapshot (#19506)
migrate ingress-gateway tests to resources_test.go
2023-11-09 16:08:07 +00:00
John Murret 2553d6e8b9
Migrate individual resource tests for Terminating Gateway to TestAllResourcesFromSnapshot (#19505)
migrate terminating-gateway tests to resources_test.go
2023-11-09 08:38:33 -07:00
John Murret 7de0b45ba4
Fix xds v2 from creating envoy endpoint resources when already inlined in the cluster (#19580)
* migrate expose checks and paths  tests to resources_test.go

* fix failing expose paths tests

* fix the way endpoint resources get created to make expose tests pass.

* wip

* remove endpoint resources that are already inlined on local_app clusters

* renaiming and comments
2023-11-08 22:18:51 +00:00
John Murret 5aff19f9bc
Migrate individual resource tests for JWT Provider to TestAllResourcesFromSnapshot (#19511)
migrate jwt provider tests to resources_test.go
2023-11-08 14:34:40 -07:00
John Murret 903ff7fccb
Migrate individual resource tests for custom configuration to TestAllResourcesFromSnapshot (#19512)
* Configure TestAllResourcesFromSnapshot to run V2 tests

* migrate custom configuration tests to resources_test.go
2023-11-08 10:34:23 -07:00
John Murret 09f73d1abf
Migrate individual resource tests for expose paths and checks to TestAllResourcesFromSnapshot (#19513)
* migrate expose checks and paths  tests to resources_test.go

* fix failing expose paths tests
2023-11-08 14:24:27 +00:00
John Murret 7bc2581c81
Migrate individual resource tests for Discovery Chains to TestAllResourcesFromSnapshot (#19508)
migrate disco chain tests to resources_test.go
2023-11-08 01:34:42 +00:00
John Murret f115cdb1d5
NET-6385 - Static routes that are inlined in listener filters are also created as a resource. (#19459)
* cover all protocols in local_app golden tests

* fix xds tests

* updating latest

* fix broken test

* add sorting of routers to TestBuildLocalApp to get rid of the flaking

* cover all protocols in local_app golden tests

* cover all protocols in local_app golden tests

* cover all protocols in local_app golden tests

* process envoy resource by walking the map.  use a map rather than array for envoy resource to prevent duplication.

* cleanup.  doc strings.

* update to latest

* fix broken test

* update tests after adding sorting of routers in local_app builder tests

* do not make endpoints for local_app

* fix catalog destinations only by creating clusters for any cluster not already created by walking the graph.

* Configure TestAllResourcesFromSnapshot to run V2 tests

* wip

* fix processing of failover groups

* add endpoints and clusters for any clusters that were not created from walking the listener -> path

* fix xds v2 golden files for clusters to include failover group clusters
2023-11-07 08:00:08 -07:00
John Murret 74daaa5043
XDS V1 should not make runs for TCP Disco Chains. (#19496)
* XDS V1 should not make runs for TCP Disco Chains.

* update TestEnvoyExtenderWithSnapshot
2023-11-03 14:53:17 -06:00
John Murret f0cf8f2f40
NET-6294 - v1 Agentless proxycfg datasource errors after v2 changes (#19365) 2023-10-27 14:06:38 -06:00
Michael Zalimeni a7803bd829
[NET-6305] xds: Ensure v2 route match and protocol are populated for gRPC (#19343)
* xds: Ensure v2 route match is populated for gRPC

Similar to HTTP, ensure that route match config (which is required by
Envoy) is populated when default values are used.

Because the default matches generated for gRPC contain a single empty
`GRPCRouteMatch`, and that proto does not directly support prefix-based
config, an interpretation of the empty struct is needed to generate the
same output that the `HTTPRouteMatch` is explicitly configured to
provide in internal/mesh/internal/controllers/routes/generate.go.

* xds: Ensure protocol set for gRPC resources

Add explicit protocol in `ProxyStateTemplate` builders and validate it
is always set on clusters. This ensures that HTTP filters and
`http2_protocol_options` are populated in all the necessary places for
gRPC traffic and prevents future unintended omissions of non-TCP
protocols.

Co-authored-by: John Murret <john.murret@hashicorp.com>

---------

Co-authored-by: John Murret <john.murret@hashicorp.com>
2023-10-25 17:43:58 +00:00
Andrew Stucki e414cbee4a
Use strict DNS for mesh gateways with hostnames (#19268)
* Use strict DNS for mesh gateways with hostnames

* Add changelog
2023-10-24 15:04:14 -04:00
Michael Zalimeni 5e517c5980
[NET-6221] Ensure LB policy set for locality-aware routing (CE) (#19283)
Ensure LB policy set for locality-aware routing (CE)

`overprovisioningFactor` should be overridden with the expected value
(100,000) when there are multiple endpoint groups. Update code and
tests to enforce this.

This is an Enterprise feature. This commit represents the CE portions of
the change; tests are added in the corresponding `consul-enterprise`
change.
2023-10-19 10:13:27 -04:00
John Maguire b78465b491
[NET-5810] CE changes for multiple virtual hosts (#19246)
CE changes for multiple virtual hosts
2023-10-17 15:08:04 +00:00
Thomas Eckert 76c60fdfac
Golden File Tests for TermGW w/ Cluster Peering (#19096)
Add intention to create golden file for terminating gateway peered trust bundle
2023-10-13 11:56:58 -04:00
Nitya Dhanushkodi 95d9b2c7e4
[NET-4931] xdsv2, sidecarproxycontroller, l4 trafficpermissions: support L7 (#19185)
* xdsv2: support l7 by adding xfcc policy/headers, tweaking routes, and make a bunch of listeners l7 tests pass

* sidecarproxycontroller: add l7 local app support 

* trafficpermissions: make l4 traffic permissions work on l7 workloads

* rename route name field for consistency with l4 cluster name field

* resolve conflicts and rebase

* fix: ensure route name is used in l7 destination route name as well. previously it was only in the route names themselves, now the route name and l7 destination route name line up
2023-10-12 23:45:45 +00:00
John Maguire 7a323c492b
[NET-5457] Golden Files for Multiple Virtual Hosts (#19131)
* Add new golden file tests

* Update with latest deterministic code
2023-10-11 18:11:29 +00:00
John Maguire 8bebfc147d
[NET-5457] Fix CE code for jwt multiple virtual hosts bug (#19123)
* Fix CE code for jwt multiple virtual hosts bug

* Fix struct definition

* fix bug with always appending route to jwt config

* Update comment to be correct

* Update comment
2023-10-10 16:25:36 -04:00
Chris S. Kim 92ce814693
Remove old build tags (#19128) 2023-10-10 10:58:06 -04:00
Thomas Eckert 342306c312
Allow connections through Terminating Gateways from peered clusters NET-3463 (#18959)
* Add InboundPeerTrustBundle maps to Terminating Gateway

* Add notify and cancelation of watch for inbound peer trust bundles

* Pass peer trust bundles to the RBAC creation function

* Regenerate Golden Files

* add changelog, also adds another spot that needed peeredTrustBundles

* Add basic test for terminating gateway with peer trust bundle

* Add intention to cluster peered golden test

* rerun codegen

* update changelog

* really update the changelog

---------

Co-authored-by: Melisa Griffin <melisa.griffin@hashicorp.com>
2023-10-05 21:54:23 +00:00
Eric Haberkorn f2b7b4591a
Fix Traffic Permissions Default Deny (#19028)
Whenver a traffic permission exists for a given workload identity, turn on default deny.

Previously, this was only working at the port level.
2023-10-04 09:58:28 -04:00
sarahalsmiller 9addd9ed7c
[NET-5788] Fix needed for JWTAuth in Consul Enterprise (#19038)
change needed for fix in consul-enterprise
2023-10-03 09:48:50 -05:00
Nitya Dhanushkodi 9a48266712
remove log (#19029) 2023-09-29 16:11:50 -07:00
Eric Haberkorn 7ce6ebaeb3
Handle Traffic Permissions With Empty Sources Properly (#19024)
Fix issues with empty sources

* Validate that each permission on traffic permissions resources has at least one source.
* Don't construct RBAC policies when there aren't any principals. This resulted in Envoy rejecting xDS updates with a validation error.

```
error=
  | rpc error: code = Internal desc = Error adding/updating listener(s) public_listener: Proto constraint validation failed (RBACValidationError.Rules: embedded message failed validation | caused by RBACValidationError.Policies[consul-intentions-layer4-1]: embedded message failed validation | caused by PolicyValidationError.Principals: value must contain at least 1 item(s)): rules {
```
2023-09-28 15:11:59 -04:00
Iryna Shustava e6b724d062
catalog,mesh,auth: Move resource types to the proto-public module (#18935) 2023-09-22 15:50:56 -06:00
Iryna Shustava d88888ee8b
catalog,mesh,auth: Bump versions to v2beta1 (#18930) 2023-09-22 10:51:15 -06:00
Nitya Dhanushkodi 0a11499588
net-5689 fix disabling panic threshold logic (#18958) 2023-09-21 15:52:30 -07:00
Nitya Dhanushkodi 3a2e62053a
v2: various fixes to make K8s tproxy multiport acceptance tests and manual explicit upstreams (single port) tests pass (#18874)
Adding coauthors who mobbed/paired at various points throughout last week.
Co-authored-by: Dan Stough <dan.stough@hashicorp.com>
Co-authored-by: Iryna Shustava <iryna@hashicorp.com>
Co-authored-by: John Murret <john.murret@hashicorp.com>
Co-authored-by: Michael Zalimeni <michael.zalimeni@hashicorp.com>
Co-authored-by: Ashwin Venkatesh <ashwin@hashicorp.com>
Co-authored-by: Michael Wilkerson <mwilkerson@hashicorp.com>
2023-09-20 00:02:01 +00:00