Commit Graph

2172 Commits (ec0072cec3a85bd7bdc4daa52f0f2f2cf1ab6d1c)

Author SHA1 Message Date
Derek Menteer 2facf50923
Fix configuration merging for implicit tproxy upstreams. (#16000)
Fix configuration merging for implicit tproxy upstreams.

Change the merging logic so that the wildcard upstream has correct proxy-defaults
and service-defaults values combined into it. It did not previously merge all fields,
and the wildcard upstream did not exist unless service-defaults existed (it ignored
proxy-defaults, essentially).

Change the way we fetch upstream configuration in the xDS layer so that it falls back
to the wildcard when no matching upstream is found. This is what allows implicit peer
upstreams to have the correct "merged" config.

Change proxycfg to always watch local mesh gateway endpoints whenever a peer upstream
is found. This simplifies the logic so that we do not have to inspect the "merged"
configuration on peer upstreams to extract the mesh gateway mode.
2023-01-18 13:43:53 -06:00
Dan Upton 7a55de375c
xds: don't attempt to load-balance sessions for local proxies (#15789)
Previously, we'd begin a session with the xDS concurrency limiter
regardless of whether the proxy was registered in the catalog or in
the server's local agent state.

This caused problems for users who run `consul connect envoy` directly
against a server rather than a client agent, as the server's locally
registered proxies wouldn't be included in the limiter's capacity.

Now, the `ConfigSource` is responsible for beginning the session and we
only do so for services in the catalog.

Fixes: https://github.com/hashicorp/consul/issues/15753
2023-01-18 12:33:21 -06:00
Dhia Ayachi 87ff8c1c95
avoid logging RPC errors when it's specific rate limiter errors (#15968)
* avoid logging RPC errors when it's specific rate limiter errors

* simplify if statements
2023-01-16 12:08:09 -05:00
Matt Keeler 5afd4657ec
Protobuf Modernization (#15949)
* Protobuf Modernization

Remove direct usage of golang/protobuf in favor of google.golang.org/protobuf

Marshallers (protobuf and json) needed some changes to account for different APIs.

Moved to using the google.golang.org/protobuf/types/known/* for the well known types including replacing some custom Struct manipulation with whats available in the structpb well known type package.

This also updates our devtools script to install protoc-gen-go from the right location so that files it generates conform to the correct interfaces.

* Fix go-mod-tidy make target to work on all modules
2023-01-11 09:39:10 -05:00
Chris S. Kim a7b34d50fc
Output user-friendly name for anonymous token (#15884) 2023-01-09 12:28:53 -06:00
Dan Upton 644cd864a5
Rate limit improvements and fixes (#15917)
- Fixes a panic when Operation.SourceAddr is nil (internal net/rpc calls)
- Adds proper HTTP response codes (429 and 503) for rate limit errors
- Makes the error messages clearer
- Enables automatic retries for rate-limit errors in the net/rpc stack
2023-01-09 10:20:05 +00:00
Semir Patel 40c0bb24ae
emit metrics for global rate limiting (#15891) 2023-01-06 17:49:33 -06:00
Dhia Ayachi 233eacf0a4
inject logger and create logdrop sink (#15822)
* inject logger and create logdrop sink

* init sink with an empty struct instead of nil

* wrap a logger instead of a sink and add a discard logger to avoid double logging

* fix compile errors

* fix linter errors

* Fix bug where log arguments aren't properly formatted

* Move log sink construction outside of handler

* Add prometheus definition and docs for log drop counter

Co-authored-by: Daniel Upton <daniel@floppy.co>
2023-01-06 11:33:53 -07:00
Dan Upton d53ce39c32
grpc: switch servers and retry on error (#15892)
This is the OSS portion of enterprise PR 3822.

Adds a custom gRPC balancer that replicates the router's server cycling
behavior. Also enables automatic retries for RESOURCE_EXHAUSTED errors,
which we now get for free.
2023-01-05 10:21:27 +00:00
Florian Apolloner 077b0a48a3
Allow Operator Generated bootstrap token (#14437)
Add support to provide an initial token via the bootstrap HTTP API, similar to hashicorp/nomad#12520
2023-01-04 20:19:33 +00:00
Semir Patel a6482341a5
Wire up the rate limiter to net/rpc calls (#15879) 2023-01-04 13:38:44 -06:00
Dan Upton 7747384f1f
Wire in rate limiter to handle internal and external gRPC calls (#15857) 2022-12-23 13:42:16 -06:00
John Murret f5e01f8c6b
Rate Limit Handler - ensure rate limiting is not in the code path when not configured (#15819)
* Rate limiting handler - ensure configuration has changed before modifying limiters

* Updating test to validate arguments to UpdateConfig

* Removing duplicate test.  Updating mock.

* Renaming NullRateLimiter to NullRequestLimitsHandler

* Rate Limit Handler - ensure rate limiting is not in the code path when not configured

* Update agent/consul/rate/handler.go

Co-authored-by: Dhia Ayachi <dhia@hashicorp.com>

* formatting handler.go

* Rate limiting handler - ensure configuration has changed before modifying limiters

* Updating test to validate arguments to UpdateConfig

* Removing duplicate test.  Updating mock.

* adding logging for when UpdateConfig is called but the config has not changed.

* Update agent/consul/rate/handler.go

Co-authored-by: Dhia Ayachi <dhia@hashicorp.com>

* Update agent/consul/rate/handler_test.go

Co-authored-by: Dan Upton <daniel@floppy.co>

* modifying existing variable name based on pr feedback

* updating a broken merge conflict;

Co-authored-by: Dhia Ayachi <dhia@hashicorp.com>
Co-authored-by: Dan Upton <daniel@floppy.co>
2022-12-20 15:00:22 -07:00
John Murret aba43d85d9
Rate limiting handler - ensure configuration has changed before modifying limiters (#15805)
* Rate limiting handler - ensure configuration has changed before modifying limiters

* Updating test to validate arguments to UpdateConfig

* Removing duplicate test.  Updating mock.

* adding logging for when UpdateConfig is called but the config has not changed.

* Update agent/consul/rate/handler.go

Co-authored-by: Dhia Ayachi <dhia@hashicorp.com>

Co-authored-by: Dhia Ayachi <dhia@hashicorp.com>
2022-12-20 14:12:03 -07:00
Derek Menteer 74b11c416c
Fix incorrect protocol check on discovery chains with peer targets. (#15833) 2022-12-20 10:15:03 -06:00
Semir Patel 799b34f1a9
Map net/rpc endpoints to a read/write/exempt op for rate-limiting (#15825)
Also fixed TestRequestRecorder flaky tests due to loss of precision in elapsed time in the test.
2022-12-19 16:04:52 -06:00
Nitya Dhanushkodi d382ca0aec
extensions: refactor serverless plugin to use extensions from config entry fields (#15817)
docs: update config entry docs and the Lambda manual registration docs

Co-authored-by: Nitya Dhanushkodi <nitya@hashicorp.com>
Co-authored-by: Eric <eric@haberkorn.co>
2022-12-19 12:19:37 -08:00
Andrew Stucki ab199a11b0
Add async reconciliation controller subpackage (#15534)
* Add async reconciliation controller subpackage

* Address initial feedback

* Add tests for panic assertions

* Fix comment
2022-12-16 16:49:26 -05:00
Dhia Ayachi f04f88e4b9
add missing code and fix enterprise specific code (#15375)
* add missing code and fix enterprise specific code

* fix retry

* fix flaky tests

* fix linter error in test
2022-12-16 16:31:05 -05:00
Dhia Ayachi 6468e3e09c
Server side rate limiter: handle the race condition for limiters tree write in multilimiter (#15767)
* change to perform all tree writes in the same go routine to avoid race condition.

* rename runStoreOnce to reconcile

* Apply suggestions from code review

Co-authored-by: Dan Upton <daniel@floppy.co>

* reduce nesting

Co-authored-by: Dan Upton <daniel@floppy.co>
2022-12-14 17:32:11 +00:00
Semir Patel bafa5c7156
Pass remote addr of incoming HTTP requests through to RPC(..) calls (#15700) 2022-12-14 09:24:22 -06:00
John Murret e027c94b52
adding config for request_limits (#15531)
* server: add placeholder glue for rate limit handler

This commit adds a no-op implementation of the rate-limit handler and
adds it to the `consul.Server` struct and setup code.

This allows us to start working on the net/rpc and gRPC interceptors and
config logic.

* Add handler errors

* Set the global read and write limits

* fixing multilimiter moving packages

* Fix typo

* Simplify globalLimit usage

* add multilimiter and tests

* exporting LimitedEntity

* Apply suggestions from code review

Co-authored-by: John Murret <john.murret@hashicorp.com>

* add config update and rename config params

* add doc string and split config

* Apply suggestions from code review

Co-authored-by: Dan Upton <daniel@floppy.co>

* use timer to avoid go routine leak and change the interface

* add comments to tests

* fix failing test

* add prefix with config edge, refactor tests

* Apply suggestions from code review

Co-authored-by: Dan Upton <daniel@floppy.co>

* refactor to apply configs for limiters under a prefix

* add fuzz tests and fix bugs found. Refactor reconcile loop to have a simpler logic

* make KeyType an exported type

* split the config and limiter trees to fix race conditions in config update

* rename variables

* fix race in test and remove dead code

* fix reconcile loop to not create a timer on each loop

* add extra benchmark tests and fix tests

* fix benchmark test to pass value to func

* server: add placeholder glue for rate limit handler

This commit adds a no-op implementation of the rate-limit handler and
adds it to the `consul.Server` struct and setup code.

This allows us to start working on the net/rpc and gRPC interceptors and
config logic.

* Set the global read and write limits

* fixing multilimiter moving packages

* add server configuration for global rate limiting.

* remove agent test

* remove added stuff from handler

* remove added stuff from multilimiter

* removing unnecessary TODOs

* Removing TODO comment from handler

* adding in defaulting to infinite

* add disabled status in there

* adding in documentation for disabled mode.

* make disabled the default.

* Add mock and agent test

* addig documentation and missing mock file.

* Fixing test TestLoad_IntegrationWithFlags

* updating docs based on PR feedback.

* Updating Request Limits mode to use int based on PR feedback.

* Adding RequestLimits struct so we have a nested struct in ReloadableConfig.

* fixing linting references

* Update agent/consul/rate/handler.go

Co-authored-by: Dan Upton <daniel@floppy.co>

* Update agent/consul/config.go

Co-authored-by: Dan Upton <daniel@floppy.co>

* removing the ignore of the request limits in JSON.  addingbuilder logic to convert any read rate or write rate less than 0 to rate.Inf

* added conversion function to convert request limits object to handler config.

* Updating docs to reflect gRPC and RPC are rate limit and as a result, HTTP requests are as well.

* Updating values for TestLoad_FullConfig() so that they were different and discernable.

* Updating TestRuntimeConfig_Sanitize

* Fixing TestLoad_IntegrationWithFlags test

* putting nil check in place

* fixing rebase

* removing change for missing error checks.  will put in another PR

* Rebasing after default multilimiter config change

* resolving rebase issues

* updating reference for incomingRPCLimiter to use interface

* updating interface

* Updating interfaces

* Fixing mock reference

Co-authored-by: Daniel Upton <daniel@floppy.co>
Co-authored-by: Dhia Ayachi <dhia@hashicorp.com>
2022-12-13 13:09:55 -07:00
Derek Menteer e87d35e313
Fix DialedDirectly configuration for Consul dataplane. (#15760)
Fix DialedDirectly configuration for Consul dataplane.
2022-12-13 09:16:31 -06:00
Dan Upton c692802dec
grpc: add rate-limiting middleware (#15550)
Implements the gRPC middleware for rate-limiting as a tap.ServerInHandle
function (executed before the request is unmarshaled).

Mappings between gRPC methods and their operation type are generated by
a protoc plugin introduced by #15564.
2022-12-13 15:01:56 +00:00
Dan Upton eef38c2199
server: add placeholder glue for rate limit handler (#15539)
Adds a no-op implementation of the rate-limit handler and exposes
it on the consul.Server struct.

It allows us to start working on the net/rpc and gRPC interceptors
and config (re)loading logic, without having to implement the full
handler up-front.

Co-authored-by: John Murret <john.murret@hashicorp.com>
Co-authored-by: Dhia Ayachi <dhia@hashicorp.com>
2022-12-13 11:41:54 +00:00
Dhia Ayachi 81e40c1fac
add multilimiter and tests (#15467)
* add multilimiter and tests

* exporting LimitedEntity

* go mod tidy

* Apply suggestions from code review

Co-authored-by: John Murret <john.murret@hashicorp.com>

* add config update and rename config params

* add doc string and split config

* Apply suggestions from code review

Co-authored-by: Dan Upton <daniel@floppy.co>

* use timer to avoid go routine leak and change the interface

* add comments to tests

* fix failing test

* add prefix with config edge, refactor tests

* Apply suggestions from code review

Co-authored-by: Dan Upton <daniel@floppy.co>

* refactor to apply configs for limiters under a prefix

* add fuzz tests and fix bugs found. Refactor reconcile loop to have a simpler logic

* make KeyType an exported type

* split the config and limiter trees to fix race conditions in config update

* rename variables

* fix race in test and remove dead code

* fix reconcile loop to not create a timer on each loop

* add extra benchmark tests and fix tests

* fix benchmark test to pass value to func

* use a separate go routine to write limiters (#15643)

* use a separate go routine to write limiters

* Add updating limiter when another limiter is created

* fix waiter to be a ticker, so we commit more than once.

* fix tests and add tests for coverage

* unexport members and add tests

* make UpdateConfig thread safe and multi call to Run safe

* replace swith with if

* fix review comments

* replace time.sleep with retries

* fix flaky test and remove unnecessary init

* fix test races

* remove unnecessary negative test case

* remove fixed todo

Co-authored-by: John Murret <john.murret@hashicorp.com>
Co-authored-by: Dan Upton <daniel@floppy.co>
2022-12-08 14:42:07 -05:00
R.B. Boyer 900584ca82
connect: ensure all vault connect CA tests use limited privilege tokens (#15669)
All of the current integration tests where Vault is the Connect CA now use non-root tokens for the test. This helps us detect privilege changes in the vault model so we can keep our guides up to date.

One larger change was that the RenewIntermediate function got refactored slightly so it could be used from a test, rather than the large duplicated function we were testing in a test which seemed error prone.
2022-12-06 10:06:36 -06:00
R.B. Boyer 4940a728ab
Detect Vault 1.11+ import in secondary datacenters and update default issuer (#15661)
The fix outlined and merged in #15253 fixed the issue as it occurs in the primary
DC. There is a similar issue that arises when vault is used as the Connect CA in a
secondary datacenter that is fixed by this PR.

Additionally: this PR adds support to run the existing suite of vault related integration
tests against the last 4 versions of vault (1.9, 1.10, 1.11, 1.12)
2022-12-05 15:39:21 -06:00
Chris S. Kim c046d1a4d8
Add warn log when all ACL policies are filtered out (#15632) 2022-12-05 11:26:10 -05:00
Freddy 941f6da202
Remove log line about server mgmt token init (#15610)
* Remove log line about server mgmt token init

Currently the server management token is only being bootstrapped in the
primary datacenter. That means that servers on the secondary datacenter
will never have this token available, and would log this line any time a
token is resolved.

Bootstrapping the token in secondary datacenters will be done in a
follow-up.

* Add changelog entry
2022-11-29 17:56:03 -05:00
cskh 97c9432843
fix(peering): increase the gRPC limit to 8MB (#15503)
* fix(peering): increase the gRPC limit to 50MB

* changelog

* update gRPC limit to 8MB
2022-11-28 17:48:43 -05:00
Chris S. Kim 27c53f6c82
Use backport-compatible assertion (#15546)
* Use backport-compatible assertion

* Add workaround for broken apt-get
2022-11-24 11:44:20 -05:00
Chris S. Kim 386da5439a
Use rpcHoldTimeout to calculate blocking timeout (#15541)
Adds buffer to clients so that servers have time to respond to blocking queries.
2022-11-24 10:13:02 -05:00
Kyle Havlovitz f4c3e54b11
auto-config: relax node name validation for JWT authorization (#15370)
* auto-config: relax node name validation for JWT authorization

This changes the JWT authorization logic to allow all non-whitespace,
non-quote characters when validating node names. Consul had previously
allowed these characters in node names, until this validation was added
to fix a security vulnerability with whitespace/quotes being passed to
the `bexpr` library. This unintentionally broke node names with
characters like `.` which aren't related to this vulnerability.

* Update website/content/docs/agent/config/cli-flags.mdx

Co-authored-by: trujillo-adam <47586768+trujillo-adam@users.noreply.github.com>

Co-authored-by: trujillo-adam <47586768+trujillo-adam@users.noreply.github.com>
2022-11-14 18:24:40 -06:00
Dhia Ayachi 225ae55e83
Leadership transfer cmd (#14132)
* add leadership transfer command

* add RPC call test (flaky)

* add missing import

* add changelog

* add command registration

* Apply suggestions from code review

Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>

* add the possibility of providing an id to raft leadership transfer. Add few tests.

* delete old file from cherry pick

* rename changelog filename to PR #

* rename changelog and fix import

* fix failing test

* check for OperatorWrite

Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>

* rename from leader-transfer to transfer-leader

* remove version check and add test for operator read

* move struct to operator.go

* first pass

* add code for leader transfer in the grpc backend and tests

* wire the http endpoint to the new grpc endpoint

* remove the RPC endpoint

* remove non needed struct

* fix naming

* add mog glue to API

* fix comment

* remove dead code

* fix linter error

* change package name for proto file

* remove error wrapping

* fix failing test

* add command registration

* add grpc service mock tests

* fix receiver to be pointer

* use defined values

Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>

* reuse MockAclAuthorizer

* add documentation

* remove usage of external.TokenFromContext

* fix failing tests

* fix proto generation

* Apply suggestions from code review

Co-authored-by: Jared Kirschner <85913323+jkirschner-hashicorp@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Jared Kirschner <85913323+jkirschner-hashicorp@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Jared Kirschner <85913323+jkirschner-hashicorp@users.noreply.github.com>

* Apply suggestions from code review

* add more context in doc for the reason

* Apply suggestions from docs code review

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* regenerate proto

* fix linter errors

Co-authored-by: github-team-consul-core <github-team-consul-core@hashicorp.com>
Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>
Co-authored-by: Jared Kirschner <85913323+jkirschner-hashicorp@users.noreply.github.com>
Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
2022-11-14 15:35:12 -05:00
Freddy c58f86a00f
Fixup authz for data imported from peers (#15347)
There are a few changes that needed to be made to to handle authorizing
reads for imported data:

- If the data was imported from a peer we should not attempt to read the
  data using the traditional authz rules. This is because the name of
  services/nodes in a peer cluster are not equivalent to those of the
  importing cluster.

- If the data was imported from a peer we need to check whether the
  token corresponds to a service, meaning that it has service:write
  permissions, or to a local read only token that can read all
  nodes/services in a namespace.

This required changes at the policyAuthorizer level, since that is the
only view available to OSS Consul, and at the enterprise
partition/namespace level.
2022-11-14 11:36:27 -07:00
Dan Stough 626249fbf5
[OSS] fix: wait and try longer to peer through mesh gw (#15328) 2022-11-10 13:54:00 -05:00
Kyle Schochenmaier bf0f61a878
removes ioutil usage everywhere which was deprecated in go1.16 (#15297)
* update go version to 1.18 for api and sdk, go mod tidy
* removes ioutil usage everywhere which was deprecated in go1.16 in favour of io and os packages. Also introduces a lint rule which forbids use of ioutil going forward.
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
2022-11-10 10:26:01 -06:00
malizz b51f0e25e9
update ACLs for cluster peering (#15317)
* update ACLs for cluster peering

* add changelog

* Update .changelog/15317.txt

Co-authored-by: Eric Haberkorn <erichaberkorn@gmail.com>

Co-authored-by: Eric Haberkorn <erichaberkorn@gmail.com>
2022-11-09 13:02:58 -08:00
Derek Menteer 418bd62c44
Fix mesh gateway configuration with proxy-defaults (#15186)
* Fix mesh gateway proxy-defaults not affecting upstreams.

* Clarify distinction with upstream settings

Top-level mesh gateway mode in proxy-defaults and service-defaults gets
merged into NodeService.Proxy.MeshGateway, and only gets merged with
the mode attached to an an upstream in proxycfg/xds.

* Fix mgw mode usage for peered upstreams

There were a couple issues with how mgw mode was being handled for
peered upstreams.

For starters, mesh gateway mode from proxy-defaults
and the top-level of service-defaults gets stored in
NodeService.Proxy.MeshGateway, but the upstream watch for peered data
was only considering the mesh gateway config attached in
NodeService.Proxy.Upstreams[i]. This means that applying a mesh gateway
mode via global proxy-defaults or service-defaults on the downstream
would not have an effect.

Separately, transparent proxy watches for peered upstreams didn't
consider mesh gateway mode at all.

This commit addresses the first issue by ensuring that we overlay the
upstream config for peered upstreams as we do for non-peered. The second
issue is addressed by re-using setupWatchesForPeeredUpstream when
handling transparent proxy updates.

Note that for transparent proxies we do not yet support mesh gateway
mode per upstream, so the NodeService.Proxy.MeshGateway mode is used.

* Fix upstream mesh gateway mode handling in xds

This commit ensures that when determining the mesh gateway mode for
peered upstreams we consider the NodeService.Proxy.MeshGateway config as
a baseline.

In absense of this change, setting a mesh gateway mode via
proxy-defaults or the top-level of service-defaults will not have an
effect for peered upstreams.

* Merge service/proxy defaults in cfg resolver

Previously the mesh gateway mode for connect proxies would be
merged at three points:

1. On servers, in ComputeResolvedServiceConfig.
2. On clients, in MergeServiceConfig.
3. On clients, in proxycfg/xds.

The first merge returns a ServiceConfigResponse where there is a
top-level MeshGateway config from proxy/service-defaults, along with
per-upstream config.

The second merge combines per-upstream config specified at the service
instance with per-upstream config specified centrally.

The third merge combines the NodeService.Proxy.MeshGateway
config containing proxy/service-defaults data with the per-upstream
mode. This third merge is easy to miss, which led to peered upstreams
not considering the mesh gateway mode from proxy-defaults.

This commit removes the third merge, and ensures that all mesh gateway
config is available at the upstream. This way proxycfg/xds do not need
to do additional overlays.

* Ensure that proxy-defaults is considered in wc

Upstream defaults become a synthetic Upstream definition under a
wildcard key "*". Now that proxycfg/xds expect Upstream definitions to
have the final MeshGateway values, this commit ensures that values from
proxy-defaults/service-defaults are the default for this synthetic
upstream.

* Add changelog.

Co-authored-by: freddygv <freddy@hashicorp.com>
2022-11-09 10:14:29 -06:00
Dan Upton 7b2d08d461
chore: remove unused argument from MergeNodeServiceWithCentralConfig (#15024)
Previously, the MergeNodeServiceWithCentralConfig method accepted a
ServiceSpecificRequest argument, of which only the Datacenter and
QueryOptions fields were used.

Digging a little deeper, it turns out these fields were only passed
down to the ComputeResolvedServiceConfig method (through the
ServiceConfigRequest struct) which didn't actually use them.

As such, not all call-sites passed a valid ServiceSpecificRequest
so it's safer to remove the argument altogether to prevent future
changes from depending on it.
2022-11-09 14:54:57 +00:00
Derek Menteer b64972d486
Bring back parameter ServerExternalAddresses in GenerateToken endpoint (#15267)
Re-add ServerExternalAddresses parameter in GenerateToken endpoint

This reverts commit 5e156772f6
and adds extra functionality to support newer peering behaviors.
2022-11-08 14:55:18 -06:00
Chris S. Kim 985a4ee1b1
Update hcp-scada-provider to fix diamond dependency problem with go-msgpack (#15185) 2022-11-07 11:34:30 -05:00
Dan Stough 553312ef61
fix: persist peering CA updates to dialing clusters (#15243)
fix: persist peering CA updates to dialing clusters
2022-11-04 12:53:20 -04:00
Derek Menteer fa5d87c116 Decrease retry time for failed peering connections. 2022-10-31 14:30:27 -05:00
R.B. Boyer 97b9fcbf48
test: fix flaky TestSubscribeBackend_IntegrationWithServer_DeliversAllMessages test (#15195)
Allow for some message duplication in subscription events during assertions.

I'm pretty sure the subscriptions machinery allows for messages to occasionally
be duplicated instead of dropping them, as a once-and-only-once queue is a pipe
dream and you have to pick one of the other two options.
2022-10-31 12:10:43 -05:00
Derek Menteer 693c8a4706 Allow peering endpoints to bypass verify_incoming. 2022-10-31 09:56:30 -05:00
Eric Haberkorn cf50bdbe20
Fix peering metrics bug (#15178)
This bug was caused by the peering health metric being set to NaN.
2022-10-28 10:51:12 -04:00
Luke Kysow 9999672fd7
autoencrypt: helpful error for clients with wrong dc (#14832)
* autoencrypt: helpful error for clients with wrong dc

If clients have set a different datacenter than the servers they're
connecting with for autoencrypt, give a helpful error message.
2022-10-25 10:13:41 -07:00
Chris S. Kim bde57c0dd0 Regenerate files according to 1.19.2 formatter 2022-10-24 16:12:08 -04:00
Iryna Shustava 176abb5ff2
proxycfg: watch service-defaults config entries (#15025)
To support Destinations on the service-defaults (for tproxy with terminating gateway), we need to now also make servers watch service-defaults config entries.
2022-10-24 12:50:28 -06:00
Chris S. Kim b236e86030 Move oss-only test to its own file 2022-10-24 14:17:43 -04:00
R.B. Boyer 300860412c
chore: update golangci-lint to v1.50.1 (#15022) 2022-10-24 11:48:02 -05:00
Venu Yanamandra efc813e92d
Update error message when restoring ENT snapshot in OSS (#15066) 2022-10-24 11:40:26 -04:00
Chris S. Kim a7ea26192b Update expected encoding in test
go-memdb was updated in v1.3.3 to make integers in indexes sortable, which changed how integers were encoded.
2022-10-20 14:32:42 -04:00
freddygv 6d9be5fb15 Use plain TaggedAddressWAN 2022-10-19 16:32:44 -06:00
freddygv 8d211cc9cc Add unit test 2022-10-19 16:26:15 -06:00
cskh 058ee4fb84 fix: wan address isn't used by peering token 2022-10-19 16:33:25 -04:00
cskh d562d363fc
peering: skip registering duplicate node and check from the peer (#14994)
* peering: skip register duplicate node and check from the peer

* Prebuilt the nodes map and checks map to avoid repeated for loop

* use key type to struct: node id, service id, and check id
2022-10-18 16:19:24 -04:00
Chris S. Kim 29a297d3e9
Refactor client RPC timeouts (#14965)
Fix an issue where rpc_hold_timeout was being used as the timeout for non-blocking queries. Users should be able to tune read timeouts without fiddling with rpc_hold_timeout. A new configuration `rpc_read_timeout` is created.

Refactor some implementation from the original PR 11500 to remove the misleading linkage between RPCInfo's timeout (used to retry in case of certain modes of failures) and the client RPC timeouts.
2022-10-18 15:05:09 -04:00
Derek Menteer 2a33d0ff96 Fix issue with incorrect method signature on test. 2022-10-14 11:04:57 -05:00
Freddy 24d0c8801a
Merge pull request #14981 from hashicorp/peering/dial-through-gateways 2022-10-14 09:44:56 -06:00
Derek Menteer 29ebcf5ff0 Add tests for peering state snapshots / restores. 2022-10-14 09:48:04 -05:00
Derek Menteer e3ff9912d0 Add test for ExportedServicesForAllPeersByName 2022-10-14 09:48:04 -05:00
freddygv 573aa408a1 Lint 2022-10-13 15:55:55 -06:00
freddygv 2c99a21596 Update leader routine to maybe use gateways 2022-10-13 14:58:00 -06:00
freddygv e69bc727ec Update peering establishment to maybe use gateways
When peering through mesh gateways we expect outbound dials to peer
servers to flow through the local mesh gateway addresses.

Now when establishing a peering we get a list of dial addresses as a
ring buffer that includes local mesh gateway addresses if the local DC
is configured to peer through mesh gateways. The ring buffer includes
the mesh gateway addresses first, but also includes the remote server
addresses as a fallback.

This fallback is present because it's possible that direct egress from
the servers may be allowed. If not allowed then the leader will cycle
back to a mesh gateway address through the ring.

When attempting to dial the remote servers we retry up to a fixed
timeout. If using mesh gateways we also have an initial wait in
order to allow for the mesh gateways to configure themselves.

Note that if we encounter a permission denied error we do not retry
since that error indicates that the secret in the peering token is
invalid.
2022-10-13 14:57:55 -06:00
malizz b0b0cbb8ee
increase protobuf size limit for cluster peering (#14976) 2022-10-13 13:46:51 -07:00
Derek Menteer 8742fbe14f Prevent consul peer-exports by discovery chain. 2022-10-13 12:45:09 -05:00
Derek Menteer f366edcb8d Prevent the "consul" service from being exported. 2022-10-13 12:45:09 -05:00
Derek Menteer caa1396255 Add remote peer partition and datacenter info. 2022-10-13 10:37:41 -05:00
Dan Upton 0af9f16343
bug: fix goroutine leaks caused by incorrect usage of `WatchCh` (#14916)
memdb's `WatchCh` method creates a goroutine that will publish to the
returned channel when the watchset is triggered or the given context
is canceled. Although this is called out in its godoc comment, it's
not obvious that this method creates a goroutine who's lifecycle you
need to manage.

In the xDS capacity controller, we were calling `WatchCh` on each
iteration of the control loop, meaning the number of goroutines would
grow on each autopilot event until there was catalog churn.

In the catalog config source, we were calling `WatchCh` with the
background context, meaning that the goroutine would keep running after
the sync loop had terminated.
2022-10-13 12:04:27 +01:00
Paul Glass d17af23641
gRPC server metrics (#14922)
* Move stats.go from grpc-internal to grpc-middleware
* Update grpc server metrics with server type label
* Add stats test to grpc-external
* Remove global metrics instance from grpc server tests
2022-10-11 17:00:32 -05:00
cskh e0356e1502
fix(peering): add missing grpc_tls_port for server address reconciliation (#14944) 2022-10-11 10:56:29 -04:00
Chris S. Kim b0a4c5c563 Include stream-related information in peering endpoints 2022-10-10 13:20:14 -06:00
Paul Glass c0c187f1c5
Merge central config for GetEnvoyBootstrapParams (#14869)
This fixes GetEnvoyBootstrapParams to merge in proxy-defaults and service-defaults.

Co-authored-by: Dan Upton <daniel@floppy.co>
2022-10-10 12:40:27 -05:00
freddygv 7d4da6eb22 Fixup test 2022-10-07 09:34:16 -06:00
freddygv 3034df6a5c Require Connect and TLS to generate peering tokens
By requiring Connect and a gRPC TLS listener we can automatically
configure TLS for all peering control-plane traffic.
2022-10-07 09:06:29 -06:00
freddygv fac3ddc857 Use internal server certificate for peering TLS
A previous commit introduced an internally-managed server certificate
to use for peering-related purposes.

Now the peering token has been updated to match that behavior:
- The server name matches the structure of the server cert
- The CA PEMs correspond to the Connect CA

Note that if Conect is disabled, and by extension the Connect CA, we
fall back to the previous behavior of returning the manually configured
certs and local server SNI.

Several tests were updated to use the gRPC TLS port since they enable
Connect by default. This means that the peering token will embed the
Connect CA, and the dialer will expect a TLS listener.
2022-10-07 09:05:32 -06:00
John Murret 79a541fd7d
Upgrade serf to v0.10.1 and memberlist to v0.5.0 to get memberlist size metrics and broadcast queue depth metric (#14873)
* updating to serf v0.10.1 and memberlist v0.5.0 to get memberlist size metrics and memberlist broadcast queue depth metric

* update changelog

* update changelog

* correcting changelog

* adding "QueueCheckInterval" for memberlist to test

* updating integration test containers to grab latest api
2022-10-04 17:51:37 -06:00
Eric Haberkorn 1b565444be
Rename `PeerName` to `Peer` on prepared queries and exported services (#14854) 2022-10-04 14:46:15 -04:00
freddygv a8c4d6bc55 Share mgw addrs in peering stream if needed
This commit adds handling so that the replication stream considers
whether the user intends to peer through mesh gateways.

The subscription will return server or mesh gateway addresses depending
on the mesh configuration setting. These watches can be updated at
runtime by modifying the mesh config entry.
2022-10-03 11:42:20 -06:00
freddygv 4ff9d475b0 Return mesh gateway addrs if peering through mgw 2022-10-03 11:35:10 -06:00
Eric Haberkorn 80e51ff907
Add exported services event to cluster peering replication. (#14797) 2022-09-29 15:37:19 -04:00
malizz 84b0f408fa
Support Stale Queries for Trust Bundle Lookups (#14724)
* initial commit

* add tags, add conversations

* add test for query options utility functions

* update previous tests

* fix test

* don't error out on empty context

* add changelog

* update decode config
2022-09-28 09:56:59 -07:00
Nick Ethier 1c1b0994b8
add HCP integration component (#14723)
* add HCP integration

* lint: use non-deprecated logging interface
2022-09-26 14:58:15 -04:00
Chris S. Kim 2203cdc4db Add new internal endpoint to list exported services to a peer 2022-09-23 09:43:56 -04:00
freddygv 02d3ce1039 Add server certificate manager
This certificate manager will request a leaf certificate for server
agents and then keep them up to date.
2022-09-16 17:57:10 -06:00
freddygv 0e5131bd33 Generate ACL token for server management
This commit introduces a new ACL token used for internal server
management purposes.

It has a few key properties:
- It has unlimited permissions.
- It is persisted through Raft as System Metadata rather than in the
ACL tokens table. This is to avoid users seeing or modifying it.
- It is re-generated on leadership establishment.
2022-09-16 17:54:34 -06:00
Kyle Havlovitz 0d9ae52643
Merge pull request #14598 from hashicorp/root-removal-fix
connect/ca: Don't discard old roots on primaryInitialize
2022-09-15 14:36:01 -07:00
Kyle Havlovitz 6105a7fd9f connect/ca: don't discard old roots on primaryInitialize 2022-09-15 12:59:09 -07:00
DanStough 2a2debee64 feat(peering): validate server name conflicts on establish 2022-09-14 11:37:30 -04:00
Derek Menteer 0aa13733a0
Add CSR check for number of URIs. (#14579)
Add CSR check for number of URIs.
2022-09-13 14:21:47 -05:00
Derek Menteer db83ff4fa6 Add input validation for auto-config JWT authorization checks. 2022-09-13 11:16:36 -05:00
skpratt b761589340
add non-double-prefixed metrics (#14193) 2022-09-09 12:13:43 -05:00
Dan Upton 1c2c975b0b
xDS Load Balancing (#14397)
Prior to #13244, connect proxies and gateways could only be configured by an
xDS session served by the local client agent.

In an upcoming release, it will be possible to deploy a Consul service mesh
without client agents. In this model, xDS sessions will be handled by the
servers themselves, which necessitates load-balancing to prevent a single
server from receiving a disproportionate amount of load and becoming
overwhelmed.

This introduces a simple form of load-balancing where Consul will attempt to
achieve an even spread of load (xDS sessions) between all healthy servers.
It does so by implementing a concurrent session limiter (limiter.SessionLimiter)
and adjusting the limit according to autopilot state and proxy service
registrations in the catalog.

If a server is already over capacity (i.e. the session limit is lowered),
Consul will begin draining sessions to rebalance the load. This will result
in the client receiving a `RESOURCE_EXHAUSTED` status code. It is the client's
responsibility to observe this response and reconnect to a different server.

Users of the gRPC client connection brokered by the
consul-server-connection-manager library will get this for free.

The rate at which Consul will drain sessions to rebalance load is scaled
dynamically based on the number of proxies in the catalog.
2022-09-09 15:02:01 +01:00
Derek Menteer f7c884f0af Merge branch 'main' of github.com:hashicorp/consul into derekm/split-grpc-ports 2022-09-08 14:53:08 -05:00
Derek Menteer 80d31458e5 Various cleanups. 2022-09-08 10:51:50 -05:00
Chris S. Kim 1c4a6eef4f
Merge pull request #14285 from hashicorp/NET-638-push-server-address-updates-to-the-peer
peering: Subscribe to server address changes and push updates to peers
2022-09-07 09:30:45 -04:00
Freddy f4dfd42e0a
Add SpiffeID for Consul server agents (#14485)
Co-authored-by: Eric Haberkorn <erichaberkorn@gmail.com>

By adding a SpiffeID for server agents, servers can now request a leaf
certificate from the Connect CA.

This new Spiffe ID has a key property: servers are identified by their
datacenter name and trust domain. All servers that share these
attributes will share a ServerURI.

The aim is to use these certificates to verify the server name of ANY
server in a Consul datacenter.
2022-09-06 17:58:13 -06:00
Daniel Upton a31738f76f proxycfg-glue: server-local implementation of ResolvedServiceConfig
This is the OSS portion of enterprise PR 2460.

Introduces a server-local implementation of the proxycfg.ResolvedServiceConfig
interface that sources data from a blocking query against the server's state
store.

It moves the service config resolution logic into the agent/configentry package
so that it can be used in both the RPC handler and data source.

I've also done a little re-arranging and adding comments to call out data
sources for which there is to be no server-local equivalent.
2022-09-06 23:27:25 +01:00
Derek Menteer bf769daae4 Merge branch 'main' of github.com:hashicorp/consul into derekm/split-grpc-ports 2022-09-06 10:51:04 -05:00
Derek Menteer 02ae66bda8 Add kv txn get-not-exists operation. 2022-09-06 10:28:59 -05:00
Chris S. Kim ddb9375cb6 Add testcase for parsing grpc_port 2022-09-06 10:17:44 -04:00
Kyle Havlovitz d97ccccdd5
Merge pull request #14429 from hashicorp/ca-prune-intermediates
Prune old expired intermediate certs when appending a new one
2022-09-02 15:34:33 -07:00
Derek Menteer f64771c707 Address PR comments. 2022-09-01 16:54:24 -05:00
Kyle Havlovitz 0c2fb7252d Prune intermediates before appending new one 2022-09-01 14:24:30 -07:00
malizz b3ac8f48ca
Add additional parameters to envoy passive health check config (#14238)
* draft commit

* add changelog, update test

* remove extra param

* fix test

* update type to account for nil value

* add test for custom passive health check

* update comments and tests

* update description in docs

* fix missing commas
2022-09-01 09:59:11 -07:00
Chris S. Kim f2b147e575 Add Internal.ServiceDump support for querying by PeerName 2022-09-01 10:32:59 -04:00
Derek Menteer cf7f24a6ec Change serf-tag references to field references. 2022-08-31 16:38:42 -05:00
Kyle Havlovitz 113454645d Prune old expired intermediate certs when appending a new one 2022-08-31 11:41:58 -07:00
Chris S. Kim 560d410c6d Merge branch 'main' into NET-638-push-server-address-updates-to-the-peer
# Conflicts:
#	agent/grpc-external/services/peerstream/stream_test.go
2022-08-30 11:09:25 -04:00
Freddy 97d1db759f
Merge pull request #13496 from maxb/fix-kv_entries-metric 2022-08-29 15:35:11 -06:00
Freddy 829a2a8722
Merge pull request #14364 from hashicorp/peering/term-delete 2022-08-29 15:33:18 -06:00
Max Bowsher decc9231ee Merge branch 'main' into fix-kv_entries-metric 2022-08-29 22:22:10 +01:00
Chris S. Kim 5010fa5c03
Merge pull request #14371 from hashicorp/kisunji/peering-metrics-update
Adjust metrics reporting for peering tracker
2022-08-29 17:16:19 -04:00
Chris S. Kim 74ddf040dd Add heartbeat timeout grace period when accounting for peering health 2022-08-29 16:32:26 -04:00
Derek Menteer 0ceec9017b Expose `grpc_tls` via serf for cluster peering. 2022-08-29 13:43:49 -05:00
Derek Menteer 1255a8a20d Add separate grpc_tls port.
To ease the transition for users, the original gRPC
port can still operate in a deprecated mode as either
plain-text or TLS mode. This behavior should be removed
in a future release whenever we no longer support this.

The resulting behavior from this commit is:
  `ports.grpc > 0 && ports.grpc_tls > 0` spawns both plain-text and tls ports.
  `ports.grpc > 0 && grpc.tls == undefined` spawns a single plain-text port.
  `ports.grpc > 0 && grpc.tls != undefined` spawns a single tls port (backwards compat mode).
2022-08-29 13:43:43 -05:00
freddygv 310608fb19 Add validation to prevent switching dialing mode
This prevents unexpected changes to the output of ShouldDial, which
should never change unless a peering is deleted and recreated.
2022-08-29 12:31:13 -06:00
Eric Haberkorn 1099665473
Update the structs and discovery chain for service resolver redirects to cluster peers. (#14366) 2022-08-29 09:51:32 -04:00
Chris S. Kim 4d97e2f936 Adjust metrics reporting for peering tracker 2022-08-26 17:34:17 -04:00
freddygv 650e48624d Allow terminated peerings to be deleted
Peerings are terminated when a peer decides to delete the peering from
their end. Deleting a peering sends a termination message to the peer
and triggers them to mark the peering as terminated but does NOT delete
the peering itself. This is to prevent peerings from disappearing from
both sides just because one side deleted them.

Previously the Delete endpoint was skipping the deletion if the peering
was not marked as active. However, terminated peerings are also
inactive.

This PR makes some updates so that peerings marked as terminated can be
deleted by users.
2022-08-26 10:52:47 -06:00
Chris S. Kim 87962b9713 Merge branch 'main' into catalog-service-list-filter 2022-08-26 11:16:06 -04:00
Chris S. Kim e2fe8b8d65 Fix tests for enterprise 2022-08-26 11:14:02 -04:00
Chris S. Kim 1c43a1a7b4 Merge branch 'main' into NET-638-push-server-address-updates-to-the-peer
# Conflicts:
#	agent/grpc-external/services/peerstream/stream_test.go
2022-08-26 10:43:56 -04:00
Chris S. Kim 6ddcc04613
Replace ring buffer with async version (#14314)
We need to watch for changes to peerings and update the server addresses which get served by the ring buffer.

Also, if there is an active connection for a peer, we are getting up-to-date server addresses from the replication stream and can safely ignore the token's addresses which may be stale.
2022-08-26 10:27:13 -04:00
alex 30ff2e9a35
peering: add peer health metric (#14004)
Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>
2022-08-25 16:32:59 -07:00
Chris S. Kim 181063cd23 Exit loop when context is cancelled 2022-08-25 11:48:25 -04:00
skpratt 919da33331
no-op: refactor usagemetrics tests for clarity and DRY cases (#14313) 2022-08-24 12:00:09 -05:00
Dan Upton 3b993f2da7
dataplane: update envoy bootstrap params for consul-dataplane (#14017)
Contains 2 changes to the GetEnvoyBootstrapParams response to support
consul-dataplane.

Exposing node_name and node_id:

consul-dataplane will support providing either the node_id or node_name in its
configuration. Unfortunately, supporting both in the xDS meta adds a fair amount
of complexity (partly because most tables are currently indexed on node_name)
so for now we're going to return them both from the bootstrap params endpoint,
allowing consul-dataplane to exchange a node_id for a node_name (which it will
supply in the xDS meta).

Properly setting service for gateways:

To avoid the need to special case gateways in consul-dataplane, service will now
either be the destination service name for connect proxies, or the gateway
service name. This means it can be used as-is in Envoy configuration (i.e. as a
cluster name or in metric tags).
2022-08-24 12:03:15 +01:00
Chris S. Kim 81e965479b PR feedback to specify Node name in test mock 2022-08-23 11:51:04 -04:00
Eric Haberkorn 58901ad7df
Cluster peering failover disco chain changes (#14296) 2022-08-23 09:13:43 -04:00
Chris S. Kim 547fb9570e Add missing mock assertions 2022-08-22 13:55:01 -04:00
cskh 060531a29a
Fix: add missing ent meta for test (#14289) 2022-08-22 13:51:04 -04:00
Chris S. Kim df951bd601 Expose external gRPC port in autopilot
The grpc_port was added to a NodeService's meta in ea58f235f5
2022-08-22 10:07:00 -04:00
cskh 527ebd068a
fix: missing MaxInboundConnections field in service-defaults config entry (#14072)
* fix:  missing max_inbound_connections field in merge config
2022-08-19 14:11:21 -04:00
cskh e84e4b8868
Fix: upgrade pkg imdario/merg to prevent merge config panic (#14237)
* upgrade imdario/merg to prevent merge config panic

* test: service definition takes precedence over service-defaults in merged results
2022-08-17 21:14:04 -04:00
James Hartig f92883bbce Use the maximum jitter when calculating the timeout
The timeout should include the maximum possible
jitter since the server will randomly add to it's
timeout a jitter. If the server's timeout is less
than the client's timeout then the client will
return an i/o deadline reached error.

Before:
```
time curl 'http://localhost:8500/v1/catalog/service/service?dc=other-dc&stale=&wait=600s&index=15820644'
rpc error making call: i/o deadline reached
real    10m11.469s
user    0m0.018s
sys     0m0.023s
```

After:
```
time curl 'http://localhost:8500/v1/catalog/service/service?dc=other-dc&stale=&wait=600s&index=15820644'
[...]
real    10m35.835s
user    0m0.021s
sys     0m0.021s
```
2022-08-17 10:24:09 -04:00
cskh d46b515b64
fix: missing segment and partition (#14194) 2022-08-12 15:21:39 -04:00
cskh 81931e52c3
feat(telemetry): add labels to serf and memberlist metrics (#14161)
* feat(telemetry): add labels to serf and memberlist metrics
* changelog
* doc update

Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
2022-08-11 22:09:56 -04:00
Chris S. Kim 4c928cb2f7
Handle breaking change for ServiceVirtualIP restore (#14149)
Consul 1.13.0 changed ServiceVirtualIP to use PeeredServiceName instead of ServiceName which was a breaking change for those using service mesh and wanted to restore their snapshot after upgrading to 1.13.0.

This commit handles existing data with older ServiceName and converts it during restore so that there are no issues when restoring from older snapshots.
2022-08-11 14:47:10 -04:00
Chris S. Kim 3926009405 Add test to verify forwarding 2022-08-11 11:16:02 -04:00
Chris S. Kim 1ef22360c3 Register peerStreamServer internally to enable RPC forwarding 2022-08-11 11:16:02 -04:00
Chris S. Kim de73171202 Handle wrapped errors in isFailedPreconditionErr 2022-08-11 11:16:02 -04:00
Daniel Kimsey 3c4fa9b468 Add support for filtering the 'List Services' API
1. Create a bexpr filter for performing the filtering
2. Change the state store functions to return the raw (not aggregated)
   list of ServiceNodes.
3. Move the aggregate service tags by name logic out of the state store
   functions into a new function called from the RPC endpoint
4. Perform the filtering in the endpoint before aggregation.
2022-08-10 16:52:32 -05:00
Kyle Havlovitz 6938b8c755
Merge pull request #13958 from hashicorp/gateway-wildcard-fix
Fix wildcard picking up services it shouldn't for ingress/terminating gateways
2022-08-08 12:54:40 -07:00
Kyle Havlovitz fe1fcea34f Add some extra handling for destination deletes 2022-08-08 11:38:13 -07:00
freddygv d421e18172 Update snapshot test 2022-08-08 09:17:15 -06:00
freddygv 1031ffc3c7 Re-validate existing secrets at state store
Previously establishment and pending secrets were only checked at the
RPC layer. However, given that these are Check-and-Set transactions we
should ensure that the given secrets are still valid when persisting a
secret exchange or promotion.

Otherwise it would be possible for concurrent requests to overwrite each
other.
2022-08-08 09:06:07 -06:00
freddygv 0ea4bfae94 Test fixes 2022-08-08 08:31:47 -06:00
freddygv c04515a844 Use proto message for each secrets write op
Previously there was a field indicating the operation that triggered a
secrets write. Now there is a message for each operation and it contains
the secret ID being persisted.
2022-08-08 01:41:00 -06:00
Kyle Havlovitz 6580566c3b Update ingress/terminating wildcard logic and handle destinations 2022-08-05 07:56:10 -07:00
freddygv 8067890787 Inherit active secret when exchanging 2022-08-03 17:32:53 -05:00
freddygv 60d6e28c97 Pass explicit signal with op for secrets write
Previously the updates to the peering secrets UUID table relied on
inferring what action triggered the update based on a reconciliation
against the existing secrets.

Instead we now explicitly require the operation to be given so that the
inference isn't necessary. This makes the UUID table logic easier to
reason about and fixes some related bugs.

There is also an update so that the peering secrets get handled on
snapshots/restores.
2022-08-03 17:25:12 -05:00
freddygv 9ca687bc7c Avoid deleting peering secret UUIDs at dialers
Dialers do not keep track of peering secret UUIDs, so they should not
attempt to clean up data from that table when their peering is deleted.

We also now keep peer server addresses when marking peerings for
deletion. Peer server addresses are used by the ShouldDial() helper
when determining whether the peering is for a dialer or an acceptor.
We need to keep this data so that peering secrets can be cleaned up
accordingly.
2022-08-03 16:34:57 -05:00
Kyle Havlovitz 499211f907 Fix wildcard picking up services it shouldn't for ingress/terminating gateways 2022-08-02 09:41:31 -07:00
Freddy 42996411cc
Various peering fixes (#13979)
* Avoid logging StreamSecretID
* Wrap additional errors in stream handler
* Fix flakiness in leader test and rename servers for clarity. There was
  a race condition where the peering was being deleted in the test
  before the stream was active. Now the test waits for the stream to be
  connected on both sides before deleting the associated peering.
* Run flaky test serially
2022-08-01 15:06:18 -06:00
Luke Kysow 988e1fd35d
peering: default to false (#13963)
* defaulting to false because peering will be released as beta
* Ignore peering disabled error in bundles cachetype

Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>
Co-authored-by: freddygv <freddy@hashicorp.com>
Co-authored-by: Matt Keeler <mjkeeler7@gmail.com>
2022-08-01 15:22:36 -04:00
Freddy dacf703d20
Merge branch 'main' into fix-kv_entries-metric 2022-08-01 13:19:27 -06:00
Matt Keeler f74d0cef7a
Implement/Utilize secrets for Peering Replication Stream (#13977) 2022-08-01 10:33:18 -04:00
alex a45bb1f06b
block PeerName register requests (#13887)
Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>
2022-07-29 14:36:22 -07:00
Luke Kysow 95096e2c03
peering: retry establishing connection more quickly on certain errors (#13938)
When we receive a FailedPrecondition error, retry that more quickly
because we expect it will resolve shortly. This is particularly
important in the context of Consul servers behind a load balancer
because when establishing a connection we have to retry until we
randomly land on a leader node.

The default retry backoff goes from 2s, 4s, 8s, etc. which can result in
very long delays quite quickly. Instead, this backoff retries in 8ms
five times, then goes exponentially from there: 16ms, 32ms, ... up to a
max of 8152ms.
2022-07-29 13:04:32 -07:00
acpana eae4e71492
sync more acl enforcement
sync w ent at 32756f7

Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>
2022-07-28 12:01:52 -07:00
Ashwin Venkatesh eef9edaed9
Add peer counts to emitted metrics. (#13930) 2022-07-27 18:34:04 -04:00
Luke Kysow 465a9801e1
Merge pull request #13924 from hashicorp/lkysow/util-metric-peering
peering: don't track imported services/nodes in usage
2022-07-27 14:49:55 -07:00
Chris S. Kim 0999e05a7d Reduce arm64 flakes for TestConnectCA_ConfigurationSet_ChangeKeyConfig_Primary
There were 16 combinations of tests but 4 of them were duplicates since the default key type and bits were "ec" and 256. That entry was commented out to reduce the subtest count to 12.

testrpc.WaitForLeader was failing on arm64 environments; the cause is unknown but it might be due to the environment being flooded with parallel tests making RPC calls. The RPC polling+retry was replaced with a simpler check for leadership based on raft.
2022-07-27 13:54:34 -04:00
Chris S. Kim 8ead1caf53 Retry checks for virtual IP metadata 2022-07-27 13:54:34 -04:00
Chris S. Kim 62ed0250c3 Sort slice of ServiceNames deterministically 2022-07-27 13:54:34 -04:00
Luke Kysow 740d54e730 peering: don't track imported services/nodes in usage
Services/nodes that are imported from other peers are stored in
state. We don't want to count those as part of our own cluster's usage.
2022-07-27 09:08:51 -07:00
cskh 4e292b7b72
chore: clarify the error message: service.service must not be empty (#13907)
- when register service using catalog endpoint, the key of service
  name actually should be "service". Add this information to the
  error message will help user to quickly fix in the request.
2022-07-27 10:16:46 -04:00
Chris S. Kim 73a84f256f
Preserve PeeringState on upsert (#13666)
Fixes a bug where if the generate token is called twice, the second call upserts the zero-value (undefined) of PeeringState.
2022-07-25 14:37:56 -04:00
DanStough 2da8949d78 feat: convert destination address to slice 2022-07-25 12:31:58 -04:00
freddygv b544ce6485 Add ACL enforcement to peering endpoints 2022-07-25 09:34:29 -06:00
Luke Kysow a1e6d69454
peering: add config to enable/disable peering (#13867)
* peering: add config to enable/disable peering

Add config:

```
peering {
  enabled = true
}
```

Defaults to true. When disabled:
1. All peering RPC endpoints will return an error
2. Leader won't start its peering establishment goroutines
3. Leader won't start its peering deletion goroutines
2022-07-22 15:20:21 -07:00
alex 927cee692b
peering: emit exported services count metric (#13811)
Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>
2022-07-22 12:05:08 -07:00
Eric Haberkorn 501089292e
Add Cluster Peering Failover Support to Prepared Queries (#13835)
Add peering failover support to prepared queries
2022-07-22 09:14:43 -04:00
acpana 12b773ab02
Rename peering internal to ~
sync ENT to 5679392c81

Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>
2022-07-21 10:51:05 -07:00
Daniel Upton 3655802fdc proxycfg-glue: server-local implementation of `PeeredUpstreams`
This is the OSS portion of enterprise PR 2352.

It adds a server-local implementation of the proxycfg.PeeredUpstreams interface
based on a blocking query against the server's state store.

It also fixes an omission in the Virtual IP freeing logic where we were never
updating the max index (and therefore blocking queries against
VirtualIPsForAllImportedServices would not return on service deletion).
2022-07-21 13:51:59 +01:00
Paul Glass 77afe0e76e
Extract AWS auth implementation out of Consul (#13760) 2022-07-19 16:26:44 -05:00
alex de5a991d8c
peering: refactor reconcile, cleanup (#13795)
Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>
2022-07-19 11:43:29 -07:00
alex a9ae2ff4fa
peering: track exported services (#13784)
Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>
2022-07-18 10:20:04 -07:00
Luke Kysow 46381b1a7f
Add docs for peerStreamServer vs peeringServer. (#13781) 2022-07-15 12:23:05 -07:00
Luke Kysow ca3d7c964c
peerstream: dialer should reconnect when stream closes (#13745)
* peerstream: dialer should reconnect when stream closes

If the stream is closed unexpectedly (i.e. when we haven't received
a terminated message), the dialer should attempt to re-establish the
stream.

Previously, the `HandleStream` would return `nil` when the stream
was closed. The caller then assumed the stream was terminated on purpose
and so didn't reconnect when instead it was stopped unexpectedly and
the dialer should have attempted to reconnect.
2022-07-15 11:58:33 -07:00
R.B. Boyer bb4d4040fb
server: ensure peer replication can successfully use TLS over external gRPC (#13733)
Ensure that the peer stream replication rpc can successfully be used with TLS activated.

Also:

- If key material is configured for the gRPC port but HTTPS is not
  enabled now TLS will still be activated for the gRPC port.

- peerstream replication stream opened by the establishing-side will now
  ignore grpc.WithBlock so that TLS errors will bubble up instead of
  being awkwardly delayed or suppressed
2022-07-15 13:15:50 -05:00
alex adb5ffa1a6
peering: track imported services (#13718) 2022-07-15 10:20:43 -07:00
Dan Stough 49f3dadb8f feat: connect proxy xDS for destinations
Signed-off-by: Dhia Ayachi <dhia@hashicorp.com>
2022-07-14 15:27:02 -04:00
Daniel Upton 3c533ceea8 proxycfg-glue: server-local implementation of `ServiceList`
This is the OSS portion of enterprise PR 2242.

This PR introduces a server-local implementation of the proxycfg.ServiceList
interface, backed by streaming events and a local materializer.
2022-07-14 18:22:12 +01:00
Dan Upton b9e525d689
grpc: rename public/private directories to external/internal (#13721)
Previously, public referred to gRPC services that are both exposed on
the dedicated gRPC port and have their definitions in the proto-public
directory (so were considered usable by 3rd parties). Whereas private
referred to services on the multiplexed server port that are only usable
by agents and other servers.

Now, we're splitting these definitions, such that external/internal
refers to the port and public/private refers to whether they can be used
by 3rd parties.

This is necessary because the peering replication API needs to be
exposed on the dedicated port, but is not (yet) suitable for use by 3rd
parties.
2022-07-13 16:33:48 +01:00
R.B. Boyer 30fffd0c90
peerstream: some cosmetic refactors to make this easier to follow (#13732)
- Use some protobuf construction helper methods for brevity.
- Rename a local variable to avoid later shadowing.
- Rename the Nonce field to be more like xDS's naming.
- Be more explicit about which PeerID fields are empty.
2022-07-13 10:00:35 -05:00
R.B. Boyer f0e6e4e697
state: prohibit changing an exported tcp discovery chain in a way that would break SAN validation (#13727)
For L4/tcp exported services the mesh gateways will not be terminating
TLS. A caller in one peer will be directly establishing TLS connections
to the ultimate exported service in the other peer.

The caller will be doing SAN validation using the replicated SpiffeID
values shipped from the exporting side. There are a class of discovery
chain edits that could be done on the exporting side that would cause
the introduction of a new SpiffeID value. In between the time of the
config entry update on the exporting side and the importing side getting
updated peer stream data requests to the exported service would fail due
to SAN validation errors.

This is unacceptable so instead prohibit the exporting peer from making
changes that would break peering in this way.
2022-07-12 11:17:33 -05:00
R.B. Boyer 2317f37b4d
state: prohibit exported discovery chains to have cross-datacenter or cross-partition references (#13726)
Because peerings are pairwise, between two tuples of (datacenter,
partition) having any exported reference via a discovery chain that
crosses out of the peered datacenter or partition will ultimately not be
able to work for various reasons. The biggest one is that there is no
way in the ultimate destination to configure an intention that can allow
an external SpiffeID to access a service.

This PR ensures that a user simply cannot do this, so they won't run
into weird situations like this.
2022-07-12 11:03:41 -05:00
Chris S. Kim a6634db4a5
Return error if ServerAddresses is empty (#13714) 2022-07-12 11:09:00 -04:00
R.B. Boyer af04851637
peering: move peer replication to the external gRPC port (#13698)
Peer replication is intended to be between separate Consul installs and
effectively should be considered "external". This PR moves the peer
stream replication bidirectional RPC endpoint to the external gRPC
server and ensures that things continue to function.
2022-07-08 12:01:13 -05:00
R.B. Boyer ea58f235f5
server: broadcast the public grpc port using lan serf and update the consul service in the catalog with the same data (#13687)
Currently servers exchange information about their WAN serf port
and RPC port with serf tags, so that they all learn of each other's
addressing information. We intend to make larger use of the new
public-facing gRPC port exposed on all of the servers, so this PR
addresses that by passing around the gRPC port via serf tags and
then ensuring the generated consul service in the catalog has
metadata about that new port as well for ease of non-serf-based lookup.
2022-07-07 13:55:41 -05:00
R.B. Boyer 2a945facec
test: update mockery use to put mocks into test files (#13656)
--testonly doesn't do anything anymore so switch to --filename instead
2022-07-05 16:57:15 -05:00
Chris S. Kim f07132dacc
Revise possible states for a peering. (#13661)
These changes are primarily for Consul's UI, where we want to be more
specific about the state a peering is in.

- The "initial" state was renamed to pending, and no longer applies to
  peerings being established from a peering token.

- Upon request to establish a peering from a peering token, peerings
  will be set as "establishing". This will help distinguish between the
  two roles: the cluster that generates the peering token and the
  cluster that establishes the peering.

- When marked for deletion, peering state will be set to "deleting".
  This way the UI determines the deletion via the state rather than the
  "DeletedAt" field.

Co-authored-by: freddygv <freddy@hashicorp.com>
2022-07-04 10:47:58 -04:00
Daniel Upton 45886848b4 proxycfg: server-local intention upstreams data source
This is the OSS portion of enterprise PR 2157.

It builds on the local blocking query work in #13438 to implement the
proxycfg.IntentionUpstreams interface using server-local data.

Also moves the ACL filtering logic from agent/consul into the acl/filter
package so that it can be reused here.
2022-07-04 10:48:36 +01:00
Daniel Upton 37ccbd2826 proxycfg: server-local intentions data source
This is the OSS portion of enterprise PR 2141.

This commit provides a server-local implementation of the `proxycfg.Intentions`
interface that sources data from streaming events.

It adds events for the `service-intentions` config entry type, and then consumes
event streams (via materialized views) for the service's explicit intentions and
any applicable wildcard intentions, merging them into a single list of intentions.

An alternative approach I considered was to consume _all_ intention events (via
`SubjectWildcard`) and filter out the irrelevant ones. This would admittedly
remove some complexity in the `agent/proxycfg-glue` package but at the expense
of considerable overhead from waking potentially many thousands of connect
proxies every time any intention is updated.
2022-07-04 10:48:36 +01:00
Daniel Upton 653b8c4f9d proxycfg: server-local config entry data sources
This is the OSS portion of enterprise PR 2056.

This commit provides server-local implementations of the proxycfg.ConfigEntry
and proxycfg.ConfigEntryList interfaces, that source data from streaming events.

It makes use of the LocalMaterializer type introduced for peering replication,
adding the necessary support for authorization.

It also adds support for "wildcard" subscriptions (within a topic) to the event
publisher, as this is needed to fetch service-resolvers for all services when
configuring mesh gateways.

Currently, events will be emitted for just the ingress-gateway, service-resolver,
and mesh config entry types, as these are the only entries required by proxycfg
— the events will be emitted on topics named IngressGateway, ServiceResolver,
and MeshConfig topics respectively.

Though these events will only be consumed "locally" for now, they can also be
consumed via the gRPC endpoint (confirmed using grpcurl) so using them from
client agents should be a case of swapping the LocalMaterializer for an
RPCMaterializer.
2022-07-04 10:48:36 +01:00