Auto-encrypt meant to fallback to the default port when it wasn't provided, but it hadn't been because of an issue with the error handling. We were checking against an incomplete error value:
"missing port in address" vs "address $HOST: missing port in address"
Additionally, all RPCs to AutoEncrypt.Sign were using a.config.ServerPort, so those were updated to use ports resolved by resolveAddrs, if they are available.
* Allow setting the mesh gateway mode for an upstream in config files
* Add envoy integration test for mesh gateways
This necessitated many supporting changes in most of the other test cases.
Add remote mode mesh gateways integration test
The main change is that we no longer filter service instances by health,
preferring instead to render all results down into EDS endpoints in
envoy and merely label the endpoints as HEALTHY or UNHEALTHY.
When OnlyPassing is set to true we will force consul checks in a
'warning' state to render as UNHEALTHY in envoy.
Fixes#6171
* Display nicely Networks (CIDR) in runtime configuration
CIDR mask is displayed in binary in configuration.
This add support for nicely displaying CIDR in runtime configuration.
Currently, if a configuration contains the following lines:
"http_config": {
"allow_write_http_from": [
"127.0.0.0/8",
"::1/128"
]
}
A call to `/v1/agent/self?pretty` would display
"AllowWriteHTTPFrom": [
{
"IP": "127.0.0.0",
"Mask": "/wAAAA=="
},
{
"IP": "::1",
"Mask": "/////////////////////w=="
}
]
This PR fixes it and it will now display:
"AllowWriteHTTPFrom": [ "127.0.0.0/8", "::1/128" ]
* Added test for cidr nice rendering in `TestSanitize()`.
* Update go-bexpr to v0.1.1
This brings in:
• `in`/`not in` operators to do substring matching
• `matches` / `not matches` operators to perform regex string matching.
* Add the capability to auto-generate the filtering selector ops tables for our docs
This fixes pathological cases where the write throughput and snapshot size are both so large that more than 10k log entries are written in the time it takes to restore the snapshot from disk. In this case followers that restart can never catch up with leader replication again and enter a loop of constantly downloading a full snapshot and restoring it only to find that snapshot is already out of date and the leader has truncated its logs so a new snapshot is sent etc.
In general if you need to adjust this, you are probably abusing Consul for purposes outside its design envelope and should reconsider your usage to reduce data size and/or write volume.
All these changes should have no side-effects or change behavior:
- Use bytes.Buffer's String() instead of a conversion
- Use time.Since and time.Until where fitting
- Drop unnecessary returns and assignment
I can only assume we want to check for the retrieved `updatedToken` to not be
nil, before accessing it below.
`token` can't possibly be nil at this point, as we accessed `token.AccessorID`
just before.
* Ensure the mesh gateway configuration comes back in the api within each upstream
* Add a test for the MeshGatewayConfig in the ToAPI functions
* Ensure we don’t use gateways for dc local connections
* Update the svc kind index for deletions
* Replace the proxycfg.state cache with an interface for testing
Also start implementing proxycfg state testing.
* Update the state tests to verify some gateway watches for upstream-targets of a discovery chain.
Also:
- add back an internal http endpoint to dump a compiled discovery chain for debugging purposes
Before the CompiledDiscoveryChain.IsDefault() method would test:
- is this chain just one resolver step?
- is that resolver step just the default?
But what I forgot to test:
- is that resolver step for the same service that the chain represents?
This last point is important because if you configured just one config
entry:
kind = "service-resolver"
name = "web"
redirect {
service = "other"
}
and requested the chain for "web" you'd get back a **default** resolver
for "other". In the xDS code the IsDefault() method is used to
determine if this chain is "empty". If it is then we use the
pre-discovery-chain logic that just uses data embedded in the Upstream
object (and still lets the escape hatches function).
In the example above that means certain parts of the xDS code were going
to try referencing a cluster named "web..." despite the other parts of
the xDS code maintaining clusters named "other...".
Both 'consul config write' and server bootstrap config entries take a
decoding detour through mapstructure on the way from HCL to an actual
struct. They both may take in snake_case or CamelCase (for consistency)
so need very similar handling.
Unfortunately since they are operating on mirror universes of structs
(api.* vs structs.*) the code cannot be identitical, so try to share the
kind-configuration and duplicate the rest for now.
* Ensure we MapWalk the proxy config in the NodeService and ServiceNode structs
This gets rid of some json encoder errors in the catalog endpoints
* Allow passing explicit bind addresses to envoy
* Move map walking to the ConnectProxyConfig struct
Any place where this struct gets JSON encoded will benefit as opposed to having to implement it everywhere.
* Fail when a non-empty address is provided and not bindable
* camel case
* Update command/connect/envoy/envoy.go
Co-Authored-By: Paul Banks <banks@banksco.de>
* Retry the creation of the test server three times.
* Reduce the retry timeout for the API wait to 2 seconds, opting to fail faster and start over.
* Remove wait for leader from server creation. This wait can be added on a test by test basis now that the function is being exported.
* Remove wait for anti-entropy sync. This is built into the existing WaitForSerfCheck func, so that can be used if the anti-entropy wait is needed
Previously a sequence of events like:
Start
Stop
Start
Stop
would segfault on the second stop because the original ctx and cancel func were only initialized during the constructor and not during Start.
* Make cluster names SNI always
* Update some tests
* Ensure we check for prepared query types
* Use sni for route cluster names
* Proper mesh gateway mode defaulting when the discovery chain is used
* Ignore service splits from PatchSliceOfMaps
* Update some xds golden files for proper test output
* Allow for grpc/http listeners/cluster configs with the disco chain
* Update stats expectation
* connect: allow overriding envoy listener bind_address
* Update agent/xds/config.go
Co-Authored-By: Kyle Havlovitz <kylehav@gmail.com>
* connect: allow overriding envoy listener bind_port
* envoy: support unix sockets for grpc in bootstrap
Add AgentSocket BootstrapTplArgs which if set overrides the AgentAddress
and AgentPort to generate a bootstrap which points Envoy to a unix
socket file instead of an ip:port.
* Add a test for passing the consul addr as a unix socket
* Fix config formatting for envoy bootstrap tests
* Fix listeners test cases for bind addr/port
* Update website/source/docs/connect/proxies/envoy.md
maxIndexWatchTxn was only watching the IndexEntry of the max index of all the entries. It needed to watch all of them regardless of which was the max.
Also plumbed the query source through in the proxy config to help better track requests.
The clusters/endpoints test were still relying on deterministic ordering of clusters/endpoints which cannot be relied upon due to golang purposefully not providing any guarantee about consistent interation ordering of maps.
Also fixed a small bug in the connect proxy cluster generation that was causing the clusters slice to be double the size it needed to with the first half being all nil pointers.
* Fix some tests that I broke when refactoring the ConfigSnapshot
* Make sure the MeshGateway config is added to all the right api structs
* Fix some more tests
The general problem was that a the CA config which contained the trust domain was happening outside of the blocking mechanism so if the client started the blocking query before the primary dcs roots had been set then a state trust domain was being pushed down.
This was fixed here but in the future we should probably fixup the CA initialization code to not initialize the CA config twice when it doesn’t need to.
* Prune Servers from WAN and LAN
* cleaned up and fixed LAN to WAN
* moving things around
* force-leave remove from serfWAN, create pruneSerfWAN
* removed serfWAN remove, reduced complexity, fixed comments
* add another place to remove from serfWAN
* add nil check
* Update agent/consul/server.go
Co-Authored-By: Paul Banks <banks@banksco.de>
With this you should be able to fetch all of the relevant discovery
chain config entries from the state store in one query and then feed
them into the compiler outside of a transaction.
There are a lot of TODOs scattered through here, but they're mostly
around handling fun edge cases and can be deferred until more of the
plumbing works completely.
With ACLs enabled if an agent is wiped and restarted without a leave
it can no longer deregister the services it had previously registered
because it no longer has the tokens the services were registered with.
To remedy that we allow service deregistration from tokens with node
write permission.
* Add ui-content-path flag
* tests complete, regex validator on string, index.html updated
* cleaning up debugging stuff
* ui: Enable ember environment configuration to be set via the go binary at runtime (#5934)
* ui: Only inject {{.ContentPath}} if we are makeing a prod build...
...otherwise we just use the current rootURL
This gets injected into a <base /> node which solves the assets path
problem but not the ember problem
* ui: Pull out the <base href=""> value and inject it into ember env
See previous commit:
The <base href=""> value is 'sometimes' injected from go at index
serve time. We pass this value down to ember by overwriting the ember
config that is injected via a <meta> tag. This has to be done before
ember bootup.
Sometimes (during testing and development, basically not production)
this is injected with the already existing value, in which case this
essentially changes nothing.
The code here is slightly abstracted away from our specific usage to
make it easier for anyone else to use, and also make sure we can cope
with using this same method to pass variables down from the CLI through
to ember in the future.
* ui: We can't use <base /> move everything to javascript (#5941)
Unfortuantely we can't seem to be able to use <base> and rootURL
together as URL paths will get doubled up (`ui/ui/`).
This moves all the things that we need to interpolate with .ContentPath
to the `startup` javascript so we can conditionally print out
`{{.ContentPath}}` in lots of places (now we can't use base)
* fixed when we serve index.html
* ui: For writing a ContentPath, we also need to cope with testing... (#5945)
...and potentially more environments
Testing has more additional things in a separate index.html in `tests/`
This make the entire thing a little saner and uses just javascriopt
template literals instead of a pseudo handbrake synatx for our
templating of these files.
Intead of just templating the entire file this way, we still only
template `{{content-for 'head'}}` and `{{content-for 'body'}}`
in this way to ensure we support other plugins/addons
* build: Loosen up the regex for retrieving the CONSUL_VERSION (#5946)
* build: Loosen up the regex for retrieving the CONSUL_VERSION
1. Previously the `sed` replacement was searching for the CONSUL_VERSION
comment at the start of a line, it no longer does this to allow for
indentation.
2. Both `grep` and `sed` where looking for the omment at the end of the
line. We've removed this restriction here. We don't need to remove it
right now, but if we ever put the comment followed by something here the
searching would break.
3. Added `xargs` for trimming the resulting version string. We aren't
using this already in the rest of the scripts, but we are pretty sure
this is available on most systems.
* ui: Fix erroneous variable, and also force an ember cache clean on build
1. We referenced a variable incorrectly here, this fixes that.
2. We also made sure that every `make` target clears ember's `tmp` cache
to ensure that its not using any caches that have since been edited
everytime we call a `make` target.
* added docs, fixed encoding
* fixed go fmt
* Update agent/config/config.go
Co-Authored-By: R.B. Boyer <public@richardboyer.net>
* Completed Suggestions
* run gofmt on http.go
* fix testsanitize
* fix fullconfig/hcl by setting correct 'want'
* ran gofmt on agent/config/runtime_test.go
* Update website/source/docs/agent/options.html.md
Co-Authored-By: Hans Hasselberg <me@hans.io>
* Update website/source/docs/agent/options.html.md
Co-Authored-By: kaitlincarter-hc <43049322+kaitlincarter-hc@users.noreply.github.com>
* remove contentpath from redirectFS struct
* Support for maximum size for Output of checks
This PR allows users to limit the size of output produced by checks at the agent
and check level.
When set at the agent level, it will limit the output for all checks monitored
by the agent.
When set at the check level, it can override the agent max for a specific check but
only if it is lower than the agent max.
Default value is 4k, and input must be at least 1.
If a KVSet is performed but does not update the entry, do not trigger
watches for this key.
This avoids releasing blocking queries for KV values that did not
actually changed.
Read requests performed during anti antropy full sync currently target
the leader only. This generates a non-negligible load on the leader when
the DC is large enough and can be offloaded to the followers following
the "eventually consistent" policy for the agent state.
We switch the AE read calls to use stale requests with a small (2s)
MaxStaleDuration value and make sure we do not read too fast after a
write.
This allows addresses to be tagged at the service level similar to what we allow for nodes already. The address translation that can be enabled with the `translate_wan_addrs` config was updated to take these new addresses into account as well.
* Upgrade xDS (go-control-plane) API to support Envoy 1.10.
This includes backwards compatibility shim to work around the ext_authz package rename in 1.10.
It also adds integration test support in CI for 1.10.0.
* Fix go vet complaints
* go mod vendor
* Update Envoy version info in docs
* Update website/source/docs/connect/proxies/envoy.md
The observed bug was that a full restart of a consul datacenter (servers
and clients) in conjunction with a restart of a connect-flavored
application with bring-your-own-service-registration logic would very
frequently cause the envoy sidecar service check to never reflect the
aliased service.
Over the course of investigation several bugs and unfortunate
interactions were corrected:
(1)
local.CheckState objects were only shallow copied, but the key piece of
data that gets read and updated is one of the things not copied (the
underlying Check with a Status field). When the stock code was run with
the race detector enabled this highly-relevant-to-the-test-scenario field
was found to be racy.
Changes:
a) update the existing Clone method to include the Check field
b) copy-on-write when those fields need to change rather than
incrementally updating them in place.
This made the observed behavior occur slightly less often.
(2)
If anything about how the runLocal method for node-local alias check
logic was ever flawed, there was no fallback option. Those checks are
purely edge-triggered and failure to properly notice a single edge
transition would leave the alias check incorrect until the next flap of
the aliased check.
The change was to introduce a fallback timer to act as a control loop to
double check the alias check matches the aliased check every minute
(borrowing the duration from the non-local alias check logic body).
This made the observed behavior eventually go away when it did occur.
(3)
Originally I thought there were two main actions involved in the data race:
A. The act of adding the original check (from disk recovery) and its
first health evaluation.
B. The act of the HTTP API requests coming in and resetting the local
state when re-registering the same services and checks.
It took awhile for me to realize that there's a third action at work:
C. The goroutines associated with the original check and the later
checks.
The actual sequence of actions that was causing the bad behavior was
that the API actions result in the original check to be removed and
re-added _without waiting for the original goroutine to terminate_. This
means for brief windows of time during check definition edits there are
two goroutines that can be sending updates for the alias check status.
In extremely unlikely scenarios the original goroutine sees the aliased
check start up in `critical` before being removed but does not get the
notification about the nearly immediate update of that check to
`passing`.
This is interlaced wit the new goroutine coming up, initializing its
base case to `passing` from the current state and then listening for new
notifications of edge triggers.
If the original goroutine "finishes" its update, it then commits one
more write into the local state of `critical` and exits leaving the
alias check no longer reflecting the underlying check.
The correction here is to enforce that the old goroutines must terminate
before spawning the new one for alias checks.
* Improve startup message to avoid confusing users when no error occurs
Several times, some users not very familiar with Consul get confused
by error message at startup:
`[INFO] agent: (LAN) joined: 1 Err: <nil>`
Having `Err: <nil>` seems weird to many users, I propose to have the
following instead:
* Success: `[INFO] agent: (LAN) joined: 1`
* Error: `[WARN] agent: (LAN) couldn't join: %d Err: ERROR`