This is the OSS portion of enterprise PR 2265.
This PR provides a server-local implementation of the
proxycfg.FederationStateListMeshGateways interface based on blocking queries.
This is the OSS portion of enterprise PR 2259.
This PR provides a server-local implementation of the proxycfg.GatewayServices
interface based on blocking queries.
This is the OSS portion of enterprise PR 2250.
This PR provides server-local implementations of the proxycfg.TrustBundle and
proxycfg.TrustBundleList interfaces, based on local blocking queries.
This is the OSS portion of enterprise PR 2249.
This PR introduces an implementation of the proxycfg.Health interface based on a
local materialized view of the health events.
It reuses the view and request machinery from agent/rpcclient/health, which made
it super straightforward.
This is the OSS portion of enterprise PR 2242.
This PR introduces a server-local implementation of the proxycfg.ServiceList
interface, backed by streaming events and a local materializer.
Previously, public referred to gRPC services that are both exposed on
the dedicated gRPC port and have their definitions in the proto-public
directory (so were considered usable by 3rd parties). Whereas private
referred to services on the multiplexed server port that are only usable
by agents and other servers.
Now, we're splitting these definitions, such that external/internal
refers to the port and public/private refers to whether they can be used
by 3rd parties.
This is necessary because the peering replication API needs to be
exposed on the dedicated port, but is not (yet) suitable for use by 3rd
parties.
Currently servers exchange information about their WAN serf port
and RPC port with serf tags, so that they all learn of each other's
addressing information. We intend to make larger use of the new
public-facing gRPC port exposed on all of the servers, so this PR
addresses that by passing around the gRPC port via serf tags and
then ensuring the generated consul service in the catalog has
metadata about that new port as well for ease of non-serf-based lookup.
This is the OSS portion of enterprise PR 2157.
It builds on the local blocking query work in #13438 to implement the
proxycfg.IntentionUpstreams interface using server-local data.
Also moves the ACL filtering logic from agent/consul into the acl/filter
package so that it can be reused here.
This is the OSS portion of enterprise PR 2141.
This commit provides a server-local implementation of the `proxycfg.Intentions`
interface that sources data from streaming events.
It adds events for the `service-intentions` config entry type, and then consumes
event streams (via materialized views) for the service's explicit intentions and
any applicable wildcard intentions, merging them into a single list of intentions.
An alternative approach I considered was to consume _all_ intention events (via
`SubjectWildcard`) and filter out the irrelevant ones. This would admittedly
remove some complexity in the `agent/proxycfg-glue` package but at the expense
of considerable overhead from waking potentially many thousands of connect
proxies every time any intention is updated.
This is the OSS portion of enterprise PR 2056.
This commit provides server-local implementations of the proxycfg.ConfigEntry
and proxycfg.ConfigEntryList interfaces, that source data from streaming events.
It makes use of the LocalMaterializer type introduced for peering replication,
adding the necessary support for authorization.
It also adds support for "wildcard" subscriptions (within a topic) to the event
publisher, as this is needed to fetch service-resolvers for all services when
configuring mesh gateways.
Currently, events will be emitted for just the ingress-gateway, service-resolver,
and mesh config entry types, as these are the only entries required by proxycfg
— the events will be emitted on topics named IngressGateway, ServiceResolver,
and MeshConfig topics respectively.
Though these events will only be consumed "locally" for now, they can also be
consumed via the gRPC endpoint (confirmed using grpcurl) so using them from
client agents should be a case of swapping the LocalMaterializer for an
RPCMaterializer.
For initial cluster peering TProxy support we consider all imported services of a partition to be potential upstreams.
We leverage the VirtualIP table because it stores plain service names (e.g. "api", not "api-sidecar-proxy").
Having this type live in the agent/consul package makes it difficult to
put anything that relies on token resolution (e.g. the new gRPC services)
in separate packages without introducing import cycles.
For example, if package foo imports agent/consul for the ACLResolveResult
type it means that agent/consul cannot import foo to register its service.
We've previously worked around this by wrapping the ACLResolver to
"downgrade" its return type to an acl.Authorizer - aside from the
added complexity, this also loses the resolved identity information.
In the future, we may want to move the whole ACLResolver into the
acl/resolver package. For now, putting the result type there at least,
fixes the immediate import cycle issues.
Mesh gateways will now enable tcp connections with SNI names including peering information so that those connections may be proxied.
Note: this does not change the callers to use these mesh gateways.
This is the OSS portion of enterprise PR 1994
Rather than directly interrogating the agent-local state for HTTP
checks using the `HTTPCheckFetcher` interface, we now rely on the
config snapshot containing the checks.
This reduces the number of changes required to support server xDS
sessions.
It's not clear why the fetching approach was introduced in
931d167ebb.
Envoy's SPIFFE certificate validation extension allows for us to
validate against different root certificates depending on the trust
domain of the dialing proxy.
If there are any trust bundles from peers in the config snapshot then we
use the SPIFFE validator as the validation context, rather than the
usual TrustedCA.
The injected validation config includes the local root certificates as
well.
There are a handful of changes in this commit:
* When querying trust bundles for a service we need to be able to
specify the namespace of the service.
* The endpoint needs to track the index because the cache watches use
it.
* Extracted bulk of the endpoint's logic to a state store function
so that index tracking could be tested more easily.
* Removed check for service existence, deferring that sort of work to ACL authz
* Added the cache type
For mTLS to work between two proxies in peered clusters with different root CAs,
proxies need to configure their outbound listener to use different root certificates
for validation.
Up until peering was introduced proxies would only ever use one set of root certificates
to validate all mesh traffic, both inbound and outbound. Now an upstream proxy
may have a leaf certificate signed by a CA that's different from the dialing proxy's.
This PR makes changes to proxycfg and xds so that the upstream TLS validation
uses different root certificates depending on which cluster is being dialed.
This is the OSS portion of enterprise PRs 1904, 1905, 1906, 1907, 1949,
and 1971.
It replaces the proxycfg manager's direct dependency on the agent cache
with interfaces that will be implemented differently when serving xDS
sessions from a Consul server.
OSS port of enterprise PR 1822
Includes the necessary changes to the `proxycfg` and `xds` packages to enable
Consul servers to configure arbitrary proxies using catalog data.
Broadly, `proxycfg.Manager` now has public methods for registering,
deregistering, and listing registered proxies — the existing local agent
state-sync behavior has been moved into a separate component that makes use of
these methods.
When an xDS session is started for a proxy service in the catalog, a goroutine
will be spawned to watch the service in the server's state store and
re-register it with the `proxycfg.Manager` whenever it is updated (and clean
it up when the client goes away).
OSS portion of enterprise PR 1857.
This removes (most) references to the `cache.UpdateEvent` type in the
`proxycfg` package.
As we're going to be direct usage of the agent cache with interfaces that
can be satisfied by alternative server-local datasources, it doesn't make
sense to depend on this type everywhere anymore (particularly on the
`state.ch` channel).
We also plan to extract `proxycfg` out of Consul into a shared library in
the future, which would require removing this dependency.
Aside from a fairly rote find-and-replace, the main change is that the
`cache.Cache` and `health.Client` types now accept a callback function
parameter, rather than a `chan<- cache.UpdateEvents`. This allows us to
do the type conversion without running another goroutine.
- Introduce a new telemetry configurable parameter retry_failed_connection. User can set the value to true to let consul agent continue its start process on failed connection to datadog server. When set to false, agent will stop on failed start. The default behavior is true.
Co-authored-by: Dan Upton <daniel@floppy.co>
Co-authored-by: Evan Culver <eculver@users.noreply.github.com>
Changes to how the version string was handled created small regression with the release of consul 1.12.0 enterprise.
Many tools use the Config:Version field reported by the agent/self resource to determine whether Consul is an enterprise or OSS instance, expect something like 1.12.0+ent for enterprise and simply 1.12.0 for OSS. This was accidentally broken during the runup to 1.12.x
This work fixes the value returned by both the self endpoint in ["Config"]["Version"] and the metrics consul.version field.
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* update raft to v1.3.7
* add changelog
* fix compilation error
* fix HeartbeatTimeout
* fix ElectionTimeout to reload only if value is valid
* fix default values for `ElectionTimeout` and `HeartbeatTimeout`
* fix test defaults
* bump raft to v1.3.8
If a service is automatically registered because it has a critical health check
for longer than deregister_critical_service_after, the error message will now
include:
- mention of the deregister_critical_service_after option
- the value of deregister_critical_service_after for that check
* add config watcher to the config package
* add logging to watcher
* add test and refactor to add WatcherEvent.
* add all API calls and fix a bug with recreated files
* add tests for watcher
* remove the unnecessary use of context
* Add debug log and a test for file rename
* use inode to detect if the file is recreated/replaced and only listen to create events.
* tidy ups (#1535)
* tidy ups
* Add tests for inode reconcile
* fix linux vs windows syscall
* fix linux vs windows syscall
* fix windows compile error
* increase timeout
* use ctime ID
* remove remove/creation test as it's a use case that fail in linux
* fix linux/windows to use Ino/CreationTime
* fix the watcher to only overwrite current file id
* fix linter error
* fix remove/create test
* set reconcile loop to 200 Milliseconds
* fix watcher to not trigger event on remove, add more tests
* on a remove event try to add the file back to the watcher and trigger the handler if success
* fix race condition
* fix flaky test
* fix race conditions
* set level to info
* fix when file is removed and get an event for it after
* fix to trigger handler when we get a remove but re-add fail
* fix error message
* add tests for directory watch and fixes
* detect if a file is a symlink and return an error on Add
* rename Watcher to FileWatcher and remove symlink deref
* add fsnotify@v1.5.1
* fix go mod
* do not reset timer on errors, rename OS specific files
* rename New func
* events trigger on write and rename
* add missing test
* fix flaking tests
* fix flaky test
* check reconcile when removed
* delete invalid file
* fix test to create files with different mod time.
* back date file instead of sleeping
* add watching file in agent command.
* fix watcher call to use new API
* add configuration and stop watcher when server stop
* add certs as watched files
* move FileWatcher to the agent start instead of the command code
* stop watcher before replacing it
* save watched files in agent
* add add and remove interfaces to the file watcher
* fix remove to not return an error
* use `Add` and `Remove` to update certs files
* fix tests
* close events channel on the file watcher even when the context is done
* extract `NotAutoReloadableRuntimeConfig` is a separate struct
* fix linter errors
* add Ca configs and outgoing verify to the not auto reloadable config
* add some logs and fix to use background context
* add tests to auto-config reload
* remove stale test
* add tests to changes to config files
* add check to see if old cert files still trigger updates
* rename `NotAutoReloadableRuntimeConfig` to `StaticRuntimeConfig`
* fix to re add both key and cert file. Add test to cover this case.
* review suggestion
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* add check to static runtime config changes
* fix test
* add changelog file
* fix review comments
* Apply suggestions from code review
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* update flag description
Co-authored-by: FFMMM <FFMMM@users.noreply.github.com>
* fix compilation error
* add static runtime config support
* fix test
* fix review comments
* fix log test
* Update .changelog/12329.txt
Co-authored-by: Dan Upton <daniel@floppy.co>
* transfer tests to runtime_test.go
* fix filewatcher Replace to not deadlock.
* avoid having lingering locks
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* split ReloadConfig func
* fix warning message
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* convert `FileWatcher` into an interface
* fix compilation errors
* fix tests
* extract func for adding and removing files
* add a coalesceTimer with a very small timer
* extract coaelsce Timer and add a shim for testing
* add tests to coalesceTimer fix to send remaining events
* set `coalesceTimer` to 1 Second
* support symlink, fix a nil deref.
* fix compile error
* fix compile error
* refactor file watcher rate limiting to be a Watcher implementation
* fix linter issue
* fix runtime config
* fix runtime test
* fix flaky tests
* fix compile error
* Apply suggestions from code review
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* fix agent New to return an error if File watcher New return an error
* quit timer loop if ctx is canceled
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: Ashwin Venkatesh <ashwin@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
Co-authored-by: FFMMM <FFMMM@users.noreply.github.com>
Co-authored-by: Daniel Upton <daniel@floppy.co>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
* add config watcher to the config package
* add logging to watcher
* add test and refactor to add WatcherEvent.
* add all API calls and fix a bug with recreated files
* add tests for watcher
* remove the unnecessary use of context
* Add debug log and a test for file rename
* use inode to detect if the file is recreated/replaced and only listen to create events.
* tidy ups (#1535)
* tidy ups
* Add tests for inode reconcile
* fix linux vs windows syscall
* fix linux vs windows syscall
* fix windows compile error
* increase timeout
* use ctime ID
* remove remove/creation test as it's a use case that fail in linux
* fix linux/windows to use Ino/CreationTime
* fix the watcher to only overwrite current file id
* fix linter error
* fix remove/create test
* set reconcile loop to 200 Milliseconds
* fix watcher to not trigger event on remove, add more tests
* on a remove event try to add the file back to the watcher and trigger the handler if success
* fix race condition
* fix flaky test
* fix race conditions
* set level to info
* fix when file is removed and get an event for it after
* fix to trigger handler when we get a remove but re-add fail
* fix error message
* add tests for directory watch and fixes
* detect if a file is a symlink and return an error on Add
* rename Watcher to FileWatcher and remove symlink deref
* add fsnotify@v1.5.1
* fix go mod
* do not reset timer on errors, rename OS specific files
* rename New func
* events trigger on write and rename
* add missing test
* fix flaking tests
* fix flaky test
* check reconcile when removed
* delete invalid file
* fix test to create files with different mod time.
* back date file instead of sleeping
* add watching file in agent command.
* fix watcher call to use new API
* add configuration and stop watcher when server stop
* add certs as watched files
* move FileWatcher to the agent start instead of the command code
* stop watcher before replacing it
* save watched files in agent
* add add and remove interfaces to the file watcher
* fix remove to not return an error
* use `Add` and `Remove` to update certs files
* fix tests
* close events channel on the file watcher even when the context is done
* extract `NotAutoReloadableRuntimeConfig` is a separate struct
* fix linter errors
* add Ca configs and outgoing verify to the not auto reloadable config
* add some logs and fix to use background context
* add tests to auto-config reload
* remove stale test
* add tests to changes to config files
* add check to see if old cert files still trigger updates
* rename `NotAutoReloadableRuntimeConfig` to `StaticRuntimeConfig`
* fix to re add both key and cert file. Add test to cover this case.
* review suggestion
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* add check to static runtime config changes
* fix test
* add changelog file
* fix review comments
* Apply suggestions from code review
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* update flag description
Co-authored-by: FFMMM <FFMMM@users.noreply.github.com>
* fix compilation error
* add static runtime config support
* fix test
* fix review comments
* fix log test
* Update .changelog/12329.txt
Co-authored-by: Dan Upton <daniel@floppy.co>
* transfer tests to runtime_test.go
* fix filewatcher Replace to not deadlock.
* avoid having lingering locks
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* split ReloadConfig func
* fix warning message
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* convert `FileWatcher` into an interface
* fix compilation errors
* fix tests
* extract func for adding and removing files
Co-authored-by: Ashwin Venkatesh <ashwin@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
Co-authored-by: FFMMM <FFMMM@users.noreply.github.com>
Co-authored-by: Daniel Upton <daniel@floppy.co>
Introduces the capability to configure TLS differently for Consul's
listeners/ports (i.e. HTTPS, gRPC, and the internal multiplexed RPC
port) which is useful in scenarios where you may want the HTTPS or
gRPC interfaces to present a certificate signed by a well-known/public
CA, rather than the certificate used for internal communication which
must have a SAN in the form `server.<dc>.consul`.
* Remove some usage of md5 from the system
OSS side of https://github.com/hashicorp/consul-enterprise/pull/1253
This is a potential security issue because an attacker could conceivably manipulate inputs to cause persistence files to collide, effectively deleting the persistence file for one of the colliding elements.
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
As part of removing the legacy ACL system ACL upgrading and the flag for
legacy ACLs is removed from Clients.
This commit also removes the 'acls' serf tag from client nodes. The tag is only ever read
from server nodes.
This commit also introduces a constant for the acl serf tag, to make it easier to track where
it is used.
The DebugConfig in the self endpoint can change at any time. It's not a stable API.
This commit adds the XDSPort to a stable part of the XDS api, and changes the envoy command to read
this new field.
It includes support for the old API as well, in case a newer CLI is used with an older API, and
adds a test for both cases.
Signed-off-by: Jakub Sokołowski <jakub@status.im>
* agent: add failures_before_warning setting
The new setting allows users to specify the number of check failures
that have to happen before a service status us updated to be `warning`.
This allows for more visibility for detected issues without creating
alerts and pinging administrators. Unlike the previous behavior, which
caused the service status to not update until it reached the configured
`failures_before_critical` setting, now Consul updates the Web UI view
with the `warning` state and the output of the service check when
`failures_before_warning` is breached.
The default value of `FailuresBeforeWarning` is the same as the value of
`FailuresBeforeCritical`, which allows for retaining the previous default
behavior of not triggering a warning.
When `FailuresBeforeWarning` is set to a value higher than that of
`FailuresBeforeCritical it has no effect as `FailuresBeforeCritical`
takes precedence.
Resolves: https://github.com/hashicorp/consul/issues/10680
Signed-off-by: Jakub Sokołowski <jakub@status.im>
Co-authored-by: Jakub Sokołowski <jakub@status.im>
This field has been unnecessary for a while now. It was always set to the same value
as PrimaryDatacenter. So we can remove the duplicate field and use PrimaryDatacenter
directly.
This change was made by GoLand refactor, which did most of the work for me.
The blocking query backend sets the default value on the server side.
The streaming backend does not using blocking queries, so we must set the timeout on
the client.
This field was documented as enabling TLS for outgoing RPC, but that was not the case.
All this field did was set the use_tls serf tag.
Instead of setting this field in a place far from where it is used, move the logic to where
the serf tag is set, so that the code is much more obvious.
tlsutil.Config already presents an excellent structure for this
configuration. Copying the runtime config fields to agent/consul.Config
makes code harder to trace, and provides no advantage.
Instead of copying the fields around, use the tlsutil.Config struct
directly instead.
This is one small step in removing the many layers of duplicate
configuration.
The bulk of this commit is moving the LeaderRoutineManager from the agent/consul package into its own package: lib/gort. It also got a renaming and its Start method now requires a context. Requiring that context required updating a whole bunch of other places in the code.
* Save exposed HTTP or GRPC ports to the agent's store
* Add those the health checks API so we can retrieve them from the API
* Change redirect-traffic command to also exclude those ports from inbound traffic redirection when expose.checks is set to true.
* WIP reloadable raft config
* Pre-define new raft gauges
* Update go-metrics to change gauge reset behaviour
* Update raft to pull in new metric and reloadable config
* Add snapshot persistance timing and installSnapshot to our 'protected' list as they can be infrequent but are important
* Update telemetry docs
* Update config and telemetry docs
* Add note to oldestLogAge on when it is visible
* Add changelog entry
* Update website/content/docs/agent/options.mdx
Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>
Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>
This adds support for the Incremental xDS protocol when using xDS v3. This is best reviewed commit-by-commit and will not be squashed when merged.
Union of all commit messages follows to give an overarching summary:
xds: exclusively support incremental xDS when using xDS v3
Attempts to use SoTW via v3 will fail, much like attempts to use incremental via v2 will fail.
Work around a strange older envoy behavior involving empty CDS responses over incremental xDS.
xds: various cleanups and refactors that don't strictly concern the addition of incremental xDS support
Dissolve the connectionInfo struct in favor of per-connection ResourceGenerators instead.
Do a better job of ensuring the xds code uses a well configured logger that accurately describes the connected client.
xds: pull out checkStreamACLs method in advance of a later commit
xds: rewrite SoTW xDS protocol tests to use protobufs rather than hand-rolled json strings
In the test we very lightly reuse some of the more boring protobuf construction helper code that is also technically under test. The important thing of the protocol tests is testing the protocol. The actual inputs and outputs are largely already handled by the xds golden output tests now so these protocol tests don't have to do double-duty.
This also updates the SoTW protocol test to exclusively use xDS v2 which is the only variant of SoTW that will be supported in Consul 1.10.
xds: default xds.Server.AuthCheckFrequency at use-time instead of construction-time
Setting this field to a value is equivalent to using the 'near' query paramter.
The intent is to sort the results by proximity to the node requesting
them. However with connect we send the results to envoy, which doesn't
care about the order, so setting this field is increasing the work
performed for no gain.
It is necessary to unset this field now because we would like connect
to use streaming, but streaming does not support sorting by proximity.
* add http2 ping checks
* fix test issue
* add h2ping check to config resources
* add new test and docs for h2ping
* fix grammatical inconsistency in H2PING documentation
* resolve rebase conflicts, add test for h2ping tls verification failure
* api documentation for h2ping
* update test config data with H2PING
* add H2PING to protocol buffers and update changelog
* fix typo in changelog entry
The streaming cache type for service health has no way to handle v1/health/ingress/:service queries as there is no equivalent topic that would return the appropriate data.
Ensure that attempts to use this endpoint will use the old cache-type for now so that they return appropriate data when streaming is enabled.
Some TLS servers require SNI, but the Golang HTTP client doesn't
include it in the ClientHello when connecting to an IP address. This
change adds a new TLSServerName field to health check definitions to
optionally set it. This fixes#9473.
Previously the ServiceManager had to run a separate goroutine so that it could block on a channel
send/receive instead of a lock. Using this mutex with TryLock allows us to cancel the lock when
the serviceConfigWatch is stopped.
Without this change removing the ServiceManager.Start goroutine would not be possible because
when AddService is called it acquires the stateLock. While that lock is held, if there are
existing watches for the service, the old watch will be stopped, and the goroutine holding the
lock will attempt to wait for that watcher goroutine to exit.
If the goroutine is handling an update (serviceConfigWatch.handleUpdate) then it can block on
acquiring the stateLock and deadlock the agent. With this change the context is cancelled as
and the goroutine will exit instead of waiting on the stateLock.
The ServiceManager.Start goroutine was used to serialize calls to
agent.addServiceInternal.
All the goroutines which sent events to the channel would block waiting
for a response from that same goroutine, which is effectively the same
as a synchronous call without any channels.
This commit removes the goroutine and channels, and instead calls
addServiceInternal directly. Since all of these goroutines will need to
take the agent.stateLock, the mutex handles the serializing of calls.
Move the field into the struct for addServiceLocked. Also don't require
setting a default value, so that the callers can leave it as nil if they
don't already have a snapshot.
Handle the decision to use ServiceManager in a single place. Instead of
calling ServiceManager.AddService, then calling back into
addServiceInternal, only call ServiceManager.AddService if we are going
to use it.
This change removes some small duplication and removes a branch from the
AddService flow.
The temprorary variables make it much harder to trace where and how struct
fields are used. If a field is only used a small number of times than
refer to the field directly.
The method is only used in tests, and only exists for legacy calls.
There was one other package which used this method in tests. Export
the AddServiceRequest and a couple of its fields so the new function can
be used in those tests.
I believe this commit also fixes a bug. Previously RPCMaxConnsPerClient was not being re-read from the RuntimeConfig, so passing it to Server.ReloadConfig was never changing the value.
Also improve the test runtime by not doing a lot of unnecessary work.
* create consul version metric with version label
* agent/agent.go: add pre-release Version as well as label
Co-Authored-By: Radha13 <kumari.radha3@gmail.com>
* verion and pre-release version labels.
* hyphen/- breaks prometheus
* Add Prometheus gauge defintion for version metric
* Add new metric to telemetry docs
Co-authored-by: Radha Kumari <kumari.radha3@gmail.com>
Co-authored-by: Aestek <thib.gilles@gmail.com>
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Previously the listener was being passed to a closure in a loop without
capturing the loop variable. The result is only the last listener is
used, so the http/https servers only listen on one address.
This problem is fixed by capturing the variable by passing it into a
function.
When a service is deregistered, we check whever matching services were
registered as sidecar along with it and deregister them as well.
To determine if a service is indeed a sidecar we check the
structs.ServiceNode.LocallyRegisteredAsSidecar property. However, to
avoid interal API leakage, it is excluded from JSON serialization,
meaning it is not persisted to disk either.
When the agent is restarted, this property lost and sidecars are no
longer deregistered along with their parent service.
To fix this, we now specifically save this property in the persisted
service file.
This allows for client agent to be run in a more stateless manner where they may be abruptly terminated and not expected to come back. If advertising a per-agent reconnect timeout using the advertise_reconnect_timeout configuration when that agent leaves, other agents will wait only that amount of time for the agent to come back before reaping it.
This has the advantageous side effect of causing servers to deregister the node/services/checks for that agent sooner than if the global reconnect_timeout was used.
This new package provides a client agent implementation of an interface
for fetching the health of services.
This approach has a number of benefits:
1. It provides a much more explicit interface. Instead of everything
dependency on `RPC()` and `Cache.Get()` for many unrelated things
they can depend on a type that are named according to the behaviour
it provides.
2. It gives us a single place to vary the behaviour and migrate to
a new form of RPC (gRPC). The current implementation has two options
(cache, or direct RPC), and in the future we will have more.
It is also a great opporunity to start adding `context.Context` args
to these operations, which in the future will allow us to cancel
the operations.
3. As a concequence of the first, in the Server agent where we make
these calls we can replace the current in-memory RPC calls with
a thin adapter for the real method. This removes the `net/rpc`
machinery from the call in places where it is not needed.
This new package is quite small right now, but I think we can expect it
to grow to a more reasonable size as other RPC calls are replaced.
This change also happens to replace two very similar implementations with
a single implementation.
In an upcoming change we will need to pass a grpc.ClientConnPool from
BaseDeps into Server. While looking at that change I noticed all of the
existing consulOption fields are already on BaseDeps.
Instead of duplicating the fields, we can create a struct used by
agent/consul, and use that struct in BaseDeps. This allows us to pass
along dependencies without translating them into different
representations.
I also looked at moving all of BaseDeps in agent/consul, however that
created some circular imports. Resolving those cycles wouldn't be too
bad (it was only an error in agent/consul being imported from
cache-types), however this change seems a little better by starting to
introduce some structure to BaseDeps.
This change is also a small step in reducing the scope of Agent.
Also remove some constants that were only used by tests, and move the
relevant comment to where the live configuration is set.
Removed some validation from NewServer and NewClient, as these are not
really runtime errors. They would be code errors, which will cause a
panic anyway, so no reason to handle them specially here.