consul

History

R.B. Boyer 40336fd353 agent: fix several data races and bugs related to node-local alias checks (#5876 ) The observed bug was that a full restart of a consul datacenter (servers and clients) in conjunction with a restart of a connect-flavored application with bring-your-own-service-registration logic would very frequently cause the envoy sidecar service check to never reflect the aliased service. Over the course of investigation several bugs and unfortunate interactions were corrected: (1) local.CheckState objects were only shallow copied, but the key piece of data that gets read and updated is one of the things not copied (the underlying Check with a Status field). When the stock code was run with the race detector enabled this highly-relevant-to-the-test-scenario field was found to be racy. Changes: a) update the existing Clone method to include the Check field b) copy-on-write when those fields need to change rather than incrementally updating them in place. This made the observed behavior occur slightly less often. (2) If anything about how the runLocal method for node-local alias check logic was ever flawed, there was no fallback option. Those checks are purely edge-triggered and failure to properly notice a single edge transition would leave the alias check incorrect until the next flap of the aliased check. The change was to introduce a fallback timer to act as a control loop to double check the alias check matches the aliased check every minute (borrowing the duration from the non-local alias check logic body). This made the observed behavior eventually go away when it did occur. (3) Originally I thought there were two main actions involved in the data race: A. The act of adding the original check (from disk recovery) and its first health evaluation. B. The act of the HTTP API requests coming in and resetting the local state when re-registering the same services and checks. It took awhile for me to realize that there's a third action at work: C. The goroutines associated with the original check and the later checks. The actual sequence of actions that was causing the bad behavior was that the API actions result in the original check to be removed and re-added _without waiting for the original goroutine to terminate_. This means for brief windows of time during check definition edits there are two goroutines that can be sending updates for the alias check status. In extremely unlikely scenarios the original goroutine sees the aliased check start up in `critical` before being removed but does not get the notification about the nearly immediate update of that check to `passing`. This is interlaced wit the new goroutine coming up, initializing its base case to `passing` from the current state and then listening for new notifications of edge triggers. If the original goroutine "finishes" its update, it then commits one more write into the local state of `critical` and exits leaving the alias check no longer reflecting the underlying check. The correction here is to enforce that the old goroutines must terminate before spawning the new one for alias checks.		2019-05-24 13:36:56 -05:00
..
acl.go	acl: a role binding rule for a role that does not exist should be ignored (#5778 )	2019-05-03 14:22:44 -05:00
acl_cache.go	acl: adding support for kubernetes auth provider login (#5600 )	2019-04-26 14:49:25 -05:00
acl_cache_test.go	acl: adding support for kubernetes auth provider login (#5600 )	2019-04-26 14:49:25 -05:00
acl_legacy.go	acl: ACL Tokens can now be assigned an optional set of service identities (#5390 )	2019-04-26 14:48:04 -05:00
acl_legacy_test.go	New ACLs (#4791 )	2018-10-19 12:04:07 -04:00
acl_test.go	acl: adding support for kubernetes auth provider login (#5600 )	2019-04-26 14:49:25 -05:00
catalog.go	agent: remove ConnectProxyServiceName	2018-06-14 09:41:49 -07:00
check_definition.go	agent: fix formatting	2018-11-07 02:16:03 -08:00
check_definition_test.go	agent: fix formatting	2018-11-07 02:16:03 -08:00
check_type.go	agent/structs: check is alias if node is empty	2018-07-12 09:36:11 -07:00
config_entry.go	Fix ConfigEntryResponse binary marshaller and ensure we watch the chan in ConfigEntry.Get even when no entry exists. (#5773 )	2019-05-02 15:25:29 -04:00
config_entry_test.go	Fix ConfigEntryResponse binary marshaller and ensure we watch the chan in ConfigEntry.Get even when no entry exists. (#5773 )	2019-05-02 15:25:29 -04:00
connect.go	fix typos reported by golangci-lint:misspell (#5434 )	2019-03-06 11:13:28 -06:00
connect_ca.go	connect: tame thundering herd of CSRs on CA rotation (#5228 )	2019-01-22 17:19:36 +00:00
connect_ca_test.go	connect: tame thundering herd of CSRs on CA rotation (#5228 )	2019-01-22 17:19:36 +00:00
connect_proxy_config.go	Connect: allow configuring Envoy for L7 Observability (#5558 )	2019-04-29 17:27:57 +01:00
connect_proxy_config_test.go	Add Proxy Upstreams to Service Definition (#4639 )	2018-10-10 16:55:34 +01:00
connect_test.go	Added connect proxy config and local agent state setup on boot.	2018-06-14 09:41:57 -07:00
errors.go	Implement /v1/agent/health/service/<service name> endpoint (#3551 )	2019-01-07 09:39:23 -05:00
intention.go	fsm: add Intention operations to transactions for internal use	2018-10-19 10:02:28 -07:00
intention_test.go	agent/consul: set precedence value on struct itself	2018-06-25 12:24:16 -07:00
operator.go	Move autopilot to a standalone package	2017-12-11 16:45:33 -08:00
prepared_query.go	Improve Connect with Prepared Queries (#5291 )	2019-02-04 09:36:51 -05:00
prepared_query_test.go	agent: move agent/consul/structs to agent/structs	2017-08-09 14:32:12 +02:00
sanitize_oss.go	Update to use a consulent build tag instead of just ent (#5759 )	2019-05-01 11:11:27 -04:00
service_definition.go	fix typos reported by golangci-lint:misspell (#5434 )	2019-03-06 11:13:28 -06:00
service_definition_test.go	Add Proxy Upstreams to Service Definition (#4639 )	2018-10-10 16:55:34 +01:00
snapshot.go	agent: move agent/consul/structs to agent/structs	2017-08-09 14:32:12 +02:00
structs.go	agent: fix several data races and bugs related to node-local alias checks (#5876 )	2019-05-24 13:36:56 -05:00
structs_filtering_test.go	Implement data filtering of some endpoints (#5579 )	2019-04-16 12:00:15 -04:00
structs_test.go	Add integration test for central config; fix central config WIP (#5752 )	2019-05-01 16:39:31 -07:00
testing_catalog.go	Add SidecarService Syntax sugar to Service Definition (#4686 )	2018-10-10 16:55:34 +01:00
testing_connect_proxy_config.go	Add -sidecar-for and new /agent/service/:service_id endpoint (#4691 )	2018-10-10 16:55:34 +01:00
testing_intention.go	agent: use testing intention to get valid intentions	2018-06-14 09:41:43 -07:00
testing_service_definition.go	Add Proxy Upstreams to Service Definition (#4639 )	2018-10-10 16:55:34 +01:00
txn.go	txn: update existing txn api docs with new operations	2019-01-15 16:54:07 -08:00