We fixed a few related issues while we were in here. We now only let
services register checks with a matching token, and we also close out
service and check delete operations if the catalog deregister claims
it doesn't know about the ID of the service or check being deleted.
Log a warning instead of a success message when attempting to deregister a nonexistent service. In Consul 0.8 this can be changed to giving an error outright, but for now we can keep the idempotent delete behavior.
It turns out the indexer can only use strings as arguments when
creating a query. Cast `types.CheckID` to a `string` before calling
into `memdb`.
Ideally the indexer would be smart enough to do this at compile-time,
but I need to look into how to do this without reflection and the
runtime package. For the time being statically cast `types.CheckID`
to a `string` at the call sites.
This experiment was brought about because of variable naming
confusion where name and checkIDs were interchanged. Gave CheckID
an Qualified Type Name and chased downstream changes.
This knob tells consul whether it should prefer the WAN address (if set)
when making service lookups in remote datacenters. This enables
reachability for remote services which are behind a NAT.
Consolidate code duplication and tests into a single lib package. Most of these functions were from various **/util.go functions that couldn't be imported due to cyclic imports. The consul/lib package is intended to be a terminal node in an import DAG and a place to stash various consul-only helper functions. Pulled in hashicorp/go-uuid instead of consolidating UUID access.
see: https://github.com/hashicorp/consul/issues/1173#1173
Reasoning: somewhere during consul development Pause()/Resume() and
PauseSync()/ResumeSync() were added to protect larger changes to
agent's localState. A few of the places that it tries to protect are:
- (a *Agent) AddService(...) # part of the method
- (c *Command) handleReload(...) # almost the whole method
- (l *localState) antiEntropy(...)# isPaused() prevents syncChanges()
The main problem is, that in the middle of handleReload(...)'s
critical section it indirectly (loadServices()) calls AddService(...).
AddService() in turn calls Pause() to protect itself against
syncChanges(). At the end of AddService() a defered call to Resume() is
made.
With the current implementation, this releases
isPaused() "lock" in the middle of handleReload() allowing antiEntropy
to kick in while configuration reload is still in progress.
Specifically almost all services and probably all check are unloaded
when syncChanges() is allowed to run.
This in turn can causes massive service/check de-/re-registration,
and since checks are by default registered in the critical state,
a majority of services on a node can be marked as failing.
It's made worse with automation, often calling `consul reload` in close
proximity on many nodes in the cluster.
This change basically turns Pause()/Resume() into P()/V() of
a garden-variety semaphore. Allowing Pause() to be called multiple times,
and releasing isPaused() only after all matching/defered Resumes() are
called as well.
TODO/NOTE: as with many semaphore implementations, it might be reasonable
to panic() if l.paused ever becomes negative.