Hi hashicorp team,
I work at eBay in Amsterdam and I have written a zero-conf consul aware HTTP(S) load-balancer in Go which can be used instead of consul-template + haproxy/varnish/apache/nginx. It builds its routing table from host/path prefixes the services publish via tags and the service status. Once a change is detected it switches the routing table dynamically without restart. It also supports canary testing by routing N% of traffic to a variable number of instances of a service.
https://github.com/eBay/fabio
We're using it to run all of marktplaats.nl (> 5-10k req/sec peak) through it and parts of kijiji.it which are eBay classifieds sites in the Netherlands and Italy.
The code has been under development for the last 5 months and runs now in production and I was able to open-source it a couple of days ago.
consul has quickly become our state and coordination backend for our micro services architecture for multiple different platforms in several countries. We're very happy with the quality and ease-of-use of your products and I'm personally looking forward to testing nomad.
It would be cool if you could list the project on your consul tools page. Please let me know whether it meets your standards and/or expectations for tools to be listed there. Feel free to ask questions.
Two of the changes are in tests; the one of consequence is in the API.
As explained in #1308 this can cause conflicts with downstream programs.
Fixes#1308.
see: https://github.com/hashicorp/consul/issues/1173#1173
Reasoning: somewhere during consul development Pause()/Resume() and
PauseSync()/ResumeSync() were added to protect larger changes to
agent's localState. A few of the places that it tries to protect are:
- (a *Agent) AddService(...) # part of the method
- (c *Command) handleReload(...) # almost the whole method
- (l *localState) antiEntropy(...)# isPaused() prevents syncChanges()
The main problem is, that in the middle of handleReload(...)'s
critical section it indirectly (loadServices()) calls AddService(...).
AddService() in turn calls Pause() to protect itself against
syncChanges(). At the end of AddService() a defered call to Resume() is
made.
With the current implementation, this releases
isPaused() "lock" in the middle of handleReload() allowing antiEntropy
to kick in while configuration reload is still in progress.
Specifically almost all services and probably all check are unloaded
when syncChanges() is allowed to run.
This in turn can causes massive service/check de-/re-registration,
and since checks are by default registered in the critical state,
a majority of services on a node can be marked as failing.
It's made worse with automation, often calling `consul reload` in close
proximity on many nodes in the cluster.
This change basically turns Pause()/Resume() into P()/V() of
a garden-variety semaphore. Allowing Pause() to be called multiple times,
and releasing isPaused() only after all matching/defered Resumes() are
called as well.
TODO/NOTE: as with many semaphore implementations, it might be reasonable
to panic() if l.paused ever becomes negative.