diff --git a/website/source/api/index.html.md b/website/source/api/index.html.md index 0740fe3364..aea0c3e750 100644 --- a/website/source/api/index.html.md +++ b/website/source/api/index.html.md @@ -77,6 +77,51 @@ to the supplied maximum `wait` time to spread out the wake up time of any concurrent requests. This adds up to `wait / 16` additional time to the maximum duration. +### Implementation Details + +While the mechanism is relatively simple to work with, there are a few edge +cases that must be handled correctly. + + * **Reset the index if it goes backwards**. While indexes in general are + monotonically increasing(i.e. they should only ever increase as time passes), + there are several real-world scenarios in + which they can go backwards for a given query. Implementations must check + to see if a returned index is lower than the previous value, + and if it is, should reset index to `0` - effectively restarting their blocking loop. + Failure to do so may cause the client to miss future updates for an unbounded + time, or to use an invalid index value that causes no blocking and increases + load on the servers. Cases where this can occur include: + * If a raft snapshot is restored on the servers with older version of the data. + * KV list operations where an item with the highest index is removed. + * A Consul upgrade changes the way watches work to optimize them with more + granular indexes. + + * **Sanity check index is greater than zero**. After the initial request (or a + reset as above) the `X-Consul-Index` returned _should_ always be greater than zero. It + is a bug in Consul if it is not, however this has happened a few times and can + still be triggered on some older Consul versions. It's especially bad because it + causes blocking clients that are not aware to enter a busy loop, using excessive + client CPU and causing high load on servers. It is _always_ safe to use an + index of `1` to wait for updates when the data being requested doesn't exist + yet, so clients _should_ sanity check that their index is at least 1 after + each blocking response is handled to be sure they actually block on the next + request. + + * **Rate limit**. The blocking query mechanism is reasonably efficient when updates + are relatively rare (order of tens of seconds to minutes between updates). In cases + where a result gets updated very fast however - possibly during an outage or incident + with a badly behaved client - blocking query loops degrade into busy loops that + consume excessive client CPU and cause high server load. While it's possible to just add a sleep + to every iteration of the loop, this is **not** recommended since it causes update + delivery to be delayed in the happy case, and it can exacerbate the problem since + it increases the chance that the index has changed on the next request. Clients + _should_ instead rate limit the loop so that in the happy case they proceed without + waiting, but when values start to churn quickly they degrade into polling at a + reasonable rate (say every 15 seconds). Ideally this is done with an algorithm that + allows a couple of quick successive deliveries before it starts to limit rate - a + [token bucket](https://en.wikipedia.org/wiki/Token_bucket) with burst of 2 is a simple + way to achieve this. + ### Hash-based Blocking Queries A limited number of agent endpoints also support blocking however because the