Update raft-boltdb to pull in new writeCapacity metric (#12646)

3 years ago · 15ddbbc686
4 changed files with 18 additions and 15 deletions
--- a/.changelog/12646.txt
+++ b/.changelog/12646.txt
@ -0,0 +1,3 @@
+```release-note:improvement
+metrics: The `consul.raft.boltdb.writeCapacity` metric was added and indicates a theoretical number of writes/second that can be performed to Consul.
+```
--- a/go.mod
+++ b/go.mod
@ -55,7 +55,7 @@ require (
 	github.com/hashicorp/raft v1.3.6
 	github.com/hashicorp/raft-autopilot v0.1.5
 	github.com/hashicorp/raft-boltdb v0.0.0-20211202195631-7d34b9fb3f42 // indirect
-	github.com/hashicorp/raft-boltdb/v2 v2.2.0
+	github.com/hashicorp/raft-boltdb/v2 v2.2.2
 	github.com/hashicorp/serf v0.9.7
 	github.com/hashicorp/vault/api v1.0.5-0.20200717191844-f687267c8086
 	github.com/hashicorp/vault/sdk v0.1.14-0.20200519221838-e0cfd64bc267
--- a/go.sum
+++ b/go.sum
@ -304,8 +304,8 @@ github.com/hashicorp/raft-boltdb v0.0.0-20171010151810-6e5ba93211ea/go.mod h1:pN
 github.com/hashicorp/raft-boltdb v0.0.0-20210409134258-03c10cc3d4ea/go.mod h1:qRd6nFJYYS6Iqnc/8HcUmko2/2Gw8qTFEmxDLii6W5I=
 github.com/hashicorp/raft-boltdb v0.0.0-20211202195631-7d34b9fb3f42 h1:Ye8SofeDHJzu9xvvaMmpMkqHELWW7rTcXwdUR0CWW48=
 github.com/hashicorp/raft-boltdb v0.0.0-20211202195631-7d34b9fb3f42/go.mod h1:wcXL8otVu5cpJVLjcmq7pmfdRCdaP+xnvu7WQcKJAhs=
-github.com/hashicorp/raft-boltdb/v2 v2.2.0 h1:/CVN9LSAcH50L3yp2TsPFIpeyHn1m3VF6kiutlDE3Nw=
-github.com/hashicorp/raft-boltdb/v2 v2.2.0/go.mod h1:SgPUD5TP20z/bswEr210SnkUFvQP/YjKV95aaiTbeMQ=
+github.com/hashicorp/raft-boltdb/v2 v2.2.2 h1:rlkPtOllgIcKLxVT4nutqlTH2NRFn+tO1wwZk/4Dxqw=
+github.com/hashicorp/raft-boltdb/v2 v2.2.2/go.mod h1:N8YgaZgNJLpZC+h+by7vDu5rzsRgONThTEeUS3zWbfY=
 github.com/hashicorp/serf v0.9.6/go.mod h1:TXZNMjZQijwlDvp+r0b63xZ45H7JmCmgg4gpTwn9UV4=
 github.com/hashicorp/serf v0.9.7 h1:hkdgbqizGQHuU5IPqYM1JdSMV8nKfpuOnZYXssk9muY=
 github.com/hashicorp/serf v0.9.7/go.mod h1:TXZNMjZQijwlDvp+r0b63xZ45H7JmCmgg4gpTwn9UV4=
--- a/website/content/docs/agent/telemetry.mdx
+++ b/website/content/docs/agent/telemetry.mdx
@ -273,9 +273,10 @@ This metric should be monitored to ensure that the license doesn't expire to pre

 | Metric Name                       | Description                                                      | Unit  | Type  |
 | :-------------------------------- | :--------------------------------------------------------------- | :---- | :---- |
-| `consul.raft.boltdb.freelistBytes`                  | Represents the number of bytes necessary to encode the freelist metadata. When [`raft_boltdb.NoFreelistSync`](/docs/agent/options#NoFreelistSync) is set to `false` these metadata bytes must also be written to disk for each committed log. | bytes | gauge   |
-| `consul.raft.boltdb.logsPerBatch`                   | Measures the number of logs being written per batch to the db. | logs | sample |
-| `consul.raft.boltdb.storeLogs`                      | Measures the amount of time spent writing logs to the db. | ms | timer |
+| `consul.raft.boltdb.freelistBytes` | Represents the number of bytes necessary to encode the freelist metadata. When [`raft_boltdb.NoFreelistSync`](/docs/agent/options#NoFreelistSync) is set to `false` these metadata bytes must also be written to disk for each committed log. | bytes | gauge   |
+| `consul.raft.boltdb.logsPerBatch`  | Measures the number of logs being written per batch to the db. | logs | sample |
+| `consul.raft.boltdb.storeLogs`     | Measures the amount of time spent writing logs to the db. | ms | timer |
+| `consul.raft.boltdb.writeCapacity` | Theoretical write capacity in terms of the number of logs that can be written per second. Each sample outputs what the capacity would be if future batched log write operations were similar to this one. This similarity encompasses 4 things: batch size, byte size, disk performance and boltdb performance. While none of these will be static and its highly likely individual samples of this metric will vary, aggregating this metric over a larger time window should provide a decent picture into how this BoltDB store can perform | logs/second  | sample |


 ** Requirements: **
@ -293,15 +294,13 @@ upper limit to the throughput of write operations within Consul.

 In Consul each write operation will turn into a single Raft log to be committed. Raft will process these
 logs and store them within Bolt DB in batches. Each call to store logs within Bolt DB is measured to record how long
-it took as well as how many logs were contained in the batch. Writing logs is this fashion is serialized so that
-a subsequent log storage operation can only be started after the previous one completed. Therefore the maximum number
-of log storage operations that can be performed each second can be calculated with the following equation: 
-`(1000 ms) / (consul.raft.boltdb.storeLogs ms/op)`. From there we can extrapolate the maximum number of Consul writes
-per second by multiplying that value by the `consul.raft.boltdb.logsPerBatch` metric's value. When log storage 
-operations are becoming slower you may not see an immediate decrease in write throughput to Consul due to increased 
-batch sizes of the each operation. However, the max batch size allowed is 64 logs. Therefore if the `logsPerBatch`
-metric is near 64 and the `storeLogs` metric is seeing increased time to write each batch to disk, then it is likely 
-that increased write latencies and other errors may occur.
+it took as well as how many logs were contained in the batch. Writing logs in this fashion is serialized so that
+a subsequent log storage operation can only be started after the previous one completed. The maximum number
+of log storage operations that can be performed each second is represented with the `consul.raft.boltdb.writeCapacity`
+metric. When log storage operations are becoming slower you may not see an immediate decrease in write capacity
+due to increased batch sizes of the each operation. However, the max batch size allowed is 64 logs. Therefore if 
+the `logsPerBatch` metric is near 64 and the `storeLogs` metric is seeing increased time to write each batch to disk,
+then it is likely that increased write latencies and other errors may occur.

 There can be a number of potential issues that can cause this. Often times it could be performance of the underlying
 disks that is the issue. Other times it may be caused by Bolt DB behavior. Bolt DB keeps track of free space within
@ -421,6 +420,7 @@ These metrics are used to monitor the health of the Consul servers.
 | `consul.raft.boltdb.txstats.split`                  | Counts the number of nodes split in the db since Consul was started. | splits | counter |
 | `consul.raft.boltdb.txstats.write`                  | Counts the number of writes to the db since Consul was started. | writes | counter |
 | `consul.raft.boltdb.txstats.writeTime`              | Measures the amount of time spent performing writes to the db. | ms  | timer |
+| `consul.raft.boltdb.writeCapacity`                  | Theoretical write capacity in terms of the number of logs that can be written per second. Each sample outputs what the capacity would be if future batched log write operations were similar to this one. This similarity encompasses 4 things: batch size, byte size, disk performance and boltdb performance. While none of these will be static and its highly likely individual samples of this metric will vary, aggregating this metric over a larger time window should provide a decent picture into how this BoltDB store can perform | logs/second  | sample |
 | `consul.raft.commitNumLogs`                         | Measures the count of logs processed for application to the FSM in a single batch.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | logs                              | gauge   |
 | `consul.raft.commitTime`                            | Measures the time it takes to commit a new entry to the Raft log on the leader.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | ms                                | timer   |
 | `consul.raft.fsm.lastRestoreDuration`               | Measures the time taken to restore the FSM from a snapshot on an agent restart or from the leader calling installSnapshot. This is a gauge that holds it's value since most servers only restore during restarts which are typically infrequent.                                                                                                                                                                                                                                                                                                                                                                                                              | ms                                | gauge   |