mirror of https://github.com/hashicorp/consul
Formatting
parent
c57ff557db
commit
24756aa6ff
|
@ -1,6 +1,6 @@
|
||||||
---
|
---
|
||||||
layout: docs
|
layout: docs
|
||||||
page_title: Recommendations for operating Consul at Scale
|
page_title: Recommendations for operating Consul at scale
|
||||||
description: >-
|
description: >-
|
||||||
|
|
||||||
---
|
---
|
||||||
|
@ -15,7 +15,7 @@ Consul is a distributed service networking system deployed as a centralized set
|
||||||
|
|
||||||
Consul can be deployed in either a single datacenter or across multiple datacenters by establishing WAN federation or peering connections. In this context, a datacenter refers to a named environment whose hosts can communicate with low networking latency. Typically, users map a Consul datacenter to a cloud provider region such as AWS us-east-1 or Azure East US.
|
Consul can be deployed in either a single datacenter or across multiple datacenters by establishing WAN federation or peering connections. In this context, a datacenter refers to a named environment whose hosts can communicate with low networking latency. Typically, users map a Consul datacenter to a cloud provider region such as AWS us-east-1 or Azure East US.
|
||||||
|
|
||||||
To ensure consistency and high availability, Consul servers share data using the Raft consensus protocol. When persisting data, Consul uses Boltdb to store Raft logs and a custom file format for state snapshots. For more information, refer to Consul architecture.
|
To ensure consistency and high availability, Consul servers share data using the [Raft consensus protocol](/consul/docs/architecture/consensus). When persisting data, Consul uses Boltdb to store Raft logs and a custom file format for state snapshots. For more information, refer to [Consul architecture](/consul/docs/architecture).
|
||||||
|
|
||||||
## General deployment recommendations
|
## General deployment recommendations
|
||||||
|
|
||||||
|
@ -24,15 +24,17 @@ This section provides general configuration and monitoring recommendations for o
|
||||||
### Data plane resiliency
|
### Data plane resiliency
|
||||||
|
|
||||||
To make service-to-service communication resilient against outages and failures, we recommend spreading multiple service instances for a service across fault domains:
|
To make service-to-service communication resilient against outages and failures, we recommend spreading multiple service instances for a service across fault domains:
|
||||||
Spread service instances across multiple infrastructure-level availability zones.
|
|
||||||
Spread service instances across multiple runtime platform instances, such as multiple Kubernetes clusters.
|
- Spread service instances across _multiple infrastructure-level availability zones_.
|
||||||
Spread service instances across multiple Consul datacenters.
|
- Spread service instances across _multiple runtime platform instances_, such as multiple Kubernetes clusters.
|
||||||
|
- Spread service instances across _multiple Consul datacenters_.
|
||||||
|
|
||||||
In the event that any individual domain experiences a failure, service failover ensures that healthy instances in other domains remain discoverable. Consul automatically provides service failover between instances within a single admin partition or datacenter.
|
In the event that any individual domain experiences a failure, service failover ensures that healthy instances in other domains remain discoverable. Consul automatically provides service failover between instances within a single admin partition or datacenter.
|
||||||
|
|
||||||
Service failover across Consul datacenters must be configured in the datacenters before you can use it. Use one of the following methods to configure failover across datacenters:
|
Service failover across Consul datacenters must be configured in the datacenters before you can use it. Use one of the following methods to configure failover across datacenters:
|
||||||
If you are using Consul service mesh: Implement failover using service-resolver configuration entries.
|
|
||||||
If you are using Consul service discovery without service mesh: Implement geo-redundant failover using prepared queries.
|
- **If you are using Consul service mesh**: Implement failover using [service-resolver configuration entries](/consul/docs/connect/config-entries/service-resolver#failover).
|
||||||
|
- **If you are using Consul service discovery without service mesh**: Implement [geo-redundant failover using prepared queries](consul/tutorials/developer-discovery/automate-geo-failover).
|
||||||
|
|
||||||
### Control plane resiliency
|
### Control plane resiliency
|
||||||
|
|
||||||
|
@ -41,27 +43,29 @@ Because the control plane’s larger size can slow network performance as deploy
|
||||||
#### Datacenter size
|
#### Datacenter size
|
||||||
|
|
||||||
To ensure resiliency, we recommend limiting deployments to a maximum of 5,000 Consul client agents per Consul datacenter. There are two reasons for this recommendation:
|
To ensure resiliency, we recommend limiting deployments to a maximum of 5,000 Consul client agents per Consul datacenter. There are two reasons for this recommendation:
|
||||||
Blast radius reduction: When Consul suffers a server outage in a datacenter or region, blast radius refers to the number of Consul clients or dataplanes attached to that datacenter that can no longer communicate as a result. We recommend limiting the total number of clients attached to a single Consul datacenter in order to reduce the size of its blast radius. Even though Consul is able to run clusters with 10,000 or more nodes, it takes longer to bring larger deployments back online after an outage, which impacts time to recovery.
|
|
||||||
Agent gossip management: Consul agents use the gossip protocol to share membership information in a gossip pool. By default, all client agents in a single Consul datacenter are in a single gossip pool. Whenever an agent joins or leaves the gossip pool, the other agents propagate that event throughout the pool. If a Consul datacenter experiences agent churn, or a consistently high rate of agents joining and leaving a single pool, cluster performance may be affected by gossip messages being generated faster than they can be transmitted. The result is an ever-growing message queue.
|
1. **Blast radius reduction**: When Consul suffers a server outage in a datacenter or region, _blast radius_ refers to the number of Consul clients or dataplanes attached to that datacenter that can no longer communicate as a result. We recommend limiting the total number of clients attached to a single Consul datacenter in order to reduce the size of its blast radius. Even though Consul is able to run clusters with 10,000 or more nodes, it takes longer to bring larger deployments back online after an outage, which impacts time to recovery.
|
||||||
|
2. **Agent gossip management**: Consul agents use the [gossip protocol](/consul/docs/architecture/gossip) to share membership information in a gossip pool. By default, all client agents in a single Consul datacenter are in a single gossip pool. Whenever an agent joins or leaves the gossip pool, the other agents propagate that event throughout the pool. If a Consul datacenter experiences _agent churn_, or a consistently high rate of agents joining and leaving a single pool, cluster performance may be affected by gossip messages being generated faster than they can be transmitted. The result is an ever-growing message queue.
|
||||||
|
|
||||||
To mitigate these risks, we recommend a maximum of 5,000 Consul client agents in a single gossip pool. There are several strategies for making gossip pools smaller:
|
To mitigate these risks, we recommend a maximum of 5,000 Consul client agents in a single gossip pool. There are several strategies for making gossip pools smaller:
|
||||||
Run exactly one Consul agent per host in the infrastructure.
|
|
||||||
Break up the single Consul datacenter into multiple smaller datacenters.
|
1. Run exactly one Consul agent per host in the infrastructure.
|
||||||
Enterprise users can define network segments to divide the single gossip pool in the Consul datacenter into multiple smaller pools.
|
1. Break up the single Consul datacenter into multiple smaller datacenters.
|
||||||
|
1. Enterprise users can define [network segments](/consul/docs/enterprise/network-segments) to divide the single gossip pool in the Consul datacenter into multiple smaller pools.
|
||||||
|
|
||||||
If appropriate for your use case, we recommend breaking up a single Consul datacenter into multiple smaller datacenters. Running multiple datacenters reduces your network’s blast radius more than applying network segments.
|
If appropriate for your use case, we recommend breaking up a single Consul datacenter into multiple smaller datacenters. Running multiple datacenters reduces your network’s blast radius more than applying network segments.
|
||||||
|
|
||||||
Be aware that the number 5,000 is a heuristic for deployments. The number of agents you deploy per datacenter is limited by performance, not Consul itself. Because gossip stability risk is determined by the rate of agent churn rather than the number of nodes, a gossip pool with mostly static nodes may be able to operate effectively with more than 5,000 agents. Meanwhile, a gossip pool with highly dynamic agents, such as spot fleet instances and serverless functions where 10% of agents are replaced each day, may need to be smaller than 5,000 agents.
|
Be aware that the number 5,000 is a heuristic for deployments. The number of agents you deploy per datacenter is limited by performance, not Consul itself. Because gossip stability risk is determined by _the rate of agent churn_ rather than _the number of nodes_, a gossip pool with mostly static nodes may be able to operate effectively with more than 5,000 agents. Meanwhile, a gossip pool with highly dynamic agents, such as spot fleet instances and serverless functions where 10% of agents are replaced each day, may need to be smaller than 5,000 agents.
|
||||||
|
|
||||||
For most use cases, a limit of 5,000 agents is appropriate. When the consul.serf.queue.Intent metric is consistently high, it is an indication that the gossip pool cannot keep up with the sustained level of churn. In this situation, reduce the churn by lowering the number agents per datacenter.
|
For most use cases, a limit of 5,000 agents is appropriate. When the `consul.serf.queue.Intent` metric is consistently high, it is an indication that the gossip pool cannot keep up with the sustained level of churn. In this situation, reduce the churn by lowering the number agents per datacenter.
|
||||||
|
|
||||||
#### Kubernetes-specific guidance
|
#### Kubernetes-specific guidance
|
||||||
|
|
||||||
In Kubernetes, even though it is possible to deploy Consul agents inside pods alongside services running in the same pod, this unsupported deployment pattern has known performance issues at scale. At large volumes, pod registration and deregistration in Kubernetes causes gossip instability that can lead to cascading failures as services are marked unhealthy, resulting in further cluster churn.
|
In Kubernetes, even though it is possible to deploy Consul agents inside pods alongside services running in the same pod, this unsupported deployment pattern has known performance issues at scale. At large volumes, pod registration and deregistration in Kubernetes causes gossip instability that can lead to cascading failures as services are marked unhealthy, resulting in further cluster churn.
|
||||||
|
|
||||||
In Consul v1.14 and higher, Consul on Kubernetes does not need to run client agents on every node in a cluster for service discovery and service mesh. This deployment configuration lowers Consul’s resource usage in the data plane. To learn more, refer to simplified service mesh with Consul Dataplane.
|
In Consul v1.14 and higher, Consul on Kubernetes does not need to run client agents on every node in a cluster for service discovery and service mesh. This deployment configuration lowers Consul’s resource usage in the data plane. To learn more, refer to [simplified service mesh with Consul Dataplane](/consul/docs/connect/dataplane).
|
||||||
|
|
||||||
If you use Kubernetes and Consul as a backend for Vault: Use Vault’s integrated storage backend instead of Consul. A runtime dependency conflict prevents Consul dataplanes from being compatible with Vault. If you need to use Consul v1.14 and over as a backend for Vault in your Kubernetes deployment, create a separate Consul datacenter that is not federated or peered to your other Consul servers. You can size this datacenter according to your needs and use it exclusively for backend storage for Vault.
|
**If you use Kubernetes and Consul as a backend for Vault**: Use Vault’s integrated storage backend instead of Consul. A runtime dependency conflict prevents Consul dataplanes from being compatible with Vault. If you need to use Consul v1.14 and over as a backend for Vault in your Kubernetes deployment, create a separate Consul datacenter that is not federated or peered to your other Consul servers. You can size this datacenter according to your needs and use it exclusively for backend storage for Vault.
|
||||||
|
|
||||||
## Consul server deployment recommendations
|
## Consul server deployment recommendations
|
||||||
|
|
||||||
|
@ -70,28 +74,31 @@ Consul server agents are an important part of Consul’s architecture. This sect
|
||||||
### Consul server runtimes
|
### Consul server runtimes
|
||||||
|
|
||||||
Consul servers can be deployed on a few different runtimes:
|
Consul servers can be deployed on a few different runtimes:
|
||||||
HashiCorp Cloud Platform (HCP) Consul (Managed). These Consul servers are deployed in a hosted environment managed by HCP. To get started with HCP Consul servers in Kubernetes or VM deployments, refer to the Deploy HCP Consul tutorial.
|
|
||||||
VMs or bare metal servers (Self-managed). To get started with Consul on VMs or bare metal servers, refer to the Deploy Consul server tutorial.For a full list of configuration options, refer to Agents Overview.
|
|
||||||
Kubernetes (Self-managed). To get started with Consul on Kubernetes, refer to the Deploy Consul on Kubernetes tutorial.
|
|
||||||
Other container environments, including Docker, Rancher, and Mesos (Self-managed).
|
|
||||||
|
|
||||||
When operating Consul at scale, self-managed VM or bare metal server deployments offer the most flexibility. Some Consul Enterprise features that can enhance fault tolerance and read scalability, such as redundancy zones and read replicas, are not available to server agents on Kubernetes runtimes. To learn more, refer to Consul Enterprise feature availability by runtime.
|
- **HashiCorp Cloud Platform (HCP) Consul (Managed)**. These Consul servers are deployed in a hosted environment managed by HCP. To get started with HCP Consul servers in Kubernetes or VM deployments, refer to the Deploy HCP Consul tutorial.
|
||||||
|
- **VMs or bare metal servers (Self-managed)**. To get started with Consul on VMs or bare metal servers, refer to the Deploy Consul server tutorial.For a full list of configuration options, refer to Agents Overview.
|
||||||
|
- **Kubernetes (Self-managed)**. To get started with Consul on Kubernetes, refer to the Deploy Consul on Kubernetes tutorial.
|
||||||
|
- **Other container environments, including Docker, Rancher, and Mesos (Self-managed)**.
|
||||||
|
|
||||||
|
When operating Consul at scale, self-managed VM or bare metal server deployments offer the most flexibility. Some Consul Enterprise features that can enhance fault tolerance and read scalability, such as [redundancy zones](/consul/docs/enterprise/redundancy) and [read replicas](/consul/docs/enterprise/read-scale), are not available to server agents on Kubernetes runtimes. To learn more, refer to [Consul Enterprise feature availability by runtime](/consul/docs/enterprise#feature-availability-by-runtime).
|
||||||
|
|
||||||
### Number of Consul servers
|
### Number of Consul servers
|
||||||
|
|
||||||
Determining the number of Consul servers to deploy on your network has two key considerations:
|
Determining the number of Consul servers to deploy on your network has two key considerations:
|
||||||
Fault tolerance: The number of server outages your deployment can tolerate while maintaining quorum. Additional servers increase a network’s fault tolerance.
|
|
||||||
Performance scalability: To handle more requests, additional servers produce latency and slow the quorum process. Too many servers impedes your network instead of helping it.
|
- **Fault tolerance**: The number of server outages your deployment can tolerate while maintaining quorum. Additional servers increase a network’s fault tolerance.
|
||||||
|
- **Performance scalability**: To handle more requests, additional servers produce latency and slow the quorum process. Too many servers impedes your network instead of helping it.
|
||||||
|
|
||||||
Fault tolerance should determine your initial decision for how many Consul server agents to deploy. Our recommendation for the number of servers to deploy depends on whether you have access to Consul Enterprise redundancy zones:
|
Fault tolerance should determine your initial decision for how many Consul server agents to deploy. Our recommendation for the number of servers to deploy depends on whether you have access to Consul Enterprise redundancy zones:
|
||||||
With redundancy zones: Deploy 6 Consul servers across 3 availability zones. This deployment provides the performance of a 3 server deployment with the fault tolerance of a 7 server deployment.
|
|
||||||
Without redundancy zones: Deploy 5 Consul servers across 3 availability zones. All 5 servers should be voting servers, not read replicas.
|
|
||||||
|
|
||||||
For more details, refer to Improving Consul Resilience.
|
- **With redundancy zones**: Deploy 6 Consul servers across 3 availability zones. This deployment provides the performance of a 3 server deployment with the fault tolerance of a 7 server deployment.
|
||||||
|
- **Without redundancy zones**: Deploy 5 Consul servers across 3 availability zones. All 5 servers should be voting servers, not [read replicas](/consul/docs/enterprise/read-scale).
|
||||||
|
|
||||||
|
For more details, refer to [Improving Consul Resilience](/consul/docs/architecture/improving-consul-resilience).
|
||||||
|
|
||||||
### Server requirements
|
### Server requirements
|
||||||
|
|
||||||
To ensure your server nodes are a sufficient size, we recommend reviewing Hardware sizing for Consul servers. If your network needs to handle heavy workloads, refer to our recommendations in read-heavy workload sources and solutions and write-heavy workload sources and solutions.
|
To ensure your server nodes are a sufficient size, we recommend reviewing [hardware sizing for Consul servers](/consul/tutorials/production-deploy/reference-architecture#hardware-sizing-for-consul-servers). If your network needs to handle heavy workloads, refer to our recommendations in [read-heavy workload sources and solutions](#read-heavy-workload-sources-and-solutions) and [write-heavy workload sources and solutions](#write-heavy-workload-sources-and-solutions).
|
||||||
|
|
||||||
#### File descriptors
|
#### File descriptors
|
||||||
|
|
||||||
|
@ -101,7 +108,7 @@ Consul's agents use network sockets for gossip communication with the other node
|
||||||
|
|
||||||
Auto scaling groups (ASGs) are infrastructure associations in cloud providers used to ensure a specific number of replicas are available for a deployment. When using ASGs for Consul servers, there are specific requirements and processes for bootstrapping Raft and maintaining quorum.
|
Auto scaling groups (ASGs) are infrastructure associations in cloud providers used to ensure a specific number of replicas are available for a deployment. When using ASGs for Consul servers, there are specific requirements and processes for bootstrapping Raft and maintaining quorum.
|
||||||
|
|
||||||
We recommend using the bootstrap-expect command-line flag during cluster creation, but when spawning new servers to add to a cluster or upgrading servers, do not configure them to automatically bootstrap. If bootstrap-expect is set on these replicas, it is possible for them to create a separate Raft system, which causes a split brain and leads to errors and general cluster instability.
|
We recommend using the [`bootstrap-expect` command-line flag](/consul/docs/agent/config/cli-flags#_bootstrap_expect) during cluster creation, but when spawning new servers to add to a cluster or upgrading servers, do not configure them to automatically bootstrap. If `bootstrap-expect` is set on these replicas, it is possible for them to create a separate Raft system, which causes a _split brain_ and leads to errors and general cluster instability.
|
||||||
|
|
||||||
#### NUMA architecture awareness
|
#### NUMA architecture awareness
|
||||||
|
|
||||||
|
@ -109,23 +116,23 @@ Some cloud providers offer extremely large instance sizes with Non-Uniform Memor
|
||||||
|
|
||||||
### Consistency modes
|
### Consistency modes
|
||||||
|
|
||||||
Consul offers different consistency modes for both its DNS and HTTP APIs.
|
Consul offers different [consistency modes](/consul/api-docs/features/consistency#stale) for both its DNS and HTTP APIs.
|
||||||
|
|
||||||
#### DNS
|
#### DNS
|
||||||
|
|
||||||
We strongly recommend utilizing the recommendations in the DNS Caching tutorial for stale consistency. Consul uses stale consistency mode to optimize for performance over consistency in DNS API responses.
|
We strongly recommend utilizing the recommendations in the [DNS Caching tutorial](/consul/tutorials/networking/dns-caching) for stale consistency. Consul uses [stale consistency mode](/consul/api-docs/features/consistency#stale) to optimize for performance over consistency in DNS API responses.
|
||||||
|
|
||||||
If you need to improve consistency at the expense of an increased workload, you can set dns_config.allow_stale to default. In addition, you can configure TTLs for services and nodes to improve DNS resolving at the client and avoid server round trips.
|
If you need to improve consistency at the expense of an increased workload, you can set `dns_config.allow_stale` to `default`. In addition, you can configure TTLs for services and nodes to improve DNS resolving at the client and avoid server round trips.
|
||||||
|
|
||||||
We also recommend that you do not configure dns_config.max_stale to limit the staleness of DNS responses, as it may result in a prolonged outage if your Consul servers become overloaded. If bounded result consistency is required by a service, consider modifying the service to use consistent service discovery HTTP API queries instead of DNS lookups.
|
We also recommend that you do not configure [`dns_config.max_stale` to limit the staleness of DNS responses](/consul/api-docs/features/consistency#limiting-staleness-advanced-usage), as it may result in a prolonged outage if your Consul servers become overloaded. If bounded result consistency is required by a service, consider modifying the service to use consistent service discovery HTTP API queries instead of DNS lookups.
|
||||||
|
|
||||||
Avoid using dns_config.use_cache when operating Consul at scale. The agent cache allocates memory for each requested route and each allocation can live up to 3 days, resulting in severe memory issues.
|
Avoid using [`dns_config.use_cache`](/consul/docs/agent/config/config-files#dns_use_cache) when operating Consul at scale. The agent cache allocates memory for each requested route and each allocation can live up to 3 days, resulting in severe memory issues.
|
||||||
|
|
||||||
#### HTTP API
|
#### HTTP API
|
||||||
|
|
||||||
By default, all HTTP API read requests use the default consistency mode unless overridden on a per-request basis. We do not recommend changing the default consistency mode for HTTP API requests.
|
By default, all HTTP API read requests use the [`default` consistency mode](/consul/api-docs/features/consistency#default-1) unless overridden on a per-request basis. We do not recommend changing the default consistency mode for HTTP API requests.
|
||||||
|
|
||||||
We also recommend that you do not configure http_config.discovery_max_stale to limit the staleness of HTTP responses.
|
We also recommend that you do not configure [`http_config.discovery_max_stale`](/consul/api-docs/features/consistency#changing-the-default-consistency-mode-advanced-usage) to limit the staleness of HTTP responses.
|
||||||
|
|
||||||
## Resource usage and metrics recommendations
|
## Resource usage and metrics recommendations
|
||||||
|
|
||||||
|
@ -133,65 +140,69 @@ While operating Consul, monitor the CPU load on the Consul server agents and use
|
||||||
|
|
||||||
### Read-heavy workload sources and solutions
|
### Read-heavy workload sources and solutions
|
||||||
|
|
||||||
The highest CPU load usually belongs to the current leader. If the CPU load is high, request load is likely a major contributor. Check the following server health metrics:
|
The highest CPU load usually belongs to the current leader. If the CPU load is high, request load is likely a major contributor. Check the following [server health metrics](/consul/docs/agent/telemetry#server-health):
|
||||||
consul.rpc.* - Traditional RPC metrics. The most relevant metrics for understanding server CPU load in read-heavy workloads are consul.rpc.query and consul.rpc.queries_blocking.
|
|
||||||
consul.grpc.server.* - Metrics for the number of streams being processed by the server.
|
- `consul.rpc.*` - Traditional RPC metrics. The most relevant metrics for understanding server CPU load in read-heavy workloads are `consul.rpc.query` and `consul.rpc.queries_blocking`.
|
||||||
consul.xds.server.* - Metrics for the Envoy xDS resources being processed by the server. In Consul v1.14 and higher, these metrics are a significant source of read load. Refer to Consul dataplanes for more information.
|
- `consul.grpc.server.*` - Metrics for the number of streams being processed by the server.
|
||||||
|
- `consul.xds.server.*` - Metrics for the Envoy xDS resources being processed by the server. In Consul v1.14 and higher, these metrics are a significant source of read load. Refer to [Consul dataplanes](/consul/docs/connect/dataplane) for more information.
|
||||||
|
|
||||||
Depending on your needs, choose one of the following strategies to mitigate server CPU load:
|
Depending on your needs, choose one of the following strategies to mitigate server CPU load:
|
||||||
The fastest mitigation strategy is to vertically scale servers. However, this strategy increases compute costs and does not scale indefinitely.
|
|
||||||
The most effective long term mitigation strategy is to use stale consistency mode for as many read requests as possible. In Consul v1.12 and higher, operators can use the consul.rpc.server.call metric to identify the most frequent type of read requests made to the Consul servers. Cross reference the results with each endpoint’s HTTP API documentation and use stale consistency for endpoints that support it.
|
- The fastest mitigation strategy is to vertically scale servers. However, this strategy increases compute costs and does not scale indefinitely.
|
||||||
If most read requests already use stale consistency mode and you still need to reduce your request load, add more non-voting servers to your deployment. You can use either redundancy zones or read replicas to scale reads without impacting write latency. We recommend adding more servers to redundancy zones because they improve both fault tolerance and stale read scalability.
|
- The most effective long term mitigation strategy is to use [stale consistency mode](/consul/api-docs/features/consistency#stale) for as many read requests as possible. In Consul v1.12 and higher, operators can use the [`consul.rpc.server.call` metric](/consul/docs/agent/telemetry#server-workload) to identify the most frequent type of read requests made to the Consul servers. Cross reference the results with each endpoint’s [HTTP API documentation](/consul/api-docs) and use stale consistency for endpoints that support it.
|
||||||
In Consul v1.14 and higher, servers handle Envoy XDS streams for Consul Dataplane deployments in stale consistency mode. As a result, server consistency mode is not configurable. Use the consul.xds.server.* metrics to identify issues related to XDS streams.
|
- If most read requests already use stale consistency mode and you still need to reduce your request load, add more non-voting servers to your deployment. You can use either [redundancy zones](/consul/docs/enterprise/redundancy) or [read replicas](/consul/docs/enterprise/read-scale) to scale reads without impacting write latency. We recommend adding more servers to redundancy zones because they improve both fault tolerance and stale read scalability.
|
||||||
|
- In Consul v1.14 and higher, servers handle Envoy XDS streams for [Consul Dataplane deployments](/consul/docs/connect/dataplane) in stale consistency mode. As a result, server consistency mode is not configurable. Use the `consul.xds.server.*` metrics to identify issues related to XDS streams.
|
||||||
|
|
||||||
### Write-heavy workload sources and solutions
|
### Write-heavy workload sources and solutions
|
||||||
|
|
||||||
Consul is write-limited by disk I/O. For write-heavy workloads, we recommend using NVMe disks.
|
Consul is write-limited by disk I/O. For write-heavy workloads, we recommend using NVMe disks.
|
||||||
|
|
||||||
As a starting point, you should make sure your hardware meets the requirements for large size server clusters, which has 7500+ IOps and 250+ MB/s disk throughput. IOps should be around 5 to 10 times the expected write rate. Conduct further analysis around disk sizing and your expected write rates to understand your network’s specific needs.
|
As a starting point, you should make sure your hardware meets the requirements for [large size server clusters](/consul/tutorials/production-deploy/reference-architecture#hardware-sizing-for-consul-servers), which has 7500+ IOps and 250+ MB/s disk throughput. IOps should be around 5 to 10 times the expected write rate. Conduct further analysis around disk sizing and your expected write rates to understand your network’s specific needs.
|
||||||
|
|
||||||
If you use network storage, such as AWS EBS, we recommend provisioned I/O volumes. While general purpose volumes function properly, their burstable IOps make it harder to capacity plan. A small peak in writes may not trigger alerts, but as usage grows you may reach a point where the burst limit runs out and workload performance worsens.
|
If you use network storage, such as AWS EBS, we recommend provisioned I/O volumes. While general purpose volumes function properly, their burstable IOps make it harder to capacity plan. A small peak in writes may not trigger alerts, but as usage grows you may reach a point where the burst limit runs out and workload performance worsens.
|
||||||
|
|
||||||
For more information, refer to the server performance read/write tuning.
|
For more information, refer to the [server performance read/write tuning](/consul/docs/install/performance#read-write-tuning).
|
||||||
|
|
||||||
### Raft database performance sources and solutions
|
### Raft database performance sources and solutions
|
||||||
|
|
||||||
Consul servers use the Raft consensus protocol to maintain a consistent and fault-tolerant state. Raft stores most Consul data in a MemDB database, which is an in-memory database with indexing. In order to tolerate restarts and power outages, Consul writes Raft logs to disk using BoltDB. Refer to Agent telemetry for more information on metrics for detecting write health.
|
Consul servers use the [Raft consensus protocol](/consul/docs/architecture/consensus) to maintain a consistent and fault-tolerant state. Raft stores most Consul data in a MemDB database, which is an in-memory database with indexing. In order to tolerate restarts and power outages, Consul writes Raft logs to disk using BoltDB. Refer to [Agent telemetry](/consul/docs/agent/telemetry) for more information on metrics for detecting write health.
|
||||||
|
|
||||||
To monitor overall transaction performance, check for spikes in the Transaction timing metrics. You can also use the Raft replication capacity issues metrics to monitor Raft log snapshots and restores, as spikes and longer durations can be symptoms of overall write and disk contention issues.
|
To monitor overall transaction performance, check for spikes in the [Transaction timing metrics](/consul/docs/agent/telemetry#transaction-timing). You can also use the [Raft replication capacity issues metrics](/consul/docs/agent/telemetry#raft-replication-capacity-issues) to monitor Raft log snapshots and restores, as spikes and longer durations can be symptoms of overall write and disk contention issues.
|
||||||
|
|
||||||
In Consul v1.11 and higher, you can also monitor Raft performance with the consul.raft.boltdb.* metrics. We recommend monitoring consul.raft.boltdb.storeLogs for increased activity above normal operating patterns.
|
In Consul v1.11 and higher, you can also monitor Raft performance with the [`consul.raft.boltdb.*` metrics](/consul/docs/agent/telemetry#bolt-db-performance). We recommend monitoring `consul.raft.boltdb.storeLogs` for increased activity above normal operating patterns.
|
||||||
|
|
||||||
Refer to Consul agent telemetry for more information on agent metrics and how to use them.
|
Refer to [Consul agent telemetry](/consul/docs/agent/telemetry#bolt-db-performance) for more information on agent metrics and how to use them.
|
||||||
|
|
||||||
#### Raft database size
|
#### Raft database size
|
||||||
|
|
||||||
Raft writes logs to BoltDB, which is designed as a single grow-only file. As a result, if you add 1GB of log entries and then you take a snapshot, only a small number of recent log entries may appear in the file. However, the actual file on disk never shrinks smaller than the 1GB size it grew.
|
Raft writes logs to BoltDB, which is designed as a single grow-only file. As a result, if you add 1GB of log entries and then you take a snapshot, only a small number of recent log entries may appear in the file. However, the actual file on disk never shrinks smaller than the 1GB size it grew.
|
||||||
|
|
||||||
If you need to reclaim disk space, use the bbolt CLI to copy the data to a new database and repoint to the new database in the process. However, be aware that the bbolt compact command requires the database to be offline while being pointed to the new database.
|
If you need to reclaim disk space, use the `bbolt` CLI to copy the data to a new database and repoint to the new database in the process. However, be aware that the `bbolt compact` command requires the database to be offline while being pointed to the new database.
|
||||||
|
|
||||||
In many cases disk space is not a primary concern because Raft logs rarely grow larger than a small number of GiB, even in large clusters. However, an inflated file with lots of free space significantly degrades write performance overall due to freelist management.
|
In many cases disk space is not a primary concern because Raft logs rarely grow larger than a small number of GiB, even in large clusters. However, an inflated file with lots of free space significantly degrades write performance overall due to _freelist management_.
|
||||||
|
|
||||||
After they are written to disk, Raft logs are eventually captured in a snapshot and log nodes are removed from BoltDB. BoltDB keeps track of the pages for the removed nodes in its freelist. BoltDB also writes this freelist to disk every time there is a Raft write. When the Raft log grows large quickly and then gets truncated, the size of the freelist can become very large. In the worst case reported to us, the freelist was over 10MB. When this large freelist is written to disk on every Raft commit, the result is a large write amplification for what should be a small Raft commit.
|
After they are written to disk, Raft logs are eventually captured in a snapshot and log nodes are removed from BoltDB. BoltDB keeps track of the pages for the removed nodes in its freelist. BoltDB also writes this freelist to disk every time there is a Raft write. When the Raft log grows large quickly and then gets truncated, the size of the freelist can become very large. In the worst case reported to us, the freelist was over 10MB. When this large freelist is written to disk on every Raft commit, the result is a large write amplification for what should be a small Raft commit.
|
||||||
|
|
||||||
To figure out if a Consul server’s disk performance issues are the result of BoldDB’s freelist, try the following strategies:
|
To figure out if a Consul server’s disk performance issues are the result of BoldDB’s freelist, try the following strategies:
|
||||||
Compare network bandwidth inbound to the server against disk write bandwidth. If disk write bandwidth is greater than or equal to 5 times the inbound network bandwidth, the disks are likely experiencing freelist management performance issues. While BoltDB freelist may cause problems at ratios lower than 5 to 1, high write bandwidth to inbound bandwidth ratios are a reliable indicator that BoltDB freelist is causing a problem.
|
|
||||||
Use the consul.raft.leader.dispatchLog metric to get information about how long it takes to write a batch of logs to disk.
|
|
||||||
In Consul v1.13 and higher, you can use Raft thread saturation metrics to figure out if Raft is experiencing back pressure and is unable to accept new work due disk limitations.
|
|
||||||
|
|
||||||
In Consul v1.11 and higher, you can prevent BoltDB from writing the freelist to disk by setting raftboltdb.NoFreelistSync to true. This setting causes BoltDB to retain the freelist in memory instead. However, be aware that when BoltDB restarts, it needs to scan the database file to manually create the freelist. Small delays in startup may occur. On a fast disk, we measured these delays at the order of tens of seconds for a raft.db file that was 5GiB in size with only 250MiB of used pages.
|
- Compare network bandwidth inbound to the server against disk write bandwidth. If _disk write bandwidth_ is greater than or equal to 5 times the _inbound network bandwidth_, the disks are likely experiencing freelist management performance issues. While BoltDB freelist may cause problems at ratios lower than 5 to 1, high write bandwidth to inbound bandwidth ratios are a reliable indicator that BoltDB freelist is causing a problem.
|
||||||
|
- Use the [`consul.raft.leader.dispatchLog` metric](/consul/docs/agent/telemetry#server-health) to get information about how long it takes to write a batch of logs to disk.
|
||||||
|
- In Consul v1.13 and higher, you can use [Raft thread saturation metrics](/consul/docs/agent/telemetry#raft-thread-saturation) to figure out if Raft is experiencing back pressure and is unable to accept new work due disk limitations.
|
||||||
|
|
||||||
In general, set raftboltdb.NoFreelistSync to true to produce the following effects:
|
In Consul v1.11 and higher, you can prevent BoltDB from writing the freelist to disk by setting [`raftboltdb.NoFreelistSync`](/consul/docs/agent/config/config-files#NoFreelistSync) to `true`. This setting causes BoltDB to retain the freelist in memory instead. However, be aware that when BoltDB restarts, it needs to scan the database file to manually create the freelist. Small delays in startup may occur. On a fast disk, we measured these delays at the order of tens of seconds for a raft.db file that was 5GiB in size with only 250MiB of used pages.
|
||||||
Reduce the amount of data written to disk
|
|
||||||
Increase the amount of time it takes to load the raft.db file on startup
|
|
||||||
|
|
||||||
We recommend operators optimize networks according to their individual concerns. For example, if your server runs into disk performance issues but Consul servers do not restart often, setting raftboltdb.NoFreelistSync to true may solve your problems but it causes issues for deployments with large database files and frequent server restarts.
|
In general, set `[raftboltdb.NoFreelistSync`](/consul/docs/agent/config/config-files#NoFreelistSync) to `true` to produce the following effects:
|
||||||
|
|
||||||
|
- Reduce the amount of data written to disk
|
||||||
|
- Increase the amount of time it takes to load the raft.db file on startup
|
||||||
|
|
||||||
|
We recommend operators optimize networks according to their individual concerns. For example, if your server runs into disk performance issues but Consul servers do not restart often, setting [`raftboltdb.NoFreelistSync`](consul/docs/agent/config/config-files#NoFreelistSync) to `true` may solve your problems but it causes issues for deployments with large database files and frequent server restarts.
|
||||||
|
|
||||||
#### Raft snapshots
|
#### Raft snapshots
|
||||||
|
|
||||||
Each state change produces a Raft log entry, and each Consul server receives the same sequence of log entries, which results in servers sharing the same state. The sequence of Raft logs is periodically compacted by the leader into a snapshot of state history. These snapshots are internal to Raft and are not the same as the snapshots generated through Consul's API, although they contain the same data. Raft snapshots are stored in the server's data directory in the raft/ folder, alongside the logs in raft.db.
|
Each state change produces a Raft log entry, and each Consul server receives the same sequence of log entries, which results in servers sharing the same state. The sequence of Raft logs is periodically compacted by the leader into a _snapshot_ of state history. These snapshots are internal to Raft and are not the same as the snapshots generated through Consul's API, although they contain the same data. Raft snapshots are stored in the server's data directory in the `raft/` folder, alongside the logs in `raft.db`.
|
||||||
|
|
||||||
When you add a new Consul server, it must catch up to the current state. It receives the latest snapshot from the leader followed by the sequence of logs between that snapshot and the leader’s current state. Each Raft log has a sequence number and each snapshot contains the last sequence number included in the snapshot. A combination of write-heavy workloads, a large state, congested networks, or busy servers makes it possible for new servers to struggle to catch up to the current state before the next log they need from the leader has already been truncated. The result is a snapshot install loop.
|
When you add a new Consul server, it must catch up to the current state. It receives the latest snapshot from the leader followed by the sequence of logs between that snapshot and the leader’s current state. Each Raft log has a sequence number and each snapshot contains the last sequence number included in the snapshot. A combination of write-heavy workloads, a large state, congested networks, or busy servers makes it possible for new servers to struggle to catch up to the current state before the next log they need from the leader has already been truncated. The result is a _snapshot install loop_.
|
||||||
|
|
||||||
For example, if snapshot A on the leader has an index of 99 and the current index is 150, then when a new server comes online the leader streams snapshot A to the new server for it to restore. However, this snapshot only enables the new server to catch up to index 99. Not only does the new server still need to catch up to index 150, but the leader continued to commit Raft logs in the meantime.
|
For example, if snapshot A on the leader has an index of 99 and the current index is 150, then when a new server comes online the leader streams snapshot A to the new server for it to restore. However, this snapshot only enables the new server to catch up to index 99. Not only does the new server still need to catch up to index 150, but the leader continued to commit Raft logs in the meantime.
|
||||||
|
|
||||||
|
@ -199,14 +210,14 @@ When the leader takes snapshot B at index 199, it truncates the logs that accumu
|
||||||
|
|
||||||
Because the new server restored snapshot A, the new server has a current index of 99. It requests logs 100 to 150 because index 150 was the current index when it started the replication restore process. At this point, the leader recognizes that it only has logs 200 and higher, and does not have logs for indexes 100 to 150. The leader determines that the new server’s state is stale and starts the process over by sending the new server the latest snapshot, snapshot B.
|
Because the new server restored snapshot A, the new server has a current index of 99. It requests logs 100 to 150 because index 150 was the current index when it started the replication restore process. At this point, the leader recognizes that it only has logs 200 and higher, and does not have logs for indexes 100 to 150. The leader determines that the new server’s state is stale and starts the process over by sending the new server the latest snapshot, snapshot B.
|
||||||
|
|
||||||
Consul keeps a configurable number of Raft trailing logs to prevent the snapshot install loop from repeating. The trailing logs are the last logs that went into the snapshot, and the new server can more easily catch up to the current state using these logs.
|
Consul keeps a configurable number of [Raft trailing logs](/consul/docs/agent/config/config-files#raft_trailing_logs) to prevent the snapshot install loop from repeating. The trailing logs are the last logs that went into the snapshot, and the new server can more easily catch up to the current state using these logs. The default Raft trailing logs configuration value is suitable for most deployments.
|
||||||
|
|
||||||
The default Raft trailing logs configuration value is suitable for most deployments.
|
In Consul v1.10 and higher, operators can try to prevent a snapshot install loop by monitoring and comparing Consul servers’ `consul.raft.rpc.installSnapshot` and `consul.raft.leader.oldestLogAge` timing metrics. Monitor these metrics for the following situations:
|
||||||
In Consul v1.10 and higher, operators can try to prevent a snapshot install loop by monitoring and comparing Consul servers’ consul.raft.rpc.installSnapshot and consul.raft.leader.oldestLogAge timing metrics. Monitor these metrics for the following situations:
|
|
||||||
After truncation, the lowest number on consul.raft.leader.oldestLogAge should always be at least two times higher than the lowest number for consul.raft.rpc.installSnapshot.
|
|
||||||
If these metrics are too close, increase the number of Raft trailing logs, which increases consul.raft.leader.oldestLogAge. Do not set the Raft trailing logs higher than necessary, as it can negatively affect write throughput and latency.
|
|
||||||
|
|
||||||
For more information, refer to Raft Replication Capacity Issues.
|
- After truncation, the lowest number on `consul.raft.leader.oldestLogAge` should always be at least two times higher than the lowest number for `consul.raft.rpc.installSnapshot`.
|
||||||
|
- If these metrics are too close, increase the number of Raft trailing logs, which increases `consul.raft.leader.oldestLogAge`. Do not set the Raft trailing logs higher than necessary, as it can negatively affect write throughput and latency.
|
||||||
|
|
||||||
|
For more information, refer to [Raft Replication Capacity Issues](h/consul/docs/agent/telemetry#raft-replication-capacity-issues).
|
||||||
|
|
||||||
## Performance considerations for specific use cases
|
## Performance considerations for specific use cases
|
||||||
|
|
||||||
|
@ -218,52 +229,55 @@ To optimize performance for service discovery, we recommend deploying multiple s
|
||||||
|
|
||||||
Several factors influence Consul performance at scale when used primarily for its service discovery and health check features. The factors you have control over include:
|
Several factors influence Consul performance at scale when used primarily for its service discovery and health check features. The factors you have control over include:
|
||||||
|
|
||||||
The overall number of registered service instances
|
- The overall number of registered service instances
|
||||||
The use of stale reads for DNS queries
|
- The use of [stale reads](/consul/api-docs/features/consistency#consul-dns-queries) for DNS queries
|
||||||
Max number of service instances for a given service - Are there services that are registered in every node or in a large percentage of nodes in the fleet?
|
- Max number of service instances for a given service - Are there services that are registered in every node or in a large percentage of nodes in the fleet?
|
||||||
Number of watches monitoring for changes to a service.
|
- Number of [watches](/consul/docs/dynamic-app-config/watches) monitoring for changes to a service.
|
||||||
Rate of catalog updates, which is affected by the following events:
|
- Rate of catalog updates, which is affected by the following events:
|
||||||
A service instance’s health check status changes
|
- A service instance’s health check status changes
|
||||||
A service instance’s node loses connectivity to Consul servers
|
- A service instance’s node loses connectivity to Consul servers
|
||||||
The contents of the Service definition changes
|
- The contents of the [service definition file](/consul/docs/discovery/services#service-definition) changes
|
||||||
Service instances are registered or deregistered
|
- Service instances are registered or deregistered
|
||||||
Orchestrators such as Kubernetes or Nomad move a service to a new node
|
- Orchestrators such as Kubernetes or Nomad move a service to a new node
|
||||||
|
|
||||||
These factors can occur in combination with one another. Overall, the amount of work the servers complete for service discovery is the product of these factors:
|
These factors can occur in combination with one another. Overall, the amount of work the servers complete for service discovery is the product of these factors:
|
||||||
Data size, which changes as the number of services and service instances increases
|
|
||||||
The catalog update rate
|
- Data size, which changes as the number of services and service instances increases
|
||||||
The number of active watches
|
- The catalog update rate
|
||||||
|
- The number of active watches
|
||||||
|
|
||||||
Because it is typical for these factors to increase in number as clusters grow, the CPU and network resources the servers require to distribute updates may eventually exceed linear growth.
|
Because it is typical for these factors to increase in number as clusters grow, the CPU and network resources the servers require to distribute updates may eventually exceed linear growth.
|
||||||
In situations where you can’t run a Consul client agent alongside the service instance you want to register with Consul, such as instances hosted externally or on legacy infrastructure, we recommend using Consul ESM.
|
|
||||||
|
In situations where you can’t run a Consul client agent alongside the service instance you want to register with Consul, such as instances hosted externally or on legacy infrastructure, we recommend using [Consul ESM](https://github.com/hashicorp/consul-esm).
|
||||||
|
|
||||||
Consul ESM enables health checks and monitoring for external services. When using Consul ESM, we recommend running multiple instances to ensure redundancy.
|
Consul ESM enables health checks and monitoring for external services. When using Consul ESM, we recommend running multiple instances to ensure redundancy.
|
||||||
|
|
||||||
### Service mesh
|
### Service mesh
|
||||||
|
|
||||||
Because Consul’s service mesh uses service discovery subsystems, service mesh performance is also optimized by deploying multiple small clusters with consistent numbers of service instances and watches.
|
Because Consul’s service mesh uses service discovery subsystems, service mesh performance is also optimized by deploying multiple small clusters with consistent numbers of service instances and watches. Service mesh performance is influenced by the following additional factors:
|
||||||
|
|
||||||
Service mesh performance is influenced by the following additional factors:
|
- The [transparent proxy](/consul/docs/connect/transparent-proxy) feature causes client agents to listen for service instance updates across all services instead of a subset. To prevent performance issues, we recommend that you do not use the permissive intention, `default: allow`, with the transparent proxy feature. When combined, every service instance update propagates to every proxy, which causes additional server load.
|
||||||
The transparent proxy feature causes client agents to listen for service instance updates across all services instead of a subset. To prevent performance issues, we recommend that you do not use the permissive intention, default: allow, with the transparent proxy feature. When combined, every service instance update propagates to every proxy, which causes additional server load.
|
- When you use the [built in CA provider](/consul/docs/connect/ca/consul#built-in-ca), Consul leaders are responsible for signing certificates used for mTLS across the service mesh. The impact on CPU utilization depends on the total number of service instances and configured certificate TTLs. You can use the [CA provider configuration options](/consul/docs/agent/config/config-files#common-ca-config-options) to control the number of requests a server processes. We recommend adjusting [`csr_max_concurrent`](/consul/docs/agent/config/config-files#ca_csr_max_concurrent) and [`csr_max_per_second`](/consul/docs/agent/config/config-files#ca_csr_max_concurrent) to suit your environment.
|
||||||
When you use the built in CA provider, Consul leaders are responsible for signing certificates used for mTLS across the service mesh. The impact on CPU utilization depends on the total number of service instances and configured certificate TTLs. You can use the CA provider configuration options to control the number of requests a server processes. We recommend adjusting csr_max_concurrent and csr_max_per_second to suit your environment.
|
|
||||||
|
|
||||||
### K/V store
|
### K/V store
|
||||||
|
|
||||||
While the K/V store in Consul has some similarities to object stores we recommend that you do not use it as a primary application data store.
|
While the K/V store in Consul has some similarities to object stores we recommend that you do not use it as a primary application data store.
|
||||||
|
|
||||||
When using Consul's K/V store for application configuration and metadata, we recommend the following to optimize performance:
|
When using Consul's K/V store for application configuration and metadata, we recommend the following to optimize performance:
|
||||||
Values must be below 512 KB and transactions should be below 64 operations.
|
|
||||||
The keyspace must be well bound. While 10,000 keys may not affect performance, millions of keys are more likely to cause performance issues.
|
|
||||||
Total data size must fit in memory, with additional room for indexes. We recommend that the in-memory size is 3 times the raw key value size.
|
|
||||||
Total data size should remain below 1 GB. Larger snapshots are possible on suitably fast hardware, but they significantly increase recovery times and the operational complexity needed for replication. We recommend limiting data size to keep the cluster healthy and able to recover during maintenance and outages.
|
|
||||||
The K/V store is optimized for reading. To know when you need to make changes to server resources and capacity, we recommend carefully monitoring update rates after they exceed more than a hundred updates per second across the cluster.
|
|
||||||
We recommend that you do not use the K/V store as a general purpose database or object store.
|
|
||||||
|
|
||||||
In addition ,we recommend that you do not use the blocking query mechanism to listen for updates when your K/V store’s update rate is high. When a K/V result is updated too fast, blocking query loops degrade into busy loops. These loops consume excessive client CPU and cause high server load until appropriately throttled. Watching large key prefixes is unlikely to solve the issue because returning the entire key prefix every time it updates can quickly consume a lot of bandwidth.
|
- Values must be below 512 KB and transactions should be below 64 operations.
|
||||||
|
- The keyspace must be well bound. While 10,000 keys may not affect performance, millions of keys are more likely to cause performance issues.
|
||||||
|
- Total data size must fit in memory, with additional room for indexes. We recommend that the in-memory size is 3 times the raw key value size.
|
||||||
|
- Total data size should remain below 1 GB. Larger snapshots are possible on suitably fast hardware, but they significantly increase recovery times and the operational complexity needed for replication. We recommend limiting data size to keep the cluster healthy and able to recover during maintenance and outages.
|
||||||
|
- The K/V store is optimized for reading. To know when you need to make changes to server resources and capacity, we recommend carefully monitoring update rates after they exceed more than a hundred updates per second across the cluster.
|
||||||
|
- We recommend that you do not use the K/V store as a general purpose database or object store.
|
||||||
|
|
||||||
|
In addition, we recommend that you do not use the [blocking query mechanism](/consul/api-docs/features/blocking) to listen for updates when your K/V store’s update rate is high. When a K/V result is updated too fast, blocking query loops degrade into busy loops. These loops consume excessive client CPU and cause high server load until appropriately throttled. Watching large key prefixes is unlikely to solve the issue because returning the entire key prefix every time it updates can quickly consume a lot of bandwidth.
|
||||||
|
|
||||||
### Backend for Vault
|
### Backend for Vault
|
||||||
|
|
||||||
At scale, using Consul as a backend for Vault results in increased memory and CPU utilization on Consul servers. It also produces unbounded growth in Consul’s data persistence layer that is proportional to both the amount of data being stored in Vault and the rate the data is updated.
|
At scale, using Consul as a backend for Vault results in increased memory and CPU utilization on Consul servers. It also produces unbounded growth in Consul’s data persistence layer that is proportional to both the amount of data being stored in Vault and the rate the data is updated.
|
||||||
|
|
||||||
In situations where Consul handles large amounts of data and has high write throughput, we recommend adding monitoring for the capacity and health of raft replication on servers. If the server experiences heavy load when the size of its stored data is large enough, a follower may be unable to catch up on replication and become a voter after restarting. This situation occurs when the time it takes for a server to restore from disk takes longer than it takes for the leader to write a new snapshot and truncate its logs. Refer to Raft Snapshots for more information.
|
In situations where Consul handles large amounts of data and has high write throughput, we recommend adding monitoring for the [capacity and health of raft replication on servers](/consul/docs/agent/telemetry#raft-replication-capacity-issues). If the server experiences heavy load when the size of its stored data is large enough, a follower may be unable to catch up on replication and become a voter after restarting. This situation occurs when the time it takes for a server to restore from disk takes longer than it takes for the leader to write a new snapshot and truncate its logs. Refer to [Raft snapshots](#raft-snapshots) for more information.
|
||||||
Vault v1.4 and higher provides integrated storage as its recommended storage option. If you want to use Consul as a storage backend for Vault, we recommend switching to integrated storage, as detailed in the storage migration guide. Integrated storage improves resiliency by preventing a Consul outage from also affecting Vault functionality.
|
|
||||||
|
Vault v1.4 and higher provides [integrated storage](/vault/docs/concepts/integrated-storage) as its recommended storage option. If you want to use Consul as a storage backend for Vault, we recommend switching to integrated storage, as detailed in the [storage migration guide](/vault/tutorials/raft/raft-migration). Integrated storage improves resiliency by preventing a Consul outage from also affecting Vault functionality.
|
Loading…
Reference in New Issue