* Added the docs for all the grafana dashboards. Author: Yasmin Lorin Kaygalak <ykaygala@villanova.edu> Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com> Co-authored-by: Blake Covarrubias <blake@covarrubi.as>pull/21657/merge
@ -0,0 +1,3 @@ |
|||||||
|
```release-note:feature |
||||||
|
docs: added the docs for the grafana dashboards |
||||||
|
``` |
@ -0,0 +1,133 @@ |
|||||||
|
--- |
||||||
|
layout: docs |
||||||
|
page_title: Dashboard for Consul dataplane metrics |
||||||
|
description: >- |
||||||
|
This Grafana dashboard provides Consul dataplane metrics on Kubernetes deployments. Learn about the Grafana queries that produce the metrics and visualizations in this dashboard. |
||||||
|
--- |
||||||
|
|
||||||
|
# Consul dataplane monitoring dashboard |
||||||
|
|
||||||
|
This page provides reference information about the [Grafana dashboard configuration included in the `hashicorp/consul` GitHub repository](https://github.com/hashicorp/consul/blob/main/grafana/consuldataplanedashboard.json). The Consul dataplane dashboard provides a comprehensive view of the service health, performance, and resource utilization within the Consul service mesh. You can monitor key metrics at both the cluster and service levels with this dashboard. It can help you ensure service reliability and performance. |
||||||
|
|
||||||
|
![Preview of the Consul dataplane dashboard](/public/img/grafana/consul-dataplane-dashboard.png) |
||||||
|
|
||||||
|
This image provides an example of the dashboard's visual layout and contents. |
||||||
|
|
||||||
|
## Grafana queries overview |
||||||
|
|
||||||
|
The Consul dataplane dashboard provides the following information about service mesh operations. |
||||||
|
|
||||||
|
### Live service count |
||||||
|
|
||||||
|
**Description:** Displays the total number of live Envoy proxies currently running in the service mesh. It helps track the overall availability of services and identify any outages or other widespread issues in the service mesh. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(envoy_server_live{app=~"$service"}) |
||||||
|
``` |
||||||
|
|
||||||
|
### Total request success rate |
||||||
|
|
||||||
|
**Description:** Tracks the percentage of successful requests across the service mesh. It excludes 4xx and 5xx response codes to focus on operational success. Use it to monitor the overall reliability of your services. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!~"5|4",consul_destination_service=~"$service"}[10m])) / sum(irate(envoy_cluster_upstream_rq_xx{consul_destination_service=~"$service"}[10m])) |
||||||
|
``` |
||||||
|
|
||||||
|
### Total failed requests |
||||||
|
|
||||||
|
**Description:** This pie chart shows the total number of failed requests within the service mesh, categorized by service. It provides a visual breakdown of where failures are occurring, allowing operators to focus on problematic services. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(increase(envoy_cluster_upstream_rq_xx{envoy_response_code_class=~"4|5", consul_destination_service=~"$service"}[10m])) by (local_cluster) |
||||||
|
``` |
||||||
|
|
||||||
|
### Requests per second |
||||||
|
|
||||||
|
**Description:** This metric shows the rate of incoming HTTP requests per second to the selected services. It helps operators understand the current load on services and how much traffic they are processing. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(rate(envoy_http_downstream_rq_total{service=~"$service",envoy_http_conn_manager_prefix="public_listener"}[5m])) by (service) |
||||||
|
``` |
||||||
|
|
||||||
|
### Unhealthy clusters |
||||||
|
|
||||||
|
**Description:** This metric tracks the number of unhealthy clusters in the mesh, helping operators identify services that are experiencing issues and need attention to ensure operational health. |
||||||
|
|
||||||
|
```promql |
||||||
|
(sum(envoy_cluster_membership_healthy{app=~"$service",envoy_cluster_name=~"$cluster"}) - sum(envoy_cluster_membership_total{app=~"$service",envoy_cluster_name=~"$cluster"})) |
||||||
|
``` |
||||||
|
|
||||||
|
### Heap size |
||||||
|
|
||||||
|
**Description:** This metric displays the total memory heap size of the Envoy proxies. Monitoring heap size is essential to detect memory issues and ensure that services are operating efficiently. |
||||||
|
|
||||||
|
```promql |
||||||
|
SUM(envoy_server_memory_heap_size{app=~"$service"}) |
||||||
|
``` |
||||||
|
|
||||||
|
### Allocated memory |
||||||
|
|
||||||
|
**Description:** This metric shows the amount of memory allocated by the Envoy proxies. It helps operators monitor the resource usage of services to prevent memory overuse and optimize performance. |
||||||
|
|
||||||
|
```promql |
||||||
|
SUM(envoy_server_memory_allocated{app=~"$service"}) |
||||||
|
``` |
||||||
|
|
||||||
|
### Avg uptime per node |
||||||
|
|
||||||
|
**Description:** This metric calculates the average uptime of Envoy proxies across all nodes. It helps operators monitor the stability of services and detect potential issues with service restarts or crashes. |
||||||
|
|
||||||
|
```promql |
||||||
|
avg(envoy_server_uptime{app=~"$service"}) |
||||||
|
``` |
||||||
|
|
||||||
|
### Cluster state |
||||||
|
|
||||||
|
**Description:** This metric indicates whether all clusters are healthy. It provides a quick overview of the cluster state to ensure that there are no issues affecting service performance. |
||||||
|
|
||||||
|
```promql |
||||||
|
(sum(envoy_cluster_membership_total{app=~"$service",envoy_cluster_name=~"$cluster"})-sum(envoy_cluster_membership_healthy{app=~"$service",envoy_cluster_name=~"$cluster"})) == bool 0 |
||||||
|
``` |
||||||
|
|
||||||
|
### CPU throttled seconds by namespace |
||||||
|
|
||||||
|
**Description:** This metric tracks the number of seconds during which CPU usage was throttled. Monitoring CPU throttling helps operators identify when services are exceeding their allocated CPU limits and may need optimization. |
||||||
|
|
||||||
|
```promql |
||||||
|
rate(container_cpu_cfs_throttled_seconds_total{namespace=~"$namespace"}[5m]) |
||||||
|
``` |
||||||
|
|
||||||
|
### Memory usage by pod limits |
||||||
|
|
||||||
|
**Description:** This metric shows memory usage as a percentage of the memory limit set for each pod. It helps operators ensure that services are staying within their allocated memory limits to avoid performance degradation. |
||||||
|
|
||||||
|
```promql |
||||||
|
100 * max (container_memory_working_set_bytes{namespace=~"$namespace"} / on(container, pod) label_replace(kube_pod_container_resource_limits{resource="memory"}, "pod", "$1", "exported_pod", "(.+)")) by (pod) |
||||||
|
``` |
||||||
|
|
||||||
|
### CPU usage by pod limits |
||||||
|
|
||||||
|
**Description:** This metric displays CPU usage as a percentage of the CPU limit set for each pod. Monitoring CPU usage helps operators optimize service performance and prevent CPU exhaustion. |
||||||
|
|
||||||
|
```promql |
||||||
|
100 * max( |
||||||
|
container_memory_working_set_bytes{namespace=~"$namespace"} / |
||||||
|
on(container, pod) label_replace(kube_pod_container_resource_limits{resource="memory"}, "pod", "$1", "exported_pod", "(.+)") |
||||||
|
) by (pod) |
||||||
|
``` |
||||||
|
|
||||||
|
### Total active upstream connections |
||||||
|
|
||||||
|
**Description:** This metric tracks the total number of active upstream connections to other services in the mesh. It provides insight into service dependencies and network load. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(envoy_cluster_upstream_cx_active{app=~"$service",envoy_cluster_name=~"$cluster"}) by (app, envoy_cluster_name) |
||||||
|
``` |
||||||
|
|
||||||
|
### Total active downstream connections |
||||||
|
|
||||||
|
**Description:** This metric tracks the total number of active downstream connections from services to clients. It helps operators monitor service load and ensure that services are able to handle the traffic effectively. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(envoy_http_downstream_cx_active{app=~"$service"}) |
||||||
|
``` |
@ -0,0 +1,128 @@ |
|||||||
|
--- |
||||||
|
layout: docs |
||||||
|
page_title: Dashboard for Consul k8s control plane metrics |
||||||
|
description: >- |
||||||
|
This documentation provides an overview of the Consul on Kubernetes Grafana Dashboard. Learn about the metrics it displays and the queries that produce the metrics. |
||||||
|
--- |
||||||
|
|
||||||
|
# Consul on Kubernetes control plane monitoring dashboard |
||||||
|
|
||||||
|
This page provides reference information about the [Grafana dashboard configuration included in the `hashicorp/consul` GitHub repository](https://github.com/hashicorp/consul/blob/main/grafana/consul-k8s-control-plane-monitoring.json). |
||||||
|
|
||||||
|
## Grafana queries overview |
||||||
|
|
||||||
|
This dashboard provides the following information about service mesh operations. |
||||||
|
|
||||||
|
### Number of Consul servers |
||||||
|
|
||||||
|
**Description:** Displays the number of Consul servers currently active. This metric provides insight into the cluster's health and the number of Consul nodes running in the environment. |
||||||
|
|
||||||
|
```promql |
||||||
|
consul_consul_server_0_consul_members_servers{pod="consul-server-0"} |
||||||
|
``` |
||||||
|
|
||||||
|
### Number of connected Consul dataplanes |
||||||
|
|
||||||
|
**Description:** Tracks the number of connected Consul dataplanes. This metric helps operators understand how many Envoy sidecars are actively connected to the mesh. |
||||||
|
|
||||||
|
```promql |
||||||
|
count(consul_dataplane_envoy_connected) |
||||||
|
``` |
||||||
|
|
||||||
|
### CPU usage in seconds (Consul servers) |
||||||
|
|
||||||
|
**Description:** This metric shows the CPU usage of the Consul servers over time, helping operators monitor resource consumption. |
||||||
|
|
||||||
|
```promql |
||||||
|
rate(container_cpu_usage_seconds_total{container="consul", pod=~"consul-server-.*"}[5m]) |
||||||
|
``` |
||||||
|
|
||||||
|
### Memory usage (Consul servers) |
||||||
|
|
||||||
|
**Description:** Displays the memory usage of the Consul servers. This metric helps ensure that the servers have sufficient memory resources for proper operation. |
||||||
|
|
||||||
|
```promql |
||||||
|
container_memory_working_set_bytes{container="consul", pod=~"consul-server-.*"} |
||||||
|
``` |
||||||
|
|
||||||
|
### Disk read/write total per 5 minutes (Consul servers) |
||||||
|
|
||||||
|
**Description:** Tracks the total network bytes received by Consul servers within a 5 minute window. This metric helps assess the network load on Consul nodes. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(rate(container_fs_writes_bytes_total{pod=~"consul-server-.*", container="consul"}[5m])) by (pod, device) |
||||||
|
``` |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(rate(container_fs_reads_bytes_total{pod=~"consul-server-.*", container="consul"}[5m])) by (pod, device) |
||||||
|
``` |
||||||
|
|
||||||
|
### Received bytes total per 5 minutes (Consul servers) |
||||||
|
|
||||||
|
**Description:** Tracks the total network bytes received by Consul servers within a 5 minute window. This metric helps assess the network load on Consul nodes. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(rate(container_network_receive_bytes_total{pod=~"consul-server-.*"}[5m])) by (pod) |
||||||
|
``` |
||||||
|
|
||||||
|
### Memory limit (Consul servers) |
||||||
|
|
||||||
|
**Description:** Displays the memory limit for Consul servers. This metric ensures that memory usage stays within the defined limits for each Consul server. |
||||||
|
|
||||||
|
```promql |
||||||
|
kube_pod_container_resource_limits{resource="memory", pod="consul-server-0"} |
||||||
|
``` |
||||||
|
|
||||||
|
### CPU limit in seconds (Consul servers) |
||||||
|
|
||||||
|
**Description:** Displays the CPU limit for Consul servers. Monitoring CPU limits helps operators ensure that the services are not constrained by resource limitations. |
||||||
|
|
||||||
|
```promql |
||||||
|
kube_pod_container_resource_limits{resource="cpu", pod="consul-server-0"} |
||||||
|
``` |
||||||
|
|
||||||
|
### Disk usage (Consul servers) |
||||||
|
|
||||||
|
**Description:** Shows the amount of filesystem storage used by Consul servers. This metric helps operators track disk usage and plan for capacity. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(container_fs_usage_bytes{}) by (pod) |
||||||
|
``` |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(container_fs_usage_bytes{pod="consul-server-0"}) |
||||||
|
``` |
||||||
|
|
||||||
|
### CPU usage in seconds (Connect injector) |
||||||
|
|
||||||
|
**Description:** Tracks the CPU usage of the Connect injector, which is responsible for injecting Envoy sidecars and other operations within the mesh. Monitoring this helps ensure that Connect injector has adequate CPU resources. |
||||||
|
|
||||||
|
```promql |
||||||
|
rate(container_cpu_usage_seconds_total{pod=~".*-connect-injector-.*", container="sidecar-injector"}[5m]) |
||||||
|
``` |
||||||
|
|
||||||
|
### CPU limit in seconds (Connect injector) |
||||||
|
|
||||||
|
**Description:** Displays the CPU limit for the Connect injector. Monitoring the CPU limits ensures that Connect injector is not constrained by resource limitations. |
||||||
|
|
||||||
|
```promql |
||||||
|
max(kube_pod_container_resource_limits{resource="cpu", container="sidecar-injector"}) |
||||||
|
``` |
||||||
|
|
||||||
|
### Memory usage (Connect injector) |
||||||
|
|
||||||
|
**Description:** Tracks the memory usage of the Connect injector. Monitoring this helps ensure the Connect injector has sufficient memory resources. |
||||||
|
|
||||||
|
```promql |
||||||
|
container_memory_working_set_bytes{pod=~".*-connect-injector-.*", container="sidecar-injector"} |
||||||
|
``` |
||||||
|
|
||||||
|
### Memory limit (Connect injector) |
||||||
|
|
||||||
|
**Description:** Displays the memory limit for the Connect injector, helping to monitor if the service is nearing its resource limits. |
||||||
|
|
||||||
|
```promql |
||||||
|
max(kube_pod_container_resource_limits{resource="memory", container="sidecar-injector"}) |
||||||
|
``` |
||||||
|
|
||||||
|
|
@ -0,0 +1,164 @@ |
|||||||
|
--- |
||||||
|
layout: docs |
||||||
|
page_title: Dashboard for Consul server metrics |
||||||
|
description: >- |
||||||
|
This documentation provides an overview of the Consul Server Dashboard. Learn about the metrics it displays and the queries that produce the metrics. |
||||||
|
--- |
||||||
|
|
||||||
|
# Consul server monitoring dashboard |
||||||
|
|
||||||
|
This page provides reference information about the [Grafana dashboard configuration included in the `hashicorp/consul` GitHub repository](https://github.com/hashicorp/consul/blob/main/grafana/consul-server-monitoring.json). |
||||||
|
|
||||||
|
## Grafana queries overview |
||||||
|
|
||||||
|
This dashboard provides the following information about service mesh operations. |
||||||
|
|
||||||
|
### Raft commit time |
||||||
|
|
||||||
|
**Description:** This metric measures the time it takes to commit Raft log entries. Stable values are expected for a healthy cluster. High values can indicate issues with resources such as memory, CPU, or disk space. |
||||||
|
|
||||||
|
```promql |
||||||
|
consul_raft_commitTime |
||||||
|
``` |
||||||
|
|
||||||
|
### Raft commits per 5 minutes |
||||||
|
|
||||||
|
**Description:** This metric tracks the rate of Raft log commits emitted by the leader, showing how quickly changes are being applied across the cluster. |
||||||
|
|
||||||
|
```promql |
||||||
|
rate(consul_raft_apply[5m]) |
||||||
|
``` |
||||||
|
|
||||||
|
### Last contacted leader |
||||||
|
|
||||||
|
**Description:** Measures the duration since the last contact with the Raft leader. Spikes in this metric can indicate network issues or an unavailable leader, which may affect cluster stability. |
||||||
|
|
||||||
|
```promql |
||||||
|
consul_raft_leader_lastContact != 0 |
||||||
|
``` |
||||||
|
|
||||||
|
### Election events |
||||||
|
|
||||||
|
**Description:** Tracks Raft state transitions, which indicate leadership elections. Frequent transitions might suggest cluster instability and require investigation. |
||||||
|
|
||||||
|
```promql |
||||||
|
rate(consul_raft_state_candidate[1m]) |
||||||
|
``` |
||||||
|
|
||||||
|
```promql |
||||||
|
rate(consul_raft_state_leader[1m]) |
||||||
|
``` |
||||||
|
|
||||||
|
### Autopilot health |
||||||
|
|
||||||
|
**Description:** A boolean metric that shows a value of 1 when Autopilot is healthy and 0 when issues are detected. Ensures that the cluster has sufficient resources and an operational leader. |
||||||
|
|
||||||
|
```promql |
||||||
|
consul_autopilot_healthy |
||||||
|
``` |
||||||
|
|
||||||
|
### DNS queries per 5 minutes |
||||||
|
|
||||||
|
**Description:** This metric tracks the rate of DNS queries per node, bucketed into 5 minute intervals. It helps monitor the query load on Consul’s DNS service. |
||||||
|
|
||||||
|
```promql |
||||||
|
rate(consul_dns_domain_query_count[5m]) |
||||||
|
``` |
||||||
|
|
||||||
|
### DNS domain query time |
||||||
|
|
||||||
|
**Description:** Measures the time spent handling DNS domain queries. Spikes in this metric may indicate high contention in the catalog or too many concurrent queries. |
||||||
|
|
||||||
|
```promql |
||||||
|
consul_dns_domain_query |
||||||
|
``` |
||||||
|
|
||||||
|
### DNS reverse query time |
||||||
|
|
||||||
|
**Description:** Tracks the time spent processing reverse DNS queries. Spikes in query time may indicate performance bottlenecks or increased workload. |
||||||
|
|
||||||
|
```promql |
||||||
|
consul_dns_ptr_query |
||||||
|
``` |
||||||
|
|
||||||
|
### KV applies per 5 minutes |
||||||
|
|
||||||
|
**Description:** This metric tracks the rate of key-value store applies over 5 minute intervals, indicating the operational load on Consul’s KV store. |
||||||
|
|
||||||
|
```promql |
||||||
|
rate(consul_kvs_apply_count[5m]) |
||||||
|
``` |
||||||
|
|
||||||
|
### KV apply time |
||||||
|
|
||||||
|
**Description:** Measures the time taken to apply updates to the key-value store. Spikes in this metric might suggest resource contention or client overload. |
||||||
|
|
||||||
|
```promql |
||||||
|
consul_kvs_apply |
||||||
|
``` |
||||||
|
|
||||||
|
### Transaction apply time |
||||||
|
|
||||||
|
**Description:** Tracks the time spent applying transaction operations in Consul, providing insights into potential bottlenecks in transaction operations. |
||||||
|
|
||||||
|
```promql |
||||||
|
consul_txn_apply |
||||||
|
``` |
||||||
|
|
||||||
|
### ACL resolves per 5 minutes |
||||||
|
|
||||||
|
**Description:** This metric tracks the rate of ACL token resolutions over 5 minute intervals. It provides insights into the activity related to ACL tokens within the cluster. |
||||||
|
|
||||||
|
```promql |
||||||
|
rate(consul_acl_ResolveToken_count[5m]) |
||||||
|
``` |
||||||
|
|
||||||
|
### ACL resolve token time |
||||||
|
|
||||||
|
**Description:** Measures the time taken to resolve ACL tokens into their associated policies. |
||||||
|
|
||||||
|
```promql |
||||||
|
consul_acl_ResolveToken |
||||||
|
``` |
||||||
|
|
||||||
|
### ACL updates per 5 minutes |
||||||
|
|
||||||
|
**Description:** Tracks the rate of ACL updates over 5 minute intervals. This metric helps monitor changes in ACL configurations over time. |
||||||
|
|
||||||
|
```promql |
||||||
|
rate(consul_acl_apply_count[5m]) |
||||||
|
``` |
||||||
|
|
||||||
|
### ACL apply time |
||||||
|
|
||||||
|
**Description:** Measures the time spent applying ACL changes. Spikes in apply time might suggest resource constraints or high operational load. |
||||||
|
|
||||||
|
```promql |
||||||
|
consul_acl_apply |
||||||
|
``` |
||||||
|
|
||||||
|
### Catalog operations per 5 minutes |
||||||
|
|
||||||
|
**Description:** Tracks the rate of register and deregister operations in the Consul catalog, providing insights into the churn of services within the cluster. |
||||||
|
|
||||||
|
```promql |
||||||
|
rate(consul_catalog_register_count[5m]) |
||||||
|
``` |
||||||
|
|
||||||
|
```promql |
||||||
|
rate(consul_catalog_deregister_count[5m]) |
||||||
|
``` |
||||||
|
|
||||||
|
### Catalog operation time |
||||||
|
|
||||||
|
**Description:** Measures the time taken to complete catalog register or deregister operations. Spikes in this metric may indicate performance issues within the catalog. |
||||||
|
|
||||||
|
```promql |
||||||
|
consul_catalog_register |
||||||
|
``` |
||||||
|
|
||||||
|
```promql |
||||||
|
consul_catalog_deregister |
||||||
|
``` |
||||||
|
|
||||||
|
|
@ -0,0 +1,91 @@ |
|||||||
|
--- |
||||||
|
layout: docs |
||||||
|
page_title: Service Mesh Observability - Dashboards |
||||||
|
description: >- |
||||||
|
This documentation provides an overview of several dashboards designed for monitoring and managing services within a Consul-managed Envoy service mesh. Learn how to enable access logs and configure key performance and operational metrics to ensure the reliability and performance of services in the service mesh. |
||||||
|
--- |
||||||
|
|
||||||
|
# Dashboards for service mesh observability |
||||||
|
|
||||||
|
This topic describes the configuration and usage of dashboards for monitoring and managing services within a Consul-managed Envoy service mesh. These dashboards provide critical insights into the health, performance, and resource utilization of services. The dashboards described here are essential tools for ensuring the stability, efficiency, and reliability of your service mesh environment. |
||||||
|
|
||||||
|
This page provides reference information about the Grafana dashboard configurations included in the [`grafana` directory in the `hashicorp/consul` GitHub repository](https://github.com/hashicorp/consul/tree/main/grafana). |
||||||
|
|
||||||
|
## Dashboards overview |
||||||
|
|
||||||
|
The repository includes the following dashboards: |
||||||
|
|
||||||
|
- **Consul service-to-service dashboard**: Provides a detailed view of service-to-service communications, monitoring key metrics like access logs, HTTP requests, error counts, response code distributions, and request success rates. The dashboard includes customizable filters for focusing on specific services and namespaces. |
||||||
|
|
||||||
|
- **Consul service dashboard**: Tracks key metrics for Envoy proxies at the cluster and service levels, ensuring the performance and reliability of individual services within the mesh. |
||||||
|
|
||||||
|
- **Consul dataplane dashboard**: Offers a comprehensive overview of service health and performance, including request success rates, resource utilization (CPU and memory), active connections, and cluster health. It helps operators maintain service reliability and optimize resource usage. |
||||||
|
|
||||||
|
- **Consul k8s dashboard**: Focuses on monitoring the health and resource usage of the Consul control plane within a Kubernetes environment, ensuring the stability of the control plane. |
||||||
|
|
||||||
|
- **Consul server dashboard**: Provides detailed monitoring of Consul servers, tracking key metrics like server health, CPU and memory usage, disk I/O, and network performance. This dashboard is critical for ensuring the stability and performance of Consul servers within the service mesh. |
||||||
|
|
||||||
|
## Enabling prometheus |
||||||
|
|
||||||
|
Add the following configurations to your Consul Helm chart to enable the prometheus tools. |
||||||
|
|
||||||
|
<CodeBlockConfig> |
||||||
|
|
||||||
|
```yaml |
||||||
|
global: |
||||||
|
metrics: |
||||||
|
enabled: true |
||||||
|
provider: "prometheus" |
||||||
|
enableAgentMetrics: true |
||||||
|
agentMetricsRetentionTime: "10m" |
||||||
|
|
||||||
|
prometheus: |
||||||
|
enabled: true |
||||||
|
|
||||||
|
ui: |
||||||
|
enabled: true |
||||||
|
metrics: |
||||||
|
enabled: true |
||||||
|
provider: "prometheus" |
||||||
|
baseURL: http://prometheus-server.consul |
||||||
|
``` |
||||||
|
|
||||||
|
</CodeBlockConfig> |
||||||
|
|
||||||
|
## Enable access logs |
||||||
|
|
||||||
|
Access logs configurations are defined globally in the [`proxy-defaults`](/consul/docs/connect/config-entries/proxy-defaults#accesslogs) configuration entry. |
||||||
|
|
||||||
|
The following example is a minimal configuration for enabling access logs: |
||||||
|
|
||||||
|
<CodeTabs tabs={[ "HCL", "Kubernetes YAML", "JSON" ]}> |
||||||
|
|
||||||
|
```hcl |
||||||
|
Kind = "proxy-defaults" |
||||||
|
Name = "global" |
||||||
|
AccessLogs { |
||||||
|
Enabled = true |
||||||
|
} |
||||||
|
``` |
||||||
|
|
||||||
|
```yaml |
||||||
|
apiVersion: consul.hashicorp.com/v1alpha1 |
||||||
|
kind: ProxyDefaults |
||||||
|
metadata: |
||||||
|
name: global |
||||||
|
spec: |
||||||
|
accessLogs: |
||||||
|
enabled: true |
||||||
|
``` |
||||||
|
|
||||||
|
```json |
||||||
|
{ |
||||||
|
"Kind": "proxy-defaults", |
||||||
|
"Name": "global", |
||||||
|
"AccessLogs": { |
||||||
|
"Enabled": true |
||||||
|
} |
||||||
|
} |
||||||
|
``` |
||||||
|
|
||||||
|
</CodeTabs> |
@ -0,0 +1,183 @@ |
|||||||
|
--- |
||||||
|
layout: docs |
||||||
|
page_title: Dashboard for monitoring Consul service-to-service mesh |
||||||
|
description: >- |
||||||
|
This documentation provides an overview of the Service-to-service dashboard. Learn about the metrics it displays and the queries that produce the metrics. |
||||||
|
--- |
||||||
|
|
||||||
|
# Service-to-service dashboard |
||||||
|
|
||||||
|
This page provides reference information about the [Grafana dashboard configuration included in the `hashicorp/consul` GitHub repository](https://github.com/hashicorp/consul/blob/main/grafana/consulservicetoservicedashboard.json). The service-to-service dashboard provides deep visibility into the traffic and interactions between services within the Consul service mesh. It focuses on critical metrics such as logs, error rates, traffic patterns, and success rates, all of which help operators maintain smooth and reliable service-to-service communication. |
||||||
|
|
||||||
|
![Preview of the service to service mesh dashboard](/public/img/grafana/service-to-service-1.png) |
||||||
|
|
||||||
|
## Grafana queries overview |
||||||
|
|
||||||
|
This dashboard provides the following information about service mesh operations. |
||||||
|
|
||||||
|
### Access logs and errors monitoring |
||||||
|
|
||||||
|
**Description:** This section provides visibility into logs and errors related to service-to-service communications. It tracks and displays the number of logs generated, errors encountered, and the percentage of logs matching specific patterns. |
||||||
|
|
||||||
|
### Total logs |
||||||
|
|
||||||
|
**Description:** This metric counts the total number of log lines produced by Consul dataplane containers. It provides an overview of the volume of logs being generated for a specific namespace. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(count_over_time(({container="consul-dataplane",namespace=~"$namespace"})[$__interval])) |
||||||
|
``` |
||||||
|
|
||||||
|
### Total logs containing "$searchable_pattern" |
||||||
|
|
||||||
|
**Description:** This metric tracks the number of logs containing the specified pattern. It is useful for filtering and monitoring specific log events across the service mesh. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(count_over_time({container="consul-dataplane", namespace=~"$namespace"} |~ (?i)(?i)$searchable_pattern [$__interval])) |
||||||
|
``` |
||||||
|
|
||||||
|
### Percentage of logs containing "$searchable_pattern" |
||||||
|
|
||||||
|
**Description:** This metric calculates the percentage of logs containing the specified search pattern within the total log volume. It helps gauge the proportion of specific log events. |
||||||
|
|
||||||
|
```promql |
||||||
|
(sum(count_over_time({container="consul-dataplane", namespace=~"$namespace"} |~ (?i)(?i)$searchable_pattern [$__interval])) * 100) / sum(count_over_time({container="consul-dataplane", namespace=~"$namespace"} [$__interval])) |
||||||
|
``` |
||||||
|
|
||||||
|
### Total response code distribution |
||||||
|
|
||||||
|
**Description:** This pie chart visualizes the distribution of HTTP response codes, helping identify any 4xx and 5xx error codes generated by the services. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum by(response_code) (count_over_time({container="consul-dataplane", namespace="$namespace"} | json | response_code != "0" | __error__= [$__range])) |
||||||
|
``` |
||||||
|
|
||||||
|
### Rate of logs containing "$searchable_pattern" per service |
||||||
|
|
||||||
|
**Description:** This metric monitors the rate at which specific patterns appear in logs per service, helping to detect trends and anomalies in log data. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum by(app) (rate({container="consul-dataplane", namespace=~"$namespace"} |~ (?i)(?i)$searchable_pattern [$__range])) |
||||||
|
``` |
||||||
|
|
||||||
|
### TCP metrics - service level |
||||||
|
|
||||||
|
### TCP inbound and outbound bytes |
||||||
|
|
||||||
|
**Description:** This metric tracks the inbound and outbound TCP bytes transferred between services. It is essential for understanding the network traffic flow between source and destination services. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(rate(envoy_tcp_downstream_cx_rx_bytes_total{}[10m])) by (service, destination_service) |
||||||
|
``` |
||||||
|
|
||||||
|
### TCP inbound and outbound bytes buffered |
||||||
|
|
||||||
|
**Description:** This metric monitors the amount of TCP bytes buffered for inbound and outbound traffic between services. It helps identify potential network performance bottlenecks. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(rate(envoy_tcp_downstream_cx_rx_bytes_buffered{}[10m])) by (service, destination_service) |
||||||
|
``` |
||||||
|
|
||||||
|
### TCP downstream connections |
||||||
|
|
||||||
|
**Description:** This metric counts the number of active TCP downstream connections from the source service to the destination service, providing visibility into the volume of connections between services. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(envoy_tcp_downstream_cx_total) by (service, destination_service) |
||||||
|
``` |
||||||
|
|
||||||
|
### Outbound traffic monitoring |
||||||
|
![Preview of the outbound traffic monitoring](/public/img/grafana/service-to-service-2.png) |
||||||
|
|
||||||
|
### Upstream traffic |
||||||
|
|
||||||
|
**Description:** This metric monitors the upstream traffic from the source service to the destination service. It shows how much traffic is being sent between services. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(irate(envoy_cluster_upstream_rq_total{local_cluster=~"$source_service",consul_destination_service=~"$destination_service"}[10m])) |
||||||
|
``` |
||||||
|
|
||||||
|
### Upstream request response timeliness |
||||||
|
|
||||||
|
**Description:** This metric calculates the 95th percentile of upstream request response times between the source and destination services. It helps ensure that service communications are handled promptly. |
||||||
|
|
||||||
|
```promql |
||||||
|
histogram_quantile(0.95, sum(rate(envoy_cluster_upstream_rq_time_bucket{local_cluster=~"$source_service",consul_destination_target!=""}[10m])) by (le, consul_destination_target)) |
||||||
|
``` |
||||||
|
|
||||||
|
### Upstream request success rate |
||||||
|
|
||||||
|
**Description:** This metric tracks the success rate of requests from the source service to the destination service, excluding 4xx and 5xx errors. It helps assess the reliability of service communications. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!="5",local_cluster=~"$source_service",consul_destination_service=~"$destination_service"}[10m])) |
||||||
|
``` |
||||||
|
|
||||||
|
### Inbound traffic monitoring |
||||||
|
![Preview of the inbound traffic monitoring](/public/img/grafana/service-to-service-3.png) |
||||||
|
|
||||||
|
### Requests sent |
||||||
|
|
||||||
|
**Description:** This metric tracks the number of requests sent between the source service and destination service within the service mesh. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(irate(envoy_cluster_upstream_rq_total{consul_destination_datacenter="dc1",local_cluster=~"$source_service",consul_destination_service=~"$destination_service"}[10m])) by (consul_destination_service, local_cluster) |
||||||
|
``` |
||||||
|
|
||||||
|
### Request success rate |
||||||
|
|
||||||
|
**Description:** This metric tracks the success rate of requests from the source service to the destination service, helping identify failures or bottlenecks in communication. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!="5",local_cluster=~"$source_service",consul_destination_service=~"$destination_service"}[10m])) by (local_cluster, consul_destination_service) / sum(irate(envoy_cluster_upstream_rq_xx{consul_destination_service=~"$destination_service"}[10m])) by (local_cluster, consul_destination_service) |
||||||
|
``` |
||||||
|
|
||||||
|
### Response success by status code |
||||||
|
|
||||||
|
**Description:** This metric tracks response success by status code for requests sent by the source service to the destination service. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(increase(envoy_http_downstream_rq_xx{local_cluster=~"$source_service",envoy_http_conn_manager_prefix="public_listener"}[10m])) by (local_cluster, envoy_response_code_class) |
||||||
|
``` |
||||||
|
|
||||||
|
### Request duration |
||||||
|
|
||||||
|
**Description:** This metric tracks the request duration between the source and destination services, helping monitor performance and response times. |
||||||
|
|
||||||
|
```promql |
||||||
|
histogram_quantile(0.95, sum(rate(envoy_cluster_upstream_rq_time_bucket{consul_destination_datacenter="dc1", consul_destination_service=~"$destination_service",local_cluster=~"$source_service"}[10m])) by (le, cluster, local_cluster, consul_destination_service)) |
||||||
|
``` |
||||||
|
|
||||||
|
### Response success |
||||||
|
|
||||||
|
**Description:** This metric tracks the success of responses for the source service's requests across the service mesh. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(increase(envoy_http_downstream_rq_total{local_cluster=~"$source_service",envoy_http_conn_manager_prefix="public_listener"}[10m])) by (local_cluster) |
||||||
|
``` |
||||||
|
|
||||||
|
### Request response rate |
||||||
|
|
||||||
|
**Description:** This metric tracks the rate at which responses are being generated by the source service, providing insight into service activity and performance. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(irate(envoy_http_downstream_rq_total{local_cluster=~"$source_service",envoy_http_conn_manager_prefix="public_listener"}[10m])) by (local_cluster) |
||||||
|
``` |
||||||
|
|
||||||
|
## Customization options |
||||||
|
|
||||||
|
![Preview of the nginx service selection as a customization option on the service to service dashboard](/public/img/grafana/service-to-service-4.png) |
||||||
|
|
||||||
|
The service-to-service dashboard includes a variety of customization options to help you analyze specific aspects of service-to-service communications, tailor the dashboard for more targeted monitoring, and enhance visibility into the service mesh. |
||||||
|
|
||||||
|
- **Filter by source service:** You can filter the dashboard to focus on traffic originating from a specific source service, allowing you to analyze interactions from the source service to all destination services. |
||||||
|
|
||||||
|
- **Filter by destination service:** Similarly, you can filter the dashboard by destination service to track and analyze the traffic received by specific services. This helps pinpoint communication issues or performance bottlenecks related to specific services. |
||||||
|
|
||||||
|
- **Filter by namespace:** The dashboard can be customized to focus on service interactions within a particular namespace. This is especially useful for isolating issues in multi-tenant environments or clusters that operate with strict namespace isolation. |
||||||
|
|
||||||
|
- **Log pattern search:** You can apply custom search patterns to logs to filter out specific log events of interest, such as error messages or specific HTTP status codes. This enables you to narrow down on specific log entries and identify patterns that may indicate issues. |
||||||
|
|
||||||
|
- **Time range selection:** The dashboard supports dynamic time range selection, allowing you to focus on service interactions over specific time intervals. This helps in analyzing traffic trends, troubleshooting incidents, and understanding the timing of service communications. |
||||||
|
|
||||||
|
By using these customization options, you can tailor the dashboard to your specific needs and ensure they are always monitoring the most relevant data for maintaining a healthy and performant service mesh. |
||||||
|
|
@ -0,0 +1,157 @@ |
|||||||
|
--- |
||||||
|
layout: docs |
||||||
|
page_title: Dashboard for monitoring Consul service mesh |
||||||
|
description: >- |
||||||
|
This documentation provides an overview of the Service Dashboard. Learn about the metrics it displays and the queries that produce the metrics. |
||||||
|
--- |
||||||
|
|
||||||
|
# Service dashboard |
||||||
|
|
||||||
|
This page provides reference information about the [Grafana dashboard configuration included in the `hashicorp/consul` GitHub repository](https://github.com/hashicorp/consul/blob/main/grafana/consulservicedashboard.json). The service dashboard offers an overview of the performance and health of individual services within the Consul service mesh. It provides insights into service availability, request success rates, latency, and connection metrics. This dashboard is essential for maintaining optimal service performance and quickly identifying any issues with service communications. |
||||||
|
|
||||||
|
![Preview of the service dashboard](/public/img/grafana/service-dashboard-2.png) |
||||||
|
|
||||||
|
## Grafana queries overview |
||||||
|
|
||||||
|
This dashboard provides the following information about service mesh operations. |
||||||
|
|
||||||
|
### Total running services |
||||||
|
|
||||||
|
**Description:** This gauge tracks the total number of running services within the mesh that are not labeled as `traffic-generator`. It provides an overall view of active services, helping operators maintain visibility into service availability. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(envoy_server_live{app!="traffic-generator"}) |
||||||
|
``` |
||||||
|
|
||||||
|
### Total request success rate |
||||||
|
|
||||||
|
**Description:** This stat visualizes the success rate of upstream requests to the selected service. It filters out 4xx and 5xx response codes, providing a clearer picture of how well the service is performing in terms of handling requests successfully. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!="5", envoy_response_code_class!="4", consul_destination_service=~"$service"}[10m])) / sum(irate(envoy_cluster_upstream_rq_xx{consul_destination_service=~"$service"}[10m])) |
||||||
|
``` |
||||||
|
|
||||||
|
### Total failed request rate |
||||||
|
|
||||||
|
**Description:** This stat tracks the rate of failed requests for the selected service according to 4xx and 5xx errors. It helps operators quickly identify if there are issues with client requests or server errors for a specific service. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class=~"4|5", consul_destination_service=~"$service"}[10m])) / sum(irate(envoy_cluster_upstream_rq_xx{consul_destination_service=~"$service"}[10m])) |
||||||
|
``` |
||||||
|
|
||||||
|
### Average request response time in milliseconds |
||||||
|
|
||||||
|
**Description:** This gauge displays the average response time for requests to the selected service, providing an overview of the service's performance and responsiveness. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(rate(envoy_cluster_upstream_rq_time_sum{consul_destination_service=~"$service"}[10m])) / sum(rate(envoy_cluster_upstream_rq_total{consul_destination_service=~"$service"}[10m])) |
||||||
|
``` |
||||||
|
|
||||||
|
### Total failed requests |
||||||
|
|
||||||
|
**Description:** This gauge tracks the total number of failed requests over a 10 minute window, categorized by service. It allows for easy identification of services that are experiencing high failure rates. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(increase(envoy_cluster_upstream_rq_xx{envoy_response_code_class=~"4|5", consul_destination_service=~"$service"}[10m])) by(local_cluster) |
||||||
|
``` |
||||||
|
|
||||||
|
### Dataplane latency |
||||||
|
|
||||||
|
**Description:** This stat tracks the dataplane latency percentiles (p50, p75, p90, p99.9) for the selected service. It gives detailed insights into the distribution of latency within the service's request handling, helping identify performance bottlenecks. |
||||||
|
|
||||||
|
![Preview of the dataplane latency metrics](/public/img/grafana/service-dashboard-1.png) |
||||||
|
|
||||||
|
```promql |
||||||
|
histogram_quantile(0.50, sum by(le) (rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace=~"$namespace", local_cluster=~"$service"}[5m]))) |
||||||
|
``` |
||||||
|
|
||||||
|
```promql |
||||||
|
histogram_quantile(0.75, sum by(le) (rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace=~"$namespace", local_cluster=~"$service"}[5m]))) |
||||||
|
``` |
||||||
|
|
||||||
|
```promql |
||||||
|
histogram_quantile(0.90, sum by(le) (rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace=~"$namespace", local_cluster=~"$service"}[5m]))) |
||||||
|
``` |
||||||
|
|
||||||
|
```promql |
||||||
|
histogram_quantile(0.999, sum by(le) (rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace=~"$namespace", local_cluster=~"$service"}[5m]))) |
||||||
|
``` |
||||||
|
|
||||||
|
### Total TCP inbound and outbound bytes |
||||||
|
|
||||||
|
**Description:** This time series shows the total number of inbound and outbound TCP bytes for services within the mesh. It provides visibility into the data transfer patterns and volume between services. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(rate(envoy_tcp_downstream_cx_rx_bytes_total{}[10m])) by (local_cluster) |
||||||
|
``` |
||||||
|
|
||||||
|
### Total TCP inbound and outbound bytes buffered |
||||||
|
|
||||||
|
**Description:** This metric tracks the amount of TCP traffic buffered during inbound and outbound communications. It helps in identifying whether there is any potential latency caused by packet buffering or congestion. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(rate(envoy_tcp_downstream_cx_rx_bytes_buffered{}[10m])) by (local_cluster) |
||||||
|
``` |
||||||
|
|
||||||
|
### Total TCP downstream active connections |
||||||
|
|
||||||
|
**Description:** This metric counts the total number of active TCP downstream connections, providing an overview of the current connection load on the services within the mesh. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(rate(envoy_tcp_downstream_cx_total{}[10m])) by(local_cluster) |
||||||
|
``` |
||||||
|
|
||||||
|
### Total active HTTP upstream connections |
||||||
|
|
||||||
|
**Description:** This time series tracks the total number of active HTTP upstream connections for the selected service. It helps monitor connection patterns and assess load. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(envoy_cluster_upstream_cx_active{app=~"$service"}) by (app) |
||||||
|
``` |
||||||
|
|
||||||
|
### Total active HTTP downstream connections |
||||||
|
|
||||||
|
**Description:** This time series monitors the number of active HTTP downstream connections for the selected service, providing visibility into the current active user or client load on the service. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(envoy_http_downstream_cx_active{app=~"$service"}) by (app) |
||||||
|
``` |
||||||
|
|
||||||
|
### Upstream requests by status code |
||||||
|
|
||||||
|
**Description:** This metric tracks the number of upstream requests, grouped by HTTP status codes, giving insight into the health of the requests being made to upstream services from the selected service. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum by(namespace,app,envoy_response_code_class) (rate(envoy_cluster_upstream_rq_xx[5m])) |
||||||
|
``` |
||||||
|
|
||||||
|
### Downstream requests by status code |
||||||
|
|
||||||
|
**Description:** This time series tracks downstream HTTP requests by status code, showing how well the selected service is responding to downstream requests from clients. |
||||||
|
|
||||||
|
```promql |
||||||
|
sum(rate(envoy_http_downstream_rq_xx{envoy_http_conn_manager_prefix="public_listener"}[5m])) by (namespace, app, envoy_response_code_class) |
||||||
|
``` |
||||||
|
|
||||||
|
### Connections rejected |
||||||
|
|
||||||
|
**Description:** This metric tracks the number of connections rejected due to overload or overflow conditions on listeners. Monitoring these values helps identify if the service is under too much load or has insufficient capacity to handle the incoming connections. |
||||||
|
|
||||||
|
```promql |
||||||
|
rate(envoy_listener_downstream_cx_overload_reject{}[$__interval]) |
||||||
|
``` |
||||||
|
|
||||||
|
## Customization options |
||||||
|
|
||||||
|
The service dashboard offers various customization options to help you analyze specific services and metrics. Use these options to tailor the dashboard to your needs and improve your ability to monitor and troubleshoot service health. |
||||||
|
|
||||||
|
- **Filter by service:** You can filter the dashboard by the service you want to monitor. This helps narrow down the metrics to the service of interest and provides a more targeted view of its performance. |
||||||
|
|
||||||
|
- **Filter by namespace:** The namespace filter allows operators to focus on a particular namespace in a multi-tenant or multi-namespace environment, isolating the service metrics within that specific context. |
||||||
|
|
||||||
|
- **Time range selection:** The dashboard supports flexible time range selection, allowing operators to analyze service behavior over different time periods. This is helpful for pinpointing issues that may occur at specific times or during high-traffic periods. |
||||||
|
|
||||||
|
- **Percentile latency tracking:** The dashboard allows operators to track multiple latency percentiles (p50, p75, p90, p99.9) to get a more detailed view of how the service performs across different levels of traffic load. |
||||||
|
|
||||||
|
|
||||||
|
|
After Width: | Height: | Size: 557 KiB |
After Width: | Height: | Size: 678 KiB |
After Width: | Height: | Size: 562 KiB |
After Width: | Height: | Size: 683 KiB |
After Width: | Height: | Size: 502 KiB |
After Width: | Height: | Size: 733 KiB |
After Width: | Height: | Size: 493 KiB |