Added the docs for all the grafana dashboards. (#21795)

* Added the docs for all the grafana dashboards.

 Author:   Yasmin Lorin Kaygalak <ykaygala@villanova.edu>

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

Co-authored-by: Blake Covarrubias <blake@covarrubi.as>
pull/21919/head
Yasmin Lorin Kaygalak 2024-11-05 10:06:29 -05:00 committed by GitHub
parent f376b6a227
commit 32515c77f2
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
15 changed files with 889 additions and 1 deletions

3
.changelog/21795.txt Normal file
View File

@ -0,0 +1,3 @@
```release-note:feature
docs: added the docs for the grafana dashboards
```

View File

@ -0,0 +1,133 @@
---
layout: docs
page_title: Dashboard for Consul dataplane metrics
description: >-
This Grafana dashboard provides Consul dataplane metrics on Kubernetes deployments. Learn about the Grafana queries that produce the metrics and visualizations in this dashboard.
---
# Consul dataplane monitoring dashboard
This page provides reference information about the [Grafana dashboard configuration included in the `hashicorp/consul` GitHub repository](https://github.com/hashicorp/consul/blob/main/grafana/consuldataplanedashboard.json). The Consul dataplane dashboard provides a comprehensive view of the service health, performance, and resource utilization within the Consul service mesh. You can monitor key metrics at both the cluster and service levels with this dashboard. It can help you ensure service reliability and performance.
![Preview of the Consul dataplane dashboard](/public/img/grafana/consul-dataplane-dashboard.png)
This image provides an example of the dashboard's visual layout and contents.
## Grafana queries overview
The Consul dataplane dashboard provides the following information about service mesh operations.
### Live service count
**Description:** Displays the total number of live Envoy proxies currently running in the service mesh. It helps track the overall availability of services and identify any outages or other widespread issues in the service mesh.
```promql
sum(envoy_server_live{app=~"$service"})
```
### Total request success rate
**Description:** Tracks the percentage of successful requests across the service mesh. It excludes 4xx and 5xx response codes to focus on operational success. Use it to monitor the overall reliability of your services.
```promql
sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!~"5|4",consul_destination_service=~"$service"}[10m])) / sum(irate(envoy_cluster_upstream_rq_xx{consul_destination_service=~"$service"}[10m]))
```
### Total failed requests
**Description:** This pie chart shows the total number of failed requests within the service mesh, categorized by service. It provides a visual breakdown of where failures are occurring, allowing operators to focus on problematic services.
```promql
sum(increase(envoy_cluster_upstream_rq_xx{envoy_response_code_class=~"4|5", consul_destination_service=~"$service"}[10m])) by (local_cluster)
```
### Requests per second
**Description:** This metric shows the rate of incoming HTTP requests per second to the selected services. It helps operators understand the current load on services and how much traffic they are processing.
```promql
sum(rate(envoy_http_downstream_rq_total{service=~"$service",envoy_http_conn_manager_prefix="public_listener"}[5m])) by (service)
```
### Unhealthy clusters
**Description:** This metric tracks the number of unhealthy clusters in the mesh, helping operators identify services that are experiencing issues and need attention to ensure operational health.
```promql
(sum(envoy_cluster_membership_healthy{app=~"$service",envoy_cluster_name=~"$cluster"}) - sum(envoy_cluster_membership_total{app=~"$service",envoy_cluster_name=~"$cluster"}))
```
### Heap size
**Description:** This metric displays the total memory heap size of the Envoy proxies. Monitoring heap size is essential to detect memory issues and ensure that services are operating efficiently.
```promql
SUM(envoy_server_memory_heap_size{app=~"$service"})
```
### Allocated memory
**Description:** This metric shows the amount of memory allocated by the Envoy proxies. It helps operators monitor the resource usage of services to prevent memory overuse and optimize performance.
```promql
SUM(envoy_server_memory_allocated{app=~"$service"})
```
### Avg uptime per node
**Description:** This metric calculates the average uptime of Envoy proxies across all nodes. It helps operators monitor the stability of services and detect potential issues with service restarts or crashes.
```promql
avg(envoy_server_uptime{app=~"$service"})
```
### Cluster state
**Description:** This metric indicates whether all clusters are healthy. It provides a quick overview of the cluster state to ensure that there are no issues affecting service performance.
```promql
(sum(envoy_cluster_membership_total{app=~"$service",envoy_cluster_name=~"$cluster"})-sum(envoy_cluster_membership_healthy{app=~"$service",envoy_cluster_name=~"$cluster"})) == bool 0
```
### CPU throttled seconds by namespace
**Description:** This metric tracks the number of seconds during which CPU usage was throttled. Monitoring CPU throttling helps operators identify when services are exceeding their allocated CPU limits and may need optimization.
```promql
rate(container_cpu_cfs_throttled_seconds_total{namespace=~"$namespace"}[5m])
```
### Memory usage by pod limits
**Description:** This metric shows memory usage as a percentage of the memory limit set for each pod. It helps operators ensure that services are staying within their allocated memory limits to avoid performance degradation.
```promql
100 * max (container_memory_working_set_bytes{namespace=~"$namespace"} / on(container, pod) label_replace(kube_pod_container_resource_limits{resource="memory"}, "pod", "$1", "exported_pod", "(.+)")) by (pod)
```
### CPU usage by pod limits
**Description:** This metric displays CPU usage as a percentage of the CPU limit set for each pod. Monitoring CPU usage helps operators optimize service performance and prevent CPU exhaustion.
```promql
100 * max(
container_memory_working_set_bytes{namespace=~"$namespace"} /
on(container, pod) label_replace(kube_pod_container_resource_limits{resource="memory"}, "pod", "$1", "exported_pod", "(.+)")
) by (pod)
```
### Total active upstream connections
**Description:** This metric tracks the total number of active upstream connections to other services in the mesh. It provides insight into service dependencies and network load.
```promql
sum(envoy_cluster_upstream_cx_active{app=~"$service",envoy_cluster_name=~"$cluster"}) by (app, envoy_cluster_name)
```
### Total active downstream connections
**Description:** This metric tracks the total number of active downstream connections from services to clients. It helps operators monitor service load and ensure that services are able to handle the traffic effectively.
```promql
sum(envoy_http_downstream_cx_active{app=~"$service"})
```

View File

@ -0,0 +1,128 @@
---
layout: docs
page_title: Dashboard for Consul k8s control plane metrics
description: >-
This documentation provides an overview of the Consul on Kubernetes Grafana Dashboard. Learn about the metrics it displays and the queries that produce the metrics.
---
# Consul on Kubernetes control plane monitoring dashboard
This page provides reference information about the [Grafana dashboard configuration included in the `hashicorp/consul` GitHub repository](https://github.com/hashicorp/consul/blob/main/grafana/consul-k8s-control-plane-monitoring.json).
## Grafana queries overview
This dashboard provides the following information about service mesh operations.
### Number of Consul servers
**Description:** Displays the number of Consul servers currently active. This metric provides insight into the cluster's health and the number of Consul nodes running in the environment.
```promql
consul_consul_server_0_consul_members_servers{pod="consul-server-0"}
```
### Number of connected Consul dataplanes
**Description:** Tracks the number of connected Consul dataplanes. This metric helps operators understand how many Envoy sidecars are actively connected to the mesh.
```promql
count(consul_dataplane_envoy_connected)
```
### CPU usage in seconds (Consul servers)
**Description:** This metric shows the CPU usage of the Consul servers over time, helping operators monitor resource consumption.
```promql
rate(container_cpu_usage_seconds_total{container="consul", pod=~"consul-server-.*"}[5m])
```
### Memory usage (Consul servers)
**Description:** Displays the memory usage of the Consul servers. This metric helps ensure that the servers have sufficient memory resources for proper operation.
```promql
container_memory_working_set_bytes{container="consul", pod=~"consul-server-.*"}
```
### Disk read/write total per 5 minutes (Consul servers)
**Description:** Tracks the total network bytes received by Consul servers within a 5 minute window. This metric helps assess the network load on Consul nodes.
```promql
sum(rate(container_fs_writes_bytes_total{pod=~"consul-server-.*", container="consul"}[5m])) by (pod, device)
```
```promql
sum(rate(container_fs_reads_bytes_total{pod=~"consul-server-.*", container="consul"}[5m])) by (pod, device)
```
### Received bytes total per 5 minutes (Consul servers)
**Description:** Tracks the total network bytes received by Consul servers within a 5 minute window. This metric helps assess the network load on Consul nodes.
```promql
sum(rate(container_network_receive_bytes_total{pod=~"consul-server-.*"}[5m])) by (pod)
```
### Memory limit (Consul servers)
**Description:** Displays the memory limit for Consul servers. This metric ensures that memory usage stays within the defined limits for each Consul server.
```promql
kube_pod_container_resource_limits{resource="memory", pod="consul-server-0"}
```
### CPU limit in seconds (Consul servers)
**Description:** Displays the CPU limit for Consul servers. Monitoring CPU limits helps operators ensure that the services are not constrained by resource limitations.
```promql
kube_pod_container_resource_limits{resource="cpu", pod="consul-server-0"}
```
### Disk usage (Consul servers)
**Description:** Shows the amount of filesystem storage used by Consul servers. This metric helps operators track disk usage and plan for capacity.
```promql
sum(container_fs_usage_bytes{}) by (pod)
```
```promql
sum(container_fs_usage_bytes{pod="consul-server-0"})
```
### CPU usage in seconds (Connect injector)
**Description:** Tracks the CPU usage of the Connect injector, which is responsible for injecting Envoy sidecars and other operations within the mesh. Monitoring this helps ensure that Connect injector has adequate CPU resources.
```promql
rate(container_cpu_usage_seconds_total{pod=~".*-connect-injector-.*", container="sidecar-injector"}[5m])
```
### CPU limit in seconds (Connect injector)
**Description:** Displays the CPU limit for the Connect injector. Monitoring the CPU limits ensures that Connect injector is not constrained by resource limitations.
```promql
max(kube_pod_container_resource_limits{resource="cpu", container="sidecar-injector"})
```
### Memory usage (Connect injector)
**Description:** Tracks the memory usage of the Connect injector. Monitoring this helps ensure the Connect injector has sufficient memory resources.
```promql
container_memory_working_set_bytes{pod=~".*-connect-injector-.*", container="sidecar-injector"}
```
### Memory limit (Connect injector)
**Description:** Displays the memory limit for the Connect injector, helping to monitor if the service is nearing its resource limits.
```promql
max(kube_pod_container_resource_limits{resource="memory", container="sidecar-injector"})
```

View File

@ -0,0 +1,164 @@
---
layout: docs
page_title: Dashboard for Consul server metrics
description: >-
This documentation provides an overview of the Consul Server Dashboard. Learn about the metrics it displays and the queries that produce the metrics.
---
# Consul server monitoring dashboard
This page provides reference information about the [Grafana dashboard configuration included in the `hashicorp/consul` GitHub repository](https://github.com/hashicorp/consul/blob/main/grafana/consul-server-monitoring.json).
## Grafana queries overview
This dashboard provides the following information about service mesh operations.
### Raft commit time
**Description:** This metric measures the time it takes to commit Raft log entries. Stable values are expected for a healthy cluster. High values can indicate issues with resources such as memory, CPU, or disk space.
```promql
consul_raft_commitTime
```
### Raft commits per 5 minutes
**Description:** This metric tracks the rate of Raft log commits emitted by the leader, showing how quickly changes are being applied across the cluster.
```promql
rate(consul_raft_apply[5m])
```
### Last contacted leader
**Description:** Measures the duration since the last contact with the Raft leader. Spikes in this metric can indicate network issues or an unavailable leader, which may affect cluster stability.
```promql
consul_raft_leader_lastContact != 0
```
### Election events
**Description:** Tracks Raft state transitions, which indicate leadership elections. Frequent transitions might suggest cluster instability and require investigation.
```promql
rate(consul_raft_state_candidate[1m])
```
```promql
rate(consul_raft_state_leader[1m])
```
### Autopilot health
**Description:** A boolean metric that shows a value of 1 when Autopilot is healthy and 0 when issues are detected. Ensures that the cluster has sufficient resources and an operational leader.
```promql
consul_autopilot_healthy
```
### DNS queries per 5 minutes
**Description:** This metric tracks the rate of DNS queries per node, bucketed into 5 minute intervals. It helps monitor the query load on Consuls DNS service.
```promql
rate(consul_dns_domain_query_count[5m])
```
### DNS domain query time
**Description:** Measures the time spent handling DNS domain queries. Spikes in this metric may indicate high contention in the catalog or too many concurrent queries.
```promql
consul_dns_domain_query
```
### DNS reverse query time
**Description:** Tracks the time spent processing reverse DNS queries. Spikes in query time may indicate performance bottlenecks or increased workload.
```promql
consul_dns_ptr_query
```
### KV applies per 5 minutes
**Description:** This metric tracks the rate of key-value store applies over 5 minute intervals, indicating the operational load on Consuls KV store.
```promql
rate(consul_kvs_apply_count[5m])
```
### KV apply time
**Description:** Measures the time taken to apply updates to the key-value store. Spikes in this metric might suggest resource contention or client overload.
```promql
consul_kvs_apply
```
### Transaction apply time
**Description:** Tracks the time spent applying transaction operations in Consul, providing insights into potential bottlenecks in transaction operations.
```promql
consul_txn_apply
```
### ACL resolves per 5 minutes
**Description:** This metric tracks the rate of ACL token resolutions over 5 minute intervals. It provides insights into the activity related to ACL tokens within the cluster.
```promql
rate(consul_acl_ResolveToken_count[5m])
```
### ACL resolve token time
**Description:** Measures the time taken to resolve ACL tokens into their associated policies.
```promql
consul_acl_ResolveToken
```
### ACL updates per 5 minutes
**Description:** Tracks the rate of ACL updates over 5 minute intervals. This metric helps monitor changes in ACL configurations over time.
```promql
rate(consul_acl_apply_count[5m])
```
### ACL apply time
**Description:** Measures the time spent applying ACL changes. Spikes in apply time might suggest resource constraints or high operational load.
```promql
consul_acl_apply
```
### Catalog operations per 5 minutes
**Description:** Tracks the rate of register and deregister operations in the Consul catalog, providing insights into the churn of services within the cluster.
```promql
rate(consul_catalog_register_count[5m])
```
```promql
rate(consul_catalog_deregister_count[5m])
```
### Catalog operation time
**Description:** Measures the time taken to complete catalog register or deregister operations. Spikes in this metric may indicate performance issues within the catalog.
```promql
consul_catalog_register
```
```promql
consul_catalog_deregister
```

View File

@ -0,0 +1,91 @@
---
layout: docs
page_title: Service Mesh Observability - Dashboards
description: >-
This documentation provides an overview of several dashboards designed for monitoring and managing services within a Consul-managed Envoy service mesh. Learn how to enable access logs and configure key performance and operational metrics to ensure the reliability and performance of services in the service mesh.
---
# Dashboards for service mesh observability
This topic describes the configuration and usage of dashboards for monitoring and managing services within a Consul-managed Envoy service mesh. These dashboards provide critical insights into the health, performance, and resource utilization of services. The dashboards described here are essential tools for ensuring the stability, efficiency, and reliability of your service mesh environment.
This page provides reference information about the Grafana dashboard configurations included in the [`grafana` directory in the `hashicorp/consul` GitHub repository](https://github.com/hashicorp/consul/tree/main/grafana).
## Dashboards overview
The repository includes the following dashboards:
- **Consul service-to-service dashboard**: Provides a detailed view of service-to-service communications, monitoring key metrics like access logs, HTTP requests, error counts, response code distributions, and request success rates. The dashboard includes customizable filters for focusing on specific services and namespaces.
- **Consul service dashboard**: Tracks key metrics for Envoy proxies at the cluster and service levels, ensuring the performance and reliability of individual services within the mesh.
- **Consul dataplane dashboard**: Offers a comprehensive overview of service health and performance, including request success rates, resource utilization (CPU and memory), active connections, and cluster health. It helps operators maintain service reliability and optimize resource usage.
- **Consul k8s dashboard**: Focuses on monitoring the health and resource usage of the Consul control plane within a Kubernetes environment, ensuring the stability of the control plane.
- **Consul server dashboard**: Provides detailed monitoring of Consul servers, tracking key metrics like server health, CPU and memory usage, disk I/O, and network performance. This dashboard is critical for ensuring the stability and performance of Consul servers within the service mesh.
## Enabling prometheus
Add the following configurations to your Consul Helm chart to enable the prometheus tools.
<CodeBlockConfig>
```yaml
global:
metrics:
enabled: true
provider: "prometheus"
enableAgentMetrics: true
agentMetricsRetentionTime: "10m"
prometheus:
enabled: true
ui:
enabled: true
metrics:
enabled: true
provider: "prometheus"
baseURL: http://prometheus-server.consul
```
</CodeBlockConfig>
## Enable access logs
Access logs configurations are defined globally in the [`proxy-defaults`](/consul/docs/connect/config-entries/proxy-defaults#accesslogs) configuration entry.
The following example is a minimal configuration for enabling access logs:
<CodeTabs tabs={[ "HCL", "Kubernetes YAML", "JSON" ]}>
```hcl
Kind = "proxy-defaults"
Name = "global"
AccessLogs {
Enabled = true
}
```
```yaml
apiVersion: consul.hashicorp.com/v1alpha1
kind: ProxyDefaults
metadata:
name: global
spec:
accessLogs:
enabled: true
```
```json
{
"Kind": "proxy-defaults",
"Name": "global",
"AccessLogs": {
"Enabled": true
}
}
```
</CodeTabs>

View File

@ -0,0 +1,183 @@
---
layout: docs
page_title: Dashboard for monitoring Consul service-to-service mesh
description: >-
This documentation provides an overview of the Service-to-service dashboard. Learn about the metrics it displays and the queries that produce the metrics.
---
# Service-to-service dashboard
This page provides reference information about the [Grafana dashboard configuration included in the `hashicorp/consul` GitHub repository](https://github.com/hashicorp/consul/blob/main/grafana/consulservicetoservicedashboard.json). The service-to-service dashboard provides deep visibility into the traffic and interactions between services within the Consul service mesh. It focuses on critical metrics such as logs, error rates, traffic patterns, and success rates, all of which help operators maintain smooth and reliable service-to-service communication.
![Preview of the service to service mesh dashboard](/public/img/grafana/service-to-service-1.png)
## Grafana queries overview
This dashboard provides the following information about service mesh operations.
### Access logs and errors monitoring
**Description:** This section provides visibility into logs and errors related to service-to-service communications. It tracks and displays the number of logs generated, errors encountered, and the percentage of logs matching specific patterns.
### Total logs
**Description:** This metric counts the total number of log lines produced by Consul dataplane containers. It provides an overview of the volume of logs being generated for a specific namespace.
```promql
sum(count_over_time(({container="consul-dataplane",namespace=~"$namespace"})[$__interval]))
```
### Total logs containing "$searchable_pattern"
**Description:** This metric tracks the number of logs containing the specified pattern. It is useful for filtering and monitoring specific log events across the service mesh.
```promql
sum(count_over_time({container="consul-dataplane", namespace=~"$namespace"} |~ (?i)(?i)$searchable_pattern [$__interval]))
```
### Percentage of logs containing "$searchable_pattern"
**Description:** This metric calculates the percentage of logs containing the specified search pattern within the total log volume. It helps gauge the proportion of specific log events.
```promql
(sum(count_over_time({container="consul-dataplane", namespace=~"$namespace"} |~ (?i)(?i)$searchable_pattern [$__interval])) * 100) / sum(count_over_time({container="consul-dataplane", namespace=~"$namespace"} [$__interval]))
```
### Total response code distribution
**Description:** This pie chart visualizes the distribution of HTTP response codes, helping identify any 4xx and 5xx error codes generated by the services.
```promql
sum by(response_code) (count_over_time({container="consul-dataplane", namespace="$namespace"} | json | response_code != "0" | __error__= [$__range]))
```
### Rate of logs containing "$searchable_pattern" per service
**Description:** This metric monitors the rate at which specific patterns appear in logs per service, helping to detect trends and anomalies in log data.
```promql
sum by(app) (rate({container="consul-dataplane", namespace=~"$namespace"} |~ (?i)(?i)$searchable_pattern [$__range]))
```
### TCP metrics - service level
### TCP inbound and outbound bytes
**Description:** This metric tracks the inbound and outbound TCP bytes transferred between services. It is essential for understanding the network traffic flow between source and destination services.
```promql
sum(rate(envoy_tcp_downstream_cx_rx_bytes_total{}[10m])) by (service, destination_service)
```
### TCP inbound and outbound bytes buffered
**Description:** This metric monitors the amount of TCP bytes buffered for inbound and outbound traffic between services. It helps identify potential network performance bottlenecks.
```promql
sum(rate(envoy_tcp_downstream_cx_rx_bytes_buffered{}[10m])) by (service, destination_service)
```
### TCP downstream connections
**Description:** This metric counts the number of active TCP downstream connections from the source service to the destination service, providing visibility into the volume of connections between services.
```promql
sum(envoy_tcp_downstream_cx_total) by (service, destination_service)
```
### Outbound traffic monitoring
![Preview of the outbound traffic monitoring](/public/img/grafana/service-to-service-2.png)
### Upstream traffic
**Description:** This metric monitors the upstream traffic from the source service to the destination service. It shows how much traffic is being sent between services.
```promql
sum(irate(envoy_cluster_upstream_rq_total{local_cluster=~"$source_service",consul_destination_service=~"$destination_service"}[10m]))
```
### Upstream request response timeliness
**Description:** This metric calculates the 95th percentile of upstream request response times between the source and destination services. It helps ensure that service communications are handled promptly.
```promql
histogram_quantile(0.95, sum(rate(envoy_cluster_upstream_rq_time_bucket{local_cluster=~"$source_service",consul_destination_target!=""}[10m])) by (le, consul_destination_target))
```
### Upstream request success rate
**Description:** This metric tracks the success rate of requests from the source service to the destination service, excluding 4xx and 5xx errors. It helps assess the reliability of service communications.
```promql
sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!="5",local_cluster=~"$source_service",consul_destination_service=~"$destination_service"}[10m]))
```
### Inbound traffic monitoring
![Preview of the inbound traffic monitoring](/public/img/grafana/service-to-service-3.png)
### Requests sent
**Description:** This metric tracks the number of requests sent between the source service and destination service within the service mesh.
```promql
sum(irate(envoy_cluster_upstream_rq_total{consul_destination_datacenter="dc1",local_cluster=~"$source_service",consul_destination_service=~"$destination_service"}[10m])) by (consul_destination_service, local_cluster)
```
### Request success rate
**Description:** This metric tracks the success rate of requests from the source service to the destination service, helping identify failures or bottlenecks in communication.
```promql
sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!="5",local_cluster=~"$source_service",consul_destination_service=~"$destination_service"}[10m])) by (local_cluster, consul_destination_service) / sum(irate(envoy_cluster_upstream_rq_xx{consul_destination_service=~"$destination_service"}[10m])) by (local_cluster, consul_destination_service)
```
### Response success by status code
**Description:** This metric tracks response success by status code for requests sent by the source service to the destination service.
```promql
sum(increase(envoy_http_downstream_rq_xx{local_cluster=~"$source_service",envoy_http_conn_manager_prefix="public_listener"}[10m])) by (local_cluster, envoy_response_code_class)
```
### Request duration
**Description:** This metric tracks the request duration between the source and destination services, helping monitor performance and response times.
```promql
histogram_quantile(0.95, sum(rate(envoy_cluster_upstream_rq_time_bucket{consul_destination_datacenter="dc1", consul_destination_service=~"$destination_service",local_cluster=~"$source_service"}[10m])) by (le, cluster, local_cluster, consul_destination_service))
```
### Response success
**Description:** This metric tracks the success of responses for the source service's requests across the service mesh.
```promql
sum(increase(envoy_http_downstream_rq_total{local_cluster=~"$source_service",envoy_http_conn_manager_prefix="public_listener"}[10m])) by (local_cluster)
```
### Request response rate
**Description:** This metric tracks the rate at which responses are being generated by the source service, providing insight into service activity and performance.
```promql
sum(irate(envoy_http_downstream_rq_total{local_cluster=~"$source_service",envoy_http_conn_manager_prefix="public_listener"}[10m])) by (local_cluster)
```
## Customization options
![Preview of the nginx service selection as a customization option on the service to service dashboard](/public/img/grafana/service-to-service-4.png)
The service-to-service dashboard includes a variety of customization options to help you analyze specific aspects of service-to-service communications, tailor the dashboard for more targeted monitoring, and enhance visibility into the service mesh.
- **Filter by source service:** You can filter the dashboard to focus on traffic originating from a specific source service, allowing you to analyze interactions from the source service to all destination services.
- **Filter by destination service:** Similarly, you can filter the dashboard by destination service to track and analyze the traffic received by specific services. This helps pinpoint communication issues or performance bottlenecks related to specific services.
- **Filter by namespace:** The dashboard can be customized to focus on service interactions within a particular namespace. This is especially useful for isolating issues in multi-tenant environments or clusters that operate with strict namespace isolation.
- **Log pattern search:** You can apply custom search patterns to logs to filter out specific log events of interest, such as error messages or specific HTTP status codes. This enables you to narrow down on specific log entries and identify patterns that may indicate issues.
- **Time range selection:** The dashboard supports dynamic time range selection, allowing you to focus on service interactions over specific time intervals. This helps in analyzing traffic trends, troubleshooting incidents, and understanding the timing of service communications.
By using these customization options, you can tailor the dashboard to your specific needs and ensure they are always monitoring the most relevant data for maintaining a healthy and performant service mesh.

View File

@ -0,0 +1,157 @@
---
layout: docs
page_title: Dashboard for monitoring Consul service mesh
description: >-
This documentation provides an overview of the Service Dashboard. Learn about the metrics it displays and the queries that produce the metrics.
---
# Service dashboard
This page provides reference information about the [Grafana dashboard configuration included in the `hashicorp/consul` GitHub repository](https://github.com/hashicorp/consul/blob/main/grafana/consulservicedashboard.json). The service dashboard offers an overview of the performance and health of individual services within the Consul service mesh. It provides insights into service availability, request success rates, latency, and connection metrics. This dashboard is essential for maintaining optimal service performance and quickly identifying any issues with service communications.
![Preview of the service dashboard](/public/img/grafana/service-dashboard-2.png)
## Grafana queries overview
This dashboard provides the following information about service mesh operations.
### Total running services
**Description:** This gauge tracks the total number of running services within the mesh that are not labeled as `traffic-generator`. It provides an overall view of active services, helping operators maintain visibility into service availability.
```promql
sum(envoy_server_live{app!="traffic-generator"})
```
### Total request success rate
**Description:** This stat visualizes the success rate of upstream requests to the selected service. It filters out 4xx and 5xx response codes, providing a clearer picture of how well the service is performing in terms of handling requests successfully.
```promql
sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!="5", envoy_response_code_class!="4", consul_destination_service=~"$service"}[10m])) / sum(irate(envoy_cluster_upstream_rq_xx{consul_destination_service=~"$service"}[10m]))
```
### Total failed request rate
**Description:** This stat tracks the rate of failed requests for the selected service according to 4xx and 5xx errors. It helps operators quickly identify if there are issues with client requests or server errors for a specific service.
```promql
sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class=~"4|5", consul_destination_service=~"$service"}[10m])) / sum(irate(envoy_cluster_upstream_rq_xx{consul_destination_service=~"$service"}[10m]))
```
### Average request response time in milliseconds
**Description:** This gauge displays the average response time for requests to the selected service, providing an overview of the service's performance and responsiveness.
```promql
sum(rate(envoy_cluster_upstream_rq_time_sum{consul_destination_service=~"$service"}[10m])) / sum(rate(envoy_cluster_upstream_rq_total{consul_destination_service=~"$service"}[10m]))
```
### Total failed requests
**Description:** This gauge tracks the total number of failed requests over a 10 minute window, categorized by service. It allows for easy identification of services that are experiencing high failure rates.
```promql
sum(increase(envoy_cluster_upstream_rq_xx{envoy_response_code_class=~"4|5", consul_destination_service=~"$service"}[10m])) by(local_cluster)
```
### Dataplane latency
**Description:** This stat tracks the dataplane latency percentiles (p50, p75, p90, p99.9) for the selected service. It gives detailed insights into the distribution of latency within the service's request handling, helping identify performance bottlenecks.
![Preview of the dataplane latency metrics](/public/img/grafana/service-dashboard-1.png)
```promql
histogram_quantile(0.50, sum by(le) (rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace=~"$namespace", local_cluster=~"$service"}[5m])))
```
```promql
histogram_quantile(0.75, sum by(le) (rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace=~"$namespace", local_cluster=~"$service"}[5m])))
```
```promql
histogram_quantile(0.90, sum by(le) (rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace=~"$namespace", local_cluster=~"$service"}[5m])))
```
```promql
histogram_quantile(0.999, sum by(le) (rate(envoy_cluster_upstream_rq_time_bucket{kubernetes_namespace=~"$namespace", local_cluster=~"$service"}[5m])))
```
### Total TCP inbound and outbound bytes
**Description:** This time series shows the total number of inbound and outbound TCP bytes for services within the mesh. It provides visibility into the data transfer patterns and volume between services.
```promql
sum(rate(envoy_tcp_downstream_cx_rx_bytes_total{}[10m])) by (local_cluster)
```
### Total TCP inbound and outbound bytes buffered
**Description:** This metric tracks the amount of TCP traffic buffered during inbound and outbound communications. It helps in identifying whether there is any potential latency caused by packet buffering or congestion.
```promql
sum(rate(envoy_tcp_downstream_cx_rx_bytes_buffered{}[10m])) by (local_cluster)
```
### Total TCP downstream active connections
**Description:** This metric counts the total number of active TCP downstream connections, providing an overview of the current connection load on the services within the mesh.
```promql
sum(rate(envoy_tcp_downstream_cx_total{}[10m])) by(local_cluster)
```
### Total active HTTP upstream connections
**Description:** This time series tracks the total number of active HTTP upstream connections for the selected service. It helps monitor connection patterns and assess load.
```promql
sum(envoy_cluster_upstream_cx_active{app=~"$service"}) by (app)
```
### Total active HTTP downstream connections
**Description:** This time series monitors the number of active HTTP downstream connections for the selected service, providing visibility into the current active user or client load on the service.
```promql
sum(envoy_http_downstream_cx_active{app=~"$service"}) by (app)
```
### Upstream requests by status code
**Description:** This metric tracks the number of upstream requests, grouped by HTTP status codes, giving insight into the health of the requests being made to upstream services from the selected service.
```promql
sum by(namespace,app,envoy_response_code_class) (rate(envoy_cluster_upstream_rq_xx[5m]))
```
### Downstream requests by status code
**Description:** This time series tracks downstream HTTP requests by status code, showing how well the selected service is responding to downstream requests from clients.
```promql
sum(rate(envoy_http_downstream_rq_xx{envoy_http_conn_manager_prefix="public_listener"}[5m])) by (namespace, app, envoy_response_code_class)
```
### Connections rejected
**Description:** This metric tracks the number of connections rejected due to overload or overflow conditions on listeners. Monitoring these values helps identify if the service is under too much load or has insufficient capacity to handle the incoming connections.
```promql
rate(envoy_listener_downstream_cx_overload_reject{}[$__interval])
```
## Customization options
The service dashboard offers various customization options to help you analyze specific services and metrics. Use these options to tailor the dashboard to your needs and improve your ability to monitor and troubleshoot service health.
- **Filter by service:** You can filter the dashboard by the service you want to monitor. This helps narrow down the metrics to the service of interest and provides a more targeted view of its performance.
- **Filter by namespace:** The namespace filter allows operators to focus on a particular namespace in a multi-tenant or multi-namespace environment, isolating the service metrics within that specific context.
- **Time range selection:** The dashboard supports flexible time range selection, allowing operators to analyze service behavior over different time periods. This is helpful for pinpointing issues that may occur at specific times or during high-traffic periods.
- **Percentile latency tracking:** The dashboard allows operators to track multiple latency percentiles (p50, p75, p90, p99.9) to get a more detailed view of how the service performs across different levels of traffic load.

View File

@ -418,7 +418,7 @@
"title": "Cache DNS lookups",
"path": "services/discovery/dns-cache"
},
{
{
"title": "Enable dynamic DNS lookups",
"path": "services/discovery/dns-dynamic-lookups"
}
@ -690,6 +690,35 @@
{
"title": "UI Visualization",
"path": "connect/observability/ui-visualization"
},
{
"title": "Grafana Dashboards",
"routes": [
{
"title": "Overview",
"path": "connect/observability/grafanadashboards"
},
{
"title": "Service to Service Dashboard",
"path": "connect/observability/grafanadashboards/service-to-servicedashboard"
},
{
"title": "Service Dashboard",
"path": "connect/observability/grafanadashboards/servicedashboard"
},
{
"title": "Consul Dataplane Dashboard",
"path": "connect/observability/grafanadashboards/consuldataplanedashboard"
},
{
"title": "Consul K8s Dashboard",
"path": "connect/observability/grafanadashboards/consulk8sdashboard"
},
{
"title": "Consul Server Dashboard",
"path": "connect/observability/grafanadashboards/consulserverdashboard"
}
]
}
]
},

Binary file not shown.

After

Width:  |  Height:  |  Size: 557 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 678 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 562 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 683 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 502 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 733 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 493 KiB