mirror of https://github.com/hashicorp/consul
184 lines
9.9 KiB
Markdown
184 lines
9.9 KiB
Markdown
---
|
|
layout: docs
|
|
page_title: Dashboard for monitoring Consul service-to-service mesh
|
|
description: >-
|
|
This documentation provides an overview of the Service-to-service dashboard. Learn about the metrics it displays and the queries that produce the metrics.
|
|
---
|
|
|
|
# Service-to-service dashboard
|
|
|
|
This page provides reference information about the [Grafana dashboard configuration included in the `hashicorp/consul` GitHub repository](https://github.com/hashicorp/consul/blob/main/grafana/consulservicetoservicedashboard.json). The service-to-service dashboard provides deep visibility into the traffic and interactions between services within the Consul service mesh. It focuses on critical metrics such as logs, error rates, traffic patterns, and success rates, all of which help operators maintain smooth and reliable service-to-service communication.
|
|
|
|
![Preview of the service to service mesh dashboard](/public/img/grafana/service-to-service-1.png)
|
|
|
|
## Grafana queries overview
|
|
|
|
This dashboard provides the following information about service mesh operations.
|
|
|
|
### Access logs and errors monitoring
|
|
|
|
**Description:** This section provides visibility into logs and errors related to service-to-service communications. It tracks and displays the number of logs generated, errors encountered, and the percentage of logs matching specific patterns.
|
|
|
|
### Total logs
|
|
|
|
**Description:** This metric counts the total number of log lines produced by Consul dataplane containers. It provides an overview of the volume of logs being generated for a specific namespace.
|
|
|
|
```promql
|
|
sum(count_over_time(({container="consul-dataplane",namespace=~"$namespace"})[$__interval]))
|
|
```
|
|
|
|
### Total logs containing "$searchable_pattern"
|
|
|
|
**Description:** This metric tracks the number of logs containing the specified pattern. It is useful for filtering and monitoring specific log events across the service mesh.
|
|
|
|
```promql
|
|
sum(count_over_time({container="consul-dataplane", namespace=~"$namespace"} |~ (?i)(?i)$searchable_pattern [$__interval]))
|
|
```
|
|
|
|
### Percentage of logs containing "$searchable_pattern"
|
|
|
|
**Description:** This metric calculates the percentage of logs containing the specified search pattern within the total log volume. It helps gauge the proportion of specific log events.
|
|
|
|
```promql
|
|
(sum(count_over_time({container="consul-dataplane", namespace=~"$namespace"} |~ (?i)(?i)$searchable_pattern [$__interval])) * 100) / sum(count_over_time({container="consul-dataplane", namespace=~"$namespace"} [$__interval]))
|
|
```
|
|
|
|
### Total response code distribution
|
|
|
|
**Description:** This pie chart visualizes the distribution of HTTP response codes, helping identify any 4xx and 5xx error codes generated by the services.
|
|
|
|
```promql
|
|
sum by(response_code) (count_over_time({container="consul-dataplane", namespace="$namespace"} | json | response_code != "0" | __error__= [$__range]))
|
|
```
|
|
|
|
### Rate of logs containing "$searchable_pattern" per service
|
|
|
|
**Description:** This metric monitors the rate at which specific patterns appear in logs per service, helping to detect trends and anomalies in log data.
|
|
|
|
```promql
|
|
sum by(app) (rate({container="consul-dataplane", namespace=~"$namespace"} |~ (?i)(?i)$searchable_pattern [$__range]))
|
|
```
|
|
|
|
### TCP metrics - service level
|
|
|
|
### TCP inbound and outbound bytes
|
|
|
|
**Description:** This metric tracks the inbound and outbound TCP bytes transferred between services. It is essential for understanding the network traffic flow between source and destination services.
|
|
|
|
```promql
|
|
sum(rate(envoy_tcp_downstream_cx_rx_bytes_total{}[10m])) by (service, destination_service)
|
|
```
|
|
|
|
### TCP inbound and outbound bytes buffered
|
|
|
|
**Description:** This metric monitors the amount of TCP bytes buffered for inbound and outbound traffic between services. It helps identify potential network performance bottlenecks.
|
|
|
|
```promql
|
|
sum(rate(envoy_tcp_downstream_cx_rx_bytes_buffered{}[10m])) by (service, destination_service)
|
|
```
|
|
|
|
### TCP downstream connections
|
|
|
|
**Description:** This metric counts the number of active TCP downstream connections from the source service to the destination service, providing visibility into the volume of connections between services.
|
|
|
|
```promql
|
|
sum(envoy_tcp_downstream_cx_total) by (service, destination_service)
|
|
```
|
|
|
|
### Outbound traffic monitoring
|
|
![Preview of the outbound traffic monitoring](/public/img/grafana/service-to-service-2.png)
|
|
|
|
### Upstream traffic
|
|
|
|
**Description:** This metric monitors the upstream traffic from the source service to the destination service. It shows how much traffic is being sent between services.
|
|
|
|
```promql
|
|
sum(irate(envoy_cluster_upstream_rq_total{local_cluster=~"$source_service",consul_destination_service=~"$destination_service"}[10m]))
|
|
```
|
|
|
|
### Upstream request response timeliness
|
|
|
|
**Description:** This metric calculates the 95th percentile of upstream request response times between the source and destination services. It helps ensure that service communications are handled promptly.
|
|
|
|
```promql
|
|
histogram_quantile(0.95, sum(rate(envoy_cluster_upstream_rq_time_bucket{local_cluster=~"$source_service",consul_destination_target!=""}[10m])) by (le, consul_destination_target))
|
|
```
|
|
|
|
### Upstream request success rate
|
|
|
|
**Description:** This metric tracks the success rate of requests from the source service to the destination service, excluding 4xx and 5xx errors. It helps assess the reliability of service communications.
|
|
|
|
```promql
|
|
sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!="5",local_cluster=~"$source_service",consul_destination_service=~"$destination_service"}[10m]))
|
|
```
|
|
|
|
### Inbound traffic monitoring
|
|
![Preview of the inbound traffic monitoring](/public/img/grafana/service-to-service-3.png)
|
|
|
|
### Requests sent
|
|
|
|
**Description:** This metric tracks the number of requests sent between the source service and destination service within the service mesh.
|
|
|
|
```promql
|
|
sum(irate(envoy_cluster_upstream_rq_total{consul_destination_datacenter="dc1",local_cluster=~"$source_service",consul_destination_service=~"$destination_service"}[10m])) by (consul_destination_service, local_cluster)
|
|
```
|
|
|
|
### Request success rate
|
|
|
|
**Description:** This metric tracks the success rate of requests from the source service to the destination service, helping identify failures or bottlenecks in communication.
|
|
|
|
```promql
|
|
sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!="5",local_cluster=~"$source_service",consul_destination_service=~"$destination_service"}[10m])) by (local_cluster, consul_destination_service) / sum(irate(envoy_cluster_upstream_rq_xx{consul_destination_service=~"$destination_service"}[10m])) by (local_cluster, consul_destination_service)
|
|
```
|
|
|
|
### Response success by status code
|
|
|
|
**Description:** This metric tracks response success by status code for requests sent by the source service to the destination service.
|
|
|
|
```promql
|
|
sum(increase(envoy_http_downstream_rq_xx{local_cluster=~"$source_service",envoy_http_conn_manager_prefix="public_listener"}[10m])) by (local_cluster, envoy_response_code_class)
|
|
```
|
|
|
|
### Request duration
|
|
|
|
**Description:** This metric tracks the request duration between the source and destination services, helping monitor performance and response times.
|
|
|
|
```promql
|
|
histogram_quantile(0.95, sum(rate(envoy_cluster_upstream_rq_time_bucket{consul_destination_datacenter="dc1", consul_destination_service=~"$destination_service",local_cluster=~"$source_service"}[10m])) by (le, cluster, local_cluster, consul_destination_service))
|
|
```
|
|
|
|
### Response success
|
|
|
|
**Description:** This metric tracks the success of responses for the source service's requests across the service mesh.
|
|
|
|
```promql
|
|
sum(increase(envoy_http_downstream_rq_total{local_cluster=~"$source_service",envoy_http_conn_manager_prefix="public_listener"}[10m])) by (local_cluster)
|
|
```
|
|
|
|
### Request response rate
|
|
|
|
**Description:** This metric tracks the rate at which responses are being generated by the source service, providing insight into service activity and performance.
|
|
|
|
```promql
|
|
sum(irate(envoy_http_downstream_rq_total{local_cluster=~"$source_service",envoy_http_conn_manager_prefix="public_listener"}[10m])) by (local_cluster)
|
|
```
|
|
|
|
## Customization options
|
|
|
|
![Preview of the nginx service selection as a customization option on the service to service dashboard](/public/img/grafana/service-to-service-4.png)
|
|
|
|
The service-to-service dashboard includes a variety of customization options to help you analyze specific aspects of service-to-service communications, tailor the dashboard for more targeted monitoring, and enhance visibility into the service mesh.
|
|
|
|
- **Filter by source service:** You can filter the dashboard to focus on traffic originating from a specific source service, allowing you to analyze interactions from the source service to all destination services.
|
|
|
|
- **Filter by destination service:** Similarly, you can filter the dashboard by destination service to track and analyze the traffic received by specific services. This helps pinpoint communication issues or performance bottlenecks related to specific services.
|
|
|
|
- **Filter by namespace:** The dashboard can be customized to focus on service interactions within a particular namespace. This is especially useful for isolating issues in multi-tenant environments or clusters that operate with strict namespace isolation.
|
|
|
|
- **Log pattern search:** You can apply custom search patterns to logs to filter out specific log events of interest, such as error messages or specific HTTP status codes. This enables you to narrow down on specific log entries and identify patterns that may indicate issues.
|
|
|
|
- **Time range selection:** The dashboard supports dynamic time range selection, allowing you to focus on service interactions over specific time intervals. This helps in analyzing traffic trends, troubleshooting incidents, and understanding the timing of service communications.
|
|
|
|
By using these customization options, you can tailor the dashboard to your specific needs and ensure they are always monitoring the most relevant data for maintaining a healthy and performant service mesh.
|
|
|