consul/website/content/docs/ecs/architecture.mdx

196 lines
9.8 KiB
Plaintext
Raw Normal View History

2021-05-26 18:25:06 +00:00
---
layout: docs
page_title: Architecture - AWS ECS
description: >-
Architecture of Consul Service Mesh on AWS ECS (Elastic Container Service).
---
# Architecture
The following diagram shows the main components of the Consul architecture when deployed to an ECS cluster:
2021-05-26 18:25:06 +00:00
![Consul on ECS Architecture](/img/consul-ecs-arch.png)
2021-05-26 18:25:06 +00:00
1. **Consul servers:** Production-ready Consul server cluster
2021-05-26 18:25:06 +00:00
1. **Application tasks:** Runs user application containers along with two helper containers:
1. **Consul client:** The Consul client container runs Consul. The Consul client communicates
2021-05-26 18:25:06 +00:00
with the Consul server and configures the Envoy proxy sidecar. This communication
is called _control plane_ communication.
1. **Sidecar proxy:** The sidecar proxy container runs [Envoy](https://envoyproxy.io/). All requests
2021-05-26 18:25:06 +00:00
to and from the application container(s) run through the sidecar proxy. This communication
is called _data plane_ communication.
2021-09-15 18:00:33 +00:00
1. **ACL Controller:** Automatically provisions Consul ACL tokens for Consul clients and service mesh services
in an ECS Cluster.
2021-05-26 18:25:06 +00:00
For more information about how Consul works in general, see Consul's [Architecture Overview](/docs/architecture).
In addition to the long-running Consul client and sidecar proxy containers, the `mesh-init` container runs
2021-09-15 18:00:33 +00:00
at startup and sets up initial configuration for Consul and Envoy.
2021-05-26 18:25:06 +00:00
### Task Startup
This diagram shows the timeline of a task starting up and all its containers:
![Task Startup Timeline](/img/ecs-task-startup.png)
2021-09-15 18:00:33 +00:00
- **T0:** ECS starts the task. The `consul-client` and `mesh-init` containers start:
- `consul-client` uses the `retry-join` option to join the Consul cluster
2021-05-26 18:25:06 +00:00
- `mesh-init` registers the service for this task and its sidecar proxy into Consul. It runs `consul connect envoy -bootstrap` to generate Envoys bootstrap JSON file and write it to a shared volume. After registration and bootstrapping, `mesh-init` exits.
2021-09-15 18:00:33 +00:00
- **T1:** The `sidecar-proxy` container starts. It runs Envoy by executing `envoy -c <path-to-bootstrap-json>`.
- **T2:** The `sidecar-proxy` container is marked as healthy by ECS. It uses a health check that detects if its public listener port is open. At this time, your application containers are started since all Consul machinery is ready to service requests. The only running containers are `consul-client`, `sidecar-proxy`, and your application container(s).
2021-09-15 18:00:33 +00:00
### Task Shutdown
Graceful shutdown is supported when deploying Consul on ECS, which means the following:
* No incoming traffic from the mesh is directed to this task during shutdown.
* Outgoing traffic to the mesh is possible during shutdown.
This diagram shows an example timeline of a task shutting down:
<img alt="Task Shutdown Timeline" src="/img/ecs-task-shutdown.svg" style={{display: "block", maxWidth: "400px"}} />
- **T0**: ECS sends a TERM signal to all containers.
- `consul-client` begins to gracefully leave the Consul cluster, since it is configured with `leave_on_terminate = true`
- `health-sync` stops syncing health status from ECS into Consul checks.
- `sidecar-proxy` ignores the TERM signal and continues running until it notices that `user-app` container has exited. This allows the application container to continue to make outgoing requests through the proxy to the mesh. This possible due to an entrypoint override for the container, `consul-ecs envoy-entrypoint`.
- `user-app` exits since, in this example, it does not intercept the TERM signal
- **T1**:
- `health-sync` updates its Consul checks to critical status, and then exits. This ensures this service instance is marked unhealthy.
- The `sidecar-proxy` container checks the ECS task metadata. It notices the `user-app` container has stopped, and exits.
- **T2**:
- `consul-client` finishes leaving the Consul cluster and exits
- Updates about this task have reached the rest of the Consul cluster, which means downstream proxies are updated to stop sending traffic to this task.
- **T3**: All containers have exited
- `consul-client` finishes gracefully leaving the Consul datacenter and exits.
- ECS notices all containers have exited, and will soon put change the Task status to `STOPPED`
- **T4**: (Not applicable to this example, but if any conatiners are still running at this point, ECS forcefully stops them by sending a KILL signal)
#### Task Shutdown: Completely Avoiding Application Errors
Because Consul service mesh is a distributed, eventually consistent system that is subject to network latency, it is hard to achieve a perfect graceful shutdown.
In particular, you may have noticed the following issue in example above, where it is possible that an application that has exited still receives incoming traffic:
* The `user-app` container exits in **T0**
* Afterwards in **T2**, downstream services are updated to no longer send traffic to this task
As a result, downstream applications will see errors when requests are directed to this instance. This can occur for a short period (seconds or less) at the beginning of task shutdown, until the rest of the Consul cluster knows to avoid sending traffic to this instance.
Here are a couple of approaches to address this issue:
1. Modify your application container continue running for a short period of time into task shutdown. By doing this, the application is running to respond to incoming requests successfully at the beginning of task shutdown. This allows time for the Consul cluster to update downstream proxies to stop sending traffic to this task.
One way to accomplish this with an entrypoint override for your application container which ignores the TERM signal sent by ECS. Here is an example shell script:
```bash
# Run the provided command in a background subprocess.
$0 "$@" &
export PID=$!
onterm() {
echo "Caught sigterm. Sleeping 10s..."
sleep 10
exit 0
}
onexit() {
if [ -n "$PID" ]; then
kill $PID
wait $PID
fi
}
trap onterm TERM
trap onexit EXIT
wait $PID
```
This script runs the application in a subprocess. It uses a `trap` to intercept the TERM signal. It then sleeps for ten seconds before exiting normally. This allows the application process to continue running after receiving the TERM signal.
If this script is saved as `./app-entrypoint.sh`, then you can use it for your ECS tasks using the `mesh-task` Terraform module:
```hcl
module "my_task" {
source = "hashicorp/consul-ecs/aws//modules/mesh-task"
version = "<latest version>"
container_definitions = [
{
name = "my-app"
image = "..."
entryPoint = ["/bin/sh", "-c", file("./app-entrypoint.sh")]
command = ["python", "manage.py", "runserver", "127.0.0.1:8080"]
...
}
...
}
```
This example sets the `entryPoint` for the container, which overrides the default entrypoint from the image. When the container starts in ECS, the `command` list is passed as arguments to the `entryPoint` command. Putting this together, the container would start with the command, `/bin/sh -c "<app-entrypoint-contents>" python manage.py runserver 127.0.0.1:8080`.
2. If the traffic is HTTP(S), you can enable retry logic through Consul Connect [Service Router](/docs/connect/config-entries/service-router). This will configure proxies retry when receiving an error. When Envoy receives a failed request an upstream service, it can retry the request to a different instance of that service that may be able to respond successfully.
To enable retries through Service Router for a service named `example`, first ensure the configured protocol to `http`:
```hcl
Kind = "service-defaults"
Name = "example"
Protocol = "http"
```
The apply the config entry:
```shell-session
$ consul config write example-defaults.hcl
```
The add retry settings for the service:
```hcl
Kind = "service-router"
Name = "example"
Routes = [
{
Match {
HTTP {
PathPrefix = "/"
}
}
Destination {
NumRetries = 5
RetryOnConnectFailure = true
RetryOnStatusCodes = [503]
}
}
]
```
To apply this, run the following:
```shell-session
$ consul config write example-router.hcl
```
This Service Router configuration sets the `PathPrefix = "/"` which will match all requests to the `example` service. It sets the `NumRetries`, `RetryOnConnectFailure`, and `RetryOnStatusCodes = [503]` fields, so that incoming requests are retried. We've seen Envoy return a 503 when its application container has exited, but it is possible there could be other error codes dependening on your environment.
See the Consul Connect [Configuration Entries](/docs/connect/config-entries/index) documentation for more detail.
2021-09-15 18:00:33 +00:00
### Automatic ACL Token Provisioning
Consul ACL tokens secure communication between agents and services.
The following containers in a task require an ACL token:
- `consul-client`: The Consul client uses a token to authorize itself with Consul servers.
All `consul-client` containers share the same token.
- `mesh-init`: The `mesh-init` container uses a token to register the service with Consul.
This token is unique for the Consul service, and is shared by instances of the service.
The ACL controller automatically creates ACL tokens for mesh-enabled tasks in an ECS cluster.
2021-09-15 21:18:37 +00:00
The `acl-controller` Terraform module creates the ACL controller task. The controller creates the
ACL token used by `consul-client` containers at startup and then watches for tasks in the cluster. It checks tags
2021-09-15 18:00:33 +00:00
to determine if the task is mesh-enabled. If so, it creates the service ACL token for the task, if the
token does not yet exist.
The ACL controller stores all ACL tokens in AWS Secrets Manager, and tasks are configured to pull these
tokens from AWS Secrets Manager when they start.