mirror of https://github.com/hashicorp/consul
remove guides that were moved to learn
parent
4a5d67a24e
commit
af07d9f006
|
@ -1,472 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Bootstrapping ACLs
|
||||
description: >-
|
||||
Consul provides an optional Access Control List (ACL) system which can be used
|
||||
to control access to data and APIs. The ACL system is a Capability-based
|
||||
system that relies on tokens which can have fine grained rules applied to
|
||||
them. It is very similar to AWS IAM in many ways.
|
||||
---
|
||||
|
||||
# Bootstrapping the ACL System
|
||||
|
||||
Consul uses Access Control Lists (ACLs) to secure the UI, API, CLI, service communications, and agent communications. For securing gossip and RPC communication please review [this guide](/docs/guides/agent-encryption). When securing your cluster you should configure the ACLs first.
|
||||
|
||||
At the core, ACLs operate by grouping rules into policies, then associating one or more policies with a token.
|
||||
|
||||
To complete this guide, you should have an operational Consul 1.4+ cluster. We also recommend reading the [ACL System documentation](/docs/agent/acl-system). For securing Consul version 1.3 and older, please read the [legacy ACL documentation](/docs/guides/acl-legacy).
|
||||
|
||||
Bootstrapping the ACL system is a multi-step process, we will cover all the necessary steps in this guide.
|
||||
|
||||
- [Enable ACLs on all the servers](/docs/guides/acl#step-1-enable-acls-on-all-the-consul-servers).
|
||||
- [Create the initial bootstrap token](/docs/guides/acl#step-2-create-the-bootstrap-token).
|
||||
- [Create the agent policy](/docs/guides/acl#step-3-create-an-agent-token-policy).
|
||||
- [Create the agent token](/docs/guides/acl#step-4-create-an-agent-token).
|
||||
- [Apply the new token to the servers](/docs/guides/acl#step-5-add-the-agent-token-to-all-the-servers).
|
||||
- [Enable ACLs on the clients and apply the agent token](/docs/guides/acl#step-6-enable-acls-on-the-consul-clients).
|
||||
|
||||
At the end of this guide, there are also several additional and optional steps.
|
||||
|
||||
## Step 1: Enable ACLs on all the Consul Servers
|
||||
|
||||
The first step for bootstrapping the ACL system is to enable ACLs on the Consul servers in the agent configuration file. In this example, we are configuring the default policy of "deny", which means we are in whitelist mode, and a down policy of "extend-cache", which means that we will ignore token TTLs during an outage.
|
||||
|
||||
```json
|
||||
{
|
||||
"acl": {
|
||||
"enabled": true,
|
||||
"default_policy": "deny",
|
||||
"down_policy": "extend-cache"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The servers will need to be restarted to load the new configuration. Please take care
|
||||
to restart the servers one at a time and ensure each server has joined and is operating
|
||||
correctly before restarting another.
|
||||
|
||||
If ACLs are enabled correctly, we will now see the following warnings and info in the leader's logs.
|
||||
|
||||
```shell
|
||||
2018/12/12 01:36:40 [INFO] acl: Created the anonymous token
|
||||
2018/12/12 01:36:40 [INFO] consul: ACL bootstrap enabled
|
||||
2018/12/12 01:36:41 [INFO] agent: Synced node info
|
||||
2018/12/12 01:36:58 [WARN] agent: Coordinate update blocked by ACLs
|
||||
2018/12/12 01:37:40 [INFO] acl: initializing acls
|
||||
2018/12/12 01:37:40 [INFO] consul: Created ACL 'global-management' policy
|
||||
```
|
||||
|
||||
If you do not see ACL bootstrap enabled, the anonymous token creation, and the `global-management` policy creation message in the logs, ACLs have not been properly enabled.
|
||||
|
||||
Note, now that we have enabled ACLs, we will need a token to complete any operation. We can't do anything else to the cluster until we bootstrap and generate the first master token. For simplicity we will use the master token created during the bootstrap for the remainder of the guide.
|
||||
|
||||
## Step 2: Create the Bootstrap Token
|
||||
|
||||
Once ACLs have been enabled we can bootstrap our first token, the bootstrap token.
|
||||
The bootstrap token is a management token with unrestricted privileges. It will
|
||||
be shared with all the servers in the quorum, since it will be added to the
|
||||
state store.
|
||||
|
||||
```bash
|
||||
$ consul acl bootstrap
|
||||
AccessorID: edcaacda-b6d0-1954-5939-b5aceaca7c9a
|
||||
SecretID: 4411f091-a4c9-48e6-0884-1fcb092da1c8
|
||||
Description: Bootstrap Token (Global Management)
|
||||
Local: false
|
||||
Create Time: 2018-12-06 18:03:23.742699239 +0000 UTC
|
||||
Policies:
|
||||
00000000-0000-0000-0000-000000000001 - global-management
|
||||
```
|
||||
|
||||
On the server where the `bootstrap` command was issued we should see the following log message.
|
||||
|
||||
```shell
|
||||
2018/12/11 15:30:23 [INFO] consul.acl: ACL bootstrap completed
|
||||
2018/12/11 15:30:23 [DEBUG] http: Request PUT /v1/acl/bootstrap (2.347965ms) from=127.0.0.1:40566
|
||||
```
|
||||
|
||||
Since ACLs have been enabled, we will need to use it to complete any additional operations.
|
||||
For example, even checking the member list will require a token.
|
||||
|
||||
```shell
|
||||
$ consul members -token "4411f091-a4c9-48e6-0884-1fcb092da1c8"
|
||||
Node Address Status Type Build Protocol DC Segment
|
||||
fox 172.20.20.10:8301 alive server 1.4.0 2 kc <all>
|
||||
bear 172.20.20.11:8301 alive server 1.4.0 2 kc <all>
|
||||
wolf 172.20.20.12:8301 alive server 1.4.0 2 kc <all>
|
||||
```
|
||||
|
||||
Note using the token on the command line with the `-token` flag is not
|
||||
recommended, instead we will set it as an environment variable once.
|
||||
|
||||
```shell
|
||||
$ export CONSUL_HTTP_TOKEN=4411f091-a4c9-48e6-0884-1fcb092da1c8
|
||||
```
|
||||
|
||||
The bootstrap token can also be used in the server configuration file as
|
||||
the [`master`](/docs/agent/options#acl_tokens_master) token.
|
||||
|
||||
Note, the bootstrap token can only be created once, bootstrapping will be disabled after the master token was created. Once the ACL system is bootstrapped, ACL tokens can be managed through the
|
||||
[ACL API](/api/acl/acl).
|
||||
|
||||
## Step 3: Create an Agent Token Policy
|
||||
|
||||
Before we can create a token, we will need to create its associated policy. A policy is a set of rules that can be used to specify granular permissions. To learn more about rules, read the ACL rule specification [documentation](/docs/agent/acl-rules).
|
||||
|
||||
```bash
|
||||
# agent-policy.hcl contains the following:
|
||||
node_prefix "" {
|
||||
policy = "write"
|
||||
}
|
||||
service_prefix "" {
|
||||
policy = "read"
|
||||
}
|
||||
```
|
||||
|
||||
This policy will allow all nodes to be registered and accessed and any service to be read.
|
||||
Note, this simple policy is not recommended in production.
|
||||
It is best practice to create separate node policies and tokens for each node in the cluster
|
||||
with an exact-match node rule.
|
||||
|
||||
We only need to create one policy and can do this on any of the servers. If you have not set the
|
||||
`CONSUL_HTTP_TOKEN` environment variable to the bootstrap token, please refer to the previous step.
|
||||
|
||||
```
|
||||
$ consul acl policy create -name "agent-token" -description "Agent Token Policy" -rules @agent-policy.hcl
|
||||
ID: 5102b76c-6058-9fe7-82a4-315c353eb7f7
|
||||
Name: agent-policy
|
||||
Description: Agent Token Policy
|
||||
Datacenters:
|
||||
Rules:
|
||||
node_prefix "" {
|
||||
policy = "write"
|
||||
}
|
||||
service_prefix "" {
|
||||
policy = "read"
|
||||
}
|
||||
```
|
||||
|
||||
The returned value is the newly-created policy that we can now use when creating our agent token.
|
||||
|
||||
## Step 4: Create an Agent Token
|
||||
|
||||
Using the newly created policy, we can create an agent token. Again we can complete this process on any of the servers. For this guide, all agents will share the same token. Note, the `SecretID` is the token used to authenticate API and CLI commands.
|
||||
|
||||
```shell
|
||||
$ consul acl token create -description "Agent Token" -policy-name "agent-token"
|
||||
AccessorID: 499ab022-27f2-acb8-4e05-5a01fff3b1d1
|
||||
SecretID: da666809-98ca-0e94-a99c-893c4bf5f9eb
|
||||
Description: Agent Token
|
||||
Local: false
|
||||
Create Time: 2018-10-19 14:23:40.816899 -0400 EDT
|
||||
Policies:
|
||||
fcd68580-c566-2bd2-891f-336eadc02357 - agent-token
|
||||
```
|
||||
|
||||
## Step 5: Add the Agent Token to all the Servers
|
||||
|
||||
Our final step for configuring the servers is to assign the token to all of our
|
||||
Consul servers via the configuration file and reload the Consul service
|
||||
on all of the servers, one last time.
|
||||
|
||||
```json
|
||||
{
|
||||
"primary_datacenter": "dc1",
|
||||
"acl": {
|
||||
"enabled": true,
|
||||
"default_policy": "deny",
|
||||
"down_policy": "extend-cache",
|
||||
"tokens": {
|
||||
"agent": "da666809-98ca-0e94-a99c-893c4bf5f9eb"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
~> Note: In Consul version 1.4.2 and older any ACL updates
|
||||
in the agent configuration file will require a full restart of the
|
||||
Consul service.
|
||||
|
||||
At this point we should no longer see the coordinate warning in the servers logs, however, we should continue to see that the node information is in sync.
|
||||
|
||||
```shell
|
||||
2018/12/11 15:34:20 [DEBUG] agent: Node info in sync
|
||||
```
|
||||
|
||||
It is important to ensure the servers are configured properly, before enable ACLs
|
||||
on the clients. This will reduce any duplicate work and troubleshooting, if there
|
||||
is a misconfiguration.
|
||||
|
||||
#### Ensure the ACL System is Configured Properly
|
||||
|
||||
Before configuring the clients, we should check that the servers are healthy. To do this, let's view the catalog.
|
||||
|
||||
```shell
|
||||
curl http://127.0.0.1:8500/v1/catalog/nodes -H 'x-consul-token: 4411f091-a4c9-48e6-0884-1fcb092da1c8'
|
||||
[
|
||||
{
|
||||
"Address": "172.20.20.10",
|
||||
"CreateIndex": 7,
|
||||
"Datacenter": "kc",
|
||||
"ID": "881cfb69-2bcd-c2a9-d87c-cb79fc454df9",
|
||||
"Meta": {
|
||||
"consul-network-segment": ""
|
||||
},
|
||||
"ModifyIndex": 10,
|
||||
"Node": "fox",
|
||||
"TaggedAddresses": {
|
||||
"lan": "172.20.20.10",
|
||||
"wan": "172.20.20.10"
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
All the values should be as expected. Particularly, if `TaggedAddresses` is `null` it is likely we have not configured ACLs correctly. A good place to start debugging is reviewing the Consul logs on all the servers.
|
||||
|
||||
If you encounter issues that are unresolvable, or misplace the bootstrap token, you can reset the ACL system by updating the index. First re-run the bootstrap command to get the index number.
|
||||
|
||||
```
|
||||
$ consul acl bootstrap
|
||||
Failed ACL bootstrapping: Unexpected response code: 403 (Permission denied: ACL bootstrap no longer allowed (reset index: 13))
|
||||
```
|
||||
|
||||
Then write the reset index into the bootstrap reset file: (here the reset index is 13)
|
||||
|
||||
```
|
||||
$ echo 13 >> <data-directory>/acl-bootstrap-reset
|
||||
```
|
||||
|
||||
After reseting the ACL system you can start again at Step 2.
|
||||
|
||||
## Step 6: Enable ACLs on the Consul Clients
|
||||
|
||||
Since ACL enforcement also occurs on the Consul clients, we need to also restart them
|
||||
with a configuration file that enables ACLs. We can use the same ACL agent token that we created for the servers. The same token can be used because we did not specify any node or service prefixes.
|
||||
|
||||
```json
|
||||
{
|
||||
"acl": {
|
||||
"enabled": true,
|
||||
"down_policy": "extend-cache",
|
||||
"tokens": {
|
||||
"agent": "da666809-98ca-0e94-a99c-893c4bf5f9eb"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
To ensure the agent's are configured correctly, we can again use the `/catalog` endpoint.
|
||||
|
||||
## Additional ACL Configuration
|
||||
|
||||
Now that the nodes have been configured to use ACLs, we can configure the CLI, UI, and nodes to use specific tokens. All of the following steps are optional examples. In your own environment you will likely need to create more fine grained policies.
|
||||
|
||||
#### Configure the Anonymous Token (Optional)
|
||||
|
||||
The anonymous token is created during the bootstrap process, `consul acl bootstrap`. It is implicitly used if no token is supplied. In this section we will update the existing token with a newly created policy.
|
||||
|
||||
At this point ACLs are bootstrapped with ACL agent tokens configured, but there are no
|
||||
other policies set up. Even basic operations like `consul members` will be restricted
|
||||
by the ACL default policy of "deny":
|
||||
|
||||
```
|
||||
$ consul members
|
||||
```
|
||||
|
||||
We will not receive an error, since the ACL has filtered what we see and we are not allowed to
|
||||
see any nodes by default.
|
||||
|
||||
If we supply the token we created above we will be able to see a listing of nodes because
|
||||
it has write privileges to an empty `node` prefix, meaning it has access to all nodes:
|
||||
|
||||
```bash
|
||||
$ CONSUL_HTTP_TOKEN=4411f091-a4c9-48e6-0884-1fcb092da1c8 consul members
|
||||
Node Address Status Type Build Protocol DC Segment
|
||||
fox 172.20.20.10:8301 alive server 1.4.0 2 kc <all>
|
||||
bear 172.20.20.11:8301 alive server 1.4.0 2 kc <all>
|
||||
wolf 172.20.20.12:8301 alive server 1.4.0 2 kc <all>
|
||||
```
|
||||
|
||||
It is common in many environments to allow listing of all nodes, even without a
|
||||
token. The policies associated with the special anonymous token can be updated to
|
||||
configure Consul's behavior when no token is supplied. The anonymous token is managed
|
||||
like any other ACL token, except that `anonymous` is used for the ID. In this example
|
||||
we will give the anonymous token read privileges for all nodes:
|
||||
|
||||
```bash
|
||||
$ consul acl policy create -name 'list-all-nodes' -rules 'node_prefix "" { policy = "read" }'
|
||||
ID: e96d0a33-28b4-d0dd-9b3f-08301700ac72
|
||||
Name: list-all-nodes
|
||||
Description:
|
||||
Datacenters:
|
||||
Rules:
|
||||
node_prefix "" { policy = "read" }
|
||||
|
||||
$ consul acl token update -id 00000000-0000-0000-0000-000000000002 -policy-name list-all-nodes -description "Anonymous Token - Can List Nodes"
|
||||
Token updated successfully.
|
||||
AccessorID: 00000000-0000-0000-0000-000000000002
|
||||
SecretID: anonymous
|
||||
Description: Anonymous Token - Can List Nodes
|
||||
Local: false
|
||||
Create Time: 0001-01-01 00:00:00 +0000 UTC
|
||||
Hash: ee4638968d9061647ac8c3c99e9d37bfdd2af4d1eaa07a7b5f80af0389460948
|
||||
Create Index: 5
|
||||
Modify Index: 38
|
||||
Policies:
|
||||
e96d0a33-28b4-d0dd-9b3f-08301700ac72 - list-all-nodes
|
||||
|
||||
```
|
||||
|
||||
The anonymous token is implicitly used if no token is supplied, so now we can run
|
||||
`consul members` without supplying a token and we will be able to see the nodes:
|
||||
|
||||
```bash
|
||||
$ consul members
|
||||
Node Address Status Type Build Protocol DC Segment
|
||||
fox 172.20.20.10:8301 alive server 1.4.0 2 kc <all>
|
||||
bear 172.20.20.11:8301 alive server 1.4.0 2 kc <all>
|
||||
wolf 172.20.20.12:8301 alive server 1.4.0 2 kc <all>
|
||||
```
|
||||
|
||||
The anonymous token is also used for DNS lookups since there is no way to pass a
|
||||
token as part of a DNS request. Here's an example lookup for the "consul" service:
|
||||
|
||||
```
|
||||
$ dig @127.0.0.1 -p 8600 consul.service.consul
|
||||
|
||||
; <<>> DiG 9.8.3-P1 <<>> @127.0.0.1 -p 8600 consul.service.consul
|
||||
; (1 server found)
|
||||
;; global options: +cmd
|
||||
;; Got answer:
|
||||
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 9648
|
||||
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
|
||||
;; WARNING: recursion requested but not available
|
||||
|
||||
;; QUESTION SECTION:
|
||||
;consul.service.consul. IN A
|
||||
|
||||
;; AUTHORITY SECTION:
|
||||
consul. 0 IN SOA ns.consul. postmaster.consul. 1499584110 3600 600 86400 0
|
||||
```
|
||||
|
||||
Now we get an `NXDOMAIN` error because the anonymous token doesn't have access to the
|
||||
"consul" service. Let's update the anonymous token's policy to allow for service reads of the "consul" service.
|
||||
|
||||
```bash
|
||||
$ consul acl policy create -name 'service-consul-read' -rules 'service "consul" { policy = "read" }'
|
||||
ID: 3c93f536-5748-2163-bb66-088d517273ba
|
||||
Name: service-consul-read
|
||||
Description:
|
||||
Datacenters:
|
||||
Rules:
|
||||
service "consul" { policy = "read" }
|
||||
|
||||
$ consul acl token update -id 00000000-0000-0000-0000-000000000002 --merge-policies -description "Anonymous Token - Can List Nodes" -policy-name service-consul-read
|
||||
Token updated successfully.
|
||||
AccessorID: 00000000-0000-0000-0000-000000000002
|
||||
SecretID: anonymous
|
||||
Description: Anonymous Token - Can List Nodes
|
||||
Local: false
|
||||
Create Time: 0001-01-01 00:00:00 +0000 UTC
|
||||
Hash: 2c641c4f73158ef6d62f6467c68d751fccd4db9df99b235373e25934f9bbd939
|
||||
Create Index: 5
|
||||
Modify Index: 43
|
||||
Policies:
|
||||
e96d0a33-28b4-d0dd-9b3f-08301700ac72 - list-all-nodes
|
||||
3c93f536-5748-2163-bb66-088d517273ba - service-consul-read
|
||||
```
|
||||
|
||||
With that new policy in place, the DNS lookup will succeed:
|
||||
|
||||
```
|
||||
$ dig @127.0.0.1 -p 8600 consul.service.consul
|
||||
|
||||
; <<>> DiG 9.8.3-P1 <<>> @127.0.0.1 -p 8600 consul.service.consul
|
||||
; (1 server found)
|
||||
;; global options: +cmd
|
||||
;; Got answer:
|
||||
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46006
|
||||
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
|
||||
;; WARNING: recursion requested but not available
|
||||
|
||||
;; QUESTION SECTION:
|
||||
;consul.service.consul. IN A
|
||||
|
||||
;; ANSWER SECTION:
|
||||
consul.service.consul. 0 IN A 127.0.0.1
|
||||
```
|
||||
|
||||
The next section shows an alternative to the anonymous token.
|
||||
|
||||
#### Set Agent-Specific Default Tokens (Optional)
|
||||
|
||||
An alternative to the anonymous token is the [`acl.tokens.default`](/docs/agent/options#acl_tokens_default)
|
||||
configuration item. When a request is made to a particular Consul agent and no token is
|
||||
supplied, the [`acl.tokens.default`](/docs/agent/options#acl_tokens_default) will be used for the token, instead of being left empty which would normally invoke the anonymous token.
|
||||
|
||||
This behaves very similarly to the anonymous token, but can be configured differently on each
|
||||
agent, if desired. For example, this allows more fine grained control of what DNS requests a
|
||||
given agent can service or can give the agent read access to some key-value store prefixes by
|
||||
default.
|
||||
|
||||
If using [`acl.tokens.default`](/docs/agent/options#acl_tokens_default), then it's likely the anonymous token will have a more restrictive policy than shown in these examples.
|
||||
|
||||
#### Create Tokens for UI Use (Optional)
|
||||
|
||||
If you utilize the Consul UI with a restrictive ACL policy, as above, the UI will not function fully using the anonymous ACL token. It is recommended that a UI-specific ACL token is used, which can be set in the UI during the web browser session to authenticate the interface.
|
||||
|
||||
First create the new policy.
|
||||
|
||||
```bash
|
||||
$ consul acl policy create -name "ui-policy" \
|
||||
-description "Necessary permissions for UI functionality" \
|
||||
-rules 'key_prefix "" { policy = "write" } node_prefix "" { policy = "read" } service_prefix "" { policy = "read" }'
|
||||
ID: 9cb99b2b-3c20-81d4-a7c0-9ffdc2fbf08a
|
||||
Name: ui-policy
|
||||
Description: Necessary permissions for UI functionality
|
||||
Datacenters:
|
||||
Rules:
|
||||
key_prefix "" { policy = "write" } node_prefix "" { policy = "read" } service_prefix "" { policy = "read" }
|
||||
```
|
||||
|
||||
With the new policy, create a token.
|
||||
|
||||
```shell
|
||||
$ consul acl token create -description "UI Token" -policy-name "ui-policy"
|
||||
AccessorID: 56e605cf-a6f9-5f9d-5c08-a0e1323cf016
|
||||
SecretID: 117842b6-6208-446a-0d1e-daf93854857d
|
||||
Description: UI Token
|
||||
Local: false
|
||||
Create Time: 2018-10-19 14:55:44.254063 -0400 EDT
|
||||
Policies:
|
||||
9cb99b2b-3c20-81d4-a7c0-9ffdc2fbf08a - ui-policy
|
||||
```
|
||||
|
||||
The token can then be set on the "settings" page of the UI.
|
||||
|
||||
Note, in this example, we have also given full write access to the KV through the UI.
|
||||
|
||||
## Summary
|
||||
|
||||
The [ACL API](/api/acl/acl) can be used to create tokens for applications specific to their intended use and to create more specific ACL agent tokens for each agent's expected role.
|
||||
Now that you have bootstrapped ACLs, learn more about [ACL rules](/docs/agent/acl-rules)
|
||||
|
||||
### Notes on Security
|
||||
|
||||
In this guide we configured a basic ACL environment with the ability to see all nodes
|
||||
by default, but with limited access to discover only the "consul" service. If your environment has stricter security requirements we would like to note the following and make some additional recommendations.
|
||||
|
||||
1. In this guide we added the agent token to the configuration file. This means the tokens are now saved on disk. If this is a security concern, tokens can be added to agents using the [Consul CLI](/docs/commands/acl/acl-set-agent-token). However, this process is more complicated and takes additional care.
|
||||
|
||||
2. It is recommended that each client get an ACL agent token with `node` write privileges for just its own node name, and `service` read privileges for just the service prefixes expected to be registered on that client.
|
||||
|
||||
3. [Anti-entropy](/docs/internals/anti-entropy) syncing requires the ACL agent token
|
||||
to have `service:write` privileges for all services that may be registered with the agent.
|
||||
We recommend providing `service:write` for each separate service via a separate token that
|
||||
is used when registering via the API, or provided along with the [registration in the
|
||||
configuration file](/docs/agent/services). Note that `service:write`
|
||||
is the privilege required to assume the identity of a service and so Consul Connect's
|
||||
intentions are only enforceable to the extent that each service instance is unable to gain
|
||||
`service:write` on any other service name. For more details see the Connect security
|
||||
[documentation](/docs/connect/security).
|
|
@ -1,179 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Multiple Datacenters - Advanced Federation with Network Areas
|
||||
description: >-
|
||||
One of the key features of Consul is its support for multiple datacenters. The
|
||||
architecture of Consul is designed to promote low coupling of datacenters so
|
||||
that connectivity issues or failure of any datacenter does not impact the
|
||||
availability of Consul in other datacenters. This means each datacenter runs
|
||||
independently, each having a dedicated group of servers and a private LAN
|
||||
gossip pool.
|
||||
---
|
||||
|
||||
# [Enterprise] Multiple Datacenters: Advanced Federation with Network Areas
|
||||
|
||||
~> The network area functionality described here is available only in [Consul Enterprise](https://www.hashicorp.com/products/consul/) version 0.8.0 and later.
|
||||
|
||||
One of the key features of Consul is its support for multiple datacenters.
|
||||
The [architecture](/docs/internals/architecture) of Consul is designed to
|
||||
promote a low coupling of datacenters so that connectivity issues or
|
||||
failure of any datacenter does not impact the availability of Consul in other
|
||||
datacenters. This means each datacenter runs independently, each having a dedicated
|
||||
group of servers and a private LAN [gossip pool](/docs/internals/gossip).
|
||||
|
||||
This guide covers the advanced form of federating Consul clusters using the new
|
||||
network areas capability added in [Consul Enterprise](https://www.hashicorp.com/products/consul/)
|
||||
version 0.8.0. For the basic form of federation available in the open source version
|
||||
of Consul, please see the [Basic Federation Guide](/docs/guides/datacenters)
|
||||
for more details.
|
||||
|
||||
## Network Area Overview
|
||||
|
||||
Consul's [Basic Federation](/docs/guides/datacenters) support relies on all
|
||||
Consul servers in all datacenters having full mesh connectivity via server RPC
|
||||
(8300/tcp) and Serf WAN (8302/tcp and 8302/udp). Securing this setup requires TLS
|
||||
in combination with managing a gossip keyring. With massive Consul deployments, it
|
||||
becomes tricky to support a full mesh with all Consul servers, and to manage the
|
||||
keyring.
|
||||
|
||||
Consul Enterprise version 0.8.0 added support for a new federation model based on
|
||||
operator-created network areas. Network areas specify a relationship between a
|
||||
pair of Consul datacenters. Operators create reciprocal areas on each side of the
|
||||
relationship and then join them together, so a given Consul datacenter can participate
|
||||
in many areas, even when some of the peer areas cannot contact each other. This
|
||||
allows for more flexible relationships between Consul datacenters, such as hub/spoke
|
||||
or more general tree structures. Traffic between areas is all performed via server
|
||||
RPC (8300/tcp) so it can be secured with just TLS.
|
||||
|
||||
Currently, Consul will only route RPC requests to datacenters it is immediately adjacent
|
||||
to via an area (or via the WAN), but future versions of Consul may add routing support.
|
||||
|
||||
The following can be used to manage network areas:
|
||||
|
||||
- [Network Areas HTTP Endpoint](/api/operator/area)
|
||||
- [Network Areas CLI](/docs/commands/operator/area)
|
||||
|
||||
### Network Areas and the WAN Gossip Pool
|
||||
|
||||
Networks areas can be used alongside the Consul's [Basic Federation](/docs/guides/datacenters)
|
||||
model and the WAN gossip pool. This helps ease migration, and clusters like the
|
||||
[primary datacenter](/docs/agent/options#primary_datacenter) are more easily managed via
|
||||
the WAN because they need to be available to all Consul datacenters.
|
||||
|
||||
A peer datacenter can connected via the WAN gossip pool and a network area at the
|
||||
same time, and RPCs will be forwarded as long as servers are available in either.
|
||||
|
||||
## Configure Advanced Federation
|
||||
|
||||
To get started, follow the [Deployment guide](https://learn.hashicorp.com/consul/advanced/day-1-operations/deployment-guide/) to
|
||||
start each datacenter. After bootstrapping, we should have two datacenters now which
|
||||
we can refer to as `dc1` and `dc2`. Note that datacenter names are opaque to Consul;
|
||||
they are simply labels that help human operators reason about the Consul clusters.
|
||||
|
||||
### Create Areas in both Datacenters
|
||||
|
||||
A compatible pair of areas must be created in each datacenter:
|
||||
|
||||
```shell
|
||||
(dc1) $ consul operator area create -peer-datacenter=dc2
|
||||
Created area "cbd364ae-3710-1770-911b-7214e98016c0" with peer datacenter "dc2"!
|
||||
```
|
||||
|
||||
```shell
|
||||
(dc2) $ consul operator area create -peer-datacenter=dc1
|
||||
Created area "2aea3145-f1e3-cb1d-a775-67d15ddd89bf" with peer datacenter "dc1"!
|
||||
```
|
||||
|
||||
Now you can query for the members of the area:
|
||||
|
||||
```shell
|
||||
(dc1) $ consul operator area members
|
||||
Area Node Address Status Build Protocol DC RTT
|
||||
cbd364ae-3710-1770-911b-7214e98016c0 node-1.dc1 127.0.0.1:8300 alive 0.8.0_entrc1 2 dc1 0s
|
||||
```
|
||||
|
||||
### Join Servers
|
||||
|
||||
Consul will automatically make sure that all servers within the datacenter where
|
||||
the area was created are joined to the area using the LAN information. We need to
|
||||
join with at least one Consul server in the other datacenter to complete the area:
|
||||
|
||||
```shell
|
||||
(dc1) $ consul operator area join -peer-datacenter=dc2 127.0.0.2
|
||||
Address Joined Error
|
||||
127.0.0.2 true (none)
|
||||
```
|
||||
|
||||
With a successful join, we should now see the remote Consul servers as part of the
|
||||
area's members:
|
||||
|
||||
```shell
|
||||
(dc1) $ consul operator area members
|
||||
Area Node Address Status Build Protocol DC RTT
|
||||
cbd364ae-3710-1770-911b-7214e98016c0 node-1.dc1 127.0.0.1:8300 alive 0.8.0_entrc1 2 dc1 0s
|
||||
cbd364ae-3710-1770-911b-7214e98016c0 node-2.dc2 127.0.0.2:8300 alive 0.8.0_entrc1 2 dc2 581.649µs
|
||||
```
|
||||
|
||||
### Route RPCs
|
||||
|
||||
Now we can route RPC commands in both directions. Here's a sample command to set a KV
|
||||
entry in dc2 from dc1:
|
||||
|
||||
```shell
|
||||
(dc1) $ consul kv put -datacenter=dc2 hello world
|
||||
Success! Data written to: hello
|
||||
```
|
||||
|
||||
### DNS Lookups
|
||||
|
||||
The DNS interface supports federation as well:
|
||||
|
||||
```shell
|
||||
(dc1) $ dig @127.0.0.1 -p 8600 consul.service.dc2.consul
|
||||
|
||||
; <<>> DiG 9.8.3-P1 <<>> @127.0.0.1 -p 8600 consul.service.dc2.consul
|
||||
; (1 server found)
|
||||
;; global options: +cmd
|
||||
;; Got answer:
|
||||
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49069
|
||||
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
|
||||
;; WARNING: recursion requested but not available
|
||||
|
||||
;; QUESTION SECTION:
|
||||
;consul.service.dc2.consul. IN A
|
||||
|
||||
;; ANSWER SECTION:
|
||||
consul.service.dc2.consul. 0 IN A 127.0.0.2
|
||||
|
||||
;; Query time: 3 msec
|
||||
;; SERVER: 127.0.0.1#8600(127.0.0.1)
|
||||
;; WHEN: Wed Mar 29 11:27:35 2017
|
||||
;; MSG SIZE rcvd: 59
|
||||
```
|
||||
|
||||
There are a few networking requirements that must be satisfied for this to
|
||||
work. Of course, all server nodes must be able to talk to each other via their server
|
||||
RPC ports (8300/tcp). If service discovery is to be used across datacenters, the
|
||||
network must be able to route traffic between IP addresses across regions as well.
|
||||
Usually, this means that all datacenters must be connected using a VPN or other
|
||||
tunneling mechanism. Consul does not handle VPN or NAT traversal for you.
|
||||
|
||||
The [`translate_wan_addrs`](/docs/agent/options#translate_wan_addrs) configuration
|
||||
provides a basic address rewriting capability.
|
||||
|
||||
## Data Replication
|
||||
|
||||
In general, data is not replicated between different Consul datacenters. When a
|
||||
request is made for a resource in another datacenter, the local Consul servers forward
|
||||
an RPC request to the remote Consul servers for that resource and return the results.
|
||||
If the remote datacenter is not available, then those resources will also not be
|
||||
available, but that won't otherwise affect the local datacenter. There are some special
|
||||
situations where a limited subset of data can be replicated, such as with Consul's built-in
|
||||
[ACL replication](/docs/guides/acl#outages-and-acl-replication/) capability, or
|
||||
external tools like [consul-replicate](https://github.com/hashicorp/consul-replicate/).
|
||||
|
||||
## Summary
|
||||
|
||||
In this guide, you setup advanced federation using
|
||||
network areas. You then learned how to route RPC commands and use
|
||||
the DNS interface with multiple datacenters.
|
|
@ -1,204 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Agent Communication Encryption
|
||||
description: This guide covers how to encrypt both gossip and RPC communication.
|
||||
---
|
||||
|
||||
# Agent Communication Encryption
|
||||
|
||||
There are two different systems that need to be configured separately to encrypt communication within the cluster: gossip encryption and TLS. TLS is used to secure the RPC calls between agents. Gossip encryption is secured with a symmetric key, since gossip between nodes is done over UDP. In this guide we will configure both.
|
||||
|
||||
To complete the RPC encryption section, you must have [configured agent certificates](/docs/guides/creating-certificates).
|
||||
|
||||
## Gossip Encryption
|
||||
|
||||
To enable gossip encryption, you need to use an encryption key when starting the Consul agent. The key can be simple set with the `encrypt` parameter in the agent configuration file. Alternatively, the encryption key can be placed in a separate configuration file with only the `encrypt` field, since the agent can merge multiple configuration files. The key must be 32-bytes, Base64 encoded.
|
||||
|
||||
You can use the Consul CLI command, [`consul keygen`](/docs/commands/keygen), to generate a cryptographically suitable key.
|
||||
|
||||
```shell
|
||||
$ consul keygen
|
||||
pUqJrVyVRj5jsiYEkM/tFQYfWyJIv4s3XkvDwy7Cu5s=
|
||||
```
|
||||
|
||||
### Enable Gossip Encryption: New Cluster
|
||||
|
||||
To enable gossip on a new cluster, we will add the encryption key parameter to the
|
||||
agent configuration file and then pass the file at startup with the [`-config-dir`](/docs/agent/options#_config_dir) flag.
|
||||
|
||||
```javascript
|
||||
{
|
||||
"data_dir": "/opt/consul",
|
||||
"log_level": "INFO",
|
||||
"node_name": "bulldog",
|
||||
"server": true,
|
||||
"encrypt": "pUqJrVyVRj5jsiYEkM/tFQYfWyJIv4s3XkvDwy7Cu5s="
|
||||
}
|
||||
```
|
||||
|
||||
```shell
|
||||
$ consul agent -config-dir=/etc/consul.d/
|
||||
==> Starting Consul agent...
|
||||
==> Starting Consul agent RPC...
|
||||
==> Consul agent running!
|
||||
Node name: 'Armons-MacBook-Air.local'
|
||||
Datacenter: 'dc1'
|
||||
Server: false (bootstrap: false)
|
||||
Client Addr: 127.0.0.1 (HTTP: 8500, HTTPS: -1, DNS: 8600, RPC: 8400)
|
||||
Cluster Addr: 10.1.10.12 (LAN: 8301, WAN: 8302)
|
||||
Gossip encrypt: true, RPC-TLS: false, TLS-Incoming: false
|
||||
...
|
||||
```
|
||||
|
||||
"Encrypt: true" will be included in the output, if encryption is properly configured.
|
||||
|
||||
Note: all nodes within a cluster must share the same encryption key in order to send and receive cluster information, including clients and servers. Additionally, if you're using multiple WAN joined datacenters, be sure to use _the same encryption key_ in all datacenters.
|
||||
|
||||
### Enable Gossip Encryption: Existing Cluster
|
||||
|
||||
Gossip encryption can also be enabled on an existing cluster, but requires several extra steps. The additional configuration of the agent configuration parameters, [`encrypt_verify_incoming`](/docs/agent/options#encrypt_verify_incoming) and [`encrypt_verify_outgoing`](/docs/agent/options#encrypt_verify_outgoing) is necessary.
|
||||
|
||||
**Step 1**: Generate an encryption key using `consul keygen`.
|
||||
|
||||
```shell
|
||||
$ consul keygen
|
||||
pUqJrVyVRj5jsiYEkM/tFQYfWyJIv4s3XkvDwy7Cu5s=
|
||||
```
|
||||
|
||||
**Step 2**: Set the [`encrypt`](/docs/agent/options#_encrypt) key, and set `encrypt_verify_incoming` and `encrypt_verify_outgoing` to `false` in the agent configuration file. Then initiate a rolling update of the cluster with these new values. After this step, the agents will be able to decrypt gossip but will not yet be sending encrypted traffic.
|
||||
|
||||
```javascript
|
||||
{
|
||||
"data_dir": "/opt/consul",
|
||||
"log_level": "INFO",
|
||||
"node_name": "bulldog",
|
||||
"server": true,
|
||||
"encrypt": "pUqJrVyVRj5jsiYEkM/tFQYfWyJIv4s3XkvDwy7Cu5s=",
|
||||
"encrypt_verify_incoming": false,
|
||||
"encrypt_verify_outgoing": false
|
||||
}
|
||||
```
|
||||
|
||||
A rolling update can be made by restarting the Consul agents (clients and servers) in turn. `consul reload` or `kill -HUP <process_id>` is _not_ sufficient to change the gossip configuration.
|
||||
|
||||
**Step 3**: Update the `encrypt_verify_outgoing` setting to `true` and perform another rolling update of the cluster by restarting Consul on each agent. The agents will now be sending encrypted gossip but will still allow incoming unencrypted traffic.
|
||||
|
||||
```javascript
|
||||
{
|
||||
"data_dir": "/opt/consul",
|
||||
"log_level": "INFO",
|
||||
"node_name": "bulldog",
|
||||
"server": true,
|
||||
"encrypt": "pUqJrVyVRj5jsiYEkM/tFQYfWyJIv4s3XkvDwy7Cu5s=",
|
||||
"encrypt_verify_incoming": false,
|
||||
"encrypt_verify_outgoing": true
|
||||
}
|
||||
```
|
||||
|
||||
**Step 4**: The previous step, enabling verify outgoing, must be completed on all agents before continuing. Update the `encrypt_verify_incoming` setting to `true` and perform a final rolling update of the cluster.
|
||||
|
||||
```javascript
|
||||
{
|
||||
"data_dir": "/opt/consul",
|
||||
"log_level": "INFO",
|
||||
"node_name": "bulldog",
|
||||
"server": true,
|
||||
"encrypt": "pUqJrVyVRj5jsiYEkM/tFQYfWyJIv4s3XkvDwy7Cu5s=",
|
||||
"encrypt_verify_incoming": true,
|
||||
"encrypt_verify_outgoing": true
|
||||
}
|
||||
```
|
||||
|
||||
All the agents will now be strictly enforcing encrypted gossip. Note, the default
|
||||
behavior of both `encrypt_verify_incoming` and `encrypt_verify_outgoing` is `true`.
|
||||
We have set them in the configuration file as an explicit example.
|
||||
|
||||
## TLS Encryption for RPC
|
||||
|
||||
Consul supports using TLS to verify the authenticity of servers and clients. To enable TLS,
|
||||
Consul requires that all servers have certificates that are signed by a single
|
||||
Certificate Authority. Clients may optionally authenticate with a client certificate generated by the same CA. Please see
|
||||
[this tutorial on creating a CA and signing certificates](/docs/guides/creating-certificates).
|
||||
|
||||
TLS can be used to verify the authenticity of the servers with [`verify_outgoing`](/docs/agent/options#verify_outgoing) and [`verify_server_hostname`](/docs/agent/options#verify_server_hostname). It can also optionally verify client certificates when using [`verify_incoming`](/docs/agent/options#verify_incoming)
|
||||
|
||||
Review the [docs for specifics](/docs/agent/encryption).
|
||||
|
||||
In Consul version 0.8.4 and newer, migrating to TLS encrypted traffic on a running cluster
|
||||
is supported.
|
||||
|
||||
### Enable TLS: New Cluster
|
||||
|
||||
After TLS has been configured on all the agents, you can start the agents and RPC communication will be encrypted.
|
||||
|
||||
```javascript
|
||||
{
|
||||
"data_dir": "/opt/consul",
|
||||
"log_level": "INFO",
|
||||
"node_name": "bulldog",
|
||||
"server": true,
|
||||
"encrypt": "pUqJrVyVRj5jsiYEkM/tFQYfWyJIv4s3XkvDwy7Cu5s=",
|
||||
"verify_incoming": true,
|
||||
"verify_outgoing": true,
|
||||
"verify_server_hostname": true,
|
||||
"ca_file": "consul-agent-ca.pem",
|
||||
"cert_file": "dc1-server-consul-0.pem",
|
||||
"key_file": "dc1-server-consul-0-key.pem"
|
||||
}
|
||||
```
|
||||
|
||||
Note, for clients, the default `cert_file` and `key_file` will be named according to their cluster for. For example, `dc1-client-consul-0.pem`.
|
||||
|
||||
The `verify_outgoing` parameter enables agents to verify the authenticity of Consul servers for outgoing connections. The `verify_server_hostname` parameter requires outgoing connections to perform hostname verification and is critically important to prevent compromised client agents from becoming servers and revealing all state to the attacker. Finally, the `verify_incoming` parameter enables the servers to verify the authenticity of all incoming connections.
|
||||
|
||||
### Enable TLS: Existing Cluster
|
||||
|
||||
Enabling TLS on an existing cluster is supported. This process assumes a starting point of a running cluster with no TLS settings configured, and involves an intermediate step in order to get to full TLS encryption.
|
||||
|
||||
**Step 1**: [Generate the necessary keys and certificates](/docs/guides/creating-certificates), then set the `ca_file` or `ca_path`, `cert_file`, and `key_file` settings in the configuration for each agent. Make sure the `verify_outgoing` and `verify_incoming` options are set to `false`. HTTPS for the API can be enabled at this point by setting the [`https`](/docs/agent/options#http_port) port.
|
||||
|
||||
```javascript
|
||||
{
|
||||
"data_dir": "/opt/consul",
|
||||
"log_level": "INFO",
|
||||
"node_name": "bulldog",
|
||||
"server": true,
|
||||
"encrypt": "pUqJrVyVRj5jsiYEkM/tFQYfWyJIv4s3XkvDwy7Cu5s=",
|
||||
"verify_incoming": false,
|
||||
"verify_outgoing": false,
|
||||
"ca_file": "consul-agent-ca.pem",
|
||||
"cert_file": "dc1-server-consul-0.pem",
|
||||
"key_file": "dc1-server-consul-0-key.pem"
|
||||
}
|
||||
```
|
||||
|
||||
Next, perform a rolling restart of each agent in the cluster. After this step, TLS should be enabled everywhere but the agents will not yet be enforcing TLS. Again, `consul reload` or `kill -HUP <process_id>` is _not_ sufficient to update the configuration.
|
||||
|
||||
**Step 2**: (Optional, Enterprise-only) If applicable, set the `Use TLS` setting in any network areas to `true`. This can be done either through the [`consul operator area update`](/docs/commands/operator/area) command or the [Operator API](/api/operator/area).
|
||||
|
||||
**Step 3**: Change the `verify_incoming`, `verify_outgoing`, and `verify_server_hostname` to `true` then perform another rolling restart of each agent in the cluster.
|
||||
|
||||
```javascript
|
||||
{
|
||||
"data_dir": "/opt/consul",
|
||||
"log_level": "INFO",
|
||||
"node_name": "bulldog",
|
||||
"server": true,
|
||||
"encrypt": "pUqJrVyVRj5jsiYEkM/tFQYfWyJIv4s3XkvDwy7Cu5s=",
|
||||
"verify_incoming": true,
|
||||
"verify_outgoing": true,
|
||||
"verify_server_hostname": true,
|
||||
"ca_file": "consul-agent-ca.pem",
|
||||
"cert_file": "dc1-server-consul-0.pem",
|
||||
"key_file": "dc1-server-consul-0-key.pem"
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
At this point, full TLS encryption for RPC communication is enabled. To disable `HTTP`
|
||||
connections, which may still be in use by clients for API and CLI communications, update
|
||||
the [agent configuration](/docs/agent/options#ports).
|
||||
|
||||
## Summary
|
||||
|
||||
In this guide we configured both gossip encryption and TLS for RPC. Securing agent communication is a recommended set in setting up a production ready cluster.
|
|
@ -1,297 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Autopilot
|
||||
description: This guide covers how to configure and use Autopilot features.
|
||||
---
|
||||
|
||||
# Autopilot
|
||||
|
||||
Autopilot features allow for automatic,
|
||||
operator-friendly management of Consul servers. It includes cleanup of dead
|
||||
servers, monitoring the state of the Raft cluster, and stable server introduction.
|
||||
|
||||
To enable Autopilot features (with the exception of dead server cleanup),
|
||||
the [`raft_protocol`](/docs/agent/options#_raft_protocol) setting in
|
||||
the Agent configuration must be set to 3 or higher on all servers. In Consul
|
||||
0.8 this setting defaults to 2; in Consul 1.0 it will default to 3. For more
|
||||
information, see the [Version Upgrade section](/docs/upgrade-specific#raft_protocol)
|
||||
on Raft Protocol versions.
|
||||
|
||||
In this guide we will learn more about Autopilot's features.
|
||||
|
||||
- Dead server cleanup
|
||||
- Server Stabilization
|
||||
- Redundancy zone tags
|
||||
- Upgrade migration
|
||||
|
||||
Finally, we will review how to ensure Autopilot is healthy.
|
||||
|
||||
Note, in this guide we are using examples from a Consul 1.4 cluster, we
|
||||
are starting with Autopilot enabled by default.
|
||||
|
||||
## Default Configuration
|
||||
|
||||
The configuration of Autopilot is loaded by the leader from the agent's
|
||||
[Autopilot settings](/docs/agent/options#autopilot) when initially
|
||||
bootstrapping the cluster. Since Autopilot and it's features are already
|
||||
enabled, we only need to update the configuration to disable them. The
|
||||
following are the defaults.
|
||||
|
||||
```
|
||||
{
|
||||
"cleanup_dead_servers": true,
|
||||
"last_contact_threshold": "200ms",
|
||||
"max_trailing_logs": 250,
|
||||
"server_stabilization_time": "10s",
|
||||
"redundancy_zone_tag": "",
|
||||
"disable_upgrade_migration": false,
|
||||
"upgrade_version_tag": ""
|
||||
}
|
||||
```
|
||||
|
||||
All Consul servers should have Autopilot and its features either enabled
|
||||
or disabled to ensure consistency across servers in case of a failure. Additionally,
|
||||
Autopilot must be enabled to use any of the features, but the features themselves
|
||||
can be configured independently. Meaning you can enable or disable any of the features
|
||||
separately, at any time.
|
||||
|
||||
After bootstrapping, the configuration can be viewed or modified either via the
|
||||
[`operator autopilot`](/docs/commands/operator/autopilot) subcommand or the
|
||||
[`/v1/operator/autopilot/configuration`](/api/operator#autopilot-configuration)
|
||||
HTTP endpoint.
|
||||
|
||||
```
|
||||
$ consul operator autopilot get-config
|
||||
CleanupDeadServers = true
|
||||
LastContactThreshold = 200ms
|
||||
MaxTrailingLogs = 250
|
||||
ServerStabilizationTime = 10s
|
||||
RedundancyZoneTag = ""
|
||||
DisableUpgradeMigration = false
|
||||
UpgradeVersionTag = ""
|
||||
```
|
||||
|
||||
In the example above, we used the `operator autopilot get-config` subcommand to check
|
||||
the autopilot configuration. You can see we still have all the defaults.
|
||||
|
||||
## Dead Server Cleanup
|
||||
|
||||
If Autopilot is disabled, it will take 72 hours for dead servers to be automatically reaped
|
||||
or an operator had to script a `consul force-leave`. If another server failure occurred
|
||||
it could jeopardize the quorum, even if the failed Consul server had been automatically
|
||||
replaced. Autopilot helps prevent these kinds of outages by quickly removing failed
|
||||
servers as soon as a replacement Consul server comes online. When servers are removed
|
||||
by the cleanup process they will enter the "left" state.
|
||||
|
||||
With Autopilot's dead server cleanup enabled, dead servers will periodically be
|
||||
cleaned up and removed from the Raft peer set to prevent them from interfering with
|
||||
the quorum size and leader elections. The cleanup process will also be automatically
|
||||
triggered whenever a new server is successfully added to the cluster.
|
||||
|
||||
To update the dead server cleanup feature use `consul operator autopilot set-config`
|
||||
with the `-cleanup-dead-servers` flag.
|
||||
|
||||
```shell
|
||||
$ consul operator autopilot set-config -cleanup-dead-servers=false
|
||||
Configuration updated!
|
||||
|
||||
$ consul operator autopilot get-config
|
||||
CleanupDeadServers = false
|
||||
LastContactThreshold = 200ms
|
||||
MaxTrailingLogs = 250
|
||||
ServerStabilizationTime = 10s
|
||||
RedundancyZoneTag = ""
|
||||
DisableUpgradeMigration = false
|
||||
UpgradeVersionTag = ""
|
||||
```
|
||||
|
||||
We have disabled dead server cleanup, but sill have all the other Autopilot defaults.
|
||||
|
||||
## Server Stabilization
|
||||
|
||||
When a new server is added to the cluster, there is a waiting period where it
|
||||
must be healthy and stable for a certain amount of time before being promoted
|
||||
to a full, voting member. This can be configured via the `ServerStabilizationTime`
|
||||
setting.
|
||||
|
||||
```shell
|
||||
consul operator autopilot set-config -server-stabilization-time=5s
|
||||
Configuration updated!
|
||||
|
||||
$ consul operator autopilot get-config
|
||||
CleanupDeadServers = false
|
||||
LastContactThreshold = 200ms
|
||||
MaxTrailingLogs = 250
|
||||
ServerStabilizationTime = 5s
|
||||
RedundancyZoneTag = ""
|
||||
DisableUpgradeMigration = false
|
||||
UpgradeVersionTag = ""
|
||||
```
|
||||
|
||||
Now we have disabled dead server cleanup and set the server stabilization time to 5 seconds.
|
||||
When a new server is added to our cluster, it will only need to be healthy and stable for
|
||||
5 seconds.
|
||||
|
||||
## Redundancy Zones
|
||||
|
||||
Prior to Autopilot, it was difficult to deploy servers in a way that took advantage of
|
||||
isolated failure domains such as AWS Availability Zones; users would be forced to either
|
||||
have an overly-large quorum (2-3 nodes per AZ) or give up redundancy within an AZ by
|
||||
deploying just one server in each.
|
||||
|
||||
If the `RedundancyZoneTag` setting is set, Consul will use its value to look for a
|
||||
zone in each server's specified [`-node-meta`](/docs/agent/options#_node_meta)
|
||||
tag. For example, if `RedundancyZoneTag` is set to `zone`, and `-node-meta zone:east1a`
|
||||
is used when starting a server, that server's redundancy zone will be `east1a`.
|
||||
|
||||
```
|
||||
$ consul operator autopilot set-config -redundancy-zone-tag=zone
|
||||
Configuration updated!
|
||||
|
||||
$ consul operator autopilot get-config
|
||||
CleanupDeadServers = false
|
||||
LastContactThreshold = 200ms
|
||||
MaxTrailingLogs = 250
|
||||
ServerStabilizationTime = 5s
|
||||
RedundancyZoneTag = "zone"
|
||||
DisableUpgradeMigration = false
|
||||
UpgradeVersionTag = ""
|
||||
```
|
||||
|
||||
For our Autopilot features, we now have disabled dead server cleanup, server stabilization time to 5 seconds, and
|
||||
the redundancy zone tag is zone.
|
||||
|
||||
Consul will then use these values to partition the servers by redundancy zone, and will
|
||||
aim to keep one voting server per zone. Extra servers in each zone will stay as non-voters
|
||||
on standby to be promoted if the active voter leaves or dies.
|
||||
|
||||
## Upgrade Migrations
|
||||
|
||||
Autopilot in Consul _Enterprise_ supports upgrade migrations by default. To disable this
|
||||
functionality, set `DisableUpgradeMigration` to true.
|
||||
|
||||
```shell
|
||||
$ consul operator autopilot set-config -disable-upgrade-migration=true
|
||||
Configuration updated!
|
||||
|
||||
$ consul operator autopilot get-config
|
||||
CleanupDeadServers = false
|
||||
LastContactThreshold = 200ms
|
||||
MaxTrailingLogs = 250
|
||||
ServerStabilizationTime = 5s
|
||||
RedundancyZoneTag = "uswest1"
|
||||
DisableUpgradeMigration = true
|
||||
UpgradeVersionTag = ""
|
||||
```
|
||||
|
||||
With upgrade migration enabled, when a new server is added and Autopilot detects that
|
||||
its Consul version is newer than that of the existing servers, Autopilot will avoid
|
||||
promoting the new server until enough newer-versioned servers have been added to the
|
||||
cluster. When the count of new servers equals or exceeds that of the old servers,
|
||||
Autopilot will begin promoting the new servers to voters and demoting the old servers.
|
||||
After this is finished, the old servers can be safely removed from the cluster.
|
||||
|
||||
To check the consul version of the servers, you can either use the [autopilot health](/api/operator#autopilot-health) endpoint or the `consul members`
|
||||
command.
|
||||
|
||||
```
|
||||
$ consul members
|
||||
Node Address Status Type Build Protocol DC Segment
|
||||
node1 127.0.0.1:8301 alive server 1.4.0 2 dc1 <all>
|
||||
node2 127.0.0.1:8703 alive server 1.4.0 2 dc1 <all>
|
||||
node3 127.0.0.1:8803 alive server 1.4.0 2 dc1 <all>
|
||||
node4 127.0.0.1:8203 alive server 1.3.0 2 dc1 <all>
|
||||
```
|
||||
|
||||
### Migrations Without a Consul Version Change
|
||||
|
||||
The `UpgradeVersionTag` can be used to override the version information used during
|
||||
a migration, so that the migration logic can be used for updating the cluster when
|
||||
changing configuration.
|
||||
|
||||
If the `UpgradeVersionTag` setting is set, Consul will use its value to look for a
|
||||
version in each server's specified [`-node-meta`](/docs/agent/options#_node_meta)
|
||||
tag. For example, if `UpgradeVersionTag` is set to `build`, and `-node-meta build:0.0.2`
|
||||
is used when starting a server, that server's version will be `0.0.2` when considered in
|
||||
a migration. The upgrade logic will follow semantic versioning and the version string
|
||||
must be in the form of either `X`, `X.Y`, or `X.Y.Z`.
|
||||
|
||||
```shell
|
||||
$ consul operator autopilot set-config -upgrade-version-tag=1.4.0
|
||||
Configuration updated!
|
||||
|
||||
$ consul operator autopilot get-config
|
||||
CleanupDeadServers = false
|
||||
LastContactThreshold = 200ms
|
||||
MaxTrailingLogs = 250
|
||||
ServerStabilizationTime = 5s
|
||||
RedundancyZoneTag = "uswest1"
|
||||
DisableUpgradeMigration = true
|
||||
UpgradeVersionTag = "1.4.0"
|
||||
```
|
||||
|
||||
## Server Health Checking
|
||||
|
||||
An internal health check runs on the leader to track the stability of servers.
|
||||
|
||||
<br />A server is considered healthy if all of the following conditions are
|
||||
true.
|
||||
|
||||
- It has a SerfHealth status of 'Alive'.
|
||||
- The time since its last contact with the current leader is below
|
||||
`LastContactThreshold`.
|
||||
- Its latest Raft term matches the leader's term.
|
||||
- The number of Raft log entries it trails the leader by does not exceed
|
||||
`MaxTrailingLogs`.
|
||||
|
||||
The status of these health checks can be viewed through the [`/v1/operator/autopilot/health`](/api/operator#autopilot-health) HTTP endpoint, with a top level
|
||||
`Healthy` field indicating the overall status of the cluster:
|
||||
|
||||
```
|
||||
$ curl localhost:8500/v1/operator/autopilot/health
|
||||
{
|
||||
"Healthy": true,
|
||||
"FailureTolerance": 0,
|
||||
"Servers": [
|
||||
{
|
||||
"ID": "e349749b-3303-3ddf-959c-b5885a0e1f6e",
|
||||
"Name": "node1",
|
||||
"Address": "127.0.0.1:8300",
|
||||
"SerfStatus": "alive",
|
||||
"Version": "0.8.0",
|
||||
"Leader": true,
|
||||
"LastContact": "0s",
|
||||
"LastTerm": 2,
|
||||
"LastIndex": 10,
|
||||
"Healthy": true,
|
||||
"Voter": true,
|
||||
"StableSince": "2017-03-28T18:28:52Z"
|
||||
},
|
||||
{
|
||||
"ID": "e35bde83-4e9c-434f-a6ef-453f44ee21ea",
|
||||
"Name": "node2",
|
||||
"Address": "127.0.0.1:8705",
|
||||
"SerfStatus": "alive",
|
||||
"Version": "0.8.0",
|
||||
"Leader": false,
|
||||
"LastContact": "35.371007ms",
|
||||
"LastTerm": 2,
|
||||
"LastIndex": 10,
|
||||
"Healthy": true,
|
||||
"Voter": false,
|
||||
"StableSince": "2017-03-28T18:29:10Z"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
In this guide we configured most of the Autopilot features; dead server cleanup, server
|
||||
stabilization, redundancy zone tags, upgrade migration, and upgrade version tag.
|
||||
|
||||
To learn more about the Autopilot settings we did not configure,
|
||||
[last_contact_threshold](https://www.consul.io/docs/agent/options.html#last_contact_threshold)
|
||||
and [max_trailing_logs](https://www.consul.io/docs/agent/options.html#max_trailing_logs),
|
||||
either read the agent configuration documentation or use the help flag with the
|
||||
operator autopilot `consul operator autopilot set-config -h`.
|
|
@ -1,89 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Datacenter Backups
|
||||
description: >-
|
||||
Consul provide the snapshot tool for backing up and restoring data. In this
|
||||
guide you will learn how to use both.
|
||||
---
|
||||
|
||||
# Datacenter Backups
|
||||
|
||||
Creating datacenter backups is an important step in production deployments. Backups provide a mechanism for the Consul server to recover from an outage (network loss, operator error, or a corrupted data directory). All servers write to the `-data-dir` before commit on write requests. The same directory is used on client agents to persist local state too, but this is not critical and can be rebuilt when recreating an agent. Local client state is not backed up in this guide and doesn't need to be in general, only the server's Raft store state.
|
||||
|
||||
Consul provides the [snapshot](https://consul.io/docs/commands/snapshot.html) command which can be run using the CLI or the API. The `snapshot` command saves a point-in-time snapshot of the state of the Consul servers which includes, but is not limited to:
|
||||
|
||||
- KV entries
|
||||
- the service catalog
|
||||
- prepared queries
|
||||
- sessions
|
||||
- ACLs
|
||||
|
||||
With [Consul Enterprise](/docs/commands/snapshot/agent), the `snapshot agent` command runs periodically and writes to local or remote storage (such as Amazon S3).
|
||||
|
||||
By default, all snapshots are taken using `consistent` mode where requests are forwarded to the leader which verifies that it is still in power before taking the snapshot. Snapshots will not be saved if the datacenter is degraded or if no leader is available. To reduce the burden on the leader, it is possible to [run the snapshot](/docs/commands/snapshot/save) on any non-leader server using `stale` consistency mode.
|
||||
|
||||
This spreads the load across nodes at the possible expense of losing full consistency guarantees. Typically this means that a very small number of recent writes may not be included. The omitted writes are typically limited to data written in the last `100ms` or less from the recovery point. This is usually suitable for disaster recovery. However, the system can’t guarantee how stale this may be if executed against a partitioned server.
|
||||
|
||||
## Create Your First Backup
|
||||
|
||||
The `snapshot save` command for backing up the datacenter state has many configuration options. In a production environment, you will want to configure ACL tokens and client certificates for security. The configuration options also allow you to specify the datacenter and server to collect the backup data from. Below are several examples.
|
||||
|
||||
First, we will run the basic snapshot command on one of our servers with the all the defaults, including `consistent` mode.
|
||||
|
||||
```shell
|
||||
$ consul snapshot save backup.snap
|
||||
Saved and verified snapshot to index 1176
|
||||
```
|
||||
|
||||
The backup will be saved locally in the directory where we ran the command.
|
||||
|
||||
You can view metadata about the backup with the `inspect` subcommand.
|
||||
|
||||
```shell
|
||||
$ consul snapshot inspect backup.snap
|
||||
ID 2-1182-1542056499724
|
||||
Size 4115
|
||||
Index 1182
|
||||
Term 2
|
||||
Version 1
|
||||
```
|
||||
|
||||
To understand each field review the inspect [documentation](https://www.consul.io/docs/commands/snapshot/inspect.html). Notably, the `Version` field does not correspond to the version of the data. Rather it is the snapshot format version.
|
||||
|
||||
Next, let’s collect the datacenter data from a non-leader server by specifying stale mode.
|
||||
|
||||
```shell
|
||||
$ consul snapshot save -stale backup.snap
|
||||
Saved and verified snapshot to index 2276
|
||||
```
|
||||
|
||||
Once ACLs and agent certificates are configured, they can be passed in as environtmennt variables or flags.
|
||||
|
||||
```shell
|
||||
$ export CONSUL_HTTP_TOKEN=<your ACL token>
|
||||
$ consul snapshot save -stale -ca-file=</path/to/file> backup.snap
|
||||
Saved and verified snapshot to index 2287
|
||||
```
|
||||
|
||||
In the above example, we set the token as an ENV and the ca-file with a command line flag.
|
||||
|
||||
For production use, the `snapshot save` command or [API](https://www.consul.io/api/snapshot.html) should be scripted and run frequently. In addition to frequently backing up the datacenter state, there are several use cases when you would also want to manually execute `snapshot save`. First, you should always backup the datacenter before upgrading. If the upgrade does not go according to plan it is often not possible to downgrade due to changes in the state store format. Restoring from a backup is the only option so taking one before the upgrade will ensure you have the latest data. Second, if the datacenter loses quorum it may be beneficial to save the state before the servers become divergent. Finally, you can manually snapshot a datacenter and use that to bootstrap a new datacenter with the same state.
|
||||
|
||||
Operationally, the backup process does not need to be executed on every server. Additionally, you can use the configuration options to save the backups to a mounted filesystem. The mounted filesystem can even be cloud storage, such as Amazon S3. The enterprise command `snapshot agent` automates this process.
|
||||
|
||||
## Restore from Backup
|
||||
|
||||
Running the `restore` process should be straightforward. However, there are a couple of actions you can take to ensure the process goes smoothly. First, make sure the datacenter you are restoring is stable and has a leader. You can see this using `consul operator raft list-peers` and checking server logs and telemetry for signs of leader elections or network issues.
|
||||
|
||||
You will only need to run the process once, on the leader. The Raft consensus protocol ensures that all servers restore the same state.
|
||||
|
||||
```shell
|
||||
$ consul snapshot restore backup.snap
|
||||
Restored snapshot
|
||||
```
|
||||
|
||||
Like the `save` subcommand, restore has many configuration options. In production, you would again want to use ACLs and certificates for security.
|
||||
|
||||
## Summary
|
||||
|
||||
In this guide, we learned about the `snapshot save` and `snapshot restore` commands. If you are testing the backup and restore process, you can add an extra dummy value to Consul KV. Another indicator that the backup was saved correctly is the size of the backup artifact.
|
|
@ -1,263 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Consul Cluster Monitoring & Metrics
|
||||
description: >-
|
||||
After setting up your first datacenter, it is an ideal time to make sure your
|
||||
cluster is healthy and establish a baseline.
|
||||
---
|
||||
|
||||
# Consul Cluster Monitoring and Metrics
|
||||
|
||||
After setting up your first datacenter, it is an ideal time to make sure your cluster is healthy and establish a baseline. This guide will cover several types of metrics in two sections: Consul health and server health.
|
||||
|
||||
**Consul health**:
|
||||
|
||||
- Transaction timing
|
||||
- Leadership changes
|
||||
- Autopilot
|
||||
- Garbage collection
|
||||
|
||||
**Server health**:
|
||||
|
||||
- File descriptors
|
||||
- CPU usage
|
||||
- Network activity
|
||||
- Disk activity
|
||||
- Memory usage
|
||||
|
||||
For each type of metric, we will review their importance and help identify when a metric is indicating a healthy or unhealthy state.
|
||||
|
||||
First, we need to understand the three methods for collecting metrics. We will briefly cover using SIGUSR1, the HTTP API, and telemetry.
|
||||
|
||||
Before starting this guide, we recommend configuring [ACLs](/docs/guides/acl).
|
||||
|
||||
## How to Collect Metrics
|
||||
|
||||
There are three methods for collecting metrics. The first, and simplest, is to use `SIGUSR1` for a one-time dump of current telemetry values. The second method is to get a similar one-time dump using the HTTP API. The third method, and the one most commonly used for long-term monitoring, is to enable telemetry in the Consul configuration file.
|
||||
|
||||
### SIGUSR1 for Local Use
|
||||
|
||||
To get a one-time dump of current metric values, we can send the `SIGUSR1` signal to the Consul process.
|
||||
|
||||
```shell
|
||||
$ kill -USR1 <process_id>
|
||||
```
|
||||
|
||||
This will send the output to the system logs, such as `/var/log/messages` or to `journald`. If you are monitoring the Consul process in the terminal via `consul monitor`, you will see the metrics in the output.
|
||||
|
||||
Although this is the easiest way to get a quick read of a single Consul agent’s health, it is much more useful to look at how the values change over time.
|
||||
|
||||
### API GET Request
|
||||
|
||||
Next let’s use the HTTP API to quickly collect metrics with curl.
|
||||
|
||||
```ssh
|
||||
$ curl http://127.0.0.1:8500/v1/agent/metrics
|
||||
```
|
||||
|
||||
In production you will want to set up credentials with an ACL token and [enable TLS](/docs/agent/encryption) for secure communications. Once ACLs have been configured, you can pass a token with the request.
|
||||
|
||||
```shell
|
||||
$ curl \
|
||||
--header "X-Consul-Token: <YOUR_ACL_TOKEN>" \
|
||||
https://127.0.0.1:8500/v1/agent/metrics
|
||||
```
|
||||
|
||||
In addition to being a good way to quickly collect metrics, it can be added to a script or it can be used with monitoring agents that support HTTP scraping, such as Prometheus, to visualize the data.
|
||||
|
||||
### Enable Telemetry
|
||||
|
||||
Finally, Consul can be configured to send telemetry data to a remote monitoring system. This allows you to monitor the health of agents over time, spot trends, and plan for future needs. You will need a monitoring agent and console for this.
|
||||
|
||||
Consul supports the following telemetry agents:
|
||||
|
||||
- Circonus
|
||||
- DataDog (via `dogstatsd`)
|
||||
- StatsD (via `statsd`, `statsite`, `telegraf`, etc.)
|
||||
|
||||
If you are using StatsD, you will also need a compatible database and server, such as Grafana, Chronograf, or Prometheus.
|
||||
|
||||
Telemetry can be enabled in the agent configuration file, for example `server.hcl`. Telemetry can be enabled on any agent, client or server. Normally, you would at least enable it on all the servers (both voting and non-voting) to monitor the health of the entire cluster.
|
||||
|
||||
An example snippet of `server.hcl` to send telemetry to DataDog looks like this:
|
||||
|
||||
```json
|
||||
"telemetry": {
|
||||
"dogstatsd_addr": "localhost:8125",
|
||||
"disable_hostname": true
|
||||
}
|
||||
```
|
||||
|
||||
When enabling telemetry on an existing cluster, the Consul process will need to be reloaded. This can be done with `consul reload` or `kill -HUP <process_id>`. It is recommended to reload the servers one at a time, starting with the non-leaders.
|
||||
|
||||
## Consul Health
|
||||
|
||||
The Consul health metrics reveal information about the Consul cluster. They include performance metrics for the key value store, transactions, raft, leadership changes, autopilot tuning, and garbage collection.
|
||||
|
||||
### Transaction Timing
|
||||
|
||||
The following metrics indicate how long it takes to complete write operations
|
||||
in various parts, including Consul KV and Raft from the Consul server. Generally, these values should remain reasonably consistent and no more than a few milliseconds each.
|
||||
|
||||
| Metric Name | Description |
|
||||
| :----------------------- | :------------------------------------------------------------------------------ |
|
||||
| `consul.kvs.apply` | Measures the time it takes to complete an update to the KV store. |
|
||||
| `consul.txn.apply` | Measures the time spent applying a transaction operation. |
|
||||
| `consul.raft.apply` | Counts the number of Raft transactions occurring over the interval. |
|
||||
| `consul.raft.commitTime` | Measures the time it takes to commit a new entry to the Raft log on the leader. |
|
||||
|
||||
Sudden changes in any of the timing values could be due to unexpected load on the Consul servers or due to problems on the hosts themselves. Specifically, if any of these metrics deviate more than 50% from the baseline over the previous hour, this indicates an issue. Below are examples of healthy transaction metrics.
|
||||
|
||||
```shell
|
||||
'consul.raft.apply': Count: 1 Sum: 1.000 LastUpdated: 2018-11-16 10:55:03.673805766 -0600 CST m=+97598.238246167
|
||||
'consul.raft.commitTime': Count: 1 Sum: 0.017 LastUpdated: 2018-11-16 10:55:03.673840104 -0600 CST m=+97598.238280505
|
||||
```
|
||||
|
||||
### Leadership Changes
|
||||
|
||||
In a healthy environment, your Consul cluster should have a stable leader. There shouldn’t be any leadership changes unless you manually change leadership (by taking a server out of the cluster, for example). If there are unexpected elections or leadership changes, you should investigate possible network issues between the Consul servers. Another possible cause could be that the Consul servers are unable to keep up with the transaction load.
|
||||
|
||||
Note: These metrics are reported by the follower nodes, not by the leader.
|
||||
|
||||
| Metric Name | Description |
|
||||
| :------------------------------- | :------------------------------------------------------------------------------------------------------------- |
|
||||
| `consul.raft.leader.lastContact` | Measures the time since the leader was last able to contact the follower nodes when checking its leader lease. |
|
||||
| `consul.raft.state.candidate` | Increments when a Consul server starts an election process. |
|
||||
| `consul.raft.state.leader` | Increments when a Consul server becomes a leader. |
|
||||
|
||||
If the `candidate` or `leader` metrics are greater than 0 or the `lastContact` metric is greater than 200ms, you should look into one of the possible causes described above. Below are examples of healthy leadership metrics.
|
||||
|
||||
```shell
|
||||
'consul.raft.leader.lastContact': Count: 4 Min: 10.000 Mean: 31.000 Max: 50.000 Stddev: 17.088 Sum: 124.000 LastUpdated: 2018-12-17 22:06:08.872973122 +0000 UTC m=+3553.639379498
|
||||
'consul.raft.state.leader': Count: 1 Sum: 1.000 LastUpdated: 2018-12-17 22:05:49.104580236 +0000 UTC m=+3533.870986584
|
||||
'consul.raft.state.candidate': Count: 1 Sum: 1.000 LastUpdated: 2018-12-17 22:05:49.097186444 +0000 UTC m=+3533.863592815
|
||||
```
|
||||
|
||||
### Autopilot
|
||||
|
||||
The autopilot metric is a boolean. A value of 1 indicates a healthy cluster and 0 indicates an unhealthy state.
|
||||
|
||||
| Metric Name | Description |
|
||||
| :------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `consul.autopilot.healthy` | Tracks the overall health of the local server cluster. If all servers are considered healthy by autopilot, this will be set to 1. If any are unhealthy, this will be 0. |
|
||||
|
||||
An alert should be setup for a returned value of 0. Below is an example of a healthy cluster according to the autopilot metric.
|
||||
|
||||
```shell
|
||||
[2018-12-17 13:03:40 -0500 EST][G] 'consul.autopilot.healthy': 1.000
|
||||
```
|
||||
|
||||
### Garbage Collection
|
||||
|
||||
Garbage collection (GC) pauses are a "stop-the-world" event, all runtime threads are blocked until GC completes. In a healthy environment these pauses should only last a few nanoseconds. If memory usage is high, the Go runtime may start the GC process so frequently that it will slow down Consul. You might observe more frequent leader elections or longer write times.
|
||||
|
||||
| Metric Name | Description |
|
||||
| :--------------------------------- | :---------------------------------------------------------------------------------------------------- |
|
||||
| `consul.runtime.total_gc_pause_ns` | Number of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Consul started. |
|
||||
|
||||
If the value return is more than 2 seconds/minute, you should start investigating the cause. If it exceeds 5 seconds per minute, you should consider the cluster to be in a critical state and start ensuring failure recovery procedures are up-to-date and start investigating. Below is an example of healthy GC pause.
|
||||
|
||||
```shell
|
||||
'consul.runtime.total_gc_pause_ns': 136603664.000
|
||||
```
|
||||
|
||||
Note, `total_gc_pause_ns` is a cumulative counter, so in order to calculate rates, such as GC/minute, you will need to apply a function such as [non_negative_difference](https://docs.influxdata.com/influxdb/v1.5/query_language/functions/#non-negative-difference).
|
||||
|
||||
## Server Health
|
||||
|
||||
The server metrics provide information about the health of your cluster including file handles, CPU usage, network activity, disk activity, and memory usage.
|
||||
|
||||
### File Descriptors
|
||||
|
||||
The majority of Consul operations require a file descriptor handle, including receiving a connection from another host, sending data between servers, and writing snapshots to disk. If Consul runs out of handles, it will stop accepting connections.
|
||||
|
||||
| Metric Name | Description |
|
||||
| :------------------------- | :------------------------------------------------------------------ |
|
||||
| `linux_sysctl_fs.file-nr` | Number of file handles being used across all processes on the host. |
|
||||
| `linux_sysctl_fs.file-max` | Total number of available file handles. |
|
||||
|
||||
By default, process and kernel limits are conservative, you may want to increase the limits beyond the defaults. If the `linux_sysctl_fs.file-nr` value exceeds 80% of `linux_sysctl_fs.file-max`, the file handles should be increased. Below is an example of a file handle metric.
|
||||
|
||||
```shell
|
||||
linux_sysctl_fs, host=statsbox, file-nr=768i, file-max=96763i
|
||||
```
|
||||
|
||||
### CPU Usage
|
||||
|
||||
Consul should not be demanding of CPU time on either server or clients. A spike in CPU usage could indicate too many operations taking place at once.
|
||||
|
||||
| Metric Name | Description |
|
||||
| :--------------- | :------------------------------------------------------------------------ |
|
||||
| `cpu.user_cpu` | Percentage of CPU being used by user processes (such as Vault or Consul). |
|
||||
| `cpu.iowait_cpu` | Percentage of CPU time spent waiting for I/O tasks to complete. |
|
||||
|
||||
If `cpu.iowait_cpu` is greater than 10%, it should be considered critical as Consul is waiting for data to be written to disk. This could be a sign that Raft is writing snapshots to disk too often. Below is an example of a healthy CPU metric.
|
||||
|
||||
```shell
|
||||
cpu, cpu=cpu-total, usage_idle=99.298, usage_user=0.400, usage_system=0.300, usage_iowait=0, usage_steal=0
|
||||
```
|
||||
|
||||
### Network Activity
|
||||
|
||||
Network activity should be consistent. A sudden spike in network traffic to Consul might be the result of a misconfigured client, such as Vault, that is causing too many requests.
|
||||
|
||||
Most agents will report separate metrics for each network interface, so be sure you are monitoring the right one.
|
||||
|
||||
| Metric Name | Description |
|
||||
| :--------------- | :------------------------------------------- |
|
||||
| `net.bytes_recv` | Bytes received on each network interface. |
|
||||
| `net.bytes_sent` | Bytes transmitted on each network interface. |
|
||||
|
||||
Sudden increases to the `net` metrics, greater than 50% deviation from baseline, indicates too many requests that are not being handled. Below is an example of a network activity metric.
|
||||
|
||||
```shell
|
||||
net, interface=enp0s5, bytes_sent=6183357i, bytes_recv=262313256i
|
||||
```
|
||||
|
||||
Note: The `net` metrics are counters, so in order to calculate rates, such as bytes/second,
|
||||
you will need to apply a function such as [non_negative_difference](https://docs.influxdata.com/influxdb/v1.5/query_language/functions/#non-negative-difference).
|
||||
|
||||
### Disk Activity
|
||||
|
||||
Normally, there is low disk activity, because Consul keeps everything in memory. If the Consul host is writing a large amount of data to disk, it could mean that Consul is under heavy write load and consequently is checkpointing Raft snapshots to disk frequently. It could also mean that debug/trace logging has accidentally been enabled in production, which can impact performance.
|
||||
|
||||
| Metric Name | Description |
|
||||
| :------------------- | :-------------------------------------------------------- |
|
||||
| `diskio.read_bytes` | Bytes read from each block device. |
|
||||
| `diskio.write_bytes` | Bytes written to each block device. |
|
||||
| `diskio.read_time` | Time spent reading from disk, in cumulative milliseconds. |
|
||||
| `diskio.write_time` | Time spent writing to disk, in cumulative milliseconds. |
|
||||
|
||||
Sudden, large changes to the `diskio` metrics, greater than 50% deviation from baseline
|
||||
or more than 3 standard deviations from baseline indicates Consul has too much disk I/O. Too much disk I/O can cause the rest of the system to slow down or become unavailable, as the kernel spends all its time waiting for I/O to complete. Below are examples of disk activity metrics.
|
||||
|
||||
```shell
|
||||
diskio, name=sda5, read_bytes=522298368i, write_bytes=1726865408i, read_time=7248i, write_time=133364i
|
||||
```
|
||||
|
||||
Note: The `diskio` metrics are counters, so in order to calculate rates (such as bytes/second),you will need to apply a function such as [non_negative_difference][].
|
||||
|
||||
### Memory Usage
|
||||
|
||||
As noted previously, Consul keeps all of its data -- the KV store, the catalog, etc -- in memory. If Consul consumes all available memory, it will crash. You should monitor total available RAM to make sure some RAM is available for other system processes and swap usage should remain at 0% for best performance.
|
||||
|
||||
| Metric Name | Description |
|
||||
| :--------------------------- | :------------------------------------------------------------- |
|
||||
| `consul.runtime.alloc_bytes` | Measures the number of bytes allocated by the Consul process. |
|
||||
| `consul.runtime.sys_bytes` | The total number of bytes of memory obtained from the OS. |
|
||||
| `mem.total` | Total amount of physical memory (RAM) available on the server. |
|
||||
| `mem.used_percent` | Percentage of physical memory in use. |
|
||||
| `swap.used_percent` | Percentage of swap space in use. |
|
||||
|
||||
Consul servers are running low on memory if `sys_bytes` exceeds 90% of `total_bytes`, `mem.used_percent` is over 90%, or `swap.used_percent` is greater than 0. You should increase the memory available to Consul if any of these three conditions are met. Below are examples of memory usage metrics.
|
||||
|
||||
```shell
|
||||
'consul.runtime.alloc_bytes': 11199928.000
|
||||
'consul.runtime.sys_bytes': 24627448.000
|
||||
mem, used_percent=31.492, total=1036312576i
|
||||
swap, used_percent=1.343
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
In this guide we reviewed the three methods for collecting metrics. SIGUSR1 and agent HTTP API are both quick methods for collecting metrics, but enabling telemetry is the best method for moving data into monitoring software. Additionally, we outlined the various metrics collected and their significance.
|
|
@ -1,252 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Using Envoy with Connect
|
||||
description: This guide walks though getting started running Envoy as a Connect Proxy.
|
||||
---
|
||||
|
||||
# Using Connect with Envoy Proxy
|
||||
|
||||
Consul Connect has first class support for using
|
||||
[Envoy](https://www.envoyproxy.io/) as a proxy. This guide will describe how to
|
||||
setup a development-mode Consul server and two services that use Envoy proxies
|
||||
on a single machine with [Docker](https://www.docker.com/). The aim of this
|
||||
guide is to demonstrate a minimal working setup and the moving parts involved,
|
||||
it is not intended for production deployments.
|
||||
|
||||
For reference documentation on how the integration works and is configured,
|
||||
please see our [Envoy documentation](/docs/connect/proxies/envoy).
|
||||
|
||||
## Setup Overview
|
||||
|
||||
We'll start all containers using Docker's `host` network mode and will have a
|
||||
total of five containers running by the end of this guide.
|
||||
|
||||
1. A single Consul server
|
||||
2. An example TCP `echo` service as a destination
|
||||
3. An Envoy sidecar proxy for the `echo` service
|
||||
4. An Envoy sidecar proxy for the `client` service
|
||||
5. An example `client` service (netcat)
|
||||
|
||||
We choose to run in Docker since Envoy is only distributed as a Docker image so
|
||||
it's the quickest way to get a demo running. The same commands used here will
|
||||
work in just the same way outside of Docker if you build an Envoy binary
|
||||
yourself.
|
||||
|
||||
## Building an Envoy Image
|
||||
|
||||
Starting Envoy requires a bootstrap configuration file that points Envoy to the
|
||||
local agent for discovering the rest of it's configuration. The Consul binary
|
||||
includes the [`consul connect envoy` command](/docs/commands/connect/envoy)
|
||||
which can generate the bootstrap configuration for Envoy and optionally run it
|
||||
directly.
|
||||
|
||||
Envoy's official Docker image can be used with Connect directly however it
|
||||
requires some additional steps to generate bootstrap configuration and inject it
|
||||
into the container.
|
||||
|
||||
Instead, we'll use Docker multi-stage builds (added in version 17.05) to make a
|
||||
local image that has both `envoy` and `consul` binaries.
|
||||
|
||||
We'll create a local Docker image to use that contains both binaries. First
|
||||
create a `Dockerfile` containing the following:
|
||||
|
||||
```shell
|
||||
FROM consul:latest
|
||||
FROM envoyproxy/envoy:v1.10.0
|
||||
COPY --from=0 /bin/consul /bin/consul
|
||||
ENTRYPOINT ["dumb-init", "consul", "connect", "envoy"]
|
||||
```
|
||||
|
||||
This takes the Consul binary from the latest release image and copies it into a
|
||||
new image based on the official Envoy image.
|
||||
|
||||
This can be built locally with:
|
||||
|
||||
```shell
|
||||
docker build -t consul-envoy .
|
||||
```
|
||||
|
||||
We will use the `consul-envoy` image we just made to configure and run Envoy
|
||||
processes later.
|
||||
|
||||
## Deploying a Consul Server
|
||||
|
||||
Next we need a Consul server. We'll work with a single Consul server in `-dev`
|
||||
mode for simplicity.
|
||||
|
||||
-> **Note:** `-dev` mode enables the gRPC server on port 8502 by default. For a
|
||||
production agent you'll need to [explicitly configure the gRPC
|
||||
port](/docs/agent/options#grpc_port).
|
||||
|
||||
In order to start a proxy instance, a [proxy service
|
||||
definition](/docs/connect/proxies) must exist on the local Consul agent.
|
||||
We'll create one using the [sidecar service
|
||||
registration](/docs/connect/proxies/sidecar-service) syntax.
|
||||
|
||||
Create a configuration file called `envoy_demo.hcl` containing the following
|
||||
service definitions.
|
||||
|
||||
```hcl
|
||||
services {
|
||||
name = "client"
|
||||
port = 8080
|
||||
connect {
|
||||
sidecar_service {
|
||||
proxy {
|
||||
upstreams {
|
||||
destination_name = "echo"
|
||||
local_bind_port = 9191
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
services {
|
||||
name = "echo"
|
||||
port = 9090
|
||||
connect {
|
||||
sidecar_service {}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The Consul container can now be started with that configuration.
|
||||
|
||||
```shell
|
||||
$ docker run --rm -d -v$(pwd)/envoy_demo.hcl:/etc/consul/envoy_demo.hcl \
|
||||
--network host --name consul-agent consul:latest \
|
||||
agent -dev -config-file /etc/consul/envoy_demo.hcl
|
||||
1c90f7fcc83f5390332d7a4fdda2f1bf74cf62762de9ea2f67cd5a09c0573641
|
||||
```
|
||||
|
||||
Running with `-d` like this puts the container into the background so we can
|
||||
continue in the same terminal. Log output can be seen using the name we gave.
|
||||
|
||||
```shell
|
||||
docker logs -f consul-agent
|
||||
```
|
||||
|
||||
Note that the Consul server has registered two services `client` and `echo`, but
|
||||
also registered two proxies `client-sidecar-proxy` and `echo-sidecar-proxy`.
|
||||
Next we'll need to run those services and proxies.
|
||||
|
||||
## Running the Echo Service
|
||||
|
||||
Next we'll run the `echo` service. We can use an existing TCP echo utility image
|
||||
for this.
|
||||
|
||||
Start the echo service on port 9090 as registered before.
|
||||
|
||||
```shell
|
||||
$ docker run -d --network host abrarov/tcp-echo --port 9090
|
||||
1a0b0c569016d00aadc4fc2b2954209b32b510966083f2a9e17d3afc6d185d87
|
||||
```
|
||||
|
||||
## Running the Proxies
|
||||
|
||||
We can now run "sidecar" proxy instances.
|
||||
|
||||
```shell
|
||||
$ docker run --rm -d --network host --name echo-proxy \
|
||||
consul-envoy -sidecar-for echo
|
||||
3f213a3cf9b7583a194dd0507a31e0188a03fc1b6e165b7f9336b0b1bb2baccb
|
||||
$ docker run --rm -d --network host --name client-proxy \
|
||||
consul-envoy -sidecar-for client -admin-bind localhost:19001
|
||||
d8399b54ee0c1f67d729bc4c8b6e624e86d63d2d9225935971bcb4534233012b
|
||||
```
|
||||
|
||||
The `-admin-bind` flag on the second proxy command is needed because both
|
||||
proxies are running on the host network and so can't bind to the same port for
|
||||
their admin API (which cannot be disabled).
|
||||
|
||||
Again we can see the output using docker logs. To see more verbose information
|
||||
from Envoy you can add `-- -l debug` to the end of the commands above. This
|
||||
passes the `-l` (log level) option directly through to Envoy. With debug level
|
||||
logs you should see the config being delivered to the proxy in the output.
|
||||
|
||||
The [`consul connect envoy` command](/docs/commands/connect/envoy) here is
|
||||
connecting to the local agent, getting the proxy configuration from the proxy
|
||||
service registration and generating the required Envoy bootstrap configuration
|
||||
before `exec`ing the envoy binary directly to run it with the generated
|
||||
configuration.
|
||||
|
||||
Envoy uses the bootstrap configuration to connect to the local agent directly
|
||||
via gRPC and use it's xDS protocol to retrieve the actual configuration for
|
||||
listeners, TLS certificates, upstream service instances and so on. The xDS API
|
||||
allows the Envoy instance to watch for any changes so certificate rotations or
|
||||
changes to the upstream service instances are immediately sent to the proxy.
|
||||
|
||||
## Running the Client Service
|
||||
|
||||
Finally, we can see the connectivity by running a dummy "client" service. Rather
|
||||
than run a full service that itself can listen, we'll simulate the service with
|
||||
a simple netcat process that will only talk to the `client-sidecar-proxy` Envoy
|
||||
instance.
|
||||
|
||||
Recall that we configured the `client` sidecar with one declared "upstream"
|
||||
dependency (the `echo` service). In that declaration we also requested that the
|
||||
`echo` service should be exposed to the client on local port 9191.
|
||||
|
||||
This configuration causes the `client-sidecar-proxy` to start a TCP proxy
|
||||
listening on `localhost:9191` and proxying to the `echo` service. Importantly,
|
||||
the listener will use the correct `client` service mTLS certificate to authorize
|
||||
the connection. It discovers the IP addresses of instances of the echo service
|
||||
via Consul service discovery.
|
||||
|
||||
We can now see this working if we run netcat.
|
||||
|
||||
```shell
|
||||
$ docker run -ti --rm --network host gophernet/netcat localhost 9191
|
||||
Hello World!
|
||||
Hello World!
|
||||
^C
|
||||
```
|
||||
|
||||
## Testing Authorization
|
||||
|
||||
To demonstrate that Connect is controlling authorization for the echo service,
|
||||
we can add an explicit deny rule.
|
||||
|
||||
```shell
|
||||
$ docker run -ti --rm --network host consul:latest intention create -deny client echo
|
||||
Created: client => echo (deny)
|
||||
```
|
||||
|
||||
Now, new connections will be denied. Depending on a few factors, netcat may not
|
||||
see the connection being closed but will not get a response from the service.
|
||||
|
||||
```shell
|
||||
$ docker run -ti --rm --network host gophernet/netcat localhost 9191
|
||||
Hello?
|
||||
Anyone there?
|
||||
^C
|
||||
```
|
||||
|
||||
-> **Note:** Envoy will not currently re-authenticate already established TCP
|
||||
connections so if you still have the netcat terminal open from before, that will
|
||||
still be able to communicate with "echo". _New_ connections should be denied
|
||||
though.
|
||||
|
||||
Removing the intention restores connectivity.
|
||||
|
||||
```
|
||||
$ docker run -ti --rm --network host consul:latest intention delete client echo
|
||||
Intention deleted.
|
||||
$ docker run -ti --rm --network host gophernet/netcat localhost 9191
|
||||
Hello?
|
||||
Hello?
|
||||
^C
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
In this guide we walked through getting a minimal working example of two plain
|
||||
TCP processes communicating over mTLS using Envoy sidecars configured by
|
||||
Connect.
|
||||
|
||||
For more details on how the Envoy integration works, please see the [Envoy
|
||||
reference documentation](/docs/connect/proxies/envoy).
|
||||
|
||||
To see how to get Consul Connect working in different environments like
|
||||
Kubernetes see the [Connect Getting
|
||||
Started](/docs/connect#getting-started-with-connect) overview.
|
|
@ -1,184 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Connect in Production
|
||||
description: This guide describes best practices for running Consul Connect in production.
|
||||
---
|
||||
|
||||
# Running Connect in Production
|
||||
|
||||
Consul Connect can secure all inter-service communication with mutual TLS. It's
|
||||
designed to work with [minimal configuration out of the
|
||||
box](https://learn.hashicorp.com/consul/getting-started/connect), however, completing the [security
|
||||
checklist](/docs/connect/security) and understanding the [Consul security
|
||||
model](/docs/internals/security) are prerequisites for production
|
||||
deployments.
|
||||
|
||||
After completing this guide, you will be able to configure Connect to
|
||||
secure services. First, you will secure your Consul cluster with ACLs and
|
||||
TLS encryption. Next, you will configure Connect on the servers and host.
|
||||
Finally, you will configure your services to use Connect.
|
||||
|
||||
~> Note: To complete this guide you should already have a Consul cluster
|
||||
with an appropriate number of servers and
|
||||
clients deployed according to the other reference material including the
|
||||
[deployment](/docs/guides/deployment) and
|
||||
[performance](/docs/install/performance) guides.
|
||||
|
||||
The steps we need to get to a secure Connect cluster are:
|
||||
|
||||
1. [Configure ACLs](#configure-acls)
|
||||
1. [Configure Agent Transport Encryption](#configure-agent-transport-encryption)
|
||||
1. [Bootstrap Connect's Certificate Authority](#bootstrap-certificate-authority)
|
||||
1. [Setup Host Firewall](#setup-host-firewall)
|
||||
1. [Configure Service Instances](#configure-service-instances)
|
||||
|
||||
For existing Consul deployments, it may be necessary to incrementally adopt Connect
|
||||
service-by-service. In this case, step one and two should already be complete.
|
||||
However, we recommend reviewing all steps since the final deployment goal is to be compliant with all the security recommendations in this guide.
|
||||
|
||||
## Configure ACLs
|
||||
|
||||
Consul Connect's security is based on service identity. In practice, the identity
|
||||
of the service is only enforcible with sufficiently restrictive ACLs.
|
||||
|
||||
This section will not replace reading the full [ACL
|
||||
guide](/docs/guides/acl) but will highlight the specific requirements
|
||||
Connect relies on to ensure it's security properties.
|
||||
|
||||
A service's identity, in the form of an x.509 certificate, will only be issued
|
||||
to an API client that has `service:write` permission for that service. In other
|
||||
words, any client that has permission to _register_ an instance of a service
|
||||
will be able to identify as that service and access all of the resources that that
|
||||
service is allowed to access.
|
||||
|
||||
A secure ACL setup must meet the following criteria.
|
||||
|
||||
1. **[ACL default
|
||||
policy](/docs/agent/options#acl_default_policy)
|
||||
must be `deny`.** If for any reason you cannot use the default policy of
|
||||
`deny`, you must add an explicit ACL denying anonymous `service:write`. Note, in this case the Connect intention graph will also default to
|
||||
`allow` and explicit `deny` intentions will be needed to restrict service
|
||||
access. Also note that explicit rules to limit who can manage intentions are
|
||||
necessary in this case. It is assumed for the remainder of this guide that
|
||||
ACL policy defaults to `deny`.
|
||||
2. **Each service must have a unique ACL token** that is restricted to
|
||||
`service:write` only for the named service. You can review the [Securing Consul with ACLs](https://learn.hashicorp.com/consul/advanced/day-1-operations/production-acls#apply-individual-tokens-to-the-services) guide for a
|
||||
service token example. Note, it is best practices for each instance to get a unique token as described below.
|
||||
|
||||
~> Individual Service Tokens: It is best practice to create a unique ACL token per service _instance_ because
|
||||
it limits the blast radius of a compromise. However, since Connect intentions manage access based only on service identity, it is
|
||||
possible to create only one ACL token per _service_ and share it between
|
||||
instances.
|
||||
|
||||
In practice, managing per-instance tokens requires automated ACL provisioning,
|
||||
for example using [HashiCorp's
|
||||
Vault](https://www.vaultproject.io/docs/secrets/consul).
|
||||
|
||||
## Configure Agent Transport Encryption
|
||||
|
||||
Consul's gossip (UDP) and RPC (TCP) communications need to be encrypted
|
||||
otherwise attackers may be able to see ACL tokens while in flight
|
||||
between the server and client agents (RPC) or between client agent and
|
||||
application (HTTP). Certificate private keys never leave the host they
|
||||
are used on but are delivered to the application or proxy over local
|
||||
HTTP so local agent traffic should be encrypted where potentially
|
||||
untrusted parties might be able to observe localhost agent API traffic.
|
||||
|
||||
Follow the [encryption guide](https://learn.hashicorp.com/consul/advanced/day-1-operations/agent-encryption) to ensure
|
||||
both gossip encryption and RPC/HTTP TLS are configured securely.
|
||||
|
||||
## Bootstrap Connect's Certificate Authority
|
||||
|
||||
Consul Connect comes with a built-in Certificate Authority (CA) that will
|
||||
bootstrap by default when you first [enable](https://www.consul.io/docs/agent/options.html#connect_enabled) Connect on your servers.
|
||||
|
||||
To use the built-in CA, enable it in the server's configuration.
|
||||
|
||||
```text
|
||||
connect {
|
||||
enabled = true
|
||||
}
|
||||
```
|
||||
|
||||
This configuration change requires a Consul server restart, which you can perform one server at a time
|
||||
to maintain availability in an existing cluster.
|
||||
|
||||
As soon as a server that has Connect enabled becomes the leader, it will
|
||||
bootstrap a new CA and generate it's own private key which is written to the
|
||||
Raft state.
|
||||
|
||||
Alternatively, an external private key can be provided via the [CA
|
||||
configuration](/docs/connect/ca#specifying-a-private-key-and-root-certificate).
|
||||
|
||||
~> External CAs: Connect has been designed with a pluggable CA component so external CAs can be
|
||||
integrated. For production workloads we recommend using [Vault or another external
|
||||
CA](/docs/connect/ca#external-ca-certificate-authority-providers) once
|
||||
available such that the root key is not stored within Consul state at all.
|
||||
|
||||
## Setup Host Firewall
|
||||
|
||||
In order to enable inbound connections to connect proxies, you may need to
|
||||
configure host or network firewalls to allow incoming connections to proxy
|
||||
ports.
|
||||
|
||||
In addition to Consul agent's [communication
|
||||
ports](/docs/agent/options#ports) any
|
||||
[proxies](/docs/connect/proxies) will need to have
|
||||
ports open to accept incoming connections.
|
||||
|
||||
If using [sidecar service
|
||||
registration](/docs/connect/proxies/sidecar-service) Consul will by default
|
||||
assign ports from [a configurable
|
||||
range](/docs/agent/options#sidecar_min_port) the default range is 21000 -
|
||||
|
||||
21255. If this feature is used, the agent assumes all ports in that range are
|
||||
both free to use (no other processes listening on them) and are exposed in the
|
||||
firewall to accept connections from other service hosts.
|
||||
|
||||
It is possible to prevent automated port selection by [configuring
|
||||
`sidecar_min_port` and
|
||||
`sidecar_max_port`](/docs/agent/options#sidecar_min_port) to both be `0`,
|
||||
forcing any sidecar service registrations to need an explicit port configured.
|
||||
|
||||
It then becomes the same problem as opening ports necessary for any other
|
||||
application and might be managed by configuration management or a scheduler.
|
||||
|
||||
## Configure Service Instances
|
||||
|
||||
With [necessary ACL tokens](#configure-acls) in place, all service registrations
|
||||
need to have an appropriate ACL token present.
|
||||
|
||||
For on-disk configuration the `token` parameter of the service definition must
|
||||
be set.
|
||||
|
||||
```json
|
||||
{
|
||||
"service": {
|
||||
"name": "cassandra_db",
|
||||
"port": 9002,
|
||||
"token: "<your_token_here>"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
For registration via the API the token is passed in the [request
|
||||
header](/api#authentication), `X-Consul-Token`, or by using the [Go
|
||||
client configuration](https://godoc.org/github.com/hashicorp/consul/api#Config).
|
||||
|
||||
To avoid the overhead of a proxy, applications may [natively
|
||||
integrate](/docs/connect/native) with connect.
|
||||
|
||||
~> Protect Application Listener: If using any kind of proxy for connect, the application must ensure no untrusted
|
||||
connections can be made to it's unprotected listening port. This is typically
|
||||
done by binding to `localhost` and only allowing loopback traffic, but may also
|
||||
be achieved using firewall rules or network namespacing.
|
||||
|
||||
For examples of proxy service definitions see the [proxy
|
||||
documentation](/docs/connect/proxies).
|
||||
|
||||
## Summary
|
||||
|
||||
After securing your Consul cluster with ACLs and TLS encryption, you
|
||||
can use Connect to secure service-to-service communication. If you
|
||||
encounter any issues while setting up Consul Connect, there are
|
||||
many [community](https://www.consul.io/community.html) resources where you can find help.
|
|
@ -1,89 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Consul-AWS
|
||||
description: >-
|
||||
Consul-AWS provides a tool, which syncs Consul's and AWS Cloud Map's service
|
||||
catalog
|
||||
---
|
||||
|
||||
# Consul-AWS
|
||||
|
||||
[Consul-AWS](https://github.com/hashicorp/consul-aws/) syncs the services in an AWS Cloud Map namespace to a Consul datacenter. Consul services will be created in AWS Cloud Map and the other way around. This enables native service discovery across Consul and AWS Cloud Map.
|
||||
This guide will describe how to configure and how to start the sync.
|
||||
|
||||
## Authentication
|
||||
|
||||
`consul-aws` needs access to Consul and AWS for uni- and bidirectional sync.
|
||||
|
||||
For Consul, the process accepts both the standard CLI flags, `-token` and the environment variables `CONSUL_HTTP_TOKEN`. This should be set to a Consul ACL token if ACLs are enabled.
|
||||
|
||||
For AWS, `consul-aws` uses the default credential provider chain to find AWS credentials. The default provider chain looks for credentials in the following order:
|
||||
|
||||
1. Environment variables.
|
||||
2. Shared credentials file.
|
||||
3. If your application is running on an Amazon EC2 instance, IAM role for Amazon EC2.
|
||||
|
||||
## Configuration
|
||||
|
||||
There are two subcommands available on `consul-aws`:
|
||||
|
||||
- version: display version number
|
||||
- sync-catalog: start syncing the catalogs
|
||||
|
||||
The version subcommand doesn’t do anything besides showing the version, so lets focus on sync-catalog. The following flags are available:
|
||||
|
||||
- A set of parameters to connect to your Consul Cluster like `-http-addr`, `-token`, `-ca-file`, `-client-cert`, and everything else you might need in order to do that
|
||||
- `-aws-namespace-id`: The AWS namespace to sync with Consul services.
|
||||
- `-aws-service-prefix`: A prefix to prepend to all services written to AWS from Consul. If this is not set then services will have no prefix.
|
||||
- `-consul-service-prefix`: A prefix to prepend to all services written to Consul from AWS. If this is not set then services will have no prefix.
|
||||
- `-to-aws`: If true, Consul services will be synced to AWS (defaults to false).
|
||||
- `-to-consul`: If true, AWS services will be synced to Consul (defaults to false).
|
||||
- `-aws-pull-interval`: The interval between fetching from AWS Cloud Map. Accepts a sequence of decimal numbers, each with optional fraction and a unit suffix, such as "300ms", "10s", "1.5m" (defaults to 30s).
|
||||
- `-aws-dns-ttl`: DNS TTL for services created in AWS Cloud Map in seconds (defaults to 60).
|
||||
|
||||
Independent of how you want to use `consul-aws` it needs to be able to connect to Consul and AWS. Apart from making sure you setup up authenticated access, `-aws-namespace-id` is mandatory.
|
||||
|
||||
## Syncing Consul services to AWS Cloud Map
|
||||
|
||||
Assuming authenticated access is set up, there is little left to do before starting the sync. Using `-to-aws` command line flag will start the sync to AWS Cloud Map. If `-aws-service-prefix` is provided, every imported service from Consul will be prefixed. For example:
|
||||
|
||||
```shell
|
||||
$ consul-aws -aws-namespace-id ns-hjrgt3bapp7phzff -to-aws -consul-service-prefix consul_
|
||||
```
|
||||
|
||||
At this point `consul-aws` will start importing services into AWS Cloud Map. A service in Consul named `web` will end up becoming `consul_web` in AWS. The individual service instances from Consul will be created in AWS as well.
|
||||
|
||||
Services in AWS Cloud Map that were imported from Consul have the following properties:
|
||||
|
||||
- Description: “Imported from Consul”
|
||||
- Record types: A and SRV
|
||||
- DNS routing policy: Multivalue answer routing
|
||||
|
||||
## Syncing AWS Cloud Map services to Consul
|
||||
|
||||
Similar to the previous chapter, there are two relevant flags: `-to-consul` to turn on the sync and optionally `-consul-service-prefix` to prefix every service imported into Consul. For example:
|
||||
|
||||
```shell
|
||||
$ consul-aws -aws-namespace-id ns-hjrgt3bapp7phzff -to-consul -aws-service-prefix aws_
|
||||
```
|
||||
|
||||
At this point `consul-aws` will start importing services into Consul. A service in AWS named `redis` will end up becoming `aws_redis` in Consul. The individual service instances from AWS will be created in Consul as well.
|
||||
|
||||
- Services in Consul that were imported from AWS Cloud Map have the following properties:
|
||||
- Tag: aws
|
||||
- Meta-Data: has aws as the source set, as well as the aws-id, the aws-namespace and every custom attribute the instance had in AWS Cloud Map
|
||||
- Node: the node name is consul-aws
|
||||
|
||||
## Syncing both directions
|
||||
|
||||
To enable bidirectional sync only put together the previous two sections and provide `-to-consul` and `-to-aws` as well as optionally `-aws-service-prefix` and `-consul-service-prefix`:
|
||||
|
||||
```shell
|
||||
$ consul-aws -aws-namespace-id ns-hjrgt3bapp7phzff -to-consul -aws-service-prefix aws_ -to-aws -consul-service-prefix consul_
|
||||
```
|
||||
|
||||
At this point `consul-aws` will start importing services into Consul from AWS Cloud Map and from AWS Cloud Map to Consul.
|
||||
|
||||
## Summary
|
||||
|
||||
At this point, either uni- or bidirectional sync is set up and service discovery is available across Consul and AWS seamlessly. If you haven’t enabled [ACL](/docs/guides/acl), now is a good time to read about it.
|
|
@ -1,120 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Using Consul with Containers
|
||||
description: >-
|
||||
This guide describes how to run Consul on containers, with Docker as the
|
||||
primary focus. It also describes best practices when running a Consul cluster
|
||||
in production on Docker.
|
||||
---
|
||||
|
||||
# Consul with Containers
|
||||
|
||||
This guide describes critical aspects of operating a Consul cluster that's run inside containers. It primarily focuses on the Docker container runtime, but the principles largely apply to rkt, oci, and other container runtimes as well.
|
||||
|
||||
## Consul Official Docker Image
|
||||
|
||||
Consul's official Docker images are tagged with version numbers. For example, `docker pull consul:1.4.4` will pull the 1.4.4 Consul release image.
|
||||
|
||||
For major releases, make sure to read our [upgrade guides](/docs/upgrade-specific) before upgrading a cluster.
|
||||
|
||||
To get a development mode Consul instance running the latest version, run `docker run consul`.
|
||||
|
||||
More instructions on how to get started using this image are available at the [official Docker repository page](https://store.docker.com/images/consul)
|
||||
|
||||
## Data Directory Persistence
|
||||
|
||||
The container exposes its data directory, `/consul/data`, as a [volume](https://docs.docker.com/engine/tutorials/dockervolumes/). This is where Consul will store its persisted state.
|
||||
|
||||
For clients, this stores some information about the cluster and the client's services and health checks in case the container is restarted. If the volume on a client disappears, it doesn't affect cluster operations.
|
||||
|
||||
For servers, this stores the client information plus snapshots and data related to the consensus algorithm and other state like Consul's key/value store and catalog. **Servers need the volume's data to be available when restarting containers to recover from outage scenarios.** Therefore, care must be taken by operators to make sure that volumes containing consul cluster data are not destroyed during container restarts.
|
||||
|
||||
~> We also recommend taking additional backups via [`consul snapshot`](/docs/commands/snapshot), and storing them externally.
|
||||
|
||||
## Configuration
|
||||
|
||||
The container has a Consul configuration directory set up at `/consul/config` and the agent will load any configuration files placed here by binding a volume or by composing a new image and adding files.
|
||||
|
||||
Note that the configuration directory is not exposed as a volume, and will not persist. Consul uses it only during start up and does not store any state there.
|
||||
|
||||
Configuration can also be added by passing the configuration JSON via environment variable CONSUL_LOCAL_CONFIG. Example:
|
||||
|
||||
```shell
|
||||
$ docker run \
|
||||
-d \
|
||||
-e CONSUL_LOCAL_CONFIG='{
|
||||
"datacenter":"us_west",
|
||||
"server":true,
|
||||
"enable_debug":true
|
||||
}' \
|
||||
consul agent -server -bootstrap-expect=3
|
||||
```
|
||||
|
||||
## Networking
|
||||
|
||||
When running inside a container, Consul must be configured with an appropriate _cluster address_ and _client address_. In some cases, it may also require configuring an _advertise address_.
|
||||
|
||||
- **Cluster Address** - The address at which other Consul agents may contact a given agent. This is also referred to as the bind address.
|
||||
|
||||
- **Client Address** - The address where other processes on the host contact Consul in order to make HTTP or DNS requests. Consider setting this to localhost or `127.0.0.1` to only allow processes on the same container to make HTTP/DNS requests.
|
||||
|
||||
- **Advertise Address** - The advertise address is used to change the address that we advertise to other nodes in the cluster. This defaults to the bind address. Consider using this if you use NAT in your environment, or in scenarios where you have a routable address that cannot be bound.
|
||||
|
||||
You will need to tell Consul what its cluster address is when starting so that it binds to the correct interface and advertises a workable interface to the rest of the Consul agents. There are two ways of doing this:
|
||||
|
||||
1. Environment Variables: Use the `CONSUL_CLIENT_INTERFACE` and `CONSUL_BIND_INTERFACE` environment variables. In the following example `eth0` is the network interface of the container.
|
||||
|
||||
```shell
|
||||
$ docker run \
|
||||
-d \
|
||||
-e CONSUL_CLIENT_INTERFACE='eth0' \
|
||||
-e CONSUL_BIND_INTERFACE='eth0' \
|
||||
consul agent -server -bootstrap-expect=3
|
||||
```
|
||||
|
||||
2. Address Templates: You can declaratively specify the client and cluster addresses using the formats described in the [go-socketaddr](https://github.com/hashicorp/go-sockaddr) library.
|
||||
In the following example, the client and bind addresses are declaratively specified for the container network interface 'eth0'
|
||||
|
||||
```shell
|
||||
$ docker run \
|
||||
consul agent -server \
|
||||
-client='{{ GetInterfaceIP "eth0" }}' \
|
||||
-bind='{{ GetInterfaceIP "eth0" }}' \
|
||||
-bootstrap-expect=3
|
||||
```
|
||||
|
||||
## Stopping and Restarting Containers
|
||||
|
||||
The official Consul container supports stopping, starting, and restarting. To stop a container, run `docker stop`:
|
||||
|
||||
```shell
|
||||
$ docker stop <container_id>
|
||||
```
|
||||
|
||||
To start a container, run `docker start`:
|
||||
|
||||
```shell
|
||||
$ docker start <container_id>
|
||||
```
|
||||
|
||||
To do an in-memory reload, send a SIGHUP to the container:
|
||||
|
||||
```shell
|
||||
$ docker kill --signal=HUP <container_id>
|
||||
```
|
||||
|
||||
As long as there are enough servers in the cluster to maintain [quorum](/docs/internals/consensus#deployment-table), Consul's [Autopilot](/docs/guides/autopilot) feature will handle removing servers whose containers were stopped. Autopilot's default settings are already configured correctly. If you override them, make sure that the following [settings](/docs/agent/options#autopilot) are appropriate.
|
||||
|
||||
- `cleanup_dead_servers` must be set to true to make sure that a stopped container is removed from the cluster.
|
||||
- `last_contact_threshold` should be reasonably small, so that dead servers are removed quickly.
|
||||
- `server_stabilization_time` should be sufficiently large (on the order of several seconds) so that unstable servers are not added to the cluster until they stabilize.
|
||||
|
||||
If the container running the currently-elected Consul server leader is stopped, a leader election will trigger. This event will cause a new Consul server in the cluster to assume leadership.
|
||||
|
||||
When a previously stopped server container is restarted using `docker start <container_id>`, and it is configured to obtain a new IP, Autopilot will add it back to the set of Raft peers with the same node-id and the new IP address, after which it can participate as a server again.
|
||||
|
||||
## Known Issues
|
||||
|
||||
**All nodes changing IP addresses** Prior to Consul 0.9.3, Consul did not gracefully handle the situation where all nodes in the cluster running inside a container are restarted at the same time, and they all obtain new IP addresses. This has been [fixed](https://github.com/hashicorp/consul/issues/1580) since Consul 0.9.3, and requires `"raft_protocol"` to be set to `"3"` in the configs in Consul 0.9.3. Consul 1.0 makes raft protocol 3 the default.
|
||||
|
||||
**Snapshot close error** Due to a [known issue](https://github.com/docker/libnetwork/issues/1204) with half close support in Docker, you will see an error message `[ERR] consul: Failed to close snapshot: write tcp <source>-><destination>: write: broken pipe` when saving snapshots. This does not affect saving and restoring snapshots when running in Docker.
|
|
@ -1,191 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Consul Template
|
||||
description: >-
|
||||
Consul template provides a programmatic method for rendering configuration
|
||||
files from Consul data.
|
||||
---
|
||||
|
||||
# Consul Template
|
||||
|
||||
The Consul template tool provides a programmatic method
|
||||
for rendering configuration files from a variety of locations,
|
||||
including Consul KV. It is an ideal option for replacing complicated API
|
||||
queries that often require custom formatting.
|
||||
The template tool is based on Go templates and shares many
|
||||
of the same attributes.
|
||||
|
||||
Consul template is a useful tool with several uses, we will focus on two
|
||||
of it's use cases.
|
||||
|
||||
1. _Update configuration files_. The Consul template tool can be used
|
||||
to update service configuration files. A common use case is managing load
|
||||
balancer configuration files that need to be updated regularly in a dynamic
|
||||
infrastructure on machines which may not be able to directly connect to the Consul cluster.
|
||||
|
||||
1. _Discover data about the Consul cluster and service_. It is possible to collect
|
||||
information about the services in your Consul cluster. For example, you could
|
||||
collect a list of all services running on the cluster or you could discover all
|
||||
service addresses for the Redis service. Note, this use case has limited
|
||||
scope for production.
|
||||
|
||||
In this guide we will briefly discuss how `consul-template` works,
|
||||
how to install it, and two use cases.
|
||||
|
||||
Before completing this guide, we assume some familiarity with
|
||||
[Consul KV](https://learn.hashicorp.com/consul/getting-started/kv)
|
||||
and [Go templates](https://golang.org/pkg/text/template/).
|
||||
|
||||
## Introduction to Consul Template
|
||||
|
||||
Consul template is a simple, yet powerful tool. When initiated, it
|
||||
reads one or more template files and queries Consul for all
|
||||
data needed to render them. Typically, you run `consul-template` as a
|
||||
daemon which will fetch the initial values and then continue to watch
|
||||
for updates, re-rendering the template whenever there are relevant changes in
|
||||
the cluster. You can alternatively use the `-once` flag to fetch and render
|
||||
the template once which is useful for testing and
|
||||
setup scripts that are triggered by some other automation for example a
|
||||
provisioning tool. Finally, the template can also run arbitrary commands after the update
|
||||
process completes. For example, it can send the HUP signal to the
|
||||
load balancer service after a configuration change has been made.
|
||||
|
||||
The Consul template tool is flexible, it can fit into many
|
||||
different environments and workflows. Depending on the use-case, you
|
||||
may have a single `consul-template` instance on a handful of hosts
|
||||
or may need to run several instances on every host. Each `consul-template`
|
||||
process can manage multiple unrelated files though and will de-duplicate
|
||||
the fetches as needed if those files share data dependencies so it can
|
||||
reduce the load on Consul servers to share where possible.
|
||||
|
||||
## Install Consul Template
|
||||
|
||||
For this guide, we are using a local Consul agent in development
|
||||
mode which can be started with `consul agent -dev`. To quickly set
|
||||
up a local Consul agent, refer to the getting started [guide](https://learn.hashicorp.com/consul/getting-started/install). The
|
||||
Consul agent must be running to complete all of the following
|
||||
steps.
|
||||
|
||||
The Consul template tool is not included with the Consul binary and will
|
||||
need to be installed separately. It can be installed from a precompiled
|
||||
binary or compiled from source. We will be installing the precompiled binary.
|
||||
|
||||
First, download the binary from the [Consul Template releases page](https://releases.hashicorp.com/consul-template/).
|
||||
|
||||
```shell
|
||||
curl -O https://releases.hashicorp.com/consul-template/0.19.5/consul-template<_version_OS>.tgz
|
||||
```
|
||||
|
||||
Next, extract the binary and move it into your `$PATH`.
|
||||
|
||||
```shell
|
||||
tar -zxf consul-template<_version_OS>.tgz
|
||||
```
|
||||
|
||||
To compile from source, please see the instructions in the
|
||||
[contributing section in GitHub](https://github.com/hashicorp/consul-template#contributing).
|
||||
|
||||
## Use Case: Consul KV
|
||||
|
||||
In this first use case example, we will render a template that pulls the HashiCorp address
|
||||
from Consul KV. To do this we will create a simple template that contains the HashiCorp
|
||||
address, run `consul-template`, add a value to Consul KV for HashiCorp's address, and
|
||||
finally view the rendered file.
|
||||
|
||||
First, we will need to create a template file `find_address.tpl` to query
|
||||
Consul's KV store:
|
||||
|
||||
```liquid
|
||||
{{ key "/hashicorp/street_address" }}
|
||||
```
|
||||
|
||||
Next, we will run `consul-template` specifying both
|
||||
the template to use and the file to update.
|
||||
|
||||
```shell
|
||||
$ consul-template -template "find_address.tpl:hashicorp_address.txt"
|
||||
```
|
||||
|
||||
The `consul-template` process will continue to run until you kill it with `CRTL+c`.
|
||||
For now, we will leave it running.
|
||||
|
||||
Finally, open a new terminal so we can write data to the key in Consul using the command
|
||||
line interface.
|
||||
|
||||
```shell
|
||||
$ consul kv put hashicorp/street_address "101 2nd St"
|
||||
|
||||
Success! Data written to: hashicorp/street_address
|
||||
```
|
||||
|
||||
We can ensure the data was written by viewing the `hashicorp_address.txt`
|
||||
file which will be located in the same directory where `consul-template`
|
||||
was run.
|
||||
|
||||
```shell
|
||||
$ cat hashicorp_address.txt
|
||||
|
||||
101 2nd St
|
||||
```
|
||||
|
||||
If you update the key `hashicorp/street_address`, you can see the changes to the file
|
||||
immediately. Go ahead and try `consul kv put hashicorp/street_address "22b Baker ST"`.
|
||||
|
||||
You can see that this simple process can have powerful implications. For example, it is
|
||||
possible to use this same process for updating your [HAProxy load balancer
|
||||
configuration](https://github.com/hashicorp/consul-template/blob/master/examples/haproxy.md).
|
||||
|
||||
You can now kill the `consul-template` process with `CTRL+c`.
|
||||
|
||||
## Use Case: Discover All Services
|
||||
|
||||
In this use case example, we will discover all the services running in the Consul cluster.
|
||||
To follow along, you use the local development agent from the previous example.
|
||||
|
||||
First, we will need to create a new template `all-services.tpl` to query all services.
|
||||
|
||||
```liquid
|
||||
{{range services}}# {{.Name}}{{range service .Name}}
|
||||
{{.Address}}{{end}}
|
||||
|
||||
{{end}}
|
||||
```
|
||||
|
||||
Next, run Consul template specifying the template we just created and the `-once` flag.
|
||||
The `-once` flag will tell the process to run once and then quit.
|
||||
|
||||
```shell
|
||||
$ consul-template -template="all-services.tpl:all-services.txt" -once
|
||||
```
|
||||
|
||||
If you complete this on your local development agent, you should
|
||||
still see the `consul` service when viewing `all-services.txt`.
|
||||
|
||||
```text
|
||||
# consul
|
||||
127.0.0.7
|
||||
```
|
||||
|
||||
On a development or production cluster, you would see a list of all the services.
|
||||
For example:
|
||||
|
||||
```text
|
||||
# consul
|
||||
104.131.121.232
|
||||
|
||||
# redis
|
||||
104.131.86.92
|
||||
104.131.109.224
|
||||
104.131.59.59
|
||||
|
||||
# web
|
||||
104.131.86.92
|
||||
104.131.109.224
|
||||
104.131.59.59
|
||||
```
|
||||
|
||||
## Conclusion
|
||||
|
||||
In this guide we learned how to set up and use the Consul template tool.
|
||||
To see additional examples, refer to the examples folder
|
||||
in [GitHub](https://github.com/hashicorp/consul-template/tree/master/examples).
|
|
@ -1,427 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Creating and Configuring TLS Certificates
|
||||
description: Learn how to create certificates for Consul.
|
||||
---
|
||||
|
||||
# Creating and Configuring TLS Certificates
|
||||
|
||||
Setting you cluster up with TLS is an important step towards a secure
|
||||
deployment. Correct TLS configuration is a prerequisite of our [Security
|
||||
Model](/docs/internals/security). Correctly configuring TLS can be a
|
||||
complex process however, especially given the wide range of deployment
|
||||
methodologies. This guide will provide you with a production ready TLS
|
||||
configuration.
|
||||
|
||||
~> More advanced topics like key management and rotation are not covered by this
|
||||
guide. [Vault][vault] is the suggested solution for key generation and
|
||||
management.
|
||||
|
||||
This guide has the following chapters:
|
||||
|
||||
1. [Creating Certificates](#creating-certificates)
|
||||
1. [Configuring Agents](#configuring-agents)
|
||||
1. [Configuring the Consul CLI for HTTPS](#configuring-the-consul-cli-for-https)
|
||||
1. [Configuring the Consul UI for HTTPS](#configuring-the-consul-ui-for-https)
|
||||
|
||||
This guide is structured in way that you build knowledge with every step. It is
|
||||
recommended to read the whole guide before starting with the actual work,
|
||||
because you can save time if you are aware of some of the more advanced things
|
||||
in Chapter [3](#configuring-the-consul-cli-for-https) and
|
||||
[4](#configuring-the-consul-ui-for-https).
|
||||
|
||||
### Reference Material
|
||||
|
||||
- [Encryption](/docs/agent/encryption)
|
||||
- [Security Model](/docs/internals/security)
|
||||
|
||||
## Creating Certificates
|
||||
|
||||
### Estimated Time to Complete
|
||||
|
||||
2 minutes
|
||||
|
||||
### Prerequisites
|
||||
|
||||
This guide assumes you have Consul 1.4.1 (or newer) in your PATH.
|
||||
|
||||
### Introduction
|
||||
|
||||
The first step to configuring TLS for Consul is generating certificates. In
|
||||
order to prevent unauthorized cluster access, Consul requires all certificates
|
||||
be signed by the same Certificate Authority (CA). This should be a _private_ CA
|
||||
and not a public one like [Let's Encrypt][letsencrypt] as any certificate
|
||||
signed by this CA will be allowed to communicate with the cluster.
|
||||
|
||||
### Step 1: Create a Certificate Authority
|
||||
|
||||
There are a variety of tools for managing your own CA, [like the PKI secret
|
||||
backend in Vault][vault-pki], but for the sake of simplicity this guide will
|
||||
use Consul's builtin TLS helpers:
|
||||
|
||||
```shell
|
||||
$ consul tls ca create
|
||||
==> Saved consul-agent-ca.pem
|
||||
==> Saved consul-agent-ca-key.pem
|
||||
```
|
||||
|
||||
The CA certificate (`consul-agent-ca.pem`) contains the public key necessary to
|
||||
validate Consul certificates and therefore must be distributed to every node
|
||||
that runs a consul agent.
|
||||
|
||||
~> The CA key (`consul-agent-ca-key.pem`) will be used to sign certificates for Consul
|
||||
nodes and must be kept private. Possession of this key allows anyone to run Consul as
|
||||
a trusted server and access all Consul data including ACL tokens.
|
||||
|
||||
### Step 2: Create individual Server Certificates
|
||||
|
||||
Create a server certificate for datacenter `dc1` and domain `consul`, if your
|
||||
datacenter or domain is different please use the appropriate flags:
|
||||
|
||||
```shell
|
||||
$ consul tls cert create -server
|
||||
==> WARNING: Server Certificates grants authority to become a
|
||||
server and access all state in the cluster including root keys
|
||||
and all ACL tokens. Do not distribute them to production hosts
|
||||
that are not server nodes. Store them as securely as CA keys.
|
||||
==> Using consul-agent-ca.pem and consul-agent-ca-key.pem
|
||||
==> Saved dc1-server-consul-0.pem
|
||||
==> Saved dc1-server-consul-0-key.pem
|
||||
```
|
||||
|
||||
Please repeat this process until there is an _individual_ certificate for each
|
||||
server. The command can be called over and over again, it will automatically add
|
||||
a suffix.
|
||||
|
||||
In order to authenticate Consul servers, servers are provided with a special
|
||||
certificate - one that contains `server.dc1.consul` in the `Subject Alternative Name`. If you enable
|
||||
[`verify_server_hostname`](/docs/agent/options#verify_server_hostname),
|
||||
only agents that provide such certificate are allowed to boot as a server.
|
||||
Without `verify_server_hostname = true` an attacker could compromise a Consul
|
||||
client agent and restart the agent as a server in order to get access to all the
|
||||
data in your cluster! This is why server certificates are special, and only
|
||||
servers should have them provisioned.
|
||||
|
||||
~> Server keys, like the CA key, must be kept private - they effectively allow
|
||||
access to all Consul data.
|
||||
|
||||
### Step 3: Create Client Certificates
|
||||
|
||||
Create a client certificate:
|
||||
|
||||
```shell
|
||||
$ consul tls cert create -client
|
||||
==> Using consul-agent-ca.pem and consul-agent-ca-key.pem
|
||||
==> Saved dc1-client-consul-0.pem
|
||||
==> Saved dc1-client-consul-0-key.pem
|
||||
```
|
||||
|
||||
Client certificates are also signed by your CA, but they do not have that
|
||||
special `Subject Alternative Name` which means that if `verify_server_hostname`
|
||||
is enabled, they cannot start as a server.
|
||||
|
||||
## Configuring Agents
|
||||
|
||||
### Prerequisites
|
||||
|
||||
For this section you need access to your existing or new Consul cluster and have
|
||||
the certificates from the previous chapters available.
|
||||
|
||||
### Notes on example configurations
|
||||
|
||||
The example configurations from this as well as the following chapters are in
|
||||
json. You can copy each one of the examples in its own file in a directory
|
||||
([`-config-dir`](/docs/agent/options#_config_dir)) from where consul will
|
||||
load all the configuration. This is just one way to do it, you can also put them
|
||||
all into one file if you prefer that.
|
||||
|
||||
### Introduction
|
||||
|
||||
By now you have created the certificates you need to enable TLS in your cluster.
|
||||
The next steps show how to configure TLS for a brand new cluster. If you already
|
||||
have a cluster in production without TLS please see the [encryption
|
||||
guide][guide] for the steps needed to introduce TLS without downtime.
|
||||
|
||||
### Step 1: Setup Consul servers with certificates
|
||||
|
||||
This step describes how to setup one of your consul servers, you want to make
|
||||
sure to repeat the process for the other ones as well with their individual
|
||||
certificates.
|
||||
|
||||
The following files need to be copied to your Consul server:
|
||||
|
||||
- `consul-agent-ca.pem`: CA public certificate.
|
||||
- `dc1-server-consul-0.pem`: Consul server node public certificate for the `dc1` datacenter.
|
||||
- `dc1-server-consul-0-key.pem`: Consul server node private key for the `dc1` datacenter.
|
||||
|
||||
Here is an example agent TLS configuration for Consul servers which mentions the
|
||||
copied files:
|
||||
|
||||
```json
|
||||
{
|
||||
"verify_incoming": true,
|
||||
"verify_outgoing": true,
|
||||
"verify_server_hostname": true,
|
||||
"ca_file": "consul-agent-ca.pem",
|
||||
"cert_file": "dc1-server-consul-0.pem",
|
||||
"key_file": "dc1-server-consul-0-key.pem",
|
||||
"ports": {
|
||||
"http": -1,
|
||||
"https": 8501
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This configuration disables the HTTP port to make sure there is only encryted
|
||||
communication. Existing clients that are not yet prepared to talk HTTPS won't be
|
||||
able to connect afterwards. This also affects builtin tooling like `consul members` and the UI. The next chapters will demonstrate how to setup secure
|
||||
access.
|
||||
|
||||
After a Consul agent restart, your servers should be only talking TLS.
|
||||
|
||||
### Step 2: Setup Consul clients with certificates
|
||||
|
||||
Now copy the following files to your Consul clients:
|
||||
|
||||
- `consul-agent-ca.pem`: CA public certificate.
|
||||
- `dc1-client-consul-0.pem`: Consul client node public certificate.
|
||||
- `dc1-client-consul-0-key.pem`: Consul client node private key.
|
||||
|
||||
Here is an example agent TLS configuration for Consul agents which mentions the
|
||||
copied files:
|
||||
|
||||
```json
|
||||
{
|
||||
"verify_incoming": true,
|
||||
"verify_outgoing": true,
|
||||
"verify_server_hostname": true,
|
||||
"ca_file": "consul-agent-ca.pem",
|
||||
"cert_file": "dc1-client-consul-0.pem",
|
||||
"key_file": "dc1-client-consul-0-key.pem",
|
||||
"ports": {
|
||||
"http": -1,
|
||||
"https": 8501
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This configuration disables the HTTP port to make sure there is only encryted
|
||||
communication. Existing clients that are not yet prepared to talk HTTPS won't be
|
||||
able to connect afterwards. This also affects builtin tooling like `consul members` and the UI. The next chapters will demonstrate how to setup secure
|
||||
access.
|
||||
|
||||
After a Consul agent restart, your agents should be only talking TLS.
|
||||
|
||||
## Configuring the Consul CLI for HTTPS
|
||||
|
||||
If your cluster is configured to only communicate via HTTPS, you will need to
|
||||
create additional certificates in order to be able to continue to access the API
|
||||
and the UI:
|
||||
|
||||
```shell
|
||||
$ consul tls cert create -cli
|
||||
==> Using consul-agent-ca.pem and consul-agent-ca-key.pem
|
||||
==> Saved dc1-cli-consul-0.pem
|
||||
==> Saved dc1-cli-consul-0-key.pem
|
||||
```
|
||||
|
||||
If you are trying to get members of you cluster, the CLI will return an error:
|
||||
|
||||
```shell
|
||||
$ consul members
|
||||
Error retrieving members:
|
||||
Get http://127.0.0.1:8500/v1/agent/members?segment=_all:
|
||||
dial tcp 127.0.0.1:8500: connect: connection refused
|
||||
$ consul members -http-addr="https://localhost:8501"
|
||||
Error retrieving members:
|
||||
Get https://localhost:8501/v1/agent/members?segment=_all:
|
||||
x509: certificate signed by unknown authority
|
||||
```
|
||||
|
||||
But it will work again if you provide the certificates you provided:
|
||||
|
||||
```shell
|
||||
$ consul members -ca-file=consul-agent-ca.pem -client-cert=dc1-cli-consul-0.pem \
|
||||
-client-key=dc1-cli-consul-0-key.pem -http-addr="https://localhost:8501"
|
||||
Node Address Status Type Build Protocol DC Segment
|
||||
...
|
||||
```
|
||||
|
||||
This process can be cumbersome to type each time, so the Consul CLI also
|
||||
searches environment variables for default values. Set the following
|
||||
environment variables in your shell:
|
||||
|
||||
```shell
|
||||
$ export CONSUL_HTTP_ADDR=https://localhost:8501
|
||||
$ export CONSUL_CACERT=consul-agent-ca.pem
|
||||
$ export CONSUL_CLIENT_CERT=dc1-cli-consul-0.pem
|
||||
$ export CONSUL_CLIENT_KEY=dc1-cli-consul-0-key.pem
|
||||
```
|
||||
|
||||
- `CONSUL_HTTP_ADDR` is the URL of the Consul agent and sets the default for
|
||||
`-http-addr`.
|
||||
- `CONSUL_CACERT` is the location of your CA certificate and sets the default
|
||||
for `-ca-file`.
|
||||
- `CONSUL_CLIENT_CERT` is the location of your CLI certificate and sets the
|
||||
default for `-client-cert`.
|
||||
- `CONSUL_CLIENT_KEY` is the location of your CLI key and sets the default for
|
||||
`-client-key`.
|
||||
|
||||
After these environment variables are correctly configured, the CLI will
|
||||
respond as expected.
|
||||
|
||||
### Note on SANs for Server and Client Certificates
|
||||
|
||||
Using `localhost` and `127.0.0.1` as `Subject Alternative Names` in server
|
||||
and client certificates allows tools like `curl` to be able to communicate with
|
||||
Consul's HTTPS API when run on the same host. Other SANs may be added during
|
||||
server/client certificates creation with `-additional-dnsname` or
|
||||
`-additional-ipaddress`to allow remote HTTPS requests from other hosts.
|
||||
|
||||
## Configuring the Consul UI for HTTPS
|
||||
|
||||
If your servers and clients are configured now like above, you won't be able to
|
||||
access the builtin UI anymore. We recommend that you pick one (or two for
|
||||
availability) Consul agent you want to run the UI on and follow the instructions
|
||||
to get the UI up and running again.
|
||||
|
||||
### Step 1: Which interface to bind to?
|
||||
|
||||
Depending on your setup you might need to change to which interface you are
|
||||
binding because thats `127.0.0.1` by default for the UI. Either via the
|
||||
[`addresses.https`](/docs/agent/options#https) or
|
||||
[client_addr](/docs/agent/options#client_addr) option which also impacts
|
||||
the DNS server. The Consul UI is unproteced which means you need to put some
|
||||
auth in front of it if you want to make it publicly available!
|
||||
|
||||
Binding to `0.0.0.0` should work:
|
||||
|
||||
```json
|
||||
{
|
||||
"ui": true,
|
||||
"client_addr": "0.0.0.0",
|
||||
"enable_script_checks": false,
|
||||
"disable_remote_exec": true
|
||||
}
|
||||
```
|
||||
|
||||
~> Since your Consul agent is now available to the network, please make sure
|
||||
that [`enable_script_checks`](/docs/agent/options#_enable_script_checks) is
|
||||
set to `false` and
|
||||
[`disable_remote_exec`](https://www.consul.io/docs/agent/options.html#disable_remote_exec)
|
||||
is set to `true`.
|
||||
|
||||
### Step 2: verify_incoming_rpc
|
||||
|
||||
Your Consul agent will deny the connection straight away because
|
||||
`verify_incoming` is enabled.
|
||||
|
||||
> If set to true, Consul requires that all incoming connections make use of TLS
|
||||
> and that the client provides a certificate signed by a Certificate Authority
|
||||
> from the ca_file or ca_path. This applies to both server RPC and to the HTTPS
|
||||
> API.
|
||||
|
||||
Since the browser doesn't present a certificate signed by our CA, you cannot
|
||||
access the UI. If you `curl` your HTTPS UI the following happens:
|
||||
|
||||
```shell
|
||||
$ curl https://localhost:8501/ui/ -k -I
|
||||
curl: (35) error:14094412:SSL routines:SSL3_READ_BYTES:sslv3 alert bad certificate
|
||||
```
|
||||
|
||||
This is the Consul HTTPS server denying your connection because you are not
|
||||
presenting a client certificate signed by your Consul CA. There is a combination
|
||||
of options however that allows us to keep using `verify_incoming` for RPC, but
|
||||
not for HTTPS:
|
||||
|
||||
```json
|
||||
{
|
||||
"verify_incoming": false,
|
||||
"verify_incoming_rpc": true
|
||||
}
|
||||
```
|
||||
|
||||
~> This is the only time we are changing the value of the existing option
|
||||
`verify_incoming` to false. Make sure to only change it on the agent running the
|
||||
UI!
|
||||
|
||||
With the new configuration, it should work:
|
||||
|
||||
```shell
|
||||
$ curl https://localhost:8501/ui/ -k -I
|
||||
HTTP/2 200
|
||||
...
|
||||
```
|
||||
|
||||
### Step 3: Subject Alternative Name
|
||||
|
||||
This step will take care of setting up the domain you want to use to access the
|
||||
Consul UI. Unless you only need to access the UI over localhost or 127.0.0.1 you
|
||||
will need to go complete this step.
|
||||
|
||||
```shell
|
||||
$ curl https://consul.example.com:8501/ui/ \
|
||||
--resolve 'consul.example.com:8501:127.0.0.1' \
|
||||
--cacert consul-agent-ca.pem
|
||||
curl: (51) SSL: no alternative certificate subject name matches target host name 'consul.example.com'
|
||||
...
|
||||
```
|
||||
|
||||
The above command simulates a request a browser is making when you are trying to
|
||||
use the domain `consul.example.com` to access your UI. The problem this time is
|
||||
that your domain is not in `Subject Alternative Name` of the Certificate. We can
|
||||
fix that by creating a certificate that has our domain:
|
||||
|
||||
```shell
|
||||
$ consul tls cert create -server -additional-dnsname consul.example.com
|
||||
...
|
||||
```
|
||||
|
||||
And if you put your new cert into the configuration of the agent you picked to
|
||||
serve the UI and restart Consul, it works now:
|
||||
|
||||
```shell
|
||||
$ curl https://consul.example.com:8501/ui/ \
|
||||
--resolve 'consul.example.com:8501:127.0.0.1' \
|
||||
--cacert consul-agent-ca.pem -I
|
||||
HTTP/2 200
|
||||
...
|
||||
```
|
||||
|
||||
### Step 4: Trust the Consul CA
|
||||
|
||||
So far we have provided curl with our CA so that it can verify the connection,
|
||||
but if we stop doing that it will complain and so will our browser if you visit
|
||||
your UI on https://consul.example.com:
|
||||
|
||||
```shell
|
||||
$ curl https://consul.example.com:8501/ui/ \
|
||||
--resolve 'consul.example.com:8501:127.0.0.1'
|
||||
curl: (60) SSL certificate problem: unable to get local issuer certificate
|
||||
...
|
||||
```
|
||||
|
||||
You can fix that by trusting your Consul CA (`consul-agent-ca.pem`) on your machine,
|
||||
please use Google to find out how to do that on your OS.
|
||||
|
||||
```shell
|
||||
$ curl https://consul.example.com:8501/ui/ \
|
||||
--resolve 'consul.example.com:8501:127.0.0.1' -I
|
||||
HTTP/2 200
|
||||
...
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
When you have completed this guide, your Consul cluster will have TLS enabled
|
||||
and will encrypt all RPC and HTTP traffic (assuming you disabled the HTTP port).
|
||||
The other pre-requisites for a secure Consul deployment are:
|
||||
|
||||
- [Enable gossip encryption](/docs/agent/encryption#gossip-encryption)
|
||||
- [Configure ACLs][acl] with default deny
|
||||
|
||||
[letsencrypt]: https://letsencrypt.org/
|
||||
[vault]: https://www.vaultproject.io/
|
||||
[vault-pki]: https://www.vaultproject.io/docs/secrets/pki
|
||||
[guide]: /docs/agent/encryption.html#configuring-tls-on-an-existing-cluster
|
||||
[acl]: /docs/guides/acl.html
|
|
@ -1,182 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Multiple Datacenters - Basic Federation with the WAN Gossip Pool
|
||||
description: >-
|
||||
One of the key features of Consul is its support for multiple datacenters. The
|
||||
architecture of Consul is designed to promote low coupling of datacenters so
|
||||
that connectivity issues or failure of any datacenter does not impact the
|
||||
availability of Consul in other datacenters. This means each datacenter runs
|
||||
independently, each having a dedicated group of servers and a private LAN
|
||||
gossip pool.
|
||||
---
|
||||
|
||||
# Multiple Datacenters: Basic Federation with the WAN Gossip
|
||||
|
||||
One of the key features of Consul is its support for multiple datacenters.
|
||||
The [architecture](/docs/internals/architecture) of Consul is designed to
|
||||
promote a low coupling of datacenters so that connectivity issues or
|
||||
failure of any datacenter does not impact the availability of Consul in other
|
||||
datacenters. This means each datacenter runs independently, each having a dedicated
|
||||
group of servers and a private LAN [gossip pool](/docs/internals/gossip).
|
||||
|
||||
## The WAN Gossip Pool
|
||||
|
||||
This guide covers the basic form of federating Consul clusters using a single
|
||||
WAN gossip pool, interconnecting all Consul servers.
|
||||
[Consul Enterprise](https://www.hashicorp.com/products/consul/) version 0.8.0 added support
|
||||
for an advanced multiple datacenter capability. Please see the
|
||||
[Advanced Federation Guide](/docs/guides/advanced-federation) for more details.
|
||||
|
||||
## Setup Two Datacenters
|
||||
|
||||
To get started, follow the [
|
||||
Deployment guide](https://learn.hashicorp.com/consul/advanced/day-1-operations/deployment-guide/) to
|
||||
start each datacenter. After bootstrapping, we should have two datacenters now which
|
||||
we can refer to as `dc1` and `dc2`. Note that datacenter names are opaque to Consul;
|
||||
they are simply labels that help human operators reason about the Consul clusters.
|
||||
|
||||
To query the known WAN nodes, we use the [`members`](/docs/commands/members)
|
||||
command with the `-wan` parameter on either datacenter.
|
||||
|
||||
```shell
|
||||
$ consul members -wan
|
||||
```
|
||||
|
||||
This will provide a list of all known members in the WAN gossip pool. In
|
||||
this case, we have not connected the servers so there will be no output.
|
||||
|
||||
`consul members -wan` should
|
||||
only contain server nodes. Client nodes send requests to a datacenter-local server,
|
||||
so they do not participate in WAN gossip. Client requests are forwarded by local
|
||||
servers to a server in the target datacenter as necessary.
|
||||
|
||||
## Join the Servers
|
||||
|
||||
The next step is to ensure that all the server nodes join the WAN gossip pool (include all the servers in all the datacenters).
|
||||
|
||||
```shell
|
||||
$ consul join -wan <server 1> <server 2> ...
|
||||
```
|
||||
|
||||
The [`join`](/docs/commands/join) command is used with the `-wan` flag to indicate
|
||||
we are attempting to join a server in the WAN gossip pool. As with LAN gossip, you only
|
||||
need to join a single existing member, and the gossip protocol will be used to exchange
|
||||
information about all known members. For the initial setup, however, each server
|
||||
will only know about itself and must be added to the cluster. Consul 0.8.0 added WAN join
|
||||
flooding, so if one Consul server in a datacenter joins the WAN, it will automatically
|
||||
join the other servers in its local datacenter that it knows about via the LAN.
|
||||
|
||||
### Persist Join with Retry Join
|
||||
|
||||
In order to persist the `join` information, the following can be added to each server's configuration file, in both datacenters. For example, in `dc1` server nodes.
|
||||
|
||||
```json
|
||||
"retry_join_wan":[
|
||||
"dc2-server-1",
|
||||
"dc2-server-2"
|
||||
],
|
||||
```
|
||||
|
||||
## Verify Multi-DC Configuration
|
||||
|
||||
Once the join is complete, the [`members`](/docs/commands/members) command can be
|
||||
used to verify that all server nodes gossiping over WAN.
|
||||
|
||||
```shell
|
||||
$ consul members -wan
|
||||
Node Address Status Type Build Protocol DC Segment
|
||||
dc1-server-1 127.0.0.1:8701 alive server 1.4.3 2 dc1 <all>
|
||||
dc2-server-1 127.0.0.1:8702 alive server 1.4.3 2 dc2 <all>
|
||||
```
|
||||
|
||||
We can also verify that both datacenters are known using the
|
||||
[HTTP Catalog API](/api/catalog#catalog_datacenters):
|
||||
|
||||
```shell
|
||||
$ curl http://localhost:8500/v1/catalog/datacenters
|
||||
["dc1", "dc2"]
|
||||
```
|
||||
|
||||
As a simple test, you can try to query the nodes in each datacenter:
|
||||
|
||||
```shell
|
||||
$ curl http://localhost:8500/v1/catalog/nodes?dc=dc1
|
||||
{
|
||||
"ID": "ee8b5f7b-9cc1-a382-978c-5ce4b1219a55",
|
||||
"Node": "dc1-server-1",
|
||||
"Address": "127.0.0.1",
|
||||
"Datacenter": "dc1",
|
||||
"TaggedAddresses": {
|
||||
"lan": "127.0.0.1",
|
||||
"wan": "127.0.0.1"
|
||||
},
|
||||
"Meta": {
|
||||
"consul-network-segment": ""
|
||||
},
|
||||
"CreateIndex": 12,
|
||||
"ModifyIndex": 14
|
||||
}
|
||||
```
|
||||
|
||||
```shell
|
||||
$ curl http://localhost:8500/v1/catalog/nodes?dc=dc2
|
||||
{
|
||||
"ID": "ee8b5f7b-9cc1-a382-978c-5ce4b1219a55",
|
||||
"Node": "dc2-server-1",
|
||||
"Address": "127.0.0.1",
|
||||
"Datacenter": "dc1",
|
||||
"TaggedAddresses": {
|
||||
"lan": "127.0.0.1",
|
||||
"wan": "127.0.0.1"
|
||||
},
|
||||
"Meta": {
|
||||
"consul-network-segment": ""
|
||||
},
|
||||
"CreateIndex": 11,
|
||||
"ModifyIndex": 16
|
||||
}
|
||||
```
|
||||
|
||||
## Network Configuration
|
||||
|
||||
There are a few networking requirements that must be satisfied for this to
|
||||
work. Of course, all server nodes must be able to talk to each other. Otherwise,
|
||||
the gossip protocol as well as RPC forwarding will not work. If service discovery
|
||||
is to be used across datacenters, the network must be able to route traffic
|
||||
between IP addresses across regions as well. Usually, this means that all datacenters
|
||||
must be connected using a VPN or other tunneling mechanism. Consul does not handle
|
||||
VPN or NAT traversal for you.
|
||||
|
||||
Note that for RPC forwarding to work the bind address must be accessible from remote nodes.
|
||||
Configuring `serf_wan`, `advertise_wan_addr` and `translate_wan_addrs` can lead to a
|
||||
situation where `consul members -wan` lists remote nodes but RPC operations fail with one
|
||||
of the following errors:
|
||||
|
||||
- `No path to datacenter`
|
||||
- `rpc error getting client: failed to get conn: dial tcp <LOCAL_ADDR>:0-><REMOTE_ADDR>:<REMOTE_RPC_PORT>: i/o timeout`
|
||||
|
||||
The most likely cause of these errors is that `bind_addr` is set to a private address preventing
|
||||
the RPC server from accepting connections across the WAN. Setting `bind_addr` to a public
|
||||
address (or one that can be routed across the WAN) will resolve this issue. Be aware that
|
||||
exposing the RPC server on a public port should only be done **after** firewall rules have
|
||||
been established.
|
||||
|
||||
The [`translate_wan_addrs`](/docs/agent/options#translate_wan_addrs) configuration
|
||||
provides a basic address rewriting capability.
|
||||
|
||||
## Data Replication
|
||||
|
||||
In general, data is not replicated between different Consul datacenters. When a
|
||||
request is made for a resource in another datacenter, the local Consul servers forward
|
||||
an RPC request to the remote Consul servers for that resource and return the results.
|
||||
If the remote datacenter is not available, then those resources will also not be
|
||||
available, but that won't otherwise affect the local datacenter. There are some special
|
||||
situations where a limited subset of data can be replicated, such as with Consul's built-in
|
||||
[ACL replication](/docs/guides/acl#outages-and-acl-replication) capability, or
|
||||
external tools like [consul-replicate](https://github.com/hashicorp/consul-replicate/).
|
||||
|
||||
## Summary
|
||||
|
||||
In this guide you setup WAN gossip across two datacenters to create
|
||||
basic federation. You also used the Consul HTTP API to ensure the
|
||||
datacenters were properly configured.
|
|
@ -1,279 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Consul Deployment Guide
|
||||
description: |-
|
||||
This deployment guide covers the steps required to install and
|
||||
configure a single HashiCorp Consul cluster as defined in the
|
||||
Consul Reference Architecture.
|
||||
ea_version: 1.4
|
||||
---
|
||||
|
||||
# Consul Deployment Guide
|
||||
|
||||
This deployment guide covers the steps required to install and configure a single HashiCorp Consul cluster as defined in the [Consul Reference Architecture](/docs/guides/deployment).
|
||||
|
||||
These instructions are for installing and configuring Consul on Linux hosts running the systemd system and service manager.
|
||||
|
||||
## Reference Material
|
||||
|
||||
This deployment guide is designed to work in combination with the [Consul Reference Architecture](/docs/guides/deployment). Although not a strict requirement to follow the Consul Reference Architecture, please ensure you are familiar with the overall architecture design; for example installing Consul server agents on multiple physical or virtual (with correct anti-affinity) hosts for high-availability.
|
||||
|
||||
## Overview
|
||||
|
||||
To provide a highly-available single cluster architecture, we recommend Consul server agents be deployed to more than one host, as shown in the [Consul Reference Architecture](/docs/guides/deployment).
|
||||
|
||||
![Reference Diagram](/img/consul-arch-single.png 'Reference Diagram')
|
||||
|
||||
These setup steps should be completed on all Consul hosts.
|
||||
|
||||
- [Download Consul](#download-consul)
|
||||
- [Install Consul](#install-consul)
|
||||
- [Configure systemd](#configure-systemd)
|
||||
- Configure Consul [(server)](#configure-consul-server-) or [(client)](#configure-consul-client-)
|
||||
- [Start Consul](#start-consul)
|
||||
|
||||
## Download Consul
|
||||
|
||||
Precompiled Consul binaries are available for download at [https://releases.hashicorp.com/consul/](https://releases.hashicorp.com/consul/) and Consul Enterprise binaries are available for download by following the instructions made available to HashiCorp Consul customers.
|
||||
|
||||
You should perform checksum verification of the zip packages using the SHA256SUMS and SHA256SUMS.sig files available for the specific release version. HashiCorp provides [a guide on checksum verification](https://www.hashicorp.com/security.html) for precompiled binaries.
|
||||
|
||||
```text
|
||||
CONSUL_VERSION="x.x.x"
|
||||
curl --silent --remote-name https://releases.hashicorp.com/consul/${CONSUL_VERSION}/consul_${CONSUL_VERSION}_linux_amd64.zip
|
||||
curl --silent --remote-name https://releases.hashicorp.com/consul/${CONSUL_VERSION}/consul_${CONSUL_VERSION}_SHA256SUMS
|
||||
curl --silent --remote-name https://releases.hashicorp.com/consul/${CONSUL_VERSION}/consul_${CONSUL_VERSION}_SHA256SUMS.sig
|
||||
```
|
||||
|
||||
## Install Consul
|
||||
|
||||
Unzip the downloaded package and move the `consul` binary to `/usr/local/bin/`. Check `consul` is available on the system path.
|
||||
|
||||
```text
|
||||
unzip consul_${CONSUL_VERSION}_linux_amd64.zip
|
||||
sudo chown root:root consul
|
||||
sudo mv consul /usr/local/bin/
|
||||
consul --version
|
||||
```
|
||||
|
||||
The `consul` command features opt-in autocompletion for flags, subcommands, and arguments (where supported). Enable autocompletion.
|
||||
|
||||
```text
|
||||
consul -autocomplete-install
|
||||
complete -C /usr/local/bin/consul consul
|
||||
```
|
||||
|
||||
Create a unique, non-privileged system user to run Consul and create its data directory.
|
||||
|
||||
```text
|
||||
sudo useradd --system --home /etc/consul.d --shell /bin/false consul
|
||||
sudo mkdir --parents /opt/consul
|
||||
sudo chown --recursive consul:consul /opt/consul
|
||||
```
|
||||
|
||||
## Configure systemd
|
||||
|
||||
Systemd uses [documented sane defaults](https://www.freedesktop.org/software/systemd/man/systemd.directives.html) so only non-default values must be set in the configuration file.
|
||||
|
||||
Create a Consul service file at /etc/systemd/system/consul.service.
|
||||
|
||||
```text
|
||||
sudo touch /etc/systemd/system/consul.service
|
||||
```
|
||||
|
||||
Add this configuration to the Consul service file:
|
||||
|
||||
```text
|
||||
[Unit]
|
||||
Description="HashiCorp Consul - A service mesh solution"
|
||||
Documentation=https://www.consul.io/
|
||||
Requires=network-online.target
|
||||
After=network-online.target
|
||||
ConditionFileNotEmpty=/etc/consul.d/consul.hcl
|
||||
|
||||
[Service]
|
||||
Type=notify
|
||||
User=consul
|
||||
Group=consul
|
||||
ExecStart=/usr/local/bin/consul agent -config-dir=/etc/consul.d/
|
||||
ExecReload=/usr/local/bin/consul reload
|
||||
KillMode=process
|
||||
Restart=on-failure
|
||||
LimitNOFILE=65536
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
The following parameters are set for the `[Unit]` stanza:
|
||||
|
||||
- [`Description`](https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Description=) - Free-form string describing the consul service
|
||||
- [`Documentation`](https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Documentation=) - Link to the consul documentation
|
||||
- [`Requires`](https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Requires=) - Configure a requirement dependency on the network service
|
||||
- [`After`](https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Before=) - Configure an ordering dependency on the network service being started before the consul service
|
||||
- [`ConditionFileNotEmpty`](https://www.freedesktop.org/software/systemd/man/systemd.unit.html#ConditionArchitecture=) - Check for a non-zero sized configuration file before consul is started
|
||||
|
||||
The following parameters are set for the `[Service]` stanza:
|
||||
|
||||
- [`User`, `Group`](https://www.freedesktop.org/software/systemd/man/systemd.exec.html#User=) - Run consul as the consul user
|
||||
- [`ExecStart`](https://www.freedesktop.org/software/systemd/man/systemd.service.html#ExecStart=) - Start consul with the `agent` argument and path to the configuration file
|
||||
- [`ExecReload`](https://www.freedesktop.org/software/systemd/man/systemd.service.html#ExecReload=) - Send consul a reload signal to trigger a configuration reload in consul
|
||||
- [`KillMode`](https://www.freedesktop.org/software/systemd/man/systemd.kill.html#KillMode=) - Treat consul as a single process
|
||||
- [`Restart`](https://www.freedesktop.org/software/systemd/man/systemd.service.html#RestartSec=) - Restart consul unless it returned a clean exit code
|
||||
- [`LimitNOFILE`](https://www.freedesktop.org/software/systemd/man/systemd.exec.html#Process%20Properties) - Set an increased Limit for File Descriptors
|
||||
|
||||
The following parameters are set for the `[Install]` stanza:
|
||||
|
||||
- [`WantedBy`](https://www.freedesktop.org/software/systemd/man/systemd.unit.html#WantedBy=) - Creates a weak dependency on consul being started by the multi-user run level
|
||||
|
||||
## Configure Consul (server)
|
||||
|
||||
Consul uses [documented sane defaults](/docs/agent/options) so only non-default values must be set in the configuration file. Configuration can be read from multiple files and is loaded in lexical order. See the [full description](/docs/agent/options) for more information about configuration loading and merge semantics.
|
||||
|
||||
Consul server agents typically require a superset of configuration required by Consul client agents. We will specify common configuration used by all Consul agents in `consul.hcl` and server specific configuration in `server.hcl`.
|
||||
|
||||
### General configuration
|
||||
|
||||
Create a configuration file at `/etc/consul.d/consul.hcl`:
|
||||
|
||||
```text
|
||||
sudo mkdir --parents /etc/consul.d
|
||||
sudo touch /etc/consul.d/consul.hcl
|
||||
sudo chown --recursive consul:consul /etc/consul.d
|
||||
sudo chmod 640 /etc/consul.d/consul.hcl
|
||||
```
|
||||
|
||||
Add this configuration to the `consul.hcl` configuration file:
|
||||
|
||||
~> **NOTE** Replace the `datacenter` parameter value with the identifier you will use for the datacenter this Consul cluster is deployed in. Replace the `encrypt` parameter value with the output from running `consul keygen` on any host with the `consul` binary installed.
|
||||
|
||||
```hcl
|
||||
datacenter = "dc1"
|
||||
data_dir = "/opt/consul"
|
||||
encrypt = "pUqJrVyVRj5jsiYEkM/tFQYfWyJIv4s3XkvDwy7Cu5s="
|
||||
```
|
||||
|
||||
- [`datacenter`](/docs/agent/options#_datacenter) - The datacenter in which the agent is running.
|
||||
- [`data_dir`](/docs/agent/options#_data_dir) - The data directory for the agent to store state.
|
||||
- [`encrypt`](/docs/agent/options#_encrypt) - Specifies the secret key to use for encryption of Consul network traffic.
|
||||
|
||||
### ACL configuration
|
||||
|
||||
The [ACL](/docs/guides/acl) guide provides instructions on configuring and enabling ACLs.
|
||||
|
||||
### Cluster auto-join
|
||||
|
||||
The `retry_join` parameter allows you to configure all Consul agents to automatically form a cluster using a common Consul server accessed via DNS address, IP address or using Cloud Auto-join. This removes the need to manually join the Consul cluster nodes together.
|
||||
|
||||
Add the retry_join parameter to the `consul.hcl` configuration file:
|
||||
|
||||
~> **NOTE** Replace the `retry_join` parameter value with the correct DNS address, IP address or [cloud auto-join configuration](/docs/agent/cloud-auto-join) for your environment.
|
||||
|
||||
```hcl
|
||||
retry_join = ["172.16.0.11"]
|
||||
```
|
||||
|
||||
- [`retry_join`](/docs/agent/options#retry-join) - Address of another agent to join upon starting up.
|
||||
|
||||
### Performance stanza
|
||||
|
||||
The [`performance`](/docs/agent/options#performance) stanza allows tuning the performance of different subsystems in Consul.
|
||||
|
||||
Add the performance configuration to the `consul.hcl` configuration file:
|
||||
|
||||
```hcl
|
||||
performance {
|
||||
raft_multiplier = 1
|
||||
}
|
||||
```
|
||||
|
||||
- [`raft_multiplier`](/docs/agent/options#raft_multiplier) - An integer multiplier used by Consul servers to scale key Raft timing parameters. Setting this to a value of 1 will configure Raft to its highest-performance mode, equivalent to the default timing of Consul prior to 0.7, and is recommended for production Consul servers.
|
||||
|
||||
For more information on Raft tuning and the `raft_multiplier` setting, see the [server performance](/docs/install/performance) documentation.
|
||||
|
||||
### Telemetry stanza
|
||||
|
||||
The [`telemetry`](/docs/agent/options#telemetry) stanza specifies various configurations for Consul to publish metrics to upstream systems.
|
||||
|
||||
If you decide to configure Consul to publish telemtery data, you should review the [telemetry configuration section](/docs/agent/options#telemetry) of our documentation.
|
||||
|
||||
### TLS configuration
|
||||
|
||||
The [Creating Certificates](/docs/guides/creating-certificates) guide provides instructions on configuring and enabling TLS.
|
||||
|
||||
### Server configuration
|
||||
|
||||
Create a configuration file at `/etc/consul.d/server.hcl`:
|
||||
|
||||
```text
|
||||
sudo mkdir --parents /etc/consul.d
|
||||
sudo touch /etc/consul.d/server.hcl
|
||||
sudo chown --recursive consul:consul /etc/consul.d
|
||||
sudo chmod 640 /etc/consul.d/server.hcl
|
||||
```
|
||||
|
||||
Add this configuration to the `server.hcl` configuration file:
|
||||
|
||||
~> **NOTE** Replace the `bootstrap_expect` value with the number of Consul servers you will use; three or five [is recommended](/docs/internals/consensus#deployment-table).
|
||||
|
||||
```hcl
|
||||
server = true
|
||||
bootstrap_expect = 3
|
||||
```
|
||||
|
||||
- [`server`](/docs/agent/options#_server) - This flag is used to control if an agent is in server or client mode.
|
||||
- [`bootstrap-expect`](/docs/agent/options#_bootstrap_expect) - This flag provides the number of expected servers in the datacenter. Either this value should not be provided or the value must agree with other servers in the cluster.
|
||||
|
||||
### Consul UI
|
||||
|
||||
Consul features a web-based user interface, allowing you to easily view all services, nodes, intentions and more using a graphical user interface, rather than the CLI or API.
|
||||
|
||||
~> **NOTE** You should consider running the Consul UI on select Consul hosts rather than all hosts.
|
||||
|
||||
Optionally, add the UI configuration to the `server.hcl` configuration file to enable the Consul UI:
|
||||
|
||||
```hcl
|
||||
ui = true
|
||||
```
|
||||
|
||||
## Configure Consul (client)
|
||||
|
||||
Consul client agents typically require a subset of configuration required by Consul server agents. All Consul clients can use the `consul.hcl` file created when [configuring the Consul servers](#general-configuration). If you have added host-specific configuration such as identifiers, you will need to set these individually.
|
||||
|
||||
## Start Consul
|
||||
|
||||
Enable and start Consul using the systemctl command responsible for controlling systemd managed services. Check the status of the consul service using systemctl.
|
||||
|
||||
```text
|
||||
sudo systemctl enable consul
|
||||
sudo systemctl start consul
|
||||
sudo systemctl status consul
|
||||
```
|
||||
|
||||
## Backups
|
||||
|
||||
Creating server backups is an important step in production deployments. Backups provide a mechanism for the server to recover from an outage (network loss, operator error, or a corrupted data directory). All agents write to the `-data-dir` before commit. This directory persists the local agent’s state and — in the case of servers — it also holds the Raft information.
|
||||
|
||||
Consul provides the [snapshot](/docs/commands/snapshot) command which can be run using the CLI command or the API. The `snapshot` command saves the point-in-time snapshot of the state of the Consul servers which includes KV entries, the service catalog, prepared queries, sessions, and ACL.
|
||||
|
||||
With [Consul Enterprise](/docs/commands/snapshot/agent), the `snapshot agent` command runs periodically and writes to local or remote storage (such as Amazon S3).
|
||||
|
||||
By default, all snapshots are taken using `consistent` mode where requests are forwarded to the leader which verifies that it is still in power before taking the snapshot. Snapshots will not be saved if the clusted is degraded or if no leader is available. To reduce the burden on the leader, it is possible to [run the snapshot](/docs/commands/snapshot/save) on any non-leader server using `stale` consistency mode:
|
||||
|
||||
```text
|
||||
consul snapshot save -stale backup.snap
|
||||
```
|
||||
|
||||
This spreads the load across nodes at the possible expense of losing full consistency guarantees. Typically this means that a very small number of recent writes may not be included. The omitted writes are typically limited to data written in the last `100ms` or less from the recovery point. This is usually suitable for disaster recovery. However, the system can’t guarantee how stale this may be if executed against a partitioned server.
|
||||
|
||||
## Next Steps
|
||||
|
||||
- Read [Monitoring Consul with Telegraf](/docs/guides/monitoring-telegraf)
|
||||
for an example guide to monitoring Consul for improved operational visibility.
|
||||
|
||||
- Read [Outage Recovery](/docs/guides/outage) to learn the steps required
|
||||
for recovery from a Consul outage due to a majority of server nodes in a
|
||||
datacenter being lost.
|
||||
|
||||
- Read [Server Performance](/docs/install/performance) to learn about
|
||||
additional configuration that benefits production deployments.
|
|
@ -1,121 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Consul Reference Architecture
|
||||
description: |-
|
||||
This document provides recommended practices and a reference
|
||||
architecture for HashiCorp Consul production deployments.
|
||||
ea_version: 1.4
|
||||
---
|
||||
|
||||
# Consul Reference Architecture
|
||||
|
||||
As applications are migrated to dynamically provisioned infrastructure, scaling services and managing the communications between them becomes challenging. Consul’s service discovery capabilities provide the connectivity between dynamic applications. Consul also monitors the health of each node and its applications to ensure that only healthy service instances are discovered. Consul’s distributed runtime configuration store allows updates across global infrastructure.
|
||||
|
||||
This document provides recommended practices and a reference architecture, including system requirements, datacenter design, networking, and performance optimizations for Consul production deployments.
|
||||
|
||||
## Infrastructure Requirements
|
||||
|
||||
### Consul Servers
|
||||
|
||||
Consul server agents are responsible for maintaining the cluster state, responding to RPC queries (read operations), and for processing all write operations. Given that Consul server agents do most of the heavy lifting, server sizing is critical for the overall performance efficiency and health of the Consul cluster.
|
||||
|
||||
The following table provides high-level server guidelines. Of particular
|
||||
note is the strong recommendation to avoid non-fixed performance CPUs,
|
||||
or "Burstable CPU".
|
||||
|
||||
| Type | CPU | Memory | Disk | Typical Cloud Instance Types |
|
||||
| ----- | -------- | ------------ | ----- | ----------------------------------------- |
|
||||
| Small | 2 core | 8-16 GB RAM | 50GB | **AWS**: m5.large, m5.xlarge |
|
||||
| | | | | **Azure**: Standard_A4_v2, Standard_A8_v2 |
|
||||
| | | | | **GCE**: n1-standard-8, n1-standard-16 |
|
||||
| Large | 4-8 core | 32-64 GB RAM | 100GB | **AWS**: m5.2xlarge, m5.4xlarge |
|
||||
| | | | | **Azure**: Standard_D4_v3, Standard_D5_v3 |
|
||||
| | | | | **GCE**: n1-standard-32, n1-standard-64 |
|
||||
|
||||
#### Hardware Sizing Considerations
|
||||
|
||||
- The small size would be appropriate for most initial production
|
||||
deployments, or for development/testing environments.
|
||||
|
||||
- The large size is for production environments where there is a
|
||||
consistently high workload.
|
||||
|
||||
~> **NOTE** For large workloads, ensure that the disks support a high number of IOPS to keep up with the rapid Raft log update rate.
|
||||
|
||||
For more information on server requirements, review the [server performance](/docs/install/performance) documentation.
|
||||
|
||||
## Infrastructure Diagram
|
||||
|
||||
![Reference Diagram](/img/consul-arch.png 'Reference Diagram')
|
||||
|
||||
## Datacenter Design
|
||||
|
||||
A Consul cluster (typically three or five servers plus client agents) may be deployed in a single physical datacenter or it may span multiple datacenters. For a large cluster with high runtime reads and writes, deploying servers in the same physical location improves performance. In cloud environments, a single datacenter may be deployed across multiple availability zones i.e. each server in a separate availability zone on a single host. Consul also supports multi-datacenter deployments via separate clusters joined by WAN links. In some cases, one may also deploy two or more Consul clusters in the same LAN environment.
|
||||
|
||||
### Single Datacenter
|
||||
|
||||
A single Consul cluster is recommended for applications deployed in the same datacenter. Consul supports traditional three-tier applications as well as microservices.
|
||||
|
||||
Typically, there must be three or five servers to balance between availability and performance. These servers together run the Raft-driven consistent state store for catalog, session, prepared query, ACL, and KV updates.
|
||||
|
||||
The recommended maximum cluster size for a single datacenter is 5,000 nodes. For a write-heavy and/or a read-heavy cluster, the maximum number of nodes may need to be reduced further, considering the impact of the number and the size of KV pairs and the number of watches. The time taken for gossip to converge increases as more client machines are added. Similarly, the time taken by the new server to join an existing multi-thousand node cluster with a large KV store and update rate may increase as they are replicated to the new server’s log.
|
||||
|
||||
-> **TIP** For write-heavy clusters, consider scaling vertically with larger machine instances and lower latency storage.
|
||||
|
||||
One must take care to use service tags in a way that assists with the kinds of queries that will be run against the cluster. If two services (e.g. blue and green) are running on the same cluster, appropriate service tags must be used to identify between them. If a query is made without tags, nodes running both blue and green services may show up in the results of the query.
|
||||
|
||||
In cases where a full mesh among all agents cannot be established due to network segmentation, Consul’s own [network segments](/docs/enterprise/network-segments) can be used. Network segments is a Consul Enterprise feature that allows the creation of multiple tenants which share Raft servers in the same cluster. Each tenant has its own gossip pool and doesn’t communicate with the agents outside this pool. The KV store, however, is shared between all tenants. If Consul network segments cannot be used, isolation between agents can be accomplished by creating discrete [Consul datacenters](/docs/guides/datacenters).
|
||||
|
||||
### Multiple Datacenters
|
||||
|
||||
Consul clusters in different datacenters running the same service can be joined by WAN links. The clusters operate independently and only communicate over the WAN on port `8302`. Unless explicitly configured via CLI or API, the Consul server will only return results from the local datacenter. Consul does not replicate data between multiple datacenters. The [consul-replicate](https://github.com/hashicorp/consul-replicate) tool can be used to replicate the KV data periodically.
|
||||
|
||||
-> A good practice is to enable TLS server name checking to avoid accidental cross-joining of agents.
|
||||
|
||||
Advanced federation can be achieved with the [network areas](/api/operator/area) feature in Consul Enterprise.
|
||||
|
||||
A typical use case is where datacenter1 (dc1) hosts share services like LDAP (or ACL datacenter) which are leveraged by all other datacenters. However, due to compliance issues, servers in dc2 must not connect with servers in dc3. This cannot be accomplished with the basic WAN federation. Basic federation requires that all the servers in dc1, dc2 and dc3 are connected in a full mesh and opens both gossip (`8302 tcp/udp`) and RPC (`8300`) ports for communication.
|
||||
|
||||
Network areas allows peering between datacenters to make the services discoverable over WAN. With network areas, servers in dc1 can communicate with those in dc2 and dc3. However, no connectivity needs to be established between dc2 and dc3 which meets the compliance requirement of the organization in this use case. Servers that are part of the network area communicate over RPC only. This removes the overhead of sharing and maintaining the symmetric key used by the gossip protocol across datacenters. It also reduces the attack surface at the gossip ports since they no longer need to be opened in security gateways or firewalls.
|
||||
|
||||
#### Prepared Queries
|
||||
|
||||
Consul’s [prepared queries](/api/query) allow clients to do a datacenter failover for service discovery. For example, if a service `payment` in the local datacenter dc1 goes down, a prepared query lets users define a geographic fallback order to the nearest datacenter to check for healthy instances of the same service.
|
||||
|
||||
~> **NOTE** Consul clusters must be WAN linked for a prepared query to work across datacenters.
|
||||
|
||||
Prepared queries, by default, resolve the query in the local datacenter first. Querying KV store features is not supported by the prepared query. Prepared queries work with ACL. Prepared query config/templates are maintained consistently in Raft and are executed on the servers.
|
||||
|
||||
#### Connect
|
||||
|
||||
Consul [Connect](/docs/guides/connect-production) supports multi-datacenter connections and replicates [intentions](/docs/connect/intentions). This allows WAN federated DCs to provide connections from source and destination proxies in any DC.
|
||||
|
||||
## Network Connectivity
|
||||
|
||||
LAN gossip occurs between all agents in a single datacenter with each agent sending a periodic probe to random agents from its member list. Agents run in either client or server mode, both participate in the gossip. The initial probe is sent over UDP every second. If a node fails to acknowledge within `200ms`, the agent pings over TCP. If the TCP probe fails (10 second timeout), it asks configurable number of random nodes to probe the same node (also known as an indirect probe). If there is no response from the peers regarding the status of the node, that agent is marked as down.
|
||||
|
||||
The agent's status directly affects the service discovery results. If an agent is down, the services it is monitoring will also be marked as down.
|
||||
|
||||
In addition, the agent also periodically performs a full state sync over TCP which gossips each agent’s understanding of the member list around it (node names, IP addresses, and health status). These operations are expensive relative to the standard gossip protocol mentioned above and are synced at a rate determined by cluster size to keep overhead low. It's typically between 30 seconds and 5 minutes. For more details, refer to [Serf Gossip docs](https://www.serf.io/docs/internals/gossip.html)
|
||||
|
||||
In a larger network that spans L3 segments, traffic typically traverses through a firewall and/or a router. ACL or firewall rules must be updated to allow the following ports:
|
||||
|
||||
| Name | Port | Flag | Description |
|
||||
| ------------- | ---- | ------------------------------------------- | ----------------------------------------------------------------------------- |
|
||||
| Server RPC | 8300 | | Used by servers to handle incoming requests from other agents. TCP only. |
|
||||
| Serf LAN | 8301 | | Used to handle gossip in the LAN. Required by all agents. TCP and UDP. |
|
||||
| Serf WAN | 8302 | `-1` to disable (available in Consul 1.0.7) | Used by servers to gossip over the LAN and WAN to other servers. TCP and UDP. |
|
||||
| HTTP API | 8500 | `-1` to disable | Used by clients to talk to the HTTP API. TCP only. |
|
||||
| DNS Interface | 8600 | `-1` to disable | Used to resolve DNS queries. TCP and UDP. |
|
||||
|
||||
-> As mentioned in the [datacenter design section](#datacenter-design), network areas and network segments can be used to prevent opening up firewall ports between different subnets.
|
||||
|
||||
By default agents will only listen for HTTP and DNS traffic on the local interface.
|
||||
|
||||
## Next steps
|
||||
|
||||
- Read [Deployment Guide](/docs/guides/deployment-guide) to learn
|
||||
the steps required to install and configure a single HashiCorp Consul cluster.
|
||||
|
||||
- Read [Server Performance](/docs/install/performance) to learn about
|
||||
additional configuration that benefits production deployments.
|
|
@ -1,181 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: DNS Caching
|
||||
description: >-
|
||||
One of the main interfaces to Consul is DNS. Using DNS is a simple way to
|
||||
integrate Consul into an existing infrastructure without any high-touch
|
||||
integration.
|
||||
---
|
||||
|
||||
# DNS Caching
|
||||
|
||||
One of the main interfaces to Consul is DNS. Using DNS is a simple way to
|
||||
integrate Consul into an existing infrastructure without any high-touch
|
||||
integration.
|
||||
|
||||
By default, Consul serves all DNS results with a 0 TTL value. This prevents
|
||||
any caching. The advantage is that each DNS lookup is always re-evaluated,
|
||||
so the most timely information is served. However, this adds a latency hit
|
||||
for each lookup and can potentially exhaust the query throughput of a cluster.
|
||||
For this reason, Consul provides a number of tuning parameters that can
|
||||
customize how DNS queries are handled.
|
||||
|
||||
In this guide, we will review important parameters for tuning
|
||||
stale reads, negative response caching, and TTL. All of the DNS config
|
||||
parameters must be set in set in the agent's configuration file.
|
||||
|
||||
<a name="stale"></a>
|
||||
|
||||
## Stale Reads
|
||||
|
||||
Stale reads can be used to reduce latency and increase the throughput
|
||||
of DNS queries. The [settings](/docs/agent/options) used to control stale reads
|
||||
are:
|
||||
|
||||
- [`dns_config.allow_stale`](/docs/agent/options#allow_stale) must be
|
||||
set to true to enable stale reads.
|
||||
- [`dns_config.max_stale`](/docs/agent/options#max_stale) limits how stale results
|
||||
are allowed to be when querying DNS.
|
||||
|
||||
With these two settings you can allow or prevent stale reads. Below we will discuss
|
||||
the advanatages and disadvatages of both.
|
||||
|
||||
### Allow Stale Reads
|
||||
|
||||
Since Consul 0.7.1, `allow_stale` is enabled by default and uses a `max_stale`
|
||||
value that defaults to a near-indefinite threshold (10 years).
|
||||
This allows DNS queries to continue to be served in the event
|
||||
of a long outage with no leader. A new telemetry counter has also been added at
|
||||
`consul.dns.stale_queries` to track when agents serve DNS queries that are stale
|
||||
by more than 5 seconds.
|
||||
|
||||
```javascript
|
||||
"dns_config" {
|
||||
"allow_stale" = true
|
||||
"max_stale" = "87600h"
|
||||
}
|
||||
```
|
||||
|
||||
~> NOTE: The above example is the default setting. You do not need to set it explicitly.
|
||||
|
||||
Doing a stale read allows any Consul server to
|
||||
service a query, but non-leader nodes may return data that is
|
||||
out-of-date. By allowing data to be slightly stale, we get horizontal
|
||||
read scalability. Now any Consul server can service the request, so we
|
||||
increase throughput by the number of servers in a cluster.
|
||||
|
||||
### Prevent Stale Reads
|
||||
|
||||
If you want to prevent stale reads or limit how stale they can be, you can set `allow_stale`
|
||||
to false or use a lower value for `max_stale`. Doing the first will ensure that
|
||||
all reads are serviced by a [single leader node](/docs/internals/consensus).
|
||||
The reads will then be strongly consistent but will be limited by the throughput
|
||||
of a single node.
|
||||
|
||||
```javascript
|
||||
"dns_config" {
|
||||
"allow_stale" = false
|
||||
}
|
||||
```
|
||||
|
||||
## Negative Response Caching
|
||||
|
||||
Some DNS clients cache negative responses - that is, Consul returning a "not
|
||||
found" style response because a service exists but there are no healthy
|
||||
endpoints. In practice, this could mean that the cached negative responses may
|
||||
cause that service to appear "down" for longer than they are actually unavailable
|
||||
when using DNS for service discovery.
|
||||
|
||||
### Configure SOA
|
||||
|
||||
In Consul 1.3.0 and newer, it is now possible to tune SOA
|
||||
responses and modify the negative TTL cache for some resolvers. It can
|
||||
be achieved using the [`soa.min_ttl`](/docs/agent/options#soa_min_ttl)
|
||||
configuration within the [`soa`](/docs/agent/options#soa) configuration.
|
||||
|
||||
```javascript
|
||||
"dns_config" {
|
||||
"soa" {
|
||||
"min_ttl" = "60s"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
One common example is that Windows will default to caching negative responses
|
||||
for 15 minutes. DNS forwarders may also cache negative responses, with the same
|
||||
effect. To avoid this problem, check the negative response cache defaults for
|
||||
your client operating system and any DNS forwarder on the path between the
|
||||
client and Consul and set the cache values appropriately. In many cases
|
||||
"appropriately" simply is turning negative response caching off to get the best
|
||||
recovery time when a service becomes available again.
|
||||
|
||||
<a name="ttl"></a>
|
||||
|
||||
## TTL Values
|
||||
|
||||
TTL values can be set to allow DNS results to be cached downstream of Consul. Higher
|
||||
TTL values reduce the number of lookups on the Consul servers and speed lookups for
|
||||
clients, at the cost of increasingly stale results. By default, all TTLs are zero,
|
||||
preventing any caching.
|
||||
|
||||
```javascript
|
||||
{
|
||||
"dns_config": {
|
||||
"service_ttl" = "0s"
|
||||
"node_ttl" = "0s"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Enable Caching
|
||||
|
||||
To enable caching of node lookups (e.g. "foo.node.consul"), we can set the
|
||||
[`dns_config.node_ttl`](/docs/agent/options#node_ttl) value. This can be set to
|
||||
"10s" for example, and all node lookups will serve results with a 10 second TTL.
|
||||
|
||||
Service TTLs can be specified in a more granular fashion. You can set TTLs
|
||||
per-service, with a wildcard TTL as the default. This is specified using the
|
||||
[`dns_config.service_ttl`](/docs/agent/options#service_ttl) map. The "_"
|
||||
is supported at the end of any prefix and a lower precedence than strict match,
|
||||
so 'my-service-x' has precedence over 'my-service-_', when performing wildcard
|
||||
match, the longest path is taken into account, thus 'my-service-_' TTL will
|
||||
be used instead of 'my-_' or '_'. With the same rule, '_' is the default value
|
||||
when nothing else matches. If no match is found the TTL defaults to 0.
|
||||
|
||||
For example, a [`dns_config`](/docs/agent/options#dns_config) that provides
|
||||
a wildcard TTL and a specific TTL for a service might look like this:
|
||||
|
||||
```javascript
|
||||
{
|
||||
"dns_config": {
|
||||
"service_ttl": {
|
||||
"*": "5s",
|
||||
"web": "30s",
|
||||
"db*": "10s",
|
||||
"db-master": "3s"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This sets all lookups to "web.service.consul" to use a 30 second TTL
|
||||
while lookups to "api.service.consul" will use the 5 second TTL from the wildcard.
|
||||
All lookups matching "db\*" would get a 10 seconds TTL except "db-master"
|
||||
that would have a 3 seconds TTL.
|
||||
|
||||
### Prepared Queries
|
||||
|
||||
[Prepared Queries](/api/query) provide an additional
|
||||
level of control over TTL. They allow for the TTL to be defined along with
|
||||
the query, and they can be changed on the fly by updating the query definition.
|
||||
If a TTL is not configured for a prepared query, then it will fall back to the
|
||||
service-specific configuration defined in the Consul agent as described above,
|
||||
and ultimately to 0 if no TTL is configured for the service in the Consul agent.
|
||||
|
||||
## Summary
|
||||
|
||||
In this guide we covered several of the parameters for tuning DNS queries. We reviewed
|
||||
how to enable or disable stale reads and how to configure the amount of time when stale
|
||||
reads are allowed. We also looked at the minimum TTL configuration options
|
||||
for negative responses from services. Finally, we reviewed how to setup TTLs
|
||||
for service lookups.
|
|
@ -1,81 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: External Services
|
||||
description: >-
|
||||
Very few infrastructures are entirely self-contained. Most rely on a multitude
|
||||
of external service providers. Consul supports this by allowing for the
|
||||
definition of external services, services that are not provided by a local
|
||||
node.
|
||||
---
|
||||
|
||||
# Registering an External Service
|
||||
|
||||
Very few infrastructures are entirely self-contained. Most rely on a multitude
|
||||
of external service providers. Consul supports this by allowing for the definition
|
||||
of external services, services that are not provided by a local node. There's also a
|
||||
companion project called [Consul ESM](https://github.com/hashicorp/consul-esm) which
|
||||
is a daemon that functions as an external service monitor that can help run health
|
||||
checks for external services.
|
||||
|
||||
Most services are registered in Consul through the use of a
|
||||
[service definition](/docs/agent/services). However, this approach registers
|
||||
the local node as the service provider. In the case of external services, we must
|
||||
instead register the service with the catalog rather than as part of a standard
|
||||
node service definition.
|
||||
|
||||
Once registered, the DNS interface will be able to return the appropriate A
|
||||
records or CNAME records for the service. The service will also appear in standard
|
||||
queries against the API. Consul must be configured with a list of
|
||||
[recursors](/docs/agent/options#recursors) for it to be able to resolve
|
||||
external service addresses.
|
||||
|
||||
Let us suppose we want to register a "search" service that is provided by
|
||||
"www.google.com". We might accomplish that like so:
|
||||
|
||||
```text
|
||||
$ curl -X PUT -d '{"Datacenter": "dc1", "Node": "google",
|
||||
"Address": "www.google.com",
|
||||
"Service": {"Service": "search", "Port": 80}}'
|
||||
http://127.0.0.1:8500/v1/catalog/register
|
||||
```
|
||||
|
||||
Add an upstream DNS server to the list of recursors to Consul's configuration. Example with Google's public DNS server:
|
||||
|
||||
```text
|
||||
"recursors":["8.8.8.8"]
|
||||
```
|
||||
|
||||
If we do a DNS lookup now, we can see the new search service:
|
||||
|
||||
```text
|
||||
; <<>> DiG 9.8.3-P1 <<>> @127.0.0.1 -p 8600 search.service.consul.
|
||||
; (1 server found)
|
||||
;; global options: +cmd
|
||||
;; Got answer:
|
||||
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13313
|
||||
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 0
|
||||
|
||||
;; QUESTION SECTION:
|
||||
;search.service.consul. IN A
|
||||
|
||||
;; ANSWER SECTION:
|
||||
search.service.consul. 0 IN CNAME www.google.com.
|
||||
www.google.com. 264 IN A 74.125.239.114
|
||||
www.google.com. 264 IN A 74.125.239.115
|
||||
www.google.com. 264 IN A 74.125.239.116
|
||||
|
||||
;; Query time: 41 msec
|
||||
;; SERVER: 127.0.0.1#8600(127.0.0.1)
|
||||
;; WHEN: Tue Feb 25 17:45:12 2014
|
||||
;; MSG SIZE rcvd: 178
|
||||
```
|
||||
|
||||
If at any time we want to deregister the service, we simply do:
|
||||
|
||||
```text
|
||||
$ curl -X PUT -d '{"Datacenter": "dc1", "Node": "google"}' http://127.0.0.1:8500/v1/catalog/deregister
|
||||
```
|
||||
|
||||
This will deregister the `google` node along with all services it provides.
|
||||
|
||||
For more information, please see the [HTTP Catalog API](/api/catalog).
|
|
@ -1,354 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Forwarding
|
||||
description: >-
|
||||
By default, DNS is served from port 53. On most operating systems, this
|
||||
requires elevated privileges. Instead of running Consul with an administrative
|
||||
or root account, it is possible to instead forward appropriate queries to
|
||||
Consul, running on an unprivileged port, from another DNS server or port
|
||||
redirect.
|
||||
---
|
||||
|
||||
# Forwarding DNS
|
||||
|
||||
By default, DNS is served from port 53. On most operating systems, this
|
||||
requires elevated privileges. Instead of running Consul with an administrative
|
||||
or root account, it is possible to instead forward appropriate queries to Consul,
|
||||
running on an unprivileged port, from another DNS server or port redirect.
|
||||
|
||||
In this guide, we will demonstrate forwarding from:
|
||||
|
||||
- [BIND](#bind-setup)
|
||||
- [dnsmasq](#dnsmasq-setup)
|
||||
- [Unbound](#unbound-setup)
|
||||
- [systemd-resolved](#systemd-resolved-setup)
|
||||
- [iptables](#iptables-setup)
|
||||
- [macOS](#macos-setup)
|
||||
|
||||
After configuring forwarding, we will demonstrate how to test the configuration. Finally, we will also provide some troubleshooting
|
||||
guidance.
|
||||
|
||||
~> Note, by default, Consul does not resolve DNS
|
||||
records outside the `.consul.` zone unless the
|
||||
[recursors](/docs/agent/options#recursors) configuration option
|
||||
has been set. As an example of how this changes Consul's behavior,
|
||||
suppose a Consul DNS reply includes a CNAME record pointing outside
|
||||
the `.consul` TLD. The DNS reply will only include CNAME records by
|
||||
default. By contrast, when `recursors` is set and the upstream resolver is
|
||||
functioning correctly, Consul will try to resolve CNAMEs and include
|
||||
any records (e.g. A, AAAA, PTR) for them in its DNS reply.
|
||||
|
||||
## BIND Setup
|
||||
|
||||
Note, in this example, BIND and Consul are running on the same machine.
|
||||
|
||||
First, you have to disable DNSSEC so that Consul and [BIND](https://www.isc.org/downloads/bind/) can communicate. Here is an example of such a configuration:
|
||||
|
||||
```text
|
||||
options {
|
||||
listen-on port 53 { 127.0.0.1; };
|
||||
listen-on-v6 port 53 { ::1; };
|
||||
directory "/var/named";
|
||||
dump-file "/var/named/data/cache_dump.db";
|
||||
statistics-file "/var/named/data/named_stats.txt";
|
||||
memstatistics-file "/var/named/data/named_mem_stats.txt";
|
||||
allow-query { localhost; };
|
||||
recursion yes;
|
||||
|
||||
dnssec-enable no;
|
||||
dnssec-validation no;
|
||||
|
||||
/* Path to ISC DLV key */
|
||||
bindkeys-file "/etc/named.iscdlv.key";
|
||||
|
||||
managed-keys-directory "/var/named/dynamic";
|
||||
};
|
||||
|
||||
include "/etc/named/consul.conf";
|
||||
```
|
||||
|
||||
### Zone File
|
||||
|
||||
Then we set up a zone for our Consul managed records in `consul.conf`:
|
||||
|
||||
```text
|
||||
zone "consul" IN {
|
||||
type forward;
|
||||
forward only;
|
||||
forwarders { 127.0.0.1 port 8600; };
|
||||
};
|
||||
```
|
||||
|
||||
Here we assume Consul is running with default settings and is serving
|
||||
DNS on port 8600.
|
||||
|
||||
## Dnsmasq Setup
|
||||
|
||||
[Dnsmasq](http://www.thekelleys.org.uk/dnsmasq/doc.html) is typically configured via a `dnsmasq.conf` or a series of files in
|
||||
the `/etc/dnsmasq.d` directory. In Dnsmasq's configuration file
|
||||
(e.g. `/etc/dnsmasq.d/10-consul`), add the following:
|
||||
|
||||
```text
|
||||
# Enable forward lookup of the 'consul' domain:
|
||||
server=/consul/127.0.0.1#8600
|
||||
|
||||
# Uncomment and modify as appropriate to enable reverse DNS lookups for
|
||||
# common netblocks found in RFC 1918, 5735, and 6598:
|
||||
#rev-server=0.0.0.0/8,127.0.0.1#8600
|
||||
#rev-server=10.0.0.0/8,127.0.0.1#8600
|
||||
#rev-server=100.64.0.0/10,127.0.0.1#8600
|
||||
#rev-server=127.0.0.1/8,127.0.0.1#8600
|
||||
#rev-server=169.254.0.0/16,127.0.0.1#8600
|
||||
#rev-server=172.16.0.0/12,127.0.0.1#8600
|
||||
#rev-server=192.168.0.0/16,127.0.0.1#8600
|
||||
#rev-server=224.0.0.0/4,127.0.0.1#8600
|
||||
#rev-server=240.0.0.0/4,127.0.0.1#8600
|
||||
```
|
||||
|
||||
Once that configuration is created, restart the `dnsmasq` service.
|
||||
|
||||
Additional useful settings in `dnsmasq` to consider include (see
|
||||
[`dnsmasq(8)`](http://www.thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html)
|
||||
for additional details):
|
||||
|
||||
```
|
||||
# Accept DNS queries only from hosts whose address is on a local subnet.
|
||||
#local-service
|
||||
|
||||
# Don't poll /etc/resolv.conf for changes.
|
||||
#no-poll
|
||||
|
||||
# Don't read /etc/resolv.conf. Get upstream servers only from the command
|
||||
# line or the dnsmasq configuration file (see the "server" directive below).
|
||||
#no-resolv
|
||||
|
||||
# Specify IP address(es) of other DNS servers for queries not handled
|
||||
# directly by consul. There is normally one 'server' entry set for every
|
||||
# 'nameserver' parameter found in '/etc/resolv.conf'. See dnsmasq(8)'s
|
||||
# 'server' configuration option for details.
|
||||
#server=1.2.3.4
|
||||
#server=208.67.222.222
|
||||
#server=8.8.8.8
|
||||
|
||||
# Set the size of dnsmasq's cache. The default is 150 names. Setting the
|
||||
# cache size to zero disables caching.
|
||||
#cache-size=65536
|
||||
```
|
||||
|
||||
## Unbound Setup
|
||||
|
||||
[Unbound](https://www.unbound.net/) is typically configured via a `unbound.conf` or a series of files in
|
||||
the `/etc/unbound/unbound.conf.d` directory. In an Unbound configuration file
|
||||
(e.g. `/etc/unbound/unbound.conf.d/consul.conf`), add the following:
|
||||
|
||||
```text
|
||||
#Allow insecure queries to local resolvers
|
||||
server:
|
||||
do-not-query-localhost: no
|
||||
domain-insecure: "consul"
|
||||
|
||||
#Add consul as a stub-zone
|
||||
stub-zone:
|
||||
name: "consul"
|
||||
stub-addr: 127.0.0.1@8600
|
||||
```
|
||||
|
||||
You may have to add the following line to the bottom of your
|
||||
`/etc/unbound/unbound.conf` file for the new configuration to be included:
|
||||
|
||||
```text
|
||||
include: "/etc/unbound/unbound.conf.d/*.conf"
|
||||
```
|
||||
|
||||
## systemd-resolved Setup
|
||||
|
||||
[`systemd-resolved`](https://www.freedesktop.org/wiki/Software/systemd/resolved/) is typically configured with `/etc/systemd/resolved.conf`.
|
||||
To configure systemd-resolved to send queries for the consul domain to
|
||||
Consul, configure resolved.conf to contain the following:
|
||||
|
||||
```
|
||||
DNS=127.0.0.1
|
||||
Domains=~consul
|
||||
```
|
||||
|
||||
The main limitation with this configuration is that the DNS field
|
||||
cannot contain ports. So for this to work either Consul must be
|
||||
[configured to listen on port 53](https://www.consul.io/docs/agent/options.html#dns_port)
|
||||
instead of 8600 or you can use iptables to map port 53 to 8600.
|
||||
The following iptables commands are sufficient to do the port
|
||||
mapping.
|
||||
|
||||
```
|
||||
[root@localhost ~]# iptables -t nat -A OUTPUT -d localhost -p udp -m udp --dport 53 -j REDIRECT --to-ports 8600
|
||||
[root@localhost ~]# iptables -t nat -A OUTPUT -d localhost -p tcp -m tcp --dport 53 -j REDIRECT --to-ports 8600
|
||||
```
|
||||
|
||||
Binding to port 53 will usually require running either as a privileged user (or on Linux running with the
|
||||
CAP_NET_BIND_SERVICE capability). If using the Consul docker image you will need to add the following to the
|
||||
environment to allow Consul to use the port: `CONSUL_ALLOW_PRIVILEGED_PORTS=yes`
|
||||
|
||||
Note: With this setup, PTR record queries will still be sent out to the other configured resolvers in
|
||||
addition to Consul. If you wish to restrict this behavior, your `resolved.conf` should be modified to
|
||||
|
||||
```
|
||||
DNS=127.0.0.1
|
||||
Domains=~consul ~0.10.in-addr.arpa
|
||||
```
|
||||
|
||||
where the example corresponds to reverse lookups of addresses in the IP range `10.0.0.0/16`. Your
|
||||
configuration should match your networks.
|
||||
|
||||
## iptables Setup
|
||||
|
||||
Note, for iptables, the rules must be set on the same host as the Consul
|
||||
instance and relay hosts should not be on the same host or the redirects will
|
||||
intercept the traffic.
|
||||
|
||||
On Linux systems that support it, incoming requests and requests to
|
||||
the local host can use [`iptables`](http://www.netfilter.org/) to forward ports on the same machine
|
||||
without a secondary service. Since Consul, by default, only resolves
|
||||
the `.consul` TLD, it is especially important to use the `recursors`
|
||||
option if you wish the `iptables` setup to resolve for other domains.
|
||||
The recursors should not include the local host as the redirects would
|
||||
just intercept the requests.
|
||||
|
||||
The iptables method is suited for situations where an external DNS
|
||||
service is already running in your infrastructure and is used as the
|
||||
recursor or if you want to use an existing DNS server as your query
|
||||
endpoint and forward requests for the consul domain to the Consul
|
||||
server. In both of those cases you may want to query the Consul server
|
||||
but not need the overhead of a separate service on the Consul host.
|
||||
|
||||
```
|
||||
[root@localhost ~]# iptables -t nat -A PREROUTING -p udp -m udp --dport 53 -j REDIRECT --to-ports 8600
|
||||
[root@localhost ~]# iptables -t nat -A PREROUTING -p tcp -m tcp --dport 53 -j REDIRECT --to-ports 8600
|
||||
[root@localhost ~]# iptables -t nat -A OUTPUT -d localhost -p udp -m udp --dport 53 -j REDIRECT --to-ports 8600
|
||||
[root@localhost ~]# iptables -t nat -A OUTPUT -d localhost -p tcp -m tcp --dport 53 -j REDIRECT --to-ports 8600
|
||||
```
|
||||
|
||||
## macOS Setup
|
||||
|
||||
On macOS systems, you can use the macOS system resolver to point all .consul requests to consul.
|
||||
Just add a resolver entry in /etc/resolver/ to point at consul.
|
||||
documentation for this feature is available via: `man5 resolver`.
|
||||
To setup create a new file `/etc/resolver/consul` (you will need sudo/root access) and put in the file:
|
||||
|
||||
```
|
||||
nameserver 127.0.0.1
|
||||
port 8600
|
||||
```
|
||||
|
||||
This is telling the macOS resolver daemon for all .consul TLD requests, ask 127.0.0.1 on port 8600.
|
||||
|
||||
## Testing
|
||||
|
||||
First, perform a DNS query against Consul directly to be sure that the record exists:
|
||||
|
||||
```text
|
||||
[root@localhost ~]# dig @localhost -p 8600 primary.redis.service.dc-1.consul. A
|
||||
|
||||
; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.23.rc1.32.amzn1 <<>> @localhost primary.redis.service.dc-1.consul. A
|
||||
; (1 server found)
|
||||
;; global options: +cmd
|
||||
;; Got answer:
|
||||
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11536
|
||||
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
|
||||
|
||||
;; QUESTION SECTION:
|
||||
;primary.redis.service.dc-1.consul. IN A
|
||||
|
||||
;; ANSWER SECTION:
|
||||
primary.redis.service.dc-1.consul. 0 IN A 172.31.3.234
|
||||
|
||||
;; Query time: 4 msec
|
||||
;; SERVER: 127.0.0.1#53(127.0.0.1)
|
||||
;; WHEN: Wed Apr 9 17:36:12 2014
|
||||
;; MSG SIZE rcvd: 76
|
||||
```
|
||||
|
||||
Then run the same query against your BIND instance and make sure you get a
|
||||
valid result:
|
||||
|
||||
```text
|
||||
[root@localhost ~]# dig @localhost -p 53 primary.redis.service.dc-1.consul. A
|
||||
|
||||
; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.23.rc1.32.amzn1 <<>> @localhost primary.redis.service.dc-1.consul. A
|
||||
; (1 server found)
|
||||
;; global options: +cmd
|
||||
;; Got answer:
|
||||
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11536
|
||||
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
|
||||
|
||||
;; QUESTION SECTION:
|
||||
;primary.redis.service.dc-1.consul. IN A
|
||||
|
||||
;; ANSWER SECTION:
|
||||
primary.redis.service.dc-1.consul. 0 IN A 172.31.3.234
|
||||
|
||||
;; Query time: 4 msec
|
||||
;; SERVER: 127.0.0.1#53(127.0.0.1)
|
||||
;; WHEN: Wed Apr 9 17:36:12 2014
|
||||
;; MSG SIZE rcvd: 76
|
||||
```
|
||||
|
||||
If desired, verify reverse DNS using the same methodology:
|
||||
|
||||
```text
|
||||
[root@localhost ~]# dig @127.0.0.1 -p 8600 133.139.16.172.in-addr.arpa. PTR
|
||||
|
||||
; <<>> DiG 9.10.3-P3 <<>> @127.0.0.1 -p 8600 133.139.16.172.in-addr.arpa. PTR
|
||||
; (1 server found)
|
||||
;; global options: +cmd
|
||||
;; Got answer:
|
||||
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3713
|
||||
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
|
||||
;; WARNING: recursion requested but not available
|
||||
|
||||
;; QUESTION SECTION:
|
||||
;133.139.16.172.in-addr.arpa. IN PTR
|
||||
|
||||
;; ANSWER SECTION:
|
||||
133.139.16.172.in-addr.arpa. 0 IN PTR consul1.node.dc1.consul.
|
||||
|
||||
;; Query time: 3 msec
|
||||
;; SERVER: 127.0.0.1#8600(127.0.0.1)
|
||||
;; WHEN: Sun Jan 31 04:25:39 UTC 2016
|
||||
;; MSG SIZE rcvd: 109
|
||||
[root@localhost ~]# dig @127.0.0.1 +short -x 172.16.139.133
|
||||
consul1.node.dc1.consul.
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
If you don't get an answer from your DNS server (e.g. BIND, Dnsmasq) but you
|
||||
do get an answer from Consul, your best bet is to turn on your DNS server's
|
||||
query log to see what's happening.
|
||||
|
||||
For BIND:
|
||||
|
||||
```text
|
||||
[root@localhost ~]# rndc querylog
|
||||
[root@localhost ~]# tail -f /var/log/messages
|
||||
```
|
||||
|
||||
The log may show errors like this:
|
||||
|
||||
```text
|
||||
error (no valid RRSIG) resolving
|
||||
error (no valid DS) resolving
|
||||
```
|
||||
|
||||
This indicates that DNSSEC is not disabled properly.
|
||||
|
||||
If you see errors about network connections, verify that there are no firewall
|
||||
or routing problems between the servers running BIND and Consul.
|
||||
|
||||
For Dnsmasq, see the `log-queries` configuration option and the `USR1`
|
||||
signal.
|
||||
|
||||
## Summary
|
||||
|
||||
In this guide we provided examples of configuring DNS forwarding with many
|
||||
common, third-party tools. It is the responsibility of the operator to ensure
|
||||
which ever tool they select is configured properly prior to integration
|
||||
with Consul.
|
|
@ -1,191 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Geo Failover
|
||||
description: >-
|
||||
Consul provides a prepared query capability that makes it easy to implement
|
||||
automatic geo failover policies for services.
|
||||
---
|
||||
|
||||
# Geo Failover with Prepared Queries
|
||||
|
||||
Within a single datacenter, Consul provides automatic failover for services by omitting failed service instances from DNS lookups and by providing service health information in APIs.
|
||||
|
||||
When there are no more instances of a service available in the local datacenter, it can be challenging to implement failover policies to other datacenters because typically that logic would need to be written into each application. Fortunately, Consul has a [prepared query](/api/query) API
|
||||
that provides the capability to let users define failover policies in a centralized way. It's easy to expose these to applications using Consul's DNS interface and it's also available to applications that consume Consul's APIs.
|
||||
|
||||
Failover policies are flexible and can be applied in a variety of ways including:
|
||||
|
||||
- Fully static lists of alternate datacenters.
|
||||
- Fully dynamic policies that make use of Consul's [network coordinate](/docs/internals/coordinates) subsystem.
|
||||
- Automatically determine the next best datacenter to failover to based on network round trip time.
|
||||
|
||||
Prepared queries can be made with policies specific to certain services and prepared query templates can allow one policy to apply to many, or even all services, with just a small number of templates.
|
||||
|
||||
This guide shows how to build geo failover policies using prepared queries through a set of examples. It also includes information on how to use prepared
|
||||
query templates to simplify the failover process.
|
||||
|
||||
## Prepared Query Introduction
|
||||
|
||||
Prepared queries are objects that are defined at the datacenter level. They
|
||||
only need to be created once and are stored on the Consul servers. This method is similar to the values in Consul's KV store.
|
||||
|
||||
Once created, prepared quires can then invoked by applications to perform the query and get the latest results.
|
||||
|
||||
Here's an example request to create a prepared query:
|
||||
|
||||
```shell
|
||||
$ curl \
|
||||
--request POST \
|
||||
--data \
|
||||
'{
|
||||
"Name": "payments",
|
||||
"Service": {
|
||||
"Service": "payments",
|
||||
"Tags": ["v1.2.3"]
|
||||
}
|
||||
}' http://127.0.0.1:8500/v1/query
|
||||
|
||||
{"ID":"fe3b8d40-0ee0-8783-6cc2-ab1aa9bb16c1"}
|
||||
```
|
||||
|
||||
This creates a prepared query called "payments" that does a lookup for all instances of the "payments" service with the tag "v1.2.3". This policy could be used to control which version of a "payments" applications should be using in a centralized way. By [updating this prepared query](/api/query#update-prepared-query) to look for the tag "v1.2.4" applications could start to find the newer version of the service without having to reconfigure anything.
|
||||
|
||||
Applications can make use of this query in two ways.
|
||||
|
||||
1. Since we gave the prepared query a name, they can simply do a DNS lookup for "payments.query.consul" instead of "payments.service.consul". Now with the prepared query, there's the additional filter policy working behind the scenes that the application doesn't have to know about.
|
||||
|
||||
1. Queries can also be executed using the [prepared query execute API](/api/query#execute-prepared-query) for applications that integrate with Consul's APIs directly.
|
||||
|
||||
## Failover Policy Types
|
||||
|
||||
Using the techniques in this section you will develop prepared queries with failover policies where simply changing application configurations to look up "payments.query.consul" instead of "payments.service.consul" via DNS will result in automatic geo failover to the next closest [federated](/docs/guides/datacenters) Consul datacenters, in order of increasing network round trip time.
|
||||
|
||||
Failover is just another policy choice for a prepared query, it works in the same manner as the previous example and is similarly transparent to applications. The failover policy is configured using the `Failover` structure, which contains two fields, both of which are optional, and determine what happens if no healthy nodes are available in the local datacenter when the query is executed.
|
||||
|
||||
- `NearestN` `(int: 0)` - Specifies that the query will be forwarded to up to `NearestN` other datacenters based on their estimated network round trip time using [network coordinates](/docs/internals/coordinates).
|
||||
|
||||
- `Datacenters` `(array<string>: nil)` - Specifies a fixed list of remote datacenters to forward the query to if there are no healthy nodes in the local datacenter. Datacenters are queried in the order given in the list.
|
||||
|
||||
The following examples use those fields to implement different geo failover methods.
|
||||
|
||||
### Static Policy
|
||||
|
||||
A static failover policy includes a fixed list of datacenters to contact once there are no healthy instances in the local datacenter.
|
||||
|
||||
Here's the example from the introduction, expanded with a static failover policy:
|
||||
|
||||
```shell
|
||||
$ curl \
|
||||
--request POST \
|
||||
--data \
|
||||
'{
|
||||
"Name": "payments",
|
||||
"Service": {
|
||||
"Service": "payments",
|
||||
"Tags": ["v1.2.3"],
|
||||
"Failover": {
|
||||
"Datacenters": ["dc2", "dc3"]
|
||||
}
|
||||
}
|
||||
}' http://127.0.0.1:8500/v1/query
|
||||
|
||||
{"ID":"fe3b8d40-0ee0-8783-6cc2-ab1aa9bb16c1"}
|
||||
```
|
||||
|
||||
When this query is executed, such as with a DNS lookup to "payments.query.consul", the following actions will occur:
|
||||
|
||||
1. Consul servers in the local datacenter will attempt to find healthy instances of the "payments" service with the required tag.
|
||||
2. If none are available locally, the Consul servers will make an RPC request to the Consul servers in "dc2" to perform the query there.
|
||||
3. If none are available in "dc2", then an RPC will be made to the Consul servers in "dc3" to perform the query there.
|
||||
4. Finally an error will be returned if none of these datacenters had any instances available.
|
||||
|
||||
### Dynamic Policy
|
||||
|
||||
In a complex federated environment with many Consul datacenters, it can be cumbersome to set static failover policies, so Consul offers a dynamic option based on Consul's [network coordinate](/docs/internals/coordinates) subsystem.
|
||||
|
||||
Consul continuously maintains an estimate of the network round trip time from the local datacenter to the servers in other datacenters it is federated with. Each server uses the median round trip time from itself to the servers in the remote datacenter. This means that failover can simply try other remote datacenters in order of increasing network round trip time, and if datacenters come and go, or experience network issues, this order will adjust automatically.
|
||||
|
||||
Here's the example from the introduction, expanded with a dynamic failover policy:
|
||||
|
||||
```shell
|
||||
$ curl \
|
||||
--request POST \
|
||||
--data \
|
||||
'{
|
||||
"Name": "payments",
|
||||
"Service": {
|
||||
"Service": "payments",
|
||||
"Tags": ["v1.2.3"],
|
||||
"Failover": {
|
||||
"NearestN": 2
|
||||
}
|
||||
}
|
||||
}' http://127.0.0.1:8500/v1/query
|
||||
|
||||
{"ID":"fe3b8d40-0ee0-8783-6cc2-ab1aa9bb16c1"}
|
||||
```
|
||||
|
||||
This query is resolved in a similar fashion to the previous example, except the choice of "dc1" or "dc2", or possibly some other datacenter, is made automatically.
|
||||
|
||||
### Hybrid Policy
|
||||
|
||||
It is possible to combine `Datacenters` and `NearestN` in the same policy. The `NearestN` queries will be performed first, followed by the list given by `Datacenters`.
|
||||
|
||||
```shell
|
||||
$ curl \
|
||||
--request POST \
|
||||
--data \
|
||||
'{
|
||||
"Name": "payments",
|
||||
"Service": {
|
||||
"Service": "payments",
|
||||
"Tags": ["v1.2.3"],
|
||||
"Failover": {
|
||||
"NearestN": 2,
|
||||
"Datacenters": ["dc2", "dc3"]
|
||||
}
|
||||
}
|
||||
}' http://127.0.0.1:8500/v1/query
|
||||
|
||||
{"ID":"fe3b8d40-0ee0-8783-6cc2-ab1aa9bb16c1"}
|
||||
```
|
||||
|
||||
Note, a given datacenter will only be queried one time during a failover, even if it is selected by both `NearestN` and is listed in `Datacenters`. This is useful for allowing a limited number of round trip-based attempts, followed by a static configuration for some known datacenter to failover to.
|
||||
|
||||
### Prepared Query Template
|
||||
|
||||
For datacenters with many services, it can be challenging to define a geo failover policy for each service. To relieve this challenge, Consul provides a [prepared query template](/api/query#prepared-query-templates) that allows one prepared query to apply to many, and even all, services.
|
||||
|
||||
Templates can match on prefixes or use full regular expressions to determine which services they match.
|
||||
|
||||
Below is an example request to create a prepared query template that applies a dynamic geo failover policy to all services. We've chosen the `name_prefix_match` type and given it an empty name, which means that it will match any service.
|
||||
|
||||
```shell
|
||||
$ curl \
|
||||
--request POST \
|
||||
--data \
|
||||
'{
|
||||
"Name": "",
|
||||
"Template": {
|
||||
"Type": "name_prefix_match"
|
||||
},
|
||||
"Service": {
|
||||
"Service": "${name.full}",
|
||||
"Failover": {
|
||||
"NearestN": 2
|
||||
}
|
||||
}
|
||||
}' http://127.0.0.1:8500/v1/query
|
||||
|
||||
{"ID":"fe3b8d40-0ee0-8783-6cc2-ab1aa9bb16c1"}
|
||||
```
|
||||
|
||||
~> Note: If multiple queries are registered, the most specific one will be selected, so it's possible to have a template like this as a catch-all, and then apply more specific policies to certain services.
|
||||
|
||||
With this one prepared query template in place, simply changing application configurations to look up "payments.query.consul" instead of "payments.service.consul" via DNS will result in automatic geo failover to the next closest federated Consul datacenters, in order of increasing network round trip time.
|
||||
|
||||
## Summary
|
||||
|
||||
In this guide you learned how to use three different policy tpes for failover;
|
||||
static, dynamic, and hybrid. You also learned how to create a prepared query template which will help you reduce some complexity of creating policies for
|
||||
services.
|
|
@ -1,188 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Kubernetes Consul Reference Architecture
|
||||
description: This document provides recommended practices and a reference architecture.
|
||||
---
|
||||
|
||||
# Consul and Kubernetes Reference Architecture
|
||||
|
||||
Preparing your Kubernetes cluster to successfully deploy and run Consul is an
|
||||
important first step in your production deployment process. In this guide you
|
||||
will prepare your Kubernetes cluster, that can be running on any platform
|
||||
(AKS, EKS, GKE, etc). However, we will call out cloud specific differences when
|
||||
applicable. Before starting this guide you should have experience with
|
||||
Kubernetes, and have `kubectl` and helm configured locally.
|
||||
|
||||
By the end of this guide, you will be able to select the right resource limits
|
||||
for Consul pods, select the Consul datacenter design that meets your use case,
|
||||
and understand the minimum networking requirements.
|
||||
|
||||
## Infrastructure Requirements
|
||||
|
||||
Consul server agents are responsible for the cluster state, responding to RPC
|
||||
queries, and processing all write operations. Since the Consul servers are
|
||||
highly active and are responsible for maintaining the cluster state, server
|
||||
sizing is critical for the overall performance, efficiency, and health of the
|
||||
Consul cluster. Review the [Consul Reference
|
||||
Architecture](/consul/advanced/day-1-operations/reference-architecture#consul-servers)
|
||||
guide for sizing recommendations for small and large Consul datacenters.
|
||||
|
||||
The CPU and memory recommendations can be used when you select the resources
|
||||
limits for the Consul pods. The disk recommendations can also be used when
|
||||
selecting the resources limits and configuring persistent volumes. You will
|
||||
need to set both `limits` and `requests` in the Helm chart. Below is an example
|
||||
snippet of Helm config for a Consul server in a large environment.
|
||||
|
||||
```yaml
|
||||
# values.yaml
|
||||
|
||||
server
|
||||
resources: |
|
||||
requests:
|
||||
memory: "32Gi"
|
||||
cpu: "4"
|
||||
limits:
|
||||
memory: "32Gi"
|
||||
cpu: "4"
|
||||
|
||||
storage: 50Gi
|
||||
...
|
||||
```
|
||||
|
||||
You should also set [resource limits for Consul
|
||||
clients](https://www.consul.io/docs/platform/k8s/helm.html#v-client-resources),
|
||||
so that the client pods do not unexpectedly consume more resources than
|
||||
expected.
|
||||
|
||||
[Persistent
|
||||
volumes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) (PV)
|
||||
allow you to have a fixed disk location for the Consul data. This ensures that
|
||||
if a Consul server is lost, the data will not be lost. This is an important
|
||||
feature of Kubernetes, but may take some additional configuration. If you are
|
||||
running Kubernetes on one of the major cloud platforms, persistent volumes
|
||||
should already be configured for you; be sure to read their documentation for more
|
||||
details. If you are setting up the persistent volumes resource in Kubernetes, you may need
|
||||
to map the Consul server to that volume with the [storage class
|
||||
parameter](https://www.consul.io/docs/platform/k8s/helm.html#v-server-storageclass).
|
||||
|
||||
Finally, you will need to enable RBAC on your Kubernetes cluster. Review
|
||||
the [Kubernetes
|
||||
RBAC](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) documenation. You
|
||||
should also review RBAC and authentication documentation if your Kubernetes cluster
|
||||
is running on a major cloud platorom.
|
||||
|
||||
- [AWS](https://docs.aws.amazon.com/eks/latest/userguide/managing-auth.html).
|
||||
- [GCP](https://cloud.google.com/kubernetes-engine/docs/how-to/role-based-access-control).
|
||||
- [Azure](https://docs.microsoft.com/en-us/cli/azure/aks?view=azure-cli-latest#az-aks-create). In Azure, RBAC is enabled by default.
|
||||
|
||||
## Datacenter Design
|
||||
|
||||
There are many possible configurations for running Consul with Kubernetes. In this guide
|
||||
we will cover three of the most common.
|
||||
|
||||
1. Consul agents can be solely deployed within Kubernetes.
|
||||
1. Consul servers
|
||||
can be deployed outside of Kubernetes and clients inside of Kubernetes.
|
||||
1. Multiple Consul datacenters with agents inside and outside of Kubernetes.
|
||||
|
||||
Review the Consul Kubernetes-specific
|
||||
[documentation](https://www.consul.io/docs/platform/k8s#use-cases)
|
||||
for additional use case information.
|
||||
|
||||
Since all three use cases will also need catalog sync, review the
|
||||
implementation [details for catalog sync](https://www.consul.io/docs/platform/k8s/service-sync.html).
|
||||
|
||||
### Consul Datacenter Deployed in Kubernetes
|
||||
|
||||
Deploying a Consul cluster, servers and clients, in Kubernetes can be done with
|
||||
the official [Helm
|
||||
chart](https://www.consul.io/docs/platform/k8s/helm.html#using-the-helm-chart).
|
||||
This configuration is useful for managing services within Kubernetes and is
|
||||
common for users who do not already have a production Consul datacenter.
|
||||
|
||||
![Reference Diagram](/img/k8s-consul-simple.png 'Consul in Kubernetes Reference Diagram')
|
||||
|
||||
The Consul datacenter in Kubernetes will function the same as a platform
|
||||
independent Consul datacenter, such as Consul clusters deployed on bare metal servers
|
||||
or virtual machines. Agents will communicate over LAN gossip, servers
|
||||
will participate in the Raft consensus, and client requests will be
|
||||
forwarded to the servers via RPCs.
|
||||
|
||||
### Consul Datacenter with a Kubernetes Cluster
|
||||
|
||||
To use an existing Consul cluster to manage services in Kubernetes, Consul
|
||||
clients can be deployed within the Kubernetes cluster. This will also allow
|
||||
Kubernetes-defined services to be synced to Consul. This design allows Consul tools
|
||||
such as envconsul, consul-template, and more to work on Kubernetes.
|
||||
|
||||
![Reference Diagram](/img/k8s-cluster-consul-datacenter.png 'Consul and Kubernetes Reference Diagram')
|
||||
|
||||
This type of deployment in Kubernetes can also be set up with the official Helm
|
||||
chart.
|
||||
|
||||
### Multiple Consul Clusters with a Kubernetes Cluster
|
||||
|
||||
Consul clusters in different datacenters running the same service can be joined
|
||||
by WAN links. The clusters can operate independently and only communicate over
|
||||
the WAN. This type datacenter design is detailed in the [Reference Architecture
|
||||
guide](/consul/advanced/day-1-operations/reference-architecture#multiple-datacenters).
|
||||
In this setup, you can have a Consul cluster running outside of Kubernetes and
|
||||
a Consul cluster running inside of Kubernetes.
|
||||
|
||||
### Catalog Sync
|
||||
|
||||
To use catalog sync, you must enable it in the [Helm
|
||||
chart](https://www.consul.io/docs/platform/k8s/helm.html#v-synccatalog).
|
||||
Catalog sync allows you to sync services between Consul and Kubernetes. The
|
||||
sync can be unidirectional in either direction or bidirectional. Read the
|
||||
[documentation](https://www.consul.io/docs/platform/k8s/service-sync.html) to
|
||||
learn more about the configuration.
|
||||
|
||||
Services synced from Kubernetes to Consul will be discoverable, like any other
|
||||
service within the Consul datacenter. Read more in the [network
|
||||
connectivity](#networking-connectivity) section to learn more about related
|
||||
Kubernetes configuration. Services synced from Consul to Kubernetes will be
|
||||
discoverable with the built-in Kubernetes DNS once a [Consul stub
|
||||
domain](https://www.consul.io/docs/platform/k8s/dns.html) is deployed. When
|
||||
bidirectional catalog sync is enabled, it will behave like the two
|
||||
unidirectional setups.
|
||||
|
||||
## Networking Connectivity
|
||||
|
||||
When running Consul as a pod inside of Kubernetes, the Consul servers will be
|
||||
automatically configured with the appropriate addresses. However, when running
|
||||
Consul servers outside of the Kubernetes cluster and clients inside Kubernetes
|
||||
as pods, there are additional [networking
|
||||
considerations](/consul/advanced/day-1-operations/reference-architecture#network-connectivity).
|
||||
|
||||
### Network Connectivity for Services
|
||||
|
||||
When using Consul catalog sync, to sync Kubernetes services to Consul, you will
|
||||
need to ensure the Kubernetes services are supported [service
|
||||
types](https://www.consul.io/docs/platform/k8s/service-sync.html#kubernetes-service-types)
|
||||
and configure correctly in Kubernetes. If the service is configured correctly,
|
||||
it will be discoverable by Consul like any other service in the datacenter.
|
||||
|
||||
~> Warning: You are responsible for ensuring that external services can communicate
|
||||
with services deployed in the Kubernetes cluster. For example, `ClusterIP` type services
|
||||
may not be directly accessible by IP address from outside the Kubernetes cluster
|
||||
for some Kubernetes configurations.
|
||||
|
||||
### Network Security
|
||||
|
||||
Finally, you should consider securing your Consul datacenter with
|
||||
[ACLs](/consul/advanced/day-1-operations/production-acls). ACLs should be used with [Consul
|
||||
Connect](https://www.consul.io/docs/platform/k8s/connect.html) to secure
|
||||
service to service communication. The Kubernetes cluster should also be
|
||||
secured.
|
||||
|
||||
## Summary
|
||||
|
||||
You are now prepared to deploy Consul with Kubernetes. In this
|
||||
guide, you were introduced to several a datacenter design for a variety of use
|
||||
cases. This guide also outlined the Kubernetes prerequisites, resource
|
||||
requirements for Consul, and networking considerations. Continue onto the
|
||||
[Deploying Consul with Kubernetes
|
||||
guide](/consul/getting-started-k8s/helm-deploy) for
|
||||
information on deploying Consul with the official Helm chart or continue
|
||||
reading about Consul Operations in the [Day 1 Path](https://learn.hashicorp.com/consul/?track=advanced#advanced).
|
|
@ -1,164 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Application Leader Election with Sessions
|
||||
description: >-
|
||||
This guide describes how to build client-side leader election using Consul. If
|
||||
you are interested in the leader election used internally to Consul, please
|
||||
refer to the consensus protocol documentation instead.
|
||||
---
|
||||
|
||||
# Application Leader Election with Sessions
|
||||
|
||||
For some applications, like HDFS, it is necessary to set one instance as
|
||||
a leader. This ensures the application data is current and stable.
|
||||
|
||||
This guide describes how to build client-side leader elections for service
|
||||
instances, using Consul. Consul's support for
|
||||
[sessions](/docs/internals/sessions) allows you to build a system that can gracefully handle failures.
|
||||
|
||||
If you
|
||||
are interested in the leader election used internally by Consul, please refer to the
|
||||
[consensus protocol](/docs/internals/consensus) documentation instead.
|
||||
|
||||
## Contending Service Instances
|
||||
|
||||
Imagine you have a set of MySQL service instances who are attempting to acquire leadership. All service instances that are participating should agree on a given
|
||||
key to coordinate. A good pattern is simply:
|
||||
|
||||
```text
|
||||
service/<service name>/leader
|
||||
```
|
||||
|
||||
This key will be used for all requests to the Consul KV API.
|
||||
|
||||
We will use the same, simple pattern for the MySQL services for the remainder of the guide.
|
||||
|
||||
```text
|
||||
service/mysql/leader
|
||||
```
|
||||
|
||||
### Create a Session
|
||||
|
||||
The first step is to create a session using the
|
||||
[Session HTTP API](/api/session#session_create).
|
||||
|
||||
```shell
|
||||
$ curl -X PUT -d '{"Name": "mysql-session"}' http://localhost:8500/v1/session/create
|
||||
```
|
||||
|
||||
This will return a JSON object containing the session ID:
|
||||
|
||||
```json
|
||||
{
|
||||
"ID": "4ca8e74b-6350-7587-addf-a18084928f3c"
|
||||
}
|
||||
```
|
||||
|
||||
### Acquire a Session
|
||||
|
||||
The next step is to acquire a session for a given key from this instance
|
||||
using the PUT method on a [KV entry](/api/kv) with the
|
||||
`?acquire=<session>` query parameter.
|
||||
|
||||
The `<body>` of the PUT should be a
|
||||
JSON object representing the local instance. This value is opaque to
|
||||
Consul, but it should contain whatever information clients require to
|
||||
communicate with your application (e.g., it could be a JSON object
|
||||
that contains the node's name and the application's port).
|
||||
|
||||
```shell
|
||||
$ curl -X PUT -d <body> http://localhost:8500/v1/kv/service/mysql/leader?acquire=4ca8e74b-6350-7587-addf-a18084928f3c
|
||||
```
|
||||
|
||||
This will either return `true` or `false`. If `true`, the lock has been acquired and
|
||||
the local service instance is now the leader. If `false` is returned, some other node has acquired
|
||||
the lock.
|
||||
|
||||
### Watch the Session
|
||||
|
||||
All instances now remain in an idle waiting state. In this state, they watch for changes
|
||||
on the key `service/mysql/leader`. This is because the lock may be released or the instance could fail, etc.
|
||||
|
||||
The leader must also watch for changes since its lock may be released by an operator
|
||||
or automatically released due to a false positive in the failure detector.
|
||||
|
||||
By default, the session makes use of only the gossip failure detector. That
|
||||
is, the session is considered held by a node as long as the default Serf health check
|
||||
has not declared the node unhealthy. Additional checks can be specified if desired.
|
||||
|
||||
Watching for changes is done via a blocking query against the key. If they ever
|
||||
notice that the `Session` field in the response is blank, there is no leader, and then should
|
||||
retry lock acquisition. Each attempt to acquire the key should be separated by a timed
|
||||
wait. This is because Consul may be enforcing a [`lock-delay`](/docs/internals/sessions).
|
||||
|
||||
### Release the Session
|
||||
|
||||
If the leader ever wishes to step down voluntarily, this should be done by simply
|
||||
releasing the lock:
|
||||
|
||||
```shell
|
||||
$ curl -X PUT http://localhost:8500/v1/kv/service/mysql/leader?release=4ca8e74b-6350-7587-addf-a18084928f3c
|
||||
```
|
||||
|
||||
## Discover the Leader
|
||||
|
||||
It is possible to identify the leader of a set of service instances participating in the election process.
|
||||
|
||||
As with leader election, all instances that are participating should agree on the key being used to coordinate.
|
||||
|
||||
### Retrieve the Key
|
||||
|
||||
Instances have a very simple role, they simply read the Consul KV key to discover the current leader. If the key has an associated `Session`, then there is a leader.
|
||||
|
||||
```shell
|
||||
$ curl -X GET http://localhost:8500/v1/kv/service/mysql/leader
|
||||
[
|
||||
{
|
||||
"Session": "4ca8e74b-6350-7587-addf-a18084928f3c",
|
||||
"Value": "Ym9keQ==",
|
||||
"Flags": 0,
|
||||
"Key": "service/mysql/leader",
|
||||
"LockIndex": 1,
|
||||
"ModifyIndex": 29,
|
||||
"CreateIndex": 29
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
If there is a leader then the value of the key will provide all the
|
||||
application-dependent information required as a Base64 encoded blob in
|
||||
the `Value` field.
|
||||
|
||||
### Retrieve Session Information
|
||||
|
||||
You can query the
|
||||
[`/v1/session/info`](/api/session#session_info)
|
||||
endpoint to get details about the session
|
||||
|
||||
```shell
|
||||
$ curl -X GET http://localhost:8500/v1/session/info/4ca8e74b-6350-7587-addf-a18084928f3c
|
||||
[
|
||||
{
|
||||
"LockDelay": 1.5e+10,
|
||||
"Checks": [
|
||||
"serfHealth"
|
||||
],
|
||||
"Node": "consul-primary-bjsiobmvdij6-node-lhe5ihreel7y",
|
||||
"Name": "mysql-session",
|
||||
"ID": "4ca8e74b-6350-7587-addf-a18084928f3c",
|
||||
"CreateIndex": 28
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
In this guide you used a session to initiate manual leader election for a
|
||||
set of service instances. To fully benefit from this process, instances should also watch the key using a blocking query for any
|
||||
changes. If the leader steps down or fails, the `Session` associated
|
||||
with the key will be cleared. When a new leader is elected, the key
|
||||
value will also be updated.
|
||||
|
||||
Using the `acquire` parameter is optional. This means
|
||||
that if you use leader election to update a key, you must not update the key
|
||||
without the acquire parameter.
|
|
@ -1,256 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Minikube
|
||||
description: Consul can be installed to the Kubernetes minikube tool for local development.
|
||||
---
|
||||
|
||||
# Consul Installation to Minikube via Helm
|
||||
|
||||
In this guide, you'll start a local Kubernetes cluster with minikube. You'll install Consul with only a few commands, then deploy two custom services that use Consul to discover each other over encrypted TLS via Consul Connect. Finally, you'll tighten down Consul Connect so that only the approved applications can communicate with each other.
|
||||
|
||||
[Demo code](https://github.com/hashicorp/demo-consul-101) is available.
|
||||
|
||||
- [Task 1: Start Minikube and Install Consul with Helm](#task-1-start-minikube-and-install-consul-with-helm)
|
||||
- [Task 2: Deploy a Consul Aware Application to the Cluster](#task-2-deploy-a-consul-aware-application-to-the-cluster)
|
||||
- [Task 3: Configure Consul Connect](#task-3-use-consul-connect)
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Let's install Consul on Kubernetes with minikube. This is a relatively quick and easy way to try out Consul on your local machine without the need for any cloud credentials. You'll be able to use most Consul features right away.
|
||||
|
||||
First, you'll need to follow the directions for [installing minikube](https://kubernetes.io/docs/tasks/tools/install-minikube/), including VirtualBox or similar.
|
||||
|
||||
You'll also need to install `kubectl` and `helm`.
|
||||
|
||||
Mac users can install `helm` and `kubectl` with Homebrew.
|
||||
|
||||
```shell
|
||||
$ brew install kubernetes-cli
|
||||
$ brew install kubernetes-helm
|
||||
```
|
||||
|
||||
Windows users can use Chocolatey with the same package names:
|
||||
|
||||
```shell
|
||||
$ choco install kubernetes-cli
|
||||
$ choco install kubernetes-helm
|
||||
```
|
||||
|
||||
For more on Helm, see [helm.sh](https://helm.sh/).
|
||||
|
||||
## Task 1: Start Minikube and Install Consul with Helm
|
||||
|
||||
### Step 1: Start Minikube
|
||||
|
||||
Start minikube. You can use the `--memory` option with the equivalent of 4GB to 8GB so there is plenty of memory for all the pods we will run. This may take several minutes. It will download a 100-300MB of dependencies and container images.
|
||||
|
||||
```
|
||||
$ minikube start --memory 4096
|
||||
```
|
||||
|
||||
Next, let's view the local Kubernetes dashboard with `minikube dashboard`. Even if the previous step completed successfully, you may have to wait a minute or two for minikube to be available. If you see an error, try again after a few minutes.
|
||||
|
||||
Once it spins up, you'll see the dashboard in your web browser. You can view pods, nodes, and other resources.
|
||||
|
||||
```
|
||||
$ minikube dashboard
|
||||
```
|
||||
|
||||
![Minikube Dashboard](/img/guides/minikube-dashboard.png 'Minikube Dashboard')
|
||||
|
||||
### Step 2: Install the Consul Helm Chart to the Cluster
|
||||
|
||||
To perform the steps in this lab exercise, clone the [hashicorp/demo-consul-101](https://github.com/hashicorp/demo-consul-101) repository from GitHub. Go into the `demo-consul-101/k8s` directory.
|
||||
|
||||
```
|
||||
$ git clone https://github.com/hashicorp/demo-consul-101.git
|
||||
|
||||
$ cd demo-consul-101/k8s
|
||||
```
|
||||
|
||||
Now we're ready to install Consul to the cluster, using the `helm` tool. Initialize Helm with `helm init`. You'll see a note that Tiller (the server-side component) has been installed. You can ignore the policy warning.
|
||||
|
||||
```
|
||||
$ helm init
|
||||
|
||||
$HELM_HOME has been configured at /Users/geoffrey/.helm.
|
||||
```
|
||||
|
||||
Now we need to install Consul with Helm. To get the freshest copy of the Helm chart, clone the [hashicorp/consul-helm](https://github.com/hashicorp/consul-helm) repository.
|
||||
|
||||
```
|
||||
$ git clone https://github.com/hashicorp/consul-helm.git
|
||||
```
|
||||
|
||||
The chart works on its own, but we'll override a few values to help things go more smoothly with minikube and to enable useful features.
|
||||
|
||||
We've created `helm-consul-values.yaml` for you with overrides. See `values.yaml` in the Helm chart repository for other possible values.
|
||||
|
||||
We've given a name to the datacenter running this Consul cluster. We've enabled the Consul web UI via a `NodePort`. When deploying to a hosted cloud that implements load balancers, we could use `LoadBalancer` instead. We'll enable secure communication between pods with Connect. We also need to enable `grpc` on the client for Connect to work properly. Finally, specify that this Consul cluster should only run one server (suitable for local development).
|
||||
|
||||
```yaml
|
||||
# Choose an optional name for the datacenter
|
||||
global:
|
||||
datacenter: minidc
|
||||
|
||||
# Enable the Consul Web UI via a NodePort
|
||||
ui:
|
||||
service:
|
||||
type: 'NodePort'
|
||||
|
||||
# Enable Connect for secure communication between nodes
|
||||
connectInject:
|
||||
enabled: true
|
||||
|
||||
client:
|
||||
enabled: true
|
||||
grpc: true
|
||||
|
||||
# Use only one Consul server for local development
|
||||
server:
|
||||
replicas: 1
|
||||
bootstrapExpect: 1
|
||||
disruptionBudget:
|
||||
enabled: true
|
||||
maxUnavailable: 0
|
||||
```
|
||||
|
||||
Now, run `helm install` together with our overrides file and the cloned `consul-helm` chart. It will print a list of all the resources that were created.
|
||||
|
||||
```
|
||||
$ helm install -f helm-consul-values.yaml --name hedgehog ./consul-helm
|
||||
```
|
||||
|
||||
~> NOTE: If no `--name` is provided, the chart will create a random name for the installation. To reduce confusion, consider specifying a `--name`.
|
||||
|
||||
## Task 2: Deploy a Consul-aware Application to the Cluster
|
||||
|
||||
### Step 1: View the Consul Web UI
|
||||
|
||||
Verify the installation by going back to the Kubernetes dashboard in your web browser. Find the list of services. Several include `consul` in the name and have the `app: consul` label.
|
||||
|
||||
![Minikube Dashboard with Consul](/img/guides/minikube-dashboard-consul.png 'Minikube Dashboard with Consul')
|
||||
|
||||
There are a few differences between running Kubernetes on a hosted cloud vs locally with minikube. You may find that any load balancer resources don't work as expected on a local cluster. But we can still view the Consul UI and other deployed resources.
|
||||
|
||||
Run `minikube service list` to see your services. Find the one with `consul-ui` in the name.
|
||||
|
||||
```
|
||||
$ minikube service list
|
||||
```
|
||||
|
||||
Run `minikube service` with the `consul-ui` service name as the argument. It will open the service in your web browser.
|
||||
|
||||
```
|
||||
$ minikube service hedgehog-consul-ui
|
||||
```
|
||||
|
||||
You can now view the Consul web UI with a list of Consul's services, nodes, and other resources.
|
||||
|
||||
![Minikube Consul UI](/img/guides/minikube-consul-ui.png 'Minikube Consul UI')
|
||||
|
||||
### Step 2: Deploy Custom Applications
|
||||
|
||||
Now let's deploy our application. It's two services: a backend data service that returns a number (`counting` service) and a front-end `dashboard` that pulls from the `counting` service over HTTP and displays the number. The kubernetes part is a single line: `kubectl create -f 04-yaml-connect-envoy`. This is a directory with several YAML files, each defining one or more resources (pods, containers, etc).
|
||||
|
||||
```
|
||||
$ kubectl create -f 04-yaml-connect-envoy
|
||||
```
|
||||
|
||||
The output shows that they have been created. In reality, they may take a few seconds to spin up. Refresh the Kubernetes dashboard a few times and you'll see that the `counting` and `dashboard` services are running. You can also click a resource to view more data about it.
|
||||
|
||||
![Services](/img/guides/minikube-services.png 'Services')
|
||||
|
||||
### Step 3: View the Web Application
|
||||
|
||||
For the last step in this initial task, use the Kubernetes `port-forward` feature for the dashboard service running on port `9002`. We already know that the pod is named `dashboard` thanks to the metadata specified in the YAML we deployed.
|
||||
|
||||
```
|
||||
$ kubectl port-forward dashboard 9002:9002
|
||||
```
|
||||
|
||||
Visit http://localhost:9002 in your web browser. You'll see a running `dashboard` container in the kubernetes cluster that displays a number retrieved from the `counting` service using Consul service discovery and secured over the network by TLS via an Envoy proxy.
|
||||
|
||||
![Application Dashboard](/img/guides/minikube-app-dashboard.png 'Application Dashboard')
|
||||
|
||||
### Addendum: Review the Code
|
||||
|
||||
Let's take a peek at the code. Relevant to this Kubernetes deployment are two YAML files in the `04` directory. The `counting` service defines an `annotation` in the `metadata` section that instructs Consul to spin up a Consul Connect proxy for this service: `connect-inject`. The relevant port number is found in the `containerPort` section (`9001`). This Pod registers a Consul service that will be available via a secure proxy.
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: counting
|
||||
annotations:
|
||||
'consul.hashicorp.com/connect-inject': 'true'
|
||||
spec:
|
||||
containers:
|
||||
- name: counting
|
||||
image: hashicorp/counting-service:0.0.2
|
||||
ports:
|
||||
- containerPort: 9001
|
||||
name: http
|
||||
# ...
|
||||
```
|
||||
|
||||
The other side is on the `dashboard` service. This declares the same `connect-inject` annotation but also adds another. The `connect-service-upstreams` in the `annotations` section configures Connect such that this Pod will have access to the `counting` service on `localhost` port `9001`. All the rest of the configuration and communication is taken care of by Consul and the Consul Helm chart.
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: dashboard
|
||||
labels:
|
||||
app: 'dashboard'
|
||||
annotations:
|
||||
'consul.hashicorp.com/connect-inject': 'true'
|
||||
'consul.hashicorp.com/connect-service-upstreams': 'counting:9001'
|
||||
spec:
|
||||
containers:
|
||||
- name: dashboard
|
||||
image: hashicorp/dashboard-service:0.0.3
|
||||
ports:
|
||||
- containerPort: 9002
|
||||
name: http
|
||||
env:
|
||||
- name: COUNTING_SERVICE_URL
|
||||
value: 'http://localhost:9001'
|
||||
# ...
|
||||
```
|
||||
|
||||
Within our `dashboard` application, we can access the `counting` service by communicating with `localhost:9001` as seen on the last line of this snippet. Here we are looking at an environment variable that is specific to the Go application running in a container in this Pod. Instead of providing an IP address or even a Consul service URL, we tell the application to talk to `localhost:9001` where our local end of the proxy is ready and listening. Because of the annotation to `counting:9001` earlier, we know that an instance of the `counting` service is on the other end.
|
||||
|
||||
This is what is happening in the cluster and over the network when we view the `dashboard` service in the browser.
|
||||
|
||||
-> TIP: The full source code for the Go-based web services and all code needed to build the Docker images are available in the [repo](https://github.com/hashicorp/demo-consul-101).
|
||||
|
||||
## Task 3: Use Consul Connect
|
||||
|
||||
### Step 1: Create an Intention that Denies All Service Communication by Default
|
||||
|
||||
For a final task, let's take this a step further by restricting service communication with intentions. We don't want any service to be able to communicate with any other service; only the ones we specify.
|
||||
|
||||
Begin by navigating to the _Intentions_ screen in the Consul web UI. Click the "Create" button and define an initial intention that blocks all communication between any services by default. Choose `*` as the source and `*` as the destination. Choose the _Deny_ radio button and add an optional description. Click "Save."
|
||||
|
||||
![Connect Deny](/img/guides/minikube-connect-deny.png 'Connect Deny')
|
||||
|
||||
Verify this by returning to the application dashboard where you will see that the "Counting Service is Unreachable."
|
||||
|
||||
![Application is Unreachable](/img/guides/minikube-connect-unreachable.png 'Application is Unreachable')
|
||||
|
||||
### Step 2: Allow the Application Dashboard to Connect to the Counting Service
|
||||
|
||||
Finally, the easy part. Click the "Create" button again and create an intention that allows the `dashboard` source service to talk to the `counting` destination service. Ensure that the "Allow" radio button is selected. Optionally add a description. Click "Save."
|
||||
|
||||
![Allow](/img/guides/minikube-connect-allow.png 'Allow')
|
||||
|
||||
This action does not require a reboot. It takes effect so quickly that by the time you visit the application dashboard, you'll see that it's successfully communicating with the backend `counting` service again.
|
||||
|
||||
And there we have Consul running on a Kubernetes cluster, as demonstrated by two services which communicate with each other via Consul Connect and an Envoy proxy.
|
||||
|
||||
![Success](/img/guides/minikube-connect-success.png 'Success')
|
||||
|
||||
## Reference
|
||||
|
||||
For more on Consul's integration with Kubernetes (including multi-cloud, service sync, and other features), see the [Consul with Kubernetes](/docs/platform/k8s) documentation.
|
|
@ -1,317 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Monitoring Consul with Telegraf
|
||||
description: >-
|
||||
Best practice approaches for monitoring a production Consul cluster with
|
||||
Telegraf
|
||||
---
|
||||
|
||||
# Monitoring Consul with Telegraf
|
||||
|
||||
Consul makes a range of metrics in various formats available so operators can
|
||||
measure the health and stability of a cluster, and diagnose or predict potential
|
||||
issues.
|
||||
|
||||
There are number of monitoring tools and options available, but for the purposes
|
||||
of this guide we are going to use the [telegraf_plugin][] in conjunction with
|
||||
the StatsD protocol supported by Consul.
|
||||
|
||||
You can read the full list of metrics available with Consul in the [telemetry
|
||||
documentation](/docs/agent/telemetry).
|
||||
|
||||
In this guide you will:
|
||||
|
||||
- Configure Telegraf to collect StatsD and host level metrics
|
||||
- Configure Consul to send metrics to Telegraf
|
||||
- See an example of metrics visualization
|
||||
- Understand important metrics to aggregate and alert on
|
||||
|
||||
## Installing Telegraf
|
||||
|
||||
The process for installing Telegraf depends on your operating system. We
|
||||
recommend following the [official Telegraf installation
|
||||
documentation][telegraf-install].
|
||||
|
||||
## Configuring Telegraf
|
||||
|
||||
Telegraf acts as a StatsD agent and can collect additional metrics about the
|
||||
hosts where Consul agents are running. Telegraf itself ships with a wide range
|
||||
of [input plugins][telegraf-input-plugins] to collect data from lots of sources
|
||||
for this purpose.
|
||||
|
||||
We're going to enable some of the most common input plugins to monitor CPU,
|
||||
memory, disk I/O, networking, and process status, since these are useful for
|
||||
debugging Consul cluster issues.
|
||||
|
||||
The `telegraf.conf` file starts with global options:
|
||||
|
||||
```toml
|
||||
[agent]
|
||||
interval = "10s"
|
||||
flush_interval = "10s"
|
||||
omit_hostname = false
|
||||
```
|
||||
|
||||
We set the default collection interval to 10 seconds and ask Telegraf to include
|
||||
a `host` tag in each metric.
|
||||
|
||||
As mentioned above, Telegraf also allows you to set additional tags on the
|
||||
metrics that pass through it. In this case, we are adding tags for the server
|
||||
role and datacenter. We can then use these tags in Grafana to filter queries
|
||||
(for example, to create a dashboard showing only servers with the
|
||||
`consul-server` role, or only servers in the `us-east-1` datacenter).
|
||||
|
||||
```toml
|
||||
[global_tags]
|
||||
role = "consul-server"
|
||||
datacenter = "us-east-1"
|
||||
```
|
||||
|
||||
Next, we set up a StatsD listener on UDP port 8125, with instructions to
|
||||
calculate percentile metrics and to parse DogStatsD-compatible tags, when
|
||||
they're sent:
|
||||
|
||||
```toml
|
||||
[[inputs.statsd]]
|
||||
protocol = "udp"
|
||||
service_address = ":8125"
|
||||
delete_gauges = true
|
||||
delete_counters = true
|
||||
delete_sets = true
|
||||
delete_timings = true
|
||||
percentiles = [90]
|
||||
metric_separator = "_"
|
||||
parse_data_dog_tags = true
|
||||
allowed_pending_messages = 10000
|
||||
percentile_limit = 1000
|
||||
```
|
||||
|
||||
The full reference to all the available StatsD-related options in Telegraf is
|
||||
[here][telegraf-statsd-input].
|
||||
|
||||
Now we can configure inputs for things like CPU, memory, network I/O, and disk
|
||||
I/O. Most of them don't require any configuration, but make sure the
|
||||
`interfaces` list in `inputs.net` matches the interface names you see in
|
||||
`ifconfig`.
|
||||
|
||||
```toml
|
||||
[[inputs.cpu]]
|
||||
percpu = true
|
||||
totalcpu = true
|
||||
collect_cpu_time = false
|
||||
|
||||
[[inputs.disk]]
|
||||
# mount_points = ["/"]
|
||||
# ignore_fs = ["tmpfs", "devtmpfs"]
|
||||
|
||||
[[inputs.diskio]]
|
||||
# devices = ["sda", "sdb"]
|
||||
# skip_serial_number = false
|
||||
|
||||
[[inputs.kernel]]
|
||||
# no configuration
|
||||
|
||||
[[inputs.linux_sysctl_fs]]
|
||||
# no configuration
|
||||
|
||||
[[inputs.mem]]
|
||||
# no configuration
|
||||
|
||||
[[inputs.net]]
|
||||
interfaces = ["enp0s*"]
|
||||
|
||||
[[inputs.netstat]]
|
||||
# no configuration
|
||||
|
||||
[[inputs.processes]]
|
||||
# no configuration
|
||||
|
||||
[[inputs.swap]]
|
||||
# no configuration
|
||||
|
||||
[[inputs.system]]
|
||||
# no configuration
|
||||
```
|
||||
|
||||
Another useful plugin is the [procstat][telegraf-procstat-input] plugin, which
|
||||
reports metrics for processes you select:
|
||||
|
||||
```toml
|
||||
[[inputs.procstat]]
|
||||
pattern = "(consul)"
|
||||
```
|
||||
|
||||
Telegraf even includes a [plugin][telegraf-consul-input] that monitors the
|
||||
health checks associated with the Consul agent, using Consul API to query the
|
||||
data.
|
||||
|
||||
It's important to note: the plugin itself will not report the telemetry, Consul
|
||||
will report those stats already using StatsD protocol.
|
||||
|
||||
```toml
|
||||
[[inputs.consul]]
|
||||
address = "localhost:8500"
|
||||
scheme = "http"
|
||||
```
|
||||
|
||||
## Telegraf Configuration for Consul
|
||||
|
||||
Asking Consul to send telemetry to Telegraf is as simple as adding a `telemetry`
|
||||
section to your agent configuration:
|
||||
|
||||
```json
|
||||
{
|
||||
"telemetry": {
|
||||
"dogstatsd_addr": "localhost:8125",
|
||||
"disable_hostname": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
As you can see, we only need to specify two options. The `dogstatsd_addr`
|
||||
specifies the hostname and port of the StatsD daemon.
|
||||
|
||||
Note that we specify DogStatsD format instead of plain StatsD, which tells
|
||||
Consul to send [tags][tagging] with each metric. Tags can be used by Grafana to
|
||||
filter data on your dashboards (for example, displaying only the data for which
|
||||
`role=consul-server`. Telegraf is compatible with the DogStatsD format and
|
||||
allows us to add our own tags too.
|
||||
|
||||
The second option tells Consul not to insert the hostname in the names of the
|
||||
metrics it sends to StatsD, since the hostnames will be sent as tags. Without
|
||||
this option, the single metric `consul.raft.apply` would become multiple
|
||||
metrics:
|
||||
|
||||
consul.server1.raft.apply
|
||||
consul.server2.raft.apply
|
||||
consul.server3.raft.apply
|
||||
|
||||
If you are using a different agent (e.g. Circonus, Statsite, or plain StatsD),
|
||||
you may want to change this configuration, and you can find the configuration
|
||||
reference [here][consul-telemetry-config].
|
||||
|
||||
## Visualising Telegraf Consul Metrics
|
||||
|
||||
You can use a tool like [Grafana][] or [Chronograf][] to visualize metrics from
|
||||
Telegraf.
|
||||
|
||||
Here is an example Grafana dashboard:
|
||||
|
||||
[![Grafana Consul Cluster](/img/grafana-screenshot.png)](/img/grafana-screenshot.png)
|
||||
|
||||
## Metric Aggregates and Alerting from Telegraf
|
||||
|
||||
### Memory usage
|
||||
|
||||
| Metric Name | Description |
|
||||
| :------------------ | :------------------------------------------------------------- |
|
||||
| `mem.total` | Total amount of physical memory (RAM) available on the server. |
|
||||
| `mem.used_percent` | Percentage of physical memory in use. |
|
||||
| `swap.used_percent` | Percentage of swap space in use. |
|
||||
|
||||
**Why they're important:** Consul keeps all of its data in memory. If Consul
|
||||
consumes all available memory, it will crash. You should also monitor total
|
||||
available RAM to make sure some RAM is available for other processes, and swap
|
||||
usage should remain at 0% for best performance.
|
||||
|
||||
**What to look for:** If `mem.used_percent` is over 90%, or if
|
||||
`swap.used_percent` is greater than 0.
|
||||
|
||||
### File descriptors
|
||||
|
||||
| Metric Name | Description |
|
||||
| :------------------------- | :------------------------------------------------------------------ |
|
||||
| `linux_sysctl_fs.file-nr` | Number of file handles being used across all processes on the host. |
|
||||
| `linux_sysctl_fs.file-max` | Total number of available file handles. |
|
||||
|
||||
**Why it's important:** Practically anything Consul does -- receiving a
|
||||
connection from another host, sending data between servers, writing snapshots to
|
||||
disk -- requires a file descriptor handle. If Consul runs out of handles, it
|
||||
will stop accepting connections. See [the Consul FAQ][consul_faq_fds] for more
|
||||
details.
|
||||
|
||||
By default, process and kernel limits are fairly conservative. You will want to
|
||||
increase these beyond the defaults.
|
||||
|
||||
**What to look for:** If `file-nr` exceeds 80% of `file-max`.
|
||||
|
||||
### CPU usage
|
||||
|
||||
| Metric Name | Description |
|
||||
| :--------------- | :--------------------------------------------------------------- |
|
||||
| `cpu.user_cpu` | Percentage of CPU being used by user processes (such as Consul). |
|
||||
| `cpu.iowait_cpu` | Percentage of CPU time spent waiting for I/O tasks to complete. |
|
||||
|
||||
**Why they're important:** Consul is not particularly demanding of CPU time, but
|
||||
a spike in CPU usage might indicate too many operations taking place at once,
|
||||
and `iowait_cpu` is critical -- it means Consul is waiting for data to be
|
||||
written to disk, a sign that Raft might be writing snapshots to disk too often.
|
||||
|
||||
**What to look for:** if `cpu.iowait_cpu` greater than 10%.
|
||||
|
||||
### Network activity - Bytes Recived
|
||||
|
||||
| Metric Name | Description |
|
||||
| :--------------- | :------------------------------------------- |
|
||||
| `net.bytes_recv` | Bytes received on each network interface. |
|
||||
| `net.bytes_sent` | Bytes transmitted on each network interface. |
|
||||
|
||||
**Why they're important:** A sudden spike in network traffic to Consul might be
|
||||
the result of a misconfigured application client causing too many requests to
|
||||
Consul. This is the raw data from the system, rather than a specific Consul
|
||||
metric.
|
||||
|
||||
**What to look for:** Sudden large changes to the `net` metrics (greater than
|
||||
50% deviation from baseline).
|
||||
|
||||
**NOTE:** The `net` metrics are counters, so in order to calculate rates (such
|
||||
as bytes/second), you will need to apply a function such as
|
||||
[non_negative_difference][].
|
||||
|
||||
### Disk activity
|
||||
|
||||
| Metric Name | Description |
|
||||
| :------------------- | :---------------------------------- |
|
||||
| `diskio.read_bytes` | Bytes read from each block device. |
|
||||
| `diskio.write_bytes` | Bytes written to each block device. |
|
||||
|
||||
**Why they're important:** If the Consul host is writing a lot of data to disk,
|
||||
such as under high volume workloads, there may be frequent major I/O spikes
|
||||
during leader elections. This is because under heavy load, Consul is
|
||||
checkpointing Raft snapshots to disk frequently.
|
||||
|
||||
It may also be caused by Consul having debug/trace logging enabled in
|
||||
production, which can impact performance.
|
||||
|
||||
Too much disk I/O can cause the rest of the system to slow down or become
|
||||
unavailable, as the kernel spends all its time waiting for I/O to complete.
|
||||
|
||||
**What to look for:** Sudden large changes to the `diskio` metrics (greater than
|
||||
50% deviation from baseline, or more than 3 standard deviations from baseline).
|
||||
|
||||
**NOTE:** The `diskio` metrics are counters, so in order to calculate rates
|
||||
(such as bytes/second), you will need to apply a function such as
|
||||
[non_negative_difference][].
|
||||
|
||||
## Summary
|
||||
|
||||
In this guide you learned how to set up Telegraf with Consul to collect metrics,
|
||||
and considered your options for visualizing, aggregating, and alerting on those
|
||||
metrics. To learn about other factors (in addition to monitoring) that you
|
||||
should consider when running Consul in production, see the [Production Checklist][prod-checklist].
|
||||
|
||||
[non_negative_difference]: https://docs.influxdata.com/influxdb/v1.5/query_language/functions/#non-negative-difference
|
||||
[consul_faq_fds]: https://www.consul.io/docs/faq.html#q-does-consul-require-certain-user-process-resource-limits-
|
||||
[telegraf_plugin]: https://github.com/influxdata/telegraf/tree/master/plugins/inputs/consul
|
||||
[telegraf-install]: https://docs.influxdata.com/telegraf/v1.6/introduction/installation/
|
||||
[telegraf-consul-input]: https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/consul
|
||||
[telegraf-statsd-input]: https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/statsd
|
||||
[telegraf-procstat-input]: https://github.com/influxdata/telegraf/tree/release-1.6/plugins/inputs/procstat
|
||||
[telegraf-input-plugins]: https://docs.influxdata.com/telegraf/v1.6/plugins/inputs/
|
||||
[tagging]: https://docs.datadoghq.com/getting_started/tagging/
|
||||
[consul-telemetry-config]: https://www.consul.io/docs/agent/options.html#telemetry
|
||||
[consul-telemetry-ref]: https://www.consul.io/docs/agent/telemetry.html
|
||||
[telegraf-input-plugins]: https://docs.influxdata.com/telegraf/v1.6/plugins/inputs/
|
||||
[grafana]: https://www.influxdata.com/partners/grafana/
|
||||
[chronograf]: https://www.influxdata.com/time-series-platform/chronograf/
|
||||
[prod-checklist]: https://learn.hashicorp.com/consul/advanced/day-1-operations/production-checklist
|
|
@ -1,273 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Partial LAN Connectivity - Configuring Network Segments
|
||||
description: >-
|
||||
Many advanced Consul users have the need to run clusters with segmented
|
||||
networks, meaning that
|
||||
|
||||
not all agents can be in a full mesh. This is usually the result of business
|
||||
policies enforced
|
||||
|
||||
via network rules or firewalls. Prior to Consul 0.9.3 this was only possible
|
||||
through federation,
|
||||
|
||||
which for some users is too heavyweight or expensive as it requires running
|
||||
multiple servers per
|
||||
|
||||
segment.
|
||||
---
|
||||
|
||||
# Network Segments [Enterprise Only]
|
||||
|
||||
~> Note, the network segment functionality described here is available only in [Consul Enterprise](https://www.hashicorp.com/products/consul/) version 0.9.3 and later.
|
||||
|
||||
Many advanced Consul users have the need to run clusters with segmented networks, meaning that
|
||||
not all agents can be in a full mesh. This is usually the result of business policies enforced
|
||||
via network rules or firewalls. Prior to Consul 0.9.3 this was only possible through federation,
|
||||
which for some users is too heavyweight or expensive as it requires running multiple servers per
|
||||
segment.
|
||||
|
||||
This guide will cover the basic configuration for setting up multiple segments, as well as
|
||||
how to configure a prepared query to limit service discovery to the services in the local agent's
|
||||
network segment.
|
||||
|
||||
To complete this guide you will need to complete the
|
||||
[Deployment Guide](https://learn.hashicorp.com/consul/advanced/day-1-operations/deployment-guide).
|
||||
|
||||
## Partial LAN Connectivity with Network Segments
|
||||
|
||||
By default, all Consul agents in one datacenter are part of a shared gossip pool over the LAN;
|
||||
this means that the partial connectivity caused by segmented networks would cause health flapping
|
||||
as nodes failed to communicate. In this guide we will cover the Network Segments feature, added
|
||||
in [Consul Enterprise](https://www.hashicorp.com/products/consul/) version 0.9.3, which allows users
|
||||
to configure Consul to support this kind of segmented network topology.
|
||||
|
||||
### Network Segments Overview
|
||||
|
||||
All Consul agents are part of the default network segment, unless a segment is specified in
|
||||
their configuration. In a standard cluster setup, all agents will normally be part of this default
|
||||
segment and as a result, part of one shared LAN gossip pool.
|
||||
|
||||
Network segments can be used to break
|
||||
up the LAN gossip pool into multiple isolated smaller pools by specifying the configuration for segments
|
||||
on the servers. Each desired segment must be given a name and port, as well as optionally a custom
|
||||
bind and advertise address for that segment's gossip listener to bind to on the server.
|
||||
|
||||
A few things to note:
|
||||
|
||||
1. Servers will be a part of all segments they have been configured with. They are the common point
|
||||
linking the different segments together. The configured list of segments is specified by the
|
||||
[`segments`](/docs/agent/options#segments) option.
|
||||
|
||||
2. Client agents can only be part of one segment at a given time, specified by the [`-segment`](/docs/agent/options#_segment) option.
|
||||
|
||||
3. Clients can only join agents in the same segment as them. If they attempt to join a client in
|
||||
another segment, or the listening port of another segment on a server, they will get a segment mismatch error.
|
||||
|
||||
Once the servers have been configured with the correct segment info, the clients only need to specify
|
||||
their own segment in the [Agent Config](/docs/agent/options#_segment) and join by connecting to another
|
||||
agent within the same segment. If joining to a Consul server, client will need to provide the server's
|
||||
port for their segment along with the address of the server when performing the join (for example,
|
||||
`consul agent -retry-join "consul.domain.internal:1234"`).
|
||||
|
||||
## Setup Network Segments
|
||||
|
||||
### Configure Consul Servers
|
||||
|
||||
To get started,
|
||||
start a server or group of servers, with the following section added to the configuration. Note, you may need to
|
||||
adjust the bind/advertise addresses for your setup.
|
||||
|
||||
```json
|
||||
{
|
||||
"segments": [
|
||||
{
|
||||
"name": "alpha",
|
||||
"bind": "{{GetPrivateIP}}",
|
||||
"advertise": "{{GetPrivateIP}}",
|
||||
"port": 8303
|
||||
},
|
||||
{
|
||||
"name": "beta",
|
||||
"bind": "{{GetPrivateIP}}",
|
||||
"advertise": "{{GetPrivateIP}}",
|
||||
"port": 8304
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
You should see a log message on the servers for each segment's listener as the agent starts up.
|
||||
|
||||
```shell
|
||||
2017/08/30 19:05:13 [INFO] serf: EventMemberJoin: server1.dc1 192.168.0.4
|
||||
2017/08/30 19:05:13 [INFO] serf: EventMemberJoin: server1 192.168.0.4
|
||||
2017/08/30 19:05:13 [INFO] consul: Started listener for LAN segment "alpha" on 192.168.0.4:8303
|
||||
2017/08/30 19:05:13 [INFO] serf: EventMemberJoin: server1 192.168.0.4
|
||||
2017/08/30 19:05:13 [INFO] consul: Started listener for LAN segment "beta" on 192.168.0.4:8304
|
||||
2017/08/30 19:05:13 [INFO] serf: EventMemberJoin: server1 192.168.0.4
|
||||
```
|
||||
|
||||
Running `consul members` should show the server as being part of all segments.
|
||||
|
||||
```shell
|
||||
(server1) $ consul members
|
||||
Node Address Status Type Build Protocol DC Segment
|
||||
server1 192.168.0.4:8301 alive server 0.9.3+ent 2 dc1 <all>
|
||||
```
|
||||
|
||||
### Configure Consul Clients in Different Network Segments
|
||||
|
||||
Next, start a client agent in the 'alpha' segment, with `-join` set to the server's segment
|
||||
address/port for that segment.
|
||||
|
||||
```shell
|
||||
(client1) $ consul agent ... -join 192.168.0.4:8303 -node client1 -segment alpha
|
||||
```
|
||||
|
||||
After the join is successful, we should see the client show up by running the `consul members` command
|
||||
on the server again.
|
||||
|
||||
```shell
|
||||
(server1) $ consul members
|
||||
Node Address Status Type Build Protocol DC Segment
|
||||
server1 192.168.0.4:8301 alive server 0.9.3+ent 2 dc1 <all>
|
||||
client1 192.168.0.5:8301 alive client 0.9.3+ent 2 dc1 alpha
|
||||
```
|
||||
|
||||
Now join another client in segment 'beta' and run the `consul members` command another time.
|
||||
|
||||
```shell
|
||||
(client2) $ consul agent ... -join 192.168.0.4:8304 -node client2 -segment beta
|
||||
```
|
||||
|
||||
```shell
|
||||
(server1) $ consul members
|
||||
Node Address Status Type Build Protocol DC Segment
|
||||
server1 192.168.0.4:8301 alive server 0.9.3+ent 2 dc1 <all>
|
||||
client1 192.168.0.5:8301 alive client 0.9.3+ent 2 dc1 alpha
|
||||
client2 192.168.0.6:8301 alive client 0.9.3+ent 2 dc1 beta
|
||||
```
|
||||
|
||||
### Filter Segmented Nodes
|
||||
|
||||
If we pass the `-segment` flag when running `consul members`, we can limit the view to agents
|
||||
in a specific segment.
|
||||
|
||||
```shell
|
||||
(server1) $ consul members -segment alpha
|
||||
Node Address Status Type Build Protocol DC Segment
|
||||
client1 192.168.0.5:8301 alive client 0.9.3+ent 2 dc1 alpha
|
||||
server1 192.168.0.4:8303 alive server 0.9.3+ent 2 dc1 alpha
|
||||
```
|
||||
|
||||
Using the `consul catalog nodes` command, we can filter on an internal metadata key,
|
||||
`consul-network-segment`, which stores the network segment of the node.
|
||||
|
||||
```shell
|
||||
(server1) $ consul catalog nodes -node-meta consul-network-segment=alpha
|
||||
Node ID Address DC
|
||||
client1 4c29819c 192.168.0.5 dc1
|
||||
```
|
||||
|
||||
With this metadata key, we can construct a [Prepared Query](/api/query) that can be used
|
||||
for DNS to return only services within the same network segment as the local agent.
|
||||
|
||||
## Configure a Prepared Query to Limit Service Discovery
|
||||
|
||||
### Create Services
|
||||
|
||||
First, register a service on each of the client nodes.
|
||||
|
||||
```shell
|
||||
(client1) $ curl \
|
||||
--request PUT \
|
||||
--data '{"Name": "redis", "Port": 8000}' \
|
||||
localhost:8500/v1/agent/service/register
|
||||
```
|
||||
|
||||
```shell
|
||||
(client2) $ curl \
|
||||
--request PUT \
|
||||
--data '{"Name": "redis", "Port": 9000}' \
|
||||
localhost:8500/v1/agent/service/register
|
||||
```
|
||||
|
||||
### Create the Prepared Query
|
||||
|
||||
Next, write the following to `query.json` and create the query using the HTTP endpoint.
|
||||
|
||||
```shell
|
||||
(server1) $ curl \
|
||||
--request POST \
|
||||
--data \
|
||||
'{
|
||||
"Name": "",
|
||||
"Template": {
|
||||
"Type": "name_prefix_match"
|
||||
},
|
||||
"Service": {
|
||||
"Service": "${name.full}",
|
||||
"NodeMeta": {"consul-network-segment": "${agent.segment}"}
|
||||
}
|
||||
}' localhost:8500/v1/query
|
||||
|
||||
{"ID":"6f49dd24-de9b-0b6c-fd29-525eca069419"}
|
||||
```
|
||||
|
||||
### Test the Segments with DNS Lookups
|
||||
|
||||
Now, we can replace any dns lookups of the form `<service>.service.consul` with
|
||||
`<service>.query.consul` to look up only services within the same network segment.
|
||||
|
||||
**Client 1:**
|
||||
|
||||
```shell
|
||||
(client1) $ dig @127.0.0.1 -p 8600 redis.query.consul SRV
|
||||
|
||||
; <<>> DiG 9.8.3-P1 <<>> @127.0.0.1 -p 8600 redis.query.consul SRV
|
||||
; (1 server found)
|
||||
;; global options: +cmd
|
||||
;; Got answer:
|
||||
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3149
|
||||
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
|
||||
;; WARNING: recursion requested but not available
|
||||
|
||||
;; QUESTION SECTION:
|
||||
;redis.query.consul. IN SRV
|
||||
|
||||
;; ANSWER SECTION:
|
||||
redis.query.consul. 0 IN SRV 1 1 8000 client1.node.dc1.consul.
|
||||
|
||||
;; ADDITIONAL SECTION:
|
||||
client1.node.dc1.consul. 0 IN A 192.168.0.5
|
||||
```
|
||||
|
||||
**Client 2:**
|
||||
|
||||
```shell
|
||||
(client2) $ dig @127.0.0.1 -p 8600 redis.query.consul SRV
|
||||
|
||||
; <<>> DiG 9.8.3-P1 <<>> @127.0.0.1 -p 8600 redis.query.consul SRV
|
||||
; (1 server found)
|
||||
;; global options: +cmd
|
||||
;; Got answer:
|
||||
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3149
|
||||
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
|
||||
;; WARNING: recursion requested but not available
|
||||
|
||||
;; QUESTION SECTION:
|
||||
;redis.query.consul. IN SRV
|
||||
|
||||
;; ANSWER SECTION:
|
||||
redis.query.consul. 0 IN SRV 1 1 9000 client2.node.dc1.consul.
|
||||
|
||||
;; ADDITIONAL SECTION:
|
||||
client2.node.dc1.consul. 0 IN A 192.168.0.6
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
In this guide you configured the Consul agents to participate in partial
|
||||
LAN gossip based on network segments. You then set up a couple services and
|
||||
a prepared query to test the segments.
|
|
@ -1,255 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Outage Recovery
|
||||
description: >-
|
||||
Don't panic! This is a critical first step. Depending on your deployment
|
||||
configuration, it may take only a single server failure for cluster
|
||||
unavailability. Recovery requires an operator to intervene, but recovery is
|
||||
straightforward.
|
||||
---
|
||||
|
||||
# Outage Recovery
|
||||
|
||||
Don't panic! This is a critical first step.
|
||||
|
||||
Depending on your
|
||||
[deployment configuration](/docs/internals/consensus#deployment_table), it
|
||||
may take only a single server failure for cluster unavailability. Recovery
|
||||
requires an operator to intervene, but the process is straightforward.
|
||||
|
||||
This guide is for recovery from a Consul outage due to a majority
|
||||
of server nodes in a datacenter being lost. There are several types
|
||||
of outages, depending on the number of server nodes and number of failed
|
||||
server nodes. We will outline how to recover from:
|
||||
|
||||
- Failure of a Single Server Cluster. This is when you have a single Consul
|
||||
server and it fails.
|
||||
- Failure of a Server in a Multi-Server Cluster. This is when one server fails,
|
||||
the Consul cluster has 3 or more servers.
|
||||
- Failure of Multiple Servers in a Multi-Server Cluster. This when more than one
|
||||
Consul server fails in a cluster of 3 or more servers. This scenario is potentially
|
||||
the most serious, because it can result in data loss.
|
||||
|
||||
## Failure of a Single Server Cluster
|
||||
|
||||
If you had only a single server and it has failed, simply restart it. A
|
||||
single server configuration requires the
|
||||
[`-bootstrap`](/docs/agent/options#_bootstrap) or
|
||||
[`-bootstrap-expect=1`](/docs/agent/options#_bootstrap_expect)
|
||||
flag.
|
||||
|
||||
```shell
|
||||
consul agent -bootstrap-expect=1
|
||||
```
|
||||
|
||||
If the server cannot be recovered, you need to bring up a new
|
||||
server using the [deployment guide](https://www.consul.io/docs/guides/deployment-guide.html).
|
||||
|
||||
In the case of an unrecoverable server failure in a single server cluster and
|
||||
no backup procedure, data loss is inevitable since data was not replicated
|
||||
to any other servers. This is why a single server deploy is **never** recommended.
|
||||
|
||||
Any services registered with agents will be re-populated when the new server
|
||||
comes online as agents perform [anti-entropy](/docs/internals/anti-entropy).
|
||||
|
||||
## Failure of a Server in a Multi-Server Cluster
|
||||
|
||||
If you think the failed server is recoverable, the easiest option is to bring
|
||||
it back online and have it rejoin the cluster with the same IP address, returning
|
||||
the cluster to a fully healthy state. Similarly, even if you need to rebuild a
|
||||
new Consul server to replace the failed node, you may wish to do that immediately.
|
||||
Keep in mind that the rebuilt server needs to have the same IP address as the failed
|
||||
server. Again, once this server is online and has rejoined, the cluster will return
|
||||
to a fully healthy state.
|
||||
|
||||
```shell
|
||||
consul agent -bootstrap-expect=3 -bind=192.172.2.4 -auto-rejoin=192.172.2.3
|
||||
```
|
||||
|
||||
Both of these strategies involve a potentially lengthy time to reboot or rebuild
|
||||
a failed server. If this is impractical or if building a new server with the same
|
||||
IP isn't an option, you need to remove the failed server. Usually, you can issue
|
||||
a [`consul force-leave`](/docs/commands/force-leave) command to remove the failed
|
||||
server if it's still a member of the cluster.
|
||||
|
||||
```shell
|
||||
consul force-leave <node.name.consul>
|
||||
```
|
||||
|
||||
If [`consul force-leave`](/docs/commands/force-leave) isn't able to remove the
|
||||
server, you have two methods available to remove it, depending on your version of Consul:
|
||||
|
||||
- In Consul 0.7 and later, you can use the [`consul operator`](/docs/commands/operator#raft-remove-peer) command to remove the stale peer server on the fly with no downtime if the cluster has a leader.
|
||||
|
||||
- In versions of Consul prior to 0.7, you can manually remove the stale peer
|
||||
server using the `raft/peers.json` recovery file on all remaining servers. See
|
||||
the [section below](#peers.json) for details on this procedure. This process
|
||||
requires a Consul downtime to complete.
|
||||
|
||||
In Consul 0.7 and later, you can use the [`consul operator`](/docs/commands/operator#raft-list-peers)
|
||||
command to inspect the Raft configuration:
|
||||
|
||||
```
|
||||
$ consul operator raft list-peers
|
||||
Node ID Address State Voter RaftProtocol
|
||||
alice 10.0.1.8:8300 10.0.1.8:8300 follower true 3
|
||||
bob 10.0.1.6:8300 10.0.1.6:8300 leader true 3
|
||||
carol 10.0.1.7:8300 10.0.1.7:8300 follower true 3
|
||||
```
|
||||
|
||||
## Failure of Multiple Servers in a Multi-Server Cluster
|
||||
|
||||
In the event that multiple servers are lost, causing a loss of quorum and a
|
||||
complete outage, partial recovery is possible using data on the remaining
|
||||
servers in the cluster. There may be data loss in this situation because multiple
|
||||
servers were lost, so information about what's committed could be incomplete.
|
||||
The recovery process implicitly commits all outstanding Raft log entries, so
|
||||
it's also possible to commit data that was uncommitted before the failure.
|
||||
|
||||
See the section below on manual recovery using peers.json for details of the recovery procedure. You
|
||||
simply include just the remaining servers in the `raft/peers.json` recovery file.
|
||||
The cluster should be able to elect a leader once the remaining servers are all
|
||||
restarted with an identical `raft/peers.json` configuration.
|
||||
|
||||
Any new servers you introduce later can be fresh with totally clean data directories
|
||||
and joined using Consul's `join` command.
|
||||
|
||||
```shell
|
||||
consul agent -join=192.172.2.3
|
||||
```
|
||||
|
||||
In extreme cases, it should be possible to recover with just a single remaining
|
||||
server by starting that single server with itself as the only peer in the
|
||||
`raft/peers.json` recovery file.
|
||||
|
||||
Prior to Consul 0.7 it wasn't always possible to recover from certain
|
||||
types of outages with `raft/peers.json` because this was ingested before any Raft
|
||||
log entries were played back. In Consul 0.7 and later, the `raft/peers.json`
|
||||
recovery file is final, and a snapshot is taken after it is ingested, so you are
|
||||
guaranteed to start with your recovered configuration. This does implicitly commit
|
||||
all Raft log entries, so should only be used to recover from an outage, but it
|
||||
should allow recovery from any situation where there's some cluster data available.
|
||||
|
||||
<a name="peers.json"></a>
|
||||
|
||||
### Manual Recovery Using peers.json
|
||||
|
||||
To begin, stop all remaining servers. You can attempt a graceful leave,
|
||||
but it will not work in most cases. Do not worry if the leave exits with an
|
||||
error. The cluster is in an unhealthy state, so this is expected.
|
||||
|
||||
In Consul 0.7 and later, the `peers.json` file is no longer present
|
||||
by default and is only used when performing recovery. This file will be deleted
|
||||
after Consul starts and ingests this file. Consul 0.7 also uses a new, automatically-
|
||||
created `raft/peers.info` file to avoid ingesting the `raft/peers.json` file on the
|
||||
first start after upgrading. Be sure to leave `raft/peers.info` in place for proper
|
||||
operation.
|
||||
|
||||
Using `raft/peers.json` for recovery can cause uncommitted Raft log entries to be
|
||||
implicitly committed, so this should only be used after an outage where no
|
||||
other option is available to recover a lost server. Make sure you don't have
|
||||
any automated processes that will put the peers file in place on a
|
||||
periodic basis.
|
||||
|
||||
The next step is to go to the [`-data-dir`](/docs/agent/options#_data_dir)
|
||||
of each Consul server. Inside that directory, there will be a `raft/`
|
||||
sub-directory. We need to create a `raft/peers.json` file. The format of this file
|
||||
depends on what the server has configured for its
|
||||
[Raft protocol](/docs/agent/options#_raft_protocol) version.
|
||||
|
||||
For Raft protocol version 2 and earlier, this should be formatted as a JSON
|
||||
array containing the address and port of each Consul server in the cluster, like
|
||||
this:
|
||||
|
||||
```json
|
||||
["10.1.0.1:8300", "10.1.0.2:8300", "10.1.0.3:8300"]
|
||||
```
|
||||
|
||||
For Raft protocol version 3 and later, this should be formatted as a JSON
|
||||
array containing the node ID, address:port, and suffrage information of each
|
||||
Consul server in the cluster, like this:
|
||||
|
||||
```
|
||||
[
|
||||
{
|
||||
"id": "adf4238a-882b-9ddc-4a9d-5b6758e4159e",
|
||||
"address": "10.1.0.1:8300",
|
||||
"non_voter": false
|
||||
},
|
||||
{
|
||||
"id": "8b6dda82-3103-11e7-93ae-92361f002671",
|
||||
"address": "10.1.0.2:8300",
|
||||
"non_voter": false
|
||||
},
|
||||
{
|
||||
"id": "97e17742-3103-11e7-93ae-92361f002671",
|
||||
"address": "10.1.0.3:8300",
|
||||
"non_voter": false
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
- `id` `(string: <required>)` - Specifies the [node ID](/docs/agent/options#_node_id)
|
||||
of the server. This can be found in the logs when the server starts up if it was auto-generated,
|
||||
and it can also be found inside the `node-id` file in the server's data directory.
|
||||
|
||||
- `address` `(string: <required>)` - Specifies the IP and port of the server. The port is the
|
||||
server's RPC port used for cluster communications.
|
||||
|
||||
- `non_voter` `(bool: <false>)` - This controls whether the server is a non-voter, which is used
|
||||
in some advanced [Autopilot](/docs/guides/autopilot) configurations. If omitted, it will
|
||||
default to false, which is typical for most clusters.
|
||||
|
||||
Simply create entries for all servers. You must confirm that servers you do not include here have
|
||||
indeed failed and will not later rejoin the cluster. Ensure that this file is the same across all
|
||||
remaining server nodes.
|
||||
|
||||
At this point, you can restart all the remaining servers. In Consul 0.7 and
|
||||
later you will see them ingest recovery file:
|
||||
|
||||
```text
|
||||
...
|
||||
2016/08/16 14:39:20 [INFO] consul: found peers.json file, recovering Raft configuration...
|
||||
2016/08/16 14:39:20 [INFO] consul.fsm: snapshot created in 12.484µs
|
||||
2016/08/16 14:39:20 [INFO] snapshot: Creating new snapshot at /tmp/peers/raft/snapshots/2-5-1471383560779.tmp
|
||||
2016/08/16 14:39:20 [INFO] consul: deleted peers.json file after successful recovery
|
||||
2016/08/16 14:39:20 [INFO] raft: Restored from snapshot 2-5-1471383560779
|
||||
2016/08/16 14:39:20 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:10.212.15.121:8300 Address:10.212.15.121:8300}]
|
||||
...
|
||||
```
|
||||
|
||||
If any servers managed to perform a graceful leave, you may need to have them
|
||||
rejoin the cluster using the [`join`](/docs/commands/join) command:
|
||||
|
||||
```text
|
||||
$ consul join <Node Address>
|
||||
Successfully joined cluster by contacting 1 nodes.
|
||||
```
|
||||
|
||||
It should be noted that any existing member can be used to rejoin the cluster
|
||||
as the gossip protocol will take care of discovering the server nodes.
|
||||
|
||||
At this point, the cluster should be in an operable state again. One of the
|
||||
nodes should claim leadership and emit a log like:
|
||||
|
||||
```text
|
||||
[INFO] consul: cluster leadership acquired
|
||||
```
|
||||
|
||||
In Consul 0.7 and later, you can use the [`consul operator`](/docs/commands/operator#raft-list-peers)
|
||||
command to inspect the Raft configuration:
|
||||
|
||||
```
|
||||
$ consul operator raft list-peers
|
||||
Node ID Address State Voter RaftProtocol
|
||||
alice 10.0.1.8:8300 10.0.1.8:8300 follower true 3
|
||||
bob 10.0.1.6:8300 10.0.1.6:8300 leader true 3
|
||||
carol 10.0.1.7:8300 10.0.1.7:8300 follower true 3
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
In this guided we reviewed how to recover from a Consul server outage. Depending on the
|
||||
quorum size and number of failed servers, the recovery process will vary. In the event of
|
||||
complete failure it is beneficial to have a
|
||||
[backup process](https://www.consul.io/docs/guides/deployment-guide.html#backups).
|
|
@ -1,177 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Semaphore
|
||||
description: >-
|
||||
This guide demonstrates how to implement a distributed semaphore using the
|
||||
Consul KV store.
|
||||
---
|
||||
|
||||
# Semaphore
|
||||
|
||||
A distributed semaphore can be useful when you want to coordinate many services, while
|
||||
restricting access to certain resources. In this guide we will focus on using Consul's support for
|
||||
sessions and Consul KV to build a distributed
|
||||
semaphore. Note, there are a number of ways that a semaphore can be built, we will not cover all the possible methods in this guide.
|
||||
|
||||
To complete this guide successfully, you should have familiarity with
|
||||
[Consul KV](/docs/agent/kv) and Consul [sessions](/docs/internals/sessions).
|
||||
|
||||
~> If you only need mutual exclusion or leader election,
|
||||
[this guide](/docs/guides/leader-election)
|
||||
provides a simpler algorithm that can be used instead.
|
||||
|
||||
## Contending Nodes in the Semaphore
|
||||
|
||||
Let's imagine we have a set of nodes who are attempting to acquire a slot in the
|
||||
semaphore. All nodes that are participating should agree on three decisions
|
||||
|
||||
- the prefix in the KV store used to coordinate.
|
||||
- a single key to use as a lock.
|
||||
- a limit on the number of slot holders.
|
||||
|
||||
### Session
|
||||
|
||||
The first step is for each contending node to create a session. Sessions allow us to build a system that
|
||||
can gracefully handle failures.
|
||||
|
||||
This is done using the
|
||||
[Session HTTP API](/api/session#session_create).
|
||||
|
||||
```shell
|
||||
curl -X PUT -d '{"Name": "db-semaphore"}' \
|
||||
http://localhost:8500/v1/session/create
|
||||
```
|
||||
|
||||
This will return a JSON object contain the session ID.
|
||||
|
||||
```json
|
||||
{
|
||||
"ID": "4ca8e74b-6350-7587-addf-a18084928f3c"
|
||||
}
|
||||
```
|
||||
|
||||
-> **Note:** Sessions by default only make use of the gossip failure detector. That is, the session is considered held by a node as long as the default Serf health check has not declared the node unhealthy. Additional checks can be specified at session creation if desired.
|
||||
|
||||
### KV Entry for Node Locks
|
||||
|
||||
Next, we create a lock contender entry. Each contender creates a kv entry that is tied
|
||||
to a session. This is done so that if a contender is holding a slot and fails, its session
|
||||
is detached from the key, which can then be detected by the other contenders.
|
||||
|
||||
Create the contender key by doing an `acquire` on `<prefix>/<session>` via `PUT`.
|
||||
|
||||
```shell
|
||||
curl -X PUT -d <body> http://localhost:8500/v1/kv/<prefix>/<session>?acquire=<session>
|
||||
```
|
||||
|
||||
`body` can be used to associate a meaningful value with the contender, such as its node’s name.
|
||||
This body is opaque to Consul but can be useful for human operators.
|
||||
|
||||
The `<session>` value is the ID returned by the call to
|
||||
[`/v1/session/create`](/api/session#session_create).
|
||||
|
||||
The call will either return `true` or `false`. If `true`, the contender entry has been
|
||||
created. If `false`, the contender node was not created; it's likely that this indicates
|
||||
a session invalidation.
|
||||
|
||||
### Single Key for Coordination
|
||||
|
||||
The next step is to create a single key to coordinate which holders are currently
|
||||
reserving a slot. A good choice for this lock key is simply `<prefix>/.lock`. We will
|
||||
refer to this special coordinating key as `<lock>`.
|
||||
|
||||
```shell
|
||||
curl -X PUT -d <body> http://localhost:8500/v1/kv/<lock>?cas=0
|
||||
```
|
||||
|
||||
Since the lock is being created, a `cas` index of 0 is used so that the key is only put if it does not exist.
|
||||
|
||||
The `body` of the request should contain both the intended slot limit for the semaphore and the session ids
|
||||
of the current holders (initially only of the creator). A simple JSON body like the following works.
|
||||
|
||||
```json
|
||||
{
|
||||
"Limit": 2,
|
||||
"Holders": ["<session>"]
|
||||
}
|
||||
```
|
||||
|
||||
## Semaphore Management
|
||||
|
||||
The current state of the semaphore is read by doing a `GET` on the entire `<prefix>`.
|
||||
|
||||
```shell
|
||||
curl http://localhost:8500/v1/kv/<prefix>?recurse
|
||||
```
|
||||
|
||||
Within the list of the entries, we should find two keys: the `<lock>` and the
|
||||
contender key `<prefix>/<session>`.
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"LockIndex": 0,
|
||||
"Key": "<lock>",
|
||||
"Flags": 0,
|
||||
"Value": "eyJMaW1pdCI6IDIsIkhvbGRlcnMiOlsiPHNlc3Npb24+Il19",
|
||||
"Session": "",
|
||||
"CreateIndex": 898,
|
||||
"ModifyIndex": 901
|
||||
},
|
||||
{
|
||||
"LockIndex": 1,
|
||||
"Key": "<prefix>/<session>",
|
||||
"Flags": 0,
|
||||
"Value": null,
|
||||
"Session": "<session>",
|
||||
"CreateIndex": 897,
|
||||
"ModifyIndex": 897
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
Note that the `Value` we embedded into `<lock>` is Base64 encoded when returned by the API.
|
||||
|
||||
When the `<lock>` is read and its `Value` is decoded, we can verify the `Limit` agrees with the `Holders` count.
|
||||
This is used to detect a potential conflict. The next step is to determine which of the current
|
||||
slot holders are still alive. As part of the results of the `GET`, we also have all the contender
|
||||
entries. By scanning those entries, we create a set of all the `Session` values. Any of the
|
||||
`Holders` that are not in that set are pruned. In effect, we are creating a set of live contenders
|
||||
based on the list results and doing a set difference with the `Holders` to detect and prune
|
||||
any potentially failed holders. In this example `<session>` is present in `Holders` and
|
||||
is attached to the key `<prefix>/<session>`, so no pruning is required.
|
||||
|
||||
If the number of holders after pruning is less than the limit, a contender attempts acquisition
|
||||
by adding its own session to the `Holders` list and doing a Check-And-Set update of the `<lock>`.
|
||||
This performs an optimistic update.
|
||||
|
||||
This is done with:
|
||||
|
||||
```shell
|
||||
curl -X PUT -d <Updated Lock Body> http://localhost:8500/v1/kv/<lock>?cas=<lock-modify-index>
|
||||
```
|
||||
|
||||
`lock-modify-index` is the latest `ModifyIndex` value known for `<lock>`, 901 in this example.
|
||||
|
||||
If this request succeeds with `true`, the contender now holds a slot in the semaphore.
|
||||
If this fails with `false`, then likely there was a race with another contender to acquire the slot.
|
||||
|
||||
To re-attempt the acquisition, we watch for changes on `<prefix>`. This is because a slot
|
||||
may be released, a node may fail, etc. Watching for changes is done via a blocking query
|
||||
against `/kv/<prefix>?recurse`.
|
||||
|
||||
Slot holders **must** continuously watch for changes to `<prefix>` since their slot can be
|
||||
released by an operator or automatically released due to a false positive in the failure detector.
|
||||
On changes to `<prefix>` the lock’s `Holders` list must be re-checked to ensure the slot
|
||||
is still held. Additionally, if the watch fails to connect the slot should be considered lost.
|
||||
|
||||
This semaphore system is purely _advisory_. Therefore it is up to the client to verify
|
||||
that a slot is held before (and during) execution of some critical operation.
|
||||
|
||||
Lastly, if a slot holder ever wishes to release its slot voluntarily, it should be done by doing a
|
||||
Check-And-Set operation against `<lock>` to remove its session from the `Holders` object.
|
||||
Once that is done, both its contender key `<prefix>/<session>` and session should be deleted.
|
||||
|
||||
## Summary
|
||||
|
||||
In this guide we created a distributed semaphore using Consul KV and Consul sessions. We also learned how to manage the newly created semaphore.
|
|
@ -1,75 +0,0 @@
|
|||
---
|
||||
layout: docs
|
||||
page_title: Windows Service
|
||||
description: >-
|
||||
By using the _sc_ command either on Powershell or
|
||||
|
||||
the Windows command line, you can make Consul run as a service. For more
|
||||
details about the _sc_ command
|
||||
|
||||
the Windows page for
|
||||
[sc](https://msdn.microsoft.com/en-us/library/windows/desktop/ms682107(v=vs.85).aspx)
|
||||
|
||||
should help you get started.
|
||||
---
|
||||
|
||||
# Run Consul as a Service on Windows
|
||||
|
||||
By using the _sc_ command, either on Powershell or
|
||||
the Windows command line, you can run Consul as a service. For more details about the _sc_ command
|
||||
the Windows page for [sc](<https://msdn.microsoft.com/en-us/library/windows/desktop/ms682107(v=vs.85).aspx>)
|
||||
should help you get started.
|
||||
|
||||
Before installing Consul, you will need to create a permanent directory for storing the configuration files. Once that directory is created, you will set it when starting Consul with the `-config-dir` option.
|
||||
|
||||
In this guide, you will download the Consul binary, register the Consul service
|
||||
with the Service Manager, and finally start Consul.
|
||||
|
||||
The steps presented here, assume that you have launched Powershell with _Adminstrator_ capabilities.
|
||||
|
||||
## Installing Consul as a Service
|
||||
|
||||
Download the Consul binary for your architecture.
|
||||
|
||||
Use the _sc_ command to create a service named **Consul**, that will load configuration files from the `config-dir`. Read the agent configuration
|
||||
[documentation](/docs/agent/options#configuration-files) to learn more about configuration options.
|
||||
|
||||
```text
|
||||
sc.exe create "Consul" binPath= "<path to the Consul.exe> agent -config-dir <path to configuration directory>" start= auto
|
||||
[SC] CreateService SUCCESS
|
||||
```
|
||||
|
||||
If you get an output that is similar to the one above, then your service is
|
||||
registered with the Service Manager.
|
||||
|
||||
If you get an error, please check that
|
||||
you have specified the proper path to the binary and check if you've entered the arguments correctly for the Consul service.
|
||||
|
||||
## Running Consul as a Service
|
||||
|
||||
You have two options for starting the service.
|
||||
|
||||
The first option is to use the Windows Service Manager, and look for **Consul** under the service name. Click the _start_ button to start the service.
|
||||
|
||||
The second option is to use the _sc_ command.
|
||||
|
||||
```text
|
||||
sc.exe start "Consul"
|
||||
|
||||
SERVICE_NAME: Consul
|
||||
TYPE : 10 WIN32_OWN_PROCESS
|
||||
STATE : 4 RUNNING (STOPPABLE, NOT_PAUSABLE, ACCEPTS_SHUTDOWN)
|
||||
WIN32_EXIT_CODE : 0 (0x0)
|
||||
SERVICE_EXIT_CODE : 0 (0x0)
|
||||
CHECKPOINT : 0x0
|
||||
WAIT_HINT : 0x0
|
||||
PID : 8008
|
||||
FLAGS :
|
||||
```
|
||||
|
||||
The service automatically starts up during/after boot, so you don't need to
|
||||
launch Consul from the command-line again.
|
||||
|
||||
## Summary
|
||||
|
||||
In this guide you setup a Consul service on Windows. This process can be repeated to setup an entire cluster of agents.
|
Loading…
Reference in New Issue