Also change how loadProxies works. Now it will load all persisted proxies into a map, then when loading config file proxies will look up the previous proxy token in that map.
UUID auto-generation here causes trouble in a few cases. The biggest being older
nodes reregistering will fail when the UUIDs are different and the names match
This reverts commit 0f70034082.
This reverts commit d1a8f9cb3f.
This reverts commit cf69ec42a4.
Since DNS is case insensitive and DB as issues when similar names with different
cases are added, check for unicity based on case insensitivity.
Following another big incident we had in our cluster, we also validate
that adding/renaming a not does not conflicts with case insensitive
matches.
We had the following error once:
- one node called: mymachine.MYDC.mydomain was shut off
- another node (different ID) was added with name: mymachine.mydc.mydomain before
72 hours
When restarting the consul server of domain, the consul server restarted failed
to start since it detected an issue in RAFT database because
mymachine.MYDC.mydomain and mymachine.mydc.mydomain had the same names.
Checking at registration time with case insensitivity should definitly fix
those issues and avoid Consul DB corruption.
It fixes the following warnings:
agent/config/builder.go:1201: Errorf format %q has arg s of wrong type *string
agent/config/builder.go:1240: Errorf format %q has arg s of wrong type *string
This also changes where the enforcement of the enable_additional_node_meta_txt configuration gets applied.
formatNodeRecord returns the main RRs and the meta/TXT RRs in separate slices. Its then up to the caller to add to the appropriate sections or not.
This just makes sure that if multiple services are registered with unique service addresses that we don’t blast back multiple CNAMEs for the same service DNS name and keeps us within the DNS specs.
It will allow the following:
* when connectivity is limited (saturated linnks between DCs), only one
single request to refresh an ACL will be sent to ACL master DC instead
of statcking ACL refresh queries
* when extend-cache is used for ACL, do not wait for result, but refresh
the ACL asynchronously, so no delay is not impacting slave DC
* When extend-cache is not used, keep the existing blocking mechanism,
but only send a single refresh request.
This will fix https://github.com/hashicorp/consul/issues/3524
These were only added as SPIFFE intends to use the in the future but currently does not mandate their usage due to patch support in common TLS implementations and some ambiguity over how to use them with URI SAN certificates. We included them because until now everything seem fine with it, however we've found the latest version of `openssl` (1.1.0h) fails to validate our certificats if its enabled. LibreSSL as installed on OS X by default doesn’t have these issues. For now it's most compatible not to have them and later we can find ways to add constraints with wider compatibility testing.
Few other fixes in here just to get a clean run locally - they are all also fixed in other PRs but shouldn't conflict.
This should be robust to timing between goroutines now.
- Dev mode assumed no persistence of services although proxy state is persisted which caused proxies to be killed on startup as their services were no longer registered. Fixed.
- Didn't snapshot the ProxyID which meant that proxies were adopted OK from snapshot but failed to restart if they died since there was no proxyID in the ENV on restart
- Dev mode with no persistence just kills all proxies on shutdown since it can't recover them later
- Naming things
This turns out to have a lot more subtelty than we accounted for. The test suite is especially prone to races now we can only poll the child and many extra levels of indirectoin are needed to correctly run daemon process without it becoming a Zombie.
I ran this test suite in a loop with parallel enabled to verify for races (-race doesn't find any as they are logical inter-process ones not actual data races). I made it through ~50 runs before hitting an error due to timing which is much better than before. I want to go back and see if we can do better though. Just getting this up.
agent/config will turn [{}] into {} (single element maps into a single
map) to work around HCL issues. These are resolved in HCL2 which I'm
sure Consul will switch to eventually.
This breaks the connect proxy configuration in service definition FILES
since we call this patch function. For now, let's just special-case skip
this. In the future we maybe Consul will adopt HCL2 and fix it, or we
can do something else if we want. This works and is tested.
This enables `consul agent -dev` to begin using Connect features with
the built-in CA. I think this is expected behavior since you can imagine
that new users would want to try.
There is no real downside since we're just using the built-in CA.
There are also a lot of small bug fixes found when testing lots of things end-to-end for the first time and some cleanup now it's integrated with real CA code.
- Includes some bug fixes for previous `api` work and `agent` that weren't tested
- Needed somewhat pervasive changes to support hash based blocking - some TODOs left in our watch toolchain that will explicitly fail on hash-based watches.
- Integration into `connect` is partially done here but still WIP
Intention de-duplication in previously merged PR actualy failed some tests that were not caught be me or CI. I ran the test files for state changes but they happened not to trigger this case so I made sure they did first and then fixed. That fixed some upstream intention endpoint tests that I'd not run as part of testing the previous fix.
Uses struct/interface embedding with the embedded structs/interfaces being empty for oss. Also methods on the server/client types are defaulted to do nothing for OSS
* Move settings to use the same service/route API as the rest of the app
* Put some ideas down for unit testing on adapters
* Favour `Model` over `Entity`
* Move away from using `reopen` to using Mixins
* Amend messages, comment/document some usage
* Make sure the returns are consistent in normalizePayload, also
Add some todo's in to remind me to think consider this further at a
later date. For example, is normalizePayload to be a hook or an
overridable method
* Start stripping back the HTML to semantics
* Use a variable rather than chaining
* Remove unused helpers
* Start picking through the new designs, start with listing pages
* First draft HTML for every page
* Making progress on the CSS
* Keep plugging away at the catalog css
* Looking at scrolling
* Wire up filtering
* Sort out filter counting, more or less done a few outstanding
* Start knocking the forms into shape
* Add in codemirror
* Keep moving forwards with the form like layouts
* Start looking at ACL editing page, add footer in
* Pull the filters back in, look at an autoresizer for scroll views
* First draft toggles
* 2nd draft healthcheck icons
* Tweak node healthcheck icons
* Looking at healthcheck detail icons
* Tweak the filter-bar and add selections to the in content tabs
* Add ACL create, pill-like acl type highlight
* Tweaking the main nav some more
* Working on the filter-bar and freetext-filter
* Masonry layout
* Stick with `checks` instead of healthy/unhealthy
* Fix up the filter numbers/counts
* Use the thead for a measure
* First draft tomography back in
* First draft DC dropdown
* Add a temporary create buttong to kv's
* Move KV and ACL to use a create page
* Move tags
* Run through old tests
* Injectable server
* Start adding test attributes
* Add some page objects
* More test attributes and pages
* Acl filter objects
* Add a page.. page object
* Clickable items in lists
* Add rest/spread babel plugin, remove mirage for now
* Add fix for ember-collection
* Keep track of acl filters
* ember-cli-page-object
* ember-test-selectors
* ui: update version of ui compile deps
* Update static assets
* Centralize radiogroup helper
* Rejig KV's and begin to clean it up
* Work around lack of Tags for the moment..
* Some little css tweaks and start to remove possibles
* Working on the dc page and incidentals
1. Sort the datacenter-picker list
2. Add a selected state to the datacenter-picker
3. Make dc an {Name: dc}
4. Add an env helper to get to 'env vars' from within templates
* Click outside stuff for the datacenter-picker, is-active on nav
* Make sure the dropdown CTA can be active
* Bump ember add pluralize helper
* Little try at sass based custom queries
* Rejig tablular collection so it deals with resizing, actions
1. WIP: start building actions dropdowns
2. Move tabular collection to deal with resizing to rule out differences
* First draft actions dropdowns
* Add ports, selectable IP's
* Flash messages, plus general cleanup/consistency
1. Add ember-cli-flash for flash messages
2. Move everything to get() instead of item.get
3. Spotted a few things that weren't consistent
* DOn't go lower than zero
* First draft vertical menu
* Missed a get, tweak dropmenu tick
* Big cleanup
1. this.get(), this.set() > get(), set()
2. assign > {...{}, ...{}}
3. Seperator > separator
* WIP: settings
* Moved things into a ui-v2 folder
* Decide on a way to do the settings page whilst maintaining the url + dc's
* Start some error pages
* Remove base64 polyfill
* Tie in settings, fix atob bug, tweak layout css
* Centralize confirmations into a component
* Allow switching between the old and new UI with the CONSUL_UI_BETA env var
Currently all the assets are packaged into a single AssetFS and a prefix is configured to switch between the two.
* Attempt at some updates to integrate the v2 ui build into the main infrastructure
* Add redirect to index.html for unknown paths
* Allow redictor to /index.html for new ui when using -ui-dir
* Take ACLs to the correct place on save
* First pass breadcrumbs
* Remove datacenter selector on the index page
* Tweak overall layout
* Make buttons 'resets'
* Tweak last DC stuff
* Validations plus kv keyname viewing tweaks
* Pull sessions back in
* Tweak the env vars to be more reusable
* Move isAnon to the view
* No items and disabled acl css
* ACL and KV details
1. Unauthorized page
2. Make sure the ACL is always selected when it needs it
3. Check record deletion with a changeset
* Few more acl tweaks/corrections
* Add no items view to node > services
* Tags for node > services
* Make sure we have tags
* Fix up the labels on the tomography graph
* Add node link (agent) to kv sessions
* Duplicate up `create` for KV 'root creation'
* Safety check for health checks
* Fix up the grids
* Truncate td a's, fix kv columns
* Watch for spaces in KV id's
* Move actions to their own mixins for now at least
* Link reset to settings incase I want to type it in
* Tweak error page
* Cleanup healthcheck icons in service listing
* Centralize errors and make getting back easier
* Nice numbers
* Compact buttons
* Some incidental css cleanups
* Use 'Key / Value' for root
* Tweak tomography layout
* Fix single healthcheck unhealthy resource
* Get loading screen ready
* Fix healthy healthcheck tick
* Everything in header starts white
* First draft loader
* Refactor the entire backend to use proper unique keys, plus..
1. Make unique keys form dc + slug (uid)
2. Fun with errors...
* Tweak header colors
* Add noopener noreferrer to external links
* Add supers to setupController
* Implement cloning, using ember-data...
* Move the more expensive down the switch order
* First draft empty record cleanup..
* Add the cusomt store test
* Temporarily use the htmlSafe prototype to remove the console warning
* Encode hashes in urls
* Go back to using title for errors for now
* Start removing unused bulma
* Lint
* WIP: Start looking at failing tests
* Remove single redirect test
* Finish off error message styling
* Add full ember-data cache invalidation to avoid stale data...
* Add uncolorable warning icons
* More info icon
* Rearrange single service, plus tag printing
* Logo
* No quotes
* Add a simple startup logo
* Tweak healthcheck statuses
* Fix border-color for healthchecks
* Tweak node tabs
* Catch 401 ACL errors and rethrow with the provided error message
* Remove old acl unauth and error routes
* Missed a super
* Make 'All' refer to number of checks, not services
* Remove ember-resizer, add autoprefixer
* Don't show tomography if its not worth it, viewify it more also
* Little model cleanup
* Chevrons
* Find a way to reliably set the class of html from the view
* Consistent html
* Make sure session id's are visible as long as possible
* Fix single service check count
* Add filters and searchs to the query string
* Don't remember the selected tab
* Change text
* Eror tweaking
* Use chevrons on all breadcrumbs even in kv's
* Clean up a file
* Tweak some messaging
* Makesure the footer overlays whats in the page
* Tweak KV errors
* Move json toggle over to the right
* feedback-dialog along with copy buttons
* Better confirmation dialogs
* Add git sha comment
* Same title as old UI
* Allow defaults
* Make sure value is a string
* WIP: Scrolling dropdowns/confirmations
* Add to kv's
* Remove set
* First pass trace
* Better table rows
* Pull over the hashi code editor styles
* Editor tweaks
* Responsive tabs
* Add number formatting to tomography
* Review whats left todo
* Lint
* Add a coordinate ember data triplet
* Bump in a v2.0.0
* Update old tests
* Get coverage working again
* Make sure query keys are also encoded
* Don't test console.error
* Unit test some more utils
* Tweak the size of the tabular collections
* Clean up gitignore
* Fix copy button rollovers
* Get healthcheck 'icon icons' onto the text baseline
* Tweak healthcheck padding and alignment
* Make sure commas kick in in rtt, probably never get to that
* Improve vertical menu
* Tweak dropdown active state to not have a bg
* Tweak paddings
* Search entire string not just 'startsWith'
* Button states
* Most buttons have 1px border
* More button tweaks
* You can only view kv folders
* CSS cleanup reduction
* Form input states and little cleanup
* More CSS reduction
* Sort checks by importance
* Fix click outside on datacenter picker
* Make sure table th's also auto calculate properly
* Make sure `json` isn't remembered in KV editing
* Fix recursive deletion in KV's
* Centralize size
* Catch updateRecord
* Don't double envode
* model > item consistency
* Action loading and ACL tweaks
* Add settings dependencies to acl tests
* Better loading
* utf-8 base64 encode/decode
* Don't hang off a prototype for htmlSafe
* Missing base64 files...
* Get atob/btoa polyfill right
* Shadowy rollovers
* Disabled button styling for primaries
* autofocuses only onload for now
* Fix footer centering
* Beginning of 'notices'
* Remove the isLocked disabling as we are letting you do what the API does
* Don't forget the documentation link for sessions
* Updates are more likely
* Use exported constant
* Dont export redirectFS and a few other PR updates
* Remove the old bootstrap config which was used for the old UI skin
* Use curlies for multiple properties
Will fix https://github.com/hashicorp/consul/issues/4036
Instead of removing one by one the entries, find the optimal
size using binary search.
For SRV records, with 5k nodes, duration of DNS lookups is
divided by 4 or more.
Update docs a little
Update/add tests. Make sure all the various ways of determining the source IP work
Update X-Forwarded-For header parsing. This can be a comma separated list with the first element being the original IP so we now handle csv data there.
Got rid of error return from sourceAddrFromRequest
Test HTTP/DNS source IP without header/extra EDNS data.
Add WARN log for when prepared query with near=_ip is executed without specifying the source ip
Also fixed an issue where we need to have the X-Forwarded-For header processed before the RemoteAddr. This shouldn’t have any functional difference for prod code but for mocked request objects it allows them to work.
The need has been spotted in issue https://github.com/hashicorp/consul/issues/3687.
Using "NYTimes/gziphandler", the http api responses can now be compressed if required.
The Go API requires compressed response if possible and handle the compressed response.
We here change only the http api (not the UI for instance).
Cosmetic fix to the agent's HTTP check function which always formats the result as "HTTP GET ...", ignoring any non-GET supplied HTTP method such as POST, PUT, etc.
Bugfix for https://github.com/hashicorp/consul/pull/3899
When a node level check is removed (example: maintenance),
some watchers on services might have to recompute their state.
If those nodes are performing blocking queries, they have to be notified.
While their state was updated when node-level state did change or was added
this was not the case when the check was removed. This fixes it.
The list of cipher suites included in this commit are consistent with
the values and precedence in the [Golang TLS documentation](https://golang.org/src/crypto/tls/cipher_suites.go).
> **Note:** Cipher suites with RC4 are still included within the list
> of accepted values for compatibility, but **these cipher suites are
> not safe to use** and should be deprecated with warnings and
> subsequently removed. Support for RC4 ciphers has already been
> removed or disabled by default in many prominent browsers and tools,
> including Golang.
>
> **References:**
>
> * [RC4 on Wikipedia](https://en.wikipedia.org/wiki/RC4)
> * [Mozilla Security Blog](https://blog.mozilla.org/security/2015/09/11/deprecating-the-rc4-cipher/)
This patch did give some better results, but break watches on
the services of a node.
It is possible to apply the same optimization for nodes than
to services (one index per instance), but it would complicate
further the patch.
Let's do it in another PR.
The root cause is actually that the agent's streaming HTTP API didn't flush until the first log line was found which commonly was pretty soon since the default level is INFO. In cases where there were no logs immediately due to level for instance, the client gets stuck in the HTTP code waiting on a response packet from the server before we enter the loop that checks the shutdown channel from the signal handler.
This fix flushes the initial status immediately on the streaming endpoint which lets the client code get into it's expected state where it's listening for shutdown or log lines.
This patch improves the watches for services on large cluster:
each service has now its own index, such watches on a specific service
are not modified by changes in the global catalog.
It should improve a lot the performance of tools such as consul-template
or libraries performing watches on very large clusters with many
services/watches.
- register endpoints with supported methods
- support OPTIONS requests, indicating supported methods
- extract method validation (error 405) from individual endpoints
- on 405 where multiple methods are allowed, create a single Allow
header with comma-separated values, not multiple Allow headers.
Because this code was doing pointer equality checks, it would work for
the case of a failed attempted RPC because the objects are from the
manager itself:
https://github.com/hashicorp/consul/blob/v1.0.3/agent/consul/rpc.go#L283-L302
But the pointer check would always fail for events coming in from the
Serf path because the server object is newly-created:
https://github.com/hashicorp/consul/blob/v1.0.3/agent/router/serf_adapter.go#L14-L40
This means that we didn't proactively shift RPC traffic away from a
failed server, we'd have to wait for an RPC to fail, which exposes
the error to the calling client.
By switching over to a name check vs. a pointer check we get the correct
behavior. We added a DEBUG log as well to help observe this behavior during
integrated testing.
Related to #3863 since the fix here needed the same logic duplicated, owing
to the complicated atomic stuff.
/cc @dadgar for a heads up in case this also affects Nomad.
Previously a change was made to make the file writing atomic,
but that wasn't enough to cover something like an OS crash so we
needed something here to handle the situation more gracefully.
Fixes#1221.
Docker/Openshift/Kubernetes mount the config file as a symbolic link and
IsDir returns true if the file is a symlink. Before calling IsDir, the
symlink should be resolved to determine if it points at a file or
directory.
Fixes#3753
Since commit 9685bdcd0b, service tags are added to the health checks.
Otherwise, when adding a service, tags are not added to its check.
In updateSyncState, we compare the checks of the local agent with the checks of the catalog.
It appears that the service tags are different (missing in one case), and so the check is synchronized.
That increase the ModifyIndex periodically when nothing changes.
Fixed it by adding serviceTags to the check.
Note that the issue appeared in version 0.8.2.
Looks related to #3259.
The lock isn't needed after we clean up the expire bin, and as seen
in #3700 we can get into a deadlock waiting to place the expire index
into the channel while holding this lock.
Fixes#3700
There were places where we still didn't have the script vs. args sorted
correctly so changed all the logging to be just based on check IDs and
also made everything uniform.
Also removed some annoying debug logging, and moved some of the large output
logging to TRACE level.
Closes#3602
* Refactors the HTTP listen path to create servers in the same spot.
* Adds HTTP/2 support to Consul's HTTPS server.
* Vendors Go HTTP/2 library and associated deps.
* config: refactor ReadPath(s) methods without side-effects
Return the sources instead of modifying the state.
* config: clean data dir before every test
* config: add tests for config-file and config-dir
* config: add -config-format option
Starting with Consul 1.0 all config files must have a '.json' or '.hcl'
extension to make it unambigous how the data should be parsed. Some
automation tools generate temporary files by appending a random string
to the generated file which obfuscates the extension and prevents the
file type detection.
This patch adds a -config-format option which can be used to override
the auto-detection behavior by forcing all config files or all files
within a config directory independent of their extension to be
interpreted as of this format.
Fixes#3620
* Relaxes Autopilot promotion logic.
When we defaulted the Raft protocol version to 3 in #3477 we made
the numPeers() routine more strict to only count voters (this is
more conservative and more correct). This had the side effect of
breaking rolling updates because it's at odds with the Autopilot
non-voter promotion logic.
That logic used to wait to only promote to maintain an odd quorum
of servers. During a rolling update (add one new server, wait, and
then kill an old server) the dead server cleanup would still count
the old server as a peer, which is conservative and the right thing
to do, and no longer count the non-voter. This would wait to promote,
so you could get into a stalemate. It is safer to promote early than
remove early, so by promoting as soon as possible we have chosen
that as the solution here.
Fixes#3611
* Gets rid of unnecessary extra not-a-voter check.
This patch adds a /v1/coordinate/node/:node endpoint to get the network
coordinates for a single node in the network.
Since Consul Enterprise supports network segments it is still possible
to receive mutiple entries for a single node - one per segment.
The Docker agent closes the connection during read after we have
read the body. This causes a "connection reset by peer" even though
the command was successful.
We ignore that error here since we got the correct status code
and a response body.
The anti-entropy tests relied on the side-effect of the StartSync()
method to perform a full sync instead of a partial sync. This lead to
multiple anti-entropy go routines being started unnecessary retry loops.
This change changes the behavior to perform synchronous full syncs when
necessary removing the need for all of the time.Sleep and most of the
retry loops.
The state of the service and health check records was spread out over
multiple maps guarded by a single lock. Access to the maps has to happen
in a coordinated effort and the tests often violated this which made
them brittle and racy.
This patch replaces the multiple maps with a single one for both checks
and services to make the code less fragile.
This is also necessary since moving the local state into its own package
creates circular dependencies for the tests. To avoid this the tests can
no longer access internal data structures which they should not be doing
in the first place.
The tests still don't compile but this is a ncessary step in that
direction.
The anti-entropy code manages background synchronizations of the local
state on a regular basis or on demand when either the state has changed
or a new consul server has been added.
This patch moves the anti-entropy code into its own package and
decouples it from the local state code since they are performing
two different functions.
To simplify code-review this revision does not make any optimizations,
renames or refactorings. This will happen in subsequent commits.
The `consul agent` command was ignoring extra command line arguments
which can lead to confusion when the user has for example forgotten to
add a dash in front of an argument or is not using an `=` when setting
boolean flags to `true`. `-bootstrap true` is not the same as
`-bootstrap=true`, for example.
Since all command line flags are known and we don't expect unparsed
arguments we can return an error. However, this may make it slightly
more difficult in the future if we ever wanted to have these kinds of
arguments.
Fixes#3397
DNS recursors can be added through go-sockaddr templates. Entries
are deduplicated while the order is maintained.
Originally proposed by @taylorchu
See #2932
The `consul agent` command was ignoring extra command line arguments
which can lead to confusion when the user has for example forgotten to
add a dash in front of an argument or is not using an `=` when setting
boolean flags to `true`. `-bootstrap true` is not the same as
`-bootstrap=true`, for example.
Since all command line flags are known and we don't expect unparsed
arguments we can return an error. However, this may make it slightly
more difficult in the future if we ever wanted to have these kinds of
arguments.
Fixes#3397
The `consul agent` command was ignoring extra command line arguments
which can lead to confusion when the user has for example forgotten to
add a dash in front of an argument or is not using an `=` when setting
boolean flags to `true`. `-bootstrap true` is not the same as
`-bootstrap=true`, for example.
Since all command line flags are known and we don't expect unparsed
arguments we can return an error. However, this may make it slightly
more difficult in the future if we ever wanted to have these kinds of
arguments.
Fixes#3397
The anti-entropy tests relied on the side-effect of the StartSync()
method to perform a full sync instead of a partial sync. This lead to
multiple anti-entropy go routines being started unnecessary retry loops.
This change changes the behavior to perform synchronous full syncs when
necessary removing the need for all of the time.Sleep and most of the
retry loops.
The state of the service and health check records was spread out over
multiple maps guarded by a single lock. Access to the maps has to happen
in a coordinated effort and the tests often violated this which made
them brittle and racy.
This patch replaces the multiple maps with a single one for both checks
and services to make the code less fragile.
This is also necessary since moving the local state into its own package
creates circular dependencies for the tests. To avoid this the tests can
no longer access internal data structures which they should not be doing
in the first place.
The tests still don't compile but this is a ncessary step in that
direction.
The anti-entropy code manages background synchronizations of the local
state on a regular basis or on demand when either the state has changed
or a new consul server has been added.
This patch moves the anti-entropy code into its own package and
decouples it from the local state code since they are performing
two different functions.
To simplify code-review this revision does not make any optimizations,
renames or refactorings. This will happen in subsequent commits.
DNS recursors can be added through go-sockaddr templates. Entries
are deduplicated while the order is maintained.
Originally proposed by @taylorchu
See #2932
This patch removes the porter tool which hands out free ports from a
given range with a library which does the same thing. The challenge for
acquiring free ports in concurrent go test runs is that go packages are
tested concurrently and run in separate processes. There has to be some
inter-process synchronization in preventing processes allocating the
same ports.
freeport allocates blocks of ports from a range expected to be not in
heavy use and implements a system-wide mutex by binding to the first
port of that block for the lifetime of the application. Ports are then
provided sequentially from that block and are tested on localhost before
being returned as available.
Queries to the DNS server can contain an optional datacenter
name in the query name. You can query for 'foo.service.consul'
or 'foo.service.dc.consul' to get a response for either the
default or a specific datacenter.
Datacenter names cannot have dots, therefore the datacenter
name can refer to only one element in the DNS query name.
The DNS server allowed extra labels between the optional
datacenter name and the domain and returned a valid response
instead of returning NXDOMAIN. For example, if the domain
is set to '.consul' then 'foo.service.dc1.extra.consul'
should return NXDOMAIN because of 'extra' being between
the datacenter name 'dc1' and the domain '.consul'.
Fixes#3200
DNS recursors can be added through go-sockaddr templates. Entries
are deduplicated while the order is maintained.
Originally proposed by @taylorchu
See #2932
* Updates the API client to support the current `ScriptArgs` parameter
for checks.
* Updates docs for checks to explain the `ScriptArgs` parameter issue.
* Adds mappings for "args" and "script-args" to give th API parity
with config.
* Adds checks on return codes.
* Removes debug logging that shows empty when args are used.
* Adds a brief wait and poll period to update the check status
if we get stucking waiting for the processes to terminate.
Fixes#3570
* Jumps out of timeout case and includes script output.
* Kill check processes after the timeout is reached
Kill the subprocess spawned by a script check once the timeout is reached. Previously Consul just marked the check critical and left the subprocess around.
Fixes#3565.
* Set err to non-nil when timeout occurs
* Fix check timeout test
* Kill entire process subtree on check timeout
* Add a docs note about windows subprocess termination
* agent: add option to discard health output
In high volatile environments consul will have checks with "noisy"
output which changes every time even though the status does not change.
Since the output is stored in the raft log every health check update
unblocks a blocking call on health checks since the raft index has
changed even though the status of the health checks may not have changed
at all. By discarding the output of the health checks the users can
choose a different tradeoff. Less visibility on why a check failed in
exchange for a reduced change rate on the raft log.
* agent: discard output also when adding a check
* agent: add test for discard check output
* agent: update docs
* go vet
* Adds discard_check_output to reloadable config table.
* Updates the change log.
* Fixes agent error handling when check definition is invalid. Distinguishes between empty checks vs invalid checks
* Made CheckTypes return Checks from service definition struct rather than a new copy, and other changes from code review. This also errors when json payload contains empty structs
* Simplify and improve validate method, and make sure that CheckTypes always returns a new copy of validated check definitions
* Tweaks some small style things and error messages.
* Updates the change log.
* doc: document discrepancy between id and CheckID
* doc: document enable_tag_override change
* config: add TranslateKeys helper
TranslateKeys makes it easier to map between different representations
of internal structures. It allows to recursively map alias keys to
canonical keys in structured maps.
* config: use TranslateKeys for config file
This also adds support for 'enabletagoverride' and removes
the need for a separate CheckID alias field.
* config: remove dead code
* agent: use TranslateKeys for FixupCheckType
* agent: translate enable_tag_override during service registration
* doc: add '.hcl' as valid extension
* config: map ScriptArgs to args
* config: add comment for TranslateKeys
* Adds client-side retry for no leader errors.
This paves over the case where the client was connected to the leader
when it loses leadership.
* Adds a configurable server RPC drain time and a fail-fast path for RPCs.
When a server leaves it gets removed from the Raft configuration, so it will
never know who the new leader server ends up being. Without this we'd be
doomed to wait out the RPC hold timeout and then fail. This makes things fail
a little quicker while a sever is draining, and since we added a client retry
AND since the server doing this has already shut down and left the Serf LAN,
clients should retry against some other server.
* Makes the RPC hold timeout configurable.
* Reorders struct members.
* Sets the RPC hold timeout default for test servers.
* Bumps the leave drain time up to 5 seconds.
* Robustifies retries with a simpler client-side RPC hold.
* Reverts untended delete.
* Fixes handling of stop channel and failed barrier attempts.
There were two issues here. First, we needed to not exit when there
was a timeout trying to write the barrier, because Raft might not
step down, so we'd be left as the leader but having run all the step
down actions.
Second, we didn't close over the stopCh correctly, so it was possible
to nil that out and have the leaderLoop never exit. We close over it
properly AND sequence the nil-ing of it AFTER the leaderLoop exits for
good measure, so the code is more robust.
Fixes#3545
* Cleans up based on code review feedback.
* Tweaks comments.
* Renames variables and removes comments.
* Clean up handling of subprocesses and make using a shell optional
* Update docs for subprocess changes
* Fix tests for new subprocess behavior
* More cleanup of subprocesses
* Minor adjustments and cleanup for subprocess logic
* Makes the watch handler reload test use the new path.
* Adds check tests for new args path, and updates existing tests to use new path.
* Adds support for script args in Docker checks.
* Fixes the sanitize unit test.
* Adds panic for unknown watch type, and reverts back to Run().
* Adds shell option back to consul lock command.
* Adds shell option back to consul exec command.
* Adds shell back into consul watch command.
* Refactors signal forwarding and makes Windows-friendly.
* Adds a clarifying comment.
* Changes error wording to a warning.
* Scopes signals to interrupt and kill.
This avoids us trying to send SIGCHILD to the dead process.
* Adds an error for shell=false for consul exec.
* Adds notes about the deprecated script and handler fields.
* De-nests an if statement.
* config: provide stable config for /v1/agent/self (#3530)
This patch adds a stable subset of the previous Config struct to the
agent/self response. The actual runtime configuration is moved into
DebugConfig and will be documented to change.
Fixes#3530
* config: fix tests
* doc: update api documentation for /v1/agent/self
- Lookup TXT records using recursive lookups
- Expect TXT record equal to value if key starts with rfc1035-
- Expect TXT record in rfc1464 otherwise, i.e. (k=v)
ref #2709
* vendor: add github.com/sergi/go-diff/diffmatchpatch for diff'ing test output
* config: refactor Sanitize to recursively clean runtime config and format complex fields
* Removes an extra int cast.
* Adds a top-level check test case for sanitization.
* Make sure that id and address are set in member created during reaping of catalog nodes that have been removed from serf
* Get address from node table in the state store rather than from service address
* Fix incorrect lookup by checkname instead of node name
* Make sure that serverlookup is called with the right address format, added unit test.
* Address code review comments
* Tweaks style stuff.
* metrics: replace statsite_prefix with service_prefix
The metrics prefix isn't statsite specific and is in fact used
for all metrics providers. Since we are deprecating fields
anyway we should fix this one as well.
Fixes#3293
* Updates docs and sorts telemetry section.
* Renames to "metrics_prefix" to disambiguate with Consul services.
* Updates the change log.
Remove an error in watch reloading that happens when http and https
are both disabled, and use an https address for running watches if
no http addresses are present.
Fixes#3425.
Since state.Checks() returns a shallow copy
its elements must not be modified. Copying
the elements in the handler does not guarantee
consistency since that list is guarded by a different
lock. Therefore, the only solution is to have state.Checks()
return a deep copy.
* agent: consolidate http method not allowed checks
This patch uses the error handling of the http handlers to handle HTTP
method not allowed errors across all available endpoints. It also adds a
test for testing whether the endpoints respond with the correct status
code.
* agent: do not panic on metrics tests
* agent: drop other tests for MethodNotAllowed
* agent: align /agent/join with reality
/agent/join uses PUT instead of GET as documented.
* agent: align /agent/check/{fail,warn,pass} with reality
/agent/check/{fail,warn,pass} uses PUT instead of GET as documented.
* fix some tests
* Drop more tests for method not allowed
* Align TestAgent_RegisterService_InvalidAddress with reality
* Changes API client join to use PUT instead of GET.
* Fixes agent endpoint verbs and removes obsolete tests.
* Updates the change log.
* Changes default Raft protocol to 3.
* Changes numPeers() to report only voters.
This should have been there before, but it's more obvious that this
is incorrect now that we default the Raft protocol to 3, which puts
new servers in a read-only state while Autopilot waits for them to
become healthy.
* Fixes TestLeader_RollRaftServer.
* Fixes TestOperator_RaftRemovePeerByAddress.
* Fixes TestServer_*.
Relaxed the check for a given number of voter peers and instead do
a thorough check that all servers see each other in their Raft
configurations.
* Fixes TestACL_*.
These now just check for Raft replication to be set up, and don't
care about the number of voter peers.
* Fixes TestOperator_Raft_ListPeers.
* Fixes TestAutopilot_CleanupDeadServerPeriodic.
* Fixes TestCatalog_ListNodes_ConsistentRead_Fail.
* Fixes TestLeader_ChangeServerID and adjusts the conn pool to throw away
sockets when it sees io.EOF.
* Changes version to 1.0.0 in the options doc.
* Makes metrics test more deterministic with autopilot metrics possible.
* new config parser for agent
This patch implements a new config parser for the consul agent which
makes the following changes to the previous implementation:
* add HCL support
* all configuration fragments in tests and for default config are
expressed as HCL fragments
* HCL fragments can be provided on the command line so that they
can eventually replace the command line flags.
* HCL/JSON fragments are parsed into a temporary Config structure
which can be merged using reflection (all values are pointers).
The existing merge logic of overwrite for values and append
for slices has been preserved.
* A single builder process generates a typed runtime configuration
for the agent.
The new implementation is more strict and fails in the builder process
if no valid runtime configuration can be generated. Therefore,
additional validations in other parts of the code should be removed.
The builder also pre-computes all required network addresses so that no
address/port magic should be required where the configuration is used
and should therefore be removed.
* Upgrade github.com/hashicorp/hcl to support int64
* improve error messages
* fix directory permission test
* Fix rtt test
* Fix ForceLeave test
* Skip performance test for now until we know what to do
* Update github.com/hashicorp/memberlist to update log prefix
* Make memberlist use the default logger
* improve config error handling
* do not fail on non-existing data-dir
* experiment with non-uniform timeouts to get a handle on stalled leader elections
* Run tests for packages separately to eliminate the spurious port conflicts
* refactor private address detection and unify approach for ipv4 and ipv6.
Fixes#2825
* do not allow unix sockets for DNS
* improve bind and advertise addr error handling
* go through builder using test coverage
* minimal update to the docs
* more coverage tests fixed
* more tests
* fix makefile
* cleanup
* fix port conflicts with external port server 'porter'
* stop test server on error
* do not run api test that change global ENV concurrently with the other tests
* Run remaining api tests concurrently
* no need for retry with the port number service
* monkey patch race condition in go-sockaddr until we understand why that fails
* monkey patch hcl decoder race condidtion until we understand why that fails
* monkey patch spurious errors in strings.EqualFold from here
* add test for hcl decoder race condition. Run with go test -parallel 128
* Increase timeout again
* cleanup
* don't log port allocations by default
* use base command arg parsing to format help output properly
* handle -dc deprecation case in Build
* switch autopilot.max_trailing_logs to int
* remove duplicate test case
* remove unused methods
* remove comments about flag/config value inconsistencies
* switch got and want around since the error message was misleading.
* Removes a stray debug log.
* Removes a stray newline in imports.
* Fixes TestACL_Version8.
* Runs go fmt.
* Adds a default case for unknown address types.
* Reoders and reformats some imports.
* Adds some comments and fixes typos.
* Reorders imports.
* add unix socket support for dns later
* drop all deprecated flags and arguments
* fix wrong field name
* remove stray node-id file
* drop unnecessary patch section in test
* drop duplicate test
* add test for LeaveOnTerm and SkipLeaveOnInt in client mode
* drop "bla" and add clarifying comment for the test
* split up tests to support enterprise/non-enterprise tests
* drop raft multiplier and derive values during build phase
* sanitize runtime config reflectively and add test
* detect invalid config fields
* fix tests with invalid config fields
* use different values for wan sanitiziation test
* drop recursor in favor of recursors
* allow dns_config.udp_answer_limit to be zero
* make sure tests run on machines with multiple ips
* Fix failing tests in a few more places by providing a bind address in the test
* Gets rid of skipped TestAgent_CheckPerformanceSettings and adds case for builder.
* Add porter to server_test.go to make tests there less flaky
* go fmt
* Added rate limiting for agent RPC calls.
* Initializes the rate limiter based on the config.
* Adds the rate limiter into the snapshot RPC path.
* Adds unit tests for the RPC rate limiter.
* Groups the RPC limit parameters under "limits" in the config.
* Adds some documentation about the RPC limiter.
* Sends a 429 response when the rate limiter kicks in.
* Adds docs for new telemetry.
* Makes snapshot telemetry look like RPC telemetry and cleans up comments.
When the metadata server is scanning the agents for potential servers
it is parsing the version number which the agent provided when it
joined. This version number has to conform to a certain format, i.e.
'n.n.n'. Without this version number properly set some tests fail with
error messages that disguise the root cause.
The default version number is currently set to 'unknown' in
version/version.go which does not parse and triggers the tests to fail.
The work around is to use a build tag 'consul' which will use the
version number set in version_base.go instead which has the correct
format and is set to the current release version.
In addition, some parts of the code also require the version number to
be of a certain value. Setting it to '0.0.0' for example makes some
tests pass and others fail since they don't pass the semantic check.
When using go build/install/test one has to remember to use '-tags
consul' or tests will fail with non-obvious error messages.
Using build tags makes the build process more complex and error prone
since it prevents the use of the plain go toolchain and - at least in
its current form - introduces subtle build and test issues. We should
try to eliminate build tags for anything else but platform specific
code.
This patch removes all references to specific version numbers in the
code and tests and sets the default version to '9.9.9' which is
syntactically correct and passes the semantic check. This solves the
issue of running go build/install/test without tags for the OSS build.
The error handling of the ACL code relies on the presence of certain
magic error messages. Since the error values are sent via RPC between
older and newer consul agents we cannot just replace the magic values
with typed errors and switch to type checks since this would break
compatibility with older clients.
Therefore, this patch moves all magic ACL error messages into the acl
package and provides default error values and helper functions which
determine the type of error.
This patch replaces the code which determines the list of servers in the
current cluster with an RPC call to get the list of active consul
service instances which only run on servers.
This replaces the previous implementation which was more complex and
relied on serf messages which can provide a different view than the
consistent response from the raft log.
As a side effect it makes the implementation independent of the server
and the agent which means it works consistently across both. Different
behavior for server and agent was the root cause for the bug in
http://github.com/hashicorp/consul/issue/3047.
Fixes#3407
This patch changes the behavior of the DNS server as follows:
* The SOA response contains the SOA record in the Answer section instead
of the Authority section. It also contains NS records in the Authority
and the corresponding A glue records in the Extra section.
In addition, CNAMEs are added to the Extra section to make the
MNAME of the SOA record resolvable.
AAAA glue records are not yet supported.
* The NS response returns up to three random servers from the
consul cluster in the Answer section and the glue A
records in the Extra section.
AAAA glue records are not yet supported.
Note that there is no test since the correct way to solve (and test)
this is to replace the different maps with a single one or to hide
that functionality behind a separate data structure. This will be
addressed in #3294.
Fixes#3265
This patch replaces the Docker client which is used
for health checks with a simplified version tailored
for that purpose.
See #3254
See #3257Fixes#3270
* Changes remote exec KV read to call GetTokenForAgent(), which can use
the acl_agent_token instead of the acl_token.
Fixes#3160.
* Fixes remote exec unit test with ACLs.
* Adds unhappy ACL path to unit tests for remote exec.
* Obfuscates ACL tokens appearing in /v1/acl APIs.
* Makes test positively identify the desired strings.
* Adds an example and explanation of the regular expression.
* Moves magic check and service constants into shared structs package.
* Removes the "consul" service from local state.
Since this service is added by the leader, it doesn't really make sense to
also keep it in local state (which requires special ACLs to configure), and
requires a bunch of special cases in the local state logic. This requires
fewer special cases and makes ACL bootstrapping cleaner.
* Makes coordinate update ACL log message a warning, similar to other AE warnings.
* Adds much more detailed examples for bootstrapping ACLs.
This can hopefully replace https://gist.github.com/slackpad/d89ce0e1cc0802c3c4f2d84932fa3234.
* Removes unnecessary set for model component which will be null.
* Returns a 404 for a missing node, not a 200 with an empty response.
* Updates built-in web assets.
* Removes GitHub reference.
* Doesn't display ACL token on the unauthorized page.
* Removes useless fetch for nodes and cleans up comments.
* Provides a path to reset the ACL token when it's invalid.
This included making the settings page global so it's reachable, and adding
some more information about an error on the error page.
* Updates built-in web assets.
This isn't racy, it's just a little dirty. The listen will happen and a port
will be selected and injected into the config once the Serf instance is
created, so we don't need the retry loop here.
This fixes TestServer_JoinSeparateLanAndWanAddresses which sets bogus
advertise addresses as part of the test. Port numbers uniquely identify
members since everything is running on localhost.
This patch creates a local config structure for the local state
which is independent from the agent but populated from its
configuration. This avoids data races between the agent configuration
which can change during tests and concurrent go routines using the
configuraiton at the same time.
The agent configuration for the consul server is a partial configuration
which needs to be cloned to avoid data races.
This is a stop-gap measure before moving the configuration into
a separate package.
The tests that use the localState of the agent access the internal
variables and call methods which are not guarded by locks creating
data races in tests. While the use of internal variables is somewhat
easy to spot the fact that not all methods are thread-safe is a
surprise.
A proper fix requires the localState struct to be moved into its own
package so that tests in the agent can only access the external
interface.
However, the localState is currently dependent on the agent.Config
which would create a circular dependency. Therefore, the Config
struct needs to be moved first for this to happen.
This patch literally monkey patches the use of the lock around the
cases which have data races and marks them with a
// todo(fs): data race comment.
The sessionTimers map was secured by a lock which wasn't used
properly in the tests. This lead to data races and failing tests
when accessing the length or the members of the map.
This patch adds a separate SessionTimers struct which is safe
for concurrent use and which ecapsulates the behavior of the
sessionTimers map.
When using dynamic ports for the serf clusters then
the actual bind port of the serf WAN cluster needs to
be discovered before the serf LAN cluster is started
since the serf LAN cluster announces the port of the WAN
cluster.
Host header must be set explicitely on http requests
Change-Id: I91a32f0fb1ec3fbc713adf0e10869797e91172c7
Signed-off-by: Grégoire Seux <g.seux@criteo.com>
The makeRecursor function was using an unreliable mechanism
to start a server with a random port. This patch changes this
so that the server starts on port 0 to let the kernel pick
a free port.
In addition, to similar functions for starting a test DNS
server were folded into one.
This patch fixes watch registration through the config file and a broken log line when the watch registration fails. It also plumbs all the watch loading through a common function and tweaks the
unit test to create the watch before the reload.
This patch adds an "http_config" object to the config file
and moves the "http_api_response_headers" option there.
"http_api_response_headers" is now deprecated in favor of
"http_config.response_headers"
When the agent is triggered to shutdown via an external 'consul leave'
command delivered via the HTTP API then the client expects to receive a
response when the agent is down. This creates a race on when to shutdown
the agent itself like the RPC server, the checks and the state and the
external endpoints like DNS and HTTP.
This patch splits the shutdown process into two parts:
* shutdown the agent
* shutdown the endpoints (http and dns)
They can be executed multiple times, concurrently and in any order but
should be executed first agent, then endpoints to provide consistent
behavior across all use cases. Both calls have to be executed for a
proper shutdown.
This could be partially hidden in a single function but would introduce
some magic that happens behind the scenes which one has to know of but
isn't obvious.
Fixes#2880
This patch hides the RPC handler overwrite mechanism from the
rest of the code so that it works in all cases and that there
is no cooperation required from the tested code, i.e. we can
drop a.getEndpoint().
Fix stale reads on server startup. Consistent reads will now wait for up to config.RPCHoldTimeout for the server to get past its raft log, before returning an error. Servers that are starting up will eventually catch up.
This fixes issue #2644
When the agent is triggered to shutdown via an external 'consul leave'
command delivered via the HTTP API then the client expects to receive a
response when the agent is down. This creates a race on when to shutdown
the agent itself like the RPC server, the checks and the state and the
external endpoints like DNS and HTTP. Ideally, the external endpoints
should be shutdown before the internal state but if the goal is to
respond reliably that the agent is down then this is not possible.
This patch splits the agent shutdown into two parts implemented in a
single method to keep it simple and unambiguos for the caller. The first
stage shuts down the internal state, checks, RPC server, ...
synchronously and then triggers the shutdown of the external endpoints
asychronously. This way the caller is guaranteed that the internal state
services are down when Shutdown returns and there remains enough time to
send a response.
Fixes#2880