* Fix CA pruning when CA config uses string durations.
The tl;dr here is:
- Configuring LeafCertTTL with a string like "72h" is how we do it by default and should be supported
- Most of our tests managed to escape this by defining them as time.Duration directly
- Out actual default value is a string
- Since this is stored in a map[string]interface{} config, when it is written to Raft it goes through a msgpack encode/decode cycle (even though it's written from server not over RPC).
- msgpack decode leaves the string as a `[]uint8`
- Some of our parsers required string and failed
- So after 1 hour, a default configured server would throw an error about pruning old CAs
- If a new CA was configured that set LeafCertTTL as a time.Duration, things might be OK after that, but if a new CA was just configured from config file, intialization would cause same issue but always fail still so would never prune the old CA.
- Mostly this is just a janky error that got passed tests due to many levels of complicated encoding/decoding.
tl;dr of the tl;dr: Yay for type safety. Map[string]interface{} combined with msgpack always goes wrong but we somehow get bitten every time in a new way :D
We already fixed this once! The main CA config had the same problem so @kyhavlov already wrote the mapstructure DecodeHook that fixes it. It wasn't used in several places it needed to be and one of those is notw in `structs` which caused a dependency cycle so I've moved them.
This adds a whole new test thta explicitly tests the case that broke here. It also adds tests that would have failed in other places before (Consul and Vaul provider parsing functions). I'm not sure if they would ever be affected as it is now as we've not seen things broken with them but it seems better to explicitly test that and support it to not be bitten a third time!
* Typo fix
* Fix bad Uint8 usage
If you provide an invalid HTTP configuration consul will still start again instead of failing. But if you do so the build-in proxy won't be able to start which you might need for connect.
This implements parts of RFC 7871 where Consul is acting as an authoritative name server (or forwarding resolver when recursors are configured)
If ECS opt is present in the request we will mirror it back and return a response with a scope of 0 (global) or with the same prefix length as the request (indicating its valid specifically for that subnet).
We only mirror the prefix-length (non-global) for prepared queries as those could potentially use nearness checks that could be affected by the subnet. In the future we could get more sophisticated with determining the scope bits and allow for better caching of prepared queries that don’t rely on nearness checks.
The other thing this does not do is implement the part of the ECS RFC related to originating ECS headers when acting as a intermediate DNS server (forwarding resolver). That would take a quite a bit more effort and in general provide very little value. Consul will currently forward the ECS headers between recursors and the clients transparently, we just don't originate them for non-ECS clients to get potentially more accurate "location aware" results.
Fixes: #4578
Prior to this fix if there was an error binding to ports for the DNS servers the error would be swallowed by the gated log writer and never output. This fix propagates the DNS server errors back to the shell with a multierror.
* Implementation of Weights Data structures
Adding this datastructure will allow us to resolve the
issues #1088 and #4198
This new structure defaults to values:
```
{ Passing: 1, Warning: 0 }
```
Which means, use weight of 0 for a Service in Warning State
while use Weight 1 for a Healthy Service.
Thus it remains compatible with previous Consul versions.
* Implemented weights for DNS SRV Records
* DNS properly support agents with weight support while server does not (backwards compatibility)
* Use Warning value of Weights of 1 by default
When using DNS interface with only_passing = false, all nodes
with non-Critical healthcheck used to have a weight value of 1.
While having weight.Warning = 0 as default value, this is probably
a bad idea as it breaks ascending compatibility.
Thus, we put a default value of 1 to be consistent with existing behaviour.
* Added documentation for new weight field in service description
* Better documentation about weights as suggested by @banks
* Return weight = 1 for unknown Check states as suggested by @banks
* Fixed typo (of -> or) in error message as requested by @mkeeler
* Fixed unstable unit test TestRetryJoin
* Fixed unstable tests
* Fixed wrong Fatalf format in `testrpc/wait.go`
* Added notes regarding DNS SRV lookup limitations regarding number of instances
* Documentation fixes and clarification regarding SRV records with weights as requested by @banks
* Rephrase docs
* Added log-file flag to capture Consul logs in a user specified file
* Refactored code.
* Refactored code. Added flags to rotate logs based on bytes and duration
* Added the flags for log file and log rotation on the webpage
* Fixed TestSantize from failing due to the addition of 3 flags
* Introduced changes : mutex, data-dir log writes, rotation logic
* Added test for logfile and updated the default log destination for docs
* Log name now uses UnixNano
* TestLogFile is now uses t.Parallel()
* Removed unnecessary int64Val function
* Updated docs to reflect default log name for log-file
* No longer writes to data-dir and adds .log if the filename has no extension
Fixes#4515
This just slightly refactors the logic to only attempt to set the serf wan reconnect timeout when the rest of the serf wan settings are configured - thus avoiding a segfault.
* Display more information about check being not properly added when it fails
It follows an incident where we add lots of error messages:
[WARN] consul.fsm: EnsureRegistration failed: failed inserting check: Missing service registration
That seems related to Consul failing to restart on respective agents.
Having Node information as well as service information would help diagnose the issue.
* Renamed ensureCheckIfNodeMatches() as requested by @banks
- Add WaitForTestAgent to tests flaky due to missing serfHealth registration
- Fix bug in retries calling Fatalf with *testing.T
- Convert TestLockCommand_ChildExitCode to table driven test
* Allow to rename nodes with IDs, will fix#3974 and #4413
This change allow to rename any well behaving recent agent with an
ID to be renamed safely, ie: without taking the name of another one
with case insensitive comparison.
Deprecated behaviour warning
----------------------------
Due to asceding compatibility, it is still possible however to
"take" the name of another name by not providing any ID.
Note that when not providing any ID, it is possible to have 2 nodes
having similar names with case differences, ie: myNode and mynode
which might lead to DB corruption on Consul server side and
lead to server not properly restarting.
See #3983 and #4399 for Context about this change.
Disabling registration of nodes without IDs as specified in #4414
should probably be the way to go eventually.
* Removed the case-insensitive search when adding a node within the else
block since it breaks the test TestAgentAntiEntropy_Services
While the else case is probably legit, it will be fixed with #4414 in
a later release.
* Added again the test in the else to avoid duplicated names, but
enforce this test only for nodes having IDs.
Thus most tests without any ID will work, and allows us fixing
* Added more tests regarding request with/without IDs.
`TestStateStore_EnsureNode` now test registration and renaming with IDs
`TestStateStore_EnsureNodeDeprecated` tests registration without IDs
and tests removing an ID from a node as well as updated a node
without its ID (deprecated behaviour kept for backwards compatibility)
* Do not allow renaming in case of conflict, including when other node has no ID
* Fixed function GetNodeID that was not working due to wrong type when searching node from its ID
Thus, all tests about renaming were not working properly.
Added the full test cas that allowed me to detect it.
* Better error messages, more tests when nodeID is not a valid UUID in GetNodeID()
* Added separate TestStateStore_GetNodeID to test GetNodeID.
More complete test coverage for GetNodeID
* Added new unit test `TestStateStore_ensureNoNodeWithSimilarNameTxn`
Also fixed comments to be clearer after remarks from @banks
* Fixed error message in unit test to match test case
* Use uuid.ParseUUID to parse Node.ID as requested by @mkeeler
* Fixes TestAgent_IndexChurn
* Fixes TestPreparedQuery_Wrapper
* Increased sleep in agent_test for IndexChurn to 500ms
* Made the comment about joinWAN operation much less of a cliffhanger
* Revert "BUGFIX: Unit test relying on WaitForLeader() did not work due to wrong test (#4472)"
This reverts commit cec5d72396.
* Revert "CA initialization while boostrapping and TestLeader_ChangeServerID fix. (#4493)"
This reverts commit 589b589b53.
- Improve resilience of testrpc.WaitForLeader()
- Add additionall retry to CI
- Increase "go test" timeout to 8m
- Add wait for cluster leader to several tests in the agent package
- Add retry to some tests in the api and command packages
* Fixes the DNS recursor properly resolving the requests
* Added a test case for the recursor bug
* Refactored code && added a test case for all failing recursors
* Inner indentation moved into else if check
Fixes: #4441
This fixes the issue with Connect Managed Proxies + ACLs being broken.
The underlying problem was that the token parsed for most http endpoints was sent untouched to the servers via the RPC request. These changes make it so that at the HTTP endpoint when parsing the token we additionally attempt to convert potential proxy tokens into regular tokens before sending to the RPC endpoint. Proxy tokens are only valid on the agent with the managed proxy so the resolution has to happen before it gets forwarded anywhere.
* New Providers added and updated vendoring for go-discover
* Vendor.json formatted using make vendorfmt
* Docs/Agent/auto-join: Added documentation for the new providers introduced in this PR
* Updated the golang.org/x/sys/unix in the vendor directory
* Agent: TestGoDiscoverRegistration updated to reflect the addition of new providers
* Deleted terraform.tfstate from vendor.
* Deleted terraform.tfstate.backup
Deleted terraform state file artifacts from unknown runs.
* Updated x/sys/windows vendor for Windows binary compilation
* Fix theoretical cache collision bug if/when we use more cache types with same result type
* Generalized fix for blocking query handling when state store methods return zero index
* Refactor test retry to only affect CI
* Undo make file merge
* Add hint to error message returned to end-user requests if Connect is not enabled when they try to request cert
* Explicit error for Roots endpoint if connect is disabled
* Fix tests that were asserting old behaviour
Also change how loadProxies works. Now it will load all persisted proxies into a map, then when loading config file proxies will look up the previous proxy token in that map.
UUID auto-generation here causes trouble in a few cases. The biggest being older
nodes reregistering will fail when the UUIDs are different and the names match
This reverts commit 0f70034082.
This reverts commit d1a8f9cb3f.
This reverts commit cf69ec42a4.
Since DNS is case insensitive and DB as issues when similar names with different
cases are added, check for unicity based on case insensitivity.
Following another big incident we had in our cluster, we also validate
that adding/renaming a not does not conflicts with case insensitive
matches.
We had the following error once:
- one node called: mymachine.MYDC.mydomain was shut off
- another node (different ID) was added with name: mymachine.mydc.mydomain before
72 hours
When restarting the consul server of domain, the consul server restarted failed
to start since it detected an issue in RAFT database because
mymachine.MYDC.mydomain and mymachine.mydc.mydomain had the same names.
Checking at registration time with case insensitivity should definitly fix
those issues and avoid Consul DB corruption.
It fixes the following warnings:
agent/config/builder.go:1201: Errorf format %q has arg s of wrong type *string
agent/config/builder.go:1240: Errorf format %q has arg s of wrong type *string
This also changes where the enforcement of the enable_additional_node_meta_txt configuration gets applied.
formatNodeRecord returns the main RRs and the meta/TXT RRs in separate slices. Its then up to the caller to add to the appropriate sections or not.
This just makes sure that if multiple services are registered with unique service addresses that we don’t blast back multiple CNAMEs for the same service DNS name and keeps us within the DNS specs.
It will allow the following:
* when connectivity is limited (saturated linnks between DCs), only one
single request to refresh an ACL will be sent to ACL master DC instead
of statcking ACL refresh queries
* when extend-cache is used for ACL, do not wait for result, but refresh
the ACL asynchronously, so no delay is not impacting slave DC
* When extend-cache is not used, keep the existing blocking mechanism,
but only send a single refresh request.
This will fix https://github.com/hashicorp/consul/issues/3524
These were only added as SPIFFE intends to use the in the future but currently does not mandate their usage due to patch support in common TLS implementations and some ambiguity over how to use them with URI SAN certificates. We included them because until now everything seem fine with it, however we've found the latest version of `openssl` (1.1.0h) fails to validate our certificats if its enabled. LibreSSL as installed on OS X by default doesn’t have these issues. For now it's most compatible not to have them and later we can find ways to add constraints with wider compatibility testing.
Few other fixes in here just to get a clean run locally - they are all also fixed in other PRs but shouldn't conflict.
This should be robust to timing between goroutines now.
- Dev mode assumed no persistence of services although proxy state is persisted which caused proxies to be killed on startup as their services were no longer registered. Fixed.
- Didn't snapshot the ProxyID which meant that proxies were adopted OK from snapshot but failed to restart if they died since there was no proxyID in the ENV on restart
- Dev mode with no persistence just kills all proxies on shutdown since it can't recover them later
- Naming things
This turns out to have a lot more subtelty than we accounted for. The test suite is especially prone to races now we can only poll the child and many extra levels of indirectoin are needed to correctly run daemon process without it becoming a Zombie.
I ran this test suite in a loop with parallel enabled to verify for races (-race doesn't find any as they are logical inter-process ones not actual data races). I made it through ~50 runs before hitting an error due to timing which is much better than before. I want to go back and see if we can do better though. Just getting this up.
agent/config will turn [{}] into {} (single element maps into a single
map) to work around HCL issues. These are resolved in HCL2 which I'm
sure Consul will switch to eventually.
This breaks the connect proxy configuration in service definition FILES
since we call this patch function. For now, let's just special-case skip
this. In the future we maybe Consul will adopt HCL2 and fix it, or we
can do something else if we want. This works and is tested.
This enables `consul agent -dev` to begin using Connect features with
the built-in CA. I think this is expected behavior since you can imagine
that new users would want to try.
There is no real downside since we're just using the built-in CA.
There are also a lot of small bug fixes found when testing lots of things end-to-end for the first time and some cleanup now it's integrated with real CA code.
- Includes some bug fixes for previous `api` work and `agent` that weren't tested
- Needed somewhat pervasive changes to support hash based blocking - some TODOs left in our watch toolchain that will explicitly fail on hash-based watches.
- Integration into `connect` is partially done here but still WIP
Intention de-duplication in previously merged PR actualy failed some tests that were not caught be me or CI. I ran the test files for state changes but they happened not to trigger this case so I made sure they did first and then fixed. That fixed some upstream intention endpoint tests that I'd not run as part of testing the previous fix.
Uses struct/interface embedding with the embedded structs/interfaces being empty for oss. Also methods on the server/client types are defaulted to do nothing for OSS
* Move settings to use the same service/route API as the rest of the app
* Put some ideas down for unit testing on adapters
* Favour `Model` over `Entity`
* Move away from using `reopen` to using Mixins
* Amend messages, comment/document some usage
* Make sure the returns are consistent in normalizePayload, also
Add some todo's in to remind me to think consider this further at a
later date. For example, is normalizePayload to be a hook or an
overridable method
* Start stripping back the HTML to semantics
* Use a variable rather than chaining
* Remove unused helpers
* Start picking through the new designs, start with listing pages
* First draft HTML for every page
* Making progress on the CSS
* Keep plugging away at the catalog css
* Looking at scrolling
* Wire up filtering
* Sort out filter counting, more or less done a few outstanding
* Start knocking the forms into shape
* Add in codemirror
* Keep moving forwards with the form like layouts
* Start looking at ACL editing page, add footer in
* Pull the filters back in, look at an autoresizer for scroll views
* First draft toggles
* 2nd draft healthcheck icons
* Tweak node healthcheck icons
* Looking at healthcheck detail icons
* Tweak the filter-bar and add selections to the in content tabs
* Add ACL create, pill-like acl type highlight
* Tweaking the main nav some more
* Working on the filter-bar and freetext-filter
* Masonry layout
* Stick with `checks` instead of healthy/unhealthy
* Fix up the filter numbers/counts
* Use the thead for a measure
* First draft tomography back in
* First draft DC dropdown
* Add a temporary create buttong to kv's
* Move KV and ACL to use a create page
* Move tags
* Run through old tests
* Injectable server
* Start adding test attributes
* Add some page objects
* More test attributes and pages
* Acl filter objects
* Add a page.. page object
* Clickable items in lists
* Add rest/spread babel plugin, remove mirage for now
* Add fix for ember-collection
* Keep track of acl filters
* ember-cli-page-object
* ember-test-selectors
* ui: update version of ui compile deps
* Update static assets
* Centralize radiogroup helper
* Rejig KV's and begin to clean it up
* Work around lack of Tags for the moment..
* Some little css tweaks and start to remove possibles
* Working on the dc page and incidentals
1. Sort the datacenter-picker list
2. Add a selected state to the datacenter-picker
3. Make dc an {Name: dc}
4. Add an env helper to get to 'env vars' from within templates
* Click outside stuff for the datacenter-picker, is-active on nav
* Make sure the dropdown CTA can be active
* Bump ember add pluralize helper
* Little try at sass based custom queries
* Rejig tablular collection so it deals with resizing, actions
1. WIP: start building actions dropdowns
2. Move tabular collection to deal with resizing to rule out differences
* First draft actions dropdowns
* Add ports, selectable IP's
* Flash messages, plus general cleanup/consistency
1. Add ember-cli-flash for flash messages
2. Move everything to get() instead of item.get
3. Spotted a few things that weren't consistent
* DOn't go lower than zero
* First draft vertical menu
* Missed a get, tweak dropmenu tick
* Big cleanup
1. this.get(), this.set() > get(), set()
2. assign > {...{}, ...{}}
3. Seperator > separator
* WIP: settings
* Moved things into a ui-v2 folder
* Decide on a way to do the settings page whilst maintaining the url + dc's
* Start some error pages
* Remove base64 polyfill
* Tie in settings, fix atob bug, tweak layout css
* Centralize confirmations into a component
* Allow switching between the old and new UI with the CONSUL_UI_BETA env var
Currently all the assets are packaged into a single AssetFS and a prefix is configured to switch between the two.
* Attempt at some updates to integrate the v2 ui build into the main infrastructure
* Add redirect to index.html for unknown paths
* Allow redictor to /index.html for new ui when using -ui-dir
* Take ACLs to the correct place on save
* First pass breadcrumbs
* Remove datacenter selector on the index page
* Tweak overall layout
* Make buttons 'resets'
* Tweak last DC stuff
* Validations plus kv keyname viewing tweaks
* Pull sessions back in
* Tweak the env vars to be more reusable
* Move isAnon to the view
* No items and disabled acl css
* ACL and KV details
1. Unauthorized page
2. Make sure the ACL is always selected when it needs it
3. Check record deletion with a changeset
* Few more acl tweaks/corrections
* Add no items view to node > services
* Tags for node > services
* Make sure we have tags
* Fix up the labels on the tomography graph
* Add node link (agent) to kv sessions
* Duplicate up `create` for KV 'root creation'
* Safety check for health checks
* Fix up the grids
* Truncate td a's, fix kv columns
* Watch for spaces in KV id's
* Move actions to their own mixins for now at least
* Link reset to settings incase I want to type it in
* Tweak error page
* Cleanup healthcheck icons in service listing
* Centralize errors and make getting back easier
* Nice numbers
* Compact buttons
* Some incidental css cleanups
* Use 'Key / Value' for root
* Tweak tomography layout
* Fix single healthcheck unhealthy resource
* Get loading screen ready
* Fix healthy healthcheck tick
* Everything in header starts white
* First draft loader
* Refactor the entire backend to use proper unique keys, plus..
1. Make unique keys form dc + slug (uid)
2. Fun with errors...
* Tweak header colors
* Add noopener noreferrer to external links
* Add supers to setupController
* Implement cloning, using ember-data...
* Move the more expensive down the switch order
* First draft empty record cleanup..
* Add the cusomt store test
* Temporarily use the htmlSafe prototype to remove the console warning
* Encode hashes in urls
* Go back to using title for errors for now
* Start removing unused bulma
* Lint
* WIP: Start looking at failing tests
* Remove single redirect test
* Finish off error message styling
* Add full ember-data cache invalidation to avoid stale data...
* Add uncolorable warning icons
* More info icon
* Rearrange single service, plus tag printing
* Logo
* No quotes
* Add a simple startup logo
* Tweak healthcheck statuses
* Fix border-color for healthchecks
* Tweak node tabs
* Catch 401 ACL errors and rethrow with the provided error message
* Remove old acl unauth and error routes
* Missed a super
* Make 'All' refer to number of checks, not services
* Remove ember-resizer, add autoprefixer
* Don't show tomography if its not worth it, viewify it more also
* Little model cleanup
* Chevrons
* Find a way to reliably set the class of html from the view
* Consistent html
* Make sure session id's are visible as long as possible
* Fix single service check count
* Add filters and searchs to the query string
* Don't remember the selected tab
* Change text
* Eror tweaking
* Use chevrons on all breadcrumbs even in kv's
* Clean up a file
* Tweak some messaging
* Makesure the footer overlays whats in the page
* Tweak KV errors
* Move json toggle over to the right
* feedback-dialog along with copy buttons
* Better confirmation dialogs
* Add git sha comment
* Same title as old UI
* Allow defaults
* Make sure value is a string
* WIP: Scrolling dropdowns/confirmations
* Add to kv's
* Remove set
* First pass trace
* Better table rows
* Pull over the hashi code editor styles
* Editor tweaks
* Responsive tabs
* Add number formatting to tomography
* Review whats left todo
* Lint
* Add a coordinate ember data triplet
* Bump in a v2.0.0
* Update old tests
* Get coverage working again
* Make sure query keys are also encoded
* Don't test console.error
* Unit test some more utils
* Tweak the size of the tabular collections
* Clean up gitignore
* Fix copy button rollovers
* Get healthcheck 'icon icons' onto the text baseline
* Tweak healthcheck padding and alignment
* Make sure commas kick in in rtt, probably never get to that
* Improve vertical menu
* Tweak dropdown active state to not have a bg
* Tweak paddings
* Search entire string not just 'startsWith'
* Button states
* Most buttons have 1px border
* More button tweaks
* You can only view kv folders
* CSS cleanup reduction
* Form input states and little cleanup
* More CSS reduction
* Sort checks by importance
* Fix click outside on datacenter picker
* Make sure table th's also auto calculate properly
* Make sure `json` isn't remembered in KV editing
* Fix recursive deletion in KV's
* Centralize size
* Catch updateRecord
* Don't double envode
* model > item consistency
* Action loading and ACL tweaks
* Add settings dependencies to acl tests
* Better loading
* utf-8 base64 encode/decode
* Don't hang off a prototype for htmlSafe
* Missing base64 files...
* Get atob/btoa polyfill right
* Shadowy rollovers
* Disabled button styling for primaries
* autofocuses only onload for now
* Fix footer centering
* Beginning of 'notices'
* Remove the isLocked disabling as we are letting you do what the API does
* Don't forget the documentation link for sessions
* Updates are more likely
* Use exported constant
* Dont export redirectFS and a few other PR updates
* Remove the old bootstrap config which was used for the old UI skin
* Use curlies for multiple properties