them, not in the steady state once they've been created. This makes it
much less likely that users will run into static IP quota issues.
Also add slightly more parallelism to the deletion of load balancers
now that I realize the static IPs can be deleted in parallel with
forwarding rules :)
Previously we'd just tear everything down and recreate it, which makes
for a pretty bad experience because it causes downtime whenever the
service controller restarts and has to make sure everything is in the
desired state.
This adds more code than I'd prefer, but makes it much cleaner and more
organized than it was before, in my opinion. I didn't bother
parallelizing anything because it's complex enough as it is, right now.
It's consistently passing the existing e2es and worked when I tested
manually, but this could definitely use additional e2e tests and/or some
serious refactoring to make real unit tests feasible. I'll follow up
with one or two e2e tests that make sense (updating an LB or killing the
controller manager, perhaps).
This will cut down on the amount of time it takes to delete an external
load balancer, which should reduce the likelihood of resource leaks when
clusters are deleted.
This code was in rough shape, so I've fixed the issues with the original
PR as well as a few other changes:
1. Clarify the error messages related to the "gce Addresses" to make it
clear we're talking about static IP addresses
2. Fix the bug in the original PR, which was a nil pointer dereference
from passing op to waitForRegionOp when the address doesn't exist.
3. Rearrange the steps of EnsureTCPLoadBalancerDeleted to be the reverse
of EnsureCreated, which mostly just seems like good practice to me.
This is also supported by the following two bugs I found :(
4. Fix an independent bug of returning too early if the target pool
doesn't exist, effectively stranding the firewall. This was likely
introduced because target pools used to be the last thing deleted,
so it was previously safe to return there.
5. Fix an independent bug of not returning an error waiting for the
target pool to be deleted failed. This was very possibly causing
target pool leaks in our e2e tests. This was similarly due to
assuming that the target pool was the last thing deleted in the
function, then having the firewall deletion stuck in after it.
Only takes the first available subnet in a AZ, ignore other subnets
and log warning about this.
Removes AWS region comparison for subnet AZs. A VPC is only in a single
AWS region.
Fixes#12381
This code was in rough shape, so I've fixed the issues with the original
PR as well as a few other changes:
1. Clarify the error messages related to the "gce Addresses" to make it
clear we're talking about static IP addresses
2. Fix the bug in the original PR, which was a nil pointer dereference
from passing op to waitForRegionOp when the address doesn't exist.
3. Rearrange the steps of EnsureTCPLoadBalancerDeleted to be the reverse
of EnsureCreated, which mostly just seems like good practice to me.
This is also supported by the following two bugs I found :(
4. Fix an independent bug of returning too early if the target pool
doesn't exist, effectively stranding the firewall. This was likely
introduced because target pools used to be the last thing deleted,
so it was previously safe to return there.
5. Fix an independent bug of not returning an error waiting for the
target pool to be deleted failed. This was very possibly causing
target pool leaks in our e2e tests. This was similarly due to
assuming that the target pool was the last thing deleted in the
function, then having the firewall deletion stuck in after it.
The ELB client lookup isn't necessary because the service
does not operate across regions. Instead the client should
be built like the others by querying for the region from
the master node's metadata service.
Inverting code path on CreateTcploadBalancer to avoid branch divergence
Removing useless variable vipAddr as vip have information needed
Renaming 'error' variable on EnsureTCPLoadBalancerDeleted to be consistent
A lot of packages use StringSet, but they don't use anything else from
the util package. Moving StringSet into another package will shrink
their dependency trees significantly.
Avoid creating a new 'err' variable in the 'if'-branch, shadowing the one
in the outer scope.
Any error from subsequent 'cloud, err = GetCloudProvider()' was not propagated
to 'err' variable in the outer scope and thus errors were never returned from
this function.
This is hard to debug error on OpenStack, when content of --cloud-config=
file is wrong or connection to OpenStack fails. Such error is never logged
and Kubernetes thinks everything is OK.
This will allows authentication with the AWS API using the
~/.aws/credentials file which is created by runnign 'aws configure' on
a node.
Signed-off-by: Sami Wagiaalla <swagiaal@redhat.com>
ELB will automatically create a health check, but if we update the
listeners the old health check port sticks around, and all the instances
are marked offline.
Update the health-checks to match the listeners: we just check the first
valid service port, with some hard-coded options for timeouts / retries etc.
This turned out to be a little convoluted, but is needed because deleting an ELB on AWS
is a painful UX - it won't have the same endpoint when it is recreated.
Also started splitting the provider into files, but only for new functions (so far!)
Previously the servicecontroller would do the delete, but by having the cloudprovider
take that task on, we can later remove it from the servicecontroller, and the
cloudprovider can do something more efficient.
The current code assumes the full domain name will not be included,
which is not always the case. This patch adds support for computing the
host tag from a fully qualified domain name.
This should ensure all load balancers get deleted even if a reordering of
watch events causes us to strand one after its service has been deleted,
because the sync will notice that the service controller's cache has a
service in it that no longer exists in the apiserver.
It could still leak in the case that the controller manager is killed
between when it leaks something and the sync runs, but this should
improve things.
We need to find the ID for a named security group, or create a new one.
We do this by listing the security groups, and then doing a create if we
cannot find one. This is a race though; against another thread if the
AWS API were consistent, but generally because the AWS API is actually
eventually consistent.
We wrap it in a retry loop.
When we create a security-group in the AWS API, there is sometimes
a delay before we can tag it (the AWS API is eventually consistent).
So we wrap CreateTags in a simple retry loop.
It seems impossible to determine from outside. Thankfully we're running
the attachment from inside the instance, so can check for /dev/sdX or
/dev/xvdX.
More modern images seem to be moving to /dev/xvdX
This is a partial reversion of #9728, and should fix#10612.
9728 used the AWS instance id as the node name. But proxy, logs
and exec all used the node name as the host name for contacting the minion.
It is possible to resolve a host to the IP, and this fixes logs. But
exec and proxy also require an SSL certificate match on the hostname,
and this is harder to fix.
So the sensible fix seems to be a minimal reversion of the changes in #9728,
and we can revisit this post 1.0.
ExternalID must return "", cloudprovider.InstanceNotFound if the instance
is not found, for nodecontroller to remove nodes corresponding to deleted instances.
We need the last state seen for interpreting the change-stream,
separately we need to track the last state we successfully applied to the
load balancer.
The most reliable way seems to be to deauthorize the LB security group from
other groups, then delete the LB itself, then repeatedly retry to delete the LB
security group.
We can't delete the LB security group until the LB is actually completely
deleted, but the LB is hidden from the API during deletion. So our only real
option is to retry deletion of the LB security group until the expected error
goes away when the LB is fully deleted.
Whenever we do a list we now filter on tags so we only see resources relating
to our cluster.
Also, rationalize all the DescribeX calls:
* They all take a request object (so that we can pass filters)
* They do paging if that is required (and return the underlying resources)
* They wrap any error with a "error while listing X: %v" message
Previously we always passed `Address: externalIP.String()` while
creating a loadbalancer VIP. This passed "0.0.0.0" when externalIP was
unspecified, effectively making it mandatory to specify an externalIP.
This change correctly leaves `Address` unspecified when externalIP is
unspecified (has a zero value).
(Thanks to @justinsb for the report)
These were introduced because the new official AWS SDK uses *string
where the old library used strings. We now use the helpers much
more (orEmpty and isNilOrEmpty).
Fixes#9123
This change allows EnsureTCPLoadBalancerDeleted to be called repeatedly
to reattempt deleting objects that may have failed on a previous run.
Specifically, if the VIP is already deleted, then an attempt is made to
lookup the pool by name. Returns success when both the VIP and pool are
not found.
Fixes#8352
Refactor GetNodeHostIP into pkg/util/node (instead of pkg/util to break import cycle).
Include internalIP in gce NodeAddresses. Remove NodeLegacyHostIP