You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
k3s/docs/adrs/remove-svclb-daemonset.md

8.5 KiB

Remove svclb daemonset

Date: 2024-09-26

Status

Not approved

Context

There are three types of services in Kubernetes:

  • ClusterIP
  • NodePort
  • LoadBalancer

If we want to expose the service to external clients, i.e. clients outside of the Kubernetes cluster, we need to use NodePort or Loadbalancer types. The latter uses an externalIP, normally a publicIP, which can be easily reached from external clients. To support Loadbalancer service types, an external controller (loadbalancer controller) is required.

The loadbalancer controller takes care of three tasks: 1 - Watches the kube-api for services of type LoadBalancer 2 - Sets up the infrastructure to provide the connectivity (externalIP ==> service) 3 - Sets the externalIP

K3s embeds a simple loadbalancer controller that we call svclb, which has been part of K3s since its inception. When a new service of type LoadBalancer comes up, this svclb creates a daemonset. That daemonset uses hostPort to reserve the service port in all nodes. Subsequently, the serviceLB controller queries the daemonset pods to know the node ips and sets those node ips as the externalIPs for the service

When an external client wants to reach the service, it needs to point to any of the node ips and use the service port. The flow of traffic would be the following: 1 - Traffic reaches the node 2 - Because hostport is reserving the service port in the node, traffic is forwarded to the daemonset pod 3 - The daemonset pod, using klipper-lb image, applies some iptables magic which replaces the destination IP with the clusterIP of the desired service 4 - Traffic gets routed to the service using regular kubernetes networking

However, after some investigation, it was found that traffic is never reaching the daemonset pod. The reason for this is that when a service gets an externalIP, kube-proxy reacts to this and adds a new rule in iptables chain KUBE-SERVICES. This rule also replaces the destination IP with the clusterIP of the desired service. Moreover, the KUBE-SERVICES chain comes before the hostPort logic and hence this is the path the traffic takes.

EXAMPLE:

Imagine a two node cluster. The traefik service uses type LoadBalancer for two ports: 80 and 443. It gets 4 external ips (2 IPv4 and 2 IPv6)

NAMESPACE     NAME             TYPE           CLUSTER-IP      EXTERNAL-IP                                                         PORT(S)                      AGE
default       kubernetes       ClusterIP      10.43.0.1       <none>                                                              443/TCP                      56m
kube-system   kube-dns         ClusterIP      10.43.0.10      <none>                                                              53/UDP,53/TCP,9153/TCP       56m
kube-system   metrics-server   ClusterIP      10.43.55.117    <none>                                                              443/TCP                      56m
kube-system   traefik          LoadBalancer   10.43.206.216   10.1.1.13,10.1.1.16,fd56:5da5:a285:eea0::6,fd56:5da5:a285:eea0::8   80:30235/TCP,443:32373/TCP   56m

In iptables, in the chain OUTPUT, we can observe that the KUBE-SERVICES chain comes before the CNI-HOSTPORT-DNAT, which is the chain taking care of the hostport functionality:

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
KUBE-SERVICES  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */
CNI-HOSTPORT-DNAT  all  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

In the KUBE-SERVICES chain, we can observe that there is one rule for each of the external-IP & port pairs, which start with KUBE-EXT-:

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination         
KUBE-SVC-Z4ANX4WAEWEBLCTM  tcp  --  0.0.0.0/0            10.43.55.117         /* kube-system/metrics-server:https cluster IP */ tcp dpt:443
KUBE-SVC-UQMCRMJZLI3FTLDP  tcp  --  0.0.0.0/0            10.43.206.216        /* kube-system/traefik:web cluster IP */ tcp dpt:80
KUBE-EXT-UQMCRMJZLI3FTLDP  tcp  --  0.0.0.0/0            10.1.1.13            /* kube-system/traefik:web loadbalancer IP */ tcp dpt:80
KUBE-EXT-UQMCRMJZLI3FTLDP  tcp  --  0.0.0.0/0            10.1.1.16            /* kube-system/traefik:web loadbalancer IP */ tcp dpt:80
KUBE-SVC-CVG3OEGEH7H5P3HQ  tcp  --  0.0.0.0/0            10.43.206.216        /* kube-system/traefik:websecure cluster IP */ tcp dpt:443
KUBE-EXT-CVG3OEGEH7H5P3HQ  tcp  --  0.0.0.0/0            10.1.1.13            /* kube-system/traefik:websecure loadbalancer IP */ tcp dpt:443
KUBE-EXT-CVG3OEGEH7H5P3HQ  tcp  --  0.0.0.0/0            10.1.1.16            /* kube-system/traefik:websecure loadbalancer IP */ tcp dpt:443
KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  0.0.0.0/0            10.43.0.1            /* default/kubernetes:https cluster IP */ tcp dpt:443
KUBE-SVC-JD5MR3NA4I4DYORP  tcp  --  0.0.0.0/0            10.43.0.10           /* kube-system/kube-dns:metrics cluster IP */ tcp dpt:9153
KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  0.0.0.0/0            10.43.0.10           /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  0.0.0.0/0            10.43.0.10           /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53
KUBE-NODEPORTS  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

Those KUBE-EXT chains, end up calling the rule starting with KUBE-SVC- which replaces the destination IP with the IP of one of pods implementing the service. For example:

Chain KUBE-EXT-CVG3OEGEH7H5P3HQ (4 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  all  --  0.0.0.0/0            0.0.0.0/0            /* masquerade traffic for kube-system/traefik:websecure external destinations */
KUBE-SVC-CVG3OEGEH7H5P3HQ  all  --  0.0.0.0/0            0.0.0.0/0   

As a consequence, the traffic never gets into the svclb daemonset pod. This can be additionally demonstrated by running a tcpdump on the svclb daemonset pod and no traffic will appear. This can also be demonstrated by tracing the iptables flow, where we will see how traffic is following the described path.

Therefore, if we replace the logic to find the node IPs of the serviceLB controller by something which does not require the svclb daemonset, we could get rid of that daemonset since traffic is never reaching it. That replacement should be easy because in the end a daemonset means all nodes, so we could basically query kube-api to provide the IPs of all nodes.

Decision

There is one use case where klipper-lb is used. When deploying in a public cloud and using the publicIP as the --node-external-ip, kube-proxy is expecting the publicIP to be the destination IP. However, public clouds are normally doing a DNAT, so the kube-proxy's rule will never be used because the incoming packet does not have the publicIP anymore. In that case, the packet is capable of reaching the service because of the hostPort functionality on the daemonset svclb pushing the packet to svclb and then, klipper-lb routing the packet to the service. Conclusion: klipper-lb is needed

Consequences

Positives

  • Less resource consumption as we won't need one daemonset per LoadBalancer type of service
  • One fewer repo to maintain (klipper-lb)
  • Easier to understand flow of traffic

Negatives

  • Possible confusion for users that have been using this feature for a long time ("Where is the daemonset?") or users relying on that daemonset for their automation
  • In today's solution, if two LoadBalancer type services are using the same port, it is rather easy to notice that things don't work because the second daemonset will not deploy, as the port is already being used by the first daemonset. Kube-proxy does not check if two services are using the same port and it will create both rules without any error. The service that gets its rules higher in the chain is the one that will be reached when querying $nodeIP:$port. Perhaps we could add some logic in the controller that warns users about a duplication of the pair ip&port