fix(k3s): pin --node-ip + --advertise-address to cp_private_ip (#1457)

prov #62 (cpx52, kernel 6.8.0-111): primary CP cilium init CrashLoop with "dial tcp 10.0.1.2:6443: i/o timeout". k3s server auto-detects its node IP from the primary interface, which on Hetzner cpx52 binds to the public IPv4 (49.x.x.x) instead of the private network IP (10.0.1.2). kube-apiserver advertises 49.x.x.x and binds there; nothing answers on 10.0.1.2:6443. Cilium agent's k8s-client wants the private IP from cilium-config k8sServiceHost — times out, CrashLoop. Worked by luck on cpx42 (earlier kernel + Hetzner network attach timing). cpx52 reproduces 100%. Fix: pass --node-ip=${cp_private_ip} + --advertise-address=${cp_private_ip} in INSTALL_K3S_EXEC. k3s then binds kube-apiserver on the private IP AND advertises it as the node's INTERNAL-IP. Pods reaching ${cp_private_ip}:6443 (cilium-config substitute) find the API server every time. Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:34:30 +04:00 · 2026-05-13 17:34:30 +04:00 · 5f4f9f2cb5
commit 5f4f9f2cb5
parent 6fac1481d3
1 changed files with 11 additions and 1 deletions
--- a/infra/hetzner/cloudinit-control-plane.tftpl
+++ b/infra/hetzner/cloudinit-control-plane.tftpl
@ -1203,7 +1203,17 @@ runcmd:
  # becomes ready. Skip the taint when there are no workers; fall back
  # to k3s default (CP fully schedulable) so the solo node carries
  # everything.
-  - 'curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=${k3s_version} K3S_TOKEN=${k3s_token} INSTALL_K3S_EXEC="server --cluster-init --flannel-backend=none --disable-network-policy --disable=traefik --disable=servicelb --tls-san=${sovereign_fqdn} --tls-san=${cp_private_ip} --kube-apiserver-arg=oidc-issuer-url=https://auth.${sovereign_fqdn}/realms/sovereign --kube-apiserver-arg=oidc-client-id=kubectl --kube-apiserver-arg=oidc-username-claim=preferred_username --kube-apiserver-arg=oidc-username-prefix=oidc: --kube-apiserver-arg=oidc-groups-claim=groups --kube-apiserver-arg=oidc-groups-prefix=oidc: --node-label catalyst.openova.io/role=control-plane ${worker_count > 0 ? "--node-taint node-role.kubernetes.io/control-plane=true:NoSchedule " : ""}--write-kubeconfig-mode=0644" sh -'
+  #
+  # --node-ip + --advertise-address pin the API server to ${cp_private_ip}
+  # (10.0.1.2 primary; 10.0.<10+idx>.2 secondary). Without them k3s
+  # auto-detects the public interface (49.x.x.x), kube-apiserver
+  # advertises that IP, and any pod (cilium init/operator, coredns)
+  # dialing 10.0.1.2:6443 times out because nothing listens on it.
+  # Symptom on prov #62 (cpx52, kernel 6.8.0-111): cilium-agent init
+  # CrashLoop with "dial tcp 10.0.1.2:6443: i/o timeout" → primary
+  # cluster never makes a Ready node. Worked by luck on cpx42 (earlier
+  # kernel + network-init order); cpx52 reproduces reliably.
+  - 'curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=${k3s_version} K3S_TOKEN=${k3s_token} INSTALL_K3S_EXEC="server --cluster-init --flannel-backend=none --disable-network-policy --disable=traefik --disable=servicelb --node-ip=${cp_private_ip} --advertise-address=${cp_private_ip} --tls-san=${sovereign_fqdn} --tls-san=${cp_private_ip} --kube-apiserver-arg=oidc-issuer-url=https://auth.${sovereign_fqdn}/realms/sovereign --kube-apiserver-arg=oidc-client-id=kubectl --kube-apiserver-arg=oidc-username-claim=preferred_username --kube-apiserver-arg=oidc-username-prefix=oidc: --kube-apiserver-arg=oidc-groups-claim=groups --kube-apiserver-arg=oidc-groups-prefix=oidc: --node-label catalyst.openova.io/role=control-plane ${worker_count > 0 ? "--node-taint node-role.kubernetes.io/control-plane=true:NoSchedule " : ""}--write-kubeconfig-mode=0644" sh -'

  # Wait for the API server to be reachable. Cilium needs to come up before
  # nodes Ready, so we wait specifically for the API endpoint.