openova/products/catalyst/bootstrap
e3mrah 48f64a4992
fix(clustermesh): derive cluster name + ID at orchestrator if request unset (#1528)
When operator submits the canonical multi-region body without
ClusterMeshName / ClusterMeshID, the in-memory dep.Request fields stay
empty. tofu's writeTfvars internally calls deriveClusterMeshName /
deriveClusterMeshID and the cilium-config rendered on each region gets
the right cluster.name + cluster.id — but the catalyst-api orchestrator
was reading from dep.Request directly, so:

  - slot.clusterID stayed 0 → cilium reserves 0 → kvstoremesh
    CrashLoopBackOff would happen if any deployment escaped a previous
    coalesce shim (we don't trip this today because cluster.id is set
    by chart values, but slot.clusterID=0 misreports in PeerStatus).
  - slot.clusterName stayed "" → peerEntries dict got "" keys →
    `Create Secret kube-system/cilium-clustermesh: ... a valid config
    key must consist of alphanumeric characters, '-', '_' or '.'`
    rejection → orchestrator wrote zero peers in every region.

Caught on t125 (590ab1490d00c452, 2026-05-16): all 3 regions had
clustermesh-apiserver Pod 3/3 Ready, LB IPs assigned, cilium-ca
present — but cilium-clustermesh Secret stayed absent after PR #1525
unblocked the kubeconfig-path resolution. Orchestrator logged 3x
"clustermesh: Secret apply failed ... data[]: Invalid value: """
with empty region/cluster fields.

This PR:

1. Exports DeriveClusterMeshName + DeriveClusterMeshID from the
   provisioner package so the orchestrator + tofu agree byte-identically
   on derivation (canonical seam — no duplicate logic).
2. buildRegionSlots now calls these exported helpers when dep.Request
   fields are empty. Lifts primary-mesh-name derivation out of the
   per-region loop.
3. Adds a defensive guard in the per-peer inner loop: a peer whose
   clusterName is empty fails with PeerStatus.Error and DOES NOT add
   empty-keyed entries to peerEntries (so even if a future regression
   bypasses the derivation, the Secret-Create error is no longer a
   blast-radius bug killing the whole region's write).

Refs DoD D10/D11. Same incident chain as PR #1525.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 16:36:25 +04:00
..
api fix(clustermesh): derive cluster name + ID at orchestrator if request unset (#1528) 2026-05-16 16:36:25 +04:00
ui fix(sovereign-ui): derive synthetic Apps/Handover stage status from deployment record + auto-redirect after handover (#1522) 2026-05-16 14:56:16 +04:00