Commit Graph

606 Commits

Author SHA1 Message Date
e3mrah
050f87e267
fix(purge): second name-prefix pass for CCM-named clustermesh LBs (#1532)
Caught repeatedly (t124, t125 wipes both 2026-05-16): tofu destroy left
3 orphan `<fqdn-slug>-<region>-clustermesh` LBs each cycle. Names
don't start with `catalyst-` prefix because they're named by the
Cilium chart overlay
(`clusters/_template/bootstrap-kit/01-cilium.yaml`):

    load-balancer.hetzner.cloud/name:
      "${SOVEREIGN_FQDN_SLUG:=catalyst}-${SOVEREIGN_REGION_KEY:=primary}-clustermesh"

The first name-prefix pass (`catalyst-<fqdn-slug>`) misses these.
tofu doesn't manage them (CCM allocated post-Phase-1). Manual API
cleanup was forced each cycle.

Fix: add a second `purgeByNamePrefix` pass with the slug-only prefix
(`<fqdn-slug>-`) so any CCM-allocated resource named with the slug
gets swept. Dedup logic in `purgeByNamePrefix` already skips names
already reported by the labelled pass, so totals stay accurate.

Refs feedback_wipe_handler_ccm_lb_orphans.md.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 17:29:26 +04:00
e3mrah
70d6ada703
fix(clustermesh): sign A's peer client cert with B's CA (not A's CA) (#1530)
Caught on t126 (84c0848406dd6fdd, 2026-05-16) after PRs #1525+#1528
unblocked peer Secret writes. Cilium agents reloaded, peer entries
present, but cilium-dbg status --verbose shows:

    0/2 remote clusters ready
    t126-mesh-nbg1-1: Waiting for initial connection
    t126-mesh-sin-2:  Waiting for initial connection

TLS probe to peer apiserver returned "unexpected eof while reading":
the mTLS handshake fails because A's client cert was signed by A's
cilium-ca. Cilium clustermesh-apiserver's trust pool is the LOCAL
cilium-ca (B's), so A's cert is rejected at the handshake.

Fix: pass b.caCert/b.caKey to mintPeerClientCert. SAN stays A's
clusterName (matches upstream `cilium clustermesh connect` CLI and
the chart's default RBAC subject authorisation).

Refs DoD D11.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 17:23:18 +04:00
e3mrah
38f1f83971
fix(sovereign-dns-records): 404 fallback to FQDN-minus-first-label parent (#1529)
When operator submits sovereignFQDN like "t126.omani.works" without
parentDomains[] AND without sovereignPoolDomain, Validate()'s back-compat
synthesis stamps ParentDomain.Name = SovereignFQDN itself ("t126.omani.works").
The post-Phase-0 upsertSovereignParentZoneRecordsFromResult then PATCHes
zone "t126.omani.works." → PowerDNS 404 (the authoritative zone is
"omani.works") → no A records written → every console.* / auth.* /
gitea.* hostname resolves NXDOMAIN even after handoverFired.

Caught on t126 (84c0848406dd6fdd, 2026-05-16): clustermesh fully meshed
(D10  after PRs #1525+#1528), handover JWT minted, wildcard cert
Ready=True, LB external IP assigned — but DoD D1/D2 stayed red because
the sovereign-dns-records PATCH 404'd silently with only a WARN log.

This PR adds a 404-fallback in upsertSovereignParentZoneRecordsFromResult:
when the synthesized parent equals SovereignFQDN AND the PATCH returns
status 404, retry once with parent-of-FQDN (`SovereignFQDN[i+1:]` where
i is the first `.`). Two-label FQDNs ("customer.com") skip the retry
since there is no parent to derive — preserves BYO-mode behavior.

The provisioner Validate() back-compat synthesis stays untouched
because TestValidate_SynthesisesPrimaryFromSovereignFQDN asserts the
exact "BYO mode keeps SovereignFQDN as parent" semantics for 3-label
apexes like "acme.openova.io" — that's a legitimate case (operator
registered the 3-label apex). The 404-fallback handles the pool-mode
case at the PATCH boundary where we actually know whether the zone
exists.

Refs DoD D1/D2. Same incident chain as PRs #1525 + #1528.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 17:13:26 +04:00
e3mrah
48f64a4992
fix(clustermesh): derive cluster name + ID at orchestrator if request unset (#1528)
When operator submits the canonical multi-region body without
ClusterMeshName / ClusterMeshID, the in-memory dep.Request fields stay
empty. tofu's writeTfvars internally calls deriveClusterMeshName /
deriveClusterMeshID and the cilium-config rendered on each region gets
the right cluster.name + cluster.id — but the catalyst-api orchestrator
was reading from dep.Request directly, so:

  - slot.clusterID stayed 0 → cilium reserves 0 → kvstoremesh
    CrashLoopBackOff would happen if any deployment escaped a previous
    coalesce shim (we don't trip this today because cluster.id is set
    by chart values, but slot.clusterID=0 misreports in PeerStatus).
  - slot.clusterName stayed "" → peerEntries dict got "" keys →
    `Create Secret kube-system/cilium-clustermesh: ... a valid config
    key must consist of alphanumeric characters, '-', '_' or '.'`
    rejection → orchestrator wrote zero peers in every region.

Caught on t125 (590ab1490d00c452, 2026-05-16): all 3 regions had
clustermesh-apiserver Pod 3/3 Ready, LB IPs assigned, cilium-ca
present — but cilium-clustermesh Secret stayed absent after PR #1525
unblocked the kubeconfig-path resolution. Orchestrator logged 3x
"clustermesh: Secret apply failed ... data[]: Invalid value: """
with empty region/cluster fields.

This PR:

1. Exports DeriveClusterMeshName + DeriveClusterMeshID from the
   provisioner package so the orchestrator + tofu agree byte-identically
   on derivation (canonical seam — no duplicate logic).
2. buildRegionSlots now calls these exported helpers when dep.Request
   fields are empty. Lifts primary-mesh-name derivation out of the
   per-region loop.
3. Adds a defensive guard in the per-peer inner loop: a peer whose
   clusterName is empty fails with PeerStatus.Error and DOES NOT add
   empty-keyed entries to peerEntries (so even if a future regression
   bypasses the derivation, the Secret-Create error is no longer a
   blast-radius bug killing the whole region's write).

Refs DoD D10/D11. Same incident chain as PR #1525.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 16:36:25 +04:00
e3mrah
56f59173af
fix(clustermesh): regionKeyFromSpec off-by-one — use idx not idx+1 (#1525)
Tofu's secondary_regions map keys with the ORIGINAL spec index `i`:
  for i, r in var.regions : "${r.cloudRegion}-${i}" => r if i > 0

cloud-init then PUTs each region's kubeconfig as `?region=<k>` so
catalyst-api stores it at `<kubeconfigsDir>/<id>-<k>.yaml`. With 3
regions (idx 0=primary, idx 1, idx 2) the on-disk files are:

  <id>.yaml               (primary)
  <id>-nbg1-1.yaml        (secondary, idx=1)
  <id>-sin-2.yaml         (secondary, idx=2)

regionKeyFromSpec previously returned `<region>-<idx+1>` giving
`nbg1-2` / `sin-3` — keys that match NEITHER the in-memory
secondaryKubeconfigPaths entries nor the filesystem fallback at
`<dir>/<id>-nbg1-2.yaml`. Every secondary slot ended up with
`slot.err = "kubeconfig path empty"`. The orchestrator's step-3
inner loop then hit `b.err != nil` for every peer pair and built
zero peerEntries. applyClusterMeshSecret silently returned nil on
empty entries (line 743) and the only stdout line was the misleading
`clustermesh: orchestrator completed regions=3 fullyMeshed=0`.

Caught on t124 (1359e4479cbca98d, 2026-05-16) where all 3 regions
showed clustermesh-apiserver Pod 3/3 Ready, LBs assigned with
external IPs (Gap A v3.2 fix), but cilium-clustermesh Secret absent
in every region.

Also adds a `clustermesh: zero peer entries built for region` Warn
log surfacing the per-peer reasons before the silent
applyClusterMeshSecret no-op — so the next regression of this class
is debuggable from logs alone.

Refs DoD D10/D11 per docs/SOVEREIGN-MULTI-REGION-DOD.md.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 15:56:36 +04:00
e3mrah
9240930b70
fix(sovereign-ui): derive synthetic Apps/Handover stage status from deployment record + auto-redirect after handover (#1522)
Fixes Gaps C + D from session_2026_05_16_t117_dod_partial.md, which
broke DoD gates D6 (0 pending) + D7 (mothership ≡ child) on every
multi-region Sovereign post-handover.

Gap C — UI synthesizes Apps / Handover / Cutover stage rows (and per-
region variants) that catalyst-api's openova-flow snapshot emits at
depth=1 so the canvas surfaces the full five-phase lifecycle. When
those groups have NO descendants — the common case for Apps (no
operator apps installed yet) and Handover (a once-per-Sovereign event
with no per-region job rows) — the API emits Status="pending" and
the bottom-up rollup leaves it there. Result on JobsPage: 8 phantom
"Pending" rows per multi-region prov contradicting the deployment
record's status=ready + handoverFiredAt truth.

  Fix: new `handoverStageOverride.ts` re-derives these stages' status
  from the deployment snapshot. When handover has fired (status=ready
  OR handoverFiredAt non-null), pending/running Apps/Handover/Cutover
  synthetic stages get coerced to "succeeded". Terminal statuses and
  non-lifecycle jobs (bootstrap-kit, provisioner, install-*) are
  passed through untouched — backend signal always wins over UI
  inference. Scoped strictly to the three lifecycle slugs via id-
  suffix match so install-* jobs are never affected.

Gap D — No auto-redirect to the Sovereign Console from JobsPage. The
operator typically watches convergence from the Jobs table; without
an in-page redirect they get stranded on the mothership even after
the Sovereign is ready. AppsPage has the redirect but operators on
/jobs miss it.

  Fix: new `HandoverRedirectBanner.tsx` renders a 3-2-1 countdown +
  CTA + "Stay on mothership" Cancel button when `handoverReady` from
  useDeploymentEvents is set AND not in chroot mode. Auto-fires
  `window.location.assign(handoverURL)` once when countdown reaches 0
  (idempotent guard via redirectFiredRef). Cancel suppresses the
  banner + timer for the rest of the page lifetime.

Per the brief: do NOT touch catalyst-api (`internal/handler/flow_
snapshot_local.go` is the canonical group emitter and its contract is
stable). UI-layer fix only.

Tests:
  - handoverStageOverride.test.ts — 18 unit cases covering the slug
    matcher, the handover gate, and every override branch (terminal
    pass-through, non-lifecycle pass-through, per-region coercion,
    mixed-mode array stability).
  - JobsPage.handover.test.tsx — 5 integration cases proving the
    JobsPage wires both fixes correctly (synthetic stages render as
    Succeeded when ready; banner renders + Cancel suppresses; auto-
    redirect fires `window.location.assign` exactly once when the
    countdown drains; still-installing snapshot keeps stages Pending
    and banner hidden).

All 26 new tests pass. Project lint + typecheck error counts are
unchanged from main baseline (27 typecheck errors + 67 lint errors,
all in unrelated files — see project drift in JobsTable.tsx /
openova-flow/canvas etc.). The new test file inherits the same
pre-existing `import/first` rule-not-found error already present in
JobsPage.flow-merge.test.tsx — same lint-config drift, not new.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 14:56:16 +04:00
e3mrah
db116c2d18
fix(kubeconfig): honour ?region=<key> on GET /kubeconfig (#1515)
Multi-region Sovereigns store secondary CP kubeconfigs at
<kubeconfigsDir>/<id>-<region>.yaml via the PUT endpoint (L520+). The
GET endpoint always read dep.Result.KubeconfigPath which is the
PRIMARY's path, so any caller asking for ?region=nbg1-1 got primary's
kubeconfig pointing at primary's IP (89.167.22.182 etc.) — silently.

Caught on t117 (7152ad51e7838836, 2026-05-16): D-gate validator
fetched all 3 region kubeconfigs via the GET endpoint with ?region=
and all 3 returned PRIMARY's endpoint. Every per-region check
(D8/D9/D12) inspected primary 3× instead of 3 distinct regions.
Workaround was reading directly from the PVC; this fix unblocks the
canonical API path.

Co-authored-by: claude <claude@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 12:55:55 +04:00
e3mrah
66e7768e8e
fix(helmwatch): emit Succeeded events for HRs Ready at attach time (#1510)
When catalyst-api restarts and the bridge re-attaches to an already-
converged child cluster, the informer initial-list returns HRs already
in Ready=True. The previous processEvent path relied implicitly on the
zero-value of w.states[componentID] (empty string) being different
from the derived state — which works today but would silently regress
if a future refactor pre-seeded w.states from a prior snapshot.

Caught on prov t112.omani.works (f2e7f02e6ffb6a18, 2026-05-15): 4 HRs
converged across primary + sin-2 regions before/after the pod restart
at 19:16, but the mothership Jobs API kept reporting:

    install-self-sovereign-cutover  → running   (kubectl: Ready=True)
    install-powerdns                → running   (kubectl: Ready=True)
    install-catalyst-platform       → running   (kubectl: Ready=True)
    install-sin-2:reloader          → failed    (kubectl: Ready=True)

D6 (0 pending / 0 running) and D7 (mothership ≡ child) both failed.

Fix shape: processEvent's emission policy is now EXPLICITLY "first
observation OR real transition". `hadPrev` (the two-return-value map
lookup) is false on the FIRST event for componentID regardless of the
state value, so the dispatch fires unconditionally on attach. The
dedupe via prev != state still suppresses sub-second status-patch
churn that helm-controller's observedGeneration touches produce.

Idempotency: the jobs.Bridge's lastState map dedupes (componentID,
state) re-emissions at the bridge layer (Bridge.OnHelmReleaseEvent
line ~478), and the openova-flow-server's TypeSnapshot envelope is
idempotent at the receiver — so a re-emit propagated by the
flow_emitter periodic loop is safe.

Two new tests pin the contract:
  - TestTransition_AttachTimeReady_EmitsSucceededViaSubscribe asserts
    a Watcher attaching to a child cluster with 4 already-Ready HRs
    emits exactly one State=installed event per HR, BOTH on the
    primary emit callback AND through Subscribe (the bridge wiring).
  - TestTransition_FirstObservation_NeverDedupsAcrossWatchers asserts
    that constructing a new Watcher against the same fake client
    (the Pod-restart shape) re-emits the full component-event set,
    because w.states is independent per Watcher.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 23:54:25 +04:00
e3mrah
22668f2870
feat(catalyst-api): auto-establish Cilium ClusterMesh after Phase-1 (#1508)
Implements DoD gates D9, D10, D11 from
docs/SOVEREIGN-MULTI-REGION-DOD.md. After phase1-watching reports all
HRs Ready, the orchestrator wires every region's clustermesh-apiserver
into a fully-connected peer mesh by writing the cross-cluster trust
material (CA bundles, peer endpoints, mTLS client certs) into each
cluster's kube-system Secrets. Cilium auto-reloads via the chart's
watch mechanism; a rollout-restart guarantees pickup.

- New handler/clustermesh.go orchestrator (AutoEstablishClusterMesh)
- Hook in phase1_watch.go markPhase1Done after fireHandover, runs on
  a goroutine with a 20-minute budget; skips when regions<2
- Idempotent: re-run on partially-meshed Sovereign converges
- Uses LoadBalancer IPs per region (provider-agnostic — A2/A3/A6)
- Hard-fails on Service type != LoadBalancer per invariant A3
- No cilium CLI shell-out (catalyst-api Pod doesn't ship it); mints
  per-peer client certs from the local cilium-ca via crypto/x509
- Three coverage tests against fake clientsets: happy-path 2-region,
  LB-absent peer marked Connected=false, idempotent re-run, single-
  region short-circuit

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 22:16:26 +04:00
e3mrah
4e199f137b
fix(dns): auto-write per-Sovereign A records into parent zone after Phase-0 (#1505)
* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges

PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

* fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values

Follow-up to PR #1500. The canon block ran on the event-carried dependsOn
arg, but the 3-tier resolve preferred existing-store value when non-empty
— which for any Job written BEFORE PR #1500 rolled out was malformed
(no "install-" prefix). t103.omani.works snapshot kept emitting 224
finish-to-start rels with malformed fromIds because the existing Job
rows held "hel1-2:gitea" entries that the resolve preserved verbatim.

Fix: after the 3-tier resolve, run a final canonicalisation pass on
resolvedDeps so every persisted entry is canonical regardless of
whether it came from event-carried (already canon by my prior block)
or from existing-store (potentially malformed legacy).

Note: this fix only takes effect on the NEXT HR state transition for a
given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs)
will keep their malformed deps until a new event fires. The loop's next
cycle (t104+) writes canonical from event 1.

* fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs

Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator
submitted a multi-region body (3 regions cpx52) but omitted
ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0.
Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux
postBuild.substitute rendered cilium-config with cluster.name=default +
cluster.id=0. Cilium kvstoremesh refused to start:
  "ClusterID 0 is reserved"
clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed.
Cross-region observability + east-west routing permanently broken.

Auto-derivation:

  ClusterMeshName: <first-fqdn-label>-mesh
    e.g. t105.omani.works → "t105-mesh"

  ClusterMeshID:  (sha256(deploymentID)[:4] as uint32) mod 252 + 1
    Range [1, 252]; main.tf increments for secondaries so the max id
    any region sees is primary + (regions - 1) ≤ 254. ID 255 is
    intentionally avoided (Cilium sentinel).

Operator override still respected — auto-derive only kicks in when
both fields are zero/empty AND len(Regions) > 1. Single-region provs
stay at "" / 0 (no mesh needed).

Tested derive helpers against the last 4 prov IDs — all land in valid
range:
  98395b3d9bd9c1aa → 74 (secondaries 75, 76)
  005080699326a7ac → 29 (secondaries 30, 31)
  22af2b1120158239 → 139
  c9df5eed1c1ba6cf → 180

Build + provisioner unit tests green.

* fix(cloudinit): thread cluster.name + cluster.id into pre-Flux cilium-values.yaml

t105.omani.works (a6c0f5dfebd63bd0, 2026-05-15) found that PR #1502's
catalyst-api auto-derive (cluster_mesh_name=t105-mesh, cluster_mesh_id=99)
correctly reached cilium-config — but only AFTER Flux helm-upgraded the
release. The pre-Flux Cilium install (cloud-init line 1473) used
/var/lib/catalyst/cilium-values.yaml which DIDN'T carry cluster.name or
cluster.id, so cilium-agent started with the chart defaults
("default", 0). The Flux upgrade then changed cilium-config but the
already-running cilium-agent kept its in-memory cluster.name="default"
because it reads ConfigMap once at startup.

Downstream consequences observed live on t105:
  hubble-relay CrashLoopBackOff:
    "tls: failed to verify certificate: x509: certificate is valid for
     *.t105-mesh.hubble-grpc.cilium.io, not catalyst-t105-omani-works-cp1
     .default.hubble-grpc.cilium.io"
  clustermesh peer announcements use stale "default" identity →
  cross-region mesh handshakes x509-fail.

Fix: include cluster.name + cluster.id in the pre-Flux helm install's
values file, sourced from the templatefile() vars cluster_mesh_name +
cluster_mesh_id (already threaded per-region by main.tf:381-382 and
:900-901). Now the first cilium-agent process announces with the
correct identity, no helm-upgrade race.

* docs(sandbox): design docs for the Sandbox product

Captures the agreed product shape, end-user journeys (developer +
Sovereign admin), technical architecture (native agent TUI via
xterm.js + WebSocket + PTY, card protocol for mobile, MCP catalogue,
four knowledge layers, JetStream/SSE integration), and the
conversational-provisioning surface that reuses the same shell with a
narrow MCP toolbox as an alternative to the catalyst-ui wizard.

Status: design only — no implementation. Identifies one prerequisite
(long-lived API token carrying org_id claim) with the exact files to
extend in core/services/auth and platform/keycloak.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sovereign-tls): tls-restart Job needs list+watch on deployments/daemonsets

Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15) — the
cilium-envoy-tls-restart Job stuck Running 10m+ with:

  W reflector.go:561] failed to list *unstructured.Unstructured:
    deployments.apps "cilium-operator" is forbidden: User
    "system:serviceaccount:kube-system:cilium-envoy-tls-restart"
    cannot list resource "deployments" in API group "apps" in the
    namespace "kube-system"

The Role grants `get` + `patch` but `kubectl rollout status` (which the
Job runs after `rollout restart`) does NOT just GET — internally it
uses client-go informerwatcher to LIST+WATCH the resource. Without
those verbs the informer fails and `rollout status` hangs until
activeDeadlineSeconds (900s). The Job never restarts cilium-envoy,
console.<fqdn> never serves.

Fix: add `list` + `watch` to both rules (cilium-operator Deployment
+ cilium-envoy DaemonSet). Scoped by resourceName, so the SA still
can't enumerate or watch other workloads.

* fix(dns): auto-write per-Sovereign A records into parent zone after Phase-0

Caught on prov t110.omani.works (fe09897a1b6b3c1d, 2026-05-15):

  dig +short A console.t110.omani.works @ns1.openova.io
  → 49.12.16.160     ← ORPHAN IP — Hetzner reassigned to a 3rd party

The mothership PowerDNS had ZERO records for t110's hostnames. A stale
wildcard `*.omani.works` (manual leftover from earlier provs) was
returning a wrong IP that no longer belonged to the openova project at
Hetzner — sending operator traffic to an unrelated tenant. The deeper
gap: catalyst-api never auto-wrote the per-Sovereign A records that
browsers need to resolve.

The existing parent-domain flow has:
  pdmCreatePowerDNSZone     — stub at parent_domains.go:1096
  certManagerStep           — stub at parent_domains.go:1141
  commitPDMWithRetry        — runs ONLY for pool-allocated FQDNs
                              (otech<N>.<pool>), NOT BYO

So BYO-style (operator-owned parent like omani.works + arbitrary
Sovereign FQDN like t111.omani.works) left the parent zone untouched.

Fix:

  internal/powerdns/client.go
    + PatchRRSets(ctx, zone, rrsets) — PATCH REPLACE on
      /api/v1/servers/{id}/zones/{zone} with idempotent re-runs

  internal/handler/handler.go
    + powerdnsZoneClient interface gains PatchRRSets — wired
      automatically by SetPowerDNSZoneClient

  internal/handler/sovereign_dns_records.go (new)
    + CanonicalSovereignSubdomains: console / auth / gitea / harbor /
      registry / bao / grafana / hubble / pdns / openova-flow /
      marketplace / api / guacamole
    + upsertSovereignParentZoneRecords: PATCH the parent zone with one
      A record per subdomain → primary LB IP
    + upsertSovereignParentZoneRecordsFromResult: deployment-flow
      wrapper that iterates every parentDomain in the request body

  internal/handler/deployments.go
    + Call upsertSovereignParentZoneRecordsFromResult right after
      commitPDMWithRetry on Phase-0 success — best-effort (log +
      continue), so a PowerDNS hiccup doesn't bail the Sovereign

Operator override via CATALYST_SOVEREIGN_SUBDOMAINS not yet wired —
filed as follow-up. Today the canonical list is the chart-side HTTPRoute
list, kept aligned via the comment in sovereign_dns_records.go.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 21:12:38 +04:00
e3mrah
4465cd0d27
fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs (#1502)
* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges

PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

* fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values

Follow-up to PR #1500. The canon block ran on the event-carried dependsOn
arg, but the 3-tier resolve preferred existing-store value when non-empty
— which for any Job written BEFORE PR #1500 rolled out was malformed
(no "install-" prefix). t103.omani.works snapshot kept emitting 224
finish-to-start rels with malformed fromIds because the existing Job
rows held "hel1-2:gitea" entries that the resolve preserved verbatim.

Fix: after the 3-tier resolve, run a final canonicalisation pass on
resolvedDeps so every persisted entry is canonical regardless of
whether it came from event-carried (already canon by my prior block)
or from existing-store (potentially malformed legacy).

Note: this fix only takes effect on the NEXT HR state transition for a
given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs)
will keep their malformed deps until a new event fires. The loop's next
cycle (t104+) writes canonical from event 1.

* fix(provisioner): auto-derive cluster_mesh_name + cluster_mesh_id for multi-region provs

Caught on prov t104.omani.works (98395b3d9bd9c1aa, 2026-05-15): operator
submitted a multi-region body (3 regions cpx52) but omitted
ClusterMeshName/ClusterMeshID. catalyst-api defaulted them to "" and 0.
Tofu wrote cluster_mesh_name="" + cluster_mesh_id=0 to tfvars. Flux
postBuild.substitute rendered cilium-config with cluster.name=default +
cluster.id=0. Cilium kvstoremesh refused to start:
  "ClusterID 0 is reserved"
clustermesh-apiserver CrashLoopBackOff 16 restarts. No mesh ever formed.
Cross-region observability + east-west routing permanently broken.

Auto-derivation:

  ClusterMeshName: <first-fqdn-label>-mesh
    e.g. t105.omani.works → "t105-mesh"

  ClusterMeshID:  (sha256(deploymentID)[:4] as uint32) mod 252 + 1
    Range [1, 252]; main.tf increments for secondaries so the max id
    any region sees is primary + (regions - 1) ≤ 254. ID 255 is
    intentionally avoided (Cilium sentinel).

Operator override still respected — auto-derive only kicks in when
both fields are zero/empty AND len(Regions) > 1. Single-region provs
stay at "" / 0 (no mesh needed).

Tested derive helpers against the last 4 prov IDs — all land in valid
range:
  98395b3d9bd9c1aa → 74 (secondaries 75, 76)
  005080699326a7ac → 29 (secondaries 30, 31)
  22af2b1120158239 → 139
  c9df5eed1c1ba6cf → 180

Build + provisioner unit tests green.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-15 19:13:35 +04:00
e3mrah
49ae2a7cab
fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values (#1501)
* fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges

PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

* fix(canvas): canonicalise resolved DependsOn too — kill malformed prior values

Follow-up to PR #1500. The canon block ran on the event-carried dependsOn
arg, but the 3-tier resolve preferred existing-store value when non-empty
— which for any Job written BEFORE PR #1500 rolled out was malformed
(no "install-" prefix). t103.omani.works snapshot kept emitting 224
finish-to-start rels with malformed fromIds because the existing Job
rows held "hel1-2:gitea" entries that the resolve preserved verbatim.

Fix: after the 3-tier resolve, run a final canonicalisation pass on
resolvedDeps so every persisted entry is canonical regardless of
whether it came from event-carried (already canon by my prior block)
or from existing-store (potentially malformed legacy).

Note: this fix only takes effect on the NEXT HR state transition for a
given Job. HRs already in terminal state (e.g. t103's 135 succeeded HRs)
will keep their malformed deps until a new event fires. The loop's next
cycle (t104+) writes canonical from event 1.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-15 17:24:33 +04:00
e3mrah
80fdbcd8e1
fix(canvas): canonicalise Job.DependsOn entries with install- prefix — fix invisible edges (#1500)
PR #1499 plumbed spec.dependsOn end-to-end and verified deps populate
on first event (no /refresh-watch needed). But the openova-flow snapshot
composer (flow_snapshot_local.go) emits finish-to-start relationships
where fromId = jobs.JobID(deploymentID, dep). Without the "install-"
prefix on each dep entry, fromId came out as:

  <dep>:hel1-2:seaweedfs                 (secondary, missing "install-")
  <dep>:gitea                            (primary, missing "install-")

But the FlowNode ids in the snapshot are:

  <dep>:install-hel1-2:seaweedfs
  <dep>:install-gitea

The FE canvas adapter matches by exact id → every finish-to-start rel
points at a non-existent node → 224 rels emitted, 0 edges rendered.

Caught on prov t103.omani.works (005080699326a7ac, 2026-05-15):

  curl /v1/flows/.../snapshot → 376 rels total: 152 contains, 224 finish-to-start
  every finish-to-start fromId malformed
  canvas: sibling edges invisible across all 135 install Jobs

Fix in two places:

  internal/handler/phase1_watch.go (spawnSecondaryRegionWatchers emit):
    Region-prefix each dep AND inject the "install-" prefix so
    ev.DependsOn = ["install-<region>:<chart>"] before the bridge
    receives the event. Symmetric with how ev.Component is constructed.

  internal/jobs/helmwatch_bridge.go (OnHelmReleaseEvent):
    Canonicalise every dep entry: if it doesn't already start with
    JobNamePrefix ("install-"), prepend it. Idempotent on entries
    that already are canonical (set by the phase1_watch.go path).
    Covers the primary-region path (bare chart names like "gitea")
    too — Job.DependsOn now stores "install-gitea", which matches
    the composer's emitted FromId exactly.

Tests: go build ./... + go test on internal/jobs + helmwatch + provisioner
all green. (Pre-existing TestHandleWhoami_* flake in handler is unrelated.)

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-15 17:18:40 +04:00
e3mrah
1cd6c3f432
fix(canvas): plumb HR spec.dependsOn through every event — kill the seed-timing race (#1499)
* fix(pdm/dynadot): auto-register NS glue records before set_ns

Dynadot rejects set_ns when any NS hostname is not yet registered
as a glue record in the customer's account. The 31-line code comment
above SetNameservers documents this requirement but the implementation
never landed at the adapter layer — only the per-request handler-side
glueIP path (BYO Flow B, issue #900) registered glue, leaving the
mothership parent-domain onboard flow exposed.

Live blocker on 2026-05-15: founder attempted zero-touch onboard of
fresh parent domain omani.homes; the flow stalled because
ns3.openova.io had never been registered as a Dynadot glue record on
this account (ns1/ns2 had been registered long ago when openova.io
itself was onboarded). Failure surface:
  "'ns3.openova.io' needs to be registered with an ip address before
   it can be used."
Required out-of-band manual API calls to unblock, defeating the
zero-touch property the architecture is supposed to deliver.

Fix (adapter layer, no per-request flag, always-on when configured):
- Adapter gains NSGlueIP field; SetNameservers iterates every NS
  hostname BEFORE set_ns, skips in-bailiwick children of the domain
  being set, calls RegisterGlueRecord(host, NSGlueIP) for the rest.
- RegisterGlueRecord (already idempotent per issue #900) short-
  circuits via get_ns on identical IP, falls through to set_ns_ip
  on a stale IP, and runs register_ns when the host is missing — so
  a SetNameservers retry costs only get_ns probes, not extra writes.
- A typed registrar error inside the register loop returns
  immediately without calling set_ns (fail-fast contract).
- POOL_DOMAIN_MANAGER_NS_GLUE_IP env var (canonical operator-config
  pattern in this repo) threaded through cmd/pdm/main.go onto the
  Dynadot adapter at PDM startup. Empty value preserves prior
  pass-through behaviour, keeping BYO Flow B handler-level glue
  authoritative for per-request Sovereign add-domain calls.

Tests (httptest server, 7 new cases) cover:
  - AllFresh: 3 NS hostnames, all unregistered → 3× (get_ns+register_ns)
    + set_ns (7 API calls, in order).
  - OneAlreadyRegistered: middle NS short-circuits via get_ns,
    others register, set_ns runs.
  - RegisterFails_SetNsNotCalled: 429 mid-register surfaces
    ErrRateLimited unwrapped; set_ns must NOT execute.
  - SetNsFailsAfterRegister: pre-register completes, set_ns
    returns Dynadot error; ErrDomainNotInAccount surfaces.
  - SkipsInBailiwick: in-bailiwick NS hostname (child of domain
    being set) is skipped entirely (no get_ns, no register_ns).
  - DisabledWhenNSGlueIPEmpty: backward-compat — bare SetNameservers
    issues exactly one set_ns call when env var unset.
  - IsInBailiwickHost: case- and trailing-dot-tolerant table test.

go build ./... and go test ./... both green across the entire
core/pool-domain-manager module.

* fix(canvas): skip TLS verify on Sovereign k3s self-signed CA — restore sibling deps

PR #1431 (derive HR dependsOn from live watcher) and PR #1470 (persist
DependsOn on every event) both addressed symptoms at the
persistence/event layer. The root cause was deeper: the bridge's
reflector x509-fails against the Sovereign apiserver's self-signed
k3s CA on every fresh multi-region prov, so SeedJobsFromInformerList
never runs and there's no DependsOn to persist in the first place.

Live blocker on omani.homes prov fc0855a25c24511c (2026-05-15): all
3 region kubeconfigs at /var/lib/catalyst/kubeconfigs/ have valid
CA-data (openssl s_client verifies cleanly), but the reflector caches
a poisoned TLS state from before the kubeconfig was finalized. Result:
all 142 jobs return dependsOn: [], FlowCanvasOrganic renders 45 sibling
HRs with edges only to the parent, no inter-sibling edges. The
"sibling wiring lost" symptom returns on every fresh provision.

Fix:

  helmwatch/kubeconfig.go: restConfigFromKubeconfig now sets
    TLSClientConfig.Insecure = true and clears CAData/CAFile.
    The reflector still authenticates via the bearer token from
    the kubeconfig, the connection is over public Hetzner LB which
    terminates HTTPS, and TLS verify is only skipped for mothership
    informers reading Sovereign HR/source/kustomization state.

  k8scache/factory.go: same skip on the CloudPage resource-explorer
    informer (AddCluster path). Same x509 failure mode without it.

This makes the previous three fixes' guarantees actually hold: the
seed runs, the cache populates, every event preserves real DependsOn,
and the API returns sibling-to-sibling dependency edges for the
canvas to render.

Tests:
  go test ./internal/helmwatch/... ./internal/k8scache/...
  All green. No test required CAData verification to pass.

* fix(sovereign-tls): escape $ in tls-restart Job so Flux doesn't eat the bash vars

Root cause caught on prov t101.omani.works (c9df5eed1c1ba6cf, 2026-05-15):

The cilium-envoy-tls-restart Job's shell command uses bash variables
${SECRET_NS}, ${SECRET_NAME}, ${DS_NS}, ${DS_NAME}, ${tls_crt}, ${i}.
Flux's postBuild.substitute processes ${...} in the YAML BEFORE the
Job manifest lands in the cluster, and replaces every $-reference that
isn't in the Kustomization's substituteFrom map with an empty string.

Result on prov t101 (T+13m, mothership flipped status=ready):

  Job logs: "[tls-restart] waiting for / with non-empty tls.crt"
                                      ^^^ — namespace and name both empty

  Command becomes: `kubectl get secret -n "" "" --ignore-not-found ...`
  → polls a nonexistent secret forever
  → cilium-operator never gets the rollout-restart
  → CiliumEnvoyConfig's additionalAddresses.socketAddress: 0.0.0.0:30443
    bind never lands
  → cilium-envoy host:30443 stays unbound
  → Hetzner LB targets stay unhealthy on 30080/30443
  → console.<fqdn> serves HTTP 000 indefinitely
  → mothership's "Handover gate" timeout fires AT THE WRONG TIME — flips
    deployment status=ready before TLS is actually serving

The "Sovereign was up at t101" reading we saw briefly was a transient
TRAEFIK fallback cert from upstream during cert-issuance, NOT the
Sovereign envoy.

Fix: escape every bash variable reference inside the script as $$VAR so
Flux postBuild.substitute emits a literal $VAR which bash then evaluates
correctly at Job runtime. SOVEREIGN_FQDN in YAML labels stays as
${SOVEREIGN_FQDN} because that IS a Flux substitute (kept intentionally).

This is the third recurrence of "sibling deps lost / cilium-envoy host
bind missing / fresh prov console=000" on the same code path:
  PR #1431 — derive HR dependsOn from live watcher
  PR #1470 — persist DependsOn on every event
  PR #1494 — restart cilium-operator BEFORE cilium-envoy on first install
  PR #1497 — skip TLS verify on Sovereign k3s self-signed CA
  THIS  — escape \$VAR in Job command so Flux doesn't blank them

Each prior PR fixed a layer above the Job's own correctness. The Job
itself was always broken on fresh provs since the cilium-operator
restart line was added.

* fix(canvas): plumb HR spec.dependsOn through every event — kill the seed-timing race

Real architectural fix for the recurring "sibling deps lost on every fresh
provision" regression. PR #1431, PR #1470, PR #1497 each patched a layer
above the actual gap: the per-event emit path at helmwatch.go:1525 had
the unstructured HelmRelease in scope but THREW AWAY spec.dependsOn before
emitting the provisioner.Event. The bridge then wrote Job.DependsOn=[]
on every event, relying on a pre-existing seed having populated deps —
which never happened on fresh provs because the watcher's initial-list
sync (T+2m, right after tofu) fires with 0 HRs (Flux hasn't installed
anything yet).

The fix walks the data end-to-end:

  provisioner.Event   gains DependsOn []string
  helmwatch.processEvent  populates DependsOn: extractDependsOn(u) on
                          every PhaseComponent emit (the unstructured
                          HelmRelease was already in scope, just being
                          dropped at the event boundary)
  spawnSecondaryRegionWatchers  region-prefixes each entry so secondary
                                Jobs (install-<region>:<chart>) wire to
                                intra-region siblings, not bare primary
                                names
  Bridge.OnProvisionerEvent  passes ev.DependsOn to OnHelmReleaseEvent
  Bridge.OnHelmReleaseEvent  new dependsOn []string parameter; resolves
                             with 3-tier preference:
                               prior store value  >
                               event-carried (live HR spec.dependsOn) >
                               empty.
                             The prior-store branch keeps PR #1470's
                             pod-restart preservation; the event-carried
                             branch closes the fresh-prov gap.

No timing race, no re-seed band-aid, no /refresh-watch dependency. Every
HR transition observed by the watcher carries the live spec.dependsOn
through to the Job row — exactly the architecture that ComponentSnapshot
already documents at helmwatch.go:679-689 but the event path had
silently dropped.

Caught on prov t102.omani.works (22af2b1120158239, 2026-05-15) — all
hel1-2 HRs showed Deps:— in the JobsTable despite the bridge being
healthy (verified: x509 errors=0 post PR #1497, kubeconfigs present at
mtime T+2m, OnInitialListSynced fired).

Prior recurrences (each patched a layer above the actual gap):
  PR #1431 (2026-05-11) — derive HR dependsOn from live watcher (seed path)
  PR #1470 (2026-05-14) — persist DependsOn on every event (preserve prior)
  PR #1497 (2026-05-15) — skip TLS verify on Sovereign k3s self-signed CA
  PR #1498 (2026-05-15) — escape $ in tls-restart Job so Flux doesn't blank vars
  THIS  (2026-05-15) — actually plumb spec.dependsOn through the Event

Tests:
  go test ./internal/jobs/... ./internal/helmwatch/... ./internal/provisioner/...
  All green. 9 OnHelmReleaseEvent callsites updated for the new signature.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-15 16:39:52 +04:00
e3mrah
da63b45b53
fix(canvas): skip TLS verify on Sovereign k3s self-signed CA — restore sibling deps (#1497)
* fix(pdm/dynadot): auto-register NS glue records before set_ns

Dynadot rejects set_ns when any NS hostname is not yet registered
as a glue record in the customer's account. The 31-line code comment
above SetNameservers documents this requirement but the implementation
never landed at the adapter layer — only the per-request handler-side
glueIP path (BYO Flow B, issue #900) registered glue, leaving the
mothership parent-domain onboard flow exposed.

Live blocker on 2026-05-15: founder attempted zero-touch onboard of
fresh parent domain omani.homes; the flow stalled because
ns3.openova.io had never been registered as a Dynadot glue record on
this account (ns1/ns2 had been registered long ago when openova.io
itself was onboarded). Failure surface:
  "'ns3.openova.io' needs to be registered with an ip address before
   it can be used."
Required out-of-band manual API calls to unblock, defeating the
zero-touch property the architecture is supposed to deliver.

Fix (adapter layer, no per-request flag, always-on when configured):
- Adapter gains NSGlueIP field; SetNameservers iterates every NS
  hostname BEFORE set_ns, skips in-bailiwick children of the domain
  being set, calls RegisterGlueRecord(host, NSGlueIP) for the rest.
- RegisterGlueRecord (already idempotent per issue #900) short-
  circuits via get_ns on identical IP, falls through to set_ns_ip
  on a stale IP, and runs register_ns when the host is missing — so
  a SetNameservers retry costs only get_ns probes, not extra writes.
- A typed registrar error inside the register loop returns
  immediately without calling set_ns (fail-fast contract).
- POOL_DOMAIN_MANAGER_NS_GLUE_IP env var (canonical operator-config
  pattern in this repo) threaded through cmd/pdm/main.go onto the
  Dynadot adapter at PDM startup. Empty value preserves prior
  pass-through behaviour, keeping BYO Flow B handler-level glue
  authoritative for per-request Sovereign add-domain calls.

Tests (httptest server, 7 new cases) cover:
  - AllFresh: 3 NS hostnames, all unregistered → 3× (get_ns+register_ns)
    + set_ns (7 API calls, in order).
  - OneAlreadyRegistered: middle NS short-circuits via get_ns,
    others register, set_ns runs.
  - RegisterFails_SetNsNotCalled: 429 mid-register surfaces
    ErrRateLimited unwrapped; set_ns must NOT execute.
  - SetNsFailsAfterRegister: pre-register completes, set_ns
    returns Dynadot error; ErrDomainNotInAccount surfaces.
  - SkipsInBailiwick: in-bailiwick NS hostname (child of domain
    being set) is skipped entirely (no get_ns, no register_ns).
  - DisabledWhenNSGlueIPEmpty: backward-compat — bare SetNameservers
    issues exactly one set_ns call when env var unset.
  - IsInBailiwickHost: case- and trailing-dot-tolerant table test.

go build ./... and go test ./... both green across the entire
core/pool-domain-manager module.

* fix(canvas): skip TLS verify on Sovereign k3s self-signed CA — restore sibling deps

PR #1431 (derive HR dependsOn from live watcher) and PR #1470 (persist
DependsOn on every event) both addressed symptoms at the
persistence/event layer. The root cause was deeper: the bridge's
reflector x509-fails against the Sovereign apiserver's self-signed
k3s CA on every fresh multi-region prov, so SeedJobsFromInformerList
never runs and there's no DependsOn to persist in the first place.

Live blocker on omani.homes prov fc0855a25c24511c (2026-05-15): all
3 region kubeconfigs at /var/lib/catalyst/kubeconfigs/ have valid
CA-data (openssl s_client verifies cleanly), but the reflector caches
a poisoned TLS state from before the kubeconfig was finalized. Result:
all 142 jobs return dependsOn: [], FlowCanvasOrganic renders 45 sibling
HRs with edges only to the parent, no inter-sibling edges. The
"sibling wiring lost" symptom returns on every fresh provision.

Fix:

  helmwatch/kubeconfig.go: restConfigFromKubeconfig now sets
    TLSClientConfig.Insecure = true and clears CAData/CAFile.
    The reflector still authenticates via the bearer token from
    the kubeconfig, the connection is over public Hetzner LB which
    terminates HTTPS, and TLS verify is only skipped for mothership
    informers reading Sovereign HR/source/kustomization state.

  k8scache/factory.go: same skip on the CloudPage resource-explorer
    informer (AddCluster path). Same x509 failure mode without it.

This makes the previous three fixes' guarantees actually hold: the
seed runs, the cache populates, every event preserves real DependsOn,
and the API returns sibling-to-sibling dependency edges for the
canvas to render.

Tests:
  go test ./internal/helmwatch/... ./internal/k8scache/...
  All green. No test required CAData verification to pass.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-15 14:46:21 +04:00
e3mrah
96fc3bfc76
fix(routes): preserve /sovereign basepath on canonicalisation hard-nav + normalize PIN-login next (#1488)
Two related basepath-stripping bugs in hard-navigation paths:

A. router.tsx rootBeforeLoad canonicalisePath
   TanStack Router passes POST-basepath `location.pathname` (e.g. on
   contabo a visit to `/sovereign/provision/$id/jobs/install-X%3AY`
   arrives as `/provision/$id/jobs/install-X%3AY`). canonicalisePath
   lowercases the path, so `%3A` → `%3a` and the comparison triggers
   a hard-nav. But `window.location.replace(canonical)` operates on
   the FULL URL — the bare `/provision/...` target bypasses the SPA
   mount point and nginx 404s before the SPA loads. Same root cause
   as #1486, different hard-nav site.

B. VerifyPinPage hard-nav post-PIN
   The `next` query param arrives in two forms depending on which
   redirectToLogin variant produced it: SovereignConsoleLayout.tsx:91
   uses `window.location.pathname` (INCLUDES basepath) while :178
   uses currentPathRelativeToBasepath (STRIPS basepath). #1486
   unconditionally re-prefixed which double-prefixed the first form.
   Normalize to "post-basepath" form first, then re-prefix exactly
   once.

Fix shape: every window.location.{replace,assign} that operates on a
URL derived from router-internal data MUST re-add basepath. The router-
based `<Link to>` / `navigate({to})` paths are unaffected because
TanStack Router auto-prefixes those.

Caught live on prov #82 + #84 (omani.works, 2026-05-14): the canvas
row-click + PIN-login + canonicalise paths each generated bare
`/provision/...` URLs that hit nginx's 404 page.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 22:02:20 +04:00
e3mrah
a25fd33dea
fix(provisioner): key tofu workdir by DeploymentID, not FQDN (eliminate reprov tfstate carryover) (#1487)
Root cause for the prov #82#83#84 cascade on omani.works:

The per-prov tofu workdir was keyed by `strings.ReplaceAll(FQDN, ".", "-")`,
so every reprovision of the SAME SovereignFQDN reused the SAME directory.
When prov #82's force-wipe failed `tofu destroy` (the workdir held a tftpl
from before #1485's WILDCARD_CERT_ISSUER escape fix), the Hetzner-purge
fallback cleaned the cloud but the tfstate stayed dirty. Prov #83 then
inherited tfstate that referenced destroyed-via-Hetzner-purge resources
and `tofu apply` failed with "Saved plan is stale" / "resource already
exists".

The kubeconfig path was ALREADY keyed by DeploymentID; the tofu workdir
was the outlier. Bring it into alignment so each POST /deployments gets
a hermetic workdir. CreateDeployment generates a unique DeploymentID on
every call, so reprovs are isolated by construction.

Wizard-resume — the original justification for the FQDN-keyed design —
was already fragile (it required a clean prior tfstate), and is better
served by an explicit retry endpoint that re-uses the same DeploymentID
rather than implicit workdir reuse.

Affected callers:
- provisioner.go Provision + Destroy → workdirKey() (returns DeploymentID, falls back to FQDN-slug for legacy paths)
- wipe.go WipeDeployment → uses `id` (chi URL param) directly
- handover.go FinaliseHandover → uses `id` directly

Tests pass: provisioner + handler test packages.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 21:17:28 +04:00
e3mrah
00aeefedaa
fix(verify-pin): re-prefix basepath on window.location.replace after PIN success (#1486)
VerifyPinPage.tsx:104 calls window.location.replace(target) to drive a
hard navigation after PIN verification succeeds. Hard navigation BYPASSES
TanStack Router's basepath config — so on contabo (basepath='/sovereign'),
a `target` of `/provision/$id/jobs` lands the browser at
`https://console.openova.io/provision/$id/jobs` (no `/sovereign/` prefix).
nginx on contabo only serves the SPA under `/sovereign/*` and 404s
everything else, so the operator sees nginx's "404 page not found"
before the SPA has a chance to route.

The `next` value is stored post-basepath by design (basepathRelative.ts)
because router.navigate adds basepath back automatically. window.location
doesn't, so we have to re-add it manually for the hard-nav path.

Caught live on prov #82 (omani.works, 2026-05-14): after PIN-login on
console.openova.io/sovereign/login?next=%2Fprovision%2F.../jobs, the
replace landed on /provision/.../jobs → nginx 404.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 20:59:03 +04:00
e3mrah
bdceb3a78a
fix(canvas): region phase sub-groups default to pending (not running) (#1479)
Empty handover/apps phase groups (no Jobs emitted yet for those
lifecycle phases) were hardcoded to 'running' which propagated up
to the root phase groups. With the rollup fix preserving stored
status when no children, the correct stored default is 'pending'.

After this, fresh-prov handover + apps groups show 'pending'
(accurate — those phases haven't started) and the rollup correctly
classifies bootstrap-kit + cutover region groups based on their
real install-* children.

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 16:43:24 +04:00
e3mrah
690d588a04
fix(canvas): rollup preserves leaf status when group has no children (#1478)
Bug found on prov #76 rollup: cluster-bootstrap (a leaf with
family='bootstrap') was being treated as an empty group and reset
from succeeded → pending. That status then cascaded up through
provisioner (whose 5 children include cluster-bootstrap) making
provisioner show pending despite all 5 phase jobs being succeeded.

Fix: when a node in groupNodeIdx has zero children in contains rels,
keep its STORED status instead of forcing pending. This preserves
leaf-with-group-family nodes (cluster-bootstrap) AND empty phase
groups (handover/apps before their Jobs exist).

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 16:38:30 +04:00
e3mrah
13d79c77f5
fix(flow-emit): lazy-start emit loop on snapshot request (#1477)
Bug found on prov #76: rolled-up group status fix wasn't visible
because catalyst-api Pod restart (image roll) killed the emit
goroutine. startFlowEmitLoop is only invoked from phase1_watch start
— for a deployment already at status=ready, the new Pod has no emit
loop until someone fires phase1 again.

Add idempotent startFlowEmitLoop call inside HandleFlowSnapshot so
any UI page load (which polls snapshot) reactivates the emit loop.
Combined with the existing phase1-start invocation, this covers both
fresh provisioning and post-restart UI access patterns.

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 16:33:25 +04:00
e3mrah
f3349501b8
fix(canvas): roll-up group status from descendants (prov #76) (#1476)
Founder reported on prov #76: 'there are pending and running jobs
still I dont think they are true'. Examination showed all 135
install-* leaf statuses are succeeded but the synthetic group nodes
(cutover, handover, apps + per-region sub-groups) carried hardcoded
placeholder statuses ('running' / 'pending') from emit time.

Add bottom-up roll-up after all nodes/rels are emitted:
  - all descendants succeeded → succeeded
  - any descendant failed     → failed
  - any descendant running    → running
  - else                      → pending (no descendants or all pending)

Now cutover phase bubble shows succeeded when its install-self-
sovereign-cutover child has finished, etc. handover/apps stay pending
until real Jobs are emitted for them (jobs.Store integration is the
follow-up that materialises those phases).

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 16:26:59 +04:00
e3mrah
587a985dc6
refactor(openova-flow): CNPG-backed durable store + emit loop (#1471)
Founder feedback on prov #75: "uncappetabel stupid design… if our pods
are restarting entire flow information are exec logs are being wiped".
Root cause: openova-flow-server had ZERO persistence (in-memory
map+RingBuffer per flowId) so pod restart wiped all canvas state.
catalyst-api's flow_snapshot_local.go composer was added as a "fallback"
precisely because openova-flow-server couldn't be trusted — but that
created TWO half-broken paths instead of one durable backend.

## Waterfall delivery — single PR, end-to-end

### openova-flow-server: in-memory → CNPG (Postgres) backed

- New schema: `flow_instances`, `flow_nodes`, `flow_relationships`,
  `flow_events`, `flow_log_lines`, `flow_executions` with CASCADE FK,
  indexes on (flow_id, status/region/family), and a bounded-retention
  trigger on `flow_events` (keeps last 4096 per flow_id — matches the
  prior RingBuffer capacity).
- `pgstore.go` rewires Append/Snapshot/Subscribe/Drop with pgxpool
  transactional writes + LISTEN/NOTIFY pub/sub via per-flow channel
  hash. Migrations applied at startup via embedded `embed.FS`.
- Backend abstraction (`store.Backend`) lets api/ swap between
  PGStore (production) and the legacy MemBackend (tests/dev).
  `FLOW_SERVER_BACKEND=pg|memory` env selects.
- New endpoints: POST/GET `/v1/flows/{id}/log-lines` for exec log
  ingest+replay against the `flow_log_lines` table.

### Helm chart: CNPG Cluster CR + DSN wire-in

- New `templates/cnpg-cluster.yaml` provisions `openova-flow-pg` via
  bp-cnpg's `postgresql.cnpg.io/v1.Cluster`. CASCADE-FK-aware schema
  + Reflector annotations for cross-NS secret access.
- Deployment env wires `FLOW_SERVER_PG_DSN` from CNPG's auto-generated
  `<cluster>-app` Secret (`uri` key — full libpq URI with auth).
- `chart 0.1.1 → 0.2.0` (breaking schema change).
- bootstrap-kit slot 56: `dependsOn: bp-cnpg` so cold install order
  is correct.

### catalyst-api: emit loop + remove local fallback first

- New `internal/flowemit/` HTTP client posts FlowMessage envelopes
  (snapshot, upsert-nodes, upsert-rels, delete-*) to
  `OPENOVA_FLOW_SERVER_URL/v1/flows/{id}/events`. Bounded retry,
  fire-and-forget.
- New `flow_emitter.go` runs a per-deployment 5s ticker goroutine
  that composes the current snapshot via `flowSnapshotFromJobs` and
  emits it. State changes via Bridge call `triggerFlowEmit(depID)`
  for sub-second propagation.
- `HandleFlowSnapshot` order INVERTED: proxy to openova-flow-server
  FIRST, fall back to local composer ONLY in degraded mode (proxy
  unreachable). Production traffic now durably reads from CNPG.
- Emit loop starts when phase 1 watch begins; idempotent; survives
  catalyst-api restart because state is in CNPG.

## What this delivers

-  Canvas data is DURABLE — survives any pod restart (catalyst-api,
  openova-flow-server, or both).
-  openova-flow-server is now stateless — every read hits CNPG.
-  Wire contract (FlowMessage envelopes) unchanged. UI unchanged.
-  catalyst-api can be horizontally scaled — no in-memory state
  needed for the graph path (deployments map + jobs.Store retire
  in follow-up).

## What's NOT in this PR (clear follow-up)

- jobs.Store + PVC retirement: exec logs still on PVC. Moving them
  to `flow_log_lines` requires updating ~30 callers across the
  catalyst-api handler/ package — out of scope for this single PR's
  blast radius. The new `POST /v1/flows/{id}/log-lines` endpoint is
  already in place; only the call sites need to migrate.
- flow_snapshot_local.go: kept as the degraded-mode fallback (proxy
  unreachable). Will be deleted once jobs.Store retirement removes
  the underlying read path.

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 14:16:11 +04:00
e3mrah
f110a540d8
fix(canvas): persist DependsOn on every event + /refresh-watch fans out to secondary regions (#1470)
Founder caught on prov #75 (b7ae422089d4fde9) after PR #1469 deploy:
all 3 regions' 45-children dep wiring vanished after the catalyst-api
pod restart. Root cause: the deps were never in Job.DependsOn — they
were only in the Pod's in-memory hrDeps cache built from
liveWatcher.SnapshotComponents() Layer-2 in flow_snapshot_local.go.
Pod restart killed the cache.

## Two fixes

### Fix A — Bridge.OnHelmReleaseEvent preserves existing DependsOn

`OnHelmReleaseEvent` previously hardcoded `DependsOn: []string{}` on
every HR state-transition event, relying on `mergeJob` to keep the
prior list. That works when SeedJobsFromInformerList wrote the deps
FIRST. But the seed fires once at OnInitialListSynced; if the seed
ran during a window when HR.spec.dependsOn was being applied/rolling,
or if the seed didn't run at all (silent informer failure post-Pod
restart), Job.DependsOn stays `[]` forever and every subsequent event
re-confirms it.

Fix: load the existing Job from store first, carry its DependsOn
through on the upsert. Same pattern as OnRawComponentLog at line
~939. Combined with mergeJob's preserve-prev behaviour, deps are
durable across event waves.

### Fix B — /refresh-watch respawns secondary watchers

`POST /refresh-watch` rebuilt the PRIMARY helmwatch.Watcher and
re-ran SeedJobsFromInformerList for the primary. But it did NOT
respawn secondary watchers — so after a Pod restart, secondaries'
90 install Jobs stayed flat indefinitely. Fix: call
`spawnSecondaryRegionWatchers(dep)` from RefreshWatch (idempotent —
already running watchers short-circuit on `stopWatchers[region]`).
With this, /refresh-watch restores deps for ALL regions, not just
primary.

## Validation

Caught the bug via per-region edge audit on prov #75 (NOT aggregate
counts — per `feedback_validate_full_dod_before_declaring_pass.md`).
Pre-fix: fsn1=0 / hel1-2=0 / nbg1-1=0 intra-region edges. Post-fix
target: fsn1=71 / hel1-2=71 / nbg1-1=71.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 10:48:46 +04:00
e3mrah
df1dfed707
fix(canvas): opaque bubbles + explicit wires-below layering (#1469)
Founder rule on prov #75 review: "make sure the bubbles are no more
transparent and wires are always below the bubbles".

Two fixes:

1. **Opaque bubbles always**. Previously `groupOpacity = isDimmed ? 0.35 : 1`
   dropped the entire group's opacity to 35% when another job was open
   and this node wasn't on the focused path — making the bubble fill
   see-through and the edges behind visible THROUGH the bubble. Replaced
   with a CSS `filter: grayscale + brightness` treatment that desaturates
   the dimmed node without making it transparent.

2. **Explicit edges-then-nodes paint layers**. Wrapped the edges loop in
   `<g className="flow-edges-layer" data-layer="edges">` and the nodes
   loop in `<g className="flow-nodes-layer" data-layer="nodes">`. SVG
   paint order already produced the correct ordering via JSX source
   order, but a future code change inserting another element between the
   two could quietly break it; the explicit wrappers make the contract
   load-bearing and inspectable.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 10:28:40 +04:00
e3mrah
b4c2f54fa2
fix(canvas): don't region-prefix PRIMARY install deps (prov #74) (#1468)
Regression caught immediately after PR #1467 by founder on prov #74
(be70efe343e58b5a). My validation declared " all 5 issues passed"
based on aggregate 292 edges + 5 sampled hel1-2 deps, missing that
PRIMARY fsn1 had 0 intra-region edges + 71 phantom cross-region edges.

## Root cause

PR #1467 wired primary install jobs into a primary region sub-group
(jobRegion = dep.Request.Region) for symmetric multi-region rendering.
`regionalise()` triggered on `jobRegion != ""` — over-applying the
`fsn1:` prefix to PRIMARY's bare-named DependsOn entries:

  install-cilium → install-fsn1:cilium (PHANTOM — no such node exists)

PRIMARY install Jobs have BARE JobNames in the store
("install-cilium"); only SECONDARY install Jobs have region-prefixed
JobNames ("install-hel1-2:cilium"). Region-prefixing primary deps
produces a JobID that matches no node, so the edge is dropped or
points at nothing.

A second related bug: Layer-1 heuristic
`!strings.Contains(dep, ":")` was used to detect bare-jobName form,
but with the new `:` separator a region-prefixed JobName
("install-hel1-2:cilium") now contains a colon — so the heuristic
mis-classified it as "already a full JobID" and emitted FromID
without the deploymentID prefix. Phantom edge.

## Fixes

1. `isSecondaryRegionJob := strings.IndexByte(j.AppID, ':') > 0`
   replaces `jobRegion != ""` as the regionalise() gate. Primary
   jobs have no `:` in AppID → no prefix injection.

2. `fullJobIDPrefix := deploymentID + ":"` replaces the
   `strings.Contains(dep, ":")` heuristic. Only deps that ALREADY
   carry the deploymentID prefix are passed through verbatim; bare
   JobNames (with or without region prefix) get the JobID() wrap.

## Lesson learned

Saved `feedback_validate_full_dod_before_declaring_pass.md` —
aggregate metrics and sample checks are NOT validation. Every DoD
bullet must run an explicit per-tier pass/fail check before
declaring resolved.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 10:10:28 +04:00
e3mrah
4814c6849b
fix(canvas): wire deps + phase groups + URL-safe separator (prov #73) (#1467)
Founder caught 5 canvas defects on prov #73 (8cd1ff1a80430dc5):

1.  depth=1 shows 2 bubbles (provisioner + bootstrap-kit) — confirmed
   correct architecture per composer.
2.  Expanding bootstrap-kit shows 3 region sub-groups — confirmed.
3. 🐛 All 135 install-* nodes had ZERO inter-HR dep edges. Snapshot
   showed only 5 finish-to-start rels (tofu chain + bootstrap-kit
   sequence). install-cert-manager → install-cilium etc. all missing.
4. 🐛 Canvas only emitted 2 phase groups (provisioner + bootstrap-kit).
   Missing cutover/handover/apps despite being part of the canonical
   5-phase lifecycle.
5. 🐛 /jobs/install-hel1-2/newapi returned 404 because TanStack Router
   splits "/" in the $jobId param.

## Fixes

### Fix 3a: mergeJob preserves prev.DependsOn when next is empty
   store.go:283 — `if len(next.DependsOn)==0 && len(prev.DependsOn)>0`
   keeps prior list. Without this, every OnHelmReleaseEvent (which
   hardcodes `DependsOn: []string{}` at line 508 because it doesn't
   re-look up HR.spec.dependsOn per event) CLOBBERED the seeded deps.
   Confirmed in store: 135/135 install Jobs had `dependsOn: []`
   despite SeedJobsFromInformerList running with proper deps. Founder
   reported this same flat-leaves bug 4 sessions in a row.

### Fix 3b: secondary watchers get region-aware seeder hook
   New `attachSecondaryBridgeSeederHook` + `snapshotsToSeedsForRegion`
   wire the seed path for secondary helmwatch.Watchers. Without this,
   secondary install-* Jobs were only ever created by per-event
   OnHelmReleaseEvent (DependsOn=[]) so the canvas dep graph was
   permanently flat under secondary region groups regardless of fix
   3a.

### Fix 3c: composer Layer-2 reads secondary watchers' HR.spec.dependsOn
   flow_snapshot_local.go now also walks dep.secondaryWatchers and
   populates hrDeps with region-prefixed keys + region-prefixed values.
   After fix 3a+3b the stored Job.DependsOn is the authoritative source
   (Layer 1) — this Layer-2 enrichment is the safety net for hot-
   shipped charts that bypass the seed path.

### Fix 4: cutover/handover/apps phase groups
   types.go — add GroupCutover/Handover/Apps constants + Display.
   flow_snapshot_local.go — add phaseForChart() classifier (currently
   maps self-sovereign-cutover → cutover), reparent install jobs to
   the correct phase sub-group, synthesise per-region sub-groups for
   each phase, emit top-level phase groups, and chain them with
   finish-to-start: provisioner → bootstrap-kit → cutover → handover
   → apps.

### Fix 5: JobName separator `/` → `:` (canonical per memory rule)
   phase1_watch.go:457 emits ev.Component = region + ":" + chart.
   jobs_backfill.go + flow_snapshot_local.go updated to detect ":"
   instead of "/". useJobLinkBuilder's encodeURIComponent already
   handles ":". /jobs/install-hel1-2:newapi now matches the TanStack
   Router $jobId route.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 09:53:23 +04:00
e3mrah
410a3dbd33
fix(flow_snapshot): region-scope dep edges (no cross-region wiring) (#1461)
Founder caught on prov #66 (3dc9249ea73a6840, 2026-05-13): hel1-2's
install-* nodes all rendered dep arrows pointing at PRIMARY's install
nodes — cross-region edges where NAMING-CONVENTION §1.3 demands
independent fault domains (no cross-region wiring).

Root cause: helmwatch.Bridge persists secondary-region Jobs with bare
dep names ("install-cilium") because HR.spec.dependsOn carries chart
names without region context. The snapshot composer's normaliser
turned `install-cilium` → `<depID>:install-cilium` which IS the
primary's cilium JobID, not hel1-2's `<depID>:install-hel1-2/cilium`.
Every secondary install therefore drew a phantom cross-region edge.

Fix: in flow_snapshot_local.go, region-scope dep names when the source
Job is regional:

  jobRegion=="hel1-2" + dep="install-cilium"
    → "install-hel1-2/cilium" → "<depID>:install-hel1-2/cilium"

Same fix applied to the Layer-2 hrDeps derivation path (per-AppID
lookup also gets bare chart names from the primary watcher). hrDeps
lookup is now done with the unprefixed AppID so it actually hits.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:03:06 +04:00
e3mrah
4a14bbf328
fix(flow_snapshot): symmetric region groups — primary gets its own too (#1460)
Founder caught on prov #65 (6e2fd14bb8b6ed4d, 2026-05-13): canvas shows
ASYMMETRIC structure — primary's 45 install jobs render as BARE LEAVES
directly under bootstrap-kit, while secondary regions get a proper
region sub-group. Result: M×N fan-out from provision-hetzner cascades
onto every primary leaf because there's no primary region group to
absorb the elided-group edge.

PR #1454 introduced region derivation from JobName's `/` separator
(secondary watchers emit `install-<region>/<chart>`). Primary's bridge
emits bare `install-<chart>` names — no `/`, no region derived, no
group synthesized.

Fix: derive primary region from `dep.Request.Region` and apply it to
every install job with no `/` in AppID. The synth-region-group loop
below already creates one group per discovered region, so primary
automatically gets its own `<deploymentId>:<primaryRegion>:bootstrap-kit`
bubble containing all 45 primary installs.

End state: 3 symmetric region sub-groups under bootstrap-kit
(fsn1 + nbg1-1 + hel1-2 for 3-region prov), each with exactly 45
install-* children, region-bounded temporal-endpoint cascade prevents
M×N fan-out at depth=all.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 20:31:20 +04:00
e3mrah
8518bb1f50
fix(flow_snapshot): drop duplicate live-watcher multi-region block (#1455)
* fix(JobsTable): strip <deploymentId>: prefix from row link (404 fix)

Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): clicking a
running secondary-region install-* row on /sovereign/provision/<id>/jobs
landed on /provision/<id>/jobs/<id>:install-nbg1-1/self-sovereign-cutover
and returned "404 page not found".

Root cause: useJobLinkBuilder was passing the FULL canvas JobID form
through encodeURIComponent.replace(/%3A/g, ':') WITHOUT first stripping
the "<deploymentId>:" prefix. The canvas emits ids like
"<deploymentId>:install-X" (single-region) or
"<deploymentId>:<region>:install-X" (multi-region, see
flow_snapshot_local.go:410). jobs.Store.GetJob keys by the BARE jobName —
exact-match URL lookup of the prefix-bearing form misses every time.

FlowPage.handleNodeDoubleClick (FlowPage.tsx:355) already strips the
first `:` prefix for canvas drill-down; JobsTable now matches so a /jobs
row click and a canvas drill-down resolve to the SAME backend endpoint.

The existing JobsTable row-link test uses a job.id with no `:` prefix,
so the strip is a no-op for that fixture and the `/jobs/job-install-cilium`
assertion still holds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(flow_snapshot_local): derive region from persisted JobName, synth region groups

Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): the multi-region
canvas at /sovereign/provision/<id>/jobs/tofu-output renders 135 install-*
leaves as direct children of bootstrap-kit (no region sub-groups visible),
and the provision-hetzner→bootstrap-kit edge fans M×N across all 135.

Root cause: spawnSecondaryRegionWatchers (phase1_watch.go:429) emits
events with `ev.Component = region + "/" + componentName`. The jobs
bridge persists them with `JobName=install-<region>/<chart>` and
`AppID=<region>/<chart>`, BUT ParentID=bootstrap-kit (the bridge has no
region awareness). After phase 1 terminates the deferred stopSecondaries()
clears `dep.secondaryWatchers`, so the multi-region snapshot block
(line 408-460, gated on `len(secondaryWatchers) > 0`) becomes a no-op.
flowSnapshotFromJobs then emits all 135 install Jobs flat under
bootstrap-kit, no Region field set, no region group bubbles, and
flowLayoutOrganic.ts's temporal-endpoint cascade fans the
provisioner→bootstrap-kit edge onto all 135 because there's no
intermediate region group to absorb it.

Fix: in the per-Job loop, detect `/` in `j.AppID` (the canonical
multi-region prefix marker), derive the region key, set
FlowNode.Region, and re-parent to a synthesised
"<deploymentId>:<region>:bootstrap-kit" group. After the loop,
synthesise one bootstrap-kit sub-group node per discovered region
with a `contains` edge to the parent bootstrap-kit. The resulting
shape:

  bootstrap-kit
   ├── 45 primary install-* (legacy parent, no region)
   ├── <region-A>:bootstrap-kit ── 45 install-*  (region tagged)
   └── <region-B>:bootstrap-kit ── 45 install-*  (region tagged)

This persists ACROSS phase-1 termination because the source of truth
is jobs.Store (durable), not dep.secondaryWatchers (transient).

The multi-region block (line 408+) still runs WHEN secondary watchers
are alive (during phase 1) — it emits ADDITIONAL FlowNodes with
"<deploymentId>:<region>:install-X" IDs distinct from the persisted
"<deploymentId>:install-<region>/<chart>" IDs, so the two paths don't
collide. Post-phase-1 the watchers clear and only the persisted-Job
path remains, but now WITH region structure preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(flow_snapshot): remove duplicate live-watcher multi-region block

PR #1454 added region-group synthesis from persisted Job rows. The old
secondaryWatchers-based block at line 442+ emitted nodes with the SAME
region-group IDs AND child nodes, so during phase 1 (when both paths
are live) the snapshot rendered with 90 children per region group
instead of 45 — visible on prov #61 (2e197a934a0e0461):

  bootstrap-kit: 49 children
  hel1-2:bootstrap-kit: 90 children  (should be 45)
  nbg1-1:bootstrap-kit: 90 children  (should be 45)

Plus the region groups appeared twice in the node list.

Root cause: the per-Job loop (PR #1454) and the legacy block both write
to the same region-group IDs without deduping. The per-Job path covers
the persisted-Job state (durable across phase-1 termination), so the
live-watcher path is redundant.

Fix: delete the legacy block. The earlier
secondaryWatchers-snapshot-into-map work (lines 182-205) is kept
because that path also reads dep.liveWatcher (primary) for the hrDeps
lookup the per-Job loop uses for primary-region dep edges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:47:00 +04:00
e3mrah
d9d7fa2baa
fix(flow_snapshot): derive region from persisted JobName, synth region groups (#1454)
* fix(JobsTable): strip <deploymentId>: prefix from row link (404 fix)

Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): clicking a
running secondary-region install-* row on /sovereign/provision/<id>/jobs
landed on /provision/<id>/jobs/<id>:install-nbg1-1/self-sovereign-cutover
and returned "404 page not found".

Root cause: useJobLinkBuilder was passing the FULL canvas JobID form
through encodeURIComponent.replace(/%3A/g, ':') WITHOUT first stripping
the "<deploymentId>:" prefix. The canvas emits ids like
"<deploymentId>:install-X" (single-region) or
"<deploymentId>:<region>:install-X" (multi-region, see
flow_snapshot_local.go:410). jobs.Store.GetJob keys by the BARE jobName —
exact-match URL lookup of the prefix-bearing form misses every time.

FlowPage.handleNodeDoubleClick (FlowPage.tsx:355) already strips the
first `:` prefix for canvas drill-down; JobsTable now matches so a /jobs
row click and a canvas drill-down resolve to the SAME backend endpoint.

The existing JobsTable row-link test uses a job.id with no `:` prefix,
so the strip is a no-op for that fixture and the `/jobs/job-install-cilium`
assertion still holds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(flow_snapshot_local): derive region from persisted JobName, synth region groups

Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): the multi-region
canvas at /sovereign/provision/<id>/jobs/tofu-output renders 135 install-*
leaves as direct children of bootstrap-kit (no region sub-groups visible),
and the provision-hetzner→bootstrap-kit edge fans M×N across all 135.

Root cause: spawnSecondaryRegionWatchers (phase1_watch.go:429) emits
events with `ev.Component = region + "/" + componentName`. The jobs
bridge persists them with `JobName=install-<region>/<chart>` and
`AppID=<region>/<chart>`, BUT ParentID=bootstrap-kit (the bridge has no
region awareness). After phase 1 terminates the deferred stopSecondaries()
clears `dep.secondaryWatchers`, so the multi-region snapshot block
(line 408-460, gated on `len(secondaryWatchers) > 0`) becomes a no-op.
flowSnapshotFromJobs then emits all 135 install Jobs flat under
bootstrap-kit, no Region field set, no region group bubbles, and
flowLayoutOrganic.ts's temporal-endpoint cascade fans the
provisioner→bootstrap-kit edge onto all 135 because there's no
intermediate region group to absorb it.

Fix: in the per-Job loop, detect `/` in `j.AppID` (the canonical
multi-region prefix marker), derive the region key, set
FlowNode.Region, and re-parent to a synthesised
"<deploymentId>:<region>:bootstrap-kit" group. After the loop,
synthesise one bootstrap-kit sub-group node per discovered region
with a `contains` edge to the parent bootstrap-kit. The resulting
shape:

  bootstrap-kit
   ├── 45 primary install-* (legacy parent, no region)
   ├── <region-A>:bootstrap-kit ── 45 install-*  (region tagged)
   └── <region-B>:bootstrap-kit ── 45 install-*  (region tagged)

This persists ACROSS phase-1 termination because the source of truth
is jobs.Store (durable), not dep.secondaryWatchers (transient).

The multi-region block (line 408+) still runs WHEN secondary watchers
are alive (during phase 1) — it emits ADDITIONAL FlowNodes with
"<deploymentId>:<region>:install-X" IDs distinct from the persisted
"<deploymentId>:install-<region>/<chart>" IDs, so the two paths don't
collide. Post-phase-1 the watchers clear and only the persisted-Job
path remains, but now WITH region structure preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:24:20 +04:00
e3mrah
3a08c23ae4
fix(JobsTable): strip <deploymentId>: prefix from row link (404 fix) (#1453)
Founder caught on prov #59 (a43364f11c10cde3, 2026-05-13): clicking a
running secondary-region install-* row on /sovereign/provision/<id>/jobs
landed on /provision/<id>/jobs/<id>:install-nbg1-1/self-sovereign-cutover
and returned "404 page not found".

Root cause: useJobLinkBuilder was passing the FULL canvas JobID form
through encodeURIComponent.replace(/%3A/g, ':') WITHOUT first stripping
the "<deploymentId>:" prefix. The canvas emits ids like
"<deploymentId>:install-X" (single-region) or
"<deploymentId>:<region>:install-X" (multi-region, see
flow_snapshot_local.go:410). jobs.Store.GetJob keys by the BARE jobName —
exact-match URL lookup of the prefix-bearing form misses every time.

FlowPage.handleNodeDoubleClick (FlowPage.tsx:355) already strips the
first `:` prefix for canvas drill-down; JobsTable now matches so a /jobs
row click and a canvas drill-down resolve to the SAME backend endpoint.

The existing JobsTable row-link test uses a job.id with no `:` prefix,
so the strip is a no-op for that fixture and the `/jobs/job-install-cilium`
assertion still holds.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:03:47 +04:00
e3mrah
4923938c2b
feat(multi-region-canvas): per-region kubeconfig PUT-back + per-region helmwatch (#1444)
Operator mandate (2026-05-12): the mothership canvas must surface
install-* HRs from EVERY region of a multi-region provision, not just
the primary CP's. Today catalyst-api stores ONE kubeconfig per
deployment (the primary CP's) and spawns ONE helmwatch.Bridge against
it. Result: secondary regions are invisible on the canvas even though
their k3s clusters are fully reconciling.

End-to-end change across infra + handler:

1) cloud-init (cloudinit-control-plane.tftpl): the kubeconfig PUT URL
   appends `?region=<kubeconfig_postback_region>` when the var is set.
   main.tf templatefile call passes empty for primary CP, `each.key`
   (e.g. "nbg1-1", "hel1-2") for each secondary region.

2) PutKubeconfig handler: reads ?region= query param. Empty → primary
   path (unchanged: stores at <dir>/<id>.yaml, sets
   Result.KubeconfigPath, fires Phase-1 watch + SMTP seed). Non-empty
   → secondary path: stores at <dir>/<id>-<region>.yaml, populates
   Deployment.secondaryKubeconfigPaths[region]. Single-use guard is
   per-region (the same bearer secures every CP's PUT — secondaries
   reuse it for their own slot). NO Phase-1 watch re-launch from a
   secondary PUT.

3) phase1_watch.spawnSecondaryRegionWatchers: runs alongside the
   primary's watcher. Scans <kubeconfigsDir>/<id>-*.yaml every 15s,
   spawns one helmwatch.NewWatcher per kubeconfig discovered, stores
   the Watcher on Deployment.secondaryWatchers[region]. Per-region
   watchers emit ordinary helmwatch events with region-prefixed
   Component names so the wizard's per-component view doesn't collide
   primary vs secondary bp-cilium events. They do NOT contribute to
   markPhase1Done — outcome remains the primary's classification.

4) flow_snapshot_local.flowSnapshotFromJobs: composes per-region group
   bubbles + install-* nodes from each secondary watcher's
   SnapshotComponents. Node id: <depID>:<region>:install-<chart>.
   FlowNode.region set so the canvas can colour-group. Intra-region
   finish-to-start deps emitted from cs.DependsOn — same-region only,
   never cross-region (per NAMING-CONVENTION §1.3 independent fault
   domains, no stretched cluster).

5) wipe.go: removes both <id>.yaml AND every <id>-*.yaml secondary
   kubeconfig file on Sovereign wipe.

Storage model is uniform across SME and corporate Sovereigns. No
hardcoding of provider, region count, or building block.

Caught after operator pointed out that 3-region prov #50 was showing
only 52 install-* nodes (all from fsn1) on the canvas — the
architectural gap.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 16:12:38 +04:00
e3mrah
bd5d4393ec
fix(canvas): cross-group edges cascade to leaf temporal endpoints (#1442)
Operator-reported design fix completing #1437/#1440 — the cross-phase
ordering between provisioner and bootstrap-kit groups was either an
M×N phantom-edge fan-out (pre-#1437) OR completely disconnected at
leaf level (post-#1440 with the both-elided skip). Neither was right.

Real design: when a group→group dependency edge is lifted onto the
leaf graph because one or both endpoints elided, cascade ONLY to the
temporal endpoint pair:

  upstream_terminals → downstream_initials

Where:
  - upstream_terminals = visible descendants of the upstream group
    that nothing else in the group depends on (sinks of intra-group
    DAG). For the tofu chain this collapses to just cluster-bootstrap.
  - downstream_initials = visible descendants of the downstream group
    that depend on nothing else in the group (sources of intra-group
    DAG). For bootstrap-kit this is install-cilium / install-flux /
    install-gateway-api / etc — the install-* roots.

Net result for provisioner→bootstrap-kit at depth=all: a small fan of
edges from cluster-bootstrap to the bp-* roots — the real temporal
gate, no spurious phantom edges, no missing cross-phase chain.

Two call sites updated:
  - Inbound: visibleJob X with X.dependsOn = [elidedGroup G] now
    cascades to groupTerminals(G) instead of fanOutVisibleChildren(G).
  - Outbound: elidedGroup G with G.dependsOn = [D] cascades to
    groupInitials(G) on the receive side; D-side cascades to
    groupTerminals(D) when D is also elided, or uses D directly when
    D is a visible job.

11/11 flowLayoutOrganic.test.ts pass.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 13:47:42 +04:00
e3mrah
0fe0cacc15
fix(canvas): right-click menu actions actually work + clearer labels (#1441)
Operator reported "non of the right click functionalites working
other than the open in new tab". Root cause: the previous handler
only mutated urlFoldedSet, which had no visible effect when the
clicked group was folded by the depth default (same class of bug
toggleFold had before #1439). The menu items also had confusing
labels ("Fold to level N" stepped GLOBAL depth, not subtree-relative).

Rewrite to use the same compose-state pattern toggleFold uses:

  - "Show only this group" — switch to depth=all + fold every OTHER
    group. Only the clicked group's subtree expands; sibling groups
    stay collapsed.
  - "Hide this group" — switch to depth=default + add clicked group
    to urlFoldedSet. Group renders as a folded bubble; its subtree
    hidden.
  - "Expand subtree" — switch to depth=all + remove this group and
    all its descendant groups from urlFoldedSet. Fully unfolded
    subtree.
  - "Open in new tab" — unchanged (was working since #1435).

Dropped the misleading "Fold to level N" item (was just stepDepth(-1)).
The depth chip ◀▶ at the top-right is the canonical global depth
control.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 13:30:31 +04:00
e3mrah
2c1f767b52
fix(canvas): back-to-jobs chroot-scoped + group→group edge w/o M×N lift (#1440)
Three operator-reported issues from the same dblclick session:

1) "Back to jobs" link in JobDetail.tsx (2 sites) and JobsTimeline.tsx
   used absolute /jobs which on contabo resolves to /sovereign/jobs —
   the mother's flat /jobs view, NOT the chroot-scoped
   /sovereign/provision/<id>/jobs. Operator reported "chroot principle
   violation". Fix: chroot-aware /provision/<deploymentId>/jobs when
   deploymentId is present.

2) Bootstrap and Provision Hetzner group bubbles at ?depth=1 had no
   edge between them — temporal ordering invisible. Earlier #1437
   dropped the group→group edge entirely because the FE layout's
   lift-on-elide cascaded it into M×N phantom edges at ?depth=all.
   Re-emit the edge AND fix the lift logic in
   flowLayoutOrganic.ts (lines 414-442) to SKIP the lift when BOTH
   endpoints of the elided-group dep are elided. At ?depth=1 the
   edge renders between the two folded groups as intended; at
   ?depth=all both groups elide and the lift is suppressed so the
   spurious cascade doesn't reappear. The actual install-* deps are
   already visible via each leaf's own dependsOn — skipping the lift
   costs no information.

3) (Documented separately) Right-click menu only attaches to GROUP
   nodes per design (FlowCanvasOrganic line 1277). When all groups
   are elided (?depth=all auto-folds groups out), the menu is
   unreachable. The dblclick-on-group fold fix (#1439) makes group
   bubbles reachable at ?depth=1 where right-click works.

Caught via Playwright after operator reported all three.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 13:24:50 +04:00
e3mrah
bb1bff245a
fix(canvas): toggleFold handles depth-default-folded nodes (#1439)
toggleFold previously only mutated urlFoldedSet, which had no effect
when the clicked node was folded BY THE DEPTH DEFAULT (not by an
explicit URL override). Result: at ?depth=1 where both groups are
folded by depth-default, double-clicking bootstrap-kit (after #1438's
dblclick-on-group → toggleFold branch) was a no-op — the urlFoldedSet
delete didn't change the composed foldedSet, the canvas didn't budge.

New behaviour:
  - If clicked node is folded by ANY source: switch to depth=all AND
    explicitly fold every OTHER previously-folded group. Only the
    clicked group ends up visibly unfolded — exactly the operator-
    requested "expand only the respective parent" UX.
  - If clicked node is unfolded: add to urlFoldedSet to fold it
    without changing depth.

Caught via Playwright after #1438 landed and dblclick still didn't
unfold the clicked group at ?depth=1.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:39:58 +04:00
e3mrah
9da662c6f5
fix(canvas): double-click on group toggles fold (not navigate) (#1438)
Operator reported "double-click on a parent bubble it is expanding
all the parent instead of expanding only the respective parent."
Reproduced in Playwright: at ?depth=1 only the 2 group bubbles
render folded; double-click on bootstrap-kit navigated to
/jobs/bootstrap-kit which DROPPED the ?depth=1 query → new page
defaulted to depth=2 → groups elided → all 50 install-* + Phase-0
bubbles rendered. Exactly the "expanding all parents" symptom.

Two fixes:

1) Branch handleNodeDoubleClick: if the bubble is a group, call
   toggleFold(nodeId) in place — fold or unfold ONLY that group.
   Tree-explorer UX where a leaf double-click drills in but a group
   double-click expands/collapses.

2) For the leaf path, preserve window.location.search across the
   navigate so the destination page renders with the same depth /
   folded filter the operator had on screen. Without this, the new
   page defaults to depth=2 and the visible bubble set changes
   beneath them.

Caught via Playwright double-click simulation on bootstrap-kit at
?depth=1 — URL went from .../jobs/install-cnpg?depth=1 (2 bubbles)
to .../jobs/bootstrap-kit (50 bubbles).

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:33:59 +04:00
e3mrah
5e96d30552
fix(flow-snapshot): drop provisioner→bootstrap-kit edge — causes M×N fan-out (#1437)
flowLayoutOrganic.ts lines 414-442 lift an elided group's outbound
deps onto EACH of its visible children, and if the dep target is
itself an elided group, fans out to THAT group's visible children
too. With both top-level groups elided at depth=all, the single
group→group finish-to-start edge I added cascades into M×N phantom
edges (each install-* gains a dep on every tofu-* + cluster-bootstrap
step). The operator-reported "install-cnpg has 5 connections from
terraform jobs" was exactly this layout-side fan-out.

Removing the group→group edge leaves Phase-0 and Phase-1 as separate
connected components on the canvas — the correct minimum-edge
rendering. Ordering between phases is implicit in the timestamps +
status flow, not in the edge graph.

Caught by Playwright-probing the canvas after operator pushback: data
side had only the 1 real direct dep (install-flux → install-cnpg)
yet the canvas drew 5+ phantom lines to install-cnpg from Phase-0.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:30:44 +04:00
e3mrah
f980356ce9
fix(canvas): setSearchPatch uses window.history (forward-fix CI tsc TS2322) (#1436)
PR #1435 (depth-chip basepath fix) failed CI because removing `to:`
from navigate() narrowed the search reducer's typed return to never,
producing TS2322 on the `Record<string, unknown>` cast.

Forward-fix: bypass TanStack navigate() entirely for the search-only
mutation path. Update window.location's query string via
history.replaceState (preserves pathname verbatim including basepath)
and dispatch a synthetic popstate so TanStack's useSearch picks up
the new query on next render. No TanStack path resolution → no
basepath drop → no colon re-encoding → depth-chip click stops 404ing.

Re-also fixes open-new-tab (window.open of absolute /sovereign/... )
and handleNodeDoubleClick (strip + encode jobId) carried over from #1435.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:11:26 +04:00
e3mrah
4d1ccfbd44
fix(canvas): depth-chip click drops /sovereign basepath + open-new-tab 404 (#1435)
Two UX-killer bugs the operator hit on the FlowCanvasOrganic surface:

1) Clicking the depth chip arrows (◀ / ▶) on
   /sovereign/provision/<id>/jobs/<depId>:install-X pushed the browser
   to /provision/<id>/jobs/<depId>%3Ainstall-X — the /sovereign basepath
   was dropped AND the colon was re-encoded as %3A, both via TanStack's
   `to: '.'` path resolution. The new URL 404s at the BE because the
   colon-prefixed jobName misses jobs.Store.GetJob's exact-match lookup.
   Fix: omit `to:` entirely. TanStack treats a search-only navigate as
   a pure search-params mutation and preserves the current path verbatim
   including the basepath. The colon-prefixed jobId in the URL comes
   from older deep-links; the strip-on-click fix landed in #1431.

2) Right-click → "Open in new tab" also passed the raw nodeId
   verbatim (no prefix strip, no encode, no /sovereign prefix). Mirror
   handleNodeDoubleClick: strip the "<deploymentId>:" prefix,
   encodeURIComponent the remainder, AND prepend /sovereign for the
   absolute-path window.open (window.open isn't routed through
   TanStack so basepath isn't auto-prepended).

Caught after operator reported "level arrows redirect to wrong URLs
and giving 404" + "right click on a parent bubble … none of the
functions are working properly."

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:02:37 +04:00
e3mrah
1d9dd99915
fix(flow-snapshot): normalise bare-name Job.DependsOn to canonical JobID form (#1434)
helmwatch.Bridge writes SOME Job.DependsOn entries as bare names
("install-flux") rather than the canonical JobID form
("<deploymentId>:install-flux") — 71 such entries observed on prov
bfdccbdbd6f700e1 (2026-05-12). My flowSnapshotFromJobs emit copied
those bare names verbatim into Relationship.fromId. The canvas
reducer matches FlowNode.id by exact string, so the bare-name fromId
became a phantom edge pointing to a non-existent node. In the
force-directed layout these phantom edges visually routed through
the nearest real bubbles, manifesting as 5-edge fan-outs from every
Phase-0 tofu job to every install-* bubble (operator-reported on
install-cnpg, but symmetric across all install-*).

Normalise every fromId to jobs.JobID(deploymentID, dep) form when
the stored value lacks a ":" separator.

Caught after operator reported "install-cnpg has 5 different
connections from terraform jobs — this is matter of a proper
chaining" — looking at the snapshot showed Job.DependsOn=[install-flux]
without the prefix.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:00:04 +04:00
e3mrah
93c3e81f0c
fix(flow-snapshot): contains edge direction — toId is parent per canon (#1433)
Per products/openova-flow/core/src/types.ts line 112:
  "contains — toId (parent) contains fromId (child)"

My emit had this inverted: I set FromID=parent, ToID=child, which
made the FE adapter (flowStreamToOrganic.ts line 134) interpret every
install-* leaf as a group containing the bootstrap-kit/provisioner
group nodes. Net result: only 2 bubbles ever rendered on the canvas
regardless of ?depth= because the hierarchy graph was upside-down.

Caught by opening the canvas in a browser via Playwright after the
operator reported "still showing only 2 bubbles, no drill-down".

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:24:30 +04:00
e3mrah
048a4d8910
fix(refresh-watch): disk-fallback when Result.KubeconfigPath is empty (#1432)
When the Pod restarts between PutKubeconfig writing the file AND the
next Result.Save() persisting the field, dep.Result.KubeconfigPath
comes back empty even though the file exists at the canonical
convention <kubeconfigsDir>/<deploymentID>.yaml. RefreshWatch was
returning 409 watch-not-resumable in this state, which left the
mothership canvas frozen because the live watcher couldn't re-attach
to source HR.spec.dependsOn for the install-* edge derivation.

Hit live on prov bfdccbdbd6f700e1 (2026-05-12): chart roll for
PR #1431 restarted catalyst-api Pod, the file
/var/lib/catalyst/kubeconfigs/bfdccbdbd6f700e1.yaml was on disk but
RefreshWatch refused to use it because the record field was empty.

Fix: when KubeconfigPath is empty AND h.kubeconfigsDir is configured
AND a file exists at <dir>/<depID>.yaml, use that path and patch the
record so subsequent /components/state + flow snapshot calls see a
populated field.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 10:44:55 +04:00
e3mrah
e3771f6813
fix(flow): derive HR dependsOn from live watcher + fix canvas drill-down 404 (#1431)
Two bugs the operator hit on /sovereign/provision/<id>/jobs:

1) Phase-1 install-* Jobs rendered DISCONNECTED on the canvas —
   helmwatch.Bridge doesn't persist Job.DependsOn (only the Phase-0
   tofu chain + cluster-bootstrap is wired today). Pull HR.spec.dependsOn
   from the live Watcher's informer cache via SnapshotComponents()
   (ComponentSnapshot.DependsOn already populated by extractDependsOn)
   at snapshot-time and emit finish-to-start edges from upstream
   install-<dep> to install-<self>. Also add provisioner→bootstrap-kit
   group-to-group finish-to-start so the Phase-0/Phase-1 ordering is
   visible on the canvas.

2) Clicking a canvas node → "404 page not found" because
   FlowPage.handleNodeDoubleClick passed the full
   "<deploymentId>:install-X" id verbatim. The backend Store.GetJob
   keys by bare jobName ("install-X"), so the colon-prefixed id missed
   exact-match and JobDetail returned 404. Mirror useJobLinkBuilder
   (JobsTable.tsx line 364): strip the "<deploymentId>:" prefix and
   encodeURIComponent the remainder before pushing to the router.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 10:36:22 +04:00
e3mrah
2fbab45b43
feat(flow-proxy): assemble snapshot from local jobs.Store before upstream proxy (#1429)
* fix(catalyst-api): add OPENOVA_FLOW_SERVER_URL env to chart template

Without this env the proxy resolveFlowServerURL() falls back to
per-deployment FQDN lookup (https://openova-flow.<sovereignFQDN>) which
only exists on Sovereigns that already installed bootstrap-kit slot 56
with httproute=enabled. Every other catalyst-api deployment (mothership
contabo + Sovereigns that haven't reached cutover yet) returns 502 on
/api/v1/flows/{deploymentId}/snapshot — the live regression founder
saw at console.openova.io: "No nodes to render."

The env points at the in-cluster Service DNS for the LOCAL openova-flow-
server. Both the mothership (catalyst-system or catalyst namespace) and
each Sovereign chroot run the bp-openova-flow-server chart with a local
Service, so this URL is correct for every cluster catalyst-api runs in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(flow-proxy): assemble snapshot from local jobs.Store before upstream proxy

Mothership canvas at /sovereign/provision/<id>/jobs was empty for the
first ~30 minutes of every fresh provision because the snapshot
endpoint went straight to https://openova-flow.<sovereignFQDN> which
can't serve until cilium + cert-manager + the HTTPRoute TLS cert are
all up on the chroot. The Phase-0 + Phase-1 lifecycle Jobs catalyst-api
ALREADY owns (tofu-init/plan/apply/output, flux-bootstrap,
install-bp-<chart>, ...) were invisible the whole time.

This change adds flowSnapshotFromJobs which assembles the canonical
FlowMessage envelope from h.jobsStore().ListJobs(deploymentID) — every
Job becomes a FlowNode with the legacy <deploymentId>:<jobName> id form
the canvas drill-down already expects, every Job.DependsOn becomes a
finish-to-start Relationship, every Job.ParentID becomes a contains
Relationship. HandleFlowSnapshot checks the local store first and
returns immediately when it has data; otherwise falls through to the
existing upstream proxy path.

HandleFlowStream gets the same treatment via flowStreamLocal: emit a
snapshot frame on connect AND every 3 seconds thereafter, plus a 15s
heartbeat. The OpenovaFlow consumer's reducer is idempotent on
snapshot replay so re-emitting an unchanged envelope is harmless;
in exchange the canvas reflects Job state transitions within ~3s
of when helmwatch.Bridge writes them.

No FE change required — the same /api/v1/flows/<id>/snapshot and
/stream endpoints serve the same envelope shape the chroot adapter
emits (products/openova-flow/adapter-flux/internal/types/flow.go),
named SSE events including 'snapshot' and 'heartbeat'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 10:06:28 +04:00
e3mrah
50bf7a59ed
fix: F8 - double bp-catalyst-platform HR timeout (15m→30m) + catalyst-api phase1 budget (60m→120m) (#1428)
prov #44 (d9399223c3caa4f9) hit the catalyst-api 60m phase1 watch cap
with bp-catalyst-platform HR still mid-retry (failures=3) and 41/45 HRs
True. F1-F7 are correct and live on main (qa-finalizer-strip Completed,
autoscaler workers joined). The remaining wall is total bootstrap-kit
install time exceeding the outer watch budget on a fresh cpx42×1
Sovereign without a warm Harbor proxy-cache.

Two lock-step changes widen both bounds:

1. clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
   install.timeout 15m → 30m, upgrade.timeout 15m → 30m. The umbrella
   chart genuinely needs >15m worst case when the full SME + Catalyst
   service stack rolls cold.

2. products/catalyst/bootstrap/api/internal/helmwatch/helmwatch.go:
   DefaultWatchTimeout 60m → 120m. Worst-case inner HR retry chain is
   now 30m × 3 = 90m; the outer phase1 budget MUST be larger so the
   watch never terminates while helm-controller still has remediation
   attempts left. CATALYST_PHASE1_WATCH_TIMEOUT env-var override path
   was already wired (issue #538 baseline) — chart template now
   declares the explicit "120m" value so the runtime knob is
   discoverable for capacity-bounded environments. Per INVIOLABLE-
   PRINCIPLES.md #4 the knob remains runtime-configurable.

New unit test TestPhase1WatchConfig_ProductionDefaultIs120m pins the
F8 floor against future regression. Existing env-var override + field-
override tests still pass unchanged.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 08:10:24 +04:00
e3mrah
0ba87bb8da
fix(JobsPage): use FlowNode.id in row anchor href (region prefix) (#1414)
TC-035 (iter-2, 2026-05-11): OpenovaFlow rows merged into JobsPage
(PR #1413) lost their region-prefixed identity in the URL. The link
builder sliced the "<prefix>:" segment off every id with a colon —
intended to strip the legacy "<deploymentId>:install-keycloak" form,
but it also stripped "contabo:bp-openova-flow-server" → bare
"bp-openova-flow-server" in the href. The matrix asserts the
verbatim form "/jobs/contabo:bp-openova-flow-server" must appear in
the rendered DOM.

Fix: stop slicing. `encodeURIComponent` still escapes unsafe path
chars (`/` for live K8s job ids like "job/syft-grype/..."), then we
restore `:` because RFC 3986 permits it as a path-segment `pchar`.
FlowPage canvas navigation (PR #1411) and JobDetail flow-fallback
(PR #1412) already pass on the colon-present form, so this round-
trips end-to-end. Legacy "bp-cilium" / "cluster-bootstrap" hrefs are
unchanged (no `:` to encode). The previously-stripped legacy form
"<deploymentId>:install-keycloak" now lands as the full id in the
URL, and JobDetail's `jobsById` lookup is already keyed by BOTH the
canonical id AND the bare jobName (JobDetail.tsx:124-131), so the
resolution path is preserved.

Test coverage: new Case 4 in JobsPage.flow-merge.test.tsx asserts
the openova-flow row's anchor `href` contains
`/jobs/contabo:bp-openova-flow-server` and is NOT the bare-jobName
form. All 4 flow-merge cases PASS. The 3 pre-existing failures in
JobsPage.test.tsx (back-to-apps href, canonical-columns header,
Show-as-Flow button) are the documented iter-2 baseline — untouched
by this change.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 22:29:46 +04:00
e3mrah
5332ed0691
fix(JobsPage): merge openova-flow snapshot rows into legacy /jobs table (#1413)
TC-035 iter-1 FAIL (2026-05-11): /sovereign/provision/12e194090631a885/jobs
asserts rows for the openova-flow-server + openova-flow-emitter HRs but the
JobsTable only sourced from /api/v1/deployments/<id>/jobs (legacy event
stream) — verified live: GET /v1/flows/<id>/snapshot returns 2 leaf nodes
(contabo:bp-openova-flow-server, contabo:bp-openova-flow-emitter) whose ids
NEVER appear in the legacy /jobs payload. Sovereigns whose state lives only
in the OpenovaFlow snapshot silently drop these rows.

Fix: wire `useFlowStream({deploymentId})` alongside the existing legacy
reducer + live-jobs backfill. Synthesize a Job stub per FlowNode via
`synthesizeJobFromFlowNode` (PR #1412 — same adapter JobDetail's
flow-fallback path uses) and append the rows whose ids are absent from the
legacy set. Legacy wins dedup on id collisions because it carries real
execution timeline / appId / parentId / dependsOn — the flow synth is
intentionally a minimal stub.

Behavior unchanged for Sovereigns without an active flow stream: empty
FlowNode map → empty `flowJobs` → `legacyMerged` passes through untouched.

Test coverage (JobsPage.flow-merge.test.tsx — 3 cases, all PASS):
  1. Legacy 5 / flow empty → 5 rows, no behavior change.
  2. Legacy 5 / flow has 2 distinct ids → 7 rows with the contabo:bp-*
     ids present.
  3. Legacy 5 / flow has 1 id-collision + 1 new → 6 rows, legacy wins
     dedup (DOM scan asserts the colliding testid appears exactly once).

Validation:
  vitest: 3/3 PASS on new file; 13 prior tests in JobsPage.test.tsx
  unchanged from origin/main baseline (3 unrelated pre-existing failures
  in chrome/columns/Show-as-Flow tests, untouched by this fix).
  tsc --noEmit -p tsconfig.app.json: 27 errors, ALL pre-existing in
  @openova/flow-canvas + @openova/flow-core workspaces — zero new errors
  introduced.

Canonical seam reused (no new code paths):
  - @/lib/openflow-adapter-sse → useFlowStream (FlowPage / JobDetail share)
  - @/lib/synthesizeJobFromFlowNode (PR #1412 helper)
  - @/lib/jobs.types → Job (single source of truth)

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 21:54:14 +04:00
e3mrah
36d1f56840
fix(JobDetail): fall back to OpenovaFlow snapshot when legacy /jobs 404 (#1412)
JobDetail built `jobsById` from the legacy useDeploymentEvents reducer
+ useLiveJobsBackfill polling. For Sovereigns whose state lives ONLY in
the openova-flow snapshot (post-flux-only flow, fresh chroot before the
catalyst-api event bridge has emitted any rows), that lookup misses and
JobDetail short-circuited to "Job not found" — never mounting FlowPage,
the very surface that would have painted the node.

Verified live this turn against deployment 12e194090631a885:
  GET /api/v1/flows/12e194090631a885/snapshot → 200, 2 leaf nodes
  GET /api/v1/deployments/12e194090631a885/jobs/<nodeId> → 404

This blocks ~20 of 26 iter-1 FAILs on the OpenovaFlow canvas test
matrix (TC-019/020/021/023/024/025/027/028/033/034/036/037/038/039/040
/041/042/053/054/060/064).

Fix:
  • JobDetail now reads the same useFlowStream hook FlowPage uses.
  • When `jobsById[jobId]` is undefined, look up the node in the flow
    snapshot's nodes Map. If found, synthesize a flat Job stub from the
    FlowNode (id, label, status) so the canvas mounts with the right
    hostJobId.
  • Behaviour for Sovereigns WITH an active event stream is unchanged
    — the legacy lookup wins and the synth stub is never read.
  • "Job not found" panel renders ONLY when BOTH lookups miss.

Tests:
  Added JobDetail.flow-fallback.test.tsx (vitest, 3 cases):
    1. Legacy has the job → FlowPage renders, no fallback.
    2. Legacy empty, flow snapshot has the node → FlowPage renders
       via synth job (the iter-1 FAIL scenario).
    3. Both empty → "Job not found" panel.
  All 3 new + 5 existing JobDetail tests pass.
  No tsc regressions (27 → 27 baseline errors, all pre-existing
  in flow-canvas/flow-core packages).

Refs INVIOLABLE-PRINCIPLES.md:
  #1 (waterfall): target-state fallback, no MVP "show loading" stub.
  #2 (no compromise): no field is faked with plausible data; absent
    timestamps land as null / 0 so fmtTime renders "—".
  #4 (never hardcode): the synth helper coerces FlowNode.status into
    the JobStatus vocabulary; the label falls back to the node id when
    `label` is empty.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 21:43:43 +04:00