Commit Graph

75 Commits

Author SHA1 Message Date
e3mrah
22855e62d8
feat(openova-flow): catalyst-api proxy + cloud-init thread (Agent #3 — integrator, infra-side) (#1396)
Final integration piece for OpenovaFlow infrastructure path —
catalyst-api proxy + cloud-init substitution for SOVEREIGN_DEPLOYMENT_ID
+ SOVEREIGN_REGION_KEY, so bp-openova-flow-emitter (slot 57) emits
distinct region tags on every FlowNode and the snapshot returns 2× per
HR on a multi-region Sovereign.

Builds on PR #1389 (TS core + canvas packages on disk), PR #1390 (Go
server + flux adapter + bootstrap-kit slots 56/57), PR #1394 (catalyst-
ui temporary revert until npm workspaces land), PR #1395 (chart no-op).

## Scope vs original Agent #3 brief

The brief planned a 4-section PR (proxy + cloud-init + FlowPage rewire +
runbook). Section 3 (catalyst-ui rewire of @openova/flow-*) is deferred:
PR #1394 reverted Agent #1's UI wiring because the Docker UI build has
no node_modules for the cross-workspace canvas source. Founder note on
#1394: "Agent #3 (or a follow-up) will re-wire them properly once npm
workspaces are configured at repo root."

This PR ships the infrastructure half (proxy + cloud-init + runbook).
The canvas-side rewire is a separate follow-up PR that needs npm
workspaces, not surgical edits to FlowPage.

## What ships

### 1. catalyst-api proxy /api/v1/flows/{deploymentId}/{snapshot,stream,events}

products/catalyst/bootstrap/api/internal/handler/openova_flow_proxy.go:
- GET /snapshot — JSON pass-through, headers + status forwarded
- GET /stream — unbuffered SSE pass-through using http.Flusher (NOT
  httputil.ReverseProxy; that buffers and breaks text/event-stream)
- POST /events — body forwarded byte-for-byte
- Upstream URL from env OPENOVA_FLOW_SERVER_URL (default Sovereign
  in-cluster Service DNS)

Routes registered in cmd/api/main.go inside the auth-gated chi.Group.

11 table-driven tests cover snapshot/events/stream pass-through, upstream
404/400/unreachable propagation, empty-deploymentId guard, SSE frames
arrive AS EMITTED, and env-default fallback.

### 2. Cloud-init threads SOVEREIGN_DEPLOYMENT_ID + SOVEREIGN_REGION_KEY

- infra/hetzner/cloudinit-control-plane.tftpl — two new postBuild.
  substitute keys alongside SOVEREIGN_FQDN/SOVEREIGN_LB_IP
- infra/hetzner/main.tf — primary CP renders var.region as region key;
  secondary CP renders each.key (e.g. "hel1-1") from for_each over
  local.secondary_regions
- infra/hetzner/variables.tf — new sovereign_deployment_id var (string,
  default "" for tofu mocks)
- provisioner.go writeTfvars — writes vars["sovereign_deployment_id"]
  = req.DeploymentID
- bootstrap-kit slot 57 — swap placeholder ${SOVEREIGN_FQDN} / literal
  "primary" for the new ${SOVEREIGN_DEPLOYMENT_ID} / ${SOVEREIGN_REGION_KEY}
  envsubst keys

### 3. Deployment record flag

handler/deployments.go State() — emits `openovaFlowEnabled: true` on
every deployment. The catalyst-ui rewire (follow-up PR) will read this
to enable the openova-flow-server adapter; legacy provisions without
the flag will keep the bridge once the rewire lands.

### 4. Verification runbook

docs/runbooks/openova-flow-multi-region-verify.md — prov #34 POST body
(multi-region cpx42 fsn1+hel1, qaTestEnabled=true,
sovereignFQDN=omantel.biz), step-by-step kubectl/curl gates, visual
canvas checks (gated on the follow-up UI rewire), and a failure-class
triage table.

## Canonical-seam citations

1. SSE pattern — products/catalyst/bootstrap/api/internal/handler/
   deployments.go:1244-1287 (StreamLogs): identical Content-Type +
   Cache-Control + X-Accel-Buffering header set; identical
   http.Flusher.Flush() after each write; identical r.Context().Done()
   cancel path.

2. postBuild.substitute pattern — infra/hetzner/cloudinit-control-plane.tftpl:884-893
   (SOVEREIGN_FQDN + SOVEREIGN_LB_IP): same indentation, same KEY: ${var}
   form, dual emission at primary + secondary CP for_each in main.tf.

## Verification

```
$ go build ./...
(clean)

$ go vet ./...
(clean)

$ go test ./internal/handler/ -run TestFlowProxy -count=1 -race
ok    github.com/openova-io/openova/products/catalyst/bootstrap/api/internal/handler   1.410s

$ go test ./internal/provisioner/... -count=1
ok    github.com/openova-io/openova/products/catalyst/bootstrap/api/internal/provisioner  0.025s
```

3 pre-existing test failures (TestHandleWhoami_NoRBACOmitsFields,
TestHandleWhoami_PinSessionRBACClaims,
TestUnstructuredToUserAccess_NilApplicationsBecomesEmpty) reproduce on
main HEAD without this PR — unrelated baseline state.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 16:01:09 +04:00
e3mrah
4e6bec7022
fix(infra): body-supplied SKUs win over QA defaults (Fix #183) (#1386)
* fix(catalyst-ui): delete malformed `import type from react` line (Fix #181)

Fix #180 PR #1383 merged with sed -i error: produced `import type  from 'react'`
(empty import binding) which is a syntax error. Main build broken.
This PR removes the malformed line entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(infra): pin LB private IPs + revert hel1 zone (Fix #182)

Root cause of prov #32 FATAL "hcloud/inlineAttachServerToNetwork:
attach server to network: IP not available" on hcloud_server.control_plane[0]:

  hcloud_load_balancer_network.{main,secondary} both attached to the
  shared network WITHOUT an explicit `ip` argument. Hetzner auto-allocates
  the first free IP from the first matching-zone subnet. In the
  multi-region prov #32 the secondary LB-network (hel1) completed first
  at t+16s and took 10.0.1.2 from the only eu-central subnet existing
  at that moment (`main` = 10.0.1.0/24) — stealing the IP the primary
  CP claims explicitly via `ip = "10.0.1.${count.index + 2}"`.

  Fix: pin LB anchors to top-of-subnet (.254) so they live outside the
  CP/worker IP range (.2..N for CPs, .10+ for workers).

Also revert Fix #179 (`hel1 = "eu-north"`). Hetzner /v1/locations API
on 2026-05-11 returns network_zone=eu-central for hel1. Fix #179 caused
prov #32's secondary subnet to fail with `invalid input in field
'network_zone' [network zone does not exist]`. The original prov #29/#30
"IP not available on secondary[hel1-1]" was the same LB-IP collision —
this PR resolves both.

Multi-region apply now lands cleanly:
  10.0.1.2     -> primary CP (cp1)
  10.0.1.254   -> primary LB anchor
  10.0.10.2    -> secondary CP (hel1-1)
  10.0.10.254  -> secondary LB anchor (hel1-1)

Refs: openova-private prov-loop session 2026-05-11 Wave 26

* fix(infra): body-supplied SKUs win over QA defaults (Fix #183)

Fix #157 introduced `effective_cp_size = coalesce(var.qa_control_plane_size,
var.control_plane_size)` when qa_fixtures_enabled='true'. Because
qa_control_plane_size has a non-empty default (cpx32), coalesce always
returned the QA default and silently overrode whatever the body supplied
in `controlPlaneSize`.

Founder-supplied body for prov #32 specified `controlPlaneSize: "cpx42"`
explicitly (cheapest viable for the founder's collapsed-CP+worker
single-node-per-region topology with workerCount=0). The QA-default
override downgraded that to cpx32 at plan time — the explicit choice
never made it onto the hardware.

Fix #183 — invert the coalesce so body wins:

  effective_cp_size = local.qa_mode
    ? coalesce(var.control_plane_size, var.qa_control_plane_size)
    : var.control_plane_size

`provisioner.go` writeTfvars already emits control_plane_size / worker_size
only when the body's field is non-empty (so `var.control_plane_size`
inherits variables.tf's cost-optimised default when the body left it
blank). That means `coalesce(var.control_plane_size, var.qa_*)` always
has a non-empty first arg in normal flow; the QA-default fallback only
fires on a zero-override QA call that intentionally leaves the SKU empty.

No change to customer-Sovereign behaviour (qa_fixtures_enabled='false'
branch already used `var.control_plane_size` verbatim).

Refs: openova-private prov-loop session 2026-05-11 Wave 26

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 13:04:41 +04:00
e3mrah
515c3cf38d
fix(infra): pin LB private IPs + revert hel1 zone (Fix #182) (#1385)
* fix(catalyst-ui): delete malformed `import type from react` line (Fix #181)

Fix #180 PR #1383 merged with sed -i error: produced `import type  from 'react'`
(empty import binding) which is a syntax error. Main build broken.
This PR removes the malformed line entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(infra): pin LB private IPs + revert hel1 zone (Fix #182)

Root cause of prov #32 FATAL "hcloud/inlineAttachServerToNetwork:
attach server to network: IP not available" on hcloud_server.control_plane[0]:

  hcloud_load_balancer_network.{main,secondary} both attached to the
  shared network WITHOUT an explicit `ip` argument. Hetzner auto-allocates
  the first free IP from the first matching-zone subnet. In the
  multi-region prov #32 the secondary LB-network (hel1) completed first
  at t+16s and took 10.0.1.2 from the only eu-central subnet existing
  at that moment (`main` = 10.0.1.0/24) — stealing the IP the primary
  CP claims explicitly via `ip = "10.0.1.${count.index + 2}"`.

  Fix: pin LB anchors to top-of-subnet (.254) so they live outside the
  CP/worker IP range (.2..N for CPs, .10+ for workers).

Also revert Fix #179 (`hel1 = "eu-north"`). Hetzner /v1/locations API
on 2026-05-11 returns network_zone=eu-central for hel1. Fix #179 caused
prov #32's secondary subnet to fail with `invalid input in field
'network_zone' [network zone does not exist]`. The original prov #29/#30
"IP not available on secondary[hel1-1]" was the same LB-IP collision —
this PR resolves both.

Multi-region apply now lands cleanly:
  10.0.1.2     -> primary CP (cp1)
  10.0.1.254   -> primary LB anchor
  10.0.10.2    -> secondary CP (hel1-1)
  10.0.10.254  -> secondary LB anchor (hel1-1)

Refs: openova-private prov-loop session 2026-05-11 Wave 26

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 13:00:50 +04:00
e3mrah
7aa1b24c0d
fix(infra/hetzner): hel1 network_zone is eu-north not eu-central (#179) (#1381)
prov #29 + prov #30 both failed at +90s with:
  Error: hcloud/inlineAttachServerToNetwork: attach server to network:
  IP not available (ip_not_available, ...)
  with hcloud_server.secondary_control_plane["hel1-1"]

Root cause: `local.hetzner_network_zones` hardcoded `hel1 = "eu-central"`.
Helsinki is physically in Hetzner's eu-north zone (Finland), not eu-central
(Falkenstein/Nuremberg). Hetzner subnets are zone-bound: when the secondary
hel1 subnet is created with network_zone=eu-central, the subnet exists but
attaching a server in location=hel1 (physical eu-north) returns
ip_not_available because cross-zone attach isn't supported.

Fix: hel1 -> eu-north. Caught live on prov #29 + #30 (omantel.biz 2-region
fsn1+hel1 reprov, both failed at the same line 872 secondary CP attach).

Per CLAUDE.md ARCHITECT-FIRST: Hetzner publishes zone-region mapping at
https://docs.hetzner.com/cloud/general/locations/; hel1 is unambiguously
listed under eu-north.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 12:26:18 +04:00
e3mrah
8308f53e32
fix(infra/hetzner): auto-flip QA Sovereigns to cpx32/cpx42 nodes (Fix #157) (#1360)
12 of 12 fresh Sovereign provisions in the 2026-05-10 bounded-cycle
session wedged on the production cpx22 CP / cpx32 worker defaults
(memory entry: "provision #5 cpx22 OOM" + handover doc). Root cause:
the CP's documented ~3.5GB k3s+cilium+flux+cert-manager+sealed-secrets
working set leaves zero RAM headroom for Flux source-controller's
~700MB burst during the 44-slot bootstrap-kit apply, while two cpx32
workers (8GB each) cannot satisfy the simultaneous request set from
bp-keycloak (2Gi JVM) + bp-harbor (~2.5Gi across 6 sub-components) +
bp-cnpg primary + bp-openbao 3-replica Raft once the qaFixtures
Continuum + CNPGPair + status-seeder Jobs queue.

Mirrors the Fix #123 pattern (wildcard_cert_use_staging) — auto-flips
ONLY when qa_fixtures_enabled='true'. Customer-facing Sovereigns
(SME / marketplace / admin / console) provision with qa_fixtures_
enabled='false' so coalesce() in main.tf falls back to the existing
cpx22/cpx32 defaults; the production code path is untouched.

  - variables.tf: qa_control_plane_size (default cpx32), qa_worker_size
    (default cpx42) with the same Hetzner SKU regex validation as the
    production size variables.
  - main.tf: locals.qa_mode + locals.effective_cp_size + locals.
    effective_worker_size; hcloud_server.control_plane and .worker
    read the effective locals so QA Sovereigns auto-flip and customer
    Sovereigns plan-clean unchanged.
  - tests/multi_region.tftest.hcl: three new run blocks pin the
    contract — qa_mode=false keeps cpx22/cpx32, qa_mode=true flips
    to cpx32/cpx42 defaults, qa_mode=true respects explicit operator
    overrides (no hardcoded SKU per docs/INVIOLABLE-PRINCIPLES.md #4).

Per principle 17 (isolated worktree) shipped from .claude/worktrees/
qa-node-sizing-157. Per principle 4 (target-state) attacks the
systemic OOM-cascade root cause rather than another per-blueprint
timeout bandaid. Per principle 16 (canonical seam) the SKU choice
lives in variables.tf defaults + per-resource selection in main.tf;
no other path mutates server_type. Per principle 18 no SKU is
hardcoded — every value is operator-overridable.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 10:04:44 +04:00
e3mrah
901afa2a95
fix(infra/hetzner): add skip_region_validation=true to aws provider for Hetzner regions (#135) (#1344)
Fix #133 (PR #1343) swapped aminueza/minio for hashicorp/aws to bypass
DeleteBucketPolicy AccessDenied. Worked for the bucket creation API,
but the aws provider's region validator runs at provider-init time and
rejects Hetzner regions (fsn1/nbg1/hel1) before any S3 call:

    Error: invalid AWS Region: fsn1
    provider["registry.opentofu.org/hashicorp/aws"]

Reproduced on prov #19 (02c23fc20df90629) — failed at `tofu plan`
in 96s. Companion to the existing skip_credentials_validation +
skip_metadata_api_check + skip_requesting_account_id flags that
already disable the other AWS-specific preflight checks the Hetzner
endpoint can't satisfy.

skip_region_validation=true tells the provider not to compare the
region string against AWS's hardcoded region list; the region is
still passed through to the S3 SDK (used as the SigV4 signing region)
which is what Hetzner expects.

Per CLAUDE.md principle 16: same canonical seam as the other skip_*
flags in the same provider block — this is the missing fourth flag in
the standard "non-AWS S3-compatible backend" pattern.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 04:12:50 +04:00
e3mrah
5d43cf7b53
fix(infra/hetzner): swap aminueza/minio for hashicorp/aws to escape AccessDenied wedge (#133) (#1343)
Root cause of provisions #13 / #17 failing in <2 min at `tofu apply`
with:

    [FATAL] [ACL] Unable to create bucket (catalyst-omantel-biz-<id>):
    unable to remove bucket policy: Access Denied.

`aminueza/minio v3.34.0`'s `minio_s3_bucket` Create handler calls
`DeleteBucketPolicy` post-create as part of state normalization (the
provider treats "no policy" as the canonical zero state and forcibly
clears any inherited policy). Hetzner Object Storage's standard
read/write credentials don't grant `s3:DeleteBucketPolicy`, so the
call fails AccessDenied EVERY TIME -- the bucket IS created on
Hetzner's side but tofu marks the resource as failed and rolls back
the apply, blocking every fresh Sovereign provision from reaching
Phase 1. The wedge is deterministic, not flaky.

Provider swap rationale -- `hashicorp/aws` configured against
Hetzner's S3 endpoint speaks vanilla S3 and does NOT do any
post-create policy normalization. A successful CreateBucket is the
terminal state for `aws_s3_bucket` Create. Hetzner officially
documents AWS CLI / SDK as a supported S3 client (see
https://docs.hetzner.com/storage/object-storage/getting-started/using-s3-api-tools/),
so this is the canonical-vendor path, not a workaround.

Changes:
  * `versions.tf` -- drop `aminueza/minio`, add `hashicorp/aws ~> 5.0`
    pointed at `https://<region>.your-objectstorage.com` with
    `s3_use_path_style = true` and the four `skip_*` flags that
    disable AWS-specific preflight calls (STS, IMDS) Hetzner doesn't
    implement.
  * `main.tf` -- `minio_s3_bucket.main` -> `aws_s3_bucket.main`
    (no force_destroy preserved). Add `aws_s3_bucket_acl.main` for
    `private` (the bucket-level acl arg was removed in aws-provider
    5.x). Updated comment block explains the AccessDenied root cause
    inline so future readers don't repeat the journey.
  * `outputs.tf` -- `minio_s3_bucket.main.bucket` ->
    `aws_s3_bucket.main.bucket`.
  * `variables.tf` -- prose-only updates pointing at the new provider
    + the fix-#133 root-cause note.
  * `tests/multi_region.tftest.hcl` -- override_resource swap from
    `minio_s3_bucket.main` to `aws_s3_bucket.main` +
    `aws_s3_bucket_acl.main` so the offline tftest mock path still
    bypasses provider validation.
  * `cloudinit-control-plane.tftpl` -- two comment lines updated to
    reference the new resource name (no behavioural change).
  * `.terraform.lock.hcl` -- removed (regenerated by `tofu init`
    against the new provider set; CI's `tofu init -backend=false`
    step relocks deterministically).

Idempotency / state migration:
  * Fresh-provision-only path -- existing prov state lives in PDM and
    is recycled per provision. New provs: `tofu init` pulls the aws
    provider, `tofu apply` creates `aws_s3_bucket` with the same name
    Hetzner already owns and gets BucketAlreadyOwnedByYou (200, no-op
    in the AWS SDK). Idempotent.
  * Long-lived Sovereigns (sme/marketplace/admin/console -- protected
    per ADR-0001 §9.4) are NOT re-applied; their tofu state is stable.
    No `state mv` runbook is required.

Test plan:
  * `tofu fmt -check -recursive` -- expected pass (manual indent matches
    fmt output).
  * `tofu validate` (CI's infra-hetzner-tofu workflow) -- expected pass.
  * `tofu test` against `tests/multi_region.tftest.hcl` -- expected pass
    on all 5 scenarios (mock_provider for hcloud + override_resource
    for the two new aws resources).
  * `tofu apply` is NOT runnable from this env (no Hetzner creds); CI's
    test-hetzner-e2e workflow exercises the live path on PR merge.

Refs #133.

Co-authored-by: Claude (e3mrah) <noreply@anthropic.com>
2026-05-11 03:59:15 +04:00
e3mrah
90aa2767da
fix(bp-cert-manager-powerdns-webhook,bp-catalyst-platform): staging ClusterIssuer for QA Sovereigns (Fix #123, LE rate-limit bypass) (#1339)
Root cause (qa-loop iter-1 wedge, 2026-05-10):
  Let's Encrypt production hit the 5-certs/168h rate limit on
  *.omantel.biz (retry after 2026-05-11 22:08 UTC). Cilium-envoy
  could not get a wildcard cert -> console.omantel.biz TLS handshake
  failed -> iter-1 Test Executor could not run. Customer Sovereigns
  are unaffected (one cert per registered domain in their lifetime),
  but QA Sovereigns wipe + re-provision dozens of times in a session
  and exhaust the production ceiling within hours.

Fix (target-state, NOT workaround):
  - bp-cert-manager-powerdns-webhook 1.1.0 ships a SECOND ClusterIssuer
    (letsencrypt-dns01-staging-powerdns) alongside the existing
    production one. Same DNS-01 webhook config (same PowerDNS endpoint,
    same API key) -> only the ACME directory URL + account key differ.
    Both ClusterIssuers are real cert-manager resources; LE treats them
    as wholly independent issuers so a rate-limit hit on production
    does NOT block staging issuance.
  - bp-catalyst-platform 1.4.136 adds wildcardCert.useStaging (bool,
    default false). When true, sovereign-wildcard-certs.yaml renders
    Certificate(s) with issuerRef.name pointing at the staging issuer
    instead of production.
  - bootstrap-kit slot 13 wires WILDCARD_CERT_USE_STAGING via envsubst,
    same passthrough pattern as QA_FIXTURES_ENABLED.
  - catalyst-api auto-stamps wildcard_cert_use_staging="true" on QA
    Sovereigns (Request.QATestEnabled=true) so the per-Sovereign
    overlay flips both QA fixtures + staging certs from one wizard
    toggle.
  - tofu var wildcard_cert_use_staging propagates through main.tf
    into the cloudinit postBuild.substitute block on both primary +
    secondary regions.

Result:
  cilium-envoy on a fresh QA Sovereign gets a staging-signed wildcard
  cert in <2min (no production rate limit). curl -sk + Playwright
  (ignoreHTTPSErrors:true) accept the cert; iter-1 Executor can run
  within minutes of provision. Customer Sovereigns (QATestEnabled=
  false) keep getting real-trusted production certs.

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): every ACME URL
+ issuer name is values-overridable. Operators wiring a private
staging ACME (e.g. internal Smallstep CA) override via per-Sovereign
overlay without rebuilding any Blueprint. Staging is the documented
LE pattern (https://letsencrypt.org/docs/staging-environment/), not a
band-aid.

_None directly -- infrastructure fix; bypasses Let's Encrypt 5/168h rate limit on QA Sovereigns by using staging ACME endpoint, enabling iter-1 to run within minutes of fresh provision_

Co-authored-by: alierenbaysal <159913086+alierenbaysal@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 01:08:07 +04:00
e3mrah
3a5d9fc102
fix(infra,catalyst-api provisioner): tftpl CI guard + bucket-name suffix (Fix #101 followup, Fix #111) (#1331)
Two infrastructure-hardening fixes that together eliminate ~30 min
of provision-cycle waste per regression event documented in Fix #101.

## Fix A — CI guard against unescaped tftpl shell expansion

Adds a grep-based step to .github/workflows/infra-hetzner-tofu.yaml
that scans every infra/hetzner/*.tftpl for unescaped \${VAR:-default}
inside YAML comment lines. Uses PCRE negative-lookbehind so correctly
escaped \$\${VAR:-default} (templatefile() literal-dollar) does not
trip the guard.

Background: PR #1311 (Fix #73) added a YAML comment with bare
\${QA_FIXTURES_ENABLED:-false}. tofu's templatefile() parses ALL
\${...} sequences regardless of YAML/HCL/shell context; the colon
in the interpolation hits HCL's reserved conditional grammar and
crashes 'tofu plan' with "Template interpolation doesn't expect
a colon at this location". Prov #9 (4204f0b0c5e37a80) wasted
~30 min before PR #1328 fixed the one offender. Without the guard,
the next operator who adds a similar comment repeats the incident.

Documented in infra/hetzner/README.md so editors learn the \$\$
escape pattern before they trip the CI gate.

## Fix B — bucket-name suffix to escape global Hetzner namespace

Hetzner Object Storage bucket names share a GLOBAL namespace
across every tenant. The previous BucketNameForSovereign(fqdn)
derivation 'catalyst-<fqdn-with-dashes>' would collide on the
second CreateDeployment for the same FQDN (re-provision after
wipe, two operators on adjacent pools, race conditions) and the
second 'tofu apply' would fail with BucketAlreadyExists.

Change BucketNameForSovereign signature to (fqdn, deploymentID)
and append the first 8 chars of the deployment-id as a suffix:

  catalyst-omantel-omani-works-b3b837a2

newID() already returns 16-hex random — the leading 8 chars are
32 bits of fresh entropy, enough to make collisions cryptographically
negligible. Backward-compat: empty deploymentID (legacy on-disk
records) falls back to first-8-hex of sha256(fqdn) so wipes of
pre-Fix-111 Sovereigns remain deterministic.

Call-sites updated:
  - handler/deployments.go: id := newID() moved before
    bucket-name derivation; uses hetzner.BucketNameForSovereign
  - handler/wipe.go: passes dep.ID to PurgeBuckets and to
    BucketNameForSovereign in the report
  - hetzner/buckets.go: PurgeBuckets signature now takes
    deploymentID; bucketSuffix() handles the fallback

Tests:
  - hetzner/buckets_test.go: 6-case TestBucketNameForSovereign
    table covers canonical newID() shape, collision avoidance,
    uppercase normalisation, empty + non-hex fallback paths.
    New TestBucketNameForSovereign_CollisionAvoidance asserts
    the Fix #111 invariant directly.
  - handler/deployments_test.go:
    TestCreateDeployment_DerivesObjectStorageBucketFromFQDN
    now asserts the suffixed shape against the actual dep.ID.
  - All produced names re-validated against the S3 bucket-naming
    RFC (mirrored regex from provisioner.s3BucketNamePattern).

## Claimed TCs

_None directly — infrastructure hardening; eliminates 30+ min
wasted per cycle from regressions like PR #1311 + bucket-collision_

## Verification

- go test ./internal/hetzner/... -run "Bucket" → 9/9 PASS
- go test ./internal/handler/ -run "DerivesObjectStorageBucket" → PASS
- go vet ./... → clean
- go build ./... → clean
- yaml.safe_load on workflow → clean
- pre-existing handler-package fails (whoami, continuum-switchover)
  are unrelated and present on origin/main

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 23:31:56 +04:00
e3mrah
0843f02269
fix(infra/hetzner): escape ${VAR:-default} in tftpl comment (PROV-9 BLOCKER) (#1328)
PR #1311 (Fix #73) added a YAML comment in cloudinit-control-plane.tftpl
line 933 that referenced the envsubst placeholder
\${QA_FIXTURES_ENABLED:-false}. tofu's templatefile() parses ALL \${...}
sequences regardless of YAML/HCL/shell context, and the colon inside
the interpolation makes it choke with:

  Extra characters after interpolation expression; Template
  interpolation doesn't expect a colon at this location.

Result: every prov-* attempt since #1311 merged tofu-plans EXIT 1 in
~2 seconds. Prov #9 (4204f0b0c5e37a80) failed at 18:51 UTC with this
error before any Hetzner resource was created.

Fix: change \${QA_FIXTURES_ENABLED:-false} to \$\${QA_FIXTURES_ENABLED:-false}
(HCL escape — \$\$ renders as a literal \$ in the cloud-init output, which
envsubst then interprets at apply time). Same precedent: commit 7e5c4375
"escape \$ in tftpl comments referencing envsubst placeholders".

This is a 1-char fix on a comment. No runtime behavior change. Unblocks
the qa-loop bounded-provision-cycle.

Refs Fix #98, Fix #95, Fix #73 (regression).

Co-authored-by: e3mrah <alierenbaysal@gmail.com>
2026-05-10 22:53:49 +04:00
e3mrah
b22975cb4b
fix(catalyst-api provisioner): qaTestEnabled flag auto-sets QA_FIXTURES_ENABLED for QA Sovereigns (qa-loop bounded-cycle Fix #73) (#1311)
Provision #7 came up zero-touch but the bp-catalyst-platform qaFixtures
stack stayed off because the chart template defaults to
${QA_FIXTURES_ENABLED:-false} and the catalyst-api provisioner never
threaded the toggle. Result: ~140 of the qa-loop matrix's TCs were
inherently fixture-blocked on every QA Sovereign.

Canonical seam: provisioner.Request struct. New fields:

  - QATestEnabled       bool   `json:"qaTestEnabled"`            (default false)
  - QAFixturesNamespace string `json:"qaFixturesNamespace,...`   (default derived)
  - QAOrganization      string `json:"qaOrganization,...`        (default derived)

When QATestEnabled=true, writeTfvars emits
qa_fixtures_enabled="true" + qa_test_session_enabled="true" plus
qa_fixtures_namespace + qa_organization derived from
SovereignFQDN's first label per docs/INVIOLABLE-PRINCIPLES.md #4
(never hardcode):

  omantel.biz       -> qa-omantel       / omantel-platform
  qa.example.com    -> qa-qa            / qa-platform
  demo.openova.io   -> qa-demo          / demo-platform

Customer Sovereigns provision with QATestEnabled=false (default) -> no
qa-fixture artifacts on production tenants.

Wiring:
  1. internal/provisioner/provisioner.go  Request struct + writeTfvars()
     + deriveQAFixturesNamespace + deriveQAOrganization + firstFQDNLabel
  2. infra/hetzner/variables.tf           4 new tofu vars (string,
                                          true|false validated)
  3. infra/hetzner/cloudinit-control-plane.tftpl
                                          QA_FIXTURES_ENABLED /
                                          QA_TEST_SESSION_ENABLED /
                                          QA_FIXTURES_NAMESPACE /
                                          QA_ORGANIZATION substitute
                                          envvars on bootstrap-kit
                                          Kustomization
  4. infra/hetzner/main.tf                pass new vars into both
                                          templatefile invocations
                                          (primary + per-secondary-region)
  5. internal/provisioner/provisioner_test.go
                                          3 new tests:
                                          - default-disabled invariant
                                          - enabled derivation matrix
                                          - operator-override-wins

QA Sovereign provision command (catalyst-api):

  POST /api/v1/deployments
  {
    "sovereignFQDN": "omantel.biz",
    "qaTestEnabled": true,
    ...
  }

Verified:
  go test ./products/catalyst/bootstrap/api/internal/provisioner/...
  ok  (0.019s)

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 21:08:35 +04:00
e3mrah
fcfed6408c
feat(infra,cilium): wire Cilium ClusterMesh anchors via tofu→cloudinit→envsubst (#1101) (#1226)
* feat(infra,cilium): wire Cilium ClusterMesh anchors via tofu→cloudinit→envsubst (#1101)

Follow-up to #1223. The Flux Kustomization on every Sovereign points
at clusters/_template/bootstrap-kit/ and post-build-substitutes per-
Sovereign vars (SOVEREIGN_FQDN, MARKETPLACE_ENABLED, ...). The
per-Sovereign overlay file at clusters/<sov>/bootstrap-kit/01-cilium.yaml
that #1223 added is therefore dead code (Flux doesn't read that
path). The canonical mechanism is to extend the template with
envsubst placeholders + thread the values through tofu vars.

Wires four layers end-to-end:

1. clusters/_template/bootstrap-kit/01-cilium.yaml — adds
   `cluster.name: ${CLUSTER_MESH_NAME:=}` and
   `cluster.id: ${CLUSTER_MESH_ID:=0}` plus
   `clustermesh.useAPIServer: true` + NodePort 32379. Empty defaults
   = single-cluster Sovereign (no peer connects); the cilium subchart
   accepts empty cluster.name when id=0.

2. infra/hetzner/cloudinit-control-plane.tftpl — adds
   CLUSTER_MESH_NAME / CLUSTER_MESH_ID to the bootstrap-kit
   Kustomization's postBuild.substitute block (alongside
   SOVEREIGN_FQDN, MARKETPLACE_ENABLED, PARENT_DOMAINS_YAML).

3. infra/hetzner/variables.tf — declares cluster_mesh_name (string,
   default "") and cluster_mesh_id (number, default 0, validated 0-255).

4. infra/hetzner/main.tf — primary cloud-init passes
   var.cluster_mesh_{name,id} verbatim. Secondary regions (when
   var.regions[i>0] is non-empty per slice G3) auto-derive each
   peer's name as `<sovereign-stem>-<region-code-no-digits>` and
   increment id from var.cluster_mesh_id+1. Per-region override via
   the new RegionSpec.ClusterMeshName field.

5. products/catalyst/bootstrap/api/internal/provisioner/provisioner.go
   — adds ClusterMeshName + ClusterMeshID to Request and threads them
   into writeTfvars(); RegionSpec gains ClusterMeshName for per-peer
   override.

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), the chart-side
default is intentionally empty — operator request OR per-Sovereign
overlay must supply the values when ClusterMesh is enabled. The
allocation registry lives at docs/CLUSTERMESH-CLUSTER-IDS.md
(introduced in #1223).

Refs: #1101 (EPIC-6), qa-loop iter-6 fix-33 follow-up to #1223

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(infra): escape $ in tftpl comments referencing envsubst placeholders

`tofu validate` reads `${CLUSTER_MESH_NAME}` inside YAML comments as a
template variable reference; the comment was meant to refer to the Flux
envsubst placeholder consumed downstream by the bootstrap-kit cilium
HelmRelease. Escaped both refs with `$$` per Terraform's templatefile
escape syntax so the comment renders verbatim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(infra): replace coalesce with conditional in secondary_region_cluster_mesh_name

coalesce errors when every arg is empty (the not-in-mesh path). Switch
to a conditional that yields '' when both the per-region override AND
var.cluster_mesh_name are empty.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:19:53 +04:00
e3mrah
7ca4abddd2
feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101) (#1159)
* feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101)

Implements the server side of the Cloudflare KV lease-witness pattern
that K-Cont-3's CFKVClient (in core/controllers/continuum/internal/
witness/cloudflarekv/) speaks to. The Worker fronts a Cloudflare
Workers KV namespace with read-then-CAS-write semantics enforced via
the If-Match header — exact contract per K-Cont-3 #1158 report (item d)
and the canonical-seams "Cloudflare KV Worker contract" entry.

Routes:
  GET    /lease/<slot-url-encoded>  → 200 + LeaseState | 404 | 401
  PUT    /lease/<slot>              → 200 + LeaseState | 412 + state | 401
  DELETE /lease/<slot>              → 204 | 412 | 401

All 7 K-Cont-3 trap behaviors verified by 46 vitest tests:
  1. If-Match: 0 = first-acquire-on-empty-slot
  2. Generation increments unconditionally (incl. Release)
  3. 412 includes current state body
  4. TTL eviction is server-authoritative in stamping (Worker doesn't
     auto-evict — controller's IsHeldBy decides)
  5. X-Holder mismatch on DELETE returns 412 (stale region can't
     evict new primary)
  6. Bearer token validation against env-bound allow-list
  7. Optional X-Lease-Slot header logged for KV granularity

Files:
  products/continuum/cloudflare-worker/{package.json, tsconfig.json,
    wrangler.toml, vitest.config.ts, .eslintrc.cjs, .gitignore,
    DESIGN.md, src/{index,auth,kv,types}.ts,
    src/handlers/{get,put,delete}.ts,
    test/{handlers,contract,env.d}.ts}
  infra/cloudflare-worker-leases/{versions,variables,main,outputs}.tf
    + README.md
  .github/workflows/cloudflare-worker-leases-build.yaml
    (event-driven, NO cron — push-on-paths + PR + workflow_dispatch)

Tests: 46/46 vitest pass (handlers 37 + contract 9). ESLint clean.
tsc --noEmit clean. wrangler deploy --dry-run produces 9.47 KiB
bundle.

Per the brief: tofu module ships ready for operator action — no
auto-deploy. Operator runbook in DESIGN.md §"Operator runbook —
deploy a new Sovereign".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(continuum/cf-worker-tofu): K-Cont-4 — adopt CF v5 inline secret_text binding (was v4 separate resource)

`tofu validate` failed on `cloudflare_workers_secret` — that resource
was REMOVED in cloudflare/cloudflare v5 (it consolidated into the
inline `bindings = [...]` array on `cloudflare_workers_script` with
`type = "secret_text"`). Same security guarantee — encrypted at rest
in CF, never visible via dashboard read API once written. `tofu fmt`
also wanted versions.tf alignment + the .terraform.lock.hcl pinning
the resolved cloudflare/cloudflare v5.19.1 (mirrors infra/hetzner/
which commits its lock file).

Per Inviolable Principle #5 the bearer token value still flows from
TF_VAR_bearer_tokens_csv extracted at apply time from a K8s
SealedSecret — never inlined here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 08:01:44 +04:00
e3mrah
8988cd9e4f
feat(infra-hetzner): wire all var.regions[] entries end-to-end (slice G1, #1095) (#1131)
Slice G1 of EPIC-0 (#1095, Group G "Multi-cluster substrate"). Today
infra/hetzner/main.tf only realises regions[0] end-to-end — every wizard
payload's regions[1..N] entries silently no-op. EPIC-6 (#1101) Continuum
DR demo needs 3 regions (mgmt + fsn + hel per docs/EPICS-1-6-unified-design.md
§3.8 + §11), so this slice closes the gap.

Architecture: hybrid singular-path + secondary-region overlay.
- The legacy singular path (var.region + count = local.control_plane_count)
  STAYS untouched — every existing Sovereign state (omantel, otech*) keeps
  its resource addresses (hcloud_server.control_plane[0],
  hcloud_load_balancer.main, etc) and produces a no-op plan diff.
- New regions (regions[1+]) are realised via a parallel for_each set keyed
  by "{cloudRegion}-{index}" (e.g. fsn1-1, hel1-2). Each secondary region
  gets its own /24 subnet inside the shared /16 hcloud_network, its own
  CP server, its own workers, and its own lb11 load balancer. The shared
  hcloud_firewall + hcloud_ssh_key (one tenant boundary per Sovereign).

Why hybrid not full for_each: a wholesale refactor would change every
existing resource address (hcloud_server.control_plane[0] →
hcloud_server.control_plane["mgmt"]), forcing every running Sovereign
to run `tofu state mv` for ~12 resources or face destructive recreates.
The brief explicitly bans that. Hybrid is purely additive — secondary
resources are NEW addresses no existing state carries.

No `tofu state mv` runbook required. Existing Sovereigns provisioned
with var.regions = [] or len(var.regions) == 1 produce identical plans
before and after this PR.

Slice G3 (out of scope here) wires Cilium ClusterMesh between secondary
regions and adds per-cluster GitOps path differentiation; today every
secondary CP renders an identical Flux Kustomization pointed at
clusters/<sovereign_fqdn>/.

Tests: tests/multi_region.tftest.hcl exercises 5 scenarios offline via
mock_provider + override_resource (no real Hetzner):
  - legacy_no_regions_payload (var.regions=[])
  - single_region_entry_does_not_double_provision (len==1)
  - three_region_mgmt_fsn_hel (EPIC-6 shape)
  - same_region_duplicates_produce_distinct_keys
  - non_hetzner_regions_are_filtered_out (oci entries skipped)
All 5 pass. CI workflow infra-hetzner-tofu.yaml runs validate + fmt -check
+ test on every PR touching infra/hetzner/**.

Per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled":
push-on-merge + pull-request-on-touch + workflow_dispatch only. No cron.

Validation:
  $ tofu validate
  Success! The configuration is valid.
  $ tofu fmt -check -recursive
  exit=0
  $ tofu test
  tests/multi_region.tftest.hcl... pass
    run "legacy_no_regions_payload"... pass
    run "single_region_entry_does_not_double_provision"... pass
    run "three_region_mgmt_fsn_hel"... pass
    run "same_region_duplicates_produce_distinct_keys"... pass
    run "non_hetzner_regions_are_filtered_out"... pass
  Success! 5 passed, 0 failed.

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 00:29:44 +04:00
e3mrah
8e312cd244
fix(infra/hetzner): strip any-indent comments, gate user_data ≤ 30 KiB at plan-time (#966) (#967)
Live blocker. Provisioning otech114 (deployment 5c3eea37d3aacda6, fsn1)
failed at `tofu apply` with:

  Error: invalid input in field 'user_data' (invalid_input):
  [user_data => [Length must be between 0 and 32768.]]
  with hcloud_server.control_plane[0]
  on main.tf line 309

Hetzner Cloud's HARD 32 KiB cap on user_data was breached after #921
inlined a base64-encoded worker cloud-init (~4.8 KB) into the CP cloud-
init for cluster-autoscaler's HCLOUD_CLOUD_INIT key, on top of #827's
multi-domain substitutions. Rendered size: ~37 KB.

Root cause: the prior strip regex `(?m)^[ ]{0,2}# .*\n` was scoped to
indent-0/2 comments only — leaving ~14 KB of indent-6+ comments INSIDE
write_files content blocks (e.g. flux-bootstrap.yaml's triplicate
Kustomization documentation). Those comments are inert: every write_files
entry is YAML / JSON / key=value config (no shell scripts), and parsers
ignore `#`-prefixed lines entirely.

Changes:

1. New strip regex `(?m)^[ ]*#( |$).*\n` strips ANY-indent comment lines
   that start with `#` followed by space or EOL. Preserves:
   - `#cloud-config` line 1 (no space after `#`)
   - `#!`-shebangs (no space after `#`)
   - `#pragma`-style directives (`#` followed by non-space non-EOL)
   Applied to both `local.control_plane_cloud_init` and
   `local.worker_cloud_init`.

2. Plan-time guardrail via `lifecycle.precondition` on
   `hcloud_server.control_plane` and `hcloud_server.worker`. Fails plan
   (not apply) when `length(local.<*>_cloud_init) > 30720` bytes (30 KiB
   = 32 KiB hard cap minus 10% future-additions buffer). Future bloat-
   creep that silently re-eats the headroom now fails fast at plan-time
   BEFORE the network/LB/firewall/SSH-key resources get created.

Verified rendered sizes (Python simulation of templatefile + strip,
substitutions match real otech114 inputs):

  CP cloud-init:     79404 bytes raw → 21144 bytes stripped
                     (margin: 11624 under hard cap, 9576 under guardrail)
  Worker cloud-init:  3254 bytes raw →  2410 bytes stripped
                     (b64-encoded for HCLOUD_CLOUD_INIT: 3216 bytes)

`#cloud-config` first-line preserved. All 18 write_files entries and
43 runcmd entries parse intact. YAML/JSON/conf contents valid post-strip
(comments are documentation only at the file-format level).

Closes #966

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 17:58:44 +04:00
e3mrah
d1431bed09
fix(autoscaler+wizard): wire HCLOUD_CLOUD_INIT, validate SKU/region in catalyst-api (#965)
Closes #921 — bp-cluster-autoscaler-hcloud chart shipped without
HCLOUD_CLUSTER_CONFIG / HCLOUD_CLOUD_INIT, so cluster-autoscaler 1.32.x
FATALs at startup with "HCLOUD_CLUSTER_CONFIG or HCLOUD_CLOUD_INIT is
not specified" on every Sovereign (otech112 evidence). HelmRelease
reports Ready=True (Helm install succeeded) but the Pod
CrashLoopBackOffs invisibly behind the False-positive condition.

Closes #916 — wizard let operators dispatch unbuildable topologies
(otech109: cpx32 worker in `ash`) because PROVIDER_NODE_SIZES did not
encode regional orderability. Hetzner rejected the worker creation 41s
into `tofu apply` after Phase-0 had already created the CP + network +
LB + firewall.

Chart fix (issue #921):
- Add `clusterAutoscalerHcloud.{clusterConfig,cloudInit}` values to the
  umbrella chart (base64-encoded per upstream contract).
- Render `hetzner-node-config` Secret unconditionally with both keys so
  the upstream Deployment's secretKeyRef references resolve cleanly
  during `helm template` AND in the live cluster regardless of overlay
  state.
- Wire HCLOUD_CLUSTER_CONFIG + HCLOUD_CLOUD_INIT extraEnvSecrets onto
  the upstream chart's deployment.
- Tofu Phase 0 base64-encodes the Phase-0 worker cloud-init and stamps
  it under `flux-system/cloud-credentials.hcloud-cloud-init`; the
  bootstrap-kit overlay lifts that key via Flux `valuesFrom` into
  `clusterAutoscalerHcloud.cloudInit`. Autoscaler-spawned workers thus
  receive the IDENTICAL bootstrap as the Phase-0 worker fleet.
- Bump bp-cluster-autoscaler-hcloud chart 1.0.0 → 1.1.0.
- Chart-test smoke gate (chart/tests/hetzner-node-config.sh) verifies
  Secret + env var wiring + no-regression of HCLOUD_TOKEN — runs in CI's
  blueprint-release "Run chart integration tests" step.

Wizard fix (issue #916):
- Add `availableRegions?: string[]` to NodeSize interface; encode
  cpx32 = ['fsn1','nbg1','hel1'], cpx21/cpx31 = [] (orderable nowhere
  new) per Hetzner /v1/server_types vs POST /v1/servers gap.
- Add `isSkuAvailableInRegion()` + `suggestAlternativeSkus()` helpers.
- StepProvider filters SKU dropdowns by selected region; auto-swaps
  current SKU to recommended default when region change drops it out
  of orderability.
- Mirror the matrix Go-side in sku_availability.go; gate
  `provisioner.Request.Validate()` with same predicate so a stale
  wizard build OR direct API caller bypassing the UI cannot dispatch
  otech109's failure mode.
- Two-sided enforcement covers both r.Regions[] (multi-region) and the
  legacy singular path.

Tests: 13 vitest cases on the wizard side + 38 Go subtests on the API
side. Chart smoke renders + helm template gates the env wiring at
publish time.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 16:21:59 +04:00
e3mrah
2ff50f0591
fix(bp-newapi+services-build): imagePullSecrets on Pod, sed bumps values.yaml smeTag (#955)
Two SME-blocker bugs caught live on otech113 (alice signup gate 5 fails on
fresh Sovereign):

#952 — bp-newapi 1.4.0 Pod has no imagePullSecrets, so kubelet pulls
PRIVATE ghcr.io/openova-io/openova/{newapi-mirror,services-metering-sidecar}
anonymously and gets 403 Forbidden. Fix:

- Templatize spec.imagePullSecrets on Deployment + channel-seed Job.
- Default values.yaml `imagePullSecrets: [{name: ghcr-pull}]`.
- Add `newapi` to flux-system/ghcr-pull's reflector
  reflection-{allowed,auto}-namespaces in cloudinit-control-plane.tftpl
  so bp-reflector mirrors the source Secret into the namespace
  automatically on every fresh Sovereign.
- Bump bp-newapi 1.4.0 -> 1.4.1, update _template overlay.

#953 — services-build.yaml's image-rewrite loop only matched the
hardcoded `image: ghcr.io/.../services-<svc>:<sha>` form. 7 of 8
sme-services templates use `image: "{{ ... }}/services-<svc>:{{
.Values.images.smeTag }}"`. Each services-build run bumped only
auth.yaml while reporting "update sme service images to ${SHA}",
leaving the live Pod on stale bytes (PR #951's #941 fix never reached
services-catalog despite the merge + chart bump chain). Fix:

- After the hardcoded loop, also bump `images.smeTag` in
  products/catalyst/chart/values.yaml with a strict regex match
  (`^  smeTag: "<sha>"$`); refuse to auto-bump if the line shape
  changes (defends against silent drift if a contributor renames the
  field).
- Mirror the change into the retry-path `rewrite()` function so a
  reset-to-origin/main retry does not recreate the original bug.

Tests:

- platform/newapi/chart/tests/imagepullsecrets-render.sh — 4 cases
  asserting the Deployment and channel-seed Job carry the default
  ghcr-pull reference, that an empty override suppresses the block,
  and that custom secret names propagate (Inviolable Principle #4).
- tests/integration/services-build-rewrite.sh — 3 cases reproducing
  the workflow's rewrite logic on a sandboxed copy of the live
  chart, asserting both auth.yaml's hardcoded line AND values.yaml's
  smeTag get bumped, that helm-render of the catalyst chart with
  the bumped values produces all 8 SME-service Deployments at the
  new SHA, and that an idempotent re-bump to a second SHA also lands
  cleanly.

Refs: #952 #953 (umbrella #915 — alice signup gate 5).

Co-authored-by: hatiyildiz <143030955+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:47:37 +04:00
e3mrah
e08d8721e1
fix(pdm/dynadot): pre-register glue records before set_ns (#900) (#906)
Multi-domain Day-2 add-domain on a Sovereign was failing with Dynadot's
"'ns1.<sov>.omani.works' needs to be registered with an ip address
before it can be used" error. Dynadot rejects set_ns whenever the NS
hostnames aren't registered as account-level "host records" first.

This change wires the glue pre-registration into the PDM dynadot
adapter as an optional registrar.GlueRegistrar interface, threads the
Sovereign's load-balancer IPv4 from cloud-init through Flux postBuild
into the chart's `global.sovereignLBIP`, and forwards it via
catalyst-api's pdmFlipNS to PDM's /set-ns endpoint as a new `glueIP`
field. PDM's SetNS handler calls RegisterGlueRecord for each
out-of-bailiwick NS before SetNameservers, with idempotent get_ns →
register_ns / set_ns_ip semantics so retries are free.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:00:45 +04:00
e3mrah
7bfd6df588
fix(catalyst-api,bp-catalyst-platform,infra): unblock multi-domain Day-2 add-domain flow on Sovereigns (#879) (#884)
5 stacked wiring bugs blocked the Day-2 add-parent-domain happy path on a
fresh post-handover Sovereign — surfaced live on otech103, 2026-05-05 — plus
a 6th gap (ghcr-pull reflector for catalyst-system). All six fixed in one PR
so a single chart bump + cloud-init re-render closes the gap end-to-end.

Bug 1 (chart, api-deployment.yaml): wire POOL_DOMAIN_MANAGER_URL=
https://pool.openova.io. The in-cluster Service default only resolves on
contabo; on Sovereigns every Day-2 POST died with NXDOMAIN.

Bug 2 (chart + code): wire CATALYST_PDM_BASIC_AUTH_USER / _PASS env from a
new pdm-basicauth Secret, and have pdmFlipNS SetBasicAuth from those envs.
The PDM public ingress at pool.openova.io is gated by Traefik basicAuth;
calls without Authorization: Basic returned 401. optional=true so contabo
+ CI + older Sovereigns degrade to a clear 401 log line. Per Inviolable
Principle #10, the credentials only ever live in Pod env + are read once
per call by pdmFlipNS — never enter a logged struct or persisted record.

Bug 3 (code, parent_domains.go): pdmFlipNS body now includes the required
nameservers field (computed from expectedNSFor). PDM's SetNSRequest schema
requires it; the previous body got 422 missing-nameservers.

Bug 4 (code, parent_domains.go): lookupPrimaryDomain falls back to
SOVEREIGN_FQDN env after CATALYST_PRIMARY_DOMAIN. On a post-handover
Sovereign no Deployment record is persisted, so without this fallback GET
/parent-domains returned {"items":[]} and the propagation panel showed
expectedNs:null. SOVEREIGN_FQDN is already wired by api-deployment.yaml
from the sovereign-fqdn ConfigMap.

Bug 5 (chart, httproute.yaml): catalyst-ui /auth/* PathPrefix narrowed to
Exact /auth/handover. The previous PathPrefix collided with OIDC PKCE
redirect_uri /auth/callback — catalyst-api 404s on that path because it
only registers /api/v1/auth/callback, breaking login post-handover-JWT-
cookie expiry. Exact match keeps /auth/handover routed to catalyst-api
while every other /auth/* path falls through to catalyst-ui's React
Router for client-side OIDC.

Bug 6 (cloud-init): ghcr-pull + harbor-robot-token + new pdm-basicauth
Reflector annotations enumerate explicit allowed/auto-namespaces (sme,
catalyst, catalyst-system, gitea, harbor) instead of empty-string. The
ambiguous empty-string interpretation caused otech103 to require a manual
catalyst-system mirror creation; explicit list back-ports the verified
working state.

Provisioner wiring: Request.PDMBasicAuthUser/Pass + Provisioner fields
+ tfvars emission so the contabo catalyst-api can stamp the credentials
onto every Sovereign provision request. variables.tf adds matching
pdm_basic_auth_user / pdm_basic_auth_pass tofu vars (sensitive, default
empty) so older provisioner builds that pre-date this change keep
rendering valid cloud-init (the Secret renders with empty values and
Pod start is unaffected).

Chart bumped 1.4.11 -> 1.4.12, lockstep slot 13 pin updated. Closes
the architectural blockers tracked in #879; the catalyst-api image
rebuild + chart republish run via the existing CI pipelines (services-
build.yaml + blueprint-release.yaml) on this commit's SHA.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 09:02:39 +04:00
e3mrah
e96741a0ca
feat(powerdns,cert-manager): multi-zone bootstrap + per-zone wildcard cert (#827) (#838)
A franchised Sovereign now supports N parent zones, NOT one. The
operator brings 1+ parent domains at signup (`omani.works` for own
use, `omani.trade` for the SME pool, etc.) and may add more
post-handover via the admin console (#829).

bp-powerdns 1.2.0 (platform/powerdns/chart):
- New `zones: []` values key listing parent domains to bootstrap
- New Helm post-install/post-upgrade hook Job
  (templates/zone-bootstrap-job.yaml) that POSTs each entry to
  /api/v1/servers/localhost/zones at install time. Idempotent on
  HTTP 409 — re-runs after upgrades or chart bumps never fail.
- Default-values render skips when zones is empty (legacy behavior).

bp-catalyst-platform 1.4.0 (products/catalyst/chart):
- New `parentZones: []` + `wildcardCert.{enabled,namespace,issuerName}`
  values
- New templates/sovereign-wildcard-certs.yaml renders one
  cert-manager.io/v1.Certificate per zone (each `*.<zone>` + apex)
  via the letsencrypt-dns01-prod-powerdns ClusterIssuer. Each cert
  renews independently. Skips entirely when parentZones is empty so
  the legacy clusters/_template/sovereign-tls/cilium-gateway-cert.yaml
  retains ownership of `sovereign-wildcard-tls` (avoids
  helm-vs-kustomize ownership flap).
- New `catalystApi.{powerdnsURL,powerdnsServerID}` values threaded
  into the catalyst-api Pod as CATALYST_POWERDNS_API_URL +
  CATALYST_POWERDNS_SERVER_ID env vars.

catalyst-api (products/catalyst/bootstrap/api):
- New internal/powerdns package with typed Client (CreateZone,
  ZoneExists). Idempotent on HTTP 409/412.
- handler.pdmCreatePowerDNSZone (issue #829's stub) now uses the
  typed client when wired via SetPowerDNSZoneClient — the
  admin-console "Add another parent domain" flow now creates real
  zones in the Sovereign's PowerDNS at runtime.
- main.go wires the client when CATALYST_POWERDNS_API_URL +
  CATALYST_POWERDNS_API_KEY are set.
- Comprehensive unit tests (client_test.go: 9 cases incl.
  201/409/412/500 + custom NS + custom serverID).

Bootstrap-kit slot integration:
- clusters/_template/bootstrap-kit/11-powerdns.yaml: bumps to
  bp-powerdns 1.2.0 and threads `zones: ${PARENT_DOMAINS_YAML}` from
  Flux postBuild.substitute.
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
  bumps to bp-catalyst-platform 1.4.0 and threads `parentZones:
  ${PARENT_DOMAINS_YAML}` (same source-of-truth string so the two
  slots stay in lockstep).
- infra/hetzner: new `parent_domains_yaml` Terraform variable
  (defaults to single-zone array derived from sovereign_fqdn) →
  cloud-init renders the PARENT_DOMAINS_YAML Flux substitute.

DoD verified end-to-end with helm template + envsubst:
- Multi-zone overlay (omani.works + omani.trade) renders 2
  PowerDNS zone-create API calls in the bootstrap Job AND 2
  Certificate resources (`*.omani.works`, `*.omani.trade`) in
  bp-catalyst-platform.
- Single-zone fallback (PARENT_DOMAINS_YAML defaults to
  `[{name: "<sov_fqdn>", role: "primary"}]`) keeps legacy
  provisioning paths working without per-overlay edits.

Closes #827.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-04 23:42:00 +04:00
e3mrah
05065b66d6
fix(provisioner+observer): document cpx21 availability + kubectl retry/LKG (closes #752, #753) (#756)
#752 — investigate cpx21/cpx31 availability in EU DCs

Concrete proof gathered against the live Hetzner Cloud API on 2026-05-04.
GET /v1/server_types LISTS cpx11/cpx21/cpx31/cpx41 with full EU prices in
fsn1/nbg1/hel1, but POST /v1/servers rejects every order for those SKUs in
those DCs with:

  {"error":{"code":"invalid_input",
            "message":"unsupported location for server type"}}

Probed all 6 (SKU × DC) combinations end-to-end via real POST + immediate
DELETE. cpx22 + cpx32 were also probed as a sanity check and returned
ORDERED. The /v1/server_types price entry is misleading: Hetzner advertises
prices for every (SKU, location) pair regardless of orderability.

Conclusion: NO SKU bump-back. cpx22 + cpx32 (PR #744) remain the floor.
README + variables.tf docstrings now carry the durable reproducer so future
engineers don't re-attempt cpx21/cpx31.

#753 — kubectl retry / LKG observer reliability

/tmp/autopilot.sh updated (script lives outside the repo, on the VPS):
  • Every kubectl call carries --request-timeout=8s so a hung TLS handshake
    surfaces as a fast empty rather than a 30s+ stall.
  • Last-known-good (LKG) state held across transient flakes: hr/cert/nodes
    no longer flip to "0/0 nodes=0" on a single failed poll.
  • Only 3 consecutive transients count as a real failure; below the
    threshold the observer prints "hr=<LKG> (transient N/3)".

UI side: the wizard's StatusPill / ApplicationPage drive off SSE from
catalyst-api (useDeploymentEvents.ts), not direct kubectl polling, so no UI
change needed. catalyst-api itself uses client-go (helmwatch / phase1_watch),
not exec kubectl, so its observer is not subject to the same shell-out flake.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:11:44 +04:00
e3mrah
e855ab0dfe
fix(k3s): taint CP node-role.kubernetes.io/control-plane:NoSchedule when workers exist (#751) (#755)
Root cause of the "apiserver flake / cpx22 too small / 8 stuck HRs"
chain: the k3s server install in cloudinit-control-plane.tftpl set
--node-label but no --node-taint. By k3s default the server node is
fully schedulable, so on a 1-CP + N-worker Sovereign with the
37-HelmRelease bootstrap-kit + guest workloads (bp-keycloak / bp-cnpg /
bp-harbor / bp-catalyst-platform / SME microservices), the scheduler
distributes guest pods onto the CP. They eat its memory, crowd
kubelet/etcd/apiserver, kubectl flakes, Helm post-install hooks time
out, HelmReleases get stuck mid-reconcile.

Fix: add --node-taint node-role.kubernetes.io/control-plane=true:NoSchedule
to the INSTALL_K3S_EXEC string, so the CP is reserved for system +
bootstrap controllers. cilium agent (DaemonSet) and cilium-operator
default to {operator: Exists} tolerations upstream — they tolerate
the taint and continue to run on the CP. cert-manager and flux2 default
to tolerations: [] — on multi-node Sovereigns they correctly land on
workers, which is the desired separation. Guest workloads do not
tolerate the taint and are pushed to workers where they belong.

Conditional on worker_count > 0: a Catalyst-Zero / solo Sovereign has
only the CP, so tainting NoSchedule there leaves no schedulable node
and the cluster never becomes ready. The Tofu inline ternary
"\${worker_count > 0 ? \"--node-taint ...\" : \"\"}" omits the flag
entirely in solo mode — k3s default (CP fully schedulable) carries
everything.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:07:34 +04:00
e3mrah
ceeefd7829
fix(cloud-init): quote MARKETPLACE_ENABLED so postBuild.substitute is map[string]string (#746)
ROOT CAUSE FOUND for the post-PR-#710 zero-touch handover stall (otech85
through otech89). Cloud-init template emitted:

  postBuild:
    substitute:
      SOVEREIGN_FQDN: otech89.omani.works
      MARKETPLACE_ENABLED: false      ← UNQUOTED YAML BOOL

Tofu interpolates `${marketplace_enabled}` (a string variable holding
"true"|"false") into the rendered cloud-init. Without quotes, kubectl's
YAML parser converts `false`/`true` into BOOL, so the rendered Kustomi-
zation manifest violates the kustomize.toolkit.fluxcd.io/v1
postBuild.substitute schema (map[string]string).

Live evidence on otech89 (and earlier otech85-88 with same SHA):
  GitRepository CRD apply  → succeeds (no postBuild, no schema issue)
  3× Kustomization apply   → silently rejected by validator
  flux-system kustomize-controller has 0 reconciliable Kustomizations
  bootstrap-kit never lands → 0 HRs ever Ready → wizard stalls forever

Quote the value: `MARKETPLACE_ENABLED: "${marketplace_enabled}"` so it
renders as `MARKETPLACE_ENABLED: "false"` (string) and passes the CRD
validator.

This is the bug that has been blocking the 2-cycle zero-touch verifi-
cation since PR #719 introduced MARKETPLACE_ENABLED. Six provisioning
cycles burned (otech85-89 + retries) chasing it. Closes #733 cycle-
verification (the SKU work itself was correct end-to-end).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 16:01:19 +04:00
e3mrah
468c3badf8
fix(cloud-init): tolerate Crossplane Provider apply failure + retry in background (#745)
Live observation on otech88 (DID b2c528023b50ec45, 2026-05-04
11:40:42Z): the new Sovereign's flux-system reaches Ready (GitRepository
artifact stored, all 6 Flux deployments Available) but no Kustomization
CRs appear — kustomize-controller has nothing to reconcile and
hr=True=0/0 forever.

The cloud-init runcmd applies in this order:
  1. cloud-credentials-secret.yaml
  2. crossplane-provider-hcloud.yaml — `pkg.crossplane.io/v1 Provider`
     CRD doesn't exist yet (bp-crossplane is installed by Flux below),
     so this apply errors with "no matches for kind Provider in version
     pkg.crossplane.io/v1"
  3. flux-bootstrap.yaml — should apply 1× GitRepository + 4×
     Kustomization

Empirically, only the GitRepository lands. The four Kustomization
documents in the same multi-doc YAML are not created. The exact
mechanism of failure is on-host (cloud-init runcmd output is at
/var/log/cloud-init-output.log on the Sovereign — out of reach per
"no SSH" rule), but the symptom is consistent across otech87 and
otech88 reprovisions on the new cost-optimised SKUs.

This patch is a belt-and-braces hardening:

1. Tolerate the Crossplane Provider apply's failure (`|| true`) so
   the runcmd cannot propagate a non-zero exit through to whatever
   downstream step is failing.

2. Add a background retry for the Crossplane Provider CR. Polls
   every 30s up to 30m for the Provider CRD to appear (i.e.
   bp-crossplane reconciled by Flux), then `kubectl apply` succeeds
   and the loop exits. Detached via `&` so cloud-init runcmd
   completes without waiting for Crossplane to be Ready.

The intent is to remove any chance the Provider apply blocks Flux
bootstrap. If Kustomizations still don't appear after this fix, the
root cause is elsewhere and a follow-up patch will land.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 15:50:55 +04:00
e3mrah
b02fc3788a
fix(provisioner): cost-optimized defaults use ORDERABLE SKUs — cpx22 CP + cpx32 workers (14% saving) (#744)
* fix(provisioner): emit regions=[] not null so OpenTofu validator accepts zero-override request

Live failure on otech86 (DID 103c52d08510006f, 2026-05-04 11:12:43Z).
After PR #742 fixed the empty SKU strings in tfvars, the next blocker
appeared: writeTfvars was emitting `"regions": null` (Go nil slice
marshals to JSON null) when the request had no per-region overrides.

OpenTofu's variables.tf carries a validation block:

  validation {
    condition = alltrue([
      for r in var.regions :
      contains(["hetzner", "huawei", "oci", "aws", "azure"], r.provider)
    ])
  }

The `for r in var.regions` iteration fails on null with:

  Error: Iteration over null value
  on variables.tf line 217, in variable "regions":

The variables.tf default `[]` is what the validator expects; emit
that shape explicitly via a coalesceRegions(req.Regions) helper that
turns nil into an empty slice. Operator overrides round-trip
unchanged.

Tests:
- TestWriteTfvars_EmitsRegionsAsEmptyArrayNotNull — proves regions
  serialises as JSON `[]`, never `null`, when the request has no
  per-region overrides.

Builds on PR #742.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(provisioner): cost-optimized defaults use ORDERABLE SKUs (cpx22 CP + cpx32 workers, 14% saving)

Live failure on otech87 (DID e47e1c0824f3fcbb, 2026-05-04 11:31:09Z): the
cpx21 CP default from PR #741 fell apart at apply time —

  Error: Server Type "cpx21" is unavailable in "fsn1" and can no
  longer be ordered

Hetzner cloud API confirms: cpx21 and cpx31 are listed in the catalog
(`/v1/server_types`) but are NOT in the per-DC orderable list
(`available_for_migration` on `/v1/datacenters`) for any EU DC
(fsn1/nbg1/hel1). The wizard's catalog literally cannot be acted on
for new Sovereigns in those regions.

Smallest AMD-shared SKUs that ARE orderable in EU DCs as of 2026-05-04:
  • cpx11 (2 vCPU / 2 GB) — too small for the CP working set
  • cpx22 (2 vCPU / 4 GB) — fits the CP working set, ~€9.49/mo fsn1
  • cpx32 (4 vCPU / 8 GB) — smallest 8 GB worker, ~€16.49/mo fsn1
  • cpx42, cpx52, cpx62 — bigger and more expensive

New default per Sovereign:

| Component       | Old             | New              | Savings |
|-----------------|-----------------|------------------|---------|
| Control plane   | CPX32 (€16.49)  | CPX22 (€9.49)    | €7.00   |
| Worker × 2      | CPX32 × 2 (€33) | CPX32 × 2 (€33)  | €0      |
| TOTAL           | €49.47/mo       | €42.47/mo        | 14%     |

The 38% saving the issue brief proposed (cpx21+cpx31 = €20.5/mo)
assumed those SKUs were orderable. They aren't in EU DCs. The 14%
saving from cpx22 CP is the largest concrete optimisation that
ships TODAY without compromising the multi-node horizontal-scale
agreement (issue #733): still 1 CP + 2 workers from day one.

Files changed:

- infra/hetzner/variables.tf
  control_plane_size default cpx21 → cpx22
  worker_size        default cpx31 → cpx32 (back to the prior orderable choice)

- products/catalyst/bootstrap/ui/src/shared/constants/providerSizes.ts
  Replace fictional CPX21 € pricing (€5.49/mo) and CPX31 € pricing
  (€7.49/mo) with the actual fsn1 Hetzner API prices (€10.99 / €20.49).
  Mark both as "listed but NOT orderable in EU DCs" so the wizard
  surfaces the constraint instead of letting operators pick a
  non-orderable SKU.
  Move recommended:true from CPX21 → CPX22.
  defaultWorkerSizeId('hetzner') returns 'cpx32' (was 'cpx31').

- products/catalyst/bootstrap/ui/src/pages/wizard/steps/StepProvider.tsx
  Comment refresh — names the new orderable defaults.

- products/catalyst/bootstrap/ui/e2e/cosmetic-guards.spec.ts
  Recommended-Hetzner-SKU set assertion: ['cpx21'] → ['cpx22'].

Builds on PR #741 (issue #740 chain).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 15:35:55 +04:00
e3mrah
994c2d1c2a
fix(provisioner): cost-optimized default sizes — cpx21 CP + cpx31 workers (38% saving) (#741)
The new Sovereign default after PR #736 / #738 / #739 was 1× CPX32 control
plane + 2× CPX32 workers — €33/mo per Sovereign. CPX32 is over-provisioned
for the CP working set: the CP carries only k3s (apiserver/etcd/scheduler/
controller-manager) + cilium-operator + flux controllers + cert-manager +
sealed-secrets — NOT the heavy bp-keycloak/cnpg/harbor/openbao/grafana
stack (those land on workers because the bootstrap-kit explicitly schedules
them off the CP taint).

CP RAM budget: etcd ~512 MB + control plane ~1.5 GB + cilium/flux/
cert-manager/sealed-secrets ~1 GB + OS ~512 MB ≈ 3.5 GB — fits CPX21's
4 GB. Workers stay at 8 GB on CPX31 since RAM is the binding constraint
for the bootstrap-kit's worker pods, not vCPU.

New default per Sovereign:

| Component       | Old             | New             | Savings |
|-----------------|-----------------|-----------------|---------|
| Control plane   | CPX32 (€11/mo)  | CPX21 (€5.5/mo) | €5.5    |
| Worker × 2      | CPX32 × 2 (€22) | CPX31 × 2 (€15) | €7      |
| TOTAL           | €33/mo          | €20.5/mo        | 38%     |

Multi-node horizontal-scale agreement (issue #733) preserved: still
1 CP + 2 workers minimum from day one.

Files changed:

- infra/hetzner/variables.tf
  control_plane_size default cpx32 → cpx21
  worker_size        default cpx32 → cpx31
  Validation regex unchanged (cxNN | cpxNN | ccxNN | caxNN).

- products/catalyst/bootstrap/ui/src/shared/constants/providerSizes.ts
  Add CPX11, CPX21, CPX31 catalog entries.
  Move recommended:true from CPX32 → CPX21 (control-plane default).
  Add defaultWorkerSizeId() — Hetzner returns 'cpx31', other providers
  fall through to defaultNodeSizeId() symmetric default.

- products/catalyst/bootstrap/ui/src/pages/wizard/steps/StepProvider.tsx
  First-visit useEffect + handleSelectProvider now call
  defaultWorkerSizeId(provider) for the worker SKU instead of mirroring
  the CP SKU. Comment updated naming the cost-optimised pair.

- products/catalyst/bootstrap/ui/e2e/cosmetic-guards.spec.ts
  Recommended-Hetzner-SKU set assertion: ['cpx32'] → ['cpx21'].

If a Sovereign exhibits CP RAM pressure with this default, the next safe
stop UP is cpx31 (4 vCPU / 8 GB, ~€7.5/mo) — never back to cpx32.

Closes #740.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 15:00:01 +04:00
e3mrah
e085a68585
fix(k3s): add 10.0.1.2 to --tls-san so Cilium can verify CP cert from workers (#739)
Issue #733 follow-up #2. After #738 changed Cilium's k8sServiceHost
from 127.0.0.1 to the CP private IP 10.0.1.2, Cilium's TLS verification
fails with:

  Get "https://10.0.1.2:6443/api/v1/namespaces/kube-system":
    tls: failed to verify certificate: x509: certificate is valid for
    10.43.0.1, 127.0.0.1, 178.104.211.206, 2a01:..., ::1, not 10.0.1.2

k3s auto-generates the apiserver TLS cert with SANs covering the public
IP, the cluster service IP (10.43.0.1), and localhost — but NOT the
private subnet IP 10.0.1.2. Adding `--tls-san=10.0.1.2` to the k3s
server install command makes the cert valid for the address Cilium
(and any other in-cluster client) reaches the apiserver via.

The sovereign FQDN is also already in --tls-san, this just adds the
private subnet anchor that the multi-node Cilium config in #738
introduced.

Verified live on otech51 (deploy SHA 69de64b): Cilium reached
"Establishing connection to apiserver host=https://10.0.1.2:6443"
correctly with the new k8sServiceHost, but TLS handshake failed on
cert SAN mismatch. After this fix the SAN list will include 10.0.1.2.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 14:35:20 +04:00
e3mrah
69de64ba19
fix(cilium): k8sServiceHost 127.0.0.1 → 10.0.1.2 so workers' Cilium can reach apiserver (#738)
Issue #733 follow-up. The default cpx32 multi-node Sovereign (1 CP + 2
workers) provisioned successfully, but worker nodes stuck NotReady
because cilium-agent on workers crashloop'd:

  Get "https://127.0.0.1:6443/api/v1/namespaces/kube-system":
    dial tcp 127.0.0.1:6443: connect: connection refused

Root cause: `k8sServiceHost: 127.0.0.1` works on the k3s SERVER node
(supervisor binds localhost:6443) but FAILS on every k3s AGENT node
(agent does NOT expose apiserver on localhost — only the supervisor
on :6444). Pre-#733 every Sovereign was solo (worker_count=0), so
this never fired.

Fix: point Cilium at `10.0.1.2`, the CP's stable private IP on the
Sovereign's 10.0.1.0/24 subnet (cp1=10.0.1.2 per main.tf network
block). No-op on the CP (10.0.1.2 IS its own private IP) and works
on workers (which already join the cluster via the same address per
cloudinit-worker.tftpl `K3S_URL=https://${cp_private_ip}:6443`).

Files:
- infra/hetzner/cloudinit-control-plane.tftpl — bootstrap helm install
  values file written to /var/lib/catalyst/cilium-values.yaml
- platform/cilium/chart/values.yaml — Flux bp-cilium HelmRelease
  values (cilium_values_parity_test.go enforces the two stay aligned)

Verified live on otech50: 3× CPX32 servers running, 1 CP Ready, 2
workers registered with k3s but NotReady due to cilium init failure.
After this fix workers should reach Ready, and the Phase-1 watcher
sees all components Ready=True across the multi-node cluster.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 14:23:51 +04:00
e3mrah
7ec25b9736
feat(provisioner): default Sovereign to 3x CPX32 (1 CP + 2 workers) — restore horizontal scale (#736)
Issue #733. Every Sovereign provisioned this week launched with a single
CPX52 control-plane and zero workers — completely discarded horizontal
scalability. Restore the originally agreed shape: 1 CPX32 control plane
+ 2 CPX32 workers (3 nodes × 4 vCPU/8 GB = 12 vCPU/24 GB total — same
aggregate footprint as a CPX52 vertical-scale, but with multi-node fault
tolerance and the architectural shape clusters/_template/ was designed
for).

Changes:
- infra/hetzner/variables.tf — defaults: control_plane_size cx42→cpx32,
  worker_size cx32→cpx32, worker_count 0→2.
- infra/hetzner/main.tf — add hcloud_load_balancer_target.workers so the
  Hetzner LB targets every node (CP + workers); Cilium Gateway DaemonSet
  on every node serves ingress on its NodePort, so any node can absorb
  traffic for genuine horizontal scale.
- infra/hetzner/README.md — sizing rationale rewritten around horizontal
  scale; CPX32 × 3 documented as canonical; CPX52 retained for solo dev.
- ui model — INITIAL_WIZARD_STATE.workerCount 0→2.
- ui StepProvider — first-visit + provider-change defaults workerCount 0→2.
- ui providerSizes — `recommended: true` flag moves cpx52→cpx32; CPX52
  description updated to "solo dev when worker_count=0".

Constraints honoured:
- Existing API requests with explicit controlPlaneSize: 'cpx52' / explicit
  workerCount: 0 keep working — only DEFAULTS change.
- Sub-CPX32 SKUs (cx21/cx31) still allowed via dropdown.
- Contabo single-node Catalyst-Zero is a different code path — unaffected.
- No cron triggers added (event-driven only).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:57:53 +04:00
e3mrah
4946ccd125
feat(bp-catalyst-platform): expose marketplace + tenant wildcard, bump 1.3.0 (closes #710) (#719)
Marketplace exposure for franchised Sovereigns. Otech becomes a SaaS
operator with a single overlay toggle.

Changes
=======

products/catalyst/chart:
- Chart.yaml 1.2.7 → 1.3.0
- values.yaml: ingress.marketplace.enabled toggle (default false) +
  marketplace.{brand,currency,paymentProvider,signupPolicy} surface
- templates/sme-services/marketplace-routes.yaml: HTTPRoute
  marketplace.<sov> with /api/ → marketplace-api, /back-office/ → admin,
  / → marketplace; HTTPRoute *.<sov> → console (per-tenant wildcard)
- templates/sme-services/marketplace-reference-grant.yaml: cross-
  namespace ReferenceGrant from catalyst-system HTTPRoute → sme Services
- .helmignore: stop excluding sme-services/* and marketplace-api/* (only
  *.kustomization.yaml + *.ingress.yaml remain Kustomize-only)
- All sme-services/* + marketplace-api/* manifests wrapped with
  {{ if .Values.ingress.marketplace.enabled }} so non-marketplace
  Sovereigns render the chart unchanged

clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
- chart version 1.2.7 → 1.3.0
- ingress.hosts.marketplace.host: marketplace.${SOVEREIGN_FQDN}
- ingress.marketplace.enabled: ${MARKETPLACE_ENABLED:-false}

infra/hetzner:
- variables.tf: marketplace_enabled var (string "true"/"false", default "false")
- main.tf: thread var into cloudinit-control-plane.tftpl
- cloudinit-control-plane.tftpl: postBuild.substitute.MARKETPLACE_ENABLED
  on bootstrap-kit, sovereign-tls, infrastructure-config Kustomizations

products/catalyst/bootstrap/api/internal/provisioner/provisioner.go:
- Request.MarketplaceEnabled bool (json:"marketplaceEnabled")
- writeTfvars: marketplace_enabled = "true"|"false"

core/pool-domain-manager/internal/allocator/allocator.go:
- canonicalRecordSet adds "marketplace" prefix → marketplace.<sov>
  resolves via PDM at zone-commit time (PR #710 explicit record so
  caches don't depend on the *.<sov> wildcard alone)

DoD ready
=========
- helm template with ingress.marketplace.enabled=false → identical
  manifest set to 1.2.7 (verified locally)
- helm template with ingress.marketplace.enabled=true → emits 17 extra
  resources: 13 sme-services workloads + 2 marketplace-api + 1
  HTTPRoute pair + 1 ReferenceGrant
- pdm tests: TestCanonicalRecordSet, TestCommitDNSShape green
- catalyst-api builds, provisioner cloudinit_path_test green

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 07:47:37 +04:00
e3mrah
6f3e15b1ec
fix(handover): provision JWK Secret on Sovereign + inject SOVEREIGN_FQDN env (Phase-8b followup) (#692)
Two handover bugs caught live on otech48 (2026-05-03):

1. Sovereign-side catalyst-api responded to GET /auth/handover with
   "server misconfiguration: public key unavailable". Root cause: the
   K8s Secret `catalyst-handover-jwt-public` (referenced by the chart's
   optional Secret-volume) was never materialised on the Sovereign,
   so the optional volume mount fell through and the JWK file was
   absent inside the container. 1.2.0 wired the mount but no
   provisioning step created the Secret. Fix mirrors the canonical
   pattern from PR #543 (ghcr-pull) and PR #680 (harbor-robot-token):
   cloud-init now writes the Secret manifest into catalyst-system NS
   and runcmd applies it BEFORE flux-bootstrap, so the Secret exists
   by the time bp-catalyst-platform reconciles. Also moves the chart
   volume mount off the catalyst-api PVC (mountPath
   /etc/catalyst/handover-jwt-public, no subPath) so a leftover empty
   directory in the PVC from pre-#606 installs cannot collide with
   the re-provisioned Secret mount.

2. /auth/handover validator rejected every valid JWT with 401
   "invalid audience" because SOVEREIGN_FQDN was unset on Sovereigns
   — the audience check collapsed to the literal "https://console."
   prefix. The bp-catalyst-platform HelmRelease overlay was already
   setting `global.sovereignFQDN` but the chart template never plumbed
   it through to the Pod env. Added a SOVEREIGN_FQDN env reading
   `.Values.global.sovereignFQDN` (default "" so Catalyst-Zero
   installs, where catalyst-api is the SIGNER not the validator,
   stay clean).

Bumps:
- bp-catalyst-platform 1.2.4 -> 1.2.5
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml HelmRelease pin

Will be verified live on otech49 — fresh provision should reach
https://console.otech49.omani.works/auth/handover?token=... and
exchange to a Keycloak session WITHOUT manual Secret creation.

Issue #606 followup.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 19:47:21 +04:00
e3mrah
d0b574bd68
fix(hetzner-tofu): add powerdns_api_key to templatefile() vars (#687)
PR #686 added var.powerdns_api_key to variables.tf and referenced it as
${powerdns_api_key} in cloudinit-control-plane.tftpl, but missed wiring
it into the templatefile() vars dict in main.tf. Result on otech48:

  Invalid value for "vars" parameter: vars map does not contain key
  "powerdns_api_key", referenced at ./cloudinit-control-plane.tftpl:273

This commit closes the gap: powerdns_api_key now flows from var ->
templatefile vars -> cloud-init -> Secret manifest.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:34:36 +04:00
e3mrah
684759564e
fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager (PR #681 followup) (#686)
* fix(cilium-gateway): listener ports 80/443 → 30080/30443 + LB retarget

cilium-envoy refuses to bind privileged ports (80/443) on Sovereigns
even with all of:

- gatewayAPI.hostNetwork.enabled=true on the Cilium chart
- securityContext.privileged=true on the cilium-envoy DaemonSet
- securityContext.capabilities.add=[NET_BIND_SERVICE]
- envoy-keep-cap-netbindservice=true in cilium-config ConfigMap
- Gateway API CRDs at v1.3.0 (matching cilium 1.19.3 schema)

Repeatable error from cilium-envoy logs across otech45, otech46, otech47:

  listener 'kube-system/cilium-gateway-cilium-gateway/listener' failed
  to bind or apply socket options: cannot bind '0.0.0.0:80':
  Permission denied

The bind() syscall is intercepted by cilium-agent's BPF socket-LB
program in a way that does not honour container capabilities. Even
PID 1 with CapEff=0x000001ffffffffff (all caps) and uid=0 gets
"Permission denied". Cilium 1.19.3 → 1.16.5 made no difference
(F1, PR #684 still ships — the version bump is sound for other
reasons; the listener bind is just a separate fix).

This commit moves the listeners to high ports (30080/30443) and lets
the Hetzner LB do the public-facing port translation:

  HCLB :80   → CP node :30080  (cilium-gateway HTTP listener)
  HCLB :443  → CP node :30443  (cilium-gateway HTTPS listener)

External users still hit `https://console.<sov>.omani.works/auth/handover`
on port 443; the high port is invisible. High-port bind succeeds
without NET_BIND_SERVICE because the kernel only gates ports below
`net.ipv4.ip_unprivileged_port_start` (default 1024).

Will be verified on otech48: the next fresh provision should serve
console.otech48/auth/handover end-to-end without the 502/timeout
chain seen on otech45–47.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager

PR #681 followup. The new bp-cert-manager-powerdns-webhook (PR #681)
calls contabo's authoritative PowerDNS at pdns.openova.io to write
DNS-01 challenge TXT records for *.otech<N>.omani.works. That webhook
needs an X-API-Key Secret in the Sovereign's cert-manager namespace —
PR #681 didn't ship the materialization seam, so on otech43..otech47
the Secret was missing and the wildcard cert never issued.

This commit closes the seam from contabo to the Sovereign:

1. bp-powerdns chart 1.1.7 to 1.1.8: Reflector annotations on
   openova-system/powerdns-api-credentials extended from "external-dns"
   to "external-dns,catalyst" so contabo catalyst-api can mount the
   API key.

2. bp-powerdns: api.basicAuth.enabled flips default true to false.
   Layered Traefik basicAuth + PowerDNS X-API-Key was double auth that
   blocked machine-to-machine API access from Sovereigns. The X-API-Key
   contract is unchanged.

3. bp-catalyst-platform 1.2.3 to 1.2.4: api-deployment.yaml adds
   CATALYST_POWERDNS_API_KEY env from powerdns-api-credentials/api-key
   secret (optional=true so Sovereign-side catalyst-api Pods that don't
   reflect this still start clean).

4. catalyst-api provisioner.go: new Provisioner.PowerDNSAPIKey field
   reads from CATALYST_POWERDNS_API_KEY env at New(). Stamps onto every
   Request before Validate(). Forwards as tofu var powerdns_api_key.

5. infra/hetzner/variables.tf: new var.powerdns_api_key (sensitive,
   default "").

6. infra/hetzner/cloudinit-control-plane.tftpl: replaces the defunct
   dynadot-api-credentials Secret block (PR #681 dropped
   bp-cert-manager-dynadot-webhook) with a new
   cert-manager/powerdns-api-credentials Secret block. runcmd applies
   it BEFORE Flux reconciles bp-cert-manager-powerdns-webhook.

End-to-end seam mirrors PR #543 ghcr-pull and PR #680 harbor-robot-token.

Will be verified live on otech48 (next provision after this lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:23:27 +04:00
e3mrah
369c229408
fix(cilium-gateway): listener ports 80/443 → 30080/30443 + LB retarget (#685)
cilium-envoy refuses to bind privileged ports (80/443) on Sovereigns
even with all of:

- gatewayAPI.hostNetwork.enabled=true on the Cilium chart
- securityContext.privileged=true on the cilium-envoy DaemonSet
- securityContext.capabilities.add=[NET_BIND_SERVICE]
- envoy-keep-cap-netbindservice=true in cilium-config ConfigMap
- Gateway API CRDs at v1.3.0 (matching cilium 1.19.3 schema)

Repeatable error from cilium-envoy logs across otech45, otech46, otech47:

  listener 'kube-system/cilium-gateway-cilium-gateway/listener' failed
  to bind or apply socket options: cannot bind '0.0.0.0:80':
  Permission denied

The bind() syscall is intercepted by cilium-agent's BPF socket-LB
program in a way that does not honour container capabilities. Even
PID 1 with CapEff=0x000001ffffffffff (all caps) and uid=0 gets
"Permission denied". Cilium 1.19.3 → 1.16.5 made no difference
(F1, PR #684 still ships — the version bump is sound for other
reasons; the listener bind is just a separate fix).

This commit moves the listeners to high ports (30080/30443) and lets
the Hetzner LB do the public-facing port translation:

  HCLB :80   → CP node :30080  (cilium-gateway HTTP listener)
  HCLB :443  → CP node :30443  (cilium-gateway HTTPS listener)

External users still hit `https://console.<sov>.omani.works/auth/handover`
on port 443; the high port is invisible. High-port bind succeeds
without NET_BIND_SERVICE because the kernel only gates ports below
`net.ipv4.ip_unprivileged_port_start` (default 1024).

Will be verified on otech48: the next fresh provision should serve
console.otech48/auth/handover end-to-end without the 502/timeout
chain seen on otech45–47.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:14:32 +04:00
e3mrah
affcf37923
fix(bp-catalyst-platform): provision harbor-robot-token automatically on Sovereign install (RCA + permanent fix) (#680)
Caught live on otech43–46 — manual placeholder Secret was being created
each iteration. RCA:

The catalyst-api Pod template references the `harbor-robot-token`
Secret via a REQUIRED (non-optional) secretKeyRef. On Sovereign
clusters that Secret was never materialised — only `ghcr-pull` had
the canonical cloud-init + Reflector auto-mirror seam (PR #543). The
chart's old comment said "Reflector mirrors from openova-harbor
namespace into catalyst" but `openova-harbor` doesn't exist on
Sovereigns; that namespace lives only on contabo where the central
Harbor source Secret is administered. Result: every fresh Sovereign's
catalyst-api Pod stuck in CreateContainerConfigError until the
operator hand-created a placeholder Secret.

The token VALUE was already arriving on the Sovereign — Tofu
var.harbor_robot_token is interpolated into
/etc/rancher/k3s/registries.yaml at cloud-init time so containerd
can authenticate against harbor.openova.io. We just never materialised
the same value as a Kubernetes Secret for catalyst-api to mount.

Permanent fix mirrors the canonical `ghcr-pull` seam:

  1. infra/hetzner/cloudinit-control-plane.tftpl write_files block
     emits /var/lib/catalyst/harbor-robot-token-secret.yaml — a
     Secret in flux-system ns with auto-mirror Reflector annotations
     (`reflection-auto-enabled: "true"`).
  2. runcmd applies it BEFORE flux-bootstrap, so the Secret exists
     before any Helm release reconciles.
  3. bp-reflector (slot 05a, already deployed) propagates the Secret
     into every namespace — including catalyst-system — on first
     reconcile tick. catalyst-api's secretKeyRef resolves cleanly,
     Pod starts.
  4. Token rotation flows through `var.harbor_robot_token` →
     re-render Tofu → re-apply cloud-init; Reflector propagates the
     rotation to all mirrored copies on the next watch tick.

`harbor-robot-token` stays NOT optional in the chart: the architecture
mandate is every Sovereign image pull goes through harbor.openova.io;
falling through to docker.io is forbidden (anonymous rate-limit makes
a fresh Hetzner IP unbootable). A missing token must surface
immediately as Pod start failure, never silently mid-provision.

Bumps:
  - bp-catalyst-platform 1.2.2 → 1.2.3 (chart-side change is a
    comment-only update on the secretKeyRef explaining the new seam;
    the Pod spec still references the same Secret name and key).
  - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml
    HelmRelease version pin → 1.2.3.

No bootstrap-kit dependency changes — bp-reflector's slot-05a position
is unchanged and was already a dependency for ghcr-pull. No
expected-bootstrap-deps.yaml edits needed.

Issue #557 follow-up. Closes the per-Sovereign manual workaround.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 16:54:37 +04:00
e3mrah
dd4148acb6
fix(cilium-gateway): hostNetwork mode + Hetzner LB→80/443 (chart 1.1.5) (#674)
The Cilium gateway-api L7LB nodePort chain was silently broken on
otech45: TCP to LB:443 succeeds, but TLS handshake never completes.
Root cause: Cilium 1.16.5's BPF L7LB Proxy Port (12869) doesn't match
what cilium-envoy actually listens on (verified via /proc/net/tcp on
the cilium-envoy pod — port 12869 not in listening sockets). The
nodePort indirection (31443→envoy:12869) is broken at the redirect
step.

Fix: bind cilium-envoy directly to the host's :80 and :443 via
gatewayAPI.hostNetwork.enabled=true. Hetzner LB forwards public
80→private:80 and 443→private:443 directly (no nodePort indirection).

Two coordinated changes:
  1. platform/cilium/chart/values.yaml: gatewayAPI.hostNetwork.enabled=true
  2. infra/hetzner/main.tf: LB destination_port = 80/443 (was 31080/31443)

bp-cilium chart bumped to 1.1.5.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 15:22:51 +04:00
e3mrah
1734979d74
fix(infra): bump kernel inotify limits (bao init was hitting EMFILE) (#656)
* fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13)

* fix(wizard): blueprint deps from Flux HelmRelease.dependsOn (single source of truth)

The wizard's componentGroups.ts carried hand-maintained `dependencies:
[...]` arrays that deviated from the real Flux install graph in
clusters/_template/bootstrap-kit/*.yaml. Examples (otech34 surfaced
this):

  componentGroups.ts          Flux HelmRelease.dependsOn
  ----------------------      ---------------------------
  keycloak: [cnpg]            keycloak: [cert-manager, gateway-api]
  openbao:  []                openbao:  [spire, gateway-api, cnpg]
  harbor:   [cnpg, seaweedfs, harbor:   [cnpg, cert-manager,
              valkey]                    gateway-api]

Founder's directive: "all the real dependencies are related to real
flux related dependencies, if you are hosting irrelevant hardcoded
baseless wizard catalog dependencies, I dont know where they are
coming from. The single source of truth for the dependencies is
flux!!!" — 2026-05-03

This commit:
  1. Adds scripts/generate-blueprint-deps.sh that parses every
     bootstrap-kit HelmRelease and emits blueprint-deps.generated.json
     keyed by bare component id (bp- prefix stripped on both source
     and target side).
  2. Commits the generated JSON.
  3. Adds products/catalyst/bootstrap/ui/src/data/blueprintDeps.ts
     thin TS wrapper exporting BLUEPRINT_DEPS + depsFor(id).
  4. Patches componentGroups.ts so every RAW_COMPONENT's
     `dependencies` field is OVERRIDDEN at module load with the
     Flux-canonical list (the inline `dependencies: [...]` literals
     are now ignored — Flux is canonical).

Follow-ups (not in this PR):
  - CI drift check that re-runs the script and diffs the JSON.
  - Strip the inline `dependencies: [...]` arrays entirely once the
    drift check is green.
  - Wire the FlowPage edge-rendering to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(flowpage): replace second hardcoded BOOTSTRAP_KIT_DEPS table with Flux SoT

PR #652 fixed the wizard catalog. FlowPage.tsx had a SECOND independent
hardcoded dep map at lines 105-155 that the founder caught — most
visibly:
  keycloak: ['cert-manager', 'openbao']  ← FALSE; Flux says no openbao
The reason the founder kept seeing the spurious arrow on the Flow page.

Replace the local table with an import of BLUEPRINT_DEPS from
data/blueprintDeps.ts (single source of truth — generated from
clusters/_template/bootstrap-kit/*.yaml by
scripts/generate-blueprint-deps.sh).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(jobs): don't regress status to pending after exec started

helmwatch_bridge.go's OnHelmReleaseEvent unconditionally overwrote the
Job's Status with jobStatusFromHelmState(state) on every event. Flux
oscillates HelmReleases between Reconciling and DependencyNotReady
while a dependency (e.g. bp-openbao waiting on bp-spire) isn't Ready
— helmwatch maps both back to HelmStatePending. The bridge then flips
the row to status='pending' even though an active Execution is
streaming exec log lines (startedAt + latestExecutionId already set).

Founder caught this on otech34's install-external-secrets job:
status='pending' on the Jobs page while Exec Log was actively
tailing.

Fix: monotonic guard — once activeExecID[component] != "" (Execution
allocated), refuse to regress nextStatus to StatusPending. Treat
ongoing-after-start as Running so the row reflects the live stream.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(jobs): cascade Failed status through dependsOn (fail-fast)

Founder caught on otech34: install-openbao=failed but
install-external-secrets stayed pending forever ('masking it and
waiting unnecessarily'). Flux's HelmRelease for external-secrets is
in DependencyNotReady, helmwatch maps that to StatePending,
bridge writes Status=pending — no signal that the upstream FAILED
rather than 'still installing'.

Add a post-rollup sweep in deriveTreeView that propagates Failed
through the dependsOn graph. Up to 8 sweeps cover the deepest
bootstrap-kit chain. Idempotent on read; reverses if openbao recovers
because it operates on the live snapshot.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): bump kernel inotify limits — bp-openbao init was crashing 'too many open files'

Diagnosed live during otech35: openbao-init pod crash-looped 4×
on 'bao operator init' with:
  failed to create fsnotify watcher: too many open files
Flux mapped to InstallFailed → RetriesExceeded → cascading through
external-secrets and external-secrets-stores. The wizard masked the
OS-level root cause behind a generic InstallFailed.

Hetzner Ubuntu 24.04 ships fs.inotify.max_user_instances=128 — far
too low for a 35-component bootstrap-kit (k3s kubelet + Flux helm-
controller + 11 CNPG operators + Reflector + Cert-Manager + bao +
keycloak-config-cli + ... each grabs instance slots). The instance
count exhausts within minutes; the next process to ask for an
inotify slot gets EMFILE.

Bump well above k8s/k3s production guidance so future blueprints
don't tickle the same wall:
  fs.inotify.max_user_instances = 8192
  fs.inotify.max_user_watches   = 1048576
  fs.inotify.max_queued_events  = 16384

Applied via /etc/sysctl.d/99-catalyst-inotify.conf + 'sysctl --system'
in runcmd. Permanent across reboots.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 10:32:38 +04:00
e3mrah
40ca4e4d50
fix(infra): registries.yaml mirror needs rewrite — Harbor proxy is /v2/proj/, not /proj/v2/ (#640)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(api): cloud-init kubeconfig postback must live outside RequireSession

The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.

Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.

Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.

Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io

PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl
and declared var.harbor_robot_token in infra/hetzner/variables.tf with a
default of "". The catalyst-api side never set it, so every Sovereign so
far provisioned with an empty token in registries.yaml — containerd's
auth to harbor.openova.io's proxy projects failed silently and pulls
fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns
rate-limit HTML and:

  Failed to pull image "rancher/mirrored-pause:3.6":
    unexpected media type text/html for sha256:...

cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux
pods stay Pending; no HelmReleases ever land; the wizard's job stream
shows everything PENDING because there's nothing to watch. Caught live
during otech24.

Wiring (mirrors the GHCRPullToken pattern):
  1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN
     env at New().
  2. Stamped onto every Request in Provision() and Destroy() before
     writeTfvars.
  3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted
     from the wizard payload.
  4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json.
  5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret
     (mirrored from openova-harbor — Reflector-managed on Sovereign
     clusters; copied per-namespace on Catalyst-Zero contabo) as
     CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths
     still come up.

variables.tf default "" preserves graceful fall-through if the operator
hasn't issued a robot token yet, and the architecture rule is now
enforced end-to-end: every image on every Sovereign goes through
harbor.openova.io.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up)

PR #638 added Validate() rejection for missing harbor_robot_token, but
the handler only stamped req.HarborRobotToken from p.HarborRobotToken
inside Provision() — Validate() runs in the handler BEFORE Provision()
gets the chance to stamp. Result: every wizard launch returned

  Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing)

even though the env var is set on the Pod. Caught immediately on the
otech25 launch attempt.

Fix: same env-stamp pattern as GHCRPullToken at the top of the
CreateDeployment handler. Provisioner-level stamp in Provision() stays
as defense-in-depth.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo>

PR #557 wrote registries.yaml with mirror endpoints like
  https://harbor.openova.io/proxy-dockerhub
hoping containerd would build URLs like
  https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6

But Harbor proxy-cache projects expose their API at
  https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
(project name lives BEFORE the image-path /v2/, not as a path prefix).
Harbor returns its SPA UI HTML (status 200, content-type text/html) for the
wrong shape; containerd then errors with:
  "unexpected media type text/html for sha256:... not found"
and pause-image / cilium / coredns pulls fail forever — caught live during
otech24 and otech25.

Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare
Harbor host; per-mirror rewrite re-maps the image path so containerd's
final URL is correctly project-prefixed. Verified manually:

  curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
  -> 200 application/vnd.docker.distribution.manifest.list.v2+json

This unblocks every Sovereign image pull through the central Harbor.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 23:22:21 +04:00
e3mrah
0ee309aa8b
fix(infra+api): wire handover_jwt_public_key end-to-end through tofu provisioning (#636)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 22:28:44 +04:00
e3mrah
96a5e3a20e
fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service (#635)
PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 22:14:23 +04:00
e3mrah
169ba2f20a
fix(infra): restore handover-jwt-public.jwk cloud-init write + variables.tf (#623)
PR #611 squash accidentally reverted the Phase-8b infra additions from PR #615
(92fdda42). Restores:
- cloudinit-control-plane.tftpl: write_files entry for /var/lib/catalyst/handover-jwt-public.jwk (mode 0600)
- variables.tf: handover_jwt_public_key variable (sensitive, default empty)

Without these, new Sovereign provisioning runs will not write the public key
to disk and auth/handover on the Sovereign will return 503 (key unavailable).

Co-authored-by: e3mrah <e3mrah@openova.io>
2026-05-02 19:21:16 +04:00
e3mrah
b5c9839da7
feat(phase-8b): sovereign wizard auth-gate + handover JWT minting + Playwright CI fixes (#611)
Squash of PR #611 (feat/607) + PR #615 (feat/605) Phase-8b deliverables:

UI:
- AuthCallbackPage: mode-aware dispatch (catalyst-zero → magic-link server
  callback; sovereign → client-side OIDC token exchange via oidc.ts)
- Router: sovereign console routes (/console/*), DETECTED_MODE index redirect,
  authCallbackRoute dedup fix, authHandoverRoute safety net
- StepSuccess: mints RS256 handover JWT via POST /deployments/{id}/mint-handover-token
  before redirecting operator to Sovereign console (falls back to plain URL on error)

API:
- main.go: wires handoverjwt.LoadOrGenerate signer from CATALYST_HANDOVER_KEY_PATH env
- deployments.go: stamps HandoverJWTPublicKey from signer.PublicJWK() at create time
- provisioner.go: injects HandoverJWTPublicKey into Tofu vars JSON
- auth.go: /auth/handover endpoint for seamless single-identity flow

Infra:
- cloudinit-control-plane.tftpl: writes handover JWT public JWK to /var/lib/catalyst/
- variables.tf: handover_jwt_public_key variable (sensitive, default empty)

Chart:
- api-deployment.yaml / ui-deployment.yaml / values.yaml: expose handover JWT env vars

Playwright CI fixes:
- playwright-smoke.yaml / cosmetic-guards.yaml: health-check URL /sovereign/wizard → /wizard
- playwright.config.ts: BASEPATH default /sovereign → / + baseURL construction fix
- cosmetic-guards.spec.ts: provision URL /sovereign/provision/* → /provision/*
- sovereign-wizard.spec.ts: WIZARD_URL /sovereign/wizard → /wizard

Closes #605, #606, #607. Fixes Playwright CI (#142 sovereign wizard smoke tests).

Co-authored-by: e3mrah <e3mrah@openova.io>
2026-05-02 19:17:56 +04:00
e3mrah
92fdda42d7
feat(catalyst-api+infra): Phase-8b handover JWT minting on Catalyst-Zero (Closes #605)
Merge via self-merge per CLAUDE.md. Playwright UI smoke passes; cosmetic guards pre-existing failure on main (unrelated to this PR). Resolves #605.
2026-05-02 19:07:27 +04:00
e3mrah
5a403e66b1
fix(tls): DNS-01 wildcard TLS chain — solverName pdns, NodePort 30053, dynadot test fix (#582)
* fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase

Harbor upstream always connects to a database named 'registry'
(harbor.database.external.coreDatabase default). The CNPG Cluster was
initialised with database='harbor', causing:

  FATAL: database "registry" does not exist (SQLSTATE 3D000)

Fix: change postgres.cluster.database default from 'harbor' → 'registry'
in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap
and Harbor's coreDatabase now use 'registry'.

Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run
against harbor-pg-1. harbor-core is now 1/1 Running.

Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tls): DNS-01 wildcard TLS chain — solverName, NodePort 30053, dynadot test fix

Five independent fixes that together complete the DNS-01 wildcard TLS chain
for per-Sovereign certificate autonomy:

1. cert-manager-powerdns-webhook solverName mismatch (root cause of #550 echo):
   - values.yaml: `webhook.solverName: powerdns` → `pdns`
   - The zachomedia binary's Name() returns "pdns" (hardcoded). cert-manager
     calls POST /apis/<groupName>/v1alpha1/<solverName>; when solverName is
     "powerdns" cert-manager gets 404 → "server could not find the resource".

2. cert-manager-dynadot-webhook solver_test.go mock format:
   - writeOK() and error injection used old ResponseHeader-wrapped format
   - Real api3.json returns ResponseCode/Status directly in SetDnsResponse
   - This caused the image build to fail at ccc38987 so the dynadot fix
     never shipped; solver tests now pass cleanly (go test ./... OK)

3. PowerDNS NodePort 30053 anycast overlay (bootstrap-kit and template):
   - _template/bootstrap-kit/11-powerdns.yaml: adds anycast NodePort values
   - omantel + otech bootstrap-kit: same NodePort 30053 overlay applied
   - anycast-endpoint.yaml: optional nodePort field rendered in port list

4. Hetzner LB + firewall for DNS port 53 (infra/hetzner/main.tf):
   - hcloud_load_balancer_service.dns: TCP:53 → NodePort 30053
   - Firewall: TCP+UDP :53 from 0.0.0.0/0,::/0

5. dynadot-client JSON parsing fix (core/pkg/dynadot-client):
   - AddRecord + SetFullDNS: struct no longer wraps respHeader in ResponseHeader
   - client_test.go: mock responses updated to real api3.json format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:49:58 +04:00
e3mrah
73ae746637
fix(cloud-init): install Gateway API v1.1.0 CRDs before cilium so operator registers gateway controller (#581)
Root cause (otech22 2026-05-02): Cilium operator checks for Gateway API
CRDs at startup and disables its gateway controller if they are absent —
a static, one-shot decision. Cloud-init installs k3s+Cilium first, then
Flux reconciles bp-gateway-api minutes later, so the operator always
starts without CRDs and never recovers. All 8 HTTPRoutes orphaned.

Three-part permanent fix:

1. cloud-init: apply Gateway API v1.1.0 experimental CRDs (incl.
   TLSRoute) BEFORE the Cilium helm install. Cilium 1.16.x requires
   TLSRoute CRD to be present; without it the operator's capability
   check fails entirely and disables the gateway controller.

2. bp-cilium (1.1.2 → 1.1.3): add gatewayAPI.gatewayClass.create: "true"
   to force GatewayClass creation regardless of CRD presence at Helm
   render time. Upstream default "auto" skips GatewayClass when the
   gateway API CRDs are absent at install time (Capabilities check).

3. bp-gateway-api (1.0.0 → 1.1.0): downgrade CRDs from v1.2.0 to v1.1.0
   and ship experimental channel (TLSRoute, TCPRoute, UDPRoute,
   BackendLBPolicy, BackendTLSPolicy). Gateway API v1.2.0 changed
   status.supportedFeatures from string[] to object[]; Cilium 1.16.5
   writes the old string format and the v1.2.0 CRD rejects the status
   patch with "must be of type object: string", leaving GatewayClass
   permanently Unknown/Pending. v1.1.0 retains string schema.

Upgrade path: bump bp-gateway-api + bp-cilium together when Cilium ≥ 1.17
adopts the v1.2.0 object schema for supportedFeatures.

Closes #503

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:23:32 +04:00
e3mrah
9e53d9e127
feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (#557) (#563)
* docs(wbs): Mermaid DAG shows actual Phase-8a dependency cascade

Per founder corrective: existing diagram missed the real blockers
surfaced during otech10..otech22 burns. The image-pull-through gap
(#557) and the cross-namespace secret gap (#543, #544) gate every
workload pull from a public registry — without them, Sovereign hits
DockerHub anonymous rate-limit on first provision and 30+ HRs are
ImagePullBackOff/CreateContainerConfigError.

Adds:
- Phase 0b · Image pull-through (#557 + #557B Sovereign-Harbor swap +
  #557C charts global.imageRegistry templating). Edges to NATS / Gitea
  / Harbor / Grafana / Loki / Mimir / PowerDNS / Crossplane /
  cert-manager-powerdns-webhook / Trivy / Kyverno / SPIRE / OpenBao
- Phase 0c · Cross-namespace secrets (#543 ghcr-pull Reflector + #544
  powerdns-api-credentials reflect). Edges to bp-catalyst-platform and
  bp-cert-manager-powerdns-webhook
- Phase 1 additions: #542 kubeconfig CP-IP fix and #547 helmwatch
  38-HR threshold both gate Phase 8a integration test
- Phase 0b → Phase 8b edge: post-handover Sovereign-Harbor swap is
  what makes "zero contabo dependency" DoD-met possible

WBS now reflects the cascade observed live, not the pre-Phase-8a model.

* feat(platform): add global.imageRegistry to bp-cilium/cert-manager/cert-manager-powerdns-webhook/sealed-secrets (PR 1/3, #560)

- bp-cilium 1.1.1→1.1.2: global.imageRegistry stub added; upstream cilium
  subchart does not expose a single registry knob — per-Sovereign overlays
  wire specific image.repository fields alongside this value.
- bp-cert-manager 1.1.1→1.1.2: global.imageRegistry stub added; upstream
  chart exposes per-component image.registry knobs documented in the comment.
- bp-cert-manager-powerdns-webhook 1.0.2→1.0.3: global.imageRegistry stub
  added + deployment.yaml templated to prefix the webhook image repository
  when the value is non-empty. Verified: helm template with
  --set global.imageRegistry=harbor.openova.io produces
  harbor.openova.io/zachomedia/cert-manager-webhook-pdns:<appVersion>.
- bp-sealed-secrets 1.1.1→1.1.2: global.imageRegistry stub added; upstream
  subchart exposes sealed-secrets.image.registry for overlay wiring.

All four charts render clean with default values (empty imageRegistry).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(infra/hetzner): registries.yaml mirror + harbor_robot_token var (openova-io/openova#557)

Add /etc/rancher/k3s/registries.yaml to Sovereign cloud-init so containerd
transparently routes all five public-registry pulls through the central
harbor.openova.io pull-through proxy (Option A of #557).

- cloudinit-control-plane.tftpl: new write_files entry for
  /etc/rancher/k3s/registries.yaml (written BEFORE k3s install so
  containerd reads the mirror config at startup). Mirrors docker.io,
  quay.io, gcr.io, registry.k8s.io, ghcr.io through the respective
  harbor.openova.io/proxy-* projects. Auth via robot$openova-bot.
- variables.tf: new harbor_robot_token variable (sensitive, default "")
  for the robot account token stored in openova-harbor/harbor-robot-token
  K8s Secret on contabo and forwarded by catalyst-api at provision time.
- main.tf: wire harbor_robot_token into the templatefile() call.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:49:13 +04:00
e3mrah
ccc38987c2
fix(tls): bp-cert-manager-dynadot-webhook slot 49b + DNS-01 JSON bug (Closes #550) (#558)
Root cause: bootstrap-kit installs bp-cert-manager-powerdns-webhook (slot 49)
but the letsencrypt-dns01-prod ClusterIssuer wires to the dynadot webhook
(groupName: acme.dynadot.openova.io). Without slot 49b the APIService for
acme.dynadot.openova.io does not exist → cert-manager gets "forbidden" on
every ChallengeRequest → sovereign-wildcard-tls stays in Issuing indefinitely
→ HTTPS gateway has no cert → SSL_ERROR_SYSCALL on the handover URL.

Changes:
- core/pkg/dynadot-client: fix SetDnsResponse JSON key (was SetDns2Response,
  API returns SetDnsResponse); change ResponseCode to json.Number (API returns
  integer 0, not string "0"); update tests to match real API response format
- platform/cert-manager-dynadot-webhook/chart:
  - rbac.yaml: add domain-solver ClusterRole + ClusterRoleBinding so
    cert-manager SA can CREATE on acme.dynadot.openova.io (the "forbidden" fix)
  - values.yaml: add certManager.{namespace,serviceAccountName}, clusterIssuer.*
    and privateKeySecretRefName; add rbac.create comment for domain-solver
  - certificate.yaml: trunc 64 on commonName (was 76 bytes, cert-manager rejects >64)
  - clusterissuer.yaml: new template (skip-render default, enabled via overlay)
  - deployment.yaml: add imagePullSecrets support (required for private GHCR)
  - Chart.yaml: bump to 1.1.0
- clusters/_template/bootstrap-kit:
  - 49b-bp-cert-manager-dynadot-webhook.yaml: new slot (PRE-handover issuer)
  - kustomization.yaml: add 49b entry
- infra/hetzner:
  - variables.tf: add dynadot_managed_domains variable
  - main.tf: pass dynadot_{key,secret,managed_domains} to cloud-init template
  - cloudinit-control-plane.tftpl: write cert-manager/dynadot-api-credentials
    Secret + apply it before Flux reconciles bootstrap-kit

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:42:13 +04:00
e3mrah
b2307e290d
fix: bp-reflector + rename ghcr-pull-secret->ghcr-pull (Closes #543) (#554)
Part A — bp-reflector blueprint:
- Add clusters/_template/bootstrap-kit/05a-reflector.yaml (slot 05a,
  dependsOn bp-cert-manager) — installs emberstack/reflector v7.1.288
  via the bp-reflector OCI wrapper chart.
- Register in bootstrap-kit/kustomization.yaml.
- Add platform/reflector/chart/ wrapper (Chart.yaml + values.yaml):
  single replica, 32Mi memory, ServiceMonitor off by default.

Part B — annotate flux-system/ghcr-pull + rename in charts:
- infra/hetzner/cloudinit-control-plane.tftpl: add four Reflector
  annotations to the ghcr-pull Secret written at cloud-init time so
  Reflector auto-mirrors it to every namespace on first boot.
- Rename imagePullSecrets from ghcr-pull-secret to ghcr-pull in:
  api-deployment.yaml, ui-deployment.yaml,
  marketplace-api/deployment.yaml, and all 11 sme-services/*.yaml
  (14 total occurrences).
- Bump bp-catalyst-platform chart 1.1.12->1.1.13; update bootstrap-kit
  HelmRelease version reference to match.

Root cause: the canonical secret name is ghcr-pull (written by
cloud-init as /var/lib/catalyst/ghcr-pull-secret.yaml). Charts were
referencing ghcr-pull-secret (wrong name), causing ImagePullBackOff
on all Catalyst pods on every new Sovereign.

Runtime hotfix applied to otech22: both ghcr-pull and ghcr-pull-secret
propagated to 33 namespaces via kubectl; non-Running pods bounced.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 12:17:51 +04:00
e3mrah
5b55d65461
fix(infra): kubeconfig points at CP public IP not LB IP (Closes #542) (#546)
The Hetzner LB only forwards 80/443 (Cilium Gateway ingress); 6443 is
exposed directly on the CP node via firewall rule (main.tf:51-56,
0.0.0.0/0 → CP:6443). Previous cloud-init rewrote kubeconfig server: to
the LB's public IPv4, which silently failed with "connect: connection
refused" — catalyst-api helmwatch could never observe HelmReleases on
the new Sovereign, so the wizard jobs page stayed PENDING for every
install-* job for 50+ minutes after the cluster was actually healthy.

Pass control_plane_ipv4 (= hcloud_server.control_plane[0].ipv4_address)
through the templatefile() call and rewrite k3s.yaml's 127.0.0.1:6443 to
that IP instead. Same firewall already opens 6443 to 0.0.0.0/0 directly
on the CP, so this is reachable from contabo without any LB / firewall
changes.

Permanent: every otechN provisioning from this commit forward will PUT
back a kubeconfig that catalyst-api can actually connect to.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 11:55:48 +04:00
e3mrah
66ff717fbc
fix(infra): reduce bootstrap Kustomization timeouts 30m→5m to unblock iterative fixes (closes #492) (#500)
Phase-8a bug #17 (otech8 deployment 1bfc46347564467b, 2026-05-01):
when the FIRST apply of bootstrap-kit was unhealthy (cilium crash-loop
from issue #491), kustomize-controller held the revision lock for the
full 30m health-check timeout and refused to pick up new GitRepository
revisions. Even though Flux fetched fix `66ea39f0` from main within 1
minute, bootstrap-kit's lastAttemptedRevision stayed pinned to the OLD
SHA `0765e89a` for the full 30 minutes. With cilium broken, the wait
would never finish, no new revision would ever apply, and the operator
was forced to wipe + reprovision from scratch. The same pathology
would repeat on every iteration unless the timeout shape changed.

Approach: Option A (timeout reduction). Drops `spec.timeout` on all
three Flux Kustomizations in the cloud-init template — bootstrap-kit,
sovereign-tls, infrastructure-config — from 30m to 5m. We KEEP
`wait: true` so downstream `dependsOn: bootstrap-kit` declarations
still get a consolidated "every HR Ready=True" signal. We do NOT
adjust `interval` (5m is correct).

Why 5m specifically: matches the GitRepository poll interval. Failed
reconciles release the revision lock within ~6m worst case so a fresh
fix on main gets applied on the next poll. Anything shorter risks
tripping legitimately-slow CRD installs; anything longer re-introduces
the iteration-stall pathology #492 documents.

Why not Option B (wait: false): would break the dependsOn chain. The
infrastructure-config Kustomization needs bootstrap-kit's HRs Ready
before it applies Provider/ProviderConfig manifests that talk to
Hetzner. Flipping wait: false would let infra-config apply prematurely.

Why not Option C (tighter retryInterval): doesn't address the root
cause. retryInterval governs how often to retry AFTER a failure;
spec.timeout is what holds the revision lock during a failed wait.

Test: kustomization_timeout_test.go (new) locks all three timeouts at
exactly 5m AND blocks any operative `timeout: 30m` regression AND
asserts wait: true is retained. Three assertions, one for each failure
mode (regression to 30m, accidental 4th Kustomization without test
update, drive-by flip to wait: false).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 00:34:35 +04:00