Commit Graph

2464 Commits

Author SHA1 Message Date
hatiyildiz
1a65d9120d fix(controllers): NATS consume-leg for D35 (organization + sandbox)
PR #1626 wired the publish-leg (tenant + billing → NATS JetStream
catalyst.<domain>.<event>). The consume-leg was missing: no in-cluster
controller subscribed, so D35 (NATS round-trip end-to-end) stayed yellow
even though the publish leg shipped.

This PR adds:

- core/controllers/pkg/natsbus: minimal JetStream subscriber shared by
  Group-C controllers. Self-contained (no dep on core/services/shared
  which pulls in franz-go/Kafka the controllers never touch).
- core/controllers/organization/internal/controller/nats_bridge.go:
  subscribes to catalyst.tenant.created + catalyst.billing.order.placed,
  patches openova.io/last-event-observed-at + ...-subject annotations on
  the matching Organization CR. The annotation patch triggers an
  informer event → controller-runtime enqueues Reconcile within ~50ms
  instead of waiting for the 30s requeue fallback.
- core/controllers/sandbox/internal/controller/nats_bridge.go: same
  pattern for catalyst.tenant.sandbox_requested. Looks up Sandbox CR
  using the same `sandbox-<sanitised-email>` naming convention
  tenant-service's SandboxOrchestrator (PR #1633) writes under.
- main.go wiring in both controllers reads NATS_URL from env. Unset =
  log "consume-leg disabled" + continue (informer requeue fallback
  intact). The 30s RequeueAfter inside r.Reconcile is unchanged — NATS
  is an accelerator, not the only path.

Idempotency: ev.Timestamp is the broker-side time stamp, so duplicate
JetStream delivery produces a byte-stable annotation patch and
controller-runtime does NOT enqueue a redundant Reconcile.

Tests cover Ack/Nak/Ack-to-skip dispatch (subscriber_test.go), the
happy path, the no-matching-CR soft miss, duplicate-envelope no-churn,
malformed JSON poison-pill, and the publish-side ↔ consume-side name
derivation lockstep for Sandbox CRs.

HARD CONSTRAINT respected: no credential mutations — bridges read only
the envelope + the target CR, never Secrets or Keycloak SA creds.

Refs #1835 (D35 round-trip end-to-end), Refs #1776 (D35b sandbox).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:41:44 +02:00
github-actions[bot]
de53b39d13 deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.25 2026-05-18 21:05:55 +00:00
github-actions[bot]
8b33188019 deploy: bump bp-newapi upstream v0.13.2 chart 1.4.21 2026-05-18 21:04:48 +00:00
e3mrah
cf35b4a9b6
fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858)
A17 (#1855) hot-patched 6 drifted blueprints (cilium, cert-manager, flux,
openbao, keycloak, gitea) where blueprint.yaml spec.version had silently
fallen behind chart/Chart.yaml version, breaking
TestBootstrapKit_BlueprintCardsHaveRequiredFields. The structural root
cause: the TBD-A6 auto-bump hook in blueprint-release.yaml updated only
clusters/_template/bootstrap-kit/<N>-<chart>.yaml pins on every chart
publish — never the upstream platform/<bp>/blueprint.yaml.

This PR extends the auto-bump hook to lockstep platform/<bp>/blueprint.yaml
spec.version whenever Chart.yaml version bumps. Both file edits land in
the SAME commit (subject becomes `deploy(<chart>): bump bootstrap-kit pin
X -> Y (auto, Refs TBD-A6)` with a secondary line noting the blueprint
lockstep). Idempotent reset-and-rewrite retry preserved for the existing
parallel-matrix race case.

Workflow changes (.github/workflows/blueprint-release.yaml):
  * New step `bump_blueprint` after `bump_pin` — locates
    ${matrix.path}/blueprint.yaml OR ${matrix.path}/chart/blueprint.yaml
    (handles both platform-leaf and products-umbrella conventions),
    filters to kind:Blueprint (defensive against CRD yaml at the
    products/catalyst/chart/crds path), reads current spec.version at
    2-space indent, sed-rewrites to CHART_VERSION, verifies post-write.
  * Commit step renamed to "Commit + push bootstrap-kit pin bump +
    blueprint.yaml lockstep"; stages both files, single commit, with
    convergent retry on conflict.
  * Summary block surfaces both bumps separately.

Regression test (tests/e2e/bootstrap-kit/main_test.go):
  * New TestBootstrapKit_BlueprintVersionLockstepSweep — walks
    platform/* and products/*, discovers every Blueprint manifest with
    a sibling Chart.yaml, asserts spec.version == Chart.yaml version.
    Covers ALL ~70 blueprints, not just the canonical 10 kit ones the
    existing TestBootstrapKit_BlueprintCardsHaveRequiredFields gates.
  * Failure messages name the file, drift direction, and the exact sed
    command to fix — drift remediation is mechanical.

Drift cleanup (mandatory companion, same shape as A17/#1855):
  26 Application-Blueprint blueprints whose spec.version had been left
  at 1.0.0 / 0.1.0 while Chart.yaml moved forward — synced down to
  Chart.yaml as authoritative. All currently surface in the new sweep
  test; without the cleanup the test would block this PR (and every
  subsequent one). Affected: alloy, cert-manager-{dynadot,powerdns}-webhook,
  cluster-autoscaler-hcloud, cnpg, crossplane-claims, external-secrets[-stores],
  falco, grafana, guacamole, harbor, hcloud-csi, k8s-ws-proxy, mimir,
  netbird, newapi, openclaw, powerdns, seaweedfs, self-sovereign-cutover,
  trivy, valkey, velero, vpa, products/dmz-vcluster.

After this lands, the next chart-version bump in any platform/<bp>/ folder
auto-converges all three artifacts (Chart.yaml, blueprint.yaml,
bootstrap-kit pin) in a single bot commit. No more manual collector PRs;
no more silent drift between chart and Blueprint manifest.

Closes #1856.
Refs #1855 (A17 hot-patch this replaces structurally), #1713 (original TBD-A6 auto-bump hook).

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:04:22 +04:00
e3mrah
2484c8a3de
fix(bp-velero): bump 1.2.1 -> 1.2.2 to force a publish (Closes #1799) (#1846)
TBD-A13: `ghcr.io/openova-io/bp-velero:1.2.1` returns not-found because
the 1.2.1 bump in platform/velero/chart/Chart.yaml shipped only in the
initial-fill commit (`e5c2797c` "deploy: bump sandbox-mcp-server image
to cadc7b5") which never triggered the blueprint-release workflow. As a
result every fresh Sovereign's bp-velero HelmRelease (slot 34) is stuck
InProgress and the bootstrap-kit kustomization fails its health check.

GHCR currently has 1.0.0, 1.1.0, 1.2.0 — confirmed via
`/orgs/openova-io/packages/container/bp-velero/versions`.

Bump to 1.2.2 (chart + bootstrap-kit pin in lockstep so the A6 sync gate
stays GREEN) so blueprint-release.yaml fires on this push, publishes
`ghcr.io/openova-io/bp-velero:1.2.2`, and the auto-bump-pin step is a
no-op. No payload changes — same upstream vmware-tanzu/velero 12.0.1
subchart, same templates, same values.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:43:13 +04:00
hatiyildiz
9975e057da deploy(bp-newapi): bump bootstrap-kit pin 1.4.19 -> 1.4.20 (auto, Refs TBD-A6) 2026-05-18 20:38:15 +00:00
github-actions[bot]
9982dcafa8 deploy: bump bp-newapi upstream v0.13.2 chart 1.4.20 2026-05-18 20:37:26 +00:00
e3mrah
3d0c96a237
fix(bp-newapi): single-pod DB migration via startupProbe (Closes #1798) (#1857)
newapi-mirror:v0.13.2 hangs on first-boot GORM AutoMigrate against an
empty CNPG database: kubelet's pre-A12 liveness probe (initialDelay
30s + period 10s + failureThreshold 3 = ~50s ceiling) SIGKILLs the
binary mid-migration on every restart. The 28-CREATE-TABLE +
2-column-type AutoMigrate takes 60-120s on cpx21/cpx31 nodes with
sslmode=require — well over the kill window. On t22 chart 1.4.18 the
`newapi` DB had ZERO public-schema tables after 29 CrashLoopBackOff
restarts because every kill happened before the GORM connection
pool's first wire write completed (pg_stat_activity on the CNPG
primary showed no newapi-user connections).

Symptom (t22 verify, pod newapi-bp-newapi-6fd8799b6-lpsd2):
  [SYS] ... database migration started   ← last log line
  exitCode=2 finishedAt-startedAt = 50s exactly
  Readiness probe: connect: connection refused 10.42.0.185:3000
  DB: psql \\dt → "Did not find any relations"
  CNPG: pg_stat_activity → no `newapi` user connections

Fix (canonical k8s pattern, Inviolable Principle #16 — own the
seam): add a startupProbe that gates BOTH liveness and readiness
until the binary opens :3000/api/status. Budget 30 × 10s = 5 min,
comfortably above the observed 60-120s ceiling and below operator-
impatience limits. Liveness's pre-A12 cadence (30s/10s/3) is
unchanged but only activates after startupProbe success per kubelet
semantics. The probe block is operator-tunable via
`.Values.newapi.probes.startup.*`; setting it to `null` skip-renders
the block so overlays against a pre-seeded DB can opt out
(Inviolable Principle #4).

Also bumps the bootstrap-kit pin 1.4.18 → 1.4.19 in slot 80 so
freshly franchised Sovereigns pull the new chart on next prov.

Render tested (smoke + override): startupProbe present with
failureThreshold=30 in defaults; suppressed when startup: null.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:37:00 +04:00
e3mrah
a8931db541
fix(ci): sync stale blueprint.yaml versions + soften push-mode pin-sync race (Closes #1849) (#1855)
Two disjoint regressions stack-failed test-bootstrap-kit.yaml on every push to main:

1. manifest-validation — TestBootstrapKit_BlueprintCardsHaveRequiredFields
   asserts platform/<bp>/blueprint.yaml spec.version == chart/Chart.yaml
   version. Six blueprints had drifted: cilium (1.3.0->1.3.5), cert-manager
   (1.2.0->1.2.2), flux (1.2.0->1.2.2), openbao (1.2.14->1.2.16), keycloak
   (1.5.0->1.4.5 — blueprint led chart, sync to authoritative Chart.yaml),
   gitea (1.2.5->1.2.7). Chart.yaml is canonical (drives bootstrap-kit pin
   -> Sovereign install); blueprint.yaml gets resynced down/up to match.

2. pin-sync-audit on push — full-sweep audit races the blueprint-release
   auto-bump hook. Chart-bump merge commit has chart=N pin=N-1 drift
   until the auto-bump bot commits the pin update ~60s later; the bot
   push (GITHUB_TOKEN convention) does not retrigger this workflow, so
   the failure remains in run history. Fix: set continue-on-error: true
   on push/workflow_dispatch events (PR remains blocking via
   --changed-only). The full-sweep output still surfaces drift on the
   run summary; it just doesn't fail the overall run while the heal-in-
   ~60s window is open. Documented inline in the job header.

Net effect: every push to main re-runs cleanly green. The 13 pre-existing
drifts called out in the existing job comment will continue to heal as
each lagging chart gets its next bump (auto-bump hook + this PR's
manifest-validation alignment).

Refs PRs #1666 #1687 #1695 #1698 #1706 #1707 (the manual collector PRs
TBD-A6 eliminated for bootstrap-kit pins; this PR extends the convergence
to blueprint.yaml versions which the test asserts but the auto-bump hook
does not yet update).

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
2026-05-19 00:34:48 +04:00
e3mrah
d36e54df74
test(chart): baseline CNP allow-list contract gate — guards #1785→#1803→#1847 cascade (Closes #1850) (#1854)
The May 2026 baseline-CNP cascade shipped three production bugs in
two days because nothing in CI rendered the chart and asserted on the
rendered CiliumNetworkPolicy shape:

  - #1785 (chart 1.4.171) — added the baseline CNP for catalyst-system
    with WORLD egress restricted to TCP/443 only AND no ingress allow
    for the `catalyst` namespace.
  - #1803 (chart 1.4.177) — re-added SMTP egress (587/465/25 TCP) after
    /api/v1/auth/pin-request 502'd on every fresh onboarding.
  - #1847 (chart 1.4.178) — re-added ingress from `catalyst` after t24
    fresh-prov handover hung at WAIT_TIMEOUT_SECONDS=1500s.

This adds products/catalyst/chart/tests/baseline-cnp-allowlist.sh —
a pure helm-template + grep/awk contract gate matching the existing
platform/self-sovereign-cutover/chart/tests/cutover-contract.sh
pattern. The Blueprint Release workflow already runs every *.sh under
chart/tests/ as a publish gate (see blueprint-release.yaml line 384),
so the gate is wired automatically and fails publish BEFORE the OCI
artifact reaches a Sovereign.

13 cases asserted:
  1. baseline-default-deny CNP renders + is namespaced to catalyst-system
  2. egress allows SMTP submission 587/TCP (#1803 regression guard)
  3. egress allows SMTPS 465/TCP (#1803 regression guard)
  4. egress allows legacy SMTP 25/TCP (#1803 regression guard)
  5. egress allows HTTPS 443/TCP to world
  6. egress allows kube-dns 53/UDP + 53/TCP
  7. ingress allows `catalyst` ns — cutover Pods → catalyst-api:8080 (#1847)
  8. ingress allows `flux-system` (HelmRelease readiness probes)
  9. ingress allows `kube-system` (operator + ccm + CoreDNS)
 10. ingress is namespace-scoped — no fromEntities:{cluster|world|all} wildcard
 11. catalyst-api Service exposes port 8080 (auto-trigger contract)
 12. CNP toggles off cleanly with security.baselineCnp.enabled=false
 13. allowedIngressNamespaces propagates via --set (operator-tunable)

Negative-test confirmation (executed locally before commit):
  - Remove SMTP 587 from template → Case 2 FAILS, exit 1
  - Remove `catalyst` from values.yaml default → Case 7 FAILS, exit 1
  - Add `fromEntities: [cluster]` wildcard → Case 10 FAILS, exit 1
  - Restore originals → all 13 cases PASS, exit 0

Refs: TBD-A18, PRs #1785 #1803 #1847, audit /tmp/audit-recent-prs-quality-report.json
Closes #1850

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-19 00:32:28 +04:00
github-actions[bot]
82e972fb77 deploy: update catalyst images to 75cb059 2026-05-18 20:26:21 +00:00
e3mrah
75cb059fc0
Merge pull request #1851 from openova-io/fix/a16-hetzner-ssh-key-sweep
fix(hetzner): sweep orphan SSH keys by public_key comment (TBD-A16)
2026-05-19 00:24:19 +04:00
github-actions[bot]
e78faa986c deploy: update catalyst images to f07312c 2026-05-18 20:23:49 +00:00
e3mrah
f07312c5ae
fix(cutover): RBAC + sovereign-fqdn ConfigMap + kubeconfig?region path — 3 t24 zero-touch P1 blockers (#1852)
Three Wave 36 P1 fresh-prov blockers ship together as one chart 1.4.179
+ bootstrap-kit pin bump + cloud-init substitute extension, because each
fix is small and they share the same fresh-prov verification cycle.

TBD-A14 (issue #1843) — catalyst-api-cutover-driver SA cannot list
networkpolicies cluster-scope. Add networking.k8s.io/networkpolicies
get/list/watch verbs to clusterrole-cutover-driver.yaml. Pre-fix the
chroot in-cluster fallback's k8sCache.Factory reflector emitted
continuous `networkpolicies is forbidden` errors at the cluster scope
because only update/patch/delete were granted (existing mutation block)
— the read path was never wired. Mirrors the existing
cilium.io/ciliumnetworkpolicies block; the two CRDs co-exist (k8s
NetworkPolicy = baseline L3/L4, CiliumNetworkPolicy = tier-3 L7).

TBD-A15 (issue #1844) — sovereign-fqdn ConfigMap fields
configuredRegions / controlPlaneIP / primaryRegion / replicaRegion /
selfDeploymentId / enableHotStandby / qaApplications empty on every
fresh prov. Pre-fix the envsubst placeholders resolved to empty because
nothing wrote them into the bootstrap-kit Kustomization postBuild
substitute map → the chart rendered empty strings → Dashboard
SovereignCard configured-regions chips, Settings page operator-identity,
/api/v1/sovereign/self, and the D31 active-hot-standby gating ALL
silently fell through to default behaviour. Wired via three coordinated
changes:
  - Chart values.yaml gains global.sovereignSelfDeploymentId default
  - bootstrap-kit slot 13 gains global.sovereignSelfDeploymentId,
    sovereign.configuredRegions, sovereign.qaApplications mappings
    (YAML inline-list shape `${SOVEREIGN_CONFIGURED_REGIONS_YAML:-[]}`)
  - cloud-init Kustomization substitute map gains SOVEREIGN_CONTROL_PLANE_IP
    (= load_balancer_ipv4), SOVEREIGN_PRIMARY_REGION /
    SOVEREIGN_REPLICA_REGION (canonical 4-segment labels),
    SOVEREIGN_ENABLE_HOT_STANDBY (reserved, default empty),
    SOVEREIGN_CONFIGURED_REGIONS_YAML (JSON-encoded cloudRegion list),
    QA_APPLICATIONS_YAML (reserved, default `[]`)
  - main.tf: new template inputs sovereign_configured_regions_yaml +
    replica_region_canonical_label (derived from local.secondary_regions),
    threaded into both primary CP and per-secondary-region cloud-init
    templatefile calls

TBD-A10b (issue #1845) — GET
/api/v1/deployments/{id}/kubeconfig?region=<cloudRegion> returns 409
kubeconfig-file-missing on fresh prov for every region. Pre-fix the
handler only resolved `<id>-<region>.yaml` exactly, but the cloud-init
PUT-back + mothership→chroot D16 fan-out use the tofu secondary-region
key shape `<cloudRegion>-<i>` (e.g. `hel1-1`, `nbg1-2`) — so on-disk
filenames look like `<id>-hel1-1.yaml`. Verifiers + operators commonly
call with the bare `cloudRegion` (`?region=hel1`) because that's the
matrix-doc-friendly form. Fall-back resolution order added to
GetKubeconfig: exact-name first (legacy + manual operator PUT), then
`<id>-<region>-*.yaml` glob (sort.Strings deterministic). Unit test
covers all three paths: exact match, slot-suffix glob, unknown-region
still 409. Closes the regression introduced when PR #1763
(mothership→chroot kubeconfig handover hook) started using the
cloud-init naming convention for fan-out exports.

Closes #1843, Closes #1844, Closes #1845

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:21:38 +04:00
hatiyildiz
6e883c1f8b fix(hetzner): sweep orphan SSH keys by public_key comment (TBD-A16)
Third match pass for SSH keys whose name AND label both drifted from the
Tofu canonical emission. The OpenSSH public_key comment is the one piece
of metadata that survives Console-rename, partial tofu apply, and
out-of-band hcloud-cli edits — bootstrap-cli stamps the canonical
prefix into it at generation.

Caught in production 2026-05-18: catalyst-t24-omantel-biz blocked fresh
t25 provs because previous wipe cycles left it as an orphan. Label-pass
+ name-prefix-pass had no signal once the name/label drifted.

Adds boundary-aware HasPrefix check (the same P0 safety guard pinned by
TestPurge_NamePrefixFallback_DoesNotTouchOtherCustomers) so wiping
t2.omantel.biz cannot delete t20.omantel.biz's SSH key.

Tests:
  - PublicKeyCommentFallback_DeletesUnlabeled (the third-pass match)
  - PublicKeyCommentFallback_BoundarySafety (P0 t2 vs t20 safety pin)
  - PublicKeyCommentFallback_NoDoubleCount (idempotent against earlier passes)
  - PublicKeyCommentFallback_LeavesOtherKeys (other tenants untouched)
  - PublicKeyComment_ParsesFormats (OpenSSH parser unit pins)
  - CommentMatchesPrefix_BoundaryRules (separator rune table)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:15:51 +02:00
hatiyildiz
7a2cad9a47 deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.177 -> 1.4.178 (auto, Refs TBD-A6) 2026-05-18 19:46:12 +00:00
e3mrah
31b7dc5859
fix(cnp): allow ingress from catalyst ns (cutover Pods) — fresh-prov handover blocker (Refs PR #1785 regression, t24 zero-touch finding) (#1847)
PR #1785 (chart 1.4.171) shipped a baseline default-deny
CiliumNetworkPolicy in catalyst-system whose ingress allowlist was
limited to:

  - reserved.ingress: "" (cilium-gateway endpoint)
  - same-namespace catalyst-system Pods
  - host / remote-node / kube-apiserver entities

The bp-self-sovereign-cutover chart stamps Jobs into the `catalyst`
namespace, including the 10-auto-trigger Job whose Pod curls
catalyst-api.catalyst-system.svc.cluster.local:8080 to fire
/api/v1/internal/cutover/trigger.

With #1785 in effect on a FRESH prov, every auto-trigger Pod times
out at WAIT_TIMEOUT_SECONDS=1500s, handoverFiredAt stays null, and
the D0 auto-redirect to the Sovereign Console never happens — the
operator is stuck on mothership /jobs forever.

Caught by t24 zero-touch verification (2026-05-18):

  handover_status: "BLOCKED — cutover auto-trigger Pod in 'catalyst'
  ns cannot reach catalyst-api in 'catalyst-system' ns because
  baseline-default-deny CNP allows ingress only from {reserved.ingress,
  catalyst-system ns, host entities}"

The companion symptom on t22 was masked because t22's cutover Job
had already completed before the CNP rolled out — the CNP did not
gate ingress there.

Fix
─────────────────────────────────────────────────────────────────
Add a fourth ingress rule to baseline-default-deny allowing
fromEndpoints in the operator-tunable list
.Values.security.baselineCnp.allowedIngressNamespaces. Defaults:

  - catalyst       — cutover Pods (the load-bearing fix)
  - flux-system    — Helm/Kustomize/Source controllers probing
                     Service readiness for HelmRelease health
                     rollups (worked pre-#1785 via no-CNP default)
  - kube-system    — Cilium operator + hcloud-ccm + CoreDNS that
                     do cluster introspection calls (the
                     reserved.ingress gateway endpoint here is
                     still matched by rule 1's reserved.ingress: ""
                     selector — this rule covers non-gateway Pods)

The list mirrors the existing allowedPlatformNamespaces pattern on
the egress side. No other rule semantics change.

Chart bump 1.4.177 → 1.4.178. Companion regression to chart 1.4.177
(PR #1803, SMTP egress) — both are sub-regressions from the same
#1785 baseline-CNP ship.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:45:28 +04:00
hatiyildiz
61948474b5 deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.176 -> 1.4.177 (auto, Refs TBD-A6) 2026-05-18 19:28:52 +00:00
e3mrah
153fcf9419
fix(cnp): allow SMTP egress (587/465/25) from catalyst-system — fixes PIN-issue 502 regression from #1785 (#1803)
PR #1785 (chart 1.4.171) shipped a baseline-default-deny CiliumNetworkPolicy
in catalyst-system whose world-egress block was restricted to TCP/443 only.
That silently broke SMTP submission from catalyst-api to the operator
Stalwart relay (mail.openova.io), surfacing as 502s at
/api/v1/auth/pin-request — customer journey step 11/12 (PIN-issue email
delivery) is now blocked on every fresh Sovereign onboarding flow.

DIAGNOSTIC EVIDENCE
-------------------
- CNP `baseline-default-deny` in catalyst-system was created at
  2026-05-18 18:13:09Z (the moment chart 1.4.171 rolled out).
- Egress rule:
    toEntities: [world]
    toPorts:    [443/TCP]
  i.e. only HTTPS world egress permitted.
- A Pod in catalyst-system cannot `nc 45.151.123.50 587` (timeout).
- A Pod in the default namespace on the SAME node connects fine
  and receives the `220 Stalwart ESMTP` banner — confirming the
  block is policy-driven, not network/host-firewall driven.

FIX
---
Extend the world-egress block in
products/catalyst/chart/templates/network-policies/baseline-catalyst-system.yaml
to permit, in addition to the existing 443/TCP:

  - 587/TCP — SMTP submission (the production path to mail.openova.io)
  - 465/TCP — SMTPS (fallback)
  - 25/TCP  — legacy SMTP (fallback)

All four ports are scoped to `toEntities: [world]`, matching the
existing 443 allow. No other rule semantics change — same-namespace,
cluster-DNS, kube-apiserver, and platform-namespace allows are
untouched. The 25/TCP allow is included only as a legacy fallback;
production traffic is on 587.

A "Regression context — DO NOT NARROW THIS BLOCK WITHOUT REVIEW"
comment is added inline so the next reviewer who tightens the block
sees the failure mode that drove the widening.

CHART
-----
1.4.176 → 1.4.177. Changelog entry added under the 1.4.176 block,
above the version line, describing the regression + fix.

VERIFICATION
------------
`helm template products/catalyst/chart` renders the updated CNP with
four ports (443/587/465/25) under the world egress block; all other
rules byte-identical to 1.4.176.

Refs PR #1785 (the regression source), Issue #1746 (the original
baseline-CNP work).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-18 23:28:19 +04:00
github-actions[bot]
732f2363b9 deploy: update catalyst images to c422c97 2026-05-18 19:16:52 +00:00
e3mrah
c422c97b97
fix(catalyst-api): publish body→query translation + rbac/assign CRD-NotFound detection (Refs TBD-C4-fup, TBD-C6-006-followup) (#1802)
TBD-C4-fup — publish body→query translation regression guard:
- Adds sme_catalog_client_test.go pinning the wire shape on
  smeCatalogClient.SetPublished. The C4-012 / #1735 fix (PR #1789)
  translates the chroot's {"published":true} JSON body into the
  upstream catalog's ?value=true|false query param shape that
  services-catalog SetAppPublished (handlers.go:303-313) requires.
  Wave 35 cov-bench v3 surfaced 400 here because the deploy bot
  hadn't bumped catalyst-api past e2c56c3 (PR #1787) when the
  bench ran — PR #1789's translation was already in the merged
  code but not in the live image. The test pins URL +
  ?value=<bool> + empty body so any future revert fires.

TBD-C6-006-followup — RBAC assign 500 → 503:
- Root cause: UserAccess is a NAMESPACED Crossplane Claim per the
  XRD's claimNames block (platform/crossplane-claims/chart/
  templates/xrds/useraccess.yaml). rbacAssignNamespace = "" routed
  the dynamic Create to the apiserver's cluster-scoped REST path
  /apis/access.openova.io/v1alpha1/useraccesses, which the
  apiserver doesn't serve for a namespaced CRD — returns 404 with
  "the server could not find the requested resource". PR #1789's
  apierrors.IsNotFound→503 wrapper never fired because the 404 was
  for the route, not the resource.
- Fix: pin rbacAssignNamespace = "catalyst-system" and stamp it on
  every Create. Mirrors user_access_owner_seed.go's t134 D21 fix
  (userAccessOwnerNamespace = "catalyst-system"). Lists keep
  Namespace("") for cross-namespace listing (valid against a
  namespaced CRD — apiserver returns the union).
- Defense in depth: isCRDNotInstalledErr() string-fallback for
  "the server could not find the requested resource" / "no matches
  for kind" — apierrors.IsNotFound can lose StatusReasonNotFound
  through error-chain wrapping. Mirrors
  catalog_client_cluster_fallback.isVersionNotServed.
- user_access.go: same defect class — CreateUserAccess /
  UpdateUserAccess / tryDeleteUserAccess all called .Namespace("")
  on a namespaced CRD. CreateUserAccess now stamps
  rbacAssignNamespace; Update + Delete walk the all-namespaces
  list via findUserAccessByName() to discover the canonical ns
  before issuing the mutation against that exact REST path.

Tests:
- TestSetPublished_SendsQueryParamNotBody (regression guard for
  TBD-C4-fup)
- TestHandleRBACAssign_CreateStampsNamespace (regression guard for
  TBD-C6-006-followup namespace fix)
- TestIsCRDNotInstalledErr_StringFallback (regression guard for
  defense-in-depth detection)
- Existing test reads updated to use rbacAssignNamespace instead
  of Namespace("") (no behavioural change — the fake dynamic
  client routes accurately now)

Refs TBD-C4-fup
Refs TBD-C6-006-followup

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:14:40 +04:00
hatiyildiz
0293318a3a deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.175 -> 1.4.176 (auto, Refs TBD-A6) 2026-05-18 19:14:22 +00:00
github-actions[bot]
fbbf1b395f deploy: update sme service images to 989328d + bump chart to 1.4.176 2026-05-18 19:13:00 +00:00
hatiyildiz
da28ae6936 deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.174 -> 1.4.175 (auto, Refs TBD-A6) 2026-05-18 19:12:31 +00:00
e3mrah
989328d7c3
fix(provisioning): write tenant commits to sme-tenants branch (Closes #1794 — 5th C18 layer) (#1801)
Provisioning's per-tenant overlay commits no longer share the `main`
branch with the cutover-gitea-mirror Job. The mirror runs every <=10 min
and force-pushes upstream main into Gitea, clobbering every tenant
commit that landed in between mirror ticks — the Organization CR never
materialised and the customer journey hung at step 16 (live evidence on
t22 2026-05-18: commit 69d64e48 at 17:46:13Z disappeared from Gitea main
by the next mirror tick at 17:54:55Z).

Fix:
- New Flux GitRepository `openova-sme-tenants` tracks the dedicated
  `sme-tenants` branch (templates/sme-services/sme-tenants-gitrepository.yaml).
- sme-tenants Flux Kustomization repointed at the new GitRepository
  (sme-tenants-kustomization.yaml) so the tenant reconcile loop reads
  from the protected branch.
- Provisioning Deployment GITHUB_BRANCH default flipped to `sme-tenants`
  on Sovereign installs (Catalyst-Zero keeps `main` — no mirror Job
  exists there). Topology-aware default, operator-overridable.
- Provisioning Go client (commitOnceContents) gains an auto-create-
  branch fallback so the first commit on a fresh Sovereign self-
  bootstraps the branch from `main` — no out-of-band seeding step.
- Chart 1.4.174 -> 1.4.175.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:11:30 +04:00
github-actions[bot]
481bf60fd5 deploy: update catalyst images to 51be70e 2026-05-18 19:06:31 +00:00
e3mrah
51be70e865
fix(catalyst-api): k8sCache.Factory periodic rescan of on-disk kubeconfigs + chroot self-register recovery (Refs 30-row matrix rows 9+27) (#1800)
Two regressions caught on t22 (2026-05-18) by the 30-row matrix:

  row 9  /cloud/list?kind=nodes              — only 1 cluster instead of 3
  row 27 /dashboard/treemap (Layer=Region)   — only 1 cell instead of 3

Root cause is two layered races on a fresh-prov Sovereign chroot, both
invisible to PR #1705 / #1763's one-shot AddCluster path:

  (a) Pod restart with empty kubeconfigs PVC. The mothership's
      secondary-kubeconfig POST hook (deployment_handover_export.go)
      ONLY fires at handover. A catalyst-api Pod that restarts AFTER
      handover and BEFORE any operator re-trigger sees an empty
      /var/lib/catalyst/kubeconfigs/. LoadClustersFromDir at startup
      returns 0 entries, the Factory starts with sovereigns:0, and
      every /k8s/list response degrades silently to one cluster (the
      chroot itself, via resolveChrootClusterID's single-cluster
      fallback) — or zero on chroots where SOVEREIGN_FQDN env was
      also empty at start (race (b)).

  (b) sovereign-fqdn ConfigMap committed AFTER Pod start. On t22 the
      Pod started at 18:13:14 but the chart's sovereign-fqdn CM
      landed at 18:13:44 — 30s later. The Pod's SOVEREIGN_FQDN env
      stayed empty for the lifetime of the Pod (Reloader v1.4.16
      does not reload env vars per a longstanding upstream
      limitation), so FactoryFromEnv's chroot self-register branch
      returned false. Logs confirmed: "k8scache: data plane started
      sovereigns=0".

Fix: a periodic background goroutine (Factory.runKubeconfigsRescanLoop)
that ticks every Config.RescanInterval (default 30s) and:

  1. Walks Config.KubeconfigsDir for kubeconfigs whose stem isn't
     already a registered cluster ID and AddClusters each one.
     Cheap (one os.ReadDir per tick) and idempotent.

  2. When Config.HomeCoreClient is set, reads the on-cluster
     sovereign-fqdn ConfigMap directly via the typed client and
     re-runs buildChrootClusterRef when fqdn is non-empty. Recovers
     from the configmap-race on the next tick after the CM commits,
     without needing a Pod restart.

FactoryFromEnv now persists the resolved KubeconfigsDir + HomeCoreClient
into the Config so the rescan loop reuses the same values without
re-reading env. Defaults: rescan interval 30s; both branches are no-ops
on the contabo mothership (KubeconfigsDir non-empty but no late-arriving
kubeconfigs; no sovereign-fqdn CM so the ConfigMap GET returns not-found
silently).

Two new tests in k8scache_test.go:

  - TestFactory_RescanRegistersNewKubeconfigs: drops a kubeconfig
    AFTER Start, asserts the Factory registers it within 3s of the
    rescan tick. Reproduces the (a) regression in unit test form.

  - TestFactory_RescanOnce_IdempotentForKnownClusters: re-runs
    rescanOnce on a directory whose entries are already registered;
    asserts no double-register, no log spam.

Operator-visible effect: post-handover Pod restarts on a multi-region
Sovereign chroot self-heal within 30s instead of staying stuck at
sovereigns:0 until manual operator re-POST.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:04:11 +04:00
github-actions[bot]
5f741f11a8 deploy: bump sandbox-controller image to 4f957c3 2026-05-18 18:56:25 +00:00
github-actions[bot]
3ecfc465f0 deploy: bump sandbox-mcp-server image to 4f957c3 2026-05-18 18:55:00 +00:00
e3mrah
4f957c3db2
fix(sandbox): per-Sandbox disable idle-scaling (Closes #1725) (#1797)
Wave 35 D8b — add `spec.idleScaling.enabled` to the Sandbox CR so
long-running agent workloads (idle-for-hours-then-resume) can opt
out of the cluster-wide idle scaler.

Renderer stamps `openova.io/sandbox-idle-scaling-disabled=true` on
the pty-server StatefulSet when enabled=false. IdleScaler skips any
StatefulSet carrying that annotation: no probe, no last-activity
stamp, no scale-to-zero decision.

Default behaviour (CR field omitted OR enabled=true) preserves the
existing tier-cap economics so the free/pro paths still scale to 0
after the timeout window.

Refs WBS row TBD-D8b in openova-private/docs/WBS.md.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:53:00 +04:00
hatiyildiz
2c1e628fb3 deploy(bp-newapi): bump bootstrap-kit pin 1.4.17 -> 1.4.18 (auto, Refs TBD-A6) 2026-05-18 18:31:05 +00:00
github-actions[bot]
ddcab439c8 deploy: bump bp-newapi upstream v0.13.2 chart 1.4.18 2026-05-18 18:30:18 +00:00
e3mrah
6c4d660058
fix(bp-newapi): dedicated HTTPRoute for newapi.<fqdn> (Closes #1778) (#1796)
Sandbox runtimes hit the LLM gateway at the URL the sandbox controller
mints into their environment:

  NEWAPI_BASE_URL=https://newapi.<sovereign-fqdn>/v1

On a Sovereign with the Catalyst marketplace enabled, the catalyst chart
ships a `tenant-wildcard` HTTPRoute (hostnames=`*.<fqdn>`) that backend-
refs to the `console` Service in the `sme` namespace. Without a
dedicated HTTPRoute for `newapi.<fqdn>`, every Sandbox request to the
LLM gateway got absorbed by the wildcard and 502'd at the storefront —
blocking the entire BYOS Claude Code journey (TBD-D35d).

Fix: add `templates/httproute.yaml` + `ingress.httpRoute` values block
to bp-newapi. The HTTPRoute lives in the `newapi` namespace (same as
the Service backend) so no cross-namespace ReferenceGrant is required;
Gateway API hostname-matching prefers the most specific listener, so an
exact `newapi.<fqdn>` HTTPRoute outranks the `*.<fqdn>` wildcard without
modifying the marketplace template.

Bootstrap-kit slot 80 overlay flips `ingress.httpRoute.enabled=true` and
supplies `host: newapi.${SOVEREIGN_FQDN}` so the route materialises on
every Sovereign install. Default OFF for contabo-style Traefik clusters
(unchanged behaviour).

- platform/newapi/chart/templates/httproute.yaml — new template, gated
  on `newapi.enabled && ingress.httpRoute.enabled` AND a resolvable
  hostname (explicit `ingress.httpRoute.host` OR derived from
  `sovereignFQDN`).
- platform/newapi/chart/values.yaml — new `ingress.httpRoute` block,
  default OFF.
- platform/newapi/chart/Chart.yaml — version 1.4.16 → 1.4.17.
- clusters/_template/bootstrap-kit/80-newapi.yaml — pin 1.4.16 → 1.4.17,
  values now enable `ingress.httpRoute` with host
  `newapi.${SOVEREIGN_FQDN}`.

helm template smoke (all four scenarios pass):
- default values → 0 HTTPRoutes rendered (chart safe for Traefik installs).
- httpRoute.enabled + sovereignFQDN → 1 HTTPRoute, hostname
  `newapi.<sovereignFQDN>`.
- httpRoute.enabled + explicit host → 1 HTTPRoute with that host.
- httpRoute.enabled, neither host nor sovereignFQDN → 0 HTTPRoutes
  (skip-render guard).

Closes #1778

Refs WBS row TBD-D35d in openova-private/docs/WBS.md.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:29:26 +04:00
github-actions[bot]
dfff1a695a deploy: update catalyst images to 2964930 2026-05-18 17:58:18 +00:00
e3mrah
29649309da
fix(marketplace): resolve _template path on chroot (Closes #1790) (#1792)
The marketplace settings handler hardcoded
clusters/<sovereignFQDN>/bootstrap-kit/13-bp-catalyst-platform.yaml.

That path exists in the openova-io/openova mothership repo (the
provisioner carves out a per-FQDN subtree per Sovereign) but NOT in the
chroot-local Gitea repo, which only carries the canonical
clusters/_template/bootstrap-kit/ subtree (see openova_flow_proxy.go,
phase1_watch.go, sme_tenant_gitops.go which all reference
clusters/_template/bootstrap-kit/...).

Wave 34 v2 cov-bench surfaced this: PR #1779 wired GITOPS_TOKEN through
to the chroot Pod, the marketplace toggle now reaches Gitea, and the
Gitea push fails with 500 "no such file or directory" because the
overlay path is wrong for the chroot's repo layout.

Fix: introduce resolveBootstrapKitDir(sovereignFQDN) which picks
clusters/_template/bootstrap-kit when SOVEREIGN_FQDN env is set (the
canonical "we are running on a chroot Pod" signal used across this
package - see auth_handover.go, deployments.go, jobs.go, rbac_matrix.go)
and clusters/<sovereignFQDN>/bootstrap-kit otherwise. A
CATALYST_BOOTSTRAP_KIT_PATH env overrides both, per
INVIOLABLE-PRINCIPLES.md #4 (never hardcode a path that a future repo
re-layout would force a code ship).

Regression test TestResolveBootstrapKitDir covers all four detection
paths (mother / chroot / whitespace-treated-as-unset / runtime
override).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-18 21:56:11 +04:00
hatiyildiz
e2a3d46b66 deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.173 -> 1.4.174 (auto, Refs TBD-A6) 2026-05-18 17:52:29 +00:00
github-actions[bot]
7a7d5b4574 deploy: update sme service images to 5c71fb8 + bump chart to 1.4.174 2026-05-18 17:51:49 +00:00
e3mrah
5c71fb8f61
fix(catalyst-api+catalog): SME bridge token for publish toggle + chroot RBAC assign wrapper (#1789)
Closes #1735
Closes #1739

C4-012 / #1735 — Publish toggle 401:
- chroot's smeCatalog.SetPublished sent no Authorization header, so
  catalog.sme's JWTAuth middleware rejected with 401. Mint the canonical
  SME bridge token in HandleSovereignAppPublish (mirrors
  sme_billing_vouchers.go::mintSMEBridgeToken) and forward as Bearer.
- catalog requireAdmin now accepts sovereign-admin role (in addition to
  superadmin) so franchisee operators can manage their own Sovereign's
  catalog per docs/FRANCHISE-MODEL.md §3 — without this, the bridge
  token's sovereign-admin role would still 403.
- SetPublished now sends published state via ?value=true|false query
  param (matches the SME catalog's SetAppPublished route shape) rather
  than a JSON body the upstream ignores.

C6-006 / #1739 — RBAC assign 500:
- Add HandleSovereignRBACAssign at POST /api/v1/sovereign/rbac/assign,
  the chroot-friendly mirror of /api/v1/sovereigns/{id}/rbac/assign
  (resolves deployment id via resolveSovereignDeploymentID, mirroring
  HandleSovereignRBACMatrix). Extracts the existing handler body into
  serveRBACAssign so both surfaces share the same wire contract.
- Surface CRD-not-installed (apierrors.IsNotFound) from the dynamic
  create as 503 + sovereign-cluster-unavailable instead of a generic
  500 rbac-assign-failed — the previous shape hid the real chart-gap
  behind a misleading 500.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-18 21:50:24 +04:00
github-actions[bot]
3785f3aa4a deploy: update catalyst images to e2c56c3 2026-05-18 17:42:13 +00:00
e3mrah
e2c56c3811
fix(catalyst-api): mint HS256 bridge token for sovereign app publish proxy (Closes #1735) (#1787)
The chroot proxy at /api/v1/sovereign/apps/{slug}/publish forwards
to the SME catalog at http://catalog.sme.svc.cluster.local:8082's
PATCH /catalog/admin/apps/{slug}/publish endpoint. The pre-fix code
sent NO Authorization header at all, so:

  1. core/services/catalog/main.go's JWTAuth middleware (line 77, applied
     to every /catalog/admin/* path) rejected the request with 401
     BEFORE the handler ran ("missing or invalid authorization header").

  2. Even with a header, requireAdmin (core/services/catalog/handlers
     /handlers.go:21) would reject any caller without role="superadmin".

Result: every Publish toggle click in the Sovereign Console surfaced
as "sme-catalog-rejected upstream returned 401" with no actionable
hint — the operator could not toggle marketplace visibility for any
app on a production Sovereign.

Fix: mint a fresh HS256 bridge token via the existing
h.mintSMEBridgeToken helper (the same one sme_billing_vouchers.go's
proxySMEVoucher uses for the BSS Vouchers surface) and forward it as
the upstream Authorization header. The helper signs the token with
sme-secrets/JWT_SECRET — the same secret the SME catalog Pod loads
from its JWT_SECRET env (per products/catalyst/chart/templates
/sme-services/catalog.yaml:40-44). Operators with `catalyst-owner`
realm-role (per shared/auth.SMERoleFor) get role="superadmin" in the
bridge token, satisfying requireAdmin upstream.

  - Adds a `bearer` parameter to smeCatalogClient.SetPublished.
  - HandleSovereignAppPublish mints the bridge token BEFORE the
    upstream round-trip so an unwired bridge (Sovereign without
    marketplace, stale chart predating the reflector annotation
    on sme-secrets) surfaces 503 sme-jwt-bridge-unwired rather
    than the pre-fix silent 401.
  - Per docs/INVIOLABLE-PRINCIPLES.md #10 the token is NEVER logged.

Verified: build + go test ./internal/handler/ pass.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:39:40 +04:00
hatiyildiz
ebe9a5c1a2 deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.172 -> 1.4.173 (auto, Refs TBD-A6) 2026-05-18 17:33:15 +00:00
github-actions[bot]
b19d64f3f6 deploy: update sme service images to 3d06db5 + bump chart to 1.4.173 2026-05-18 17:32:36 +00:00
e3mrah
3d06db5625
fix(provisioning): use Gitea contents API for git writes (Closes #1781 — 4th C18 layer) (#1786)
Journey v4 Wave 33 (retry) — after #1712 fixed the singular→plural `/git/refs/`
path, provisioning's NEXT call landed on `POST /repos/.../git/blobs → 404`.
Gitea 1.22.3 simply does not implement the GitHub Git Data WRITE API
(`POST /git/blobs`, `POST /git/trees`, `POST /git/commits`, `PATCH /git/refs/...`).
All four return 404. Only the READ side (`GET /git/refs/...`, `GET /git/commits/...`,
`GET /git/trees/...?recursive=1`) is supported by Gitea.

This is the last blocker in the customer marketplace journey — steps 14→16→17
(Org CR + vCluster + WordPress) all stall on this single 404.

Fix
---
- New `commitOnceContents` path that batches creates/updates/deletes into one
  `POST /repos/{owner}/{repo}/contents` (Gitea ≥ 1.21 ChangeFiles endpoint).
  Files are base64-encoded; updates carry the existing blob SHA sourced from
  the recursive tree listing (which IS supported on Gitea).
- New `targetsGitea()` predicate: when `APIURL != ""` (Sovereign in-cluster
  Gitea), `commitOnce` routes through the contents API. When empty (upstream
  github.com / contabo path), it keeps the original Git Data blob+tree+
  commit+updateRef dance untouched — upstream GitHub does NOT expose a batch
  ChangeFiles endpoint, so we must not unconditionally switch.
- `isFastForwardRejection` extended to recognise Gitea's branch-moved wording
  (409 / "branch has been changed" / "stale base"), so the existing outer
  retry loop in `CommitFilesWithPruneAndRebuild` keeps working across both
  backends.
- Prune semantics preserved: any blob under a managed prefix that's not in
  the files map becomes a delete op in the same batch.

Test coverage
-------------
- `TestCommitFiles_GiteaTarget_UsesContentsAPI` asserts the new path POSTs
  to `/repos/.../contents` and never touches `/git/blobs|trees|commits` or
  `PATCH /git/refs/...`.
- `TestCommitFiles_GiteaTarget_UpdateUsesExistingSHA` asserts updates carry
  the existing blob SHA (Gitea 422s without it).
- `TestCommitFiles_UpstreamTarget_KeepsGitDataAPI` pins the upstream Git
  Data API path so the Gitea fork doesn't accidentally also fire on
  api.github.com.

API before vs after
-------------------
Before (Gitea path):
  POST /repos/{o}/{r}/git/blobs        404
  POST /repos/{o}/{r}/git/trees        404
  POST /repos/{o}/{r}/git/commits      404
  PATCH /repos/{o}/{r}/git/refs/h/main 404

After (Gitea path):
  POST /repos/{o}/{r}/contents
    {"branch":"main","message":"...","files":[
      {"operation":"create|update|delete","path":"...","content":"<b64>","sha":"<existing-sha-if-update>"}
    ]}

Refs TBD-C18d, WBS Wave 33 retry. Closes #1781.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:31:12 +04:00
hatiyildiz
39c8464554 deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.171 -> 1.4.172 (auto, Refs TBD-A6) 2026-05-18 17:30:37 +00:00
e3mrah
be1ad96f43
feat(security): baseline CNPs for cilium-gateway + catalyst-system namespaces (Closes #1746) (#1785)
Cov-bench confirmed only 2 CNPs cluster-wide and zero in either
critical namespace. WBS row C12-009 (TBD-Cov-12) fails until baseline
coverage lands. Ship two namespaced CiliumNetworkPolicies under
products/catalyst/chart/templates/network-policies/:

  - baseline-default-deny in catalyst-system: default-deny with
    explicit allow for cilium-gateway ingress + same-namespace +
    kubelet host probes; egress to kube-apiserver / kube-dns /
    same-namespace / 14 platform namespaces + world TCP/443.
  - baseline-cilium-gateway-allow in kube-system: scoped to the
    reserved:ingress endpoint, namespaced equivalent of the
    qaFixtures allow-gateway-world-ingress CCNP.

Both CNPs mirror the working bp-external-dns-apiserver +
qa-fixtures patterns (toEntities/reserved.ingress selectors,
label conventions, operator-tunable allow lists). Bundle is
helm-gated on .Values.security.baselineCnp.enabled (default true)
and independent of qaFixtures so it ships on every Sovereign.
Platform-namespace allow list tunable via
.Values.security.baselineCnp.allowedPlatformNamespaces.

Chart bump to 1.4.171.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:29:51 +04:00
github-actions[bot]
202bda36bf deploy: update catalyst images to b3b0539 2026-05-18 17:27:30 +00:00
e3mrah
b3b05391ac
fix(sovereign-tls): include all parent_domains in Gateway listeners (Closes #1772) (#1784)
Wave 32 D27-D31 verifier on t22 found tfvars carrying
parent_domains: [{omantel.biz, primary}, {omani.homes, sme-pool}]
but the live Cilium Gateway advertising only *.t22.omantel.biz —
*.omani.homes never rendered as a listener, so every sme-pool
tenant hit the envoy default fallback cert.

Root cause: writeTfvars emitted the structural `parent_domains`
JSON array but never set `parent_domains_yaml` — the YAML-string
variable infra/hetzner/variables.tf declares and that
infra/hetzner/main.tf locals.parent_domains_decoded actually
yamldecode()s to derive the listener pool. With the variable
empty, the terraform local fell through to the single-zone
fallback `[{name: "<sovereign_fqdn>", role: "primary"}]` and
every sme-pool zone the operator added was silently dropped
from the Gateway listener list.

Fix: writeTfvars now renders parent_domains_yaml as a JSON-flow
array literal (`[{"name":"x","role":"y"},...]`) carrying every
parent_domains entry. JSON-flow is a YAML superset so
yamldecode() reads it natively. Empty ParentDomains still emits
"" so the single-zone fallback (derived from sovereign_fqdn)
keeps working for legacy payloads.

Day-2 re-trigger note: AddParentDomain persists the new entry to
dep.Request.ParentDomains so a subsequent provisioner.Provision
re-write picks up the updated literal. The hcloud_server's
user_data has no `ignore_changes` so an existing Sovereign
cannot get the new listener via tofu apply (would request
destructive recreate) — the handler now logs an operator hint
pointing at the live Sovereign's Kustomization sovereign-tls
postBuild.substitute.PARENT_DOMAINS_LISTENERS_YAML field.

Tests:
- TestWriteTfvars_EmitsParentDomainsYAMLForSMEPool — regression
  guard for the exact t22 scenario (primary + sme-pool).
- TestWriteTfvars_EmitsParentDomainsYAMLEmptyOnSingleZone —
  fallback path preserved for legacy single-zone payloads.
- TestParentDomainsYAMLLiteral_RoundTripsCleanly — table-driven
  unit test (lowercasing, role defaulting, JSON-flow shape).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:25:07 +04:00
e3mrah
998fa67e41
fix(tenant+sandbox): wire K8s client SA + NEWAPI_DEFAULT_CHANNELS default (Closes #1775, #1777) (#1783)
Wave 32 D35 verifier caught two adjacent Sandbox-plane bugs on t26:

TBD-D35a (#1775): tenant service hosts the SandboxOrchestrator
(core/services/tenant/handlers/sandbox_consumer.go) which materialises
Sandbox.sandbox.openova.io CRs on every tenant.sandbox_requested
event. main.go buildDynamicClient logs
`sandbox-orchestrator: kubernetes client unavailable — orchestrator
disabled` and silently skips the consumer because the tenant SA carries
automountServiceAccountToken=false (zero blast-radius default from
#76) AND no Role grants verbs on sandbox.openova.io. Fix: flip the
flag to true on both the SA + the pod spec, plus a narrow Role +
RoleBinding granting get + create on sandboxes.sandbox.openova.io
scoped to the catalyst-system namespace
(handlers.DefaultSandboxNamespace). Verbs match what the orchestrator
actually exercises against the dynamic.Interface (Get for idempotency
pre-check, Create for CR materialisation) — a leaked tenant SA token
still cannot patch/delete Sandbox CRs or touch any other CRD group.

TBD-D35c (#1777): sandbox-controller fails per-Sandbox token mint
with NoAllowedChannels (sandbox_controller.go:191) because the
NEWAPI_DEFAULT_CHANNELS env defaulted to "" in
platform/sandbox/chart/values.yaml and bootstrap-kit slot 19a never
wired an envsubst placeholder. Fix: default chart value to "qwen"
(the only channel alias bp-newapi channel-seed-job.yaml writes on a
fresh Sovereign install — alias for qwen3.6-bankdhofar per
products/sandbox/docs/newapi-proxy-contract.md §2), AND add
`${SANDBOX_DEFAULT_CHANNELS:-qwen}` to slot 19a so per-Sovereign
overlays can extend without forking the chart (e.g.
SANDBOX_DEFAULT_CHANNELS=qwen,anthropic,openai).

Chart bump 1.4.170 → 1.4.171 + bootstrap-kit pin 13-bp-catalyst-
platform.yaml 1.4.170 → 1.4.171.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:23:28 +04:00
github-actions[bot]
7d8d99d3c7 deploy: update catalyst images to 5bb9032 2026-05-18 17:17:47 +00:00
e3mrah
5bb903275d
fix(catalyst-ui): Mothership auto-redirect on ?token= to Sovereign handover (Closes #1773) (#1782)
# Problem (DoD gate D0 — founder's #1 pinned gate per
# feedback_handover_redirect_is_critical_d0.md)

When the operator lands on `console.openova.io/sovereign/jobs?token=<JWT>`
(via fresh tab from the wizard SuccessPage, share-link, browser history),
the Mothership UI used to render its own Jobs page and strand the
operator there. The bundle had ZERO references to `mint-handover-token`,
`redirectURL`, or any `?token=` handler.

Verified live on t22 chart 1.4.168 (Wave 32 evidence):
  1. POST /sovereign/api/v1/deployments/{id}/mint-handover-token
     returns { redirectURL, token } as expected.
  2. Navigating to console.openova.io/sovereign/jobs?token=<JWT> stays
     on Mothership — never redirects to console.t22.omantel.biz/auth/handover.

Without this redirect, every other DoD gate is invisible to the operator
(memory: "the fucking successful handover is still not there ... end user
is not even aware if the sovereign environment is provisioned").

# Fix

New module `shared/lib/mothershipTokenRedirect.ts` runs at bootstrap
BEFORE the router, fetch interceptor, or DOM render:

  1. Only fires on Mothership host (console.openova.io).
  2. Reads `?token=<JWT>` from window.location.search.
  3. Decodes the JWT payload (no signature verification — the
     Sovereign-side /auth/handover does full RS256 verify + aud-binding).
  4. Extracts the `aud` claim. Per catalyst-api/handover_jwt.go, aud is
     `["https://console.<sovereignFqdn>"]` (array) or string form.
  5. Constructs `https://console.<sovereignFqdn>/auth/handover?token=<JWT>`
     and `window.location.replace()` to it.
  6. Self-loop guard: refuses to redirect if aud points back at the
     Mothership.

`main.tsx` calls `runMothershipTokenRedirect()` first; if it returns true
the rest of bootstrap is skipped (avoids Mothership UI flash during the
hard-nav).

# Tests

`mothershipTokenRedirect.test.ts` — 18 unit tests covering the
pure decision function:
  - aud as array vs string vs missing
  - chroot URL extraction (https-only, console.<host>, self-loop guard)
  - JWT preservation across redirect (no claim mutation)
  - Mothership host gate (no-op on Sovereign / dev hosts)
  - malformed-JWT no-op
  - missing-?token= no-op

All 18 tests pass. tsc + eslint clean. Pre-existing unrelated test
failures in StepComponents.test.tsx (CORTEX cascade) verified to also
fail on origin/main without these changes.

Refs: feedback_handover_redirect_is_critical_d0.md, Wave 32 evidence,
GitHub issue #1773.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:15:41 +04:00