openova

Author	SHA1	Message	Date
hatiyildiz	1a65d9120d	fix(controllers): NATS consume-leg for D35 (organization + sandbox) PR #1626 wired the publish-leg (tenant + billing → NATS JetStream catalyst.<domain>.<event>). The consume-leg was missing: no in-cluster controller subscribed, so D35 (NATS round-trip end-to-end) stayed yellow even though the publish leg shipped. This PR adds: - core/controllers/pkg/natsbus: minimal JetStream subscriber shared by Group-C controllers. Self-contained (no dep on core/services/shared which pulls in franz-go/Kafka the controllers never touch). - core/controllers/organization/internal/controller/nats_bridge.go: subscribes to catalyst.tenant.created + catalyst.billing.order.placed, patches openova.io/last-event-observed-at + ...-subject annotations on the matching Organization CR. The annotation patch triggers an informer event → controller-runtime enqueues Reconcile within ~50ms instead of waiting for the 30s requeue fallback. - core/controllers/sandbox/internal/controller/nats_bridge.go: same pattern for catalyst.tenant.sandbox_requested. Looks up Sandbox CR using the same `sandbox-<sanitised-email>` naming convention tenant-service's SandboxOrchestrator (PR #1633) writes under. - main.go wiring in both controllers reads NATS_URL from env. Unset = log "consume-leg disabled" + continue (informer requeue fallback intact). The 30s RequeueAfter inside r.Reconcile is unchanged — NATS is an accelerator, not the only path. Idempotency: ev.Timestamp is the broker-side time stamp, so duplicate JetStream delivery produces a byte-stable annotation patch and controller-runtime does NOT enqueue a redundant Reconcile. Tests cover Ack/Nak/Ack-to-skip dispatch (subscriber_test.go), the happy path, the no-matching-CR soft miss, duplicate-envelope no-churn, malformed JSON poison-pill, and the publish-side ↔ consume-side name derivation lockstep for Sandbox CRs. HARD CONSTRAINT respected: no credential mutations — bridges read only the envelope + the target CR, never Secrets or Keycloak SA creds. Refs #1835 (D35 round-trip end-to-end), Refs #1776 (D35b sandbox). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 23:41:44 +02:00
github-actions[bot]	de53b39d13	deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.25	2026-05-18 21:05:55 +00:00
github-actions[bot]	8b33188019	deploy: bump bp-newapi upstream v0.13.2 chart 1.4.21	2026-05-18 21:04:48 +00:00
e3mrah	cf35b4a9b6	fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856 ) (#1858 ) A17 (#1855) hot-patched 6 drifted blueprints (cilium, cert-manager, flux, openbao, keycloak, gitea) where blueprint.yaml spec.version had silently fallen behind chart/Chart.yaml version, breaking TestBootstrapKit_BlueprintCardsHaveRequiredFields. The structural root cause: the TBD-A6 auto-bump hook in blueprint-release.yaml updated only clusters/_template/bootstrap-kit/<N>-<chart>.yaml pins on every chart publish — never the upstream platform/<bp>/blueprint.yaml. This PR extends the auto-bump hook to lockstep platform/<bp>/blueprint.yaml spec.version whenever Chart.yaml version bumps. Both file edits land in the SAME commit (subject becomes `deploy(<chart>): bump bootstrap-kit pin X -> Y (auto, Refs TBD-A6)` with a secondary line noting the blueprint lockstep). Idempotent reset-and-rewrite retry preserved for the existing parallel-matrix race case. Workflow changes (.github/workflows/blueprint-release.yaml): * New step `bump_blueprint` after `bump_pin` — locates ${matrix.path}/blueprint.yaml OR ${matrix.path}/chart/blueprint.yaml (handles both platform-leaf and products-umbrella conventions), filters to kind:Blueprint (defensive against CRD yaml at the products/catalyst/chart/crds path), reads current spec.version at 2-space indent, sed-rewrites to CHART_VERSION, verifies post-write. * Commit step renamed to "Commit + push bootstrap-kit pin bump + blueprint.yaml lockstep"; stages both files, single commit, with convergent retry on conflict. * Summary block surfaces both bumps separately. Regression test (tests/e2e/bootstrap-kit/main_test.go): * New TestBootstrapKit_BlueprintVersionLockstepSweep — walks platform/* and products/, discovers every Blueprint manifest with a sibling Chart.yaml, asserts spec.version == Chart.yaml version. Covers ALL ~70 blueprints, not just the canonical 10 kit ones the existing TestBootstrapKit_BlueprintCardsHaveRequiredFields gates. Failure messages name the file, drift direction, and the exact sed command to fix — drift remediation is mechanical. Drift cleanup (mandatory companion, same shape as A17/#1855): 26 Application-Blueprint blueprints whose spec.version had been left at 1.0.0 / 0.1.0 while Chart.yaml moved forward — synced down to Chart.yaml as authoritative. All currently surface in the new sweep test; without the cleanup the test would block this PR (and every subsequent one). Affected: alloy, cert-manager-{dynadot,powerdns}-webhook, cluster-autoscaler-hcloud, cnpg, crossplane-claims, external-secrets[-stores], falco, grafana, guacamole, harbor, hcloud-csi, k8s-ws-proxy, mimir, netbird, newapi, openclaw, powerdns, seaweedfs, self-sovereign-cutover, trivy, valkey, velero, vpa, products/dmz-vcluster. After this lands, the next chart-version bump in any platform/<bp>/ folder auto-converges all three artifacts (Chart.yaml, blueprint.yaml, bootstrap-kit pin) in a single bot commit. No more manual collector PRs; no more silent drift between chart and Blueprint manifest. Closes #1856. Refs #1855 (A17 hot-patch this replaces structurally), #1713 (original TBD-A6 auto-bump hook). Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 01:04:22 +04:00
e3mrah	2484c8a3de	fix(bp-velero): bump 1.2.1 -> 1.2.2 to force a publish (Closes #1799 ) (#1846 ) TBD-A13: `ghcr.io/openova-io/bp-velero:1.2.1` returns not-found because the 1.2.1 bump in platform/velero/chart/Chart.yaml shipped only in the initial-fill commit (`e5c2797c` "deploy: bump sandbox-mcp-server image to cadc7b5") which never triggered the blueprint-release workflow. As a result every fresh Sovereign's bp-velero HelmRelease (slot 34) is stuck InProgress and the bootstrap-kit kustomization fails its health check. GHCR currently has 1.0.0, 1.1.0, 1.2.0 — confirmed via `/orgs/openova-io/packages/container/bp-velero/versions`. Bump to 1.2.2 (chart + bootstrap-kit pin in lockstep so the A6 sync gate stays GREEN) so blueprint-release.yaml fires on this push, publishes `ghcr.io/openova-io/bp-velero:1.2.2`, and the auto-bump-pin step is a no-op. No payload changes — same upstream vmware-tanzu/velero 12.0.1 subchart, same templates, same values. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 00:43:13 +04:00
hatiyildiz	9975e057da	deploy(bp-newapi): bump bootstrap-kit pin 1.4.19 -> 1.4.20 (auto, Refs TBD-A6)	2026-05-18 20:38:15 +00:00
github-actions[bot]	9982dcafa8	deploy: bump bp-newapi upstream v0.13.2 chart 1.4.20	2026-05-18 20:37:26 +00:00
e3mrah	3d0c96a237	fix(bp-newapi): single-pod DB migration via startupProbe (Closes #1798 ) (#1857 ) newapi-mirror:v0.13.2 hangs on first-boot GORM AutoMigrate against an empty CNPG database: kubelet's pre-A12 liveness probe (initialDelay 30s + period 10s + failureThreshold 3 = ~50s ceiling) SIGKILLs the binary mid-migration on every restart. The 28-CREATE-TABLE + 2-column-type AutoMigrate takes 60-120s on cpx21/cpx31 nodes with sslmode=require — well over the kill window. On t22 chart 1.4.18 the `newapi` DB had ZERO public-schema tables after 29 CrashLoopBackOff restarts because every kill happened before the GORM connection pool's first wire write completed (pg_stat_activity on the CNPG primary showed no newapi-user connections). Symptom (t22 verify, pod newapi-bp-newapi-6fd8799b6-lpsd2): [SYS] ... database migration started ← last log line exitCode=2 finishedAt-startedAt = 50s exactly Readiness probe: connect: connection refused 10.42.0.185:3000 DB: psql \\dt → "Did not find any relations" CNPG: pg_stat_activity → no `newapi` user connections Fix (canonical k8s pattern, Inviolable Principle #16 — own the seam): add a startupProbe that gates BOTH liveness and readiness until the binary opens :3000/api/status. Budget 30 × 10s = 5 min, comfortably above the observed 60-120s ceiling and below operator- impatience limits. Liveness's pre-A12 cadence (30s/10s/3) is unchanged but only activates after startupProbe success per kubelet semantics. The probe block is operator-tunable via `.Values.newapi.probes.startup.*`; setting it to `null` skip-renders the block so overlays against a pre-seeded DB can opt out (Inviolable Principle #4). Also bumps the bootstrap-kit pin 1.4.18 → 1.4.19 in slot 80 so freshly franchised Sovereigns pull the new chart on next prov. Render tested (smoke + override): startupProbe present with failureThreshold=30 in defaults; suppressed when startup: null. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 00:37:00 +04:00
e3mrah	a8931db541	fix(ci): sync stale blueprint.yaml versions + soften push-mode pin-sync race (Closes #1849 ) (#1855 ) Two disjoint regressions stack-failed test-bootstrap-kit.yaml on every push to main: 1. manifest-validation — TestBootstrapKit_BlueprintCardsHaveRequiredFields asserts platform/<bp>/blueprint.yaml spec.version == chart/Chart.yaml version. Six blueprints had drifted: cilium (1.3.0->1.3.5), cert-manager (1.2.0->1.2.2), flux (1.2.0->1.2.2), openbao (1.2.14->1.2.16), keycloak (1.5.0->1.4.5 — blueprint led chart, sync to authoritative Chart.yaml), gitea (1.2.5->1.2.7). Chart.yaml is canonical (drives bootstrap-kit pin -> Sovereign install); blueprint.yaml gets resynced down/up to match. 2. pin-sync-audit on push — full-sweep audit races the blueprint-release auto-bump hook. Chart-bump merge commit has chart=N pin=N-1 drift until the auto-bump bot commits the pin update ~60s later; the bot push (GITHUB_TOKEN convention) does not retrigger this workflow, so the failure remains in run history. Fix: set continue-on-error: true on push/workflow_dispatch events (PR remains blocking via --changed-only). The full-sweep output still surfaces drift on the run summary; it just doesn't fail the overall run while the heal-in- ~60s window is open. Documented inline in the job header. Net effect: every push to main re-runs cleanly green. The 13 pre-existing drifts called out in the existing job comment will continue to heal as each lagging chart gets its next bump (auto-bump hook + this PR's manifest-validation alignment). Refs PRs #1666 #1687 #1695 #1698 #1706 #1707 (the manual collector PRs TBD-A6 eliminated for bootstrap-kit pins; this PR extends the convergence to blueprint.yaml versions which the test asserts but the auto-bump hook does not yet update). Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>	2026-05-19 00:34:48 +04:00
e3mrah	d36e54df74	test(chart): baseline CNP allow-list contract gate — guards #1785→#1803→#1847 cascade (Closes #1850 ) (#1854 ) The May 2026 baseline-CNP cascade shipped three production bugs in two days because nothing in CI rendered the chart and asserted on the rendered CiliumNetworkPolicy shape: - #1785 (chart 1.4.171) — added the baseline CNP for catalyst-system with WORLD egress restricted to TCP/443 only AND no ingress allow for the `catalyst` namespace. - #1803 (chart 1.4.177) — re-added SMTP egress (587/465/25 TCP) after /api/v1/auth/pin-request 502'd on every fresh onboarding. - #1847 (chart 1.4.178) — re-added ingress from `catalyst` after t24 fresh-prov handover hung at WAIT_TIMEOUT_SECONDS=1500s. This adds products/catalyst/chart/tests/baseline-cnp-allowlist.sh — a pure helm-template + grep/awk contract gate matching the existing platform/self-sovereign-cutover/chart/tests/cutover-contract.sh pattern. The Blueprint Release workflow already runs every *.sh under chart/tests/ as a publish gate (see blueprint-release.yaml line 384), so the gate is wired automatically and fails publish BEFORE the OCI artifact reaches a Sovereign. 13 cases asserted: 1. baseline-default-deny CNP renders + is namespaced to catalyst-system 2. egress allows SMTP submission 587/TCP (#1803 regression guard) 3. egress allows SMTPS 465/TCP (#1803 regression guard) 4. egress allows legacy SMTP 25/TCP (#1803 regression guard) 5. egress allows HTTPS 443/TCP to world 6. egress allows kube-dns 53/UDP + 53/TCP 7. ingress allows `catalyst` ns — cutover Pods → catalyst-api:8080 (#1847) 8. ingress allows `flux-system` (HelmRelease readiness probes) 9. ingress allows `kube-system` (operator + ccm + CoreDNS) 10. ingress is namespace-scoped — no fromEntities:{cluster\|world\|all} wildcard 11. catalyst-api Service exposes port 8080 (auto-trigger contract) 12. CNP toggles off cleanly with security.baselineCnp.enabled=false 13. allowedIngressNamespaces propagates via --set (operator-tunable) Negative-test confirmation (executed locally before commit): - Remove SMTP 587 from template → Case 2 FAILS, exit 1 - Remove `catalyst` from values.yaml default → Case 7 FAILS, exit 1 - Add `fromEntities: [cluster]` wildcard → Case 10 FAILS, exit 1 - Restore originals → all 13 cases PASS, exit 0 Refs: TBD-A18, PRs #1785 #1803 #1847, audit /tmp/audit-recent-prs-quality-report.json Closes #1850 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-19 00:32:28 +04:00
github-actions[bot]	82e972fb77	deploy: update catalyst images to `75cb059`	2026-05-18 20:26:21 +00:00
e3mrah	75cb059fc0	Merge pull request #1851 from openova-io/fix/a16-hetzner-ssh-key-sweep fix(hetzner): sweep orphan SSH keys by public_key comment (TBD-A16)	2026-05-19 00:24:19 +04:00
github-actions[bot]	e78faa986c	deploy: update catalyst images to `f07312c`	2026-05-18 20:23:49 +00:00
e3mrah	f07312c5ae	fix(cutover): RBAC + sovereign-fqdn ConfigMap + kubeconfig?region path — 3 t24 zero-touch P1 blockers (#1852 ) Three Wave 36 P1 fresh-prov blockers ship together as one chart 1.4.179 + bootstrap-kit pin bump + cloud-init substitute extension, because each fix is small and they share the same fresh-prov verification cycle. TBD-A14 (issue #1843) — catalyst-api-cutover-driver SA cannot list networkpolicies cluster-scope. Add networking.k8s.io/networkpolicies get/list/watch verbs to clusterrole-cutover-driver.yaml. Pre-fix the chroot in-cluster fallback's k8sCache.Factory reflector emitted continuous `networkpolicies is forbidden` errors at the cluster scope because only update/patch/delete were granted (existing mutation block) — the read path was never wired. Mirrors the existing cilium.io/ciliumnetworkpolicies block; the two CRDs co-exist (k8s NetworkPolicy = baseline L3/L4, CiliumNetworkPolicy = tier-3 L7). TBD-A15 (issue #1844) — sovereign-fqdn ConfigMap fields configuredRegions / controlPlaneIP / primaryRegion / replicaRegion / selfDeploymentId / enableHotStandby / qaApplications empty on every fresh prov. Pre-fix the envsubst placeholders resolved to empty because nothing wrote them into the bootstrap-kit Kustomization postBuild substitute map → the chart rendered empty strings → Dashboard SovereignCard configured-regions chips, Settings page operator-identity, /api/v1/sovereign/self, and the D31 active-hot-standby gating ALL silently fell through to default behaviour. Wired via three coordinated changes: - Chart values.yaml gains global.sovereignSelfDeploymentId default - bootstrap-kit slot 13 gains global.sovereignSelfDeploymentId, sovereign.configuredRegions, sovereign.qaApplications mappings (YAML inline-list shape `${SOVEREIGN_CONFIGURED_REGIONS_YAML:-[]}`) - cloud-init Kustomization substitute map gains SOVEREIGN_CONTROL_PLANE_IP (= load_balancer_ipv4), SOVEREIGN_PRIMARY_REGION / SOVEREIGN_REPLICA_REGION (canonical 4-segment labels), SOVEREIGN_ENABLE_HOT_STANDBY (reserved, default empty), SOVEREIGN_CONFIGURED_REGIONS_YAML (JSON-encoded cloudRegion list), QA_APPLICATIONS_YAML (reserved, default `[]`) - main.tf: new template inputs sovereign_configured_regions_yaml + replica_region_canonical_label (derived from local.secondary_regions), threaded into both primary CP and per-secondary-region cloud-init templatefile calls TBD-A10b (issue #1845) — GET /api/v1/deployments/{id}/kubeconfig?region=<cloudRegion> returns 409 kubeconfig-file-missing on fresh prov for every region. Pre-fix the handler only resolved `<id>-<region>.yaml` exactly, but the cloud-init PUT-back + mothership→chroot D16 fan-out use the tofu secondary-region key shape `<cloudRegion>-<i>` (e.g. `hel1-1`, `nbg1-2`) — so on-disk filenames look like `<id>-hel1-1.yaml`. Verifiers + operators commonly call with the bare `cloudRegion` (`?region=hel1`) because that's the matrix-doc-friendly form. Fall-back resolution order added to GetKubeconfig: exact-name first (legacy + manual operator PUT), then `<id>-<region>-*.yaml` glob (sort.Strings deterministic). Unit test covers all three paths: exact match, slot-suffix glob, unknown-region still 409. Closes the regression introduced when PR #1763 (mothership→chroot kubeconfig handover hook) started using the cloud-init naming convention for fan-out exports. Closes #1843, Closes #1844, Closes #1845 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 00:21:38 +04:00
hatiyildiz	6e883c1f8b	fix(hetzner): sweep orphan SSH keys by public_key comment (TBD-A16) Third match pass for SSH keys whose name AND label both drifted from the Tofu canonical emission. The OpenSSH public_key comment is the one piece of metadata that survives Console-rename, partial tofu apply, and out-of-band hcloud-cli edits — bootstrap-cli stamps the canonical prefix into it at generation. Caught in production 2026-05-18: catalyst-t24-omantel-biz blocked fresh t25 provs because previous wipe cycles left it as an orphan. Label-pass + name-prefix-pass had no signal once the name/label drifted. Adds boundary-aware HasPrefix check (the same P0 safety guard pinned by TestPurge_NamePrefixFallback_DoesNotTouchOtherCustomers) so wiping t2.omantel.biz cannot delete t20.omantel.biz's SSH key. Tests: - PublicKeyCommentFallback_DeletesUnlabeled (the third-pass match) - PublicKeyCommentFallback_BoundarySafety (P0 t2 vs t20 safety pin) - PublicKeyCommentFallback_NoDoubleCount (idempotent against earlier passes) - PublicKeyCommentFallback_LeavesOtherKeys (other tenants untouched) - PublicKeyComment_ParsesFormats (OpenSSH parser unit pins) - CommentMatchesPrefix_BoundaryRules (separator rune table) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 22:15:51 +02:00
hatiyildiz	7a2cad9a47	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.177 -> 1.4.178 (auto, Refs TBD-A6)	2026-05-18 19:46:12 +00:00
e3mrah	31b7dc5859	fix(cnp): allow ingress from catalyst ns (cutover Pods) — fresh-prov handover blocker (Refs PR #1785 regression, t24 zero-touch finding) (#1847 ) PR #1785 (chart 1.4.171) shipped a baseline default-deny CiliumNetworkPolicy in catalyst-system whose ingress allowlist was limited to: - reserved.ingress: "" (cilium-gateway endpoint) - same-namespace catalyst-system Pods - host / remote-node / kube-apiserver entities The bp-self-sovereign-cutover chart stamps Jobs into the `catalyst` namespace, including the 10-auto-trigger Job whose Pod curls catalyst-api.catalyst-system.svc.cluster.local:8080 to fire /api/v1/internal/cutover/trigger. With #1785 in effect on a FRESH prov, every auto-trigger Pod times out at WAIT_TIMEOUT_SECONDS=1500s, handoverFiredAt stays null, and the D0 auto-redirect to the Sovereign Console never happens — the operator is stuck on mothership /jobs forever. Caught by t24 zero-touch verification (2026-05-18): handover_status: "BLOCKED — cutover auto-trigger Pod in 'catalyst' ns cannot reach catalyst-api in 'catalyst-system' ns because baseline-default-deny CNP allows ingress only from {reserved.ingress, catalyst-system ns, host entities}" The companion symptom on t22 was masked because t22's cutover Job had already completed before the CNP rolled out — the CNP did not gate ingress there. Fix ───────────────────────────────────────────────────────────────── Add a fourth ingress rule to baseline-default-deny allowing fromEndpoints in the operator-tunable list .Values.security.baselineCnp.allowedIngressNamespaces. Defaults: - catalyst — cutover Pods (the load-bearing fix) - flux-system — Helm/Kustomize/Source controllers probing Service readiness for HelmRelease health rollups (worked pre-#1785 via no-CNP default) - kube-system — Cilium operator + hcloud-ccm + CoreDNS that do cluster introspection calls (the reserved.ingress gateway endpoint here is still matched by rule 1's reserved.ingress: "" selector — this rule covers non-gateway Pods) The list mirrors the existing allowedPlatformNamespaces pattern on the egress side. No other rule semantics change. Chart bump 1.4.177 → 1.4.178. Companion regression to chart 1.4.177 (PR #1803, SMTP egress) — both are sub-regressions from the same #1785 baseline-CNP ship. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 23:45:28 +04:00
hatiyildiz	61948474b5	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.176 -> 1.4.177 (auto, Refs TBD-A6)	2026-05-18 19:28:52 +00:00
e3mrah	153fcf9419	fix(cnp): allow SMTP egress (587/465/25) from catalyst-system — fixes PIN-issue 502 regression from #1785 (#1803 ) PR #1785 (chart 1.4.171) shipped a baseline-default-deny CiliumNetworkPolicy in catalyst-system whose world-egress block was restricted to TCP/443 only. That silently broke SMTP submission from catalyst-api to the operator Stalwart relay (mail.openova.io), surfacing as 502s at /api/v1/auth/pin-request — customer journey step 11/12 (PIN-issue email delivery) is now blocked on every fresh Sovereign onboarding flow. DIAGNOSTIC EVIDENCE ------------------- - CNP `baseline-default-deny` in catalyst-system was created at 2026-05-18 18:13:09Z (the moment chart 1.4.171 rolled out). - Egress rule: toEntities: [world] toPorts: [443/TCP] i.e. only HTTPS world egress permitted. - A Pod in catalyst-system cannot `nc 45.151.123.50 587` (timeout). - A Pod in the default namespace on the SAME node connects fine and receives the `220 Stalwart ESMTP` banner — confirming the block is policy-driven, not network/host-firewall driven. FIX --- Extend the world-egress block in products/catalyst/chart/templates/network-policies/baseline-catalyst-system.yaml to permit, in addition to the existing 443/TCP: - 587/TCP — SMTP submission (the production path to mail.openova.io) - 465/TCP — SMTPS (fallback) - 25/TCP — legacy SMTP (fallback) All four ports are scoped to `toEntities: [world]`, matching the existing 443 allow. No other rule semantics change — same-namespace, cluster-DNS, kube-apiserver, and platform-namespace allows are untouched. The 25/TCP allow is included only as a legacy fallback; production traffic is on 587. A "Regression context — DO NOT NARROW THIS BLOCK WITHOUT REVIEW" comment is added inline so the next reviewer who tightens the block sees the failure mode that drove the widening. CHART ----- 1.4.176 → 1.4.177. Changelog entry added under the 1.4.176 block, above the version line, describing the regression + fix. VERIFICATION ------------ `helm template products/catalyst/chart` renders the updated CNP with four ports (443/587/465/25) under the world egress block; all other rules byte-identical to 1.4.176. Refs PR #1785 (the regression source), Issue #1746 (the original baseline-CNP work). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-18 23:28:19 +04:00
github-actions[bot]	732f2363b9	deploy: update catalyst images to `c422c97`	2026-05-18 19:16:52 +00:00
e3mrah	c422c97b97	fix(catalyst-api): publish body→query translation + rbac/assign CRD-NotFound detection (Refs TBD-C4-fup, TBD-C6-006-followup) (#1802 ) TBD-C4-fup — publish body→query translation regression guard: - Adds sme_catalog_client_test.go pinning the wire shape on smeCatalogClient.SetPublished. The C4-012 / #1735 fix (PR #1789) translates the chroot's {"published":true} JSON body into the upstream catalog's ?value=true\|false query param shape that services-catalog SetAppPublished (handlers.go:303-313) requires. Wave 35 cov-bench v3 surfaced 400 here because the deploy bot hadn't bumped catalyst-api past `e2c56c3` (PR #1787) when the bench ran — PR #1789's translation was already in the merged code but not in the live image. The test pins URL + ?value=<bool> + empty body so any future revert fires. TBD-C6-006-followup — RBAC assign 500 → 503: - Root cause: UserAccess is a NAMESPACED Crossplane Claim per the XRD's claimNames block (platform/crossplane-claims/chart/ templates/xrds/useraccess.yaml). rbacAssignNamespace = "" routed the dynamic Create to the apiserver's cluster-scoped REST path /apis/access.openova.io/v1alpha1/useraccesses, which the apiserver doesn't serve for a namespaced CRD — returns 404 with "the server could not find the requested resource". PR #1789's apierrors.IsNotFound→503 wrapper never fired because the 404 was for the route, not the resource. - Fix: pin rbacAssignNamespace = "catalyst-system" and stamp it on every Create. Mirrors user_access_owner_seed.go's t134 D21 fix (userAccessOwnerNamespace = "catalyst-system"). Lists keep Namespace("") for cross-namespace listing (valid against a namespaced CRD — apiserver returns the union). - Defense in depth: isCRDNotInstalledErr() string-fallback for "the server could not find the requested resource" / "no matches for kind" — apierrors.IsNotFound can lose StatusReasonNotFound through error-chain wrapping. Mirrors catalog_client_cluster_fallback.isVersionNotServed. - user_access.go: same defect class — CreateUserAccess / UpdateUserAccess / tryDeleteUserAccess all called .Namespace("") on a namespaced CRD. CreateUserAccess now stamps rbacAssignNamespace; Update + Delete walk the all-namespaces list via findUserAccessByName() to discover the canonical ns before issuing the mutation against that exact REST path. Tests: - TestSetPublished_SendsQueryParamNotBody (regression guard for TBD-C4-fup) - TestHandleRBACAssign_CreateStampsNamespace (regression guard for TBD-C6-006-followup namespace fix) - TestIsCRDNotInstalledErr_StringFallback (regression guard for defense-in-depth detection) - Existing test reads updated to use rbacAssignNamespace instead of Namespace("") (no behavioural change — the fake dynamic client routes accurately now) Refs TBD-C4-fup Refs TBD-C6-006-followup Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 23:14:40 +04:00
hatiyildiz	0293318a3a	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.175 -> 1.4.176 (auto, Refs TBD-A6)	2026-05-18 19:14:22 +00:00
github-actions[bot]	fbbf1b395f	deploy: update sme service images to `989328d` + bump chart to 1.4.176	2026-05-18 19:13:00 +00:00
hatiyildiz	da28ae6936	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.174 -> 1.4.175 (auto, Refs TBD-A6)	2026-05-18 19:12:31 +00:00
e3mrah	989328d7c3	fix(provisioning): write tenant commits to sme-tenants branch (Closes #1794 — 5th C18 layer) (#1801 ) Provisioning's per-tenant overlay commits no longer share the `main` branch with the cutover-gitea-mirror Job. The mirror runs every <=10 min and force-pushes upstream main into Gitea, clobbering every tenant commit that landed in between mirror ticks — the Organization CR never materialised and the customer journey hung at step 16 (live evidence on t22 2026-05-18: commit 69d64e48 at 17:46:13Z disappeared from Gitea main by the next mirror tick at 17:54:55Z). Fix: - New Flux GitRepository `openova-sme-tenants` tracks the dedicated `sme-tenants` branch (templates/sme-services/sme-tenants-gitrepository.yaml). - sme-tenants Flux Kustomization repointed at the new GitRepository (sme-tenants-kustomization.yaml) so the tenant reconcile loop reads from the protected branch. - Provisioning Deployment GITHUB_BRANCH default flipped to `sme-tenants` on Sovereign installs (Catalyst-Zero keeps `main` — no mirror Job exists there). Topology-aware default, operator-overridable. - Provisioning Go client (commitOnceContents) gains an auto-create- branch fallback so the first commit on a fresh Sovereign self- bootstraps the branch from `main` — no out-of-band seeding step. - Chart 1.4.174 -> 1.4.175. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 23:11:30 +04:00
github-actions[bot]	481bf60fd5	deploy: update catalyst images to `51be70e`	2026-05-18 19:06:31 +00:00
e3mrah	51be70e865	fix(catalyst-api): k8sCache.Factory periodic rescan of on-disk kubeconfigs + chroot self-register recovery (Refs 30-row matrix rows 9+27) (#1800 ) Two regressions caught on t22 (2026-05-18) by the 30-row matrix: row 9 /cloud/list?kind=nodes — only 1 cluster instead of 3 row 27 /dashboard/treemap (Layer=Region) — only 1 cell instead of 3 Root cause is two layered races on a fresh-prov Sovereign chroot, both invisible to PR #1705 / #1763's one-shot AddCluster path: (a) Pod restart with empty kubeconfigs PVC. The mothership's secondary-kubeconfig POST hook (deployment_handover_export.go) ONLY fires at handover. A catalyst-api Pod that restarts AFTER handover and BEFORE any operator re-trigger sees an empty /var/lib/catalyst/kubeconfigs/. LoadClustersFromDir at startup returns 0 entries, the Factory starts with sovereigns:0, and every /k8s/list response degrades silently to one cluster (the chroot itself, via resolveChrootClusterID's single-cluster fallback) — or zero on chroots where SOVEREIGN_FQDN env was also empty at start (race (b)). (b) sovereign-fqdn ConfigMap committed AFTER Pod start. On t22 the Pod started at 18:13:14 but the chart's sovereign-fqdn CM landed at 18:13:44 — 30s later. The Pod's SOVEREIGN_FQDN env stayed empty for the lifetime of the Pod (Reloader v1.4.16 does not reload env vars per a longstanding upstream limitation), so FactoryFromEnv's chroot self-register branch returned false. Logs confirmed: "k8scache: data plane started sovereigns=0". Fix: a periodic background goroutine (Factory.runKubeconfigsRescanLoop) that ticks every Config.RescanInterval (default 30s) and: 1. Walks Config.KubeconfigsDir for kubeconfigs whose stem isn't already a registered cluster ID and AddClusters each one. Cheap (one os.ReadDir per tick) and idempotent. 2. When Config.HomeCoreClient is set, reads the on-cluster sovereign-fqdn ConfigMap directly via the typed client and re-runs buildChrootClusterRef when fqdn is non-empty. Recovers from the configmap-race on the next tick after the CM commits, without needing a Pod restart. FactoryFromEnv now persists the resolved KubeconfigsDir + HomeCoreClient into the Config so the rescan loop reuses the same values without re-reading env. Defaults: rescan interval 30s; both branches are no-ops on the contabo mothership (KubeconfigsDir non-empty but no late-arriving kubeconfigs; no sovereign-fqdn CM so the ConfigMap GET returns not-found silently). Two new tests in k8scache_test.go: - TestFactory_RescanRegistersNewKubeconfigs: drops a kubeconfig AFTER Start, asserts the Factory registers it within 3s of the rescan tick. Reproduces the (a) regression in unit test form. - TestFactory_RescanOnce_IdempotentForKnownClusters: re-runs rescanOnce on a directory whose entries are already registered; asserts no double-register, no log spam. Operator-visible effect: post-handover Pod restarts on a multi-region Sovereign chroot self-heal within 30s instead of staying stuck at sovereigns:0 until manual operator re-POST. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 23:04:11 +04:00
github-actions[bot]	5f741f11a8	deploy: bump sandbox-controller image to `4f957c3`	2026-05-18 18:56:25 +00:00
github-actions[bot]	3ecfc465f0	deploy: bump sandbox-mcp-server image to `4f957c3`	2026-05-18 18:55:00 +00:00
e3mrah	4f957c3db2	fix(sandbox): per-Sandbox disable idle-scaling (Closes #1725 ) (#1797 ) Wave 35 D8b — add `spec.idleScaling.enabled` to the Sandbox CR so long-running agent workloads (idle-for-hours-then-resume) can opt out of the cluster-wide idle scaler. Renderer stamps `openova.io/sandbox-idle-scaling-disabled=true` on the pty-server StatefulSet when enabled=false. IdleScaler skips any StatefulSet carrying that annotation: no probe, no last-activity stamp, no scale-to-zero decision. Default behaviour (CR field omitted OR enabled=true) preserves the existing tier-cap economics so the free/pro paths still scale to 0 after the timeout window. Refs WBS row TBD-D8b in openova-private/docs/WBS.md. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 22:53:00 +04:00
hatiyildiz	2c1e628fb3	deploy(bp-newapi): bump bootstrap-kit pin 1.4.17 -> 1.4.18 (auto, Refs TBD-A6)	2026-05-18 18:31:05 +00:00
github-actions[bot]	ddcab439c8	deploy: bump bp-newapi upstream v0.13.2 chart 1.4.18	2026-05-18 18:30:18 +00:00
e3mrah	6c4d660058	fix(bp-newapi): dedicated HTTPRoute for newapi.<fqdn> (Closes #1778 ) (#1796 ) Sandbox runtimes hit the LLM gateway at the URL the sandbox controller mints into their environment: NEWAPI_BASE_URL=https://newapi.<sovereign-fqdn>/v1 On a Sovereign with the Catalyst marketplace enabled, the catalyst chart ships a `tenant-wildcard` HTTPRoute (hostnames=`.<fqdn>`) that backend- refs to the `console` Service in the `sme` namespace. Without a dedicated HTTPRoute for `newapi.<fqdn>`, every Sandbox request to the LLM gateway got absorbed by the wildcard and 502'd at the storefront — blocking the entire BYOS Claude Code journey (TBD-D35d). Fix: add `templates/httproute.yaml` + `ingress.httpRoute` values block to bp-newapi. The HTTPRoute lives in the `newapi` namespace (same as the Service backend) so no cross-namespace ReferenceGrant is required; Gateway API hostname-matching prefers the most specific listener, so an exact `newapi.<fqdn>` HTTPRoute outranks the `.<fqdn>` wildcard without modifying the marketplace template. Bootstrap-kit slot 80 overlay flips `ingress.httpRoute.enabled=true` and supplies `host: newapi.${SOVEREIGN_FQDN}` so the route materialises on every Sovereign install. Default OFF for contabo-style Traefik clusters (unchanged behaviour). - platform/newapi/chart/templates/httproute.yaml — new template, gated on `newapi.enabled && ingress.httpRoute.enabled` AND a resolvable hostname (explicit `ingress.httpRoute.host` OR derived from `sovereignFQDN`). - platform/newapi/chart/values.yaml — new `ingress.httpRoute` block, default OFF. - platform/newapi/chart/Chart.yaml — version 1.4.16 → 1.4.17. - clusters/_template/bootstrap-kit/80-newapi.yaml — pin 1.4.16 → 1.4.17, values now enable `ingress.httpRoute` with host `newapi.${SOVEREIGN_FQDN}`. helm template smoke (all four scenarios pass): - default values → 0 HTTPRoutes rendered (chart safe for Traefik installs). - httpRoute.enabled + sovereignFQDN → 1 HTTPRoute, hostname `newapi.<sovereignFQDN>`. - httpRoute.enabled + explicit host → 1 HTTPRoute with that host. - httpRoute.enabled, neither host nor sovereignFQDN → 0 HTTPRoutes (skip-render guard). Closes #1778 Refs WBS row TBD-D35d in openova-private/docs/WBS.md. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 22:29:26 +04:00
github-actions[bot]	dfff1a695a	deploy: update catalyst images to `2964930`	2026-05-18 17:58:18 +00:00
e3mrah	29649309da	fix(marketplace): resolve _template path on chroot (Closes #1790 ) (#1792 ) The marketplace settings handler hardcoded clusters/<sovereignFQDN>/bootstrap-kit/13-bp-catalyst-platform.yaml. That path exists in the openova-io/openova mothership repo (the provisioner carves out a per-FQDN subtree per Sovereign) but NOT in the chroot-local Gitea repo, which only carries the canonical clusters/_template/bootstrap-kit/ subtree (see openova_flow_proxy.go, phase1_watch.go, sme_tenant_gitops.go which all reference clusters/_template/bootstrap-kit/...). Wave 34 v2 cov-bench surfaced this: PR #1779 wired GITOPS_TOKEN through to the chroot Pod, the marketplace toggle now reaches Gitea, and the Gitea push fails with 500 "no such file or directory" because the overlay path is wrong for the chroot's repo layout. Fix: introduce resolveBootstrapKitDir(sovereignFQDN) which picks clusters/_template/bootstrap-kit when SOVEREIGN_FQDN env is set (the canonical "we are running on a chroot Pod" signal used across this package - see auth_handover.go, deployments.go, jobs.go, rbac_matrix.go) and clusters/<sovereignFQDN>/bootstrap-kit otherwise. A CATALYST_BOOTSTRAP_KIT_PATH env overrides both, per INVIOLABLE-PRINCIPLES.md #4 (never hardcode a path that a future repo re-layout would force a code ship). Regression test TestResolveBootstrapKitDir covers all four detection paths (mother / chroot / whitespace-treated-as-unset / runtime override). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-18 21:56:11 +04:00
hatiyildiz	e2a3d46b66	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.173 -> 1.4.174 (auto, Refs TBD-A6)	2026-05-18 17:52:29 +00:00
github-actions[bot]	7a7d5b4574	deploy: update sme service images to `5c71fb8` + bump chart to 1.4.174	2026-05-18 17:51:49 +00:00
e3mrah	5c71fb8f61	fix(catalyst-api+catalog): SME bridge token for publish toggle + chroot RBAC assign wrapper (#1789 ) Closes #1735 Closes #1739 C4-012 / #1735 — Publish toggle 401: - chroot's smeCatalog.SetPublished sent no Authorization header, so catalog.sme's JWTAuth middleware rejected with 401. Mint the canonical SME bridge token in HandleSovereignAppPublish (mirrors sme_billing_vouchers.go::mintSMEBridgeToken) and forward as Bearer. - catalog requireAdmin now accepts sovereign-admin role (in addition to superadmin) so franchisee operators can manage their own Sovereign's catalog per docs/FRANCHISE-MODEL.md §3 — without this, the bridge token's sovereign-admin role would still 403. - SetPublished now sends published state via ?value=true\|false query param (matches the SME catalog's SetAppPublished route shape) rather than a JSON body the upstream ignores. C6-006 / #1739 — RBAC assign 500: - Add HandleSovereignRBACAssign at POST /api/v1/sovereign/rbac/assign, the chroot-friendly mirror of /api/v1/sovereigns/{id}/rbac/assign (resolves deployment id via resolveSovereignDeploymentID, mirroring HandleSovereignRBACMatrix). Extracts the existing handler body into serveRBACAssign so both surfaces share the same wire contract. - Surface CRD-not-installed (apierrors.IsNotFound) from the dynamic create as 503 + sovereign-cluster-unavailable instead of a generic 500 rbac-assign-failed — the previous shape hid the real chart-gap behind a misleading 500. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-18 21:50:24 +04:00
github-actions[bot]	3785f3aa4a	deploy: update catalyst images to `e2c56c3`	2026-05-18 17:42:13 +00:00
e3mrah	e2c56c3811	fix(catalyst-api): mint HS256 bridge token for sovereign app publish proxy (Closes #1735 ) (#1787 ) The chroot proxy at /api/v1/sovereign/apps/{slug}/publish forwards to the SME catalog at http://catalog.sme.svc.cluster.local:8082's PATCH /catalog/admin/apps/{slug}/publish endpoint. The pre-fix code sent NO Authorization header at all, so: 1. core/services/catalog/main.go's JWTAuth middleware (line 77, applied to every /catalog/admin/* path) rejected the request with 401 BEFORE the handler ran ("missing or invalid authorization header"). 2. Even with a header, requireAdmin (core/services/catalog/handlers /handlers.go:21) would reject any caller without role="superadmin". Result: every Publish toggle click in the Sovereign Console surfaced as "sme-catalog-rejected upstream returned 401" with no actionable hint — the operator could not toggle marketplace visibility for any app on a production Sovereign. Fix: mint a fresh HS256 bridge token via the existing h.mintSMEBridgeToken helper (the same one sme_billing_vouchers.go's proxySMEVoucher uses for the BSS Vouchers surface) and forward it as the upstream Authorization header. The helper signs the token with sme-secrets/JWT_SECRET — the same secret the SME catalog Pod loads from its JWT_SECRET env (per products/catalyst/chart/templates /sme-services/catalog.yaml:40-44). Operators with `catalyst-owner` realm-role (per shared/auth.SMERoleFor) get role="superadmin" in the bridge token, satisfying requireAdmin upstream. - Adds a `bearer` parameter to smeCatalogClient.SetPublished. - HandleSovereignAppPublish mints the bridge token BEFORE the upstream round-trip so an unwired bridge (Sovereign without marketplace, stale chart predating the reflector annotation on sme-secrets) surfaces 503 sme-jwt-bridge-unwired rather than the pre-fix silent 401. - Per docs/INVIOLABLE-PRINCIPLES.md #10 the token is NEVER logged. Verified: build + go test ./internal/handler/ pass. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 21:39:40 +04:00
hatiyildiz	ebe9a5c1a2	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.172 -> 1.4.173 (auto, Refs TBD-A6)	2026-05-18 17:33:15 +00:00
github-actions[bot]	b19d64f3f6	deploy: update sme service images to `3d06db5` + bump chart to 1.4.173	2026-05-18 17:32:36 +00:00
e3mrah	3d06db5625	fix(provisioning): use Gitea contents API for git writes (Closes #1781 — 4th C18 layer) (#1786 ) Journey v4 Wave 33 (retry) — after #1712 fixed the singular→plural `/git/refs/` path, provisioning's NEXT call landed on `POST /repos/.../git/blobs → 404`. Gitea 1.22.3 simply does not implement the GitHub Git Data WRITE API (`POST /git/blobs`, `POST /git/trees`, `POST /git/commits`, `PATCH /git/refs/...`). All four return 404. Only the READ side (`GET /git/refs/...`, `GET /git/commits/...`, `GET /git/trees/...?recursive=1`) is supported by Gitea. This is the last blocker in the customer marketplace journey — steps 14→16→17 (Org CR + vCluster + WordPress) all stall on this single 404. Fix --- - New `commitOnceContents` path that batches creates/updates/deletes into one `POST /repos/{owner}/{repo}/contents` (Gitea ≥ 1.21 ChangeFiles endpoint). Files are base64-encoded; updates carry the existing blob SHA sourced from the recursive tree listing (which IS supported on Gitea). - New `targetsGitea()` predicate: when `APIURL != ""` (Sovereign in-cluster Gitea), `commitOnce` routes through the contents API. When empty (upstream github.com / contabo path), it keeps the original Git Data blob+tree+ commit+updateRef dance untouched — upstream GitHub does NOT expose a batch ChangeFiles endpoint, so we must not unconditionally switch. - `isFastForwardRejection` extended to recognise Gitea's branch-moved wording (409 / "branch has been changed" / "stale base"), so the existing outer retry loop in `CommitFilesWithPruneAndRebuild` keeps working across both backends. - Prune semantics preserved: any blob under a managed prefix that's not in the files map becomes a delete op in the same batch. Test coverage ------------- - `TestCommitFiles_GiteaTarget_UsesContentsAPI` asserts the new path POSTs to `/repos/.../contents` and never touches `/git/blobs\|trees\|commits` or `PATCH /git/refs/...`. - `TestCommitFiles_GiteaTarget_UpdateUsesExistingSHA` asserts updates carry the existing blob SHA (Gitea 422s without it). - `TestCommitFiles_UpstreamTarget_KeepsGitDataAPI` pins the upstream Git Data API path so the Gitea fork doesn't accidentally also fire on api.github.com. API before vs after ------------------- Before (Gitea path): POST /repos/{o}/{r}/git/blobs 404 POST /repos/{o}/{r}/git/trees 404 POST /repos/{o}/{r}/git/commits 404 PATCH /repos/{o}/{r}/git/refs/h/main 404 After (Gitea path): POST /repos/{o}/{r}/contents {"branch":"main","message":"...","files":[ {"operation":"create\|update\|delete","path":"...","content":"<b64>","sha":"<existing-sha-if-update>"} ]} Refs TBD-C18d, WBS Wave 33 retry. Closes #1781. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 21:31:12 +04:00
hatiyildiz	39c8464554	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.171 -> 1.4.172 (auto, Refs TBD-A6)	2026-05-18 17:30:37 +00:00
e3mrah	be1ad96f43	feat(security): baseline CNPs for cilium-gateway + catalyst-system namespaces (Closes #1746 ) (#1785 ) Cov-bench confirmed only 2 CNPs cluster-wide and zero in either critical namespace. WBS row C12-009 (TBD-Cov-12) fails until baseline coverage lands. Ship two namespaced CiliumNetworkPolicies under products/catalyst/chart/templates/network-policies/: - baseline-default-deny in catalyst-system: default-deny with explicit allow for cilium-gateway ingress + same-namespace + kubelet host probes; egress to kube-apiserver / kube-dns / same-namespace / 14 platform namespaces + world TCP/443. - baseline-cilium-gateway-allow in kube-system: scoped to the reserved:ingress endpoint, namespaced equivalent of the qaFixtures allow-gateway-world-ingress CCNP. Both CNPs mirror the working bp-external-dns-apiserver + qa-fixtures patterns (toEntities/reserved.ingress selectors, label conventions, operator-tunable allow lists). Bundle is helm-gated on .Values.security.baselineCnp.enabled (default true) and independent of qaFixtures so it ships on every Sovereign. Platform-namespace allow list tunable via .Values.security.baselineCnp.allowedPlatformNamespaces. Chart bump to 1.4.171. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 21:29:51 +04:00
github-actions[bot]	202bda36bf	deploy: update catalyst images to `b3b0539`	2026-05-18 17:27:30 +00:00
e3mrah	b3b05391ac	fix(sovereign-tls): include all parent_domains in Gateway listeners (Closes #1772 ) (#1784 ) Wave 32 D27-D31 verifier on t22 found tfvars carrying parent_domains: [{omantel.biz, primary}, {omani.homes, sme-pool}] but the live Cilium Gateway advertising only .t22.omantel.biz — .omani.homes never rendered as a listener, so every sme-pool tenant hit the envoy default fallback cert. Root cause: writeTfvars emitted the structural `parent_domains` JSON array but never set `parent_domains_yaml` — the YAML-string variable infra/hetzner/variables.tf declares and that infra/hetzner/main.tf locals.parent_domains_decoded actually yamldecode()s to derive the listener pool. With the variable empty, the terraform local fell through to the single-zone fallback `[{name: "<sovereign_fqdn>", role: "primary"}]` and every sme-pool zone the operator added was silently dropped from the Gateway listener list. Fix: writeTfvars now renders parent_domains_yaml as a JSON-flow array literal (`[{"name":"x","role":"y"},...]`) carrying every parent_domains entry. JSON-flow is a YAML superset so yamldecode() reads it natively. Empty ParentDomains still emits "" so the single-zone fallback (derived from sovereign_fqdn) keeps working for legacy payloads. Day-2 re-trigger note: AddParentDomain persists the new entry to dep.Request.ParentDomains so a subsequent provisioner.Provision re-write picks up the updated literal. The hcloud_server's user_data has no `ignore_changes` so an existing Sovereign cannot get the new listener via tofu apply (would request destructive recreate) — the handler now logs an operator hint pointing at the live Sovereign's Kustomization sovereign-tls postBuild.substitute.PARENT_DOMAINS_LISTENERS_YAML field. Tests: - TestWriteTfvars_EmitsParentDomainsYAMLForSMEPool — regression guard for the exact t22 scenario (primary + sme-pool). - TestWriteTfvars_EmitsParentDomainsYAMLEmptyOnSingleZone — fallback path preserved for legacy single-zone payloads. - TestParentDomainsYAMLLiteral_RoundTripsCleanly — table-driven unit test (lowercasing, role defaulting, JSON-flow shape). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 21:25:07 +04:00
e3mrah	998fa67e41	fix(tenant+sandbox): wire K8s client SA + NEWAPI_DEFAULT_CHANNELS default (Closes #1775 , #1777 ) (#1783 ) Wave 32 D35 verifier caught two adjacent Sandbox-plane bugs on t26: TBD-D35a (#1775): tenant service hosts the SandboxOrchestrator (core/services/tenant/handlers/sandbox_consumer.go) which materialises Sandbox.sandbox.openova.io CRs on every tenant.sandbox_requested event. main.go buildDynamicClient logs `sandbox-orchestrator: kubernetes client unavailable — orchestrator disabled` and silently skips the consumer because the tenant SA carries automountServiceAccountToken=false (zero blast-radius default from #76) AND no Role grants verbs on sandbox.openova.io. Fix: flip the flag to true on both the SA + the pod spec, plus a narrow Role + RoleBinding granting get + create on sandboxes.sandbox.openova.io scoped to the catalyst-system namespace (handlers.DefaultSandboxNamespace). Verbs match what the orchestrator actually exercises against the dynamic.Interface (Get for idempotency pre-check, Create for CR materialisation) — a leaked tenant SA token still cannot patch/delete Sandbox CRs or touch any other CRD group. TBD-D35c (#1777): sandbox-controller fails per-Sandbox token mint with NoAllowedChannels (sandbox_controller.go:191) because the NEWAPI_DEFAULT_CHANNELS env defaulted to "" in platform/sandbox/chart/values.yaml and bootstrap-kit slot 19a never wired an envsubst placeholder. Fix: default chart value to "qwen" (the only channel alias bp-newapi channel-seed-job.yaml writes on a fresh Sovereign install — alias for qwen3.6-bankdhofar per products/sandbox/docs/newapi-proxy-contract.md §2), AND add `${SANDBOX_DEFAULT_CHANNELS:-qwen}` to slot 19a so per-Sovereign overlays can extend without forking the chart (e.g. SANDBOX_DEFAULT_CHANNELS=qwen,anthropic,openai). Chart bump 1.4.170 → 1.4.171 + bootstrap-kit pin 13-bp-catalyst- platform.yaml 1.4.170 → 1.4.171. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 21:23:28 +04:00
github-actions[bot]	7d8d99d3c7	deploy: update catalyst images to `5bb9032`	2026-05-18 17:17:47 +00:00
e3mrah	5bb903275d	fix(catalyst-ui): Mothership auto-redirect on ?token= to Sovereign handover (Closes #1773 ) (#1782 ) # Problem (DoD gate D0 — founder's #1 pinned gate per # feedback_handover_redirect_is_critical_d0.md) When the operator lands on `console.openova.io/sovereign/jobs?token=<JWT>` (via fresh tab from the wizard SuccessPage, share-link, browser history), the Mothership UI used to render its own Jobs page and strand the operator there. The bundle had ZERO references to `mint-handover-token`, `redirectURL`, or any `?token=` handler. Verified live on t22 chart 1.4.168 (Wave 32 evidence): 1. POST /sovereign/api/v1/deployments/{id}/mint-handover-token returns { redirectURL, token } as expected. 2. Navigating to console.openova.io/sovereign/jobs?token=<JWT> stays on Mothership — never redirects to console.t22.omantel.biz/auth/handover. Without this redirect, every other DoD gate is invisible to the operator (memory: "the fucking successful handover is still not there ... end user is not even aware if the sovereign environment is provisioned"). # Fix New module `shared/lib/mothershipTokenRedirect.ts` runs at bootstrap BEFORE the router, fetch interceptor, or DOM render: 1. Only fires on Mothership host (console.openova.io). 2. Reads `?token=<JWT>` from window.location.search. 3. Decodes the JWT payload (no signature verification — the Sovereign-side /auth/handover does full RS256 verify + aud-binding). 4. Extracts the `aud` claim. Per catalyst-api/handover_jwt.go, aud is `["https://console.<sovereignFqdn>"]` (array) or string form. 5. Constructs `https://console.<sovereignFqdn>/auth/handover?token=<JWT>` and `window.location.replace()` to it. 6. Self-loop guard: refuses to redirect if aud points back at the Mothership. `main.tsx` calls `runMothershipTokenRedirect()` first; if it returns true the rest of bootstrap is skipped (avoids Mothership UI flash during the hard-nav). # Tests `mothershipTokenRedirect.test.ts` — 18 unit tests covering the pure decision function: - aud as array vs string vs missing - chroot URL extraction (https-only, console.<host>, self-loop guard) - JWT preservation across redirect (no claim mutation) - Mothership host gate (no-op on Sovereign / dev hosts) - malformed-JWT no-op - missing-?token= no-op All 18 tests pass. tsc + eslint clean. Pre-existing unrelated test failures in StepComponents.test.tsx (CORTEX cascade) verified to also fail on origin/main without these changes. Refs: feedback_handover_redirect_is_critical_d0.md, Wave 32 evidence, GitHub issue #1773. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 21:15:41 +04:00

1 2 3 4 5 ...

2464 Commits