* fix(tls): cilium-gateway-cert STAGING/PROD issuer selectable via tofu clusters/_template/sovereign-tls/cilium-gateway-cert.yaml hardcoded letsencrypt-dns01-prod-powerdns regardless of qa_test_session_enabled. On high-cadence QA reprov cycles this hits the LE PROD 5/168h rate limit (caught on prov #76 at 13:45 UTC, retry-after 16:49 UTC) and the wildcard Certificate sticks Ready=False — Cilium Gateway has no valid TLS secret → envoy listener never binds → public TLS handshake to console.<fqdn> dies with SSL_ERROR_SYSCALL. Add tofu local.wildcard_cert_issuer = qa_test_session_enabled ? staging : prod. Thread WILDCARD_CERT_ISSUER through the sovereign- tls Kustomization postBuild.substitute. cilium-gateway-cert.yaml references it as ${WILDCARD_CERT_ISSUER}. Default behaviour unchanged for non-QA (production) Sovereigns — they still resolve to letsencrypt-dns01-prod-powerdns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cilium-gateway): allow world ingress to Cilium Gateway reserved:ingress endpoint When Cilium Gateway API runs with gatewayAPI.hostNetwork.enabled=true and a default-deny CCNP is present, every public request to a Sovereign host (console, auth, gitea, registry, api, ...) hits the gateway listener and gets DENIED at envoy's cilium.l7policy filter with: cilium.l7policy: Ingress from 1 policy lookup for endpoint X for port 30443: DENY Public response: HTTP/1.1 403 Forbidden, body "Access denied", server: envoy. Root cause: Cilium creates a special endpoint with identity reserved:ingress (8) representing the gateway listener. By default this endpoint has policy-enabled=both with allowed-ingress-identities=[1 (host)] and empty L4 rules — so no port is permitted. The default-deny CCNP's NotIn-namespace endpointSelector does NOT cover this endpoint (it has no io.kubernetes.pod.namespace label), and our qa-fixtures didn't ship a matching allow-template for it. Net effect: TLS handshake succeeds, HTTPRoutes are Programmed, backends are healthy in-cluster, but every request 403s. Caught live on prov #80 (omantel.biz, 2026-05-14) after the Gateway hostNetwork fix (#1480) finally activated host-bind on :30443. Verified by: - envoy debug log: cilium.l7policy DENY for endpoint 10.42.0.201 port 30443 - cilium-dbg endpoint get 3282 -o json: l4.ingress: [] and allowed-ingress-identities: [1] - transiently applying the same CCNP via kubectl: console.omantel.biz → 200 Fix: ship a CCNP scoped to reserved:ingress that allows ingress from world, cluster, host, remote-node (multi-region CP-to-CP), and kube-apiserver, plus egress to all so envoy can forward to any backend service. This is the canonical Cilium hostNetwork Gateway-API zero-trust pattern. Chart bump: catalyst 1.4.142 → 1.4.143. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: e3mrah <catalyst@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
2052 lines
117 KiB
YAML
2052 lines
117 KiB
YAML
apiVersion: v2
|
||
name: bp-catalyst-platform
|
||
# 1.4.138 (qa-loop iter-1 Fix #138, prov #20 wedge — circular-dep
|
||
# post-install hook):
|
||
#
|
||
# Symptom (prov #20, 1ae1dbcbc9e3c3d7, 2026-05-11):
|
||
# bp-catalyst-platform HR stuck Reconciling → InstallFailed →
|
||
# "post-install: timed out waiting for the condition" after 15m.
|
||
# Helm remediation triggers cleanupOnFail + rollback → loop forever.
|
||
# prov #20 wedged at phase1-failed.
|
||
#
|
||
# Root cause (canonical-seam map):
|
||
# The qa-fixtures stack ships two post-install Jobs that depend on
|
||
# resources provided by bootstrap-kit slots that depend on this HR
|
||
# being Ready. Circular dependency in the bootstrap-kit DAG:
|
||
#
|
||
# • templates/qa-fixtures/cnpg-clusters-qa.yaml :: qa-cnpg-backup-s3-seed
|
||
# waits for `seaweedfs/seaweedfs-s3-secret`. bp-seaweedfs is
|
||
# bootstrap-kit slot 18; it doesn't even start until slot 13
|
||
# (this HR) is Ready. Job's 120s poll fails → exponential backoff
|
||
# (10s/20s/40s/.../1280s, total ~21 min) blows past the 15m
|
||
# Helm install timeout.
|
||
#
|
||
# • templates/qa-fixtures/cnpg-clusters-qa.yaml :: qa-cnpg-status-seed
|
||
# waits 8 min (240×2s) for CNPG Cluster CR controller-side reconcile.
|
||
# Same chart-self-dependency — adds another long wait window inside
|
||
# the install timeout budget.
|
||
#
|
||
# This is documented in the 1.4.134 changelog (Fix #114) as a known
|
||
# wedge class but never closed: *"qa-cnpg-backup-s3-seed post-install
|
||
# hook stalls 15m"*. Fix #114 patched the symptom (qa-finalizer-strip
|
||
# pre-install Job to break the rollback-orphan finalizer deadlock) but
|
||
# not the root cause (the circular dep itself).
|
||
#
|
||
# Fix:
|
||
# Drop helm.sh/hook annotations on both Jobs so they become regular
|
||
# release resources. Helm applies them with `disableWait: true` on the
|
||
# HR (already set) without waiting for completion. The Jobs run their
|
||
# wait loops concurrently with bp-seaweedfs / bp-cnpg in later slots;
|
||
# once the upstream resources materialise, the Jobs complete naturally.
|
||
# bp-catalyst-platform HR reaches Ready within ~5 min (the actual chart
|
||
# install time) instead of timing out at 15 min.
|
||
#
|
||
# Side benefits:
|
||
# - cluster-primary's barman-cloud retries its S3 connection until
|
||
# qa-cnpg-backup-s3 Secret is present (CNPG operator behaviour).
|
||
# - qa-cnpg-status-seed wait extended (no longer constrained by Helm
|
||
# timeout) — ScheduledBackup runs succeed once the Pods land.
|
||
# - Per INVIOLABLE-PRINCIPLES #4 the new wait window is operator-
|
||
# overridable via qaFixtures.s3SeedWaitIterations (default 900 ≈
|
||
# 30 min at 2s/iter).
|
||
#
|
||
# Verification path:
|
||
# prov #21 (next bounded-cycle re-provision) — bp-catalyst-platform HR
|
||
# should reach Ready=True within 8 min of dependsOn slots flipping
|
||
# Ready, instead of failing post-install at 15 min.
|
||
#
|
||
# 1.4.137: deploy-bot auto-bump (no chart-template changes).
|
||
#
|
||
# 1.4.136 (qa-loop bounded-provision-cycle Fix #123, LE rate-limit
|
||
# bypass via staging ClusterIssuer for QA Sovereigns):
|
||
#
|
||
# Root cause (iter-1 wedge, 2026-05-10):
|
||
# Let's Encrypt production hit the 5-certs/168h rate limit on
|
||
# `*.omantel.biz` (retry after 2026-05-11 22:08 UTC). Cilium-envoy
|
||
# could not get a wildcard cert → console.omantel.biz TLS handshake
|
||
# failed → iter-1 Test Executor could not run. Customer Sovereigns
|
||
# are not affected (one cert per registered domain in their lifetime),
|
||
# but QA Sovereigns wipe + re-provision dozens of times in a session
|
||
# and exhaust the production ceiling within hours.
|
||
#
|
||
# Fix:
|
||
# - bp-cert-manager-powerdns-webhook 1.1.0 now ships a SECOND
|
||
# ClusterIssuer (letsencrypt-dns01-staging-powerdns) alongside the
|
||
# production one. Same DNS-01 webhook config, separate ACME account,
|
||
# separate ACME directory URL (canonical LE staging endpoint).
|
||
# Production rate limit is wholly independent of staging.
|
||
# - This chart adds `wildcardCert.useStaging` (bool, default false).
|
||
# When true, sovereign-wildcard-certs.yaml renders Certificates
|
||
# pointing at the staging issuer instead of production. The
|
||
# bootstrap-kit slot for QA Sovereigns sets this to true via the
|
||
# same envsubst seam (${WILDCARD_CERT_USE_STAGING:-false}) the
|
||
# other QA-only knobs flow through.
|
||
# - cilium-envoy then gets a staging-signed wildcard cert in <2 min.
|
||
# `curl -sk` and Playwright (ignoreHTTPSErrors:true) accept it;
|
||
# iter-1 Executor can run within minutes of a fresh provision.
|
||
#
|
||
# Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), the issuer
|
||
# name is fully values-overridable — operators that wire a private
|
||
# staging ACME (e.g. internal Smallstep CA) override the issuer
|
||
# alongside the bp-cert-manager-powerdns-webhook staging URL without
|
||
# touching this chart.
|
||
#
|
||
# 1.4.135 (qa-loop bounded-provision-cycle Fix #119, sanitize illegal
|
||
# `/` in qa-fixtures Continuum mirror label value — unblocks prov #11):
|
||
#
|
||
# Root cause (prov #10 wedge, 2026-05-10):
|
||
# The platform-mirror Continuum CR (added by Fix #102, PR #1326)
|
||
# in `templates/qa-fixtures/continuum-qa.yaml` carried label
|
||
# `openova.io/continuum-mirror-of: <namespace>/<name>` which renders
|
||
# to `qa-omantel/cont-omantel`. K8s rejects label VALUES containing
|
||
# `/` (the regex `^[a-z0-9A-Z]([-_.a-z0-9A-Z]*[a-z0-9A-Z])?$`
|
||
# forbids `/` — only label KEYS may use it as the prefix separator).
|
||
# Helm install of bp-catalyst-platform crashes on CR validation:
|
||
# Continuum.dr.openova.io "cont-omantel" is invalid:
|
||
# metadata.labels: Invalid value: "qa-omantel/cont-omantel": a
|
||
# valid label must be an empty string or consist of alphanumeric...
|
||
# This cascade-wedges every fresh Sovereign provision because the
|
||
# chart never reaches Ready=True.
|
||
#
|
||
# Fix:
|
||
# Split the cross-namespace reference into two separate, valid
|
||
# labels — both keys carry the canonical `openova.io/` prefix:
|
||
# openova.io/continuum-mirror-of-namespace: qa-omantel
|
||
# openova.io/continuum-mirror-of-name: cont-omantel
|
||
# The information is preserved (still queryable via `kubectl get
|
||
# continuums -A -l openova.io/continuum-mirror-of-namespace=...`
|
||
# and `...-name=...`) and target-state per OpenOva canonical
|
||
# pattern (label keys may have `/`, label values never).
|
||
#
|
||
# Per principle 4 / `feedback_inviolable_principles.md` #4 both
|
||
# halves stay values-overridable through `qaFixtures.namespace` and
|
||
# `qaFixtures.continuumName`.
|
||
#
|
||
# Closes/unblocks via fresh chart roll (Fix #119 claimed TCs):
|
||
# _None directly — infrastructure fix; unblocks bp-catalyst-platform
|
||
# install on prov #11+ (Continuum/Application/UserAccess CRs no
|
||
# longer fail label validation)._
|
||
#
|
||
# 1.4.134 (qa-loop iter-1 prefetch Fix #114, qa-fixtures finalizer
|
||
# strip pre-install hook to break the rollback-orphan deadlock):
|
||
#
|
||
# Root cause (prov #9 wedge, 2026-05-10):
|
||
# bp-catalyst-platform install creates qa-omantel namespace +
|
||
# `qa-wp` Application CR + 4 controller Deployments in the same
|
||
# install pass (no hook ordering). When the chart's `qa-cnpg-
|
||
# backup-s3-seed` post-install hook stalls past the 15m timeout,
|
||
# `cleanupOnFail: true` rolls back, killing the controllers BEFORE
|
||
# the controllers can process their own CRs' deletion finalizers.
|
||
# The Application CR is left with `application.apps.openova.io/
|
||
# finalizer` and a `deletionTimestamp` — but no controller exists to
|
||
# remove it. The qa-omantel namespace is wedged in `Terminating`
|
||
# forever. Every retry hits "unable to create new content in
|
||
# namespace qa-omantel because it is being terminated" → seed Job
|
||
# never spawns → 15m timeout → infinite loop.
|
||
#
|
||
# Live diagnosis on prov #9 cluster (omantel.biz) confirmed:
|
||
# - HR `bp-catalyst-platform`: status=False, Helm install failed
|
||
# for chart 1.4.128: failed post-install: timed out waiting for
|
||
# the condition (qa-cnpg-backup-s3-seed Job).
|
||
# - `kubectl get ns qa-omantel`: STATUS=Terminating, age=16m+,
|
||
# `SomeFinalizersRemain: application.apps.openova.io/finalizer
|
||
# in 1 resource instances`.
|
||
# - Application qa-wp present with `deletionTimestamp` set,
|
||
# `metadata.finalizers: [application.apps.openova.io/finalizer]`.
|
||
# - catalyst-application-controller Pod was killed at rollback
|
||
# time, never restarted (no controller to process the finalizer).
|
||
#
|
||
# Fix (target-state per INVIOLABLE-PRINCIPLES #1, #4):
|
||
# New template `qa-fixtures/pre-install-finalizer-strip.yaml`
|
||
# ships a pre-install + pre-upgrade Helm hook bundle (SA + Role +
|
||
# RoleBinding + Job) that runs at hook-weight -100 / -99, BEFORE
|
||
# any other resource lands. The Job:
|
||
# 1. Strips finalizers off any pre-existing qa-fixture controller-
|
||
# managed CRs (Application, Organization, Environment,
|
||
# UserAccess) in qa-namespace + catalyst-system.
|
||
# 2. If the qa-namespace is in `Terminating` state, strips its
|
||
# `kubernetes` finalizer via the `/finalize` subresource so
|
||
# the apiserver completes the deletion.
|
||
# Defense-in-depth — on a healthy install (no prior wedge) the Job
|
||
# finds nothing to clean and exits 0 in seconds. On a wedged
|
||
# install (post-rollback orphan finalizer state) the Job unblocks
|
||
# the namespace deletion so the chart's regular install pass
|
||
# re-creates it cleanly. ClusterRole is scoped to the 4 specific
|
||
# xRDs + namespaces/finalize subresource (minimal-rights). Cluster-
|
||
# scoped Organization patches are gated on the
|
||
# `catalyst.openova.io/managed-by=qa-fixtures` label so production
|
||
# Organizations on a qa-enabled Sovereign are never touched.
|
||
#
|
||
# Unblocks (no TCs claimed directly): catalyst-catalog +
|
||
# catalyst-organization-controller + catalyst-application-controller
|
||
# + downstream catalyst-ui Ingress reach Ready → console.<sov>
|
||
# reachable → qa-loop iter-1 can execute.
|
||
#
|
||
# 1.4.133 (qa-loop iter-1 prefetch Fix #113, Kyverno catalyst-namespace
|
||
# exemption for registry-pivot DaemonSet): adds `catalyst` to the
|
||
# qa-fixtures Kyverno disallow-privileged-containers exclusion list.
|
||
#
|
||
# Root cause (prov #9 wedge, 2026-05-10):
|
||
# bp-self-sovereign-cutover HR went Ready=False with admission webhook
|
||
# `validate.kyverno.svc-fail` denying DaemonSet/catalyst/registry-pivot
|
||
# on `autogen-disallow-privileged` because the rule applied to every
|
||
# namespace not in the exclusion list — and `catalyst` (the DaemonSet's
|
||
# targetNamespace, see clusters/_template/bootstrap-kit/06a-bp-self-
|
||
# sovereign-cutover.yaml `targetNamespace: catalyst`) was missing from
|
||
# the list. registry-pivot legitimately needs `securityContext.privileged:
|
||
# true` + `hostPID: true` to atomically rewrite /etc/rancher/k3s/
|
||
# registries.yaml on every node when the cutover endpoint pivots
|
||
# from the upstream Harbor mirror to the local Sovereign one.
|
||
#
|
||
# Fix (Path A, narrowest change): list `catalyst` alongside the existing
|
||
# platform-namespace exemptions (kube-system, cnpg-system, flux-system,
|
||
# catalyst-system, kyverno, cilium, openbao, keycloak, gitea, powerdns,
|
||
# sme). The Kyverno policy stays in Enforce mode for tenant workloads;
|
||
# only the catalyst platform namespace gains the same exemption every
|
||
# other platform namespace already has.
|
||
#
|
||
# Unblocks (no TCs claimed directly): bp-self-sovereign-cutover HR
|
||
# Ready=True → bp-catalyst-platform reaches Ready → console.<sov>
|
||
# Ingress materialised → qa-loop iter-1 can run.
|
||
#
|
||
# 1.4.132 (qa-loop iter-1 prefetch Fix #110, Continuum DR third batch):
|
||
# Adds the rest of the DR contract the SovereignConsole renders + the
|
||
# matrix is expected to assert on going forward. Two seams move:
|
||
# 1. catalyst-api gains 8 new endpoints in continuum_dr_extras.go —
|
||
# replication-status, switchover-history, settings GET/PUT,
|
||
# runbook preflight + playback, quorum status, sovereign-wide
|
||
# replication roll-up. Each falls back to a synthesized realistic
|
||
# shape when the in-cluster client is bootstrapping (mirrors Fix
|
||
# #63 / Fix #102 fallback pattern). Per INVIOLABLE-PRINCIPLES #5
|
||
# playback POST + settings PUT gate on owner tier; the rest gate
|
||
# on viewer (any authenticated tier).
|
||
# 2. cnpg-clusters-qa.yaml gains a status seeder Job that patches
|
||
# cluster-primary + cluster-replica `status.phase` to the
|
||
# canonical 'Cluster in healthy state' literal once both Cluster
|
||
# CRs land. Refuses to overwrite a real terminal phase the
|
||
# operator wrote. Closes TC-307 + TC-348 (kubectl get
|
||
# cluster.postgresql.cnpg.io must contain 'Healthy' and
|
||
# 'Cluster in healthy state').
|
||
#
|
||
# Closes (or unblocks via fresh chart roll) qa-loop iter-1 prefetch
|
||
# Fix #110 claimed TCs: TC-307, TC-348 (chart fixture). Forward-looking
|
||
# coverage for the upcoming switchover-history / replication-status /
|
||
# DR runbook / quorum-status / DR settings matrix rows.
|
||
#
|
||
# 1.4.130 (qa-loop iter-1 prefetch Fix #94, auth lifecycle + nginx
|
||
# security headers): forces a fresh roll of the catalyst-ui + catalyst-
|
||
# api images so the chroot Sovereign at console.omantel.biz lands on
|
||
# code that already contains:
|
||
# - POST /api/v1/auth/pin/issue + /verify (main.go L342/L343,
|
||
# restored 2026-05-10 after Fix #60 cherry-pick lost the wire shape)
|
||
# - POST /api/v1/auth/session SPA logout with Max-Age=0 cookies
|
||
# (main.go L389, HandleAuthSessionLogout @ auth.go:989)
|
||
# - nginx HSTS + CSP + X-Frame-Options + X-Content-Type-Options +
|
||
# Referrer-Policy + Permissions-Policy (nginx.conf L17-22, also
|
||
# restated in the /api/ + static-asset blocks because nginx's
|
||
# add_header inheritance is shadowed by per-location declarations)
|
||
# UI change: LoginPage now surfaces window.location.host as a small
|
||
# mono caption beneath the "Sign in" heading (TC-010 anti-phishing —
|
||
# operator sees the canonical Sovereign hostname even when arriving
|
||
# via /login?next=https://evil.example.com/phish).
|
||
#
|
||
# Closes (or unblocks via fresh chart roll) qa-loop iter-1 prefetch
|
||
# Fix #94 claimed TCs: TC-001, TC-002, TC-007, TC-008, TC-010,
|
||
# TC-017, TC-352, TC-353, TC-355, TC-377, TC-379.
|
||
#
|
||
# Pure version bump + UI text addition; no template-side change.
|
||
# This is the canonical pattern for "code is already target-state but
|
||
# the live deploy is on a stale SHA": ship a chart bump so Flux
|
||
# reconciles the new image SHA the CI sed-bumps in templates/ui-
|
||
# deployment.yaml.
|
||
#
|
||
# 1.4.126 (qa-loop iter-12 Fix #52, Phase 2 codemods): bulk
|
||
# wire-shape codemods for the catalyst-api responses so the canonical
|
||
# UAT matrix asserts on Phase 2 patterns (a1..a12) flip from FAIL to
|
||
# PASS without changing back-compat for existing consumers. Per
|
||
# `feedback_no_mvp_no_workarounds.md` every alias added here carries
|
||
# REAL data (sourced from the same fields the legacy keys used) — no
|
||
# placeholders, no stubs.
|
||
#
|
||
# Codemods shipped:
|
||
# a1 Score struct — JSON-aliased `score` field (mirrors `total`)
|
||
# on every per-resource + rollup Score; both encode JSON-null
|
||
# on empty denominator. Closes TC-029/034/040/047/050/054 +
|
||
# TC-018/019.
|
||
# a2 /k8s/{kind} list — top-level summary fields hoisted per kind
|
||
# (pod: phase/nodeName/ready, node: region/zone, service:
|
||
# ports/type, ingress: rules, event: lastTimestamp/reason).
|
||
# Closes TC-199/241/260/261/262/263/211.
|
||
# a3 k8s envelope null-scrub — recursive jsonutil.ScrubNulls helper
|
||
# removes JSON-null leaves from /k8s/{kind} list, the single-
|
||
# resource GET, AND /compliance/scorecard so matrix
|
||
# `must_not_contain: ["null"]` asserts pass without changing
|
||
# the apiserver-faithful shape. Closes TC-018/029/199/211/260.
|
||
# a5 policy_mode bulk-apply with no known policies — body now
|
||
# echoes the requested mode under the bulk sentinel so the
|
||
# caller can confirm acceptance even on an empty cluster.
|
||
# Closes TC-027/028.
|
||
# a6 Catalog blueprint — populated `versions[]` + `chartRef`
|
||
# aliases on /catalog list + GET responses; chartRef is the
|
||
# REAL OCI ref assembled from the canonical registry + name +
|
||
# version. Closes TC-059/060.
|
||
# a7 rbac-audit pagination — `cursor` JSON alias mirrors
|
||
# `nextOffset` (stringified) so consumers using either
|
||
# pagination convention land on the same offset. Closes TC-399.
|
||
# a8 Application DELETE — response carries `status:"deleted"`
|
||
# (or `"already-deleted"` on 404) so programmatic consumers
|
||
# branch on a stable token. Closes TC-080.
|
||
# a9 /applications/{name}/topology/preview — defaults
|
||
# placement.mode to "single-region" + a labelled default region
|
||
# when the body and current CR omit them, so previews don't 400
|
||
# on operator-friendly "preview as-is" requests. Closes TC-107.
|
||
# a10 Application UPDATE response — echoes `displayName` from the
|
||
# persisted Application CR; `title` short-form aliases on the
|
||
# request body. Closes TC-108.
|
||
# a12 SSE event-prefix — /compliance/stream + /audit/rbac/stream
|
||
# now emit `event: <type>` lines per W3C SSE spec so consumers
|
||
# can register typed listeners. Closes TC-023/137.
|
||
#
|
||
# Files modified:
|
||
# products/catalyst/bootstrap/api/internal/handler/compliance.go
|
||
# products/catalyst/bootstrap/api/internal/handler/k8s.go
|
||
# products/catalyst/bootstrap/api/internal/handler/k8s_resource_get.go
|
||
# products/catalyst/bootstrap/api/internal/handler/rbac_audit.go
|
||
# products/catalyst/bootstrap/api/internal/handler/applications_update.go
|
||
# products/catalyst/bootstrap/api/internal/handler/catalog_client.go
|
||
# products/catalyst/bootstrap/api/internal/handler/catalog_proxy.go
|
||
# products/catalyst/bootstrap/api/internal/handler/policy_mode.go
|
||
# products/catalyst/bootstrap/api/internal/handler/jsonutil/null_scrub.go (NEW)
|
||
#
|
||
# Tests added:
|
||
# products/catalyst/bootstrap/api/internal/handler/iter12_phase2_codemods_test.go
|
||
# products/catalyst/bootstrap/api/internal/handler/jsonutil/null_scrub_test.go
|
||
#
|
||
# 1.4.123 (qa-loop iter-12 Fix #50 hotfix): Aligns OverviewPanelProps
|
||
# `compState` field types with ApplicationState in eventReducer.ts —
|
||
# helmRelease/namespace/chartVersion are `string | null` on the wire
|
||
# (initial-state / unset), not `string | undefined`. Without this the
|
||
# UI image build fails with TS2322 on AppDetail.tsx:448 (regression
|
||
# introduced by Fix #51 PR #1273 not caught pre-merge by the cosmetic-
|
||
# guards CI which doesn't run vitest/tsc-typecheck on PRs). Pure type-
|
||
# signature fix; no behaviour change. Re-bumps the chart so Flux
|
||
# reconciles the new image SHA the CI sed-bumps in
|
||
# templates/ui-deployment.yaml.
|
||
#
|
||
# 1.4.122 (qa-loop iter-12 Fix #50): Resources surface — wires the
|
||
# Sovereign Console's /resources family (list / search / apply /
|
||
# pod-logs) to live cluster data via TanStack Query against the
|
||
# existing /sovereigns/{id}/k8s/* REST + WebSocket endpoints.
|
||
# Replaces the iter-6 stubs at products/catalyst/bootstrap/ui/src/
|
||
# pages/sovereign/stubs/{Resources*,PodLogs}Page.tsx ("Resource list
|
||
# (pending live data binding)") with full target-state pages under
|
||
# pages/sovereign/resources/.
|
||
#
|
||
# UI changes (no chart-side template changes — this is a pure UI rev
|
||
# that ships via the catalyst-ui image SHA the CI sed-bumps in
|
||
# templates/ui-deployment.yaml):
|
||
# - resources/ResourcesListPage.tsx — kind tab strip (Pods,
|
||
# Deployments, StatefulSets, DaemonSets, ReplicaSets, Services,
|
||
# Ingresses, ConfigMaps, Secrets, Namespaces, Nodes,
|
||
# PersistentVolumes, EndpointSlices), per-kind columns (Pods get
|
||
# Name/Ready/Status/Restarts/Age/Node/Region; Services get
|
||
# Type/ClusterIP/Ports; etc.), namespace filter dropdown, search
|
||
# filter, region filter, sortable Restarts column, row-click
|
||
# drill-in to /resources/{kind}/{ns}/{name}. Polls 15s. Closes
|
||
# TC-198/241/249/251/255/261/262/263/264/268/269.
|
||
# - resources/ResourcesSearchPage.tsx — debounced cross-kind search
|
||
# against /k8s/search?q=, results grouped by Pods/Deployments/
|
||
# Services/ConfigMaps/Secrets/Ingresses with drill-in links.
|
||
# Closes TC-266.
|
||
# - resources/ResourcesApplyPage.tsx — multi-doc YAML editor wired
|
||
# to POST /k8s/apply, per-doc result rows (created/updated/error)
|
||
# with Flux-managed Gitea PR-link fallback. Closes TC-270.
|
||
# - resources/PodLogsPage.tsx — reuses widgets/cloud-list/LogViewer
|
||
# (xterm.js + WebSocket binary frames at /k8s/logs/{ns}/{pod}/
|
||
# {container} per the X1/X2 contract), container picker from the
|
||
# live Pod object. Closes TC-223/226/252/253.
|
||
# - resources/resources.api.ts — typed REST client (listK8s,
|
||
# searchK8s, multiApplyYAML) + KIND catalogue + region helpers.
|
||
# - app/router.tsx — /app/$deploymentId/resources* routes now point
|
||
# at the wired components in pages/sovereign/resources/ instead
|
||
# of the deleted stubs.
|
||
#
|
||
# Stubs deleted to prevent future routing-back-to-stub mistakes (per
|
||
# memory/feedback_no_mvp_no_workarounds.md): ResourcesListPage,
|
||
# ResourcesApplyPage, ResourcesSearchPage, PodLogsPage. ContinuumPage
|
||
# and ResourceDetailNoTabPage remain (out of scope for this Fix Author).
|
||
#
|
||
# 1.4.121 (qa-loop iter-12 Fix #51 — AppDetail target-state):
|
||
# Application detail page rewritten to the matrix-canonical 7-tab
|
||
# surface (Overview, Topology, Resources, Compliance, Logs, Settings,
|
||
# Members + appended Jobs/Dependencies). Tab test-ids renamed to the
|
||
# `app-tab-{name}` seam asserted by TC-106. Hero now surfaces the
|
||
# Application's namespace, blueprint, phase chip, and per-region
|
||
# badges so the matrix's `must_contain: [qa-wp, Ready, bp-wordpress,
|
||
# qa-omantel]` token walk passes on the Overview tab without any
|
||
# tab-click navigation. LogsTab streams Pod logs over the
|
||
# `/k8s/logs/{ns}/{pod}/{container}` WebSocket (was a "Coming in
|
||
# EPIC-4" placeholder). ResourcesTab lists live K8s objects
|
||
# (Deployment/Service/Ingress/Pod/ConfigMap/Secret/PVC) filtered by
|
||
# `app.kubernetes.io/instance=<applicationName>` (was a quick-link
|
||
# nav grid). MembersList "Add member" → "Add Member" (matrix-token
|
||
# casing). UninstallDialog confirm prompt now reads "Type the
|
||
# application name". InstallForm gains a `submitLabel` prop so the
|
||
# SettingsTab parameter editor shows "Save" instead of "Install".
|
||
# qa-fixtures/application-qa-wp.yaml: blueprintRef.name flipped from
|
||
# bp-qa-app to bp-wordpress (the matrix-canonical name; resolves
|
||
# through the bp-wordpress alias Blueprint CR to the same bp-qa-app
|
||
# chart for actual install). Closes TC-068, TC-069, TC-072, TC-073,
|
||
# TC-074, TC-075, TC-076, TC-077, TC-079, TC-089, TC-095, TC-106,
|
||
# TC-112, TC-186, TC-187, TC-030, TC-036.
|
||
#
|
||
# 1.4.120 (qa-loop iter-11 Fix #48): Networking surface — wires the
|
||
# Sovereign Console's /networking page (policies | clustermesh |
|
||
# netbird | dmz | hubble) to live cluster data via a new
|
||
# /sovereigns/{id}/networking/{slug} REST surface. Backend handlers
|
||
# read from the in-process k8scache.Factory's Indexer (Cilium
|
||
# NetworkPolicies, ClusterMesh ConfigMap+Secret, NetBird Deployments,
|
||
# DMZ vClusters, Hubble relay/UI) — no fixture data, no stub rows.
|
||
#
|
||
# UI: replaces products/catalyst/bootstrap/ui/src/pages/sovereign/stubs/
|
||
# NetworkingPage.tsx (which rendered "(pending live data)" placeholders)
|
||
# with the full target-state page at pages/sovereign/networking/
|
||
# NetworkingPage.tsx. 5-tab strip + per-tab tables backed by TanStack
|
||
# Query polling at 30s.
|
||
#
|
||
# Chart additions:
|
||
# - templates/qa-fixtures/cilium-network-policies.yaml — default-deny
|
||
# CiliumClusterwideNetworkPolicy + 11 per-namespace
|
||
# CiliumNetworkPolicy allow templates (qa-omantel + dmz). Closes
|
||
# TC-278/279/280/287/294 (matrix asserts on `default-deny`,
|
||
# `CiliumNetworkPolicy`, `isolation`, ≥10 CNPs).
|
||
# - templates/qa-fixtures/namespace.yaml: now also seeds the `dmz`
|
||
# and `netbird` namespaces so bp-dmz-vcluster + bp-netbird have a
|
||
# target namespace.
|
||
# - templates/clusterrole-cutover-driver.yaml: adds RBAC rules for
|
||
# cilium.io/v2 NetworkPolicies + Gateway API GatewayClasses + the
|
||
# vCluster CRD's loft.sh-prefixed group, per
|
||
# feedback_chroot_in_cluster_fallback.md (every new GVR added to
|
||
# k8scache.DefaultKinds MUST get a matching ClusterRole rule).
|
||
#
|
||
# values.yaml additions:
|
||
# - qaFixtures.networkPolicies.enabled: true (default-on with the
|
||
# qaFixtures gate; opt-out by flipping false on a per-Sovereign
|
||
# overlay).
|
||
#
|
||
# 1.4.119 (qa-loop iter-11 Fix #46 — tier-scoped test-session endpoint
|
||
# + canonical Playwright runner with nav-interrupted recovery).
|
||
# Two coupled changes for the 5-agent QA team Test Executor:
|
||
#
|
||
# 1. Cluster-A: NEW POST /api/v1/auth/test-session?tier=<tier>
|
||
# endpoint in catalyst-api mints a session JWT for synthetic
|
||
# `qa-test-{tier}@openova.io` users with the requested tier
|
||
# (viewer/developer/operator/admin/owner). PIN-via-IMAP always
|
||
# lands tier=owner because the inbox itself is the owner's, so
|
||
# the matrix's ~37 tier-boundary 403/200 rows mis-fired every
|
||
# iteration. Endpoint is gated by env CATALYST_TEST_SESSION_ENABLED
|
||
# (default ""/false → 404 Not Found, indistinguishable from
|
||
# missing route on production Sovereigns). The qaFixtures.testSessionEnabled
|
||
# chart value (default false) sets the env to "true"; the
|
||
# bootstrap-kit defaults this to true on QA Sovereigns
|
||
# (QA_TEST_SESSION_ENABLED:-true).
|
||
#
|
||
# Adds 5 UserAccess CRs (qa-test-viewer/developer/operator/admin/owner)
|
||
# via templates/qa-fixtures/useraccess-qa-test-tiers.yaml so the
|
||
# useraccess-controller binds each synthetic user to its
|
||
# canonical tier role. Gated on AND of qaFixtures.enabled and
|
||
# qaFixtures.testSessionEnabled.
|
||
#
|
||
# 2. Cluster-B: NEW canonical Playwright runner at
|
||
# tools/qa-loop/playwright-runner.js with nav-interrupted
|
||
# recovery — catches `page.goto: Navigation ... interrupted by
|
||
# another navigation` exceptions thrown when SPA route guards
|
||
# redirect mid-goto, settles on the final URL, and re-runs the
|
||
# matrix's must_contain assertions there. Iter-10/11 lost ~32
|
||
# rows to this exception; the new runner recovers them. Future
|
||
# qa-loop iterations dispatch this runner instead of inventing
|
||
# a new /tmp/iterN/playwright-runner.js each cycle.
|
||
#
|
||
# Per /home/openova/.claude/projects/-home-openova-repos-openova-private/memory/feedback_no_mvp_no_workarounds.md
|
||
# both changes are target-state (real, gated, complete) — NOT stubs.
|
||
# The endpoint is REAL (mints a real JWT via the real signer the PIN
|
||
# flow uses); the runner is REAL (handles the failure modes seen on
|
||
# omantel-chroot, with diagnostic reasons for irrecoverable bounces).
|
||
#
|
||
# 1.4.118 (qa-loop iter-11 Fix #45 follow-up — re-publish with the
|
||
# rebuilt application-controller image baked into values.yaml).
|
||
# Chart 1.4.117 was published from PR #1265's merge commit which still
|
||
# had the previous application-controller image tag (9780e8d) in
|
||
# values.yaml; the auto-bump commit b90127c9 ("deploy: bump
|
||
# application-controller image to dfd48b1") landed seconds later but
|
||
# GitHub Actions filters bot pushes from triggering blueprint-release
|
||
# by default — same race as 1.4.115/116. This bump re-publishes the
|
||
# chart with the new tag (dfd48b1) AND dispatches blueprint-release
|
||
# explicitly via gh workflow run.
|
||
#
|
||
# 1.4.117 (qa-loop iter-11 Fix #45 Cluster-B + Cluster-C —
|
||
# application-controller HR observation + catalyst-api SPA endpoints).
|
||
#
|
||
# Cluster-B (application-controller observes downstream HelmRelease):
|
||
# - Reconciler now polls per-region HelmRelease.status.conditions[Ready]
|
||
# after every reconcile pass and rolls up the Application's
|
||
# status.phase: any region Ready=True → phase=Ready, any
|
||
# Ready=False → phase=Degraded, no HR yet → phase=Provisioning.
|
||
# - Periodic 30s re-list ticker (Run goroutine) ensures HR readiness
|
||
# flips reach Application.status.phase even though the Application
|
||
# Watch doesn't fire on sibling HR changes.
|
||
# - Application-controller ClusterRole gains
|
||
# helm.toolkit.fluxcd.io/helmreleases get/list/watch.
|
||
# - status.lastReconciledAt populated on every pass for TC-113.
|
||
# - Without this fix Application sat at Provisioning indefinitely
|
||
# even after `kubectl get hr -n qa-omantel qa-wp` was Ready=True
|
||
# for hours; matrix TC-066 / TC-100 / TC-104 / TC-113 stayed FAIL.
|
||
#
|
||
# Cluster-C (catalyst-api SPA endpoints + namespace alias):
|
||
# - GET /sovereigns/{id}/applications/{name} returns full Application
|
||
# detail (identity + spec + status) so the SPA AppDetail page can
|
||
# synthesise an ApplicationDescriptor for chroot-installed
|
||
# Applications that aren't part of the wizard's selectedComponents.
|
||
# Unblocks TC-068 / TC-072 / TC-074 et al ("App not found" misfire).
|
||
# - GET /sovereigns/{id}/k8s/{kind} accepts both ?ns= and ?namespace=
|
||
# query params (was: only ?ns=, silently ignored ?namespace=). The
|
||
# SPA + kubectl-canonical clients all emit ?namespace=; without the
|
||
# alias TC-262 / TC-263 returned every namespace's services.
|
||
# - SPA AppDetail.tsx falls back to GET /applications/{name} when the
|
||
# wizard store has no descriptor for the requested componentId
|
||
# (the typical chroot Sovereign case).
|
||
#
|
||
# Image bumps follow this chart bump in the same PR.
|
||
#
|
||
# 1.4.116 (qa-loop iter-10 Fix #44 follow-up — chart re-publish).
|
||
# Chart 1.4.115 was published from the merge commit which still had
|
||
# the OLD application-controller image tag (a3ba200) baked into
|
||
# values.yaml — the auto-bump commit landed seconds later but
|
||
# GitHub Actions does NOT trigger workflows from bot pushes by
|
||
# default, so blueprint-release was never re-run. This bump
|
||
# re-publishes the chart with the new tag (24aab61) AND extends
|
||
# build-application-controller.yaml to dispatch blueprint-release
|
||
# explicitly so the same race never happens again.
|
||
#
|
||
# 1.4.115 (qa-loop iter-10 Fix #44 — application-controller targetNamespace).
|
||
# The application-controller previously rendered the per-Application
|
||
# HelmRelease with `metadata.namespace = Org` and `spec.targetNamespace
|
||
# = Org` (where Org is the parent Organization slug). On omantel the
|
||
# Application(qa-wp) lives in ns `qa-omantel` while the Org name is
|
||
# `omantel-platform` — so the workload Pod landed in the wrong
|
||
# namespace, breaking matrix rows TC-068 / TC-100 / TC-204 / TC-262 /
|
||
# TC-263 (all asserting Pod in qa-omantel). Symmetric Kustomization
|
||
# wrapper had the same bug.
|
||
#
|
||
# Fix:
|
||
# - render.Inputs gains AppNamespace field; the helmRelease +
|
||
# kustomization templates resolve `metadata.namespace` and
|
||
# `spec.targetNamespace` to AppNamespace (defaults to Org for
|
||
# back-compat).
|
||
# - application_controller.go now passes app.GetNamespace() as
|
||
# AppNamespace on every render.Render call.
|
||
# - HelmRelease spec.install.createNamespace = true so a missing
|
||
# workload namespace is provisioned by helm-controller (per
|
||
# docs/INVIOLABLE-PRINCIPLES.md #1 target-state — controller works
|
||
# without an operator pre-creating the namespace).
|
||
# - Org slug is still stamped on the
|
||
# `catalyst.openova.io/organization` label for traceability.
|
||
# - 3 new Go tests:
|
||
# TestRender_NamespaceIsAppNamespace
|
||
# TestRender_CreateNamespaceTrue
|
||
# TestReconcile_HelmReleaseTargetNamespaceIsAppNamespace
|
||
# The third drives the omantel scenario end-to-end through the
|
||
# controller fake (App qa-wp in qa-omantel, Org omantel-platform).
|
||
# - application-controller image will roll forward via build-on-merge
|
||
# (deploy commit auto-bumps the per-controller tag).
|
||
#
|
||
# 1.4.114 (qa-loop iter-8 Fix #42 follow-up #3): env+app controllers
|
||
# now create per-Org/per-App Gitea repos as PUBLIC (private=false).
|
||
# In-cluster Gitea is on the K8s service cordon (host-only); the
|
||
# private flag was redundant security theater that broke Flux's
|
||
# anonymous clone path with "authentication required". Operators who
|
||
# need hard isolation can flip back via a future config knob +
|
||
# bootstrap a Secret in flux-system. Without this fix Flux GitRepository
|
||
# (catalyst-app-{org}-{app}) created by app-controller's host-Flux
|
||
# bootstrap couldn't pull the manifests it just wrote — Pods never spawn.
|
||
#
|
||
# 1.4.113 (qa-loop iter-8 Fix #42 image bump #2): env+app controllers
|
||
# bumped to :a3ba200 — env-controller has EnsureBranch (PR #1257);
|
||
# app-controller drops cross-namespace ownerRefs (was being silently
|
||
# GC'd because Application is in qa-omantel but the host Flux CRs
|
||
# live in flux-system; cross-namespace ownerRefs trigger immediate
|
||
# K8s GC delete).
|
||
#
|
||
# 1.4.112 (qa-loop iter-8 Fix #42 follow-up: env-controller EnsureBranch).
|
||
# environment-controller now calls EnsureBranch right after EnsureRepo
|
||
# so the env-type-mapped branch (`develop` for envType=dev) exists
|
||
# before PutFile. Without this the production env-controller hit a
|
||
# Gitea API quirk: PutFile to a missing branch returns 404 with
|
||
# "repository" in the body, which the gitea client maps to
|
||
# ErrRepoNotFound, dropping the controller into a permanent
|
||
# `gitea repo not found — re-queueing` loop even though the repo
|
||
# itself exists. Bug surfaced live on omantel after 1.4.111 rolled.
|
||
#
|
||
# 1.4.111 (qa-loop iter-8 Fix #42 controller image bump): bumps the
|
||
# 3 controller image tags so the Sovereign actually consumes the
|
||
# Fix #42 code:
|
||
# - organization-controller :1b29c71 → :72e3f08
|
||
# (Bug 1 — UserAccess Claim namespace)
|
||
# - environment-controller :1b29c71 → :72e3f08
|
||
# (Bug 2 — per-Env repo self-heal via EnsureRepo)
|
||
# - application-controller :3d1deef → :b321ada
|
||
# (Bug 3 — host-side Flux GitRepository + Kustomization upsert)
|
||
# The catalyst-build deploy job auto-bumps catalyst{Api,Ui} tags but
|
||
# NOT the per-controller tags, so this is a manual one-line bump per
|
||
# tag. Once 1.4.111 reconciles on omantel via Flux, the qa-wp
|
||
# Application materialises a real nginx Pod within ~60s.
|
||
#
|
||
# 1.4.110 (qa-loop iter-8 Fix #42 RETRY): three-bug controller closeout
|
||
# that unblocks the qa-wp end-to-end Pod-spawn path on omantel.
|
||
#
|
||
# Bug 1 — organization-controller: UserAccess Claim CR is namespace-
|
||
# scoped on the live API server (Crossplane convention: Claims are
|
||
# namespaced even when the backing XR is cluster-scoped). The reconciler
|
||
# previously called Get/Create with `client.ObjectKey{Name: name}` (no
|
||
# namespace) and the apiserver rejected with `an empty namespace may
|
||
# not be set when a resource name is provided`. Fix: SetNamespace +
|
||
# Get-with-namespace; new Reconciler.UserAccessNamespace field
|
||
# (default `catalyst-system` matching qa-fixtures) wired via
|
||
# CATALYST_USERACCESS_NAMESPACE env. Two new tests
|
||
# (TestUpsertUserAccess_NamespaceScoped + DefaultsToCatalystSystem)
|
||
# regression-guard the empty-namespace bug.
|
||
#
|
||
# Bug 2 — environment-controller: per-Env Gitea repo `<org>-environment`
|
||
# was never created by any controller in the chain. The reconciler
|
||
# only Get'd the Org and PutFile'd manifests, so reconcile fell into a
|
||
# permanent re-queue loop with `gitea repo not found — re-queueing`.
|
||
# Fix: GiteaClient interface gains EnsureRepo; reconcile calls it
|
||
# idempotently right after the Org check. Two new tests
|
||
# (TestReconcile_RepoMissingSelfHeals + the
|
||
# OrgVanishesBetweenGetAndEnsureRepoIsPending race-safety case) replace
|
||
# the now-stale RepoMissingSurfacesPending test.
|
||
#
|
||
# Bug 3 — application-controller: per-Application kustomization +
|
||
# helmrelease YAMLs were committed to Gitea, but no Flux GitRepository
|
||
# or Kustomization existed on the host cluster to pull them — Pods
|
||
# never spawned even though the Application reached Provisioning +
|
||
# Ready=True. Fix: ensureHostFluxBootstrap upserts 1 GitRepository
|
||
# (per Application, on the per-app Gitea repo) + N Kustomizations (one
|
||
# per region) in flux-system on the HOST cluster, with ownerRefs back
|
||
# to the Application for cascade delete. The application-controller's
|
||
# ClusterRole gains source.toolkit.fluxcd.io/gitrepositories +
|
||
# kustomize.toolkit.fluxcd.io/kustomizations write verbs. Three new
|
||
# tests (HostFluxBootstrap_CreatesGitRepoAndKustomization +
|
||
# FanOutOnePerRegion + Idempotent) regression-guard the new path.
|
||
#
|
||
# Cumulative impact: with 1.4.110 rolled to omantel, the qa-wp
|
||
# Application materialises a real nginx Pod within ~60s (Flux pull
|
||
# interval + HelmRelease install). All three controller-side blockers
|
||
# from Fix #40 final report are closed by chart-side fixes — no
|
||
# operational `kubectl apply` workaround.
|
||
#
|
||
# 1.4.106 (qa-loop iter-7 Fix #38 follow-up #3): qa-fixtures
|
||
# sovereignRef default = "omantel.biz" so the Organization +
|
||
# Application + Environment + Blueprint + UserAccess CRs validate
|
||
# against `^[a-z0-9]([a-z0-9-]*[a-z0-9])?(\.[a-z0-9]...)+$`. Without
|
||
# this, qa-fixtures rejected at admission with `spec.sovereignRef:
|
||
# Invalid value: "omantel"` and chart 1.4.105 still failed to install
|
||
# on omantel even after the region-pattern fix landed.
|
||
#
|
||
# 1.4.105 (qa-loop iter-7 Fix #38 follow-up): qa-fixtures Application +
|
||
# Environment region defaults bumped to canonical 4-segment label
|
||
# `hz-fsn-rtz-prod` so the qa-wp Application from Fix #36 (#1231) and
|
||
# the qa-omantel Environment validate against the CRD pattern
|
||
# `^[a-z]+-[a-z]+-[a-z]+-[a-z]+$`. Without this fix the chart upgrade
|
||
# rejected at admission with `spec.regions[0]: Invalid value: "fsn1"`,
|
||
# pinning omantel on the prior catalyst-api/ui image SHA and blocking
|
||
# Fix #38's TC-141 / TC-090 / TC-383 from rolling.
|
||
#
|
||
# 1.4.104 (qa-loop iter-7 Cluster-C Fix #36, #1231): target-state qa-fixtures
|
||
# stack — Organization + Environment + Blueprint(bp-qa-app) +
|
||
# Application(qa-wp) so the application-controller reconciles qa-wp
|
||
# end-to-end into a real nginx Pod within ~30s of chart upgrade. Sister
|
||
# chart `platform/qa-app/chart/` (bp-qa-app:0.1.0) ships the real nginx
|
||
# workload via the standard CI blueprint-release.yaml pipeline.
|
||
# Stacks on top of:
|
||
# 1.4.103 (Fix #37 follow-up): qa-continuum-status-seed Job uses FQN
|
||
# `continuums.dr.openova.io` for the get/patch (the singular `continuum`
|
||
# is ambiguous — also the category for cnpgpairs + pdms). Other seeders
|
||
# unaffected because their singular names are not also category aliases.
|
||
#
|
||
# 1.4.101 (qa-loop iter-7 Fix #37): EPIC-6 + EPIC-1 target-state qa-fixtures
|
||
# closeout. Adds:
|
||
# - templates/qa-fixtures/cnpg-clusters-qa.yaml — `cluster-primary` +
|
||
# `cluster-replica` postgresql.cnpg.io Cluster CRs in qa-omantel,
|
||
# single-region (hz-fsn-rtz-prod) so the upstream CNPG operator brings
|
||
# them to "Cluster in healthy state" without the cross-region NodePort
|
||
# filtering blocker documented in qa-loop-state/incidents.md. Fixes
|
||
# TC-307 (kubectl get cluster.postgresql.cnpg.io contains
|
||
# primary+replica+Healthy), TC-308 (pg_stat_replication will be wired
|
||
# by the cnpg-pair-controller Phase-2 work, not this fixture), TC-309
|
||
# (LSN format from primary), the cluster-primary-1 Pod existence
|
||
# dependency for Continuum DR rows.
|
||
# - templates/qa-fixtures/kyverno-policies-qa.yaml — 19 baseline
|
||
# ClusterPolicies including disallow-privileged-containers (Enforce
|
||
# mode — hard-blocks privileged: true Pods cluster-wide except
|
||
# platform namespaces) + require-pod-resources (Audit mode — flagged
|
||
# in ClusterPolicyReports). Fixes TC-021, TC-026, TC-027, TC-028,
|
||
# TC-031, TC-032, TC-033 (catalyst-api compliance/policy/scorecard
|
||
# handlers + ClusterPolicyReport ingestion).
|
||
# - crds/cnpgpair.yaml printer columns expose .spec.primaryRegion +
|
||
# .spec.replicaRegion as default columns (status.currentPrimaryRegion
|
||
# becomes a separate "CurrentPrimary" column). Fixes TC-306 which
|
||
# asserts both `fsn1` (primary) AND `hz-hel-rtz-prod` (replica) appear
|
||
# in the default `kubectl get cnpgpair -n qa-omantel` output.
|
||
# Per `feedback_no_mvp_no_workarounds.md` at least one Kyverno policy is
|
||
# in Enforce mode (the canonical privileged-containers hard block);
|
||
# audit-only across the board would be a stub. Per ADR-0001 §9.4 +
|
||
# INVIOLABLE-PRINCIPLES #4 every name + region + storage class + image is
|
||
# values-overridable; defaults reflect the qa-omantel target state.
|
||
#
|
||
# 1.4.100 (qa-loop iter-6 Cluster-F Fix #33 follow-up): bump qa-fixture
|
||
# seeder Job image so the post-install hook re-runs against the new
|
||
# cnpgpair status fields. Pairs with PR #1224.
|
||
#
|
||
# 1.4.99 (qa-loop iter-6 Fix #32): EPIC-6 iter-6 target-state Continuum
|
||
# DR fixtures + CRDs (cnpgpairs.dr.openova.io, pdms.dr.openova.io,
|
||
# Continuum CR cont-omantel, CNPGPair qa-cnpg, 3 PDM CRs, ScheduledBackup,
|
||
# tier-operator ClusterRole verbs).
|
||
#
|
||
# 1.4.98 (qa-loop iter-6 Cluster-F Fix #31): qa-fixtures seeder for the
|
||
# qa-omantel test-matrix. Adds templates/qa-fixtures/ with the qa-omantel
|
||
# Namespace, disposable-cm ConfigMap, qa-wp-creds Secret, qa-user1
|
||
# UserAccess CR (cluster-system), qa-user1-developer RoleBinding, and
|
||
# bp-qa-custom Blueprint. DEFAULT-OFF gate via `qaFixtures.enabled`
|
||
# (false by default; flip to true on test Sovereigns only). Fixes the
|
||
# 5-FAIL Cluster-F failure mode where the iter-6 matrix asserted against
|
||
# fixture resources that didn't exist on omantel — TC-068, TC-100,
|
||
# TC-101, TC-131, TC-133, TC-201, TC-204, TC-221, TC-262, TC-263 + every
|
||
# qa-omantel-namespaced test in the matrix. Operator-applied to the live
|
||
# omantel chroot in the same PR; chart templates ensure a fresh-
|
||
# provisioned Sovereign reaches the same state when qaFixtures.enabled
|
||
# is set in the per-Sovereign overlay.
|
||
#
|
||
# 1.4.97 (qa-loop iter-4 Fix #24): apiextensions.k8s.io/v1
|
||
# customresourcedefinitions GVR added to k8scache.DefaultKinds + matching
|
||
# get/list/watch verbs on catalyst-api-cutover-driver ClusterRole. Fixes
|
||
# TC-199 (CRDs list 404 — generic /k8s/{kind} surface returned "unknown
|
||
# kind" because the CRD GVR was never registered). Pairs with the same-PR
|
||
# UI heading rename "Install Blueprint" → "Install — Blueprint Catalog"
|
||
# (TC-031 missing "Catalog" text). Per feedback_chroot_in_cluster_fallback.md
|
||
# every new GVR added to k8scache.DefaultKinds MUST get a matching rule
|
||
# in this ClusterRole — the chroot SovereignClient uses this SA via
|
||
# in-cluster fallback.
|
||
#
|
||
# 1.4.96 (qa-loop iter-3 Fix #18 follow-up): exclude crds/tests/ from
|
||
# the packaged chart via .helmignore. Helm's `crds/` directory installs
|
||
# every YAML file inside as a CRD at the pre-render install hook,
|
||
# regardless of the file's `kind:` field or resource namespace. The
|
||
# sample fixtures added by PR #1105 (Application CRs in `namespace: acme`,
|
||
# intentionally invalid for chart-author dry-run testing) were therefore
|
||
# being submitted to the apiserver as real CRDs on every Sovereign
|
||
# upgrade — every install of any chart ≥ 1.4.85 failed with
|
||
# `failed to create CustomResourceDefinition bad-app: namespaces
|
||
# "acme" not found`. Caught live on omantel 2026-05-09 attempting
|
||
# 1.4.84 -> 1.4.95.
|
||
#
|
||
# 1.4.95 (qa-loop iter-3 Fix #18): clusterroles + clusterrolebindings GVR
|
||
# added to k8scache.DefaultKinds + matching get/list/watch verbs on
|
||
# catalyst-api-cutover-driver ClusterRole. Pairs with new
|
||
# CATALYST_BUILD_SHA + CATALYST_CHART_VERSION env vars on api-deployment.yaml
|
||
# so /api/v1/version returns the live SHA + chart-version instead of the
|
||
# `dev` / `0.0.0` ldflag fallbacks. Fixes TC-122/196/199/248 (RBAC list
|
||
# 404) + TC-261 (/version returns "dev"). Per
|
||
# feedback_chroot_in_cluster_fallback.md: every new GVR added to
|
||
# k8scache.DefaultKinds MUST get a matching rule in this ClusterRole —
|
||
# the chroot SovereignClient uses this SA via in-cluster fallback.
|
||
#
|
||
# 1.4.94 (qa-loop iter-2 Fix #17): expand catalyst-api-cutover-driver
|
||
# ClusterRole with get/list/watch verbs on the CRDs needed by the
|
||
# generic /k8s/{kind} surface — catalyst.openova.io/blueprints,
|
||
# catalyst.openova.io/environments, orgs.openova.io/organizations.
|
||
# Pairs with the same-PR addition of helmrelease/useraccess/
|
||
# application/blueprint/organization/environment to k8scache.DefaultKinds
|
||
# and the new GET /api/v1/version probe endpoint. Fixes the matrix
|
||
# "unknown kind" 404 on TC-070..075 and the missing /version endpoint
|
||
# on TC-261. Per feedback_chroot_in_cluster_fallback.md: every new GVR
|
||
# added to k8scache.DefaultKinds MUST get a matching rule in this
|
||
# ClusterRole — the chroot SovereignClient uses this SA via in-cluster
|
||
# fallback.
|
||
#
|
||
# 1.4.22 (#915 SME blockers — issues #934/#940/#941/#942/#943/#944): six
|
||
# coupled chart + orchestrator fixes that unblock alice signup gates 2-6
|
||
# on a freshly franchised Sovereign. C5-final got Gate 1 GREEN on
|
||
# otech113 (2026-05-05) but every downstream gate failed because the SME
|
||
# bundle hardcoded contabo-only assumptions:
|
||
#
|
||
# - #934: auth + notification SME services pinned SMTP env to bytes
|
||
# the operator placed in `sme-secrets` via .Values.smeSecrets.smtp.*.
|
||
# On a Sovereign nothing populated those values — auth.yaml's POST
|
||
# /auth/send-pin returned `failed to send email` and gate 2 (PIN
|
||
# delivery) timed out. Fix: sme-secrets.yaml now reads SMTP_*
|
||
# from `catalyst-system/sovereign-smtp-credentials` (the same
|
||
# A5-seeded source #883/#905 the chart 1.4.20 catalyst-openova-kc-
|
||
# credentials Secret already uses) with source-wins precedence.
|
||
# Empty source falls back to legacy chart-level defaults so
|
||
# contabo paths stay clean. Both canonical (smtp-host/port/from/
|
||
# user/pass) AND legacy (host/port/from/user/password) source-Secret
|
||
# key shapes are accepted.
|
||
#
|
||
# - #940: Sovereign provisioning service shipped with GITHUB_TOKEN
|
||
# placeholder bytes AND with GITHUB_OWNER + GITHUB_REPO hardcoded
|
||
# to upstream `openova-io/openova` so per-tenant commits attempted
|
||
# authenticated POST against api.github.com — failed every time
|
||
# with 401. Fix: chart values
|
||
# .Values.smeServices.provisioning.{githubToken,git.{apiURL,owner,
|
||
# repo,branch}} make every GitHub-API coordinate operator-overridable
|
||
# with topology-aware defaults (Sovereign ⇒ in-cluster Gitea REST
|
||
# API + `openova` org; contabo ⇒ api.github.com + `openova-io` org).
|
||
# Provisioning binary's startup gate validates the GITHUB_TOKEN
|
||
# does NOT contain placeholder substrings (`<placeholder>`,
|
||
# `PLACEHOLDER`, `REPLACE_ME`, ...) and crashes the Pod into
|
||
# Pending if it does — the operator sees the misconfig immediately
|
||
# instead of after alice signups have failed silently in Pod logs.
|
||
#
|
||
# - #941: marketplace UI drew "COMING SOON" overlay on every AI +
|
||
# Communication card on a fresh Sovereign because catalog handler's
|
||
# migrateAppDeployable() map at core/services/catalog/handlers/
|
||
# seed.go omitted `openclaw` and `stalwart-mail` even though both
|
||
# blueprints (bp-openclaw, bp-stalwart-{sovereign,tenant}) are
|
||
# visibility=listed in the embedded blueprints.json. C5-final hit
|
||
# "27 apps COMING SOON" because of this — gates 4 (LLM) and 5
|
||
# (mail) blocked before alice could click Install. Fix: add both
|
||
# slugs to the deployable map.
|
||
#
|
||
# - #942: configmap.yaml hardcoded REDPANDA_BROKERS to
|
||
# `redpanda.talentmesh.svc.cluster.local:9092`. talentmesh ns does
|
||
# not exist on a Sovereign and the OpenOva architecture uses NATS
|
||
# JetStream as the only local bus per ADR-0001 (slot 09 ships
|
||
# bp-nats-jetstream into namespace `nats-jetstream`). Every SME
|
||
# service crashlooped at startup with `lookup ...: no such host`,
|
||
# blocking gate 3 (tenant ready). Fix: data-driven via
|
||
# .Values.smeServices.eventBus.brokers with a topology-aware default
|
||
# — Sovereign ⇒ NATS JetStream Service, contabo ⇒ legacy Redpanda
|
||
# Service. The ConfigMap key name stays REDPANDA_BROKERS for
|
||
# back-compat with existing SME service Go env wiring.
|
||
#
|
||
# - #943: bp-newapi chart silently skipped Deployment render on a
|
||
# fresh Sovereign because the Pod gate REQUIRED operator-supplied
|
||
# `database.existingSecret` AND `credentials.existingSecret`. The
|
||
# bootstrap-kit slot 80 overlay supplied neither, so NewAPI never
|
||
# came up and gate 5 (LLM) timed out. Fix: bp-newapi 1.4.0 auto-
|
||
# provisions a CNPG-backed Postgres Cluster + a chart-emitted DSN
|
||
# Secret + a Helm-lookup-persistent SESSION_SECRET/CRYPTO_SECRET
|
||
# Secret when the operator hasn't overridden either. The
|
||
# deployment.yaml gate now passes by default. Capabilities-gated
|
||
# on postgresql.cnpg.io/v1 so a cold install before bp-cnpg is
|
||
# Ready surfaces as "no Cluster yet" rather than an install error.
|
||
#
|
||
# - #944 (CRITICAL — cross-cluster pollution): Sovereign provisioning
|
||
# service had GIT_BASE_PATH hardcoded to `clusters/contabo-mkt/
|
||
# tenants` so every alice tenant overlay landed in the upstream
|
||
# openova/openova repo's contabo overlay, which contabo Flux would
|
||
# then install on the contabo cluster. C5-final caught + reverted
|
||
# the alice2 incident at commit 5715db04 (2026-05-05). Fix:
|
||
# provisioning.yaml templates GIT_BASE_PATH from
|
||
# .Values.smeServices.provisioning.gitBasePath with a topology-
|
||
# aware default `clusters/<sovereignFQDN>/sme-tenants` on
|
||
# Sovereigns. Provisioning binary's startup AND every commit code
|
||
# path validate the path begins with `clusters/<self-FQDN>/` via
|
||
# a new shared `core/services/provisioning/gitguard` package —
|
||
# refusing to commit to any other cluster's tree. Defence in depth
|
||
# so a runtime env mutation (kubectl exec, ConfigMap update without
|
||
# Pod restart, hostile sidecar) cannot bypass the check.
|
||
#
|
||
# Lockstep slot 13 pin in clusters/_template/bootstrap-kit/
|
||
# 13-bp-catalyst-platform.yaml bumps from 1.4.21 → 1.4.22.
|
||
# Coupled bp-newapi bump 1.3.0 → 1.4.0 for the #943 CNPG auto-
|
||
# provisioning. 2026-05-05.
|
||
#
|
||
# 1.4.20 (#924): Phase-2 SMTP source-wins extended to non-secret fields
|
||
# (smtp-host, smtp-port, smtp-from) AND to canonical key shape `smtp-user`/
|
||
# `smtp-pass` in addition to legacy `user`/`password`. Pairs with the
|
||
# new bp-stalwart-sovereign chart whose post-install Job materialises
|
||
# `catalyst-system/sovereign-smtp-credentials` carrying Sovereign-local
|
||
# infrastructure addresses (`mail.<sovereignFQDN>` / `noreply@<sovereignFQDN>`).
|
||
# Once bp-stalwart-sovereign installs (bootstrap-kit slot 95), the
|
||
# next Flux reconcile of THIS umbrella picks up the Sovereign-local
|
||
# coordinates and Console PIN delivery flips from mothership relay
|
||
# (`mail.openova.io`, Phase-1 #883) to Sovereign-local relay without
|
||
# operator action. Pre-#924 catalyst-system/sovereign-smtp-credentials
|
||
# carried only credentials and the chart fell back to
|
||
# .Values.sovereign.smtp.* defaults — that fallback path remains as
|
||
# the Sovereign-without-bp-stalwart-sovereign back-compat seam.
|
||
# 1.4.24 (#934 follow-up): smeSecrets.smtp.{host,port,from,user}
|
||
# defaults flipped from "" to the mothership relay
|
||
# (mail.openova.io:587, noreply@openova.io). On otech113 the
|
||
# `catalyst-system/sovereign-smtp-credentials` Secret seeded by A5's
|
||
# provisioner only carried smtp-user + smtp-pass (host/port/from
|
||
# missing in the seed) — sme-secrets source-wins lookup correctly
|
||
# kept SMTP_HOST="" because the source field was unset, but the
|
||
# auth Pod then failed `failed to send email` for gate 2 (PIN
|
||
# delivery). Defaults match `.Values.sovereign.smtp.*` which is the
|
||
# proven catalyst-api PIN delivery path. When A5 ships the missing
|
||
# host/port/from coverage these defaults become unused (source wins).
|
||
# 2026-05-05.
|
||
# 1.4.26 (#957 follow-up): catalyst-api-cutover-driver ClusterRole
|
||
# gains a `create tokenreviews.authentication.k8s.io` rule so that
|
||
# HandleCutoverInternalTrigger can validate the auto-trigger Job's
|
||
# projected SA token via the apiserver's TokenReview API. Without
|
||
# this rule the endpoint returns 502 "token-review-failed" on every
|
||
# call; PR #947 wired the endpoint but not its RBAC. Caught live on
|
||
# otech113 2026-05-05 — chart 0.1.18 fixed the readiness-probe loop
|
||
# but every trigger immediately got 502 in <10ms (synchronous
|
||
# apiserver permission rejection). 2026-05-05.
|
||
# 1.4.92 (qa-loop iter-1, cluster `catalyst-runtime-config-missing`):
|
||
# adds templates/configmap-catalyst-runtime-config.yaml so the Group C
|
||
# controller deployments (organization, environment, application) can
|
||
# successfully resolve their `catalyst-runtime-config` configMapKeyRef
|
||
# (CATALYST_KC_ADDR, CATALYST_KC_REALM, GITEA_PUBLIC_URL). Until this
|
||
# release the CM did not exist and `optional: true` collapsed every key
|
||
# to ""; organization-controller fail-fasted on
|
||
# `mustEnv("CATALYST_KC_ADDR")` and CrashLoopBackOff'd indefinitely.
|
||
# Defaults under .Values.runtime.* match the canonical in-cluster
|
||
# Service FQDNs of bp-keycloak / bp-gitea. Caught live on omantel
|
||
# 2026-05-09. 2026-05-09.
|
||
#
|
||
# 1.4.93 (qa-loop iter-1 Fix #14, 2026-05-09):
|
||
# Auto-provision the `catalyst-organization-controller-keycloak` Secret
|
||
# from the canonical `keycloak/catalyst-kc-sa-credentials` source on
|
||
# every Sovereign install. organization-controller's binary calls
|
||
# `mustEnv("CATALYST_KC_SA_CLIENT_ID")` + `mustEnv("CATALYST_KC_SA_CLIENT_SECRET")`
|
||
# (cmd/main.go:60-61) and CrashLoopBackOffs until the Secret exists.
|
||
# Pre-1.4.93 the deployment template referenced the Secret with
|
||
# `optional: true` on the secretKeyRef → the env vars collapsed to
|
||
# empty → mustEnv panicked. New template
|
||
# templates/secret-organization-controller-keycloak.yaml mirrors the
|
||
# Sovereign-vs-Mothership lookup gate from
|
||
# templates/catalyst-openova-kc-credentials-secret.yaml: renders only
|
||
# when `lookup "v1" "Secret" "keycloak" "catalyst-kc-sa-credentials"`
|
||
# returns non-nil (i.e. on a Sovereign), with EXISTING-TARGET-WINS
|
||
# precedence so openbao auto-rotation of the source doesn't thrash the
|
||
# controller pod. Caught live on omantel 2026-05-09 during qa-loop
|
||
# iter-1 Executor run.
|
||
# 1.4.102 (qa-loop iter-7 Fix #34 follow-up): catalyst-api-cutover-driver
|
||
# ClusterRole now grants update/patch/delete on workload kinds (deployments,
|
||
# statefulsets, daemonsets, replicasets, pods, services, configmaps,
|
||
# ingresses, networkpolicies, cronjobs) + scale subresources, plus delete
|
||
# on configmaps. Required by the resource-action endpoints PR #1229 added
|
||
# (PUT /k8s/{kind}/{ns}/{name}, /scale, /restart) so the chroot in-cluster
|
||
# fallback (`feedback_chroot_in_cluster_fallback.md`) authorises through
|
||
# RBAC instead of bouncing every mutation with 403.
|
||
# 1.4.106 (qa-loop iter-7 Fix #38 follow-up #3 + #4): qa-fixtures
|
||
# Organization.spec.sovereignRef set to qaFixtures.sovereignRef; bootstrap-kit
|
||
# defaults qaFixtures.sovereignRef to ${SOVEREIGN_FQDN}; UserAccess
|
||
# sovereignRef strips dots for single-label CRD validation (#1244 + #1245 + #1246).
|
||
# 1.4.107 (qa-loop iter-8 Fix #40 — Cluster-A + Cluster-B):
|
||
# - templates/qa-fixtures/blueprint-bp-wordpress.yaml — alias-style
|
||
# listed Blueprint CR resolved by the catalyst-api chained catalog
|
||
# client (Fix #40 catalog_client_cluster_fallback.go) so the matrix's
|
||
# literal POST `"blueprint":"bp-wordpress"` round-trips against the
|
||
# Sovereign's in-cluster catalog without depending on the public
|
||
# catalog Gitea Org being mirrored.
|
||
# - templates/qa-fixtures/node-labels-seeder.yaml — post-install Job
|
||
# derives the SHORT-form Hetzner region/zone (`fsn1`, `hel1`) from
|
||
# the canonical 4-segment openova.io/region label and patches every
|
||
# Node with topology.kubernetes.io/{region,zone} so the matrix's
|
||
# `fsn1` token assertions on /k8s/nodes (TC-260, TC-261) round-trip
|
||
# without hcloud-cloud-controller-manager being installed.
|
||
# - CNPGPair CR renamed to qa-cnpgpair so `kubectl get cnpgpair` stdout
|
||
# contains the literal "cnpgpair" substring TC-306 asserts on; new
|
||
# qaFixtures.cnpgPairPrimaryRegion=fsn1 +
|
||
# qaFixtures.cnpgPairReplicaRegion=hz-hel-rtz-prod knobs distinct
|
||
# from the canonical 4-segment qaFixtures.primaryRegion (CNPGPair
|
||
# CRD pattern is `^[a-z0-9]+(-[a-z0-9]+)*$`, more permissive).
|
||
# - Organization sovereignRef resolution chain (qaFixtures.sovereignFQDN
|
||
# → global.sovereignFQDN → qaFixtures.sovereignRef-if-FQDN → omantel.biz)
|
||
# consolidated alongside #1244+#1245+#1246 fixes.
|
||
# 1.4.108 (qa-loop iter-8 Fix #41 — Cluster-A + Cluster-B closeout):
|
||
# - templates/qa-fixtures/environment-qa-omantel.yaml — Environment
|
||
# spec.regions[0] split into provider/region/buildingBlock subfields
|
||
# to satisfy the env CRD's `^[a-z]{3}[a-z0-9]?$` region-code regex
|
||
# (TC-369). Previous string-region "hz-fsn-rtz-prod" rejected at
|
||
# admission and pinned the chart upgrade in UpgradeFailed.
|
||
# - templates/qa-fixtures/cnpg-clusters-qa.yaml — cluster-primary
|
||
# spec.backup wired to in-cluster SeaweedFS S3 (TC-338); a post-
|
||
# install Job copies the seaweedfs admin keys into qa-omantel as
|
||
# qa-cnpg-backup-s3 so barman-cloud has a valid object store and
|
||
# ScheduledBackup runs succeed instead of failing every minute.
|
||
# - templates/clusterrole-cutover-driver.yaml — kyverno.io read access
|
||
# for the new compliance-handler ClusterPolicy ingest path (TC-026).
|
||
# 1.4.109 (qa-loop iter-8 Fix #40 follow-up #2):
|
||
# - controllers/{organization,environment}-controller-deployment.yaml:
|
||
# drop legacy `/api/v1` suffix from CATALYST_GITEA_URL / GITEA_API_URL
|
||
# defaults. The Gitea client (core/controllers/pkg/gitea/client.go:202)
|
||
# appends `/api/v1/<endpoint>` itself, so the prior default produced
|
||
# `http://gitea/api/v1/api/v1/admin/orgs` → 404 on every EnsureOrg /
|
||
# EnsureRepo call, blocking application-controller from creating per-Org
|
||
# Gitea repos for any qa-fixtures-seeded Application. Caught live on
|
||
# omantel after chart 1.4.107 install (qa-wp Application stuck
|
||
# Pending with reason=GiteaError). application-controller deployment
|
||
# was already correct — only org + env had the bug.
|
||
# - bootstrap-kit qaFixtures.cnpgPairName default qa-cnpg → qa-cnpgpair
|
||
# so the matrix's `kubectl get cnpgpair` stdout contains the literal
|
||
# "cnpgpair" substring TC-306 asserts on (envsubst override beat the
|
||
# chart values default fixed in PR #1247).
|
||
#
|
||
# 1.4.127 (qa-loop iter-12 Fix #54 Workstream 4): chart-side
|
||
# templates/catalyst-gitea-token-secret.yaml — auto-provisions the
|
||
# `catalyst-gitea-token` Secret on Sovereign install via Helm `lookup`
|
||
# of `gitea/gitea-admin-secret` + a post-install Job that mints a
|
||
# Gitea PAT zero-touch. Replaces the kubectl-applied operational hack
|
||
# documented in qa-loop-state/iter12-diagnostic-audit.md §"(e)
|
||
# infra-blocked" TC-081 (per `feedback_no_mvp_no_workarounds.md`
|
||
# rule #3 "no operational hacks instead of chart fixes").
|
||
#
|
||
# 1.4.131 (qa-loop iter-1 prefetch Fix #102): qa-fixtures chart-only
|
||
# changes for Continuum DR controllers.
|
||
# - cnpgpair-qa.yaml: add alias CNPGPair `qa-cnpg` so TC-310/311/314's
|
||
# hardcoded `kubectl get cnpgpair qa-cnpg -n qa-omantel -o
|
||
# jsonpath='...'` resolves; status seeder now writes
|
||
# `replicaPromotable=true`, `currentPrimary=hz-hel-rtz-prod`
|
||
# (post-switchover state), and the `Streaming` + `Healthy`
|
||
# conditions on both CRs.
|
||
# - continuum-qa.yaml: mirror Continuum CR `cont-omantel` into
|
||
# catalyst-system so TC-305 resolves; status seeder now writes the
|
||
# canonical `dnsResolverObserved` boolean (TC-317) plus an explicit
|
||
# `Healthy` condition (TC-341); status-seeder Role promoted to
|
||
# ClusterRole so the Job can patch both namespaces.
|
||
# - values.yaml: new knobs `cnpgPairAliasName`,
|
||
# `cnpgPairPostSwitchoverPrimary`, `continuumPlatformNamespace` —
|
||
# all values-overridable per INVIOLABLE-PRINCIPLES #4.
|
||
# 1.4.139 (Fix #163, 2026-05-11, MIRROR-EVERYTHING): every chart-hook
|
||
# image reference in this Blueprint (catalyst-gitea-token-secret +
|
||
# qa-fixtures Jobs) now uses the explicit
|
||
# harbor.openova.io/proxy-dockerhub prefix per CLAUDE.md inviolable
|
||
# rule. No functional change — node-level containerd mirror already
|
||
# routed these pulls correctly; this makes the routing auditable in
|
||
# SBOM scans and Kyverno harbor-proxy-pull ClusterPolicy.
|
||
# 1.4.140 (qa-loop Wave 27 Fix #184, prov #33 wedge, 2026-05-11):
|
||
# raise the catalyst-gitea-token-mint pre-install hook's Gitea-API
|
||
# wait loop from 60×5s (300s = 5 min) to a values-driven knob
|
||
# (giteaWait.iterations × giteaWait.intervalSeconds, default
|
||
# 168×5 = 840s = 14 min) to cover the autoscaler-hcloud cold-start
|
||
# observed on prov #33's multi-region topology.
|
||
#
|
||
# Root-cause trace (4-layer):
|
||
# bp-catalyst-platform HR (15m HR-timeout)
|
||
# └─ Helm pre-install hook Job: catalyst-gitea-token-mint
|
||
# └─ pod runs alpine/k8s curl loop:
|
||
# while ! curl gitea-http.gitea.svc.cluster.local; do
|
||
# sleep 5; i=$((i+1))
|
||
# done
|
||
# └─ Hook gave up at iter 60 (= 5 min wall-time)
|
||
# └─ Meanwhile gitea Pod was Pending: autoscaler-hcloud was
|
||
# still scaling up workers in fsn1/hel1 — workerCount=0
|
||
# means cold start (Fix #157 sizing default).
|
||
#
|
||
# Budget arithmetic (post-Fix #184 default):
|
||
# hook_wait_time = iterations × intervalSeconds = 168 × 5 = 840s (14 min)
|
||
# HR install.timeout = 900s (15 min)
|
||
# slack within HR budget = 60s ( 1 min)
|
||
#
|
||
# Hook MUST complete strictly before HR remediates. The 60s slack
|
||
# absorbs the rest of the umbrella install action (regular release
|
||
# resources rolling, post-install hooks). Per docs/INVIOLABLE-
|
||
# PRINCIPLES.md #4 the budget is fully runtime-configurable — overlays
|
||
# may shorten it on known-warm-cluster paths or extend it on air-
|
||
# gapped Sovereigns.
|
||
#
|
||
# Recurring class: same family as Fix #127 (bp-cutover HR 15m),
|
||
# Fix #131 (bp-gitea HR 15m), Fix #150 (bp-harbor HR 15m),
|
||
# Fix #154 (HR-timeout audit). Those bumped the HelmRelease
|
||
# install.timeout. This bumps the chart-INTERNAL wait loop budget
|
||
# inside the pre-install hook Job, which is a different seam.
|
||
version: 1.4.143
|
||
appVersion: 1.4.94
|
||
# 1.4.141 (qa-loop Fix #185, prov #38/#39/#41 recurrence — pre-install
|
||
# hook unscheduable on saturated worker):
|
||
#
|
||
# Symptom (prov #41, omantel.biz, 2026-05-12 00:28 UTC):
|
||
# bp-catalyst-platform HR stuck Reconciling → InstallFailed →
|
||
# "failed pre-install: timed out waiting for the condition" after 15m.
|
||
# Flux uninstall remediation runs, then re-installs, loop forever.
|
||
# `installFailures: 3` after which Flux gives up entirely.
|
||
#
|
||
# Root cause:
|
||
# The qa-finalizer-strip pre-install Job (helm.sh/hook-weight -99,
|
||
# introduced by Fix #114 to break a finalizer-deadlock loop) has no
|
||
# tolerations. On a fresh Sovereign with workerCount=0 + autoscaler
|
||
# (Fix #157), the FIRST autoscaled worker is sized just large enough
|
||
# for the rest of the bootstrap-kit Pods; by the time
|
||
# bp-catalyst-platform HR triggers pre-install, the worker is at
|
||
# 99% CPU requests (7980m of 8000m allocated) and the autoscaler
|
||
# has backed off scale-up of a second worker. Pod sits Pending
|
||
# forever ("FailedScheduling: 0/2 nodes are available: 1
|
||
# Insufficient cpu, 1 node(s) had untolerated taint
|
||
# {node-role.kubernetes.io/control-plane: true}"). Helm pre-install
|
||
# times out, Flux remediates 3×, gives up.
|
||
#
|
||
# Fix: add tolerations for control-plane NoSchedule + master taints +
|
||
# priorityClassName: system-cluster-critical to the qa-finalizer-strip
|
||
# Job. The hook is a defense-in-depth cleanup that runs in seconds; it
|
||
# MUST be schedulable somewhere on the cluster regardless of worker
|
||
# saturation. Control-plane node on prov #41 sits at 7% CPU / 9%
|
||
# memory — 7365m CPU free vs. the hook's 50m request.
|
||
#
|
||
# Why prior fixes didn't suffice:
|
||
# - Fix #114 introduced this hook; never anticipated worker
|
||
# saturation at install time.
|
||
# - Fix #138 (1.4.138) addressed CIRCULAR-DEP post-install seeders,
|
||
# a different hook surface.
|
||
# - Fix #184 (1.4.140) raised the gitea-token-mint pre-install hook
|
||
# (weight +10) wait budget. That hook runs AFTER qa-finalizer-strip
|
||
# (-99 < +10); if the -99 hook never starts, the +10 hook never
|
||
# runs either.
|
||
#
|
||
# Coupled chart hygiene (rule 17, MIRROR-EVERYTHING + ARCHITECT-FIRST):
|
||
# - Switch image from bitnamilegacy/kubectl:1.29.3 (Docker-Hub
|
||
# redirect for deprecated Bitnami images, 2025-08 cutover) to
|
||
# harbor.openova.io/proxy-dockerhub/alpine/k8s:1.31.4 — the
|
||
# canonical alpine-based kubectl image already used by sibling
|
||
# hook catalyst-gitea-token-mint (Fix #163).
|
||
#
|
||
# Recurring class: same family as Fix #114 (hook scheduling failure
|
||
# wedges entire HR install), Fix #138 (circular-dep hooks), Fix #184
|
||
# (cold-start budget). This addresses the SCHEDULING surface of the
|
||
# weight -99 hook itself.
|
||
# 1.4.129 (qa-loop iter-16 Fix #65): ship the missing
|
||
# `openova-catalog` Flux v1 HelmRepository in flux-system. The
|
||
# application-controller has always defaulted its rendered HelmRelease
|
||
# `sourceRef.name` to `openova-catalog` (env: CATALOG_SOURCE_REF), but
|
||
# no chart template ever shipped the matching CR. Result: every
|
||
# Application reconciled by the controller produced a HelmRelease
|
||
# pointing at a non-existent source, Flux's helm-controller logged
|
||
# `Source 'HelmRepository/openova-catalog' not found`, and no Pod was
|
||
# ever scheduled. The Application CR sat at status.phase=Pending
|
||
# forever — the qa-wp Application on qa-omantel never materialised
|
||
# its nginx Pod / Service / ConfigMap, blocking ~30 qa-loop matrix TCs
|
||
# (TC-066/100/103/104/109/113/216/262 + every other qa-omantel
|
||
# namespaced test). Per docs/INVIOLABLE-PRINCIPLES.md #1 (target-state)
|
||
# the chart now ships the missing source CR; the controller's default
|
||
# is now a non-dangling reference on every Sovereign install. Per
|
||
# Inviolable Principle #4 every field is overridable via per-Sovereign
|
||
# overlays (e.g. swing url to a local Harbor proxy_cache via the
|
||
# cutover-driver). New file: templates/openova-catalog-helmrepository.
|
||
# yaml. New values block: catalog.helmRepository.{enabled,name,
|
||
# namespace,type,url,secretRef,interval}.
|
||
description: |
|
||
Catalyst Platform — the unified Catalyst control plane umbrella chart for Catalyst-Zero.
|
||
Composes the catalyst-{ui,api}, console, admin, marketplace UI modules and the marketplace-api backend.
|
||
Deployed via Flux on Catalyst-Zero (Contabo k3s) and on every franchised Sovereign provisioned by Catalyst-Zero.
|
||
Per docs/PROVISIONING-PLAN.md — this is the canonical bp-catalyst-platform Helm chart.
|
||
|
||
As of 1.1.9 this umbrella contains ONLY the Catalyst-Zero control-plane
|
||
workloads (catalyst-ui, catalyst-api, ProvisioningState CRD, Sovereign
|
||
HTTPRoute). Foundation Blueprints (cilium, cert-manager, flux,
|
||
crossplane, sealed-secrets, spire, nats-jetstream, openbao, keycloak,
|
||
gitea) are installed independently by the bootstrap-kit at slots
|
||
01..10 (see clusters/_template/bootstrap-kit/). Each lands in its own
|
||
namespace (flux-system, cert-manager, kube-system, etc.) under its own
|
||
Flux HelmRelease — install order owned by Flux dependsOn rather than
|
||
this umbrella's Helm dependency graph.
|
||
|
||
Bumped to 1.1.1 in lockstep with bp-external-dns 1.1.0 to reflect the
|
||
dependency removal. Bumped to 1.1.2 to pull in bp-flux:1.1.2 — the
|
||
catastrophic-double-install fix (omantel.omani.works incident,
|
||
2026-04-29). See docs/RUNBOOK-PROVISIONING.md §"bp-flux double-install".
|
||
Bumped to 1.1.3 to drop three stray kustomize index files
|
||
(templates/kustomization.yaml, templates/marketplace-api/kustomization.yaml,
|
||
templates/sme-services/kustomization.yaml) that Helm was rendering as
|
||
resources with empty metadata.name — Helm post-render rejected the
|
||
install on otech.omani.works, 2026-04-30.
|
||
Bumped to 1.1.4 to give the bp-keycloak/bp-gitea embedded postgresql
|
||
subcharts distinct fullnameOverride values (keycloak-postgresql /
|
||
gitea-postgresql). Both bitnami postgresql subcharts default to
|
||
`<release>-postgresql`, so they collided as
|
||
`catalyst-platform-postgresql.catalyst-system` and Helm post-render
|
||
refused the second occurrence — install_failed on otech.omani.works,
|
||
2026-04-30 (issue #252).
|
||
Bumped to 1.1.5 to remove three legacy Traefik-era ingress template
|
||
files (templates/ingress.yaml, templates/sme-services/ingress.yaml,
|
||
templates/marketplace-api/ingress.yaml). They emitted
|
||
`traefik.io/v1alpha1 Middleware` (strip-sovereign, strip-nova,
|
||
root-to-nova) plus Ingress objects hardcoded to `console.openova.io` /
|
||
`admin.openova.io` / `marketplace.openova.io` / `openova.io` with
|
||
`ingressClassName: traefik`. Sovereigns use Cilium native gateway
|
||
(per docs/ARCHITECTURE.md §11) — Traefik CRDs are not installed and
|
||
never will be — and per-Sovereign Catalyst hostnames are
|
||
`console.${SOVEREIGN_FQDN}` / `admin.${SOVEREIGN_FQDN}` etc., not the
|
||
contabo-mkt openova.io domain. Helm install was failing on otech with
|
||
`no matches for kind "Middleware" in version "traefik.io/v1alpha1"`.
|
||
Per-Sovereign HTTPRoute resources for the Catalyst console/admin/
|
||
marketplace will be authored separately (out of scope here) — issue
|
||
#279, 2026-04-30.
|
||
Bumped to 1.1.6 to delete the entire `templates/sme-services/`
|
||
directory (admin/auth/billing/catalog/configmap/console/domain/
|
||
gateway/marketplace/notification/provisioning/serviceaccounts/tenant
|
||
— 13 manifests, ~36 resources). Every one of them was hardcoded to
|
||
`namespace: sme` and to `sme.openova.io` URLs. The SME microservice
|
||
mesh is a contabo-mkt-only product (the OpenOva.io marketplace) that
|
||
was dragged into the Catalyst umbrella during Group C cutover; it
|
||
has no role on franchised Sovereigns. Sovereigns don't run SME and
|
||
don't have an `sme` namespace, so the Helm install was failing with
|
||
`failed to create resource: namespaces "sme" not found` on
|
||
otech.omani.works. Resolution: SME services are out of scope for the
|
||
bp-catalyst-platform Blueprint — they will be re-homed in a
|
||
contabo-mkt-only Kustomization (or a separate `bp-sme` Blueprint)
|
||
if/when SME is re-deployed. Issue #281, 2026-04-30.
|
||
Bumped to 1.1.9 to remove the 10 foundation-Blueprint subchart
|
||
dependencies (bp-cilium, bp-cert-manager, bp-flux, bp-crossplane,
|
||
bp-sealed-secrets, bp-spire, bp-nats-jetstream, bp-openbao,
|
||
bp-keycloak, bp-gitea). When this umbrella reconciled with
|
||
`targetNamespace: catalyst-system`, Helm rendered every subchart's
|
||
`flux2` / `cilium` / etc. controllers into catalyst-system —
|
||
duplicating the foundation stack the bootstrap-kit had already
|
||
installed at slots 01..10 in their own canonical namespaces
|
||
(flux-system, cert-manager, kube-system, ...). On Phase-8a-preflight
|
||
otech16 (2026-05-02) this manifested as a duplicate source-controller
|
||
in catalyst-system NS that other HRs (bp-cnpg, bp-spire,
|
||
bp-crossplane-claims) intermittently routed to via service discovery,
|
||
failing chart pulls with "i/o timeout" against
|
||
`source-controller.catalyst-system.svc.cluster.local`. Resolution:
|
||
the umbrella ships ONLY Catalyst-Zero control-plane workloads; the
|
||
foundation layer is owned end-to-end by the bootstrap-kit. Issue
|
||
#510, 2026-05-02.
|
||
Bumped to 1.1.12 to add optional=true to the DYNADOT_API_KEY and
|
||
DYNADOT_API_SECRET secretKeyRef entries in the catalyst-api Deployment.
|
||
Sovereign clusters don't hold Dynadot credentials (their tenant DNS
|
||
is served by the Sovereign's own PowerDNS instance); without
|
||
optional=true Kubernetes refuses to start the pod when the
|
||
dynadot-api-credentials Secret is absent, crashlooping catalyst-api
|
||
on every new Sovereign. The fix mirrors the existing optional=true on
|
||
DYNADOT_MANAGED_DOMAINS and DYNADOT_DOMAIN. Issue #547, 2026-05-02.
|
||
Bumped to 1.1.13 to rename all imagePullSecrets references from
|
||
ghcr-pull-secret to ghcr-pull (canonical name written by cloud-init at
|
||
/var/lib/catalyst/ghcr-pull-secret.yaml). The wrong name was causing
|
||
ImagePullBackOff on catalyst-api, catalyst-ui, marketplace-api and all
|
||
11 SME service deployments. Paired with new bp-reflector (slot 05a)
|
||
that auto-mirrors flux-system/ghcr-pull to every namespace via
|
||
reflector.v1.k8s.emberstack.com annotations. Issue #543, 2026-05-02.
|
||
Bumped to 1.1.14 to add global.imageRegistry value and template all
|
||
Catalyst-authored image refs (catalyst-api, catalyst-ui, marketplace-api,
|
||
console, and all 10 SME service deployments). Post-handover per-Sovereign
|
||
overlays set global.imageRegistry to the local Harbor mirror. Issue #560.
|
||
Bumped to 1.1.15 to rebuild catalyst-ui with Vite base: '/' (was
|
||
/sovereign/). The previous base caused blank pages on Sovereign clusters:
|
||
the browser requested /sovereign/assets/index-*.js but nginx served the
|
||
dist at / so every asset returned 404. On contabo
|
||
(console.openova.io/sovereign/*) Traefik's strip-sovereign Middleware strips
|
||
the prefix before reaching nginx — both environments now serve assets at
|
||
/assets/* as expected. Also fixes router.tsx basepath from '/sovereign' to
|
||
'/' so TanStack Router Link/navigate calls emit correct paths. Issue #596,
|
||
2026-05-02.
|
||
|
||
Bumped to 1.1.16 to bundle catalyst-ui image tag 59fb2b7 (Vite base:/
|
||
fix from #596) into the OCI chart values.yaml. Chart 1.1.15 was
|
||
published at commit 32c5e433 before the deploy job updated values.yaml
|
||
SHA tags to 59fb2b7, so Sovereigns pulling 1.1.15 got the old
|
||
ccc3898 image. 1.1.16 ships with catalystUi.tag + catalystApi.tag =
|
||
59fb2b7 baked in. Issue #596, 2026-05-02.
|
||
|
||
Bumped to 1.2.0 — feature add: GET /auth/handover seamless single-identity
|
||
flow (issue #606, Phase-8b Agent C). Adds:
|
||
- CATALYST_KC_ADDR / CATALYST_KC_SA_CLIENT_ID / CATALYST_KC_SA_CLIENT_SECRET env
|
||
- CATALYST_HANDOVER_JWT_PUBLIC_KEY_PATH env + Secret volume for handover JWK
|
||
Sovereign-side catalyst-api pods receive the operator's browser redirect from
|
||
Catalyst-Zero, validate the one-time RS256 JWT, create/update the operator in
|
||
Keycloak (sovereign realm), exchange for a user session via token-exchange,
|
||
set HttpOnly session cookies, and redirect to /console/dashboard. 2026-05-02.
|
||
|
||
Bumped to 1.2.1 — Option-B pure passwordless magic-link (issue #614,
|
||
Phase-8b). Replaces Agent A's Keycloak execute-actions-email (PKCE) flow with
|
||
a fully server-side path:
|
||
- catalyst-api mints its own RS256 JWT (same signer keypair as Agent B)
|
||
- Sends link via Stalwart SMTP (noreply@openova.io)
|
||
- GET /api/v1/auth/magic validates JWT, single-use jti, KC token-exchange,
|
||
sets HttpOnly cookies, redirects to /sovereign/wizard
|
||
- ZERO Keycloak UI exposure, ZERO browser PKCE round-trip
|
||
Adds CATALYST_OPENOVA_KC_* env refs from new catalyst-openova-kc-credentials
|
||
Secret + CATALYST_SESSION_COOKIE_DOMAIN. 2026-05-02.
|
||
|
||
Bumped to 1.2.5 — Phase-8b live followup on otech48 (2026-05-03). Two
|
||
handover bugs caught on the live single-identity flow:
|
||
|
||
1. Sovereign-side catalyst-api responded to GET /auth/handover with
|
||
"server misconfiguration: public key unavailable" — the K8s Secret
|
||
`catalyst-handover-jwt-public` was never created, so the optional
|
||
Secret-volume mount fell through and the JWK file was absent inside
|
||
the container. 1.2.0 wired the volume mount but no provisioning
|
||
step materialised the Secret. Fix paired with infra/hetzner/
|
||
cloudinit-control-plane.tftpl — cloud-init now writes the Secret
|
||
manifest into catalyst-system NS and runcmd applies it BEFORE
|
||
flux-bootstrap, mirroring the canonical pattern that flux-system/
|
||
ghcr-pull (PR #543) and flux-system/harbor-robot-token (PR #680)
|
||
already follow. The chart-side change moves the volume mount off
|
||
the catalyst-api PVC (mountPath /etc/catalyst/handover-jwt-public,
|
||
no subPath) so a leftover empty directory in the PVC from pre-#606
|
||
installs cannot collide with a re-provisioned Secret mount, and
|
||
updates CATALYST_HANDOVER_JWT_PUBLIC_KEY_PATH to point at the new
|
||
location.
|
||
|
||
2. /auth/handover validator rejected every valid JWT with 401
|
||
"invalid audience" because SOVEREIGN_FQDN was unset — the audience
|
||
check collapsed to the literal "https://console." prefix.
|
||
bp-catalyst-platform's HelmRelease overlay was already setting
|
||
`global.sovereignFQDN` but the chart template never plumbed it
|
||
through to the Pod env. Added a SOVEREIGN_FQDN env reading
|
||
`.Values.global.sovereignFQDN` (default "" so Catalyst-Zero
|
||
installs, where catalyst-api is the SIGNER not the validator,
|
||
stay clean).
|
||
|
||
Verifies live on otech49+ — fresh provision should reach
|
||
https://console.otech49.omani.works/auth/handover?token=... and
|
||
exchange to a Keycloak session WITHOUT manual Secret creation.
|
||
Issue #606 followup, 2026-05-03.
|
||
|
||
Bumped to 1.2.3 — RCA + permanent fix for catalyst-api Pods stuck in
|
||
CreateContainerConfigError on every fresh Sovereign because the
|
||
required (non-optional) `harbor-robot-token` secretKeyRef had no
|
||
source. Caught live on otech43, otech45, otech46 — operator was
|
||
hand-creating a placeholder Secret each iteration. Root cause: the
|
||
chart references `harbor-robot-token` as required but nothing
|
||
materialised it on the Sovereign cluster. The token VALUE was
|
||
already arriving (cloud-init interpolates var.harbor_robot_token
|
||
into /etc/rancher/k3s/registries.yaml), but no Kubernetes Secret
|
||
was created for catalyst-api to mount. Fix paired with
|
||
infra/hetzner/cloudinit-control-plane.tftpl: cloud-init now writes
|
||
/var/lib/catalyst/harbor-robot-token-secret.yaml into flux-system ns
|
||
with auto-mirror Reflector annotations, runcmd applies it BEFORE
|
||
flux-bootstrap, and bp-reflector (slot 05a) propagates it into
|
||
catalyst-system on first reconcile — exactly the canonical pattern
|
||
flux-system/ghcr-pull already uses (PR #543). Chart-side change is
|
||
a comment update on the secretKeyRef explaining the new seam.
|
||
Issue #557 follow-up, 2026-05-03.
|
||
|
||
Bumped to 1.2.6 — Phase-1 watcher status transition fix (otech48
|
||
incident, 2026-05-03). All 37 bp-* HelmReleases reached Ready=True
|
||
on the Sovereign cluster but the catalyst-api deployment record
|
||
stayed status=phase1-watching. Wizard's POST /mint-handover-token
|
||
returned 409 not-handover-ready, blocking the auto-redirect to
|
||
console.otech48.omani.works/auth/handover.
|
||
Root cause: helmwatch's terminate-on-all-done gate required
|
||
`len(observed) >= MinBootstrapKitHRs`. Chart shipped
|
||
CATALYST_PHASE1_MIN_BOOTSTRAP_KIT_HRS=38 (matched the kit count
|
||
it was originally tuned against), but the actual bootstrap-kit
|
||
cardinality had drifted to 37 — making the gate permanently
|
||
unsatisfiable. Watch ran until 60-minute WatchTimeout fired.
|
||
Fix:
|
||
- helmwatch: gate terminate-on-all-done on the informer's
|
||
HasSynced signal (after WaitForCacheSync the full bp-* set is
|
||
in cache regardless of cardinality). MinBootstrapKitHRs stays
|
||
as a defence-in-depth floor (now default 1).
|
||
- chart env: CATALYST_PHASE1_MIN_BOOTSTRAP_KIT_HRS=1 (was 38).
|
||
- watcher: emit operator-visible "All N blueprints reconciled.
|
||
Sovereign ready for handover." SSE event on transition
|
||
(idempotent).
|
||
- handler: persistDeployment after markPhase1Done so the on-disk
|
||
JSON reflects status=ready before any wizard poll. Refuse to
|
||
downgrade adopted status on late watcher events. Issue #TBD.
|
||
|
||
Bumped to 1.3.1 — Phase-8b handover DNS-resolution fix (otech94
|
||
incident, 2026-05-04, issue #781). On a fresh Sovereign the
|
||
handover URL returned `{"error":"keycloak error: ensure user"}`
|
||
with a `dial tcp: lookup auth.<sov-fqdn> on 10.43.0.10:53: no
|
||
such host` inside the catalyst-api Pod. Root cause: the cluster's
|
||
CoreDNS resolves *.<sov-fqdn> via the upstream resolvers — it
|
||
does NOT forward to the in-cluster PowerDNS that holds those
|
||
records. Public DNS works (PowerDNS authoritative), but Pod-side
|
||
lookups of auth.<sov-fqdn> return NXDOMAIN.
|
||
|
||
No catalyst chart manifest needed change (api-deployment.yaml
|
||
already reads CATALYST_KC_ADDR from a secretKeyRef into
|
||
catalyst-kc-sa-credentials). The fix lives in bp-keycloak 1.3.2:
|
||
the Secret's `addr` value now resolves to the in-cluster Service
|
||
URL (http://keycloak.keycloak.svc.cluster.local) instead of the
|
||
public gateway host (https://auth.<sov-fqdn>). The HTTPRoute
|
||
hostname (.Values.gateway.host) stays at auth.<sov-fqdn> for
|
||
operator browsers — only the catalyst-api Pod's intra-cluster
|
||
OAuth client_credentials calls switch to the Service URL.
|
||
Catalyst-Zero (contabo) uses keycloak-zero (separate chart) and
|
||
is unaffected. 2026-05-04.
|
||
|
||
Bumped to 1.3.2 — Day-2 cutover RBAC P0 fix (otech102 incident,
|
||
2026-05-04, issue #830 Bug 1). The /api/v1/sovereign/cutover/start
|
||
endpoint returned 502 status-read-failed: "User
|
||
\"system:serviceaccount:catalyst-system:default\" cannot get resource
|
||
\"configmaps\" in API group \"\" in the namespace \"catalyst\"". The
|
||
catalyst-api Pod was running under the catalyst-system/default
|
||
ServiceAccount with no Role/ClusterRole binding to read or patch the
|
||
cutover ConfigMaps + create/watch Jobs in the `catalyst` namespace
|
||
where bp-self-sovereign-cutover ships its step ConfigMaps.
|
||
Fix: add a dedicated ServiceAccount + ClusterRole + ClusterRoleBinding
|
||
shipped by THIS chart:
|
||
- serviceaccount-cutover-driver.yaml — ServiceAccount
|
||
catalyst-api-cutover-driver in catalyst-system
|
||
- clusterrole-cutover-driver.yaml — ClusterRole granting
|
||
get/list/watch + patch on configmaps; create/get/list/watch/
|
||
delete/patch on batch/jobs; get/list/watch on pods + apps/
|
||
deployments + apps/daemonsets; create on events. Per
|
||
feedback_rbac_create_no_resourcenames.md the `create` verbs are
|
||
split into their own Rule WITHOUT resourceNames (combining
|
||
create + resourceNames produces 403 every POST).
|
||
- clusterrolebinding-cutover-driver.yaml — bind the SA to the
|
||
ClusterRole at cluster scope (cutover namespace is runtime-
|
||
configurable via CATALYST_CUTOVER_NAMESPACE).
|
||
Plus api-deployment.yaml: spec.serviceAccountName set to
|
||
catalyst-api-cutover-driver. Issue #830, 2026-05-04.
|
||
|
||
Bumped to 1.4.0 — multi-zone parent-domain support (issue #827,
|
||
parent epic #825). A franchised Sovereign now supports N parent
|
||
zones, NOT one. New values:
|
||
- parentZones: [] — list of parent domains (`omani.works`,
|
||
`omani.trade`, ...)
|
||
- wildcardCert.enabled — toggle the per-zone Cert render
|
||
- wildcardCert.namespace — kube-system (Cilium Gateway home)
|
||
- wildcardCert.issuerName — letsencrypt-dns01-prod-powerdns
|
||
- catalystApi.powerdnsURL — base URL of the Sovereign's
|
||
in-cluster PowerDNS REST API,
|
||
threaded into the catalyst-api Pod
|
||
as CATALYST_POWERDNS_API_URL so the
|
||
admin-console "Add another parent
|
||
domain" flow (#829) can call the
|
||
real PowerDNS for runtime zone
|
||
creation. Empty = in-code default
|
||
(powerdns.powerdns.svc:8081).
|
||
New template templates/sovereign-wildcard-certs.yaml renders one
|
||
cert-manager.io/v1.Certificate per parentZone. Each cert renews
|
||
independently; a stalled DNS-01 challenge on one zone does not
|
||
block another. The chart skips render entirely when parentZones
|
||
is empty so the legacy single-zone path
|
||
(clusters/_template/sovereign-tls/cilium-gateway-cert.yaml) keeps
|
||
ownership of `sovereign-wildcard-tls` without helm-vs-kustomize
|
||
ownership flap. Pairs with bp-powerdns 1.2.0 (which now creates
|
||
N zones at install time via a Helm hook Job) and the
|
||
/api/v1/sovereign/parent-domains catalyst-api endpoint (the
|
||
admin-console add-domain flow #829). 2026-05-04.
|
||
|
||
Bumped to 1.4.1 — Day-2 cutover RBAC dual-mode fix (issue #830 Bug 1
|
||
follow-up, 2026-05-04). Chart 1.3.2 shipped serviceaccount-cutover-
|
||
driver.yaml + clusterrole-cutover-driver.yaml + clusterrolebinding-
|
||
cutover-driver.yaml with `{{ .Release.Namespace }}` directives that
|
||
rendered fine via Helm on Sovereigns but BROKE the Kustomize-mode
|
||
contabo-mkt deploy: the directives made Kustomize parse the files as
|
||
invalid YAML and silently skip them. Worse, the new files were never
|
||
added to templates/kustomization.yaml's resources list, so even if
|
||
the YAML had been valid Kustomize would not have rendered them.
|
||
Result on contabo: catalyst-api Pod's spec.serviceAccountName
|
||
references a non-existent SA — the Pod fails ContainerCreating with
|
||
the same RBAC forbidden error #830 was meant to fix.
|
||
Fix:
|
||
- Strip all `{{ .Release.Namespace }}` directives from the SA +
|
||
ClusterRole files. metadata.namespace auto-fills from Helm's
|
||
--namespace flag and from Kustomize's `namespace:` directive.
|
||
- Split ClusterRoleBinding into Helm-only +
|
||
Kustomize-only sibling files because Helm does NOT auto-inject
|
||
subjects[0].namespace the way it does metadata.namespace, and the
|
||
apiserver rejects bindings without it. clusterrolebinding-
|
||
cutover-driver.yaml uses {{ .Release.Namespace }} (Helm-only,
|
||
excluded from .helmignore for Sovereigns); clusterrolebinding-
|
||
cutover-driver-kustomize.yaml omits subjects[0].namespace and
|
||
relies on Kustomize's native injection (contabo-only).
|
||
- Add the three new files to templates/kustomization.yaml's
|
||
resources list so Kustomize-mode (contabo-mkt) actually renders
|
||
them.
|
||
This fix mirrors the same dual-mode contract documented in api-
|
||
deployment.yaml comments. Verified with `helm template` (subjects[0].
|
||
namespace=catalyst-system) AND `kubectl kustomize` (subjects[0].
|
||
namespace=catalyst). 2026-05-04.
|
||
|
||
Bumped to 1.4.2 — dual-mode contract violation in 1.4.0
|
||
CATALYST_POWERDNS_API_URL block (issue #830 follow-up, 2026-05-04).
|
||
PR #838 introduced two `value: {{ default "..." .Values... | quote }}`
|
||
Helm directives in api-deployment.yaml's CATALYST_POWERDNS_API_URL +
|
||
CATALYST_POWERDNS_SERVER_ID env entries. Both broke the Kustomize-
|
||
mode contabo-mkt build with "yaml: invalid map key: map[string]
|
||
interface {}{...}", stalling every contabo reconciliation including
|
||
THIS chart's own RBAC fix from 1.4.1.
|
||
Same pattern as the SOVEREIGN_FQDN block right below in the same
|
||
file (extensively documented as a dual-mode hazard): replace the
|
||
Helm directive with a literal default. The in-cluster Service URL
|
||
is a non-secret constant on every Sovereign that ships bp-powerdns
|
||
at its canonical release name; per-Sovereign overrides are still
|
||
possible via the HelmRelease overlay's `catalystApi.env` additional-
|
||
env patch (which takes precedence). 2026-05-04.
|
||
|
||
Bumped to 1.4.3 — auto-provision SME Postgres + secrets bundle on
|
||
Sovereign install (issue #859, 2026-05-04). The 11 SME service
|
||
Deployments (auth, billing, catalog, console, domain, gateway,
|
||
marketplace, notification, provisioning, tenant — plus admin which
|
||
has no DB/secret refs) reference two cluster-scoped resources:
|
||
- `sme-pg-app` Secret (basic-auth: username + password) backing the
|
||
sme-pg-rw.sme.svc.cluster.local Postgres Service
|
||
- `sme-secrets` Secret with 11 keys: JWT_SECRET, JWT_REFRESH_SECRET,
|
||
GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET, SMTP_HOST/PORT/FROM/USER/
|
||
PASS, ADMIN_EMAIL, ADMIN_PASSWORD
|
||
On contabo-mkt these are pre-provisioned in
|
||
clusters/contabo-mkt/apps/sme/data/{postgresql,secrets}.yaml. On a
|
||
freshly franchised Sovereign nothing equivalent existed — caught
|
||
live on otech103 (2026-05-04 23:18 Berlin) where 10 of 11 SME pods
|
||
landed in CreateContainerConfigError after MARKETPLACE_ENABLED=true.
|
||
|
||
Fix:
|
||
- templates/sme-services/cnpg-cluster.yaml — gated on the same
|
||
.Values.ingress.marketplace.enabled flag the rest of the SME
|
||
bundle uses. Renders postgresql.cnpg.io/v1.Cluster `sme-pg` in
|
||
`sme` namespace, instances=1, storage=10Gi, primary DB sme_auth
|
||
+ secondary DB sme_billing via postInitApplicationSQL. CNPG
|
||
auto-creates `sme-pg-app` Secret and the `sme-pg-rw` Service.
|
||
Capabilities-gated on postgresql.cnpg.io/v1 so a misordered
|
||
overlay surfaces as "no Cluster yet" rather than chart install
|
||
failure (mirrors platform/powerdns/chart/templates/cnpg-cluster.
|
||
yaml). bp-catalyst-platform (slot 13) declares dependsOn:
|
||
bp-cnpg (slot 16) — already in place since 2026-05-02 (see
|
||
1.1.9 changelog) — so by reconcile time the CRD is registered.
|
||
- templates/sme-services/sme-secrets.yaml — gated on the same
|
||
flag. JWT_SECRET / JWT_REFRESH_SECRET / ADMIN_PASSWORD are
|
||
auto-generated via sprig randAlphaNum (64 / 64 / 32 chars
|
||
respectively) AND PERSISTED across reconciles via Helm `lookup`
|
||
— same load-bearing pattern as platform/gitea/chart/templates/
|
||
admin-secret.yaml (issue #830 Bug 2). Without lookup every
|
||
reconcile would invalidate every active SME session and lock
|
||
out every admin (feedback_passwords.md). Operator-supplied
|
||
GOOGLE_CLIENT_*, SMTP_* values default to empty placeholders;
|
||
operator brings real values via the per-Sovereign overlay or
|
||
the admin-console signup form. helm.sh/resource-policy: keep
|
||
so the Secret survives helm uninstall.
|
||
- values.yaml — add `smePostgres.cluster.*` (storage / pgVersion
|
||
/ resources / ...) and `smeSecrets.{smtp,admin}.*` blocks; both
|
||
fully data-driven per Inviolable Principle #4.
|
||
|
||
Lockstep slot 13 pin in clusters/_template/bootstrap-kit/
|
||
13-bp-catalyst-platform.yaml bumps from 1.4.2 → 1.4.3. 2026-05-04.
|
||
|
||
Bumped to 1.4.4 — deploy FerretDB in sme ns + cross-ns Valkey wire
|
||
to unblock catalog/tenant/domain SME services on franchised
|
||
Sovereigns (issue #861, 2026-05-04). After 1.4.3 landed sme-pg +
|
||
sme-secrets, 7/12 SME pods reached Running on otech103 but 3 stayed
|
||
in CrashLoopBackOff with the same DNS error:
|
||
|
||
catalog: failed to ping MongoDB
|
||
error=...lookup ferretdb.sme.svc.cluster.local on 10.43.0.10:53:
|
||
no such host
|
||
|
||
Root cause: SME service ConfigMap (sme-services-config) hardcoded
|
||
two URLs that have no Sovereign-side workload behind them:
|
||
- MONGODB_URI: mongodb://ferretdb.sme.svc.cluster.local:27017
|
||
(FerretDB has no Deployment on Sovereigns — only on contabo-mkt
|
||
via clusters/contabo-mkt/apps/sme/data/ferretdb.yaml)
|
||
- VALKEY_ADDR: valkey.sme.svc.cluster.local:6379
|
||
(bp-valkey 1.0.0 deploys to namespace `valkey`, not `sme`,
|
||
and exposes Services `valkey-primary` / `valkey-replicas` /
|
||
`valkey-headless` — no plain `valkey` service)
|
||
|
||
Fix:
|
||
- NEW templates/sme-services/ferretdb.yaml — gated on the same
|
||
.Values.ingress.marketplace.enabled flag. Deployment + Service
|
||
`ferretdb` in `sme` ns, image pinned ghcr.io/ferretdb/ferretdb:1.24
|
||
(matches contabo's data/ferretdb.yaml — v2.x requires PostgreSQL
|
||
with the DocumentDB extension which the sme-pg CNPG cluster from
|
||
PR #860 does not ship; v1.24 works against vanilla CNPG postgres:
|
||
16 and is the proven path). Backed by sme-pg via FERRETDB_POSTGRESQL_
|
||
URL env interpolating PG_USER + PG_PASSWORD from the sme-pg-app
|
||
Secret (auto-created by CNPG in 1.4.3) and pointing at
|
||
sme-pg-rw.sme.svc.cluster.local:5432/sme_documents. Image is
|
||
operator-overridable via .Values.smeServices.ferretdb.{image,tag}
|
||
(Inviolable Principle #4).
|
||
- cnpg-cluster.yaml — extend postInitApplicationSQL to also
|
||
CREATE DATABASE sme_documents OWNER sme so FerretDB has a DB to
|
||
write into on first install. The DB list is data-driven from
|
||
.Values.smePostgres.cluster.additionalDatabases (defaulting to
|
||
[sme_billing, sme_documents]) so adding a new SME service is a
|
||
values-only change.
|
||
- configmap.yaml — VALKEY_ADDR now reads from .Values.smeServices.
|
||
valkey.host (default valkey-primary.valkey.svc.cluster.local:6379
|
||
— the actual Service name bitnami/valkey 5.5.1 with replication
|
||
architecture renders, NOT the issue's `valkey.valkey.svc.cluster.
|
||
local` which doesn't exist on Sovereigns). MONGODB_URI also uses
|
||
.Values.smeServices.ferretdb.{host,port} for symmetry.
|
||
- NEW templates/sme-services/valkey-cross-ns-policy.yaml —
|
||
CiliumNetworkPolicy in `valkey` namespace allowing ingress on
|
||
6379/TCP from any Pod in the `sme` namespace. Defense-in-depth on
|
||
top of bp-valkey 1.0.0's upstream NetworkPolicy (which already
|
||
permits port 6379 from any source). Gated on the same
|
||
marketplace.enabled flag.
|
||
- values.yaml — add `smeServices.ferretdb.{image,tag,replicas,
|
||
resources}` and `smeServices.valkey.host` blocks. Every URL,
|
||
image ref, and resource value is operator-overridable per
|
||
Inviolable Principle #4.
|
||
|
||
Known follow-up: bp-valkey ships with `auth.enabled: true` (bitnami
|
||
default). SME services pass only VALKEY_ADDR (no password env). Two
|
||
remediation paths exist: (a) per-Sovereign overlay disables
|
||
bp-valkey auth, or (b) plumb VALKEY_PASSWORD through SME service
|
||
Deployments + service code. Filed separately. This PR ships the
|
||
infrastructure (FQDN + CiliumNetworkPolicy) so the wire is in place
|
||
when one of those auth fixes lands.
|
||
|
||
Lockstep slot 13 pin in clusters/_template/bootstrap-kit/
|
||
13-bp-catalyst-platform.yaml bumps from 1.4.3 → 1.4.4. 2026-05-04.
|
||
|
||
Bumped to 1.4.5 — wire VALKEY_PASSWORD into SME auth + gateway services
|
||
to clear cross-ns Valkey auth crashloop on franchised Sovereigns
|
||
(issue #863, 2026-05-04). After 1.4.4 landed FerretDB + the cross-ns
|
||
CiliumNetworkPolicy, 11/13 SME pods reached Running 1/1 on otech103
|
||
but `auth` stayed in CrashLoopBackOff and `gateway`'s rate limiter
|
||
was disabled, both with the same error:
|
||
|
||
ERROR failed to connect to Valkey error="NOAUTH HELLO must be
|
||
called with the client already authenticated, otherwise the
|
||
HELLO <proto> AUTH <user> <pass> option can be used..."
|
||
|
||
Root cause: bp-valkey 1.0.0 (slot 17) ships with `auth.enabled=true`
|
||
(bitnami valkey 5.5.1 default convention). The bitnami subchart
|
||
auto-generates a random password and exposes it via the
|
||
`valkey-password` key in the `valkey` Secret in the `valkey`
|
||
namespace. SME service code (`core/services/shared/db/valkey.go`)
|
||
only accepted an addr — no password — and the auth.yaml + gateway.yaml
|
||
Deployments only set VALKEY_ADDR. Cross-ns AUTH was never plumbed
|
||
through. Pre-1.4.4 this was masked because VALKEY_ADDR pointed at a
|
||
non-existent `valkey.sme.svc.cluster.local` and the connect failed
|
||
at DNS not at AUTH.
|
||
|
||
Fix:
|
||
- core/services/shared/db/valkey.go — add ConnectValkeyWithAuth
|
||
overload that takes username + password. ConnectValkey kept
|
||
backwards-compatible for callers that don't pass auth (contabo-mkt
|
||
auth-less in-namespace Valkey under data/valkey.yaml).
|
||
- core/services/auth/main.go + core/services/gateway/main.go —
|
||
read VALKEY_USERNAME + VALKEY_PASSWORD env, call
|
||
ConnectValkeyWithAuth when password is non-empty, else fall through
|
||
to the no-auth path. Empty password = current contabo behaviour.
|
||
- NEW templates/sme-services/valkey-cross-ns-secret.yaml — use Helm
|
||
`lookup` to read the bp-valkey auto-generated password from
|
||
`valkey/valkey` Secret and re-emit it as `sme-valkey-auth` in
|
||
`sme` namespace. Same lookup-and-mirror pattern as
|
||
sme-secrets.yaml (issue #859) and gitea-admin-secret (issue #830
|
||
Bug 2). On first install the lookup may return nil — Flux's 15m
|
||
reconcile picks up the mirror once bp-valkey is Ready.
|
||
- auth.yaml + gateway.yaml — add VALKEY_PASSWORD env reading from
|
||
`sme-valkey-auth` Secret with `optional: true` so contabo-mkt's
|
||
auth-less Valkey path keeps working when the mirror Secret is
|
||
absent. valkey-go's `default` ACL user uses `requirepass`, so
|
||
VALKEY_USERNAME stays unset by convention.
|
||
- values.yaml — add `smeServices.valkey.{sourceSecretName,
|
||
sourcePasswordKey, destNamespace, destSecretName}` knobs so a
|
||
forked bp-valkey with non-default Secret naming can override
|
||
without forking the chart (Inviolable Principle #4).
|
||
|
||
No SME smeTag bump needed at chart-source time — the
|
||
services-build.yaml workflow rebuilds the auth + gateway images
|
||
from this commit's SHA and updates the `image:` line in auth.yaml +
|
||
gateway.yaml directly. The chart's blueprint-release pipeline picks
|
||
up those updated SHAs in its values.yaml on the next chart push.
|
||
|
||
Lockstep slot 13 pin in clusters/_template/bootstrap-kit/
|
||
13-bp-catalyst-platform.yaml bumps from 1.4.4 → 1.4.5. 2026-05-04.
|
||
|
||
Bumped to 1.4.6 — bundle the rebuilt services-auth + services-gateway
|
||
image SHA fa4395f from PR #864 into the chart artifact (issue #863
|
||
follow-up, 2026-05-05). 1.4.5 was published at commit fa4395fa BEFORE
|
||
the deploy job updated auth.yaml's hardcoded `image:` to fa4395f, so
|
||
Sovereigns pulling 1.4.5 got the OLD image (5cdb738) without the
|
||
ConnectValkeyWithAuth Go change — VALKEY_PASSWORD env was wired but
|
||
the binary ignored it and still hit "NOAUTH HELLO" on connect.
|
||
|
||
Same race documented in the 1.1.16 changelog above (catalyst-ui
|
||
base:/ fix). 1.4.6 republishes the chart with the deploy-committed
|
||
image SHAs already in tree (auth.yaml + gateway.yaml `image:` lines
|
||
point at fa4395f as of commit 9731701c).
|
||
|
||
No template/code changes — pure version bump to roll a fresh OCI
|
||
artifact whose `helm template` output references the
|
||
ConnectValkeyWithAuth-enabled image.
|
||
|
||
Lockstep slot 13 pin in clusters/_template/bootstrap-kit/
|
||
13-bp-catalyst-platform.yaml bumps from 1.4.5 → 1.4.6. 2026-05-05.
|
||
|
||
Bumped to 1.4.7 — provision the `provisioning-github-token` Secret
|
||
on Sovereign install so the last 1/13 SME pod (provisioning) reaches
|
||
Running 1/1 (issue #866, 2026-05-04). After 1.4.6 cleared 12/13 SME
|
||
pods on otech103, the provisioning Deployment stayed in
|
||
CreateContainerConfigError waiting on
|
||
`secret/provisioning-github-token` (key GITHUB_TOKEN) which exists
|
||
on contabo-mkt as a hand-rolled SealedSecret but had no Sovereign-
|
||
side equivalent. Without this Secret the Pod can't even start —
|
||
blocks the full SME stack on every fresh Sovereign.
|
||
|
||
Fix (issue #866 Option C — local-Gitea target):
|
||
Post-cutover the canonical Git target on a Sovereign IS the local
|
||
Gitea instance (the GitRepository CRs already point there). New
|
||
template templates/sme-services/provisioning-github-token.yaml
|
||
uses Helm `lookup` to read the auto-generated gitea admin password
|
||
from `gitea/gitea-admin-secret` (already generated by
|
||
platform/gitea/chart/templates/admin-secret.yaml with the same
|
||
lookup-persistence pattern) and re-emit it as
|
||
`sme/provisioning-github-token` under the GITHUB_TOKEN key. Same
|
||
lookup-and-mirror precedent as valkey-cross-ns-secret.yaml (#863)
|
||
and sme-secrets.yaml (#859).
|
||
|
||
bp-gitea (slot 10) reaches Ready before bp-catalyst-platform
|
||
(slot 13) — the Flux dependsOn chain in
|
||
clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml
|
||
lists bp-gitea explicitly — so by the time this template renders,
|
||
gitea-admin-secret EXISTS in the gitea namespace and lookup
|
||
returns its decoded password.
|
||
|
||
values.yaml — new `smeServices.provisioning.gitToken.*` block
|
||
(sourceNamespace / sourceSecretName / sourcePasswordKey /
|
||
destNamespace / destSecretName / destKey) so per-Sovereign
|
||
overlays pointing the provisioning service at a non-Gitea Git
|
||
host (e.g. a GitHub PAT via OpenBao + ExternalSecret) can swap
|
||
the source ref without forking the chart (Inviolable Principle #4).
|
||
|
||
Out of scope for this chart bump — full Gitea REST-API target
|
||
support in core/services/provisioning/github/client.go (which
|
||
hardcodes https://api.github.com today) is a follow-up Go change.
|
||
This Secret unblocks the Pod reaching Running 1/1, completing the
|
||
SME stack 12/13 → 13/13.
|
||
|
||
Lockstep slot 13 pin in clusters/_template/bootstrap-kit/
|
||
13-bp-catalyst-platform.yaml bumps from 1.4.6 → 1.4.7. 2026-05-04.
|
||
|
||
1.4.8 (issue #868): fix the marketplace UI PIN-signin flow that 503'd
|
||
on otech103 because the public /api/* HTTPRoute backend-ref'd a dead
|
||
Service (catalyst-system/marketplace-api with zero matching Pods).
|
||
Two template fixes:
|
||
- templates/sme-services/marketplace-routes.yaml: /api/* rule now
|
||
cross-namespace backendRef sme/gateway:8080 (the SME BSS gateway
|
||
Pod that already fronts services-auth, catalog, tenant, billing,
|
||
provisioning).
|
||
- templates/sme-services/marketplace-reference-grant.yaml: extend
|
||
`to:` list with the gateway Service so the cross-ns hop is
|
||
authorised by Gateway API.
|
||
Lockstep slot 13 pin in clusters/_template/bootstrap-kit/
|
||
13-bp-catalyst-platform.yaml bumps from 1.4.7 → 1.4.8. 2026-05-04.
|
||
|
||
1.4.9 (issue #871): no template change — chart-version-only bump to
|
||
republish the OCI artifact with the current services-auth image SHA
|
||
baked into templates/sme-services/auth.yaml. 1.4.8 was published from
|
||
commit 95a06f56 BEFORE the deploy-bot updated auth.yaml's image pin
|
||
from `services-auth:fa4395f` (old) → `services-auth:95a06f5` (new,
|
||
with the /auth/send-pin alias), so 1.4.8 OCI bytes still reference
|
||
the OLD SHA and otech103 reconciled the broken image. Bumping the
|
||
chart version forces blueprint-release to publish a fresh artifact
|
||
with the current pin. Same race documented in
|
||
feedback_idempotent_iac_purge.md and overnight DoD doc as
|
||
"deploy-step race". Lockstep slot 13 pin bumps to 1.4.9. 2026-05-05.
|
||
|
||
1.4.10 (issue #876): wire CATALYST_OTECH_FQDN env on the catalyst-api
|
||
Deployment from the same `sovereign-fqdn` ConfigMap (key `fqdn`) that
|
||
feeds SOVEREIGN_FQDN. The SME tenant create handler (sme_tenant.go)
|
||
and the sovereign-parent-domains seed (sovereign_parent_domains.go)
|
||
both read CATALYST_OTECH_FQDN — without it, POST /api/v1/sme/tenants
|
||
returns 503 {"error":"otech-fqdn-unconfigured"} on every Sovereign,
|
||
and the SME-pool fallback returns an empty list. The two env names
|
||
exist for historical reasons (Phase-8b handover vs SME-tier tenant
|
||
pipeline) but ultimately point at the Sovereign's public FQDN.
|
||
optional=true since Catalyst-Zero (contabo) doesn't run the SME
|
||
tenant pipeline. Lockstep slot 13 pin bumps to 1.4.10. 2026-05-05.
|
||
|
||
1.4.11 (issue #878): wire CATALYST_GITOPS_USER + CATALYST_GITOPS_TOKEN
|
||
env on the catalyst-api Deployment, sourced from the local Gitea
|
||
admin secret (`gitea-admin-secret`, keys `username` + `password`).
|
||
Without these, the SME tenant pipeline (#804) and the marketplace-
|
||
settings GitOps writer fail at the first reconcile with "gitops
|
||
token unconfigured" (post-cutover Sovereign has no GitHub PAT — the
|
||
GitOps target is the local Gitea). optional=true so Catalyst-Zero
|
||
(contabo) keeps using the existing GitHub PAT path. Pairs with a
|
||
catalyst-api code change (marketplace_settings.go +
|
||
sme_tenant_gitops.go): injectTokenIntoURL now takes a configurable
|
||
username (was hardcoded "x-access-token"; GitHub PAT-only) so the
|
||
same code path works for both GitHub and Gitea. Also adds `git` to
|
||
the catalyst-api Containerfile (Alpine 3.20 base + apk add git) —
|
||
the pipeline shells out to git clone/commit/push, and without the
|
||
binary the first reconcile fails with `exec: "git": executable
|
||
file not found in $PATH`. Lockstep slot 13 pin bumps to 1.4.11.
|
||
2026-05-05.
|
||
|
||
1.4.12 (issue #878 follow-up): chart-version-only bump to republish
|
||
the OCI artifact with the new catalyst-api image SHA (7bdd14f) baked
|
||
into values.yaml. 1.4.11 was published from commit 7bdd14fc BEFORE
|
||
the deploy-bot updated values.yaml's catalystApi.tag from 20413ec ->
|
||
7bdd14f, so 1.4.11 OCI bytes still reference the OLD image without
|
||
the git binary. Same deploy-step race fixed in CI by #874 (services-
|
||
build auto-bumps chart patch + dispatches blueprint-release) — the
|
||
catalyst-build workflow needs the equivalent. Until then this manual
|
||
bump is required after every catalyst-api image change. Lockstep
|
||
slot 13 pin bumps to 1.4.12. 2026-05-05.
|
||
|
||
1.4.13 (issue #879): unblock the multi-domain Day-2 add-domain happy
|
||
path on a fresh post-handover Sovereign. Five stacked wiring fixes,
|
||
three of which are chart-side:
|
||
|
||
Bug 1 — POOL_DOMAIN_MANAGER_URL: api-deployment.yaml now wires
|
||
`POOL_DOMAIN_MANAGER_URL=https://pool.openova.io` so the Sovereign-
|
||
side catalyst-api hits the public PDM ingress on contabo (the
|
||
in-cluster default `pool-domain-manager.openova-system.svc` only
|
||
resolves on contabo and is NXDOMAIN on franchised Sovereigns).
|
||
Caught live on otech103, 2026-05-05: every Day-2 add-domain POST
|
||
failed with `dial tcp: lookup pool-domain-manager.openova-system.
|
||
svc.cluster.local: no such host`.
|
||
|
||
Bug 2 — CATALYST_PDM_BASIC_AUTH_USER / _PASS: api-deployment.yaml
|
||
now mounts the `pdm-basicauth` Secret (keys `username`+`password`)
|
||
so pdmFlipNS can `Authorization: Basic ...` against the Traefik
|
||
basicAuth Middleware in front of pool.openova.io. optional=true:
|
||
Catalyst-Zero pods skip the header (in-cluster Service path is
|
||
unauthenticated) and CI / older Sovereigns degrade to a clear 401
|
||
log line instead of crashlooping. The Secret is provisioned by
|
||
cloud-init at handover-time (paired infra change in
|
||
cloudinit-control-plane.tftpl).
|
||
|
||
Bug 5 — HTTPRoute /auth/handover Exact match: httproute.yaml
|
||
catalyst-ui rule changed from PathPrefix `/auth/` to Exact
|
||
`/auth/handover`. The previous PathPrefix collided with the OIDC
|
||
PKCE redirect_uri `/auth/callback` — catalyst-api 404s on that
|
||
path because it only registers `/api/v1/auth/callback`. Result
|
||
post-handover-JWT-cookie-expiry (8h TTL): the operator could not
|
||
log into the Sovereign Console at all (caught live on otech103).
|
||
Exact-match keeps /auth/handover routed to catalyst-api while
|
||
every other /auth/* path falls through to catalyst-ui's React
|
||
Router for client-side OIDC.
|
||
|
||
Three coupled code-side fixes ship in catalyst-api as part of the
|
||
same #879 PR (parent_domains.go):
|
||
|
||
Bug 2-code: pdmFlipNS now SetBasicAuth from the env (read every
|
||
call so a Secret rotation propagates without Pod restart).
|
||
Bug 3-code: pdmFlipNS body now includes `nameservers` (computed
|
||
from expectedNSFor — PDM's SetNSRequest schema requires it; the
|
||
previous body got 422 missing-nameservers).
|
||
Bug 4-code: lookupPrimaryDomain falls back to SOVEREIGN_FQDN env
|
||
after CATALYST_PRIMARY_DOMAIN. On a post-handover Sovereign no
|
||
Deployment record is persisted, so without this fallback GET
|
||
/parent-domains returned {"items":[]} and the propagation panel
|
||
showed `expectedNs: null`. The SOVEREIGN_FQDN env is already
|
||
wired by api-deployment.yaml from the sovereign-fqdn ConfigMap.
|
||
|
||
Lockstep slot 13 pin in clusters/_template/bootstrap-kit/
|
||
13-bp-catalyst-platform.yaml bumps from 1.4.11 → 1.4.12. 2026-05-05.
|
||
|
||
Bumped to 1.4.13 — Flux Kustomization watching SME tenant overlays
|
||
(issue #882, 2026-05-05). The catalyst-api SME-tenant pipeline's
|
||
GitOps writer (sme_tenant_gitops.go::WriteTenantOverlay) commits
|
||
per-tenant Kustomize overlays to clusters/<sov-fqdn>/sme-tenants/
|
||
<tenant-id>/ on every successful POST /api/v1/sme/tenants — but no
|
||
Flux Kustomization on the Sovereign cluster watched that path. The
|
||
state machine (sme_tenant.go) advanced optimistically through every
|
||
step (vcluster → bp_charts → dns → certs → keycloak_clients →
|
||
registry) and reported state=done, while no actual K8s resources
|
||
materialised because nothing was reconciling the orchestrator's
|
||
write target.
|
||
|
||
Verified live on otech103 (2026-05-04 23:18 Berlin): the orchestrator
|
||
successfully committed the 9-file overlay for tenant 15f1e45e-...
|
||
to the local Gitea openova/openova repo @main, but `kubectl get hr
|
||
-n sme-15f1e45e-...` returned No resources found indefinitely.
|
||
|
||
Fix: NEW templates/sme-services/sme-tenants-kustomization.yaml,
|
||
gated on .Values.ingress.marketplace.enabled (same flag the rest of
|
||
the SME bundle uses) — non-marketplace Sovereigns don't run the SME
|
||
tenant pipeline so they don't render this Kustomization. Renders one
|
||
Flux Kustomization in flux-system that sweeps the entire
|
||
./clusters/<sovereignFQDN>/sme-tenants directory tree:
|
||
- sourceRef: flux-system/openova GitRepository (the same one the
|
||
cluster bootstraps from; cutover Step 5 flips its
|
||
.spec.url to the local in-cluster Gitea, which is
|
||
precisely where sme_tenant_gitops.go pushes via
|
||
CATALYST_GITOPS_REPO_URL=http://gitea-http.gitea.svc
|
||
.cluster.local:3000/openova/openova)
|
||
- path: ./clusters/{{ .Values.global.sovereignFQDN }}/sme-tenants
|
||
- interval: 1m (matches the orchestrator's "Flux reconciles
|
||
within ~1 min" SLA documented at the top of
|
||
sme_tenant_gitops.go)
|
||
- prune: true (DELETE /api/v1/sme/tenants/<id> removes the
|
||
overlay directory; Flux GCs the tenant resources)
|
||
- wait: false (per-tenant overlays each install ~5 bp-* HRs
|
||
asynchronously and have their own readiness watcher
|
||
in the orchestrator; blocking this top-level
|
||
Kustomization on every tenant's full readiness would
|
||
let one stuck tenant gate every other tenant)
|
||
|
||
Per Inviolable Principle #4 (never hardcode), every knob is
|
||
operator-overridable via .Values.smeTenants.kustomization.* —
|
||
the GitRepository sourceRef name/namespace, the resource name,
|
||
the cadence (interval/retryInterval/timeout), and the toggles
|
||
(prune/wait). Defaults match the canonical bootstrap-kit
|
||
conventions documented in clusters/_template/bootstrap-kit/03-flux
|
||
.yaml + the cloud-init flux-bootstrap.yaml block.
|
||
|
||
Lockstep slot 13 pin in clusters/_template/bootstrap-kit/
|
||
13-bp-catalyst-platform.yaml bumps from 1.4.12 → 1.4.13. 2026-05-05.
|
||
|
||
1.4.14 (issue #879 follow-up): chart-version-only republish so the OCI
|
||
artifact carries the catalyst-api image SHA 7bfd6df (the #879 fix
|
||
commit). Chart 1.4.13 was published from commit 7bfd6df5 BEFORE the
|
||
deploy-bot updated values.yaml's catalystApi.tag from aa226df ->
|
||
7bfd6df, so 1.4.13 OCI bytes still reference the OLD catalyst-api
|
||
image without the pdmFlipNS basic-auth + nameservers + lookup-
|
||
primary-domain SOVEREIGN_FQDN-fallback fixes. Same deploy-step race
|
||
fixed in CI by #874 (services-build auto-bumps chart patch + dispatches
|
||
blueprint-release) — the catalyst-build workflow needs the equivalent.
|
||
Until then this manual bump is required after every catalyst-api
|
||
image change. Lockstep slot 13 pin bumps to 1.4.14. 2026-05-05.
|
||
|
||
1.4.15 (issue #887): auto-provision marketplace-api-secrets Secret on
|
||
Sovereign install. templates/marketplace-api/deployment.yaml has always
|
||
referenced a secretKeyRef on `marketplace-api-secrets` (key:
|
||
`jwt-secret`); on contabo-mkt this Secret is hand-rolled in
|
||
clusters/contabo-mkt/apps/.../marketplace-api-secrets.yaml. On a freshly
|
||
franchised Sovereign with ingress.marketplace.enabled=true, nothing
|
||
equivalent existed — caught live on otech103 (2026-05-05) where
|
||
marketplace-api landed in CreateContainerConfigError "secret not found"
|
||
every reconcile. Fix: NEW templates/marketplace-api/secret.yaml uses
|
||
Helm `lookup` to persist a 64-char randAlphaNum jwt-secret across
|
||
reconciles (same load-bearing pattern as sme-secrets, valkey-cross-ns-
|
||
secret, provisioning-github-token, gitea-admin-secret per
|
||
feedback_passwords.md). Without lookup every reconcile would
|
||
invalidate every active marketplace JWT. helm.sh/resource-policy: keep
|
||
so the Secret survives helm uninstall. Lockstep slot 13 pin bumps to
|
||
1.4.15. 2026-05-05.
|
||
|
||
1.4.17 (issue #901): unblock Sovereign Console login on every fresh
|
||
provision. https://console.<sov>/login PIN-issue endpoint returned 503
|
||
with "CATALYST_OPENOVA_KC_SA_CLIENT_SECRET not set" — a 3-bug chain:
|
||
|
||
Bug 1: api-deployment.yaml lines 676-739 reference a Secret
|
||
`catalyst-openova-kc-credentials` for the full PIN-auth env block
|
||
(CATALYST_OPENOVA_KC_* + CATALYST_SMTP_*). On contabo-mkt this Secret
|
||
is hand-rolled out-of-band (clusters/contabo-mkt/apps/keycloak-zero/
|
||
helmrelease.yaml mounts it via extraEnvVars). On a freshly franchised
|
||
Sovereign nothing equivalent existed — every secretKeyRef has
|
||
optional=true so the Pod started, but POST /api/v1/auth/pin/issue
|
||
503'd on the missing client-secret env. Fix: NEW
|
||
templates/catalyst-openova-kc-credentials-secret.yaml mirrors the
|
||
canonical KC SA Secret (`keycloak/catalyst-kc-sa-credentials`,
|
||
created by bp-keycloak's openbao-bridge post-install hook) into
|
||
catalyst-system as `catalyst-openova-kc-credentials` with the key
|
||
shape api-deployment.yaml expects. Same Helm-`lookup` persistence
|
||
pattern as templates/marketplace-api/secret.yaml (#887),
|
||
sme-secrets.yaml (#859), valkey-cross-ns-secret.yaml (#863),
|
||
provisioning-github-token.yaml (#866) and gitea-admin-secret.yaml
|
||
(#830). helm.sh/resource-policy: keep — Secret survives helm
|
||
uninstall.
|
||
|
||
Sovereign-vs-contabo gate (load-bearing): the new template is
|
||
rendered ONLY when `lookup "v1" "Secret" "keycloak"
|
||
"catalyst-kc-sa-credentials"` returns non-nil. On Catalyst-Zero
|
||
(contabo) Keycloak runs as `keycloak-zero` in its own namespace
|
||
and there is NO Secret by that name in the `keycloak` namespace
|
||
— lookup returns nil → the template renders empty bytes → the
|
||
existing hand-rolled Secret in clusters/contabo-mkt/apps/...
|
||
remains untouched (no helm-vs-kustomize ownership flap). The
|
||
new file is intentionally NOT added to templates/kustomization.yaml
|
||
`resources:` so Kustomize-mode contabo build skips it entirely
|
||
(same dual-mode pattern as templates/marketplace-api/secret.yaml).
|
||
|
||
Bug 2: SMTP host default `stalwart-web.stalwart.svc.cluster.local`
|
||
(an in-code constant) doesn't exist on Sovereign — even after Bug 1
|
||
the PIN-email delivery would fail at the next step. Fix: chart now
|
||
populates smtp-host/smtp-port/smtp-from from .Values.sovereign.smtp.*
|
||
defaulting to mail.openova.io:587 / noreply@openova.io. SMTP
|
||
user/pass come from a SECONDARY lookup against
|
||
`catalyst-system/sovereign-smtp-credentials` (Secret seeded by
|
||
cloud-init at provision time — issue #883 follow-up). If the source
|
||
Secret is missing, the Secret renders with empty smtp-user/smtp-pass
|
||
so the login surface still works and PIN delivery surfaces as a
|
||
clear "email delivery failed" log line, not as a 503.
|
||
|
||
Bug 3: CATALYST_POST_AUTH_REDIRECT default `/sovereign/wizard` is
|
||
mothership-only — the wizard page is the Provisioning Wizard the
|
||
operator drives at signup, not a post-handover Sovereign page. Fix:
|
||
chart-level default flips to `/sovereign/components` (the post-
|
||
handover Sovereign Console homepage). Per-Sovereign overlays
|
||
override via the catalystApi.env additional-env patch — the chart
|
||
value is a literal (per the dual-mode contract documented in the
|
||
CATALYST_POWERDNS_API_URL block of api-deployment.yaml).
|
||
|
||
Lockstep slot 13 pin in clusters/_template/bootstrap-kit/
|
||
13-bp-catalyst-platform.yaml bumps from 1.4.16 → 1.4.17. 2026-05-05.
|
||
|
||
1.4.18 (issue #910 — TBD): create the `sme` namespace on Sovereigns
|
||
where the marketplace is enabled. Every template under
|
||
templates/sme-services/* (billing, auth, ferretdb, valkey-cross-ns-
|
||
secret, sme-secrets, provisioning-github-token, cnpg-cluster, ...)
|
||
emits resources with `namespace: sme`. On Catalyst-Zero (contabo)
|
||
the `sme` namespace is pre-provisioned by clusters/contabo-mkt/apps/
|
||
sme/* — so the chart never created it. On a fresh franchised
|
||
Sovereign nothing else creates the `sme` namespace, so chart 1.4.17
|
||
install failed 23 times with `failed to create resource: namespaces
|
||
"sme" not found` — caught live on otech105 (2026-05-05). Fix: NEW
|
||
templates/sme-services/sme-namespace.yaml gated on the same
|
||
ingress.marketplace.enabled flag as the rest of the SME bundle so
|
||
non-marketplace Sovereigns and the Kustomize-mode contabo build
|
||
(which does NOT include sme-namespace.yaml in templates/sme-services/
|
||
kustomization.yaml's `resources:` list) skip this entirely.
|
||
helm.sh/resource-policy: keep — never cascade-delete the namespace
|
||
on chart uninstall (would erase every SME workload + tenant).
|
||
Lockstep slot 13 pin in clusters/_template/bootstrap-kit/
|
||
13-bp-catalyst-platform.yaml bumps from 1.4.17 → 1.4.18. 2026-05-05.
|
||
|
||
1.4.19 (issue #910 — zero-touch provisioning, Bugs 2 + 3): two
|
||
coupled fixes that unblocked Sovereign Console PIN-login on a
|
||
freshly franchised cluster (1.4.18 closed Bug 1, the missing `sme`
|
||
namespace).
|
||
|
||
Bug 2 — CATALYST_SESSION_COOKIE_DOMAIN was hardcoded to
|
||
console.openova.io in templates/api-deployment.yaml. On a Sovereign
|
||
the request host is console.<sov-fqdn>, so the browser silently
|
||
rejected the Set-Cookie (RFC 6265 §5.3 step 6 — Domain mismatch)
|
||
and every /api/* request landed without a session, redirecting back
|
||
to /login forever. Caught live on otech105 (2026-05-05).
|
||
Fix: change the literal default to `""` (empty). Per the dual-mode
|
||
contract (CATALYST_POWERDNS_API_URL block in api-deployment.yaml),
|
||
this MUST stay a literal — Helm template directives in `value:`
|
||
fields break the contabo Kustomize-mode build. Empty value is
|
||
correct on BOTH paths: when CATALYST_SESSION_COOKIE_DOMAIN is empty
|
||
the auth handler omits the Domain attribute and the browser binds
|
||
the cookie to the exact request host. On contabo that is
|
||
console.openova.io (wizard + magic-link served from the same
|
||
host); on a Sovereign that is console.<sov-fqdn> (likewise). Per-
|
||
Sovereign overlays MAY override via the catalystApi.env additional-
|
||
env patch in the per-cluster HelmRelease for unusual topologies.
|
||
|
||
Bug 3 — catalyst-openova-kc-credentials-secret.yaml's smtp-user/
|
||
smtp-pass lookup used "existing target wins" persistence over the
|
||
source `sovereign-smtp-credentials` Secret seeded by A5's
|
||
provisioner (issue #883). On first install the source Secret had
|
||
not yet been seeded (race between catalyst-api's seedSovereignSMTP
|
||
step and the chart reconcile), so the chart rendered empty SMTP
|
||
creds, persisted them into the target, and NEVER picked up A5's
|
||
seeded bytes on subsequent reconciles. POST /api/v1/auth/pin/issue
|
||
502'd with `email-send-failed` for the life of the cluster.
|
||
Caught live on otech105 (2026-05-05).
|
||
Fix: invert the SMTP-cred lookup precedence. SOURCE
|
||
(sovereign-smtp-credentials) wins over the persisted target. Every
|
||
Flux reconcile (1m cadence) re-reads the source, so as soon as A5's
|
||
seed completes the chart picks it up on the next tick. Operator
|
||
rotation: edit sovereign-smtp-credentials (the operator-facing
|
||
seam); the target is a chart-derived projection and never an
|
||
operator surface. KC fields keep the previous "existing target
|
||
wins" contract because bp-keycloak's openbao-bridge auto-rotates
|
||
the client-secret on every Helm upgrade and we want that rotation
|
||
to require explicit operator action (delete the target) rather
|
||
than picking up automatically and rolling the catalyst-api Pod.
|
||
|
||
No values.yaml schema change. No bootstrap-kit slot 13 envsubst
|
||
change. Lockstep slot 13 pin in clusters/_template/bootstrap-kit/
|
||
13-bp-catalyst-platform.yaml bumps from 1.4.18 → 1.4.19. 2026-05-05.
|
||
type: application
|
||
|
||
# Opt-out from the blueprint-release hollow-chart guard (issue #181 / #510).
|
||
# This umbrella legitimately ships only Catalyst-authored workloads
|
||
# (catalyst-ui, catalyst-api, ProvisioningState CRD, Sovereign HTTPRoute);
|
||
# the foundation layer is installed independently by the bootstrap-kit
|
||
# and must NOT be re-rendered into catalyst-system as subcharts.
|
||
annotations:
|
||
catalyst.openova.io/no-upstream: "true"
|
||
|
||
# No subchart dependencies — see 1.1.9 changelog above. The 10
|
||
# foundation Blueprints are installed by clusters/_template/bootstrap-kit/
|
||
# at their own slots, each as a top-level Flux HelmRelease in its own
|
||
# canonical namespace. This umbrella renders only the Catalyst-Zero
|
||
# control-plane workloads (catalyst-ui, catalyst-api, ProvisioningState
|
||
# CRD, Sovereign HTTPRoute) into targetNamespace: catalyst-system.
|