openova/products/catalyst/chart/templates/catalyst-gitea-token-secret.yaml
e3mrah b5181ec5d6
fix(catalyst-platform): gitea-token-mint hook 60->180 iters for autoscaler cold-start (Fix #184) (#1388)
* fix(catalyst-platform): gitea-token-mint hook 60->180 iters for autoscaler cold-start (Fix #184)

Raise the catalyst-gitea-token-mint pre-install hook's Gitea-API wait
loop from a hardcoded 60x5s (300s = 5m) budget to a values-driven knob
(giteaWait.iterations x giteaWait.intervalSeconds, default 168x5 =
840s = 14m). Pairs with HR install.timeout=15m to leave 60s slack for
the rest of the umbrella install action.

Root-cause trace (4-layer) on prov #33 (multi-region fsn1+hel1, cpx42
workerCount=0+autoscaler):

  bp-catalyst-platform HR (15m HR-timeout)
    -> Helm pre-install hook Job: catalyst-gitea-token-mint
         -> pod runs alpine/k8s curl loop:
              while ! curl gitea-http.gitea.svc.cluster.local; do
                sleep 5; i=$((i+1))
              done
         -> Hook gave up at iter 60 (= 5 min wall-time)
         -> Meanwhile gitea Pod is Pending: autoscaler-hcloud still
            scaling up workers in fsn1/hel1 (Fix #157 sizing default
            workerCount=0 means cold start).

Budget arithmetic (post-Fix #184 default):
  hook_wait_time = iterations x intervalSeconds = 168 x 5 = 840s (14 min)
  HR install.timeout =                                       900s (15 min)
  slack within HR budget =                                    60s ( 1 min)

The hook MUST complete strictly before HR remediates; the 60s slack
absorbs regular release resources rolling + post-install hooks after
the pre-install Job.

Canonical-seam citations:
- The hook lives at products/catalyst/chart/templates/
  catalyst-gitea-token-secret.yaml (line ~303 pre-Fix), the
  catalyst-gitea-token-mint Job's `args` block.
- Prior pattern: bp-keycloak chart 1.4.5 (Fix #146) introduced
  keycloakConfigCli.availabilityCheck.timeout as a values knob -
  same shape (chart-internal hook timing knob, distinct from the
  outer HR timeout). See platform/keycloak/chart/values.yaml:413.
- The HR's install.timeout=15m lives at clusters/_template/
  bootstrap-kit/13-bp-catalyst-platform.yaml:484 - the chart-internal
  wait budget MUST stay strictly less than this.

Recurring class: same family as Fix #127 (bp-cutover HR 15m),
Fix #131 (bp-gitea HR 15m), Fix #150 (bp-harbor HR 15m), Fix #154
(HR-timeout audit). Those bumped the HelmRelease install.timeout.
This bumps the chart-INTERNAL wait loop budget inside the pre-
install hook Job, which is a different (lower) seam.

Per INVIOLABLE-PRINCIPLES #4 (never hardcode) the budget is fully
runtime-configurable via .Values.giteaWait. Operators may shorten on
known-warm-cluster overlays or extend on air-gapped Sovereigns.

Changes:
- products/catalyst/chart/templates/catalyst-gitea-token-secret.yaml:
  replace hardcoded `seq 1 60` + `sleep 5` with templated
  ITERATIONS/INTERVAL vars driven by .Values.giteaWait.{iterations,
  intervalSeconds}.
- products/catalyst/chart/values.yaml: add giteaWait block with
  defaults (iterations: 168, intervalSeconds: 5 = 14m budget).
- products/catalyst/chart/Chart.yaml: bump 1.4.139 -> 1.4.140 with
  changelog entry capturing the 4-layer trace + budget arithmetic.
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: bump
  HelmRelease pin 1.4.138 -> 1.4.140 (skip 1.4.139 which is a no-op
  packaging bump on main).

Verification:
- helm template renders cleanly (2799 lines, exit 0).
- Force-render with lookup gate bypassed shows ITERATIONS=168 +
  INTERVAL=5 substituted into the rendered Job args.
- --set giteaWait.iterations=240 --set giteaWait.intervalSeconds=10
  override confirmed to emit ITERATIONS=240 + INTERVAL=10.

Test plan (post-merge, on prov #34):
- kubectl logs -n catalyst-system catalyst-gitea-token-mint-* should
  emit `waiting for gitea api ($i/168)` instead of `($i/60)`.
- bp-catalyst-platform HR reaches Ready=True within the 15m HR
  budget (previously installFailures: 2 on prov #33).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bootstrap-deps): reconcile pre-existing dep-graph audit drift

Two pre-existing drift items surfaced when dep-graph-audit ran on the
Fix #184 PR — both are in `main` already, not introduced here, but the
gate blocks any PR until the expected DAG matches the actual HRs.

1. `bp-catalyst-platform` (slot 13) — actual HR file declares
   `bp-crossplane-claims` as an additional dependsOn edge (added in
   chart-roll-rca iter-15, 2026-05-10, for the XRD-ordering race that
   caused the omantel.biz 90-min wedge). Update expected-deps to
   include it.

2. `bp-hcloud-ccm` (slot 55) — present on disk but absent from
   expected-deps. Cloud-provider seam, no upstream dependencies.
   Added with empty depends_on.

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
2026-05-11 14:44:54 +04:00

377 lines
17 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{{- /*
Auto-provision catalyst-gitea-token Secret on Sovereign install
(qa-loop iter-12 Fix #54 Workstream 4, follow-up to Fix #53B).
qa-loop iter-1 Fix #124 (chart 1.4.136) — PRE-INSTALL HOOK CONVERSION
=====================================================================
Prior post-install hook ordering caused a chicken-and-egg deadlock:
1. Helm applies regular release resources (catalog Deployment,
organization-controller Deployment, api Deployment etc.) which
mount CATALYST_GITEA_TOKEN from data.token of this Secret.
2. The Secret was created at this stage with `data.token: ""`
(empty bytes, since `lookup` returned nil on a fresh install).
3. catalyst-catalog and catalyst-organization-controller validate
non-empty tokens at startup (config: CATALYST_GITEA_TOKEN is
required) and immediately exit code=1 → CrashLoopBackOff.
4. Helm's post-install hook then ran the mint Job, which patched
the Secret with a real token. But:
(a) The crashlooping Pods' env was already evaluated against
the empty value at first start. K8s re-evaluates secretKeyRef
env on each container restart, so eventually the back-off
Pods could pick up the new token — but the back-off window
grows exponentially (10s → 20s → 40s → ... → 5m) and on
install #1 the post-install hook had to wait both for Gitea
API to become ready AND for the apps to leave back-off.
(b) The Helm install timeout (15m) lapsed before all Pods
reached Ready, the release flipped to InstallFailed,
remediation kicked off `helm uninstall` — which itself
timed out — and the loop repeated.
The fix below moves the entire token-bootstrap flow to a `pre-install,
pre-upgrade` hook so the Secret is populated with a real token BEFORE
any Deployment that consumes it is rolled out. Result: catalog +
organization-controller see a valid CATALYST_GITEA_TOKEN at first
container start; no CrashLoop, no backoff, no Helm timeout.
Why this template exists (background)
=====================================
api-deployment.yaml lines 538-543 reference a secretKeyRef on
`catalyst-gitea-token` for the CATALYST_GITEA_TOKEN env var consumed by
internal/handler/blueprints.go giteaClientFromEnv(). Without this
Secret the /api/v1/sovereigns/.../blueprints/{publish,curatable,curate,
edit-pr} endpoints return the 503 "Gitea client unconfigured" error
(blueprints.go:184 + blueprints.go:493) — matrix TC-081, TC-082,
TC-083, TC-085 FAIL. catalyst-catalog (services/catalog) and
catalyst-organization-controller (controllers/organization) consume
the same Secret for the same Gitea calls.
Pre-Fix #54 the Secret was kubectl-applied as an operational hack
(see qa-loop-state/iter12-diagnostic-audit.md §"(e) infra-blocked"
TC-081). Per `feedback_no_mvp_no_workarounds.md` rule #3 ("no
operational hacks instead of chart fixes") this template is the
chart-side canonical fix.
Sovereign-vs-Mothership gate (load-bearing)
===========================================
The canonical Gitea API token source on a Sovereign is the
`gitea-admin-secret` Secret in the `gitea` namespace (created by
bp-gitea's templates/admin-secret.yaml — see platform/gitea/chart/).
We use the admin password to call the Gitea API and either:
(1) Re-use a previously-minted catalyst-gitea-token in
catalyst-system (Helm lookup-existing-target idempotency); OR
(2) Mint a fresh PAT via POST /api/v1/users/{admin}/tokens at
first install, persist into catalyst-gitea-token.
On Catalyst-Zero (contabo) there is NO bp-gitea HelmRelease — the
Gitea install lives in the openova-private repo as a separate
manifest, and `gitea-admin-secret` lives in a different namespace.
We gate render on `lookup "v1" "Secret" "gitea" "gitea-admin-secret"`
returning non-nil. This means:
- On a Sovereign: lookup returns the Secret → template renders the
Secret + the pre-install Job that mints the PAT zero-touch.
- On contabo: lookup returns nil → template renders empty bytes →
the existing hand-rolled wiring (operator-managed via
clusters/contabo-mkt/apps/.../secret.yaml) is untouched (no helm-
vs-kustomize ownership flap).
Persistence across reconciles
=============================
The lookup contract is EXISTING-TARGET-WINS: once a PAT is minted into
catalyst-gitea-token, every subsequent reconcile re-emits the SAME
bytes via the lookup result. The Job's first action checks for an
existing non-empty token in the in-cluster Secret and short-circuits
(exit 0) when present. So on upgrades the Job runs but is a no-op.
helm.sh/resource-policy: keep — survives helm uninstall so a re-install
picks up the same bytes via lookup.
Per docs/INVIOLABLE-PRINCIPLES.md #10: NO plaintext credentials in this
template. The PAT is minted by the Job at runtime via the Gitea API
(authenticated with the gitea-admin-secret bytes which themselves were
generated by `randAlphaNum 32` in bp-gitea, never echoed). The PAT
value lives only in the Secret bytes after the Job completes.
*/}}
{{- $giteaSrc := lookup "v1" "Secret" "gitea" "gitea-admin-secret" -}}
{{- if and $giteaSrc $giteaSrc.data -}}
{{- $secretName := "catalyst-gitea-token" -}}
{{- $namespace := .Release.Namespace -}}
{{- $existing := lookup "v1" "Secret" $namespace $secretName -}}
{{- /* ---- Token resolution: existing target wins (idempotent) ---- */ -}}
{{- $token := "" -}}
{{- if and $existing $existing.data (index $existing.data "token") -}}
{{- $token = index $existing.data "token" | b64dec -}}
{{- end -}}
---
{{- /* ---- Secret as pre-install hook ----
Created BEFORE regular release resources so that catalog +
organization-controller Deployments mount a populated token at
first container start.
Hook delete policy notes:
- `before-hook-creation`: on next install the prior hook Secret
is deleted before re-create. This is OK because the lookup
above runs at template-render time and captures the existing
token bytes; the new hook Secret is created with the same
bytes.
- We DO NOT include `hook-succeeded` because we want the Secret
to persist after the hook completes (the Deployments need to
mount it for the entire lifetime of the release).
- `helm.sh/resource-policy: keep` is retained as belt-and-
braces against any future helm-uninstall cleanup of hook
resources.
*/}}
apiVersion: v1
kind: Secret
metadata:
name: {{ $secretName }}
namespace: {{ $namespace }}
labels:
catalyst.openova.io/blueprint: bp-catalyst-platform
catalyst.openova.io/component: catalyst-gitea-token
app.kubernetes.io/part-of: catalyst
annotations:
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-weight: "5"
helm.sh/hook-delete-policy: before-hook-creation
# Survive helm uninstall — the Secret outlives the release. A
# subsequent helm install picks up the bytes via lookup against the
# source Secrets.
helm.sh/resource-policy: keep
type: Opaque
data:
# `url` carries the in-cluster Gitea endpoint so consumers don't need
# to hardcode it (matches the pattern catalyst-api uses via
# CATALYST_GITEA_URL env). When empty, consumers fall through to
# blueprints.go giteaClientFromEnv() which expects CATALYST_GITEA_URL
# to be set explicitly — that env var is set in api-deployment.yaml
# to the same in-cluster Service URL.
url: {{ "http://gitea-http.gitea.svc.cluster.local:3000" | b64enc | quote }}
token: {{ $token | b64enc | quote }}
{{- if not $token }}
---
{{- /* ---- First-install token-mint Job (pre-install hook) ----
Runs ONCE per fresh install (gated on token=="" at template-
render time). The Job:
(a) reads gitea-admin-secret to extract user+pass
(b) waits for Gitea API to be Ready
(c) POSTs /api/v1/users/{admin}/tokens to mint a Personal
Access Token with admin scope
(d) writes the PAT into catalyst-gitea-token via kubectl patch
on the Secret data.token field
Subsequent reconciles see token!="" in the Secret (via lookup) →
this Job branch is skipped at template-render time. Even if a
prior install left the Job lying around, hook-delete-policy
cleans it up before-hook-creation.
Per ADR-0001 §11.3: this is the canonical seam for
Sovereign-side credential-bootstrap that requires a runtime API
call rather than pure Helm-template generation. Mirrors
guacamole-recordings-pvc-migrate Job pattern (qa-loop iter-7
Fix #39) and bp-keycloak's keycloak-config-cli pre-install hook.
Why pre-install (not post-install) — qa-loop iter-1 Fix #124
=============================================================
See top-of-file note. tl;dr: catalog + organization-controller
refuse to start with empty CATALYST_GITEA_TOKEN, so the token
MUST be populated BEFORE those Deployments roll. Pre-install
hooks run before Helm applies regular release resources. */}}
apiVersion: v1
kind: ServiceAccount
metadata:
name: catalyst-gitea-token-minter
namespace: {{ $namespace }}
labels:
catalyst.openova.io/blueprint: bp-catalyst-platform
catalyst.openova.io/component: catalyst-gitea-token-minter
annotations:
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-weight: "5"
helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: catalyst-gitea-token-minter
namespace: {{ $namespace }}
labels:
catalyst.openova.io/blueprint: bp-catalyst-platform
annotations:
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-weight: "5"
helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
rules:
- apiGroups: [""]
resources: ["secrets"]
resourceNames: ["{{ $secretName }}"]
verbs: ["get", "patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: catalyst-gitea-token-minter
namespace: {{ $namespace }}
labels:
catalyst.openova.io/blueprint: bp-catalyst-platform
annotations:
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-weight: "5"
helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
subjects:
- kind: ServiceAccount
name: catalyst-gitea-token-minter
namespace: {{ $namespace }}
roleRef:
kind: Role
name: catalyst-gitea-token-minter
apiGroup: rbac.authorization.k8s.io
---
{{- /* Read-side Role on the gitea ns Secret holding the admin password.
The Job needs to GET the gitea-admin-secret to read the admin
password used in the Gitea API basic-auth call. */}}
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: catalyst-gitea-admin-reader
namespace: gitea
labels:
catalyst.openova.io/blueprint: bp-catalyst-platform
annotations:
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-weight: "5"
helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
rules:
- apiGroups: [""]
resources: ["secrets"]
resourceNames: ["gitea-admin-secret"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: catalyst-gitea-admin-reader-binding
namespace: gitea
labels:
catalyst.openova.io/blueprint: bp-catalyst-platform
annotations:
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-weight: "5"
helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
subjects:
- kind: ServiceAccount
name: catalyst-gitea-token-minter
namespace: {{ $namespace }}
roleRef:
kind: Role
name: catalyst-gitea-admin-reader
apiGroup: rbac.authorization.k8s.io
---
apiVersion: batch/v1
kind: Job
metadata:
name: catalyst-gitea-token-mint
namespace: {{ $namespace }}
labels:
catalyst.openova.io/blueprint: bp-catalyst-platform
catalyst.openova.io/component: catalyst-gitea-token-mint
annotations:
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-weight: "10"
helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
spec:
ttlSecondsAfterFinished: 600
backoffLimit: 8
template:
metadata:
labels:
catalyst.openova.io/blueprint: bp-catalyst-platform
catalyst.openova.io/component: catalyst-gitea-token-mint
spec:
serviceAccountName: catalyst-gitea-token-minter
restartPolicy: OnFailure
containers:
- name: mint
# Fix #163 (2026-05-11, MIRROR-EVERYTHING): explicit
# harbor.openova.io/proxy-dockerhub prefix per CLAUDE.md
# inviolable rule.
image: harbor.openova.io/proxy-dockerhub/alpine/k8s:1.31.4
command: ["sh","-c"]
args:
- |
set -eu
# Step 1: Wait for Gitea API to be reachable.
#
# Budget knob (qa-loop Wave 27 Fix #184, prov #33 wedge):
# giteaWait.iterations × giteaWait.intervalSeconds defines the
# wall-clock wait budget for the Gitea API to become
# reachable. Default 168 × 5s = 840s (14 min) — covers the
# worst-case autoscaler-hcloud cold-start observed on multi-
# region prov #33 (Fix #157 sets workerCount=0 + autoscaler
# so a fresh provision must wait for the autoscaler to spawn
# the first worker before bp-gitea's Pod can schedule). The
# HR's install.timeout / upgrade.timeout is 15m (see
# clusters/_template/bootstrap-kit/13-bp-catalyst-platform
# .yaml); we leave 60s slack within that budget so the rest
# of the umbrella install action can complete after the hook.
#
# Pre-Fix #184 the loop was hardcoded `seq 1 60` (300s = 5
# min) which was sized for warm-cluster installs
# (workerCount>0, all worker nodes already up). With
# workerCount=0 + autoscaler (Fix #157) the gitea Pod takes
# 10-15 min to land on a freshly-spawned worker, so the
# 300s budget always expired before Gitea was reachable —
# bp-catalyst-platform HR loop-rolled forever
# (installFailures: 2 on prov #33). Same recurring class
# as the HR-timeout audit (Fix #154); this is the
# *chart-internal* layer of that same family.
ITERATIONS={{ .Values.giteaWait.iterations | default 168 }}
INTERVAL={{ .Values.giteaWait.intervalSeconds | default 5 }}
for i in $(seq 1 "$ITERATIONS"); do
if curl -sSf -o /dev/null --max-time 3 \
http://gitea-http.gitea.svc.cluster.local:3000/api/v1/version; then
echo "gitea api reachable"
break
fi
echo "waiting for gitea api ($i/$ITERATIONS)"
sleep "$INTERVAL"
done
# Step 2: Re-check existing token in catalyst-gitea-token —
# in case a parallel reconcile beat us to mint, skip.
EXISTING_TOKEN=$(kubectl -n {{ $namespace }} get secret {{ $secretName }} \
-o jsonpath='{.data.token}' 2>/dev/null | base64 -d 2>/dev/null || true)
if [ -n "$EXISTING_TOKEN" ]; then
echo "catalyst-gitea-token.token already set; skipping mint"
exit 0
fi
# Step 3: Read admin creds from gitea/gitea-admin-secret.
GU=$(kubectl -n gitea get secret gitea-admin-secret \
-o jsonpath='{.data.username}' | base64 -d)
GP=$(kubectl -n gitea get secret gitea-admin-secret \
-o jsonpath='{.data.password}' | base64 -d)
if [ -z "$GU" ] || [ -z "$GP" ]; then
echo "FATAL: gitea-admin-secret missing username/password keys"
exit 1
fi
# Step 4: Mint a fresh PAT with admin scope.
# Gitea PAT API: POST /api/v1/users/{username}/tokens
# body {"name":"catalyst-api","scopes":["all"]}
TOKEN_NAME="catalyst-api-$(date +%s)"
MINT_OUT=$(curl -sS -u "${GU}:${GP}" \
-H "Content-Type: application/json" \
-X POST "http://gitea-http.gitea.svc.cluster.local:3000/api/v1/users/${GU}/tokens" \
-d "{\"name\":\"${TOKEN_NAME}\",\"scopes\":[\"all\"]}")
# Parse sha1 token field from JSON without jq dependency.
TOKEN_SHA=$(echo "$MINT_OUT" | grep -oE '"sha1":"[^"]+"' | head -1 | cut -d'"' -f4)
if [ -z "$TOKEN_SHA" ]; then
echo "FATAL: no sha1 in mint response: $MINT_OUT"
exit 1
fi
# Step 5: Patch catalyst-gitea-token with the minted token.
TOKEN_B64=$(echo -n "$TOKEN_SHA" | base64 -w0)
kubectl -n {{ $namespace }} patch secret {{ $secretName }} \
--type=merge \
-p "{\"data\":{\"token\":\"${TOKEN_B64}\"}}"
echo "minted catalyst-gitea-token successfully (token name: $TOKEN_NAME)"
{{- end }}
{{- end }}