fix(self-sovereign-cutover): bump deadlines + HR pin to 0.1.27 (Fix #152) (#1356)

Prov #23 wedged with 3× consecutive DeadlineExceeded on the auto-trigger
Job because catalyst-api was not yet reachable inside the 14m Job deadline
that Fix #127 set. Cold-start of catalyst-platform on a fresh Sovereign
in a slow Hetzner region exceeds 14m end-to-end.

Two coupled changes:

1. Restore 2× safety margin: HR install/upgrade timeout 15m → 30m,
   values.autoWaitForAPISeconds 720 → 1500s (25m), autoTimeoutSeconds
   840 → 1740s (29m, 1m below the 30m HR cap). Same canonical-seam
   alignment Fix #127 introduced (hook deadlines < HR timeout), with
   2× the cold-start budget.

2. Bump HR version pin 0.1.25 → 0.1.27. Fix #127 (commit 58f518ff)
   bumped Chart.yaml to 0.1.26 but left the HR pin at 0.1.25, so
   the post-#127 chart changes never actually shipped to any
   Sovereign. The pin bump here is what materialises BOTH Fix #127
   AND Fix #152 on the next provision.

Chart bump 0.1.26 → 0.1.27.

Per CLAUDE.md principle 4: realistic deadline that matches observed
cold-start time, not a workaround.
Per CLAUDE.md principle 16: HR.timeout > Job.activeDeadlineSeconds >
Job.WAIT_TIMEOUT_SECONDS preserved.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
e3mrah 2026-05-11 08:40:15 +04:00 committed by GitHub
parent 83f9fc429a
commit 045fe466bc
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 35 additions and 11 deletions

View File

@ -234,19 +234,37 @@ spec:
# a 60s wait-loop for Reflector lag, and falls back to the
# source namespace (flux-system) if the local copy is still
# missing. Idempotent path unchanged.
version: 0.1.25
# 0.1.26: HR install/upgrade timeout 15m + values
# autoWaitForAPISeconds=720, autoTimeoutSeconds=840 (Fix #127).
# Provisions #12 + #14 wedged at phase1-watching because the
# HR had no explicit timeout → Helm 5m default → hit before
# the auto-trigger Job's 600s activeDeadline could complete.
# 0.1.27: HR install/upgrade timeout 15m → 30m + values
# autoWaitForAPISeconds 720→1500s (25m wait), autoTimeoutSeconds
# 840→1740s (29m Job deadline) (Fix #152). Prov #23 wedged
# identically with 3× consecutive DeadlineExceeded on the auto-
# trigger Job: catalyst-api had not yet become reachable inside
# the 14m Job deadline. Cold-start of catalyst-platform on a
# fresh Sovereign exceeds 14m on slow Hetzner regions; 2×
# headroom (29m Job, 30m HR) restores the safety margin Fix #127
# intended. NOTE: also bumps HR version pin from 0.1.25 → 0.1.27
# — Fix #127 (commit 58f518ff) bumped Chart.yaml to 0.1.26 but
# left this pin at 0.1.25, so the new HR-timeout/values changes
# never landed on any Sovereign. The pin update here is what
# actually delivers BOTH Fix #127 and Fix #152.
version: 0.1.27
sourceRef:
kind: HelmRepository
name: bp-self-sovereign-cutover
namespace: flux-system
install:
disableWait: true
timeout: 15m
timeout: 30m
remediation:
retries: 3
upgrade:
disableWait: true
timeout: 15m
timeout: 30m
remediation:
retries: 3
# Per-Sovereign overrides — the chart's values.yaml carries

View File

@ -1,6 +1,6 @@
apiVersion: v2
name: bp-self-sovereign-cutover
version: 0.1.26
version: 0.1.27
description: |
Catalyst Self-Sovereignty Cutover Blueprint. Installs DORMANT — this
chart ships eight step ConfigMaps (PodSpec ConfigMaps, one per step),

View File

@ -332,18 +332,24 @@ trigger:
# How long the auto-trigger Job will wait for catalyst-api to be
# reachable before giving up (and exiting 0 so the operator can fire
# manually). Must finish below the HelmRelease install/upgrade
# timeout (15m for bp-self-sovereign-cutover) AND the activeDeadline
# below so the Job exits cleanly even when catalyst-api never comes
# up — 12 minutes leaves a healthy 3m buffer below the 15m HR cap.
autoWaitForAPISeconds: 720
# timeout (30m for bp-self-sovereign-cutover post-Fix-#152) AND the
# activeDeadline below so the Job exits cleanly even when catalyst-
# api never comes up — 25 minutes leaves a healthy 5m buffer below
# the 30m HR cap. Bumped from 720s (12m) on Fix #152 (chart 0.1.27)
# after prov #23 hit the 14m Job deadline before catalyst-api came
# up — cold-start budget needs ~2× headroom on slow Sovereigns.
autoWaitForAPISeconds: 1500
# Overall cap on the auto-trigger Job runtime. activeDeadlineSeconds
# on the Job spec — anything longer means catalyst-api is sick and
# the operator should investigate. The Job exiting at this deadline
# is non-fatal for the chart install (the cutover engine already
# runs detached inside catalyst-api once /start returns 200).
# Must stay below the HelmRelease install/upgrade timeout (15m =
# 900s) so the Job ends and the hook unblocks before Helm gives up.
autoTimeoutSeconds: 840
# Must stay below the HelmRelease install/upgrade timeout (30m =
# 1800s post-Fix-#152) so the Job ends and the hook unblocks before
# Helm gives up. Bumped from 840s (14m) on Fix #152 (chart 0.1.27)
# after prov #23 wedged at 3 consecutive DeadlineExceeded — 29m
# leaves a 1m buffer below the 30m HR cap.
autoTimeoutSeconds: 1740
# TTL on the completed Job — kept for audit so operators can read
# the trigger Pod logs if something looks wrong.
autoJobTTLSeconds: 86400