Prov #23 wedged with 3× consecutive DeadlineExceeded on the auto-trigger
Job because catalyst-api was not yet reachable inside the 14m Job deadline
that Fix #127 set. Cold-start of catalyst-platform on a fresh Sovereign
in a slow Hetzner region exceeds 14m end-to-end.
Two coupled changes:
1. Restore 2× safety margin: HR install/upgrade timeout 15m → 30m,
values.autoWaitForAPISeconds 720 → 1500s (25m), autoTimeoutSeconds
840 → 1740s (29m, 1m below the 30m HR cap). Same canonical-seam
alignment Fix #127 introduced (hook deadlines < HR timeout), with
2× the cold-start budget.
2. Bump HR version pin 0.1.25 → 0.1.27. Fix #127 (commit 58f518ff)
bumped Chart.yaml to 0.1.26 but left the HR pin at 0.1.25, so
the post-#127 chart changes never actually shipped to any
Sovereign. The pin bump here is what materialises BOTH Fix #127
AND Fix #152 on the next provision.
Chart bump 0.1.26 → 0.1.27.
Per CLAUDE.md principle 4: realistic deadline that matches observed
cold-start time, not a workaround.
Per CLAUDE.md principle 16: HR.timeout > Job.activeDeadlineSeconds >
Job.WAIT_TIMEOUT_SECONDS preserved.
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
83f9fc429a
commit
045fe466bc
@ -234,19 +234,37 @@ spec:
|
||||
# a 60s wait-loop for Reflector lag, and falls back to the
|
||||
# source namespace (flux-system) if the local copy is still
|
||||
# missing. Idempotent path unchanged.
|
||||
version: 0.1.25
|
||||
# 0.1.26: HR install/upgrade timeout 15m + values
|
||||
# autoWaitForAPISeconds=720, autoTimeoutSeconds=840 (Fix #127).
|
||||
# Provisions #12 + #14 wedged at phase1-watching because the
|
||||
# HR had no explicit timeout → Helm 5m default → hit before
|
||||
# the auto-trigger Job's 600s activeDeadline could complete.
|
||||
# 0.1.27: HR install/upgrade timeout 15m → 30m + values
|
||||
# autoWaitForAPISeconds 720→1500s (25m wait), autoTimeoutSeconds
|
||||
# 840→1740s (29m Job deadline) (Fix #152). Prov #23 wedged
|
||||
# identically with 3× consecutive DeadlineExceeded on the auto-
|
||||
# trigger Job: catalyst-api had not yet become reachable inside
|
||||
# the 14m Job deadline. Cold-start of catalyst-platform on a
|
||||
# fresh Sovereign exceeds 14m on slow Hetzner regions; 2×
|
||||
# headroom (29m Job, 30m HR) restores the safety margin Fix #127
|
||||
# intended. NOTE: also bumps HR version pin from 0.1.25 → 0.1.27
|
||||
# — Fix #127 (commit 58f518ff) bumped Chart.yaml to 0.1.26 but
|
||||
# left this pin at 0.1.25, so the new HR-timeout/values changes
|
||||
# never landed on any Sovereign. The pin update here is what
|
||||
# actually delivers BOTH Fix #127 and Fix #152.
|
||||
version: 0.1.27
|
||||
sourceRef:
|
||||
kind: HelmRepository
|
||||
name: bp-self-sovereign-cutover
|
||||
namespace: flux-system
|
||||
install:
|
||||
disableWait: true
|
||||
timeout: 15m
|
||||
timeout: 30m
|
||||
remediation:
|
||||
retries: 3
|
||||
upgrade:
|
||||
disableWait: true
|
||||
timeout: 15m
|
||||
timeout: 30m
|
||||
remediation:
|
||||
retries: 3
|
||||
# Per-Sovereign overrides — the chart's values.yaml carries
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
apiVersion: v2
|
||||
name: bp-self-sovereign-cutover
|
||||
version: 0.1.26
|
||||
version: 0.1.27
|
||||
description: |
|
||||
Catalyst Self-Sovereignty Cutover Blueprint. Installs DORMANT — this
|
||||
chart ships eight step ConfigMaps (PodSpec ConfigMaps, one per step),
|
||||
|
||||
@ -332,18 +332,24 @@ trigger:
|
||||
# How long the auto-trigger Job will wait for catalyst-api to be
|
||||
# reachable before giving up (and exiting 0 so the operator can fire
|
||||
# manually). Must finish below the HelmRelease install/upgrade
|
||||
# timeout (15m for bp-self-sovereign-cutover) AND the activeDeadline
|
||||
# below so the Job exits cleanly even when catalyst-api never comes
|
||||
# up — 12 minutes leaves a healthy 3m buffer below the 15m HR cap.
|
||||
autoWaitForAPISeconds: 720
|
||||
# timeout (30m for bp-self-sovereign-cutover post-Fix-#152) AND the
|
||||
# activeDeadline below so the Job exits cleanly even when catalyst-
|
||||
# api never comes up — 25 minutes leaves a healthy 5m buffer below
|
||||
# the 30m HR cap. Bumped from 720s (12m) on Fix #152 (chart 0.1.27)
|
||||
# after prov #23 hit the 14m Job deadline before catalyst-api came
|
||||
# up — cold-start budget needs ~2× headroom on slow Sovereigns.
|
||||
autoWaitForAPISeconds: 1500
|
||||
# Overall cap on the auto-trigger Job runtime. activeDeadlineSeconds
|
||||
# on the Job spec — anything longer means catalyst-api is sick and
|
||||
# the operator should investigate. The Job exiting at this deadline
|
||||
# is non-fatal for the chart install (the cutover engine already
|
||||
# runs detached inside catalyst-api once /start returns 200).
|
||||
# Must stay below the HelmRelease install/upgrade timeout (15m =
|
||||
# 900s) so the Job ends and the hook unblocks before Helm gives up.
|
||||
autoTimeoutSeconds: 840
|
||||
# Must stay below the HelmRelease install/upgrade timeout (30m =
|
||||
# 1800s post-Fix-#152) so the Job ends and the hook unblocks before
|
||||
# Helm gives up. Bumped from 840s (14m) on Fix #152 (chart 0.1.27)
|
||||
# after prov #23 wedged at 3 consecutive DeadlineExceeded — 29m
|
||||
# leaves a 1m buffer below the 30m HR cap.
|
||||
autoTimeoutSeconds: 1740
|
||||
# TTL on the completed Job — kept for audit so operators can read
|
||||
# the trigger Pod logs if something looks wrong.
|
||||
autoJobTTLSeconds: 86400
|
||||
|
||||
Loading…
Reference in New Issue
Block a user