Two infrastructure-hardening fixes that together eliminate ~30 min
of provision-cycle waste per regression event documented in Fix#101.
## Fix A — CI guard against unescaped tftpl shell expansion
Adds a grep-based step to .github/workflows/infra-hetzner-tofu.yaml
that scans every infra/hetzner/*.tftpl for unescaped \${VAR:-default}
inside YAML comment lines. Uses PCRE negative-lookbehind so correctly
escaped \$\${VAR:-default} (templatefile() literal-dollar) does not
trip the guard.
Background: PR #1311 (Fix#73) added a YAML comment with bare
\${QA_FIXTURES_ENABLED:-false}. tofu's templatefile() parses ALL
\${...} sequences regardless of YAML/HCL/shell context; the colon
in the interpolation hits HCL's reserved conditional grammar and
crashes 'tofu plan' with "Template interpolation doesn't expect
a colon at this location". Prov #9 (4204f0b0c5e37a80) wasted
~30 min before PR #1328 fixed the one offender. Without the guard,
the next operator who adds a similar comment repeats the incident.
Documented in infra/hetzner/README.md so editors learn the \$\$
escape pattern before they trip the CI gate.
## Fix B — bucket-name suffix to escape global Hetzner namespace
Hetzner Object Storage bucket names share a GLOBAL namespace
across every tenant. The previous BucketNameForSovereign(fqdn)
derivation 'catalyst-<fqdn-with-dashes>' would collide on the
second CreateDeployment for the same FQDN (re-provision after
wipe, two operators on adjacent pools, race conditions) and the
second 'tofu apply' would fail with BucketAlreadyExists.
Change BucketNameForSovereign signature to (fqdn, deploymentID)
and append the first 8 chars of the deployment-id as a suffix:
catalyst-omantel-omani-works-b3b837a2
newID() already returns 16-hex random — the leading 8 chars are
32 bits of fresh entropy, enough to make collisions cryptographically
negligible. Backward-compat: empty deploymentID (legacy on-disk
records) falls back to first-8-hex of sha256(fqdn) so wipes of
pre-Fix-111 Sovereigns remain deterministic.
Call-sites updated:
- handler/deployments.go: id := newID() moved before
bucket-name derivation; uses hetzner.BucketNameForSovereign
- handler/wipe.go: passes dep.ID to PurgeBuckets and to
BucketNameForSovereign in the report
- hetzner/buckets.go: PurgeBuckets signature now takes
deploymentID; bucketSuffix() handles the fallback
Tests:
- hetzner/buckets_test.go: 6-case TestBucketNameForSovereign
table covers canonical newID() shape, collision avoidance,
uppercase normalisation, empty + non-hex fallback paths.
New TestBucketNameForSovereign_CollisionAvoidance asserts
the Fix#111 invariant directly.
- handler/deployments_test.go:
TestCreateDeployment_DerivesObjectStorageBucketFromFQDN
now asserts the suffixed shape against the actual dep.ID.
- All produced names re-validated against the S3 bucket-naming
RFC (mirrored regex from provisioner.s3BucketNamePattern).
## Claimed TCs
_None directly — infrastructure hardening; eliminates 30+ min
wasted per cycle from regressions like PR #1311 + bucket-collision_
## Verification
- go test ./internal/hetzner/... -run "Bucket" → 9/9 PASS
- go test ./internal/handler/ -run "DerivesObjectStorageBucket" → PASS
- go vet ./... → clean
- go build ./... → clean
- yaml.safe_load on workflow → clean
- pre-existing handler-package fails (whoami, continuum-switchover)
are unrelated and present on origin/main
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>