Commit Graph

758 Commits

Author SHA1 Message Date
e3mrah
115c58885b
fix(cilium-gateway): allow world ingress to reserved:ingress (unblocks Sovereign public surfaces) (#1482)
* fix(tls): cilium-gateway-cert STAGING/PROD issuer selectable via tofu

clusters/_template/sovereign-tls/cilium-gateway-cert.yaml hardcoded
letsencrypt-dns01-prod-powerdns regardless of qa_test_session_enabled.
On high-cadence QA reprov cycles this hits the LE PROD 5/168h rate
limit (caught on prov #76 at 13:45 UTC, retry-after 16:49 UTC) and
the wildcard Certificate sticks Ready=False — Cilium Gateway has no
valid TLS secret → envoy listener never binds → public TLS handshake
to console.<fqdn> dies with SSL_ERROR_SYSCALL.

Add tofu local.wildcard_cert_issuer = qa_test_session_enabled ?
staging : prod. Thread WILDCARD_CERT_ISSUER through the sovereign-
tls Kustomization postBuild.substitute. cilium-gateway-cert.yaml
references it as ${WILDCARD_CERT_ISSUER}.

Default behaviour unchanged for non-QA (production) Sovereigns —
they still resolve to letsencrypt-dns01-prod-powerdns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cilium-gateway): allow world ingress to Cilium Gateway reserved:ingress endpoint

When Cilium Gateway API runs with gatewayAPI.hostNetwork.enabled=true and
a default-deny CCNP is present, every public request to a Sovereign host
(console, auth, gitea, registry, api, ...) hits the gateway listener and
gets DENIED at envoy's cilium.l7policy filter with:

    cilium.l7policy: Ingress from 1 policy lookup for endpoint X for port 30443: DENY

Public response: HTTP/1.1 403 Forbidden, body "Access denied", server: envoy.

Root cause: Cilium creates a special endpoint with identity reserved:ingress (8)
representing the gateway listener. By default this endpoint has
policy-enabled=both with allowed-ingress-identities=[1 (host)] and empty
L4 rules — so no port is permitted. The default-deny CCNP's NotIn-namespace
endpointSelector does NOT cover this endpoint (it has no
io.kubernetes.pod.namespace label), and our qa-fixtures didn't ship a
matching allow-template for it. Net effect: TLS handshake succeeds, HTTPRoutes
are Programmed, backends are healthy in-cluster, but every request 403s.

Caught live on prov #80 (omantel.biz, 2026-05-14) after the Gateway hostNetwork
fix (#1480) finally activated host-bind on :30443. Verified by:
- envoy debug log: cilium.l7policy DENY for endpoint 10.42.0.201 port 30443
- cilium-dbg endpoint get 3282 -o json: l4.ingress: [] and allowed-ingress-identities: [1]
- transiently applying the same CCNP via kubectl: console.omantel.biz → 200

Fix: ship a CCNP scoped to reserved:ingress that allows ingress from world,
cluster, host, remote-node (multi-region CP-to-CP), and kube-apiserver,
plus egress to all so envoy can forward to any backend service. This is
the canonical Cilium hostNetwork Gateway-API zero-trust pattern.

Chart bump: catalyst 1.4.142 → 1.4.143.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
2026-05-14 18:50:34 +04:00
github-actions[bot]
fb99ae5fd0 deploy: update catalyst images to a88e132 2026-05-14 14:27:51 +00:00
github-actions[bot]
5752fc751f deploy: update catalyst images to bdceb3a 2026-05-14 12:45:34 +00:00
github-actions[bot]
0e4cb67319 deploy: update catalyst images to 690d588 2026-05-14 12:40:44 +00:00
github-actions[bot]
195c6b5bc5 deploy: update catalyst images to 13d79c7 2026-05-14 12:35:31 +00:00
github-actions[bot]
5527652b49 deploy: update catalyst images to f334950 2026-05-14 12:29:07 +00:00
github-actions[bot]
fb8303766e deploy: update catalyst images to 587a985 2026-05-14 10:18:12 +00:00
github-actions[bot]
bb2726bcf9 deploy: update catalyst images to f110a54 2026-05-14 06:51:04 +00:00
github-actions[bot]
b4c96a6d0d deploy: update catalyst images to df1dfed 2026-05-14 06:30:40 +00:00
github-actions[bot]
331e6b2834 deploy: update catalyst images to b4c2f54 2026-05-14 06:12:28 +00:00
github-actions[bot]
2f5b1cd0ee deploy: update catalyst images to 4814c68 2026-05-14 05:55:28 +00:00
github-actions[bot]
f5929e6114 deploy: update catalyst images to 2626d40 2026-05-14 04:27:53 +00:00
e3mrah
2626d40117
chore(catalyst-chart): bump 1.4.141 → 1.4.142 — propagate prov #72 fixes (#1466)
PR #1465 added `catalyst` + `newapi` to default-deny allowlist and
shipped `allow-kube-apiserver` CNP for qa-omantel, but the chart
version wasn't bumped so HRs across active provisions kept resolving
the OLD 1.4.141 artifact (with the broken allowlist). Bumping to
1.4.142 forces Flux on every Sovereign to upgrade and pick up the fix.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 08:25:55 +04:00
github-actions[bot]
edf8e6fd18 deploy: update catalyst images to c267ab5 2026-05-14 04:20:59 +00:00
e3mrah
c267ab5338
fix(qa-fixtures): allow catalyst+newapi NS + kube-apiserver egress (prov #72) (#1465)
* fix(flow_snapshot): region-scope dep edges (no cross-region wiring)

Founder caught on prov #66 (3dc9249ea73a6840, 2026-05-13): hel1-2's
install-* nodes all rendered dep arrows pointing at PRIMARY's install
nodes — cross-region edges where NAMING-CONVENTION §1.3 demands
independent fault domains (no cross-region wiring).

Root cause: helmwatch.Bridge persists secondary-region Jobs with bare
dep names ("install-cilium") because HR.spec.dependsOn carries chart
names without region context. The snapshot composer's normaliser
turned `install-cilium` → `<depID>:install-cilium` which IS the
primary's cilium JobID, not hel1-2's `<depID>:install-hel1-2/cilium`.
Every secondary install therefore drew a phantom cross-region edge.

Fix: in flow_snapshot_local.go, region-scope dep names when the source
Job is regional:

  jobRegion=="hel1-2" + dep="install-cilium"
    → "install-hel1-2/cilium" → "<depID>:install-hel1-2/cilium"

Same fix applied to the Layer-2 hrDeps derivation path (per-AppID
lookup also gets bare chart names from the primary watcher). hrDeps
lookup is now done with the unprefixed AppID so it actually hits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cloud-init): wait for private NIC before k3s install (prov #71)

Hetzner Cloud hot-attaches the private-network NIC ~10-20s AFTER server
create. cloud-init init-local fetches /hetzner/v1/metadata/private-networks
BEFORE the NIC is ready, renders netplan with only eth0, and the
private NIC (kernel-renamed eth1 → enp7s0 by udev) stays DOWN.

Effect on secondary CPs: k3s server starts with
  --node-ip=10.0.<10+idx>.2 --advertise-address=10.0.<10+idx>.2
and fatals on
  "listen tcp 10.0.11.2:2380: bind: cannot assign requested address"
then crashloops. Caught on prov #71/omantel.biz/nbg1-1-cp1: k3s.service
restart counter reached 5394, kubeconfig never PUT back to mothership,
canvas showed secondary region as a permanent black hole. Diagnosed via
Hetzner rescue mode SSH 2026-05-14. Primary CP works by luck of faster
fsn1 zone NIC attach.

Fix: in cloud-init runcmd, BEFORE the k3s install, poll up to 120s for
the expected private IP (control plane) or a route to it (worker). If
the NIC appears DOWN with no netplan stanza, generate one with dhcp4:true
and `netplan apply`. Bail loudly if the IP/route never appears — failures
surface in cloud-init.log instead of disguising as a slow boot.

Symmetric fix in worker template covers autoscaler-spawned secondary
workers when worker_count > 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(qa-fixtures): allow catalyst+newapi NS + kube-apiserver egress (prov #72)

The qa-fixtures chart's `default-deny` CiliumClusterwideNetworkPolicy
excluded `catalyst-system` from its NotIn list but FORGOT `catalyst`
(where bp-self-sovereign-cutover's Jobs live: auto-trigger,
gitea-mirror, harbor-projects, registry-pivot) and `newapi` (where
bp-newapi's Application pods live).

Effect on prov #72:
- bp-self-sovereign-cutover-auto-trigger Job stuck 20m+ on HTTP 000000
  curling http://catalyst-api.catalyst-system.svc → DNS resolution + TCP
  egress both denied by default-deny. Cutover never fires → handover
  blocked → bp-catalyst-platform's --wait never completes.
- newapi-bp-newapi pod gets `secret newapi-oidc not found` but its
  inability to resolve apiserver compounds the issue.
- qa-omantel cnpg cluster-primary/replica stuck "Setting up primary"
  for 18m because initdb's `dial tcp 10.43.0.1:443 i/o timeout` — the
  ClusterIP-rewritten kube-apiserver address has no allow-egress.

Fixes:
1. Add `catalyst` + `newapi` to $excludedNamespaces — they're first-party
   blueprint namespaces analogous to catalyst-system.
2. Add `allow-kube-apiserver` CNP in qa-omantel using Cilium's canonical
   `toEntities: [kube-apiserver]` directive so cnpg initdb can reach the
   apiserver regardless of whether traffic resolves to ClusterIP, node
   IP, or Service VIP.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 08:18:54 +04:00
github-actions[bot]
5f2298c550 deploy: update catalyst images to a75463f 2026-05-14 03:42:19 +00:00
github-actions[bot]
af3a1e6375 deploy: update catalyst images to 410a3db 2026-05-13 18:05:18 +00:00
github-actions[bot]
3c38565951 deploy: update catalyst images to 4a14bbf 2026-05-13 16:34:30 +00:00
github-actions[bot]
cd5ace8dcb deploy: update catalyst images to 32e0b40 2026-05-13 15:42:13 +00:00
github-actions[bot]
55edb953d5 deploy: update catalyst images to 44913d8 2026-05-13 14:40:02 +00:00
github-actions[bot]
b6e6470ccf deploy: update catalyst images to 5f4f9f2 2026-05-13 14:01:04 +00:00
e3mrah
6fac1481d3
fix(catalyst-api): bump memory limit 1Gi → 4Gi for multi-region snapshot load (#1456)
prov #61 (2e197a934a0e0461, 2026-05-13): catalyst-api OOMKilled 6× during
phase-1 watch on a 3-region Sovereign. The in-memory state has grown
substantially since the 1Gi limit was set:

- 1 primary helmwatch.Watcher (45 HRs + informer cache)
- N secondary helmwatch.Watchers (45 HRs × 2 secondary regions, each
  with its own informer cache)
- jobs.Store backed by on-disk + in-memory tree
- per-/snapshot poll: composes per-region region groups across all
  Job rows + cross-references hrDeps from the live primary watcher

Combined steady-state exceeds 1Gi on cpx32-equivalent loads. Bumped
limits to 4Gi (request 512Mi up from 128Mi). The mothership node has
8GB+ resident, no other tight constraint. Future fix: persist region
in Job rows so secondary watchers don't need to be retained post
phase-1 (orthogonal cleanup).

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:20:00 +04:00
github-actions[bot]
2c6374b200 deploy: update catalyst images to 8518bb1 2026-05-13 12:48:59 +00:00
github-actions[bot]
ed4f66438f deploy: update catalyst images to d9d7fa2 2026-05-13 12:26:59 +00:00
github-actions[bot]
6f50bc0a4a deploy: update catalyst images to 3a08c23 2026-05-13 12:05:56 +00:00
github-actions[bot]
16f41bef56 deploy: update catalyst images to 68372d7 2026-05-12 16:13:41 +00:00
github-actions[bot]
1c6e82b83b deploy: update catalyst images to be47815 2026-05-12 16:03:56 +00:00
github-actions[bot]
034da82c00 deploy: update catalyst images to cdcc50a 2026-05-12 15:58:30 +00:00
github-actions[bot]
fc71800a52 deploy: update catalyst images to 19a847e 2026-05-12 12:30:55 +00:00
github-actions[bot]
bc0f56eb4e deploy: update catalyst images to 4923938 2026-05-12 12:15:30 +00:00
github-actions[bot]
effd75e4a7 deploy: update catalyst images to c5d891a 2026-05-12 11:26:54 +00:00
github-actions[bot]
5fb99be8e8 deploy: update catalyst images to bd5d439 2026-05-12 10:00:04 +00:00
github-actions[bot]
064fc3073f deploy: update catalyst images to 0fe0cac 2026-05-12 09:32:31 +00:00
github-actions[bot]
c80d43c6d8 deploy: update catalyst images to 2c1f767 2026-05-12 09:27:06 +00:00
github-actions[bot]
fe337d571c deploy: update catalyst images to bb1bff2 2026-05-12 08:42:18 +00:00
github-actions[bot]
24a2b13870 deploy: update catalyst images to 9da662c 2026-05-12 08:36:45 +00:00
github-actions[bot]
41787d66c6 deploy: update catalyst images to 5e96d30 2026-05-12 08:33:55 +00:00
github-actions[bot]
732949bc73 deploy: update catalyst images to f980356 2026-05-12 08:14:36 +00:00
github-actions[bot]
1a0333a43f deploy: update catalyst images to 93c3e81 2026-05-12 07:27:29 +00:00
github-actions[bot]
9011d1b635 deploy: update catalyst images to 048a4d8 2026-05-12 06:46:54 +00:00
github-actions[bot]
7e4f38ec62 deploy: update catalyst images to e3771f6 2026-05-12 06:38:32 +00:00
github-actions[bot]
59b6940c18 deploy: update catalyst images to 2fbab45 2026-05-12 06:08:41 +00:00
github-actions[bot]
4ceb74067f deploy: update catalyst images to 50bf7a5 2026-05-12 04:12:24 +00:00
e3mrah
50bf7a59ed
fix: F8 - double bp-catalyst-platform HR timeout (15m→30m) + catalyst-api phase1 budget (60m→120m) (#1428)
prov #44 (d9399223c3caa4f9) hit the catalyst-api 60m phase1 watch cap
with bp-catalyst-platform HR still mid-retry (failures=3) and 41/45 HRs
True. F1-F7 are correct and live on main (qa-finalizer-strip Completed,
autoscaler workers joined). The remaining wall is total bootstrap-kit
install time exceeding the outer watch budget on a fresh cpx42×1
Sovereign without a warm Harbor proxy-cache.

Two lock-step changes widen both bounds:

1. clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
   install.timeout 15m → 30m, upgrade.timeout 15m → 30m. The umbrella
   chart genuinely needs >15m worst case when the full SME + Catalyst
   service stack rolls cold.

2. products/catalyst/bootstrap/api/internal/helmwatch/helmwatch.go:
   DefaultWatchTimeout 60m → 120m. Worst-case inner HR retry chain is
   now 30m × 3 = 90m; the outer phase1 budget MUST be larger so the
   watch never terminates while helm-controller still has remediation
   attempts left. CATALYST_PHASE1_WATCH_TIMEOUT env-var override path
   was already wired (issue #538 baseline) — chart template now
   declares the explicit "120m" value so the runtime knob is
   discoverable for capacity-bounded environments. Per INVIOLABLE-
   PRINCIPLES.md #4 the knob remains runtime-configurable.

New unit test TestPhase1WatchConfig_ProductionDefaultIs120m pins the
F8 floor against future regression. Existing env-var override + field-
override tests still pass unchanged.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 08:10:24 +04:00
github-actions[bot]
dd095b8597 deploy: update catalyst images to b743b64 2026-05-12 02:13:30 +00:00
github-actions[bot]
d4d05f16f6 deploy: update catalyst images to 8c7d326 2026-05-12 00:38:43 +00:00
e3mrah
8c7d32616e
fix(bp-catalyst-platform): qa-finalizer-strip hook unschedulable on saturated worker (Fix #185, prov #38/#39/#41 recurrence) (#1426)
Root cause (4-layer trace on prov #41, omantel.biz, 2026-05-12 00:28 UTC):

  bp-catalyst-platform HR install.timeout=15m
    → Helm pre-install hook: qa-finalizer-strip Job (weight -99)
      → Pod requests 50m CPU + 64Mi memory (tiny)
        → BUT no tolerations → scheduler restricted to worker
          → worker cpx32 (8vCPU/16GB) at 99% CPU requests
            (7980m of 8000m allocated) after bootstrap-kit fan-out
            → FailedScheduling: "0/2 nodes are available: 1
              Insufficient cpu, 1 node(s) had untolerated taint
              {node-role.kubernetes.io/control-plane: true}"
            → autoscaler triggers scale-up worker 2→3 → "1 in backoff
              after failed scale-up" → still Pending → 15m timeout
              → InstallFailed → Flux uninstall+rollback → installFailures: 3
              → Flux gives up entirely

Live evidence quoted from chroot kubeconfig on prov #41:
  - bp-catalyst-platform HR `Reconciling=True, reason=Progressing,
    message="Running 'install' action with timeout of 15m0s"`
  - HR `Released=False, reason=InstallFailed, message="Helm install
    failed for release catalyst-system/catalyst-platform with chart
    bp-catalyst-platform@1.4.140: failed pre-install: 1 error occurred:
    * timed out waiting for the condition"`
  - Pod `qa-finalizer-strip-m2hdb` status=Pending; events:
    `Warning  FailedScheduling 108s default-scheduler 0/2 nodes are
    available: 1 Insufficient cpu, 1 node(s) had untolerated taint
    {node-role.kubernetes.io/control-plane: true}`
  - Worker `Allocated cpu 7980m (99%) of 8000m capacity`
  - Control-plane `Allocated cpu 635m (7%) of 8000m capacity` (idle)

Fix: add tolerations for the control-plane NoSchedule taint +
priorityClassName: system-cluster-critical so the qa-finalizer-strip
Job can ALWAYS schedule regardless of worker-node CPU saturation.
The hook is a defense-in-depth cleanup that runs in seconds on a
clean cluster; it legitimately belongs anywhere with free capacity
including the control-plane node (which on prov #41 had 7365m CPU
free vs. the hook's 50m request).

Why prior fixes didn't suffice:
  - Fix #114 introduced this hook to break a finalizer-deadlock loop
    on prov #9. Correct fix for that wedge; never anticipated worker
    saturation as a scheduling failure mode for the hook itself.
  - Fix #138 (chart 1.4.138) converted the qa-cnpg-backup-s3-seed +
    qa-cnpg-status-seed hooks (weight 0/post-install) to regular
    release resources to break a circular DAG dep. Different hook
    surface.
  - Fix #184 (chart 1.4.140) raised the gitea-token-mint pre-install
    hook (weight +10) wait budget for cold-start autoscaler. That
    hook runs AFTER qa-finalizer-strip (-99 < +10); if the -99 hook
    never starts, the +10 hook never runs.

Recurring class: same family as Fix #114 (hook scheduling failure
wedges entire HR install). 3 consecutive recurrences (prov #38, #39,
#41) on chart pin 1.4.140 trigger the category-level audit threshold
(CLAUDE.md rule "CATEGORY-LEVEL THINKING"). Coupled chart hygiene
swept in same commit:

  - Switch image from bitnamilegacy/kubectl:1.29.3 (Docker-Hub
    redirect for deprecated Bitnami images, 2025-08 cutover
    documented at platform/self-sovereign-cutover/chart/values.yaml:
    252) → harbor.openova.io/proxy-dockerhub/alpine/k8s:1.31.4 —
    the canonical alpine-based kubectl image already used by sibling
    hook catalyst-gitea-token-mint (Fix #163). MIRROR-EVERYTHING +
    ARCHITECT-FIRST rules.

Coordinator follow-up tickets:
  - Sibling Jobs in templates/qa-fixtures/cnpg-clusters-qa.yaml
    (qa-cnpgpair-status-seed) still reference bitnamilegacy/kubectl
    :1.29.3 — same Bitnami-deprecation class. Out of scope for this
    Fix (not part of the recurrence cluster); flagged for a sweep.
  - Worker cpx32 sizing may be undersized for the bootstrap-kit fan-
    out on omantel.biz — separate sizing ticket, not blocking.

Changes:
  - products/catalyst/chart/templates/qa-fixtures/pre-install-
    finalizer-strip.yaml: add tolerations + priorityClassName;
    switch image to alpine/k8s:1.31.4. Inline doc comments explain
    the 4-layer trace and the Fix #114/#138/#184 history.
  - products/catalyst/chart/Chart.yaml: bump 1.4.140 → 1.4.141 with
    changelog entry capturing root cause + budget arithmetic.
  - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
    bump HR pin 1.4.140 → 1.4.141.

Verification:
  - helm template renders cleanly (exit 0, ~6700 lines).
  - kubectl apply --dry-run=client validates the rendered Job
    manifest (job.batch/qa-finalizer-strip created (dry run)).
  - Rendered Job contains tolerations[control-plane Exists NoSchedule],
    priorityClassName: system-cluster-critical, image: alpine/k8s:1.31.4.

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 04:36:35 +04:00
github-actions[bot]
5fdd33b7c0 deploy: update catalyst images to 0ba87bb 2026-05-11 18:32:08 +00:00
github-actions[bot]
5c987309b5 deploy: update catalyst images to 5332ed0 2026-05-11 17:56:31 +00:00
github-actions[bot]
1f05e52e77 deploy: update catalyst images to 36d1f56 2026-05-11 17:47:04 +00:00