fix(cnp): allow ingress from catalyst ns (cutover Pods) — fresh-prov handover blocker (Refs PR #1785 regression, t24 zero-touch finding) (#1847)

PR #1785 (chart 1.4.171) shipped a baseline default-deny
CiliumNetworkPolicy in catalyst-system whose ingress allowlist was
limited to:

  - reserved.ingress: "" (cilium-gateway endpoint)
  - same-namespace catalyst-system Pods
  - host / remote-node / kube-apiserver entities

The bp-self-sovereign-cutover chart stamps Jobs into the `catalyst`
namespace, including the 10-auto-trigger Job whose Pod curls
catalyst-api.catalyst-system.svc.cluster.local:8080 to fire
/api/v1/internal/cutover/trigger.

With #1785 in effect on a FRESH prov, every auto-trigger Pod times
out at WAIT_TIMEOUT_SECONDS=1500s, handoverFiredAt stays null, and
the D0 auto-redirect to the Sovereign Console never happens — the
operator is stuck on mothership /jobs forever.

Caught by t24 zero-touch verification (2026-05-18):

  handover_status: "BLOCKED — cutover auto-trigger Pod in 'catalyst'
  ns cannot reach catalyst-api in 'catalyst-system' ns because
  baseline-default-deny CNP allows ingress only from {reserved.ingress,
  catalyst-system ns, host entities}"

The companion symptom on t22 was masked because t22's cutover Job
had already completed before the CNP rolled out — the CNP did not
gate ingress there.

Fix
─────────────────────────────────────────────────────────────────
Add a fourth ingress rule to baseline-default-deny allowing
fromEndpoints in the operator-tunable list
.Values.security.baselineCnp.allowedIngressNamespaces. Defaults:

  - catalyst       — cutover Pods (the load-bearing fix)
  - flux-system    — Helm/Kustomize/Source controllers probing
                     Service readiness for HelmRelease health
                     rollups (worked pre-#1785 via no-CNP default)
  - kube-system    — Cilium operator + hcloud-ccm + CoreDNS that
                     do cluster introspection calls (the
                     reserved.ingress gateway endpoint here is
                     still matched by rule 1's reserved.ingress: ""
                     selector — this rule covers non-gateway Pods)

The list mirrors the existing allowedPlatformNamespaces pattern on
the egress side. No other rule semantics change.

Chart bump 1.4.177 → 1.4.178. Companion regression to chart 1.4.177
(PR #1803, SMTP egress) — both are sub-regressions from the same
#1785 baseline-CNP ship.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
e3mrah 2026-05-18 23:45:28 +04:00 committed by GitHub
parent 61948474b5
commit 31b7dc5859
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 82 additions and 2 deletions

View File

@ -1169,6 +1169,29 @@ name: bp-catalyst-platform
# overwritten by the mirror rerun at 17:54:55Z, Organization CR never
# materialised, customer-journey step 16 hung indefinitely. (Closes
# #1794 — 5th C18 layer, the last customer-journey blocker.)
# 1.4.178 (TBD-CNP-Catalyst-NS-Ingress, 2026-05-18): SECOND regression from
# #1785 (1.4.171 baseline CNPs). The catalyst-system baseline-default-deny
# CNP allowed ingress only from {reserved.ingress (cilium-gateway),
# same-namespace, host/remote-node/kube-apiserver}. The bp-self-sovereign-
# cutover chart stamps Jobs into the `catalyst` namespace, including the
# 10-auto-trigger Job whose Pod curls catalyst-api in `catalyst-system` to
# fire /api/v1/internal/cutover/trigger. With #1785 in effect on a FRESH
# prov, every auto-trigger Pod times out at WAIT_TIMEOUT_SECONDS=1500s →
# handoverFiredAt stays null → D0 auto-redirect to Sovereign Console never
# happens → operator is stuck on mothership /jobs forever. Caught by t24
# zero-touch verification (2026-05-18). The companion symptom on t22 was
# masked because t22's cutover Job had already completed before the CNP
# rolled out, so the CNP did not gate ingress.
#
# Fix: add an explicit fromEndpoints rule allowing ingress from the
# `catalyst` namespace (cutover Pods), `flux-system` (Helm/Kustomize/Source
# controllers probing Service readiness for HelmRelease health rollups),
# and `kube-system` Pods (Cilium operator + hcloud-ccm + CoreDNS that do
# cluster introspection calls — the reserved.ingress gateway endpoint also
# in kube-system is matched separately via reserved.ingress: ""). The list
# is operator-tunable via `.Values.security.baselineCnp.allowedIngressNamespaces`,
# mirroring the existing allowedPlatformNamespaces pattern on the egress
# side. Refs PR #1785 (1.4.171), t24 zero-touch finding.
# 1.4.177 (TBD-CNP-SMTP-Egress, 2026-05-18): fix regression introduced by
# #1785 (1.4.171 baseline CNPs). The catalyst-system default-deny CNP
# limited world egress to TCP/443 only, which silently broke SMTP
@ -1186,8 +1209,8 @@ name: bp-catalyst-platform
# 25/TCP (legacy SMTP fallback). All three are explicitly scoped to
# `toEntities: world`, matching the existing 443/TCP allow. No other
# rule semantics change. (Fixes PIN-issue 502 regression from #1785.)
version: 1.4.177
appVersion: 1.4.177
version: 1.4.178
appVersion: 1.4.178
# 1.4.172 — feat(security): baseline CiliumNetworkPolicies for
# catalyst-system + cilium-gateway (Closes #1746, Refs TBD-Cov-12 /
# C12-009). Cov-bench audit on the live Sovereign cluster confirmed

View File

@ -71,6 +71,7 @@ when qaFixtures is disabled.
*/}}
{{- if .Values.security.baselineCnp.enabled }}
{{- $allowedPlatform := .Values.security.baselineCnp.allowedPlatformNamespaces | default (list "keycloak" "gitea" "powerdns" "cnpg-system" "openbao" "harbor" "nats-system" "loki" "mimir" "tempo" "alloy" "opentelemetry" "external-secrets-system" "cert-manager") }}
{{- $allowedIngressNs := .Values.security.baselineCnp.allowedIngressNamespaces | default (list "catalyst" "flux-system" "kube-system") }}
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
@ -116,6 +117,46 @@ spec:
- host
- remote-node
- kube-apiserver
# 4. Allow ingress from operator-tunable adjacent namespaces.
# Defaults cover three load-bearing senders:
#
# - `catalyst` (the cutover namespace) — the
# bp-self-sovereign-cutover chart stamps Jobs into this
# namespace, INCLUDING the 10-auto-trigger Job whose Pod
# curls `catalyst-api.catalyst-system.svc.cluster.local:8080`
# to fire /api/v1/internal/cutover/trigger. Without this
# allow the auto-trigger Pod cannot reach the API service,
# the trigger never fires, handoverFiredAt stays null, and
# D0 auto-redirect to the Sovereign Console never happens —
# the operator is stuck on mothership /jobs forever
# (regression caught by t24 zero-touch verification, fresh
# prov 2026-05-18).
#
# - `flux-system` (the Flux controllers) — Helm Controller,
# Kustomize Controller, and Source Controller probe the
# backend Services they reconcile (e.g. catalyst-api
# readinessProbes for HelmRelease health rollups). Pre-#1785
# this worked implicitly because no CNP existed.
#
# - `kube-system` (the Cilium operator + kube-proxy + CCM) —
# beyond the kubelet-host probes already covered by
# rule 3, kube-system Pods (cilium-operator, hcloud-ccm,
# coredns recursive probes) need to be allowed for cluster
# introspection. The reserved.ingress endpoint that handles
# public traffic is also in kube-system but is matched by
# rule 1's `reserved.ingress: ""` selector — this rule
# covers the non-gateway kube-system Pods.
#
# Operators may extend or shrink this list via
# `.Values.security.baselineCnp.allowedIngressNamespaces`.
- fromEndpoints:
- matchExpressions:
- key: k8s:io.kubernetes.pod.namespace
operator: In
values:
{{- range $allowedIngressNs }}
- {{ . | quote }}
{{- end }}
egress:
# 1. kube-apiserver — every controller informer (organization,
# application, environment, useraccess, blueprint, tenant)

View File

@ -1509,3 +1509,19 @@ security:
- opentelemetry
- external-secrets-system
- cert-manager
# Adjacent namespaces whose Pods are allowed to INGRESS into
# catalyst-system. Defaults cover:
# - catalyst — bp-self-sovereign-cutover Jobs (incl. the
# 10-auto-trigger Pod that fires /api/v1/internal/cutover/
# trigger; without this allow handover NEVER FIRES on a
# fresh prov — regression caught by t24 zero-touch).
# - flux-system — Helm/Kustomize/Source controllers probing
# Service readiness for HelmRelease health rollups.
# - kube-system — Cilium operator + hcloud-ccm + CoreDNS that
# do cluster introspection calls into catalyst-system. The
# reserved.ingress gateway endpoint also in kube-system is
# matched by a separate rule via reserved.ingress: "" label.
allowedIngressNamespaces:
- catalyst
- flux-system
- kube-system