From eb8693e1ceee509940b92b12a4c75a9241ee1ce1 Mon Sep 17 00:00:00 2001
From: Codex <codex@openai.com>
Date: Mon, 11 May 2026 10:30:05 -0500
Subject: [PATCH] fix(multus): bump kube-multus-ds memory 50Mi/50Mi ->
 1Gi/512Mi (prevent OOM cascade)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Cluster outage 2026-05-10T17:43 through 2026-05-11 ~10:30 (~21h). Root cause:
FlowerCore.RemoteDesktop emitted 219 orphan rd-browser-only-* pods in fc-desktop
(missing OwnerReferences — see companion fix in FlowerCore.RemoteDesktop).
Kubelet's continuous CNI ADD retries for those pending pods drove a request
queue that exceeded the upstream default 50Mi limit on kube-multus-ds. Multus
OOMKilled (exit 137), restarted with an even bigger backlog, OOMKilled again,
positive feedback loop. Restart counts climbed to 276 / 412 / 261 across the
3 RKE2 nodes.

Downstream blast radius: both Traefik pods stuck ContainerCreating (101m +
4h35m), all Longhorn CSI attacher/provisioner/instance-manager stuck, every
Prometheus blackbox probe for *.iamworkin.lan failing, UpdateCenterPublicEdgeDown
critical on update.flowercore.io, every ArgoCD app showing sync=Unknown
because repo-server lost git connectivity. 45 firing Prometheus alerts.

Recovery sequence (Q-MR-1 from FlowerCore.Notes morning routine):
1. kubectl patch kube-multus-ds memory live (this commit locks it in git so
   ArgoCD doesn't revert on next sync)
2. Force-delete the 219 orphan pods (kubectl --grace-period=0 --force) to
   break the avalanche
3. Rollout restart kube-multus-ds — STABLE after restart with new limit
4. Restart Traefik + Longhorn CSI to clear stuck ContainerCreating
5. Verify update.flowercore.io returns 200 + ArgoCD apps reconcile

Tested incrementally: 256Mi limit was insufficient (still OOMed on catchup
burst), 512Mi was insufficient on rke2-agent1 (most pods concentrated there),
1Gi/512Mi handled the full 200+ pending pod CNI catchup cleanly with 0 multus
restarts after rollout. Nodes are 64GB with <25% used in steady-state, so the
~256Mi typical working-set is well within the new limit.

Companion change: FlowerCore.RemoteDesktop must set OwnerReferences on every
worker pod so future operator crashes don't leak orphans (Q-MR-2). Preventive
alerts (Q-MR-3) MultusMemoryPressure + NamespacePendingPodBacklog are coming
in a follow-up commit to apps/monitoring/.

Memory: feedback_multus_50mi_limit_oom_orphan_pod_avalanche
Decisions card: docs/dashboards/decisions-waiting.html Q-MR-1..3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 apps/multus/multus.yaml | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/apps/multus/multus.yaml b/apps/multus/multus.yaml
index 15e2a58..2a0d802 100644
--- a/apps/multus/multus.yaml
+++ b/apps/multus/multus.yaml
@@ -188,13 +188,24 @@ spec:
         - name: kube-multus
           image: ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick
           command: [ "/usr/src/multus-cni/bin/multus-daemon" ]
+          # 2026-05-11: upstream default of 50Mi memory limit OOM-cascades when
+          # an operator-owned namespace accumulates >100 pending pods retrying
+          # CNI ADD. RemoteDesktop emitted 219 orphan rd-browser-only pods
+          # (missing OwnerReferences), kubelet's CNI ADD avalanche pushed multus
+          # over 50Mi, OOMKilled, restarted with even bigger backlog → loop.
+          # 21h cluster outage. See FlowerCore.Notes:
+          #   feedback_multus_50mi_limit_oom_orphan_pod_avalanche.md
+          # 1Gi limit / 512Mi request comfortably handles a 200+ pod CNI
+          # catchup burst on 64GB nodes (nodes are <25% used in steady-state).
+          # Drop back toward 256Mi only after MultusMemoryPressure alert
+          # proves steady-state working set sits well below 200Mi.
           resources:
             requests:
               cpu: "100m"
-              memory: "50Mi"
+              memory: "512Mi"
             limits:
               cpu: "100m"
-              memory: "50Mi"
+              memory: "1Gi"
           securityContext:
             privileged: true
           terminationMessagePolicy: FallbackToLogsOnError