From eb8693e1ceee509940b92b12a4c75a9241ee1ce1 Mon Sep 17 00:00:00 2001 From: Codex Date: Mon, 11 May 2026 10:30:05 -0500 Subject: [PATCH] fix(multus): bump kube-multus-ds memory 50Mi/50Mi -> 1Gi/512Mi (prevent OOM cascade) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cluster outage 2026-05-10T17:43 through 2026-05-11 ~10:30 (~21h). Root cause: FlowerCore.RemoteDesktop emitted 219 orphan rd-browser-only-* pods in fc-desktop (missing OwnerReferences — see companion fix in FlowerCore.RemoteDesktop). Kubelet's continuous CNI ADD retries for those pending pods drove a request queue that exceeded the upstream default 50Mi limit on kube-multus-ds. Multus OOMKilled (exit 137), restarted with an even bigger backlog, OOMKilled again, positive feedback loop. Restart counts climbed to 276 / 412 / 261 across the 3 RKE2 nodes. Downstream blast radius: both Traefik pods stuck ContainerCreating (101m + 4h35m), all Longhorn CSI attacher/provisioner/instance-manager stuck, every Prometheus blackbox probe for *.iamworkin.lan failing, UpdateCenterPublicEdgeDown critical on update.flowercore.io, every ArgoCD app showing sync=Unknown because repo-server lost git connectivity. 45 firing Prometheus alerts. Recovery sequence (Q-MR-1 from FlowerCore.Notes morning routine): 1. kubectl patch kube-multus-ds memory live (this commit locks it in git so ArgoCD doesn't revert on next sync) 2. Force-delete the 219 orphan pods (kubectl --grace-period=0 --force) to break the avalanche 3. Rollout restart kube-multus-ds — STABLE after restart with new limit 4. Restart Traefik + Longhorn CSI to clear stuck ContainerCreating 5. Verify update.flowercore.io returns 200 + ArgoCD apps reconcile Tested incrementally: 256Mi limit was insufficient (still OOMed on catchup burst), 512Mi was insufficient on rke2-agent1 (most pods concentrated there), 1Gi/512Mi handled the full 200+ pending pod CNI catchup cleanly with 0 multus restarts after rollout. Nodes are 64GB with <25% used in steady-state, so the ~256Mi typical working-set is well within the new limit. Companion change: FlowerCore.RemoteDesktop must set OwnerReferences on every worker pod so future operator crashes don't leak orphans (Q-MR-2). Preventive alerts (Q-MR-3) MultusMemoryPressure + NamespacePendingPodBacklog are coming in a follow-up commit to apps/monitoring/. Memory: feedback_multus_50mi_limit_oom_orphan_pod_avalanche Decisions card: docs/dashboards/decisions-waiting.html Q-MR-1..3 Co-Authored-By: Claude Opus 4.7 (1M context) --- apps/multus/multus.yaml | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/apps/multus/multus.yaml b/apps/multus/multus.yaml index 15e2a58..2a0d802 100644 --- a/apps/multus/multus.yaml +++ b/apps/multus/multus.yaml @@ -188,13 +188,24 @@ spec: - name: kube-multus image: ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick command: [ "/usr/src/multus-cni/bin/multus-daemon" ] + # 2026-05-11: upstream default of 50Mi memory limit OOM-cascades when + # an operator-owned namespace accumulates >100 pending pods retrying + # CNI ADD. RemoteDesktop emitted 219 orphan rd-browser-only pods + # (missing OwnerReferences), kubelet's CNI ADD avalanche pushed multus + # over 50Mi, OOMKilled, restarted with even bigger backlog → loop. + # 21h cluster outage. See FlowerCore.Notes: + # feedback_multus_50mi_limit_oom_orphan_pod_avalanche.md + # 1Gi limit / 512Mi request comfortably handles a 200+ pod CNI + # catchup burst on 64GB nodes (nodes are <25% used in steady-state). + # Drop back toward 256Mi only after MultusMemoryPressure alert + # proves steady-state working set sits well below 200Mi. resources: requests: cpu: "100m" - memory: "50Mi" + memory: "512Mi" limits: cpu: "100m" - memory: "50Mi" + memory: "1Gi" securityContext: privileged: true terminationMessagePolicy: FallbackToLogsOnError