fix(multus): bump kube-multus-ds memory 50Mi/50Mi -> 1Gi/512Mi (prevent OOM cascade)

Cluster outage 2026-05-10T17:43 through 2026-05-11 ~10:30 (~21h). Root cause:
FlowerCore.RemoteDesktop emitted 219 orphan rd-browser-only-* pods in fc-desktop
(missing OwnerReferences — see companion fix in FlowerCore.RemoteDesktop).
Kubelet's continuous CNI ADD retries for those pending pods drove a request
queue that exceeded the upstream default 50Mi limit on kube-multus-ds. Multus
OOMKilled (exit 137), restarted with an even bigger backlog, OOMKilled again,
positive feedback loop. Restart counts climbed to 276 / 412 / 261 across the
3 RKE2 nodes.

Downstream blast radius: both Traefik pods stuck ContainerCreating (101m +
4h35m), all Longhorn CSI attacher/provisioner/instance-manager stuck, every
Prometheus blackbox probe for *.iamworkin.lan failing, UpdateCenterPublicEdgeDown
critical on update.flowercore.io, every ArgoCD app showing sync=Unknown
because repo-server lost git connectivity. 45 firing Prometheus alerts.

Recovery sequence (Q-MR-1 from FlowerCore.Notes morning routine):
1. kubectl patch kube-multus-ds memory live (this commit locks it in git so
   ArgoCD doesn't revert on next sync)
2. Force-delete the 219 orphan pods (kubectl --grace-period=0 --force) to
   break the avalanche
3. Rollout restart kube-multus-ds — STABLE after restart with new limit
4. Restart Traefik + Longhorn CSI to clear stuck ContainerCreating
5. Verify update.flowercore.io returns 200 + ArgoCD apps reconcile

Tested incrementally: 256Mi limit was insufficient (still OOMed on catchup
burst), 512Mi was insufficient on rke2-agent1 (most pods concentrated there),
1Gi/512Mi handled the full 200+ pending pod CNI catchup cleanly with 0 multus
restarts after rollout. Nodes are 64GB with <25% used in steady-state, so the
~256Mi typical working-set is well within the new limit.

Companion change: FlowerCore.RemoteDesktop must set OwnerReferences on every
worker pod so future operator crashes don't leak orphans (Q-MR-2). Preventive
alerts (Q-MR-3) MultusMemoryPressure + NamespacePendingPodBacklog are coming
in a follow-up commit to apps/monitoring/.

Memory: feedback_multus_50mi_limit_oom_orphan_pod_avalanche
Decisions card: docs/dashboards/decisions-waiting.html Q-MR-1..3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Codex
2026-05-11 10:30:05 -05:00
parent 667777a653
commit eb8693e1ce

View File

@@ -188,13 +188,24 @@ spec:
- name: kube-multus
image: ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick
command: [ "/usr/src/multus-cni/bin/multus-daemon" ]
# 2026-05-11: upstream default of 50Mi memory limit OOM-cascades when
# an operator-owned namespace accumulates >100 pending pods retrying
# CNI ADD. RemoteDesktop emitted 219 orphan rd-browser-only pods
# (missing OwnerReferences), kubelet's CNI ADD avalanche pushed multus
# over 50Mi, OOMKilled, restarted with even bigger backlog → loop.
# 21h cluster outage. See FlowerCore.Notes:
# feedback_multus_50mi_limit_oom_orphan_pod_avalanche.md
# 1Gi limit / 512Mi request comfortably handles a 200+ pod CNI
# catchup burst on 64GB nodes (nodes are <25% used in steady-state).
# Drop back toward 256Mi only after MultusMemoryPressure alert
# proves steady-state working set sits well below 200Mi.
resources:
requests:
cpu: "100m"
memory: "50Mi"
memory: "512Mi"
limits:
cpu: "100m"
memory: "50Mi"
memory: "1Gi"
securityContext:
privileged: true
terminationMessagePolicy: FallbackToLogsOnError