Cluster outage 2026-05-10T17:43 through 2026-05-11 ~10:30 (~21h). Root cause:
FlowerCore.RemoteDesktop emitted 219 orphan rd-browser-only-* pods in fc-desktop
(missing OwnerReferences — see companion fix in FlowerCore.RemoteDesktop).
Kubelet's continuous CNI ADD retries for those pending pods drove a request
queue that exceeded the upstream default 50Mi limit on kube-multus-ds. Multus
OOMKilled (exit 137), restarted with an even bigger backlog, OOMKilled again,
positive feedback loop. Restart counts climbed to 276 / 412 / 261 across the
3 RKE2 nodes.
Downstream blast radius: both Traefik pods stuck ContainerCreating (101m +
4h35m), all Longhorn CSI attacher/provisioner/instance-manager stuck, every
Prometheus blackbox probe for *.iamworkin.lan failing, UpdateCenterPublicEdgeDown
critical on update.flowercore.io, every ArgoCD app showing sync=Unknown
because repo-server lost git connectivity. 45 firing Prometheus alerts.
Recovery sequence (Q-MR-1 from FlowerCore.Notes morning routine):
1. kubectl patch kube-multus-ds memory live (this commit locks it in git so
ArgoCD doesn't revert on next sync)
2. Force-delete the 219 orphan pods (kubectl --grace-period=0 --force) to
break the avalanche
3. Rollout restart kube-multus-ds — STABLE after restart with new limit
4. Restart Traefik + Longhorn CSI to clear stuck ContainerCreating
5. Verify update.flowercore.io returns 200 + ArgoCD apps reconcile
Tested incrementally: 256Mi limit was insufficient (still OOMed on catchup
burst), 512Mi was insufficient on rke2-agent1 (most pods concentrated there),
1Gi/512Mi handled the full 200+ pending pod CNI catchup cleanly with 0 multus
restarts after rollout. Nodes are 64GB with <25% used in steady-state, so the
~256Mi typical working-set is well within the new limit.
Companion change: FlowerCore.RemoteDesktop must set OwnerReferences on every
worker pod so future operator crashes don't leak orphans (Q-MR-2). Preventive
alerts (Q-MR-3) MultusMemoryPressure + NamespacePendingPodBacklog are coming
in a follow-up commit to apps/monitoring/.
Memory: feedback_multus_50mi_limit_oom_orphan_pod_avalanche
Decisions card: docs/dashboards/decisions-waiting.html Q-MR-1..3
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds three new bluejay-infra apps that auto-pickup via ApplicationSet (apps/*
directory generator on main):
* apps/multus/multus.yaml — Multus CNI v4.2.2 thick-plugin daemonset (verbatim
upstream, project-annotated). Enables KubeVirt VMs to attach additional
network interfaces. Required by ci1 to bridge onto PROD VLAN 57.
* apps/cdi/{cdi-operator.yaml,cdi-cr.yaml,README.md} — Containerized Data
Importer v1.65.0 (verbatim upstream). Operator + CR pattern. Enables
populating PVCs from HTTP/registry/upload sources, used to load the Windows
Server 2025 ISO into the windows-server-2025-iso PVC.
* apps/kubevirt-vms/prod-vlan57-nad.yaml — NetworkAttachmentDefinition for
PROD VLAN 57 bridge. **Deploy gated on Phase 1.5 host work**: requires
br-prod bridge enslaving enp86s0.57 on each RKE2 node (Puppet config-as-code).
ci1.yaml continues to use pod-network masquerade until that lands; switching
to multus.networkName: kubevirt-vms/prod-vlan57 is a one-line YAML edit
followed by a GitOps push.
Cluster verification (2026-05-08):
- KubeVirt LIVE (3 nodes, virt-api/controller/handler/operator all Running)
- Calico CNI on /etc/cni/net.d + /opt/cni/bin (Multus default paths)
- ApplicationSet `bluejay-infra` already watches `apps/*` on main
Reproducibility: upstream YAMLs vendored verbatim with project header diffs
only. Bumping versions = re-curl + git push. No deploy-time internet fetch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>