Two new preventive alert rules added to the kubernetes-state group of the
K8s migration target ConfigMap. The live Podman Prometheus on noc1 has
already been updated via FlowerCore.Notes/scripts/monitoring/alerts.yml +
sudo cp + podman pod restart monitoring (this commit only locks it in
the bluejay-infra K8s mirror so a future migration carries it forward).
MultusMemoryPressure (critical, thermal_print): fires when kube-multus
working set exceeds 80% of its memory limit for 5m. Catches the next
multus OOM cascade BEFORE it kills the daemon cluster-wide. The 2026-05-10
21h outage hit because no alert fired on the rising multus working set;
only downstream blackbox / Traefik / service alerts triggered, after the
fact.
NamespacePendingPodBacklog (warning): fires when any single namespace has
>25 Pending pods sustained for 30m. Catches the operator-leak avalanche
pattern (orphan pods from a crashed reconciler emitting children without
ownerReferences) before it cascades into a CNI OOM.
See FlowerCore.Notes:
- feedback_multus_50mi_limit_oom_orphan_pod_avalanche
- feedback_monitoring_k8s_target_vs_live_podman (workflow)
Companion commits:
- bluejay-infra@eb8693e (multus memory limit)
- FlowerCore.RemoteDesktop@b02c59b (OwnerReferences fix)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>