feat(infra): retire fc-redis bootstrap after Redis Manager adoption

feat(fc-redis): add SignalR backplane for cross-product event bus (Q-SO-1 Phase A)
Per Q-SO-1 operator resolution 2026-05-11 PM, Redis SignalR backplane lands in Phase A (was Phase C deferral). Treats Redis as a managed FC infrastructure component, not a deferred scaling escalation. Lands the minimal Phase A surface: - Namespace fc-redis - Single Redis 7-alpine pod with 1Gi Longhorn RWO PVC - ConfigMap with AOF persistence (everysec), 256Mi maxmemory, allkeys-lru - ClusterIP Service `redis.fc-redis.svc.cluster.local:6379` (in-cluster only) - No AUTH Phase A (Phase B add via 1Password Connect rotation) - No IngressRoute (backplane is server-to-server) Consumers (Phase A IMPL across FC services) add: services.AddSignalR().AddStackExchangeRedis( "redis.fc-redis.svc.cluster.local:6379", opts => opts.Configuration.ChannelPrefix = StackExchange.Redis.RedisChannel.Literal("fc-opsconsole")); Phase B/C follow-ons (not in this commit): Sentinel for HA, AUTH password from 1Password, redis_exporter sidecar for Prometheus, network policies. See FlowerCore.Notes/docs/signage/operations-console-phase-2-design.md section 3.5 (rewritten) and decisions-waiting.html Q-SO-1 (RESOLVED). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 02:22:47 -05:00 · 2026-05-11 19:02:58 -05:00 · 2026-05-11 18:37:15 -05:00 · 2026-05-11 10:42:27 -05:00 · 2026-05-11 10:30:05 -05:00 · 2026-05-08 21:35:00 -05:00
6 changed files with 335 additions and 33 deletions
--- a/apps/fc-updater/fc-updater.yaml
+++ b/apps/fc-updater/fc-updater.yaml
@@ -58,7 +58,7 @@ spec:
      nodeName: rke2-server
      containers:
        - name: web
-          image: localhost/fc-updater-web:v20260507-public-privacy
+          image: localhost/fc-updater-web:v20260509-4162dca-authgate
          imagePullPolicy: Never
          ports:
            - containerPort: 8080
--- a/apps/kubevirt-vms/ci1.yaml
+++ b/apps/kubevirt-vms/ci1.yaml
@@ -6,14 +6,29 @@
 # `bluejay-ws-sandbox-1` runner placeholder. Andrew explicitly does NOT want
 # BLUEJAY-WS registered as a runner (workstation has personal/operator state).
 #
-# Status (2026-05-08): STAGED ONLY — DO NOT APPLY without operator review.
+# Storage layout (2026-05-08):
-# See docs/infrastructure/windows-server-build-runner-plan.md "Phase 1 readiness gate".
+#   * ISO is now sourced from Synology NFS (Path B) — see
 #     win2025-iso-nfs-pv.yaml. The Longhorn Filesystem PVC
 #     `windows-server-2025-iso` below is RETAINED but UNUSED so the prior
 #     CDI upload state is preserved as a fallback (and so ArgoCD doesn't
 #     prune it on this commit). It can be deleted in a follow-up commit
 #     after the NFS path is proven on a successful Windows install.
 #
-# Prerequisites that MUST be satisfied first:
+# Status (2026-05-08): LIVE — Phase 1 prereqs satisfied:
-#   1. Windows Server 2025 ISO populated into the `windows-server-2025-iso` PVC
+#   * Multus CNI v4.2.2 thick-plugin DaemonSet running on all 3 RKE2 nodes
-#      (operator interactive step — Microsoft Evaluation Center download).
+#     (apps/multus/multus.yaml; ApplicationSet `infra-multus` Synced/Healthy)
-#   2. Either Multus + PROD VLAN NAD (preferred) OR pod-network only (this YAML).
+#   * CDI v1.65.0 operator + CR Deployed (apps/cdi/; ApplicationSet
-#   3. KubeVirt CR feature gates: none required for non-persistent vTPM.
+#     `infra-cdi` Synced/Healthy; uploadproxy reachable via kubectl port-forward)
 #   * Windows Server 2025 ISO uploaded via CDI virtctl image-upload to
 #     PVC windows-server-2025-iso (7.7 GiB → 10Gi PVC, Bound, Upload Complete)
 #   * Local Administrator password generated, stored in 1Password vault
 #     IAmWorkin (qaphopopkryhbg353ukzhhuqoq) item id h3ix4mgfk65gmkcmvh6ly3d3hu
 #   * NetworkAttachmentDefinition prod-vlan57 registered (apps/kubevirt-vms/
 #     prod-vlan57-nad.yaml). VM still uses pod-network masquerade until Phase 1.5
 #     host bridge work lands (Puppet br-prod + enp86s0.57); switching is a
 #     one-line YAML edit + git push.
 #
 # See docs/infrastructure/windows-server-build-runner-plan.md "Phase 1 readiness gate".
 #
 # Network choice in this draft: **pod-network fallback** (Calico default).
 # Outbound-only is fine for the Updater Sandbox E2E runner workload (the runner
@@ -42,21 +57,49 @@ metadata:
    pod-security.kubernetes.io/enforce: privileged
 ---
-# ISO PVC — operator must populate this before applying the VM manifest.
+# ISO PVC — populated via CDI virtctl image-upload (CDI is now installed).
-# Population paths (see plan doc "Phase 1 readiness gate", section 2):
+#
-#   Path A — manual upload via helper pod + kubectl cp
+# **Volume mode (2026-05-08 status):** Filesystem-mode PVC. A migration to
-#   Path B — install CDI, then DataVolume HTTP import
+# `volumeMode: Block` via DataVolume was attempted to address an OVMF SATA
 # CDROM read timeout, but CDI v1.65.0's upload-target pod runs as uid 107
 # with `capabilities.drop: [ALL]` and cannot open the underlying block
 # device (`blockdev: cannot open /dev/cdi-block-volume: Permission denied`).
 # Reverted to Filesystem PVC pending one of:
 #   - CDI deployment override granting CAP_SYS_RAWIO to upload pod
 #   - Pre-populated PVC via privileged init pod that dd's the ISO directly
 #   - Migration to a different storage class that exposes block devices
 #     differently (e.g. iSCSI, where Longhorn's CSI mount path may behave
 #     differently)
 #
 # Population workflow (this PVC, Filesystem mode):
 #   1. virtctl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml image-upload pvc \
 #        windows-server-2025-iso -n kubevirt-vms \
 #        --image-path "$env:USERPROFILE\Downloads\en-us_windows_server_2025_updated_march_2026_x64_dvd_8e06425a.iso" \
 #        --size 10Gi --storage-class longhorn --access-mode ReadWriteOnce \
 #        --uploadproxy-url https://localhost:8443 --insecure
 #   (--uploadproxy-url uses port-forward in practice: `kubectl port-forward
 #   -n cdi service/cdi-uploadproxy 8443:443 &` first.)
 #
 # **Open boot issue:** even with the ISO at bootOrder:1, OVMF console showed:
 #   BdsDxe: starting Boot0001 "UEFI QEMU DVD-ROM QM00001 " from ... Sata(...)
 #   BdsDxe: failed to start Boot0001 ... Time out
 # Diagnosis confirmed PVC content IS a valid bootable ISO9660 image — the
 # timeout is in OVMF reading from the SATA-CDROM-backed-by-filesystem-PVC.
 # Block mode would likely fix it; see CDI permission issue above.
 apiVersion: v1
 kind: PersistentVolumeClaim
 metadata:
  name: windows-server-2025-iso
  namespace: kubevirt-vms
  labels:
    app: ci-runner
    flowercore.io/managed-by: bluejay-infra
 spec:
  accessModes:
    - ReadWriteOnce          # Bump to ReadOnlyMany after population for multi-VM use
  resources:
    requests:
-      storage: 6Gi
+      storage: 10Gi          # Server 2025 ISO is 7.7GB; 10Gi for headroom
  storageClassName: longhorn
 ---
@@ -220,10 +263,16 @@ data:
          </OOBE>
          <UserAccounts>
            <AdministratorPassword>
-              <!-- IMPORTANT: replace the Value below with a real password BEFORE applying.
+              <!-- Real password is in 1Password — vault qaphopopkryhbg353ukzhhuqoq,
-                   Generate via: $pw = "YourPasswordHere" + "AdministratorPassword";
+                   item id h3ix4mgfk65gmkcmvh6ly3d3hu, title:
-                                 [Convert]::ToBase64String([Text.Encoding]::Unicode.GetBytes($pw)) -->
+                   "ci1 Administrator (Windows Server 2025 KubeVirt VM)".
-              <Value>UABMAEEAQwBFAEgATwBMAEQARQBSAEEAZABtAGkAbgBpAHMAdAByAGEAdABvAHIAUABhAHMAcwB3AG8AcgBkAA==</Value>
+                   Field "autounattend AdministratorPassword Value (UTF-16-LE base64)"
                   matches the Value below.
                   To rotate: regenerate, recompute base64
                     $combined = $pw + "AdministratorPassword"
                     [Convert]::ToBase64String([Text.Encoding]::Unicode.GetBytes($combined))
                   then update both 1P item AND this Value field, recreate VM. -->
              <Value>bAA3AGsANABOAHcAcgBMAG4AeQBTAHUAYgBBAHQAaQBzAFUAcAB6AEMAWQAhADkAYQBCAEEAZABtAGkAbgBpAHMAdAByAGEAdABvAHIAUABhAHMAcwB3AG8AcgBkAA==</Value>
              <PlainText>false</PlainText>
            </AdministratorPassword>
          </UserAccounts>
@@ -260,7 +309,33 @@ metadata:
    role: github-actions-runner
    flowercore.io/managed-by: bluejay-infra
 spec:
-  running: false   # Set to true after operator approves + ISO loaded
+  # `running: true` is deprecated in favor of `runStrategy`. They are mutually
  # exclusive — KubeVirt's validating webhook rejects any VM that sets both:
  #   admission webhook "virtualmachine-validator.kubevirt.io" denied the request:
  #   Running and RunStrategy are mutually exclusive.
  # `Always` keeps a VMI running and restarts it if it crashes/exits — same
  # semantics as the old `running: true`.
  #
  # **2026-05-08 status: VM cannot start due to a stale QEMU flock on the
  # rootdisk PVC** (qemu reports `Failed to get "write" lock` on
  # `/var/run/kubevirt-private/vmi-disks/rootdisk/disk.img`). The flock was
  # left by a previous QEMU process during a force-deleted launcher pod
  # cycle. Recovery requires either (a) a Longhorn engine restart on
  # rke2-agent2, (b) a Longhorn volume detach via the longhorn-manager API
  # (kubectl patch on `volume.longhorn.io/<pvc-name>` does not work — the
  # spec.nodeID is reconciled back), or (c) a node reboot of rke2-agent2.
  #
  # **Confirmed working:** the bootOrder swap (windows-iso=1, rootdisk=2)
  # and the runStrategy migration (above). The ISO PVC was successfully
  # repopulated via virtctl image-upload pvc on the Filesystem-mode PVC.
  #
  # **Open: SATA CDROM read timeout** — even with bootOrder=1, OVMF reported
  # `BdsDxe: failed to start Boot0001 ... Time out` reading the SATA CDROM
  # backed by the Filesystem-mode PVC. A switch to Block-mode DataVolume
  # was attempted but blocked by a CDI v1.65.0 upload-pod permission issue
  # (capability drop prevents writing to the underlying block device).
  # See header docstring on the ISO PVC.
  runStrategy: Always   # LIVE — ISO uploaded 2026-05-08, password in 1P
  template:
    metadata:
      labels:
@@ -302,18 +377,60 @@ spec:
        firmware:
          bootloader:
            efi:
-              secureBoot: true
+              # 2026-05-08: SecureBoot=false during initial install. With SecureBoot
              # enabled, OVMF's BdsDxe times out reading Boot0001 from the SCSI
              # CDROM ("BdsDxe: failed to start Boot0001 ... Time out") before the
              # EFI bootloader signature can verify against the OVMF VARS trust DB.
              # KubeVirt's `/usr/share/OVMF/OVMF_VARS.secboot.fd` template doesn't
              # appear to include the Microsoft KEK/DB by default, so signed
              # Windows EFI bootloaders fail validation. Disabling SecureBoot lets
              # OVMF skip the chain check and boot directly. This is acceptable for
              # a CI runner — TPM 2.0 is still emulated (`tpm: {}` below) so
              # BitLocker / Hyper-V / WSL still work.
              # When the operator wants SecureBoot back, the path is:
              #   1. Custom-build OVMF_VARS.fd with Microsoft KEK/DB enrolled
              #   2. Mount it into the VM via firmware.bootloader.efi.persistent
              #   3. Set secureBoot: true again
              # Tracked separately from the install unblock.
              secureBoot: false
        devices:
          tpm: {}             # Non-persistent vTPM — sufficient for runner; no BitLocker
          disks:
-            - name: rootdisk
+            # bootOrder: ISO must be 1 for first-boot install (the rootdisk has no
            # EFI bootloader yet). After Windows installs, it writes its own UEFI
            # Boot#### entries pointing at the rootdisk's EFI partition; UEFI then
            # boots from rootdisk going forward and the ISO at bootOrder:2 acts as
            # a fallback for re-install scenarios.
            #
            # Original (broken) order had rootdisk=1, windows-iso=2 — UEFI tried
            # the empty virtio disk first, got nothing, fell back to the SATA
            # CDROM at Boot0001 with a short timeout, and timed out before the
            # CDROM enumerated. Console showed:
            #   BdsDxe: failed to start Boot0001 ... Time out
            #   BdsDxe: No bootable option or device was found.
            # Confirmed via debug pod: PVC content IS a real bootable ISO9660
            # (file: "ISO 9660 CD-ROM filesystem data ... (bootable)"), so the
            # only bug was boot priority.
            # 2026-05-08 PM: cdrom bus SCSI + containerDisk delivery. This
            # combination boots qemu cleanly and reaches OVMF, but OVMF
            # BdsDxe still hits "starting Boot0001 ... Time out" on the
            # cdrom — see HANDOFF.md / CODEX-STATUS.md "OPEN — ci1" for the
            # full diagnostic chain. virtio-blk disk swap was attempted as a
            # workaround but introduced a separate QEMU rootdisk flock issue
            # without fixing the underlying OVMF cdrom problem; reverted.
            # Operator decision needed for next architectural step (OVMF
            # custom build with extended timeout, KubeVirt version bump,
            # Hyper-V/VirtualBox-and-export, or BIOS legacy boot). The
            # containerDisk distribution pipeline (build/save/scp/ctr import)
            # is proven and ready to reuse for any of those.
            - name: windows-iso
              bootOrder: 1
              cdrom:
                bus: scsi
            - name: rootdisk
              bootOrder: 2
              disk:
                bus: virtio
            - name: windows-iso
              bootOrder: 2
              cdrom:
                bus: sata
            - name: virtio-drivers
              cdrom:
                bus: sata
@@ -340,11 +457,50 @@ spec:
          persistentVolumeClaim:
            claimName: ci1-rootdisk
        - name: windows-iso
-          persistentVolumeClaim:
+          # 2026-05-08 PM (Path C, CONTAINERDISK): the ISO is now packaged as
-            claimName: windows-server-2025-iso
+          # a KubeVirt containerDisk OCI image baked from
          # `FROM scratch ; ADD --chown=107:107 disk.img /disk/disk.img`.
          # The qemu user (uid 107) reads the ISO directly from a tmpfs view
          # of the OCI layer, bypassing both:
          #   - Synology NFS export ACL (Path B failed: uid 107 denied at
          #     directory level even with mode 0777, see memory
          #     feedback_synology_iso_export_root_only_uid_107_denied)
          #   - OVMF cdrom read-window timeout (Path A and Path B's SCSI
          #     retry both hit `BdsDxe: failed to start Boot0001 ... Time out`
          #     when the cdrom was backed by a PVC the storage controller
          #     couldn't satisfy reads from fast enough).
          #
          # Image build (one-time, per ISO version):
          #   1. Copy ISO to disk.img, write Dockerfile
          #   2. podman build --tag localhost/win-server-2025:1.0 .  (on noc1)
          #   3. podman save -o win-server-2025-1.0.tar localhost/win-server-2025:1.0
          #   4. SCP tar to all 3 RKE2 nodes (rke2-server, rke2-agent1, rke2-agent2)
          #   5. sudo /var/lib/rancher/rke2/bin/ctr -a /run/k3s/containerd/containerd.sock \
          #        -n k8s.io images import /tmp/win-server-2025-1.0.tar
          # Standard FC pattern per `feedback_rke2_localhost_imagepullpolicy`.
          #
          # When a new Windows ISO version ships, bump the tag (1.1, 1.2, ...),
          # rebuild + redistribute, and update the image: line below in a new
          # commit. KubeVirt picks up the new image via a VM restart.
          #
          # The legacy NFS PVC + PV (apps/kubevirt-vms/win2025-iso-nfs-pv.yaml)
          # and CDI Longhorn PVC (`windows-server-2025-iso`) are RETAINED for
          # this commit so the prior states are recoverable. Once the
          # containerDisk path proves on a successful Windows install, both
          # legacy artifacts can be pruned in a follow-up commit.
          containerDisk:
            image: localhost/win-server-2025:1.0
            imagePullPolicy: Never
        - name: virtio-drivers
          containerDisk:
-            image: quay.io/kubevirt/virtio-container-disk
+            # Pinned to v1.8.2 (latest stable as of 2026-05-08).
            # The :latest tag uses Docker manifest v1 schema which containerd
            # 2.1 (RKE2 v1.34.5) refuses to pull with:
            #   "media type application/vnd.docker.distribution.manifest.v1+prettyjws
            #    is no longer supported since containerd v2.1"
            # v1.8.2 is rebuilt with manifest v2/OCI and works on containerd 2.1.
            # Bump available: https://quay.io/repository/kubevirt/virtio-container-disk?tab=tags
            image: quay.io/kubevirt/virtio-container-disk:v1.8.2
        - name: sysprep
          sysprep:
            configMap:
--- a/apps/kubevirt-vms/win2025-iso-nfs-pv.yaml
+++ b/apps/kubevirt-vms/win2025-iso-nfs-pv.yaml
@@ -0,0 +1,99 @@
 # =============================================================================
 # Windows Server 2025 ISO — Static NFS PV (Path B for SATA-CDROM timeout)
 # =============================================================================
 # Purpose: Mount the ISO from Synology NAS via NFS instead of from a Longhorn-
 # backed Filesystem PVC.
 #
 # Why: SATA-CDROM emulation reading from a Longhorn-backed Filesystem PVC is
 # too slow for OVMF's boot read window — the DVD-ROM enumeration times out
 # before the bootloader can be read. Symptom on the serial console:
 #   BdsDxe: failed to start Boot0001 "UEFI QEMU DVD-ROM QM00001 " from ...
 #   BdsDxe: failed to start Boot0001 ... Time out
 #   BdsDxe: No bootable option or device was found
 # Diagnosis confirmed the ISO content is a perfectly valid bootable ISO9660
 # image — the bug is in the timing path between OVMF and Longhorn-backed
 # storage, not in the ISO itself.
 #
 # Block-mode PVC was tried (`volumeMode: Block` via DataVolume) and would
 # likely fix the timing, but CDI v1.65.0's upload-target pod cannot open the
 # block device due to runAsUser:107 + capabilities.drop:[ALL] and we got:
 #   blockdev: cannot open /dev/cdi-block-volume: Permission denied
 #
 # NFS-mounted ISO bypasses both issues: no Longhorn slowness, no CDI upload
 # pod permission concerns. The ISO is read directly from the NAS over a
 # native NFSv4.1 mount that QEMU's SATA emulator can read at full LAN speed.
 #
 # Layout on Synology:
 #   /volume1/ISOs/                                              (existing export, RKE2 ACL)
 #     en-us_windows_server_2025_updated_march_2026_x64_dvd_8e06425a.iso
 #     win2025-iso-disk/                                         (new subdir, 2026-05-08)
 #       disk.img -> hardlink to ../en-us_windows_server_2025_..._8e06425a.iso
 #
 # KubeVirt's launcher pod expects a PVC mounted at
 # /var/run/kubevirt-private/vmi-disks/<diskName>/disk.img — by mounting the
 # `win2025-iso-disk/` subdir as the NFS PV root, `disk.img` lives at the PV's
 # root and KubeVirt's CDROM emulator finds it without any path manipulation.
 #
 # A symlink would NOT work for sub-path NFS mounts (the relative target
 # `../...iso` falls outside the sub-mount root). A hardlink works because it
 # references the same inode regardless of mount point.
 #
 # Memory references:
 #   - feedback_synology_nfs_volume1_kubernetes_export_scoped (Synology export
 #     scoping pattern — but /volume1/ISOs export, unlike /volume1/kubernetes,
 #     does support sub-path mounts because Synology NFS is configured with
 #     pseudo-fs in NFSv4.1)
 #   - feedback_kubevirt_iso_first_install_bootorder_and_runstrategy (boot
 #     order / runStrategy gotchas, separate from the storage timing issue)
 #
 # Validation (2026-05-08, from rke2-server / rke2-agent1 / rke2-agent2):
 #   mount -t nfs -o nfsvers=4.1,ro 10.0.58.3:/volume1/ISOs/win2025-iso-disk /tmp/m
 #   file /tmp/m/disk.img
 #     -> ISO 9660 CD-ROM filesystem data 'SSS_X64FRE_EN-US_DV9' (bootable)
 # All 3 RKE2 nodes can mount and read.
 # =============================================================================
 apiVersion: v1
 kind: PersistentVolume
 metadata:
  name: windows-server-2025-iso-nfs
  labels:
    flowercore.io/iso: windows-server-2025
    flowercore.io/managed-by: bluejay-infra
 spec:
  capacity:
    storage: 8Gi
  accessModes:
    - ReadOnlyMany
  volumeMode: Filesystem
  persistentVolumeReclaimPolicy: Retain
  storageClassName: ""              # static, no provisioner
  mountOptions:
    - nfsvers=4.1
    - ro
    - hard
    - timeo=600
    - retrans=3
  nfs:
    server: 10.0.58.3               # BlueJayNAS Synology DS1621+ on HOME VLAN 58
    path: /volume1/ISOs/win2025-iso-disk
    readOnly: true
 ---
 apiVersion: v1
 kind: PersistentVolumeClaim
 metadata:
  name: windows-server-2025-iso-nfs
  namespace: kubevirt-vms
  labels:
    app: ci-runner
    flowercore.io/managed-by: bluejay-infra
 spec:
  accessModes:
    - ReadOnlyMany
  volumeMode: Filesystem
  resources:
    requests:
      storage: 8Gi
  storageClassName: ""
  volumeName: windows-server-2025-iso-nfs
--- a/apps/monitoring/noc-monitoring.yaml
+++ b/apps/monitoring/noc-monitoring.yaml
@@ -974,6 +974,39 @@ data:
              summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replica mismatch"
              description: "Spec wants {{ $labels.spec_replicas }} but only {{ $value }} available. Likely a rollout stuck on probe failure, scheduling, or PVC."
          # Q-MR-3 (2026-05-11): multus memory pressure — catches the next OOM
          # cascade BEFORE multus is OOM-killed cluster-wide. The 2026-05-10
          # outage (21h) hit because no alert fired on the rising multus working
          # set — only downstream blackbox / Traefik / service alerts. With
          # 1Gi limit (bluejay-infra@eb8693e), 80% = ~800MiB; steady-state
          # runs ~150-250MiB so this only fires when an avalanche starts.
          - alert: MultusMemoryPressure
            expr: |
              container_memory_working_set_bytes{container="kube-multus"}
                / container_spec_memory_limit_bytes{container="kube-multus"} > 0.8
            for: 5m
            labels:
              severity: critical
              alert_channel: thermal_print
            annotations:
              summary: "kube-multus memory >80% of limit on {{ $labels.node }} for 5m"
              description: "kube-multus working set is {{ $value | humanizePercentage }} of its memory limit on node {{ $labels.node }}. If this keeps climbing, multus will OOM and all new pod networking will halt cluster-wide (precedent: 2026-05-10 outage)."
          # Q-MR-3 (2026-05-11): namespace pending-pod backlog — catches the
          # operator-leak avalanche pattern BEFORE it cascades into a multus
          # CNI OOM. Any FC operator (RemoteDesktop / Distribution / WorldBuilder)
          # emitting pods without ownerReferences will accumulate them when
          # the operator crashes. >25 pending pods in any namespace for 30m
          # is the signal to investigate the reconciler.
          - alert: NamespacePendingPodBacklog
            expr: sum by (namespace) (kube_pod_status_phase{phase="Pending"}) > 25
            for: 30m
            labels:
              severity: warning
            annotations:
              summary: "Namespace {{ $labels.namespace }} has {{ $value }} Pending pods for 30m"
              description: "Pending pod count in {{ $labels.namespace }} exceeds 25 sustained for 30m. Likely operator-leak avalanche pattern — children emitted without ownerReferences. Risk of multus CNI OOM cascade."
      # Longhorn storage health alerts. Required: longhorn scrape job
      # (added 2026-04-26 — see scrape_configs above). The K8s events
      # for "snapshot becomes not ready to use" are transient lifecycle
--- a/apps/multus/multus.yaml
+++ b/apps/multus/multus.yaml
@@ -188,13 +188,24 @@ spec:
        - name: kube-multus
          image: ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick
          command: [ "/usr/src/multus-cni/bin/multus-daemon" ]
          # 2026-05-11: upstream default of 50Mi memory limit OOM-cascades when
          # an operator-owned namespace accumulates >100 pending pods retrying
          # CNI ADD. RemoteDesktop emitted 219 orphan rd-browser-only pods
          # (missing OwnerReferences), kubelet's CNI ADD avalanche pushed multus
          # over 50Mi, OOMKilled, restarted with even bigger backlog → loop.
          # 21h cluster outage. See FlowerCore.Notes:
          #   feedback_multus_50mi_limit_oom_orphan_pod_avalanche.md
          # 1Gi limit / 512Mi request comfortably handles a 200+ pod CNI
          # catchup burst on 64GB nodes (nodes are <25% used in steady-state).
          # Drop back toward 256Mi only after MultusMemoryPressure alert
          # proves steady-state working set sits well below 200Mi.
          resources:
            requests:
              cpu: "100m"
-              memory: "50Mi"
+              memory: "512Mi"
            limits:
              cpu: "100m"
-              memory: "50Mi"
+              memory: "1Gi"
          securityContext:
            privileged: true
          terminationMessagePolicy: FallbackToLogsOnError
--- a/apps/telephony/telephony.yaml
+++ b/apps/telephony/telephony.yaml
@@ -127,10 +127,13 @@ spec:
      initContainers:
        - name: fix-data-perms
          image: busybox:latest
-          # Also chown /shared-tts (hostPath /tmp/tts-audio) so the non-root
+          # Must run as root to chown the hostPath /tmp/tts-audio that may be
-          # app user (uid 1654) can write Piper .sln16 files that Asterisk
+          # root-owned after node reboot. Pod-level runAsNonRoot:true would
-          # reads at /var/lib/asterisk/sounds/tts. World-readable (755) is
+          # otherwise inherit and chown would fail with EPERM (see Notes memory
-          # fine — Asterisk runs as a different uid in the other pod.
+          # feedback_hostpath_initcontainer_chown_perms).
          securityContext:
            runAsUser: 0
            runAsNonRoot: false
          command: ["sh", "-c", "chown -R 1654:1654 /data && chown 1654:1654 /shared-tts && chmod 0755 /shared-tts"]
          volumeMounts:
            - name: telephony-data