bluejay-infra/apps at eb8693e1ceee509940b92b12a4c75a9241ee1ce1 - bluejay-infra - Gitea: Git with a cup of tea

bluejay/bluejay-infra

Files

History

Codex eb8693e1ce fix(multus): bump kube-multus-ds memory 50Mi/50Mi -> 1Gi/512Mi (prevent OOM cascade)

Cluster outage 2026-05-10T17:43 through 2026-05-11 ~10:30 (~21h). Root cause:
FlowerCore.RemoteDesktop emitted 219 orphan rd-browser-only-* pods in fc-desktop
(missing OwnerReferences — see companion fix in FlowerCore.RemoteDesktop).
Kubelet's continuous CNI ADD retries for those pending pods drove a request
queue that exceeded the upstream default 50Mi limit on kube-multus-ds. Multus
OOMKilled (exit 137), restarted with an even bigger backlog, OOMKilled again,
positive feedback loop. Restart counts climbed to 276 / 412 / 261 across the
3 RKE2 nodes.

Downstream blast radius: both Traefik pods stuck ContainerCreating (101m +
4h35m), all Longhorn CSI attacher/provisioner/instance-manager stuck, every
Prometheus blackbox probe for *.iamworkin.lan failing, UpdateCenterPublicEdgeDown
critical on update.flowercore.io, every ArgoCD app showing sync=Unknown
because repo-server lost git connectivity. 45 firing Prometheus alerts.

Recovery sequence (Q-MR-1 from FlowerCore.Notes morning routine):
1. kubectl patch kube-multus-ds memory live (this commit locks it in git so
   ArgoCD doesn't revert on next sync)
2. Force-delete the 219 orphan pods (kubectl --grace-period=0 --force) to
   break the avalanche
3. Rollout restart kube-multus-ds — STABLE after restart with new limit
4. Restart Traefik + Longhorn CSI to clear stuck ContainerCreating
5. Verify update.flowercore.io returns 200 + ArgoCD apps reconcile

Tested incrementally: 256Mi limit was insufficient (still OOMed on catchup
burst), 512Mi was insufficient on rke2-agent1 (most pods concentrated there),
1Gi/512Mi handled the full 200+ pending pod CNI catchup cleanly with 0 multus
restarts after rollout. Nodes are 64GB with <25% used in steady-state, so the
~256Mi typical working-set is well within the new limit.

Companion change: FlowerCore.RemoteDesktop must set OwnerReferences on every
worker pod so future operator crashes don't leak orphans (Q-MR-2). Preventive
alerts (Q-MR-3) MultusMemoryPressure + NamespacePendingPodBacklog are coming
in a follow-up commit to apps/monitoring/.

Memory: feedback_multus_50mi_limit_oom_orphan_pod_avalanche
Decisions card: docs/dashboards/decisions-waiting.html Q-MR-1..3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-11 10:30:05 -05:00

..

fix(agent-zero): export knowledge mcp gate to python builder

2026-04-29 23:32:55 -05:00

Fix Traefik dashboard cert issuer: step-ca-acme

2026-03-10 01:12:08 -05:00

K8s manifest hardening + new bluejay-infra-lint test project

2026-05-04 03:18:04 -05:00

feat(infra): add Multus CNI + CDI + PROD VLAN 57 NAD as GitOps prereqs for ci1

2026-05-08 13:05:58 -05:00

Fix Traefik dashboard cert issuer: step-ca-acme

2026-03-10 01:12:08 -05:00

edge2-services: print.iamworkin.lan Traefik HTTPS for Print.Web (XL Track C)

2026-04-26 14:37:33 -05:00

Fix Traefik dashboard cert issuer: step-ca-acme

2026-03-10 01:12:08 -05:00

apps: fc-chat refactor + fc-menuboard app split

2026-04-16 19:25:25 -05:00

fix(fc-desktop): land 4 NetworkPolicies into bluejay-infra (was deploy-script-only)

2026-05-07 10:27:20 -05:00

fc-distribution

fix(certs): kill cert-manager renewal loop on 3 broken Certificate specs

2026-05-07 15:32:00 -05:00

Add signage service ingress manifests

2026-04-09 15:09:08 -05:00

Add step-ca TLS certs for mysql, php, desktop, signage, fc-landing

2026-04-08 18:20:23 -05:00

K8s manifest hardening + new bluejay-infra-lint test project

2026-05-04 03:18:04 -05:00

apps: fc-chat refactor + fc-menuboard app split

2026-04-16 19:25:25 -05:00

fc-messageboard

K8s manifest hardening + new bluejay-infra-lint test project

2026-05-04 03:18:04 -05:00

Add step-ca TLS certs for mysql, php, desktop, signage, fc-landing

2026-04-08 18:20:23 -05:00

Add step-ca TLS certs for mysql, php, desktop, signage, fc-landing

2026-04-08 18:20:23 -05:00

fc-presentations

Add signage service ingress manifests

2026-04-09 15:09:08 -05:00

Add signage service ingress manifests

2026-04-09 15:09:08 -05:00

fc-segmentdisplay

feat(fc-segmentdisplay): switch tls certificate to dns01

2026-04-23 18:39:17 -05:00

Add step-ca TLS certs for mysql, php, desktop, signage, fc-landing

2026-04-08 18:20:23 -05:00

fc-signalcontrol

K8s manifest hardening + new bluejay-infra-lint test project

2026-05-04 03:18:04 -05:00

deploy(ttsreader): persist voice reference clips on pvc

2026-05-06 20:48:58 -05:00

[uc] Phase 1 auth gate deploy v20260509-4162dca-authgate

2026-05-08 21:16:54 -05:00

Fix Traefik dashboard cert issuer: step-ca-acme

2026-03-10 01:12:08 -05:00

Adopt fc-updater into ArgoCD

2026-05-06 17:33:32 -05:00

Update telephony-web image to v20260324d, resolve merge conflicts

2026-03-24 15:55:52 -05:00

feat(guacamole): add macmini-vnc-creds OnePasswordItem + fix Mac mini connection IPs

2026-04-28 20:09:45 -05:00

deploy(intranet): promote brochure wave 1 image

2026-05-08 11:12:56 -05:00

fix(irc): use short name for unrealircd in anope + thelounge configs

2026-04-22 21:23:38 -05:00

fix(certs): kill cert-manager renewal loop on 3 broken Certificate specs

2026-05-07 15:32:00 -05:00

revert(ci1): back to cdrom:scsi (virtio-blk disk hit QEMU flock)

2026-05-08 21:35:00 -05:00

Update telephony-web image to v20260324d, resolve merge conflicts

2026-03-24 15:55:52 -05:00

statefulsets: align guacamole and matrix drift defaults

2026-04-22 23:11:47 -05:00

fix(monitoring): rename bare Grafana dashboard JSONs out of *.json extension

2026-05-07 10:13:37 -05:00

fix(multus): bump kube-multus-ds memory 50Mi/50Mi -> 1Gi/512Mi (prevent OOM cascade)

2026-05-11 10:30:05 -05:00

feat(noc-services): wire puppetdb.iamworkin.lan through Traefik step-ca cert

2026-04-28 15:13:20 -05:00

Add infrastructure manifests for 9 services

2026-03-09 16:35:04 -05:00

fix(selenium): GitOps-capture selenium-netpol (was unmanaged anywhere)

2026-05-07 10:30:59 -05:00

Update telephony-web image to v20260324d, resolve merge conflicts

2026-03-24 15:55:52 -05:00

fc-telephony: bump web to v202604252156 (T7 step trail)

2026-04-25 21:56:14 -05:00

traefik-dashboard

Fix Traefik dashboard cert issuer: step-ca-acme

2026-03-10 01:12:08 -05:00

Update telephony-web image to v20260324d, resolve merge conflicts

2026-03-24 15:55:52 -05:00

fix(certs): kill cert-manager renewal loop on 3 broken Certificate specs

2026-05-07 15:32:00 -05:00

feat(zabbix): add RemoteDesktop monitoring template

2026-04-23 23:30:32 -05:00