The thermal printer drained overnight (2026-05-18/19) because the old
notify.py POSTed one print job per Grafana webhook fire. With 9
concurrently-firing alerts (zabbix-postgres + fc-devicemgmt + brochure
+ PrintPaperRollLow), every evaluation cycle stamped fresh CUPS jobs
onto the queue until the operator physically powered the printer off.
This refactor:
- Adds env-var config: THERMAL_PRINT_ENABLED (master kill switch),
BATCH_INTERVAL_MIN (default 60), BATCH_MAX_PENDING (default 50).
- IRC delivery stays per-event (operator wants the live stream).
- Thermal routing now:
* critical/disaster/page severity OR alert_channel=thermal_print_immediate
-> print immediately
* alert_channel=thermal_print -> enqueue into hourly digest
* RESOLVED -> remove from digest buffer (no resolution-spam prints)
* else -> IRC only, no thermal
- Background digest_loop thread flushes the buffer hourly (or sooner
if buffer hits BATCH_MAX_PENDING). Digest payload is a single
Print.Web /api/print/alert POST listing distinct alertnames + per-rule
target counts.
- New POST /flush endpoint (manual operator force-flush; useful for
testing without waiting an hour).
- GET / returns config + buffer depth + per-stat counters for observability.
Net effect: max 1 thermal print per BATCH_INTERVAL_MIN for batched
warnings, plus immediate prints for criticals. Closes the 2026-05-18/19
alert-storm incident.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the live `puppet` alert group from
FlowerCore.Notes/scripts/monitoring/alerts.yml into the K8s ConfigMap so a
future in-cluster Prometheus inherits the ruleset automatically.
Source of truth remains the Notes file (live Podman Prometheus on noc1).
See feedback_monitoring_k8s_target_vs_live_podman.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new preventive alert rules added to the kubernetes-state group of the
K8s migration target ConfigMap. The live Podman Prometheus on noc1 has
already been updated via FlowerCore.Notes/scripts/monitoring/alerts.yml +
sudo cp + podman pod restart monitoring (this commit only locks it in
the bluejay-infra K8s mirror so a future migration carries it forward).
MultusMemoryPressure (critical, thermal_print): fires when kube-multus
working set exceeds 80% of its memory limit for 5m. Catches the next
multus OOM cascade BEFORE it kills the daemon cluster-wide. The 2026-05-10
21h outage hit because no alert fired on the rising multus working set;
only downstream blackbox / Traefik / service alerts triggered, after the
fact.
NamespacePendingPodBacklog (warning): fires when any single namespace has
>25 Pending pods sustained for 30m. Catches the operator-leak avalanche
pattern (orphan pods from a crashed reconciler emitting children without
ownerReferences) before it cascades into a CNI OOM.
See FlowerCore.Notes:
- feedback_multus_50mi_limit_oom_orphan_pod_avalanche
- feedback_monitoring_k8s_target_vs_live_podman (workflow)
Companion commits:
- bluejay-infra@eb8693e (multus memory limit)
- FlowerCore.RemoteDesktop@b02c59b (OwnerReferences fix)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror of FlowerCore.Notes/scripts/monitoring/alerts.yml fc-signage-marquee
group into the K8s migration target apps/monitoring/noc-monitoring.yaml so
that future migration of the noc1 Podman monitoring stack into RKE2
inherits the marquee alert ruleset automatically.
Three rules added:
- MarqueeDroppedFramesHigh (5% / 5min / warning)
- MarqueeRenderLatencyP99High (16ms / 10min / warning)
- MarqueeAnimationDurationDrift (10% / 15min / info)
All three gated with `unless on() absent_over_time(metric[7d])` so they
don't fire during the metric-not-yet-published window before Track 3
IR-21 source IMPL ships the MarqueeMeter into Common + Web + WPF.
Live source-of-truth (the noc1 Podman Prometheus reads from
/opt/monitoring/prometheus/alerts.yml) was updated and reloaded
in the same session — Notes commit 300daa0 carries the matching
alerts.yml + Grafana fc-signage-dashboard.json change.
Per feedback_monitoring_k8s_target_vs_live_podman: this file is the
forward-looking K8s migration target, NOT what the live Podman
Prometheus reads. ArgoCD-syncing this file does NOT push alerts to
the live monitoring stack.
Companion to:
- FlowerCore.Notes 300daa0 (live alerts.yml + Grafana panels deployed)
- docs/signage/marquee-performance-telemetry-design.md (Track 3 IR-21 spec)
- docs/signage/marquee-animation-phases.md (Track 6 13-phase coverage matrix)
Memory: project_marquee_vr_promotion_landed_2026_05_06
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new Prometheus alert rules for the print-services group, all routed
to thermal_print via alert_channel label (Grafana contact point ->
irc-notify -> Print.Web /api/print/alert):
- PrintPaperRollLow (warning, 5-10% remaining, 5m for)
- PrintPaperRollCritical (critical, <=5% remaining, 2m for)
- PrintJobDeadLetter (warning, any new dead-letter in 15m)
Source-of-truth gauge is print_paper_remaining_percent (Print.Web OTEL),
which is hydrated from the active PaperRoll row at process startup
(Print.Web@<TBD> HydrateMetricsAsync) so the gauge isn't blind for an
arbitrary window after every deploy.
Self-referential humor: low-roll alerts route to the printer that's
running out of paper, so it announces its own paper-out warning on its
remaining paper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
agent-zero ollama-proxy had 172 historic restarts (now stable).
Root cause: liveness/readiness probes hit /api/tags which proxies
through to BLUEJAY-WS Ollama (10.0.56.20:11434). When the workstation
Ollama is slow or offline, nginx fails over to the edge1 backup —
but the failover takes >1s and the kube-probe default timeoutSeconds=1
gives up first. Three failed probes → kubelet kills the container.
Fix:
- Add nginx local healthz endpoint (200, no upstream).
- Liveness probe → /healthz (proves nginx itself is alive).
- Readiness probe stays on /api/tags but with timeoutSeconds=5 so
failover to backup completes before the probe times out.
This decouples liveness from upstream availability — kubelet only
restarts the proxy when nginx is genuinely dead, not when Ollama is
slow.
Longhorn coverage gap: K8s emits "snapshot becomes not ready to use"
events constantly during the hourly snapshot lifecycle (1047
snapshots, all readyToUse=true on inspect). Those events were the
only signal we had — purely transient lifecycle noise, not actionable.
Add:
- longhorn scrape job (longhorn-backend.longhorn-system.svc:9500)
- NetworkPolicy egress rule for longhorn-system port 9500
- 4 new alerts in 'longhorn-storage' group:
- LonghornVolumeDegraded (>15m) — replica unhealthy, auto-rebuild
- LonghornVolumeFaulted (>5m, critical, thermal print) — data loss
- LonghornBackupStale (no completed backup in >36h) — recurring job
silently failing
- LonghornNodeUnhealthy (>5m) — node ready=false
zabbix-web 7 restarts and Print.Web 12:55 stop investigated — both
are stable now, no actionable cause found in journal/events. Adding
KubeContainerRestartingFrequently in the previous commit will catch
recurrence of either.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Followup to 05a273d. After deploy, six PoolDepleted/Deficit alerts
went pending again because the publisher emits per-status gauge
series (fc_desktop_pool_depleted{template,status,alert_level}) and
the historical Warming/BelowDesiredSize series stay at value=1 even
after the template transitions to status=Ready. Filtering by
alert_level=Critical/Warning was not enough — those labels are baked
into the stale series too.
Replace with a join-based query: alert only when the canonical
"Ready" status gauge does NOT report ready=1 for the enabled
template. fc_desktop_pool_ready{status="Ready"}==1 is the publisher's
own current-state canary and never goes stale.
Verified against the live cluster — query returns 0 results when all
pools report healthy in their reconcile logs (no stale-label false
positives).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Followup to ab6ade4. Three issues uncovered after the rollout:
1. NodePort hairpin breaks scrape from same-node pod. Prometheus on
rke2-agent1 could reach traefik-metrics on .11/.13 NodePort 30900
but timed out on its OWN node's NodePort. Same problem would hit
kube-state-metrics + cert-manager whenever prometheus reschedules.
Fix: scrape via ClusterIP svc DNS instead of NodePort. NodePorts
stay in place for external/Podman scrapers.
2. probe-traefik-services failed for grafana, prometheus, guac with
non-200/3xx codes. grafana + prometheus are behind Traefik basic-
auth (every endpoint returns 401), so drop from probe surface —
health is covered by the in-cluster monitoring-* scrape jobs.
guac.iamworkin.lan was deprecated when Guacamole moved under
desktop.iamworkin.lan/guacamole/ — drop it.
3. acme path was wrong (root 404). Use /health.
Coverage adds (probe-traefik-services):
chat, dist, dms, menuboard, messageboard, presentations, retail,
ttsreader. All of these have IngressRoutes serving root at 200/3xx.
NetworkPolicy egress rules added so the new ClusterIP svc scrapes work:
- traefik-system: port 9100 (metrics) — separate from data-path 8080/8443
- kube-system: port 8080 (kube-state-metrics)
- cert-manager: port 9402 (controller metrics)
Out-of-band fix during this audit:
- Print.Web on edge2 was inactive (clean exit at 12:55 CDT, root cause
unclear — systemd Stopping signal). Restarted. Service back on 5200.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Live audit on 2026-04-26 found 14 firing alerts caused by stale probe
targets, blackbox TLS verify failures, and stale state-as-label series.
Plus three K8s scrape sources (kube-state-metrics, cert-manager,
traefik) that exposed NodePorts but were not in any scrape config.
Fixes
- probe-remotedesktop: switch http_2xx -> https_internal. Blackbox does
not trust step-ca root, so /health was failing with x509 unknown
authority while the app served 200s.
- probe-agentzero-nuc: short svc form (agent-zero.agent-zero.svc:80)
instead of *.cluster.local. The FQDN form was being rewritten to the
Traefik VIP by the CoreDNS iamworkin.lan template + ndots:5 search
expansion, then 5s timeout.
- probe-agentzero-local + probe-ollama-local: removed. 10.0.58.100 is on
HOME VLAN and not reachable from cluster pods. Workstation/AI-laptop
Ollama monitoring belongs to host-side Puppet, not cluster blackbox.
- snmp-cloudkey: commented out. The Cloud Key Gen2+ runs unifi-core
(controller), not an SNMP agent. Was generating "connection refused"
every 30s.
- RemoteDesktopPoolDepleted / RemoteDesktopPoolDeficitSustained:
filter on alert_level=Critical / Warning|Critical + enabled=true.
The publisher emits one series per template per status without
resetting old series to 0, so the historical Warming/BelowDesiredSize
series stayed at 1 and the alert kept firing on stale labels.
- RemoteDesktopTlsExpiry: match by job, not hostname-only instance.
The probe sets instance=https://desktop.iamworkin.lan/health so a
hostname-only label match never fired.
- EpsonPrinterDown for: 5m -> 30m. EcoTank sleeps after ~5 min idle and
SNMP times out, so 5m guaranteed nightly noise.
Coverage adds
- kube-state-metrics scrape (NodePort 30901). Required for the new
pod-state alerts and a long list of standard K8s SLO queries.
- cert-manager scrape (NodePort 30902). Required for the
CertManagerCertificateNotReady / RenewalFailed alert pair documented
in project_cert_manager_prometheus_scrape.
- traefik scrape (NodePort 30900) on all three nodes.
- probe-traefik-services: HTTPS probe (https_internal) over the 17 main
iamworkin.lan hosts so any Traefik-fronted service returning non-200
shows up as a single named probe failure.
- blackbox-config: add the https_internal module that the new probes
reference (was only in the FlowerCore.Notes scripts/monitoring copy,
not in the live ConfigMap).
New alerts (kubernetes-state group)
- KubeContainerRestartingFrequently (>5 restarts/h)
- KubeContainerCrashLooping (>3 restarts/15m, thermal print)
- KubePodNotReady (Pending/Failed/Unknown >15m)
- KubePodImagePullBackOff (>10m)
- KubeDeploymentReplicasMismatch (>15m)
Without these, the agent-zero ollama-proxy 172x restart loop was
invisible for ~3 days. Same gap would have hidden the fc-php
php84-app-probe ImagePullBackOff orphan (cleaned up out of band).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to the Prometheus alert rules landed in e44e9a0. The
Prometheus rules were loading but never delivered — the monitoring
stack has no Alertmanager configured; **Grafana** owns alert
routing via its built-in engine + webhook contact point to
irc-notify.monitoring.svc:9119. Without a matching Grafana alert,
the Prometheus rules just show up in the Prometheus UI and page
no one.
Adds 6 Grafana alert rules in a new `RemoteDesktop` group under
the AI Stack Alerts folder:
- remotedesktop-web-down (3m) — probe_success{job="probe-remotedesktop"} < 1
- remotedesktop-metrics-stale (10m) — fc_desktop_session_events_total series absent
- remotedesktop-pool-depleted (5m) — fc_desktop_pool_depleted > 0
- remotedesktop-pool-deficit-sustained (10m info) — fc_desktop_pool_deficit > 0
- remotedesktop-session-churn-spike (5m info) — launch rate > 20/min
- remotedesktop-tls-expiry (6h critical) — cert < 2 days to expiry
Each uses the standard Grafana 3-stage pipeline (query → reduce →
threshold) matching the existing AI Stack + Infrastructure alert
patterns. Labels: service=remotedesktop + severity (warning/info/critical).
Default route is `IRC #alerts` via the existing webhook contact point.
Parity with the Prometheus rules (which already fire internally
for the Prometheus UI + any future Alertmanager integration).
Grafana restart picks up the new provisioning on next reload.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two egress allows to monitoring-netpol so Prometheus can scrape
FlowerCore.RemoteDesktop:
1. fc-desktop namespace on port 8080 — direct ClusterIP service
target (remotedesktop-web.fc-desktop:8080).
2. traefik-system namespace pods on ports 8080 + 8443 — covers the
Traefik VIP hairpin path for the `https://desktop.iamworkin.lan`
scrape target (CoreDNS wildcard resolves iamworkin.lan hostnames
to the LB VIP; after kube-proxy DNAT, egress needs the backend
pod port allowed per feedback_netpol_dnat_backend_port).
Without these, the fc-remotedesktop scrape times out with "context
deadline exceeded" even though the monitoring-netpol already allows
the 10.0.56.0/24 CIDR — post-DNAT the destination is a 10.42.x.x
pod IP, not the VIP.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three additions to the monitoring ConfigMap, each targeting
FlowerCore.RemoteDesktop:
- **Scrape jobs** (2 new):
- probe-remotedesktop: blackbox http_2xx against
https://desktop.iamworkin.lan/health every 30s. Feeds the
RemoteDesktopWebDown alert.
- fc-remotedesktop: direct /metrics scrape against
desktop.iamworkin.lan for the fc_desktop_session_events_total
and fc_desktop_pool_* series.
- **Alert group `remote-desktop`** (7 rules in alerts.yml):
- RemoteDesktopWebDown (3m) — /health probe failing
- RemoteDesktopMetricsStale (10m) — absent metrics series
- RemoteDesktopPoolDepleted (5m) — pool deficit + depleted flag
- RemoteDesktopPoolDeficitSustained (10m, info) — persistent
below-desired pool size
- RemoteDesktopSessionChurnSpike (5m, info) — launch rate
>20/min
- RemoteDesktopRecordingEventsDropped (15m, info) — 30m without
recording events while launches active
- RemoteDesktopTlsExpiry (6h, critical) — <2d cert renewal
window; aligns with feedback_acme_expiry_alert_threshold
- **Grafana dashboard mount**: new volumeMounts + volumes entry for
`dashboards-remotedesktop` backed by the grafana-dashboard-remotedesktop
ConfigMap (previously added as a standalone file in d4210c8).
Folder path /var/lib/grafana/dashboards/remotedesktop — picked up
by the file-provider with foldersFromFilesStructure:true so the
dashboard shows up in a "Remotedesktop" folder in Grafana.
No CRLF churn; pure 100-line insertion into LF-normalized file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Synology NAS is configured with community bluejay_monitor
(→ snmp.yml auth 'bluejay_v2'), not public. public_v2 was returning
HTTP 500 from snmp-exporter for this target. Verified bluejay_v2
returns metrics.
Keeps printer (10.0.58.107) on public_v2 — Epson ET-3750 uses
community "public" as documented in its SNMP settings.
CoreDNS iamworkin.lan template + ndots:5 was hijacking
unrealircd.irc.svc.cluster.local lookups → Traefik VIP → timeout.
Every alert since ~2026-04-09 silently failed with "IRC send failed: timed out",
which also killed the thermal-printer path (routed through irc-notify).
Same fix pattern as guacamole@28b7600.