bluejay-infra

Author	SHA1	Message	Date
Andrew Stoltz	0a20f05525	monitoring: tag pirelay planned outage	2026-05-19 17:40:34 -05:00
Andrew Stoltz	e641ceab48	monitoring(irc-notify): criticals also batch hourly — fix per-fire spam The first batching pass (`bacac06`) left critical-severity alerts on the immediate-print path. That's still per-event spam for any persistent critical (e.g. PrintPaperRollCritical fires every 30s Grafana evaluation cycle when paper is <5%). Caught immediately after deploy: CUPS queue grew 0 → 8 jobs in 8 minutes from a single firing PrintPaperRollCritical. This commit aligns with the operator's verbatim ask ("one alert an hour"): - Critical-severity alerts now go into the digest buffer, NOT the immediate-print path. The digest payload already shows severity tags per alertname, so the operator still sees "[critical] X" in the printout. - The explicit `alert_channel=thermal_print_immediate` label still bypasses batching, but only on NEW fingerprint arrival — it triggers a flush of the CURRENT digest (with the new alert included), then clears. Repeat webhooks for the same fingerprint dedupe in the buffer until the next hourly tick OR until the alert resolves. No fingerprint can spam. - `add_to_digest` now returns bool (True = buffer grew, False = dedup / resolution / disabled) so the immediate-label path can flush only on state transitions. Net effect: max 1 thermal print per BATCH_INTERVAL_MIN per alert fingerprint, regardless of severity. Rules that genuinely need same-second paper opt in via `alert_channel=thermal_print_immediate` (currently zero rules use this). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:22:25 -05:00
Andrew Stoltz	bacac067cf	monitoring(irc-notify): hourly digest batching for thermal printer The thermal printer drained overnight (2026-05-18/19) because the old notify.py POSTed one print job per Grafana webhook fire. With 9 concurrently-firing alerts (zabbix-postgres + fc-devicemgmt + brochure + PrintPaperRollLow), every evaluation cycle stamped fresh CUPS jobs onto the queue until the operator physically powered the printer off. This refactor: - Adds env-var config: THERMAL_PRINT_ENABLED (master kill switch), BATCH_INTERVAL_MIN (default 60), BATCH_MAX_PENDING (default 50). - IRC delivery stays per-event (operator wants the live stream). - Thermal routing now: * critical/disaster/page severity OR alert_channel=thermal_print_immediate -> print immediately * alert_channel=thermal_print -> enqueue into hourly digest * RESOLVED -> remove from digest buffer (no resolution-spam prints) * else -> IRC only, no thermal - Background digest_loop thread flushes the buffer hourly (or sooner if buffer hits BATCH_MAX_PENDING). Digest payload is a single Print.Web /api/print/alert POST listing distinct alertnames + per-rule target counts. - New POST /flush endpoint (manual operator force-flush; useful for testing without waiting an hour). - GET / returns config + buffer depth + per-stat counters for observability. Net effect: max 1 thermal print per BATCH_INTERVAL_MIN for batched warnings, plus immediate prints for criticals. Closes the 2026-05-18/19 alert-storm incident. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 09:56:14 -05:00
Andrew Stoltz	46c392605e	monitoring: mirror PuppetServiceFailed alert from Notes (Sprint 33 Cx-7 Phase B) Mirrors the live `puppet` alert group from FlowerCore.Notes/scripts/monitoring/alerts.yml into the K8s ConfigMap so a future in-cluster Prometheus inherits the ruleset automatically. Source of truth remains the Notes file (live Podman Prometheus on noc1). See feedback_monitoring_k8s_target_vs_live_podman. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:11:07 -05:00
bluejay	634b9c4169	feat(github-runner): harden Linux runner fleet (#5 )	2026-05-18 02:51:02 +00:00
bluejay	9fd32c4415	feat(monitoring): MacMiniRunnerOffline alert (Sprint 28 reconcile)	2026-05-17 19:50:29 +00:00
Codex	653d4472f5	fix(monitoring): mirror Q-MR-3 MultusMemoryPressure + NamespacePendingPodBacklog alerts Two new preventive alert rules added to the kubernetes-state group of the K8s migration target ConfigMap. The live Podman Prometheus on noc1 has already been updated via FlowerCore.Notes/scripts/monitoring/alerts.yml + sudo cp + podman pod restart monitoring (this commit only locks it in the bluejay-infra K8s mirror so a future migration carries it forward). MultusMemoryPressure (critical, thermal_print): fires when kube-multus working set exceeds 80% of its memory limit for 5m. Catches the next multus OOM cascade BEFORE it kills the daemon cluster-wide. The 2026-05-10 21h outage hit because no alert fired on the rising multus working set; only downstream blackbox / Traefik / service alerts triggered, after the fact. NamespacePendingPodBacklog (warning): fires when any single namespace has >25 Pending pods sustained for 30m. Catches the operator-leak avalanche pattern (orphan pods from a crashed reconciler emitting children without ownerReferences) before it cascades into a CNI OOM. See FlowerCore.Notes: - feedback_multus_50mi_limit_oom_orphan_pod_avalanche - feedback_monitoring_k8s_target_vs_live_podman (workflow) Companion commits: - bluejay-infra@eb8693e (multus memory limit) - FlowerCore.RemoteDesktop@b02c59b (OwnerReferences fix) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 10:42:27 -05:00
Codex	4b777b16ac	monitoring: mirror fc-signage-marquee alert group into noc-monitoring K8s target Mirror of FlowerCore.Notes/scripts/monitoring/alerts.yml fc-signage-marquee group into the K8s migration target apps/monitoring/noc-monitoring.yaml so that future migration of the noc1 Podman monitoring stack into RKE2 inherits the marquee alert ruleset automatically. Three rules added: - MarqueeDroppedFramesHigh (5% / 5min / warning) - MarqueeRenderLatencyP99High (16ms / 10min / warning) - MarqueeAnimationDurationDrift (10% / 15min / info) All three gated with `unless on() absent_over_time(metric[7d])` so they don't fire during the metric-not-yet-published window before Track 3 IR-21 source IMPL ships the MarqueeMeter into Common + Web + WPF. Live source-of-truth (the noc1 Podman Prometheus reads from /opt/monitoring/prometheus/alerts.yml) was updated and reloaded in the same session — Notes commit 300daa0 carries the matching alerts.yml + Grafana fc-signage-dashboard.json change. Per feedback_monitoring_k8s_target_vs_live_podman: this file is the forward-looking K8s migration target, NOT what the live Podman Prometheus reads. ArgoCD-syncing this file does NOT push alerts to the live monitoring stack. Companion to: - FlowerCore.Notes 300daa0 (live alerts.yml + Grafana panels deployed) - docs/signage/marquee-performance-telemetry-design.md (Track 3 IR-21 spec) - docs/signage/marquee-animation-phases.md (Track 6 13-phase coverage matrix) Memory: project_marquee_vr_promotion_landed_2026_05_06 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 16:01:44 -05:00
Andrew Stoltz	3e0b9055b0	monitoring: paper-roll lifecycle alerts (XL Track I) Three new Prometheus alert rules for the print-services group, all routed to thermal_print via alert_channel label (Grafana contact point -> irc-notify -> Print.Web /api/print/alert): - PrintPaperRollLow (warning, 5-10% remaining, 5m for) - PrintPaperRollCritical (critical, <=5% remaining, 2m for) - PrintJobDeadLetter (warning, any new dead-letter in 15m) Source-of-truth gauge is print_paper_remaining_percent (Print.Web OTEL), which is hydrated from the active PaperRoll row at process startup (Print.Web@<TBD> HydrateMetricsAsync) so the gauge isn't blind for an arbitrary window after every deploy. Self-referential humor: low-roll alerts route to the printer that's running out of paper, so it announces its own paper-out warning on its remaining paper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 16:00:40 -05:00
Andrew Stoltz	e2c71c2b8a	fix agent-zero ollama-proxy crashloop + add Longhorn monitoring agent-zero ollama-proxy had 172 historic restarts (now stable). Root cause: liveness/readiness probes hit /api/tags which proxies through to BLUEJAY-WS Ollama (10.0.56.20:11434). When the workstation Ollama is slow or offline, nginx fails over to the edge1 backup — but the failover takes >1s and the kube-probe default timeoutSeconds=1 gives up first. Three failed probes → kubelet kills the container. Fix: - Add nginx local healthz endpoint (200, no upstream). - Liveness probe → /healthz (proves nginx itself is alive). - Readiness probe stays on /api/tags but with timeoutSeconds=5 so failover to backup completes before the probe times out. This decouples liveness from upstream availability — kubelet only restarts the proxy when nginx is genuinely dead, not when Ollama is slow. Longhorn coverage gap: K8s emits "snapshot becomes not ready to use" events constantly during the hourly snapshot lifecycle (1047 snapshots, all readyToUse=true on inspect). Those events were the only signal we had — purely transient lifecycle noise, not actionable. Add: - longhorn scrape job (longhorn-backend.longhorn-system.svc:9500) - NetworkPolicy egress rule for longhorn-system port 9500 - 4 new alerts in 'longhorn-storage' group: - LonghornVolumeDegraded (>15m) — replica unhealthy, auto-rebuild - LonghornVolumeFaulted (>5m, critical, thermal print) — data loss - LonghornBackupStale (no completed backup in >36h) — recurring job silently failing - LonghornNodeUnhealthy (>5m) — node ready=false zabbix-web 7 restarts and Print.Web 12:55 stop investigated — both are stable now, no actionable cause found in journal/events. Adding KubeContainerRestartingFrequently in the previous commit will catch recurrence of either. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:31:14 -05:00
Andrew Stoltz	b3028f5119	monitoring: fix RemoteDesktop pool alerts for stale per-status series Followup to `05a273d`. After deploy, six PoolDepleted/Deficit alerts went pending again because the publisher emits per-status gauge series (fc_desktop_pool_depleted{template,status,alert_level}) and the historical Warming/BelowDesiredSize series stay at value=1 even after the template transitions to status=Ready. Filtering by alert_level=Critical/Warning was not enough — those labels are baked into the stale series too. Replace with a join-based query: alert only when the canonical "Ready" status gauge does NOT report ready=1 for the enabled template. fc_desktop_pool_ready{status="Ready"}==1 is the publisher's own current-state canary and never goes stale. Verified against the live cluster — query returns 0 results when all pools report healthy in their reconcile logs (no stale-label false positives). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:12:10 -05:00
Andrew Stoltz	05a273d3a6	monitoring: switch K8s scrapes to ClusterIP svc + fix probe paths Followup to `ab6ade4`. Three issues uncovered after the rollout: 1. NodePort hairpin breaks scrape from same-node pod. Prometheus on rke2-agent1 could reach traefik-metrics on .11/.13 NodePort 30900 but timed out on its OWN node's NodePort. Same problem would hit kube-state-metrics + cert-manager whenever prometheus reschedules. Fix: scrape via ClusterIP svc DNS instead of NodePort. NodePorts stay in place for external/Podman scrapers. 2. probe-traefik-services failed for grafana, prometheus, guac with non-200/3xx codes. grafana + prometheus are behind Traefik basic- auth (every endpoint returns 401), so drop from probe surface — health is covered by the in-cluster monitoring-* scrape jobs. guac.iamworkin.lan was deprecated when Guacamole moved under desktop.iamworkin.lan/guacamole/ — drop it. 3. acme path was wrong (root 404). Use /health. Coverage adds (probe-traefik-services): chat, dist, dms, menuboard, messageboard, presentations, retail, ttsreader. All of these have IngressRoutes serving root at 200/3xx. NetworkPolicy egress rules added so the new ClusterIP svc scrapes work: - traefik-system: port 9100 (metrics) — separate from data-path 8080/8443 - kube-system: port 8080 (kube-state-metrics) - cert-manager: port 9402 (controller metrics) Out-of-band fix during this audit: - Print.Web on edge2 was inactive (clean exit at 12:55 CDT, root cause unclear — systemd Stopping signal). Restarted. Service back on 5200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:05:32 -05:00
Andrew Stoltz	ab6ade4e46	monitoring: stabilize firing alerts + add cluster-state coverage Live audit on 2026-04-26 found 14 firing alerts caused by stale probe targets, blackbox TLS verify failures, and stale state-as-label series. Plus three K8s scrape sources (kube-state-metrics, cert-manager, traefik) that exposed NodePorts but were not in any scrape config. Fixes - probe-remotedesktop: switch http_2xx -> https_internal. Blackbox does not trust step-ca root, so /health was failing with x509 unknown authority while the app served 200s. - probe-agentzero-nuc: short svc form (agent-zero.agent-zero.svc:80) instead of *.cluster.local. The FQDN form was being rewritten to the Traefik VIP by the CoreDNS iamworkin.lan template + ndots:5 search expansion, then 5s timeout. - probe-agentzero-local + probe-ollama-local: removed. 10.0.58.100 is on HOME VLAN and not reachable from cluster pods. Workstation/AI-laptop Ollama monitoring belongs to host-side Puppet, not cluster blackbox. - snmp-cloudkey: commented out. The Cloud Key Gen2+ runs unifi-core (controller), not an SNMP agent. Was generating "connection refused" every 30s. - RemoteDesktopPoolDepleted / RemoteDesktopPoolDeficitSustained: filter on alert_level=Critical / Warning\|Critical + enabled=true. The publisher emits one series per template per status without resetting old series to 0, so the historical Warming/BelowDesiredSize series stayed at 1 and the alert kept firing on stale labels. - RemoteDesktopTlsExpiry: match by job, not hostname-only instance. The probe sets instance=https://desktop.iamworkin.lan/health so a hostname-only label match never fired. - EpsonPrinterDown for: 5m -> 30m. EcoTank sleeps after ~5 min idle and SNMP times out, so 5m guaranteed nightly noise. Coverage adds - kube-state-metrics scrape (NodePort 30901). Required for the new pod-state alerts and a long list of standard K8s SLO queries. - cert-manager scrape (NodePort 30902). Required for the CertManagerCertificateNotReady / RenewalFailed alert pair documented in project_cert_manager_prometheus_scrape. - traefik scrape (NodePort 30900) on all three nodes. - probe-traefik-services: HTTPS probe (https_internal) over the 17 main iamworkin.lan hosts so any Traefik-fronted service returning non-200 shows up as a single named probe failure. - blackbox-config: add the https_internal module that the new probes reference (was only in the FlowerCore.Notes scripts/monitoring copy, not in the live ConfigMap). New alerts (kubernetes-state group) - KubeContainerRestartingFrequently (>5 restarts/h) - KubeContainerCrashLooping (>3 restarts/15m, thermal print) - KubePodNotReady (Pending/Failed/Unknown >15m) - KubePodImagePullBackOff (>10m) - KubeDeploymentReplicasMismatch (>15m) Without these, the agent-zero ollama-proxy 172x restart loop was invisible for ~3 days. Same gap would have hidden the fc-php php84-app-probe ImagePullBackOff orphan (cleaned up out of band). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 12:57:18 -05:00
Andrew Stoltz	c23e903ba7	feat(monitoring): Grafana alert rules route RemoteDesktop to IRC Companion to the Prometheus alert rules landed in `e44e9a0`. The Prometheus rules were loading but never delivered — the monitoring stack has no Alertmanager configured; Grafana owns alert routing via its built-in engine + webhook contact point to irc-notify.monitoring.svc:9119. Without a matching Grafana alert, the Prometheus rules just show up in the Prometheus UI and page no one. Adds 6 Grafana alert rules in a new `RemoteDesktop` group under the AI Stack Alerts folder: - remotedesktop-web-down (3m) — probe_success{job="probe-remotedesktop"} < 1 - remotedesktop-metrics-stale (10m) — fc_desktop_session_events_total series absent - remotedesktop-pool-depleted (5m) — fc_desktop_pool_depleted > 0 - remotedesktop-pool-deficit-sustained (10m info) — fc_desktop_pool_deficit > 0 - remotedesktop-session-churn-spike (5m info) — launch rate > 20/min - remotedesktop-tls-expiry (6h critical) — cert < 2 days to expiry Each uses the standard Grafana 3-stage pipeline (query → reduce → threshold) matching the existing AI Stack + Infrastructure alert patterns. Labels: service=remotedesktop + severity (warning/info/critical). Default route is `IRC #alerts` via the existing webhook contact point. Parity with the Prometheus rules (which already fire internally for the Prometheus UI + any future Alertmanager integration). Grafana restart picks up the new provisioning on next reload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:57:26 -05:00
Andrew Stoltz	3c5c1a07bd	fix(monitoring): netpol egress allows for fc-desktop + Traefik hairpin Adds two egress allows to monitoring-netpol so Prometheus can scrape FlowerCore.RemoteDesktop: 1. fc-desktop namespace on port 8080 — direct ClusterIP service target (remotedesktop-web.fc-desktop:8080). 2. traefik-system namespace pods on ports 8080 + 8443 — covers the Traefik VIP hairpin path for the `https://desktop.iamworkin.lan` scrape target (CoreDNS wildcard resolves iamworkin.lan hostnames to the LB VIP; after kube-proxy DNAT, egress needs the backend pod port allowed per feedback_netpol_dnat_backend_port). Without these, the fc-remotedesktop scrape times out with "context deadline exceeded" even though the monitoring-netpol already allows the 10.0.56.0/24 CIDR — post-DNAT the destination is a 10.42.x.x pod IP, not the VIP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:47:50 -05:00
Andrew Stoltz	e44e9a0062	feat(monitoring): RemoteDesktop alerts + scrape jobs + dashboard mount Three additions to the monitoring ConfigMap, each targeting FlowerCore.RemoteDesktop: - Scrape jobs (2 new): - probe-remotedesktop: blackbox http_2xx against https://desktop.iamworkin.lan/health every 30s. Feeds the RemoteDesktopWebDown alert. - fc-remotedesktop: direct /metrics scrape against desktop.iamworkin.lan for the fc_desktop_session_events_total and fc_desktop_pool_* series. - Alert group `remote-desktop` (7 rules in alerts.yml): - RemoteDesktopWebDown (3m) — /health probe failing - RemoteDesktopMetricsStale (10m) — absent metrics series - RemoteDesktopPoolDepleted (5m) — pool deficit + depleted flag - RemoteDesktopPoolDeficitSustained (10m, info) — persistent below-desired pool size - RemoteDesktopSessionChurnSpike (5m, info) — launch rate >20/min - RemoteDesktopRecordingEventsDropped (15m, info) — 30m without recording events while launches active - RemoteDesktopTlsExpiry (6h, critical) — <2d cert renewal window; aligns with feedback_acme_expiry_alert_threshold - Grafana dashboard mount: new volumeMounts + volumes entry for `dashboards-remotedesktop` backed by the grafana-dashboard-remotedesktop ConfigMap (previously added as a standalone file in `d4210c8`). Folder path /var/lib/grafana/dashboards/remotedesktop — picked up by the file-provider with foldersFromFilesStructure:true so the dashboard shows up in a "Remotedesktop" folder in Grafana. No CRLF churn; pure 100-line insertion into LF-normalized file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:41:35 -05:00
Andrew Stoltz	93f77c1844	fix(monitoring): use bluejay_v2 auth for snmp-nas (not public_v2) Synology NAS is configured with community bluejay_monitor (→ snmp.yml auth 'bluejay_v2'), not public. public_v2 was returning HTTP 500 from snmp-exporter for this target. Verified bluejay_v2 returns metrics. Keeps printer (10.0.58.107) on public_v2 — Epson ET-3750 uses community "public" as documented in its SNMP settings.	2026-04-22 21:32:14 -05:00
Andrew Stoltz	3bb3801fbd	fix(monitoring): use short service name for irc-notify IRC_HOST CoreDNS iamworkin.lan template + ndots:5 was hijacking unrealircd.irc.svc.cluster.local lookups → Traefik VIP → timeout. Every alert since ~2026-04-09 silently failed with "IRC send failed: timed out", which also killed the thermal-printer path (routed through irc-notify). Same fix pattern as guacamole@28b7600.	2026-04-22 09:55:23 -05:00
Claude Code	67e41febf5	Add agent-zero egress to monitoring NetworkPolicy for blackbox probes	2026-04-08 17:34:16 +00:00

19 Commits