bluejay-infra

Author	SHA1	Message	Date
Andrew Stoltz	bacac067cf	monitoring(irc-notify): hourly digest batching for thermal printer The thermal printer drained overnight (2026-05-18/19) because the old notify.py POSTed one print job per Grafana webhook fire. With 9 concurrently-firing alerts (zabbix-postgres + fc-devicemgmt + brochure + PrintPaperRollLow), every evaluation cycle stamped fresh CUPS jobs onto the queue until the operator physically powered the printer off. This refactor: - Adds env-var config: THERMAL_PRINT_ENABLED (master kill switch), BATCH_INTERVAL_MIN (default 60), BATCH_MAX_PENDING (default 50). - IRC delivery stays per-event (operator wants the live stream). - Thermal routing now: * critical/disaster/page severity OR alert_channel=thermal_print_immediate -> print immediately * alert_channel=thermal_print -> enqueue into hourly digest * RESOLVED -> remove from digest buffer (no resolution-spam prints) * else -> IRC only, no thermal - Background digest_loop thread flushes the buffer hourly (or sooner if buffer hits BATCH_MAX_PENDING). Digest payload is a single Print.Web /api/print/alert POST listing distinct alertnames + per-rule target counts. - New POST /flush endpoint (manual operator force-flush; useful for testing without waiting an hour). - GET / returns config + buffer depth + per-stat counters for observability. Net effect: max 1 thermal print per BATCH_INTERVAL_MIN for batched warnings, plus immediate prints for criticals. Closes the 2026-05-18/19 alert-storm incident. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 09:56:14 -05:00
Andrew Stoltz	46c392605e	monitoring: mirror PuppetServiceFailed alert from Notes (Sprint 33 Cx-7 Phase B) Mirrors the live `puppet` alert group from FlowerCore.Notes/scripts/monitoring/alerts.yml into the K8s ConfigMap so a future in-cluster Prometheus inherits the ruleset automatically. Source of truth remains the Notes file (live Podman Prometheus on noc1). See feedback_monitoring_k8s_target_vs_live_podman. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:11:07 -05:00
bluejay	634b9c4169	feat(github-runner): harden Linux runner fleet (#5 )	2026-05-18 02:51:02 +00:00
bluejay	9fd32c4415	feat(monitoring): MacMiniRunnerOffline alert (Sprint 28 reconcile)	2026-05-17 19:50:29 +00:00
Codex	653d4472f5	fix(monitoring): mirror Q-MR-3 MultusMemoryPressure + NamespacePendingPodBacklog alerts Two new preventive alert rules added to the kubernetes-state group of the K8s migration target ConfigMap. The live Podman Prometheus on noc1 has already been updated via FlowerCore.Notes/scripts/monitoring/alerts.yml + sudo cp + podman pod restart monitoring (this commit only locks it in the bluejay-infra K8s mirror so a future migration carries it forward). MultusMemoryPressure (critical, thermal_print): fires when kube-multus working set exceeds 80% of its memory limit for 5m. Catches the next multus OOM cascade BEFORE it kills the daemon cluster-wide. The 2026-05-10 21h outage hit because no alert fired on the rising multus working set; only downstream blackbox / Traefik / service alerts triggered, after the fact. NamespacePendingPodBacklog (warning): fires when any single namespace has >25 Pending pods sustained for 30m. Catches the operator-leak avalanche pattern (orphan pods from a crashed reconciler emitting children without ownerReferences) before it cascades into a CNI OOM. See FlowerCore.Notes: - feedback_multus_50mi_limit_oom_orphan_pod_avalanche - feedback_monitoring_k8s_target_vs_live_podman (workflow) Companion commits: - bluejay-infra@eb8693e (multus memory limit) - FlowerCore.RemoteDesktop@b02c59b (OwnerReferences fix) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 10:42:27 -05:00
Codex	4b777b16ac	monitoring: mirror fc-signage-marquee alert group into noc-monitoring K8s target Mirror of FlowerCore.Notes/scripts/monitoring/alerts.yml fc-signage-marquee group into the K8s migration target apps/monitoring/noc-monitoring.yaml so that future migration of the noc1 Podman monitoring stack into RKE2 inherits the marquee alert ruleset automatically. Three rules added: - MarqueeDroppedFramesHigh (5% / 5min / warning) - MarqueeRenderLatencyP99High (16ms / 10min / warning) - MarqueeAnimationDurationDrift (10% / 15min / info) All three gated with `unless on() absent_over_time(metric[7d])` so they don't fire during the metric-not-yet-published window before Track 3 IR-21 source IMPL ships the MarqueeMeter into Common + Web + WPF. Live source-of-truth (the noc1 Podman Prometheus reads from /opt/monitoring/prometheus/alerts.yml) was updated and reloaded in the same session — Notes commit 300daa0 carries the matching alerts.yml + Grafana fc-signage-dashboard.json change. Per feedback_monitoring_k8s_target_vs_live_podman: this file is the forward-looking K8s migration target, NOT what the live Podman Prometheus reads. ArgoCD-syncing this file does NOT push alerts to the live monitoring stack. Companion to: - FlowerCore.Notes 300daa0 (live alerts.yml + Grafana panels deployed) - docs/signage/marquee-performance-telemetry-design.md (Track 3 IR-21 spec) - docs/signage/marquee-animation-phases.md (Track 6 13-phase coverage matrix) Memory: project_marquee_vr_promotion_landed_2026_05_06 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 16:01:44 -05:00
Andrew Stoltz	3e0b9055b0	monitoring: paper-roll lifecycle alerts (XL Track I) Three new Prometheus alert rules for the print-services group, all routed to thermal_print via alert_channel label (Grafana contact point -> irc-notify -> Print.Web /api/print/alert): - PrintPaperRollLow (warning, 5-10% remaining, 5m for) - PrintPaperRollCritical (critical, <=5% remaining, 2m for) - PrintJobDeadLetter (warning, any new dead-letter in 15m) Source-of-truth gauge is print_paper_remaining_percent (Print.Web OTEL), which is hydrated from the active PaperRoll row at process startup (Print.Web@<TBD> HydrateMetricsAsync) so the gauge isn't blind for an arbitrary window after every deploy. Self-referential humor: low-roll alerts route to the printer that's running out of paper, so it announces its own paper-out warning on its remaining paper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 16:00:40 -05:00
Andrew Stoltz	e2c71c2b8a	fix agent-zero ollama-proxy crashloop + add Longhorn monitoring agent-zero ollama-proxy had 172 historic restarts (now stable). Root cause: liveness/readiness probes hit /api/tags which proxies through to BLUEJAY-WS Ollama (10.0.56.20:11434). When the workstation Ollama is slow or offline, nginx fails over to the edge1 backup — but the failover takes >1s and the kube-probe default timeoutSeconds=1 gives up first. Three failed probes → kubelet kills the container. Fix: - Add nginx local healthz endpoint (200, no upstream). - Liveness probe → /healthz (proves nginx itself is alive). - Readiness probe stays on /api/tags but with timeoutSeconds=5 so failover to backup completes before the probe times out. This decouples liveness from upstream availability — kubelet only restarts the proxy when nginx is genuinely dead, not when Ollama is slow. Longhorn coverage gap: K8s emits "snapshot becomes not ready to use" events constantly during the hourly snapshot lifecycle (1047 snapshots, all readyToUse=true on inspect). Those events were the only signal we had — purely transient lifecycle noise, not actionable. Add: - longhorn scrape job (longhorn-backend.longhorn-system.svc:9500) - NetworkPolicy egress rule for longhorn-system port 9500 - 4 new alerts in 'longhorn-storage' group: - LonghornVolumeDegraded (>15m) — replica unhealthy, auto-rebuild - LonghornVolumeFaulted (>5m, critical, thermal print) — data loss - LonghornBackupStale (no completed backup in >36h) — recurring job silently failing - LonghornNodeUnhealthy (>5m) — node ready=false zabbix-web 7 restarts and Print.Web 12:55 stop investigated — both are stable now, no actionable cause found in journal/events. Adding KubeContainerRestartingFrequently in the previous commit will catch recurrence of either. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:31:14 -05:00
Andrew Stoltz	b3028f5119	monitoring: fix RemoteDesktop pool alerts for stale per-status series Followup to `05a273d`. After deploy, six PoolDepleted/Deficit alerts went pending again because the publisher emits per-status gauge series (fc_desktop_pool_depleted{template,status,alert_level}) and the historical Warming/BelowDesiredSize series stay at value=1 even after the template transitions to status=Ready. Filtering by alert_level=Critical/Warning was not enough — those labels are baked into the stale series too. Replace with a join-based query: alert only when the canonical "Ready" status gauge does NOT report ready=1 for the enabled template. fc_desktop_pool_ready{status="Ready"}==1 is the publisher's own current-state canary and never goes stale. Verified against the live cluster — query returns 0 results when all pools report healthy in their reconcile logs (no stale-label false positives). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:12:10 -05:00
Andrew Stoltz	05a273d3a6	monitoring: switch K8s scrapes to ClusterIP svc + fix probe paths Followup to `ab6ade4`. Three issues uncovered after the rollout: 1. NodePort hairpin breaks scrape from same-node pod. Prometheus on rke2-agent1 could reach traefik-metrics on .11/.13 NodePort 30900 but timed out on its OWN node's NodePort. Same problem would hit kube-state-metrics + cert-manager whenever prometheus reschedules. Fix: scrape via ClusterIP svc DNS instead of NodePort. NodePorts stay in place for external/Podman scrapers. 2. probe-traefik-services failed for grafana, prometheus, guac with non-200/3xx codes. grafana + prometheus are behind Traefik basic- auth (every endpoint returns 401), so drop from probe surface — health is covered by the in-cluster monitoring-* scrape jobs. guac.iamworkin.lan was deprecated when Guacamole moved under desktop.iamworkin.lan/guacamole/ — drop it. 3. acme path was wrong (root 404). Use /health. Coverage adds (probe-traefik-services): chat, dist, dms, menuboard, messageboard, presentations, retail, ttsreader. All of these have IngressRoutes serving root at 200/3xx. NetworkPolicy egress rules added so the new ClusterIP svc scrapes work: - traefik-system: port 9100 (metrics) — separate from data-path 8080/8443 - kube-system: port 8080 (kube-state-metrics) - cert-manager: port 9402 (controller metrics) Out-of-band fix during this audit: - Print.Web on edge2 was inactive (clean exit at 12:55 CDT, root cause unclear — systemd Stopping signal). Restarted. Service back on 5200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:05:32 -05:00
Andrew Stoltz	ab6ade4e46	monitoring: stabilize firing alerts + add cluster-state coverage Live audit on 2026-04-26 found 14 firing alerts caused by stale probe targets, blackbox TLS verify failures, and stale state-as-label series. Plus three K8s scrape sources (kube-state-metrics, cert-manager, traefik) that exposed NodePorts but were not in any scrape config. Fixes - probe-remotedesktop: switch http_2xx -> https_internal. Blackbox does not trust step-ca root, so /health was failing with x509 unknown authority while the app served 200s. - probe-agentzero-nuc: short svc form (agent-zero.agent-zero.svc:80) instead of *.cluster.local. The FQDN form was being rewritten to the Traefik VIP by the CoreDNS iamworkin.lan template + ndots:5 search expansion, then 5s timeout. - probe-agentzero-local + probe-ollama-local: removed. 10.0.58.100 is on HOME VLAN and not reachable from cluster pods. Workstation/AI-laptop Ollama monitoring belongs to host-side Puppet, not cluster blackbox. - snmp-cloudkey: commented out. The Cloud Key Gen2+ runs unifi-core (controller), not an SNMP agent. Was generating "connection refused" every 30s. - RemoteDesktopPoolDepleted / RemoteDesktopPoolDeficitSustained: filter on alert_level=Critical / Warning\|Critical + enabled=true. The publisher emits one series per template per status without resetting old series to 0, so the historical Warming/BelowDesiredSize series stayed at 1 and the alert kept firing on stale labels. - RemoteDesktopTlsExpiry: match by job, not hostname-only instance. The probe sets instance=https://desktop.iamworkin.lan/health so a hostname-only label match never fired. - EpsonPrinterDown for: 5m -> 30m. EcoTank sleeps after ~5 min idle and SNMP times out, so 5m guaranteed nightly noise. Coverage adds - kube-state-metrics scrape (NodePort 30901). Required for the new pod-state alerts and a long list of standard K8s SLO queries. - cert-manager scrape (NodePort 30902). Required for the CertManagerCertificateNotReady / RenewalFailed alert pair documented in project_cert_manager_prometheus_scrape. - traefik scrape (NodePort 30900) on all three nodes. - probe-traefik-services: HTTPS probe (https_internal) over the 17 main iamworkin.lan hosts so any Traefik-fronted service returning non-200 shows up as a single named probe failure. - blackbox-config: add the https_internal module that the new probes reference (was only in the FlowerCore.Notes scripts/monitoring copy, not in the live ConfigMap). New alerts (kubernetes-state group) - KubeContainerRestartingFrequently (>5 restarts/h) - KubeContainerCrashLooping (>3 restarts/15m, thermal print) - KubePodNotReady (Pending/Failed/Unknown >15m) - KubePodImagePullBackOff (>10m) - KubeDeploymentReplicasMismatch (>15m) Without these, the agent-zero ollama-proxy 172x restart loop was invisible for ~3 days. Same gap would have hidden the fc-php php84-app-probe ImagePullBackOff orphan (cleaned up out of band). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 12:57:18 -05:00
Andrew Stoltz	c23e903ba7	feat(monitoring): Grafana alert rules route RemoteDesktop to IRC Companion to the Prometheus alert rules landed in `e44e9a0`. The Prometheus rules were loading but never delivered — the monitoring stack has no Alertmanager configured; Grafana owns alert routing via its built-in engine + webhook contact point to irc-notify.monitoring.svc:9119. Without a matching Grafana alert, the Prometheus rules just show up in the Prometheus UI and page no one. Adds 6 Grafana alert rules in a new `RemoteDesktop` group under the AI Stack Alerts folder: - remotedesktop-web-down (3m) — probe_success{job="probe-remotedesktop"} < 1 - remotedesktop-metrics-stale (10m) — fc_desktop_session_events_total series absent - remotedesktop-pool-depleted (5m) — fc_desktop_pool_depleted > 0 - remotedesktop-pool-deficit-sustained (10m info) — fc_desktop_pool_deficit > 0 - remotedesktop-session-churn-spike (5m info) — launch rate > 20/min - remotedesktop-tls-expiry (6h critical) — cert < 2 days to expiry Each uses the standard Grafana 3-stage pipeline (query → reduce → threshold) matching the existing AI Stack + Infrastructure alert patterns. Labels: service=remotedesktop + severity (warning/info/critical). Default route is `IRC #alerts` via the existing webhook contact point. Parity with the Prometheus rules (which already fire internally for the Prometheus UI + any future Alertmanager integration). Grafana restart picks up the new provisioning on next reload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:57:26 -05:00
Andrew Stoltz	3c5c1a07bd	fix(monitoring): netpol egress allows for fc-desktop + Traefik hairpin Adds two egress allows to monitoring-netpol so Prometheus can scrape FlowerCore.RemoteDesktop: 1. fc-desktop namespace on port 8080 — direct ClusterIP service target (remotedesktop-web.fc-desktop:8080). 2. traefik-system namespace pods on ports 8080 + 8443 — covers the Traefik VIP hairpin path for the `https://desktop.iamworkin.lan` scrape target (CoreDNS wildcard resolves iamworkin.lan hostnames to the LB VIP; after kube-proxy DNAT, egress needs the backend pod port allowed per feedback_netpol_dnat_backend_port). Without these, the fc-remotedesktop scrape times out with "context deadline exceeded" even though the monitoring-netpol already allows the 10.0.56.0/24 CIDR — post-DNAT the destination is a 10.42.x.x pod IP, not the VIP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:47:50 -05:00
Andrew Stoltz	e44e9a0062	feat(monitoring): RemoteDesktop alerts + scrape jobs + dashboard mount Three additions to the monitoring ConfigMap, each targeting FlowerCore.RemoteDesktop: - Scrape jobs (2 new): - probe-remotedesktop: blackbox http_2xx against https://desktop.iamworkin.lan/health every 30s. Feeds the RemoteDesktopWebDown alert. - fc-remotedesktop: direct /metrics scrape against desktop.iamworkin.lan for the fc_desktop_session_events_total and fc_desktop_pool_* series. - Alert group `remote-desktop` (7 rules in alerts.yml): - RemoteDesktopWebDown (3m) — /health probe failing - RemoteDesktopMetricsStale (10m) — absent metrics series - RemoteDesktopPoolDepleted (5m) — pool deficit + depleted flag - RemoteDesktopPoolDeficitSustained (10m, info) — persistent below-desired pool size - RemoteDesktopSessionChurnSpike (5m, info) — launch rate >20/min - RemoteDesktopRecordingEventsDropped (15m, info) — 30m without recording events while launches active - RemoteDesktopTlsExpiry (6h, critical) — <2d cert renewal window; aligns with feedback_acme_expiry_alert_threshold - Grafana dashboard mount: new volumeMounts + volumes entry for `dashboards-remotedesktop` backed by the grafana-dashboard-remotedesktop ConfigMap (previously added as a standalone file in `d4210c8`). Folder path /var/lib/grafana/dashboards/remotedesktop — picked up by the file-provider with foldersFromFilesStructure:true so the dashboard shows up in a "Remotedesktop" folder in Grafana. No CRLF churn; pure 100-line insertion into LF-normalized file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:41:35 -05:00
Andrew Stoltz	93f77c1844	fix(monitoring): use bluejay_v2 auth for snmp-nas (not public_v2) Synology NAS is configured with community bluejay_monitor (→ snmp.yml auth 'bluejay_v2'), not public. public_v2 was returning HTTP 500 from snmp-exporter for this target. Verified bluejay_v2 returns metrics. Keeps printer (10.0.58.107) on public_v2 — Epson ET-3750 uses community "public" as documented in its SNMP settings.	2026-04-22 21:32:14 -05:00
Andrew Stoltz	3bb3801fbd	fix(monitoring): use short service name for irc-notify IRC_HOST CoreDNS iamworkin.lan template + ndots:5 was hijacking unrealircd.irc.svc.cluster.local lookups → Traefik VIP → timeout. Every alert since ~2026-04-09 silently failed with "IRC send failed: timed out", which also killed the thermal-printer path (routed through irc-notify). Same fix pattern as guacamole@28b7600.	2026-04-22 09:55:23 -05:00
Claude Code	67e41febf5	Add agent-zero egress to monitoring NetworkPolicy for blackbox probes	2026-04-08 17:34:16 +00:00

17 Commits