bluejay-infra

Author	SHA1	Message	Date
Andrew Stoltz	c23e903ba7	feat(monitoring): Grafana alert rules route RemoteDesktop to IRC Companion to the Prometheus alert rules landed in `e44e9a0`. The Prometheus rules were loading but never delivered — the monitoring stack has no Alertmanager configured; Grafana owns alert routing via its built-in engine + webhook contact point to irc-notify.monitoring.svc:9119. Without a matching Grafana alert, the Prometheus rules just show up in the Prometheus UI and page no one. Adds 6 Grafana alert rules in a new `RemoteDesktop` group under the AI Stack Alerts folder: - remotedesktop-web-down (3m) — probe_success{job="probe-remotedesktop"} < 1 - remotedesktop-metrics-stale (10m) — fc_desktop_session_events_total series absent - remotedesktop-pool-depleted (5m) — fc_desktop_pool_depleted > 0 - remotedesktop-pool-deficit-sustained (10m info) — fc_desktop_pool_deficit > 0 - remotedesktop-session-churn-spike (5m info) — launch rate > 20/min - remotedesktop-tls-expiry (6h critical) — cert < 2 days to expiry Each uses the standard Grafana 3-stage pipeline (query → reduce → threshold) matching the existing AI Stack + Infrastructure alert patterns. Labels: service=remotedesktop + severity (warning/info/critical). Default route is `IRC #alerts` via the existing webhook contact point. Parity with the Prometheus rules (which already fire internally for the Prometheus UI + any future Alertmanager integration). Grafana restart picks up the new provisioning on next reload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:57:26 -05:00
Andrew Stoltz	3c5c1a07bd	fix(monitoring): netpol egress allows for fc-desktop + Traefik hairpin Adds two egress allows to monitoring-netpol so Prometheus can scrape FlowerCore.RemoteDesktop: 1. fc-desktop namespace on port 8080 — direct ClusterIP service target (remotedesktop-web.fc-desktop:8080). 2. traefik-system namespace pods on ports 8080 + 8443 — covers the Traefik VIP hairpin path for the `https://desktop.iamworkin.lan` scrape target (CoreDNS wildcard resolves iamworkin.lan hostnames to the LB VIP; after kube-proxy DNAT, egress needs the backend pod port allowed per feedback_netpol_dnat_backend_port). Without these, the fc-remotedesktop scrape times out with "context deadline exceeded" even though the monitoring-netpol already allows the 10.0.56.0/24 CIDR — post-DNAT the destination is a 10.42.x.x pod IP, not the VIP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:47:50 -05:00
Andrew Stoltz	e44e9a0062	feat(monitoring): RemoteDesktop alerts + scrape jobs + dashboard mount Three additions to the monitoring ConfigMap, each targeting FlowerCore.RemoteDesktop: - Scrape jobs (2 new): - probe-remotedesktop: blackbox http_2xx against https://desktop.iamworkin.lan/health every 30s. Feeds the RemoteDesktopWebDown alert. - fc-remotedesktop: direct /metrics scrape against desktop.iamworkin.lan for the fc_desktop_session_events_total and fc_desktop_pool_* series. - Alert group `remote-desktop` (7 rules in alerts.yml): - RemoteDesktopWebDown (3m) — /health probe failing - RemoteDesktopMetricsStale (10m) — absent metrics series - RemoteDesktopPoolDepleted (5m) — pool deficit + depleted flag - RemoteDesktopPoolDeficitSustained (10m, info) — persistent below-desired pool size - RemoteDesktopSessionChurnSpike (5m, info) — launch rate >20/min - RemoteDesktopRecordingEventsDropped (15m, info) — 30m without recording events while launches active - RemoteDesktopTlsExpiry (6h, critical) — <2d cert renewal window; aligns with feedback_acme_expiry_alert_threshold - Grafana dashboard mount: new volumeMounts + volumes entry for `dashboards-remotedesktop` backed by the grafana-dashboard-remotedesktop ConfigMap (previously added as a standalone file in `d4210c8`). Folder path /var/lib/grafana/dashboards/remotedesktop — picked up by the file-provider with foldersFromFilesStructure:true so the dashboard shows up in a "Remotedesktop" folder in Grafana. No CRLF churn; pure 100-line insertion into LF-normalized file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:41:35 -05:00
Andrew Stoltz	93f77c1844	fix(monitoring): use bluejay_v2 auth for snmp-nas (not public_v2) Synology NAS is configured with community bluejay_monitor (→ snmp.yml auth 'bluejay_v2'), not public. public_v2 was returning HTTP 500 from snmp-exporter for this target. Verified bluejay_v2 returns metrics. Keeps printer (10.0.58.107) on public_v2 — Epson ET-3750 uses community "public" as documented in its SNMP settings.	2026-04-22 21:32:14 -05:00
Andrew Stoltz	3bb3801fbd	fix(monitoring): use short service name for irc-notify IRC_HOST CoreDNS iamworkin.lan template + ndots:5 was hijacking unrealircd.irc.svc.cluster.local lookups → Traefik VIP → timeout. Every alert since ~2026-04-09 silently failed with "IRC send failed: timed out", which also killed the thermal-printer path (routed through irc-notify). Same fix pattern as guacamole@28b7600.	2026-04-22 09:55:23 -05:00
Claude Code	67e41febf5	Add agent-zero egress to monitoring NetworkPolicy for blackbox probes	2026-04-08 17:34:16 +00:00

6 Commits