bluejay-infra

Author	SHA1	Message	Date
Andrew Stoltz	e2c71c2b8a	fix agent-zero ollama-proxy crashloop + add Longhorn monitoring agent-zero ollama-proxy had 172 historic restarts (now stable). Root cause: liveness/readiness probes hit /api/tags which proxies through to BLUEJAY-WS Ollama (10.0.56.20:11434). When the workstation Ollama is slow or offline, nginx fails over to the edge1 backup — but the failover takes >1s and the kube-probe default timeoutSeconds=1 gives up first. Three failed probes → kubelet kills the container. Fix: - Add nginx local healthz endpoint (200, no upstream). - Liveness probe → /healthz (proves nginx itself is alive). - Readiness probe stays on /api/tags but with timeoutSeconds=5 so failover to backup completes before the probe times out. This decouples liveness from upstream availability — kubelet only restarts the proxy when nginx is genuinely dead, not when Ollama is slow. Longhorn coverage gap: K8s emits "snapshot becomes not ready to use" events constantly during the hourly snapshot lifecycle (1047 snapshots, all readyToUse=true on inspect). Those events were the only signal we had — purely transient lifecycle noise, not actionable. Add: - longhorn scrape job (longhorn-backend.longhorn-system.svc:9500) - NetworkPolicy egress rule for longhorn-system port 9500 - 4 new alerts in 'longhorn-storage' group: - LonghornVolumeDegraded (>15m) — replica unhealthy, auto-rebuild - LonghornVolumeFaulted (>5m, critical, thermal print) — data loss - LonghornBackupStale (no completed backup in >36h) — recurring job silently failing - LonghornNodeUnhealthy (>5m) — node ready=false zabbix-web 7 restarts and Print.Web 12:55 stop investigated — both are stable now, no actionable cause found in journal/events. Adding KubeContainerRestartingFrequently in the previous commit will catch recurrence of either. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:31:14 -05:00
Andrew Stoltz	b3028f5119	monitoring: fix RemoteDesktop pool alerts for stale per-status series Followup to `05a273d`. After deploy, six PoolDepleted/Deficit alerts went pending again because the publisher emits per-status gauge series (fc_desktop_pool_depleted{template,status,alert_level}) and the historical Warming/BelowDesiredSize series stay at value=1 even after the template transitions to status=Ready. Filtering by alert_level=Critical/Warning was not enough — those labels are baked into the stale series too. Replace with a join-based query: alert only when the canonical "Ready" status gauge does NOT report ready=1 for the enabled template. fc_desktop_pool_ready{status="Ready"}==1 is the publisher's own current-state canary and never goes stale. Verified against the live cluster — query returns 0 results when all pools report healthy in their reconcile logs (no stale-label false positives). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:12:10 -05:00
Andrew Stoltz	05a273d3a6	monitoring: switch K8s scrapes to ClusterIP svc + fix probe paths Followup to `ab6ade4`. Three issues uncovered after the rollout: 1. NodePort hairpin breaks scrape from same-node pod. Prometheus on rke2-agent1 could reach traefik-metrics on .11/.13 NodePort 30900 but timed out on its OWN node's NodePort. Same problem would hit kube-state-metrics + cert-manager whenever prometheus reschedules. Fix: scrape via ClusterIP svc DNS instead of NodePort. NodePorts stay in place for external/Podman scrapers. 2. probe-traefik-services failed for grafana, prometheus, guac with non-200/3xx codes. grafana + prometheus are behind Traefik basic- auth (every endpoint returns 401), so drop from probe surface — health is covered by the in-cluster monitoring-* scrape jobs. guac.iamworkin.lan was deprecated when Guacamole moved under desktop.iamworkin.lan/guacamole/ — drop it. 3. acme path was wrong (root 404). Use /health. Coverage adds (probe-traefik-services): chat, dist, dms, menuboard, messageboard, presentations, retail, ttsreader. All of these have IngressRoutes serving root at 200/3xx. NetworkPolicy egress rules added so the new ClusterIP svc scrapes work: - traefik-system: port 9100 (metrics) — separate from data-path 8080/8443 - kube-system: port 8080 (kube-state-metrics) - cert-manager: port 9402 (controller metrics) Out-of-band fix during this audit: - Print.Web on edge2 was inactive (clean exit at 12:55 CDT, root cause unclear — systemd Stopping signal). Restarted. Service back on 5200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:05:32 -05:00
Andrew Stoltz	ab6ade4e46	monitoring: stabilize firing alerts + add cluster-state coverage Live audit on 2026-04-26 found 14 firing alerts caused by stale probe targets, blackbox TLS verify failures, and stale state-as-label series. Plus three K8s scrape sources (kube-state-metrics, cert-manager, traefik) that exposed NodePorts but were not in any scrape config. Fixes - probe-remotedesktop: switch http_2xx -> https_internal. Blackbox does not trust step-ca root, so /health was failing with x509 unknown authority while the app served 200s. - probe-agentzero-nuc: short svc form (agent-zero.agent-zero.svc:80) instead of *.cluster.local. The FQDN form was being rewritten to the Traefik VIP by the CoreDNS iamworkin.lan template + ndots:5 search expansion, then 5s timeout. - probe-agentzero-local + probe-ollama-local: removed. 10.0.58.100 is on HOME VLAN and not reachable from cluster pods. Workstation/AI-laptop Ollama monitoring belongs to host-side Puppet, not cluster blackbox. - snmp-cloudkey: commented out. The Cloud Key Gen2+ runs unifi-core (controller), not an SNMP agent. Was generating "connection refused" every 30s. - RemoteDesktopPoolDepleted / RemoteDesktopPoolDeficitSustained: filter on alert_level=Critical / Warning\|Critical + enabled=true. The publisher emits one series per template per status without resetting old series to 0, so the historical Warming/BelowDesiredSize series stayed at 1 and the alert kept firing on stale labels. - RemoteDesktopTlsExpiry: match by job, not hostname-only instance. The probe sets instance=https://desktop.iamworkin.lan/health so a hostname-only label match never fired. - EpsonPrinterDown for: 5m -> 30m. EcoTank sleeps after ~5 min idle and SNMP times out, so 5m guaranteed nightly noise. Coverage adds - kube-state-metrics scrape (NodePort 30901). Required for the new pod-state alerts and a long list of standard K8s SLO queries. - cert-manager scrape (NodePort 30902). Required for the CertManagerCertificateNotReady / RenewalFailed alert pair documented in project_cert_manager_prometheus_scrape. - traefik scrape (NodePort 30900) on all three nodes. - probe-traefik-services: HTTPS probe (https_internal) over the 17 main iamworkin.lan hosts so any Traefik-fronted service returning non-200 shows up as a single named probe failure. - blackbox-config: add the https_internal module that the new probes reference (was only in the FlowerCore.Notes scripts/monitoring copy, not in the live ConfigMap). New alerts (kubernetes-state group) - KubeContainerRestartingFrequently (>5 restarts/h) - KubeContainerCrashLooping (>3 restarts/15m, thermal print) - KubePodNotReady (Pending/Failed/Unknown >15m) - KubePodImagePullBackOff (>10m) - KubeDeploymentReplicasMismatch (>15m) Without these, the agent-zero ollama-proxy 172x restart loop was invisible for ~3 days. Same gap would have hidden the fc-php php84-app-probe ImagePullBackOff orphan (cleaned up out of band). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 12:57:18 -05:00
Andrew Stoltz	c23e903ba7	feat(monitoring): Grafana alert rules route RemoteDesktop to IRC Companion to the Prometheus alert rules landed in `e44e9a0`. The Prometheus rules were loading but never delivered — the monitoring stack has no Alertmanager configured; Grafana owns alert routing via its built-in engine + webhook contact point to irc-notify.monitoring.svc:9119. Without a matching Grafana alert, the Prometheus rules just show up in the Prometheus UI and page no one. Adds 6 Grafana alert rules in a new `RemoteDesktop` group under the AI Stack Alerts folder: - remotedesktop-web-down (3m) — probe_success{job="probe-remotedesktop"} < 1 - remotedesktop-metrics-stale (10m) — fc_desktop_session_events_total series absent - remotedesktop-pool-depleted (5m) — fc_desktop_pool_depleted > 0 - remotedesktop-pool-deficit-sustained (10m info) — fc_desktop_pool_deficit > 0 - remotedesktop-session-churn-spike (5m info) — launch rate > 20/min - remotedesktop-tls-expiry (6h critical) — cert < 2 days to expiry Each uses the standard Grafana 3-stage pipeline (query → reduce → threshold) matching the existing AI Stack + Infrastructure alert patterns. Labels: service=remotedesktop + severity (warning/info/critical). Default route is `IRC #alerts` via the existing webhook contact point. Parity with the Prometheus rules (which already fire internally for the Prometheus UI + any future Alertmanager integration). Grafana restart picks up the new provisioning on next reload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:57:26 -05:00
Andrew Stoltz	3c5c1a07bd	fix(monitoring): netpol egress allows for fc-desktop + Traefik hairpin Adds two egress allows to monitoring-netpol so Prometheus can scrape FlowerCore.RemoteDesktop: 1. fc-desktop namespace on port 8080 — direct ClusterIP service target (remotedesktop-web.fc-desktop:8080). 2. traefik-system namespace pods on ports 8080 + 8443 — covers the Traefik VIP hairpin path for the `https://desktop.iamworkin.lan` scrape target (CoreDNS wildcard resolves iamworkin.lan hostnames to the LB VIP; after kube-proxy DNAT, egress needs the backend pod port allowed per feedback_netpol_dnat_backend_port). Without these, the fc-remotedesktop scrape times out with "context deadline exceeded" even though the monitoring-netpol already allows the 10.0.56.0/24 CIDR — post-DNAT the destination is a 10.42.x.x pod IP, not the VIP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:47:50 -05:00
Andrew Stoltz	e44e9a0062	feat(monitoring): RemoteDesktop alerts + scrape jobs + dashboard mount Three additions to the monitoring ConfigMap, each targeting FlowerCore.RemoteDesktop: - Scrape jobs (2 new): - probe-remotedesktop: blackbox http_2xx against https://desktop.iamworkin.lan/health every 30s. Feeds the RemoteDesktopWebDown alert. - fc-remotedesktop: direct /metrics scrape against desktop.iamworkin.lan for the fc_desktop_session_events_total and fc_desktop_pool_* series. - Alert group `remote-desktop` (7 rules in alerts.yml): - RemoteDesktopWebDown (3m) — /health probe failing - RemoteDesktopMetricsStale (10m) — absent metrics series - RemoteDesktopPoolDepleted (5m) — pool deficit + depleted flag - RemoteDesktopPoolDeficitSustained (10m, info) — persistent below-desired pool size - RemoteDesktopSessionChurnSpike (5m, info) — launch rate >20/min - RemoteDesktopRecordingEventsDropped (15m, info) — 30m without recording events while launches active - RemoteDesktopTlsExpiry (6h, critical) — <2d cert renewal window; aligns with feedback_acme_expiry_alert_threshold - Grafana dashboard mount: new volumeMounts + volumes entry for `dashboards-remotedesktop` backed by the grafana-dashboard-remotedesktop ConfigMap (previously added as a standalone file in `d4210c8`). Folder path /var/lib/grafana/dashboards/remotedesktop — picked up by the file-provider with foldersFromFilesStructure:true so the dashboard shows up in a "Remotedesktop" folder in Grafana. No CRLF churn; pure 100-line insertion into LF-normalized file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:41:35 -05:00
Andrew Stoltz	93f77c1844	fix(monitoring): use bluejay_v2 auth for snmp-nas (not public_v2) Synology NAS is configured with community bluejay_monitor (→ snmp.yml auth 'bluejay_v2'), not public. public_v2 was returning HTTP 500 from snmp-exporter for this target. Verified bluejay_v2 returns metrics. Keeps printer (10.0.58.107) on public_v2 — Epson ET-3750 uses community "public" as documented in its SNMP settings.	2026-04-22 21:32:14 -05:00
Andrew Stoltz	3bb3801fbd	fix(monitoring): use short service name for irc-notify IRC_HOST CoreDNS iamworkin.lan template + ndots:5 was hijacking unrealircd.irc.svc.cluster.local lookups → Traefik VIP → timeout. Every alert since ~2026-04-09 silently failed with "IRC send failed: timed out", which also killed the thermal-printer path (routed through irc-notify). Same fix pattern as guacamole@28b7600.	2026-04-22 09:55:23 -05:00
Claude Code	67e41febf5	Add agent-zero egress to monitoring NetworkPolicy for blackbox probes	2026-04-08 17:34:16 +00:00

10 Commits