Commit Graph

7 Commits

Author SHA1 Message Date
Andrew Stoltz
ab6ade4e46 monitoring: stabilize firing alerts + add cluster-state coverage
Live audit on 2026-04-26 found 14 firing alerts caused by stale probe
targets, blackbox TLS verify failures, and stale state-as-label series.
Plus three K8s scrape sources (kube-state-metrics, cert-manager,
traefik) that exposed NodePorts but were not in any scrape config.

Fixes
- probe-remotedesktop: switch http_2xx -> https_internal. Blackbox does
  not trust step-ca root, so /health was failing with x509 unknown
  authority while the app served 200s.
- probe-agentzero-nuc: short svc form (agent-zero.agent-zero.svc:80)
  instead of *.cluster.local. The FQDN form was being rewritten to the
  Traefik VIP by the CoreDNS iamworkin.lan template + ndots:5 search
  expansion, then 5s timeout.
- probe-agentzero-local + probe-ollama-local: removed. 10.0.58.100 is on
  HOME VLAN and not reachable from cluster pods. Workstation/AI-laptop
  Ollama monitoring belongs to host-side Puppet, not cluster blackbox.
- snmp-cloudkey: commented out. The Cloud Key Gen2+ runs unifi-core
  (controller), not an SNMP agent. Was generating "connection refused"
  every 30s.
- RemoteDesktopPoolDepleted / RemoteDesktopPoolDeficitSustained:
  filter on alert_level=Critical / Warning|Critical + enabled=true.
  The publisher emits one series per template per status without
  resetting old series to 0, so the historical Warming/BelowDesiredSize
  series stayed at 1 and the alert kept firing on stale labels.
- RemoteDesktopTlsExpiry: match by job, not hostname-only instance.
  The probe sets instance=https://desktop.iamworkin.lan/health so a
  hostname-only label match never fired.
- EpsonPrinterDown for: 5m -> 30m. EcoTank sleeps after ~5 min idle and
  SNMP times out, so 5m guaranteed nightly noise.

Coverage adds
- kube-state-metrics scrape (NodePort 30901). Required for the new
  pod-state alerts and a long list of standard K8s SLO queries.
- cert-manager scrape (NodePort 30902). Required for the
  CertManagerCertificateNotReady / RenewalFailed alert pair documented
  in project_cert_manager_prometheus_scrape.
- traefik scrape (NodePort 30900) on all three nodes.
- probe-traefik-services: HTTPS probe (https_internal) over the 17 main
  iamworkin.lan hosts so any Traefik-fronted service returning non-200
  shows up as a single named probe failure.
- blackbox-config: add the https_internal module that the new probes
  reference (was only in the FlowerCore.Notes scripts/monitoring copy,
  not in the live ConfigMap).

New alerts (kubernetes-state group)
- KubeContainerRestartingFrequently (>5 restarts/h)
- KubeContainerCrashLooping (>3 restarts/15m, thermal print)
- KubePodNotReady (Pending/Failed/Unknown >15m)
- KubePodImagePullBackOff (>10m)
- KubeDeploymentReplicasMismatch (>15m)

Without these, the agent-zero ollama-proxy 172x restart loop was
invisible for ~3 days. Same gap would have hidden the fc-php
php84-app-probe ImagePullBackOff orphan (cleaned up out of band).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:57:18 -05:00
Andrew Stoltz
c23e903ba7 feat(monitoring): Grafana alert rules route RemoteDesktop to IRC
Companion to the Prometheus alert rules landed in e44e9a0. The
Prometheus rules were loading but never delivered — the monitoring
stack has no Alertmanager configured; **Grafana** owns alert
routing via its built-in engine + webhook contact point to
irc-notify.monitoring.svc:9119. Without a matching Grafana alert,
the Prometheus rules just show up in the Prometheus UI and page
no one.

Adds 6 Grafana alert rules in a new `RemoteDesktop` group under
the AI Stack Alerts folder:

- remotedesktop-web-down (3m) — probe_success{job="probe-remotedesktop"} < 1
- remotedesktop-metrics-stale (10m) — fc_desktop_session_events_total series absent
- remotedesktop-pool-depleted (5m) — fc_desktop_pool_depleted > 0
- remotedesktop-pool-deficit-sustained (10m info) — fc_desktop_pool_deficit > 0
- remotedesktop-session-churn-spike (5m info) — launch rate > 20/min
- remotedesktop-tls-expiry (6h critical) — cert < 2 days to expiry

Each uses the standard Grafana 3-stage pipeline (query → reduce →
threshold) matching the existing AI Stack + Infrastructure alert
patterns. Labels: service=remotedesktop + severity (warning/info/critical).
Default route is `IRC #alerts` via the existing webhook contact point.

Parity with the Prometheus rules (which already fire internally
for the Prometheus UI + any future Alertmanager integration).
Grafana restart picks up the new provisioning on next reload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:57:26 -05:00
Andrew Stoltz
3c5c1a07bd fix(monitoring): netpol egress allows for fc-desktop + Traefik hairpin
Adds two egress allows to monitoring-netpol so Prometheus can scrape
FlowerCore.RemoteDesktop:

1. fc-desktop namespace on port 8080 — direct ClusterIP service
   target (remotedesktop-web.fc-desktop:8080).
2. traefik-system namespace pods on ports 8080 + 8443 — covers the
   Traefik VIP hairpin path for the `https://desktop.iamworkin.lan`
   scrape target (CoreDNS wildcard resolves iamworkin.lan hostnames
   to the LB VIP; after kube-proxy DNAT, egress needs the backend
   pod port allowed per feedback_netpol_dnat_backend_port).

Without these, the fc-remotedesktop scrape times out with "context
deadline exceeded" even though the monitoring-netpol already allows
the 10.0.56.0/24 CIDR — post-DNAT the destination is a 10.42.x.x
pod IP, not the VIP.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:47:50 -05:00
Andrew Stoltz
e44e9a0062 feat(monitoring): RemoteDesktop alerts + scrape jobs + dashboard mount
Three additions to the monitoring ConfigMap, each targeting
FlowerCore.RemoteDesktop:

- **Scrape jobs** (2 new):
  - probe-remotedesktop: blackbox http_2xx against
    https://desktop.iamworkin.lan/health every 30s. Feeds the
    RemoteDesktopWebDown alert.
  - fc-remotedesktop: direct /metrics scrape against
    desktop.iamworkin.lan for the fc_desktop_session_events_total
    and fc_desktop_pool_* series.

- **Alert group `remote-desktop`** (7 rules in alerts.yml):
  - RemoteDesktopWebDown (3m) — /health probe failing
  - RemoteDesktopMetricsStale (10m) — absent metrics series
  - RemoteDesktopPoolDepleted (5m) — pool deficit + depleted flag
  - RemoteDesktopPoolDeficitSustained (10m, info) — persistent
    below-desired pool size
  - RemoteDesktopSessionChurnSpike (5m, info) — launch rate
    >20/min
  - RemoteDesktopRecordingEventsDropped (15m, info) — 30m without
    recording events while launches active
  - RemoteDesktopTlsExpiry (6h, critical) — <2d cert renewal
    window; aligns with feedback_acme_expiry_alert_threshold

- **Grafana dashboard mount**: new volumeMounts + volumes entry for
  `dashboards-remotedesktop` backed by the grafana-dashboard-remotedesktop
  ConfigMap (previously added as a standalone file in d4210c8).
  Folder path /var/lib/grafana/dashboards/remotedesktop — picked up
  by the file-provider with foldersFromFilesStructure:true so the
  dashboard shows up in a "Remotedesktop" folder in Grafana.

No CRLF churn; pure 100-line insertion into LF-normalized file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:41:35 -05:00
Andrew Stoltz
93f77c1844 fix(monitoring): use bluejay_v2 auth for snmp-nas (not public_v2)
Synology NAS is configured with community bluejay_monitor
(→ snmp.yml auth 'bluejay_v2'), not public. public_v2 was returning
HTTP 500 from snmp-exporter for this target. Verified bluejay_v2
returns metrics.

Keeps printer (10.0.58.107) on public_v2 — Epson ET-3750 uses
community "public" as documented in its SNMP settings.
2026-04-22 21:32:14 -05:00
Andrew Stoltz
3bb3801fbd fix(monitoring): use short service name for irc-notify IRC_HOST
CoreDNS iamworkin.lan template + ndots:5 was hijacking
unrealircd.irc.svc.cluster.local lookups → Traefik VIP → timeout.
Every alert since ~2026-04-09 silently failed with "IRC send failed: timed out",
which also killed the thermal-printer path (routed through irc-notify).

Same fix pattern as guacamole@28b7600.
2026-04-22 09:55:23 -05:00
Claude Code
67e41febf5 Add agent-zero egress to monitoring NetworkPolicy for blackbox probes 2026-04-08 17:34:16 +00:00