Commit Graph

7 Commits

Author SHA1 Message Date
Andrew Stoltz
c23e903ba7 feat(monitoring): Grafana alert rules route RemoteDesktop to IRC
Companion to the Prometheus alert rules landed in e44e9a0. The
Prometheus rules were loading but never delivered — the monitoring
stack has no Alertmanager configured; **Grafana** owns alert
routing via its built-in engine + webhook contact point to
irc-notify.monitoring.svc:9119. Without a matching Grafana alert,
the Prometheus rules just show up in the Prometheus UI and page
no one.

Adds 6 Grafana alert rules in a new `RemoteDesktop` group under
the AI Stack Alerts folder:

- remotedesktop-web-down (3m) — probe_success{job="probe-remotedesktop"} < 1
- remotedesktop-metrics-stale (10m) — fc_desktop_session_events_total series absent
- remotedesktop-pool-depleted (5m) — fc_desktop_pool_depleted > 0
- remotedesktop-pool-deficit-sustained (10m info) — fc_desktop_pool_deficit > 0
- remotedesktop-session-churn-spike (5m info) — launch rate > 20/min
- remotedesktop-tls-expiry (6h critical) — cert < 2 days to expiry

Each uses the standard Grafana 3-stage pipeline (query → reduce →
threshold) matching the existing AI Stack + Infrastructure alert
patterns. Labels: service=remotedesktop + severity (warning/info/critical).
Default route is `IRC #alerts` via the existing webhook contact point.

Parity with the Prometheus rules (which already fire internally
for the Prometheus UI + any future Alertmanager integration).
Grafana restart picks up the new provisioning on next reload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:57:26 -05:00
Andrew Stoltz
3c5c1a07bd fix(monitoring): netpol egress allows for fc-desktop + Traefik hairpin
Adds two egress allows to monitoring-netpol so Prometheus can scrape
FlowerCore.RemoteDesktop:

1. fc-desktop namespace on port 8080 — direct ClusterIP service
   target (remotedesktop-web.fc-desktop:8080).
2. traefik-system namespace pods on ports 8080 + 8443 — covers the
   Traefik VIP hairpin path for the `https://desktop.iamworkin.lan`
   scrape target (CoreDNS wildcard resolves iamworkin.lan hostnames
   to the LB VIP; after kube-proxy DNAT, egress needs the backend
   pod port allowed per feedback_netpol_dnat_backend_port).

Without these, the fc-remotedesktop scrape times out with "context
deadline exceeded" even though the monitoring-netpol already allows
the 10.0.56.0/24 CIDR — post-DNAT the destination is a 10.42.x.x
pod IP, not the VIP.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:47:50 -05:00
Andrew Stoltz
e44e9a0062 feat(monitoring): RemoteDesktop alerts + scrape jobs + dashboard mount
Three additions to the monitoring ConfigMap, each targeting
FlowerCore.RemoteDesktop:

- **Scrape jobs** (2 new):
  - probe-remotedesktop: blackbox http_2xx against
    https://desktop.iamworkin.lan/health every 30s. Feeds the
    RemoteDesktopWebDown alert.
  - fc-remotedesktop: direct /metrics scrape against
    desktop.iamworkin.lan for the fc_desktop_session_events_total
    and fc_desktop_pool_* series.

- **Alert group `remote-desktop`** (7 rules in alerts.yml):
  - RemoteDesktopWebDown (3m) — /health probe failing
  - RemoteDesktopMetricsStale (10m) — absent metrics series
  - RemoteDesktopPoolDepleted (5m) — pool deficit + depleted flag
  - RemoteDesktopPoolDeficitSustained (10m, info) — persistent
    below-desired pool size
  - RemoteDesktopSessionChurnSpike (5m, info) — launch rate
    >20/min
  - RemoteDesktopRecordingEventsDropped (15m, info) — 30m without
    recording events while launches active
  - RemoteDesktopTlsExpiry (6h, critical) — <2d cert renewal
    window; aligns with feedback_acme_expiry_alert_threshold

- **Grafana dashboard mount**: new volumeMounts + volumes entry for
  `dashboards-remotedesktop` backed by the grafana-dashboard-remotedesktop
  ConfigMap (previously added as a standalone file in d4210c8).
  Folder path /var/lib/grafana/dashboards/remotedesktop — picked up
  by the file-provider with foldersFromFilesStructure:true so the
  dashboard shows up in a "Remotedesktop" folder in Grafana.

No CRLF churn; pure 100-line insertion into LF-normalized file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:41:35 -05:00
Andrew Stoltz
d4210c819f feat(monitoring): RemoteDesktop Grafana dashboard ConfigMap
Wraps apps/monitoring/flowercore-remotedesktop-grafana-dashboard.json
as a ConfigMap manifest so ArgoCD syncs it into the cluster alongside
the existing grafana-dashboard-* ConfigMaps. Standalone file — does
NOT modify noc-monitoring.yaml. That keeps the CRLF churn on
noc-monitoring.yaml (sibling files apps/intranet/intranet.yaml and
apps/agent-zero/configmaps-bluejay.yaml also carry CRLF churn) out
of this commit.

Dashboard will be synced into the cluster but NOT loaded by Grafana
until a matching `volumes:` entry lands in the Grafana Deployment
in noc-monitoring.yaml:

    - name: dashboard-remotedesktop
      configMap:
        name: grafana-dashboard-remotedesktop

Plus a `volumeMounts:` entry in the grafana container:

    - name: dashboard-remotedesktop
      mountPath: /etc/grafana/provisioning/dashboards/remotedesktop
      readOnly: true

Those edits are deferred to the CRLF-normalization pass on
bluejay-infra so the review diff stays reviewable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:20:40 -05:00
Andrew Stoltz
93f77c1844 fix(monitoring): use bluejay_v2 auth for snmp-nas (not public_v2)
Synology NAS is configured with community bluejay_monitor
(→ snmp.yml auth 'bluejay_v2'), not public. public_v2 was returning
HTTP 500 from snmp-exporter for this target. Verified bluejay_v2
returns metrics.

Keeps printer (10.0.58.107) on public_v2 — Epson ET-3750 uses
community "public" as documented in its SNMP settings.
2026-04-22 21:32:14 -05:00
Andrew Stoltz
3bb3801fbd fix(monitoring): use short service name for irc-notify IRC_HOST
CoreDNS iamworkin.lan template + ndots:5 was hijacking
unrealircd.irc.svc.cluster.local lookups → Traefik VIP → timeout.
Every alert since ~2026-04-09 silently failed with "IRC send failed: timed out",
which also killed the thermal-printer path (routed through irc-notify).

Same fix pattern as guacamole@28b7600.
2026-04-22 09:55:23 -05:00
Claude Code
67e41febf5 Add agent-zero egress to monitoring NetworkPolicy for blackbox probes 2026-04-08 17:34:16 +00:00