Followup to ab6ade4. Three issues uncovered after the rollout:
1. NodePort hairpin breaks scrape from same-node pod. Prometheus on
rke2-agent1 could reach traefik-metrics on .11/.13 NodePort 30900
but timed out on its OWN node's NodePort. Same problem would hit
kube-state-metrics + cert-manager whenever prometheus reschedules.
Fix: scrape via ClusterIP svc DNS instead of NodePort. NodePorts
stay in place for external/Podman scrapers.
2. probe-traefik-services failed for grafana, prometheus, guac with
non-200/3xx codes. grafana + prometheus are behind Traefik basic-
auth (every endpoint returns 401), so drop from probe surface —
health is covered by the in-cluster monitoring-* scrape jobs.
guac.iamworkin.lan was deprecated when Guacamole moved under
desktop.iamworkin.lan/guacamole/ — drop it.
3. acme path was wrong (root 404). Use /health.
Coverage adds (probe-traefik-services):
chat, dist, dms, menuboard, messageboard, presentations, retail,
ttsreader. All of these have IngressRoutes serving root at 200/3xx.
NetworkPolicy egress rules added so the new ClusterIP svc scrapes work:
- traefik-system: port 9100 (metrics) — separate from data-path 8080/8443
- kube-system: port 8080 (kube-state-metrics)
- cert-manager: port 9402 (controller metrics)
Out-of-band fix during this audit:
- Print.Web on edge2 was inactive (clean exit at 12:55 CDT, root cause
unclear — systemd Stopping signal). Restarted. Service back on 5200.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Live audit on 2026-04-26 found 14 firing alerts caused by stale probe
targets, blackbox TLS verify failures, and stale state-as-label series.
Plus three K8s scrape sources (kube-state-metrics, cert-manager,
traefik) that exposed NodePorts but were not in any scrape config.
Fixes
- probe-remotedesktop: switch http_2xx -> https_internal. Blackbox does
not trust step-ca root, so /health was failing with x509 unknown
authority while the app served 200s.
- probe-agentzero-nuc: short svc form (agent-zero.agent-zero.svc:80)
instead of *.cluster.local. The FQDN form was being rewritten to the
Traefik VIP by the CoreDNS iamworkin.lan template + ndots:5 search
expansion, then 5s timeout.
- probe-agentzero-local + probe-ollama-local: removed. 10.0.58.100 is on
HOME VLAN and not reachable from cluster pods. Workstation/AI-laptop
Ollama monitoring belongs to host-side Puppet, not cluster blackbox.
- snmp-cloudkey: commented out. The Cloud Key Gen2+ runs unifi-core
(controller), not an SNMP agent. Was generating "connection refused"
every 30s.
- RemoteDesktopPoolDepleted / RemoteDesktopPoolDeficitSustained:
filter on alert_level=Critical / Warning|Critical + enabled=true.
The publisher emits one series per template per status without
resetting old series to 0, so the historical Warming/BelowDesiredSize
series stayed at 1 and the alert kept firing on stale labels.
- RemoteDesktopTlsExpiry: match by job, not hostname-only instance.
The probe sets instance=https://desktop.iamworkin.lan/health so a
hostname-only label match never fired.
- EpsonPrinterDown for: 5m -> 30m. EcoTank sleeps after ~5 min idle and
SNMP times out, so 5m guaranteed nightly noise.
Coverage adds
- kube-state-metrics scrape (NodePort 30901). Required for the new
pod-state alerts and a long list of standard K8s SLO queries.
- cert-manager scrape (NodePort 30902). Required for the
CertManagerCertificateNotReady / RenewalFailed alert pair documented
in project_cert_manager_prometheus_scrape.
- traefik scrape (NodePort 30900) on all three nodes.
- probe-traefik-services: HTTPS probe (https_internal) over the 17 main
iamworkin.lan hosts so any Traefik-fronted service returning non-200
shows up as a single named probe failure.
- blackbox-config: add the https_internal module that the new probes
reference (was only in the FlowerCore.Notes scripts/monitoring copy,
not in the live ConfigMap).
New alerts (kubernetes-state group)
- KubeContainerRestartingFrequently (>5 restarts/h)
- KubeContainerCrashLooping (>3 restarts/15m, thermal print)
- KubePodNotReady (Pending/Failed/Unknown >15m)
- KubePodImagePullBackOff (>10m)
- KubeDeploymentReplicasMismatch (>15m)
Without these, the agent-zero ollama-proxy 172x restart loop was
invisible for ~3 days. Same gap would have hidden the fc-php
php84-app-probe ImagePullBackOff orphan (cleaned up out of band).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to the Prometheus alert rules landed in e44e9a0. The
Prometheus rules were loading but never delivered — the monitoring
stack has no Alertmanager configured; **Grafana** owns alert
routing via its built-in engine + webhook contact point to
irc-notify.monitoring.svc:9119. Without a matching Grafana alert,
the Prometheus rules just show up in the Prometheus UI and page
no one.
Adds 6 Grafana alert rules in a new `RemoteDesktop` group under
the AI Stack Alerts folder:
- remotedesktop-web-down (3m) — probe_success{job="probe-remotedesktop"} < 1
- remotedesktop-metrics-stale (10m) — fc_desktop_session_events_total series absent
- remotedesktop-pool-depleted (5m) — fc_desktop_pool_depleted > 0
- remotedesktop-pool-deficit-sustained (10m info) — fc_desktop_pool_deficit > 0
- remotedesktop-session-churn-spike (5m info) — launch rate > 20/min
- remotedesktop-tls-expiry (6h critical) — cert < 2 days to expiry
Each uses the standard Grafana 3-stage pipeline (query → reduce →
threshold) matching the existing AI Stack + Infrastructure alert
patterns. Labels: service=remotedesktop + severity (warning/info/critical).
Default route is `IRC #alerts` via the existing webhook contact point.
Parity with the Prometheus rules (which already fire internally
for the Prometheus UI + any future Alertmanager integration).
Grafana restart picks up the new provisioning on next reload.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two egress allows to monitoring-netpol so Prometheus can scrape
FlowerCore.RemoteDesktop:
1. fc-desktop namespace on port 8080 — direct ClusterIP service
target (remotedesktop-web.fc-desktop:8080).
2. traefik-system namespace pods on ports 8080 + 8443 — covers the
Traefik VIP hairpin path for the `https://desktop.iamworkin.lan`
scrape target (CoreDNS wildcard resolves iamworkin.lan hostnames
to the LB VIP; after kube-proxy DNAT, egress needs the backend
pod port allowed per feedback_netpol_dnat_backend_port).
Without these, the fc-remotedesktop scrape times out with "context
deadline exceeded" even though the monitoring-netpol already allows
the 10.0.56.0/24 CIDR — post-DNAT the destination is a 10.42.x.x
pod IP, not the VIP.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three additions to the monitoring ConfigMap, each targeting
FlowerCore.RemoteDesktop:
- **Scrape jobs** (2 new):
- probe-remotedesktop: blackbox http_2xx against
https://desktop.iamworkin.lan/health every 30s. Feeds the
RemoteDesktopWebDown alert.
- fc-remotedesktop: direct /metrics scrape against
desktop.iamworkin.lan for the fc_desktop_session_events_total
and fc_desktop_pool_* series.
- **Alert group `remote-desktop`** (7 rules in alerts.yml):
- RemoteDesktopWebDown (3m) — /health probe failing
- RemoteDesktopMetricsStale (10m) — absent metrics series
- RemoteDesktopPoolDepleted (5m) — pool deficit + depleted flag
- RemoteDesktopPoolDeficitSustained (10m, info) — persistent
below-desired pool size
- RemoteDesktopSessionChurnSpike (5m, info) — launch rate
>20/min
- RemoteDesktopRecordingEventsDropped (15m, info) — 30m without
recording events while launches active
- RemoteDesktopTlsExpiry (6h, critical) — <2d cert renewal
window; aligns with feedback_acme_expiry_alert_threshold
- **Grafana dashboard mount**: new volumeMounts + volumes entry for
`dashboards-remotedesktop` backed by the grafana-dashboard-remotedesktop
ConfigMap (previously added as a standalone file in d4210c8).
Folder path /var/lib/grafana/dashboards/remotedesktop — picked up
by the file-provider with foldersFromFilesStructure:true so the
dashboard shows up in a "Remotedesktop" folder in Grafana.
No CRLF churn; pure 100-line insertion into LF-normalized file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Synology NAS is configured with community bluejay_monitor
(→ snmp.yml auth 'bluejay_v2'), not public. public_v2 was returning
HTTP 500 from snmp-exporter for this target. Verified bluejay_v2
returns metrics.
Keeps printer (10.0.58.107) on public_v2 — Epson ET-3750 uses
community "public" as documented in its SNMP settings.
CoreDNS iamworkin.lan template + ndots:5 was hijacking
unrealircd.irc.svc.cluster.local lookups → Traefik VIP → timeout.
Every alert since ~2026-04-09 silently failed with "IRC send failed: timed out",
which also killed the thermal-printer path (routed through irc-notify).
Same fix pattern as guacamole@28b7600.