Live audit on 2026-04-26 found 14 firing alerts caused by stale probe
targets, blackbox TLS verify failures, and stale state-as-label series.
Plus three K8s scrape sources (kube-state-metrics, cert-manager,
traefik) that exposed NodePorts but were not in any scrape config.
Fixes
- probe-remotedesktop: switch http_2xx -> https_internal. Blackbox does
not trust step-ca root, so /health was failing with x509 unknown
authority while the app served 200s.
- probe-agentzero-nuc: short svc form (agent-zero.agent-zero.svc:80)
instead of *.cluster.local. The FQDN form was being rewritten to the
Traefik VIP by the CoreDNS iamworkin.lan template + ndots:5 search
expansion, then 5s timeout.
- probe-agentzero-local + probe-ollama-local: removed. 10.0.58.100 is on
HOME VLAN and not reachable from cluster pods. Workstation/AI-laptop
Ollama monitoring belongs to host-side Puppet, not cluster blackbox.
- snmp-cloudkey: commented out. The Cloud Key Gen2+ runs unifi-core
(controller), not an SNMP agent. Was generating "connection refused"
every 30s.
- RemoteDesktopPoolDepleted / RemoteDesktopPoolDeficitSustained:
filter on alert_level=Critical / Warning|Critical + enabled=true.
The publisher emits one series per template per status without
resetting old series to 0, so the historical Warming/BelowDesiredSize
series stayed at 1 and the alert kept firing on stale labels.
- RemoteDesktopTlsExpiry: match by job, not hostname-only instance.
The probe sets instance=https://desktop.iamworkin.lan/health so a
hostname-only label match never fired.
- EpsonPrinterDown for: 5m -> 30m. EcoTank sleeps after ~5 min idle and
SNMP times out, so 5m guaranteed nightly noise.
Coverage adds
- kube-state-metrics scrape (NodePort 30901). Required for the new
pod-state alerts and a long list of standard K8s SLO queries.
- cert-manager scrape (NodePort 30902). Required for the
CertManagerCertificateNotReady / RenewalFailed alert pair documented
in project_cert_manager_prometheus_scrape.
- traefik scrape (NodePort 30900) on all three nodes.
- probe-traefik-services: HTTPS probe (https_internal) over the 17 main
iamworkin.lan hosts so any Traefik-fronted service returning non-200
shows up as a single named probe failure.
- blackbox-config: add the https_internal module that the new probes
reference (was only in the FlowerCore.Notes scripts/monitoring copy,
not in the live ConfigMap).
New alerts (kubernetes-state group)
- KubeContainerRestartingFrequently (>5 restarts/h)
- KubeContainerCrashLooping (>3 restarts/15m, thermal print)
- KubePodNotReady (Pending/Failed/Unknown >15m)
- KubePodImagePullBackOff (>10m)
- KubeDeploymentReplicasMismatch (>15m)
Without these, the agent-zero ollama-proxy 172x restart loop was
invisible for ~3 days. Same gap would have hidden the fc-php
php84-app-probe ImagePullBackOff orphan (cleaned up out of band).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>