bluejay-infra/apps/monitoring/noc-monitoring.yaml at 05a273d3a6d7ade97068e2ac948789011a33041b

Files

Andrew Stoltz 05a273d3a6 monitoring: switch K8s scrapes to ClusterIP svc + fix probe paths

Followup to ab6ade4. Three issues uncovered after the rollout:

1. NodePort hairpin breaks scrape from same-node pod. Prometheus on
   rke2-agent1 could reach traefik-metrics on .11/.13 NodePort 30900
   but timed out on its OWN node's NodePort. Same problem would hit
   kube-state-metrics + cert-manager whenever prometheus reschedules.
   Fix: scrape via ClusterIP svc DNS instead of NodePort. NodePorts
   stay in place for external/Podman scrapers.
2. probe-traefik-services failed for grafana, prometheus, guac with
   non-200/3xx codes. grafana + prometheus are behind Traefik basic-
   auth (every endpoint returns 401), so drop from probe surface —
   health is covered by the in-cluster monitoring-* scrape jobs.
   guac.iamworkin.lan was deprecated when Guacamole moved under
   desktop.iamworkin.lan/guacamole/ — drop it.
3. acme path was wrong (root 404). Use /health.

Coverage adds (probe-traefik-services):
chat, dist, dms, menuboard, messageboard, presentations, retail,
ttsreader. All of these have IngressRoutes serving root at 200/3xx.

NetworkPolicy egress rules added so the new ClusterIP svc scrapes work:
- traefik-system: port 9100 (metrics) — separate from data-path 8080/8443
- kube-system: port 8080 (kube-state-metrics)
- cert-manager: port 9402 (controller metrics)

Out-of-band fix during this audit:
- Print.Web on edge2 was inactive (clean exit at 12:55 CDT, root cause
  unclear — systemd Stopping signal). Restarted. Service back on 5200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-26 13:05:32 -05:00

147 KiB

Raw Blame History

View Raw

147 KiB Raw Blame History

147 KiB

Raw Blame History