Files
bluejay-infra/apps/monitoring
Andrew Stoltz e2c71c2b8a fix agent-zero ollama-proxy crashloop + add Longhorn monitoring
agent-zero ollama-proxy had 172 historic restarts (now stable).
Root cause: liveness/readiness probes hit /api/tags which proxies
through to BLUEJAY-WS Ollama (10.0.56.20:11434). When the workstation
Ollama is slow or offline, nginx fails over to the edge1 backup —
but the failover takes >1s and the kube-probe default timeoutSeconds=1
gives up first. Three failed probes → kubelet kills the container.

Fix:
- Add nginx local healthz endpoint (200, no upstream).
- Liveness probe → /healthz (proves nginx itself is alive).
- Readiness probe stays on /api/tags but with timeoutSeconds=5 so
  failover to backup completes before the probe times out.

This decouples liveness from upstream availability — kubelet only
restarts the proxy when nginx is genuinely dead, not when Ollama is
slow.

Longhorn coverage gap: K8s emits "snapshot becomes not ready to use"
events constantly during the hourly snapshot lifecycle (1047
snapshots, all readyToUse=true on inspect). Those events were the
only signal we had — purely transient lifecycle noise, not actionable.

Add:
- longhorn scrape job (longhorn-backend.longhorn-system.svc:9500)
- NetworkPolicy egress rule for longhorn-system port 9500
- 4 new alerts in 'longhorn-storage' group:
  - LonghornVolumeDegraded (>15m) — replica unhealthy, auto-rebuild
  - LonghornVolumeFaulted (>5m, critical, thermal print) — data loss
  - LonghornBackupStale (no completed backup in >36h) — recurring job
    silently failing
  - LonghornNodeUnhealthy (>5m) — node ready=false

zabbix-web 7 restarts and Print.Web 12:55 stop investigated — both
are stable now, no actionable cause found in journal/events. Adding
KubeContainerRestartingFrequently in the previous commit will catch
recurrence of either.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:31:14 -05:00
..