fix agent-zero ollama-proxy crashloop + add Longhorn monitoring

agent-zero ollama-proxy had 172 historic restarts (now stable).
Root cause: liveness/readiness probes hit /api/tags which proxies
through to BLUEJAY-WS Ollama (10.0.56.20:11434). When the workstation
Ollama is slow or offline, nginx fails over to the edge1 backup —
but the failover takes >1s and the kube-probe default timeoutSeconds=1
gives up first. Three failed probes → kubelet kills the container.

Fix:
- Add nginx local healthz endpoint (200, no upstream).
- Liveness probe → /healthz (proves nginx itself is alive).
- Readiness probe stays on /api/tags but with timeoutSeconds=5 so
  failover to backup completes before the probe times out.

This decouples liveness from upstream availability — kubelet only
restarts the proxy when nginx is genuinely dead, not when Ollama is
slow.

Longhorn coverage gap: K8s emits "snapshot becomes not ready to use"
events constantly during the hourly snapshot lifecycle (1047
snapshots, all readyToUse=true on inspect). Those events were the
only signal we had — purely transient lifecycle noise, not actionable.

Add:
- longhorn scrape job (longhorn-backend.longhorn-system.svc:9500)
- NetworkPolicy egress rule for longhorn-system port 9500
- 4 new alerts in 'longhorn-storage' group:
  - LonghornVolumeDegraded (>15m) — replica unhealthy, auto-rebuild
  - LonghornVolumeFaulted (>5m, critical, thermal print) — data loss
  - LonghornBackupStale (no completed backup in >36h) — recurring job
    silently failing
  - LonghornNodeUnhealthy (>5m) — node ready=false

zabbix-web 7 restarts and Print.Web 12:55 stop investigated — both
are stable now, no actionable cause found in journal/events. Adding
KubeContainerRestartingFrequently in the previous commit will catch
recurrence of either.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Andrew Stoltz
2026-04-26 13:31:14 -05:00
parent b3028f5119
commit e2c71c2b8a
2 changed files with 93 additions and 1 deletions

View File

@@ -208,6 +208,15 @@ spec:
}
server {
listen 11434;
# Local healthcheck — proves nginx itself is alive.
# Must NOT depend on upstream so liveness doesn't restart
# the container when BLUEJAY-WS Ollama is slow/offline
# and nginx is mid-failover to the edge1 backup.
location = /healthz {
access_log off;
return 200 'ok\n';
default_type text/plain;
}
location / {
proxy_http_version 1.1;
proxy_set_header Connection "";
@@ -224,18 +233,30 @@ spec:
exec nginx -g 'daemon off;'
ports:
- containerPort: 11434
# Readiness probe DOES check upstream so K8s only routes traffic
# when at least one Ollama backend is reachable. timeoutSeconds=5
# allows nginx to fail over from BLUEJAY-WS primary to edge1
# backup before the probe fails (was timeoutSeconds=1 default →
# 172 historic restarts when workstation Ollama was down).
readinessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 5
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
# Liveness probe hits ONLY local healthz — restarts the container
# only when nginx itself is dead. Decoupling liveness from upstream
# eliminates restart-loops caused by transient upstream outages.
livenessProbe:
httpGet:
path: /api/tags
path: /healthz
port: 11434
initialDelaySeconds: 10
periodSeconds: 30
timeoutSeconds: 3
failureThreshold: 3
- name: agent-zero
image: agent0ai/agent-zero:latest
command: ["/bin/bash", "-c"]