fix agent-zero ollama-proxy crashloop + add Longhorn monitoring

agent-zero ollama-proxy had 172 historic restarts (now stable). Root cause: liveness/readiness probes hit /api/tags which proxies through to BLUEJAY-WS Ollama (10.0.56.20:11434). When the workstation Ollama is slow or offline, nginx fails over to the edge1 backup — but the failover takes >1s and the kube-probe default timeoutSeconds=1 gives up first. Three failed probes → kubelet kills the container. Fix: - Add nginx local healthz endpoint (200, no upstream). - Liveness probe → /healthz (proves nginx itself is alive). - Readiness probe stays on /api/tags but with timeoutSeconds=5 so failover to backup completes before the probe times out. This decouples liveness from upstream availability — kubelet only restarts the proxy when nginx is genuinely dead, not when Ollama is slow. Longhorn coverage gap: K8s emits "snapshot becomes not ready to use" events constantly during the hourly snapshot lifecycle (1047 snapshots, all readyToUse=true on inspect). Those events were the only signal we had — purely transient lifecycle noise, not actionable. Add: - longhorn scrape job (longhorn-backend.longhorn-system.svc:9500) - NetworkPolicy egress rule for longhorn-system port 9500 - 4 new alerts in 'longhorn-storage' group: - LonghornVolumeDegraded (>15m) — replica unhealthy, auto-rebuild - LonghornVolumeFaulted (>5m, critical, thermal print) — data loss - LonghornBackupStale (no completed backup in >36h) — recurring job silently failing - LonghornNodeUnhealthy (>5m) — node ready=false zabbix-web 7 restarts and Print.Web 12:55 stop investigated — both are stable now, no actionable cause found in journal/events. Adding KubeContainerRestartingFrequently in the previous commit will catch recurrence of either. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:31:14 -05:00
parent b3028f5119
commit e2c71c2b8a
2 changed files with 93 additions and 1 deletions
--- a/apps/agent-zero/agent-zero.yaml
+++ b/apps/agent-zero/agent-zero.yaml
@@ -208,6 +208,15 @@ spec:
                }
                server {
                  listen 11434;
+                  # Local healthcheck — proves nginx itself is alive.
+                  # Must NOT depend on upstream so liveness doesn't restart
+                  # the container when BLUEJAY-WS Ollama is slow/offline
+                  # and nginx is mid-failover to the edge1 backup.
+                  location = /healthz {
+                    access_log off;
+                    return 200 'ok\n';
+                    default_type text/plain;
+                  }
                  location / {
                    proxy_http_version 1.1;
                    proxy_set_header Connection "";
@@ -224,18 +233,30 @@ spec:
              exec nginx -g 'daemon off;'
          ports:
            - containerPort: 11434
+          # Readiness probe DOES check upstream so K8s only routes traffic
+          # when at least one Ollama backend is reachable. timeoutSeconds=5
+          # allows nginx to fail over from BLUEJAY-WS primary to edge1
+          # backup before the probe fails (was timeoutSeconds=1 default →
+          # 172 historic restarts when workstation Ollama was down).
          readinessProbe:
            httpGet:
              path: /api/tags
              port: 11434
            initialDelaySeconds: 5
            periodSeconds: 15
+            timeoutSeconds: 5
+            failureThreshold: 3
+          # Liveness probe hits ONLY local healthz — restarts the container
+          # only when nginx itself is dead. Decoupling liveness from upstream
+          # eliminates restart-loops caused by transient upstream outages.
          livenessProbe:
            httpGet:
-              path: /api/tags
+              path: /healthz
              port: 11434
            initialDelaySeconds: 10
            periodSeconds: 30
+            timeoutSeconds: 3
+            failureThreshold: 3
        - name: agent-zero
          image: agent0ai/agent-zero:latest
          command: ["/bin/bash", "-c"]