fix agent-zero ollama-proxy crashloop + add Longhorn monitoring
agent-zero ollama-proxy had 172 historic restarts (now stable).
Root cause: liveness/readiness probes hit /api/tags which proxies
through to BLUEJAY-WS Ollama (10.0.56.20:11434). When the workstation
Ollama is slow or offline, nginx fails over to the edge1 backup —
but the failover takes >1s and the kube-probe default timeoutSeconds=1
gives up first. Three failed probes → kubelet kills the container.
Fix:
- Add nginx local healthz endpoint (200, no upstream).
- Liveness probe → /healthz (proves nginx itself is alive).
- Readiness probe stays on /api/tags but with timeoutSeconds=5 so
failover to backup completes before the probe times out.
This decouples liveness from upstream availability — kubelet only
restarts the proxy when nginx is genuinely dead, not when Ollama is
slow.
Longhorn coverage gap: K8s emits "snapshot becomes not ready to use"
events constantly during the hourly snapshot lifecycle (1047
snapshots, all readyToUse=true on inspect). Those events were the
only signal we had — purely transient lifecycle noise, not actionable.
Add:
- longhorn scrape job (longhorn-backend.longhorn-system.svc:9500)
- NetworkPolicy egress rule for longhorn-system port 9500
- 4 new alerts in 'longhorn-storage' group:
- LonghornVolumeDegraded (>15m) — replica unhealthy, auto-rebuild
- LonghornVolumeFaulted (>5m, critical, thermal print) — data loss
- LonghornBackupStale (no completed backup in >36h) — recurring job
silently failing
- LonghornNodeUnhealthy (>5m) — node ready=false
zabbix-web 7 restarts and Print.Web 12:55 stop investigated — both
are stable now, no actionable cause found in journal/events. Adding
KubeContainerRestartingFrequently in the previous commit will catch
recurrence of either.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -208,6 +208,15 @@ spec:
|
||||
}
|
||||
server {
|
||||
listen 11434;
|
||||
# Local healthcheck — proves nginx itself is alive.
|
||||
# Must NOT depend on upstream so liveness doesn't restart
|
||||
# the container when BLUEJAY-WS Ollama is slow/offline
|
||||
# and nginx is mid-failover to the edge1 backup.
|
||||
location = /healthz {
|
||||
access_log off;
|
||||
return 200 'ok\n';
|
||||
default_type text/plain;
|
||||
}
|
||||
location / {
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Connection "";
|
||||
@@ -224,18 +233,30 @@ spec:
|
||||
exec nginx -g 'daemon off;'
|
||||
ports:
|
||||
- containerPort: 11434
|
||||
# Readiness probe DOES check upstream so K8s only routes traffic
|
||||
# when at least one Ollama backend is reachable. timeoutSeconds=5
|
||||
# allows nginx to fail over from BLUEJAY-WS primary to edge1
|
||||
# backup before the probe fails (was timeoutSeconds=1 default →
|
||||
# 172 historic restarts when workstation Ollama was down).
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /api/tags
|
||||
port: 11434
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 15
|
||||
timeoutSeconds: 5
|
||||
failureThreshold: 3
|
||||
# Liveness probe hits ONLY local healthz — restarts the container
|
||||
# only when nginx itself is dead. Decoupling liveness from upstream
|
||||
# eliminates restart-loops caused by transient upstream outages.
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /api/tags
|
||||
path: /healthz
|
||||
port: 11434
|
||||
initialDelaySeconds: 10
|
||||
periodSeconds: 30
|
||||
timeoutSeconds: 3
|
||||
failureThreshold: 3
|
||||
- name: agent-zero
|
||||
image: agent0ai/agent-zero:latest
|
||||
command: ["/bin/bash", "-c"]
|
||||
|
||||
Reference in New Issue
Block a user