monitoring: delay PiManagerDown duplicate pages

This commit is contained in:
Andrew Stoltz
2026-06-10 16:23:49 -05:00
parent 81ac1f3e4f
commit b0a3ef7448
2 changed files with 9 additions and 1 deletions

View File

@@ -24,6 +24,12 @@ original Longhorn ReadWriteOnce NuGet PVC. Every other repo-scoped runner uses
two replicas with per-pod `emptyDir` caches. That is the safe backlog-drain
strategy: no two pods share one RWO PVC.
Ephemeral runner pods are expected to register, run one job, deregister, and
exit so the Deployment starts a fresh pod for the next registration token. A
small amount of exit-1/restart churn from token-expiry or no-work windows is
accepted operational noise as long as jobs are not stuck queued and the
repo-scoped runner-offline alerts stay quiet.
Sprint 32 final long-tail wave adds 16 two-replica Deployments:
`FlowerCore.Knowledge`, `FlowerCore.LlmBridge`, `FlowerCore.Media`,
`FlowerCore.Presentations`, `FlowerCore.RemoteDesktop`, `FlowerCore.DNS`,

View File

@@ -843,7 +843,9 @@ data:
rules:
- alert: PiManagerDown
expr: up{job="pimanager-app"} == 0
for: 3m
# Sprint 67: delayed behind NodeDown's critical page so a powered-off
# Pi does not create the first duplicate page for the same host.
for: 8m
labels:
severity: warning
annotations: