monitoring: delay PiManagerDown duplicate pages
This commit is contained in:
@@ -24,6 +24,12 @@ original Longhorn ReadWriteOnce NuGet PVC. Every other repo-scoped runner uses
|
|||||||
two replicas with per-pod `emptyDir` caches. That is the safe backlog-drain
|
two replicas with per-pod `emptyDir` caches. That is the safe backlog-drain
|
||||||
strategy: no two pods share one RWO PVC.
|
strategy: no two pods share one RWO PVC.
|
||||||
|
|
||||||
|
Ephemeral runner pods are expected to register, run one job, deregister, and
|
||||||
|
exit so the Deployment starts a fresh pod for the next registration token. A
|
||||||
|
small amount of exit-1/restart churn from token-expiry or no-work windows is
|
||||||
|
accepted operational noise as long as jobs are not stuck queued and the
|
||||||
|
repo-scoped runner-offline alerts stay quiet.
|
||||||
|
|
||||||
Sprint 32 final long-tail wave adds 16 two-replica Deployments:
|
Sprint 32 final long-tail wave adds 16 two-replica Deployments:
|
||||||
`FlowerCore.Knowledge`, `FlowerCore.LlmBridge`, `FlowerCore.Media`,
|
`FlowerCore.Knowledge`, `FlowerCore.LlmBridge`, `FlowerCore.Media`,
|
||||||
`FlowerCore.Presentations`, `FlowerCore.RemoteDesktop`, `FlowerCore.DNS`,
|
`FlowerCore.Presentations`, `FlowerCore.RemoteDesktop`, `FlowerCore.DNS`,
|
||||||
|
|||||||
@@ -843,7 +843,9 @@ data:
|
|||||||
rules:
|
rules:
|
||||||
- alert: PiManagerDown
|
- alert: PiManagerDown
|
||||||
expr: up{job="pimanager-app"} == 0
|
expr: up{job="pimanager-app"} == 0
|
||||||
for: 3m
|
# Sprint 67: delayed behind NodeDown's critical page so a powered-off
|
||||||
|
# Pi does not create the first duplicate page for the same host.
|
||||||
|
for: 8m
|
||||||
labels:
|
labels:
|
||||||
severity: warning
|
severity: warning
|
||||||
annotations:
|
annotations:
|
||||||
|
|||||||
Reference in New Issue
Block a user