Mirror of FlowerCore.Notes/scripts/monitoring/alerts.yml fc-signage-marquee
group into the K8s migration target apps/monitoring/noc-monitoring.yaml so
that future migration of the noc1 Podman monitoring stack into RKE2
inherits the marquee alert ruleset automatically.
Three rules added:
- MarqueeDroppedFramesHigh (5% / 5min / warning)
- MarqueeRenderLatencyP99High (16ms / 10min / warning)
- MarqueeAnimationDurationDrift (10% / 15min / info)
All three gated with `unless on() absent_over_time(metric[7d])` so they
don't fire during the metric-not-yet-published window before Track 3
IR-21 source IMPL ships the MarqueeMeter into Common + Web + WPF.
Live source-of-truth (the noc1 Podman Prometheus reads from
/opt/monitoring/prometheus/alerts.yml) was updated and reloaded
in the same session — Notes commit 300daa0 carries the matching
alerts.yml + Grafana fc-signage-dashboard.json change.
Per feedback_monitoring_k8s_target_vs_live_podman: this file is the
forward-looking K8s migration target, NOT what the live Podman
Prometheus reads. ArgoCD-syncing this file does NOT push alerts to
the live monitoring stack.
Companion to:
- FlowerCore.Notes 300daa0 (live alerts.yml + Grafana panels deployed)
- docs/signage/marquee-performance-telemetry-design.md (Track 3 IR-21 spec)
- docs/signage/marquee-animation-phases.md (Track 6 13-phase coverage matrix)
Memory: project_marquee_vr_promotion_landed_2026_05_06
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds fc-updatecenter-dashboard.json (uid: fc-updatecenter, version: 2)
to apps/monitoring/ — mirrors the dashboard deployed to noc1 at
/opt/monitoring/grafana/dashboards/fc-updatecenter-dashboard.json.
13 panels: 5 existing probe/availability panels + 1 OTEL row header
+ 7 new panels for the 6 OTEL counters added to FlowerCore.Updater.Web:
updatecenter_manifest_requests_total
updatecenter_bundle_download_bytes_total
updatecenter_bundle_downloads_total
updatecenter_checkins_total
updatecenter_release_publishes_total
updatecenter_signature_verify_failures_total
Live on Grafana at https://grafana.iamworkin.lan/d/fc-updatecenter
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JSON-provisioned dashboard for FlowerCore.RemoteDesktop session metrics,
matches the Apr 23 staging done in the codex/ttsreader-release-b6ca2d5
worktree. Drop into apps/monitoring so ArgoCD-managed Grafana provisioning
picks it up alongside the other FC service dashboards.
Three new Prometheus alert rules for the print-services group, all routed
to thermal_print via alert_channel label (Grafana contact point ->
irc-notify -> Print.Web /api/print/alert):
- PrintPaperRollLow (warning, 5-10% remaining, 5m for)
- PrintPaperRollCritical (critical, <=5% remaining, 2m for)
- PrintJobDeadLetter (warning, any new dead-letter in 15m)
Source-of-truth gauge is print_paper_remaining_percent (Print.Web OTEL),
which is hydrated from the active PaperRoll row at process startup
(Print.Web@<TBD> HydrateMetricsAsync) so the gauge isn't blind for an
arbitrary window after every deploy.
Self-referential humor: low-roll alerts route to the printer that's
running out of paper, so it announces its own paper-out warning on its
remaining paper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
agent-zero ollama-proxy had 172 historic restarts (now stable).
Root cause: liveness/readiness probes hit /api/tags which proxies
through to BLUEJAY-WS Ollama (10.0.56.20:11434). When the workstation
Ollama is slow or offline, nginx fails over to the edge1 backup —
but the failover takes >1s and the kube-probe default timeoutSeconds=1
gives up first. Three failed probes → kubelet kills the container.
Fix:
- Add nginx local healthz endpoint (200, no upstream).
- Liveness probe → /healthz (proves nginx itself is alive).
- Readiness probe stays on /api/tags but with timeoutSeconds=5 so
failover to backup completes before the probe times out.
This decouples liveness from upstream availability — kubelet only
restarts the proxy when nginx is genuinely dead, not when Ollama is
slow.
Longhorn coverage gap: K8s emits "snapshot becomes not ready to use"
events constantly during the hourly snapshot lifecycle (1047
snapshots, all readyToUse=true on inspect). Those events were the
only signal we had — purely transient lifecycle noise, not actionable.
Add:
- longhorn scrape job (longhorn-backend.longhorn-system.svc:9500)
- NetworkPolicy egress rule for longhorn-system port 9500
- 4 new alerts in 'longhorn-storage' group:
- LonghornVolumeDegraded (>15m) — replica unhealthy, auto-rebuild
- LonghornVolumeFaulted (>5m, critical, thermal print) — data loss
- LonghornBackupStale (no completed backup in >36h) — recurring job
silently failing
- LonghornNodeUnhealthy (>5m) — node ready=false
zabbix-web 7 restarts and Print.Web 12:55 stop investigated — both
are stable now, no actionable cause found in journal/events. Adding
KubeContainerRestartingFrequently in the previous commit will catch
recurrence of either.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Followup to 05a273d. After deploy, six PoolDepleted/Deficit alerts
went pending again because the publisher emits per-status gauge
series (fc_desktop_pool_depleted{template,status,alert_level}) and
the historical Warming/BelowDesiredSize series stay at value=1 even
after the template transitions to status=Ready. Filtering by
alert_level=Critical/Warning was not enough — those labels are baked
into the stale series too.
Replace with a join-based query: alert only when the canonical
"Ready" status gauge does NOT report ready=1 for the enabled
template. fc_desktop_pool_ready{status="Ready"}==1 is the publisher's
own current-state canary and never goes stale.
Verified against the live cluster — query returns 0 results when all
pools report healthy in their reconcile logs (no stale-label false
positives).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Followup to ab6ade4. Three issues uncovered after the rollout:
1. NodePort hairpin breaks scrape from same-node pod. Prometheus on
rke2-agent1 could reach traefik-metrics on .11/.13 NodePort 30900
but timed out on its OWN node's NodePort. Same problem would hit
kube-state-metrics + cert-manager whenever prometheus reschedules.
Fix: scrape via ClusterIP svc DNS instead of NodePort. NodePorts
stay in place for external/Podman scrapers.
2. probe-traefik-services failed for grafana, prometheus, guac with
non-200/3xx codes. grafana + prometheus are behind Traefik basic-
auth (every endpoint returns 401), so drop from probe surface —
health is covered by the in-cluster monitoring-* scrape jobs.
guac.iamworkin.lan was deprecated when Guacamole moved under
desktop.iamworkin.lan/guacamole/ — drop it.
3. acme path was wrong (root 404). Use /health.
Coverage adds (probe-traefik-services):
chat, dist, dms, menuboard, messageboard, presentations, retail,
ttsreader. All of these have IngressRoutes serving root at 200/3xx.
NetworkPolicy egress rules added so the new ClusterIP svc scrapes work:
- traefik-system: port 9100 (metrics) — separate from data-path 8080/8443
- kube-system: port 8080 (kube-state-metrics)
- cert-manager: port 9402 (controller metrics)
Out-of-band fix during this audit:
- Print.Web on edge2 was inactive (clean exit at 12:55 CDT, root cause
unclear — systemd Stopping signal). Restarted. Service back on 5200.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Live audit on 2026-04-26 found 14 firing alerts caused by stale probe
targets, blackbox TLS verify failures, and stale state-as-label series.
Plus three K8s scrape sources (kube-state-metrics, cert-manager,
traefik) that exposed NodePorts but were not in any scrape config.
Fixes
- probe-remotedesktop: switch http_2xx -> https_internal. Blackbox does
not trust step-ca root, so /health was failing with x509 unknown
authority while the app served 200s.
- probe-agentzero-nuc: short svc form (agent-zero.agent-zero.svc:80)
instead of *.cluster.local. The FQDN form was being rewritten to the
Traefik VIP by the CoreDNS iamworkin.lan template + ndots:5 search
expansion, then 5s timeout.
- probe-agentzero-local + probe-ollama-local: removed. 10.0.58.100 is on
HOME VLAN and not reachable from cluster pods. Workstation/AI-laptop
Ollama monitoring belongs to host-side Puppet, not cluster blackbox.
- snmp-cloudkey: commented out. The Cloud Key Gen2+ runs unifi-core
(controller), not an SNMP agent. Was generating "connection refused"
every 30s.
- RemoteDesktopPoolDepleted / RemoteDesktopPoolDeficitSustained:
filter on alert_level=Critical / Warning|Critical + enabled=true.
The publisher emits one series per template per status without
resetting old series to 0, so the historical Warming/BelowDesiredSize
series stayed at 1 and the alert kept firing on stale labels.
- RemoteDesktopTlsExpiry: match by job, not hostname-only instance.
The probe sets instance=https://desktop.iamworkin.lan/health so a
hostname-only label match never fired.
- EpsonPrinterDown for: 5m -> 30m. EcoTank sleeps after ~5 min idle and
SNMP times out, so 5m guaranteed nightly noise.
Coverage adds
- kube-state-metrics scrape (NodePort 30901). Required for the new
pod-state alerts and a long list of standard K8s SLO queries.
- cert-manager scrape (NodePort 30902). Required for the
CertManagerCertificateNotReady / RenewalFailed alert pair documented
in project_cert_manager_prometheus_scrape.
- traefik scrape (NodePort 30900) on all three nodes.
- probe-traefik-services: HTTPS probe (https_internal) over the 17 main
iamworkin.lan hosts so any Traefik-fronted service returning non-200
shows up as a single named probe failure.
- blackbox-config: add the https_internal module that the new probes
reference (was only in the FlowerCore.Notes scripts/monitoring copy,
not in the live ConfigMap).
New alerts (kubernetes-state group)
- KubeContainerRestartingFrequently (>5 restarts/h)
- KubeContainerCrashLooping (>3 restarts/15m, thermal print)
- KubePodNotReady (Pending/Failed/Unknown >15m)
- KubePodImagePullBackOff (>10m)
- KubeDeploymentReplicasMismatch (>15m)
Without these, the agent-zero ollama-proxy 172x restart loop was
invisible for ~3 days. Same gap would have hidden the fc-php
php84-app-probe ImagePullBackOff orphan (cleaned up out of band).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to the Prometheus alert rules landed in e44e9a0. The
Prometheus rules were loading but never delivered — the monitoring
stack has no Alertmanager configured; **Grafana** owns alert
routing via its built-in engine + webhook contact point to
irc-notify.monitoring.svc:9119. Without a matching Grafana alert,
the Prometheus rules just show up in the Prometheus UI and page
no one.
Adds 6 Grafana alert rules in a new `RemoteDesktop` group under
the AI Stack Alerts folder:
- remotedesktop-web-down (3m) — probe_success{job="probe-remotedesktop"} < 1
- remotedesktop-metrics-stale (10m) — fc_desktop_session_events_total series absent
- remotedesktop-pool-depleted (5m) — fc_desktop_pool_depleted > 0
- remotedesktop-pool-deficit-sustained (10m info) — fc_desktop_pool_deficit > 0
- remotedesktop-session-churn-spike (5m info) — launch rate > 20/min
- remotedesktop-tls-expiry (6h critical) — cert < 2 days to expiry
Each uses the standard Grafana 3-stage pipeline (query → reduce →
threshold) matching the existing AI Stack + Infrastructure alert
patterns. Labels: service=remotedesktop + severity (warning/info/critical).
Default route is `IRC #alerts` via the existing webhook contact point.
Parity with the Prometheus rules (which already fire internally
for the Prometheus UI + any future Alertmanager integration).
Grafana restart picks up the new provisioning on next reload.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two egress allows to monitoring-netpol so Prometheus can scrape
FlowerCore.RemoteDesktop:
1. fc-desktop namespace on port 8080 — direct ClusterIP service
target (remotedesktop-web.fc-desktop:8080).
2. traefik-system namespace pods on ports 8080 + 8443 — covers the
Traefik VIP hairpin path for the `https://desktop.iamworkin.lan`
scrape target (CoreDNS wildcard resolves iamworkin.lan hostnames
to the LB VIP; after kube-proxy DNAT, egress needs the backend
pod port allowed per feedback_netpol_dnat_backend_port).
Without these, the fc-remotedesktop scrape times out with "context
deadline exceeded" even though the monitoring-netpol already allows
the 10.0.56.0/24 CIDR — post-DNAT the destination is a 10.42.x.x
pod IP, not the VIP.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three additions to the monitoring ConfigMap, each targeting
FlowerCore.RemoteDesktop:
- **Scrape jobs** (2 new):
- probe-remotedesktop: blackbox http_2xx against
https://desktop.iamworkin.lan/health every 30s. Feeds the
RemoteDesktopWebDown alert.
- fc-remotedesktop: direct /metrics scrape against
desktop.iamworkin.lan for the fc_desktop_session_events_total
and fc_desktop_pool_* series.
- **Alert group `remote-desktop`** (7 rules in alerts.yml):
- RemoteDesktopWebDown (3m) — /health probe failing
- RemoteDesktopMetricsStale (10m) — absent metrics series
- RemoteDesktopPoolDepleted (5m) — pool deficit + depleted flag
- RemoteDesktopPoolDeficitSustained (10m, info) — persistent
below-desired pool size
- RemoteDesktopSessionChurnSpike (5m, info) — launch rate
>20/min
- RemoteDesktopRecordingEventsDropped (15m, info) — 30m without
recording events while launches active
- RemoteDesktopTlsExpiry (6h, critical) — <2d cert renewal
window; aligns with feedback_acme_expiry_alert_threshold
- **Grafana dashboard mount**: new volumeMounts + volumes entry for
`dashboards-remotedesktop` backed by the grafana-dashboard-remotedesktop
ConfigMap (previously added as a standalone file in d4210c8).
Folder path /var/lib/grafana/dashboards/remotedesktop — picked up
by the file-provider with foldersFromFilesStructure:true so the
dashboard shows up in a "Remotedesktop" folder in Grafana.
No CRLF churn; pure 100-line insertion into LF-normalized file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps apps/monitoring/flowercore-remotedesktop-grafana-dashboard.json
as a ConfigMap manifest so ArgoCD syncs it into the cluster alongside
the existing grafana-dashboard-* ConfigMaps. Standalone file — does
NOT modify noc-monitoring.yaml. That keeps the CRLF churn on
noc-monitoring.yaml (sibling files apps/intranet/intranet.yaml and
apps/agent-zero/configmaps-bluejay.yaml also carry CRLF churn) out
of this commit.
Dashboard will be synced into the cluster but NOT loaded by Grafana
until a matching `volumes:` entry lands in the Grafana Deployment
in noc-monitoring.yaml:
- name: dashboard-remotedesktop
configMap:
name: grafana-dashboard-remotedesktop
Plus a `volumeMounts:` entry in the grafana container:
- name: dashboard-remotedesktop
mountPath: /etc/grafana/provisioning/dashboards/remotedesktop
readOnly: true
Those edits are deferred to the CRLF-normalization pass on
bluejay-infra so the review diff stays reviewable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Synology NAS is configured with community bluejay_monitor
(→ snmp.yml auth 'bluejay_v2'), not public. public_v2 was returning
HTTP 500 from snmp-exporter for this target. Verified bluejay_v2
returns metrics.
Keeps printer (10.0.58.107) on public_v2 — Epson ET-3750 uses
community "public" as documented in its SNMP settings.
CoreDNS iamworkin.lan template + ndots:5 was hijacking
unrealircd.irc.svc.cluster.local lookups → Traefik VIP → timeout.
Every alert since ~2026-04-09 silently failed with "IRC send failed: timed out",
which also killed the thermal-printer path (routed through irc-notify).
Same fix pattern as guacamole@28b7600.