bluejay-infra

Author	SHA1	Message	Date
Codex	5bfe41beca	fix(monitoring): rename bare Grafana dashboard JSONs out of .json extension ArgoCD's directory-driven manifest parser scans .yaml AND .json by default. Bare Grafana dashboard JSONs (no apiVersion/kind/metadata) poisoned manifest generation for the entire infra-monitoring Application ("Object 'Kind' is missing in <dashboard JSON>"), leaving sync state Unknown. These files are SOURCE for the file-provisioning path on noc1 (/opt/monitoring/grafana/dashboards/) and also get inlined into ConfigMap wrappers like grafana-dashboard-remotedesktop.yaml. They are NOT K8s manifests and shouldn't be in ArgoCD's manifest tree. .argocdignore is for repo-level GitOps source eligibility, not for filtering manifests within a directory-mode Application — the cleanest fix is the .txt extension that ArgoCD's parser skips. Reverts the .argocdignore from the previous commit (didn't take effect). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 10:13:37 -05:00
Codex	df22774674	fix(infra): unstick fc-updater + monitoring ArgoCD apps fc-updater PVC: bump updatecenter-data storage 10Gi → 25Gi. The cluster PVC was already manually expanded to 20Gi to fit Mike Bundle (~5.1 GiB) plus the LocalFsBundleStore.MaxTotalBytes soft cap of 25 GiB (see project_uc_remaining_4_apps_signed_2026_05_06). PVCs cannot shrink, so ArgoCD couldn't sync the smaller git value (OutOfSync, retried 5x with "field can not be less than status.capacity"). Setting git to 25Gi gives headroom matching the soft cap. monitoring .argocdignore: skip bare dashboard JSON files. Both fc-updatecenter-dashboard.json and flowercore-remotedesktop-grafana- dashboard.json live in apps/monitoring/ as source-of-truth for file- provisioning to noc1's /opt/monitoring/grafana/dashboards/. ArgoCD's manifest parser tries to unmarshal every file and chokes on bare dashboard JSON ("Object 'Kind' is missing"), which then poisoned the whole infra-monitoring Application status (Unknown sync, no comparison possible). The .argocdignore tells ArgoCD to skip *.json — actual K8s deploys happen via ConfigMap wrappers like grafana-dashboard-remotedesktop.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 10:11:27 -05:00
Codex	4b777b16ac	monitoring: mirror fc-signage-marquee alert group into noc-monitoring K8s target Mirror of FlowerCore.Notes/scripts/monitoring/alerts.yml fc-signage-marquee group into the K8s migration target apps/monitoring/noc-monitoring.yaml so that future migration of the noc1 Podman monitoring stack into RKE2 inherits the marquee alert ruleset automatically. Three rules added: - MarqueeDroppedFramesHigh (5% / 5min / warning) - MarqueeRenderLatencyP99High (16ms / 10min / warning) - MarqueeAnimationDurationDrift (10% / 15min / info) All three gated with `unless on() absent_over_time(metric[7d])` so they don't fire during the metric-not-yet-published window before Track 3 IR-21 source IMPL ships the MarqueeMeter into Common + Web + WPF. Live source-of-truth (the noc1 Podman Prometheus reads from /opt/monitoring/prometheus/alerts.yml) was updated and reloaded in the same session — Notes commit 300daa0 carries the matching alerts.yml + Grafana fc-signage-dashboard.json change. Per feedback_monitoring_k8s_target_vs_live_podman: this file is the forward-looking K8s migration target, NOT what the live Podman Prometheus reads. ArgoCD-syncing this file does NOT push alerts to the live monitoring stack. Companion to: - FlowerCore.Notes 300daa0 (live alerts.yml + Grafana panels deployed) - docs/signage/marquee-performance-telemetry-design.md (Track 3 IR-21 spec) - docs/signage/marquee-animation-phases.md (Track 6 13-phase coverage matrix) Memory: project_marquee_vr_promotion_landed_2026_05_06 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 16:01:44 -05:00
Codex	ca8d062826	feat(monitoring): mirror Update Center Operations dashboard (Track 1D) Adds fc-updatecenter-dashboard.json (uid: fc-updatecenter, version: 2) to apps/monitoring/ — mirrors the dashboard deployed to noc1 at /opt/monitoring/grafana/dashboards/fc-updatecenter-dashboard.json. 13 panels: 5 existing probe/availability panels + 1 OTEL row header + 7 new panels for the 6 OTEL counters added to FlowerCore.Updater.Web: updatecenter_manifest_requests_total updatecenter_bundle_download_bytes_total updatecenter_bundle_downloads_total updatecenter_checkins_total updatecenter_release_publishes_total updatecenter_signature_verify_failures_total Live on Grafana at https://grafana.iamworkin.lan/d/fc-updatecenter Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 10:54:39 -05:00
Andrew Stoltz	57d7ba46a7	feat(monitoring): add fc-remotedesktop grafana dashboard JSON-provisioned dashboard for FlowerCore.RemoteDesktop session metrics, matches the Apr 23 staging done in the codex/ttsreader-release-b6ca2d5 worktree. Drop into apps/monitoring so ArgoCD-managed Grafana provisioning picks it up alongside the other FC service dashboards.	2026-04-30 14:32:54 -05:00
Andrew Stoltz	3e0b9055b0	monitoring: paper-roll lifecycle alerts (XL Track I) Three new Prometheus alert rules for the print-services group, all routed to thermal_print via alert_channel label (Grafana contact point -> irc-notify -> Print.Web /api/print/alert): - PrintPaperRollLow (warning, 5-10% remaining, 5m for) - PrintPaperRollCritical (critical, <=5% remaining, 2m for) - PrintJobDeadLetter (warning, any new dead-letter in 15m) Source-of-truth gauge is print_paper_remaining_percent (Print.Web OTEL), which is hydrated from the active PaperRoll row at process startup (Print.Web@<TBD> HydrateMetricsAsync) so the gauge isn't blind for an arbitrary window after every deploy. Self-referential humor: low-roll alerts route to the printer that's running out of paper, so it announces its own paper-out warning on its remaining paper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 16:00:40 -05:00
Andrew Stoltz	e2c71c2b8a	fix agent-zero ollama-proxy crashloop + add Longhorn monitoring agent-zero ollama-proxy had 172 historic restarts (now stable). Root cause: liveness/readiness probes hit /api/tags which proxies through to BLUEJAY-WS Ollama (10.0.56.20:11434). When the workstation Ollama is slow or offline, nginx fails over to the edge1 backup — but the failover takes >1s and the kube-probe default timeoutSeconds=1 gives up first. Three failed probes → kubelet kills the container. Fix: - Add nginx local healthz endpoint (200, no upstream). - Liveness probe → /healthz (proves nginx itself is alive). - Readiness probe stays on /api/tags but with timeoutSeconds=5 so failover to backup completes before the probe times out. This decouples liveness from upstream availability — kubelet only restarts the proxy when nginx is genuinely dead, not when Ollama is slow. Longhorn coverage gap: K8s emits "snapshot becomes not ready to use" events constantly during the hourly snapshot lifecycle (1047 snapshots, all readyToUse=true on inspect). Those events were the only signal we had — purely transient lifecycle noise, not actionable. Add: - longhorn scrape job (longhorn-backend.longhorn-system.svc:9500) - NetworkPolicy egress rule for longhorn-system port 9500 - 4 new alerts in 'longhorn-storage' group: - LonghornVolumeDegraded (>15m) — replica unhealthy, auto-rebuild - LonghornVolumeFaulted (>5m, critical, thermal print) — data loss - LonghornBackupStale (no completed backup in >36h) — recurring job silently failing - LonghornNodeUnhealthy (>5m) — node ready=false zabbix-web 7 restarts and Print.Web 12:55 stop investigated — both are stable now, no actionable cause found in journal/events. Adding KubeContainerRestartingFrequently in the previous commit will catch recurrence of either. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:31:14 -05:00
Andrew Stoltz	b3028f5119	monitoring: fix RemoteDesktop pool alerts for stale per-status series Followup to `05a273d`. After deploy, six PoolDepleted/Deficit alerts went pending again because the publisher emits per-status gauge series (fc_desktop_pool_depleted{template,status,alert_level}) and the historical Warming/BelowDesiredSize series stay at value=1 even after the template transitions to status=Ready. Filtering by alert_level=Critical/Warning was not enough — those labels are baked into the stale series too. Replace with a join-based query: alert only when the canonical "Ready" status gauge does NOT report ready=1 for the enabled template. fc_desktop_pool_ready{status="Ready"}==1 is the publisher's own current-state canary and never goes stale. Verified against the live cluster — query returns 0 results when all pools report healthy in their reconcile logs (no stale-label false positives). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:12:10 -05:00
Andrew Stoltz	05a273d3a6	monitoring: switch K8s scrapes to ClusterIP svc + fix probe paths Followup to `ab6ade4`. Three issues uncovered after the rollout: 1. NodePort hairpin breaks scrape from same-node pod. Prometheus on rke2-agent1 could reach traefik-metrics on .11/.13 NodePort 30900 but timed out on its OWN node's NodePort. Same problem would hit kube-state-metrics + cert-manager whenever prometheus reschedules. Fix: scrape via ClusterIP svc DNS instead of NodePort. NodePorts stay in place for external/Podman scrapers. 2. probe-traefik-services failed for grafana, prometheus, guac with non-200/3xx codes. grafana + prometheus are behind Traefik basic- auth (every endpoint returns 401), so drop from probe surface — health is covered by the in-cluster monitoring-* scrape jobs. guac.iamworkin.lan was deprecated when Guacamole moved under desktop.iamworkin.lan/guacamole/ — drop it. 3. acme path was wrong (root 404). Use /health. Coverage adds (probe-traefik-services): chat, dist, dms, menuboard, messageboard, presentations, retail, ttsreader. All of these have IngressRoutes serving root at 200/3xx. NetworkPolicy egress rules added so the new ClusterIP svc scrapes work: - traefik-system: port 9100 (metrics) — separate from data-path 8080/8443 - kube-system: port 8080 (kube-state-metrics) - cert-manager: port 9402 (controller metrics) Out-of-band fix during this audit: - Print.Web on edge2 was inactive (clean exit at 12:55 CDT, root cause unclear — systemd Stopping signal). Restarted. Service back on 5200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:05:32 -05:00
Andrew Stoltz	ab6ade4e46	monitoring: stabilize firing alerts + add cluster-state coverage Live audit on 2026-04-26 found 14 firing alerts caused by stale probe targets, blackbox TLS verify failures, and stale state-as-label series. Plus three K8s scrape sources (kube-state-metrics, cert-manager, traefik) that exposed NodePorts but were not in any scrape config. Fixes - probe-remotedesktop: switch http_2xx -> https_internal. Blackbox does not trust step-ca root, so /health was failing with x509 unknown authority while the app served 200s. - probe-agentzero-nuc: short svc form (agent-zero.agent-zero.svc:80) instead of *.cluster.local. The FQDN form was being rewritten to the Traefik VIP by the CoreDNS iamworkin.lan template + ndots:5 search expansion, then 5s timeout. - probe-agentzero-local + probe-ollama-local: removed. 10.0.58.100 is on HOME VLAN and not reachable from cluster pods. Workstation/AI-laptop Ollama monitoring belongs to host-side Puppet, not cluster blackbox. - snmp-cloudkey: commented out. The Cloud Key Gen2+ runs unifi-core (controller), not an SNMP agent. Was generating "connection refused" every 30s. - RemoteDesktopPoolDepleted / RemoteDesktopPoolDeficitSustained: filter on alert_level=Critical / Warning\|Critical + enabled=true. The publisher emits one series per template per status without resetting old series to 0, so the historical Warming/BelowDesiredSize series stayed at 1 and the alert kept firing on stale labels. - RemoteDesktopTlsExpiry: match by job, not hostname-only instance. The probe sets instance=https://desktop.iamworkin.lan/health so a hostname-only label match never fired. - EpsonPrinterDown for: 5m -> 30m. EcoTank sleeps after ~5 min idle and SNMP times out, so 5m guaranteed nightly noise. Coverage adds - kube-state-metrics scrape (NodePort 30901). Required for the new pod-state alerts and a long list of standard K8s SLO queries. - cert-manager scrape (NodePort 30902). Required for the CertManagerCertificateNotReady / RenewalFailed alert pair documented in project_cert_manager_prometheus_scrape. - traefik scrape (NodePort 30900) on all three nodes. - probe-traefik-services: HTTPS probe (https_internal) over the 17 main iamworkin.lan hosts so any Traefik-fronted service returning non-200 shows up as a single named probe failure. - blackbox-config: add the https_internal module that the new probes reference (was only in the FlowerCore.Notes scripts/monitoring copy, not in the live ConfigMap). New alerts (kubernetes-state group) - KubeContainerRestartingFrequently (>5 restarts/h) - KubeContainerCrashLooping (>3 restarts/15m, thermal print) - KubePodNotReady (Pending/Failed/Unknown >15m) - KubePodImagePullBackOff (>10m) - KubeDeploymentReplicasMismatch (>15m) Without these, the agent-zero ollama-proxy 172x restart loop was invisible for ~3 days. Same gap would have hidden the fc-php php84-app-probe ImagePullBackOff orphan (cleaned up out of band). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 12:57:18 -05:00
Andrew Stoltz	c23e903ba7	feat(monitoring): Grafana alert rules route RemoteDesktop to IRC Companion to the Prometheus alert rules landed in `e44e9a0`. The Prometheus rules were loading but never delivered — the monitoring stack has no Alertmanager configured; Grafana owns alert routing via its built-in engine + webhook contact point to irc-notify.monitoring.svc:9119. Without a matching Grafana alert, the Prometheus rules just show up in the Prometheus UI and page no one. Adds 6 Grafana alert rules in a new `RemoteDesktop` group under the AI Stack Alerts folder: - remotedesktop-web-down (3m) — probe_success{job="probe-remotedesktop"} < 1 - remotedesktop-metrics-stale (10m) — fc_desktop_session_events_total series absent - remotedesktop-pool-depleted (5m) — fc_desktop_pool_depleted > 0 - remotedesktop-pool-deficit-sustained (10m info) — fc_desktop_pool_deficit > 0 - remotedesktop-session-churn-spike (5m info) — launch rate > 20/min - remotedesktop-tls-expiry (6h critical) — cert < 2 days to expiry Each uses the standard Grafana 3-stage pipeline (query → reduce → threshold) matching the existing AI Stack + Infrastructure alert patterns. Labels: service=remotedesktop + severity (warning/info/critical). Default route is `IRC #alerts` via the existing webhook contact point. Parity with the Prometheus rules (which already fire internally for the Prometheus UI + any future Alertmanager integration). Grafana restart picks up the new provisioning on next reload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:57:26 -05:00
Andrew Stoltz	3c5c1a07bd	fix(monitoring): netpol egress allows for fc-desktop + Traefik hairpin Adds two egress allows to monitoring-netpol so Prometheus can scrape FlowerCore.RemoteDesktop: 1. fc-desktop namespace on port 8080 — direct ClusterIP service target (remotedesktop-web.fc-desktop:8080). 2. traefik-system namespace pods on ports 8080 + 8443 — covers the Traefik VIP hairpin path for the `https://desktop.iamworkin.lan` scrape target (CoreDNS wildcard resolves iamworkin.lan hostnames to the LB VIP; after kube-proxy DNAT, egress needs the backend pod port allowed per feedback_netpol_dnat_backend_port). Without these, the fc-remotedesktop scrape times out with "context deadline exceeded" even though the monitoring-netpol already allows the 10.0.56.0/24 CIDR — post-DNAT the destination is a 10.42.x.x pod IP, not the VIP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:47:50 -05:00
Andrew Stoltz	e44e9a0062	feat(monitoring): RemoteDesktop alerts + scrape jobs + dashboard mount Three additions to the monitoring ConfigMap, each targeting FlowerCore.RemoteDesktop: - Scrape jobs (2 new): - probe-remotedesktop: blackbox http_2xx against https://desktop.iamworkin.lan/health every 30s. Feeds the RemoteDesktopWebDown alert. - fc-remotedesktop: direct /metrics scrape against desktop.iamworkin.lan for the fc_desktop_session_events_total and fc_desktop_pool_* series. - Alert group `remote-desktop` (7 rules in alerts.yml): - RemoteDesktopWebDown (3m) — /health probe failing - RemoteDesktopMetricsStale (10m) — absent metrics series - RemoteDesktopPoolDepleted (5m) — pool deficit + depleted flag - RemoteDesktopPoolDeficitSustained (10m, info) — persistent below-desired pool size - RemoteDesktopSessionChurnSpike (5m, info) — launch rate >20/min - RemoteDesktopRecordingEventsDropped (15m, info) — 30m without recording events while launches active - RemoteDesktopTlsExpiry (6h, critical) — <2d cert renewal window; aligns with feedback_acme_expiry_alert_threshold - Grafana dashboard mount: new volumeMounts + volumes entry for `dashboards-remotedesktop` backed by the grafana-dashboard-remotedesktop ConfigMap (previously added as a standalone file in `d4210c8`). Folder path /var/lib/grafana/dashboards/remotedesktop — picked up by the file-provider with foldersFromFilesStructure:true so the dashboard shows up in a "Remotedesktop" folder in Grafana. No CRLF churn; pure 100-line insertion into LF-normalized file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:41:35 -05:00
Andrew Stoltz	d4210c819f	feat(monitoring): RemoteDesktop Grafana dashboard ConfigMap Wraps apps/monitoring/flowercore-remotedesktop-grafana-dashboard.json as a ConfigMap manifest so ArgoCD syncs it into the cluster alongside the existing grafana-dashboard-* ConfigMaps. Standalone file — does NOT modify noc-monitoring.yaml. That keeps the CRLF churn on noc-monitoring.yaml (sibling files apps/intranet/intranet.yaml and apps/agent-zero/configmaps-bluejay.yaml also carry CRLF churn) out of this commit. Dashboard will be synced into the cluster but NOT loaded by Grafana until a matching `volumes:` entry lands in the Grafana Deployment in noc-monitoring.yaml: - name: dashboard-remotedesktop configMap: name: grafana-dashboard-remotedesktop Plus a `volumeMounts:` entry in the grafana container: - name: dashboard-remotedesktop mountPath: /etc/grafana/provisioning/dashboards/remotedesktop readOnly: true Those edits are deferred to the CRLF-normalization pass on bluejay-infra so the review diff stays reviewable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:20:40 -05:00
Andrew Stoltz	93f77c1844	fix(monitoring): use bluejay_v2 auth for snmp-nas (not public_v2) Synology NAS is configured with community bluejay_monitor (→ snmp.yml auth 'bluejay_v2'), not public. public_v2 was returning HTTP 500 from snmp-exporter for this target. Verified bluejay_v2 returns metrics. Keeps printer (10.0.58.107) on public_v2 — Epson ET-3750 uses community "public" as documented in its SNMP settings.	2026-04-22 21:32:14 -05:00
Andrew Stoltz	3bb3801fbd	fix(monitoring): use short service name for irc-notify IRC_HOST CoreDNS iamworkin.lan template + ndots:5 was hijacking unrealircd.irc.svc.cluster.local lookups → Traefik VIP → timeout. Every alert since ~2026-04-09 silently failed with "IRC send failed: timed out", which also killed the thermal-printer path (routed through irc-notify). Same fix pattern as guacamole@28b7600.	2026-04-22 09:55:23 -05:00
Claude Code	67e41febf5	Add agent-zero egress to monitoring NetworkPolicy for blackbox probes	2026-04-08 17:34:16 +00:00

17 Commits