bluejay-infra

Author	SHA1	Message	Date
Andrew Stoltz	4b0eef0fb0	deploy(fc-llm-bridge): roll alias-fix image v20260430001132	2026-04-30 00:13:48 -05:00
Andrew Stoltz	bb09a3786f	fix(knowledge): pin live manifest to bundled edition path	2026-04-29 23:37:02 -05:00
Andrew Stoltz	006dbcf671	fix(agent-zero): export knowledge mcp gate to python builder	2026-04-29 23:32:55 -05:00
Andrew Stoltz	1be71d6ba7	fix(agent-zero): export mcp servers without python indent errors	2026-04-29 23:19:48 -05:00
Andrew Stoltz	0c8026c912	fix(agent-zero): avoid heredoc break in mcp bootstrap	2026-04-29 23:16:54 -05:00
Andrew Stoltz	621ae47e00	fix(agent-zero): repair fc knowledge mcp manifest	2026-04-29 23:11:57 -05:00
Andrew Stoltz	ae6b8c0142	fix(knowledge): keep mcp key env on new token secret	2026-04-29 23:06:07 -05:00
Andrew Stoltz	da55220218	feat(agent-zero): wire fc_knowledge phase1 rollout	2026-04-29 22:59:19 -05:00
Andrew Stoltz	b1ad253dd6	fix(agent-zero): prefix bridge embedding alias for litellm	2026-04-29 21:14:12 -05:00
Andrew Stoltz	ee935f6e07	fix(agent-zero): keep internal util/embed on bridge v1	2026-04-29 21:09:04 -05:00
Andrew Stoltz	2853ee2024	chore(bridge): bump fc-llm-bridge image tag v202604292028	2026-04-29 20:50:55 -05:00
Andrew Stoltz	b4a34e16ca	refactor(agent-zero): drop ollama-proxy sidecar (Phase 3)	2026-04-29 20:50:55 -05:00
Andrew Stoltz	0d5a1fd530	fix(agent-zero): route util and embed through llm bridge	2026-04-29 19:14:01 -05:00
Andrew Stoltz	1b633f57b2	chore(infra): wire knowledge MCP api key secret	2026-04-29 18:04:43 -05:00
Andrew Stoltz	ee8afd0a08	deploy(intranet): promote auth-gated intranet image	2026-04-29 17:11:17 -05:00
Andrew Stoltz	cf35884eae	deploy(intranet): harden knowledge search rollout	2026-04-29 16:43:09 -05:00
Andrew Stoltz	9881767b11	deploy(intranet): bump intranet web for knowledge search lane	2026-04-29 16:21:27 -05:00
Andrew Stoltz	c9bf23834b	chore(ttsreader): bump image to v202604291817 Per-profile MoodAnnotationModelOverride picker — Profiles page now shows a model dropdown from IModelRegistry instead of a free-text field; model override null-falls-back to global TtsReader:Ollama:DefaultModel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 13:21:40 -05:00
Andrew Stoltz	174002023d	fix(agent-zero): move corpus_search + intranet_search into bluejay-tools-c The prior commit `b71f9e4` created a stray YAML document between the bluejay-tools-c and bluejay-profile sections. kubectl applied the stray block's data to bluejay-profile (wrong ConfigMap, wrong mount target). The setup-bluejay initContainer copies bluejay-tools-{a,b,c} to the tools directory; bluejay-profile is copied to the agent profile directory. Tools must live in one of the three tools ConfigMaps. Fix: insert corpus_search.py and intranet_search.py directly into the bluejay-tools-c YAML document (before kind/metadata, matching the data-first layout the rest of the file uses). Also fix two mojibake characters (→ and ·) that were corrupted in the prior commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 08:49:23 -05:00
Andrew Stoltz	b71f9e4ec9	feat(agent-zero): add corpus_search + intranet_search to cluster configmaps - Add corpus_search.py to bluejay-tools-c: semantic vector search over fleet SQLite-vec DBs (fleet-workstation-full, fleet-pi-edge, fleet-bmo-bot). Returns offline-friendly results for Bible/Greek/Hebrew/Strongs corpora. Cluster pod degrades gracefully (no DB mounted yet — BLUEJAY-WS only for now). - Add intranet_search.py to bluejay-tools-c: live RAG search over the intranet vector store via GET /api/search?q=...&topK=N. Uses in-cluster service URL (http://intranet-web.intranet.svc:5300) to bypass Traefik TLS and the private-range egress denylist. - Fix intranet_search.py param name: was 'limit', now 'topK' matching the SearchController's [FromQuery] parameter name. - NetworkPolicy: add egress rule for intranet namespace port 5300 (without this the pod's TCP connection to the search endpoint was dropped). - agent-zero.yaml: set FLOWERCORE_INTRANET_URL env var to in-cluster service URL so intranet_search uses internal routing, not the public Traefik VIP. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-29 08:34:31 -05:00
Andrew Stoltz	f1431f7324	feat(agent-zero): wire Print.Web API key to pod via 1Password OnePasswordItem Add `print-web-api-keys` OnePasswordItem CRD that syncs from 1Password "Print.Web API Keys" vault item (password field). Mount as PRINT_WEB_API_KEY env var in the agent-zero container. The print_web.py Python tool (already in bluejay-tools ConfigMaps) reads PRINT_WEB_URL and PRINT_WEB_API_KEY env vars for all HTTP calls to the thermal print service on edge2. Previously the key was unset so every API call was rejected with 401. Note: Print.Web uses the legacy REST MCP shape (/api/mcp/tools/*) not the streamable-http protocol. The Python tool bridges this gap — no /mcp endpoint exists on Print.Web today. Network policy already allows 10.0.57.16:5200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 20:36:36 -05:00
Andrew Stoltz	35bd055cb4	feat(guacamole): add macmini-vnc-creds OnePasswordItem + fix Mac mini connection IPs Phase 1 of Mac mini onboarding (2026-04-28): - Add OnePasswordItem CRD 'macmini-vnc-creds' in guacamole namespace bound to vault item 'Mac Mini' — operator mints Secret with username/password/VNC Password fields - Mac mini discovered at 10.0.56.115 (INFRA VLAN) — not 10.0.57.50 stored in 1P IP field - Guacamole connections updated via API (not stored here): VNC conn #10, SSH conns #9/#33 corrected from old IP 10.0.57.50 → 10.0.56.115 - macOS: 26.4.1 (Sequoia), Apple M1, 16 GB, user: bluejay (admin group) - VNC port 5900 confirmed open; SSH works via noc1 jumpbox with password auth Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-28 20:09:45 -05:00
Andrew Stoltz	f604ab419e	feat(ttsreader): bump image to v202604281923 (SignalR ProgressHub) Adds ProgressHub endpoint at /hubs/progress with project-scoped group broadcasting for JobStarted, CueProgress, JobCompleted, and JobFailed events. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 19:30:41 -05:00
Andrew Stoltz	b2786252b0	chore(ttsreader): bump web image to v202604281831 (ops failed-manifest cleanup) Deploys fix for stale Failed manifest accumulation in TTS Reader Ops view and atomic-write guard against empty/corrupt job manifests. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>	2026-04-28 18:31:53 -05:00
Andrew Stoltz	45ee40920d	fix(ttsreader): bump image to v202604281638 (Range support + Ollama timeout 240s)	2026-04-28 16:44:57 -05:00
Andrew Stoltz	8ad7eb714b	fix(ttsreader): bump image to v202604281542 (annotation few-shot prompt + UI hint)	2026-04-28 15:46:28 -05:00
Andrew Stoltz	3cb44c3104	feat(noc-services): wire puppetdb.iamworkin.lan through Traefik step-ca cert	2026-04-28 15:13:20 -05:00
Andrew Stoltz	2400329acd	fix(intranet): bump image to v20260428-1500 (Monitoring crash patch + Lane 11 anatomy refresh)	2026-04-28 14:59:27 -05:00
Andrew Stoltz	c17af882cc	fix(ttsreader): bump image to v202604281444 for UX polish (cross-chapter Bible passage, /profiles dedup, /ops table)	2026-04-28 14:48:13 -05:00
Andrew Stoltz	76b1938afa	fix(ttsreader): bump image to v202604281434 for live playback regression patch (study-player + speech override synth)	2026-04-28 14:43:06 -05:00
Andrew Stoltz	ced04a6148	intranet: bump web image to v20260428-0953 Sprint E XXL Intranet docs depth + read-aloud-root sweep deploy. Image tag v20260427-2353 → v20260428-0953: - Track A (Intranet.Web@c4f3d78): 7 service pages deepened toward PrintService.razor's 8-tab depth standard. Workflows / Verified Surfaces / Recent Verified Changes added. - Read-aloud-root sweep (Intranet.Web@787982c): data-read-aloud-root wrappers added to 6 older /services/* pages so the read-aloud overlay scopes content extraction precisely instead of falling back to <main> with layout chrome included.	2026-04-28 09:54:27 -05:00
Andrew Stoltz	f2258b92a2	fc-ttsreader: bump web image to v202604280946 + add Render__CdnDirectory env Sprint E XXL Phase 4γ MVP deploy — POST /api/v1/render endpoint. Two changes: 1. Image tag v202604272339 → v202604280946 (TtsReader@d9e0a58 master tip includes the new RenderController + RenderService + 9 tests). 2. New TtsReader__Render__CdnDirectory=/data/cdn env var. Default wwwroot/cdn resolves under the read-only app filesystem when runAsNonRoot=true; pin to the existing writable PVC mount alongside other TtsReader runtime data. Manifests + cue audio land at /data/cdn/sha256/<hash>/manifest.json + cues/. Pre-existing PVC mount at /data/ already covers this — no PVC change needed, just the env var override. Pairs with TtsReader@d9e0a58 master tip (ready for image build + import).	2026-04-28 09:47:46 -05:00
Andrew Stoltz	979a7c7b25	feat(intranet): bump fc-intranet-web to v20260427-2353 + persist PageReadingOverrides Bump intranet image to v20260427-2353 (master @ 38b0148): - Sprint E search lane: /search Blazor page + IntranetSearchService + DocsCorpusIndexer + Shared.Indexing wiring - 7 new service pages: LocalAiAgents, AiTopology, Distribution, Dns, Knowledge, LlmBridge, Provisioning - PiManager drift docs New env var: PageReadingOverrides__FilePath=/data/page-reading-overrides.json so the persisted Lane 2α store lives on the writable PVC instead of the default in-memory fallback (which loses state on pod restart). Operator-edited overrides via the existing /api/v1/pages/{encoded}/overrides controller will now survive across restarts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 23:54:17 -05:00
Andrew Stoltz	0df8f7b936	chore(ttsreader): bump fc-ttsreader-web to v202604272339 (Sprint E Phase C — partial-render UX) TtsReader@9333480: distinguishes partial-render (yellow Warning, audio plays, 'Re-render N failed sentences' button) from full-fail (red Danger, 'Try render again'). New TtsFallbackChainFailedException carries both voices when Kokoro + Piper both fail; chapter breadcrumb names the entire chain instead of just the requested voice. +8 tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 23:40:19 -05:00
Andrew Stoltz	38558641c1	fix(ttsreader-kokoro): bump liveness probe timeouts (Sprint E Phase 1a) Kokoro pod has 4 restarts in 2d6h with exit 143 (SIGTERM from kubelet). kubectl describe events all show: Liveness probe failed: Get "http://10.42.229.109:8880/v1/audio/voices": context deadline exceeded The probe path /v1/audio/voices shares the FastAPI worker pool with /v1/audio/speech. A long synth (Bible chapter, 30+ sentences) holds the pool past the prior 5s × 3 = 15s probe window, kubelet kills the pod, in-flight renders fail. Operator hits "fallback chain failed" toasts + partial-render breadcrumbs during these windows. Bump probe timeoutSeconds 5 → 15 and failureThreshold 3 → 5 → 75 s of grace before kubelet gives up. Combined with the kokoro-side circuit breaker landing in TtsReader (Sprint E Phase 1b), the FC backend will also stop slamming kokoro during recovery so it can serve the probe even faster. The companion Prometheus alerts (KokoroPodFlapping, PiperPodFlapping) land in FlowerCore.Notes/scripts/monitoring/alerts.yml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 23:28:07 -05:00
Andrew Stoltz	63d905b4df	chore(ttsreader): bump fc-ttsreader-web to v202604272236 (Thinking + Feedback ALTERs)	2026-04-27 22:37:08 -05:00
Andrew Stoltz	d95f4e0caf	chore(ttsreader): bump fc-ttsreader-web to v202604272228 (ChatSessions IsFavorite ALTER hotfix)	2026-04-27 22:28:56 -05:00
Andrew Stoltz	7bc565d17e	fix(ttsreader): pin VoicePreview CacheDirectory to /data PVC Day 8 disk-cache warmer crashes on production with 'Read-only file system : /home/app/data' because the relative default 'data/voice-previews' resolves under runAsNonRoot HOME (read-only with readOnlyRootFilesystem=true). Pin to /data/voice-previews so the cache lands on the writable PVC mount alongside ttsreader.db, audio output, and jobs root. Image v202604272216 (already on nodes) is unaffected by this — only the env routing changes. ArgoCD reconciles + rollout restart picks up the new env without rebuild. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 22:24:04 -05:00
Andrew Stoltz	dfe9c3b67e	chore(ttsreader): bump fc-ttsreader-web to v202604272216 (brace-escape fix)	2026-04-27 22:16:19 -05:00
Andrew Stoltz	37f8db89e4	chore(ttsreader): bump fc-ttsreader-web to v202604272208 (Day 10 + VoiceProfiles hotfix) v202604272157 crash-looped on the production PVC because Database.EnsureCreated() is a no-op on existing DBs and the VoiceProfiles table was missing. TtsReader@a9f0b73 adds an idempotent CREATE TABLE IF NOT EXISTS to the infra reconciler before TtsReaderDataSeeder runs. Bumping the manifest to pick up that fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 22:09:08 -05:00
Andrew Stoltz	00c7d8df24	chore(ttsreader): bump fc-ttsreader-web to v202604272157 (Sprint E Day 10 UX polish) Compact project page (Setup chip strip + chapter inspect-toggle drawer) + render feedback (rolling ETA strip + active-chapter pulse) + Bible Dashboard navigates to /projects/{id} on queue. Source TtsReader@79de78b. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 21:58:12 -05:00
Andrew Stoltz	c6811eadd8	intranet: bump image to v20260427-newpages-and-topology Adds 7 new pages (5 service pages, AI topology, opencode operator guide) to https://intranet.iamworkin.lan: /services/dns /services/distribution /services/llm-bridge /services/knowledge /services/provisioning /services/ai-topology /development/local-ai-agents Plus topology corrections in /services/ai (AiStack.razor) and 6 new nav entries. Source commit: FlowerCore.Intranet.Web@1598542 on codex-wip-pre-readaloud-collision-2026-04-24. Image built from artifacts/publish via Dockerfile.deploy on BLUEJAY-WS, imported to all 3 RKE2 nodes (rke2-server + rke2-agent1 + rke2-agent2). Build: 0 warnings, 0 errors, 197/197 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:52:34 -05:00
Andrew Stoltz	4d9d537d83	fix(knowledge): repoint Ollama at edge1 + flip README to LIVE (Sprint E B7) Two changes after the Phase 2.4 deploy went live at https://knowledge.iamworkin.lan: 1. Ollama URL flip: from BLUEJAY-WS (10.0.56.20:11434) to edge1 Pi 5 (10.0.57.17:11434). Honors the cluster-clean architecture from bluejay-infra@0f9d56e ("Workstation is private dev hardware and should not be in the cluster path"). Query-time embeddings (~ms per query) are fast enough on edge1; bulk index rebuilds (Phase 2.5+) will need a separate ingestion lane that can opt into the workstation GPU when present. ArgoCD picks up the env-var change and rolls the pod automatically — no image rebuild needed. 2. README LIVE status: flip the staged-not-yet-applied banner to LIVE 2026-04-27. Pod running, certificate issued, PVC bound, /healthz 200, /api/v1/editions [] (initial-deploy state). Phase 2.5+ admin UI handles bulk population. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:56:35 -05:00
Andrew Stoltz	0f9d56ee16	agent-zero: drop BLUEJAY-WS upstream, edge1 Pi is sole Ollama backend Workstation (BLUEJAY-WS) is private dev hardware and should not be in the cluster path. Repointing the nginx ollama-proxy sidecar so cluster Agent Zero talks ONLY to edge1 Pi 5 + AI HAT+ (10.0.57.17:11434): - nginx upstream: edge1 sole server, no workstation entry - wait-for-ollama init container: only checks edge1 - NetworkPolicy egress: drop 10.0.56.20/32, keep 10.0.57.17/32 - Comments updated throughout to flag workstation as off-limits to cluster - Annotation rewritten to document the architectural intent Pulled qwen2.5:1.5b on edge1 first so Agent Zero's utility_model survives the cutover (existing models on edge1: qwen3:4b, gemma3:4b, qwen2.5-coder:7b, nomic-embed-text). Model count on edge1: 4 → 5. Lets BLUEJAY-WS lock down its Ollama port to localhost without breaking the cluster Agent Zero.	2026-04-27 16:30:44 -05:00
Andrew Stoltz	3bf6511d5d	feat(knowledge): stage Phase 2.4 K8s deployment manifests (Sprint E B2) NOT YET APPLIED — push to origin/main is gated on the DNS A record knowledge.iamworkin.lan -> 10.0.56.200 being live. Per memory feedback_pfsense_dns_required_for_acme, applying the Certificate without DNS in place puts cert-manager into ~2h HTTP-01 backoff and needs `kubectl -n knowledge delete order <name>` recovery. Manifests authored: - apps/knowledge/knowledge.yaml — Namespace, PVC (knowledge-vector-store Longhorn 20Gi RWO), Deployment (single replica, Recreate, image localhost/fc-knowledge-web:v202604272200 placeholder, runAsNonRoot 1654, readOnlyRootFilesystem, drop ALL caps, /healthz startupProbe + readinessProbe, tcpSocket livenessProbe), Service (ClusterIP port 80 -> 8080), Certificate (step-ca-acme ClusterIssuer, 90d duration), IngressRoute (knowledge.iamworkin.lan, websecure entrypoint). - apps/knowledge/kustomization.yaml — `kubectl kustomize` preview file (matches fc-distribution shape; ApplicationSet uses dir generator). - apps/knowledge/README.md — deployment order checklist with the DNS preflight, image build/import loop for all 3 RKE2 nodes, push procedure, smoke verification, initial-deploy-state notes (zero editions until *.db files are pushed to the PVC), resource sizing, probe + middleware notes. Companion artifacts (separate repos, separate commits): - FlowerCore.Knowledge@eb91eb4 — Dockerfile.deploy at repo root - FlowerCore.Notes@96cd443 — scripts/deploy-knowledge.sh Apply order (from apps/knowledge/README.md): 1. Add DNS A record knowledge.iamworkin.lan -> 10.0.56.200 via FlowerCore.DNS or pfSense web UI. 2. Run `bash scripts/deploy-knowledge.sh` from FlowerCore.Notes — this builds + imports the image to all 3 RKE2 nodes with FLOWERCORE_DEPLOY_SKIP_ROLLOUT=1 (since the Deployment doesn't exist yet on the cluster). 3. Bump the image tag in this manifest to match the freshly-imported tag, then `git push` from this repo to land on main. ArgoCD picks up within ~3 minutes and creates `infra-knowledge`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:28:26 -05:00
Andrew Stoltz	3e0b9055b0	monitoring: paper-roll lifecycle alerts (XL Track I) Three new Prometheus alert rules for the print-services group, all routed to thermal_print via alert_channel label (Grafana contact point -> irc-notify -> Print.Web /api/print/alert): - PrintPaperRollLow (warning, 5-10% remaining, 5m for) - PrintPaperRollCritical (critical, <=5% remaining, 2m for) - PrintJobDeadLetter (warning, any new dead-letter in 15m) Source-of-truth gauge is print_paper_remaining_percent (Print.Web OTEL), which is hydrated from the active PaperRoll row at process startup (Print.Web@<TBD> HydrateMetricsAsync) so the gauge isn't blind for an arbitrary window after every deploy. Self-referential humor: low-roll alerts route to the printer that's running out of paper, so it announces its own paper-out warning on its remaining paper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 16:00:40 -05:00
Andrew Stoltz	c828832808	edge2-services: print.iamworkin.lan Traefik HTTPS for Print.Web (XL Track C) Adds an IngressRoute + cert-manager Certificate that terminates HTTPS for print.iamworkin.lan and proxies to edge2's Print.Web at 10.0.57.16:5200. Same headless-Service-with-manual-Endpoints pattern as noc-services (used for grafana/prometheus/cockpit on noc1). pfSense Unbound already resolves print.iamworkin.lan to the Traefik VIP 10.0.56.200, so cert-manager HTTP-01 should validate cleanly. No basicAuth middleware: Print.Web has its own X-Api-Key authentication and exposes anonymous endpoints for the bookmarklet / Python CLI / cups-notifier flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 14:37:33 -05:00
Andrew Stoltz	e2c71c2b8a	fix agent-zero ollama-proxy crashloop + add Longhorn monitoring agent-zero ollama-proxy had 172 historic restarts (now stable). Root cause: liveness/readiness probes hit /api/tags which proxies through to BLUEJAY-WS Ollama (10.0.56.20:11434). When the workstation Ollama is slow or offline, nginx fails over to the edge1 backup — but the failover takes >1s and the kube-probe default timeoutSeconds=1 gives up first. Three failed probes → kubelet kills the container. Fix: - Add nginx local healthz endpoint (200, no upstream). - Liveness probe → /healthz (proves nginx itself is alive). - Readiness probe stays on /api/tags but with timeoutSeconds=5 so failover to backup completes before the probe times out. This decouples liveness from upstream availability — kubelet only restarts the proxy when nginx is genuinely dead, not when Ollama is slow. Longhorn coverage gap: K8s emits "snapshot becomes not ready to use" events constantly during the hourly snapshot lifecycle (1047 snapshots, all readyToUse=true on inspect). Those events were the only signal we had — purely transient lifecycle noise, not actionable. Add: - longhorn scrape job (longhorn-backend.longhorn-system.svc:9500) - NetworkPolicy egress rule for longhorn-system port 9500 - 4 new alerts in 'longhorn-storage' group: - LonghornVolumeDegraded (>15m) — replica unhealthy, auto-rebuild - LonghornVolumeFaulted (>5m, critical, thermal print) — data loss - LonghornBackupStale (no completed backup in >36h) — recurring job silently failing - LonghornNodeUnhealthy (>5m) — node ready=false zabbix-web 7 restarts and Print.Web 12:55 stop investigated — both are stable now, no actionable cause found in journal/events. Adding KubeContainerRestartingFrequently in the previous commit will catch recurrence of either. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:31:14 -05:00
Andrew Stoltz	b3028f5119	monitoring: fix RemoteDesktop pool alerts for stale per-status series Followup to `05a273d`. After deploy, six PoolDepleted/Deficit alerts went pending again because the publisher emits per-status gauge series (fc_desktop_pool_depleted{template,status,alert_level}) and the historical Warming/BelowDesiredSize series stay at value=1 even after the template transitions to status=Ready. Filtering by alert_level=Critical/Warning was not enough — those labels are baked into the stale series too. Replace with a join-based query: alert only when the canonical "Ready" status gauge does NOT report ready=1 for the enabled template. fc_desktop_pool_ready{status="Ready"}==1 is the publisher's own current-state canary and never goes stale. Verified against the live cluster — query returns 0 results when all pools report healthy in their reconcile logs (no stale-label false positives). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:12:10 -05:00
Andrew Stoltz	05a273d3a6	monitoring: switch K8s scrapes to ClusterIP svc + fix probe paths Followup to `ab6ade4`. Three issues uncovered after the rollout: 1. NodePort hairpin breaks scrape from same-node pod. Prometheus on rke2-agent1 could reach traefik-metrics on .11/.13 NodePort 30900 but timed out on its OWN node's NodePort. Same problem would hit kube-state-metrics + cert-manager whenever prometheus reschedules. Fix: scrape via ClusterIP svc DNS instead of NodePort. NodePorts stay in place for external/Podman scrapers. 2. probe-traefik-services failed for grafana, prometheus, guac with non-200/3xx codes. grafana + prometheus are behind Traefik basic- auth (every endpoint returns 401), so drop from probe surface — health is covered by the in-cluster monitoring-* scrape jobs. guac.iamworkin.lan was deprecated when Guacamole moved under desktop.iamworkin.lan/guacamole/ — drop it. 3. acme path was wrong (root 404). Use /health. Coverage adds (probe-traefik-services): chat, dist, dms, menuboard, messageboard, presentations, retail, ttsreader. All of these have IngressRoutes serving root at 200/3xx. NetworkPolicy egress rules added so the new ClusterIP svc scrapes work: - traefik-system: port 9100 (metrics) — separate from data-path 8080/8443 - kube-system: port 8080 (kube-state-metrics) - cert-manager: port 9402 (controller metrics) Out-of-band fix during this audit: - Print.Web on edge2 was inactive (clean exit at 12:55 CDT, root cause unclear — systemd Stopping signal). Restarted. Service back on 5200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:05:32 -05:00

1 2 3 4 5 ...

306 Commits