bluejay-infra

Author	SHA1	Message	Date
Andrew Stoltz	0d5a1fd530	fix(agent-zero): route util and embed through llm bridge	2026-04-29 19:14:01 -05:00
Andrew Stoltz	174002023d	fix(agent-zero): move corpus_search + intranet_search into bluejay-tools-c The prior commit `b71f9e4` created a stray YAML document between the bluejay-tools-c and bluejay-profile sections. kubectl applied the stray block's data to bluejay-profile (wrong ConfigMap, wrong mount target). The setup-bluejay initContainer copies bluejay-tools-{a,b,c} to the tools directory; bluejay-profile is copied to the agent profile directory. Tools must live in one of the three tools ConfigMaps. Fix: insert corpus_search.py and intranet_search.py directly into the bluejay-tools-c YAML document (before kind/metadata, matching the data-first layout the rest of the file uses). Also fix two mojibake characters (→ and ·) that were corrupted in the prior commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 08:49:23 -05:00
Andrew Stoltz	b71f9e4ec9	feat(agent-zero): add corpus_search + intranet_search to cluster configmaps - Add corpus_search.py to bluejay-tools-c: semantic vector search over fleet SQLite-vec DBs (fleet-workstation-full, fleet-pi-edge, fleet-bmo-bot). Returns offline-friendly results for Bible/Greek/Hebrew/Strongs corpora. Cluster pod degrades gracefully (no DB mounted yet — BLUEJAY-WS only for now). - Add intranet_search.py to bluejay-tools-c: live RAG search over the intranet vector store via GET /api/search?q=...&topK=N. Uses in-cluster service URL (http://intranet-web.intranet.svc:5300) to bypass Traefik TLS and the private-range egress denylist. - Fix intranet_search.py param name: was 'limit', now 'topK' matching the SearchController's [FromQuery] parameter name. - NetworkPolicy: add egress rule for intranet namespace port 5300 (without this the pod's TCP connection to the search endpoint was dropped). - agent-zero.yaml: set FLOWERCORE_INTRANET_URL env var to in-cluster service URL so intranet_search uses internal routing, not the public Traefik VIP. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-29 08:34:31 -05:00
Andrew Stoltz	f1431f7324	feat(agent-zero): wire Print.Web API key to pod via 1Password OnePasswordItem Add `print-web-api-keys` OnePasswordItem CRD that syncs from 1Password "Print.Web API Keys" vault item (password field). Mount as PRINT_WEB_API_KEY env var in the agent-zero container. The print_web.py Python tool (already in bluejay-tools ConfigMaps) reads PRINT_WEB_URL and PRINT_WEB_API_KEY env vars for all HTTP calls to the thermal print service on edge2. Previously the key was unset so every API call was rejected with 401. Note: Print.Web uses the legacy REST MCP shape (/api/mcp/tools/*) not the streamable-http protocol. The Python tool bridges this gap — no /mcp endpoint exists on Print.Web today. Network policy already allows 10.0.57.16:5200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 20:36:36 -05:00
Andrew Stoltz	0f9d56ee16	agent-zero: drop BLUEJAY-WS upstream, edge1 Pi is sole Ollama backend Workstation (BLUEJAY-WS) is private dev hardware and should not be in the cluster path. Repointing the nginx ollama-proxy sidecar so cluster Agent Zero talks ONLY to edge1 Pi 5 + AI HAT+ (10.0.57.17:11434): - nginx upstream: edge1 sole server, no workstation entry - wait-for-ollama init container: only checks edge1 - NetworkPolicy egress: drop 10.0.56.20/32, keep 10.0.57.17/32 - Comments updated throughout to flag workstation as off-limits to cluster - Annotation rewritten to document the architectural intent Pulled qwen2.5:1.5b on edge1 first so Agent Zero's utility_model survives the cutover (existing models on edge1: qwen3:4b, gemma3:4b, qwen2.5-coder:7b, nomic-embed-text). Model count on edge1: 4 → 5. Lets BLUEJAY-WS lock down its Ollama port to localhost without breaking the cluster Agent Zero.	2026-04-27 16:30:44 -05:00
Andrew Stoltz	e2c71c2b8a	fix agent-zero ollama-proxy crashloop + add Longhorn monitoring agent-zero ollama-proxy had 172 historic restarts (now stable). Root cause: liveness/readiness probes hit /api/tags which proxies through to BLUEJAY-WS Ollama (10.0.56.20:11434). When the workstation Ollama is slow or offline, nginx fails over to the edge1 backup — but the failover takes >1s and the kube-probe default timeoutSeconds=1 gives up first. Three failed probes → kubelet kills the container. Fix: - Add nginx local healthz endpoint (200, no upstream). - Liveness probe → /healthz (proves nginx itself is alive). - Readiness probe stays on /api/tags but with timeoutSeconds=5 so failover to backup completes before the probe times out. This decouples liveness from upstream availability — kubelet only restarts the proxy when nginx is genuinely dead, not when Ollama is slow. Longhorn coverage gap: K8s emits "snapshot becomes not ready to use" events constantly during the hourly snapshot lifecycle (1047 snapshots, all readyToUse=true on inspect). Those events were the only signal we had — purely transient lifecycle noise, not actionable. Add: - longhorn scrape job (longhorn-backend.longhorn-system.svc:9500) - NetworkPolicy egress rule for longhorn-system port 9500 - 4 new alerts in 'longhorn-storage' group: - LonghornVolumeDegraded (>15m) — replica unhealthy, auto-rebuild - LonghornVolumeFaulted (>5m, critical, thermal print) — data loss - LonghornBackupStale (no completed backup in >36h) — recurring job silently failing - LonghornNodeUnhealthy (>5m) — node ready=false zabbix-web 7 restarts and Print.Web 12:55 stop investigated — both are stable now, no actionable cause found in journal/events. Adding KubeContainerRestartingFrequently in the previous commit will catch recurrence of either. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:31:14 -05:00
Andrew Stoltz	bfc755057e	fix(agent-zero): use streamable http for chat mcp	2026-04-23 13:54:06 -05:00
Andrew Stoltz	d6008ee205	fix(agent-zero): allow chat mcp pod port	2026-04-23 13:29:36 -05:00
Andrew Stoltz	39fe6f1dba	fix(agent-zero): route chat mcp in-cluster	2026-04-23 13:26:10 -05:00
Andrew Stoltz	90fcf0cd5d	fix(agent-zero): expose openai provider key	2026-04-23 13:21:12 -05:00
Andrew Stoltz	86ccca18e3	Add Chat MCP server to Agent Zero	2026-04-23 12:41:58 -05:00
Andrew Stoltz	702a6e4f52	fix(agent-zero): use short DNS name to avoid CoreDNS template hijack The full FQDN fc-llm-bridge.fc-llm-bridge.svc.cluster.local has 4 dots, which is less than the pod's ndots:5 threshold. The resolver then applies every entry in the search list BEFORE falling through to the bare FQDN, and the CoreDNS 'template iamworkin.lan' catch-all matches "...svc.cluster.local.iamworkin.lan" and returns Traefik VIP 10.0.56.200. The egress NetworkPolicy blocks that VIP (0.0.0.0/0 EXCEPT 10.0.0.0/8), so curl hangs for 30-134s and returns HTTP 000. Reference: feedback_coredns_ndots_template_collision memory. Fix: use "fc-llm-bridge.fc-llm-bridge.svc" (2 dots, still <5 so search expansion still fires, but the first suffix "...svc.cluster.local" hits the Kubernetes plugin in CoreDNS and returns the real ClusterIP 10.43.67.125 before the iamworkin.lan template is ever consulted). Verified: pod-exec curl fc:cheap → HTTP 200 with a real chat.completion envelope (Ollama/gemma3:4b via bridge). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 10:02:09 -05:00
Andrew Stoltz	6cbb5d8792	fix(agent-zero): NetworkPolicy egress rule for fc-llm-bridge (ADR-088) The chat_model flip (`62db15c`) pointed Agent Zero at fc-llm-bridge.fc-llm-bridge.svc.cluster.local:8080 but the existing agent-zero-netpol only allowed egress to specific node IPs (10.0.56.20:11434, 10.0.57.17:11434, 10.0.57.16:5200, 10.0.56.11:6443) plus public-internet (with RFC1918 exclusion). ClusterIP traffic to 10.43.0.0/16 was implicitly denied, so pod-exec curl to the bridge timed out after 134s. Adds an egress rule allowing TCP 8080 to the fc-llm-bridge namespace (matched by kubernetes.io/metadata.name which K8s 1.22+ sets automatically). No ingress changes needed — fc-llm-bridge has no NetworkPolicy, so the ingress side is already open. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 09:59:17 -05:00
Andrew Stoltz	62db15c69c	feat(agent-zero): route chat_model through fc-llm-bridge (ADR-088) Flips Agent Zero's chat_model from direct local Ollama (gemma3:12b via the 127.0.0.1:11434 sidecar proxy) to the FlowerCore LLM Bridge (fc:balanced tier, OpenAI-compatible, Anthropic Claude Sonnet under the hood) so chat turns are spend-tracked and can dispatch to any provider via a single tier alias. Scope is intentionally minimal and reversible: - chat_model: ollama/gemma3:12b/127.0.0.1:11434 → openai/fc:balanced/fc-llm-bridge internal service URL - utility_model, embedding_model, browser_model: UNCHANGED (stay on local 127.0.0.1 Ollama sidecar — no spend, low latency, not worth routing through the bridge for small-model traffic). Auth: new A0_SET_chat_model_api_key env var wired to the fc-llm-bridge-api-keys Secret (field: agent-zero-k8s). The Secret is synced by a new OnePasswordItem pointing at "FC LLM Bridge API Keys" in the IAmWorkin vault. Bearer-token auth is now accepted by the bridge (FlowerCore.LlmBridge@3225f1f). Rollback: revert this commit; old image v202604231449 is still present on all RKE2 nodes, and Agent Zero's strategy: Recreate makes the flip atomic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 09:54:27 -05:00
Andrew Stoltz	ab7435a43a	Update Agent Zero, Asterisk, and Telephony K8s manifests - Update agent-zero deployment configuration - Update Asterisk configmap and deployment - Update telephony service manifest Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 19:12:08 -05:00
Claude Code	8f8290e0da	Increase ctx to 8192 (system prompt + 21 tools need >2048)	2026-04-08 20:07:27 +00:00
Claude Code	607192aaec	Reduce ctx to 2048 for Pi 5 CPU speed	2026-04-08 19:40:52 +00:00
Claude Code	072d64a5e9	Fix model config: write config.json not config.yaml (plugin reads JSON)	2026-04-08 19:22:16 +00:00
Claude Code	acb19bee9c	Write Ollama model config before initialize.sh (fix OpenRouter default)	2026-04-08 19:15:43 +00:00
Claude Code	e6fbe2d22b	Mount extensions+theme directly in main container (symlinks lost by initialize.sh)	2026-04-08 18:12:07 +00:00
Claude Code	dbd6769537	Reference split tools ConfigMaps (tools-a/b/c) in init container	2026-04-08 18:09:55 +00:00
Claude Code	0af47f893a	Split bluejay-tools into 3 ConfigMaps (K8s 262K annotation limit)	2026-04-08 18:09:49 +00:00
Claude Code	d16f72f089	Enable Blue Jay profile: init container, ConfigMap volumes, tools, extensions, theme	2026-04-08 18:07:13 +00:00
Claude Code	36e7369609	Add Blue Jay profile ConfigMaps (21 tools, prompts, extensions, theme)	2026-04-08 18:07:06 +00:00
Claude Code	c9f07108bd	Fix edge1 Ollama IP (.15->.17), add monitoring ingress, add init container	2026-04-08 17:30:22 +00:00
bluejay	0811bc078b	Add cert-manager TLS certificate to agent-zero manifest	2026-03-11 02:45:15 +00:00
bluejay	bc1f56ae10	Add Agent Zero NUC deployment manifest	2026-03-11 02:29:24 +00:00

27 Commits