Commit Graph

27 Commits

Author SHA1 Message Date
Andrew Stoltz
0d5a1fd530 fix(agent-zero): route util and embed through llm bridge 2026-04-29 19:14:01 -05:00
Andrew Stoltz
174002023d fix(agent-zero): move corpus_search + intranet_search into bluejay-tools-c
The prior commit b71f9e4 created a stray YAML document between the
bluejay-tools-c and bluejay-profile sections. kubectl applied the stray
block's data to bluejay-profile (wrong ConfigMap, wrong mount target).

The setup-bluejay initContainer copies bluejay-tools-{a,b,c} to the tools
directory; bluejay-profile is copied to the agent profile directory. Tools
must live in one of the three tools ConfigMaps.

Fix: insert corpus_search.py and intranet_search.py directly into the
bluejay-tools-c YAML document (before kind/metadata, matching the
data-first layout the rest of the file uses). Also fix two mojibake
characters (→ and ·) that were corrupted in the prior commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:49:23 -05:00
Andrew Stoltz
b71f9e4ec9 feat(agent-zero): add corpus_search + intranet_search to cluster configmaps
- Add corpus_search.py to bluejay-tools-c: semantic vector search over
  fleet SQLite-vec DBs (fleet-workstation-full, fleet-pi-edge, fleet-bmo-bot).
  Returns offline-friendly results for Bible/Greek/Hebrew/Strongs corpora.
  Cluster pod degrades gracefully (no DB mounted yet — BLUEJAY-WS only for now).

- Add intranet_search.py to bluejay-tools-c: live RAG search over the
  intranet vector store via GET /api/search?q=...&topK=N. Uses in-cluster
  service URL (http://intranet-web.intranet.svc:5300) to bypass Traefik TLS
  and the private-range egress denylist.

- Fix intranet_search.py param name: was 'limit', now 'topK' matching the
  SearchController's [FromQuery] parameter name.

- NetworkPolicy: add egress rule for intranet namespace port 5300 (without
  this the pod's TCP connection to the search endpoint was dropped).

- agent-zero.yaml: set FLOWERCORE_INTRANET_URL env var to in-cluster service
  URL so intranet_search uses internal routing, not the public Traefik VIP.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 08:34:31 -05:00
Andrew Stoltz
f1431f7324 feat(agent-zero): wire Print.Web API key to pod via 1Password OnePasswordItem
Add `print-web-api-keys` OnePasswordItem CRD that syncs from 1Password
"Print.Web API Keys" vault item (password field). Mount as PRINT_WEB_API_KEY
env var in the agent-zero container.

The print_web.py Python tool (already in bluejay-tools ConfigMaps) reads
PRINT_WEB_URL and PRINT_WEB_API_KEY env vars for all HTTP calls to the
thermal print service on edge2. Previously the key was unset so every API
call was rejected with 401.

Note: Print.Web uses the legacy REST MCP shape (/api/mcp/tools/*) not the
streamable-http protocol. The Python tool bridges this gap — no /mcp endpoint
exists on Print.Web today. Network policy already allows 10.0.57.16:5200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 20:36:36 -05:00
Andrew Stoltz
0f9d56ee16 agent-zero: drop BLUEJAY-WS upstream, edge1 Pi is sole Ollama backend
Workstation (BLUEJAY-WS) is private dev hardware and should not be in the
cluster path. Repointing the nginx ollama-proxy sidecar so cluster Agent Zero
talks ONLY to edge1 Pi 5 + AI HAT+ (10.0.57.17:11434):

- nginx upstream: edge1 sole server, no workstation entry
- wait-for-ollama init container: only checks edge1
- NetworkPolicy egress: drop 10.0.56.20/32, keep 10.0.57.17/32
- Comments updated throughout to flag workstation as off-limits to cluster
- Annotation rewritten to document the architectural intent

Pulled qwen2.5:1.5b on edge1 first so Agent Zero's utility_model survives
the cutover (existing models on edge1: qwen3:4b, gemma3:4b, qwen2.5-coder:7b,
nomic-embed-text). Model count on edge1: 4 → 5.

Lets BLUEJAY-WS lock down its Ollama port to localhost without breaking
the cluster Agent Zero.
2026-04-27 16:30:44 -05:00
Andrew Stoltz
e2c71c2b8a fix agent-zero ollama-proxy crashloop + add Longhorn monitoring
agent-zero ollama-proxy had 172 historic restarts (now stable).
Root cause: liveness/readiness probes hit /api/tags which proxies
through to BLUEJAY-WS Ollama (10.0.56.20:11434). When the workstation
Ollama is slow or offline, nginx fails over to the edge1 backup —
but the failover takes >1s and the kube-probe default timeoutSeconds=1
gives up first. Three failed probes → kubelet kills the container.

Fix:
- Add nginx local healthz endpoint (200, no upstream).
- Liveness probe → /healthz (proves nginx itself is alive).
- Readiness probe stays on /api/tags but with timeoutSeconds=5 so
  failover to backup completes before the probe times out.

This decouples liveness from upstream availability — kubelet only
restarts the proxy when nginx is genuinely dead, not when Ollama is
slow.

Longhorn coverage gap: K8s emits "snapshot becomes not ready to use"
events constantly during the hourly snapshot lifecycle (1047
snapshots, all readyToUse=true on inspect). Those events were the
only signal we had — purely transient lifecycle noise, not actionable.

Add:
- longhorn scrape job (longhorn-backend.longhorn-system.svc:9500)
- NetworkPolicy egress rule for longhorn-system port 9500
- 4 new alerts in 'longhorn-storage' group:
  - LonghornVolumeDegraded (>15m) — replica unhealthy, auto-rebuild
  - LonghornVolumeFaulted (>5m, critical, thermal print) — data loss
  - LonghornBackupStale (no completed backup in >36h) — recurring job
    silently failing
  - LonghornNodeUnhealthy (>5m) — node ready=false

zabbix-web 7 restarts and Print.Web 12:55 stop investigated — both
are stable now, no actionable cause found in journal/events. Adding
KubeContainerRestartingFrequently in the previous commit will catch
recurrence of either.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:31:14 -05:00
Andrew Stoltz
bfc755057e fix(agent-zero): use streamable http for chat mcp 2026-04-23 13:54:06 -05:00
Andrew Stoltz
d6008ee205 fix(agent-zero): allow chat mcp pod port 2026-04-23 13:29:36 -05:00
Andrew Stoltz
39fe6f1dba fix(agent-zero): route chat mcp in-cluster 2026-04-23 13:26:10 -05:00
Andrew Stoltz
90fcf0cd5d fix(agent-zero): expose openai provider key 2026-04-23 13:21:12 -05:00
Andrew Stoltz
86ccca18e3 Add Chat MCP server to Agent Zero 2026-04-23 12:41:58 -05:00
Andrew Stoltz
702a6e4f52 fix(agent-zero): use short DNS name to avoid CoreDNS template hijack
The full FQDN fc-llm-bridge.fc-llm-bridge.svc.cluster.local has 4 dots,
which is less than the pod's ndots:5 threshold. The resolver then
applies every entry in the search list BEFORE falling through to the
bare FQDN, and the CoreDNS 'template iamworkin.lan' catch-all matches
"...svc.cluster.local.iamworkin.lan" and returns Traefik VIP
10.0.56.200. The egress NetworkPolicy blocks that VIP (0.0.0.0/0
EXCEPT 10.0.0.0/8), so curl hangs for 30-134s and returns HTTP 000.

Reference: feedback_coredns_ndots_template_collision memory.

Fix: use "fc-llm-bridge.fc-llm-bridge.svc" (2 dots, still <5 so search
expansion still fires, but the first suffix "...svc.cluster.local"
hits the Kubernetes plugin in CoreDNS and returns the real ClusterIP
10.43.67.125 before the iamworkin.lan template is ever consulted).

Verified: pod-exec curl fc:cheap → HTTP 200 with a real chat.completion
envelope (Ollama/gemma3:4b via bridge).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:02:09 -05:00
Andrew Stoltz
6cbb5d8792 fix(agent-zero): NetworkPolicy egress rule for fc-llm-bridge (ADR-088)
The chat_model flip (62db15c) pointed Agent Zero at
fc-llm-bridge.fc-llm-bridge.svc.cluster.local:8080 but the existing
agent-zero-netpol only allowed egress to specific node IPs
(10.0.56.20:11434, 10.0.57.17:11434, 10.0.57.16:5200, 10.0.56.11:6443)
plus public-internet (with RFC1918 exclusion). ClusterIP traffic to
10.43.0.0/16 was implicitly denied, so pod-exec curl to the bridge
timed out after 134s.

Adds an egress rule allowing TCP 8080 to the fc-llm-bridge namespace
(matched by kubernetes.io/metadata.name which K8s 1.22+ sets
automatically). No ingress changes needed — fc-llm-bridge has no
NetworkPolicy, so the ingress side is already open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:59:17 -05:00
Andrew Stoltz
62db15c69c feat(agent-zero): route chat_model through fc-llm-bridge (ADR-088)
Flips Agent Zero's chat_model from direct local Ollama (gemma3:12b via
the 127.0.0.1:11434 sidecar proxy) to the FlowerCore LLM Bridge
(fc:balanced tier, OpenAI-compatible, Anthropic Claude Sonnet under the
hood) so chat turns are spend-tracked and can dispatch to any provider
via a single tier alias.

Scope is intentionally minimal and reversible:
  - chat_model: ollama/gemma3:12b/127.0.0.1:11434
              → openai/fc:balanced/fc-llm-bridge internal service URL
  - utility_model, embedding_model, browser_model: UNCHANGED
    (stay on local 127.0.0.1 Ollama sidecar — no spend, low latency,
    not worth routing through the bridge for small-model traffic).

Auth: new A0_SET_chat_model_api_key env var wired to the
fc-llm-bridge-api-keys Secret (field: agent-zero-k8s). The Secret is
synced by a new OnePasswordItem pointing at "FC LLM Bridge API Keys"
in the IAmWorkin vault. Bearer-token auth is now accepted by the
bridge (FlowerCore.LlmBridge@3225f1f).

Rollback: revert this commit; old image v202604231449 is still present
on all RKE2 nodes, and Agent Zero's strategy: Recreate makes the flip
atomic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:54:27 -05:00
Andrew Stoltz
ab7435a43a Update Agent Zero, Asterisk, and Telephony K8s manifests
- Update agent-zero deployment configuration
- Update Asterisk configmap and deployment
- Update telephony service manifest

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 19:12:08 -05:00
Claude Code
8f8290e0da Increase ctx to 8192 (system prompt + 21 tools need >2048) 2026-04-08 20:07:27 +00:00
Claude Code
607192aaec Reduce ctx to 2048 for Pi 5 CPU speed 2026-04-08 19:40:52 +00:00
Claude Code
072d64a5e9 Fix model config: write config.json not config.yaml (plugin reads JSON) 2026-04-08 19:22:16 +00:00
Claude Code
acb19bee9c Write Ollama model config before initialize.sh (fix OpenRouter default) 2026-04-08 19:15:43 +00:00
Claude Code
e6fbe2d22b Mount extensions+theme directly in main container (symlinks lost by initialize.sh) 2026-04-08 18:12:07 +00:00
Claude Code
dbd6769537 Reference split tools ConfigMaps (tools-a/b/c) in init container 2026-04-08 18:09:55 +00:00
Claude Code
0af47f893a Split bluejay-tools into 3 ConfigMaps (K8s 262K annotation limit) 2026-04-08 18:09:49 +00:00
Claude Code
d16f72f089 Enable Blue Jay profile: init container, ConfigMap volumes, tools, extensions, theme 2026-04-08 18:07:13 +00:00
Claude Code
36e7369609 Add Blue Jay profile ConfigMaps (21 tools, prompts, extensions, theme) 2026-04-08 18:07:06 +00:00
Claude Code
c9f07108bd Fix edge1 Ollama IP (.15->.17), add monitoring ingress, add init container 2026-04-08 17:30:22 +00:00
0811bc078b Add cert-manager TLS certificate to agent-zero manifest 2026-03-11 02:45:15 +00:00
bc1f56ae10 Add Agent Zero NUC deployment manifest 2026-03-11 02:29:24 +00:00