Commit Graph

580 Commits

Author SHA1 Message Date
Andrew Stoltz
11f32f1a6e deploy(dns): add GX10 fc-dns app 2026-06-17 02:12:40 -05:00
Andrew Stoltz
083e7f41cd fix(fc-php): restore missing IngressRoute + TLS cert (php-web 404 on GX10)
php.iamworkin.lan returned 404 on every path: the GX10 GitOps capture grabbed
fc-php's deployment/service but NOT its IngressRoute (chicken-egg — php wasn't
routed at capture time), so Traefik matched no route. Pod is 1/1 Running 37h —
the 404 was pure missing-route, confirmed by diffing against the healthy sibling
mysql-web (which has its IngressRoute).

Mirrors the mysql-web / fc-network pattern: a cert-manager Certificate (step-ca-acme
ClusterIssuer) to mint php-web-tls + an IngressRoute Host(php.iamworkin.lan)->php-web:5400.
Additive only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 01:57:47 -05:00
Andrew Stoltz
336c4a6ec0 deploy(signage): roll GX10 F2 image 2026-06-17 01:25:04 -05:00
Andrew Stoltz
415fec9e4d gx10-gitops: deploy-loop proof — mark knowledge svc managed-by gx10-argocd 2026-06-16 22:33:40 -05:00
Andrew Stoltz
6c0be8563d gx10-gitops: capture live manifests for 32 product namespaces (ArgoCD adoption source) 2026-06-16 22:24:23 -05:00
Andrew Stoltz
0218b1f8b6 gx10-gitops: pilot — capture live knowledge manifests (adoption source) 2026-06-16 22:18:20 -05:00
Andrew Stoltz
4b58b0ca5f deploy: align gateway key field 2026-06-16 21:08:03 -05:00
Andrew Stoltz
bd8adb2188 deploy: add MCP gateway for Agent Zero 2026-06-16 21:01:52 -05:00
Andrew Stoltz
d32abd62c8 deploy(chat): chat-web v20260616-circuit-mood-5711f2d
Ships the Blazor circuit-resilience + mood + telemetry fixes to live chat:
- FcAiChat reconnect-resync (stuck _generating after a circuit drop)
- fc-blazor-start.js client serverTimeout 60s (fewer spurious 1006 reconnects)
- ChatToolVisibility (tool plumbing hidden in personality chat)
- mood empathy fix (avatar no longer "excited" on bad news)
- fc-circuit-telemetry.js + /api/clientlog sink (kubectl-readable circuit data)

Image built on noc1 + imported to rke2-server + rke2-agent1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 13:29:01 -05:00
Andrew Stoltz
204001a89d deploy(dns): pin current DNS image 2026-06-16 11:45:13 -05:00
Andrew Stoltz
6950010ea4 ops(github-runner): scale tts-reader runner to 0 (crash-looping, memory relief)
The github-runner-tts-reader pod was crash-looping (329 restarts) and consuming
memory on the over-pressured old rke2 cluster (rke2-agent1 ~81%), contributing
to Blazor SignalR circuit drops on ttsreader/chat. It provides no working CI in
this state. Set replicas: 0 so ArgoCD stops re-creating it; restore to 1 once
the runner is fixed or CI moves to a working host.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 11:27:40 -05:00
Andrew Stoltz
b28ab73a19 deploy(ttsreader): pin TT-3 endpoint fix image 2026-06-16 05:33:43 -05:00
Andrew Stoltz
09398d451f deploy(ttsreader): pin TT-3 plane health image 2026-06-16 05:15:48 -05:00
Andrew Stoltz
3a7978ab1f deploy(dns): pin DN-3b drift image 2026-06-15 20:56:30 -05:00
Andrew Stoltz
c0bfcb46fa deploy(dns): pin DN-3 PowerDNS image 2026-06-15 20:33:18 -05:00
Andrew Stoltz
ebbf501038 deploy(dns): pin DN-2 MCP bridge image 2026-06-15 20:15:42 -05:00
Andrew Stoltz
d4f24f6f43 deploy(dns): wire MCP transport key 2026-06-15 19:58:52 -05:00
Andrew Stoltz
9f4805f1d6 deploy(dns): pin DN-2 entity CRUD image 2026-06-15 19:52:42 -05:00
Andrew Stoltz
b9a81fb4c0 deploy(dns): pin DN-1 rate-limit image 2026-06-15 19:07:56 -05:00
Andrew Stoltz
a4ccd30429 deploy(chat): require intranet fallback citation image 2026-06-15 18:25:19 -05:00
Andrew Stoltz
09b22e32c2 deploy(chat): pin activation hardened citation image 2026-06-15 18:21:54 -05:00
Andrew Stoltz
5bb136554d deploy(chat): pin citation card fallback image 2026-06-15 18:16:15 -05:00
Andrew Stoltz
485710230b deploy(chat): pin citation card image 2026-06-15 18:06:08 -05:00
Andrew Stoltz
f016375419 deploy(chat): pin citation route fix image 2026-06-15 17:50:21 -05:00
Andrew Stoltz
fc64638029 deploy(chat): pin citation fallback fix-forward 2026-06-15 17:32:39 -05:00
Andrew Stoltz
6b751b0fbe deploy(chat): pin citation fallback image 2026-06-15 17:26:03 -05:00
Andrew Stoltz
a03dbe166d deploy(library): pin RL5 fine action image 2026-06-15 15:56:20 -05:00
Andrew Stoltz
6febe1fdb3 deploy(dns): enable production auth profile 2026-06-15 15:08:03 -05:00
Andrew Stoltz
40fd35ba44 deploy(chat): pin CH-6 presence image 2026-06-14 19:26:31 -05:00
Andrew Stoltz
17654835e7 gx10/platform: step-ca-acme issuer + Traefik HelmChart (migration platform layer)
Bootstrap manifests for the GX10 cluster platform layer (NUC->GX10 migration).
Direct-applied to GX10 + LIVE: step-ca-acme ClusterIssuer Ready (ACME->noc1 step-ca),
Traefik v3.6.10 via RKE2 HelmChart CRD at MetalLB VIP 10.0.57.202 (prod-pool, temp
parallel-run; no clash with live old .200). Under gx10/ NOT apps/* to avoid the old
ApplicationSet auto-deploying GX10 manifests to the OLD cluster.
2026-06-14 18:06:25 -05:00
Andrew Stoltz
63b8d4b667 Deploy Chat regroup CH-3 image 2026-06-14 18:01:43 -05:00
Andrew Stoltz
2c12f35f75 agent-zero: fix fc_dms netpol egress port (8080 = pod targetPort, not svc 80)
NetworkPolicy matches the destination POD port. dms-web svc:80 -> containerPort
8080, so the egress must allow 8080 (the fc-chat rule already allows 80+8080,
which is why chat worked and dms timed out). Add 8080 to the fc-dms egress.
2026-06-14 16:25:25 -05:00
Andrew Stoltz
e33fe81823 agent-zero: connect fc_dms MCP (product-manager fan-out, first server)
AZ only had fc_chat (chat-session) + fc_knowledge (RAG) — so it had no product
capabilities (the 'mysql manager' gap). Wire fc_dms (dynamic message signs, ~13
tools): OnePasswordItem dms-mcp-keys (1P 'FlowerCore DMS MCP Keys' field credential)
-> DMS_MCP_API_KEY -> X-Api-Key; builder adds fc_dms; netpol egress fc-dms:80.
Proven: dms-web/mcp returns 200 with this key. presentations/messageboard/
segmentdisplay/telephony 1P MCP-key items exist for the same pattern; mysql+signage
need 1P items provisioned first (mysql/mcp 401s with no key). Watch context budget.
2026-06-14 16:19:34 -05:00
Andrew Stoltz
ef6afdd577 fc-llm-bridge: repoint Ollama to GX10 NodePort (fix AZ MTU black-hole)
The PROD-VLAN VIP 10.0.57.201 MTU-black-holes Agent Zero's ~150KB requests
(full prompt + 108 MCP tools) -> connection reset mid-stream -> AZ 'same message
again' loop. Switch FlowerCore__Chat__OllamaBaseUrl to the INFRA-VLAN NodePort
10.0.56.14:30976 (same VLAN as the old cluster, carries 150KB fine). Verified:
150KB POST = 200 via NodePort, times out via VIP. NodePort pinned to 30976 on GX10.
2026-06-14 15:12:05 -05:00
Andrew Stoltz
62ca7dacf6 telephony: deploy ARI abort-fix image v20260614-arifix; drop 3600s band-aids
Image -> v20260614-arifix (Telephony 86ff0d0: ReceiveAsync no longer cancelled).
Remove the WebSocketKeepAliveTimeoutSeconds/WebSocketReceiveTimeoutSeconds=3600
band-aids; the code now disables the pong deadline by default and ignores the
receive timeout (liveness = keepalive ping + WebSocketException/Close).
2026-06-14 14:36:11 -05:00
Andrew Stoltz
d03a92407d gx10/tts: persist Piper /tts source + manifest (telephony TTS port baseline)
Dockerfile (linux/arm64, en_US-amy-medium baked), tts_service.py (16kHz/16-bit/mono
WAV, numpy resample 22050->16000), gx10-tts.yaml (CPU NodePort 30850, no GPU request),
README (build/import/cutover/verify on the GX10 cluster).
2026-06-14 14:14:59 -05:00
Andrew Stoltz
e4d1735d35 telephony: make TTS cutover EFFECTIVE via Tts__PiperUrl env (overrides configmap)
Root cause: the live deploy carried env Tts__PiperUrl=edge1 (drifted, not in git)
which shadows appsettings Tts.PiperUrl. Codify Tts__PiperUrl=GX10 + Ari__ env to
match live so git is source-of-truth; the configmap edit alone was inert.
2026-06-14 14:12:02 -05:00
Andrew Stoltz
15edcb7c71 telephony: cut TTS over to GX10 (10.0.56.14:30850, amy-medium); keep edge1 warm
- Tts.PiperUrl edge1 10.0.57.17:8500 -> GX10 NodePort 10.0.56.14:30850
- add netpol egress to GX10 TTS; keep edge1 egress as rollback target
- DefaultEngine piper / SampleRate 8000 unchanged (sln16 16kHz path)
2026-06-14 14:01:50 -05:00
Andrew Stoltz
284ca84166 agent-zero: GX10 system prompt rewrite (tool-calling + RAG rules, strip dead lanes)
Sync the bluejay-profile ConfigMap's embedded system_prompt.md with the
rewritten scripts/agent-zero/agents/bluejay/system_prompt.md: Ollama section
-> GX10 hub (VIP 10.0.57.201, GB10/121GiB); model table with tool-calling
flags (qwen2.5 = tools, gemma3 = 400-on-tools/vision-only, nomic = embed);
new 'Models & Tool-Calling' + 'Knowledge & RAG' rule blocks; stripped dead
WSL/R9700/.132/host.docker.internal/port-30050 lanes; de-pinned test counts;
'Blu' team is persona vocabulary not a fixed team. Personality preserved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 13:40:25 -05:00
Andrew Stoltz
7a86c40cf1 fix(telephony): ARI receive timeout 45s->3600s — the real false-abort root cause
Cancelling ClientWebSocket.ReceiveAsync via CancellationToken ABORTS the
socket (a half-read WS frame can't resume). The per-iteration
iterationCts.CancelAfter(WebSocketReceiveTimeoutSeconds) therefore aborted a
healthy idle ARI WebSocket every 45s (state=Aborted), not the keepalive pong
(proven: loop persisted after pong-timeout 15s->3600s). A large receive
timeout lets ReceiveAsync block harmlessly while the PBX is idle; real drops
still surface immediately as WebSocketException -> reconnect. Proper code fix
(stop using CancelAfter on the receive) tracked separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 13:04:13 -05:00
Andrew Stoltz
de5c9f39fd deploy(devicemgmt): pin regroup web image 2026-06-14 12:52:30 -05:00
Andrew Stoltz
d5311de676 fix(telephony): stop ARI WebSocket false-abort loop (pong-timeout 15s->3600s)
Asterisk res_http_websocket does not reliably answer client PING frames
with PONG, so .NET KeepAliveTimeout (default 15s) aborted a healthy idle
ARI WebSocket every ~45s (ping@30s + pong-wait@15s), dropping StasisStart
events so the *100 IVR intermittently answered with no audio. Generous pong
timeout stops the false aborts; genuine drops still caught by the 45s
receive-timeout state re-check and TCP-level WebSocketException.

Surfaced by FlowerCore.Telephony.SipTests Call_Star100_ReceivesAudibleAudioStream
(0 RTP packets while ExtToExt RTP-hook passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 12:50:12 -05:00
Andrew Stoltz
7b4f57bb97 deploy(updater): pin regroup web image 2026-06-14 12:45:39 -05:00
Andrew Stoltz
c569c05ad7 deploy(retail-library): roll regroup web images 2026-06-14 12:38:57 -05:00
Andrew Stoltz
fc8297041a deploy(fc-chat): roll effective-prompt debug reveal v20260614-debugreveal-d389e4b
Influence Audit panel now surfaces the per-turn effective prompt
(RagContextSnapshot) as an operator/debug row. FlowerCore.Chat d389e4b.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 12:33:37 -05:00
Andrew Stoltz
e1554757e8 deploy(fc-chat): roll user-bubble prompt-leak fix v20260614-bubblefix-37f57b0
Stored/displayed user message is now the raw prompt; injected scaffolding
(mood contract + guidance + memory) goes to the model via ragContext as a
system message and is captured in RagContextSnapshot for debug.
FlowerCore.Chat 37f57b0 + FlowerCore.Common 4d741b3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 03:15:26 -05:00
Andrew Stoltz
0c8e6ee8ab agent-zero(models): tool-capable qwen2.5 on GX10 via fc-llm-bridge (Wiring A)
Agent Zero's agentic tool-loop ran on cloud Anthropic Sonnet (the bridge's
Anthropic key is currently 401) + gemma3:4b util (gemma3 returns 400 "does not
support tools" — fatal for the loop). Repoint the bridge ModelRouter tiers:
Balanced -> Ollama qwen2.5:14b (AZ chat) and Cheap -> qwen2.5:7b (AZ util), both
on the GX10 VIP 10.0.57.201 (already the bridge OllamaBaseUrl). Env-only, no
rebuild; Wiring A keeps the budget ledger + cache. Also: AZ chat ctx -> 32768,
browser -> qwen2.5:7b (text/tool-capable, vision off), AGENT_NAME -> "Blue Jay"
(the NUC role is retired). qwen2.5:7b + :14b pulled + warm-pinned on the GX10.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 02:38:17 -05:00
Andrew Stoltz
9d5a1cce97 deploy(fc-chat): roll mood-signal build v20260614-moodsignal-a606892
Workstream A: set_mood structured signal replaces leaky [mood:X] text
(FlowerCore.Chat a606892). Image built + imported to rke2-server and
rke2-agent1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 02:21:47 -05:00
Andrew Stoltz
e0460bd881 infra(ai): consolidate fleet Ollama consumers onto GX10 VIP 10.0.57.201
Repoints fc-chat, fc-ttsreader, knowledge, fc-llm-bridge (off the slow edge1
Pi5 10.0.57.17) and intranet (off the reimaged BLUEJAY-AI test laptop
10.0.56.132) to the GX10 (DGX Spark / GB10) Ollama over the PROD MetalLB VIP
10.0.57.201. GX10 serves gemma3:12b/gemma3:4b/qwen2.5:1.5b/nomic-embed-text/
llama3.2:1b on local NVMe, warm-pinned (keep_alive=-1).

fc-chat default model qwen2.5-coder:7b -> gemma3:12b (the coder model won't
pull reliably on the GX10; gemma3:12b is the warm fleet default + a better
general-chat model). Other consumers keep their exact models. Inline comments
referencing edge1/BLUEJAY-AI are now historical; the values are the GX10 VIP.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 00:54:36 -05:00
Robot
303c450bc9 Cl-5: Admin console infra finding — rides DM.Web (zero new infra)
Audit of apps/fc-devicemgmt/ confirms the admin/helpdesk console needs NO new
infra: the existing host-matched IngressRoute (devices.iamworkin.lan, no path
constraint) + step-ca-acme Certificate already cover admin routes served under
FlowerCore:PathBase (ADR-204 routes-inside-DM.Web). ADMIN-CONSOLE-INFRA.md
records the finding + the open Q-MP question (distinct admin hostname vs PathBase
path) with the exact 3-step add if a separate host is later chosen.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 23:22:16 -05:00