Commit Graph

278 Commits

Author SHA1 Message Date
Andrew Stoltz
39fe6f1dba fix(agent-zero): route chat mcp in-cluster 2026-04-23 13:26:10 -05:00
Andrew Stoltz
90fcf0cd5d fix(agent-zero): expose openai provider key 2026-04-23 13:21:12 -05:00
Andrew Stoltz
ffef5c9126 Deploy TTS Reader annotation timeout fix 2026-04-23 13:06:17 -05:00
Andrew Stoltz
634e90a9ee Deploy TTS Reader quick hardening release 2026-04-23 12:47:45 -05:00
Andrew Stoltz
86ccca18e3 Add Chat MCP server to Agent Zero 2026-04-23 12:41:58 -05:00
Andrew Stoltz
1c5caf3f40 Deploy TTS Reader v20260423114016 2026-04-23 11:57:39 -05:00
Andrew Stoltz
d3db19b0ca guacamole: enable json auth for remotedesktop sso 2026-04-23 11:27:30 -05:00
Andrew Stoltz
702a6e4f52 fix(agent-zero): use short DNS name to avoid CoreDNS template hijack
The full FQDN fc-llm-bridge.fc-llm-bridge.svc.cluster.local has 4 dots,
which is less than the pod's ndots:5 threshold. The resolver then
applies every entry in the search list BEFORE falling through to the
bare FQDN, and the CoreDNS 'template iamworkin.lan' catch-all matches
"...svc.cluster.local.iamworkin.lan" and returns Traefik VIP
10.0.56.200. The egress NetworkPolicy blocks that VIP (0.0.0.0/0
EXCEPT 10.0.0.0/8), so curl hangs for 30-134s and returns HTTP 000.

Reference: feedback_coredns_ndots_template_collision memory.

Fix: use "fc-llm-bridge.fc-llm-bridge.svc" (2 dots, still <5 so search
expansion still fires, but the first suffix "...svc.cluster.local"
hits the Kubernetes plugin in CoreDNS and returns the real ClusterIP
10.43.67.125 before the iamworkin.lan template is ever consulted).

Verified: pod-exec curl fc:cheap → HTTP 200 with a real chat.completion
envelope (Ollama/gemma3:4b via bridge).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:02:09 -05:00
Andrew Stoltz
6cbb5d8792 fix(agent-zero): NetworkPolicy egress rule for fc-llm-bridge (ADR-088)
The chat_model flip (62db15c) pointed Agent Zero at
fc-llm-bridge.fc-llm-bridge.svc.cluster.local:8080 but the existing
agent-zero-netpol only allowed egress to specific node IPs
(10.0.56.20:11434, 10.0.57.17:11434, 10.0.57.16:5200, 10.0.56.11:6443)
plus public-internet (with RFC1918 exclusion). ClusterIP traffic to
10.43.0.0/16 was implicitly denied, so pod-exec curl to the bridge
timed out after 134s.

Adds an egress rule allowing TCP 8080 to the fc-llm-bridge namespace
(matched by kubernetes.io/metadata.name which K8s 1.22+ sets
automatically). No ingress changes needed — fc-llm-bridge has no
NetworkPolicy, so the ingress side is already open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:59:17 -05:00
Andrew Stoltz
62db15c69c feat(agent-zero): route chat_model through fc-llm-bridge (ADR-088)
Flips Agent Zero's chat_model from direct local Ollama (gemma3:12b via
the 127.0.0.1:11434 sidecar proxy) to the FlowerCore LLM Bridge
(fc:balanced tier, OpenAI-compatible, Anthropic Claude Sonnet under the
hood) so chat turns are spend-tracked and can dispatch to any provider
via a single tier alias.

Scope is intentionally minimal and reversible:
  - chat_model: ollama/gemma3:12b/127.0.0.1:11434
              → openai/fc:balanced/fc-llm-bridge internal service URL
  - utility_model, embedding_model, browser_model: UNCHANGED
    (stay on local 127.0.0.1 Ollama sidecar — no spend, low latency,
    not worth routing through the bridge for small-model traffic).

Auth: new A0_SET_chat_model_api_key env var wired to the
fc-llm-bridge-api-keys Secret (field: agent-zero-k8s). The Secret is
synced by a new OnePasswordItem pointing at "FC LLM Bridge API Keys"
in the IAmWorkin vault. Bearer-token auth is now accepted by the
bridge (FlowerCore.LlmBridge@3225f1f).

Rollback: revert this commit; old image v202604231449 is still present
on all RKE2 nodes, and Agent Zero's strategy: Recreate makes the flip
atomic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:54:27 -05:00
Andrew Stoltz
84634f59f0 chore(fc-llm-bridge): bump image to v202604231520
Ships the Bearer-token auth fix (FlowerCore.LlmBridge@3225f1f) so Agent
Zero's OpenAI provider can authenticate with Authorization: Bearer in
addition to the original X-Api-Key header.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:51:57 -05:00
Andrew Stoltz
4cd5806fd0 fix(fc-llm-bridge): set dnsConfig ndots=2 to prevent CoreDNS wildcard hijack
Pods in this cluster inherit ndots=5. External FQDNs with <5 dots (like
api.anthropic.com) are expanded through the search path first, and the 4th
suffix `api.anthropic.com.iamworkin.lan` matches CoreDNS' `template IN A
iamworkin.lan` wildcard — resolves to Traefik VIP 10.0.56.200. TLS connect
lands on Traefik's default cert and the AnthropicClient rejects with
RemoteCertificateNameMismatch/RemoteCertificateChainErrors.

Setting ndots=2 makes the resolver try the bare FQDN first (3 dots in
api.anthropic.com), so the search path never fires.

Reference: memory feedback_coredns_ndots_template_collision. Wider follow-up:
the CoreDNS template plugin should add fallthrough for external public suffixes,
so every FC service calling external HTTPS APIs stops hitting this trap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:42:17 -05:00
Andrew Stoltz
11c48bef30 chore(fc-llm-bridge): bump to v202604231449 (Budget 1.0.1 multi-provider dispatcher)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:36:05 -05:00
Andrew Stoltz
a86e87050b fix(fc-llm-bridge): anthropic secret key is 'password' not 'credential'
The 1Password item "Claude API Key" stores the key in a standard Password
field (labeled `password`), so the OnePasswordItem operator creates the K8s
Secret with key `password`. Deployment was referencing `credential`, which
made the pod fail with CreateContainerConfigError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:29:32 -05:00
Andrew Stoltz
0214f94ac4 chore(fc-llm-bridge): bump image to v202604231424 (first live tag)
Built from FlowerCore.LlmBridge@6d285b5 (initial scaffold). Imported on all
three RKE2 nodes via podman save + ctr import. Replaces v00000000000000
placeholder — ArgoCD sync will roll the pod.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:28:05 -05:00
Andrew Stoltz
a1b8eb379d feat(fc-llm-bridge): stage ADR-088 manifests (not yet applied)
Staged but NOT applied. Do not git push until the two pre-requisites below
are done. See apps/fc-llm-bridge/README.md for the full order-of-ops.

Manifests (apps/fc-llm-bridge/fc-llm-bridge.yaml, 8 docs):
  - Namespace fc-llm-bridge
  - OnePasswordItem anthropic-api-key (existing Claude API Key item)
  - OnePasswordItem fc-llm-bridge-api-keys (NEW item, pending creation)
  - PersistentVolumeClaim fc-llm-bridge-data (2Gi longhorn)
  - Deployment fc-llm-bridge (port 8080, uid 1654, readOnlyRootFilesystem,
    tcpSocket probes to survive future ApiKeyAuthMiddleware reordering)
  - Service fc-llm-bridge ClusterIP
  - Certificate fc-llm-bridge-cert (step-ca-acme)
  - IngressRoute fc-llm-bridge (fc-llm-bridge.iamworkin.lan, websecure)

Pre-requisites BEFORE git push:
  1. pfSense Unbound override fc-llm-bridge.iamworkin.lan -> 10.0.56.200
     (currently NXDOMAIN -- verified via nslookup and check-pfsense-dns.py).
     Skipping this step puts cert-manager HTTP-01 into ~2h backoff.
  2. Create 1Password item `FC LLM Bridge API Keys` in vault IAmWorkin with
     password fields: agent-zero-ws, agent-zero-k8s, spare-1, spare-2.
  3. Build + import localhost/fc-llm-bridge:v<tag> to rke2-server +
     rke2-agent1 + rke2-agent2. Bump image tag from placeholder
     v00000000000000 before committing the apply.

Related: ADR-088 (FlowerCore.Notes/ARCHITECTURE.md), design doc at
FlowerCore.Notes/docs/ai-agents/agent-zero-anthropic-bridge.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 03:10:36 -05:00
Andrew Stoltz
9a1665907c fc-signalcontrol: align live port and selectors 2026-04-22 23:22:14 -05:00
Andrew Stoltz
899804215a statefulsets: align guacamole and matrix drift defaults 2026-04-22 23:11:47 -05:00
Andrew Stoltz
1dc66738e6 zabbix: align postgres tracking label 2026-04-22 22:50:24 -05:00
Andrew Stoltz
5623a272c5 zabbix: include statefulset defaults 2026-04-22 22:39:31 -05:00
Andrew Stoltz
3d3f91160b monitoring: add Print.Web Ollama Zabbix template 2026-04-22 22:07:40 -05:00
Andrew Stoltz
93f77c1844 fix(monitoring): use bluejay_v2 auth for snmp-nas (not public_v2)
Synology NAS is configured with community bluejay_monitor
(→ snmp.yml auth 'bluejay_v2'), not public. public_v2 was returning
HTTP 500 from snmp-exporter for this target. Verified bluejay_v2
returns metrics.

Keeps printer (10.0.58.107) on public_v2 — Epson ET-3750 uses
community "public" as documented in its SNMP settings.
2026-04-22 21:32:14 -05:00
Andrew Stoltz
59efc460fd fix(irc): use short name for unrealircd in anope + thelounge configs
Same CoreDNS iamworkin.lan template + ndots:5 hijack as the irc-notify fix.
Anope services (nickserv/chanserv/memo) have been disconnected from unrealircd
for weeks ("Host is unreachable" every 3s). Thelounge server defaults pointed
at the same broken FQDN.

Short name unrealircd.irc.svc resolves to the ClusterIP directly.
2026-04-22 21:23:38 -05:00
Andrew Stoltz
a3aa84bdae fc-ttsreader: bump image to v20260422201135 (Quick Read highlight no-reflow fix)
Quick Read's active-sentence highlight was changing font-weight from
regular to semibold, which shifted glyph widths and reflowed the whole
paragraph mid-playback. New image drops the weight change and uses a
1px box-shadow ring instead for a stable layout.

Built from FlowerCore.TtsReader@e77d69d.
2026-04-22 20:20:30 -05:00
Andrew Stoltz
01cb9a557f fc-ttsreader: deploy fixed reader image 2026-04-22 16:13:15 -05:00
Andrew Stoltz
0fa46ad53b fc-ttsreader: deploy reader UI split image 2026-04-22 15:57:58 -05:00
Andrew Stoltz
1ded5a61c0 fc-segmentdisplay: add TLS ingress gitops stub 2026-04-22 15:55:54 -05:00
Andrew Stoltz
3c1d212251 fc-messageboard: deploy latest web image via gitops 2026-04-22 15:48:05 -05:00
Andrew Stoltz
c0547a9964 fc-signalcontrol: switch probes to tcpSocket — middleware blocks /health
The app's ApiKeyAuthenticationMiddleware runs BEFORE /health is mapped, so
unauthenticated probe requests get 404. tcpSocket probes verify the listener
is up without auth, which is correct for an internal K8s probe (kubelet
talks pod IP directly, not externally).

Real fix is in the app: move /health before the middleware or mark it
[AllowAnonymous]. Tracked separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 15:21:04 -05:00
Andrew Stoltz
973c1dae72 fc-signalcontrol: fix probe path /metrics/prometheus -> /health
The app exposes /health (Program.cs:91 maps a Healthy text response) but does
NOT expose /metrics/prometheus. K8s liveness/readiness probes against a 404
endpoint kept the pod in CrashLoopBackOff after PVC mount was added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 15:15:07 -05:00
Andrew Stoltz
475737b36f fc-signalcontrol: add PVC + volumeMount for SQLite data dir
Live cluster had a Longhorn PVC `signalcontrol-data` mounted at /app/data
since 2026-04-14, but the bluejay-infra git manifest never declared it. As a
result, when ArgoCD recreated the Deployment from git (after deletion to fix
an unrelated selector-label mismatch caught during cert-manager recovery),
the new pod started without /app/data and crashed with `SQLite Error 14:
unable to open database file 'data/signalcontrol.db'`.

Bring git in line with reality: declare the PVC, mount it, and switch the
Deployment to `strategy.type: Recreate` (RWO PVC blocks rolling updates per
existing memory feedback_k8s_rwo_rollout.md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 15:10:10 -05:00
Andrew Stoltz
3bb3801fbd fix(monitoring): use short service name for irc-notify IRC_HOST
CoreDNS iamworkin.lan template + ndots:5 was hijacking
unrealircd.irc.svc.cluster.local lookups → Traefik VIP → timeout.
Every alert since ~2026-04-09 silently failed with "IRC send failed: timed out",
which also killed the thermal-printer path (routed through irc-notify).

Same fix pattern as guacamole@28b7600.
2026-04-22 09:55:23 -05:00
Andrew Stoltz
28b76001a8 fix(guacamole): use short service name for GUAC_URL (CoreDNS template collision)
The guac-k8s-sync CronJob has been crash-looping (exit 7) since the
2026-04-11 run. Root cause: CoreDNS has an `*.iamworkin.lan`
template wildcard, and the Kubernetes pod resolv.conf ships with
`ndots:5` plus a search list that includes `iamworkin.lan`.

Resolving `guacamole.guacamole.svc.cluster.local` (4 dots < 5) goes
through search-suffix expansion BEFORE the bare FQDN. The iamworkin.lan
suffix makes it `guacamole.guacamole.svc.cluster.local.iamworkin.lan`,
which matches the template and answers with Traefik LB VIP
10.0.56.200. That VIP has no pod-network hairpin route, so curl exits
with 'No route to host'.

Using the short name `http://guacamole:8080` keeps the query at 0
dots, search expansion runs on the bare name, and the in-namespace
`guacamole.svc.cluster.local` suffix hits the Kubernetes CoreDNS
plugin directly (ClusterIP 10.43.229.31).

Alt fixes considered but not taken: trim the CoreDNS template regex
to exclude `.svc.cluster.local.` prefixes (cross-cutting, higher
blast radius); trailing-dot FQDN in the URL (curl/Java HTTP clients
handle inconsistently).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:52:53 -05:00
Andrew Stoltz
0c67fa5356 asterisk: add *832 test-entry dialplan for VDAY workflow AATs
Lets live SIP AATs (ext 901–904, from-internal context) dial *832 to
exercise the Victory Day workflow + Fun Menu + AsteriskGameHandler path
without routing through Twilio. Mnemonic: *832 = V-D-A (8-3-2) from the
V-D-A-Y keypad pattern.

Maps to Stasis(flowercore-pbx,inbound-pstn,+15074618329) — same call-
type classification as a real Twilio-inbound call to the VDAY DID, so
InboundPstnHandler routes to the seeded VDAY workflow identically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:51:49 -05:00
Andrew Stoltz
62e342cfb2 guacamole: consolidate nodeSelector — use rke2-server for guacd too
Previous commit 90deacd raced with the user's f0733ff (which had
already pinned the guacamole web Deployment to rke2-server for the
NFS ACL). That left two nodeSelector blocks on the web pod and an
inconsistent agent2 pin on guacd. Align both pods to rke2-server.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:36:25 -05:00
Andrew Stoltz
90deacd154 guacamole: pin guacd + web to rke2-agent2 for NFS recordings mount
Synology NFS export at /volume1/kubernetes currently grants mount
permission only to 10.0.56.13 (rke2-agent2). rke2-agent1 gets
"access denied by server". guacd + guacamole web both need the
recordings volume, so co-locating is also efficient. Remove the
nodeSelector once the Synology NFS ACL opens to all cluster nodes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:35:13 -05:00
Andrew Stoltz
f0733ff89d feat(guacamole): wire 1Password vault extension + logback into deployment
Adds the 1Password vault JAR to the Guacamole pod so connection params
like ${OP:ItemTitle/fieldLabel} are resolved from 1Password Connect at
tunnel-open time. Credentials never land in MySQL — only token literals.

Deployment changes:
- env: OP_CONNECT_URL=http://10.0.56.10:8180, OP_VAULT_ID=..., plus
  OP_CONNECT_TOKEN from secret/guacamole-1password-token/credential.
- env: ENABLE_ENVIRONMENT_PROPERTIES=true so OP_* env vars render as
  op-connect-url / op-connect-token / op-vault-id properties the
  extension reads.
- volumeMount for guacamole-vault-jar at
  /etc/guacamole/extensions/guacamole-vault-1password-1.0.0.jar
- volumeMount for guacamole-logback so we see DEBUG token-inject lines.
- nodeSelector kubernetes.io/hostname=rke2-server — the Synology NFS
  export for /volume1/kubernetes currently only allows rke2-server.
  Followup: add rke2-agent1/2 to the export and remove this selector.

New ConfigMaps:
- guacamole-vault-jar (binaryData, ~312KB JAR, Gson shaded, built from
  FlowerCore.Notes/k8s/guacamole/extensions/1password-vault via mvn).
- guacamole-logback with DEBUG on io.flowercore.guacamole.vault — drop
  to INFO once resolution is proven stable.

Existing guacamole-properties: added onepassword-vault to extension-priority.

The guacamole-1password-token Secret is NOT in git — it holds a verbatim
copy of the onepassword-connect-operator bearer token. Followup task:
provision a scoped Connect token for Guacamole and rotate the copy out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:32:51 -05:00
Andrew Stoltz
313bdcb21a guacamole: NFS subPath — Synology exports /volume1/kubernetes root only
First pass used nfs.path=/volume1/kubernetes/guacamole/recordings,
which triggered "mount.nfs: access denied by server" on rke2-agent1.
Synology NFS export is scoped to /volume1/kubernetes; match the
working fc-desktop pattern: mount the export root and select the
subdirectory via volumeMount.subPath.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:23:49 -05:00
Andrew Stoltz
5f4818bd96 guacamole: wire session recording to Synology NFS
Phase 5 of docs/infrastructure/guacamole-customization-plan.md:

- Mount /volume1/kubernetes/guacamole/recordings (Synology 10.0.58.3)
  into both guacd (writer) and guacamole web (reader) at
  /var/lib/guacamole/recordings
- Set RECORDING_SEARCH_PATH env on guacamole web -- the Guacamole
  Docker entrypoint treats any RECORDING_* var as an enable signal
  for the history-recording-storage extension (symlinks the JAR
  from /opt/guacamole/environment/RECORDING_/extensions/ into
  GUACAMOLE_HOME/extensions/)

Per-connection recording still requires setting recording-path on
each connection in MySQL -- follow-up task. This commit enables
the plumbing; no sessions record yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:15:55 -05:00
Andrew Stoltz
fff998dab5 matrix, zabbix: add volumeMode to postgres PVC templates
Same ArgoCD + SSA self-heal loop pattern as guacamole (20e4130):
K8s defaults volumeMode=Filesystem on volumeClaimTemplates at
creation, git omits it, argocd-controller owns the atomic list so
every reconcile sees drift, and volumeClaimTemplates is immutable
so it can never reconcile. Adding the field closes both loops.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 14:48:43 -05:00
Andrew Stoltz
20e4130c74 guacamole: add volumeMode to guac-mysql PVC template
Closes the infra-guacamole OutOfSync sync loop. K8s API sets
volumeMode=Filesystem as a default on volumeClaimTemplates at creation,
but the git manifest omitted it. ArgoCD uses ServerSideApply with
atomic ownership of volumeClaimTemplates, so every sync saw a
desired/live mismatch on that one field. volumeClaimTemplates is
immutable after creation so ArgoCD could never reconcile it --
autoHealAttemptsCount climbed to 6091. Adding the field to git
matches live and breaks the loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 14:29:40 -05:00
Andrew Stoltz
3cf675b8c3 ttsreader: wire operator secrets through 1password 2026-04-17 10:05:24 -05:00
Andrew Stoltz
2a9f2e4540 Improve TTS Reader workspace card layout 2026-04-17 03:57:23 -05:00
Andrew Stoltz
b15a35a258 Fix TTS Reader character layout 2026-04-17 03:48:03 -05:00
Andrew Stoltz
3f4985ee13 Deploy TTS Reader queue feedback fix 2026-04-17 03:34:28 -05:00
Andrew Stoltz
e535a8d34b Deploy TTS Reader voice preview update 2026-04-17 02:13:09 -05:00
Andrew Stoltz
6ddbd2cae5 Point TTS Reader at Pi Ollama defaults 2026-04-17 00:53:45 -05:00
Andrew Stoltz
e9608651f7 Bump TTS Reader image to v20260417001119 2026-04-17 00:33:29 -05:00
Andrew Stoltz
abdb7a806e Bump TTS Reader image to v20260416234817 2026-04-16 23:53:42 -05:00
Andrew Stoltz
7afb5043c4 Fix ttsreader forwarded header handling 2026-04-16 21:55:46 -05:00