- Serves GET /manifests/{edition}/{version}.cert (leaf+intermediate PEM)
- Adds CertChainPem migration on startup (nullable column)
- ManifestSignService now embeds version-specific certChainUrl
Provisioning Agent's verify step will flip from ChainNotServed (Phase 2A
soft-pass) to Valid once a fresh edition is published with this image.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the pre-merge DNS gate to (optionally) scan live-cluster
Certificates + IngressRoutes via kubectl. Closes the coverage hole
where a service's IngressRoute gets deployed from its own repo (not
from bluejay-infra/apps/) and the manifests-only scan misses it —
fc-retail/retail-web-tls stuck Issuing for 15h on a missing pfSense
Unbound override was exactly this class of bug.
Auto mode: if kubectl is on PATH and usable, live-scan runs silently.
--live forces it (and errors out if kubectl can't reach the cluster).
--no-live skips live entirely (CI path with no cluster access).
Immediate live-scan finding on 2026-04-23: 10 orphan *.iamworkin.lan
IngressRoutes from failed e2e / codex / smoke / deleteproof test runs
in fc-php + fc-tenant-default (2026-04-16/17). None have DNS overrides
so their Certificates have been failing to issue for 7 days — the new
CertManagerCertificateNotReady alert will catch them too. Cleanup
(delete abandoned IngressRoutes + Certificates + CertificateRequests)
is a separate task; this check now surfaces them.
The full FQDN fc-llm-bridge.fc-llm-bridge.svc.cluster.local has 4 dots,
which is less than the pod's ndots:5 threshold. The resolver then
applies every entry in the search list BEFORE falling through to the
bare FQDN, and the CoreDNS 'template iamworkin.lan' catch-all matches
"...svc.cluster.local.iamworkin.lan" and returns Traefik VIP
10.0.56.200. The egress NetworkPolicy blocks that VIP (0.0.0.0/0
EXCEPT 10.0.0.0/8), so curl hangs for 30-134s and returns HTTP 000.
Reference: feedback_coredns_ndots_template_collision memory.
Fix: use "fc-llm-bridge.fc-llm-bridge.svc" (2 dots, still <5 so search
expansion still fires, but the first suffix "...svc.cluster.local"
hits the Kubernetes plugin in CoreDNS and returns the real ClusterIP
10.43.67.125 before the iamworkin.lan template is ever consulted).
Verified: pod-exec curl fc:cheap → HTTP 200 with a real chat.completion
envelope (Ollama/gemma3:4b via bridge).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The chat_model flip (62db15c) pointed Agent Zero at
fc-llm-bridge.fc-llm-bridge.svc.cluster.local:8080 but the existing
agent-zero-netpol only allowed egress to specific node IPs
(10.0.56.20:11434, 10.0.57.17:11434, 10.0.57.16:5200, 10.0.56.11:6443)
plus public-internet (with RFC1918 exclusion). ClusterIP traffic to
10.43.0.0/16 was implicitly denied, so pod-exec curl to the bridge
timed out after 134s.
Adds an egress rule allowing TCP 8080 to the fc-llm-bridge namespace
(matched by kubernetes.io/metadata.name which K8s 1.22+ sets
automatically). No ingress changes needed — fc-llm-bridge has no
NetworkPolicy, so the ingress side is already open.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flips Agent Zero's chat_model from direct local Ollama (gemma3:12b via
the 127.0.0.1:11434 sidecar proxy) to the FlowerCore LLM Bridge
(fc:balanced tier, OpenAI-compatible, Anthropic Claude Sonnet under the
hood) so chat turns are spend-tracked and can dispatch to any provider
via a single tier alias.
Scope is intentionally minimal and reversible:
- chat_model: ollama/gemma3:12b/127.0.0.1:11434
→ openai/fc:balanced/fc-llm-bridge internal service URL
- utility_model, embedding_model, browser_model: UNCHANGED
(stay on local 127.0.0.1 Ollama sidecar — no spend, low latency,
not worth routing through the bridge for small-model traffic).
Auth: new A0_SET_chat_model_api_key env var wired to the
fc-llm-bridge-api-keys Secret (field: agent-zero-k8s). The Secret is
synced by a new OnePasswordItem pointing at "FC LLM Bridge API Keys"
in the IAmWorkin vault. Bearer-token auth is now accepted by the
bridge (FlowerCore.LlmBridge@3225f1f).
Rollback: revert this commit; old image v202604231449 is still present
on all RKE2 nodes, and Agent Zero's strategy: Recreate makes the flip
atomic.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships the Bearer-token auth fix (FlowerCore.LlmBridge@3225f1f) so Agent
Zero's OpenAI provider can authenticate with Authorization: Bearer in
addition to the original X-Api-Key header.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pods in this cluster inherit ndots=5. External FQDNs with <5 dots (like
api.anthropic.com) are expanded through the search path first, and the 4th
suffix `api.anthropic.com.iamworkin.lan` matches CoreDNS' `template IN A
iamworkin.lan` wildcard — resolves to Traefik VIP 10.0.56.200. TLS connect
lands on Traefik's default cert and the AnthropicClient rejects with
RemoteCertificateNameMismatch/RemoteCertificateChainErrors.
Setting ndots=2 makes the resolver try the bare FQDN first (3 dots in
api.anthropic.com), so the search path never fires.
Reference: memory feedback_coredns_ndots_template_collision. Wider follow-up:
the CoreDNS template plugin should add fallthrough for external public suffixes,
so every FC service calling external HTTPS APIs stops hitting this trap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 1Password item "Claude API Key" stores the key in a standard Password
field (labeled `password`), so the OnePasswordItem operator creates the K8s
Secret with key `password`. Deployment was referencing `credential`, which
made the pod fail with CreateContainerConfigError.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Built from FlowerCore.LlmBridge@6d285b5 (initial scaffold). Imported on all
three RKE2 nodes via podman save + ctr import. Replaces v00000000000000
placeholder — ArgoCD sync will roll the pod.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Staged but NOT applied. Do not git push until the two pre-requisites below
are done. See apps/fc-llm-bridge/README.md for the full order-of-ops.
Manifests (apps/fc-llm-bridge/fc-llm-bridge.yaml, 8 docs):
- Namespace fc-llm-bridge
- OnePasswordItem anthropic-api-key (existing Claude API Key item)
- OnePasswordItem fc-llm-bridge-api-keys (NEW item, pending creation)
- PersistentVolumeClaim fc-llm-bridge-data (2Gi longhorn)
- Deployment fc-llm-bridge (port 8080, uid 1654, readOnlyRootFilesystem,
tcpSocket probes to survive future ApiKeyAuthMiddleware reordering)
- Service fc-llm-bridge ClusterIP
- Certificate fc-llm-bridge-cert (step-ca-acme)
- IngressRoute fc-llm-bridge (fc-llm-bridge.iamworkin.lan, websecure)
Pre-requisites BEFORE git push:
1. pfSense Unbound override fc-llm-bridge.iamworkin.lan -> 10.0.56.200
(currently NXDOMAIN -- verified via nslookup and check-pfsense-dns.py).
Skipping this step puts cert-manager HTTP-01 into ~2h backoff.
2. Create 1Password item `FC LLM Bridge API Keys` in vault IAmWorkin with
password fields: agent-zero-ws, agent-zero-k8s, spare-1, spare-2.
3. Build + import localhost/fc-llm-bridge:v<tag> to rke2-server +
rke2-agent1 + rke2-agent2. Bump image tag from placeholder
v00000000000000 before committing the apply.
Related: ADR-088 (FlowerCore.Notes/ARCHITECTURE.md), design doc at
FlowerCore.Notes/docs/ai-agents/agent-zero-anthropic-bridge.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Synology NAS is configured with community bluejay_monitor
(→ snmp.yml auth 'bluejay_v2'), not public. public_v2 was returning
HTTP 500 from snmp-exporter for this target. Verified bluejay_v2
returns metrics.
Keeps printer (10.0.58.107) on public_v2 — Epson ET-3750 uses
community "public" as documented in its SNMP settings.
Same CoreDNS iamworkin.lan template + ndots:5 hijack as the irc-notify fix.
Anope services (nickserv/chanserv/memo) have been disconnected from unrealircd
for weeks ("Host is unreachable" every 3s). Thelounge server defaults pointed
at the same broken FQDN.
Short name unrealircd.irc.svc resolves to the ClusterIP directly.
Adds a real README describing the 4-step deploy flow, with pfSense Unbound
host overrides as step 1 (the prerequisite that, if skipped, silently breaks
cert-manager HTTP-01 for ~2h per cert until manually diagnosed — root cause
of the 2026-04-22 cluster-wide cert outage).
Adds scripts/check-pfsense-dns.py: parses every apps/*/*.yaml, extracts
hostnames from Certificate.spec.dnsNames and Traefik IngressRoute
`Host(...)` match rules, and fails the check if any don't resolve via the
system DNS (pfSense Unbound on this LAN). Ignores IRC server-link labels,
image tags, comments — only checks hostnames cert-manager and Traefik will
actually use.
Run before `git push` or wire into pre-commit / Gitea Actions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Quick Read's active-sentence highlight was changing font-weight from
regular to semibold, which shifted glyph widths and reflowed the whole
paragraph mid-playback. New image drops the weight change and uses a
1px box-shadow ring instead for a stable layout.
Built from FlowerCore.TtsReader@e77d69d.
The app's ApiKeyAuthenticationMiddleware runs BEFORE /health is mapped, so
unauthenticated probe requests get 404. tcpSocket probes verify the listener
is up without auth, which is correct for an internal K8s probe (kubelet
talks pod IP directly, not externally).
Real fix is in the app: move /health before the middleware or mark it
[AllowAnonymous]. Tracked separately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The app exposes /health (Program.cs:91 maps a Healthy text response) but does
NOT expose /metrics/prometheus. K8s liveness/readiness probes against a 404
endpoint kept the pod in CrashLoopBackOff after PVC mount was added.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Live cluster had a Longhorn PVC `signalcontrol-data` mounted at /app/data
since 2026-04-14, but the bluejay-infra git manifest never declared it. As a
result, when ArgoCD recreated the Deployment from git (after deletion to fix
an unrelated selector-label mismatch caught during cert-manager recovery),
the new pod started without /app/data and crashed with `SQLite Error 14:
unable to open database file 'data/signalcontrol.db'`.
Bring git in line with reality: declare the PVC, mount it, and switch the
Deployment to `strategy.type: Recreate` (RWO PVC blocks rolling updates per
existing memory feedback_k8s_rwo_rollout.md).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CoreDNS iamworkin.lan template + ndots:5 was hijacking
unrealircd.irc.svc.cluster.local lookups → Traefik VIP → timeout.
Every alert since ~2026-04-09 silently failed with "IRC send failed: timed out",
which also killed the thermal-printer path (routed through irc-notify).
Same fix pattern as guacamole@28b7600.
The guac-k8s-sync CronJob has been crash-looping (exit 7) since the
2026-04-11 run. Root cause: CoreDNS has an `*.iamworkin.lan`
template wildcard, and the Kubernetes pod resolv.conf ships with
`ndots:5` plus a search list that includes `iamworkin.lan`.
Resolving `guacamole.guacamole.svc.cluster.local` (4 dots < 5) goes
through search-suffix expansion BEFORE the bare FQDN. The iamworkin.lan
suffix makes it `guacamole.guacamole.svc.cluster.local.iamworkin.lan`,
which matches the template and answers with Traefik LB VIP
10.0.56.200. That VIP has no pod-network hairpin route, so curl exits
with 'No route to host'.
Using the short name `http://guacamole:8080` keeps the query at 0
dots, search expansion runs on the bare name, and the in-namespace
`guacamole.svc.cluster.local` suffix hits the Kubernetes CoreDNS
plugin directly (ClusterIP 10.43.229.31).
Alt fixes considered but not taken: trim the CoreDNS template regex
to exclude `.svc.cluster.local.` prefixes (cross-cutting, higher
blast radius); trailing-dot FQDN in the URL (curl/Java HTTP clients
handle inconsistently).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets live SIP AATs (ext 901–904, from-internal context) dial *832 to
exercise the Victory Day workflow + Fun Menu + AsteriskGameHandler path
without routing through Twilio. Mnemonic: *832 = V-D-A (8-3-2) from the
V-D-A-Y keypad pattern.
Maps to Stasis(flowercore-pbx,inbound-pstn,+15074618329) — same call-
type classification as a real Twilio-inbound call to the VDAY DID, so
InboundPstnHandler routes to the seeded VDAY workflow identically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous commit 90deacd raced with the user's f0733ff (which had
already pinned the guacamole web Deployment to rke2-server for the
NFS ACL). That left two nodeSelector blocks on the web pod and an
inconsistent agent2 pin on guacd. Align both pods to rke2-server.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Synology NFS export at /volume1/kubernetes currently grants mount
permission only to 10.0.56.13 (rke2-agent2). rke2-agent1 gets
"access denied by server". guacd + guacamole web both need the
recordings volume, so co-locating is also efficient. Remove the
nodeSelector once the Synology NFS ACL opens to all cluster nodes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the 1Password vault JAR to the Guacamole pod so connection params
like ${OP:ItemTitle/fieldLabel} are resolved from 1Password Connect at
tunnel-open time. Credentials never land in MySQL — only token literals.
Deployment changes:
- env: OP_CONNECT_URL=http://10.0.56.10:8180, OP_VAULT_ID=..., plus
OP_CONNECT_TOKEN from secret/guacamole-1password-token/credential.
- env: ENABLE_ENVIRONMENT_PROPERTIES=true so OP_* env vars render as
op-connect-url / op-connect-token / op-vault-id properties the
extension reads.
- volumeMount for guacamole-vault-jar at
/etc/guacamole/extensions/guacamole-vault-1password-1.0.0.jar
- volumeMount for guacamole-logback so we see DEBUG token-inject lines.
- nodeSelector kubernetes.io/hostname=rke2-server — the Synology NFS
export for /volume1/kubernetes currently only allows rke2-server.
Followup: add rke2-agent1/2 to the export and remove this selector.
New ConfigMaps:
- guacamole-vault-jar (binaryData, ~312KB JAR, Gson shaded, built from
FlowerCore.Notes/k8s/guacamole/extensions/1password-vault via mvn).
- guacamole-logback with DEBUG on io.flowercore.guacamole.vault — drop
to INFO once resolution is proven stable.
Existing guacamole-properties: added onepassword-vault to extension-priority.
The guacamole-1password-token Secret is NOT in git — it holds a verbatim
copy of the onepassword-connect-operator bearer token. Followup task:
provision a scoped Connect token for Guacamole and rotate the copy out.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First pass used nfs.path=/volume1/kubernetes/guacamole/recordings,
which triggered "mount.nfs: access denied by server" on rke2-agent1.
Synology NFS export is scoped to /volume1/kubernetes; match the
working fc-desktop pattern: mount the export root and select the
subdirectory via volumeMount.subPath.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 5 of docs/infrastructure/guacamole-customization-plan.md:
- Mount /volume1/kubernetes/guacamole/recordings (Synology 10.0.58.3)
into both guacd (writer) and guacamole web (reader) at
/var/lib/guacamole/recordings
- Set RECORDING_SEARCH_PATH env on guacamole web -- the Guacamole
Docker entrypoint treats any RECORDING_* var as an enable signal
for the history-recording-storage extension (symlinks the JAR
from /opt/guacamole/environment/RECORDING_/extensions/ into
GUACAMOLE_HOME/extensions/)
Per-connection recording still requires setting recording-path on
each connection in MySQL -- follow-up task. This commit enables
the plumbing; no sessions record yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same ArgoCD + SSA self-heal loop pattern as guacamole (20e4130):
K8s defaults volumeMode=Filesystem on volumeClaimTemplates at
creation, git omits it, argocd-controller owns the atomic list so
every reconcile sees drift, and volumeClaimTemplates is immutable
so it can never reconcile. Adding the field closes both loops.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>