NetworkPolicy matches the destination POD port. dms-web svc:80 -> containerPort
8080, so the egress must allow 8080 (the fc-chat rule already allows 80+8080,
which is why chat worked and dms timed out). Add 8080 to the fc-dms egress.
AZ only had fc_chat (chat-session) + fc_knowledge (RAG) — so it had no product
capabilities (the 'mysql manager' gap). Wire fc_dms (dynamic message signs, ~13
tools): OnePasswordItem dms-mcp-keys (1P 'FlowerCore DMS MCP Keys' field credential)
-> DMS_MCP_API_KEY -> X-Api-Key; builder adds fc_dms; netpol egress fc-dms:80.
Proven: dms-web/mcp returns 200 with this key. presentations/messageboard/
segmentdisplay/telephony 1P MCP-key items exist for the same pattern; mysql+signage
need 1P items provisioned first (mysql/mcp 401s with no key). Watch context budget.
The PROD-VLAN VIP 10.0.57.201 MTU-black-holes Agent Zero's ~150KB requests
(full prompt + 108 MCP tools) -> connection reset mid-stream -> AZ 'same message
again' loop. Switch FlowerCore__Chat__OllamaBaseUrl to the INFRA-VLAN NodePort
10.0.56.14:30976 (same VLAN as the old cluster, carries 150KB fine). Verified:
150KB POST = 200 via NodePort, times out via VIP. NodePort pinned to 30976 on GX10.
Image -> v20260614-arifix (Telephony 86ff0d0: ReceiveAsync no longer cancelled).
Remove the WebSocketKeepAliveTimeoutSeconds/WebSocketReceiveTimeoutSeconds=3600
band-aids; the code now disables the pong deadline by default and ignores the
receive timeout (liveness = keepalive ping + WebSocketException/Close).
Root cause: the live deploy carried env Tts__PiperUrl=edge1 (drifted, not in git)
which shadows appsettings Tts.PiperUrl. Codify Tts__PiperUrl=GX10 + Ari__ env to
match live so git is source-of-truth; the configmap edit alone was inert.
Sync the bluejay-profile ConfigMap's embedded system_prompt.md with the
rewritten scripts/agent-zero/agents/bluejay/system_prompt.md: Ollama section
-> GX10 hub (VIP 10.0.57.201, GB10/121GiB); model table with tool-calling
flags (qwen2.5 = tools, gemma3 = 400-on-tools/vision-only, nomic = embed);
new 'Models & Tool-Calling' + 'Knowledge & RAG' rule blocks; stripped dead
WSL/R9700/.132/host.docker.internal/port-30050 lanes; de-pinned test counts;
'Blu' team is persona vocabulary not a fixed team. Personality preserved.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cancelling ClientWebSocket.ReceiveAsync via CancellationToken ABORTS the
socket (a half-read WS frame can't resume). The per-iteration
iterationCts.CancelAfter(WebSocketReceiveTimeoutSeconds) therefore aborted a
healthy idle ARI WebSocket every 45s (state=Aborted), not the keepalive pong
(proven: loop persisted after pong-timeout 15s->3600s). A large receive
timeout lets ReceiveAsync block harmlessly while the PBX is idle; real drops
still surface immediately as WebSocketException -> reconnect. Proper code fix
(stop using CancelAfter on the receive) tracked separately.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Asterisk res_http_websocket does not reliably answer client PING frames
with PONG, so .NET KeepAliveTimeout (default 15s) aborted a healthy idle
ARI WebSocket every ~45s (ping@30s + pong-wait@15s), dropping StasisStart
events so the *100 IVR intermittently answered with no audio. Generous pong
timeout stops the false aborts; genuine drops still caught by the 45s
receive-timeout state re-check and TCP-level WebSocketException.
Surfaced by FlowerCore.Telephony.SipTests Call_Star100_ReceivesAudibleAudioStream
(0 RTP packets while ExtToExt RTP-hook passed).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Influence Audit panel now surfaces the per-turn effective prompt
(RagContextSnapshot) as an operator/debug row. FlowerCore.Chat d389e4b.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stored/displayed user message is now the raw prompt; injected scaffolding
(mood contract + guidance + memory) goes to the model via ragContext as a
system message and is captured in RagContextSnapshot for debug.
FlowerCore.Chat 37f57b0 + FlowerCore.Common 4d741b3.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Agent Zero's agentic tool-loop ran on cloud Anthropic Sonnet (the bridge's
Anthropic key is currently 401) + gemma3:4b util (gemma3 returns 400 "does not
support tools" — fatal for the loop). Repoint the bridge ModelRouter tiers:
Balanced -> Ollama qwen2.5:14b (AZ chat) and Cheap -> qwen2.5:7b (AZ util), both
on the GX10 VIP 10.0.57.201 (already the bridge OllamaBaseUrl). Env-only, no
rebuild; Wiring A keeps the budget ledger + cache. Also: AZ chat ctx -> 32768,
browser -> qwen2.5:7b (text/tool-capable, vision off), AGENT_NAME -> "Blue Jay"
(the NUC role is retired). qwen2.5:7b + :14b pulled + warm-pinned on the GX10.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Workstream A: set_mood structured signal replaces leaky [mood:X] text
(FlowerCore.Chat a606892). Image built + imported to rke2-server and
rke2-agent1.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Repoints fc-chat, fc-ttsreader, knowledge, fc-llm-bridge (off the slow edge1
Pi5 10.0.57.17) and intranet (off the reimaged BLUEJAY-AI test laptop
10.0.56.132) to the GX10 (DGX Spark / GB10) Ollama over the PROD MetalLB VIP
10.0.57.201. GX10 serves gemma3:12b/gemma3:4b/qwen2.5:1.5b/nomic-embed-text/
llama3.2:1b on local NVMe, warm-pinned (keep_alive=-1).
fc-chat default model qwen2.5-coder:7b -> gemma3:12b (the coder model won't
pull reliably on the GX10; gemma3:12b is the warm fleet default + a better
general-chat model). Other consumers keep their exact models. Inline comments
referencing edge1/BLUEJAY-AI are now historical; the values are the GX10 VIP.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Audit of apps/fc-devicemgmt/ confirms the admin/helpdesk console needs NO new
infra: the existing host-matched IngressRoute (devices.iamworkin.lan, no path
constraint) + step-ca-acme Certificate already cover admin routes served under
FlowerCore:PathBase (ADR-204 routes-inside-DM.Web). ADMIN-CONSOLE-INFRA.md
records the finding + the open Q-MP question (distinct admin hostname vs PathBase
path) with the exact 3-step add if a separate host is later chosen.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Gated substrate (Cl2-4 / Cl-infra-3) — outside apps/ so the ApplicationSet
will not deploy it, and spec.suspend: true. Reconciles the 1Password
tenant-mapping doc into Authentik groups via Connect REST. Activate at Au-3
public-go (un-suspend + materialize the script ConfigMap). Pairs Codex Cx2-7.
Canonical script: FlowerCore.Notes/scripts/authentik/authentik-tenant-mapping-sync.py.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cl-infra-2 (deep-regroup 2026-06-13). LE staging+prod ClusterIssuers (HTTP-01
via Traefik, DNS-01 stub) + a per-tenant default-deny NetworkPolicy template,
under gated/public-tls/ OUTSIDE apps/ so the ApplicationSet does NOT auto-apply
them (an applied ACME ClusterIssuer registers an account immediately). Internal
*.iamworkin.lan TLS stays on step-ca. Inert until the operator opens the
web-hosting public-exposure gate (R-1; 14/14 blockers red). Pairs with Codex
Wh-C1 (hybrid public TLS) + Wh-C2 (isolation).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The hostname edge1.iamworkin.lan resolves to an unroutable IPv6 from cluster
pods and the CoreDNS *.iamworkin.lan template maps it to the Traefik VIP, so
the corpus indexer failed every embed with "No route to host". edge1's IPv4
(10.0.57.17, PROD VLAN) is pod-routable and has nomic-embed-text; an in-pod
embed test returned real vectors. This makes the now-enabled notes-md/notes-html
indexes actually populate.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cl-infra-1 (deep-regroup 2026-06-13). Adds a notes-corpus-clone initContainer
(shallow git clone of bluejay/FlowerCore.Notes into an emptyDir at
/srv/flowercore-notes) + a notes-corpus-sync sidecar (30-min pull) and flips
IntranetSearch__Enabled false->true so the previously doubly-disabled indexer
has a corpus to index (768 md + 108 html under docs/).
- Trailing-dot FQDN gitea-clusterip.gitea.svc.cluster.local. bypasses a CoreDNS
*.iamworkin.lan template that mis-resolves the in-cluster service name to the
Traefik VIP for musl / ndots:5 pods (search-domain appending).
- Cred via gitea-corpus-cred secret (canonical 1P bluejay read cred, created
imperatively in-ns; mirrors the gitea-flowercore-notes argocd repo-cred pattern).
- First-boot bulk embed runs in background via edge1 Ollama; /health stays Ready.
Pairs with Codex In-1 (intranet app-side reindex endpoint + SemaphoreSlim).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Gold PWA primary CTA (mobile-button--primary blue->gold cascade fix) + About
operator jump-links / honest update-status / license (FcAboutPanel contract).
Image built + imported to rke2-server + rke2-agent1; pin so ArgoCD adopts the
new tag instead of reverting the kubectl set image.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Stand up the pfSense automation plane (Phase 0, read-only) on RKE2 as an
ArgoCD-managed workload at network.iamworkin.lan.
- namespace fc-network
- Deployment fc-network-web: localhost/fc-network-web:v20260612-0b5b049,
imagePullPolicy Never, port 5340, /healthz probes, runAsNonRoot 1654 +
readOnlyRootFilesystem, RWO-safe RollingUpdate (maxSurge 0/maxUnavailable 1),
auth gate-OFF, SQLite + snapshot-store + intended-model paths under /data.
- PVC fc-network-web-data (longhorn, 2Gi): SQLite index + on-box snapshot store
(full-fidelity raw config.xml stays on-box; service surfaces redacted only).
- Service (ClusterIP 80 -> 5340), Certificate (ClusterIssuer step-ca-acme),
IngressRoute (network.iamworkin.lan, all methods — POST ingest is local-only).
- kustomization.yaml for local previews / single-app validation.
The ApplicationSet git generator picks this up as infra-fc-network; if it lags,
the Application is applied manually (documented pattern).
Ships the L2 pilot UI sweep to worldbuilder.iamworkin.lan: the dashboard
fc-component fix (missing-styles), ComfyUI local detection, and the rebuilt
About page. Image imported to rke2-server (10.0.56.11) + rke2-agent1
(10.0.56.12). rke2-agent2/10.0.56.13 is retired and was not used.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Image built from current DM master (network/BT command plane + Blue Jay
UI.Components restyle) and imported on rke2-server + rke2-agent1.
Deployment stays parked at replicas: 0 — gap 1 is wider than previously
noted (the fc-mysql Operator deployment itself is absent, so instance
CRDs would not reconcile) and gap 2 (1P runtime item) is still open.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
StatefulSet/authentik-postgres has been eternally OutOfSync since ~Sprint 65
even though 'kubectl diff --server-side --field-manager=argocd-controller'
shows zero real change. The STS was created via ServerSideApply, so the live
object carries apiVersion/kind inside volumeClaimTemplates[]; git omitting
them makes ArgoCD's normalized diff disagree forever. Declare them in git.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sprint 63 Cx-10 live-proof fix after Traefik curls found three stale probe-path annotations. Local lint 100/100; git diff --check clean; no Gitea statuses attached.
Library.Web is already running + serving at library.iamworkin.lan (root=200,
healthz=200), deployed manually 41h ago (image fc-library-web:v20260602-...,
PVC library-web-data holding the live SQLite DB). My from-scratch manifest used
a different PVC name (library-data) which ArgoCD would attach as a fresh empty
volume, orphaning the live DB. Adopting the live deploy into GitOps is a
separate careful task. Not disturbing a working deployment.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
From-scratch .Web deploy at library.iamworkin.lan (operator-authorized 2026-06-03).
Cloned from the worldbuilder pattern: Deployment + Service + Longhorn RWO PVC +
step-ca cert + Traefik IngressRoute. SQLite at /data/library.db, no OIDC, both
/health + /healthz probes. Image localhost/fc-library:v202606031925 imported to
both RKE2 nodes. DNS library.iamworkin.lan -> 10.0.56.200 already in pfSense.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add source-controlled Puppet/Hiera contracts for edge2 Divoom-as-DM-device without replacing the live flowercore-divoom systemd deployment.
Add Divoom TV Pi HDMI systemd/Puppet deployment artifacts, LF shell-script guardrails, and focused lint coverage for the additive non-K8s deploy shape.
Co-Authored-By: Codex <codex@openai.com>
Mirrors the live noc1 podman fix + Notes scripts/monitoring/prometheus.yml.
These services enforce OIDC bearer auth (FlowerCore__Auth__Enabled=true), so an
anonymous probe of / returns 401 -> false TraefikServiceDown. All three expose
anonymous /healthz=200. This noc-monitoring.yaml is the forward K8s-migration
target (not live); brings it in sync with the live config.
step-ca-acme only has an HTTP-01 (Traefik) solver, but mail.iamworkin.lan must resolve
to the dedicated MetalLB IP 10.0.56.202 (SMTP/IMAP), so HTTP-01 cannot validate (order
stuck pending since 2026-05-06; cert expired 2026-05-24). mail-tls is now issued from
step-ca's JWK 'admin' provisioner and auto-renewed by a systemd timer on noc1 that writes
the mail-tls secret directly. The secret + Deployment mount + webmail IngressRoute are
unchanged. Re-add a Certificate only if a DNS-01 solver is deployed for step-ca-acme.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Durable image bump for FlowerCore.Updater main a6c3354 (PRs #63-#66): hosted-service
+ request-path SQLite DateTimeOffset fixes, StopHost restored + per-tick resilience,
Shared.Settings 1.0.1. Image built + imported to rke2-server. Un-degrades the Phase-9
provenance verifier + settings poll (were stopped under the removed global Ignore mask).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The v202605301642-296f350-rework image crash-looped: FlowerCore.Shared.Settings SettingsDbPollHostedService
ran a DateTimeOffset Where/OrderBy on SettingsRecordChanges that SQLite can't
translate, and as a BackgroundService it stopped the host. Shared.Settings 1.0.1
materializes the change-log then filters/orders in memory; Updater Web bumped to 1.0.1.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Deploy the current FlowerCore.Updater main (PRs #52-#61) to prod: MSI-first
packaging, beta gating + per-install tokens, interactive+bearer Authentik OIDC,
native installer apply, and the .fcsetup.exe retirement (DropReleaseInstallers
migration runs on the now-empty DB). Image pre-imported to rke2-server + agent1.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Scale all github-runner deployments to 1 replica and halt the ci1
KubeVirt VM. With agent2 down (failed PSU) the cluster runs on two
passively-cooled NUCs; the ci1 8-vCPU VM drove agent1 to ~100C. Keep
total load trimmed until replacement hardware is in place.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The port additions caused the new VMI to stick at phase=Scheduled with
reason=GuestNotRunning. The guest-console-log sidecar exited 1 and
qemu never started. Reverting to the working 9-day-stable shape until
the port-add path is verified in a non-production VM.
Phase 2 (Windows runner install + registration) needs an operator-
interactive virtctl-vnc session against the rebuilt VM, OR a separate
investigation of why this port-add tipped over the VM.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Phase 1 VM has been Running for 9 days but Phase 2 (Puppet bootstrap +
runner registration) was deferred because the operator-interactive
virtctl-vnc path was the only way in. The masquerade interface listed
no exposed ports, so virtctl ssh and kubectl port-forward both hit
'no route to host' — qemu user-mode NAT does not forward inbound by
default.
Adding 5985 (WinRM HTTP) lets a kubectl port-forward + PowerShell
remoting path drive runner registration entirely from outside the VM.
3389 + 22 are reserved for desktop access via Guacamole or virtctl ssh
once OpenSSH Server is installed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 33 runner Deployments now request 100m CPU instead of 500m,
freeing roughly 50 idle pods × 400m = ~20 cores back to the cluster.
Observed CPU usage on idle runners is ~1m via kubectl top; the 500m
request was a 500× over-provision that was eating allocatable CPU
and blocking new workload scheduling — WorldBuilder runner could not
be scheduled even at the new 100m request because the pre-existing
fleet held the cluster at 99% requested.
Burst headroom preserved by limits.cpu: 2000m unchanged. TtsReader
keeps its 8Gi memory limit from the 2026-05-25 OOMKill fix; only
the CPU request line moves.
Recreate strategy on each deployment means a brief offline window
per runner during rollout; in-flight CI jobs complete on the
existing container before the new spec takes effect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After cutting requests to 100m, 4 of 6 new pods scheduled and 2 stayed
Pending — cluster CPU REQUEST utilization is 49.6 of 48 allocatable cores
because the existing fleet of ~50 idle runners reserves 25.6 cores
(500m × ~50) for ~50m actual use. Single-replica per new repo gets the
service online without competing with in-flight CI from the rest of the
fleet.
When the broader fleet-wide request right-sizing pass lands
(500m → 100m on all idle runners would free ~20 cores), these can be
bumped back to 2 replicas if PR-CI backlog warrants it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 3 fleet nodes were at 99% CPU REQUEST allocation; the 6 new pods
from the previous commit (3 deployments × 2 replicas × 500m) couldn't
schedule. Idle runners actually use ~1m CPU per `kubectl top pods`;
the 500m request was significantly over-provisioned. Burst headroom
preserved by limits.cpu: 2000m unchanged.
Follow-up: similar request right-sizing pass across the rest of the
runner fleet is queued for a future morning-routine sweep — 25 cores
reserved for ~50m actual use is a large slack we can reclaim cluster-
wide.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Morning-routine 2026-05-26 — these three repos had ZERO online Linux PR-CI
capacity, blocking the Sprint 37 Cx-1 Linux-CI-migration PRs (DM #20/#21/
#22, AiStation.Linux #13, WorldBuilder #3/#4). Chicken-and-egg: the
migration PRs need Linux runners that the migration creates.
Each Deployment uses the same canonical emptyDir-only pattern as the
fresh-2026-05-26 updater deployment that lives just above:
- replicas: 2 (room for parallel PR-CI without head-of-line blocking)
- per-pod emptyDir caches (no RWO PVC contention)
- shared github-runner-token secret (existing ACCESS_TOKEN PAT has
org-wide read access)
- LABELS: self-hosted,linux,fc-build-linux
- DOTNET_INSTALL_DIR pinned per ADR-170 family
For AiStation.Linux specifically: Linux job will now pick up; the
Windows job in #13 remains queued indefinitely until the Windows runner
host substrate lands per Sprint 36 v2 Cl-2 / ADR-174 — that's a separate
arc, not this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BLUEJAY-WS must never be a fleet GHA runner (operator directive 2026-05-26). Build-side analog of Sprint 9 safe-account exclusion. Also fixes 3 stale runner-fleet assertions broken by initContainer addition + replica tuning.
The github-runner-tts-reader pod was being OOMKilled (exit 137)
mid-`dotnet test` on the TtsReader 1000+ test suite. PR #21 CI
(the Windows -> Linux runner migration) flapped twice with the
"self-hosted runner lost communication" annotation before the
K8s-side symptoms surfaced via kubectl describe pod.
Requests bumped 1Gi -> 2Gi, limits 4Gi -> 8Gi. Comment added
inline so future fleet runs don't trip the same wall.
Unblocks PR #21 + the 9 other open TtsReader PRs that all rebase
through it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FlowerCore.PiManager build run 26417714843 sat queued 5h with zero
self-hosted runners registered to the repo. PiManager was missed in
the Sprint 32 long-tail sweep — every other FC repo got a dedicated
repo-scoped Deployment with its own ACCESS_TOKEN registration, but
PiManager fell through the cracks.
Adds a 2-replica ephemeral runner Deployment matching the Signage /
DMS / Print.Web pattern (per-pod emptyDir caches, no shared PVC,
labels `self-hosted,linux,fc-build-linux`, shared github-runner-token
PAT). Once ArgoCD syncs, the queued job will pick up automatically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Edge node has been OOMKilled 51 times in 5 days (~1 every 2.4h) on a
1Gi memory limit. Chrome runs maxSessions=2 on the same 1Gi cap and
was idling at 684Mi — first concurrent session pushing the node to
~900Mi+ would be the next OOM. Hub was running at 766Mi against a 1Gi
limit (75%); no recent restarts but no headroom either.
Firefox node has been running at 2Gi memory limit for 9 days with
zero restarts — that is the right size for a Selenium 4.27 browser
node under our session profile (screen recording sidecar + 1080p
rendering + page captures). Match it.
Changes:
- Hub: limit 1Gi -> 1.5Gi, request 512Mi -> 1Gi
- Chrome: limit 1Gi -> 2Gi, request 512Mi -> 1Gi
- Edge: limit 1Gi -> 2Gi, request 512Mi -> 1Gi
CPU left alone on all three — observed utilization is well under the
existing limits (hub 54m / 500m, chrome 185m / 1, edge 11m / 1).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without the IAmWorkin step-ca root CA in the runner image's system
trust store, .NET HttpClient calls from CI tests against
`*.iamworkin.lan` (e.g. `https://selenium.iamworkin.lan/session`) fail
with `The remote certificate is invalid because of errors in the
certificate chain: PartialChain`. FlowerCore.Print.Web's
`WebScreenshotService` unit tests hit this on every build.
Drop the step-ca root PEM into `/usr/local/share/ca-certificates/`,
run `update-ca-certificates` once during apt install, and let OpenSSL +
.NET-on-Linux read the regenerated `/etc/ssl/certs/ca-certificates.crt`
automatically — no `SSL_CERT_FILE` env var, no per-Deployment volume
mount.
Image rebuilt + saved + imported on all 3 schedulable RKE2 nodes
(rke2-server, rke2-agent1, rke2-agent2) before this PR — verified with
`ctr images list -q | grep stepca` on each node.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Unblocks CI jobs running in github-runner pods (e.g. FlowerCore.Print.Web
`help-screenshots`) from reaching selenium-hub. Previously the session
POST was DNAT'd to the hub pod IP then dropped at the Calico ingress
hook, surfacing as a 60s timeout against
http://selenium-hub.selenium.svc.cluster.local:4444 while the Selenium
UI showed 0/4 sessions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously orphan kubectl-applied since the Selenium Grid was first set
up. The `infra-selenium` ArgoCD app existed but only managed
`network-policy.yaml` — the deployments themselves drifted whenever
anyone `kubectl set env`'d or `kubectl scale`'d.
This commit captures the live state (with the 2026-05-25 maxSessions
bump for chrome already baked in) as canonical git source. ArgoCD's
ServerSideApply syncPolicy + selfHeal will now keep the grid in lock
step with this file.
Resources captured:
- Service selenium-hub (ClusterIP, internal traffic on 4444)
- Service selenium-hub-external (LoadBalancer, MetalLB 10.0.56.208)
- Deployment selenium-hub
- Deployment selenium-node-chrome (replicas=1, SE_NODE_MAX_SESSIONS=2)
- Deployment selenium-node-firefox (replicas=1, maxSessions=1)
- Deployment selenium-node-edge (replicas=1, maxSessions=1)
- IngressRoute selenium-hub (Traefik, selenium.iamworkin.lan)
No live behavior change — server-side dry-run confirms unchanged for
hub/firefox/ingressroute, "configured" for hub-external + 3 deploys
(default-field reordering only; SSA + field managers handle the diff).
Refs: Sprint 33 morning-routine 2026-05-25 follow-up Q-MR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The server pod was getting killed by liveness probe at 60s while still
waiting on migration DB lock (worker pod also running migrations against
same DB). Add startupProbe with 10.5 min budget so liveness doesn't fire
until migrations finish.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PermissionError: [Errno 13] Permission denied: '/media/public' in tenant_files
migration because Authentik container runs as uid 1000 but Longhorn PVC mounts
root:root by default. fsGroup on Pod securityContext recursively chgrps the
PVC mount to gid 1000 + chmods g+rwx.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stack:
- PostgreSQL 16 StatefulSet (Longhorn RWO 5Gi)
- Redis 7 Deployment (no persistence)
- Authentik server + worker (ghcr.io/goauthentik/server:2024.12.3)
- Shared media PVC (Longhorn RWO 2Gi) between server+worker
- Certificate via step-ca-acme ClusterIssuer
- Traefik IngressRoute at id.iamworkin.lan
Secrets sourced from 1Password item 'authentik-credentials' (IAmWorkin
vault, id y6i74ch22q5wvm7znquq4nhhcu) via OnePasswordItem CRD. Fields:
AUTHENTIK_SECRET_KEY, POSTGRES_PASSWORD, REDIS_PASSWORD,
BOOTSTRAP_ADMIN_PASSWORD, BOOTSTRAP_ADMIN_TOKEN, BOOTSTRAP_ADMIN_EMAIL.
DNS A record id.iamworkin.lan -> 10.0.56.200 added via
scripts/pfsense-add-id-host.py (FlowerCore.DNS service was 502'ing on
pfSense diag_command.php response parsing).
Closes the immediate gap from PiManager OIDC Cohort 3 wire-up: PiManager
(a87cd6f) configures id.iamworkin.lan as JWT authority but the backend
was never deployed. Pirelay specifically is on Mode:apikey until this
backend is bootstrapped and a pimanager service-account exists.
Post-deploy bootstrap (manual once pods Ready):
1. Login at https://id.iamworkin.lan/if/admin/ as akadmin
using BOOTSTRAP_ADMIN_PASSWORD from 1Password.
2. Create OAuth2/OpenID Provider for pimanager (issuer
https://id.iamworkin.lan/application/o/pimanager/, audience 'pimanager').
3. Create Application binding the provider.
4. Create service account user 'pimanager-service-account', generate
long-lived token, store in 1Password as 'pimanager-service-account'.
5. Re-enable jwt mode on pirelay + un-mask puppet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes the apps/brochure/ directory entirely from the bluejay-infra
ApplicationSet glob. ArgoCD will:
1. See infra-brochure has no git source -> mark for delete
2. Prune the brochure namespace + Deployment + Service + Certificate
+ Secret + IngressRoute (all generated from the now-gone
apps/brochure/brochure.yaml)
3. Remove the infra-brochure Application from argocd ns
Operator decision 2026-05-19 (follow-up to 09387f9 ARCHIVED banner
commit): "Yes, prune argo for brochure. Probably fully deleted there."
The brochure subdomain project was a planning-chain misinterpretation
of "make TtsReader + AI Station production-ready" — see
memory/project_brochure_split_misinterpretation_archived_2026_05_19.md
in FlowerCore.Notes for the full decision record.
Reusable artifacts that were the operator's archive concern stay alive
in their actual homes:
- FlowerCore.Intranet.Web PR #8 content-NuGet carve-out: still in
Intranet's master, may transfer to TtsReader / AI Station prod work
- Sprint 32 Cl-5 substrate (public-twin design ideas): SUPERSEDED banner
in-place in FlowerCore.Notes docs/standards/, history preserved
- magpie-doc-writer + wren-walkthrough skill output: unchanged in
Intranet's flowercore-whats-new/walkthroughs/galleries directories
Companion Notes-side commit updates the "scaled to 0 + ARCHIVED banner"
language in mvp-readiness.html + fleet-roadmap-2026-05-19-sprint36-v2.md
+ memory record to reflect full deletion instead.
Wrong-codebase image localhost/fc-brochure-web:v20260524-sprint32 is
being removed from rke2-server / rke2-agent1 / rke2-agent2 in a
follow-up step (reclaims ~800MB per node).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The brochure split project was a misinterpretation of an operator request
to make TtsReader + AI Station production-ready. Somewhere in the planning
chain it spun up into a separate "showcase brochure product" with its own
host, repo, NuGet, and Codex pack — none of which the operator actually
wanted. The project itself is pointless and a waste of credits.
Archive (not delete) per operator decision 2026-05-19, because some work
shipped under the misinterpretation may still have reusable value:
- FlowerCore.Intranet.Web PR #8 (merged) introduced FlowerCore.Brochure.Content
content-NuGet carve-out — pattern may apply to TtsReader/AiStation production
polish.
- Sprint 32 Cl-5 substrate has design ideas for public-twin vs operator-host
separation that may transfer.
- magpie-doc-writer / wren-walkthrough skills still author useful Intranet
content — those skills stay active.
These manifests stay at replicas: 0 for ArgoCD continuity. Cleanup options
(move out of apps/* glob, or delete entirely) are documented in README.md
for an operator-explicit future call.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first batching pass (bacac06) left critical-severity alerts on the
immediate-print path. That's still per-event spam for any persistent
critical (e.g. PrintPaperRollCritical fires every 30s Grafana evaluation
cycle when paper is <5%). Caught immediately after deploy: CUPS queue grew
0 → 8 jobs in 8 minutes from a single firing PrintPaperRollCritical.
This commit aligns with the operator's verbatim ask ("one alert an hour"):
- Critical-severity alerts now go into the digest buffer, NOT the
immediate-print path. The digest payload already shows severity tags
per alertname, so the operator still sees "[critical] X" in the printout.
- The explicit `alert_channel=thermal_print_immediate` label still bypasses
batching, but only on NEW fingerprint arrival — it triggers a flush of
the CURRENT digest (with the new alert included), then clears. Repeat
webhooks for the same fingerprint dedupe in the buffer until the next
hourly tick OR until the alert resolves. No fingerprint can spam.
- `add_to_digest` now returns bool (True = buffer grew, False = dedup /
resolution / disabled) so the immediate-label path can flush only on
state transitions.
Net effect: max 1 thermal print per BATCH_INTERVAL_MIN per alert fingerprint,
regardless of severity. Rules that genuinely need same-second paper opt in
via `alert_channel=thermal_print_immediate` (currently zero rules use this).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OPERATOR (PodCrashLoopBackOff cleared):
- Bumped image to v20260519-sp34cl3-fix (built from astoltz/FlowerCore.DeviceManagement@d9a3685
after Sprint 34 Cl-3 stranded branch was merged via PR #19 squash).
- The v20260512-cx5 image was the broken Sprint 8 scaffold: generic Host
builder, no kubeops, no Kestrel on :8080, no AddController chain. Readiness
probe dial-tcp 8080 failed every restart.
- The new image ships the AddController chain for all 4 reconcilers
(DeviceCrd / DeviceGroupCrd / DevicePolicyCrd / RemoteCommandCrd) plus
Kestrel on :8080 and /healthz.
- Image saved + scp'd + ctr-imported on rke2-server / rke2-agent1 / rke2-agent2
before this commit. SHA256: 2cc79ee0a2313c550268d1244f805ae41b396362148dd5603061cc15b6f7fa7e
WEB (DeploymentReplicasMismatch cleared via scale-to-0):
- Web pod cannot start. Two upstream gaps must close first:
1) MySQL DB instance + user `fc_devicemgmt` / database `flowercore_devicemgmt`
are not provisioned in fc-mysql. Cluster has zero MySqlInstanceCrds and
no `mysql.fc-mysql.svc:3306` Service.
2) 1Password vault item `IAmWorkin/FlowerCore DeviceManagement Runtime` is
missing (5 fields: DB-Password + 4 mTLS PEMs). OnePasswordItem CRD has
been stuck Ready=False since 2026-05-18T02:58.
- Same pattern as the brochure-web scale-to-0 in 914fed0 — make the cluster
clean and quiet, let operator restart deploy on a real schedule.
Re-enable path is fully documented in the deployment-web.yaml header comment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The thermal printer drained overnight (2026-05-18/19) because the old
notify.py POSTed one print job per Grafana webhook fire. With 9
concurrently-firing alerts (zabbix-postgres + fc-devicemgmt + brochure
+ PrintPaperRollLow), every evaluation cycle stamped fresh CUPS jobs
onto the queue until the operator physically powered the printer off.
This refactor:
- Adds env-var config: THERMAL_PRINT_ENABLED (master kill switch),
BATCH_INTERVAL_MIN (default 60), BATCH_MAX_PENDING (default 50).
- IRC delivery stays per-event (operator wants the live stream).
- Thermal routing now:
* critical/disaster/page severity OR alert_channel=thermal_print_immediate
-> print immediately
* alert_channel=thermal_print -> enqueue into hourly digest
* RESOLVED -> remove from digest buffer (no resolution-spam prints)
* else -> IRC only, no thermal
- Background digest_loop thread flushes the buffer hourly (or sooner
if buffer hits BATCH_MAX_PENDING). Digest payload is a single
Print.Web /api/print/alert POST listing distinct alertnames + per-rule
target counts.
- New POST /flush endpoint (manual operator force-flush; useful for
testing without waiting an hour).
- GET / returns config + buffer depth + per-stat counters for observability.
Net effect: max 1 thermal print per BATCH_INTERVAL_MIN for batched
warnings, plus immediate prints for criticals. Closes the 2026-05-18/19
alert-storm incident.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the live `puppet` alert group from
FlowerCore.Notes/scripts/monitoring/alerts.yml into the K8s ConfigMap so a
future in-cluster Prometheus inherits the ruleset automatically.
Source of truth remains the Notes file (live Podman Prometheus on noc1).
See feedback_monitoring_k8s_target_vs_live_podman.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #5 rebase concatenated PR #5 env additions onto PR #7 env additions on
the base + sharedpos Deployments, producing duplicate-key validation
errors in ArgoCD's structured merge. The DOTNET_INSTALL_DIR and
NUGET_PACKAGES values are identical between PR #5 and PR #7; keep the
PR #7 originals and retain only the unique new env vars from PR #5
(DOTNET_CLI_TELEMETRY_OPTOUT, DOTNET_NOLOGO, DOTNET_GENERATE_ASPNET_CERTIFICATE).
No behavioral change — same final env var set, no duplicates.
Shared.Pos build fails on non-root runner (setup-dotnet /usr/share/dotnet denied); churning runner drove HighCPU on rke2-agent2. Re-enable in the Sprint 30+ Cx-1 Linux-runner-fleet lane (DOTNET_INSTALL_DIR on pod env).
Operator provisioned GitHub PAT (Runner Registration) 1P item. OnePasswordItem CRD already materialized the secret. Unblocks CI fleet-wide.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Renames the OnePasswordItem.itemPath from "GitHub Runner Registration
Token" to "GitHub PAT (Runner Registration)" so the runner 1P entry
sits next to its siblings — GitHub PAT (Gitea Mirrors) and GitHub PAT
(NuGet Packages) — under a consistent "GitHub PAT (...)" naming pattern
and API_CREDENTIAL category.
Existing field "credential" remains the consumer (RUNNER_TOKEN env).
Comment block clarified to require Administration:read/write fine-grained
PAT scope on target repos.
Old 1P item renamed to "[DEPRECATED 2026-05-16] GitHub Runner
Registration" — kept as recovery backup; can be hard-deleted after the
first successful runner pod start against the new item path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Renames the OnePasswordItem.itemPath from "GitHub Runner Registration
Token" to "GitHub PAT (Runner Registration)" so the runner 1P entry
sits next to its siblings — GitHub PAT (Gitea Mirrors) and GitHub PAT
(NuGet Packages) — under a consistent "GitHub PAT (...)" naming pattern
and API_CREDENTIAL category.
Existing field "credential" remains the consumer (RUNNER_TOKEN env).
Comment block clarified to require Administration:read/write fine-grained
PAT scope on target repos.
Old 1P item renamed to "[DEPRECATED 2026-05-16] GitHub Runner
Registration" — kept as recovery backup; can be hard-deleted after the
first successful runner pod start against the new item path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
github-runner-token OnePasswordItem exists but the underlying 1Password
vault item hasn't been created yet, so the operator can't mint the K8s
Secret. Pod stuck in CreateContainerConfigError → DeploymentReplicasMismatch
alert fires.
Scaling to 0 keeps the manifest infrastructure intact but stops trying
to schedule until operator:
1. Creates "GitHub Runner Registration Token" item in IAmWorkin vault
2. Generates a token at github.com/astoltz/<repo>/settings/actions/runners/new
3. Updates the OnePasswordItem itemPath to point at it
4. Bumps replicas back to 1 via PR
zabbix-web nginx+PHP-FPM container serves / at ~3-5s baseline with
occasional 6-7s spikes (probe path renders full dashboard via PHP).
kube-probe was killing the container after 3 consecutive 5s-timeout
499s, producing CrashLoopBackOff alert noise even though the app
was serving real traffic fine.
15s timeout absorbs the natural variance; explicit failureThreshold=3
documents the policy (was implicit default).
Closes the firing PodCrashLoopBackOff (zabbix-web) + pending
HTTPServiceSlow/HTTPServiceDegraded alerts. zabbix.iamworkin.lan
remains slow at the application layer (separate work — PHP-FPM
warm-up + Zabbix server "host not found" agent lookup spam need
their own fixes) but the pod restart loop stops.
Namespace github-runner with myoung34/github-runner:latest Deployment,
5Gi Longhorn RWO NuGet cache PVC, zero-privilege ServiceAccount, and
OnePasswordItem CRD for the registration token. EPHEMERAL=true mode
re-registers after each job; Recreate strategy avoids RWO multi-attach.
Targets fc-build-linux label; single replica pinned to rke2-server node.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Missing document separator caused YAML to merge the OnePasswordItem's
top-level `spec: itemPath:` block into the ConfigMap that follows.
Result: a ConfigMap with a `.spec` field whose K8s schema does not
declare one, triggering ArgoCD's structured-merge diff to fail since
2026-05-11T15:30:54Z:
Failed to compare desired state to live state: failed to calculate
diff: error calculating structured merge diff: error building typed
value from config resource: .spec: field not declared in schema
App stayed Healthy (live K8s tolerated the extra field — ConfigMap
ignored it) but ArgoCD's diff calc was broken, leaving the app stuck at
sync=Unknown for all 21 resources. Adding the missing `---` separator
makes the OnePasswordItem and ConfigMap proper sibling YAML documents,
each with its own kind-correct schema.
Diagnosed during 2026-05-12 morning routine.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Q-SO-1 operator resolution 2026-05-11 PM, Redis SignalR backplane lands
in Phase A (was Phase C deferral). Treats Redis as a managed FC infrastructure
component, not a deferred scaling escalation.
Lands the minimal Phase A surface:
- Namespace fc-redis
- Single Redis 7-alpine pod with 1Gi Longhorn RWO PVC
- ConfigMap with AOF persistence (everysec), 256Mi maxmemory, allkeys-lru
- ClusterIP Service `redis.fc-redis.svc.cluster.local:6379` (in-cluster only)
- No AUTH Phase A (Phase B add via 1Password Connect rotation)
- No IngressRoute (backplane is server-to-server)
Consumers (Phase A IMPL across FC services) add:
services.AddSignalR().AddStackExchangeRedis(
"redis.fc-redis.svc.cluster.local:6379",
opts => opts.Configuration.ChannelPrefix =
StackExchange.Redis.RedisChannel.Literal("fc-opsconsole"));
Phase B/C follow-ons (not in this commit): Sentinel for HA, AUTH password
from 1Password, redis_exporter sidecar for Prometheus, network policies.
See FlowerCore.Notes/docs/signage/operations-console-phase-2-design.md
section 3.5 (rewritten) and decisions-waiting.html Q-SO-1 (RESOLVED).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The fix-data-perms init container chowns /data (PVC) and /shared-tts
(hostPath /tmp/tts-audio on rke2-agent1) to uid 1654 so the non-root
telephony-web app can write Piper TTS .sln16 files.
Without an explicit container-level securityContext override, the init
container inherits pod-level runAsNonRoot:true / runAsUser:1654 and
fails with 'chown: /shared-tts: Operation not permitted' the first
time the hostPath comes up root-owned after a node reboot.
Outage 2026-05-11 23:00 UTC: telephony-web in Init:CrashLoopBackOff for
9 hours (100+ restarts) until init container was bumped to runAsUser:0.
Live cluster patched in the same operation; this commit makes the fix
durable in git so ArgoCD sync preserves it.
See Notes memory: feedback_hostpath_initcontainer_chown_perms
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two new preventive alert rules added to the kubernetes-state group of the
K8s migration target ConfigMap. The live Podman Prometheus on noc1 has
already been updated via FlowerCore.Notes/scripts/monitoring/alerts.yml +
sudo cp + podman pod restart monitoring (this commit only locks it in
the bluejay-infra K8s mirror so a future migration carries it forward).
MultusMemoryPressure (critical, thermal_print): fires when kube-multus
working set exceeds 80% of its memory limit for 5m. Catches the next
multus OOM cascade BEFORE it kills the daemon cluster-wide. The 2026-05-10
21h outage hit because no alert fired on the rising multus working set;
only downstream blackbox / Traefik / service alerts triggered, after the
fact.
NamespacePendingPodBacklog (warning): fires when any single namespace has
>25 Pending pods sustained for 30m. Catches the operator-leak avalanche
pattern (orphan pods from a crashed reconciler emitting children without
ownerReferences) before it cascades into a CNI OOM.
See FlowerCore.Notes:
- feedback_multus_50mi_limit_oom_orphan_pod_avalanche
- feedback_monitoring_k8s_target_vs_live_podman (workflow)
Companion commits:
- bluejay-infra@eb8693e (multus memory limit)
- FlowerCore.RemoteDesktop@b02c59b (OwnerReferences fix)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cluster outage 2026-05-10T17:43 through 2026-05-11 ~10:30 (~21h). Root cause:
FlowerCore.RemoteDesktop emitted 219 orphan rd-browser-only-* pods in fc-desktop
(missing OwnerReferences — see companion fix in FlowerCore.RemoteDesktop).
Kubelet's continuous CNI ADD retries for those pending pods drove a request
queue that exceeded the upstream default 50Mi limit on kube-multus-ds. Multus
OOMKilled (exit 137), restarted with an even bigger backlog, OOMKilled again,
positive feedback loop. Restart counts climbed to 276 / 412 / 261 across the
3 RKE2 nodes.
Downstream blast radius: both Traefik pods stuck ContainerCreating (101m +
4h35m), all Longhorn CSI attacher/provisioner/instance-manager stuck, every
Prometheus blackbox probe for *.iamworkin.lan failing, UpdateCenterPublicEdgeDown
critical on update.flowercore.io, every ArgoCD app showing sync=Unknown
because repo-server lost git connectivity. 45 firing Prometheus alerts.
Recovery sequence (Q-MR-1 from FlowerCore.Notes morning routine):
1. kubectl patch kube-multus-ds memory live (this commit locks it in git so
ArgoCD doesn't revert on next sync)
2. Force-delete the 219 orphan pods (kubectl --grace-period=0 --force) to
break the avalanche
3. Rollout restart kube-multus-ds — STABLE after restart with new limit
4. Restart Traefik + Longhorn CSI to clear stuck ContainerCreating
5. Verify update.flowercore.io returns 200 + ArgoCD apps reconcile
Tested incrementally: 256Mi limit was insufficient (still OOMed on catchup
burst), 512Mi was insufficient on rke2-agent1 (most pods concentrated there),
1Gi/512Mi handled the full 200+ pending pod CNI catchup cleanly with 0 multus
restarts after rollout. Nodes are 64GB with <25% used in steady-state, so the
~256Mi typical working-set is well within the new limit.
Companion change: FlowerCore.RemoteDesktop must set OwnerReferences on every
worker pod so future operator crashes don't leak orphans (Q-MR-2). Preventive
alerts (Q-MR-3) MultusMemoryPressure + NamespacePendingPodBacklog are coming
in a follow-up commit to apps/monitoring/.
Memory: feedback_multus_50mi_limit_oom_orphan_pod_avalanche
Decisions card: docs/dashboards/decisions-waiting.html Q-MR-1..3
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The virtio-blk disk swap (commit 84c9feb) didn't help: qemu fails to
acquire the write lock on the rootdisk PVC because the previous
launcher's qemu process didn't release it cleanly. Same family of
bug as the "stale QEMU flock" already documented in
feedback_kubevirt_iso_first_install_bootorder_and_runstrategy, but
now triggered on rke2-agent1 instead of agent2.
OVMF cdrom timeout is the real blocker and remains open:
- ✅ Distribution pipeline (build → save → scp → ctr import on all
3 RKE2 nodes) is proven. localhost/win-server-2025:1.0 lives in
each node's containerd k8s.io namespace.
- ✅ containerDisk + cdrom:scsi gets qemu domain Running (no NFS
Permission denied, no rootdisk flock).
- ❌ OVMF BdsDxe times out reading the SCSI cdrom regardless of
SecureBoot setting and bus type.
Reverting the disk type to cdrom:scsi so the VM lands back on the
"qemu Running, OVMF stuck at Boot Manager" state — known-stable and
easier to attack than the QEMU-flock state we hit by trying
virtio-blk disk.
Operator decision for next architectural step (one of):
- Custom OVMF firmware build with longer Boot0001 timeout
- KubeVirt version bump (v1.5+ has OVMF fixes)
- Hyper-V/VirtualBox install + export VHD to ci1
- BIOS legacy boot (Win Server 2025 needs UEFI but install media
has a BIOS path)
- DataVolume HTTP datasource (CDI internalizes ISO bytes via
different code path)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OVMF BdsDxe "starting Boot0001 ... Time out" persists across:
- SATA cdrom + Longhorn Filesystem PVC (Path A)
- SATA cdrom + Synology NFS (Path B failed: storage perms)
- SCSI cdrom + Longhorn (Path B variant)
- SCSI cdrom + containerDisk tmpfs (Path C)
- + SecureBoot=false
That rules out: storage IO speed, cdrom bus type, signature
verification. Remaining cause is deeper in qemu's cdrom device
emulation under KubeVirt v1.4.0's OVMF firmware — the cdrom read
window for OVMF's first-sector probe is too short to satisfy from
the cdrom controller path regardless of bus type.
Workaround: present the ISO bytes as a regular virtio-blk DISK
(not a cdrom). UEFI/OVMF still recognizes ISO9660 + El Torito
boot records on any block device, so it can find and boot the
EFI bootloader the same way it would from a USB stick. virtio-blk
has a different read path that doesn't hit the cdrom-specific
timeout.
This also better aligns with the FlowerCore.Distribution USB-key
pattern: ISO bytes on a block device, UEFI boots from the El
Torito boot record, Windows installer takes over. The autounattend
ConfigMap (ci1-autounattend) drives unattended Windows setup once
the installer kicks off.
The containerDisk OCI image (localhost/win-server-2025:1.0)
remains unchanged — only the disk type in the VM spec changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
containerDisk delivery (commit b998f50) successfully gave qemu fast
in-memory access to the ISO bytes (no NFS denial, no Longhorn read
latency), but OVMF's BdsDxe still timed out:
BdsDxe: loading Boot0001 "UEFI QEMU QEMU CD-ROM " from
PciRoot(0x0)/Pci(0x2,0x4)/Pci(0x0,0x0)/Scsi(0x0,0x0)
BdsDxe: starting Boot0001 ... Time out
That rules out storage IO speed and bus type as causes (already
tested both sata and scsi against both Longhorn-PVC and tmpfs-backed
containerDisk). Remaining likely cause: SecureBoot signature
verification on the ISO's EFI bootloader. KubeVirt's stock
`/usr/share/OVMF/OVMF_VARS.secboot.fd` doesn't appear to ship with
the Microsoft KEK/DB enrolled by default, so signed Windows EFI
bootloaders fail the trust-chain check and OVMF reports a generic
"Time out" rather than a verification failure.
Disabling SecureBoot lets OVMF skip the chain check entirely and
boot the El Torito EFI image. SMM stays enabled (KubeVirt only
requires it WITH SecureBoot, not the inverse). TPM 2.0 emulation
also stays on (`tpm: {}`), so BitLocker, Hyper-V, and WSL2 still
work in the guest.
This is acceptable for a CI runner. Long-term path back to
SecureBoot:
1. Custom-build OVMF_VARS.fd with Microsoft KEK/DB pre-enrolled
2. Mount via firmware.bootloader.efi.persistent
3. secureBoot: true
Tracked as a Phase 2 hardening task once the runner is doing real
work and we want signed-boot guarantees.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OCI image: localhost/win-server-2025:1.0 (8.27 GB)
Built FROM scratch + ADD disk.img → /disk/disk.img on noc1, podman
saved as tar (8.27 GB), SCP'd in parallel to all 3 RKE2 nodes,
imported via ctr in k8s.io namespace. Verified present on all 3
schedulable nodes (rke2-server, rke2-agent1, rke2-agent2).
Why containerDisk over the prior PVC paths:
- Path A (Longhorn Filesystem PVC, sata): OVMF BdsDxe SATA-CDROM
read timeout. Cdrom-backed PVC is too slow for OVMF's first-sector
read window.
- Path B (Synology NFS): uid 107 (qemu) denied at directory level by
Synology export ACL despite file mode 0777. Memory:
feedback_synology_iso_export_root_only_uid_107_denied.
- Path B+SCSI: same OVMF timeout, just on SCSI controller. Bus
choice was not load-bearing — the issue was always the slow PVC
backing.
- Path C (this commit): containerDisk delivers the ISO bytes from
a tmpfs view of the OCI layer, no PVC controller in the read path.
qemu reads at native FS speed; OVMF first-sector read completes
well within timeout. This is also the KubeVirt-recommended pattern
for installer ISOs.
Connects to FlowerCore.Distribution / Provisioning USB story: same
"OCI image of the OS installer + autounattend on a sysprep CDROM"
pattern that the USB provisioning agent will use. The Windows
install proceeds hands-off via the existing autounattend.xml in
ci1-autounattend ConfigMap (RDP enabled, WinRM, UAC disabled,
Administrator password from 1Password vault item
h3ix4mgfk65gmkcmvh6ly3d3hu).
Image lifecycle: bump tag (1.1, 1.2, ...) when ISO version changes,
rebuild on noc1, redistribute to RKE2 nodes, update image: line.
Legacy NFS PVC + PV manifest and CDI Longhorn PVC RETAINED for this
commit so prior states are recoverable. Will prune in follow-up
once containerDisk boot proves.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NFS Path B (commit fc2aca0) failed at storage layer: Synology export
`/volume1/ISOs` denies non-root client UIDs at the directory level.
qemu uid 107 cannot `ls /iso/` even though disk.img is mode 0777.
Diagnosed via uid-107 + uid-0 busybox probe pods on rke2-agent2:
- libvirt error: "Cannot access storage file ... Permission denied"
(virStorageSourceReportBrokenChain:1281, virError Code=38 Domain=18)
- uid 107 pod: "ls: can't open '/iso/': Permission denied"
- uid 0 pod (same mount): "drwxrwxrwx 1 root root 16 ... disk.img"
- SELinux Enforcing + virt_use_nfs=on, no AVC denials → not SELinux
- File mode 0777 with owner 107:107 → not POSIX
Same export-only-root pattern as `/volume1/kubernetes`. Memory:
feedback_synology_iso_export_root_only_uid_107_denied.md
Existing CDI-uploaded Longhorn PVC `windows-server-2025-iso` (10Gi
Filesystem mode) verified to contain valid ISO bytes readable by
uid 107 (mode 0660 root:107, 9.85 GB sparse, 8.27 GB blocks ≈
original 7.7 GB ISO). Reverting to it.
The original OVMF SATA-CDROM read timeout that drove yesterday's
NFS pivot is now addressed by `cdrom: bus: scsi` (virtio-scsi has
a longer read window than the IDE/SATA emulator). Per user-prompt
diagnostic chain Step 5.
NFS PVC + PV (apps/kubevirt-vms/win2025-iso-nfs-pv.yaml) RETAINED
so Path B state is recoverable; can be pruned in follow-up once
SCSI boot is proven.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous fix attempts confirmed the Longhorn-backed Filesystem PVC contains
a perfectly valid bootable ISO9660 image. The bug is that SATA-CDROM
emulation reading from a Longhorn Filesystem PVC is too slow for OVMF's
boot read window — DVD-ROM enumeration times out before the bootloader
loads. Symptom on the serial console:
BdsDxe: failed to start Boot0001 "UEFI QEMU DVD-ROM QM00001 " ... Time out
BdsDxe: No bootable option or device was found
Block-mode PVC (Path A) was attempted and would likely fix the timing,
but CDI v1.65.0's upload-target pod cannot open the underlying block
device (runAsUser:107 + capabilities.drop:[ALL]):
blockdev: cannot open /dev/cdi-block-volume: Permission denied
Path B (this change): mount the ISO directly from Synology NAS over
NFSv4.1. Bypasses both the Longhorn slowness and the CDI permission
issue. QEMU's SATA emulator reads at native LAN speed.
Layout:
/volume1/ISOs/ — existing Synology export, RKE2 ACL already granted
/volume1/ISOs/win2025-iso-disk/disk.img — new subdir, hardlink to the
ISO file, named so KubeVirt's launcher finds it at the PV root
A hardlink (not symlink) is required because symlinks with relative
targets pointing to the parent directory are broken when the NFS PV
sub-mounts the subdir as its root.
Validated 2026-05-08 from rke2-server, rke2-agent1, rke2-agent2:
mount -t nfs -o nfsvers=4.1,ro 10.0.58.3:/volume1/ISOs/win2025-iso-disk
file disk.img -> ISO 9660 CD-ROM filesystem data ... (bootable)
The original Longhorn Filesystem ISO PVC is RETAINED unused (so ArgoCD
doesn't prune the populated PVC and so we have a fallback). Can be
removed in a follow-up commit after the NFS path is proven on a
successful Windows install.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documenting the remaining 2 unresolved issues for the next operator
session, with the recovery paths from this session captured inline so
the next agent doesn't repeat the same blind alleys:
1. **rootdisk QEMU flock** — every new launcher pod fails QEMU start with
`Failed to get "write" lock` on the rootdisk Filesystem-mode disk.img.
Stale flock from a previous force-deleted virt-launcher pod. Longhorn
engine on rke2-agent2 needs to release the lock; `kubectl patch
volume.longhorn.io/<pvc-name> spec.nodeID=""` is reverted by the
Longhorn controller. Operator-level recovery only.
2. **SATA CDROM read timeout** — even with bootOrder=1 (windows-iso first),
OVMF UEFI fails Boot0001 with "Time out" reading the SATA CDROM backed
by the Filesystem-mode PVC. Block-mode DataVolume migration was
attempted but blocked by CDI v1.65.0's upload pod running with
`capabilities.drop: [ALL]` and `runAsUser: 107`, preventing direct
block-device writes (`blockdev: cannot open /dev/cdi-block-volume:
Permission denied`). See ISO PVC header docstring for 3 forward paths.
Net commits during this session:
- 1c4145a: bootOrder swap (windows-iso=1, rootdisk=2)
- 87a7d7c: deprecated `running:` -> `runStrategy: Always`
- 0bf47df: ISO migration to Block-mode DataVolume (REVERTED)
- 9f6dc1a: revert to Filesystem PVC (CDI block-upload blocked)
- 1c4145a + 87a7d7c + 9f6dc1a are the live, correct configuration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Block-mode DataVolume migration (commit 0bf47df) hit a CDI v1.65.0 limitation:
the upload-target pod runs as uid 107 with `capabilities.drop: [ALL]`, so it
cannot open the underlying block device:
blockdev: cannot open /dev/cdi-block-volume: Permission denied
Saving stream failed: Unable to transfer source data to target file:
error determining if block device exists: exit status 1
Reverting to a Filesystem-mode PVC + virtctl image-upload pvc, which DID work
(uploaded the 7.7 GiB ISO with valid ISO9660 magic intact). Boot timeout is
unresolved (header docstring captures the open issue + 3 paths to revisit).
The bootOrder swap (1c4145a) and runStrategy migration (87a7d7c) stay landed —
those are correct improvements regardless of the volume-mode question.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bootOrder swap alone didn't fix the install — even with `windows-iso` at
bootOrder:1, OVMF UEFI still timed out reading the SATA CDROM:
BdsDxe: starting Boot0001 "UEFI QEMU DVD-ROM QM00001 " from ... Sata(...)
BdsDxe: failed to start Boot0001 ... : Time out
BdsDxe: No bootable option or device was found.
Diagnosis (debug pod mounting the live PVC):
- /pvc/disk.img IS a valid bootable ISO9660 image — `file` reports
"ISO 9660 CD-ROM filesystem data 'SSS_X64FRE_EN-US_DV9' (bootable)".
- bytes 0..15: zeros (NOT QCOW2 magic 51 46 49 fb).
- bytes 32769..32773: "CD001" — ISO9660 primary volume descriptor at the
correct offset.
So content was fine. The bug is in how KubeVirt + QEMU + Longhorn expose a
Filesystem-mode PVC's `/disk.img` as a SATA CDROM. With Block-mode the
underlying volume IS the raw ISO9660 sectors, OVMF reads them directly,
no QEMU file-emulation layer. This is the recommended pattern for ISO
install media on KubeVirt + Longhorn.
Migration:
- Replace `kind: PersistentVolumeClaim` with `kind: DataVolume` (CDI manages
the underlying PVC + upload-target pod).
- Set `pvc.volumeMode: Block`.
- Annotate `cdi.kubevirt.io/storage.contentType: kubevirt` so CDI keeps raw
bytes (no QCOW2 wrap).
- VM volume reference changes from `persistentVolumeClaim.claimName` to
`dataVolume.name`. KubeVirt's VMI controller blocks VM start until DV
phase is Succeeded (upload completed).
Operator step after this lands:
1. Wait for DV `phase: UploadReady`
kubectl get dv -n kubevirt-vms windows-server-2025-iso -w
2. virtctl image-upload dv windows-server-2025-iso -n kubevirt-vms \
--image-path "...\en-us_windows_server_2025...iso" \
--uploadproxy-url https://localhost:8443 --insecure --no-create
3. Re-flip runStrategy to Always (was set to Halted live-side during
migration; this commit keeps the manifest at Always).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Required to clear OutOfSync state after the bootOrder fix. Live VM had
runStrategy: Halted (set during diagnosis to release the PVC for inspection).
Manifest had running: true. KubeVirt's validating webhook rejects sync:
admission webhook "virtualmachine-validator.kubevirt.io" denied the request:
Running and RunStrategy are mutually exclusive.
Switching to runStrategy: Always preserves the original "auto-start +
auto-restart" semantics with the non-deprecated field, and gives ArgoCD a
clean diff target to flip Halted -> Always.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Original order: rootdisk=1 (empty 200Gi virtio), windows-iso=2 (SATA CDROM).
UEFI tried the empty virtio disk first, got nothing, fell back to Boot0001
(the SATA CDROM) with a short timeout, and aborted with:
BdsDxe: failed to start Boot0001 ... Time out
BdsDxe: No bootable option or device was found.
VM had been running 38+ min with rootdisk actualSize stuck at 4.13 GiB and
no AgentConnected condition — install never started.
Diagnosis via debug pod mounting the windows-server-2025-iso PVC:
/pvc/disk.img: ISO 9660 CD-ROM filesystem data 'SSS_X64FRE_EN-US_DV9' (bootable)
bytes 0..15: zeros (NOT QCOW2 magic 51 46 49 fb)
bytes 32769..32773: "CD001" (ISO9660 primary volume descriptor)
So the PVC content is a real bootable ISO — the only fix needed is to make
the ISO bootOrder=1 for first install. After Windows installs, it writes its
own UEFI Boot#### entries pointing at the rootdisk EFI partition; UEFI then
boots from rootdisk going forward and the ISO at bootOrder:2 is a fallback
for re-install scenarios.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
KubeVirt v1.4.0 + RKE2 containerd 2.1.5 cannot pull
quay.io/kubevirt/virtio-container-disk:latest:
rpc error: code = Unimplemented
desc = failed to pull and unpack image: not implemented:
media type "application/vnd.docker.distribution.manifest.v1+prettyjws"
is no longer supported since containerd v2.1, please rebuild the image as
"application/vnd.docker.distribution.manifest.v2+json" or
"application/vnd.oci.image.manifest.v1+json"
The :latest tag was last rebuilt with the v1 manifest schema. Tagged versions
v1.6.5+, v1.7.3, v1.8.2 are rebuilt with v2/OCI manifests.
Pinning to v1.8.2 (newest available, contains current Windows VirtIO drivers).
The image only contains the Windows VirtIO driver ISO mounted as a CDROM —
not the KubeVirt runtime — so it is decoupled from the cluster KubeVirt
version.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 prereqs all satisfied:
- Multus CNI v4.2.2 thick-plugin DS Running on rke2-server/agent1/agent2
- CDI v1.65.0 operator + CR Deployed (cdi-apiserver/deployment/uploadproxy
all Running 1/1)
- Windows Server 2025 ISO (7.7GiB, March 2026 update) uploaded via CDI
virtctl image-upload to PVC windows-server-2025-iso. Verified via PVC
annotations: cdi.kubevirt.io/storage.condition.running.message="Upload
Complete", storage.pod.phase="Succeeded"
- Local Administrator password generated (26 char, FANTASTIC strength).
Stored in 1Password vault IAmWorkin (qaphopopkryhbg353ukzhhuqoq) item
h3ix4mgfk65gmkcmvh6ly3d3hu. UTF-16-LE base64 in autounattend.xml Value
field matches the 1P "autounattend AdministratorPassword Value" field.
Changes:
- ISO PVC bumped 6Gi → 10Gi (ISO is 7.7GiB, need headroom)
- Added labels app=ci-runner, flowercore.io/managed-by=bluejay-infra
- autounattend.xml AdministratorPassword Value: real base64-encoded password
- spec.running: false → true (VM starts on next ArgoCD sync)
- Header comment refreshed to LIVE state with prereq references
Network: still pod-network masquerade. Multus NAD prod-vlan57 is registered
but the VM doesn't use it yet (Phase 1.5 host bridge needed first).
Verify after sync:
kubectl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml -n kubevirt-vms get vm,vmi
virtctl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml vnc ci1 -n kubevirt-vms
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Promotes the fleet to FlowerCore.Updater main @ 2bdf108 which combines:
- PR #6 publish pre-signed releases (1a188f4)
- PR #7 deeper public-host privacy enforcement (8cd8544)
- PublishPreSignedAsync(stream, sig) Integration coverage (2bdf108)
Live image already imported to rke2-server and rolled via deploy-web.ps1.
This commit aligns the bluejay-infra source of truth so ArgoCD doesn't
snap the spec back to the previous tag (per
feedback_argocd_managed_image_overrides_do_not_stick).
Adds three new bluejay-infra apps that auto-pickup via ApplicationSet (apps/*
directory generator on main):
* apps/multus/multus.yaml — Multus CNI v4.2.2 thick-plugin daemonset (verbatim
upstream, project-annotated). Enables KubeVirt VMs to attach additional
network interfaces. Required by ci1 to bridge onto PROD VLAN 57.
* apps/cdi/{cdi-operator.yaml,cdi-cr.yaml,README.md} — Containerized Data
Importer v1.65.0 (verbatim upstream). Operator + CR pattern. Enables
populating PVCs from HTTP/registry/upload sources, used to load the Windows
Server 2025 ISO into the windows-server-2025-iso PVC.
* apps/kubevirt-vms/prod-vlan57-nad.yaml — NetworkAttachmentDefinition for
PROD VLAN 57 bridge. **Deploy gated on Phase 1.5 host work**: requires
br-prod bridge enslaving enp86s0.57 on each RKE2 node (Puppet config-as-code).
ci1.yaml continues to use pod-network masquerade until that lands; switching
to multus.networkName: kubevirt-vms/prod-vlan57 is a one-line YAML edit
followed by a GitOps push.
Cluster verification (2026-05-08):
- KubeVirt LIVE (3 nodes, virt-api/controller/handler/operator all Running)
- Calico CNI on /etc/cni/net.d + /opt/cni/bin (Multus default paths)
- ApplicationSet `bluejay-infra` already watches `apps/*` on main
Reproducibility: upstream YAMLs vendored verbatim with project header diffs
only. Bumping versions = re-curl + git push. No deploy-time internet fetch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stages a draft VirtualMachine + Namespace + ISO PVC + rootdisk PVC + sysprep
ConfigMap for the dedicated GitHub Actions self-hosted runner that replaces
the never-registered bluejay-ws-sandbox-1 placeholder.
Status: STAGED ONLY. spec.running = false. ISO PVC empty. Two operator
decisions still pending before this can boot:
1. Network choice — pod-network fallback (in this draft) vs Multus +
PROD VLAN NAD (preferred, requires Multus install).
2. ISO path — manual upload via helper pod (Path A) vs CDI HTTP import
(Path B, requires CDI install).
Cluster baseline 2026-05-08:
- KubeVirt operator: installed, healthy, 14d
- CDI: NOT installed
- Multus: NOT installed
- Calico-only CNI
See docs/infrastructure/windows-server-build-runner-plan.md "Phase 1 readiness
gate" for the full operator pickup checklist.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three Certificates requested duration: 2160h (90d) with renewBefore: 720h
(30d). step-ca's ACME provisioner caps cert lifetime at 30d, so it silently
issued 720h certs — making renewBefore EQUAL to the actual cert lifetime.
cert-manager treats the cert as needing immediate renewal the moment it's
issued, creates a CertificateRequest, gets a new (still 30d) cert, marks
it for immediate renewal, and loops.
Damage on 2026-05-07 ~20:30 (caught during regroup after 5h gap):
- fc-worldbuilder/worldbuilder-web-tls: 2365 CRs in 18h
- fc-distribution/fc-distribution-tls: 10880 CRs in 18h
- knowledge/knowledge-tls: 10888 CRs in 18h
Total: 24,133 stale CertificateRequest objects in etcd.
Bulk-deleted all CRs + Orders in those 3 namespaces, then this commit
fixes the source so ArgoCD sync stops re-creating the loop.
Fix: match the working 720h/240h pattern used by every other FC service
cert (agent-zero, fc-dns, fc-llm-bridge, fc-php, traefik-system, etc.).
30d cert lifetime + 10d renewal headroom = renewal at day 20, which is
the cert-manager standard 2/3-of-lifetime practice.
Side effect during loop: ALSO contributed to step-ca load and may have
caused intermittent timeouts cluster-wide (the latest stuck challenge
was timing out dialing step-ca:9443 even though step-ca itself was up).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captured during 2026-05-07 regroup audit. selenium-netpol was applied via
raw `kubectl apply` to the cluster on 2026-03-15 with no source-of-truth
file anywhere — neither in bluejay-infra nor in any FC service repo. A
cluster rebuild from bluejay-infra would have lost it entirely (including
the Selenium Grid → Traefik VIP allow rule that gates AAT runs against
*.iamworkin.lan services).
Captured byte-for-byte from `kubectl get netpol -n selenium selenium-netpol
-o yaml`. ServerSideApply via ArgoCD will adopt the existing resource
without recreation.
The Selenium Grid Deployment + Services themselves are still managed
outside ArgoCD (deployed via raw kubectl from the original bring-up).
Migrating those into bluejay-infra is a separate lane — this commit only
restores GitOps repeatability for the NetworkPolicy.
See feedback_networkpolicies_belong_in_bluejay_infra.md for the canonical
pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Repeatability gap caught during 2026-05-07 morning regroup. The four
fc-desktop NetworkPolicies (desktop-isolation, fc-desktop-default-deny,
remotedesktop-web-isolation, cm-acme-http-solver-allow) were applied via
FlowerCore.RemoteDesktop/scripts/deploy-web.sh `kubectl apply` calls.
That meant a fresh cluster rebuild from bluejay-infra alone would miss
all of them — Browser Lab session isolation, control-plane allow-list,
and HTTP-01 cert renewal would silently fail to come up.
Canonical FC GitOps pattern is for NetworkPolicies to live alongside
other resources in bluejay-infra. Verified by audit: 6 of 11 cluster
NetworkPolicies (agent-zero, edge2-services, monitoring, noc-services,
telephony, voice) already follow this pattern. fc-desktop was the
outlier; selenium-netpol is also unmanaged and tracked separately.
Source-of-truth split (now documented in fc-desktop.yaml):
- bluejay-infra OWNS: Certificate + IngressRoute + all NetworkPolicies.
- FlowerCore.RemoteDesktop scripts/deploy-web.sh OWNS: Deployment +
Service ONLY (because `localhost/fc-desktop:linux-xfce` image refs
require manual ctr import on each node — Deployment in bluejay-infra
would race the image-import step).
Follow-up commits in FlowerCore.RemoteDesktop will:
- Remove the now-duplicate k8s/{networkpolicy,namespace-default-deny,
web-networkpolicy,acme-http01-solver-allow}.yaml files.
- Drop the 3 `kubectl_apply_file` lines from scripts/deploy-web.sh.
The 4 NPs in this commit are byte-for-byte identical to what's running in
the cluster today (verified via kubectl get -o yaml diff). ServerSideApply
in the bluejay-infra ApplicationSet will adopt the existing resources
without recreating them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ArgoCD's directory-driven manifest parser scans *.yaml AND *.json by
default. Bare Grafana dashboard JSONs (no apiVersion/kind/metadata)
poisoned manifest generation for the entire infra-monitoring Application
("Object 'Kind' is missing in <dashboard JSON>"), leaving sync state
Unknown.
These files are SOURCE for the file-provisioning path on noc1
(/opt/monitoring/grafana/dashboards/) and also get inlined into ConfigMap
wrappers like grafana-dashboard-remotedesktop.yaml. They are NOT K8s
manifests and shouldn't be in ArgoCD's manifest tree.
.argocdignore is for repo-level GitOps source eligibility, not for
filtering manifests within a directory-mode Application — the cleanest
fix is the *.txt extension that ArgoCD's parser skips.
Reverts the .argocdignore from the previous commit (didn't take effect).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fc-updater PVC: bump updatecenter-data storage 10Gi → 25Gi.
The cluster PVC was already manually expanded to 20Gi to fit Mike Bundle
(~5.1 GiB) plus the LocalFsBundleStore.MaxTotalBytes soft cap of 25 GiB
(see project_uc_remaining_4_apps_signed_2026_05_06). PVCs cannot shrink,
so ArgoCD couldn't sync the smaller git value (OutOfSync, retried 5x with
"field can not be less than status.capacity"). Setting git to 25Gi gives
headroom matching the soft cap.
monitoring .argocdignore: skip bare dashboard JSON files.
Both fc-updatecenter-dashboard.json and flowercore-remotedesktop-grafana-
dashboard.json live in apps/monitoring/ as source-of-truth for file-
provisioning to noc1's /opt/monitoring/grafana/dashboards/. ArgoCD's
manifest parser tries to unmarshal every file and chokes on bare dashboard
JSON ("Object 'Kind' is missing"), which then poisoned the whole
infra-monitoring Application status (Unknown sync, no comparison possible).
The .argocdignore tells ArgoCD to skip *.json — actual K8s deploys happen
via ConfigMap wrappers like grafana-dashboard-remotedesktop.yaml.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WorldBuilder live runtime promotion lands in the Intranet at
/services/world-builder + ServiceRegistry homepage tile.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4delta server-side HTML overlay enrichment landed in
FlowerCore.TtsReader@8f23e15 (master @6091618). Adds 9-pass enrichment +
SQLite-backed cache + 4 REST endpoints (/api/v1/enrich/{html,jsonld,both,passes})
+ RenderRequest.sourceJsonLd. Tests 476 -> 522 (+46). Image already imported
to all RKE2 nodes via deploy.sh; this bumps the bluejay-infra-managed tag so
ArgoCD reconciles the live deployment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cluster CPU-request budget at 99% on all 3 RKE2 nodes at deploy time.
0/3 nodes available; "3 Insufficient cpu". Actual CPU usage on the
nodes is 10/52/19%, so the cluster is request-overprovisioned but has
plenty of real headroom. Idle Blazor + SignalR + ComfyUI poller is ~5m.
25m unblocks scheduling and stays generous for expected runtime.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror of FlowerCore.Notes/scripts/monitoring/alerts.yml fc-signage-marquee
group into the K8s migration target apps/monitoring/noc-monitoring.yaml so
that future migration of the noc1 Podman monitoring stack into RKE2
inherits the marquee alert ruleset automatically.
Three rules added:
- MarqueeDroppedFramesHigh (5% / 5min / warning)
- MarqueeRenderLatencyP99High (16ms / 10min / warning)
- MarqueeAnimationDurationDrift (10% / 15min / info)
All three gated with `unless on() absent_over_time(metric[7d])` so they
don't fire during the metric-not-yet-published window before Track 3
IR-21 source IMPL ships the MarqueeMeter into Common + Web + WPF.
Live source-of-truth (the noc1 Podman Prometheus reads from
/opt/monitoring/prometheus/alerts.yml) was updated and reloaded
in the same session — Notes commit 300daa0 carries the matching
alerts.yml + Grafana fc-signage-dashboard.json change.
Per feedback_monitoring_k8s_target_vs_live_podman: this file is the
forward-looking K8s migration target, NOT what the live Podman
Prometheus reads. ArgoCD-syncing this file does NOT push alerts to
the live monitoring stack.
Companion to:
- FlowerCore.Notes 300daa0 (live alerts.yml + Grafana panels deployed)
- docs/signage/marquee-performance-telemetry-design.md (Track 3 IR-21 spec)
- docs/signage/marquee-animation-phases.md (Track 6 13-phase coverage matrix)
Memory: project_marquee_vr_promotion_landed_2026_05_06
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumps tag to bring live pod up to FlowerCore.Intranet.Web@a9ede80 (master tip
post-fleet-search-resurrect merge). Image imported to all 3 RKE2 nodes via
scripts/deploy.sh v20260505-1108.
Closes the source-vs-deployed gap that existed since 2026-04-29: the
KnowledgeFleetSearchController + Service + TrustedHeader auth handler were
running on the deployed pod but never landed on master. Surgical extraction
from stale codex/fleet-knowledge-search branch (12-file rebase conflict made
full merge non-trivial) brings the source up to match production.
+7 tests (280/280 vs 273), 0W/0E build.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds fc-updatecenter-dashboard.json (uid: fc-updatecenter, version: 2)
to apps/monitoring/ — mirrors the dashboard deployed to noc1 at
/opt/monitoring/grafana/dashboards/fc-updatecenter-dashboard.json.
13 panels: 5 existing probe/availability panels + 1 OTEL row header
+ 7 new panels for the 6 OTEL counters added to FlowerCore.Updater.Web:
updatecenter_manifest_requests_total
updatecenter_bundle_download_bytes_total
updatecenter_bundle_downloads_total
updatecenter_checkins_total
updatecenter_release_publishes_total
updatecenter_signature_verify_failures_total
Live on Grafana at https://grafana.iamworkin.lan/d/fc-updatecenter
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumps tag to include the new AddDataProtectionKeys EF migration that closes
the fc_dp_keys table-creation gap from v20260505-1023. Master tip a82d7d4.
Previous tag v20260505-1023 crash-looped on every page load with
'no such table: fc_dp_keys' due to eb9fe6d (DataProtection-in-DB)
registering the DI but missing the table-creation migration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to 0b52093 (K8s manifest hardening) closing two real gaps the
prior sweep didn't catch:
1. Public read-write allowlist regression guard (Track A)
- New PublicReadWriteAllowlistHosts set tracks updatecenter.iamworkin.lan
+ updates.iamworkin.lan. The allowlist on those hosts is
GET||HEAD||POST||OPTIONS — POST is required for the bootstrap-JWT
check-in endpoint. PUT/PATCH/DELETE must still 404 at the route.
- New PublicReadWriteIngressRoutes_MustPinGetHeadPostOptionsAllowlist
test enforces the allowlist invariant (3 required methods present,
3 forbidden methods absent).
- Companion conftest.dev policy 08_public_readwrite_allowlist.rego.
2. Selenium NetworkPolicy DNAT backend port audit
- FlowerCore.Notes/k8s/selenium/06-networkpolicy.yaml allowed Traefik
VIP 10.0.56.200:443 + :80 but its 10.42.0.0/16 + 10.43.0.0/16 egress
rules didn't include the post-DNAT backend ports (8443 for Traefik
TLS, 8080 for HTTP). Per feedback_netpol_dnat_backend_port: kube-proxy
DNATs the destination to a backend pod IP+port BEFORE Calico
evaluates the FORWARD chain, so without those backend ports in the
pod CIDR rule, Selenium-driven browser AAT calls to
https://*.iamworkin.lan time out at connect.
- Lint inventory now includes FlowerCore.Notes/k8s/selenium/ so
regressions in this manifest fail fast.
Lint scope notes:
- FlowerCore.Notes/k8s/guacamole/ + monitoring/ are historical
scaffolds that have diverged from the live state (bluejay-infra/apps/
is canonical). Operator review is required before bringing them in
line OR decommissioning them — kept out of lint scope until that
decision lands (see xxl-regroup-2026-05-03-followup.md "Codex 7 §0").
README hardening:
- New "Public read-write allowlist hosts" entry under "Known gotchas"
documenting the GET||HEAD||POST||OPTIONS pattern + linking the lint.
Tests: 8/8 lint tests pass.
Companion fix in FlowerCore.Updater repo on branch
codex/k8s-gotcha-fleet-sweep-c7 (k8s/web-deployment.yaml: localhost/ image
needs imagePullPolicy: Never). The FlowerCore.Updater fix applies to a
deploy that's currently live but bites only on first scheduled-pod
landing on a fresh node — not a live production-impact regression.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JSON-provisioned dashboard for FlowerCore.RemoteDesktop session metrics,
matches the Apr 23 staging done in the codex/ttsreader-release-b6ca2d5
worktree. Drop into apps/monitoring so ArgoCD-managed Grafana provisioning
picks it up alongside the other FC service dashboards.
Per-profile MoodAnnotationModelOverride picker — Profiles page now shows
a model dropdown from IModelRegistry instead of a free-text field; model
override null-falls-back to global TtsReader:Ollama:DefaultModel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The prior commit b71f9e4 created a stray YAML document between the
bluejay-tools-c and bluejay-profile sections. kubectl applied the stray
block's data to bluejay-profile (wrong ConfigMap, wrong mount target).
The setup-bluejay initContainer copies bluejay-tools-{a,b,c} to the tools
directory; bluejay-profile is copied to the agent profile directory. Tools
must live in one of the three tools ConfigMaps.
Fix: insert corpus_search.py and intranet_search.py directly into the
bluejay-tools-c YAML document (before kind/metadata, matching the
data-first layout the rest of the file uses). Also fix two mojibake
characters (→ and ·) that were corrupted in the prior commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add corpus_search.py to bluejay-tools-c: semantic vector search over
fleet SQLite-vec DBs (fleet-workstation-full, fleet-pi-edge, fleet-bmo-bot).
Returns offline-friendly results for Bible/Greek/Hebrew/Strongs corpora.
Cluster pod degrades gracefully (no DB mounted yet — BLUEJAY-WS only for now).
- Add intranet_search.py to bluejay-tools-c: live RAG search over the
intranet vector store via GET /api/search?q=...&topK=N. Uses in-cluster
service URL (http://intranet-web.intranet.svc:5300) to bypass Traefik TLS
and the private-range egress denylist.
- Fix intranet_search.py param name: was 'limit', now 'topK' matching the
SearchController's [FromQuery] parameter name.
- NetworkPolicy: add egress rule for intranet namespace port 5300 (without
this the pod's TCP connection to the search endpoint was dropped).
- agent-zero.yaml: set FLOWERCORE_INTRANET_URL env var to in-cluster service
URL so intranet_search uses internal routing, not the public Traefik VIP.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add `print-web-api-keys` OnePasswordItem CRD that syncs from 1Password
"Print.Web API Keys" vault item (password field). Mount as PRINT_WEB_API_KEY
env var in the agent-zero container.
The print_web.py Python tool (already in bluejay-tools ConfigMaps) reads
PRINT_WEB_URL and PRINT_WEB_API_KEY env vars for all HTTP calls to the
thermal print service on edge2. Previously the key was unset so every API
call was rejected with 401.
Note: Print.Web uses the legacy REST MCP shape (/api/mcp/tools/*) not the
streamable-http protocol. The Python tool bridges this gap — no /mcp endpoint
exists on Print.Web today. Network policy already allows 10.0.57.16:5200.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 of Mac mini onboarding (2026-04-28):
- Add OnePasswordItem CRD 'macmini-vnc-creds' in guacamole namespace bound to
vault item 'Mac Mini' — operator mints Secret with username/password/VNC Password fields
- Mac mini discovered at 10.0.56.115 (INFRA VLAN) — not 10.0.57.50 stored in 1P IP field
- Guacamole connections updated via API (not stored here): VNC conn #10, SSH conns #9/#33
corrected from old IP 10.0.57.50 → 10.0.56.115
- macOS: 26.4.1 (Sequoia), Apple M1, 16 GB, user: bluejay (admin group)
- VNC port 5900 confirmed open; SSH works via noc1 jumpbox with password auth
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds ProgressHub endpoint at /hubs/progress with project-scoped
group broadcasting for JobStarted, CueProgress, JobCompleted, and
JobFailed events.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deploys fix for stale Failed manifest accumulation in TTS Reader Ops view
and atomic-write guard against empty/corrupt job manifests.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sprint E XXL Phase 4γ MVP deploy — POST /api/v1/render endpoint.
Two changes:
1. Image tag v202604272339 → v202604280946 (TtsReader@d9e0a58 master tip
includes the new RenderController + RenderService + 9 tests).
2. New TtsReader__Render__CdnDirectory=/data/cdn env var. Default
wwwroot/cdn resolves under the read-only app filesystem when
runAsNonRoot=true; pin to the existing writable PVC mount alongside
other TtsReader runtime data. Manifests + cue audio land at
/data/cdn/sha256/<hash>/manifest.json + cues/.
Pre-existing PVC mount at /data/ already covers this — no PVC change
needed, just the env var override.
Pairs with TtsReader@d9e0a58 master tip (ready for image build + import).
Bump intranet image to v20260427-2353 (master @ 38b0148):
- Sprint E search lane: /search Blazor page + IntranetSearchService
+ DocsCorpusIndexer + Shared.Indexing wiring
- 7 new service pages: LocalAiAgents, AiTopology, Distribution, Dns,
Knowledge, LlmBridge, Provisioning
- PiManager drift docs
New env var: PageReadingOverrides__FilePath=/data/page-reading-overrides.json
so the persisted Lane 2α store lives on the writable PVC instead of
the default in-memory fallback (which loses state on pod restart).
Operator-edited overrides via the existing /api/v1/pages/{encoded}/overrides
controller will now survive across restarts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TtsReader@9333480: distinguishes partial-render (yellow Warning, audio
plays, 'Re-render N failed sentences' button) from full-fail (red
Danger, 'Try render again'). New TtsFallbackChainFailedException carries
both voices when Kokoro + Piper both fail; chapter breadcrumb names
the entire chain instead of just the requested voice. +8 tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Kokoro pod has 4 restarts in 2d6h with exit 143 (SIGTERM from kubelet).
kubectl describe events all show:
Liveness probe failed: Get "http://10.42.229.109:8880/v1/audio/voices":
context deadline exceeded
The probe path /v1/audio/voices shares the FastAPI worker pool with
/v1/audio/speech. A long synth (Bible chapter, 30+ sentences) holds the
pool past the prior 5s × 3 = 15s probe window, kubelet kills the pod,
in-flight renders fail. Operator hits "fallback chain failed" toasts +
partial-render breadcrumbs during these windows.
Bump probe timeoutSeconds 5 → 15 and failureThreshold 3 → 5 → 75 s of
grace before kubelet gives up. Combined with the kokoro-side circuit
breaker landing in TtsReader (Sprint E Phase 1b), the FC backend will
also stop slamming kokoro during recovery so it can serve the probe
even faster.
The companion Prometheus alerts (KokoroPodFlapping, PiperPodFlapping)
land in FlowerCore.Notes/scripts/monitoring/alerts.yml.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Day 8 disk-cache warmer crashes on production with
'Read-only file system : /home/app/data' because the relative default
'data/voice-previews' resolves under runAsNonRoot HOME (read-only with
readOnlyRootFilesystem=true). Pin to /data/voice-previews so the cache
lands on the writable PVC mount alongside ttsreader.db, audio output,
and jobs root.
Image v202604272216 (already on nodes) is unaffected by this — only
the env routing changes. ArgoCD reconciles + rollout restart picks up
the new env without rebuild.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v202604272157 crash-looped on the production PVC because Database.EnsureCreated()
is a no-op on existing DBs and the VoiceProfiles table was missing. TtsReader@a9f0b73
adds an idempotent CREATE TABLE IF NOT EXISTS to the infra reconciler before
TtsReaderDataSeeder runs. Bumping the manifest to pick up that fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds 7 new pages (5 service pages, AI topology, opencode operator guide)
to https://intranet.iamworkin.lan:
/services/dns
/services/distribution
/services/llm-bridge
/services/knowledge
/services/provisioning
/services/ai-topology
/development/local-ai-agents
Plus topology corrections in /services/ai (AiStack.razor) and 6 new nav entries.
Source commit: FlowerCore.Intranet.Web@1598542 on
codex-wip-pre-readaloud-collision-2026-04-24.
Image built from artifacts/publish via Dockerfile.deploy on BLUEJAY-WS,
imported to all 3 RKE2 nodes (rke2-server + rke2-agent1 + rke2-agent2).
Build: 0 warnings, 0 errors, 197/197 tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes after the Phase 2.4 deploy went live at
https://knowledge.iamworkin.lan:
1. **Ollama URL flip**: from BLUEJAY-WS (10.0.56.20:11434) to edge1 Pi 5
(10.0.57.17:11434). Honors the cluster-clean architecture from
bluejay-infra@0f9d56e ("Workstation is private dev hardware and should
not be in the cluster path"). Query-time embeddings (~ms per query)
are fast enough on edge1; bulk index rebuilds (Phase 2.5+) will need a
separate ingestion lane that can opt into the workstation GPU when
present. ArgoCD picks up the env-var change and rolls the pod
automatically — no image rebuild needed.
2. **README LIVE status**: flip the staged-not-yet-applied banner to
LIVE 2026-04-27. Pod running, certificate issued, PVC bound,
/healthz 200, /api/v1/editions [] (initial-deploy state). Phase 2.5+
admin UI handles bulk population.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workstation (BLUEJAY-WS) is private dev hardware and should not be in the
cluster path. Repointing the nginx ollama-proxy sidecar so cluster Agent Zero
talks ONLY to edge1 Pi 5 + AI HAT+ (10.0.57.17:11434):
- nginx upstream: edge1 sole server, no workstation entry
- wait-for-ollama init container: only checks edge1
- NetworkPolicy egress: drop 10.0.56.20/32, keep 10.0.57.17/32
- Comments updated throughout to flag workstation as off-limits to cluster
- Annotation rewritten to document the architectural intent
Pulled qwen2.5:1.5b on edge1 first so Agent Zero's utility_model survives
the cutover (existing models on edge1: qwen3:4b, gemma3:4b, qwen2.5-coder:7b,
nomic-embed-text). Model count on edge1: 4 → 5.
Lets BLUEJAY-WS lock down its Ollama port to localhost without breaking
the cluster Agent Zero.
NOT YET APPLIED — push to origin/main is gated on the DNS A record
knowledge.iamworkin.lan -> 10.0.56.200 being live. Per memory
feedback_pfsense_dns_required_for_acme, applying the Certificate
without DNS in place puts cert-manager into ~2h HTTP-01 backoff and
needs `kubectl -n knowledge delete order <name>` recovery.
Manifests authored:
- apps/knowledge/knowledge.yaml — Namespace, PVC (knowledge-vector-store
Longhorn 20Gi RWO), Deployment (single replica, Recreate, image
localhost/fc-knowledge-web:v202604272200 placeholder, runAsNonRoot
1654, readOnlyRootFilesystem, drop ALL caps, /healthz startupProbe +
readinessProbe, tcpSocket livenessProbe), Service (ClusterIP port
80 -> 8080), Certificate (step-ca-acme ClusterIssuer, 90d duration),
IngressRoute (knowledge.iamworkin.lan, websecure entrypoint).
- apps/knowledge/kustomization.yaml — `kubectl kustomize` preview file
(matches fc-distribution shape; ApplicationSet uses dir generator).
- apps/knowledge/README.md — deployment order checklist with the DNS
preflight, image build/import loop for all 3 RKE2 nodes, push
procedure, smoke verification, initial-deploy-state notes
(zero editions until *.db files are pushed to the PVC), resource
sizing, probe + middleware notes.
Companion artifacts (separate repos, separate commits):
- FlowerCore.Knowledge@eb91eb4 — Dockerfile.deploy at repo root
- FlowerCore.Notes@96cd443 — scripts/deploy-knowledge.sh
Apply order (from apps/knowledge/README.md):
1. Add DNS A record knowledge.iamworkin.lan -> 10.0.56.200 via
FlowerCore.DNS or pfSense web UI.
2. Run `bash scripts/deploy-knowledge.sh` from FlowerCore.Notes — this
builds + imports the image to all 3 RKE2 nodes with
FLOWERCORE_DEPLOY_SKIP_ROLLOUT=1 (since the Deployment doesn't
exist yet on the cluster).
3. Bump the image tag in this manifest to match the freshly-imported
tag, then `git push` from this repo to land on main. ArgoCD picks
up within ~3 minutes and creates `infra-knowledge`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new Prometheus alert rules for the print-services group, all routed
to thermal_print via alert_channel label (Grafana contact point ->
irc-notify -> Print.Web /api/print/alert):
- PrintPaperRollLow (warning, 5-10% remaining, 5m for)
- PrintPaperRollCritical (critical, <=5% remaining, 2m for)
- PrintJobDeadLetter (warning, any new dead-letter in 15m)
Source-of-truth gauge is print_paper_remaining_percent (Print.Web OTEL),
which is hydrated from the active PaperRoll row at process startup
(Print.Web@<TBD> HydrateMetricsAsync) so the gauge isn't blind for an
arbitrary window after every deploy.
Self-referential humor: low-roll alerts route to the printer that's
running out of paper, so it announces its own paper-out warning on its
remaining paper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an IngressRoute + cert-manager Certificate that terminates HTTPS for
print.iamworkin.lan and proxies to edge2's Print.Web at 10.0.57.16:5200.
Same headless-Service-with-manual-Endpoints pattern as noc-services (used
for grafana/prometheus/cockpit on noc1). pfSense Unbound already resolves
print.iamworkin.lan to the Traefik VIP 10.0.56.200, so cert-manager
HTTP-01 should validate cleanly.
No basicAuth middleware: Print.Web has its own X-Api-Key authentication
and exposes anonymous endpoints for the bookmarklet / Python CLI /
cups-notifier flow.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
agent-zero ollama-proxy had 172 historic restarts (now stable).
Root cause: liveness/readiness probes hit /api/tags which proxies
through to BLUEJAY-WS Ollama (10.0.56.20:11434). When the workstation
Ollama is slow or offline, nginx fails over to the edge1 backup —
but the failover takes >1s and the kube-probe default timeoutSeconds=1
gives up first. Three failed probes → kubelet kills the container.
Fix:
- Add nginx local healthz endpoint (200, no upstream).
- Liveness probe → /healthz (proves nginx itself is alive).
- Readiness probe stays on /api/tags but with timeoutSeconds=5 so
failover to backup completes before the probe times out.
This decouples liveness from upstream availability — kubelet only
restarts the proxy when nginx is genuinely dead, not when Ollama is
slow.
Longhorn coverage gap: K8s emits "snapshot becomes not ready to use"
events constantly during the hourly snapshot lifecycle (1047
snapshots, all readyToUse=true on inspect). Those events were the
only signal we had — purely transient lifecycle noise, not actionable.
Add:
- longhorn scrape job (longhorn-backend.longhorn-system.svc:9500)
- NetworkPolicy egress rule for longhorn-system port 9500
- 4 new alerts in 'longhorn-storage' group:
- LonghornVolumeDegraded (>15m) — replica unhealthy, auto-rebuild
- LonghornVolumeFaulted (>5m, critical, thermal print) — data loss
- LonghornBackupStale (no completed backup in >36h) — recurring job
silently failing
- LonghornNodeUnhealthy (>5m) — node ready=false
zabbix-web 7 restarts and Print.Web 12:55 stop investigated — both
are stable now, no actionable cause found in journal/events. Adding
KubeContainerRestartingFrequently in the previous commit will catch
recurrence of either.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Followup to 05a273d. After deploy, six PoolDepleted/Deficit alerts
went pending again because the publisher emits per-status gauge
series (fc_desktop_pool_depleted{template,status,alert_level}) and
the historical Warming/BelowDesiredSize series stay at value=1 even
after the template transitions to status=Ready. Filtering by
alert_level=Critical/Warning was not enough — those labels are baked
into the stale series too.
Replace with a join-based query: alert only when the canonical
"Ready" status gauge does NOT report ready=1 for the enabled
template. fc_desktop_pool_ready{status="Ready"}==1 is the publisher's
own current-state canary and never goes stale.
Verified against the live cluster — query returns 0 results when all
pools report healthy in their reconcile logs (no stale-label false
positives).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Followup to ab6ade4. Three issues uncovered after the rollout:
1. NodePort hairpin breaks scrape from same-node pod. Prometheus on
rke2-agent1 could reach traefik-metrics on .11/.13 NodePort 30900
but timed out on its OWN node's NodePort. Same problem would hit
kube-state-metrics + cert-manager whenever prometheus reschedules.
Fix: scrape via ClusterIP svc DNS instead of NodePort. NodePorts
stay in place for external/Podman scrapers.
2. probe-traefik-services failed for grafana, prometheus, guac with
non-200/3xx codes. grafana + prometheus are behind Traefik basic-
auth (every endpoint returns 401), so drop from probe surface —
health is covered by the in-cluster monitoring-* scrape jobs.
guac.iamworkin.lan was deprecated when Guacamole moved under
desktop.iamworkin.lan/guacamole/ — drop it.
3. acme path was wrong (root 404). Use /health.
Coverage adds (probe-traefik-services):
chat, dist, dms, menuboard, messageboard, presentations, retail,
ttsreader. All of these have IngressRoutes serving root at 200/3xx.
NetworkPolicy egress rules added so the new ClusterIP svc scrapes work:
- traefik-system: port 9100 (metrics) — separate from data-path 8080/8443
- kube-system: port 8080 (kube-state-metrics)
- cert-manager: port 9402 (controller metrics)
Out-of-band fix during this audit:
- Print.Web on edge2 was inactive (clean exit at 12:55 CDT, root cause
unclear — systemd Stopping signal). Restarted. Service back on 5200.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Live audit on 2026-04-26 found 14 firing alerts caused by stale probe
targets, blackbox TLS verify failures, and stale state-as-label series.
Plus three K8s scrape sources (kube-state-metrics, cert-manager,
traefik) that exposed NodePorts but were not in any scrape config.
Fixes
- probe-remotedesktop: switch http_2xx -> https_internal. Blackbox does
not trust step-ca root, so /health was failing with x509 unknown
authority while the app served 200s.
- probe-agentzero-nuc: short svc form (agent-zero.agent-zero.svc:80)
instead of *.cluster.local. The FQDN form was being rewritten to the
Traefik VIP by the CoreDNS iamworkin.lan template + ndots:5 search
expansion, then 5s timeout.
- probe-agentzero-local + probe-ollama-local: removed. 10.0.58.100 is on
HOME VLAN and not reachable from cluster pods. Workstation/AI-laptop
Ollama monitoring belongs to host-side Puppet, not cluster blackbox.
- snmp-cloudkey: commented out. The Cloud Key Gen2+ runs unifi-core
(controller), not an SNMP agent. Was generating "connection refused"
every 30s.
- RemoteDesktopPoolDepleted / RemoteDesktopPoolDeficitSustained:
filter on alert_level=Critical / Warning|Critical + enabled=true.
The publisher emits one series per template per status without
resetting old series to 0, so the historical Warming/BelowDesiredSize
series stayed at 1 and the alert kept firing on stale labels.
- RemoteDesktopTlsExpiry: match by job, not hostname-only instance.
The probe sets instance=https://desktop.iamworkin.lan/health so a
hostname-only label match never fired.
- EpsonPrinterDown for: 5m -> 30m. EcoTank sleeps after ~5 min idle and
SNMP times out, so 5m guaranteed nightly noise.
Coverage adds
- kube-state-metrics scrape (NodePort 30901). Required for the new
pod-state alerts and a long list of standard K8s SLO queries.
- cert-manager scrape (NodePort 30902). Required for the
CertManagerCertificateNotReady / RenewalFailed alert pair documented
in project_cert_manager_prometheus_scrape.
- traefik scrape (NodePort 30900) on all three nodes.
- probe-traefik-services: HTTPS probe (https_internal) over the 17 main
iamworkin.lan hosts so any Traefik-fronted service returning non-200
shows up as a single named probe failure.
- blackbox-config: add the https_internal module that the new probes
reference (was only in the FlowerCore.Notes scripts/monitoring copy,
not in the live ConfigMap).
New alerts (kubernetes-state group)
- KubeContainerRestartingFrequently (>5 restarts/h)
- KubeContainerCrashLooping (>3 restarts/15m, thermal print)
- KubePodNotReady (Pending/Failed/Unknown >15m)
- KubePodImagePullBackOff (>10m)
- KubeDeploymentReplicasMismatch (>15m)
Without these, the agent-zero ollama-proxy 172x restart loop was
invisible for ~3 days. Same gap would have hidden the fc-php
php84-app-probe ImagePullBackOff orphan (cleaned up out of band).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the Deployment + Service for the fc-modern-tts container that
landed in the previous commit. Same shape as ttsreader-biblical:
runAsNonRoot uid 1654, dnsPolicy: None to bypass the iamworkin.lan
hijack on Microsoft endpoint lookups, /health probes, modest CPU/mem
since edge-tts is network-bound.
Service surfaces ttsreader-modern.fc-ttsreader.svc:10403 for the web
pod to call when the operator picks a he-IL-* or el-GR-* voice.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a fourth TTS engine alongside Piper / Kokoro / biblical-tts: a
small FastAPI bridge to Microsoft Edge's Read Aloud TTS via the
edge-tts Python package. Provides studio-quality Modern Hebrew (he-IL)
and Modern Greek (el-GR) narrators for the cluster.
modern-tts/Dockerfile + app.py:
- Python 3.12 base + edge-tts==7.2.8 (older versions hit 403 from MS).
- POST /tts -> MP3 audio (audio/mpeg).
- POST /timings -> word-level timings. Edge sometimes omits WordBoundary
events for non-English voices; fall back to MP3-frame-walking duration
estimate + proportional distribution across whitespace-split words
(same approach biblical-tts uses for eSpeak).
- GET /voices?language=all|default — filtered to he-/el- by default so
the AiStation voice picker isn't overwhelmed by 400+ voices.
- GET /health for probes.
- Body shape mirrors BiblicalTtsRequest so the .NET client lives in the
same FlowerCore.Shared.Speech package.
K8s deployment in fc-ttsreader namespace:
- ttsreader-modern Deployment + Service on port 10403.
- localhost/fc-modern-tts:v1, imagePullPolicy: Never (built on noc1,
imported to all 3 RKE2 nodes via ctr).
- runAsNonRoot uid 1654 + fsGroup 1654.
- dnsPolicy: None to bypass the *.iamworkin.lan template hijack on
Microsoft endpoint lookups.
- Modest resources (100m/128Mi req, 1000m/512Mi limit) — edge-tts is
network-bound, not compute-bound.
- Probes against /health.
Verified live locally: container handles 'Καλημέρα Ελλάδα Πώς είστε'
in 2496ms, returns el-GR-NestorasNeural voice + 4 word timings.
Hebrew: 'בְּרֵאשִׁית בָּרָא אֱלֹהִים' returns he-IL-AvriNeural,
2472ms, 3 words.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third TTS engine alongside Piper (modern English/multi-lang) and
Kokoro (high-quality English): a small FastAPI wrapper around eSpeak-NG
with built-in support for Ancient Greek (grc), Hebrew (he), and Modern
Greek (el). Same shape as fc-speech-align so AiStation talks to all the
TTS/alignment services with one HTTP client pattern.
biblical-tts/Dockerfile + app.py:
- Python 3.12 base + apt-get espeak-ng + libsndfile1 + ffmpeg-free deps.
- POST /tts -> WAV audio bytes (audio/wav).
- POST /timings -> word-level timings derived from espeak's --pho phoneme
duration stream, distributed across whitespace-split words proportional
to character count. Accuracy is good enough for chip-level read-along
highlighting (~30-80ms per-word jitter).
- GET /voices for catalog discovery, GET /health for probes.
- Body shape mirrors AlignmentRequest from FlowerCore.Shared.Speech so
the .NET BiblicalTtsClient round-trips it cleanly.
K8s deployment in fc-ttsreader namespace:
- ttsreader-biblical Deployment + Service on port 10402.
- localhost/fc-biblical-tts:v1, imagePullPolicy: Never (built on noc1,
imported to all 3 RKE2 nodes via ctr).
- runAsNonRoot uid 1654 to match the namespace's standard security ctx.
- Modest resources (100m/128Mi req, 1000m/512Mi limit) — eSpeak is
CPU-cheap.
- Probes hit /health which returns the supported language list.
Verified live: container started, /health returns ok with grc/el/he,
POST /timings on Ἐν ἀρχῇ ἦν ὁ λόγος returned 5 words / 1714ms.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cluster ttsreader-web was reaching across to BLUEJAY-WS:10401 for
Kokoro synthesis, which meant a workstation-down event broke render-
pipeline TTS. Add a cluster-native ttsreader-kokoro Deployment and
Service inside fc-ttsreader so the cluster owns the engine.
- Image: ghcr.io/remsky/kokoro-fastapi-cpu:latest. Model + 67 voices
ship inside the image, so no PVC is required.
- Port 8880 (the kokoro-fastapi default; the entrypoint hardcodes it).
- Resources: 250m/1Gi request, 2000m/3Gi limit. CPU-only inference
matches what AiStation runs locally on BLUEJAY-WS.
- dnsPolicy: None to bypass CoreDNS's *.iamworkin.lan template hijack
on huggingface.co lookups, same shape as ttsreader-align.
- Probes hit /v1/audio/voices since the kokoro server doesn't expose
/health; that endpoint is cheap (lists configured voice files).
ttsreader-web env var TtsReader__Kokoro__BaseUrl flips from the
workstation pointer to the cluster service:
http://ttsreader-kokoro.fc-ttsreader.svc.cluster.local.:8880.
AiStation keeps its local http://localhost:8880 since the workstation
operator still wants the audio to render on the local sound device
without a network hop.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /align endpoint was returning Whisper-native word fields
(word/startSeconds/endSeconds/confidence), but FlowerCore.Shared.Speech's
FasterWhisperAlignmentClient on master deserializes
FasterWhisperWord against [JsonPropertyName("text")/("startMs")/("endMs")].
Result: ttsreader-web reported alignment.source="whisper" with words[]
present but every entry had Text="" and StartMs=EndMs=0 — visible in the
2026-04-25 hello-world smoke against ttsreader.iamworkin.lan.
Match the published Common contract instead of the Python model's native
shape: emit text/startMs/endMs (millisecond ints, not float seconds).
Confidence stays on the wire as informational; the deployed C# client
ignores it but a future fc-align operator UI can surface low-confidence
words. Bump tag to v3 and bump the Deployment image accordingly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New ttsreader-align Deployment + Service + 5Gi PVC under
apps/fc-ttsreader/. Wraps SYSTRAN/faster-whisper in a small FastAPI app
exposing POST /align (fc-align contract used by Shared.Speech) AND
POST /transcribe (audio-in feature consumed by ttsreader-web Lane G).
Source: apps/fc-ttsreader/speech-align/ (Dockerfile + app.py +
requirements.txt). Built locally (apt-get RUN steps need BLUEJAY-WS,
not noc1) and ctr-imported to all 3 RKE2 nodes.
- ttsreader-web env: flip Speech__Alignment__Enabled=true and point
BaseUrl at http://ttsreader-align.fc-ttsreader.svc.cluster.local.:9200.
Add new TtsReader__Transcription__* env triplet pointing at the same
service (same /transcribe endpoint).
- Bump ttsreader-web image to v202604251046 (carries the
TranscriptionController + MCP tool + Quick.razor InputFile UI).
The cluster-wide pod cannot reach BLUEJAY-WS speaches on 10.0.56.20:9200
because the rootless+host-net podman setup binds 127.0.0.1 only on the
WSL machine; nothing on the LAN-facing interface. The openai-compatible
Backend value also relied on a Common change still on feat/shared-indexing
rather than master, so the deployed image's Shared.Speech only knows
the FC-native /align shape.
Disable Speech:Alignment for now. EstimatedAlignmentClient kicks in and
keeps /api/v1/voices/preview-with-timings returning word-aligned JSON,
just with uniform-distribution timings instead of real Whisper output.
Re-enable once: (a) Common's openai-compatible Backend lands on master
and a new TtsReader image ships, or (b) we point at a LAN-routable
backend (e.g. an aiohttp /align shim, or speaches running on a node
that's actually reachable from cluster pods).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The fc-speech-align container on BLUEJAY-WS (port 9200) is the speaches
build of faster-whisper-server, which exposes the OpenAI-compatible
/v1/audio/transcriptions contract — not the FlowerCore /align contract.
FasterWhisperAlignmentClient (FlowerCore.Common a1b3bfc) supports both
shapes; tell it explicitly to talk OpenAI-compatible here so requests land
on the right endpoint and verbose_json gets adapted into the FC alignment
response. Also pin the Model id to one speaches recognizes.
Switch back to fc-align once a native /align backend is deployed (or wire
a tiny FastAPI shim in front of speaches if we want a stable contract).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flips Speech__Alignment__Enabled=true and points BaseUrl at the
new BLUEJAY-WS podman quadlet running fc-speech-align (faster-
whisper, /align contract). When Lane 1δ's
/api/v1/voices/preview-with-timings runs after this lands, the
alignment.source field flips from 'estimated' to 'whisper' and
the per-word timings come from real audio analysis instead of
uniform-spacing estimates.
No image rebuild — the Lane 1α DI registration already routes
IWhisperAlignmentClient to FasterWhisperAlignmentClient when
Speech:Alignment:Enabled is true.
Companion firewall rule from FlowerCore.Puppet@bbc02ea +
@05504ed (whisper_align_enabled flag on bluejay-ws-linux Hiera)
opens port 9200 to RKE2 pod CIDR durably.
Word-level highlighting + inline annotation popover in the
read-aloud bar, backed by TtsReader's preview-with-timings
(Lane 1δ) and the existing /api/v1/pages/{encodedUrl}/overrides
REST surface (Lane 2α).
Built from FlowerCore.Intranet.Web@9abde21 against
FlowerCore.Common@d23d4c3, both on master.
Picks up Lane 1δ + 3β:
- Lane 3β: GET /reader-embed iframe-friendly host route + public-
read CORS for /api/v1/voices and /_content/.../embed/
- Lane 1δ: GET /api/v1/voices/preview-with-timings — pairs
synthesized audio with per-word alignment timings (faster-
whisper or estimated fallback) so embed bundle / FcReaderOverlay
/ in-app /voices preview can word-highlight in one round-trip
- Latest FlowerCore.UI.Components from Common master:
FcReaderOverlay annotation popover (Lane 2γ) + <fc-reader>
standalone embed bundle (Phase 3)
Built from FlowerCore.TtsReader@06ef815 (master) against
FlowerCore.Common@d23d4c3 (master).
Image imported on rke2-server / rke2-agent1 / rke2-agent2.
Picks up the merged Lane 1γ + Lane 2α + Lane 1β + Phase 3 work:
top-bar Read aloud button + per-page reading overrides REST +
FcReaderOverlay shared component + <fc-reader> embed bundle.
Built from FlowerCore.Intranet.Web@35a552f against
FlowerCore.Common@a56975a, both on master.
Image imported on rke2-server / rke2-agent1 / rke2-agent2.
Previous v202604241543backlog image accidentally used a stale
publish/ directory at the TtsReader repo root (Dockerfile.deploy
says COPY publish/ but my ad-hoc publish wrote to artifacts/publish/).
Rebuilt with a clean copy from artifacts/publish/ to publish/ first.
Confirmed new image has appsettings.json Preview section + the
quick-swipe-gestures.js asset baked in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hotfix for two live render errors:
- Kokoro chapter render failed with "count ('-1') must be non-negative"
— streaming-WAV chunk-size sentinel (0xFFFFFFFF) read as -1.
- Piper render timed out on book-chapter paragraphs with no sentence
punctuation — one giant segment exceeded the 2-min timeout.
Source fix: FlowerCore.TtsReader@826589b. 153/153 tests green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guacamole 1.6 renders .button.home/.button.logout/.button.reconnect icons
via an absolutely-positioned ::before pseudo-element with width:1.8em
and background-position:.5em .45em. The Blue Jay branding CSS was
clamping every .notification button::before to display:inline-flex
and width:1rem, so only the top-left sliver of the sprite rendered —
appearing as a green/purple garbage rectangle on the connection-not-found
page Home button. Reconnect escaped because it doesn't carry a
background-image on its ::before. The rule was redundant anyway: the
.notification button flex row + padding already spaces the native icon
cleanly. Only the custom .fc-embed-logout-disabled::before override
remains (intentionally dims the replacement disabled-logout pseudo-element).
v202604240140longchunk still hit 400 Bad Request from nomic-embed-text
on several batches — the chars/4 token estimate was optimistic for
code-heavy/Unicode content. Rebuilt from FlowerCore.Common@e1c28b4
which tightens MarkdownChunker hard cap (ChunkSizeTokens × 2, clamped
at 16000 chars) AND adds a character-length check in IndexBuilder's
safety filter alongside the estimated-tokens check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v202604240135longchunk image shipped with only 1 file in the baked
corpus (NEXT-SPRINT.md) because the corpus tar was accidentally built
from the Intranet.Web working directory instead of the Notes repo
root. Rebuilt from the right cwd; new image has the expected 370
*.md + *.html files at /srv/flowercore-notes/docs/.
Same long-chunk handling code as v202604240135longchunk; just a clean
rebuild.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Image bump v202604240108gpu -> v202604240135longchunk, rebuilt from
FlowerCore.Intranet.Web@feat/shared-indexing-search HEAD which transitively
picks up FlowerCore.Common@feat/shared-indexing@105af75:
- MarkdownChunker hard-caps oversized heading-bounded sections at
ChunkSizeTokens × 4 chars and splits with overlap (same pattern as
JsonArticleChunker). Stops the indexer from producing chunks above
nomic-embed-text's 8192-token input limit at the source.
- IndexBuilder gains IndexingOptions.MaxEmbeddingTokens (default 8000)
safety filter — chunks above the cap are warn-logged and dropped
before any batch is sent. New IndexBuildResult.ChunksDropped tracks
how many got skipped.
Goal: notes-md should index 2541/2541 chunks (vs. 2080/2541 last pass)
with zero "Failed to embed batch" 400s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-host routing via desktop.iamworkin.lan/guacamole has been
live-proven (curl → 200) and the Codex single-host-guacamole-wip
merge flipped RemoteDesktop.Web's GuacamolePublicUrl + defaults to
the new path. Nothing else in FlowerCore actively requires the
legacy guac.iamworkin.lan URL.
Removed from the guacamole app:
- IngressRoute `guacamole` matching Host(guac.iamworkin.lan)
- Middleware `guac-add-prefix` (only the legacy route referenced it)
- Certificate `guacamole-tls` (only covered guac.iamworkin.lan)
ArgoCD prune will delete the live resources on next sync. The
pfSense DNS override for guac.iamworkin.lan should be removed
via FlowerCore.DNS as a follow-up operator step — not managed by
this repo.
The new `guacamole-desktop-path` IngressRoute + `desktop-guacamole-path-tls`
Certificate (added in e65de29) handle all Guacamole traffic going
forward.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-part fix on top of the live Shared.Indexing rollout:
1. Image bump v202604240050corpus -> v202604240108gpu, rebuilt from
FlowerCore.Intranet.Web@feat/shared-indexing-search (HEAD includes
the FilePatterns array-merge fix in IntranetSearchOptions). At
runtime each DocCorpusRoot now sees ONLY the patterns explicitly
set in appsettings.json — notes-md gets ["*.md"], notes-html gets
["*.html"], no accidental cross-bleed.
2. New IntranetSearch__OllamaBaseUrl env var pointing at
http://10.0.56.20:11434 (BLUEJAY-WS GPU, R9700 32GB VRAM). Verified
reachable from the cluster and nomic-embed-text:latest is pulled.
This is the workaround for memory feedback_pi5_nomic_embed_slow:
edge1 Pi 5 takes ~189s per 32-chunk batch, projecting full notes-md
indexing (5665 chunks) at ~9 hours; the GPU should land it in minutes.
Edge1 stays the chat default; this env var only redirects the
indexer's bulk embedding calls.
Image distributed to all three RKE2 nodes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cluster Traefik disallows cross-namespace service refs from
IngressRoutes, so the PathPrefix(/guacamole) rule I added to
fc-desktop IngressRoute in 292528e failed with:
"service guacamole/guacamole not in the parent resource namespace
fc-desktop"
Move the /guacamole path match into the guacamole namespace where
the Service actually lives:
- apps/guacamole/guacamole.yaml adds a new `guacamole-desktop-path`
IngressRoute matching `Host(desktop.iamworkin.lan) &&
PathPrefix(/guacamole)` → guacamole:8080 (no add-prefix middleware;
the browser already sends the /guacamole/* path that Guacamole's
servlet serves at).
- New Certificate `desktop-guacamole-path-tls` for desktop.iamworkin.lan
in the guacamole namespace, issued by step-ca-acme. Separate cert
from fc-desktop's remotedesktop-web-tls because Secret refs are
also scoped per-namespace; duplicating the cert is cheaper than
enabling cross-namespace secret refs cluster-wide.
- Revert the cross-namespace attempt in apps/fc-desktop/fc-desktop.yaml
back to a Host-only route. Traefik's router matching precedence
(longer/more-specific rule wins) handles the /guacamole vs
catch-all priority without explicit priority: fields.
Closes the single-host Guacamole URL regression Codex's branch
introduced — GuacamolePublicUrl=https://desktop.iamworkin.lan/guacamole
now resolves to the Guacamole webapp end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-host Guacamole routing — Traefik matches Host=desktop.iamworkin.lan
+ PathPrefix=/guacamole first (priority 20) and forwards to the
guacamole Service in the guacamole namespace on 8080. The existing
Host-only catch-all rule drops to priority 10 so Guacamole traffic
resolves to the more-specific match.
Mirrors the IngressRoute in FlowerCore.RemoteDesktop@master (merged
as part of codex/single-host-guacamole-wip). The RemoteDesktop repo
copy is deploy-ref only — ArgoCD owns the live IngressRoute via
this manifest. Without this change, GuacamolePublicUrl=
https://desktop.iamworkin.lan/guacamole returns 404 because Traefik
routes the whole Host to remotedesktop-web.
Unblocks the per-template AAT smoke against the new public URL
path + closes the final live piece of Codex's single-host routing
work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds TtsReader__Kokoro__Enabled=true + BaseUrl=http://10.0.56.20:10401
+ TimeoutSeconds=120 so the pod routes kokoro-tagged voices to the
Kokoro-FastAPI backend running on BLUEJAY-WS. Multi-engine router
falls through to Piper for piper-tagged and untagged voices.
Requires nftables on BLUEJAY-WS to permit tcp/10401 from 10.0.56/23
and 10.42.0.0/16. Applied to the live ruleset — Puppet Hiera path is
the durable fix (kokoro_server_enabled under profile::security::firewall).
Tests 107 → 114 (+7 MultiEngineSpeechSynthesizerTests).
Companion to the Prometheus alert rules landed in e44e9a0. The
Prometheus rules were loading but never delivered — the monitoring
stack has no Alertmanager configured; **Grafana** owns alert
routing via its built-in engine + webhook contact point to
irc-notify.monitoring.svc:9119. Without a matching Grafana alert,
the Prometheus rules just show up in the Prometheus UI and page
no one.
Adds 6 Grafana alert rules in a new `RemoteDesktop` group under
the AI Stack Alerts folder:
- remotedesktop-web-down (3m) — probe_success{job="probe-remotedesktop"} < 1
- remotedesktop-metrics-stale (10m) — fc_desktop_session_events_total series absent
- remotedesktop-pool-depleted (5m) — fc_desktop_pool_depleted > 0
- remotedesktop-pool-deficit-sustained (10m info) — fc_desktop_pool_deficit > 0
- remotedesktop-session-churn-spike (5m info) — launch rate > 20/min
- remotedesktop-tls-expiry (6h critical) — cert < 2 days to expiry
Each uses the standard Grafana 3-stage pipeline (query → reduce →
threshold) matching the existing AI Stack + Infrastructure alert
patterns. Labels: service=remotedesktop + severity (warning/info/critical).
Default route is `IRC #alerts` via the existing webhook contact point.
Parity with the Prometheus rules (which already fire internally
for the Prometheus UI + any future Alertmanager integration).
Grafana restart picks up the new provisioning on next reload.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cluster egress to github.com is fronted by a step-ca TLS proxy that
returns 404 page not found for unmatched routes — git clone of the
public FlowerCore.Notes repo failed inside the pod even with
GIT_SSL_NO_VERIFY=true. Rather than chase the egress NetworkPolicy /
proxy config, bake the docs corpus directly into the image at
/srv/flowercore-notes/docs.
The corpus is just *.md + *.html (369 files, 2.7 MB uncompressed) —
small enough that re-baking on every deploy is fine and avoids any
runtime network dependency.
Manifest changes:
- Image bump: v202604240040search -> v202604240050corpus
- Removed initContainers (clone-notes-corpus is now redundant)
- Removed notes-corpus emptyDir + its volumeMounts
- Vector-store PVC mount stays.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two egress allows to monitoring-netpol so Prometheus can scrape
FlowerCore.RemoteDesktop:
1. fc-desktop namespace on port 8080 — direct ClusterIP service
target (remotedesktop-web.fc-desktop:8080).
2. traefik-system namespace pods on ports 8080 + 8443 — covers the
Traefik VIP hairpin path for the `https://desktop.iamworkin.lan`
scrape target (CoreDNS wildcard resolves iamworkin.lan hostnames
to the LB VIP; after kube-proxy DNAT, egress needs the backend
pod port allowed per feedback_netpol_dnat_backend_port).
Without these, the fc-remotedesktop scrape times out with "context
deadline exceeded" even though the monitoring-netpol already allows
the 10.0.56.0/24 CIDR — post-DNAT the destination is a 10.42.x.x
pod IP, not the VIP.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cluster egress is fronted by a step-ca TLS proxy whose cert doesn't
match github.com. The init container's git clone failed with
"SSL: no alternative certificate subject name matches target hostname
'github.com'". The Notes repo is public — there is no secret to
protect on the wire — so GIT_SSL_NO_VERIFY=true is the right tradeoff
here. Tag at v202604240040search.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 lane 1 of FlowerCore.Shared.Indexing rollout — wires the new
search consumer in FlowerCore.Intranet.Web to live infrastructure.
Manifest changes:
- Image bump: localhost/fc-intranet-web:latest -> :v202604240040search.
Built from FlowerCore.Intranet.Web@feat/shared-indexing-search and
imported into all three RKE2 nodes (rke2-server, rke2-agent1, rke2-agent2)
via ctr import. Both :latest and :v202604240040search tags are present.
- New PersistentVolumeClaim intranet-vector-store (1Gi, ReadWriteOnce,
Longhorn) mounted at /data for the SQLite vector store
(intranet-vectors.db).
- New emptyDir volume notes-corpus (1Gi sizeLimit) shared between the
init container and main container, mounted at /srv/flowercore-notes
(read-only in the main container).
- New init container clone-notes-corpus (alpine/git) that shallow-clones
https://github.com/astoltz/FlowerCore.Notes.git
(codex/notes-pimanager-live-drift) into /srv/flowercore-notes on every
pod start. Re-clone is cheap (depth=1) and re-runs of git fetch +
reset --hard are idempotent.
- Strategy switched to Recreate for the deployment, since the new RWO
PVC blocks rolling updates — see CLAUDE.md memory "RWO PVC blocks K8s
rolling updates".
- Resource bumps: memory 128Mi -> 256Mi req, 512Mi -> 1Gi limit; CPU
500m -> 1000m limit. The DocsCorpusIndexer + Ollama HTTP calls add
measurable load during the initial index build.
- initialDelaySeconds bumps on both probes (10s -> 30s liveness, 5s ->
10s readiness) to account for startup-time Ollama probing and the
slightly larger image.
The DocsCorpusIndexer waits 15s after host startup before its first
indexing pass, then loops every RescanInterval (default 1h). Its first
run will:
1. Embed all *.md under /srv/flowercore-notes/docs against
nomic-embed-text on edge1 (10.0.57.17:11434).
2. Embed all *.html under /srv/flowercore-notes/docs/dashboards.
3. Persist chunks + embeddings to /data/intranet-vectors.db.
Verify after rollout:
- kubectl -n intranet logs deploy/intranet-web -c clone-notes-corpus
(init container should show the docs/ listing).
- kubectl -n intranet logs deploy/intranet-web -f
(DocsCorpusIndexer should log "Indexing docs root 'notes-md'..." then
"Docs root 'notes-md' indexed: N files, M chunks, M stored").
- curl -sk https://intranet.iamworkin.lan/api/search/indexes
-> ["notes-html","notes-md"]
- curl -sk 'https://intranet.iamworkin.lan/api/search?q=guacamole+single+host&topK=3'
-> hits from docs/infrastructure/guacamole-customization-plan.md
Companion source on FlowerCore.Intranet.Web@feat/shared-indexing-search.
Depends on FlowerCore.Common@feat/shared-indexing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three additions to the monitoring ConfigMap, each targeting
FlowerCore.RemoteDesktop:
- **Scrape jobs** (2 new):
- probe-remotedesktop: blackbox http_2xx against
https://desktop.iamworkin.lan/health every 30s. Feeds the
RemoteDesktopWebDown alert.
- fc-remotedesktop: direct /metrics scrape against
desktop.iamworkin.lan for the fc_desktop_session_events_total
and fc_desktop_pool_* series.
- **Alert group `remote-desktop`** (7 rules in alerts.yml):
- RemoteDesktopWebDown (3m) — /health probe failing
- RemoteDesktopMetricsStale (10m) — absent metrics series
- RemoteDesktopPoolDepleted (5m) — pool deficit + depleted flag
- RemoteDesktopPoolDeficitSustained (10m, info) — persistent
below-desired pool size
- RemoteDesktopSessionChurnSpike (5m, info) — launch rate
>20/min
- RemoteDesktopRecordingEventsDropped (15m, info) — 30m without
recording events while launches active
- RemoteDesktopTlsExpiry (6h, critical) — <2d cert renewal
window; aligns with feedback_acme_expiry_alert_threshold
- **Grafana dashboard mount**: new volumeMounts + volumes entry for
`dashboards-remotedesktop` backed by the grafana-dashboard-remotedesktop
ConfigMap (previously added as a standalone file in d4210c8).
Folder path /var/lib/grafana/dashboards/remotedesktop — picked up
by the file-provider with foldersFromFilesStructure:true so the
dashboard shows up in a "Remotedesktop" folder in Grafana.
No CRLF churn; pure 100-line insertion into LF-normalized file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New surfaces: POST /api/v1/bible/projects (one-click whole-book render),
GET /api/v1/bible/books, GET /api/v1/bible/books/{book}/preview, MCP
tools render_tts_reader_bible_book + list_tts_reader_bible_books,
Dashboard "Render a Bible book" card. 107/107 tests, +7 from previous.
Wraps apps/monitoring/flowercore-remotedesktop-grafana-dashboard.json
as a ConfigMap manifest so ArgoCD syncs it into the cluster alongside
the existing grafana-dashboard-* ConfigMaps. Standalone file — does
NOT modify noc-monitoring.yaml. That keeps the CRLF churn on
noc-monitoring.yaml (sibling files apps/intranet/intranet.yaml and
apps/agent-zero/configmaps-bluejay.yaml also carry CRLF churn) out
of this commit.
Dashboard will be synced into the cluster but NOT loaded by Grafana
until a matching `volumes:` entry lands in the Grafana Deployment
in noc-monitoring.yaml:
- name: dashboard-remotedesktop
configMap:
name: grafana-dashboard-remotedesktop
Plus a `volumeMounts:` entry in the grafana container:
- name: dashboard-remotedesktop
mountPath: /etc/grafana/provisioning/dashboards/remotedesktop
readOnly: true
Those edits are deferred to the CRLF-normalization pass on
bluejay-infra so the review diff stays reviewable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulls in FlowerCore.TtsReader@9e2497f: P2.3 iTunes-namespace podcast
feed (author, summary, category, cover art, episode numbering,
duration, atom:self link, serial channel type for Bible projects) and
P2.4 ID3v2 tags on MP3 export + Vorbis comments on OGG (title, artist
with Piper voice humanized, album, track N/M, genre defaulting to
Religion & Spirituality for Bible or Audiobook for text sources,
date). Phones and podcast apps now show proper track info instead of
"Unknown - Unknown".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New Zabbix 7.2 template under `Templates/FlowerCore` that scrapes
the `/metrics` exposition from FlowerCore.RemoteDesktop and extracts:
- `fc_desktop_session_events_total` split by event (launch/connect/
disconnect/recording), with a dedicated datapoint for the
`browser_datasource="json"` slice to track delegated-auth launches.
- `fc_desktop_pool_ready` gauge sum for warm pools.
Trigger: `nodata(flowercore.remotedesktop.metrics,10m)=1` warns when
the public desktop host stops exposing metrics.
Follows the existing `flowercore-print-ollama.yaml` pattern — import
manually into Zabbix and link to the Print/Desktop host. Not a K8s
manifest; ArgoCD ignores.
Grafana dashboard JSON is drafted at
`apps/monitoring/flowercore-remotedesktop-grafana-dashboard.json`
but still needs a ConfigMap wrap + Grafana Deployment volume mount
in noc-monitoring.yaml before it ships (follow-up).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulls in FlowerCore.TtsReader@63e6b62: P1.1 Media Session API wiring in
fc-media-session.js + quick-player.js + rendered-chapter-player.js, and
P1.2 biblical-name pronunciation lexicon auto-seed on Bible-source
project creation plus apply-bible-defaults endpoint + MCP tool for
existing projects. Tests 81 -> 97 all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Live verification 2026-04-24 caught POST /blobs on dist.flowercore.io
returning 201 Created with the blob persisted — admin write operations
reachable on the public surface. Controller-level strict entitlement
was on, but that gates reads; writes weren't blocked at all.
Fix: add Method(GET) || Method(HEAD) to the Host match on the public
IngressRoute. POST/PUT/PATCH/DELETE now miss every route for
dist.flowercore.io and Traefik returns 404 before the pod sees the
request. Edge-level defense-in-depth on top of the controller's
strict-mode entitlement check.
The internal IngressRoute for dist.iamworkin.lan stays unrestricted —
admin POST /blobs + POST /manifests flows keep working from the lab.
Lights up dist.flowercore.io end-to-end:
- cf-origin-flowercore-io Secret (literal *.flowercore.io Origin Cert,
copied from the telephony/gitea-public/matrix/mail/flowercore/fc-landing
pattern — not via OnePasswordItem yet).
- Traefik Middleware dist-public-profile-header: strips any caller-supplied
X-FC-Distribution-Profile, injects 'public' so the controller's
NamedEntitlementResolverRouter routes to the strict resolver.
- IngressRoute fc-distribution-public: Host(`dist.flowercore.io`) ->
same backing Service as the internal dist.iamworkin.lan route.
Middleware attached; cert secret cf-origin-flowercore-io.
Cloudflare DNS A record dist.flowercore.io -> 74.40.140.24 (proxied)
already created 2026-04-24 via Cloudflare API (record id
e9b957511556f37ff6763f4441acbc45).
Controller entitlement config is still DefaultAllow=false + empty
PublicEditions on the 'public' profile, so every public request
returns 403 by default. Populate FlowerCore__Distribution__EntitlementPublic__PublicEditions__0
via env var when ready to expose specific editions.
- Serves GET /manifests/{edition}/{version}.cert (leaf+intermediate PEM)
- Adds CertChainPem migration on startup (nullable column)
- ManifestSignService now embeds version-specific certChainUrl
Provisioning Agent's verify step will flip from ChainNotServed (Phase 2A
soft-pass) to Valid once a fresh edition is published with this image.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the pre-merge DNS gate to (optionally) scan live-cluster
Certificates + IngressRoutes via kubectl. Closes the coverage hole
where a service's IngressRoute gets deployed from its own repo (not
from bluejay-infra/apps/) and the manifests-only scan misses it —
fc-retail/retail-web-tls stuck Issuing for 15h on a missing pfSense
Unbound override was exactly this class of bug.
Auto mode: if kubectl is on PATH and usable, live-scan runs silently.
--live forces it (and errors out if kubectl can't reach the cluster).
--no-live skips live entirely (CI path with no cluster access).
Immediate live-scan finding on 2026-04-23: 10 orphan *.iamworkin.lan
IngressRoutes from failed e2e / codex / smoke / deleteproof test runs
in fc-php + fc-tenant-default (2026-04-16/17). None have DNS overrides
so their Certificates have been failing to issue for 7 days — the new
CertManagerCertificateNotReady alert will catch them too. Cleanup
(delete abandoned IngressRoutes + Certificates + CertificateRequests)
is a separate task; this check now surfaces them.
The full FQDN fc-llm-bridge.fc-llm-bridge.svc.cluster.local has 4 dots,
which is less than the pod's ndots:5 threshold. The resolver then
applies every entry in the search list BEFORE falling through to the
bare FQDN, and the CoreDNS 'template iamworkin.lan' catch-all matches
"...svc.cluster.local.iamworkin.lan" and returns Traefik VIP
10.0.56.200. The egress NetworkPolicy blocks that VIP (0.0.0.0/0
EXCEPT 10.0.0.0/8), so curl hangs for 30-134s and returns HTTP 000.
Reference: feedback_coredns_ndots_template_collision memory.
Fix: use "fc-llm-bridge.fc-llm-bridge.svc" (2 dots, still <5 so search
expansion still fires, but the first suffix "...svc.cluster.local"
hits the Kubernetes plugin in CoreDNS and returns the real ClusterIP
10.43.67.125 before the iamworkin.lan template is ever consulted).
Verified: pod-exec curl fc:cheap → HTTP 200 with a real chat.completion
envelope (Ollama/gemma3:4b via bridge).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The chat_model flip (62db15c) pointed Agent Zero at
fc-llm-bridge.fc-llm-bridge.svc.cluster.local:8080 but the existing
agent-zero-netpol only allowed egress to specific node IPs
(10.0.56.20:11434, 10.0.57.17:11434, 10.0.57.16:5200, 10.0.56.11:6443)
plus public-internet (with RFC1918 exclusion). ClusterIP traffic to
10.43.0.0/16 was implicitly denied, so pod-exec curl to the bridge
timed out after 134s.
Adds an egress rule allowing TCP 8080 to the fc-llm-bridge namespace
(matched by kubernetes.io/metadata.name which K8s 1.22+ sets
automatically). No ingress changes needed — fc-llm-bridge has no
NetworkPolicy, so the ingress side is already open.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flips Agent Zero's chat_model from direct local Ollama (gemma3:12b via
the 127.0.0.1:11434 sidecar proxy) to the FlowerCore LLM Bridge
(fc:balanced tier, OpenAI-compatible, Anthropic Claude Sonnet under the
hood) so chat turns are spend-tracked and can dispatch to any provider
via a single tier alias.
Scope is intentionally minimal and reversible:
- chat_model: ollama/gemma3:12b/127.0.0.1:11434
→ openai/fc:balanced/fc-llm-bridge internal service URL
- utility_model, embedding_model, browser_model: UNCHANGED
(stay on local 127.0.0.1 Ollama sidecar — no spend, low latency,
not worth routing through the bridge for small-model traffic).
Auth: new A0_SET_chat_model_api_key env var wired to the
fc-llm-bridge-api-keys Secret (field: agent-zero-k8s). The Secret is
synced by a new OnePasswordItem pointing at "FC LLM Bridge API Keys"
in the IAmWorkin vault. Bearer-token auth is now accepted by the
bridge (FlowerCore.LlmBridge@3225f1f).
Rollback: revert this commit; old image v202604231449 is still present
on all RKE2 nodes, and Agent Zero's strategy: Recreate makes the flip
atomic.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships the Bearer-token auth fix (FlowerCore.LlmBridge@3225f1f) so Agent
Zero's OpenAI provider can authenticate with Authorization: Bearer in
addition to the original X-Api-Key header.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pods in this cluster inherit ndots=5. External FQDNs with <5 dots (like
api.anthropic.com) are expanded through the search path first, and the 4th
suffix `api.anthropic.com.iamworkin.lan` matches CoreDNS' `template IN A
iamworkin.lan` wildcard — resolves to Traefik VIP 10.0.56.200. TLS connect
lands on Traefik's default cert and the AnthropicClient rejects with
RemoteCertificateNameMismatch/RemoteCertificateChainErrors.
Setting ndots=2 makes the resolver try the bare FQDN first (3 dots in
api.anthropic.com), so the search path never fires.
Reference: memory feedback_coredns_ndots_template_collision. Wider follow-up:
the CoreDNS template plugin should add fallthrough for external public suffixes,
so every FC service calling external HTTPS APIs stops hitting this trap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 1Password item "Claude API Key" stores the key in a standard Password
field (labeled `password`), so the OnePasswordItem operator creates the K8s
Secret with key `password`. Deployment was referencing `credential`, which
made the pod fail with CreateContainerConfigError.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Built from FlowerCore.LlmBridge@6d285b5 (initial scaffold). Imported on all
three RKE2 nodes via podman save + ctr import. Replaces v00000000000000
placeholder — ArgoCD sync will roll the pod.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Staged but NOT applied. Do not git push until the two pre-requisites below
are done. See apps/fc-llm-bridge/README.md for the full order-of-ops.
Manifests (apps/fc-llm-bridge/fc-llm-bridge.yaml, 8 docs):
- Namespace fc-llm-bridge
- OnePasswordItem anthropic-api-key (existing Claude API Key item)
- OnePasswordItem fc-llm-bridge-api-keys (NEW item, pending creation)
- PersistentVolumeClaim fc-llm-bridge-data (2Gi longhorn)
- Deployment fc-llm-bridge (port 8080, uid 1654, readOnlyRootFilesystem,
tcpSocket probes to survive future ApiKeyAuthMiddleware reordering)
- Service fc-llm-bridge ClusterIP
- Certificate fc-llm-bridge-cert (step-ca-acme)
- IngressRoute fc-llm-bridge (fc-llm-bridge.iamworkin.lan, websecure)
Pre-requisites BEFORE git push:
1. pfSense Unbound override fc-llm-bridge.iamworkin.lan -> 10.0.56.200
(currently NXDOMAIN -- verified via nslookup and check-pfsense-dns.py).
Skipping this step puts cert-manager HTTP-01 into ~2h backoff.
2. Create 1Password item `FC LLM Bridge API Keys` in vault IAmWorkin with
password fields: agent-zero-ws, agent-zero-k8s, spare-1, spare-2.
3. Build + import localhost/fc-llm-bridge:v<tag> to rke2-server +
rke2-agent1 + rke2-agent2. Bump image tag from placeholder
v00000000000000 before committing the apply.
Related: ADR-088 (FlowerCore.Notes/ARCHITECTURE.md), design doc at
FlowerCore.Notes/docs/ai-agents/agent-zero-anthropic-bridge.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Synology NAS is configured with community bluejay_monitor
(→ snmp.yml auth 'bluejay_v2'), not public. public_v2 was returning
HTTP 500 from snmp-exporter for this target. Verified bluejay_v2
returns metrics.
Keeps printer (10.0.58.107) on public_v2 — Epson ET-3750 uses
community "public" as documented in its SNMP settings.
Same CoreDNS iamworkin.lan template + ndots:5 hijack as the irc-notify fix.
Anope services (nickserv/chanserv/memo) have been disconnected from unrealircd
for weeks ("Host is unreachable" every 3s). Thelounge server defaults pointed
at the same broken FQDN.
Short name unrealircd.irc.svc resolves to the ClusterIP directly.
Adds a real README describing the 4-step deploy flow, with pfSense Unbound
host overrides as step 1 (the prerequisite that, if skipped, silently breaks
cert-manager HTTP-01 for ~2h per cert until manually diagnosed — root cause
of the 2026-04-22 cluster-wide cert outage).
Adds scripts/check-pfsense-dns.py: parses every apps/*/*.yaml, extracts
hostnames from Certificate.spec.dnsNames and Traefik IngressRoute
`Host(...)` match rules, and fails the check if any don't resolve via the
system DNS (pfSense Unbound on this LAN). Ignores IRC server-link labels,
image tags, comments — only checks hostnames cert-manager and Traefik will
actually use.
Run before `git push` or wire into pre-commit / Gitea Actions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Quick Read's active-sentence highlight was changing font-weight from
regular to semibold, which shifted glyph widths and reflowed the whole
paragraph mid-playback. New image drops the weight change and uses a
1px box-shadow ring instead for a stable layout.
Built from FlowerCore.TtsReader@e77d69d.
The app's ApiKeyAuthenticationMiddleware runs BEFORE /health is mapped, so
unauthenticated probe requests get 404. tcpSocket probes verify the listener
is up without auth, which is correct for an internal K8s probe (kubelet
talks pod IP directly, not externally).
Real fix is in the app: move /health before the middleware or mark it
[AllowAnonymous]. Tracked separately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The app exposes /health (Program.cs:91 maps a Healthy text response) but does
NOT expose /metrics/prometheus. K8s liveness/readiness probes against a 404
endpoint kept the pod in CrashLoopBackOff after PVC mount was added.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Live cluster had a Longhorn PVC `signalcontrol-data` mounted at /app/data
since 2026-04-14, but the bluejay-infra git manifest never declared it. As a
result, when ArgoCD recreated the Deployment from git (after deletion to fix
an unrelated selector-label mismatch caught during cert-manager recovery),
the new pod started without /app/data and crashed with `SQLite Error 14:
unable to open database file 'data/signalcontrol.db'`.
Bring git in line with reality: declare the PVC, mount it, and switch the
Deployment to `strategy.type: Recreate` (RWO PVC blocks rolling updates per
existing memory feedback_k8s_rwo_rollout.md).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CoreDNS iamworkin.lan template + ndots:5 was hijacking
unrealircd.irc.svc.cluster.local lookups → Traefik VIP → timeout.
Every alert since ~2026-04-09 silently failed with "IRC send failed: timed out",
which also killed the thermal-printer path (routed through irc-notify).
Same fix pattern as guacamole@28b7600.
The guac-k8s-sync CronJob has been crash-looping (exit 7) since the
2026-04-11 run. Root cause: CoreDNS has an `*.iamworkin.lan`
template wildcard, and the Kubernetes pod resolv.conf ships with
`ndots:5` plus a search list that includes `iamworkin.lan`.
Resolving `guacamole.guacamole.svc.cluster.local` (4 dots < 5) goes
through search-suffix expansion BEFORE the bare FQDN. The iamworkin.lan
suffix makes it `guacamole.guacamole.svc.cluster.local.iamworkin.lan`,
which matches the template and answers with Traefik LB VIP
10.0.56.200. That VIP has no pod-network hairpin route, so curl exits
with 'No route to host'.
Using the short name `http://guacamole:8080` keeps the query at 0
dots, search expansion runs on the bare name, and the in-namespace
`guacamole.svc.cluster.local` suffix hits the Kubernetes CoreDNS
plugin directly (ClusterIP 10.43.229.31).
Alt fixes considered but not taken: trim the CoreDNS template regex
to exclude `.svc.cluster.local.` prefixes (cross-cutting, higher
blast radius); trailing-dot FQDN in the URL (curl/Java HTTP clients
handle inconsistently).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets live SIP AATs (ext 901–904, from-internal context) dial *832 to
exercise the Victory Day workflow + Fun Menu + AsteriskGameHandler path
without routing through Twilio. Mnemonic: *832 = V-D-A (8-3-2) from the
V-D-A-Y keypad pattern.
Maps to Stasis(flowercore-pbx,inbound-pstn,+15074618329) — same call-
type classification as a real Twilio-inbound call to the VDAY DID, so
InboundPstnHandler routes to the seeded VDAY workflow identically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous commit 90deacd raced with the user's f0733ff (which had
already pinned the guacamole web Deployment to rke2-server for the
NFS ACL). That left two nodeSelector blocks on the web pod and an
inconsistent agent2 pin on guacd. Align both pods to rke2-server.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Synology NFS export at /volume1/kubernetes currently grants mount
permission only to 10.0.56.13 (rke2-agent2). rke2-agent1 gets
"access denied by server". guacd + guacamole web both need the
recordings volume, so co-locating is also efficient. Remove the
nodeSelector once the Synology NFS ACL opens to all cluster nodes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the 1Password vault JAR to the Guacamole pod so connection params
like ${OP:ItemTitle/fieldLabel} are resolved from 1Password Connect at
tunnel-open time. Credentials never land in MySQL — only token literals.
Deployment changes:
- env: OP_CONNECT_URL=http://10.0.56.10:8180, OP_VAULT_ID=..., plus
OP_CONNECT_TOKEN from secret/guacamole-1password-token/credential.
- env: ENABLE_ENVIRONMENT_PROPERTIES=true so OP_* env vars render as
op-connect-url / op-connect-token / op-vault-id properties the
extension reads.
- volumeMount for guacamole-vault-jar at
/etc/guacamole/extensions/guacamole-vault-1password-1.0.0.jar
- volumeMount for guacamole-logback so we see DEBUG token-inject lines.
- nodeSelector kubernetes.io/hostname=rke2-server — the Synology NFS
export for /volume1/kubernetes currently only allows rke2-server.
Followup: add rke2-agent1/2 to the export and remove this selector.
New ConfigMaps:
- guacamole-vault-jar (binaryData, ~312KB JAR, Gson shaded, built from
FlowerCore.Notes/k8s/guacamole/extensions/1password-vault via mvn).
- guacamole-logback with DEBUG on io.flowercore.guacamole.vault — drop
to INFO once resolution is proven stable.
Existing guacamole-properties: added onepassword-vault to extension-priority.
The guacamole-1password-token Secret is NOT in git — it holds a verbatim
copy of the onepassword-connect-operator bearer token. Followup task:
provision a scoped Connect token for Guacamole and rotate the copy out.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First pass used nfs.path=/volume1/kubernetes/guacamole/recordings,
which triggered "mount.nfs: access denied by server" on rke2-agent1.
Synology NFS export is scoped to /volume1/kubernetes; match the
working fc-desktop pattern: mount the export root and select the
subdirectory via volumeMount.subPath.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 5 of docs/infrastructure/guacamole-customization-plan.md:
- Mount /volume1/kubernetes/guacamole/recordings (Synology 10.0.58.3)
into both guacd (writer) and guacamole web (reader) at
/var/lib/guacamole/recordings
- Set RECORDING_SEARCH_PATH env on guacamole web -- the Guacamole
Docker entrypoint treats any RECORDING_* var as an enable signal
for the history-recording-storage extension (symlinks the JAR
from /opt/guacamole/environment/RECORDING_/extensions/ into
GUACAMOLE_HOME/extensions/)
Per-connection recording still requires setting recording-path on
each connection in MySQL -- follow-up task. This commit enables
the plumbing; no sessions record yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same ArgoCD + SSA self-heal loop pattern as guacamole (20e4130):
K8s defaults volumeMode=Filesystem on volumeClaimTemplates at
creation, git omits it, argocd-controller owns the atomic list so
every reconcile sees drift, and volumeClaimTemplates is immutable
so it can never reconcile. Adding the field closes both loops.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the infra-guacamole OutOfSync sync loop. K8s API sets
volumeMode=Filesystem as a default on volumeClaimTemplates at creation,
but the git manifest omitted it. ArgoCD uses ServerSideApply with
atomic ownership of volumeClaimTemplates, so every sync saw a
desired/live mismatch on that one field. volumeClaimTemplates is
immutable after creation so ArgoCD could never reconcile it --
autoHealAttemptsCount climbed to 6091. Adding the field to git
matches live and breaks the loop.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- fc-chat.yaml: TLS/IngressRoute only (Deployment managed by deploy script, matches fc-signage/fc-mysql/fc-kiosk pattern)
- fc-menuboard: new app bundle
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups to the Piper TTS wire-up landed in d3ffad9:
1. Telephony-web runs as uid 1654 (non-root), but the hostPath at
/tmp/tts-audio is owned by root:root 0755. Pod couldn't write .sln16
files — every Piper call would succeed at the HTTP layer and then
fall back to the sound map when File.WriteAllBytesAsync threw
"Permission denied." Extend the existing fix-data-perms initContainer
to chown the shared-tts mount too (0755 world-readable, so the
Asterisk pod — running as a different uid — can still read).
2. Pod security context now explicitly sets runAsNonRoot: true + runAsUser
1654 + runAsGroup 1654 (cluster policy), matching the pattern used
by every other FlowerCore service.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Piper was never reachable on 10.0.57.15 — edge1's actual address is
10.0.57.17 (SSH config, project_edge1_sdcard memory). Every telephony
prompt hit the 8s HttpClient timeout and fell back to the built-in sound
map (vm-advopts, vm-goodbye, beep) instead of speaking the real workflow
text. Verified from noc1: `curl http://10.0.57.17:8500/health` returns
HTTP 200 in 6ms, `POST /tts` returns a 16kHz mono WAV in 606ms.
Changes:
- apps/telephony/telephony.yaml
- `Tts.PiperUrl` → `http://10.0.57.17:8500`
- NetworkPolicy egress allow → `10.0.57.17/32:8500`
- Header comment now documents the POST /tts {"text":"..."} contract
- telephony-web pod mounts `/shared-tts` from hostPath `/tmp/tts-audio`
(rke2-agent1). This is where `AsteriskProvider.SpeakTextAsync` writes
the synthesized .sln16 before calling ARI `Play sound:tts/<name>`.
- apps/asterisk/deployment.yaml
- Asterisk pod mounts the same hostPath at
`/var/lib/asterisk/sounds/tts` so it can read and play what
telephony-web wrote. Both deployments have
`nodeSelector: kubernetes.io/hostname: rke2-agent1` so the hostPath
is guaranteed to be the same directory.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CoreDNS wildcard for iamworkin.lan catches unresolved names and returns
the Traefik VIP (10.0.56.200), so downloads.asterisk.org from inside a
pod returns 404 from Traefik rather than the real Sangoma mirror. Pin
the real IP (165.22.184.19 = oss-downloads.sangoma.com) via hostAliases
so curl reaches the actual server.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cluster egress goes through a step-ca-fronted TLS proxy that install-sounds
doesn't trust ("SSL certificate problem: self-signed certificate"). The
Asterisk core sounds tarball is a public artifact; integrity is enforced
downstream when Asterisk plays the file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The install-sounds init container was a stub that left /var/lib/asterisk/sounds/en
empty. Result: every SpeakText fallback path (vm-advopts, vm-goodbye, characters:*,
digits/*, beep, pbx-invalid) resolved to a missing file, Asterisk silently failed
each Playback, zero RTP was produced, and callers heard dead air. This is why
dialing *0 (Settings Menu) or *100 (Debug IVR) "picks up quietly" — there is
literally nothing to stream.
Replaced the stub with alpine:3.20 + curl + tar that downloads the pinned
asterisk-core-sounds-en-ulaw-1.6.1.tar.gz (~10 MB) from downloads.asterisk.org
and unpacks it into the sounds emptyDir. Idempotent — skips download if
vm-goodbye is already present.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Full deployment manifests (Namespace, Deployment, Service, Certificate,
IngressRoute) for 4 new FlowerCore services with port 8080, ClusterIP
on port 80, cert-manager step-ca-acme TLS, and /metrics/prometheus
health probes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bitnami/kubectl image doesn't have python3. Replaced all python3
JSON parsing with grep/cut for auth token and connection data.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Updated bluejay-branding-1.0.0.jar with gold accents, hover fix,
icon fix, pinstripe patterns, Blue Jay SVG logo
- Added guac-k8s-sync CronJob: runs every 2min, auto-updates pod
names in Kubernetes exec connections when pods restart
- Fixed secret reference (guacamole-credentials, not guacamole-db-credentials)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RKE2 Traefik has no ACME certResolver configured, so IngressRoutes
using certResolver: step-ca silently fall back to the Traefik default
self-signed cert. Fix by using cert-manager Certificate resources with
the step-ca-acme ClusterIssuer and tls.secretName in IngressRoutes.
- fc-landing: Add Certificate, change tls: {} to tls.secretName
- fc-mysql: New app (Certificate + IngressRoute only)
- fc-php: New app (Certificate + IngressRoute only)
- fc-desktop: New app (Certificate + IngressRoute only)
- fc-signage: New app (Certificate + IngressRoute, plus HTTP route for players)
Deployments/Services for mysql/php/desktop/signage are managed by
deploy scripts, not ArgoCD. These apps only manage TLS + ingress.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the 188KB ConfigMap-embedded HTML with a proper Blazor Server
deployment (fc-intranet-web:latest on port 5300). The old nginx deployment,
ConfigMaps (intranet-html, intranet-nginx-conf), and all embedded HTML are
removed. The intranet is now a .NET 10 Blazor app with live health monitoring,
REST API, 49 pages, and the unified Blue Jay theme.
Source: github.com/astoltz/FlowerCore.Intranet.Web
- guacamole-branding ConfigMap with Blue Jay dark theme CSS
- guacamole-properties ConfigMap with ban/TOTP/session config
- kubectl-proxy sidecar on guacd for K8s pod exec connections
- guacd-exec ServiceAccount + ClusterRole/Binding for pod exec RBAC
- Volume mounts for branding JAR and properties on guacamole webapp
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Infrastructure manifests for ArgoCD. An `ApplicationSet` in `argocd` namespace watches the `apps/*` directories in this repo and creates one `Application` per subdir (prefixed `infra-<name>`).
## Adding a new service to the cluster
Follow these steps in order. **Step 1 must run before step 3** — if you skip it, cert-manager HTTP-01 will silently fail for ~2h per cert (exponential backoff) until someone diagnoses the DNS.
### 1. Create or verify the FlowerCore.DNS A record (REQUIRED for current HTTP-01 manifests)
step-ca (the ACME CA on noc1) runs in a Podman container with host networking. Its container resolver uses pfSense Unbound (10.0.56.1), **not** cluster CoreDNS. So even though CoreDNS has a wildcard `*.iamworkin.lan → 10.0.56.200` for in-cluster lookups, step-ca cannot see it. Every new public hostname needs an explicit pfSense host override.
The management path is now `FlowerCore.DNS`, not `FlowerCore.Notes/scripts/pfsense-add-dns-overrides.py`. Add or verify the public A record there before you apply the manifest:
```bash
curl -sk https://dns.iamworkin.lan/api/v1/servers
# Find the pfSense serverId, then create the record using the host label only.
# Example: for foo.iamworkin.lan, use "name":"foo".
curl -sk -X POST https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iamworkin.lan/records \
# for every Certificate dnsName and Traefik Host(...) rule it finds.
python scripts/check-pfsense-dns.py --live
# Optional stronger pass when kubectl access is available; also checks
# live-cluster Certificates and IngressRoutes for drift outside manifests.
```
**Symptom if you skip this:** the Certificate resource stays `Ready: False` with `status.reason: unexpected non-ACME API error: context deadline exceeded`. Recovery requires `kubectl -n <ns> delete order <order-name>` after adding the DNS to bypass cert-manager's backoff.
### 2. Create the app manifest
Create `apps/<name>/<name>.yaml` containing the Namespace, Deployment, Service, Certificate, and IngressRoute. Reference an existing directory (e.g. `apps/fc-messageboard/`) for the canonical shape.
Conventions:
-`Namespace` has label `app.kubernetes.io/part-of: bluejay-infra`
-`Deployment.spec.selector.matchLabels` and `Service.spec.selector` MUST use the same label key. The historical convention here is `app: <name>` (not `app.kubernetes.io/name`) — don't mix.
- Image: `localhost/<name>:v<YYYYMMDD><HHMM>`, `imagePullPolicy: Never`. Import the image to every RKE2 node (server + both agents) via `ctr images import` before applying — pods schedule anywhere.
- If the app persists local state (SQLite, uploads), declare the `PersistentVolumeClaim` here with `storageClassName: longhorn` and `accessModes: [ReadWriteOnce]`. Add `strategy.type: Recreate` to the Deployment — RWO PVC blocks rolling updates.
- Probes: use `tcpSocket` if the app has middleware that intercepts unauth requests (returns 404/401 for `/health`). Otherwise prefer `httpGet` against whatever the app exposes (verify the path isn't gated by auth).
- Certificate: `issuerRef.name: step-ca-acme`, `issuerRef.kind: ClusterIssuer`. `dnsNames` must match the hostname you created in FlowerCore.DNS in step 1.
### 3. Commit & push
```bash
git add apps/<name>/
git commit -m "<name>: initial deployment"
git push
```
ArgoCD's `ApplicationSet` picks up the new directory within ~3 minutes and creates `infra-<name>` with auto-sync + self-heal enabled.
Certificate should be `Ready: True` within ~60s. If it stalls `False` for >2m, the pfSense DNS step got skipped — go back to step 1, then `kubectl -n <ns> delete order <order-name>` to bust the backoff.
### Pre-merge gate
Before `git push`, always run:
```bash
python scripts/check-pfsense-dns.py
```
It's a quick service-backed check that would have caught the entire 2026-04-22 cert-manager outage. Consider wiring it into a pre-commit hook or a Gitea Actions workflow.
## Retiring a service
1.`kubectl -n argocd delete application infra-<name>` (cascade deletes the K8s resources via ArgoCD finalizers)
2.`git rm -r apps/<name>/` and push
3. Remove the FlowerCore.DNS record through the UI or API, for example:
- **CoreDNS template + ndots:5 collision**: inside pods, `<svc>.<ns>.svc.cluster.local` with <5 dots gets search-expanded through `iamworkin.lan` FIRST and hits the wildcard template → resolves to Traefik VIP, not the real ClusterIP. Use short service names (`<svc>`) in K8s manifests. See memory `feedback_coredns_ndots_template_collision.md`.
- **Image not on node**: pods stuck `ErrImageNeverPull` means the image wasn't imported to the node Kubernetes scheduled the pod onto. `ctr images import` on all of rke2-server, rke2-agent1, rke2-agent2.
- **StatefulSet PVC drift**: `volumeClaimTemplates` needs explicit `volumeMode: Filesystem` or ArgoCD SSA self-heals forever. See memory `feedback_argocd_statefulset_pvc_drift.md`.
- **IngressRoute namespace split**: this RKE2 Traefik install does not allow cross-namespace service refs. Keep the `IngressRoute`, backend `Service`, and TLS secret in the same namespace; if one host is shared across namespaces, duplicate the `Certificate` and move the route next to the destination service.
- **Public read-only hosts**: if a public host fronts a service that also exposes admin writes internally, add a Traefik route match like `Host(...) && (Method(GET) || Method(HEAD))` on the public edge instead of trusting the app to reject unsafe methods.
- **Public read-write allowlist hosts**: if a public host accepts a tightly bounded write surface (e.g. bootstrap-JWT POST), pin the allowlist as `(Method(GET) || Method(HEAD) || Method(POST) || Method(OPTIONS))`. PUT/PATCH/DELETE must still 404 at the route. Track A's `updatecenter.iamworkin.lan` / `updates.iamworkin.lan` are the canonical example. The lint test enforces this invariant.
- **Traefik VIP netpols**: when a `NetworkPolicy` allows `10.0.56.200`, also allow the post-DNAT backend ports (`8443` for TLS plus `8080` or `8000` for HTTP) or Calico will drop the rewritten flow.
- **Auth-safe probes**: services behind API-key or global auth middleware should prefer `tcpSocket` probes unless `/health` is explicitly exempted before the middleware runs.
- **ArgoCD must use internal Gitea URL**: `http://gitea-clusterip.gitea.svc.cluster.local:3000/bluejay/bluejay-infra.git`, not the external HTTPS URL (step-ca cert isn't trusted by ArgoCD). The `ApplicationSet` and any hand-created `Application` must both use the internal URL.
## Local manifest lint
The repo now carries a local-first lint pass for the recurring K8s gotchas that have burned the fleet:
```bash
dotnet test tests/bluejay-infra-lint/BluejayInfraLint.Tests.csproj -c Release
```
That test project sweeps `bluejay-infra/apps/**` plus the canonical sibling `FlowerCore.*\\k8s` manifests that share the same workspace. Matching `conftest.dev` policy files live under `tests/bluejay-infra-lint/conftest.dev/` for environments that also have `conftest` or `opa`.
## Non-K8s Pi Artifacts
Some `apps/*` directories are deployment artifact bundles consumed by Puppet
instead of Kubernetes workloads. `apps/fc-signage-pi-player/` carries the
Chromium signage Pi player, `apps/fc-divoom-dm-pi-device/` carries the additive
edge2 Divoom-as-DeviceManagement-device profile/Hiera contract, and
`apps/fc-divoom-tv-pi/` carries the Divoom TV Pi HDMI systemd/Puppet shape.
These bundles intentionally avoid Deployment, IngressRoute, Certificate, and
| `service-web.yaml` | `Service` `fc-devicemgmt-web` (ClusterIP, 80→8080) | Owned by the DM.Web deploy. |
| `deployment-web.yaml` | `Deployment` `fc-devicemgmt-web` | Currently `replicas: 0` (gated on fc-mysql operator + `flowercore_devicemgmt` DB + 1Password runtime item — see header comment). Not a Cl-5 concern. |
| also present | operator RBAC, namespace, network-policy, 1password-item | Full app dir, ArgoCD-managed. |
## Why the admin console needs nothing new
The existing IngressRoute matches **`Host(\`devices.iamworkin.lan\`)` with no `PathPrefix`
constraint**. Traefik therefore forwards *all* paths on that host to the
`fc-devicemgmt-web` service — including any admin/helpdesk routes the DM.Web app exposes
under its `FlowerCore:PathBase` (e.g. `/admin`, `/helpdesk`). The same TLS secret
(`fc-devicemgmt-web-tls`) and the same step-ca ACME `Certificate` already protect them.
This matches the established TLS-only-app pattern (e.g. `apps/fc-library/fc-library.yaml`,
Source-controlled Puppet/Hiera deployment contract for registering the edge2
Divoom MiniToo panel as a FlowerCore DeviceManagement-managed Pi device.
This is not a Kubernetes application. The live panel remains the existing
edge2 `flowercore-divoom.service` managed by `FlowerCore.Puppet`
`profile::pi::service::divoom`, with the .NET payload deployed out of band
and `/opt/flowercore/divoom/data` plus the Bluetooth shell wrappers preserved.
Because edge2 is already Hiera-driven through `profile::pi::service::apps`,
the deploy home is additive `profile::pi::service` data/profile source, not
`profile::edge::service::apps` and not an ArgoCD/K8s app.
## Scope
- Stage DeviceManagement registration metadata for the edge2 Divoom MiniToo.
- Stage a separate, disabled-by-default DM Agent executor unit for privileged
Bluetooth operations once the DM-RPC lane lands.
- Keep `flowercore-divoom.service` and `flowercore-divoom-bt.service`
untouched: no service replacement, no restart subscription, no K8s surface.
- Preserve the current wrapper contract:
`/opt/flowercore/divoom/bt-link.sh`,
`/opt/flowercore/divoom/bt-reset.sh`, and
`/opt/flowercore/divoom/audio-link.sh`.
- Keep FM radio disabled and require visible render proof; device-info echo is
not render proof.
## Artifact Map
| Path | Use |
| --- | --- |
| `hiera/edge2-divoom-dm-device.overlay.yaml` | Additive Hiera overlay for edge2. Merge into the existing node YAML without removing `fc-pimanager` or `fc-divoom`. |
| `puppet/profile/pi/service/divoom_dm_device.pp` | Puppet profile shape to vendor into `FlowerCore.Puppet` after the DM-RPC executor binary exists. |
| `puppet/templates/flowercore-divoom-dm-agent.service.epp` | Separate DM Agent systemd unit. Defaults are stopped and disabled until a later cutover. |
## Rollout Notes
1. Land these artifacts in bluejay-infra as the deploy contract.
2. Vendor the Puppet profile and EPP templates into `FlowerCore.Puppet`.
3. Merge the Hiera overlay into `data/nodes/edge2.iamworkin.lan.yaml`.
4. Run Puppet in noop first, preferably with a node-local validation directory
under `~/.fcv` rather than `/tmp`.
5. Only enable the DM Agent service after the DeviceManagement BT executor has
### 2. Create the `FC LLM Bridge API Keys` 1Password item
The `Claude API Key` item in vault `IAmWorkin` already exists (id
`e5tth3y5mp3lhdavg35pxadzca`, see `docs/ai-agents/anthropic-integration.md`).
The new item for per-consumer bridge API keys does NOT yet exist. Create it
before the first apply of this manifest — the Deployment marks the individual
key env vars `optional: true` so missing keys will not crash the pod, but the
bridge will reject every request with 401 until at least one key is populated.
| Field | Item position | Type | Purpose |
|-------|---------------|------|---------|
| `credential` | Top section | Password (random, 48 char) | Unused placeholder required by the 1Password schema for single-field items. Can be anything — this file is never read by K8s. |
| `agent-zero-ws` | "API Keys" section | Password (random, 48 char) | API key for the BLUEJAY-WS Agent Zero instance. |
| `agent-zero-k8s` | "API Keys" section | Password (random, 48 char) | API key for the K8s-hosted `agent-zero` Deployment. |
Apple TV signage is a sealed appliance running the `FlowerCore.Signage.Agent.AppleTv` tvOS app per ADR-134.
This ApplicationSet entry is documentation and inventory metadata only. It intentionally creates no `Deployment`, `Service`, or `Pod`.
The Apple TV app connects outbound to existing FC.Signage.Web surfaces:
-`https://signage.iamworkin.lan/hub/signage` for SignalR live status.
-`GET /api/v1/nodes/{nodeId}/state` for the 30 second polling fallback.
-`POST /api/v1/nodes/register` and `POST /api/v1/nodes/{nodeId}/enroll` for pairing and mTLS enrollment.
-`POST /api/v1/nodes/{nodeId}/heartbeat` for metrics, current content identity, and local audit excerpts.
Distribution is via Apple Developer Enterprise Program or TestFlight plus FC.Distribution / UpdateCenter publishing once Apple credentials are available.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.