Compare commits

...

351 Commits

Author SHA1 Message Date
Andrew Stoltz
59543016c0 runners: add github-runner-updater Deployment
FlowerCore.Updater had only the offline bluejay-ws-sandbox-1 Windows
runner registered (the specialized fcsetup E2E target) and no Linux
self-hosted runner, leaving the repo with no Linux PR-CI capacity for
any future workflow. Modeled on github-runner-pimanager (Sprint 32
long-tail final entry, 2026-05-25); two replicas with per-pod emptyDir
caches to keep ReadWriteOnce PVC contention out of the picture.

Also registers github-runner-updater in the LinuxRunnerRepos +
ScaledLinuxRunnerDeployments fleet-lint sets so future suite repairs
treat the entry as canonically required (the 6 pre-existing lint
failures on this file family are orthogonal: initContainer single-
container count assertion + fc-updater IngressRoute POST allowlist
+ DM ApplicationSet convention drift).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 22:22:41 -05:00
Andrew Stoltz
bc3548e715 runners: add github-runner-pimanager Deployment
FlowerCore.PiManager build run 26417714843 sat queued 5h with zero
self-hosted runners registered to the repo. PiManager was missed in
the Sprint 32 long-tail sweep — every other FC repo got a dedicated
repo-scoped Deployment with its own ACCESS_TOKEN registration, but
PiManager fell through the cracks.

Adds a 2-replica ephemeral runner Deployment matching the Signage /
DMS / Print.Web pattern (per-pod emptyDir caches, no shared PVC,
labels `self-hosted,linux,fc-build-linux`, shared github-runner-token
PAT). Once ArgoCD syncs, the queued job will pick up automatically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 20:33:44 -05:00
74333cc26b selenium: right-size hub + chrome + edge memory limits (#28) 2026-05-26 01:12:15 +00:00
Andrew Stoltz
7310fb88c2 selenium: right-size hub + chrome + edge memory limits
Edge node has been OOMKilled 51 times in 5 days (~1 every 2.4h) on a
1Gi memory limit. Chrome runs maxSessions=2 on the same 1Gi cap and
was idling at 684Mi — first concurrent session pushing the node to
~900Mi+ would be the next OOM. Hub was running at 766Mi against a 1Gi
limit (75%); no recent restarts but no headroom either.

Firefox node has been running at 2Gi memory limit for 9 days with
zero restarts — that is the right size for a Selenium 4.27 browser
node under our session profile (screen recording sidecar + 1080p
rendering + page captures). Match it.

Changes:
- Hub:    limit 1Gi -> 1.5Gi, request 512Mi -> 1Gi
- Chrome: limit 1Gi -> 2Gi,   request 512Mi -> 1Gi
- Edge:   limit 1Gi -> 2Gi,   request 512Mi -> 1Gi

CPU left alone on all three — observed utilization is well under the
existing limits (hub 54m / 500m, chrome 185m / 1, edge 11m / 1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 20:11:41 -05:00
148bc87b9a runners: bake step-ca root CA into image (v20260525-stepca) (#27) 2026-05-26 01:04:14 +00:00
Andrew Stoltz
2a1e842100 runners: bake step-ca root CA into image (v20260525-stepca)
Without the IAmWorkin step-ca root CA in the runner image's system
trust store, .NET HttpClient calls from CI tests against
`*.iamworkin.lan` (e.g. `https://selenium.iamworkin.lan/session`) fail
with `The remote certificate is invalid because of errors in the
certificate chain: PartialChain`. FlowerCore.Print.Web's
`WebScreenshotService` unit tests hit this on every build.

Drop the step-ca root PEM into `/usr/local/share/ca-certificates/`,
run `update-ca-certificates` once during apt install, and let OpenSSL +
.NET-on-Linux read the regenerated `/etc/ssl/certs/ca-certificates.crt`
automatically — no `SSL_CERT_FILE` env var, no per-Deployment volume
mount.

Image rebuilt + saved + imported on all 3 schedulable RKE2 nodes
(rke2-server, rke2-agent1, rke2-agent2) before this PR — verified with
`ctr images list -q | grep stepca` on each node.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 19:55:38 -05:00
bc28430d24 selenium: allow github-runner namespace ingress on 4444 (#26) 2026-05-26 00:44:23 +00:00
Andrew Stoltz
cc92272217 selenium: allow github-runner namespace ingress on 4444
Unblocks CI jobs running in github-runner pods (e.g. FlowerCore.Print.Web
`help-screenshots`) from reaching selenium-hub. Previously the session
POST was DNAT'd to the hub pod IP then dropped at the Calico ingress
hook, surfacing as a 60s timeout against
http://selenium-hub.selenium.svc.cluster.local:4444 while the Selenium
UI showed 0/4 sessions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 19:43:12 -05:00
d6f4468a9c selenium: migrate hub + 3 nodes into ArgoCD-managed manifests (#25) 2026-05-26 00:09:35 +00:00
Andrew Stoltz
2f796a2ebd selenium: migrate hub + 3 nodes + service + ingressroute into ArgoCD
Previously orphan kubectl-applied since the Selenium Grid was first set
up. The `infra-selenium` ArgoCD app existed but only managed
`network-policy.yaml` — the deployments themselves drifted whenever
anyone `kubectl set env`'d or `kubectl scale`'d.

This commit captures the live state (with the 2026-05-25 maxSessions
bump for chrome already baked in) as canonical git source. ArgoCD's
ServerSideApply syncPolicy + selfHeal will now keep the grid in lock
step with this file.

Resources captured:
  - Service selenium-hub (ClusterIP, internal traffic on 4444)
  - Service selenium-hub-external (LoadBalancer, MetalLB 10.0.56.208)
  - Deployment selenium-hub
  - Deployment selenium-node-chrome (replicas=1, SE_NODE_MAX_SESSIONS=2)
  - Deployment selenium-node-firefox (replicas=1, maxSessions=1)
  - Deployment selenium-node-edge (replicas=1, maxSessions=1)
  - IngressRoute selenium-hub (Traefik, selenium.iamworkin.lan)

No live behavior change — server-side dry-run confirms unchanged for
hub/firefox/ingressroute, "configured" for hub-external + 3 deploys
(default-field reordering only; SSA + field managers handle the diff).

Refs: Sprint 33 morning-routine 2026-05-25 follow-up Q-MR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 19:08:55 -05:00
1f1f6823db runners: right-size replica counts per 14d CI activity (#24) 2026-05-26 00:01:47 +00:00
Andrew Stoltz
b92f74b63a runners: right-size replica counts per 14d CI activity data
Drop 2 → 1 for 10 deploys based on trailing-14d run counts:
  - LlmBridge, Media, Knowledge, Intranet.Web, DNS  (0 runs each)
  - Presentations (6), Redis (3), Provisioning (3),
    MessageBoard (3), MenuBoard (3)

Bump 2 → 3 for Print.Web: 12 runs in trailing 5d, and the
help-screenshots AAT job holds a runner 30+ min, creating
head-of-line blocking for parallel PRs.

Net change: -9 replicas (≈ -9 GiB committed memory).
Aligns with Sprint 33 morning-routine capacity audit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 18:55:47 -05:00
Andrew Stoltz
cb7f7dbc4d authentik: generous startup/liveness probes for first-boot migration
The server pod was getting killed by liveness probe at 60s while still
waiting on migration DB lock (worker pod also running migrations against
same DB). Add startupProbe with 10.5 min budget so liveness doesn't fire
until migrations finish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 16:03:03 -05:00
Andrew Stoltz
03126d5584 authentik: add fsGroup:1000 to server + worker so non-root uid can write /media
PermissionError: [Errno 13] Permission denied: '/media/public' in tenant_files
migration because Authentik container runs as uid 1000 but Longhorn PVC mounts
root:root by default. fsGroup on Pod securityContext recursively chgrps the
PVC mount to gid 1000 + chmods g+rwx.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 15:58:35 -05:00
Andrew Stoltz
495e884c41 authentik: initial deployment at id.iamworkin.lan
Stack:
  - PostgreSQL 16 StatefulSet (Longhorn RWO 5Gi)
  - Redis 7 Deployment (no persistence)
  - Authentik server + worker (ghcr.io/goauthentik/server:2024.12.3)
  - Shared media PVC (Longhorn RWO 2Gi) between server+worker
  - Certificate via step-ca-acme ClusterIssuer
  - Traefik IngressRoute at id.iamworkin.lan

Secrets sourced from 1Password item 'authentik-credentials' (IAmWorkin
vault, id y6i74ch22q5wvm7znquq4nhhcu) via OnePasswordItem CRD. Fields:
AUTHENTIK_SECRET_KEY, POSTGRES_PASSWORD, REDIS_PASSWORD,
BOOTSTRAP_ADMIN_PASSWORD, BOOTSTRAP_ADMIN_TOKEN, BOOTSTRAP_ADMIN_EMAIL.

DNS A record id.iamworkin.lan -> 10.0.56.200 added via
scripts/pfsense-add-id-host.py (FlowerCore.DNS service was 502'ing on
pfSense diag_command.php response parsing).

Closes the immediate gap from PiManager OIDC Cohort 3 wire-up: PiManager
(a87cd6f) configures id.iamworkin.lan as JWT authority but the backend
was never deployed. Pirelay specifically is on Mode:apikey until this
backend is bootstrapped and a pimanager service-account exists.

Post-deploy bootstrap (manual once pods Ready):
  1. Login at https://id.iamworkin.lan/if/admin/ as akadmin
     using BOOTSTRAP_ADMIN_PASSWORD from 1Password.
  2. Create OAuth2/OpenID Provider for pimanager (issuer
     https://id.iamworkin.lan/application/o/pimanager/, audience 'pimanager').
  3. Create Application binding the provider.
  4. Create service account user 'pimanager-service-account', generate
     long-lived token, store in 1Password as 'pimanager-service-account'.
  5. Re-enable jwt mode on pirelay + un-mask puppet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 15:50:10 -05:00
Andrew Stoltz
65aa1e6104 fix(monitoring): point probe-printweb at /health (Q-MR-90)
Root path requires API key auth — `/` returned 401 to the blackbox
probe, firing PrintWebDown despite `/health` reporting Healthy.
Pattern: feedback_k8s_probes_behind_auth_middleware.

Mirrors FlowerCore.Notes scripts/monitoring/prometheus.yml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:52:02 -05:00
Andrew Stoltz
7f2a3b76b4 feat(github-runner): bake Ruby 3.3 into Linux self-hosted runner image (Q-MR-81) 2026-05-20 11:45:43 -05:00
ea73f00461 fix(fc-devicemgmt): remove self-referential Application resource (Q-MR-79)
ApplicationSet already creates infra-fc-devicemgmt; removing the in-repo Application child clears the self-reference drift.
2026-05-20 16:20:01 +00:00
Andrew Stoltz
25ace30a03 fix(fc-devicemgmt): remove self-referential Application resource (Q-MR-79) 2026-05-20 11:18:25 -05:00
Andrew Stoltz
ca574c2280 brochure: delete apps/brochure/ — full prune per operator decision 2026-05-19
Removes the apps/brochure/ directory entirely from the bluejay-infra
ApplicationSet glob. ArgoCD will:

  1. See infra-brochure has no git source -> mark for delete
  2. Prune the brochure namespace + Deployment + Service + Certificate
     + Secret + IngressRoute (all generated from the now-gone
     apps/brochure/brochure.yaml)
  3. Remove the infra-brochure Application from argocd ns

Operator decision 2026-05-19 (follow-up to 09387f9 ARCHIVED banner
commit): "Yes, prune argo for brochure. Probably fully deleted there."

The brochure subdomain project was a planning-chain misinterpretation
of "make TtsReader + AI Station production-ready" — see
memory/project_brochure_split_misinterpretation_archived_2026_05_19.md
in FlowerCore.Notes for the full decision record.

Reusable artifacts that were the operator's archive concern stay alive
in their actual homes:

- FlowerCore.Intranet.Web PR #8 content-NuGet carve-out: still in
  Intranet's master, may transfer to TtsReader / AI Station prod work
- Sprint 32 Cl-5 substrate (public-twin design ideas): SUPERSEDED banner
  in-place in FlowerCore.Notes docs/standards/, history preserved
- magpie-doc-writer + wren-walkthrough skill output: unchanged in
  Intranet's flowercore-whats-new/walkthroughs/galleries directories

Companion Notes-side commit updates the "scaled to 0 + ARCHIVED banner"
language in mvp-readiness.html + fleet-roadmap-2026-05-19-sprint36-v2.md
+ memory record to reflect full deletion instead.

Wrong-codebase image localhost/fc-brochure-web:v20260524-sprint32 is
being removed from rke2-server / rke2-agent1 / rke2-agent2 in a
follow-up step (reclaims ~800MB per node).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 10:42:30 -05:00
Andrew Stoltz
09387f90e1 brochure: ARCHIVED 2026-05-19 — was a misinterpretation, do not re-enable
The brochure split project was a misinterpretation of an operator request
to make TtsReader + AI Station production-ready. Somewhere in the planning
chain it spun up into a separate "showcase brochure product" with its own
host, repo, NuGet, and Codex pack — none of which the operator actually
wanted. The project itself is pointless and a waste of credits.

Archive (not delete) per operator decision 2026-05-19, because some work
shipped under the misinterpretation may still have reusable value:

- FlowerCore.Intranet.Web PR #8 (merged) introduced FlowerCore.Brochure.Content
  content-NuGet carve-out — pattern may apply to TtsReader/AiStation production
  polish.
- Sprint 32 Cl-5 substrate has design ideas for public-twin vs operator-host
  separation that may transfer.
- magpie-doc-writer / wren-walkthrough skills still author useful Intranet
  content — those skills stay active.

These manifests stay at replicas: 0 for ArgoCD continuity. Cleanup options
(move out of apps/* glob, or delete entirely) are documented in README.md
for an operator-explicit future call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 10:34:28 -05:00
Andrew Stoltz
e641ceab48 monitoring(irc-notify): criticals also batch hourly — fix per-fire spam
The first batching pass (bacac06) left critical-severity alerts on the
immediate-print path. That's still per-event spam for any persistent
critical (e.g. PrintPaperRollCritical fires every 30s Grafana evaluation
cycle when paper is <5%). Caught immediately after deploy: CUPS queue grew
0 → 8 jobs in 8 minutes from a single firing PrintPaperRollCritical.

This commit aligns with the operator's verbatim ask ("one alert an hour"):

- Critical-severity alerts now go into the digest buffer, NOT the
  immediate-print path. The digest payload already shows severity tags
  per alertname, so the operator still sees "[critical] X" in the printout.
- The explicit `alert_channel=thermal_print_immediate` label still bypasses
  batching, but only on NEW fingerprint arrival — it triggers a flush of
  the CURRENT digest (with the new alert included), then clears. Repeat
  webhooks for the same fingerprint dedupe in the buffer until the next
  hourly tick OR until the alert resolves. No fingerprint can spam.
- `add_to_digest` now returns bool (True = buffer grew, False = dedup /
  resolution / disabled) so the immediate-label path can flush only on
  state transitions.

Net effect: max 1 thermal print per BATCH_INTERVAL_MIN per alert fingerprint,
regardless of severity. Rules that genuinely need same-second paper opt in
via `alert_channel=thermal_print_immediate` (currently zero rules use this).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 10:22:25 -05:00
Andrew Stoltz
c263426ea5 fc-devicemgmt: operator image fix + Web scaled to 0
OPERATOR (PodCrashLoopBackOff cleared):
- Bumped image to v20260519-sp34cl3-fix (built from astoltz/FlowerCore.DeviceManagement@d9a3685
  after Sprint 34 Cl-3 stranded branch was merged via PR #19 squash).
- The v20260512-cx5 image was the broken Sprint 8 scaffold: generic Host
  builder, no kubeops, no Kestrel on :8080, no AddController chain. Readiness
  probe dial-tcp 8080 failed every restart.
- The new image ships the AddController chain for all 4 reconcilers
  (DeviceCrd / DeviceGroupCrd / DevicePolicyCrd / RemoteCommandCrd) plus
  Kestrel on :8080 and /healthz.
- Image saved + scp'd + ctr-imported on rke2-server / rke2-agent1 / rke2-agent2
  before this commit. SHA256: 2cc79ee0a2313c550268d1244f805ae41b396362148dd5603061cc15b6f7fa7e

WEB (DeploymentReplicasMismatch cleared via scale-to-0):
- Web pod cannot start. Two upstream gaps must close first:
  1) MySQL DB instance + user `fc_devicemgmt` / database `flowercore_devicemgmt`
     are not provisioned in fc-mysql. Cluster has zero MySqlInstanceCrds and
     no `mysql.fc-mysql.svc:3306` Service.
  2) 1Password vault item `IAmWorkin/FlowerCore DeviceManagement Runtime` is
     missing (5 fields: DB-Password + 4 mTLS PEMs). OnePasswordItem CRD has
     been stuck Ready=False since 2026-05-18T02:58.
- Same pattern as the brochure-web scale-to-0 in 914fed0 — make the cluster
  clean and quiet, let operator restart deploy on a real schedule.

Re-enable path is fully documented in the deployment-web.yaml header comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 10:11:09 -05:00
Andrew Stoltz
bacac067cf monitoring(irc-notify): hourly digest batching for thermal printer
The thermal printer drained overnight (2026-05-18/19) because the old
notify.py POSTed one print job per Grafana webhook fire. With 9
concurrently-firing alerts (zabbix-postgres + fc-devicemgmt + brochure
+ PrintPaperRollLow), every evaluation cycle stamped fresh CUPS jobs
onto the queue until the operator physically powered the printer off.

This refactor:

- Adds env-var config: THERMAL_PRINT_ENABLED (master kill switch),
  BATCH_INTERVAL_MIN (default 60), BATCH_MAX_PENDING (default 50).
- IRC delivery stays per-event (operator wants the live stream).
- Thermal routing now:
  * critical/disaster/page severity OR alert_channel=thermal_print_immediate
    -> print immediately
  * alert_channel=thermal_print -> enqueue into hourly digest
  * RESOLVED -> remove from digest buffer (no resolution-spam prints)
  * else -> IRC only, no thermal
- Background digest_loop thread flushes the buffer hourly (or sooner
  if buffer hits BATCH_MAX_PENDING). Digest payload is a single
  Print.Web /api/print/alert POST listing distinct alertnames + per-rule
  target counts.
- New POST /flush endpoint (manual operator force-flush; useful for
  testing without waiting an hour).
- GET / returns config + buffer depth + per-stat counters for observability.

Net effect: max 1 thermal print per BATCH_INTERVAL_MIN for batched
warnings, plus immediate prints for criticals. Closes the 2026-05-18/19
alert-storm incident.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 09:56:14 -05:00
914fed08d8 fix(brochure): scale brochure-web to 0 — wrong codebase shipped (Intranet.Web binary in fc-brochure-web image, CrashLoopBackOff 296 restarts on /data read-only). Re-enable after Sprint 34 Cx-3 rebuild per docs/ai-agents/codex-prompts/2026-05-18-fc-brochure-web-rebuild-pack.md 2026-05-19 14:45:01 +00:00
Andrew Stoltz
200aeab032 ttsreader: deploy study mode repair image 2026-05-18 16:33:08 -05:00
Andrew Stoltz
8182616d4c ttsreader: point render piper to edge1 demo endpoint 2026-05-18 16:06:37 -05:00
Andrew Stoltz
f0862ac03c ttsreader: deploy sprint36 demo audio image 2026-05-18 16:04:59 -05:00
Andrew Stoltz
46c392605e monitoring: mirror PuppetServiceFailed alert from Notes (Sprint 33 Cx-7 Phase B)
Mirrors the live `puppet` alert group from
FlowerCore.Notes/scripts/monitoring/alerts.yml into the K8s ConfigMap so a
future in-cluster Prometheus inherits the ruleset automatically.

Source of truth remains the Notes file (live Podman Prometheus on noc1).
See feedback_monitoring_k8s_target_vs_live_podman.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:11:07 -05:00
89b147bbdd docs(openvox): document quadlet durability smoke (#12) 2026-05-18 04:53:02 +00:00
d7238a5e3b feat(brochure): add public brochure GitOps app (#13) 2026-05-18 04:52:37 +00:00
fc444a02a1 feat(chat): add public twin ingress (#11) 2026-05-18 04:52:20 +00:00
83d4883d55 feat(worldbuilder): pin k8s demo to fake backend (#10) 2026-05-18 04:52:11 +00:00
f8fe3b2688 feat(github-runner): add final long-tail runners (#9) 2026-05-18 04:52:01 +00:00
f2ab892ebc feat(github-runner): add Marquee + TtsReader per-repo runners (#8) 2026-05-18 03:27:14 +00:00
fef68a9560 feat(fc-devicemgmt): add Kubernetes deployment manifests (#1)
Sprint 8 IMPL lane Cx-5: fc-devicemgmt K8s manifests (rebased onto main 2026-05-18; 13 files, +944).

Namespace + Web Deployment (replicas:2, MySQL backend) + Operator Deployment (replicas:1, KubeOps leader-elect) + Service + Certificate (step-ca-acme ClusterIssuer) + Traefik IngressRoute (devices.iamworkin.lan internal) + ServiceAccount + ClusterRole + ClusterRoleBinding + NetworkPolicy (CNI DNAT-aware backend ports) + OnePasswordItem (5-field consolidated) + ArgoCD Application bootstrap shape + lint coverage.

Follow-ups (not merge blockers):
- localhost/fc-devicemgmt-{web,operator}:v20260512-cx5 must be imported to all 3 RKE2 nodes; pods will ErrImageNeverPull until imported.
- 1Password vault item 'FlowerCore DeviceManagement Runtime' must be created with 5 fields before pods can start.
- DNS devices.iamworkin.lan -> 10.0.56.200 already present.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 02:56:23 +00:00
Andrew Stoltz
6fe77225ae fix(github-runner): dedupe DOTNET_INSTALL_DIR+NUGET_PACKAGES on base+sharedpos
PR #5 rebase concatenated PR #5 env additions onto PR #7 env additions on
the base + sharedpos Deployments, producing duplicate-key validation
errors in ArgoCD's structured merge. The DOTNET_INSTALL_DIR and
NUGET_PACKAGES values are identical between PR #5 and PR #7; keep the
PR #7 originals and retain only the unique new env vars from PR #5
(DOTNET_CLI_TELEMETRY_OPTOUT, DOTNET_NOLOGO, DOTNET_GENERATE_ASPNET_CERTIFICATE).

No behavioral change — same final env var set, no duplicates.
2026-05-17 21:53:05 -05:00
634b9c4169 feat(github-runner): harden Linux runner fleet (#5) 2026-05-18 02:51:02 +00:00
b8c7e59005 Tighten Pi signage HDMI settle coverage (#3) 2026-05-18 02:35:17 +00:00
65ac8d6f01 feat(github-runner): pod-env DOTNET_INSTALL_DIR + initContainer for non-root runner (#7) 2026-05-18 02:25:18 +00:00
35844e0dbd chore(github-runner): un-park github-runner-sharedpos (Shared.Pos non-root build fixed) (#6) 2026-05-18 02:20:00 +00:00
b1e307151e chore(github-runner): un-park github-runner-sharedpos (replicas 1) after Shared.Pos CI fix merged 2026-05-17 21:54:16 +00:00
12b07219c7 chore(github-runner): park github-runner-sharedpos (replicas 0) until Cx-1 non-root fix
Shared.Pos build fails on non-root runner (setup-dotnet /usr/share/dotnet denied); churning runner drove HighCPU on rke2-agent2. Re-enable in the Sprint 30+ Cx-1 Linux-runner-fleet lane (DOTNET_INSTALL_DIR on pod env).
2026-05-17 21:50:35 +00:00
9fd32c4415 feat(monitoring): MacMiniRunnerOffline alert (Sprint 28 reconcile) 2026-05-17 19:50:29 +00:00
ad670fb344 feat(github-runner): add Shared.Pos repo-scoped Linux runner (unstick stuck publish) 2026-05-17 19:50:23 +00:00
Codex
6f6ca50987 fix(github-runner): switch RUNNER_TOKEN -> ACCESS_TOKEN; set RUN_AS_ROOT=false
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:08:56 +00:00
Codex
c7be58c1f7 chore(github-runner): bump replicas 0 -> 1 (PAT provisioned)
Operator provisioned GitHub PAT (Runner Registration) 1P item. OnePasswordItem CRD already materialized the secret. Unblocks CI fleet-wide.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:04:03 +00:00
Codex
a1f5a393cd chore(github-runner): rename 1P item to GitHub PAT (Runner Registration)
Renames the OnePasswordItem.itemPath from "GitHub Runner Registration
Token" to "GitHub PAT (Runner Registration)" so the runner 1P entry
sits next to its siblings — GitHub PAT (Gitea Mirrors) and GitHub PAT
(NuGet Packages) — under a consistent "GitHub PAT (...)" naming pattern
and API_CREDENTIAL category.

Existing field "credential" remains the consumer (RUNNER_TOKEN env).
Comment block clarified to require Administration:read/write fine-grained
PAT scope on target repos.

Old 1P item renamed to "[DEPRECATED 2026-05-16] GitHub Runner
Registration" — kept as recovery backup; can be hard-deleted after the
first successful runner pod start against the new item path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:01:41 +00:00
Codex
710340d8be chore(github-runner): rename 1P item to GitHub PAT (Runner Registration)
Renames the OnePasswordItem.itemPath from "GitHub Runner Registration
Token" to "GitHub PAT (Runner Registration)" so the runner 1P entry
sits next to its siblings — GitHub PAT (Gitea Mirrors) and GitHub PAT
(NuGet Packages) — under a consistent "GitHub PAT (...)" naming pattern
and API_CREDENTIAL category.

Existing field "credential" remains the consumer (RUNNER_TOKEN env).
Comment block clarified to require Administration:read/write fine-grained
PAT scope on target repos.

Old 1P item renamed to "[DEPRECATED 2026-05-16] GitHub Runner
Registration" — kept as recovery backup; can be hard-deleted after the
first successful runner pod start against the new item path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 10:27:58 -05:00
Andrew Stoltz
7d2daaa4f8 chore(github-runner): replicas 1 → 0 until 1Password token provisioned
github-runner-token OnePasswordItem exists but the underlying 1Password
vault item hasn't been created yet, so the operator can't mint the K8s
Secret. Pod stuck in CreateContainerConfigError → DeploymentReplicasMismatch
alert fires.

Scaling to 0 keeps the manifest infrastructure intact but stops trying
to schedule until operator:
1. Creates "GitHub Runner Registration Token" item in IAmWorkin vault
2. Generates a token at github.com/astoltz/<repo>/settings/actions/runners/new
3. Updates the OnePasswordItem itemPath to point at it
4. Bumps replicas back to 1 via PR
2026-05-15 16:18:19 -05:00
Andrew Stoltz
e50e103ba0 fix(zabbix): bump web probe timeouts 5s→15s + add failureThreshold
zabbix-web nginx+PHP-FPM container serves / at ~3-5s baseline with
occasional 6-7s spikes (probe path renders full dashboard via PHP).
kube-probe was killing the container after 3 consecutive 5s-timeout
499s, producing CrashLoopBackOff alert noise even though the app
was serving real traffic fine.

15s timeout absorbs the natural variance; explicit failureThreshold=3
documents the policy (was implicit default).

Closes the firing PodCrashLoopBackOff (zabbix-web) + pending
HTTPServiceSlow/HTTPServiceDegraded alerts. zabbix.iamworkin.lan
remains slow at the application layer (separate work — PHP-FPM
warm-up + Zabbix server "host not found" agent lookup spam need
their own fixes) but the pod restart loop stops.
2026-05-15 15:59:04 -05:00
Codex
e8094eb0bd ci(github-runner): add Phase 2 ephemeral Linux runner K8s manifest
Namespace github-runner with myoung34/github-runner:latest Deployment,
5Gi Longhorn RWO NuGet cache PVC, zero-privilege ServiceAccount, and
OnePasswordItem CRD for the registration token. EPHEMERAL=true mode
re-registers after each job; Recreate strategy avoids RWO multi-attach.
Targets fc-build-linux label; single replica pinned to rke2-server node.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 12:46:25 -05:00
8d87d9172c Add Pi signage Phase 1 player artifacts
Squash merge Sprint 14 Pi signage player artifacts.
2026-05-14 01:46:09 +00:00
Codex
cfd9743afa Add Apple TV signage docs manifest 2026-05-13 20:32:48 -05:00
Andrew Stoltz
5029e209cd kubevirt-vms: boot ci1 from server template 2026-05-12 16:58:18 -05:00
Codex
f298339152 fix(guacamole): add --- separator between macmini-vnc-creds OnePasswordItem and guacamole-branding ConfigMap
Missing document separator caused YAML to merge the OnePasswordItem's
top-level `spec: itemPath:` block into the ConfigMap that follows.
Result: a ConfigMap with a `.spec` field whose K8s schema does not
declare one, triggering ArgoCD's structured-merge diff to fail since
2026-05-11T15:30:54Z:

  Failed to compare desired state to live state: failed to calculate
  diff: error calculating structured merge diff: error building typed
  value from config resource: .spec: field not declared in schema

App stayed Healthy (live K8s tolerated the extra field — ConfigMap
ignored it) but ArgoCD's diff calc was broken, leaving the app stuck at
sync=Unknown for all 21 resources. Adding the missing `---` separator
makes the OnePasswordItem and ConfigMap proper sibling YAML documents,
each with its own kind-correct schema.

Diagnosed during 2026-05-12 morning routine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 09:26:03 -05:00
Codex
6e7d88db49 feat(fc-redis): add SignalR backplane for cross-product event bus (Q-SO-1 Phase A)
Per Q-SO-1 operator resolution 2026-05-11 PM, Redis SignalR backplane lands
in Phase A (was Phase C deferral). Treats Redis as a managed FC infrastructure
component, not a deferred scaling escalation.

Lands the minimal Phase A surface:
- Namespace fc-redis
- Single Redis 7-alpine pod with 1Gi Longhorn RWO PVC
- ConfigMap with AOF persistence (everysec), 256Mi maxmemory, allkeys-lru
- ClusterIP Service `redis.fc-redis.svc.cluster.local:6379` (in-cluster only)
- No AUTH Phase A (Phase B add via 1Password Connect rotation)
- No IngressRoute (backplane is server-to-server)

Consumers (Phase A IMPL across FC services) add:
  services.AddSignalR().AddStackExchangeRedis(
      "redis.fc-redis.svc.cluster.local:6379",
      opts => opts.Configuration.ChannelPrefix =
          StackExchange.Redis.RedisChannel.Literal("fc-opsconsole"));

Phase B/C follow-ons (not in this commit): Sentinel for HA, AUTH password
from 1Password, redis_exporter sidecar for Prometheus, network policies.

See FlowerCore.Notes/docs/signage/operations-console-phase-2-design.md
section 3.5 (rewritten) and decisions-waiting.html Q-SO-1 (RESOLVED).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 19:02:58 -05:00
Codex
5ae50bd491 fix(telephony): init container runs as root to chown hostPath /tmp/tts-audio
The fix-data-perms init container chowns /data (PVC) and /shared-tts
(hostPath /tmp/tts-audio on rke2-agent1) to uid 1654 so the non-root
telephony-web app can write Piper TTS .sln16 files.

Without an explicit container-level securityContext override, the init
container inherits pod-level runAsNonRoot:true / runAsUser:1654 and
fails with 'chown: /shared-tts: Operation not permitted' the first
time the hostPath comes up root-owned after a node reboot.

Outage 2026-05-11 23:00 UTC: telephony-web in Init:CrashLoopBackOff for
9 hours (100+ restarts) until init container was bumped to runAsUser:0.
Live cluster patched in the same operation; this commit makes the fix
durable in git so ArgoCD sync preserves it.

See Notes memory: feedback_hostpath_initcontainer_chown_perms

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 18:37:15 -05:00
Codex
653d4472f5 fix(monitoring): mirror Q-MR-3 MultusMemoryPressure + NamespacePendingPodBacklog alerts
Two new preventive alert rules added to the kubernetes-state group of the
K8s migration target ConfigMap. The live Podman Prometheus on noc1 has
already been updated via FlowerCore.Notes/scripts/monitoring/alerts.yml +
sudo cp + podman pod restart monitoring (this commit only locks it in
the bluejay-infra K8s mirror so a future migration carries it forward).

MultusMemoryPressure (critical, thermal_print): fires when kube-multus
working set exceeds 80% of its memory limit for 5m. Catches the next
multus OOM cascade BEFORE it kills the daemon cluster-wide. The 2026-05-10
21h outage hit because no alert fired on the rising multus working set;
only downstream blackbox / Traefik / service alerts triggered, after the
fact.

NamespacePendingPodBacklog (warning): fires when any single namespace has
>25 Pending pods sustained for 30m. Catches the operator-leak avalanche
pattern (orphan pods from a crashed reconciler emitting children without
ownerReferences) before it cascades into a CNI OOM.

See FlowerCore.Notes:
  - feedback_multus_50mi_limit_oom_orphan_pod_avalanche
  - feedback_monitoring_k8s_target_vs_live_podman (workflow)

Companion commits:
  - bluejay-infra@eb8693e (multus memory limit)
  - FlowerCore.RemoteDesktop@b02c59b (OwnerReferences fix)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 10:42:27 -05:00
Codex
eb8693e1ce fix(multus): bump kube-multus-ds memory 50Mi/50Mi -> 1Gi/512Mi (prevent OOM cascade)
Cluster outage 2026-05-10T17:43 through 2026-05-11 ~10:30 (~21h). Root cause:
FlowerCore.RemoteDesktop emitted 219 orphan rd-browser-only-* pods in fc-desktop
(missing OwnerReferences — see companion fix in FlowerCore.RemoteDesktop).
Kubelet's continuous CNI ADD retries for those pending pods drove a request
queue that exceeded the upstream default 50Mi limit on kube-multus-ds. Multus
OOMKilled (exit 137), restarted with an even bigger backlog, OOMKilled again,
positive feedback loop. Restart counts climbed to 276 / 412 / 261 across the
3 RKE2 nodes.

Downstream blast radius: both Traefik pods stuck ContainerCreating (101m +
4h35m), all Longhorn CSI attacher/provisioner/instance-manager stuck, every
Prometheus blackbox probe for *.iamworkin.lan failing, UpdateCenterPublicEdgeDown
critical on update.flowercore.io, every ArgoCD app showing sync=Unknown
because repo-server lost git connectivity. 45 firing Prometheus alerts.

Recovery sequence (Q-MR-1 from FlowerCore.Notes morning routine):
1. kubectl patch kube-multus-ds memory live (this commit locks it in git so
   ArgoCD doesn't revert on next sync)
2. Force-delete the 219 orphan pods (kubectl --grace-period=0 --force) to
   break the avalanche
3. Rollout restart kube-multus-ds — STABLE after restart with new limit
4. Restart Traefik + Longhorn CSI to clear stuck ContainerCreating
5. Verify update.flowercore.io returns 200 + ArgoCD apps reconcile

Tested incrementally: 256Mi limit was insufficient (still OOMed on catchup
burst), 512Mi was insufficient on rke2-agent1 (most pods concentrated there),
1Gi/512Mi handled the full 200+ pending pod CNI catchup cleanly with 0 multus
restarts after rollout. Nodes are 64GB with <25% used in steady-state, so the
~256Mi typical working-set is well within the new limit.

Companion change: FlowerCore.RemoteDesktop must set OwnerReferences on every
worker pod so future operator crashes don't leak orphans (Q-MR-2). Preventive
alerts (Q-MR-3) MultusMemoryPressure + NamespacePendingPodBacklog are coming
in a follow-up commit to apps/monitoring/.

Memory: feedback_multus_50mi_limit_oom_orphan_pod_avalanche
Decisions card: docs/dashboards/decisions-waiting.html Q-MR-1..3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 10:30:05 -05:00
Codex
667777a653 revert(ci1): back to cdrom:scsi (virtio-blk disk hit QEMU flock)
The virtio-blk disk swap (commit 84c9feb) didn't help: qemu fails to
acquire the write lock on the rootdisk PVC because the previous
launcher's qemu process didn't release it cleanly. Same family of
bug as the "stale QEMU flock" already documented in
feedback_kubevirt_iso_first_install_bootorder_and_runstrategy, but
now triggered on rke2-agent1 instead of agent2.

OVMF cdrom timeout is the real blocker and remains open:
  -  Distribution pipeline (build → save → scp → ctr import on all
    3 RKE2 nodes) is proven. localhost/win-server-2025:1.0 lives in
    each node's containerd k8s.io namespace.
  -  containerDisk + cdrom:scsi gets qemu domain Running (no NFS
    Permission denied, no rootdisk flock).
  -  OVMF BdsDxe times out reading the SCSI cdrom regardless of
    SecureBoot setting and bus type.

Reverting the disk type to cdrom:scsi so the VM lands back on the
"qemu Running, OVMF stuck at Boot Manager" state — known-stable and
easier to attack than the QEMU-flock state we hit by trying
virtio-blk disk.

Operator decision for next architectural step (one of):
  - Custom OVMF firmware build with longer Boot0001 timeout
  - KubeVirt version bump (v1.5+ has OVMF fixes)
  - Hyper-V/VirtualBox install + export VHD to ci1
  - BIOS legacy boot (Win Server 2025 needs UEFI but install media
    has a BIOS path)
  - DataVolume HTTP datasource (CDI internalizes ISO bytes via
    different code path)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 21:35:00 -05:00
Codex
84c9feb893 fix(ci1): present ISO as virtio-blk disk instead of cdrom
OVMF BdsDxe "starting Boot0001 ... Time out" persists across:
  - SATA cdrom + Longhorn Filesystem PVC (Path A)
  - SATA cdrom + Synology NFS (Path B failed: storage perms)
  - SCSI cdrom + Longhorn (Path B variant)
  - SCSI cdrom + containerDisk tmpfs (Path C)
  - + SecureBoot=false

That rules out: storage IO speed, cdrom bus type, signature
verification. Remaining cause is deeper in qemu's cdrom device
emulation under KubeVirt v1.4.0's OVMF firmware — the cdrom read
window for OVMF's first-sector probe is too short to satisfy from
the cdrom controller path regardless of bus type.

Workaround: present the ISO bytes as a regular virtio-blk DISK
(not a cdrom). UEFI/OVMF still recognizes ISO9660 + El Torito
boot records on any block device, so it can find and boot the
EFI bootloader the same way it would from a USB stick. virtio-blk
has a different read path that doesn't hit the cdrom-specific
timeout.

This also better aligns with the FlowerCore.Distribution USB-key
pattern: ISO bytes on a block device, UEFI boots from the El
Torito boot record, Windows installer takes over. The autounattend
ConfigMap (ci1-autounattend) drives unattended Windows setup once
the installer kicks off.

The containerDisk OCI image (localhost/win-server-2025:1.0)
remains unchanged — only the disk type in the VM spec changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 21:29:59 -05:00
Codex
427dbfcef2 [uc] Phase 1 auth gate deploy v20260509-4162dca-authgate 2026-05-08 21:16:54 -05:00
Codex
b651a4e2d0 fix(ci1): disable SecureBoot to allow OVMF to boot Windows ISO
containerDisk delivery (commit b998f50) successfully gave qemu fast
in-memory access to the ISO bytes (no NFS denial, no Longhorn read
latency), but OVMF's BdsDxe still timed out:

  BdsDxe: loading Boot0001 "UEFI QEMU QEMU CD-ROM " from
    PciRoot(0x0)/Pci(0x2,0x4)/Pci(0x0,0x0)/Scsi(0x0,0x0)
  BdsDxe: starting Boot0001 ... Time out

That rules out storage IO speed and bus type as causes (already
tested both sata and scsi against both Longhorn-PVC and tmpfs-backed
containerDisk). Remaining likely cause: SecureBoot signature
verification on the ISO's EFI bootloader. KubeVirt's stock
`/usr/share/OVMF/OVMF_VARS.secboot.fd` doesn't appear to ship with
the Microsoft KEK/DB enrolled by default, so signed Windows EFI
bootloaders fail the trust-chain check and OVMF reports a generic
"Time out" rather than a verification failure.

Disabling SecureBoot lets OVMF skip the chain check entirely and
boot the El Torito EFI image. SMM stays enabled (KubeVirt only
requires it WITH SecureBoot, not the inverse). TPM 2.0 emulation
also stays on (`tpm: {}`), so BitLocker, Hyper-V, and WSL2 still
work in the guest.

This is acceptable for a CI runner. Long-term path back to
SecureBoot:
  1. Custom-build OVMF_VARS.fd with Microsoft KEK/DB pre-enrolled
  2. Mount via firmware.bootloader.efi.persistent
  3. secureBoot: true

Tracked as a Phase 2 hardening task once the runner is doing real
work and we want signed-boot guarantees.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 21:06:18 -05:00
Codex
b998f50f48 fix(ci1): switch ISO delivery to containerDisk OCI image (Path C)
OCI image: localhost/win-server-2025:1.0 (8.27 GB)
Built FROM scratch + ADD disk.img → /disk/disk.img on noc1, podman
saved as tar (8.27 GB), SCP'd in parallel to all 3 RKE2 nodes,
imported via ctr in k8s.io namespace. Verified present on all 3
schedulable nodes (rke2-server, rke2-agent1, rke2-agent2).

Why containerDisk over the prior PVC paths:
  - Path A (Longhorn Filesystem PVC, sata): OVMF BdsDxe SATA-CDROM
    read timeout. Cdrom-backed PVC is too slow for OVMF's first-sector
    read window.
  - Path B (Synology NFS): uid 107 (qemu) denied at directory level by
    Synology export ACL despite file mode 0777. Memory:
    feedback_synology_iso_export_root_only_uid_107_denied.
  - Path B+SCSI: same OVMF timeout, just on SCSI controller. Bus
    choice was not load-bearing — the issue was always the slow PVC
    backing.
  - Path C (this commit): containerDisk delivers the ISO bytes from
    a tmpfs view of the OCI layer, no PVC controller in the read path.
    qemu reads at native FS speed; OVMF first-sector read completes
    well within timeout. This is also the KubeVirt-recommended pattern
    for installer ISOs.

Connects to FlowerCore.Distribution / Provisioning USB story: same
"OCI image of the OS installer + autounattend on a sysprep CDROM"
pattern that the USB provisioning agent will use. The Windows
install proceeds hands-off via the existing autounattend.xml in
ci1-autounattend ConfigMap (RDP enabled, WinRM, UAC disabled,
Administrator password from 1Password vault item
h3ix4mgfk65gmkcmvh6ly3d3hu).

Image lifecycle: bump tag (1.1, 1.2, ...) when ISO version changes,
rebuild on noc1, redistribute to RKE2 nodes, update image: line.

Legacy NFS PVC + PV manifest and CDI Longhorn PVC RETAINED for this
commit so prior states are recoverable. Will prune in follow-up
once containerDisk boot proves.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 20:45:38 -05:00
Codex
8fd9ae1cd3 fix(ci1): revert NFS Path B + flip ISO cdrom bus sata→scsi
NFS Path B (commit fc2aca0) failed at storage layer: Synology export
`/volume1/ISOs` denies non-root client UIDs at the directory level.
qemu uid 107 cannot `ls /iso/` even though disk.img is mode 0777.

Diagnosed via uid-107 + uid-0 busybox probe pods on rke2-agent2:
- libvirt error: "Cannot access storage file ... Permission denied"
  (virStorageSourceReportBrokenChain:1281, virError Code=38 Domain=18)
- uid 107 pod: "ls: can't open '/iso/': Permission denied"
- uid 0 pod (same mount): "drwxrwxrwx 1 root root 16 ... disk.img"
- SELinux Enforcing + virt_use_nfs=on, no AVC denials → not SELinux
- File mode 0777 with owner 107:107 → not POSIX

Same export-only-root pattern as `/volume1/kubernetes`. Memory:
feedback_synology_iso_export_root_only_uid_107_denied.md

Existing CDI-uploaded Longhorn PVC `windows-server-2025-iso` (10Gi
Filesystem mode) verified to contain valid ISO bytes readable by
uid 107 (mode 0660 root:107, 9.85 GB sparse, 8.27 GB blocks ≈
original 7.7 GB ISO). Reverting to it.

The original OVMF SATA-CDROM read timeout that drove yesterday's
NFS pivot is now addressed by `cdrom: bus: scsi` (virtio-scsi has
a longer read window than the IDE/SATA emulator). Per user-prompt
diagnostic chain Step 5.

NFS PVC + PV (apps/kubevirt-vms/win2025-iso-nfs-pv.yaml) RETAINED
so Path B state is recoverable; can be pruned in follow-up once
SCSI boot is proven.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 18:54:36 -05:00
Codex
fc2aca0e9e fix(ci1): mount Windows ISO via Synology NFS (Path B for SATA-CDROM timeout)
Previous fix attempts confirmed the Longhorn-backed Filesystem PVC contains
a perfectly valid bootable ISO9660 image. The bug is that SATA-CDROM
emulation reading from a Longhorn Filesystem PVC is too slow for OVMF's
boot read window — DVD-ROM enumeration times out before the bootloader
loads. Symptom on the serial console:
  BdsDxe: failed to start Boot0001 "UEFI QEMU DVD-ROM QM00001 " ... Time out
  BdsDxe: No bootable option or device was found

Block-mode PVC (Path A) was attempted and would likely fix the timing,
but CDI v1.65.0's upload-target pod cannot open the underlying block
device (runAsUser:107 + capabilities.drop:[ALL]):
  blockdev: cannot open /dev/cdi-block-volume: Permission denied

Path B (this change): mount the ISO directly from Synology NAS over
NFSv4.1. Bypasses both the Longhorn slowness and the CDI permission
issue. QEMU's SATA emulator reads at native LAN speed.

Layout:
  /volume1/ISOs/ — existing Synology export, RKE2 ACL already granted
  /volume1/ISOs/win2025-iso-disk/disk.img — new subdir, hardlink to the
    ISO file, named so KubeVirt's launcher finds it at the PV root

A hardlink (not symlink) is required because symlinks with relative
targets pointing to the parent directory are broken when the NFS PV
sub-mounts the subdir as its root.

Validated 2026-05-08 from rke2-server, rke2-agent1, rke2-agent2:
  mount -t nfs -o nfsvers=4.1,ro 10.0.58.3:/volume1/ISOs/win2025-iso-disk
  file disk.img -> ISO 9660 CD-ROM filesystem data ... (bootable)

The original Longhorn Filesystem ISO PVC is RETAINED unused (so ArgoCD
doesn't prune the populated PVC and so we have a fallback). Can be
removed in a follow-up commit after the NFS path is proven on a
successful Windows install.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 17:03:42 -05:00
Codex
ba18c52130 docs(ci1): record open rootdisk-flock and SATA-CDROM-timeout issues
Documenting the remaining 2 unresolved issues for the next operator
session, with the recovery paths from this session captured inline so
the next agent doesn't repeat the same blind alleys:

1. **rootdisk QEMU flock** — every new launcher pod fails QEMU start with
   `Failed to get "write" lock` on the rootdisk Filesystem-mode disk.img.
   Stale flock from a previous force-deleted virt-launcher pod. Longhorn
   engine on rke2-agent2 needs to release the lock; `kubectl patch
   volume.longhorn.io/<pvc-name> spec.nodeID=""` is reverted by the
   Longhorn controller. Operator-level recovery only.

2. **SATA CDROM read timeout** — even with bootOrder=1 (windows-iso first),
   OVMF UEFI fails Boot0001 with "Time out" reading the SATA CDROM backed
   by the Filesystem-mode PVC. Block-mode DataVolume migration was
   attempted but blocked by CDI v1.65.0's upload pod running with
   `capabilities.drop: [ALL]` and `runAsUser: 107`, preventing direct
   block-device writes (`blockdev: cannot open /dev/cdi-block-volume:
   Permission denied`). See ISO PVC header docstring for 3 forward paths.

Net commits during this session:
- 1c4145a: bootOrder swap (windows-iso=1, rootdisk=2)
- 87a7d7c: deprecated `running:` -> `runStrategy: Always`
- 0bf47df: ISO migration to Block-mode DataVolume (REVERTED)
- 9f6dc1a: revert to Filesystem PVC (CDI block-upload blocked)
- 1c4145a + 87a7d7c + 9f6dc1a are the live, correct configuration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 15:18:38 -05:00
Codex
9f6dc1a9d5 fix(ci1): revert ISO to Filesystem PVC; CDI v1.65.0 block-upload pod blocked by capability drop
The Block-mode DataVolume migration (commit 0bf47df) hit a CDI v1.65.0 limitation:
the upload-target pod runs as uid 107 with `capabilities.drop: [ALL]`, so it
cannot open the underlying block device:

  blockdev: cannot open /dev/cdi-block-volume: Permission denied
  Saving stream failed: Unable to transfer source data to target file:
  error determining if block device exists: exit status 1

Reverting to a Filesystem-mode PVC + virtctl image-upload pvc, which DID work
(uploaded the 7.7 GiB ISO with valid ISO9660 magic intact). Boot timeout is
unresolved (header docstring captures the open issue + 3 paths to revisit).

The bootOrder swap (1c4145a) and runStrategy migration (87a7d7c) stay landed —
those are correct improvements regardless of the volume-mode question.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:32:52 -05:00
Codex
0bf47dfa33 fix(ci1): switch ISO from filesystem PVC to Block-mode DataVolume
The bootOrder swap alone didn't fix the install — even with `windows-iso` at
bootOrder:1, OVMF UEFI still timed out reading the SATA CDROM:

  BdsDxe: starting Boot0001 "UEFI QEMU DVD-ROM QM00001 " from ... Sata(...)
  BdsDxe: failed to start Boot0001 ... : Time out
  BdsDxe: No bootable option or device was found.

Diagnosis (debug pod mounting the live PVC):
- /pvc/disk.img IS a valid bootable ISO9660 image — `file` reports
  "ISO 9660 CD-ROM filesystem data 'SSS_X64FRE_EN-US_DV9' (bootable)".
- bytes 0..15: zeros (NOT QCOW2 magic 51 46 49 fb).
- bytes 32769..32773: "CD001" — ISO9660 primary volume descriptor at the
  correct offset.

So content was fine. The bug is in how KubeVirt + QEMU + Longhorn expose a
Filesystem-mode PVC's `/disk.img` as a SATA CDROM. With Block-mode the
underlying volume IS the raw ISO9660 sectors, OVMF reads them directly,
no QEMU file-emulation layer. This is the recommended pattern for ISO
install media on KubeVirt + Longhorn.

Migration:
- Replace `kind: PersistentVolumeClaim` with `kind: DataVolume` (CDI manages
  the underlying PVC + upload-target pod).
- Set `pvc.volumeMode: Block`.
- Annotate `cdi.kubevirt.io/storage.contentType: kubevirt` so CDI keeps raw
  bytes (no QCOW2 wrap).
- VM volume reference changes from `persistentVolumeClaim.claimName` to
  `dataVolume.name`. KubeVirt's VMI controller blocks VM start until DV
  phase is Succeeded (upload completed).

Operator step after this lands:
1. Wait for DV `phase: UploadReady`
   kubectl get dv -n kubevirt-vms windows-server-2025-iso -w
2. virtctl image-upload dv windows-server-2025-iso -n kubevirt-vms \
     --image-path "...\en-us_windows_server_2025...iso" \
     --uploadproxy-url https://localhost:8443 --insecure --no-create
3. Re-flip runStrategy to Always (was set to Halted live-side during
   migration; this commit keeps the manifest at Always).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:23:31 -05:00
Codex
87a7d7c70a fix(ci1): switch deprecated running: true -> runStrategy: Always
Required to clear OutOfSync state after the bootOrder fix. Live VM had
runStrategy: Halted (set during diagnosis to release the PVC for inspection).
Manifest had running: true. KubeVirt's validating webhook rejects sync:
  admission webhook "virtualmachine-validator.kubevirt.io" denied the request:
  Running and RunStrategy are mutually exclusive.

Switching to runStrategy: Always preserves the original "auto-start +
auto-restart" semantics with the non-deprecated field, and gives ArgoCD a
clean diff target to flip Halted -> Always.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:12:07 -05:00
Codex
1c4145a581 fix(ci1): swap bootOrder so Windows install ISO boots first
Original order: rootdisk=1 (empty 200Gi virtio), windows-iso=2 (SATA CDROM).
UEFI tried the empty virtio disk first, got nothing, fell back to Boot0001
(the SATA CDROM) with a short timeout, and aborted with:
  BdsDxe: failed to start Boot0001 ... Time out
  BdsDxe: No bootable option or device was found.

VM had been running 38+ min with rootdisk actualSize stuck at 4.13 GiB and
no AgentConnected condition — install never started.

Diagnosis via debug pod mounting the windows-server-2025-iso PVC:
  /pvc/disk.img: ISO 9660 CD-ROM filesystem data 'SSS_X64FRE_EN-US_DV9' (bootable)
  bytes 0..15: zeros (NOT QCOW2 magic 51 46 49 fb)
  bytes 32769..32773: "CD001" (ISO9660 primary volume descriptor)

So the PVC content is a real bootable ISO — the only fix needed is to make
the ISO bootOrder=1 for first install. After Windows installs, it writes its
own UEFI Boot#### entries pointing at the rootdisk EFI partition; UEFI then
boots from rootdisk going forward and the ISO at bootOrder:2 is a fallback
for re-install scenarios.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:10:17 -05:00
Codex
c50a403f74 fix(infra): pin virtio-container-disk to v1.8.2 (containerd 2.1 manifest fix)
KubeVirt v1.4.0 + RKE2 containerd 2.1.5 cannot pull
quay.io/kubevirt/virtio-container-disk:latest:
  rpc error: code = Unimplemented
  desc = failed to pull and unpack image: not implemented:
  media type "application/vnd.docker.distribution.manifest.v1+prettyjws"
  is no longer supported since containerd v2.1, please rebuild the image as
  "application/vnd.docker.distribution.manifest.v2+json" or
  "application/vnd.oci.image.manifest.v1+json"

The :latest tag was last rebuilt with the v1 manifest schema. Tagged versions
v1.6.5+, v1.7.3, v1.8.2 are rebuilt with v2/OCI manifests.

Pinning to v1.8.2 (newest available, contains current Windows VirtIO drivers).
The image only contains the Windows VirtIO driver ISO mounted as a CDROM —
not the KubeVirt runtime — so it is decoupled from the cluster KubeVirt
version.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:28:22 -05:00
Codex
fb7bd10528 feat(infra): activate ci1 VM — running:true + 10Gi ISO PVC + 1P password
Phase 1 prereqs all satisfied:
- Multus CNI v4.2.2 thick-plugin DS Running on rke2-server/agent1/agent2
- CDI v1.65.0 operator + CR Deployed (cdi-apiserver/deployment/uploadproxy
  all Running 1/1)
- Windows Server 2025 ISO (7.7GiB, March 2026 update) uploaded via CDI
  virtctl image-upload to PVC windows-server-2025-iso. Verified via PVC
  annotations: cdi.kubevirt.io/storage.condition.running.message="Upload
  Complete", storage.pod.phase="Succeeded"
- Local Administrator password generated (26 char, FANTASTIC strength).
  Stored in 1Password vault IAmWorkin (qaphopopkryhbg353ukzhhuqoq) item
  h3ix4mgfk65gmkcmvh6ly3d3hu. UTF-16-LE base64 in autounattend.xml Value
  field matches the 1P "autounattend AdministratorPassword Value" field.

Changes:
- ISO PVC bumped 6Gi → 10Gi (ISO is 7.7GiB, need headroom)
- Added labels app=ci-runner, flowercore.io/managed-by=bluejay-infra
- autounattend.xml AdministratorPassword Value: real base64-encoded password
- spec.running: false → true (VM starts on next ArgoCD sync)
- Header comment refreshed to LIVE state with prereq references

Network: still pod-network masquerade. Multus NAD prod-vlan57 is registered
but the VM doesn't use it yet (Phase 1.5 host bridge needed first).

Verify after sync:
  kubectl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml -n kubevirt-vms get vm,vmi
  virtctl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml vnc ci1 -n kubevirt-vms

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:24:46 -05:00
Codex
6c21d14a98 deploy(fc-updater): bump image to v20260508-pub3-deepening-2bdf108
Promotes the fleet to FlowerCore.Updater main @ 2bdf108 which combines:
- PR #6 publish pre-signed releases (1a188f4)
- PR #7 deeper public-host privacy enforcement (8cd8544)
- PublishPreSignedAsync(stream, sig) Integration coverage (2bdf108)

Live image already imported to rke2-server and rolled via deploy-web.ps1.
This commit aligns the bluejay-infra source of truth so ArgoCD doesn't
snap the spec back to the previous tag (per
feedback_argocd_managed_image_overrides_do_not_stick).
2026-05-08 13:07:24 -05:00
Codex
b3529f8e96 feat(infra): add Multus CNI + CDI + PROD VLAN 57 NAD as GitOps prereqs for ci1
Adds three new bluejay-infra apps that auto-pickup via ApplicationSet (apps/*
directory generator on main):

* apps/multus/multus.yaml — Multus CNI v4.2.2 thick-plugin daemonset (verbatim
  upstream, project-annotated). Enables KubeVirt VMs to attach additional
  network interfaces. Required by ci1 to bridge onto PROD VLAN 57.

* apps/cdi/{cdi-operator.yaml,cdi-cr.yaml,README.md} — Containerized Data
  Importer v1.65.0 (verbatim upstream). Operator + CR pattern. Enables
  populating PVCs from HTTP/registry/upload sources, used to load the Windows
  Server 2025 ISO into the windows-server-2025-iso PVC.

* apps/kubevirt-vms/prod-vlan57-nad.yaml — NetworkAttachmentDefinition for
  PROD VLAN 57 bridge. **Deploy gated on Phase 1.5 host work**: requires
  br-prod bridge enslaving enp86s0.57 on each RKE2 node (Puppet config-as-code).
  ci1.yaml continues to use pod-network masquerade until that lands; switching
  to multus.networkName: kubevirt-vms/prod-vlan57 is a one-line YAML edit
  followed by a GitOps push.

Cluster verification (2026-05-08):
- KubeVirt LIVE (3 nodes, virt-api/controller/handler/operator all Running)
- Calico CNI on /etc/cni/net.d + /opt/cni/bin (Multus default paths)
- ApplicationSet `bluejay-infra` already watches `apps/*` on main

Reproducibility: upstream YAMLs vendored verbatim with project header diffs
only. Bumping versions = re-curl + git push. No deploy-time internet fetch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:05:58 -05:00
Codex
00c11b4eaa feat(infra): stage ci1 Windows Server 2025 KubeVirt VM (Phase 1, NOT YET APPLIED)
Stages a draft VirtualMachine + Namespace + ISO PVC + rootdisk PVC + sysprep
ConfigMap for the dedicated GitHub Actions self-hosted runner that replaces
the never-registered bluejay-ws-sandbox-1 placeholder.

Status: STAGED ONLY. spec.running = false. ISO PVC empty. Two operator
decisions still pending before this can boot:
  1. Network choice — pod-network fallback (in this draft) vs Multus +
     PROD VLAN NAD (preferred, requires Multus install).
  2. ISO path — manual upload via helper pod (Path A) vs CDI HTTP import
     (Path B, requires CDI install).

Cluster baseline 2026-05-08:
  - KubeVirt operator: installed, healthy, 14d
  - CDI: NOT installed
  - Multus: NOT installed
  - Calico-only CNI

See docs/infrastructure/windows-server-build-runner-plan.md "Phase 1 readiness
gate" for the full operator pickup checklist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 12:32:47 -05:00
Codex
04881f46f0 deploy(intranet): promote brochure wave 1 image 2026-05-08 11:12:56 -05:00
Codex
c0038e4859 deploy(intranet): bump image to v20260508-7bad3a5 (Theme picker + FcThemedRoot)
FlowerCore.Intranet.Web@7bad3a5 'feat(theme): add /admin/theme picker page + wrap routes in FcThemedRoot'.
Image built, distributed to all 3 RKE2 nodes (10.0.56.11/12/13), 366/366 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 10:20:11 -05:00
Codex
dee48831c6 Deploy updater public privacy hardening 2026-05-07 17:12:33 -05:00
Codex
0f1dc5f871 fix(certs): kill cert-manager renewal loop on 3 broken Certificate specs
Three Certificates requested duration: 2160h (90d) with renewBefore: 720h
(30d). step-ca's ACME provisioner caps cert lifetime at 30d, so it silently
issued 720h certs — making renewBefore EQUAL to the actual cert lifetime.
cert-manager treats the cert as needing immediate renewal the moment it's
issued, creates a CertificateRequest, gets a new (still 30d) cert, marks
it for immediate renewal, and loops.

Damage on 2026-05-07 ~20:30 (caught during regroup after 5h gap):
  - fc-worldbuilder/worldbuilder-web-tls:  2365 CRs in 18h
  - fc-distribution/fc-distribution-tls:  10880 CRs in 18h
  - knowledge/knowledge-tls:              10888 CRs in 18h
  Total: 24,133 stale CertificateRequest objects in etcd.

Bulk-deleted all CRs + Orders in those 3 namespaces, then this commit
fixes the source so ArgoCD sync stops re-creating the loop.

Fix: match the working 720h/240h pattern used by every other FC service
cert (agent-zero, fc-dns, fc-llm-bridge, fc-php, traefik-system, etc.).
30d cert lifetime + 10d renewal headroom = renewal at day 20, which is
the cert-manager standard 2/3-of-lifetime practice.

Side effect during loop: ALSO contributed to step-ca load and may have
caused intermittent timeouts cluster-wide (the latest stuck challenge
was timing out dialing step-ca:9443 even though step-ca itself was up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:32:00 -05:00
Codex
11c5f6e6cc fix(selenium): GitOps-capture selenium-netpol (was unmanaged anywhere)
Captured during 2026-05-07 regroup audit. selenium-netpol was applied via
raw `kubectl apply` to the cluster on 2026-03-15 with no source-of-truth
file anywhere — neither in bluejay-infra nor in any FC service repo. A
cluster rebuild from bluejay-infra would have lost it entirely (including
the Selenium Grid → Traefik VIP allow rule that gates AAT runs against
*.iamworkin.lan services).

Captured byte-for-byte from `kubectl get netpol -n selenium selenium-netpol
-o yaml`. ServerSideApply via ArgoCD will adopt the existing resource
without recreation.

The Selenium Grid Deployment + Services themselves are still managed
outside ArgoCD (deployed via raw kubectl from the original bring-up).
Migrating those into bluejay-infra is a separate lane — this commit only
restores GitOps repeatability for the NetworkPolicy.

See feedback_networkpolicies_belong_in_bluejay_infra.md for the canonical
pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 10:30:59 -05:00
Codex
d637fe9b30 fix(fc-desktop): land 4 NetworkPolicies into bluejay-infra (was deploy-script-only)
Repeatability gap caught during 2026-05-07 morning regroup. The four
fc-desktop NetworkPolicies (desktop-isolation, fc-desktop-default-deny,
remotedesktop-web-isolation, cm-acme-http-solver-allow) were applied via
FlowerCore.RemoteDesktop/scripts/deploy-web.sh `kubectl apply` calls.
That meant a fresh cluster rebuild from bluejay-infra alone would miss
all of them — Browser Lab session isolation, control-plane allow-list,
and HTTP-01 cert renewal would silently fail to come up.

Canonical FC GitOps pattern is for NetworkPolicies to live alongside
other resources in bluejay-infra. Verified by audit: 6 of 11 cluster
NetworkPolicies (agent-zero, edge2-services, monitoring, noc-services,
telephony, voice) already follow this pattern. fc-desktop was the
outlier; selenium-netpol is also unmanaged and tracked separately.

Source-of-truth split (now documented in fc-desktop.yaml):
  - bluejay-infra OWNS: Certificate + IngressRoute + all NetworkPolicies.
  - FlowerCore.RemoteDesktop scripts/deploy-web.sh OWNS: Deployment +
    Service ONLY (because `localhost/fc-desktop:linux-xfce` image refs
    require manual ctr import on each node — Deployment in bluejay-infra
    would race the image-import step).

Follow-up commits in FlowerCore.RemoteDesktop will:
  - Remove the now-duplicate k8s/{networkpolicy,namespace-default-deny,
    web-networkpolicy,acme-http01-solver-allow}.yaml files.
  - Drop the 3 `kubectl_apply_file` lines from scripts/deploy-web.sh.

The 4 NPs in this commit are byte-for-byte identical to what's running in
the cluster today (verified via kubectl get -o yaml diff). ServerSideApply
in the bluejay-infra ApplicationSet will adopt the existing resources
without recreating them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 10:27:20 -05:00
Codex
5bfe41beca fix(monitoring): rename bare Grafana dashboard JSONs out of *.json extension
ArgoCD's directory-driven manifest parser scans *.yaml AND *.json by
default. Bare Grafana dashboard JSONs (no apiVersion/kind/metadata)
poisoned manifest generation for the entire infra-monitoring Application
("Object 'Kind' is missing in <dashboard JSON>"), leaving sync state
Unknown.

These files are SOURCE for the file-provisioning path on noc1
(/opt/monitoring/grafana/dashboards/) and also get inlined into ConfigMap
wrappers like grafana-dashboard-remotedesktop.yaml. They are NOT K8s
manifests and shouldn't be in ArgoCD's manifest tree.

.argocdignore is for repo-level GitOps source eligibility, not for
filtering manifests within a directory-mode Application — the cleanest
fix is the *.txt extension that ArgoCD's parser skips.

Reverts the .argocdignore from the previous commit (didn't take effect).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 10:13:37 -05:00
Codex
df22774674 fix(infra): unstick fc-updater + monitoring ArgoCD apps
fc-updater PVC: bump updatecenter-data storage 10Gi → 25Gi.
The cluster PVC was already manually expanded to 20Gi to fit Mike Bundle
(~5.1 GiB) plus the LocalFsBundleStore.MaxTotalBytes soft cap of 25 GiB
(see project_uc_remaining_4_apps_signed_2026_05_06). PVCs cannot shrink,
so ArgoCD couldn't sync the smaller git value (OutOfSync, retried 5x with
"field can not be less than status.capacity"). Setting git to 25Gi gives
headroom matching the soft cap.

monitoring .argocdignore: skip bare dashboard JSON files.
Both fc-updatecenter-dashboard.json and flowercore-remotedesktop-grafana-
dashboard.json live in apps/monitoring/ as source-of-truth for file-
provisioning to noc1's /opt/monitoring/grafana/dashboards/. ArgoCD's
manifest parser tries to unmarshal every file and chokes on bare dashboard
JSON ("Object 'Kind' is missing"), which then poisoned the whole
infra-monitoring Application status (Unknown sync, no comparison possible).
The .argocdignore tells ArgoCD to skip *.json — actual K8s deploys happen
via ConfigMap wrappers like grafana-dashboard-remotedesktop.yaml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 10:11:27 -05:00
Codex
c4065b15a3 deploy(ttsreader): persist voice reference clips on pvc 2026-05-06 20:48:58 -05:00
Codex
a4aa612373 deploy(fc-distribution): roll startup backfill fix 2026-05-06 19:51:11 -05:00
Codex
c2eb37dee9 deploy(ttsreader): enable phase6 biblical routing 2026-05-06 19:46:25 -05:00
Codex
bf6f542569 deploy(fc-distribution): roll latest endpoint fix 2026-05-06 19:38:26 -05:00
Codex
e150b2102f deploy(fc-distribution): roll phase1 api image 2026-05-06 19:31:22 -05:00
Codex
33a765b0bc deploy(fc-intranet-web): roll v20260506-1737 with Wave 2 specialist galleries
6 Wave 2 product galleries landed on intranet master c083016:
- /specialists/telephony  (7 sections + Overview, +11 tests)
- /specialists/library    (8 sections + Overview, +17 tests)
- /specialists/retail     (6 sections + Overview, +16 tests)
- /specialists/mysql      (6 sections + Overview, +22 tests)
- /specialists/php        (6 sections + Overview, +9 tests)
- /specialists/pimanager  (7 sections + Overview, +11 tests)

NavMenu.razor wired with new Specialists section.

Test ledger: 280 -> 366 (+86) full project, 0W/0E build.

Sources: 6 sibling-depth worktrees claude/intranet-w2-{name} dispatched
2026-05-06 per intranet-xxxl-sprint-2026-05-05.md §4 Phase 2.
Inherits Q-IK-1..15 + Q-IS-1..12 + Q-IX-1..7 verbatim per Q-IW-5.
6 Q-IW-1..6 cards on Notes decisions-waiting.html.
2026-05-06 17:38:22 -05:00
Codex
5484ed7db6 Adopt fc-updater into ArgoCD 2026-05-06 17:33:32 -05:00
Codex
2aa84349ea merge claude/bluejay-infra-worldbuilder: roll fc-intranet-web v20260506-2120 with WorldBuilder LIVE flip 2026-05-06 16:22:51 -05:00
Codex
851f8e673b deploy(fc-intranet-web): roll v20260506-2120 with WorldBuilder LIVE flip
WorldBuilder live runtime promotion lands in the Intranet at
/services/world-builder + ServiceRegistry homepage tile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 16:22:43 -05:00
Codex
f78f8c8192 Merge claude/bluejay-infra-ttsreader-4delta: bump fc-ttsreader image for Phase 4delta enrichment landing 2026-05-06 16:04:57 -05:00
Codex
9b255fefc1 merge claude/bluejay-infra-worldbuilder: cpu request fix 2026-05-06 16:04:32 -05:00
Codex
6a89a76e39 fc-ttsreader: bump image to v202605061500 (Phase 4delta enrichment pipeline)
Phase 4delta server-side HTML overlay enrichment landed in
FlowerCore.TtsReader@8f23e15 (master @6091618). Adds 9-pass enrichment +
SQLite-backed cache + 4 REST endpoints (/api/v1/enrich/{html,jsonld,both,passes})
+ RenderRequest.sourceJsonLd. Tests 476 -> 522 (+46). Image already imported
to all RKE2 nodes via deploy.sh; this bumps the bluejay-infra-managed tag so
ArgoCD reconciles the live deployment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 16:04:31 -05:00
Codex
2489464d4f fix(worldbuilder): cpu request 100m -> 25m to clear scheduler
Cluster CPU-request budget at 99% on all 3 RKE2 nodes at deploy time.
0/3 nodes available; "3 Insufficient cpu". Actual CPU usage on the
nodes is 10/52/19%, so the cluster is request-overprovisioned but has
plenty of real headroom. Idle Blazor + SignalR + ComfyUI poller is ~5m.
25m unblocks scheduling and stays generous for expected runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 16:04:25 -05:00
Codex
4b777b16ac monitoring: mirror fc-signage-marquee alert group into noc-monitoring K8s target
Mirror of FlowerCore.Notes/scripts/monitoring/alerts.yml fc-signage-marquee
group into the K8s migration target apps/monitoring/noc-monitoring.yaml so
that future migration of the noc1 Podman monitoring stack into RKE2
inherits the marquee alert ruleset automatically.

Three rules added:
  - MarqueeDroppedFramesHigh        (5% / 5min / warning)
  - MarqueeRenderLatencyP99High     (16ms / 10min / warning)
  - MarqueeAnimationDurationDrift   (10% / 15min / info)

All three gated with `unless on() absent_over_time(metric[7d])` so they
don't fire during the metric-not-yet-published window before Track 3
IR-21 source IMPL ships the MarqueeMeter into Common + Web + WPF.

Live source-of-truth (the noc1 Podman Prometheus reads from
/opt/monitoring/prometheus/alerts.yml) was updated and reloaded
in the same session — Notes commit 300daa0 carries the matching
alerts.yml + Grafana fc-signage-dashboard.json change.

Per feedback_monitoring_k8s_target_vs_live_podman: this file is the
forward-looking K8s migration target, NOT what the live Podman
Prometheus reads. ArgoCD-syncing this file does NOT push alerts to
the live monitoring stack.

Companion to:
- FlowerCore.Notes 300daa0 (live alerts.yml + Grafana panels deployed)
- docs/signage/marquee-performance-telemetry-design.md (Track 3 IR-21 spec)
- docs/signage/marquee-animation-phases.md (Track 6 13-phase coverage matrix)

Memory: project_marquee_vr_promotion_landed_2026_05_06

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 16:01:44 -05:00
Codex
8c60e3a4d3 merge claude/bluejay-infra-worldbuilder: fc-worldbuilder ArgoCD app 2026-05-06 15:57:34 -05:00
Codex
df02b4c3c3 feat(worldbuilder): add fc-worldbuilder ArgoCD app
FlowerCore.WorldBuilder runtime deploy: Namespace + Longhorn PVC + Deployment
+ Service + step-ca Certificate + Traefik IngressRoute. ArgoCD ApplicationSet
picks up apps/worldbuilder/ within ~3 minutes.

Source: D:\git\FlowerCore\FlowerCore.WorldBuilder@6ed6d26

Initial image: localhost/fc-worldbuilder:v202605062048 (already imported on
all 3 RKE2 nodes via ctr images import).

DNS preflight done: worldbuilder.iamworkin.lan -> 10.0.56.200 (Traefik VIP)
in pfSense Unbound (FlowerCore.DNS provider was 502 at deploy time, fell back
to direct pfSense PHP exec via diag_command.php).

ImageGen backend: BLUEJAY-WS http://10.0.56.20:8188 (R9700 / gfx1201 / ROCm
7.2.1). One real branding render confirmed working 2026-05-06T20:36:47Z.

Memory references in README:
- feedback_pfsense_dns_required_for_acme
- feedback_rke2_image_import_per_node_scp
- feedback_k8s_probes_must_not_hit_openapi
- feedback_k8s_probes_behind_auth_middleware
- feedback_dataprotection_keys_persist_to_app_dbcontext

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 15:56:59 -05:00
Codex
c0dceafffd deploy(ttsreader): roll web v20260506-47a88ae 2026-05-06 14:40:57 -05:00
Codex
490db8f9e6 deploy(fc-intranet-web): roll v20260505-1108 with fleet-search seam landed
Bumps tag to bring live pod up to FlowerCore.Intranet.Web@a9ede80 (master tip
post-fleet-search-resurrect merge). Image imported to all 3 RKE2 nodes via
scripts/deploy.sh v20260505-1108.

Closes the source-vs-deployed gap that existed since 2026-04-29: the
KnowledgeFleetSearchController + Service + TrustedHeader auth handler were
running on the deployed pod but never landed on master. Surgical extraction
from stale codex/fleet-knowledge-search branch (12-file rebase conflict made
full merge non-trivial) brings the source up to match production.

+7 tests (280/280 vs 273), 0W/0E build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 10:18:36 -05:00
Codex
1926bdaf3b merge claude/bluejay-infra-update-center-monitoring-2026-05-05: Update Center Operations dashboard mirror (Phase 1D) 2026-05-05 11:01:06 -05:00
Codex
ca8d062826 feat(monitoring): mirror Update Center Operations dashboard (Track 1D)
Adds fc-updatecenter-dashboard.json (uid: fc-updatecenter, version: 2)
to apps/monitoring/ — mirrors the dashboard deployed to noc1 at
/opt/monitoring/grafana/dashboards/fc-updatecenter-dashboard.json.

13 panels: 5 existing probe/availability panels + 1 OTEL row header
+ 7 new panels for the 6 OTEL counters added to FlowerCore.Updater.Web:

  updatecenter_manifest_requests_total
  updatecenter_bundle_download_bytes_total
  updatecenter_bundle_downloads_total
  updatecenter_checkins_total
  updatecenter_release_publishes_total
  updatecenter_signature_verify_failures_total

Live on Grafana at https://grafana.iamworkin.lan/d/fc-updatecenter

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 10:54:39 -05:00
Codex
1889462fc4 deploy(fc-intranet-web): roll v20260505-1041 with fc_dp_keys migration
Bumps tag to include the new AddDataProtectionKeys EF migration that closes
the fc_dp_keys table-creation gap from v20260505-1023. Master tip a82d7d4.

Previous tag v20260505-1023 crash-looped on every page load with
'no such table: fc_dp_keys' due to eb9fe6d (DataProtection-in-DB)
registering the DI but missing the table-creation migration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 10:35:42 -05:00
Codex
523ba61232 deploy(fc-intranet-web): roll Phase 0 closeout image v20260505-1023
Bumps intranet image tag to bring live pod up to FlowerCore.Intranet.Web@ea80c25
(post-XXXL Phase 0 closeout merge). Image imported to all 3 RKE2 nodes via
scripts/deploy.sh v20260505-1023.

Carries the 8 commits from claude/intranet-fleet-fixes:
- Range processing for read-aloud audio
- Blazor SignalR receive limit raise (8 MB)
- ASP.NET footgun sweep (PR #3)
- Self-contained linux-x64 publish (transitive deps)
- Blazor error-ui banner proof + AAT
- DataProtection-in-DB + FcReconnectModal adoption
- Custom .bj-reconnect CSS removal
- Library PNG privacy withdrawal + WorldBuilder design page + Overview enrichment

Tests: 273/273 passed, 32 AAT skipped, 0W/0E build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 10:27:14 -05:00
Codex
53f67c8713 merge claude/k8s-manifest-hardening: K8s gotcha sweep (C7) + lint extensions 2026-05-04 23:00:34 -05:00
Codex
6b9cf3d12c K8s gotcha sweep C7 — extend lint + cover Track A allowlist + scope Notes/k8s
Follow-up to 0b52093 (K8s manifest hardening) closing two real gaps the
prior sweep didn't catch:

1. Public read-write allowlist regression guard (Track A)
   - New PublicReadWriteAllowlistHosts set tracks updatecenter.iamworkin.lan
     + updates.iamworkin.lan. The allowlist on those hosts is
     GET||HEAD||POST||OPTIONS — POST is required for the bootstrap-JWT
     check-in endpoint. PUT/PATCH/DELETE must still 404 at the route.
   - New PublicReadWriteIngressRoutes_MustPinGetHeadPostOptionsAllowlist
     test enforces the allowlist invariant (3 required methods present,
     3 forbidden methods absent).
   - Companion conftest.dev policy 08_public_readwrite_allowlist.rego.

2. Selenium NetworkPolicy DNAT backend port audit
   - FlowerCore.Notes/k8s/selenium/06-networkpolicy.yaml allowed Traefik
     VIP 10.0.56.200:443 + :80 but its 10.42.0.0/16 + 10.43.0.0/16 egress
     rules didn't include the post-DNAT backend ports (8443 for Traefik
     TLS, 8080 for HTTP). Per feedback_netpol_dnat_backend_port: kube-proxy
     DNATs the destination to a backend pod IP+port BEFORE Calico
     evaluates the FORWARD chain, so without those backend ports in the
     pod CIDR rule, Selenium-driven browser AAT calls to
     https://*.iamworkin.lan time out at connect.
   - Lint inventory now includes FlowerCore.Notes/k8s/selenium/ so
     regressions in this manifest fail fast.

Lint scope notes:
   - FlowerCore.Notes/k8s/guacamole/ + monitoring/ are historical
     scaffolds that have diverged from the live state (bluejay-infra/apps/
     is canonical). Operator review is required before bringing them in
     line OR decommissioning them — kept out of lint scope until that
     decision lands (see xxl-regroup-2026-05-03-followup.md "Codex 7 §0").

README hardening:
   - New "Public read-write allowlist hosts" entry under "Known gotchas"
     documenting the GET||HEAD||POST||OPTIONS pattern + linking the lint.

Tests: 8/8 lint tests pass.

Companion fix in FlowerCore.Updater repo on branch
codex/k8s-gotcha-fleet-sweep-c7 (k8s/web-deployment.yaml: localhost/ image
needs imagePullPolicy: Never). The FlowerCore.Updater fix applies to a
deploy that's currently live but bites only on first scheduled-pod
landing on a fresh node — not a live production-impact regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:57:59 -05:00
Codex
0b52093b36 K8s manifest hardening + new bluejay-infra-lint test project
Manifest hardening (per documented memories):
- apps/asterisk/deployment.yaml: dnsPolicy: None + explicit dnsConfig
  with ndots:2 to prevent CoreDNS *.iamworkin.lan template from
  hijacking external egress (downloads.asterisk.org).
- apps/fc-llm-bridge/fc-llm-bridge.yaml: same dnsConfig pattern for
  api.anthropic.com egress.
- apps/fc-ttsreader/fc-ttsreader.yaml: same dnsConfig pattern for
  huggingface.co model seeding.
- apps/fc-messageboard/fc-messageboard.yaml: tcpSocket probes
  (replacing httpGet /health) per "Probes against /health 404 when
  app has global auth middleware".
- apps/fc-signalcontrol/fc-signalcontrol.yaml: same tcpSocket probe
  fix.

New lint project:
- tests/bluejay-infra-lint/BluejayInfraLint.Tests.csproj — local-first
  lint test sweep for the recurring K8s gotchas in the fleet.
- tests/bluejay-infra-lint/FleetManifestLintTests.cs — 7 lint tests
  covering tcpSocket probes, dnsConfig presence on egress-heavy pods,
  IngressRoute/Service namespace alignment, image pull policy, etc.
- tests/bluejay-infra-lint/conftest.dev/ — matching conftest policies
  for environments with conftest/opa.
- .gitignore — adds bin/ + obj/ + DS_Store/swp.

README.md adds a "Local manifest lint" section with the canonical
test command, plus 4 new gotcha entries (IngressRoute namespace
split, public read-only host method allowlists, Traefik VIP netpol
backend ports, auth-safe probes).

Tests: 7 / 7 lint tests passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 03:18:04 -05:00
Codex
7a9098d3bd fix(fc-ttsreader): lower web cpu request 2026-05-04 02:28:11 -05:00
Andrew Stoltz
57d7ba46a7 feat(monitoring): add fc-remotedesktop grafana dashboard
JSON-provisioned dashboard for FlowerCore.RemoteDesktop session metrics,
matches the Apr 23 staging done in the codex/ttsreader-release-b6ca2d5
worktree. Drop into apps/monitoring so ArgoCD-managed Grafana provisioning
picks it up alongside the other FC service dashboards.
2026-04-30 14:32:54 -05:00
Andrew Stoltz
9ec2e2d52e deploy(ttsreader): bump web image to b6ca2d5 2026-04-30 12:43:48 -05:00
Andrew Stoltz
b4d62a8a50 deploy(fc-ttsreader): roll chapter-context image 2026-04-30 02:31:55 -05:00
Andrew Stoltz
fbbc07023b deploy(fc-llm-bridge): roll fc:vision image v202604300022
Source: FlowerCore.LlmBridge@8dd181c (feat: fc:vision route + image
content forwarding). Adds:

- fc:vision tier alias parsing (TryParseTier handles fc:vision,
  FC:VISION, openai/fc:vision, vision)
- Image content forwarding: OpenAi image_url shape (https URL +
  data:[mediaType];base64,... URI) and Anthropic image/source
  passthrough are now promoted to LlmContentBlocks. Text-only
  content-parts arrays still flatten to the legacy joined string.
- DefaultRoutes seeder + appsettings.json gain Vision -> Anthropic +
  claude-sonnet-4-6.

Image built on BLUEJAY-WS, podman save + ctr import to all 3 RKE2
nodes (rke2-server, rke2-agent1, rke2-agent2). Bridge tests: 62/62
green (was 51/51, +11). Backwards-compatible with current chat /
util / embed callers; existing routes unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:26:45 -05:00
Andrew Stoltz
4b0eef0fb0 deploy(fc-llm-bridge): roll alias-fix image v20260430001132 2026-04-30 00:13:48 -05:00
Andrew Stoltz
bb09a3786f fix(knowledge): pin live manifest to bundled edition path 2026-04-29 23:37:02 -05:00
Andrew Stoltz
006dbcf671 fix(agent-zero): export knowledge mcp gate to python builder 2026-04-29 23:32:55 -05:00
Andrew Stoltz
1be71d6ba7 fix(agent-zero): export mcp servers without python indent errors 2026-04-29 23:19:48 -05:00
Andrew Stoltz
0c8026c912 fix(agent-zero): avoid heredoc break in mcp bootstrap 2026-04-29 23:16:54 -05:00
Andrew Stoltz
621ae47e00 fix(agent-zero): repair fc knowledge mcp manifest 2026-04-29 23:11:57 -05:00
Andrew Stoltz
ae6b8c0142 fix(knowledge): keep mcp key env on new token secret 2026-04-29 23:06:07 -05:00
Andrew Stoltz
da55220218 feat(agent-zero): wire fc_knowledge phase1 rollout 2026-04-29 22:59:19 -05:00
Andrew Stoltz
b1ad253dd6 fix(agent-zero): prefix bridge embedding alias for litellm 2026-04-29 21:14:12 -05:00
Andrew Stoltz
ee935f6e07 fix(agent-zero): keep internal util/embed on bridge v1 2026-04-29 21:09:04 -05:00
Andrew Stoltz
2853ee2024 chore(bridge): bump fc-llm-bridge image tag v202604292028 2026-04-29 20:50:55 -05:00
Andrew Stoltz
b4a34e16ca refactor(agent-zero): drop ollama-proxy sidecar (Phase 3) 2026-04-29 20:50:55 -05:00
Andrew Stoltz
0d5a1fd530 fix(agent-zero): route util and embed through llm bridge 2026-04-29 19:14:01 -05:00
Andrew Stoltz
1b633f57b2 chore(infra): wire knowledge MCP api key secret 2026-04-29 18:04:43 -05:00
Andrew Stoltz
ee8afd0a08 deploy(intranet): promote auth-gated intranet image 2026-04-29 17:11:17 -05:00
Andrew Stoltz
cf35884eae deploy(intranet): harden knowledge search rollout 2026-04-29 16:43:09 -05:00
Andrew Stoltz
9881767b11 deploy(intranet): bump intranet web for knowledge search lane 2026-04-29 16:21:27 -05:00
Andrew Stoltz
c9bf23834b chore(ttsreader): bump image to v202604291817
Per-profile MoodAnnotationModelOverride picker — Profiles page now shows
a model dropdown from IModelRegistry instead of a free-text field; model
override null-falls-back to global TtsReader:Ollama:DefaultModel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 13:21:40 -05:00
Andrew Stoltz
174002023d fix(agent-zero): move corpus_search + intranet_search into bluejay-tools-c
The prior commit b71f9e4 created a stray YAML document between the
bluejay-tools-c and bluejay-profile sections. kubectl applied the stray
block's data to bluejay-profile (wrong ConfigMap, wrong mount target).

The setup-bluejay initContainer copies bluejay-tools-{a,b,c} to the tools
directory; bluejay-profile is copied to the agent profile directory. Tools
must live in one of the three tools ConfigMaps.

Fix: insert corpus_search.py and intranet_search.py directly into the
bluejay-tools-c YAML document (before kind/metadata, matching the
data-first layout the rest of the file uses). Also fix two mojibake
characters (→ and ·) that were corrupted in the prior commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:49:23 -05:00
Andrew Stoltz
b71f9e4ec9 feat(agent-zero): add corpus_search + intranet_search to cluster configmaps
- Add corpus_search.py to bluejay-tools-c: semantic vector search over
  fleet SQLite-vec DBs (fleet-workstation-full, fleet-pi-edge, fleet-bmo-bot).
  Returns offline-friendly results for Bible/Greek/Hebrew/Strongs corpora.
  Cluster pod degrades gracefully (no DB mounted yet — BLUEJAY-WS only for now).

- Add intranet_search.py to bluejay-tools-c: live RAG search over the
  intranet vector store via GET /api/search?q=...&topK=N. Uses in-cluster
  service URL (http://intranet-web.intranet.svc:5300) to bypass Traefik TLS
  and the private-range egress denylist.

- Fix intranet_search.py param name: was 'limit', now 'topK' matching the
  SearchController's [FromQuery] parameter name.

- NetworkPolicy: add egress rule for intranet namespace port 5300 (without
  this the pod's TCP connection to the search endpoint was dropped).

- agent-zero.yaml: set FLOWERCORE_INTRANET_URL env var to in-cluster service
  URL so intranet_search uses internal routing, not the public Traefik VIP.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 08:34:31 -05:00
Andrew Stoltz
f1431f7324 feat(agent-zero): wire Print.Web API key to pod via 1Password OnePasswordItem
Add `print-web-api-keys` OnePasswordItem CRD that syncs from 1Password
"Print.Web API Keys" vault item (password field). Mount as PRINT_WEB_API_KEY
env var in the agent-zero container.

The print_web.py Python tool (already in bluejay-tools ConfigMaps) reads
PRINT_WEB_URL and PRINT_WEB_API_KEY env vars for all HTTP calls to the
thermal print service on edge2. Previously the key was unset so every API
call was rejected with 401.

Note: Print.Web uses the legacy REST MCP shape (/api/mcp/tools/*) not the
streamable-http protocol. The Python tool bridges this gap — no /mcp endpoint
exists on Print.Web today. Network policy already allows 10.0.57.16:5200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 20:36:36 -05:00
Andrew Stoltz
35bd055cb4 feat(guacamole): add macmini-vnc-creds OnePasswordItem + fix Mac mini connection IPs
Phase 1 of Mac mini onboarding (2026-04-28):
- Add OnePasswordItem CRD 'macmini-vnc-creds' in guacamole namespace bound to
  vault item 'Mac Mini' — operator mints Secret with username/password/VNC Password fields
- Mac mini discovered at 10.0.56.115 (INFRA VLAN) — not 10.0.57.50 stored in 1P IP field
- Guacamole connections updated via API (not stored here): VNC conn #10, SSH conns #9/#33
  corrected from old IP 10.0.57.50 → 10.0.56.115
- macOS: 26.4.1 (Sequoia), Apple M1, 16 GB, user: bluejay (admin group)
- VNC port 5900 confirmed open; SSH works via noc1 jumpbox with password auth

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 20:09:45 -05:00
Andrew Stoltz
f604ab419e feat(ttsreader): bump image to v202604281923 (SignalR ProgressHub)
Adds ProgressHub endpoint at /hubs/progress with project-scoped
group broadcasting for JobStarted, CueProgress, JobCompleted, and
JobFailed events.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 19:30:41 -05:00
Andrew Stoltz
b2786252b0 chore(ttsreader): bump web image to v202604281831 (ops failed-manifest cleanup)
Deploys fix for stale Failed manifest accumulation in TTS Reader Ops view
and atomic-write guard against empty/corrupt job manifests.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 18:31:53 -05:00
Andrew Stoltz
45ee40920d fix(ttsreader): bump image to v202604281638 (Range support + Ollama timeout 240s) 2026-04-28 16:44:57 -05:00
Andrew Stoltz
8ad7eb714b fix(ttsreader): bump image to v202604281542 (annotation few-shot prompt + UI hint) 2026-04-28 15:46:28 -05:00
Andrew Stoltz
3cb44c3104 feat(noc-services): wire puppetdb.iamworkin.lan through Traefik step-ca cert 2026-04-28 15:13:20 -05:00
Andrew Stoltz
2400329acd fix(intranet): bump image to v20260428-1500 (Monitoring crash patch + Lane 11 anatomy refresh) 2026-04-28 14:59:27 -05:00
Andrew Stoltz
c17af882cc fix(ttsreader): bump image to v202604281444 for UX polish (cross-chapter Bible passage, /profiles dedup, /ops table) 2026-04-28 14:48:13 -05:00
Andrew Stoltz
76b1938afa fix(ttsreader): bump image to v202604281434 for live playback regression patch (study-player + speech override synth) 2026-04-28 14:43:06 -05:00
Andrew Stoltz
ced04a6148 intranet: bump web image to v20260428-0953
Sprint E XXL Intranet docs depth + read-aloud-root sweep deploy.

Image tag v20260427-2353 → v20260428-0953:
- Track A (Intranet.Web@c4f3d78): 7 service pages deepened toward
  PrintService.razor's 8-tab depth standard. Workflows / Verified
  Surfaces / Recent Verified Changes added.
- Read-aloud-root sweep (Intranet.Web@787982c): data-read-aloud-root
  wrappers added to 6 older /services/* pages so the read-aloud
  overlay scopes content extraction precisely instead of falling back
  to <main> with layout chrome included.
2026-04-28 09:54:27 -05:00
Andrew Stoltz
f2258b92a2 fc-ttsreader: bump web image to v202604280946 + add Render__CdnDirectory env
Sprint E XXL Phase 4γ MVP deploy — POST /api/v1/render endpoint.

Two changes:
1. Image tag v202604272339 → v202604280946 (TtsReader@d9e0a58 master tip
   includes the new RenderController + RenderService + 9 tests).
2. New TtsReader__Render__CdnDirectory=/data/cdn env var. Default
   wwwroot/cdn resolves under the read-only app filesystem when
   runAsNonRoot=true; pin to the existing writable PVC mount alongside
   other TtsReader runtime data. Manifests + cue audio land at
   /data/cdn/sha256/<hash>/manifest.json + cues/.

Pre-existing PVC mount at /data/ already covers this — no PVC change
needed, just the env var override.

Pairs with TtsReader@d9e0a58 master tip (ready for image build + import).
2026-04-28 09:47:46 -05:00
Andrew Stoltz
979a7c7b25 feat(intranet): bump fc-intranet-web to v20260427-2353 + persist PageReadingOverrides
Bump intranet image to v20260427-2353 (master @ 38b0148):
- Sprint E search lane: /search Blazor page + IntranetSearchService
  + DocsCorpusIndexer + Shared.Indexing wiring
- 7 new service pages: LocalAiAgents, AiTopology, Distribution, Dns,
  Knowledge, LlmBridge, Provisioning
- PiManager drift docs

New env var: PageReadingOverrides__FilePath=/data/page-reading-overrides.json
so the persisted Lane 2α store lives on the writable PVC instead of
the default in-memory fallback (which loses state on pod restart).
Operator-edited overrides via the existing /api/v1/pages/{encoded}/overrides
controller will now survive across restarts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 23:54:17 -05:00
Andrew Stoltz
0df8f7b936 chore(ttsreader): bump fc-ttsreader-web to v202604272339 (Sprint E Phase C — partial-render UX)
TtsReader@9333480: distinguishes partial-render (yellow Warning, audio
plays, 'Re-render N failed sentences' button) from full-fail (red
Danger, 'Try render again'). New TtsFallbackChainFailedException carries
both voices when Kokoro + Piper both fail; chapter breadcrumb names
the entire chain instead of just the requested voice. +8 tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 23:40:19 -05:00
Andrew Stoltz
38558641c1 fix(ttsreader-kokoro): bump liveness probe timeouts (Sprint E Phase 1a)
Kokoro pod has 4 restarts in 2d6h with exit 143 (SIGTERM from kubelet).
kubectl describe events all show:

  Liveness probe failed: Get "http://10.42.229.109:8880/v1/audio/voices":
    context deadline exceeded

The probe path /v1/audio/voices shares the FastAPI worker pool with
/v1/audio/speech. A long synth (Bible chapter, 30+ sentences) holds the
pool past the prior 5s × 3 = 15s probe window, kubelet kills the pod,
in-flight renders fail. Operator hits "fallback chain failed" toasts +
partial-render breadcrumbs during these windows.

Bump probe timeoutSeconds 5 → 15 and failureThreshold 3 → 5 → 75 s of
grace before kubelet gives up. Combined with the kokoro-side circuit
breaker landing in TtsReader (Sprint E Phase 1b), the FC backend will
also stop slamming kokoro during recovery so it can serve the probe
even faster.

The companion Prometheus alerts (KokoroPodFlapping, PiperPodFlapping)
land in FlowerCore.Notes/scripts/monitoring/alerts.yml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 23:28:07 -05:00
Andrew Stoltz
63d905b4df chore(ttsreader): bump fc-ttsreader-web to v202604272236 (Thinking + Feedback ALTERs) 2026-04-27 22:37:08 -05:00
Andrew Stoltz
d95f4e0caf chore(ttsreader): bump fc-ttsreader-web to v202604272228 (ChatSessions IsFavorite ALTER hotfix) 2026-04-27 22:28:56 -05:00
Andrew Stoltz
7bc565d17e fix(ttsreader): pin VoicePreview CacheDirectory to /data PVC
Day 8 disk-cache warmer crashes on production with
'Read-only file system : /home/app/data' because the relative default
'data/voice-previews' resolves under runAsNonRoot HOME (read-only with
readOnlyRootFilesystem=true). Pin to /data/voice-previews so the cache
lands on the writable PVC mount alongside ttsreader.db, audio output,
and jobs root.

Image v202604272216 (already on nodes) is unaffected by this — only
the env routing changes. ArgoCD reconciles + rollout restart picks up
the new env without rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 22:24:04 -05:00
Andrew Stoltz
dfe9c3b67e chore(ttsreader): bump fc-ttsreader-web to v202604272216 (brace-escape fix) 2026-04-27 22:16:19 -05:00
Andrew Stoltz
37f8db89e4 chore(ttsreader): bump fc-ttsreader-web to v202604272208 (Day 10 + VoiceProfiles hotfix)
v202604272157 crash-looped on the production PVC because Database.EnsureCreated()
is a no-op on existing DBs and the VoiceProfiles table was missing. TtsReader@a9f0b73
adds an idempotent CREATE TABLE IF NOT EXISTS to the infra reconciler before
TtsReaderDataSeeder runs. Bumping the manifest to pick up that fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 22:09:08 -05:00
Andrew Stoltz
00c7d8df24 chore(ttsreader): bump fc-ttsreader-web to v202604272157 (Sprint E Day 10 UX polish)
Compact project page (Setup chip strip + chapter inspect-toggle drawer)
+ render feedback (rolling ETA strip + active-chapter pulse) + Bible
Dashboard navigates to /projects/{id} on queue. Source TtsReader@79de78b.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:58:12 -05:00
Andrew Stoltz
c6811eadd8 intranet: bump image to v20260427-newpages-and-topology
Adds 7 new pages (5 service pages, AI topology, opencode operator guide)
to https://intranet.iamworkin.lan:
  /services/dns
  /services/distribution
  /services/llm-bridge
  /services/knowledge
  /services/provisioning
  /services/ai-topology
  /development/local-ai-agents

Plus topology corrections in /services/ai (AiStack.razor) and 6 new nav entries.

Source commit: FlowerCore.Intranet.Web@1598542 on
codex-wip-pre-readaloud-collision-2026-04-24.

Image built from artifacts/publish via Dockerfile.deploy on BLUEJAY-WS,
imported to all 3 RKE2 nodes (rke2-server + rke2-agent1 + rke2-agent2).

Build: 0 warnings, 0 errors, 197/197 tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:52:34 -05:00
Andrew Stoltz
4d9d537d83 fix(knowledge): repoint Ollama at edge1 + flip README to LIVE (Sprint E B7)
Two changes after the Phase 2.4 deploy went live at
https://knowledge.iamworkin.lan:

1. **Ollama URL flip**: from BLUEJAY-WS (10.0.56.20:11434) to edge1 Pi 5
   (10.0.57.17:11434). Honors the cluster-clean architecture from
   bluejay-infra@0f9d56e ("Workstation is private dev hardware and should
   not be in the cluster path"). Query-time embeddings (~ms per query)
   are fast enough on edge1; bulk index rebuilds (Phase 2.5+) will need a
   separate ingestion lane that can opt into the workstation GPU when
   present. ArgoCD picks up the env-var change and rolls the pod
   automatically — no image rebuild needed.

2. **README LIVE status**: flip the staged-not-yet-applied banner to
   LIVE 2026-04-27. Pod running, certificate issued, PVC bound,
   /healthz 200, /api/v1/editions [] (initial-deploy state). Phase 2.5+
   admin UI handles bulk population.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:56:35 -05:00
Andrew Stoltz
0f9d56ee16 agent-zero: drop BLUEJAY-WS upstream, edge1 Pi is sole Ollama backend
Workstation (BLUEJAY-WS) is private dev hardware and should not be in the
cluster path. Repointing the nginx ollama-proxy sidecar so cluster Agent Zero
talks ONLY to edge1 Pi 5 + AI HAT+ (10.0.57.17:11434):

- nginx upstream: edge1 sole server, no workstation entry
- wait-for-ollama init container: only checks edge1
- NetworkPolicy egress: drop 10.0.56.20/32, keep 10.0.57.17/32
- Comments updated throughout to flag workstation as off-limits to cluster
- Annotation rewritten to document the architectural intent

Pulled qwen2.5:1.5b on edge1 first so Agent Zero's utility_model survives
the cutover (existing models on edge1: qwen3:4b, gemma3:4b, qwen2.5-coder:7b,
nomic-embed-text). Model count on edge1: 4 → 5.

Lets BLUEJAY-WS lock down its Ollama port to localhost without breaking
the cluster Agent Zero.
2026-04-27 16:30:44 -05:00
Andrew Stoltz
3bf6511d5d feat(knowledge): stage Phase 2.4 K8s deployment manifests (Sprint E B2)
NOT YET APPLIED — push to origin/main is gated on the DNS A record
knowledge.iamworkin.lan -> 10.0.56.200 being live. Per memory
feedback_pfsense_dns_required_for_acme, applying the Certificate
without DNS in place puts cert-manager into ~2h HTTP-01 backoff and
needs `kubectl -n knowledge delete order <name>` recovery.

Manifests authored:
- apps/knowledge/knowledge.yaml — Namespace, PVC (knowledge-vector-store
  Longhorn 20Gi RWO), Deployment (single replica, Recreate, image
  localhost/fc-knowledge-web:v202604272200 placeholder, runAsNonRoot
  1654, readOnlyRootFilesystem, drop ALL caps, /healthz startupProbe +
  readinessProbe, tcpSocket livenessProbe), Service (ClusterIP port
  80 -> 8080), Certificate (step-ca-acme ClusterIssuer, 90d duration),
  IngressRoute (knowledge.iamworkin.lan, websecure entrypoint).
- apps/knowledge/kustomization.yaml — `kubectl kustomize` preview file
  (matches fc-distribution shape; ApplicationSet uses dir generator).
- apps/knowledge/README.md — deployment order checklist with the DNS
  preflight, image build/import loop for all 3 RKE2 nodes, push
  procedure, smoke verification, initial-deploy-state notes
  (zero editions until *.db files are pushed to the PVC), resource
  sizing, probe + middleware notes.

Companion artifacts (separate repos, separate commits):
- FlowerCore.Knowledge@eb91eb4 — Dockerfile.deploy at repo root
- FlowerCore.Notes@96cd443 — scripts/deploy-knowledge.sh

Apply order (from apps/knowledge/README.md):
1. Add DNS A record knowledge.iamworkin.lan -> 10.0.56.200 via
   FlowerCore.DNS or pfSense web UI.
2. Run `bash scripts/deploy-knowledge.sh` from FlowerCore.Notes — this
   builds + imports the image to all 3 RKE2 nodes with
   FLOWERCORE_DEPLOY_SKIP_ROLLOUT=1 (since the Deployment doesn't
   exist yet on the cluster).
3. Bump the image tag in this manifest to match the freshly-imported
   tag, then `git push` from this repo to land on main. ArgoCD picks
   up within ~3 minutes and creates `infra-knowledge`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:28:26 -05:00
Andrew Stoltz
3e0b9055b0 monitoring: paper-roll lifecycle alerts (XL Track I)
Three new Prometheus alert rules for the print-services group, all routed
to thermal_print via alert_channel label (Grafana contact point ->
irc-notify -> Print.Web /api/print/alert):

- PrintPaperRollLow      (warning, 5-10% remaining, 5m for)
- PrintPaperRollCritical (critical, <=5% remaining, 2m for)
- PrintJobDeadLetter     (warning, any new dead-letter in 15m)

Source-of-truth gauge is print_paper_remaining_percent (Print.Web OTEL),
which is hydrated from the active PaperRoll row at process startup
(Print.Web@<TBD> HydrateMetricsAsync) so the gauge isn't blind for an
arbitrary window after every deploy.

Self-referential humor: low-roll alerts route to the printer that's
running out of paper, so it announces its own paper-out warning on its
remaining paper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 16:00:40 -05:00
Andrew Stoltz
c828832808 edge2-services: print.iamworkin.lan Traefik HTTPS for Print.Web (XL Track C)
Adds an IngressRoute + cert-manager Certificate that terminates HTTPS for
print.iamworkin.lan and proxies to edge2's Print.Web at 10.0.57.16:5200.

Same headless-Service-with-manual-Endpoints pattern as noc-services (used
for grafana/prometheus/cockpit on noc1). pfSense Unbound already resolves
print.iamworkin.lan to the Traefik VIP 10.0.56.200, so cert-manager
HTTP-01 should validate cleanly.

No basicAuth middleware: Print.Web has its own X-Api-Key authentication
and exposes anonymous endpoints for the bookmarklet / Python CLI /
cups-notifier flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 14:37:33 -05:00
Andrew Stoltz
e2c71c2b8a fix agent-zero ollama-proxy crashloop + add Longhorn monitoring
agent-zero ollama-proxy had 172 historic restarts (now stable).
Root cause: liveness/readiness probes hit /api/tags which proxies
through to BLUEJAY-WS Ollama (10.0.56.20:11434). When the workstation
Ollama is slow or offline, nginx fails over to the edge1 backup —
but the failover takes >1s and the kube-probe default timeoutSeconds=1
gives up first. Three failed probes → kubelet kills the container.

Fix:
- Add nginx local healthz endpoint (200, no upstream).
- Liveness probe → /healthz (proves nginx itself is alive).
- Readiness probe stays on /api/tags but with timeoutSeconds=5 so
  failover to backup completes before the probe times out.

This decouples liveness from upstream availability — kubelet only
restarts the proxy when nginx is genuinely dead, not when Ollama is
slow.

Longhorn coverage gap: K8s emits "snapshot becomes not ready to use"
events constantly during the hourly snapshot lifecycle (1047
snapshots, all readyToUse=true on inspect). Those events were the
only signal we had — purely transient lifecycle noise, not actionable.

Add:
- longhorn scrape job (longhorn-backend.longhorn-system.svc:9500)
- NetworkPolicy egress rule for longhorn-system port 9500
- 4 new alerts in 'longhorn-storage' group:
  - LonghornVolumeDegraded (>15m) — replica unhealthy, auto-rebuild
  - LonghornVolumeFaulted (>5m, critical, thermal print) — data loss
  - LonghornBackupStale (no completed backup in >36h) — recurring job
    silently failing
  - LonghornNodeUnhealthy (>5m) — node ready=false

zabbix-web 7 restarts and Print.Web 12:55 stop investigated — both
are stable now, no actionable cause found in journal/events. Adding
KubeContainerRestartingFrequently in the previous commit will catch
recurrence of either.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:31:14 -05:00
Andrew Stoltz
b3028f5119 monitoring: fix RemoteDesktop pool alerts for stale per-status series
Followup to 05a273d. After deploy, six PoolDepleted/Deficit alerts
went pending again because the publisher emits per-status gauge
series (fc_desktop_pool_depleted{template,status,alert_level}) and
the historical Warming/BelowDesiredSize series stay at value=1 even
after the template transitions to status=Ready. Filtering by
alert_level=Critical/Warning was not enough — those labels are baked
into the stale series too.

Replace with a join-based query: alert only when the canonical
"Ready" status gauge does NOT report ready=1 for the enabled
template. fc_desktop_pool_ready{status="Ready"}==1 is the publisher's
own current-state canary and never goes stale.

Verified against the live cluster — query returns 0 results when all
pools report healthy in their reconcile logs (no stale-label false
positives).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:12:10 -05:00
Andrew Stoltz
05a273d3a6 monitoring: switch K8s scrapes to ClusterIP svc + fix probe paths
Followup to ab6ade4. Three issues uncovered after the rollout:

1. NodePort hairpin breaks scrape from same-node pod. Prometheus on
   rke2-agent1 could reach traefik-metrics on .11/.13 NodePort 30900
   but timed out on its OWN node's NodePort. Same problem would hit
   kube-state-metrics + cert-manager whenever prometheus reschedules.
   Fix: scrape via ClusterIP svc DNS instead of NodePort. NodePorts
   stay in place for external/Podman scrapers.
2. probe-traefik-services failed for grafana, prometheus, guac with
   non-200/3xx codes. grafana + prometheus are behind Traefik basic-
   auth (every endpoint returns 401), so drop from probe surface —
   health is covered by the in-cluster monitoring-* scrape jobs.
   guac.iamworkin.lan was deprecated when Guacamole moved under
   desktop.iamworkin.lan/guacamole/ — drop it.
3. acme path was wrong (root 404). Use /health.

Coverage adds (probe-traefik-services):
chat, dist, dms, menuboard, messageboard, presentations, retail,
ttsreader. All of these have IngressRoutes serving root at 200/3xx.

NetworkPolicy egress rules added so the new ClusterIP svc scrapes work:
- traefik-system: port 9100 (metrics) — separate from data-path 8080/8443
- kube-system: port 8080 (kube-state-metrics)
- cert-manager: port 9402 (controller metrics)

Out-of-band fix during this audit:
- Print.Web on edge2 was inactive (clean exit at 12:55 CDT, root cause
  unclear — systemd Stopping signal). Restarted. Service back on 5200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:05:32 -05:00
Andrew Stoltz
ab6ade4e46 monitoring: stabilize firing alerts + add cluster-state coverage
Live audit on 2026-04-26 found 14 firing alerts caused by stale probe
targets, blackbox TLS verify failures, and stale state-as-label series.
Plus three K8s scrape sources (kube-state-metrics, cert-manager,
traefik) that exposed NodePorts but were not in any scrape config.

Fixes
- probe-remotedesktop: switch http_2xx -> https_internal. Blackbox does
  not trust step-ca root, so /health was failing with x509 unknown
  authority while the app served 200s.
- probe-agentzero-nuc: short svc form (agent-zero.agent-zero.svc:80)
  instead of *.cluster.local. The FQDN form was being rewritten to the
  Traefik VIP by the CoreDNS iamworkin.lan template + ndots:5 search
  expansion, then 5s timeout.
- probe-agentzero-local + probe-ollama-local: removed. 10.0.58.100 is on
  HOME VLAN and not reachable from cluster pods. Workstation/AI-laptop
  Ollama monitoring belongs to host-side Puppet, not cluster blackbox.
- snmp-cloudkey: commented out. The Cloud Key Gen2+ runs unifi-core
  (controller), not an SNMP agent. Was generating "connection refused"
  every 30s.
- RemoteDesktopPoolDepleted / RemoteDesktopPoolDeficitSustained:
  filter on alert_level=Critical / Warning|Critical + enabled=true.
  The publisher emits one series per template per status without
  resetting old series to 0, so the historical Warming/BelowDesiredSize
  series stayed at 1 and the alert kept firing on stale labels.
- RemoteDesktopTlsExpiry: match by job, not hostname-only instance.
  The probe sets instance=https://desktop.iamworkin.lan/health so a
  hostname-only label match never fired.
- EpsonPrinterDown for: 5m -> 30m. EcoTank sleeps after ~5 min idle and
  SNMP times out, so 5m guaranteed nightly noise.

Coverage adds
- kube-state-metrics scrape (NodePort 30901). Required for the new
  pod-state alerts and a long list of standard K8s SLO queries.
- cert-manager scrape (NodePort 30902). Required for the
  CertManagerCertificateNotReady / RenewalFailed alert pair documented
  in project_cert_manager_prometheus_scrape.
- traefik scrape (NodePort 30900) on all three nodes.
- probe-traefik-services: HTTPS probe (https_internal) over the 17 main
  iamworkin.lan hosts so any Traefik-fronted service returning non-200
  shows up as a single named probe failure.
- blackbox-config: add the https_internal module that the new probes
  reference (was only in the FlowerCore.Notes scripts/monitoring copy,
  not in the live ConfigMap).

New alerts (kubernetes-state group)
- KubeContainerRestartingFrequently (>5 restarts/h)
- KubeContainerCrashLooping (>3 restarts/15m, thermal print)
- KubePodNotReady (Pending/Failed/Unknown >15m)
- KubePodImagePullBackOff (>10m)
- KubeDeploymentReplicasMismatch (>15m)

Without these, the agent-zero ollama-proxy 172x restart loop was
invisible for ~3 days. Same gap would have hidden the fc-php
php84-app-probe ImagePullBackOff orphan (cleaned up out of band).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:57:18 -05:00
Andrew Stoltz
4848f72eec fc-telephony: bump web to v202604252156 (T7 step trail) 2026-04-25 21:56:14 -05:00
Andrew Stoltz
f5eafc5def fc-telephony: bump web to v202604252144
Live workflow position tracking + canvas overlay sprint.
- Schema: CallSession.CurrentStep* + CallLog.Step* (migration
  AddCallSessionWorkflowPosition)
- Real-time CallStepExecuted events on every step entry, both
  Asterisk and Twilio paths
- New /calls/{id}/workflow live workflow viewer with visited
  path overlay and pulsing current-step badge
- GET /api/sessions/{id}/path + MCP get_call_session_path
- ActiveCalls 30s -> 3s poll + Live indicator + per-row View
  Workflow link
- Asterisk regroup also rolls in: playback verification,
  fallback chain, MainLayout refresh

Tests: 11525 -> 11549 pass / 1 skip / 0 fail. Build 0E.
Source: master @ 05b3d1c.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 21:44:14 -05:00
Andrew Stoltz
2d3fd74bab fc-ttsreader: bump web to v202604252002 (alignment Status guard relaxed) 2026-04-25 20:06:26 -05:00
Andrew Stoltz
df4e1f78b0 fc-ttsreader: bump web to v202604251956 (XL: per-chapter annotate + word-level alignment + Study TOC + Resume row) 2026-04-25 19:59:56 -05:00
Andrew Stoltz
2a10b775a8 fc-ttsreader: bump web to v202604251935 (Slice 5: select-to-annotate pronunciation + mood) 2026-04-25 19:39:27 -05:00
Andrew Stoltz
447ddd339d fc-ttsreader: bump web to v202604251917 (Bible passage shorthand: 'Esther 1' etc.) 2026-04-25 19:21:00 -05:00
Andrew Stoltz
7833143c1c fc-ttsreader: bump web to v202604251903 (Slices 2/3/4 + Lane I MCP + Lane J pills) 2026-04-25 19:08:08 -05:00
Andrew Stoltz
8ed77c4627 fc-ttsreader: bump web to v202604251836 (seek-race + auto-scroll + active-cue contrast) 2026-04-25 18:41:17 -05:00
Andrew Stoltz
437f346aee fc-ttsreader: register ttsreader-modern Deployment + Service
Adds the Deployment + Service for the fc-modern-tts container that
landed in the previous commit. Same shape as ttsreader-biblical:
runAsNonRoot uid 1654, dnsPolicy: None to bypass the iamworkin.lan
hijack on Microsoft endpoint lookups, /health probes, modest CPU/mem
since edge-tts is network-bound.

Service surfaces ttsreader-modern.fc-ttsreader.svc:10403 for the web
pod to call when the operator picks a he-IL-* or el-GR-* voice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 18:39:58 -05:00
Andrew Stoltz
bc32b5ef04 fc-ttsreader: deploy fc-modern-tts (Edge Read Aloud Hebrew/Greek)
Adds a fourth TTS engine alongside Piper / Kokoro / biblical-tts: a
small FastAPI bridge to Microsoft Edge's Read Aloud TTS via the
edge-tts Python package. Provides studio-quality Modern Hebrew (he-IL)
and Modern Greek (el-GR) narrators for the cluster.

modern-tts/Dockerfile + app.py:
- Python 3.12 base + edge-tts==7.2.8 (older versions hit 403 from MS).
- POST /tts -> MP3 audio (audio/mpeg).
- POST /timings -> word-level timings. Edge sometimes omits WordBoundary
  events for non-English voices; fall back to MP3-frame-walking duration
  estimate + proportional distribution across whitespace-split words
  (same approach biblical-tts uses for eSpeak).
- GET /voices?language=all|default — filtered to he-/el- by default so
  the AiStation voice picker isn't overwhelmed by 400+ voices.
- GET /health for probes.
- Body shape mirrors BiblicalTtsRequest so the .NET client lives in the
  same FlowerCore.Shared.Speech package.

K8s deployment in fc-ttsreader namespace:
- ttsreader-modern Deployment + Service on port 10403.
- localhost/fc-modern-tts:v1, imagePullPolicy: Never (built on noc1,
  imported to all 3 RKE2 nodes via ctr).
- runAsNonRoot uid 1654 + fsGroup 1654.
- dnsPolicy: None to bypass the *.iamworkin.lan template hijack on
  Microsoft endpoint lookups.
- Modest resources (100m/128Mi req, 1000m/512Mi limit) — edge-tts is
  network-bound, not compute-bound.
- Probes against /health.

Verified live locally: container handles 'Καλημέρα Ελλάδα Πώς είστε'
in 2496ms, returns el-GR-NestorasNeural voice + 4 word timings.
Hebrew: 'בְּרֵאשִׁית בָּרָא אֱלֹהִים' returns he-IL-AvriNeural,
2472ms, 3 words.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 18:39:21 -05:00
Andrew Stoltz
263d06acb9 fc-ttsreader: bump web to v202604251750 (Lane H Slice 1: inline Study view + chapter notes) 2026-04-25 17:54:09 -05:00
Andrew Stoltz
25dbb2967f fc-ttsreader: bump web to v202604251714 (BuildRenderPlan splits each mood block via SpeechSentenceSegmenter) 2026-04-25 17:18:09 -05:00
Andrew Stoltz
a89a774eaf fc-ttsreader: deploy eSpeak-NG biblical-tts (Ancient Greek + Hebrew)
Adds a third TTS engine alongside Piper (modern English/multi-lang) and
Kokoro (high-quality English): a small FastAPI wrapper around eSpeak-NG
with built-in support for Ancient Greek (grc), Hebrew (he), and Modern
Greek (el). Same shape as fc-speech-align so AiStation talks to all the
TTS/alignment services with one HTTP client pattern.

biblical-tts/Dockerfile + app.py:
- Python 3.12 base + apt-get espeak-ng + libsndfile1 + ffmpeg-free deps.
- POST /tts -> WAV audio bytes (audio/wav).
- POST /timings -> word-level timings derived from espeak's --pho phoneme
  duration stream, distributed across whitespace-split words proportional
  to character count. Accuracy is good enough for chip-level read-along
  highlighting (~30-80ms per-word jitter).
- GET /voices for catalog discovery, GET /health for probes.
- Body shape mirrors AlignmentRequest from FlowerCore.Shared.Speech so
  the .NET BiblicalTtsClient round-trips it cleanly.

K8s deployment in fc-ttsreader namespace:
- ttsreader-biblical Deployment + Service on port 10402.
- localhost/fc-biblical-tts:v1, imagePullPolicy: Never (built on noc1,
  imported to all 3 RKE2 nodes via ctr).
- runAsNonRoot uid 1654 to match the namespace's standard security ctx.
- Modest resources (100m/128Mi req, 1000m/512Mi limit) — eSpeak is
  CPU-cheap.
- Probes hit /health which returns the supported language list.

Verified live: container started, /health returns ok with grc/el/he,
POST /timings on Ἐν ἀρχῇ ἦν ὁ λόγος returned 5 words / 1714ms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 17:17:38 -05:00
Andrew Stoltz
dc39747f3f fc-ttsreader: piper memory 1Gi -> 3Gi to stop OOMKill mid-render 2026-04-25 17:10:20 -05:00
Andrew Stoltz
87050e72a9 fc-ttsreader: deploy Kokoro to the cluster (replaces BLUEJAY-WS host pointer)
The cluster ttsreader-web was reaching across to BLUEJAY-WS:10401 for
Kokoro synthesis, which meant a workstation-down event broke render-
pipeline TTS. Add a cluster-native ttsreader-kokoro Deployment and
Service inside fc-ttsreader so the cluster owns the engine.

- Image: ghcr.io/remsky/kokoro-fastapi-cpu:latest. Model + 67 voices
  ship inside the image, so no PVC is required.
- Port 8880 (the kokoro-fastapi default; the entrypoint hardcodes it).
- Resources: 250m/1Gi request, 2000m/3Gi limit. CPU-only inference
  matches what AiStation runs locally on BLUEJAY-WS.
- dnsPolicy: None to bypass CoreDNS's *.iamworkin.lan template hijack
  on huggingface.co lookups, same shape as ttsreader-align.
- Probes hit /v1/audio/voices since the kokoro server doesn't expose
  /health; that endpoint is cheap (lists configured voice files).

ttsreader-web env var TtsReader__Kokoro__BaseUrl flips from the
workstation pointer to the cluster service:
http://ttsreader-kokoro.fc-ttsreader.svc.cluster.local.:8880.

AiStation keeps its local http://localhost:8880 since the workstation
operator still wants the audio to render on the local sound device
without a network hop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 16:56:39 -05:00
Andrew Stoltz
e8c5d2afd2 fc-ttsreader: bump web to v202604251544 (Unicode sanitize + continue-on-segment-fail) 2026-04-25 15:50:16 -05:00
Andrew Stoltz
eef492125f fc-ttsreader: bump web to v202604251534 (SignalR 8MB + DOM-peek + poll loop refresh fix) 2026-04-25 15:39:18 -05:00
Andrew Stoltz
b51ee35bfa fc-speech-align: v3 — emit FlowerCore.Shared.Speech word contract
The /align endpoint was returning Whisper-native word fields
(word/startSeconds/endSeconds/confidence), but FlowerCore.Shared.Speech's
FasterWhisperAlignmentClient on master deserializes
FasterWhisperWord against [JsonPropertyName("text")/("startMs")/("endMs")].
Result: ttsreader-web reported alignment.source="whisper" with words[]
present but every entry had Text="" and StartMs=EndMs=0 — visible in the
2026-04-25 hello-world smoke against ttsreader.iamworkin.lan.

Match the published Common contract instead of the Python model's native
shape: emit text/startMs/endMs (millisecond ints, not float seconds).
Confidence stays on the wire as informational; the deployed C# client
ignores it but a future fc-align operator UI can surface low-confidence
words. Bump tag to v3 and bump the Deployment image accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 11:52:14 -05:00
Andrew Stoltz
4abc2fa95d fc-speech-align: add dnsPolicy: None to bypass CoreDNS *.iamworkin.lan template hijack on huggingface.co 2026-04-25 11:12:21 -05:00
Andrew Stoltz
d7628a6945 fc-speech-align: bump to v2 with explicit requests dep (faster-whisper 1.0.3 missing transitive) 2026-04-25 10:55:51 -05:00
Andrew Stoltz
df115e4d1e fc-ttsreader: ship cluster-native fc-speech-align (faster-whisper) + bump web
- New ttsreader-align Deployment + Service + 5Gi PVC under
  apps/fc-ttsreader/. Wraps SYSTRAN/faster-whisper in a small FastAPI app
  exposing POST /align (fc-align contract used by Shared.Speech) AND
  POST /transcribe (audio-in feature consumed by ttsreader-web Lane G).
  Source: apps/fc-ttsreader/speech-align/ (Dockerfile + app.py +
  requirements.txt). Built locally (apt-get RUN steps need BLUEJAY-WS,
  not noc1) and ctr-imported to all 3 RKE2 nodes.
- ttsreader-web env: flip Speech__Alignment__Enabled=true and point
  BaseUrl at http://ttsreader-align.fc-ttsreader.svc.cluster.local.:9200.
  Add new TtsReader__Transcription__* env triplet pointing at the same
  service (same /transcribe endpoint).
- Bump ttsreader-web image to v202604251046 (carries the
  TranscriptionController + MCP tool + Quick.razor InputFile UI).
2026-04-25 10:50:45 -05:00
Andrew Stoltz
9df26620b8 fc-ttsreader: disable Whisper, fall back to estimator until backend is reachable
The cluster-wide pod cannot reach BLUEJAY-WS speaches on 10.0.56.20:9200
because the rootless+host-net podman setup binds 127.0.0.1 only on the
WSL machine; nothing on the LAN-facing interface. The openai-compatible
Backend value also relied on a Common change still on feat/shared-indexing
rather than master, so the deployed image's Shared.Speech only knows
the FC-native /align shape.

Disable Speech:Alignment for now. EstimatedAlignmentClient kicks in and
keeps /api/v1/voices/preview-with-timings returning word-aligned JSON,
just with uniform-distribution timings instead of real Whisper output.

Re-enable once: (a) Common's openai-compatible Backend lands on master
and a new TtsReader image ships, or (b) we point at a LAN-routable
backend (e.g. an aiohttp /align shim, or speaches running on a node
that's actually reachable from cluster pods).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 10:28:21 -05:00
Andrew Stoltz
08aa7a5bff fc-ttsreader: route Whisper alignment at openai-compatible backend
The fc-speech-align container on BLUEJAY-WS (port 9200) is the speaches
build of faster-whisper-server, which exposes the OpenAI-compatible
/v1/audio/transcriptions contract — not the FlowerCore /align contract.

FasterWhisperAlignmentClient (FlowerCore.Common a1b3bfc) supports both
shapes; tell it explicitly to talk OpenAI-compatible here so requests land
on the right endpoint and verbose_json gets adapted into the FC alignment
response. Also pin the Model id to one speaches recognizes.

Switch back to fc-align once a native /align backend is deployed (or wire
a tiny FastAPI shim in front of speaches if we want a stable contract).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 10:25:24 -05:00
Andrew Stoltz
38e20a8b64 fc-ttsreader: bump to v202604251018 (resume banner + scope-ID rebuild fix) 2026-04-25 10:24:39 -05:00
Andrew Stoltz
c945d44b9e fc-ttsreader: bump to v202604250729 (playback bookmark + breaker fix) 2026-04-25 07:33:38 -05:00
Andrew Stoltz
1f1354f634 fc-intranet-web: bump to v202604242354overridefix 2026-04-24 23:57:18 -05:00
Andrew Stoltz
76ece92cfd fc-ttsreader: enable real Whisper alignment via fc-speech-align
Flips Speech__Alignment__Enabled=true and points BaseUrl at the
new BLUEJAY-WS podman quadlet running fc-speech-align (faster-
whisper, /align contract). When Lane 1δ's
/api/v1/voices/preview-with-timings runs after this lands, the
alignment.source field flips from 'estimated' to 'whisper' and
the per-word timings come from real audio analysis instead of
uniform-spacing estimates.

No image rebuild — the Lane 1α DI registration already routes
IWhisperAlignmentClient to FasterWhisperAlignmentClient when
Speech:Alignment:Enabled is true.

Companion firewall rule from FlowerCore.Puppet@bbc02ea +
@05504ed (whisper_align_enabled flag on bluejay-ws-linux Hiera)
opens port 9200 to RKE2 pod CIDR durably.
2026-04-24 23:37:30 -05:00
Andrew Stoltz
a760a58846 fc-intranet-web: bump to v202604242315wordhighlight (Lane 1γ.1)
Word-level highlighting + inline annotation popover in the
read-aloud bar, backed by TtsReader's preview-with-timings
(Lane 1δ) and the existing /api/v1/pages/{encodedUrl}/overrides
REST surface (Lane 2α).

Built from FlowerCore.Intranet.Web@9abde21 against
FlowerCore.Common@d23d4c3, both on master.
2026-04-24 23:21:59 -05:00
Andrew Stoltz
9fb526c7c5 fc-ttsreader: bump to v202604242301readwithtimings
Picks up Lane 1δ + 3β:
- Lane 3β: GET /reader-embed iframe-friendly host route + public-
  read CORS for /api/v1/voices and /_content/.../embed/
- Lane 1δ: GET /api/v1/voices/preview-with-timings — pairs
  synthesized audio with per-word alignment timings (faster-
  whisper or estimated fallback) so embed bundle / FcReaderOverlay
  / in-app /voices preview can word-highlight in one round-trip
- Latest FlowerCore.UI.Components from Common master:
  FcReaderOverlay annotation popover (Lane 2γ) + <fc-reader>
  standalone embed bundle (Phase 3)

Built from FlowerCore.TtsReader@06ef815 (master) against
FlowerCore.Common@d23d4c3 (master).
Image imported on rke2-server / rke2-agent1 / rke2-agent2.
2026-04-24 23:05:41 -05:00
Andrew Stoltz
dd7980642e fc-intranet-web: bump to v202604242222readaloud
Picks up the merged Lane 1γ + Lane 2α + Lane 1β + Phase 3 work:
top-bar Read aloud button + per-page reading overrides REST +
FcReaderOverlay shared component + <fc-reader> embed bundle.

Built from FlowerCore.Intranet.Web@35a552f against
FlowerCore.Common@a56975a, both on master.
Image imported on rke2-server / rke2-agent1 / rke2-agent2.
2026-04-24 22:26:12 -05:00
Andrew Stoltz
1d4ad64226 chore(ttsreader): bump to v202604241555backlog (rebuild with correct publish/)
Previous v202604241543backlog image accidentally used a stale
publish/ directory at the TtsReader repo root (Dockerfile.deploy
says COPY publish/ but my ad-hoc publish wrote to artifacts/publish/).
Rebuilt with a clean copy from artifacts/publish/ to publish/ first.

Confirmed new image has appsettings.json Preview section + the
quick-swipe-gestures.js asset baked in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 15:53:55 -05:00
Andrew Stoltz
774f82c431 chore(ttsreader): bump image to v202604241543backlog
Merges three parallel backlog lanes onto longsegfix:
- B (hotfix/preview-timeout-2026-04-24): 25s preview-path timeout, 504 on
  expiry, config-tunable via TtsReader:Preview:TimeoutSeconds.
- C (feature/swipe-gestures-2026-04-24): PointerEvent swipe gestures on
  /quick (prev/next sentence) + CSS fade-hint.
- D (feature/bible-defaults-button-2026-04-24): Apply Bible speech
  defaults button on project detail page with confirm + toast.

TtsReader master octopus merge: 730e7fa. Tests 155/155 .NET + 13/13 JS.
28 MCP tools. Built + imported on all 3 RKE2 nodes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 15:47:13 -05:00
Andrew Stoltz
d2cc36ea0e chore(ttsreader): bump image to v202604241345longsegfix
Hotfix for two live render errors:
- Kokoro chapter render failed with "count ('-1') must be non-negative"
  — streaming-WAV chunk-size sentinel (0xFFFFFFFF) read as -1.
- Piper render timed out on book-chapter paragraphs with no sentence
  punctuation — one giant segment exceeded the 2-min timeout.

Source fix: FlowerCore.TtsReader@826589b. 153/153 tests green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 13:58:45 -05:00
Andrew Stoltz
299070e4bf fc-ttsreader: bump to v202604241332 (XL sprint — sidebar shell + CHAP + offline ZIP + breaker)
Merges four XL-sprint lanes to TtsReader master:
- FcSidebarLayout swap (MainLayout + NavMenu, deletes 88-line shell CSS)
- ID3 CHAP chapter markers in rendered MP3s (Apple Podcasts chapter list)
- POST /api/v1/projects/{id}/export/zip offline bundle + MCP tool
- Per-engine TtsCircuitBreaker with Kokoro→Piper fallback + /voices
  engine-health panel + GET /api/v1/voices/engines/status

153/153 tests green on Linux dotnet build. Image imported to
rke2-server / rke2-agent1 / rke2-agent2. No app env changes beyond
the image tag.
2026-04-24 13:37:33 -05:00
Andrew Stoltz
a9debd8668 fix(guacamole): remove .notification button::before override that clipped native action-icon sprite
Guacamole 1.6 renders .button.home/.button.logout/.button.reconnect icons
via an absolutely-positioned ::before pseudo-element with width:1.8em
and background-position:.5em .45em. The Blue Jay branding CSS was
clamping every .notification button::before to display:inline-flex
and width:1rem, so only the top-left sliver of the sprite rendered —
appearing as a green/purple garbage rectangle on the connection-not-found
page Home button. Reconnect escaped because it doesn't carry a
background-image on its ::before. The rule was redundant anyway: the
.notification button flex row + padding already spaces the native icon
cleanly. Only the custom .fc-embed-logout-disabled::before override
remains (intentionally dims the replacement disabled-logout pseudo-element).
2026-04-24 11:23:46 -05:00
Andrew Stoltz
675b9da4f9 intranet: bump to v202604240144longchunk2 (tightened chunk cap)
v202604240140longchunk still hit 400 Bad Request from nomic-embed-text
on several batches — the chars/4 token estimate was optimistic for
code-heavy/Unicode content. Rebuilt from FlowerCore.Common@e1c28b4
which tightens MarkdownChunker hard cap (ChunkSizeTokens × 2, clamped
at 16000 chars) AND adds a character-length check in IndexBuilder's
safety filter alongside the estimated-tokens check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 01:44:58 -05:00
Andrew Stoltz
2b471a55b0 intranet: bump to v202604240140longchunk (rebuild with correct corpus)
v202604240135longchunk image shipped with only 1 file in the baked
corpus (NEXT-SPRINT.md) because the corpus tar was accidentally built
from the Intranet.Web working directory instead of the Notes repo
root. Rebuilt from the right cwd; new image has the expected 370
*.md + *.html files at /srv/flowercore-notes/docs/.

Same long-chunk handling code as v202604240135longchunk; just a clean
rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 01:40:49 -05:00
Andrew Stoltz
37ce0aed85 intranet: v202604240135longchunk — long-chunk handling fix
Image bump v202604240108gpu -> v202604240135longchunk, rebuilt from
FlowerCore.Intranet.Web@feat/shared-indexing-search HEAD which transitively
picks up FlowerCore.Common@feat/shared-indexing@105af75:

- MarkdownChunker hard-caps oversized heading-bounded sections at
  ChunkSizeTokens × 4 chars and splits with overlap (same pattern as
  JsonArticleChunker). Stops the indexer from producing chunks above
  nomic-embed-text's 8192-token input limit at the source.

- IndexBuilder gains IndexingOptions.MaxEmbeddingTokens (default 8000)
  safety filter — chunks above the cap are warn-logged and dropped
  before any batch is sent. New IndexBuildResult.ChunksDropped tracks
  how many got skipped.

Goal: notes-md should index 2541/2541 chunks (vs. 2080/2541 last pass)
with zero "Failed to embed batch" 400s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 01:28:00 -05:00
Andrew Stoltz
a37fc83584 ttsreader: bump to v202604240117 (fix blazor-error-ui scope drift) 2026-04-24 01:21:38 -05:00
Andrew Stoltz
3a8aae9e2d chore(guacamole): retire legacy guac.iamworkin.lan IngressRoute+cert
Single-host routing via desktop.iamworkin.lan/guacamole has been
live-proven (curl → 200) and the Codex single-host-guacamole-wip
merge flipped RemoteDesktop.Web's GuacamolePublicUrl + defaults to
the new path. Nothing else in FlowerCore actively requires the
legacy guac.iamworkin.lan URL.

Removed from the guacamole app:
- IngressRoute `guacamole` matching Host(guac.iamworkin.lan)
- Middleware `guac-add-prefix` (only the legacy route referenced it)
- Certificate `guacamole-tls` (only covered guac.iamworkin.lan)

ArgoCD prune will delete the live resources on next sync. The
pfSense DNS override for guac.iamworkin.lan should be removed
via FlowerCore.DNS as a follow-up operator step — not managed by
this repo.

The new `guacamole-desktop-path` IngressRoute + `desktop-guacamole-path-tls`
Certificate (added in e65de29) handle all Guacamole traffic going
forward.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 01:14:25 -05:00
Andrew Stoltz
020a806d08 intranet: v202604240108gpu — point indexer at BLUEJAY-WS GPU + FilePatterns fix
Two-part fix on top of the live Shared.Indexing rollout:

1. Image bump v202604240050corpus -> v202604240108gpu, rebuilt from
   FlowerCore.Intranet.Web@feat/shared-indexing-search (HEAD includes
   the FilePatterns array-merge fix in IntranetSearchOptions). At
   runtime each DocCorpusRoot now sees ONLY the patterns explicitly
   set in appsettings.json — notes-md gets ["*.md"], notes-html gets
   ["*.html"], no accidental cross-bleed.

2. New IntranetSearch__OllamaBaseUrl env var pointing at
   http://10.0.56.20:11434 (BLUEJAY-WS GPU, R9700 32GB VRAM). Verified
   reachable from the cluster and nomic-embed-text:latest is pulled.
   This is the workaround for memory feedback_pi5_nomic_embed_slow:
   edge1 Pi 5 takes ~189s per 32-chunk batch, projecting full notes-md
   indexing (5665 chunks) at ~9 hours; the GPU should land it in minutes.
   Edge1 stays the chat default; this env var only redirects the
   indexer's bulk embedding calls.

Image distributed to all three RKE2 nodes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 01:08:55 -05:00
Andrew Stoltz
e65de2938b feat(ingress): single-host Guacamole via guacamole-namespace IngressRoute
Cluster Traefik disallows cross-namespace service refs from
IngressRoutes, so the PathPrefix(/guacamole) rule I added to
fc-desktop IngressRoute in 292528e failed with:

  "service guacamole/guacamole not in the parent resource namespace
  fc-desktop"

Move the /guacamole path match into the guacamole namespace where
the Service actually lives:

- apps/guacamole/guacamole.yaml adds a new `guacamole-desktop-path`
  IngressRoute matching `Host(desktop.iamworkin.lan) &&
  PathPrefix(/guacamole)` → guacamole:8080 (no add-prefix middleware;
  the browser already sends the /guacamole/* path that Guacamole's
  servlet serves at).
- New Certificate `desktop-guacamole-path-tls` for desktop.iamworkin.lan
  in the guacamole namespace, issued by step-ca-acme. Separate cert
  from fc-desktop's remotedesktop-web-tls because Secret refs are
  also scoped per-namespace; duplicating the cert is cheaper than
  enabling cross-namespace secret refs cluster-wide.
- Revert the cross-namespace attempt in apps/fc-desktop/fc-desktop.yaml
  back to a Host-only route. Traefik's router matching precedence
  (longer/more-specific rule wins) handles the /guacamole vs
  catch-all priority without explicit priority: fields.

Closes the single-host Guacamole URL regression Codex's branch
introduced — GuacamolePublicUrl=https://desktop.iamworkin.lan/guacamole
now resolves to the Guacamole webapp end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 01:07:12 -05:00
Andrew Stoltz
5c0c21790e ttsreader: bump image to v202604240101 (Kokoro voices merge fix) 2026-04-24 01:05:55 -05:00
Andrew Stoltz
292528ec15 feat(fc-desktop): add /guacamole PathPrefix route to IngressRoute
Single-host Guacamole routing — Traefik matches Host=desktop.iamworkin.lan
+ PathPrefix=/guacamole first (priority 20) and forwards to the
guacamole Service in the guacamole namespace on 8080. The existing
Host-only catch-all rule drops to priority 10 so Guacamole traffic
resolves to the more-specific match.

Mirrors the IngressRoute in FlowerCore.RemoteDesktop@master (merged
as part of codex/single-host-guacamole-wip). The RemoteDesktop repo
copy is deploy-ref only — ArgoCD owns the live IngressRoute via
this manifest. Without this change, GuacamolePublicUrl=
https://desktop.iamworkin.lan/guacamole returns 404 because Traefik
routes the whole Host to remotedesktop-web.

Unblocks the per-template AAT smoke against the new public URL
path + closes the final live piece of Codex's single-host routing
work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 01:03:34 -05:00
Andrew Stoltz
bb39a0c1fd ttsreader: bump image to v202604240053 + Kokoro env vars
Adds TtsReader__Kokoro__Enabled=true + BaseUrl=http://10.0.56.20:10401
+ TimeoutSeconds=120 so the pod routes kokoro-tagged voices to the
Kokoro-FastAPI backend running on BLUEJAY-WS. Multi-engine router
falls through to Piper for piper-tagged and untagged voices.

Requires nftables on BLUEJAY-WS to permit tcp/10401 from 10.0.56/23
and 10.42.0.0/16. Applied to the live ruleset — Puppet Hiera path is
the durable fix (kokoro_server_enabled under profile::security::firewall).

Tests 107 → 114 (+7 MultiEngineSpeechSynthesizerTests).
2026-04-24 00:57:41 -05:00
Andrew Stoltz
c23e903ba7 feat(monitoring): Grafana alert rules route RemoteDesktop to IRC
Companion to the Prometheus alert rules landed in e44e9a0. The
Prometheus rules were loading but never delivered — the monitoring
stack has no Alertmanager configured; **Grafana** owns alert
routing via its built-in engine + webhook contact point to
irc-notify.monitoring.svc:9119. Without a matching Grafana alert,
the Prometheus rules just show up in the Prometheus UI and page
no one.

Adds 6 Grafana alert rules in a new `RemoteDesktop` group under
the AI Stack Alerts folder:

- remotedesktop-web-down (3m) — probe_success{job="probe-remotedesktop"} < 1
- remotedesktop-metrics-stale (10m) — fc_desktop_session_events_total series absent
- remotedesktop-pool-depleted (5m) — fc_desktop_pool_depleted > 0
- remotedesktop-pool-deficit-sustained (10m info) — fc_desktop_pool_deficit > 0
- remotedesktop-session-churn-spike (5m info) — launch rate > 20/min
- remotedesktop-tls-expiry (6h critical) — cert < 2 days to expiry

Each uses the standard Grafana 3-stage pipeline (query → reduce →
threshold) matching the existing AI Stack + Infrastructure alert
patterns. Labels: service=remotedesktop + severity (warning/info/critical).
Default route is `IRC #alerts` via the existing webhook contact point.

Parity with the Prometheus rules (which already fire internally
for the Prometheus UI + any future Alertmanager integration).
Grafana restart picks up the new provisioning on next reload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:57:26 -05:00
Andrew Stoltz
cae03296f5 intranet: bake Notes corpus into image, drop init container
Cluster egress to github.com is fronted by a step-ca TLS proxy that
returns 404 page not found for unmatched routes — git clone of the
public FlowerCore.Notes repo failed inside the pod even with
GIT_SSL_NO_VERIFY=true. Rather than chase the egress NetworkPolicy /
proxy config, bake the docs corpus directly into the image at
/srv/flowercore-notes/docs.

The corpus is just *.md + *.html (369 files, 2.7 MB uncompressed) —
small enough that re-baking on every deploy is fine and avoids any
runtime network dependency.

Manifest changes:
- Image bump: v202604240040search -> v202604240050corpus
- Removed initContainers (clone-notes-corpus is now redundant)
- Removed notes-corpus emptyDir + its volumeMounts
- Vector-store PVC mount stays.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:50:00 -05:00
Andrew Stoltz
3c5c1a07bd fix(monitoring): netpol egress allows for fc-desktop + Traefik hairpin
Adds two egress allows to monitoring-netpol so Prometheus can scrape
FlowerCore.RemoteDesktop:

1. fc-desktop namespace on port 8080 — direct ClusterIP service
   target (remotedesktop-web.fc-desktop:8080).
2. traefik-system namespace pods on ports 8080 + 8443 — covers the
   Traefik VIP hairpin path for the `https://desktop.iamworkin.lan`
   scrape target (CoreDNS wildcard resolves iamworkin.lan hostnames
   to the LB VIP; after kube-proxy DNAT, egress needs the backend
   pod port allowed per feedback_netpol_dnat_backend_port).

Without these, the fc-remotedesktop scrape times out with "context
deadline exceeded" even though the monitoring-netpol already allows
the 10.0.56.0/24 CIDR — post-DNAT the destination is a 10.42.x.x
pod IP, not the VIP.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:47:50 -05:00
Andrew Stoltz
057595de3d intranet: GIT_SSL_NO_VERIFY=true in clone-notes-corpus init container
Cluster egress is fronted by a step-ca TLS proxy whose cert doesn't
match github.com. The init container's git clone failed with
"SSL: no alternative certificate subject name matches target hostname
'github.com'". The Notes repo is public — there is no secret to
protect on the wire — so GIT_SSL_NO_VERIFY=true is the right tradeoff
here. Tag at v202604240040search.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:46:20 -05:00
Andrew Stoltz
b02bb4be38 intranet: deploy v202604240040search with Notes corpus + vector store
Phase 3 lane 1 of FlowerCore.Shared.Indexing rollout — wires the new
search consumer in FlowerCore.Intranet.Web to live infrastructure.

Manifest changes:
- Image bump: localhost/fc-intranet-web:latest -> :v202604240040search.
  Built from FlowerCore.Intranet.Web@feat/shared-indexing-search and
  imported into all three RKE2 nodes (rke2-server, rke2-agent1, rke2-agent2)
  via ctr import. Both :latest and :v202604240040search tags are present.
- New PersistentVolumeClaim intranet-vector-store (1Gi, ReadWriteOnce,
  Longhorn) mounted at /data for the SQLite vector store
  (intranet-vectors.db).
- New emptyDir volume notes-corpus (1Gi sizeLimit) shared between the
  init container and main container, mounted at /srv/flowercore-notes
  (read-only in the main container).
- New init container clone-notes-corpus (alpine/git) that shallow-clones
  https://github.com/astoltz/FlowerCore.Notes.git
  (codex/notes-pimanager-live-drift) into /srv/flowercore-notes on every
  pod start. Re-clone is cheap (depth=1) and re-runs of git fetch +
  reset --hard are idempotent.
- Strategy switched to Recreate for the deployment, since the new RWO
  PVC blocks rolling updates — see CLAUDE.md memory "RWO PVC blocks K8s
  rolling updates".
- Resource bumps: memory 128Mi -> 256Mi req, 512Mi -> 1Gi limit; CPU
  500m -> 1000m limit. The DocsCorpusIndexer + Ollama HTTP calls add
  measurable load during the initial index build.
- initialDelaySeconds bumps on both probes (10s -> 30s liveness, 5s ->
  10s readiness) to account for startup-time Ollama probing and the
  slightly larger image.

The DocsCorpusIndexer waits 15s after host startup before its first
indexing pass, then loops every RescanInterval (default 1h). Its first
run will:
1. Embed all *.md under /srv/flowercore-notes/docs against
   nomic-embed-text on edge1 (10.0.57.17:11434).
2. Embed all *.html under /srv/flowercore-notes/docs/dashboards.
3. Persist chunks + embeddings to /data/intranet-vectors.db.

Verify after rollout:
- kubectl -n intranet logs deploy/intranet-web -c clone-notes-corpus
  (init container should show the docs/ listing).
- kubectl -n intranet logs deploy/intranet-web -f
  (DocsCorpusIndexer should log "Indexing docs root 'notes-md'..." then
  "Docs root 'notes-md' indexed: N files, M chunks, M stored").
- curl -sk https://intranet.iamworkin.lan/api/search/indexes
  -> ["notes-html","notes-md"]
- curl -sk 'https://intranet.iamworkin.lan/api/search?q=guacamole+single+host&topK=3'
  -> hits from docs/infrastructure/guacamole-customization-plan.md

Companion source on FlowerCore.Intranet.Web@feat/shared-indexing-search.
Depends on FlowerCore.Common@feat/shared-indexing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:42:03 -05:00
Andrew Stoltz
e44e9a0062 feat(monitoring): RemoteDesktop alerts + scrape jobs + dashboard mount
Three additions to the monitoring ConfigMap, each targeting
FlowerCore.RemoteDesktop:

- **Scrape jobs** (2 new):
  - probe-remotedesktop: blackbox http_2xx against
    https://desktop.iamworkin.lan/health every 30s. Feeds the
    RemoteDesktopWebDown alert.
  - fc-remotedesktop: direct /metrics scrape against
    desktop.iamworkin.lan for the fc_desktop_session_events_total
    and fc_desktop_pool_* series.

- **Alert group `remote-desktop`** (7 rules in alerts.yml):
  - RemoteDesktopWebDown (3m) — /health probe failing
  - RemoteDesktopMetricsStale (10m) — absent metrics series
  - RemoteDesktopPoolDepleted (5m) — pool deficit + depleted flag
  - RemoteDesktopPoolDeficitSustained (10m, info) — persistent
    below-desired pool size
  - RemoteDesktopSessionChurnSpike (5m, info) — launch rate
    >20/min
  - RemoteDesktopRecordingEventsDropped (15m, info) — 30m without
    recording events while launches active
  - RemoteDesktopTlsExpiry (6h, critical) — <2d cert renewal
    window; aligns with feedback_acme_expiry_alert_threshold

- **Grafana dashboard mount**: new volumeMounts + volumes entry for
  `dashboards-remotedesktop` backed by the grafana-dashboard-remotedesktop
  ConfigMap (previously added as a standalone file in d4210c8).
  Folder path /var/lib/grafana/dashboards/remotedesktop — picked up
  by the file-provider with foldersFromFilesStructure:true so the
  dashboard shows up in a "Remotedesktop" folder in Grafana.

No CRLF churn; pure 100-line insertion into LF-normalized file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:41:35 -05:00
Andrew Stoltz
297a2a9bbc ttsreader: bump image to v202604240023 for P3 one-click book render
New surfaces: POST /api/v1/bible/projects (one-click whole-book render),
GET /api/v1/bible/books, GET /api/v1/bible/books/{book}/preview, MCP
tools render_tts_reader_bible_book + list_tts_reader_bible_books,
Dashboard "Render a Bible book" card. 107/107 tests, +7 from previous.
2026-04-24 00:25:48 -05:00
Andrew Stoltz
d4210c819f feat(monitoring): RemoteDesktop Grafana dashboard ConfigMap
Wraps apps/monitoring/flowercore-remotedesktop-grafana-dashboard.json
as a ConfigMap manifest so ArgoCD syncs it into the cluster alongside
the existing grafana-dashboard-* ConfigMaps. Standalone file — does
NOT modify noc-monitoring.yaml. That keeps the CRLF churn on
noc-monitoring.yaml (sibling files apps/intranet/intranet.yaml and
apps/agent-zero/configmaps-bluejay.yaml also carry CRLF churn) out
of this commit.

Dashboard will be synced into the cluster but NOT loaded by Grafana
until a matching `volumes:` entry lands in the Grafana Deployment
in noc-monitoring.yaml:

    - name: dashboard-remotedesktop
      configMap:
        name: grafana-dashboard-remotedesktop

Plus a `volumeMounts:` entry in the grafana container:

    - name: dashboard-remotedesktop
      mountPath: /etc/grafana/provisioning/dashboards/remotedesktop
      readOnly: true

Those edits are deferred to the CRLF-normalization pass on
bluejay-infra so the review diff stays reviewable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:20:40 -05:00
Andrew Stoltz
fc0b67f670 ttsreader: bump image to v202604232334 for iTunes RSS + ID3 tags
Pulls in FlowerCore.TtsReader@9e2497f: P2.3 iTunes-namespace podcast
feed (author, summary, category, cover art, episode numbering,
duration, atom:self link, serial channel type for Bible projects) and
P2.4 ID3v2 tags on MP3 export + Vorbis comments on OGG (title, artist
with Piper voice humanized, album, track N/M, genre defaulting to
Religion & Spirituality for Bible or Audiobook for text sources,
date). Phones and podcast apps now show proper track info instead of
"Unknown - Unknown".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 23:37:53 -05:00
Andrew Stoltz
223e9a9232 feat(zabbix): add RemoteDesktop monitoring template
New Zabbix 7.2 template under `Templates/FlowerCore` that scrapes
the `/metrics` exposition from FlowerCore.RemoteDesktop and extracts:

- `fc_desktop_session_events_total` split by event (launch/connect/
  disconnect/recording), with a dedicated datapoint for the
  `browser_datasource="json"` slice to track delegated-auth launches.
- `fc_desktop_pool_ready` gauge sum for warm pools.

Trigger: `nodata(flowercore.remotedesktop.metrics,10m)=1` warns when
the public desktop host stops exposing metrics.

Follows the existing `flowercore-print-ollama.yaml` pattern — import
manually into Zabbix and link to the Print/Desktop host. Not a K8s
manifest; ArgoCD ignores.

Grafana dashboard JSON is drafted at
`apps/monitoring/flowercore-remotedesktop-grafana-dashboard.json`
but still needs a ConfigMap wrap + Grafana Deployment volume mount
in noc-monitoring.yaml before it ships (follow-up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 23:30:32 -05:00
Andrew Stoltz
6c1375b21a ttsreader: bump image to v202604232310 for Media Session API + Bible lexicon
Pulls in FlowerCore.TtsReader@63e6b62: P1.1 Media Session API wiring in
fc-media-session.js + quick-player.js + rendered-chapter-player.js, and
P1.2 biblical-name pronunciation lexicon auto-seed on Bible-source
project creation plus apply-bible-defaults endpoint + MCP tool for
existing projects. Tests 81 -> 97 all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 23:15:53 -05:00
Andrew Stoltz
82529ed9b5 fix(guacamole): detect managed embeds from client URL 2026-04-23 21:02:05 -05:00
Andrew Stoltz
3ea8a56dab fix(guacamole): disable logout for managed embeds 2026-04-23 20:51:15 -05:00
Andrew Stoltz
9272abc225 Use absolute cluster DNS for ttsreader piper 2026-04-23 20:49:40 -05:00
Andrew Stoltz
436185818d fc-distribution: restrict public IngressRoute to GET+HEAD only
Live verification 2026-04-24 caught POST /blobs on dist.flowercore.io
returning 201 Created with the blob persisted — admin write operations
reachable on the public surface. Controller-level strict entitlement
was on, but that gates reads; writes weren't blocked at all.

Fix: add Method(GET) || Method(HEAD) to the Host match on the public
IngressRoute. POST/PUT/PATCH/DELETE now miss every route for
dist.flowercore.io and Traefik returns 404 before the pod sees the
request. Edge-level defense-in-depth on top of the controller's
strict-mode entitlement check.

The internal IngressRoute for dist.iamworkin.lan stays unrestricted —
admin POST /blobs + POST /manifests flows keep working from the lab.
2026-04-23 20:12:25 -05:00
Andrew Stoltz
c3cc404beb fc-distribution: add dist.flowercore.io public surface (Cloudflare A record + Origin Cert + profile-header middleware)
Lights up dist.flowercore.io end-to-end:
- cf-origin-flowercore-io Secret (literal *.flowercore.io Origin Cert,
  copied from the telephony/gitea-public/matrix/mail/flowercore/fc-landing
  pattern — not via OnePasswordItem yet).
- Traefik Middleware dist-public-profile-header: strips any caller-supplied
  X-FC-Distribution-Profile, injects 'public' so the controller's
  NamedEntitlementResolverRouter routes to the strict resolver.
- IngressRoute fc-distribution-public: Host(`dist.flowercore.io`) ->
  same backing Service as the internal dist.iamworkin.lan route.
  Middleware attached; cert secret cf-origin-flowercore-io.

Cloudflare DNS A record dist.flowercore.io -> 74.40.140.24 (proxied)
already created 2026-04-24 via Cloudflare API (record id
e9b957511556f37ff6763f4441acbc45).

Controller entitlement config is still DefaultAllow=false + empty
PublicEditions on the 'public' profile, so every public request
returns 403 by default. Populate FlowerCore__Distribution__EntitlementPublic__PublicEditions__0
via env var when ready to expose specific editions.
2026-04-23 20:10:29 -05:00
Andrew Stoltz
90627819cc fc-distribution: bump to v202604240010 (Phase 4 header-routing controller) 2026-04-23 19:23:35 -05:00
Andrew Stoltz
c97d486a3d feat(fc-segmentdisplay): switch tls certificate to dns01 2026-04-23 18:39:17 -05:00
Andrew Stoltz
209bdc16cd fc-distribution: bump to v202604232310 (Front D entitlement wired into ManifestsController) 2026-04-23 18:11:21 -05:00
Andrew Stoltz
3999634b06 Seed ttsreader piper voices before startup 2026-04-23 17:18:57 -05:00
Andrew Stoltz
61538d3712 fc-distribution: bump to v202604232212 (disable 30MB body size limit on POST /blobs) 2026-04-23 17:11:56 -05:00
Andrew Stoltz
ccaac367af fc-distribution: bump to v202604232206 (adds POST /blobs endpoint) 2026-04-23 17:07:13 -05:00
Andrew Stoltz
407d473b71 feat(infra): route dns preflight through flowercore dns 2026-04-23 17:03:22 -05:00
Andrew Stoltz
f9593e494a Allow ttsreader piper voice downloads 2026-04-23 16:50:21 -05:00
Andrew Stoltz
5b6c7b97fc feat(fc-distribution): bump image to v202604232145 — cert chain endpoint
- Serves GET /manifests/{edition}/{version}.cert (leaf+intermediate PEM)
- Adds CertChainPem migration on startup (nullable column)
- ManifestSignService now embeds version-specific certChainUrl

Provisioning Agent's verify step will flip from ChainNotServed (Phase 2A
soft-pass) to Valid once a fresh edition is published with this image.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 16:43:56 -05:00
Andrew Stoltz
a76eeb5c39 Add dedicated selectable piper for ttsreader 2026-04-23 16:37:03 -05:00
Andrew Stoltz
8a960ffc73 feat(fc-distribution): K8s manifest for Phase 1 edition publisher
Adds apps/fc-distribution/{fc-distribution.yaml,kustomization.yaml,README.md}.
Ships the FlowerCore.Distribution service (Blazor + REST + MCP) backed by
Synology NFS for SQLite catalog + content-addressed blob root.

Contents:
- Namespace fc-distribution
- 3x OnePasswordItem (FlowerCore Code Signing CA informational + per-edition
  signing keys for kiosk-standard and aistation-field)
- Deployment: localhost/fc-distribution:v202604232000 (already imported to
  rke2-server via ctr), pinned to rke2-server nodeSelector because Synology
  NFS ACL restricts writes to that node, emptyDir for /tmp + /app/logs,
  inline NFS for /data (subPath distribution/data) and /blobs (subPath
  distribution/blobs), Secret volume mounts for /signing/<edition>.
  readOnlyRootFilesystem + runAsUser 1654 + drop ALL capabilities.
  Probes: startup + readiness on /healthz, liveness on tcpSocket (defense
  against future auth middleware accidentally gating /healthz).
- Service (ClusterIP :80 -> container :8080)
- Certificate (cert-manager ClusterIssuer step-ca-acme, dist.iamworkin.lan,
  90d / 30d renew). pfSense Unbound override dist.iamworkin.lan ->
  10.0.56.200 already in place (req'd for HTTP-01).
- IngressRoute (Traefik websecure, Host rule on dist.iamworkin.lan)

Env var keys align with the scaffold:
  FlowerCore__Database__ConnectionStrings__Sqlite
  FlowerCore__Distribution__Blobs__Root
  FlowerCore__Distribution__Signing__EditionCerts__<slug>__{CertPath,KeyPath}

Consumer: ProvisioningAgent (USB-side, Phase 2) — see
FlowerCore.Notes/docs/infrastructure/usb-provisioning-architecture.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 15:59:50 -05:00
Andrew Stoltz
686dbacc66 Bump TTS Reader image for render follow-along 2026-04-23 15:54:07 -05:00
Andrew Stoltz
5ccf055465 check-pfsense-dns: add live-cluster scan
Extends the pre-merge DNS gate to (optionally) scan live-cluster
Certificates + IngressRoutes via kubectl. Closes the coverage hole
where a service's IngressRoute gets deployed from its own repo (not
from bluejay-infra/apps/) and the manifests-only scan misses it —
fc-retail/retail-web-tls stuck Issuing for 15h on a missing pfSense
Unbound override was exactly this class of bug.

Auto mode: if kubectl is on PATH and usable, live-scan runs silently.
--live  forces it (and errors out if kubectl can't reach the cluster).
--no-live skips live entirely (CI path with no cluster access).

Immediate live-scan finding on 2026-04-23: 10 orphan *.iamworkin.lan
IngressRoutes from failed e2e / codex / smoke / deleteproof test runs
in fc-php + fc-tenant-default (2026-04-16/17). None have DNS overrides
so their Certificates have been failing to issue for 7 days — the new
CertManagerCertificateNotReady alert will catch them too. Cleanup
(delete abandoned IngressRoutes + Certificates + CertificateRequests)
is a separate task; this check now surfaces them.
2026-04-23 15:51:19 -05:00
Andrew Stoltz
4da60820c6 Deploy TTS Reader queue presentation fix 2026-04-23 15:13:21 -05:00
Andrew Stoltz
1cc4324cfb Deploy TTS Reader import and preview fixes 2026-04-23 14:28:08 -05:00
Andrew Stoltz
bfc755057e fix(agent-zero): use streamable http for chat mcp 2026-04-23 13:54:06 -05:00
Andrew Stoltz
d6008ee205 fix(agent-zero): allow chat mcp pod port 2026-04-23 13:29:36 -05:00
Andrew Stoltz
39fe6f1dba fix(agent-zero): route chat mcp in-cluster 2026-04-23 13:26:10 -05:00
Andrew Stoltz
90fcf0cd5d fix(agent-zero): expose openai provider key 2026-04-23 13:21:12 -05:00
Andrew Stoltz
ffef5c9126 Deploy TTS Reader annotation timeout fix 2026-04-23 13:06:17 -05:00
Andrew Stoltz
634e90a9ee Deploy TTS Reader quick hardening release 2026-04-23 12:47:45 -05:00
Andrew Stoltz
86ccca18e3 Add Chat MCP server to Agent Zero 2026-04-23 12:41:58 -05:00
Andrew Stoltz
1c5caf3f40 Deploy TTS Reader v20260423114016 2026-04-23 11:57:39 -05:00
Andrew Stoltz
d3db19b0ca guacamole: enable json auth for remotedesktop sso 2026-04-23 11:27:30 -05:00
Andrew Stoltz
702a6e4f52 fix(agent-zero): use short DNS name to avoid CoreDNS template hijack
The full FQDN fc-llm-bridge.fc-llm-bridge.svc.cluster.local has 4 dots,
which is less than the pod's ndots:5 threshold. The resolver then
applies every entry in the search list BEFORE falling through to the
bare FQDN, and the CoreDNS 'template iamworkin.lan' catch-all matches
"...svc.cluster.local.iamworkin.lan" and returns Traefik VIP
10.0.56.200. The egress NetworkPolicy blocks that VIP (0.0.0.0/0
EXCEPT 10.0.0.0/8), so curl hangs for 30-134s and returns HTTP 000.

Reference: feedback_coredns_ndots_template_collision memory.

Fix: use "fc-llm-bridge.fc-llm-bridge.svc" (2 dots, still <5 so search
expansion still fires, but the first suffix "...svc.cluster.local"
hits the Kubernetes plugin in CoreDNS and returns the real ClusterIP
10.43.67.125 before the iamworkin.lan template is ever consulted).

Verified: pod-exec curl fc:cheap → HTTP 200 with a real chat.completion
envelope (Ollama/gemma3:4b via bridge).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:02:09 -05:00
Andrew Stoltz
6cbb5d8792 fix(agent-zero): NetworkPolicy egress rule for fc-llm-bridge (ADR-088)
The chat_model flip (62db15c) pointed Agent Zero at
fc-llm-bridge.fc-llm-bridge.svc.cluster.local:8080 but the existing
agent-zero-netpol only allowed egress to specific node IPs
(10.0.56.20:11434, 10.0.57.17:11434, 10.0.57.16:5200, 10.0.56.11:6443)
plus public-internet (with RFC1918 exclusion). ClusterIP traffic to
10.43.0.0/16 was implicitly denied, so pod-exec curl to the bridge
timed out after 134s.

Adds an egress rule allowing TCP 8080 to the fc-llm-bridge namespace
(matched by kubernetes.io/metadata.name which K8s 1.22+ sets
automatically). No ingress changes needed — fc-llm-bridge has no
NetworkPolicy, so the ingress side is already open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:59:17 -05:00
Andrew Stoltz
62db15c69c feat(agent-zero): route chat_model through fc-llm-bridge (ADR-088)
Flips Agent Zero's chat_model from direct local Ollama (gemma3:12b via
the 127.0.0.1:11434 sidecar proxy) to the FlowerCore LLM Bridge
(fc:balanced tier, OpenAI-compatible, Anthropic Claude Sonnet under the
hood) so chat turns are spend-tracked and can dispatch to any provider
via a single tier alias.

Scope is intentionally minimal and reversible:
  - chat_model: ollama/gemma3:12b/127.0.0.1:11434
              → openai/fc:balanced/fc-llm-bridge internal service URL
  - utility_model, embedding_model, browser_model: UNCHANGED
    (stay on local 127.0.0.1 Ollama sidecar — no spend, low latency,
    not worth routing through the bridge for small-model traffic).

Auth: new A0_SET_chat_model_api_key env var wired to the
fc-llm-bridge-api-keys Secret (field: agent-zero-k8s). The Secret is
synced by a new OnePasswordItem pointing at "FC LLM Bridge API Keys"
in the IAmWorkin vault. Bearer-token auth is now accepted by the
bridge (FlowerCore.LlmBridge@3225f1f).

Rollback: revert this commit; old image v202604231449 is still present
on all RKE2 nodes, and Agent Zero's strategy: Recreate makes the flip
atomic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:54:27 -05:00
Andrew Stoltz
84634f59f0 chore(fc-llm-bridge): bump image to v202604231520
Ships the Bearer-token auth fix (FlowerCore.LlmBridge@3225f1f) so Agent
Zero's OpenAI provider can authenticate with Authorization: Bearer in
addition to the original X-Api-Key header.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:51:57 -05:00
Andrew Stoltz
4cd5806fd0 fix(fc-llm-bridge): set dnsConfig ndots=2 to prevent CoreDNS wildcard hijack
Pods in this cluster inherit ndots=5. External FQDNs with <5 dots (like
api.anthropic.com) are expanded through the search path first, and the 4th
suffix `api.anthropic.com.iamworkin.lan` matches CoreDNS' `template IN A
iamworkin.lan` wildcard — resolves to Traefik VIP 10.0.56.200. TLS connect
lands on Traefik's default cert and the AnthropicClient rejects with
RemoteCertificateNameMismatch/RemoteCertificateChainErrors.

Setting ndots=2 makes the resolver try the bare FQDN first (3 dots in
api.anthropic.com), so the search path never fires.

Reference: memory feedback_coredns_ndots_template_collision. Wider follow-up:
the CoreDNS template plugin should add fallthrough for external public suffixes,
so every FC service calling external HTTPS APIs stops hitting this trap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:42:17 -05:00
Andrew Stoltz
11c48bef30 chore(fc-llm-bridge): bump to v202604231449 (Budget 1.0.1 multi-provider dispatcher)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:36:05 -05:00
Andrew Stoltz
a86e87050b fix(fc-llm-bridge): anthropic secret key is 'password' not 'credential'
The 1Password item "Claude API Key" stores the key in a standard Password
field (labeled `password`), so the OnePasswordItem operator creates the K8s
Secret with key `password`. Deployment was referencing `credential`, which
made the pod fail with CreateContainerConfigError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:29:32 -05:00
Andrew Stoltz
0214f94ac4 chore(fc-llm-bridge): bump image to v202604231424 (first live tag)
Built from FlowerCore.LlmBridge@6d285b5 (initial scaffold). Imported on all
three RKE2 nodes via podman save + ctr import. Replaces v00000000000000
placeholder — ArgoCD sync will roll the pod.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:28:05 -05:00
Andrew Stoltz
a1b8eb379d feat(fc-llm-bridge): stage ADR-088 manifests (not yet applied)
Staged but NOT applied. Do not git push until the two pre-requisites below
are done. See apps/fc-llm-bridge/README.md for the full order-of-ops.

Manifests (apps/fc-llm-bridge/fc-llm-bridge.yaml, 8 docs):
  - Namespace fc-llm-bridge
  - OnePasswordItem anthropic-api-key (existing Claude API Key item)
  - OnePasswordItem fc-llm-bridge-api-keys (NEW item, pending creation)
  - PersistentVolumeClaim fc-llm-bridge-data (2Gi longhorn)
  - Deployment fc-llm-bridge (port 8080, uid 1654, readOnlyRootFilesystem,
    tcpSocket probes to survive future ApiKeyAuthMiddleware reordering)
  - Service fc-llm-bridge ClusterIP
  - Certificate fc-llm-bridge-cert (step-ca-acme)
  - IngressRoute fc-llm-bridge (fc-llm-bridge.iamworkin.lan, websecure)

Pre-requisites BEFORE git push:
  1. pfSense Unbound override fc-llm-bridge.iamworkin.lan -> 10.0.56.200
     (currently NXDOMAIN -- verified via nslookup and check-pfsense-dns.py).
     Skipping this step puts cert-manager HTTP-01 into ~2h backoff.
  2. Create 1Password item `FC LLM Bridge API Keys` in vault IAmWorkin with
     password fields: agent-zero-ws, agent-zero-k8s, spare-1, spare-2.
  3. Build + import localhost/fc-llm-bridge:v<tag> to rke2-server +
     rke2-agent1 + rke2-agent2. Bump image tag from placeholder
     v00000000000000 before committing the apply.

Related: ADR-088 (FlowerCore.Notes/ARCHITECTURE.md), design doc at
FlowerCore.Notes/docs/ai-agents/agent-zero-anthropic-bridge.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 03:10:36 -05:00
Andrew Stoltz
9a1665907c fc-signalcontrol: align live port and selectors 2026-04-22 23:22:14 -05:00
Andrew Stoltz
899804215a statefulsets: align guacamole and matrix drift defaults 2026-04-22 23:11:47 -05:00
Andrew Stoltz
1dc66738e6 zabbix: align postgres tracking label 2026-04-22 22:50:24 -05:00
Andrew Stoltz
5623a272c5 zabbix: include statefulset defaults 2026-04-22 22:39:31 -05:00
Andrew Stoltz
3d3f91160b monitoring: add Print.Web Ollama Zabbix template 2026-04-22 22:07:40 -05:00
Andrew Stoltz
93f77c1844 fix(monitoring): use bluejay_v2 auth for snmp-nas (not public_v2)
Synology NAS is configured with community bluejay_monitor
(→ snmp.yml auth 'bluejay_v2'), not public. public_v2 was returning
HTTP 500 from snmp-exporter for this target. Verified bluejay_v2
returns metrics.

Keeps printer (10.0.58.107) on public_v2 — Epson ET-3750 uses
community "public" as documented in its SNMP settings.
2026-04-22 21:32:14 -05:00
Andrew Stoltz
59efc460fd fix(irc): use short name for unrealircd in anope + thelounge configs
Same CoreDNS iamworkin.lan template + ndots:5 hijack as the irc-notify fix.
Anope services (nickserv/chanserv/memo) have been disconnected from unrealircd
for weeks ("Host is unreachable" every 3s). Thelounge server defaults pointed
at the same broken FQDN.

Short name unrealircd.irc.svc resolves to the ClusterIP directly.
2026-04-22 21:23:38 -05:00
Andrew Stoltz
02959f1ac6 docs: deployment runbook + pfSense DNS pre-merge check
Adds a real README describing the 4-step deploy flow, with pfSense Unbound
host overrides as step 1 (the prerequisite that, if skipped, silently breaks
cert-manager HTTP-01 for ~2h per cert until manually diagnosed — root cause
of the 2026-04-22 cluster-wide cert outage).

Adds scripts/check-pfsense-dns.py: parses every apps/*/*.yaml, extracts
hostnames from Certificate.spec.dnsNames and Traefik IngressRoute
`Host(...)` match rules, and fails the check if any don't resolve via the
system DNS (pfSense Unbound on this LAN). Ignores IRC server-link labels,
image tags, comments — only checks hostnames cert-manager and Traefik will
actually use.

Run before `git push` or wire into pre-commit / Gitea Actions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 21:11:24 -05:00
Andrew Stoltz
a3aa84bdae fc-ttsreader: bump image to v20260422201135 (Quick Read highlight no-reflow fix)
Quick Read's active-sentence highlight was changing font-weight from
regular to semibold, which shifted glyph widths and reflowed the whole
paragraph mid-playback. New image drops the weight change and uses a
1px box-shadow ring instead for a stable layout.

Built from FlowerCore.TtsReader@e77d69d.
2026-04-22 20:20:30 -05:00
Andrew Stoltz
01cb9a557f fc-ttsreader: deploy fixed reader image 2026-04-22 16:13:15 -05:00
Andrew Stoltz
0fa46ad53b fc-ttsreader: deploy reader UI split image 2026-04-22 15:57:58 -05:00
Andrew Stoltz
1ded5a61c0 fc-segmentdisplay: add TLS ingress gitops stub 2026-04-22 15:55:54 -05:00
Andrew Stoltz
3c1d212251 fc-messageboard: deploy latest web image via gitops 2026-04-22 15:48:05 -05:00
Andrew Stoltz
c0547a9964 fc-signalcontrol: switch probes to tcpSocket — middleware blocks /health
The app's ApiKeyAuthenticationMiddleware runs BEFORE /health is mapped, so
unauthenticated probe requests get 404. tcpSocket probes verify the listener
is up without auth, which is correct for an internal K8s probe (kubelet
talks pod IP directly, not externally).

Real fix is in the app: move /health before the middleware or mark it
[AllowAnonymous]. Tracked separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 15:21:04 -05:00
Andrew Stoltz
973c1dae72 fc-signalcontrol: fix probe path /metrics/prometheus -> /health
The app exposes /health (Program.cs:91 maps a Healthy text response) but does
NOT expose /metrics/prometheus. K8s liveness/readiness probes against a 404
endpoint kept the pod in CrashLoopBackOff after PVC mount was added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 15:15:07 -05:00
Andrew Stoltz
475737b36f fc-signalcontrol: add PVC + volumeMount for SQLite data dir
Live cluster had a Longhorn PVC `signalcontrol-data` mounted at /app/data
since 2026-04-14, but the bluejay-infra git manifest never declared it. As a
result, when ArgoCD recreated the Deployment from git (after deletion to fix
an unrelated selector-label mismatch caught during cert-manager recovery),
the new pod started without /app/data and crashed with `SQLite Error 14:
unable to open database file 'data/signalcontrol.db'`.

Bring git in line with reality: declare the PVC, mount it, and switch the
Deployment to `strategy.type: Recreate` (RWO PVC blocks rolling updates per
existing memory feedback_k8s_rwo_rollout.md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 15:10:10 -05:00
Andrew Stoltz
3bb3801fbd fix(monitoring): use short service name for irc-notify IRC_HOST
CoreDNS iamworkin.lan template + ndots:5 was hijacking
unrealircd.irc.svc.cluster.local lookups → Traefik VIP → timeout.
Every alert since ~2026-04-09 silently failed with "IRC send failed: timed out",
which also killed the thermal-printer path (routed through irc-notify).

Same fix pattern as guacamole@28b7600.
2026-04-22 09:55:23 -05:00
Andrew Stoltz
28b76001a8 fix(guacamole): use short service name for GUAC_URL (CoreDNS template collision)
The guac-k8s-sync CronJob has been crash-looping (exit 7) since the
2026-04-11 run. Root cause: CoreDNS has an `*.iamworkin.lan`
template wildcard, and the Kubernetes pod resolv.conf ships with
`ndots:5` plus a search list that includes `iamworkin.lan`.

Resolving `guacamole.guacamole.svc.cluster.local` (4 dots < 5) goes
through search-suffix expansion BEFORE the bare FQDN. The iamworkin.lan
suffix makes it `guacamole.guacamole.svc.cluster.local.iamworkin.lan`,
which matches the template and answers with Traefik LB VIP
10.0.56.200. That VIP has no pod-network hairpin route, so curl exits
with 'No route to host'.

Using the short name `http://guacamole:8080` keeps the query at 0
dots, search expansion runs on the bare name, and the in-namespace
`guacamole.svc.cluster.local` suffix hits the Kubernetes CoreDNS
plugin directly (ClusterIP 10.43.229.31).

Alt fixes considered but not taken: trim the CoreDNS template regex
to exclude `.svc.cluster.local.` prefixes (cross-cutting, higher
blast radius); trailing-dot FQDN in the URL (curl/Java HTTP clients
handle inconsistently).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:52:53 -05:00
Andrew Stoltz
0c67fa5356 asterisk: add *832 test-entry dialplan for VDAY workflow AATs
Lets live SIP AATs (ext 901–904, from-internal context) dial *832 to
exercise the Victory Day workflow + Fun Menu + AsteriskGameHandler path
without routing through Twilio. Mnemonic: *832 = V-D-A (8-3-2) from the
V-D-A-Y keypad pattern.

Maps to Stasis(flowercore-pbx,inbound-pstn,+15074618329) — same call-
type classification as a real Twilio-inbound call to the VDAY DID, so
InboundPstnHandler routes to the seeded VDAY workflow identically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:51:49 -05:00
Andrew Stoltz
62e342cfb2 guacamole: consolidate nodeSelector — use rke2-server for guacd too
Previous commit 90deacd raced with the user's f0733ff (which had
already pinned the guacamole web Deployment to rke2-server for the
NFS ACL). That left two nodeSelector blocks on the web pod and an
inconsistent agent2 pin on guacd. Align both pods to rke2-server.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:36:25 -05:00
Andrew Stoltz
90deacd154 guacamole: pin guacd + web to rke2-agent2 for NFS recordings mount
Synology NFS export at /volume1/kubernetes currently grants mount
permission only to 10.0.56.13 (rke2-agent2). rke2-agent1 gets
"access denied by server". guacd + guacamole web both need the
recordings volume, so co-locating is also efficient. Remove the
nodeSelector once the Synology NFS ACL opens to all cluster nodes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:35:13 -05:00
Andrew Stoltz
f0733ff89d feat(guacamole): wire 1Password vault extension + logback into deployment
Adds the 1Password vault JAR to the Guacamole pod so connection params
like ${OP:ItemTitle/fieldLabel} are resolved from 1Password Connect at
tunnel-open time. Credentials never land in MySQL — only token literals.

Deployment changes:
- env: OP_CONNECT_URL=http://10.0.56.10:8180, OP_VAULT_ID=..., plus
  OP_CONNECT_TOKEN from secret/guacamole-1password-token/credential.
- env: ENABLE_ENVIRONMENT_PROPERTIES=true so OP_* env vars render as
  op-connect-url / op-connect-token / op-vault-id properties the
  extension reads.
- volumeMount for guacamole-vault-jar at
  /etc/guacamole/extensions/guacamole-vault-1password-1.0.0.jar
- volumeMount for guacamole-logback so we see DEBUG token-inject lines.
- nodeSelector kubernetes.io/hostname=rke2-server — the Synology NFS
  export for /volume1/kubernetes currently only allows rke2-server.
  Followup: add rke2-agent1/2 to the export and remove this selector.

New ConfigMaps:
- guacamole-vault-jar (binaryData, ~312KB JAR, Gson shaded, built from
  FlowerCore.Notes/k8s/guacamole/extensions/1password-vault via mvn).
- guacamole-logback with DEBUG on io.flowercore.guacamole.vault — drop
  to INFO once resolution is proven stable.

Existing guacamole-properties: added onepassword-vault to extension-priority.

The guacamole-1password-token Secret is NOT in git — it holds a verbatim
copy of the onepassword-connect-operator bearer token. Followup task:
provision a scoped Connect token for Guacamole and rotate the copy out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:32:51 -05:00
Andrew Stoltz
313bdcb21a guacamole: NFS subPath — Synology exports /volume1/kubernetes root only
First pass used nfs.path=/volume1/kubernetes/guacamole/recordings,
which triggered "mount.nfs: access denied by server" on rke2-agent1.
Synology NFS export is scoped to /volume1/kubernetes; match the
working fc-desktop pattern: mount the export root and select the
subdirectory via volumeMount.subPath.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:23:49 -05:00
Andrew Stoltz
5f4818bd96 guacamole: wire session recording to Synology NFS
Phase 5 of docs/infrastructure/guacamole-customization-plan.md:

- Mount /volume1/kubernetes/guacamole/recordings (Synology 10.0.58.3)
  into both guacd (writer) and guacamole web (reader) at
  /var/lib/guacamole/recordings
- Set RECORDING_SEARCH_PATH env on guacamole web -- the Guacamole
  Docker entrypoint treats any RECORDING_* var as an enable signal
  for the history-recording-storage extension (symlinks the JAR
  from /opt/guacamole/environment/RECORDING_/extensions/ into
  GUACAMOLE_HOME/extensions/)

Per-connection recording still requires setting recording-path on
each connection in MySQL -- follow-up task. This commit enables
the plumbing; no sessions record yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:15:55 -05:00
Andrew Stoltz
fff998dab5 matrix, zabbix: add volumeMode to postgres PVC templates
Same ArgoCD + SSA self-heal loop pattern as guacamole (20e4130):
K8s defaults volumeMode=Filesystem on volumeClaimTemplates at
creation, git omits it, argocd-controller owns the atomic list so
every reconcile sees drift, and volumeClaimTemplates is immutable
so it can never reconcile. Adding the field closes both loops.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 14:48:43 -05:00
Andrew Stoltz
20e4130c74 guacamole: add volumeMode to guac-mysql PVC template
Closes the infra-guacamole OutOfSync sync loop. K8s API sets
volumeMode=Filesystem as a default on volumeClaimTemplates at creation,
but the git manifest omitted it. ArgoCD uses ServerSideApply with
atomic ownership of volumeClaimTemplates, so every sync saw a
desired/live mismatch on that one field. volumeClaimTemplates is
immutable after creation so ArgoCD could never reconcile it --
autoHealAttemptsCount climbed to 6091. Adding the field to git
matches live and breaks the loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 14:29:40 -05:00
Andrew Stoltz
3cf675b8c3 ttsreader: wire operator secrets through 1password 2026-04-17 10:05:24 -05:00
Andrew Stoltz
2a9f2e4540 Improve TTS Reader workspace card layout 2026-04-17 03:57:23 -05:00
Andrew Stoltz
b15a35a258 Fix TTS Reader character layout 2026-04-17 03:48:03 -05:00
Andrew Stoltz
3f4985ee13 Deploy TTS Reader queue feedback fix 2026-04-17 03:34:28 -05:00
Andrew Stoltz
e535a8d34b Deploy TTS Reader voice preview update 2026-04-17 02:13:09 -05:00
Andrew Stoltz
6ddbd2cae5 Point TTS Reader at Pi Ollama defaults 2026-04-17 00:53:45 -05:00
Andrew Stoltz
e9608651f7 Bump TTS Reader image to v20260417001119 2026-04-17 00:33:29 -05:00
Andrew Stoltz
abdb7a806e Bump TTS Reader image to v20260416234817 2026-04-16 23:53:42 -05:00
Andrew Stoltz
7afb5043c4 Fix ttsreader forwarded header handling 2026-04-16 21:55:46 -05:00
Andrew Stoltz
1883953cb8 Enable live Piper and ffmpeg for fc-ttsreader 2026-04-16 21:18:43 -05:00
Andrew Stoltz
9c555db083 telephony: bump web image to v202604170153 2026-04-16 20:56:30 -05:00
Andrew Stoltz
cb349c6764 Configure TtsReader Bible corpus path 2026-04-16 20:44:23 -05:00
Andrew Stoltz
3888c4c3e0 Align fc-ttsreader with hardened runtime 2026-04-16 20:06:53 -05:00
Andrew Stoltz
7aec403e96 Pin telephony-web v202604170059 2026-04-16 20:03:01 -05:00
Andrew Stoltz
5685ab0550 Improve The Lounge MOTD contrast 2026-04-16 19:49:53 -05:00
Andrew Stoltz
d4d3455ef2 Style The Lounge as FlowerCore IRC 2026-04-16 19:45:52 -05:00
Andrew Stoltz
29d557003f fix: deploy responsive telephony debug menu 2026-04-16 19:45:49 -05:00
Andrew Stoltz
719aa8c1c6 fix: align desk phone dtmf mode with yealink provisioning 2026-04-16 19:36:37 -05:00
Andrew Stoltz
63cf5193ef Use recreate strategy for UnrealIRCd 2026-04-16 19:33:28 -05:00
Andrew Stoltz
ef0e1f2505 fix: update telephony web image tag 2026-04-16 19:30:36 -05:00
Andrew Stoltz
f8eb946704 Add IRC MOTD file 2026-04-16 19:29:43 -05:00
Andrew Stoltz
929449c55c apps: fc-chat refactor + fc-menuboard app split
- fc-chat.yaml: TLS/IngressRoute only (Deployment managed by deploy script, matches fc-signage/fc-mysql/fc-kiosk pattern)
- fc-menuboard: new app bundle

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 19:25:25 -05:00
Andrew Stoltz
9d0da584af Add The Lounge web IRC client 2026-04-16 19:10:10 -05:00
Andrew Stoltz
4f33d7a053 fix(telephony): chown /shared-tts in initContainer + harden security context
Two follow-ups to the Piper TTS wire-up landed in d3ffad9:

1. Telephony-web runs as uid 1654 (non-root), but the hostPath at
   /tmp/tts-audio is owned by root:root 0755. Pod couldn't write .sln16
   files — every Piper call would succeed at the HTTP layer and then
   fall back to the sound map when File.WriteAllBytesAsync threw
   "Permission denied." Extend the existing fix-data-perms initContainer
   to chown the shared-tts mount too (0755 world-readable, so the
   Asterisk pod — running as a different uid — can still read).

2. Pod security context now explicitly sets runAsNonRoot: true + runAsUser
   1654 + runAsGroup 1654 (cluster policy), matching the pattern used
   by every other FlowerCore service.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 16:29:21 -05:00
Andrew Stoltz
d3ffad9190 fix(telephony): PiperUrl 10.0.57.15 → .17 + shared-tts hostPath for TTS playback
Piper was never reachable on 10.0.57.15 — edge1's actual address is
10.0.57.17 (SSH config, project_edge1_sdcard memory). Every telephony
prompt hit the 8s HttpClient timeout and fell back to the built-in sound
map (vm-advopts, vm-goodbye, beep) instead of speaking the real workflow
text. Verified from noc1: `curl http://10.0.57.17:8500/health` returns
HTTP 200 in 6ms, `POST /tts` returns a 16kHz mono WAV in 606ms.

Changes:

- apps/telephony/telephony.yaml
  - `Tts.PiperUrl` → `http://10.0.57.17:8500`
  - NetworkPolicy egress allow → `10.0.57.17/32:8500`
  - Header comment now documents the POST /tts {"text":"..."} contract
  - telephony-web pod mounts `/shared-tts` from hostPath `/tmp/tts-audio`
    (rke2-agent1). This is where `AsteriskProvider.SpeakTextAsync` writes
    the synthesized .sln16 before calling ARI `Play sound:tts/<name>`.

- apps/asterisk/deployment.yaml
  - Asterisk pod mounts the same hostPath at
    `/var/lib/asterisk/sounds/tts` so it can read and play what
    telephony-web wrote. Both deployments have
    `nodeSelector: kubernetes.io/hostname: rke2-agent1` so the hostPath
    is guaranteed to be the same directory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 16:19:48 -05:00
Andrew Stoltz
403d061664 fix(asterisk): hostAlias downloads.asterisk.org so sounds actually download
CoreDNS wildcard for iamworkin.lan catches unresolved names and returns
the Traefik VIP (10.0.56.200), so downloads.asterisk.org from inside a
pod returns 404 from Traefik rather than the real Sangoma mirror. Pin
the real IP (165.22.184.19 = oss-downloads.sangoma.com) via hostAliases
so curl reaches the actual server.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 15:54:39 -05:00
Andrew Stoltz
45a2cb3f93 fix(asterisk): curl -k for sounds download — cluster TLS MITM
Cluster egress goes through a step-ca-fronted TLS proxy that install-sounds
doesn't trust ("SSL certificate problem: self-signed certificate"). The
Asterisk core sounds tarball is a public artifact; integrity is enforced
downstream when Asterisk plays the file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 15:48:34 -05:00
Andrew Stoltz
e1922564ae fix(asterisk): actually install core sounds (en, ulaw 1.6.1)
The install-sounds init container was a stub that left /var/lib/asterisk/sounds/en
empty. Result: every SpeakText fallback path (vm-advopts, vm-goodbye, characters:*,
digits/*, beep, pbx-invalid) resolved to a missing file, Asterisk silently failed
each Playback, zero RTP was produced, and callers heard dead air. This is why
dialing *0 (Settings Menu) or *100 (Debug IVR) "picks up quietly" — there is
literally nothing to stream.

Replaced the stub with alpine:3.20 + curl + tar that downloads the pinned
asterisk-core-sounds-en-ulaw-1.6.1.tar.gz (~10 MB) from downloads.asterisk.org
and unpacks it into the sounds emptyDir. Idempotent — skips download if
vm-goodbye is already present.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 15:42:58 -05:00
Andrew Stoltz
7762a0079a Add K8s deployment manifests for SignalControl, MessageBoard, Chat, TTS Reader
Full deployment manifests (Namespace, Deployment, Service, Certificate,
IngressRoute) for 4 new FlowerCore services with port 8080, ClusterIP
on port 80, cert-manager step-ca-acme TLS, and /metrics/prometheus
health probes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 20:23:55 -05:00
Andrew Stoltz
ab7435a43a Update Agent Zero, Asterisk, and Telephony K8s manifests
- Update agent-zero deployment configuration
- Update Asterisk configmap and deployment
- Update telephony service manifest

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 19:12:08 -05:00
Andrew Stoltz
53234bfcc8 Fix K8s sync script: use grep instead of python3
bitnami/kubectl image doesn't have python3. Replaced all python3
JSON parsing with grep/cut for auth token and connection data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 23:02:02 -05:00
Andrew Stoltz
cf572c167f Update Guacamole: branding JAR, K8s sync CronJob
- Updated bluejay-branding-1.0.0.jar with gold accents, hover fix,
  icon fix, pinstripe patterns, Blue Jay SVG logo
- Added guac-k8s-sync CronJob: runs every 2min, auto-updates pod
  names in Kubernetes exec connections when pods restart
- Fixed secret reference (guacamole-credentials, not guacamole-db-credentials)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 22:49:48 -05:00
Andrew Stoltz
7d5d0f86e7 Add signage service ingress manifests 2026-04-09 15:09:08 -05:00
Andrew Stoltz
8f59322329 Add step-ca TLS certs for mysql, php, desktop, signage, fc-landing
RKE2 Traefik has no ACME certResolver configured, so IngressRoutes
using certResolver: step-ca silently fall back to the Traefik default
self-signed cert. Fix by using cert-manager Certificate resources with
the step-ca-acme ClusterIssuer and tls.secretName in IngressRoutes.

- fc-landing: Add Certificate, change tls: {} to tls.secretName
- fc-mysql: New app (Certificate + IngressRoute only)
- fc-php: New app (Certificate + IngressRoute only)
- fc-desktop: New app (Certificate + IngressRoute only)
- fc-signage: New app (Certificate + IngressRoute, plus HTTP route for players)

Deployments/Services for mysql/php/desktop/signage are managed by
deploy scripts, not ArgoCD. These apps only manage TLS + ingress.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 18:20:23 -05:00
Claude Code
8f8290e0da Increase ctx to 8192 (system prompt + 21 tools need >2048) 2026-04-08 20:07:27 +00:00
Claude Code
607192aaec Reduce ctx to 2048 for Pi 5 CPU speed 2026-04-08 19:40:52 +00:00
Claude Code
072d64a5e9 Fix model config: write config.json not config.yaml (plugin reads JSON) 2026-04-08 19:22:16 +00:00
Claude Code
acb19bee9c Write Ollama model config before initialize.sh (fix OpenRouter default) 2026-04-08 19:15:43 +00:00
Claude Code
e6fbe2d22b Mount extensions+theme directly in main container (symlinks lost by initialize.sh) 2026-04-08 18:12:07 +00:00
Claude Code
dbd6769537 Reference split tools ConfigMaps (tools-a/b/c) in init container 2026-04-08 18:09:55 +00:00
Claude Code
0af47f893a Split bluejay-tools into 3 ConfigMaps (K8s 262K annotation limit) 2026-04-08 18:09:49 +00:00
Claude Code
d16f72f089 Enable Blue Jay profile: init container, ConfigMap volumes, tools, extensions, theme 2026-04-08 18:07:13 +00:00
Claude Code
36e7369609 Add Blue Jay profile ConfigMaps (21 tools, prompts, extensions, theme) 2026-04-08 18:07:06 +00:00
Claude Code
3e5c017c4e Add agent-zero egress to monitoring NetworkPolicy, fix probe target to use K8s svc DNS 2026-04-08 17:36:23 +00:00
Claude Code
67e41febf5 Add agent-zero egress to monitoring NetworkPolicy for blackbox probes 2026-04-08 17:34:16 +00:00
Claude Code
c9f07108bd Fix edge1 Ollama IP (.15->.17), add monitoring ingress, add init container 2026-04-08 17:30:22 +00:00
Claude Code
f3919cf728 Add cert-manager Certificate for intranet ACME TLS auto-renewal 2026-04-05 08:47:42 -05:00
Claude Code
56442ecfbc Replace nginx+ConfigMap intranet with Blazor Server app
Replaces the 188KB ConfigMap-embedded HTML with a proper Blazor Server
deployment (fc-intranet-web:latest on port 5300). The old nginx deployment,
ConfigMaps (intranet-html, intranet-nginx-conf), and all embedded HTML are
removed. The intranet is now a .NET 10 Blazor app with live health monitoring,
REST API, 49 pages, and the unified Blue Jay theme.

Source: github.com/astoltz/FlowerCore.Intranet.Web
2026-04-04 19:29:28 -05:00
Andrew Stoltz
a07b6311b9 Add Blue Jay branding, kubectl-proxy, RBAC, and properties to Guacamole
- guacamole-branding ConfigMap with Blue Jay dark theme CSS
- guacamole-properties ConfigMap with ban/TOTP/session config
- kubectl-proxy sidecar on guacd for K8s pod exec connections
- guacd-exec ServiceAccount + ClusterRole/Binding for pod exec RBAC
- Volume mounts for branding JAR and properties on guacamole webapp

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 14:22:51 -05:00
Claude Code
331ae14d3f Update intranet: fcadmin links, Guacamole connections, 1Password deep-links 2026-04-03 13:40:09 -05:00
Claude Code
b291d0360b Update intranet HTML — deep cleanup 2026-03-28
- OpenVPN 9 servers, WiFi portal, Signage+RemoteDesktop on K8s
- Print.Web HTTPS via noc-proxy, 530 tests, 21 pages, 15 MCP
- Monitoring: 36 scrape jobs, 25 alert rules, 12 Grafana dashboards
- Remove BlueJay-Employee SSID (factory reset), fix WiFi to 4 SSIDs
- Fix Guacamole URL (guac -> guacamole), noc1 SSH typo, pfSense WAN igc3
- Add Signage, RemoteDesktop, WiFi Portal to DNS/service tables
- Update ArgoCD 22 apps, 41 namespaces, 49 IngressRoutes, Traefik v3.6.10
- IRC Anope marked CrashLoopBackOff, monitoring moved to K8s
- Total: 21,437+ tests across 13 services
2026-03-28 14:34:25 -05:00
Andrew M. Stoltz
090b29933f telephony-web v20260325d: global search, error pages, quick-create wizard 2026-03-25 17:58:56 -05:00
Andrew M. Stoltz
987b73c537 telephony-web v20260325c: workflow config validation, enhanced health checks, response compression, Serilog request logging 2026-03-25 17:47:27 -05:00
Andrew M. Stoltz
bf12474de9 telephony-web v20260325b: add SMS UnreadCount/LastMessagePreview columns to schema drift 2026-03-25 08:19:58 -05:00
Andrew M. Stoltz
f366dd5c90 telephony-web v20260325a: fix billing/RBAC 500s — replace IDbContextFactory with direct TelephonyDbContext injection 2026-03-25 08:11:59 -05:00
Andrew M. Stoltz
50146f8355 telephony-web v20260324n: rebuild-schema admin endpoint for production DB migration 2026-03-24 19:45:06 -05:00
Andrew M. Stoltz
ace06c5fb9 telephony-web v20260324m: model-driven schema drift — auto-creates ALL missing tables 2026-03-24 19:28:08 -05:00
Andrew M. Stoltz
7ed834f056 telephony-web v20260324l: schema drift fix for CustomRoles table 2026-03-24 19:03:26 -05:00
Andrew M. Stoltz
2b04c9e292 telephony-web v20260324k: RBAC policy editor, billing dashboard, 11081 tests ALL PASS 2026-03-24 18:55:03 -05:00
Andrew M. Stoltz
fafc2e510b telephony-web v20260324j: recording playback, SMS enhancements, notifications polish, dashboard shortcuts, all 11049 tests pass 2026-03-24 18:22:46 -05:00
Andrew M. Stoltz
fb1c622e62 telephony-web v20260324i: break-glass UI, 5 MCP tools, survey editor config, step palette 2026-03-24 17:37:19 -05:00
Andrew M. Stoltz
40cb7faef5 telephony-web v20260324h: setup wizard, REST smoke tests, survey route fix 2026-03-24 17:16:09 -05:00
Andrew M. Stoltz
bd79279b28 telephony-web v20260324g: schema drift fix (BridgeEvents, SurveyResponses tables), survey route fix 2026-03-24 16:53:21 -05:00
Andrew M. Stoltz
35b6b4f8e5 telephony-web v20260324f: remove Scalar/OpenApi packages (Swashbuckle conflict) 2026-03-24 16:06:11 -05:00
Andrew M. Stoltz
8d8b76c82b Fix telephony-web: revert Scalar (Swashbuckle conflict), use v20260324e 2026-03-24 16:02:32 -05:00
124 changed files with 46381 additions and 1572 deletions

7
.gitignore vendored Normal file
View File

@@ -0,0 +1,7 @@
# .NET build outputs (lint test project)
**/bin/
**/obj/
# Editor / temp
.DS_Store
*.swp

129
README.md
View File

@@ -1,3 +1,126 @@
# bluejay-infra
Infrastructure manifests for ArgoCD
# bluejay-infra
Infrastructure manifests for ArgoCD. An `ApplicationSet` in `argocd` namespace watches the `apps/*` directories in this repo and creates one `Application` per subdir (prefixed `infra-<name>`).
## Adding a new service to the cluster
Follow these steps in order. **Step 1 must run before step 3** — if you skip it, cert-manager HTTP-01 will silently fail for ~2h per cert (exponential backoff) until someone diagnoses the DNS.
### 1. Create or verify the FlowerCore.DNS A record (REQUIRED for current HTTP-01 manifests)
step-ca (the ACME CA on noc1) runs in a Podman container with host networking. Its container resolver uses pfSense Unbound (10.0.56.1), **not** cluster CoreDNS. So even though CoreDNS has a wildcard `*.iamworkin.lan → 10.0.56.200` for in-cluster lookups, step-ca cannot see it. Every new public hostname needs an explicit pfSense host override.
The management path is now `FlowerCore.DNS`, not `FlowerCore.Notes/scripts/pfsense-add-dns-overrides.py`. Add or verify the public A record there before you apply the manifest:
```bash
curl -sk https://dns.iamworkin.lan/api/v1/servers
# Find the pfSense serverId, then create the record using the host label only.
# Example: for foo.iamworkin.lan, use "name":"foo".
curl -sk -X POST https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iamworkin.lan/records \
-H "Content-Type: application/json" \
-d '{"name":"<yourservice>","type":"A","data":"10.0.56.200","ttl":300}'
```
Verify all referenced iamworkin.lan hosts resolve (run from anywhere on LAN):
```bash
python scripts/check-pfsense-dns.py
# Historical filename retained. The script now calls
# https://dns.iamworkin.lan/api/v1/zones/iamworkin.lan/resolve-preflight
# for every Certificate dnsName and Traefik Host(...) rule it finds.
python scripts/check-pfsense-dns.py --live
# Optional stronger pass when kubectl access is available; also checks
# live-cluster Certificates and IngressRoutes for drift outside manifests.
```
**Symptom if you skip this:** the Certificate resource stays `Ready: False` with `status.reason: unexpected non-ACME API error: context deadline exceeded`. Recovery requires `kubectl -n <ns> delete order <order-name>` after adding the DNS to bypass cert-manager's backoff.
### 2. Create the app manifest
Create `apps/<name>/<name>.yaml` containing the Namespace, Deployment, Service, Certificate, and IngressRoute. Reference an existing directory (e.g. `apps/fc-messageboard/`) for the canonical shape.
Conventions:
- `Namespace` has label `app.kubernetes.io/part-of: bluejay-infra`
- `Deployment.spec.selector.matchLabels` and `Service.spec.selector` MUST use the same label key. The historical convention here is `app: <name>` (not `app.kubernetes.io/name`) — don't mix.
- Image: `localhost/<name>:v<YYYYMMDD><HHMM>`, `imagePullPolicy: Never`. Import the image to every RKE2 node (server + both agents) via `ctr images import` before applying — pods schedule anywhere.
- If the app persists local state (SQLite, uploads), declare the `PersistentVolumeClaim` here with `storageClassName: longhorn` and `accessModes: [ReadWriteOnce]`. Add `strategy.type: Recreate` to the Deployment — RWO PVC blocks rolling updates.
- Probes: use `tcpSocket` if the app has middleware that intercepts unauth requests (returns 404/401 for `/health`). Otherwise prefer `httpGet` against whatever the app exposes (verify the path isn't gated by auth).
- Certificate: `issuerRef.name: step-ca-acme`, `issuerRef.kind: ClusterIssuer`. `dnsNames` must match the hostname you created in FlowerCore.DNS in step 1.
### 3. Commit & push
```bash
git add apps/<name>/
git commit -m "<name>: initial deployment"
git push
```
ArgoCD's `ApplicationSet` picks up the new directory within ~3 minutes and creates `infra-<name>` with auto-sync + self-heal enabled.
### 4. Verify
```bash
# From noc1
fcadmin_ssh noc1 '
kubectl -n argocd get application infra-<name>
kubectl -n <ns> get certificate,pod
curl -sk -m 8 -o /dev/null -w "HTTP %{http_code}\n" https://<name>.iamworkin.lan/
'
```
Certificate should be `Ready: True` within ~60s. If it stalls `False` for >2m, the pfSense DNS step got skipped — go back to step 1, then `kubectl -n <ns> delete order <order-name>` to bust the backoff.
### Pre-merge gate
Before `git push`, always run:
```bash
python scripts/check-pfsense-dns.py
```
It's a quick service-backed check that would have caught the entire 2026-04-22 cert-manager outage. Consider wiring it into a pre-commit hook or a Gitea Actions workflow.
## Retiring a service
1. `kubectl -n argocd delete application infra-<name>` (cascade deletes the K8s resources via ArgoCD finalizers)
2. `git rm -r apps/<name>/` and push
3. Remove the FlowerCore.DNS record through the UI or API, for example:
```bash
curl -sk https://dns.iamworkin.lan/api/v1/servers
curl -sk -X DELETE https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iamworkin.lan/records/<yourservice>
```
## Known gotchas
- **CoreDNS template + ndots:5 collision**: inside pods, `<svc>.<ns>.svc.cluster.local` with <5 dots gets search-expanded through `iamworkin.lan` FIRST and hits the wildcard template → resolves to Traefik VIP, not the real ClusterIP. Use short service names (`<svc>`) in K8s manifests. See memory `feedback_coredns_ndots_template_collision.md`.
- **Image not on node**: pods stuck `ErrImageNeverPull` means the image wasn't imported to the node Kubernetes scheduled the pod onto. `ctr images import` on all of rke2-server, rke2-agent1, rke2-agent2.
- **StatefulSet PVC drift**: `volumeClaimTemplates` needs explicit `volumeMode: Filesystem` or ArgoCD SSA self-heals forever. See memory `feedback_argocd_statefulset_pvc_drift.md`.
- **IngressRoute namespace split**: this RKE2 Traefik install does not allow cross-namespace service refs. Keep the `IngressRoute`, backend `Service`, and TLS secret in the same namespace; if one host is shared across namespaces, duplicate the `Certificate` and move the route next to the destination service.
- **Public read-only hosts**: if a public host fronts a service that also exposes admin writes internally, add a Traefik route match like `Host(...) && (Method(GET) || Method(HEAD))` on the public edge instead of trusting the app to reject unsafe methods.
- **Public read-write allowlist hosts**: if a public host accepts a tightly bounded write surface (e.g. bootstrap-JWT POST), pin the allowlist as `(Method(GET) || Method(HEAD) || Method(POST) || Method(OPTIONS))`. PUT/PATCH/DELETE must still 404 at the route. Track A's `updatecenter.iamworkin.lan` / `updates.iamworkin.lan` are the canonical example. The lint test enforces this invariant.
- **Traefik VIP netpols**: when a `NetworkPolicy` allows `10.0.56.200`, also allow the post-DNAT backend ports (`8443` for TLS plus `8080` or `8000` for HTTP) or Calico will drop the rewritten flow.
- **Auth-safe probes**: services behind API-key or global auth middleware should prefer `tcpSocket` probes unless `/health` is explicitly exempted before the middleware runs.
- **ArgoCD must use internal Gitea URL**: `http://gitea-clusterip.gitea.svc.cluster.local:3000/bluejay/bluejay-infra.git`, not the external HTTPS URL (step-ca cert isn't trusted by ArgoCD). The `ApplicationSet` and any hand-created `Application` must both use the internal URL.
## Local manifest lint
The repo now carries a local-first lint pass for the recurring K8s gotchas that have burned the fleet:
```bash
dotnet test tests/bluejay-infra-lint/BluejayInfraLint.Tests.csproj -c Release
```
That test project sweeps `bluejay-infra/apps/**` plus the canonical sibling `FlowerCore.*\\k8s` manifests that share the same workspace. Matching `conftest.dev` policy files live under `tests/bluejay-infra-lint/conftest.dev/` for environments that also have `conftest` or `opa`.
## References
- OpenVox noc1 durability runbook: `docs/runbooks/openvoxserver-quadlet-durability.md`
- Cert-manager recovery playbook: `FlowerCore.Notes/memory/project_cert_manager_recovery_2026_04_22.md`
- Why pfSense DNS is required: `FlowerCore.Notes/memory/feedback_pfsense_dns_required_for_acme.md`
- Public DNS operator host: `https://dns.iamworkin.lan`
- Canonical credential helper: `FlowerCore.Notes/scripts/credential-helper.sh`
- pfSense admin automation: `FlowerCore.Notes/memory/feedback_pfsense_automation.md`

View File

@@ -1,325 +1,647 @@
# =============================================================================
# Agent Zero AI Stack — NUC Deployment (RKE2 Bare-Metal)
# =============================================================================
# Deploys: AgentZero (agent UI) on RKE2 cluster
# Ollama: edge1 Pi 5 at 10.0.57.15:11434 (qwen2.5-coder:7b, CPU)
# Target: RKE2 bare-metal cluster, namespace: agent-zero
#
# Differences from LOCAL (WSL K3s):
# - Uses Longhorn StorageClass (not local-path)
# - Connects to edge1 Pi 5 Ollama (not workstation R9700)
# - NO Anthropic API key (free/local models only)
# - NO Piper TTS or Kiwix (edge1 handles TTS, no Wikipedia needed)
# - NO hostPath volumes (no access to Windows filesystem)
# - Traefik IngressRoute for LAN access at agent-zero.iamworkin.lan
# - Knowledge base loaded via ConfigMap (not hostPath)
#
# Available Ollama models on edge1:
# - qwen2.5-coder:7b ~4.7 GB Code generation (CPU, Q4_K_M)
#
# Apply: KUBECONFIG=~/.kube/rke2.yaml kubectl apply -f agent-zero-nuc.yaml
# =============================================================================
---
apiVersion: v1
kind: Namespace
metadata:
name: agent-zero
labels:
app.kubernetes.io/part-of: agent-zero-stack
# =============================================================================
# Persistent Volume Claims (Longhorn)
# =============================================================================
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: agent-zero-data
namespace: agent-zero
spec:
accessModes: [ReadWriteOnce]
storageClassName: longhorn
resources:
requests:
storage: 5Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: agent-zero-knowledge
namespace: agent-zero
spec:
accessModes: [ReadWriteOnce]
storageClassName: longhorn
resources:
requests:
storage: 1Gi
# =============================================================================
# RBAC — Give Agent Zero kubectl access to the cluster
# =============================================================================
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: agent-zero
namespace: agent-zero
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: agent-zero-cluster-admin
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: agent-zero
namespace: agent-zero
# =============================================================================
# Agent Zero — AI Agent Web UI (NUC Edition)
# =============================================================================
# Connects to edge1 Pi 5 Ollama (free, local models only)
# No paid API keys — uses qwen2.5-coder:7b for everything
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: agent-zero
namespace: agent-zero
labels:
app: agent-zero
annotations:
agent-zero/deployment: "nuc"
agent-zero/ollama: "edge1 Pi 5 (10.0.57.15:11434)"
spec:
replicas: 1
selector:
matchLabels:
app: agent-zero
strategy:
type: Recreate
template:
metadata:
labels:
app: agent-zero
spec:
serviceAccountName: agent-zero
containers:
- name: agent-zero
image: agent0ai/agent-zero:latest
command: ["/bin/bash", "-c"]
args:
- |
# Install kubectl if not cached
if [ -f /a0/work/kubectl ]; then
cp /a0/work/kubectl /usr/local/bin/kubectl
else
curl -sLO "https://dl.k8s.io/release/v1.32.0/bin/linux/amd64/kubectl" && \
chmod +x kubectl && mv kubectl /usr/local/bin/kubectl && \
cp /usr/local/bin/kubectl /a0/work/kubectl
fi
# Run the original entrypoint
exec /exe/initialize.sh $BRANCH
ports:
- containerPort: 80
env:
# Agent identity
- name: AGENT_NAME
value: "Blue Jay (NUC)"
# Chat model — qwen2.5-coder:7b on edge1 Pi 5
- name: A0_SET_chat_model_provider
value: "ollama"
- name: A0_SET_chat_model_name
value: "qwen2.5-coder:7b"
- name: A0_SET_chat_model_api_base
value: "http://10.0.57.15:11434"
- name: A0_SET_chat_model_ctx_length
value: "32768"
- name: A0_SET_chat_model_kwargs
value: '{"temperature": 0, "num_ctx": 32768}'
# Utility model — same as chat (only one model available)
- name: A0_SET_util_model_provider
value: "ollama"
- name: A0_SET_util_model_name
value: "qwen2.5-coder:7b"
- name: A0_SET_util_model_api_base
value: "http://10.0.57.15:11434"
- name: A0_SET_util_model_kwargs
value: '{"num_ctx": 8192}'
# Embedding model — nomic on edge1 (if installed, fallback to none)
- name: A0_SET_embed_model_provider
value: "ollama"
- name: A0_SET_embed_model_name
value: "nomic-embed-text"
- name: A0_SET_embed_model_api_base
value: "http://10.0.57.15:11434"
# Browser model — disabled (no vision model on Pi)
- name: A0_SET_browser_model_provider
value: "ollama"
- name: A0_SET_browser_model_name
value: "qwen2.5-coder:7b"
- name: A0_SET_browser_model_api_base
value: "http://10.0.57.15:11434"
- name: A0_SET_browser_model_vision
value: "false"
# Agent profile
- name: A0_SET_agent_profile
value: "default"
# Memory settings
- name: A0_SET_memory_memorize_enabled
value: "true"
- name: A0_SET_memory_memorize_consolidation
value: "true"
- name: A0_SET_memory_memorize_replace_threshold
value: "0.85"
- name: A0_SET_memory_recall_enabled
value: "true"
# Speech-to-text disabled (no GPU for Whisper)
- name: A0_SET_stt_model_size
value: "tiny"
# Kubernetes
- name: KUBERNETES_SERVICE_HOST
value: "kubernetes.default.svc"
- name: KUBERNETES_SERVICE_PORT
value: "443"
volumeMounts:
- name: workspace
mountPath: /a0/work
- name: knowledge
mountPath: /a0/knowledge/custom/main
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "3Gi"
cpu: "2000m"
volumes:
- name: workspace
persistentVolumeClaim:
claimName: agent-zero-data
- name: knowledge
persistentVolumeClaim:
claimName: agent-zero-knowledge
---
apiVersion: v1
kind: Service
metadata:
name: agent-zero
namespace: agent-zero
spec:
type: ClusterIP
selector:
app: agent-zero
ports:
- port: 80
targetPort: 80
# =============================================================================
# Traefik IngressRoute — LAN access at agent-zero.iamworkin.lan
# =============================================================================
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: agent-zero
namespace: agent-zero
spec:
entryPoints:
- websecure
routes:
- match: Host(`agent-zero.iamworkin.lan`)
kind: Rule
services:
- name: agent-zero
port: 80
tls:
secretName: agent-zero-tls
# =============================================================================
# TLS Certificate via cert-manager (step-ca ACME)
# =============================================================================
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: agent-zero-tls
namespace: agent-zero
spec:
secretName: agent-zero-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- agent-zero.iamworkin.lan
duration: 720h
renewBefore: 240h
# =============================================================================
# NetworkPolicy — Restrict traffic
# =============================================================================
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: agent-zero-netpol
namespace: agent-zero
spec:
podSelector:
matchLabels:
app: agent-zero
policyTypes:
- Ingress
- Egress
ingress:
# Allow from Traefik
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: traefik-system
ports:
- port: 80
egress:
# DNS
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
# Ollama on edge1
- to:
- ipBlock:
cidr: 10.0.57.15/32
ports:
- port: 11434
# K8s API
- to:
- ipBlock:
cidr: 10.0.56.11/32
ports:
- port: 6443
# Allow internet (for kubectl image pull, etc)
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 10.0.0.0/8
- 172.16.0.0/12
- 192.168.0.0/16
# =============================================================================
# Agent Zero AI Stack — NUC Deployment (RKE2 Bare-Metal)
# =============================================================================
# Deploys: AgentZero (agent UI) on RKE2 cluster with Blue Jay profile
# Ollama: edge1 Pi 5 + AI HAT+ ONLY (10.0.57.17:11434).
# Workstation Ollama (BLUEJAY-WS) is intentionally NOT in the upstream —
# the workstation is private dev hardware, not a cluster dependency.
# Target: RKE2 bare-metal cluster, namespace: agent-zero
# Profile: Blue Jay (21 tools, 3 prompts, 4 extensions, theme)
#
# Differences from LOCAL (WSL K3s):
# - Uses Longhorn StorageClass (not local-path)
# - Cluster-only Ollama path (edge1) — keeps workstation private
# - NO Anthropic API key (free/local models only)
# - NO Piper TTS or Kiwix (edge1 handles TTS, no Wikipedia needed)
# - NO hostPath volumes — profile/tools/extensions loaded via ConfigMaps
# - Traefik IngressRoute for LAN access at agent-zero.iamworkin.lan
#
# ConfigMaps (defined in configmaps-bluejay.yaml):
# bluejay-tools 21 Python tool modules (~520K)
# bluejay-profile agent.json, agent.yaml, system_prompt.md (~20K)
# bluejay-prompts 3 prompt templates (~11K)
# flowercore-extensions 5 Python extension modules (~76K)
# bluejay-theme CSS theme (~7K)
#
# Apply: KUBECONFIG=~/.kube/rke2.yaml kubectl apply -f agent-zero-nuc.yaml
# =============================================================================
---
apiVersion: v1
kind: Namespace
metadata:
name: agent-zero
labels:
app.kubernetes.io/part-of: agent-zero-stack
# =============================================================================
# Persistent Volume Claims (Longhorn)
# =============================================================================
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: agent-zero-data
namespace: agent-zero
spec:
accessModes: [ReadWriteOnce]
storageClassName: longhorn
resources:
requests:
storage: 5Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: agent-zero-knowledge
namespace: agent-zero
spec:
accessModes: [ReadWriteOnce]
storageClassName: longhorn
resources:
requests:
storage: 1Gi
# =============================================================================
# RBAC — Give Agent Zero kubectl access to the cluster
# =============================================================================
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: agent-zero
namespace: agent-zero
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: agent-zero-cluster-admin
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: agent-zero
namespace: agent-zero
# =============================================================================
# Agent Zero — AI Agent Web UI (NUC Edition, Blue Jay Profile)
# =============================================================================
# Connects directly to fc-llm-bridge for chat + internal util/embed + browser.
# Agent Zero's internal util/embed slots stay on the bridge's OpenAI-compatible
# /v1 surface, while browser + corpus-search use the Ollama-compatible /api/*
# surface through OLLAMA_HOST.
# Blue Jay profile with 21 tools, 3 prompts, 4 extensions.
---
# FC LLM Bridge API key for Agent Zero (ADR-088 chat/util/embed/browser routing).
# Syncs from 1Password item "FC LLM Bridge API Keys" (field: agent-zero-k8s).
# Consumed by chat, internal util/embed, browser, and corpus-search requests
# that traverse fc-llm-bridge.
apiVersion: onepassword.com/v1
kind: OnePasswordItem
metadata:
name: fc-llm-bridge-api-keys
namespace: agent-zero
spec:
itemPath: "vaults/IAmWorkin/items/FC LLM Bridge API Keys"
---
# Print.Web API key for Agent Zero's print_web.py Python tool.
# Syncs from 1Password item "Print.Web API Keys" (password field = API key).
# The print_web.py tool reads PRINT_WEB_API_KEY env var for all HTTP requests
# to the thermal print service (GET /api/mcp/tools, POST /api/print/*, etc.).
# Note: Print.Web uses the legacy REST MCP shape (/api/mcp/tools/*), not the
# streamable-http MCP protocol. The print_web Python tool bridges this gap
# and is already present in bluejay-tools ConfigMaps.
apiVersion: onepassword.com/v1
kind: OnePasswordItem
metadata:
name: print-web-api-keys
namespace: agent-zero
spec:
itemPath: "vaults/IAmWorkin/items/Print.Web API Keys"
---
# Knowledge MCP bearer token for the direct Agent Zero -> Knowledge.Web path.
# The 1Password item currently stores the raw token in its concealed PASSWORD
# field, which the operator syncs to Secret key `password`.
apiVersion: onepassword.com/v1
kind: OnePasswordItem
metadata:
name: knowledge-mcp-tokens
namespace: agent-zero
spec:
itemPath: "vaults/IAmWorkin/items/FlowerCore Knowledge MCP Tokens"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: agent-zero
namespace: agent-zero
labels:
app: agent-zero
annotations:
agent-zero/deployment: "nuc"
agent-zero/profile: "bluejay"
agent-zero/ollama: "fc-llm-bridge fronts edge1 Pi 5 + AI HAT+ Ollama for cluster browser/corpus-search traffic; internal chat/util/embed route through the bridge's authenticated OpenAI surface"
spec:
replicas: 1
selector:
matchLabels:
app: agent-zero
strategy:
type: Recreate
template:
metadata:
labels:
app: agent-zero
spec:
serviceAccountName: agent-zero
initContainers:
# Wait for fc-llm-bridge to be reachable before starting Agent Zero.
- name: wait-for-llm-bridge
image: busybox:1.37
command: ["sh", "-c"]
args:
- |
echo "Waiting for fc-llm-bridge..."
until wget -qO- --timeout=2 http://fc-llm-bridge.fc-llm-bridge.svc:8080/healthz >/dev/null 2>&1; do
echo "fc-llm-bridge not ready yet, retrying in 5s..."
sleep 5
done
echo "fc-llm-bridge is reachable."
# Assemble the Blue Jay profile directory structure from ConfigMaps.
# ConfigMaps can't create nested dirs, so we copy into the workspace PVC.
- name: setup-bluejay
image: busybox:1.37
command: ["sh", "-c"]
args:
- |
echo "Setting up Blue Jay profile..."
# Profile root files
mkdir -p /a0/work/.bluejay/agents/bluejay/tools
mkdir -p /a0/work/.bluejay/agents/bluejay/prompts
cp /tmp/bluejay-profile/* /a0/work/.bluejay/agents/bluejay/
# Tools (split across 3 ConfigMaps to stay under K8s 262K annotation limit)
cp /tmp/bluejay-tools-a/* /a0/work/.bluejay/agents/bluejay/tools/
cp /tmp/bluejay-tools-b/* /a0/work/.bluejay/agents/bluejay/tools/
cp /tmp/bluejay-tools-c/* /a0/work/.bluejay/agents/bluejay/tools/
# Prompts
cp /tmp/bluejay-prompts/* /a0/work/.bluejay/agents/bluejay/prompts/
# Extensions
mkdir -p /a0/work/.bluejay/extensions/flowercore
cp /tmp/flowercore-extensions/* /a0/work/.bluejay/extensions/flowercore/
# Theme
mkdir -p /a0/work/.bluejay/theme
cp /tmp/bluejay-theme/* /a0/work/.bluejay/theme/
echo "Blue Jay profile ready:"
echo " Tools: $(ls /a0/work/.bluejay/agents/bluejay/tools/*.py | wc -l)"
echo " Prompts: $(ls /a0/work/.bluejay/agents/bluejay/prompts/*.md | wc -l)"
echo " Extensions: $(ls /a0/work/.bluejay/extensions/flowercore/*.py | wc -l)"
volumeMounts:
- name: workspace
mountPath: /a0/work
- name: bluejay-tools-a
mountPath: /tmp/bluejay-tools-a
- name: bluejay-tools-b
mountPath: /tmp/bluejay-tools-b
- name: bluejay-tools-c
mountPath: /tmp/bluejay-tools-c
- name: bluejay-profile
mountPath: /tmp/bluejay-profile
- name: bluejay-prompts
mountPath: /tmp/bluejay-prompts
- name: flowercore-extensions
mountPath: /tmp/flowercore-extensions
- name: bluejay-theme
mountPath: /tmp/bluejay-theme
containers:
- name: agent-zero
image: agent0ai/agent-zero:latest
command: ["/bin/bash", "-c"]
args:
- |
# Install kubectl if not cached
if [ -f /a0/work/kubectl ]; then
cp /a0/work/kubectl /usr/local/bin/kubectl
else
curl -sLO "https://dl.k8s.io/release/v1.32.0/bin/linux/amd64/kubectl" && \
chmod +x kubectl && mv kubectl /usr/local/bin/kubectl && \
cp /usr/local/bin/kubectl /a0/work/kubectl
fi
# Link Blue Jay profile from workspace into Agent Zero's expected path
ln -sfn /a0/work/.bluejay/agents/bluejay /a0/agents/bluejay
# Write model config BEFORE initialize.sh loads it
# The _model_config plugin reads config.json (NOT config.yaml).
# chat_model: FlowerCore LLM Bridge (ADR-088) — OpenAI-compat,
# spend-tracked, tier-aliased (fc:balanced → Claude Sonnet).
# api_key comes from A0_SET_chat_model_api_key env var (overrides
# config.json). Utility + embedding stay on the authenticated
# OpenAI-compatible /v1 surface; browser and direct tool traffic
# use the bridge's Ollama-compatible root via OLLAMA_HOST.
mkdir -p /a0/usr/plugins/_model_config
cat > /a0/usr/plugins/_model_config/config.json << 'MODELCFG'
{"allow_chat_override":true,"chat_model":{"provider":"openai","name":"fc:balanced","api_base":"http://fc-llm-bridge.fc-llm-bridge.svc:8080/v1","ctx_length":8192,"ctx_history":0.7,"vision":false,"kwargs":{"temperature":0,"num_ctx":8192}},"utility_model":{"provider":"openai","name":"fc:cheap","api_base":"http://fc-llm-bridge.fc-llm-bridge.svc:8080/v1","ctx_length":8192,"ctx_input":0.7,"kwargs":{"num_ctx":8192}},"embedding_model":{"provider":"openai","name":"openai/fc:embedding","api_base":"http://fc-llm-bridge.fc-llm-bridge.svc:8080/v1","kwargs":{}}}
MODELCFG
# Strip heredoc indentation
sed -i 's/^ //' /a0/usr/plugins/_model_config/config.json
# Phase 0 Chat MCP pilot: Agent Zero does not interpolate env vars
# inside A0_SET_mcp_servers JSON, so build the final JSON here from
# the secret-backed env vars before initialize.sh. Keep the local
# corpus_search.py tool mounted either way so outage fallback
# remains available even when fc_knowledge is not advertised.
export KNOWLEDGE_MCP_ENABLED=false
if [ -n "${KNOWLEDGE_MCP_BEARER_TOKEN:-}" ]; then
if curl -sf --connect-timeout 3 "${KNOWLEDGE_MCP_HEALTH_URL}" > /dev/null && \
curl -sf --connect-timeout 5 \
-H "Authorization: Bearer ${KNOWLEDGE_MCP_BEARER_TOKEN}" \
-H "Accept: application/json, text/event-stream" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":"fc-knowledge-bootstrap","method":"initialize","params":{"protocolVersion":"2025-03-26","capabilities":{},"clientInfo":{"name":"agent-zero-bootstrap","version":"1.0"}}}' \
"${KNOWLEDGE_MCP_URL}" > /dev/null; then
export KNOWLEDGE_MCP_ENABLED=true
echo "fc_knowledge enabled from ${KNOWLEDGE_MCP_URL}."
else
echo "fc_knowledge unavailable or unauthorized; keeping local corpus_search.py as the fallback path."
fi
else
echo "fc_knowledge token missing; keeping local corpus_search.py as the fallback path."
fi
export A0_SET_mcp_servers="$(
python3 -c 'import json, os; servers = {}; chat_key = os.getenv("CHAT_MCP_API_KEY"); knowledge_enabled = os.getenv("KNOWLEDGE_MCP_ENABLED", "false").lower() == "true"; token = os.getenv("KNOWLEDGE_MCP_BEARER_TOKEN", "") if knowledge_enabled else ""; chat_key and servers.setdefault("fc_chat", {"type": "streamable-http", "url": "http://chat-web.fc-chat.svc/mcp", "headers": {"X-Api-Key": chat_key}}); token and servers.setdefault("fc_knowledge", {"type": "streamable-http", "url": os.getenv("KNOWLEDGE_MCP_URL", "http://knowledge-web.knowledge.svc/mcp"), "headers": {"Authorization": f"Bearer {token}"}}); print(json.dumps({"mcpServers": servers}, separators=(",", ":")))'
)"
# Run the original entrypoint
exec /exe/initialize.sh $BRANCH
ports:
- containerPort: 80
env:
# Agent identity
- name: AGENT_NAME
value: "Blue Jay (NUC)"
# Chat model — routed through FlowerCore LLM Bridge (ADR-088)
# so spend is tracked and tier aliases (fc:cheap/fc:balanced/fc:deep)
# dispatch to Ollama or Anthropic via a single OpenAI-compat endpoint.
# Internal utility + embedding use the authenticated OpenAI surface,
# while browser/corpus-search use the bridge's Ollama-compatible
# endpoints so Agent Zero no longer needs a local proxy sidecar.
- name: A0_SET_chat_model_provider
value: "openai"
- name: A0_SET_chat_model_name
value: "fc:balanced"
- name: A0_SET_chat_model_api_base
value: "http://fc-llm-bridge.fc-llm-bridge.svc:8080/v1"
- name: A0_SET_chat_model_api_key
valueFrom:
secretKeyRef:
name: fc-llm-bridge-api-keys
key: agent-zero-k8s
# Agent Zero's runtime still resolves provider keys from the
# provider-level env names (models.get_api_key -> OPENAI_API_KEY /
# API_KEY_OPENAI), not the slot-scoped A0_SET_* value alone.
# Mirror the same secret here so real public chat runs can reach
# the fc-llm-bridge chat_model path instead of failing before MCP.
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: fc-llm-bridge-api-keys
key: agent-zero-k8s
- name: FC_LLM_BRIDGE_API_KEY
valueFrom:
secretKeyRef:
name: fc-llm-bridge-api-keys
key: agent-zero-k8s
- name: A0_SET_chat_model_ctx_length
value: "8192"
- name: A0_SET_chat_model_kwargs
value: '{"temperature": 0, "num_ctx": 8192}'
# Utility model — fast small helper tier through the OpenAI surface
- name: A0_SET_util_model_provider
value: "openai"
- name: A0_SET_util_model_name
value: "fc:cheap"
- name: A0_SET_util_model_api_base
value: "http://fc-llm-bridge.fc-llm-bridge.svc:8080/v1"
- name: A0_SET_util_model_kwargs
value: '{"num_ctx": 2048}'
# Embedding model — authenticated bridge alias to nomic-embed-text.
# LiteLLM's embedding() path needs an explicit provider prefix here
# even though the chat slot can use bare fc:* aliases.
- name: A0_SET_embed_model_provider
value: "openai"
- name: A0_SET_embed_model_name
value: "openai/fc:embedding"
- name: A0_SET_embed_model_api_base
value: "http://fc-llm-bridge.fc-llm-bridge.svc:8080/v1"
# Browser model — small Gemma candidate through the same proxy
- name: A0_SET_browser_model_provider
value: "ollama"
- name: A0_SET_browser_model_name
value: "gemma3:4b"
- name: A0_SET_browser_model_api_base
value: "http://fc-llm-bridge.fc-llm-bridge.svc:8080"
- name: A0_SET_browser_model_api_key
valueFrom:
secretKeyRef:
name: fc-llm-bridge-api-keys
key: agent-zero-k8s
- name: A0_SET_browser_model_vision
value: "true"
- name: OLLAMA_HOST
value: "http://fc-llm-bridge.fc-llm-bridge.svc:8080"
- name: FLOWERCORE_AGENTZERO_OLLAMA_URL
value: "http://fc-llm-bridge.fc-llm-bridge.svc:8080"
# Agent profile — Blue Jay personality, tools, and system prompt
- name: A0_SET_agent_profile
value: "bluejay"
# Memory settings
- name: A0_SET_memory_memorize_enabled
value: "true"
- name: A0_SET_memory_memorize_consolidation
value: "true"
- name: A0_SET_memory_memorize_replace_threshold
value: "0.85"
- name: A0_SET_memory_recall_enabled
value: "true"
# Speech-to-text disabled (no GPU for Whisper)
- name: A0_SET_stt_model_size
value: "tiny"
# FlowerCore.Chat MCP pilot (Phase 0)
- name: CHAT_MCP_API_KEY
valueFrom:
secretKeyRef:
name: chat-mcp-api-key
key: api-key
optional: true
# FlowerCore.Knowledge MCP Phase 1 — direct Agent Zero client path.
# Probe /healthz first, then try an authenticated initialize call.
# If either fails, Agent Zero boots without fc_knowledge and keeps
# the local corpus_search.py tool as the outage-safe path.
- name: KNOWLEDGE_MCP_URL
value: "http://knowledge-web.knowledge.svc/mcp"
- name: KNOWLEDGE_MCP_HEALTH_URL
value: "http://knowledge-web.knowledge.svc/healthz"
- name: KNOWLEDGE_MCP_BEARER_TOKEN
valueFrom:
secretKeyRef:
name: knowledge-mcp-tokens
key: password
# Print.Web — Thermal printer service on edge2.
# PRINT_WEB_URL: internal HTTP (bypasses Traefik TLS — print_web.py
# runs in-cluster and can reach edge2 directly on the PROD VLAN).
# PRINT_WEB_API_KEY: from 1Password "Print.Web API Keys" password field,
# synced by the print-web-api-keys OnePasswordItem CRD above.
# The print_web.py Python tool reads both env vars for all HTTP calls.
- name: PRINT_WEB_URL
value: "http://10.0.57.16:5200"
- name: PRINT_WEB_API_KEY
valueFrom:
secretKeyRef:
name: print-web-api-keys
key: password
# Intranet search — use in-cluster HTTP (no step-ca TLS needed)
# corpus_search.py reads FLOWERCORE_FLEET_VECTOR_DIR but that mount is not
# on the cluster yet (BLUEJAY-WS only). The tool gracefully returns a
# "no DB found" message with rebuild instructions rather than crashing.
- name: FLOWERCORE_INTRANET_URL
value: "http://intranet-web.intranet.svc:5300"
# Kubernetes
- name: KUBERNETES_SERVICE_HOST
value: "kubernetes.default.svc"
- name: KUBERNETES_SERVICE_PORT
value: "443"
volumeMounts:
- name: workspace
mountPath: /a0/work
- name: knowledge
mountPath: /a0/knowledge/custom/main
- name: flowercore-extensions
mountPath: /a0/extensions/flowercore
readOnly: true
- name: bluejay-theme
mountPath: /a0/webui/static/css/custom
readOnly: true
startupProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 18
livenessProbe:
httpGet:
path: /
port: 80
periodSeconds: 30
failureThreshold: 3
readinessProbe:
exec:
command:
- /bin/bash
- -c
- "curl -sf http://localhost:80/ > /dev/null && curl -sf --connect-timeout 3 http://fc-llm-bridge.fc-llm-bridge.svc:8080/healthz > /dev/null"
periodSeconds: 30
failureThreshold: 2
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "3Gi"
cpu: "2000m"
volumes:
- name: workspace
persistentVolumeClaim:
claimName: agent-zero-data
- name: knowledge
persistentVolumeClaim:
claimName: agent-zero-knowledge
- name: bluejay-tools-a
configMap:
name: bluejay-tools-a
- name: bluejay-tools-b
configMap:
name: bluejay-tools-b
- name: bluejay-tools-c
configMap:
name: bluejay-tools-c
- name: bluejay-profile
configMap:
name: bluejay-profile
- name: bluejay-prompts
configMap:
name: bluejay-prompts
- name: flowercore-extensions
configMap:
name: flowercore-extensions
- name: bluejay-theme
configMap:
name: bluejay-theme
---
apiVersion: v1
kind: Service
metadata:
name: agent-zero
namespace: agent-zero
spec:
type: ClusterIP
selector:
app: agent-zero
ports:
- port: 80
targetPort: 80
# =============================================================================
# Traefik IngressRoute — LAN access at agent-zero.iamworkin.lan
# =============================================================================
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: agent-zero
namespace: agent-zero
spec:
entryPoints:
- websecure
routes:
- match: Host(`agent-zero.iamworkin.lan`)
kind: Rule
services:
- name: agent-zero
port: 80
tls:
secretName: agent-zero-tls
# =============================================================================
# TLS Certificate via cert-manager (step-ca ACME)
# =============================================================================
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: agent-zero-tls
namespace: agent-zero
spec:
secretName: agent-zero-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- agent-zero.iamworkin.lan
duration: 720h
renewBefore: 240h
# =============================================================================
# NetworkPolicy — Restrict traffic
# =============================================================================
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: agent-zero-netpol
namespace: agent-zero
spec:
podSelector:
matchLabels:
app: agent-zero
policyTypes:
- Ingress
- Egress
ingress:
# Allow from Traefik
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: traefik-system
ports:
- port: 80
# Allow from monitoring (blackbox probe)
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: monitoring
ports:
- port: 80
egress:
# DNS
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
# Print.Web on edge2
- to:
- ipBlock:
cidr: 10.0.57.16/32
ports:
- port: 5200
# K8s API
- to:
- ipBlock:
cidr: 10.0.56.11/32
ports:
- port: 6443
# FlowerCore LLM Bridge (ADR-088 chat_model routing) — ClusterIP service
# in the fc-llm-bridge namespace on port 8080.
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: fc-llm-bridge
ports:
- port: 8080
protocol: TCP
# FlowerCore.Chat MCP (Phase 0 pilot) — use the in-cluster chat-web
# service instead of the public Traefik VIP so MCP traffic stays inside
# the cluster and survives the private-range egress denylist.
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: fc-chat
ports:
- port: 80
protocol: TCP
- port: 8080
protocol: TCP
# FlowerCore.Knowledge MCP (Phase 1) — in-cluster direct route with
# anonymous /healthz probe plus authenticated /mcp initialize/tool calls.
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: knowledge
ports:
- port: 80
protocol: TCP
- port: 8080
protocol: TCP
# Intranet search API — use in-cluster svc so traffic stays inside
# the cluster and is not blocked by the private-range egress denylist.
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: intranet
ports:
- port: 5300
protocol: TCP
# Allow internet (for kubectl image pull, etc)
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 10.0.0.0/8
- 172.16.0.0/12
- 192.168.0.0/16

File diff suppressed because it is too large Load Diff

View File

@@ -96,7 +96,9 @@ data:
allow=ulaw
allow=alaw
direct_media=no
dtmf_mode=inband
; Yealink provisioning sends RFC2833/RFC4733 DTMF (payload 101).
; Keep the PBX template aligned so physical desk phones emit ARI DTMF events.
dtmf_mode=rfc4733
rtp_symmetric=yes
force_rport=yes
rewrite_contact=yes
@@ -173,6 +175,83 @@ data:
remove_existing=yes
qualify_frequency=60
; Test endpoints 901-904 for softphone proof
[test-endpoint](!)
type=endpoint
context=from-internal
transport=transport-udp
disallow=all
allow=ulaw
allow=alaw
direct_media=no
rtp_symmetric=yes
force_rport=yes
rewrite_contact=yes
[901](test-endpoint)
auth=auth901
aors=901
callerid="Proof Caller" <901>
[auth901]
type=auth
auth_type=userpass
username=901
password=test-sip-secret-901
[901]
type=aor
max_contacts=1
remove_existing=yes
[902](test-endpoint)
auth=auth902
aors=902
callerid="Proof Callee" <902>
[auth902]
type=auth
auth_type=userpass
username=902
password=test-sip-secret-901
[902]
type=aor
max_contacts=1
remove_existing=yes
[903](test-endpoint)
auth=auth903
aors=903
callerid="Proof Endpoint 3" <903>
[auth903]
type=auth
auth_type=userpass
username=903
password=test-sip-secret-901
[903]
type=aor
max_contacts=1
remove_existing=yes
[904](test-endpoint)
auth=auth904
aors=904
callerid="Proof Endpoint 4" <904>
[auth904]
type=auth
auth_type=userpass
username=904
password=test-sip-secret-901
[904]
type=aor
max_contacts=1
remove_existing=yes
extensions.conf: |
[general]
static=yes
@@ -195,6 +274,32 @@ data:
exten => _1XX,1,Dial(PJSIP/${EXTEN},30)
same => n,Hangup()
; Softphone proof endpoints and utility extensions
exten => _9XX,1,NoOp(Proof call to ${EXTEN})
same => n,Dial(PJSIP/${EXTEN},30)
same => n,Hangup()
exten => 999,1,Answer()
same => n,Playback(demo-echotest)
same => n,Echo()
same => n,Hangup()
exten => 998,1,Answer()
same => n,Milliwatt()
same => n,Hangup()
exten => 997,1,Answer()
same => n,Wait(0.5)
same => n,Playback(hello-world)
same => n,Wait(1)
same => n,Hangup()
exten => 996,1,Answer()
same => n,Wait(0.5)
same => n,Read(DIGITS,,4,,,5)
same => n,SayDigits(${DIGITS})
same => n,Hangup()
; Outbound via Twilio SIP trunk (11-digit US)
exten => _1NXXNXXXXXX,1,Set(CALLERID(num)=+13202332529)
same => n,Dial(PJSIP/+${EXTEN}@twilio-trunk,60)
@@ -209,6 +314,13 @@ data:
exten => *100,1,Stasis(flowercore-pbx,internal,ivr)
same => n,Hangup()
; Test-only entry into the Victory Day workflow (DID +15074618329).
; Used by live SIP AATs to exercise the VDAY Fun Menu + AsteriskGameHandler
; path without dialing in over Twilio. Mnemonic: *832 = "V-D-A" (8-3-2).
exten => *832,1,NoOp(Test entry: Victory Day workflow via AAT)
same => n,Stasis(flowercore-pbx,inbound-pstn,+15074618329)
same => n,Hangup()
; Star codes routed to FlowerCore Stasis app for handling
exten => *0,1,Stasis(flowercore-pbx,starcode,*0)
same => n,Hangup()

View File

@@ -16,22 +16,63 @@ spec:
metadata:
labels:
app: asterisk
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
securityContext:
fsGroup: 0
spec:
nodeSelector:
kubernetes.io/hostname: rke2-agent1
hostNetwork: true
# Keep the search list free of iamworkin.lan so CoreDNS's wildcard
# template cannot hijack public egress like downloads.asterisk.org.
dnsPolicy: None
dnsConfig:
nameservers:
- 10.43.0.10
searches:
- telephony.svc.cluster.local
- svc.cluster.local
- cluster.local
options:
- name: ndots
value: "2"
securityContext:
fsGroup: 0
# CoreDNS in this cluster has an iamworkin.lan wildcard that catches
# any unresolved name and returns 10.0.56.200 (Traefik VIP), which
# means downloads.asterisk.org inside the pod resolves to Traefik and
# returns 404. Pin the real address so the init container can fetch
# the sounds tarball.
hostAliases:
- ip: 165.22.184.19
hostnames:
- downloads.asterisk.org
initContainers:
- name: install-sounds
image: busybox:latest
# Downloads Asterisk core sounds (en, ulaw) into the sounds emptyDir
# volume so the base Asterisk image (which ships no sounds) can play
# vm-advopts, vm-goodbye, digits/*, characters/*, beep, etc. Skips
# the download if the directory already contains sound files —
# re-running the pod after a hot image reload reuses the unpack.
image: alpine:3.20
command:
- sh
- -c
- |
mkdir -p /sounds/en &&
wget -qO- http://downloads.asterisk.org/pub/telephony/sounds/asterisk-core-sounds-en-ulaw-current.tar.gz | tar xz -C /sounds/en/ &&
wget -qO- http://downloads.asterisk.org/pub/telephony/sounds/asterisk-extra-sounds-en-ulaw-current.tar.gz | tar xz -C /sounds/en/ &&
echo "Sound files installed: $(find /sounds/en -type f | wc -l) files"
set -eu
if [ -f /sounds/en/vm-goodbye.ulaw ] || [ -f /sounds/en/vm-goodbye.gsm ]; then
echo "Sounds already present — skipping download."
exit 0
fi
echo "Installing curl + tar..."
apk add --no-cache curl tar gzip >/dev/null
cd /tmp
echo "Downloading Asterisk core sounds (en, ulaw) 1.6.1..."
# -k: cluster egress goes through a step-ca MITM for outbound TLS
# that this pod does not trust. The tarball is a public artifact —
# integrity is checked downstream by Asterisk at playback time.
curl -fksSLO https://downloads.asterisk.org/pub/telephony/sounds/releases/asterisk-core-sounds-en-ulaw-1.6.1.tar.gz
echo "Extracting to /sounds/en ..."
mkdir -p /sounds/en
tar -xzf asterisk-core-sounds-en-ulaw-1.6.1.tar.gz -C /sounds/en
echo "Done — $(ls /sounds/en | wc -l) files installed."
volumeMounts:
- name: sounds
mountPath: /sounds/en
@@ -77,6 +118,11 @@ spec:
mountPath: /var/log/asterisk
- name: sounds
mountPath: /var/lib/asterisk/sounds/en
# Shared TTS audio — telephony-web writes .sln16 files here (as
# /shared-tts), Asterisk plays them via `sound:tts/<name>` which
# resolves to this mount. Both pods are pinned to rke2-agent1.
- name: shared-tts
mountPath: /var/lib/asterisk/sounds/tts
resources:
requests:
cpu: 100m
@@ -148,3 +194,7 @@ spec:
emptyDir: {}
- name: sounds
emptyDir: {}
- name: shared-tts
hostPath:
path: /tmp/tts-audio
type: DirectoryOrCreate

View File

@@ -0,0 +1,448 @@
# Authentik OIDC backend
# ArgoCD-managed. BlueJay Lab.
#
# Stack:
# - PostgreSQL 16 StatefulSet (single replica, Longhorn RWO 5Gi)
# - Redis 7 Deployment (no persistence — session/cache only)
# - Authentik server + worker Deployments (image ghcr.io/goauthentik/server:2024.12.3)
# - Media PVC shared between server + worker (Longhorn RWO 2Gi)
# - Certificate via step-ca-acme ClusterIssuer
# - Traefik IngressRoute at id.iamworkin.lan
#
# Secrets come from 1Password item "authentik-credentials" (IAmWorkin vault, id y6i74ch22q5wvm7znquq4nhhcu)
# via the OnePasswordItem CRD, materialized into k8s Secret authentik/authentik-credentials.
#
# Why the discovery URL is /application/o/pimanager/ : Authentik issues per-application OIDC providers.
# The pimanager OIDC application/provider is created after the cluster pods are healthy (manual or
# via API once the bootstrap token is available — see Notes substrate).
---
apiVersion: v1
kind: Namespace
metadata:
name: authentik
labels:
app.kubernetes.io/part-of: bluejay-infra
---
# 1Password operator pulls the authentik-credentials item into a k8s Secret of the same name.
# Field labels in 1P become Secret keys: AUTHENTIK_SECRET_KEY, POSTGRES_PASSWORD, REDIS_PASSWORD,
# BOOTSTRAP_ADMIN_PASSWORD, BOOTSTRAP_ADMIN_TOKEN, BOOTSTRAP_ADMIN_EMAIL.
apiVersion: onepassword.com/v1
kind: OnePasswordItem
metadata:
name: authentik-credentials
namespace: authentik
spec:
itemPath: "vaults/IAmWorkin/items/authentik-credentials"
---
# Shared media volume for server + worker pods.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: authentik-media
namespace: authentik
spec:
storageClassName: longhorn
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 2Gi
---
# PostgreSQL 16 StatefulSet — Authentik's primary store.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: authentik-postgres
namespace: authentik
labels:
app: authentik-postgres
argocd.argoproj.io/instance: infra-authentik
spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Retain
podManagementPolicy: OrderedReady
serviceName: authentik-postgres
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: authentik-postgres
template:
metadata:
labels:
app: authentik-postgres
spec:
containers:
- name: postgres
image: postgres:16-alpine
ports:
- containerPort: 5432
name: postgres
env:
- name: POSTGRES_USER
value: authentik
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: authentik-credentials
key: POSTGRES_PASSWORD
- name: POSTGRES_DB
value: authentik
- name: POSTGRES_INITDB_ARGS
value: "--encoding=UTF-8 --lc-collate=C --lc-ctype=C"
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
readinessProbe:
exec:
command: ["pg_isready", "-U", "authentik"]
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
exec:
command: ["pg_isready", "-U", "authentik"]
initialDelaySeconds: 30
periodSeconds: 30
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 1000m, memory: 1Gi }
volumeMounts:
- name: pgdata
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: pgdata
spec:
storageClassName: longhorn
accessModes: [ReadWriteOnce]
volumeMode: Filesystem
resources:
requests:
storage: 5Gi
---
apiVersion: v1
kind: Service
metadata:
name: authentik-postgres
namespace: authentik
spec:
clusterIP: None
selector:
app: authentik-postgres
ports:
- name: postgres
port: 5432
targetPort: 5432
---
# Redis 7 — session storage + Celery broker. No persistence needed (cache).
apiVersion: apps/v1
kind: Deployment
metadata:
name: authentik-redis
namespace: authentik
labels:
app: authentik-redis
argocd.argoproj.io/instance: infra-authentik
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app: authentik-redis
template:
metadata:
labels:
app: authentik-redis
spec:
containers:
- name: redis
image: redis:7-alpine
args:
- "--save"
- ""
- "--appendonly"
- "no"
- "--requirepass"
- "$(REDIS_PASSWORD)"
env:
- name: REDIS_PASSWORD
valueFrom:
secretKeyRef:
name: authentik-credentials
key: REDIS_PASSWORD
ports:
- containerPort: 6379
name: redis
readinessProbe:
tcpSocket: { port: 6379 }
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
tcpSocket: { port: 6379 }
initialDelaySeconds: 30
periodSeconds: 30
resources:
requests: { cpu: 50m, memory: 64Mi }
limits: { cpu: 500m, memory: 256Mi }
---
apiVersion: v1
kind: Service
metadata:
name: authentik-redis
namespace: authentik
spec:
selector:
app: authentik-redis
ports:
- name: redis
port: 6379
targetPort: 6379
---
# Authentik server Deployment — HTTP frontend on :9000.
apiVersion: apps/v1
kind: Deployment
metadata:
name: authentik-server
namespace: authentik
labels:
app: authentik-server
argocd.argoproj.io/instance: infra-authentik
spec:
replicas: 1
strategy:
type: Recreate # shares /media RWO PVC with worker
selector:
matchLabels:
app: authentik-server
template:
metadata:
labels:
app: authentik-server
spec:
securityContext:
# Authentik image runs as uid 1000 "authentik" but the Longhorn PVC mounts
# root:root by default. fsGroup recursively chgrp + chmod g+rwx so the
# non-root container can mkdir /media/public during the tenant_files migration.
fsGroup: 1000
containers:
- name: server
image: ghcr.io/goauthentik/server:2024.12.3
args: ["server"]
ports:
- containerPort: 9000
name: http
- containerPort: 9443
name: https
env:
- name: AUTHENTIK_SECRET_KEY
valueFrom:
secretKeyRef:
name: authentik-credentials
key: AUTHENTIK_SECRET_KEY
- name: AUTHENTIK_REDIS__HOST
value: authentik-redis
- name: AUTHENTIK_REDIS__PASSWORD
valueFrom:
secretKeyRef:
name: authentik-credentials
key: REDIS_PASSWORD
- name: AUTHENTIK_POSTGRESQL__HOST
value: authentik-postgres
- name: AUTHENTIK_POSTGRESQL__NAME
value: authentik
- name: AUTHENTIK_POSTGRESQL__USER
value: authentik
- name: AUTHENTIK_POSTGRESQL__PASSWORD
valueFrom:
secretKeyRef:
name: authentik-credentials
key: POSTGRES_PASSWORD
- name: AUTHENTIK_BOOTSTRAP_PASSWORD
valueFrom:
secretKeyRef:
name: authentik-credentials
key: BOOTSTRAP_ADMIN_PASSWORD
- name: AUTHENTIK_BOOTSTRAP_TOKEN
valueFrom:
secretKeyRef:
name: authentik-credentials
key: BOOTSTRAP_ADMIN_TOKEN
- name: AUTHENTIK_BOOTSTRAP_EMAIL
valueFrom:
secretKeyRef:
name: authentik-credentials
key: BOOTSTRAP_ADMIN_EMAIL
- name: AUTHENTIK_DISABLE_UPDATE_CHECK
value: "true"
- name: AUTHENTIK_ERROR_REPORTING__ENABLED
value: "false"
- name: AUTHENTIK_LOG_LEVEL
value: info
# First-boot Authentik can take 3+ min on the migration phase
# (waiting on DB lock while worker also runs migrations). Initial
# delays are generous so kubelet doesn't kill the pod mid-migration;
# periodSeconds keeps post-startup probing responsive.
readinessProbe:
httpGet:
path: /-/health/ready/
port: 9000
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 12
livenessProbe:
httpGet:
path: /-/health/live/
port: 9000
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
startupProbe:
httpGet:
path: /-/health/live/
port: 9000
initialDelaySeconds: 30
periodSeconds: 15
timeoutSeconds: 10
failureThreshold: 40 # 30s + 40*15s = 10.5 min budget
resources:
requests: { cpu: 150m, memory: 512Mi }
limits: { cpu: 1500m, memory: 1Gi }
volumeMounts:
- name: media
mountPath: /media
volumes:
- name: media
persistentVolumeClaim:
claimName: authentik-media
---
# Authentik worker Deployment — runs Celery background tasks.
apiVersion: apps/v1
kind: Deployment
metadata:
name: authentik-worker
namespace: authentik
labels:
app: authentik-worker
argocd.argoproj.io/instance: infra-authentik
spec:
replicas: 1
strategy:
type: Recreate # shares /media RWO PVC with server
selector:
matchLabels:
app: authentik-worker
template:
metadata:
labels:
app: authentik-worker
spec:
securityContext:
# Same as server pod — non-root uid 1000 needs PVC group write.
fsGroup: 1000
containers:
- name: worker
image: ghcr.io/goauthentik/server:2024.12.3
args: ["worker"]
env:
- name: AUTHENTIK_SECRET_KEY
valueFrom:
secretKeyRef:
name: authentik-credentials
key: AUTHENTIK_SECRET_KEY
- name: AUTHENTIK_REDIS__HOST
value: authentik-redis
- name: AUTHENTIK_REDIS__PASSWORD
valueFrom:
secretKeyRef:
name: authentik-credentials
key: REDIS_PASSWORD
- name: AUTHENTIK_POSTGRESQL__HOST
value: authentik-postgres
- name: AUTHENTIK_POSTGRESQL__NAME
value: authentik
- name: AUTHENTIK_POSTGRESQL__USER
value: authentik
- name: AUTHENTIK_POSTGRESQL__PASSWORD
valueFrom:
secretKeyRef:
name: authentik-credentials
key: POSTGRES_PASSWORD
- name: AUTHENTIK_DISABLE_UPDATE_CHECK
value: "true"
- name: AUTHENTIK_ERROR_REPORTING__ENABLED
value: "false"
- name: AUTHENTIK_LOG_LEVEL
value: info
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 1000m, memory: 768Mi }
volumeMounts:
- name: media
mountPath: /media
volumes:
- name: media
persistentVolumeClaim:
claimName: authentik-media
---
apiVersion: v1
kind: Service
metadata:
name: authentik-server
namespace: authentik
spec:
selector:
app: authentik-server
ports:
- name: http
port: 9000
targetPort: 9000
- name: https
port: 9443
targetPort: 9443
---
# step-ca leaf certificate for id.iamworkin.lan.
# step-ca container resolver uses pfSense Unbound, so the public A record for id.iamworkin.lan
# MUST exist before this Certificate is applied (cert-manager HTTP-01 will silently 2h-backoff
# otherwise). Added 2026-05-25 via scripts/pfsense-add-id-host.py.
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: authentik-tls
namespace: authentik
spec:
secretName: authentik-tls
dnsNames:
- id.iamworkin.lan
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: authentik
namespace: authentik
spec:
entryPoints: [websecure]
routes:
- match: Host(`id.iamworkin.lan`)
kind: Rule
services:
- name: authentik-server
port: 9000
tls:
secretName: authentik-tls

69
apps/cdi/README.md Normal file
View File

@@ -0,0 +1,69 @@
# CDI — Containerized Data Importer
KubeVirt's `containerized-data-importer` for populating PVCs from external
sources (HTTP, HTTPS, container registry, S3, virtctl upload). Required to
import the Windows Server 2025 ISO into the `windows-server-2025-iso` PVC
that `apps/kubevirt-vms/ci1.yaml` mounts as a CDROM.
## Files
| File | Source | Purpose |
| ----------------- | ----------------------------------------------------------------------------------------------------------------- | -------------------------------------------------- |
| `cdi-operator.yaml` | [`v1.65.0`](https://github.com/kubevirt/containerized-data-importer/releases/tag/v1.65.0) — verbatim copy | Installs operator + CRDs (5779 lines, large) |
| `cdi-cr.yaml` | [`v1.65.0`](https://github.com/kubevirt/containerized-data-importer/releases/tag/v1.65.0) — annotated + commented | Tells operator to deploy CDI components |
`cdi-operator.yaml` is **vendored verbatim** from the upstream release for
air-gap reproducibility (no internet fetch at deploy time, ArgoCD prune
contracts hold). To bump versions:
```bash
CDI_VER=v1.66.0 # for example
curl -sL "https://github.com/kubevirt/containerized-data-importer/releases/download/${CDI_VER}/cdi-operator.yaml" \
-o apps/cdi/cdi-operator.yaml
curl -sL "https://github.com/kubevirt/containerized-data-importer/releases/download/${CDI_VER}/cdi-cr.yaml" \
-o /tmp/cdi-cr-new.yaml # then re-apply project header diff
git diff apps/cdi/ # review
git commit + push
```
## Verify after deploy
```bash
kubectl -n cdi get pods # operator + apiserver + deployment + uploadproxy
kubectl get cdis cdi -o jsonpath='{.status.phase}' # "Deployed"
kubectl get crd | grep cdi.kubevirt.io
# Expected CRDs: datavolumes.cdi.kubevirt.io, cdiconfigs.cdi.kubevirt.io,
# storageprofiles.cdi.kubevirt.io, dataimportcrons.cdi.kubevirt.io,
# datasources.cdi.kubevirt.io, objecttransfers.cdi.kubevirt.io
```
## Use after install
```yaml
# Example DataVolume that imports from HTTP
apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
name: my-iso
spec:
source:
http:
url: "https://server/path/to.iso"
pvc:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 10Gi
storageClassName: longhorn
```
```bash
# Or upload from local disk via virtctl
virtctl image-upload pvc my-iso \
--image-path ./my.iso \
--size 10Gi \
--storage-class longhorn \
--access-mode ReadWriteOnce \
--uploadproxy-url https://cdi-uploadproxy.cdi.svc:443 \
--insecure
```

36
apps/cdi/cdi-cr.yaml Normal file
View File

@@ -0,0 +1,36 @@
# =============================================================================
# CDI CR — Tells the CDI operator to install CDI components into the cluster.
# =============================================================================
# After cdi-operator.yaml is applied, the operator watches for THIS resource
# (CDI named "cdi"). When found, it deploys cdi-apiserver, cdi-deployment,
# cdi-uploadproxy, cdi-cronjob, and the importer/uploadserver/cloner pods.
#
# Configuration:
# - HonorWaitForFirstConsumer: PVCs created by DataVolumes wait for first
# pod to schedule before binding (lets storage class pick best node).
# - WebhookPvcRendering: validates PVC creation against CDI policies.
# - imagePullPolicy IfNotPresent: re-pull only on tag rotation.
# - nodeSelector linux: pin to Linux nodes (no Windows worker support).
#
# Andrew may want to add a `uploadProxyURLOverride` later to expose the
# uploadproxy via Traefik IngressRoute for `virtctl image-upload` from
# BLUEJAY-WS without `kubectl port-forward`. Phase 2 enhancement.
# =============================================================================
apiVersion: cdi.kubevirt.io/v1beta1
kind: CDI
metadata:
name: cdi
annotations:
bluejay.iamworkin.lan/source: "kubevirt/containerized-data-importer v1.65.0"
spec:
config:
featureGates:
- HonorWaitForFirstConsumer
- WebhookPvcRendering
imagePullPolicy: IfNotPresent
infra:
nodeSelector:
kubernetes.io/os: linux
workload:
nodeSelector:
kubernetes.io/os: linux

5779
apps/cdi/cdi-operator.yaml Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,106 @@
# edge2 Services — Traefik IngressRoutes for FlowerCore Print.Web on edge2
# Proxies print.iamworkin.lan to edge2 (10.0.57.16:5200) via headless Service
# + manual Endpoints (same K8s external-proxy pattern as noc-services).
#
# Print.Web has its own X-Api-Key authentication and exposes anonymous
# endpoints for the bookmarklet / Python CLI / cups-notifier flow, so no
# Traefik basicAuth middleware is wired here.
#
# ArgoCD managed - BlueJay Lab
---
apiVersion: v1
kind: Namespace
metadata:
name: edge2-proxy
labels:
app.kubernetes.io/part-of: bluejay-infra
---
# ============================================================
# Print.Web - edge2:5200 (FlowerCore.Print.Web on Pi 4)
# ============================================================
apiVersion: v1
kind: Service
metadata:
name: print-web-external
namespace: edge2-proxy
spec:
ports:
- port: 5200
targetPort: 5200
name: http
clusterIP: None
---
apiVersion: v1
kind: Endpoints
metadata:
name: print-web-external
namespace: edge2-proxy
subsets:
- addresses:
- ip: 10.0.57.16
ports:
- port: 5200
name: http
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: print-web-tls
namespace: edge2-proxy
spec:
secretName: print-web-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- print.iamworkin.lan
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: print-web
namespace: edge2-proxy
spec:
entryPoints:
- websecure
routes:
- kind: Rule
match: Host(`print.iamworkin.lan`)
services:
- name: print-web-external
port: 5200
tls:
secretName: print-web-tls
---
# NetworkPolicy: allow Traefik ingress, allow egress to edge2 + DNS
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: edge2-proxy-netpol
namespace: edge2-proxy
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: traefik-system
egress:
- to:
- ipBlock:
cidr: 10.0.57.16/32
ports:
- port: 5200
protocol: TCP
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP

70
apps/fc-chat/fc-chat.yaml Normal file
View File

@@ -0,0 +1,70 @@
# FlowerCore Chat — TLS + Ingress
# Deployment and Service managed by deploy script (not ArgoCD)
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: chat-web-tls
namespace: fc-chat
spec:
secretName: chat-web-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- chat.iamworkin.lan
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: chat-web
namespace: fc-chat
spec:
entryPoints:
- websecure
routes:
- match: Host(`chat.iamworkin.lan`)
kind: Rule
services:
- name: chat-web
port: 80
tls:
secretName: chat-web-tls
---
# Public host profile marker. The app treats this header as authoritative for
# the public twin, while the internal chat.iamworkin.lan route does not attach
# it and keeps the operator-oriented UI.
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: chat-public-profile-header
namespace: fc-chat
spec:
headers:
customRequestHeaders:
X-FC-Chat-Host-Profile: "public"
---
# Public Cloudflare-fronted twin for the anonymous chat surface. Operator
# paths are intentionally absent from the allowlist below, so /admin,
# /operator, /console, /ops, /api/operator, and /operatorhub miss this route
# and return Traefik 404 before reaching the pod. Operator action still needed:
# create/verify Cloudflare DNS chat.flowercore.io -> public Traefik endpoint
# and mirror the cf-origin-flowercore-io TLS secret into namespace fc-chat.
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: chat-web-public
namespace: fc-chat
spec:
entryPoints:
- websecure
routes:
- match: Host(`chat.flowercore.io`) && (Path(`/`) || Path(`/chat`) || PathPrefix(`/_blazor`) || PathPrefix(`/_framework`) || PathPrefix(`/_content`) || PathPrefix(`/avatars`) || PathPrefix(`/css`) || PathPrefix(`/js`) || PathPrefix(`/favicon`) || PathPrefix(`/chathub`)) && (Method(`GET`) || Method(`HEAD`) || Method(`POST`) || Method(`OPTIONS`))
kind: Rule
middlewares:
- name: chat-public-profile-header
services:
- name: chat-web
port: 80
tls:
secretName: cf-origin-flowercore-io

View File

@@ -0,0 +1,53 @@
# FlowerCore Remote Desktop — TLS + Ingress
#
# Source-of-truth split:
# - bluejay-infra OWNS: Certificate, IngressRoute, all NetworkPolicies
# (see network-policies.yaml in this directory).
# - FlowerCore.RemoteDesktop scripts/deploy-web.sh OWNS: Deployment +
# Service. Reason: image refs like `localhost/fc-desktop:linux-xfce`
# only exist on each node's containerd after a manual import, so a
# Deployment manifest in bluejay-infra would race the image-import
# step and crash-loop.
#
# NetworkPolicies moved into bluejay-infra 2026-05-07 — previously they
# were applied via the deploy script's kubectl apply calls, which broke
# cluster-rebuild repeatability. See
# feedback_networkpolicies_belong_in_bluejay_infra.md.
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: remotedesktop-web-tls
namespace: fc-desktop
spec:
secretName: remotedesktop-web-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- desktop.iamworkin.lan
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: remotedesktop-web
namespace: fc-desktop
spec:
entryPoints:
- websecure
routes:
# Host-level catch-all for desktop.iamworkin.lan. The /guacamole
# path-prefix match lives in apps/guacamole/guacamole.yaml as a
# separate IngressRoute in the guacamole namespace — the cluster
# Traefik disallows cross-namespace service refs, so the PathPrefix
# rule can't sit here. Traefik's router matching precedence gives
# longer/more-specific rules priority automatically, so as long as
# the guacamole IngressRoute exists it takes /guacamole traffic
# before this catch-all sees it.
- match: Host(`desktop.iamworkin.lan`)
kind: Rule
services:
- name: remotedesktop-web
port: 8080
tls:
secretName: remotedesktop-web-tls

View File

@@ -0,0 +1,332 @@
# FlowerCore Remote Desktop — NetworkPolicies (GitOps-managed)
#
# Moved into bluejay-infra 2026-05-07 as part of the regroup audit. These
# four policies were previously applied via FlowerCore.RemoteDesktop's
# scripts/deploy-web.sh `kubectl apply` calls, which meant a fresh cluster
# rebuild from bluejay-infra alone would miss them — Browser Lab session
# isolation, control-plane allow-list, and HTTP-01 cert renewal would all
# silently fail to come up.
#
# Source-of-truth contract:
# - bluejay-infra OWNS all NetworkPolicy + Certificate + IngressRoute
# resources for fc-desktop.
# - FlowerCore.RemoteDesktop's scripts/deploy-web.sh continues to own
# the Deployment + Service apply (because the image ref
# `localhost/fc-desktop:linux-xfce` only exists on each node's
# containerd after a manual import — it can't be pulled from a
# registry, so a Deployment manifest in bluejay-infra would race the
# image-import step and crash-loop).
---
# 1) desktop-isolation — Browser Lab session pods.
#
# Locks down pods labeled `app.kubernetes.io/name=remote-desktop` (every
# session pod regardless of template). Allows guacd ingress for the VNC/RDP
# display lane and remotedesktop-web's pre-handoff probing. Egress: NFS to
# Synology, DNS, Traefik (cluster + LB VIP), Intranet (Browser Lab home).
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: desktop-isolation
namespace: fc-desktop
labels:
app.kubernetes.io/part-of: remotedesktop
app.kubernetes.io/component: isolation
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: remote-desktop
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: guacamole
ports:
- port: 3000
protocol: TCP
- port: 3001
protocol: TCP
- port: 5901
protocol: TCP
- port: 3389
protocol: TCP
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: fc-desktop
podSelector:
matchLabels:
app.kubernetes.io/name: remotedesktop-web
ports:
- port: 3000
protocol: TCP
- port: 5901
protocol: TCP
egress:
# NFS to Synology
- to:
- ipBlock:
cidr: 10.0.58.3/32
ports:
- port: 2049
protocol: TCP
- port: 2049
protocol: UDP
- port: 111
protocol: TCP
- port: 111
protocol: UDP
- to:
- ipBlock:
cidr: 10.0.58.3/32
ports:
- port: 445
protocol: TCP
- to: []
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
- to:
- ipBlock:
cidr: 10.0.56.200/32
- ipBlock:
cidr: 10.43.33.87/32
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: traefik-system
podSelector:
matchLabels:
app.kubernetes.io/name: traefik
ports:
- port: 80
protocol: TCP
- port: 443
protocol: TCP
- port: 8000
protocol: TCP
- port: 8443
protocol: TCP
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: intranet
podSelector:
matchLabels:
app: intranet-web
ports:
- port: 5300
protocol: TCP
---
# 2) fc-desktop-default-deny — namespace-wide catch-all.
#
# Selects every pod EXCEPT remotedesktop-web (the public-surface control
# plane) and applies default-deny semantics for both Ingress and Egress.
# Closes the gap where session pods land WITHOUT the desktop-isolation
# policy's `app.kubernetes.io/name=remote-desktop` label, plus prevents
# arbitrary debug sidecars / kubectl debug images from getting cluster
# access.
#
# CRITICAL: also catches transient cm-acme-http-solver pods (that's the
# bug this whole regroup chased). The cm-acme-http-solver-allow policy
# below is the explicit carve-out.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: fc-desktop-default-deny
namespace: fc-desktop
labels:
app.kubernetes.io/part-of: remotedesktop
app.kubernetes.io/component: isolation
spec:
podSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: NotIn
values:
- remotedesktop-web
policyTypes:
- Ingress
- Egress
---
# 3) remotedesktop-web-isolation — control plane explicit allow-list.
#
# remotedesktop-web is the only pod label the default-deny excludes, so
# without this policy the control plane would have wide-open Ingress AND
# Egress. This re-introduces a tight allow-list:
# - Ingress: Traefik only on TCP/8080
# - Egress: CoreDNS, K8s API, Guacamole admin, NFS, Intranet,
# Traefik (cluster + LB), and the fc-desktop namespace itself
# (for session pod readiness probing).
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: remotedesktop-web-isolation
namespace: fc-desktop
labels:
app.kubernetes.io/part-of: remotedesktop
app.kubernetes.io/component: isolation
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: remotedesktop-web
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: traefik-system
podSelector:
matchLabels:
app.kubernetes.io/name: traefik
ports:
- port: 8080
protocol: TCP
egress:
# CoreDNS
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
# K8s API server
- to: []
ports:
- port: 443
protocol: TCP
- port: 6443
protocol: TCP
# Guacamole admin
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: guacamole
ports:
- port: 8080
protocol: TCP
# NFS to Synology
- to:
- ipBlock:
cidr: 10.0.58.3/32
ports:
- port: 2049
protocol: TCP
- port: 2049
protocol: UDP
- port: 111
protocol: TCP
- port: 111
protocol: UDP
# Intranet web
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: intranet
podSelector:
matchLabels:
app: intranet-web
ports:
- port: 5300
protocol: TCP
# Cluster Traefik pods (in-cluster service resolution + Guacamole
# routing handoff where web app builds URLs against the public host
# but resolves internally).
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: traefik-system
podSelector:
matchLabels:
app.kubernetes.io/name: traefik
ports:
- port: 80
protocol: TCP
- port: 443
protocol: TCP
- port: 8080
protocol: TCP
- port: 8443
protocol: TCP
# fc-desktop namespace — session pod probing during browser-access
# readiness checks.
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: fc-desktop
ports:
- port: 3000
protocol: TCP
- port: 3001
protocol: TCP
- port: 5901
protocol: TCP
- port: 3389
protocol: TCP
---
# 4) cm-acme-http-solver-allow — cert-manager HTTP-01 carve-out.
#
# Without this, fc-desktop-default-deny catches the transient solver pods
# cert-manager creates for each renewal (they don't carry the
# remotedesktop-web label). Caused 8-day silent renewal failure on
# desktop.iamworkin.lan in 2026-04-28..2026-05-07 (see
# feedback_certmanager_renewal_stuck_when_solver_blocked_by_namespace_default_deny.md).
#
# Authorizes:
# - Ingress on TCP/8089 from cluster Traefik (which proxies the external
# HTTP-01 GET on port 80 through to the solver).
# - Egress for cluster DNS (defensive — newer cert-manager probes from
# inside the solver too).
#
# The `acme.cert-manager.io/http01-solver=true` label is set by
# cert-manager itself on every solver pod automatically.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: cm-acme-http-solver-allow
namespace: fc-desktop
labels:
app.kubernetes.io/part-of: remotedesktop
app.kubernetes.io/component: cert-renewal
spec:
podSelector:
matchLabels:
acme.cert-manager.io/http01-solver: "true"
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: traefik-system
podSelector:
matchLabels:
app.kubernetes.io/name: traefik
ports:
- port: 8089
protocol: TCP
egress:
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP

View File

@@ -0,0 +1,26 @@
# Runtime secrets for FlowerCore.DeviceManagement.
#
# OnePasswordItem operator syncs this item into a Kubernetes Secret with the
# same name. Expected fields:
# DB-Password
# mtls-ca.pem
# mtls-client.crt
# mtls-client.key
# mtls-chain.pem
#
# Do not add literal secret values to this repo. Runtime pods consume the
# synced Secret through env vars and read-only mounts.
apiVersion: onepassword.com/v1
kind: OnePasswordItem
metadata:
name: fc-devicemgmt-runtime
namespace: fc-devicemgmt
labels:
app.kubernetes.io/name: fc-devicemgmt
app.kubernetes.io/component: secrets
app.kubernetes.io/part-of: flowercore
app.kubernetes.io/managed-by: argocd
flowercore.io/tenant-id: system
flowercore.io/created-by: bluejay-infra
spec:
itemPath: "vaults/IAmWorkin/items/FlowerCore DeviceManagement Runtime"

View File

@@ -0,0 +1,30 @@
# Certificate for devices.iamworkin.lan.
#
# Preflight gate: FlowerCore.DNS / pfSense must contain an explicit A record:
# devices.iamworkin.lan -> 10.0.56.200
# before this Certificate is synced. step-ca ACME cannot see the CoreDNS
# wildcard, so missing pfSense DNS produces cert-manager HTTP-01 backoff
# (feedback_pfsense_dns_required_for_acme).
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: fc-devicemgmt-web-tls
namespace: fc-devicemgmt
labels:
app.kubernetes.io/name: fc-devicemgmt-web
app.kubernetes.io/component: web
app.kubernetes.io/part-of: flowercore
app.kubernetes.io/managed-by: argocd
flowercore.io/tenant-id: system
flowercore.io/created-by: bluejay-infra
annotations:
flowercore.io/dns-preflight: "devices.iamworkin.lan must resolve to 10.0.56.200 before ACME sync"
spec:
secretName: fc-devicemgmt-web-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- devices.iamworkin.lan
duration: 720h
renewBefore: 240h

View File

@@ -0,0 +1,81 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: fc-devicemgmt-operator
labels:
app.kubernetes.io/name: fc-devicemgmt-operator
app.kubernetes.io/component: operator
app.kubernetes.io/part-of: flowercore
app.kubernetes.io/managed-by: argocd
flowercore.io/tenant-id: system
flowercore.io/created-by: bluejay-infra
rules:
- apiGroups:
- devices.flowercore.io
resources:
- '*'
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- devices.flowercore.io
resources:
- devices/status
- devices/finalizers
- devicegroups/status
- devicegroups/finalizers
- devicepolicies/status
- devicepolicies/finalizers
- remotecommands/status
- remotecommands/finalizers
verbs:
- get
- update
- patch
- apiGroups:
- apps
resources:
- deployments
verbs:
- get
- apiGroups:
- ""
resources:
- pods
- services
- configmaps
- secrets
- events
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- batch
resources:
- jobs
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- networking.k8s.io
resources:
- networkpolicies
verbs:
- get
- list
- watch

View File

@@ -0,0 +1,19 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: fc-devicemgmt-operator
labels:
app.kubernetes.io/name: fc-devicemgmt-operator
app.kubernetes.io/component: operator
app.kubernetes.io/part-of: flowercore
app.kubernetes.io/managed-by: argocd
flowercore.io/tenant-id: system
flowercore.io/created-by: bluejay-infra
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: fc-devicemgmt-operator
subjects:
- kind: ServiceAccount
name: fc-devicemgmt-operator
namespace: fc-devicemgmt

View File

@@ -0,0 +1,109 @@
# FlowerCore.DeviceManagement Operator.
#
# KubeOps controller for devices.flowercore.io resources. Operator-created
# children must set OwnerReferences + traceability labels/annotations per
# k8s-pod-ownership-and-traceability-standard.md. RBAC below grants
# apps/deployments/get so the process can resolve its own Deployment UID.
apiVersion: apps/v1
kind: Deployment
metadata:
name: fc-devicemgmt-operator
namespace: fc-devicemgmt
labels:
app: fc-devicemgmt-operator
app.kubernetes.io/name: fc-devicemgmt-operator
app.kubernetes.io/component: operator
app.kubernetes.io/part-of: flowercore
app.kubernetes.io/managed-by: argocd
flowercore.io/tenant-id: system
flowercore.io/created-by: bluejay-infra
annotations:
flowercore.io/traceability-standard: k8s-pod-ownership-and-traceability-standard
spec:
replicas: 1
revisionHistoryLimit: 3
selector:
matchLabels:
app: fc-devicemgmt-operator
template:
metadata:
labels:
app: fc-devicemgmt-operator
app.kubernetes.io/name: fc-devicemgmt-operator
app.kubernetes.io/component: operator
app.kubernetes.io/part-of: flowercore
app.kubernetes.io/managed-by: argocd
flowercore.io/tenant-id: system
flowercore.io/created-by: bluejay-infra
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
flowercore.io/audit-trace-id: "runtime-activity-trace"
spec:
serviceAccountName: fc-devicemgmt-operator
securityContext:
fsGroup: 1654
fsGroupChangePolicy: OnRootMismatch
containers:
- name: operator
image: localhost/fc-devicemgmt-operator:v20260519-sp34cl3-fix
imagePullPolicy: Never
ports:
- name: metrics
containerPort: 8080
env:
- name: ASPNETCORE_ENVIRONMENT
value: "Production"
- name: DOTNET_SYSTEM_GLOBALIZATION_INVARIANT
value: "false"
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: FLOWERCORE_KUBERNETES_OWNER_DEPLOYMENT
value: "fc-devicemgmt-operator"
- name: FlowerCore__Service__Name
value: "FlowerCore.DeviceManagement.Operator"
- name: FlowerCore__DeviceManagement__DefaultTenantId
value: "system"
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 20
periodSeconds: 30
securityContext:
runAsNonRoot: true
runAsUser: 1654
runAsGroup: 1654
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: tmp
mountPath: /tmp
- name: logs
mountPath: /app/logs
volumes:
- name: tmp
emptyDir: {}
- name: logs
emptyDir: {}

View File

@@ -0,0 +1,151 @@
# FlowerCore.DeviceManagement Web.
#
# Source repo is expected to ship FlowerCore.DeviceManagement.Web in a later
# Sprint 9+ lane. This manifest is static-valid without requiring the image to
# exist yet; import localhost/fc-devicemgmt-web:<tag> to all schedulable RKE2
# nodes before letting ArgoCD sync a live rollout.
#
# SCALED TO 0 — 2026-05-19 morning-routine cleanup.
# The Web pod cannot start until TWO upstream gaps close:
# 1. MySQL DB instance `flowercore_devicemgmt` (user `fc_devicemgmt`) is
# provisioned via fc-mysql Manager. The cluster currently has ZERO
# MySqlInstanceCrds and no `mysql.fc-mysql.svc:3306` Service, so the
# deployment-web container env `FlowerCore__Database__Host=mysql.fc-mysql.svc`
# points at nothing. Provision via the fc-mysql Manager UI/REST/MCP.
# 2. 1Password vault item `IAmWorkin/FlowerCore DeviceManagement Runtime`
# with 5 fields (DB-Password, mtls-ca.pem, mtls-client.crt, mtls-client.key,
# mtls-chain.pem) — see apps/fc-devicemgmt/1password-item.yaml. Mint mTLS
# from step-ca-agent ClusterIssuer per ADR-126; DB-Password must match the
# password configured for the MySQL user.
# Re-enable: change replicas back to 2 after both gaps close. The image tag
# in this file (v20260512-cx5) MAY also need a refresh — it predates the
# Sprint 34 Cl-3 operator fix; Web may have an analogous bug.
apiVersion: apps/v1
kind: Deployment
metadata:
name: fc-devicemgmt-web
namespace: fc-devicemgmt
labels:
app: fc-devicemgmt-web
app.kubernetes.io/name: fc-devicemgmt-web
app.kubernetes.io/component: web
app.kubernetes.io/part-of: flowercore
app.kubernetes.io/managed-by: argocd
flowercore.io/tenant-id: system
flowercore.io/created-by: bluejay-infra
annotations:
flowercore.io/traceability-standard: k8s-pod-ownership-and-traceability-standard
spec:
replicas: 0
revisionHistoryLimit: 3
selector:
matchLabels:
app: fc-devicemgmt-web
template:
metadata:
labels:
app: fc-devicemgmt-web
app.kubernetes.io/name: fc-devicemgmt-web
app.kubernetes.io/component: web
app.kubernetes.io/part-of: flowercore
app.kubernetes.io/managed-by: argocd
flowercore.io/tenant-id: system
flowercore.io/created-by: bluejay-infra
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
flowercore.io/audit-trace-id: "runtime-activity-trace"
spec:
securityContext:
fsGroup: 1654
fsGroupChangePolicy: OnRootMismatch
containers:
- name: web
image: localhost/fc-devicemgmt-web:v20260512-cx5
imagePullPolicy: Never
ports:
- name: http
containerPort: 8080
env:
- name: ASPNETCORE_URLS
value: "http://+:8080"
- name: ASPNETCORE_ENVIRONMENT
value: "Production"
- name: DOTNET_SYSTEM_GLOBALIZATION_INVARIANT
value: "false"
- name: FlowerCore__Service__Name
value: "FlowerCore.DeviceManagement.Web"
- name: FlowerCore__DeviceManagement__DefaultTenantId
value: "system"
- name: FlowerCore__Database__Provider
value: "MySql"
- name: FlowerCore__Database__Host
value: "mysql.fc-mysql.svc"
- name: FlowerCore__Database__Database
value: "flowercore_devicemgmt"
- name: FlowerCore__Database__User
value: "fc_devicemgmt"
- name: FlowerCore__Database__Password
valueFrom:
secretKeyRef:
name: fc-devicemgmt-runtime
key: DB-Password
- name: FlowerCore__DeviceManagement__AgentMtls__CaPath
value: "/secrets/devicemgmt-mtls/mtls-ca.pem"
- name: FlowerCore__DeviceManagement__AgentMtls__ClientCertificatePath
value: "/secrets/devicemgmt-mtls/mtls-client.crt"
- name: FlowerCore__DeviceManagement__AgentMtls__ClientKeyPath
value: "/secrets/devicemgmt-mtls/mtls-client.key"
- name: FlowerCore__EventBus__Redis__Configuration
value: "redis.fc-redis.svc:6379"
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 1000m
memory: 768Mi
startupProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 30
readinessProbe:
tcpSocket:
port: 8080
periodSeconds: 10
failureThreshold: 3
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
failureThreshold: 3
securityContext:
runAsNonRoot: true
runAsUser: 1654
runAsGroup: 1654
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: tmp
mountPath: /tmp
- name: logs
mountPath: /app/logs
- name: devicemgmt-mtls
mountPath: /secrets/devicemgmt-mtls
readOnly: true
volumes:
- name: tmp
emptyDir: {}
- name: logs
emptyDir: {}
- name: devicemgmt-mtls
secret:
secretName: fc-devicemgmt-runtime
defaultMode: 0400

View File

@@ -0,0 +1,55 @@
# LAN ingress for FlowerCore.DeviceManagement Web.
#
# RKE2 Traefik has no built-in ACME resolver configured. Keep TLS certificate
# ownership in cert-manager Certificate/fc-devicemgmt-web-tls.
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: fc-devicemgmt-web
namespace: fc-devicemgmt
labels:
app.kubernetes.io/name: fc-devicemgmt-web
app.kubernetes.io/component: web
app.kubernetes.io/part-of: flowercore
app.kubernetes.io/managed-by: argocd
flowercore.io/tenant-id: system
flowercore.io/created-by: bluejay-infra
spec:
entryPoints:
- websecure
routes:
- match: Host(`devices.iamworkin.lan`)
kind: Rule
services:
- name: fc-devicemgmt-web
port: 80
tls:
secretName: fc-devicemgmt-web-tls
# Future public agent/update host gate (OFF by default):
#
# Do not enable `update.flowercore.io` here until Authentik OIDC Q-OIDC-1
# resolves the public-device-management auth model and route ownership with
# UpdateCenter. When enabled, use a separate public IngressRoute with an
# explicit Method allowlist, public-host auth middleware, and public TLS
# certificate strategy. Leaving this as comments keeps ArgoCD from stealing
# live UpdateCenter traffic.
#
# apiVersion: traefik.io/v1alpha1
# kind: IngressRoute
# metadata:
# name: fc-devicemgmt-web-public
# namespace: fc-devicemgmt
# annotations:
# flowercore.io/public-host-gate: "disabled-until-Q-OIDC-1"
# spec:
# entryPoints:
# - websecure
# routes:
# - match: Host(`update.flowercore.io`) && (Method(`GET`) || Method(`HEAD`) || Method(`POST`) || Method(`OPTIONS`))
# kind: Rule
# services:
# - name: fc-devicemgmt-web
# port: 80
# tls:
# secretName: fc-devicemgmt-public-tls

View File

@@ -0,0 +1,13 @@
# FlowerCore.DeviceManagement namespace.
#
# ArgoCD discovers this directory as Application `infra-fc-devicemgmt`.
apiVersion: v1
kind: Namespace
metadata:
name: fc-devicemgmt
labels:
app.kubernetes.io/name: fc-devicemgmt
app.kubernetes.io/part-of: flowercore
app.kubernetes.io/managed-by: argocd
flowercore.io/tenant-id: system
flowercore.io/created-by: bluejay-infra

View File

@@ -0,0 +1,224 @@
# FlowerCore.DeviceManagement NetworkPolicies.
#
# NetworkPolicies belong in bluejay-infra so ArgoCD owns rebuild state.
# Rules include Traefik post-DNAT backend ports per
# feedback_netpol_dnat_backend_port and Synology NFS egress for the requested
# cold-tier / future artifact path.
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: fc-devicemgmt-web-isolation
namespace: fc-devicemgmt
labels:
app.kubernetes.io/name: fc-devicemgmt-web
app.kubernetes.io/component: web
app.kubernetes.io/part-of: flowercore
app.kubernetes.io/managed-by: argocd
flowercore.io/tenant-id: system
flowercore.io/created-by: bluejay-infra
spec:
podSelector:
matchLabels:
app: fc-devicemgmt-web
policyTypes:
- Ingress
- Egress
ingress:
# LAN edge: only cluster Traefik should reach the Web pod for
# devices.iamworkin.lan.
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: traefik-system
podSelector:
matchLabels:
app.kubernetes.io/name: traefik
ports:
- port: 8080
protocol: TCP
# Direct LAN diagnostics are allowed only from FlowerCore LAN/VPN ranges.
- from:
- ipBlock:
cidr: 10.0.56.0/24
- ipBlock:
cidr: 10.0.57.0/24
- ipBlock:
cidr: 10.0.58.0/24
- ipBlock:
cidr: 10.0.68.0/27
ports:
- port: 8080
protocol: TCP
egress:
# CoreDNS.
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
# Database namespace.
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: fc-mysql
ports:
- port: 3306
protocol: TCP
# Redis backplane for multi-replica SignalR / live-status fan-out.
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: fc-redis
ports:
- port: 6379
protocol: TCP
# Traefik VIP / in-cluster Traefik for self-callbacks and public URL
# generation tests. Include post-DNAT backend ports 8443 + 8080.
- to:
- ipBlock:
cidr: 10.0.56.200/32
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: traefik-system
podSelector:
matchLabels:
app.kubernetes.io/name: traefik
ports:
- port: 80
protocol: TCP
- port: 443
protocol: TCP
- port: 8080
protocol: TCP
- port: 8443
protocol: TCP
# Agent egress: LAN/VPN devices may run DM Agent in Generic, Kiosk, Pi,
# ThinClient, or Server mode. Keep this private-range only.
- to:
- ipBlock:
cidr: 10.0.56.0/24
- ipBlock:
cidr: 10.0.57.0/24
- ipBlock:
cidr: 10.0.58.0/24
- ipBlock:
cidr: 10.0.68.0/27
ports:
- port: 80
protocol: TCP
- port: 443
protocol: TCP
- port: 8080
protocol: TCP
- port: 8443
protocol: TCP
- port: 5000
protocol: TCP
- port: 5001
protocol: TCP
# Synology NFS cold-tier / artifact mount allowance.
- to:
- ipBlock:
cidr: 10.0.58.3/32
ports:
- port: 2049
protocol: TCP
- port: 2049
protocol: UDP
- port: 111
protocol: TCP
- port: 111
protocol: UDP
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: fc-devicemgmt-operator-isolation
namespace: fc-devicemgmt
labels:
app.kubernetes.io/name: fc-devicemgmt-operator
app.kubernetes.io/component: operator
app.kubernetes.io/part-of: flowercore
app.kubernetes.io/managed-by: argocd
flowercore.io/tenant-id: system
flowercore.io/created-by: bluejay-infra
spec:
podSelector:
matchLabels:
app: fc-devicemgmt-operator
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: monitoring
ports:
- port: 8080
protocol: TCP
egress:
# CoreDNS.
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
# Kubernetes API for KubeOps reconciliation and Deployment UID lookup.
- to: []
ports:
- port: 443
protocol: TCP
- port: 6443
protocol: TCP
# Agent egress for operator-initiated probes / fallback command dispatch.
- to:
- ipBlock:
cidr: 10.0.56.0/24
- ipBlock:
cidr: 10.0.57.0/24
- ipBlock:
cidr: 10.0.58.0/24
- ipBlock:
cidr: 10.0.68.0/27
ports:
- port: 80
protocol: TCP
- port: 443
protocol: TCP
- port: 8080
protocol: TCP
- port: 8443
protocol: TCP
- port: 5000
protocol: TCP
- port: 5001
protocol: TCP
# Synology NFS allowance for future cold-tier/audit archival jobs.
- to:
- ipBlock:
cidr: 10.0.58.3/32
ports:
- port: 2049
protocol: TCP
- port: 2049
protocol: UDP
- port: 111
protocol: TCP
- port: 111
protocol: UDP

View File

@@ -0,0 +1,22 @@
apiVersion: v1
kind: Service
metadata:
name: fc-devicemgmt-web
namespace: fc-devicemgmt
labels:
app: fc-devicemgmt-web
app.kubernetes.io/name: fc-devicemgmt-web
app.kubernetes.io/component: web
app.kubernetes.io/part-of: flowercore
app.kubernetes.io/managed-by: argocd
flowercore.io/tenant-id: system
flowercore.io/created-by: bluejay-infra
spec:
type: ClusterIP
selector:
app: fc-devicemgmt-web
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP

View File

@@ -0,0 +1,12 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: fc-devicemgmt-operator
namespace: fc-devicemgmt
labels:
app.kubernetes.io/name: fc-devicemgmt-operator
app.kubernetes.io/component: operator
app.kubernetes.io/part-of: flowercore
app.kubernetes.io/managed-by: argocd
flowercore.io/tenant-id: system
flowercore.io/created-by: bluejay-infra

View File

@@ -0,0 +1,105 @@
# fc-distribution — staged deployment (Phase 1, USB provisioning)
**Status:** manifests staged, **NOT YET APPLIED**. Image must be built +
imported and signing 1Password items confirmed before `git push`.
- Architecture: [`../../../FlowerCore.Notes/docs/infrastructure/usb-provisioning-architecture.md`](../../../FlowerCore.Notes/docs/infrastructure/usb-provisioning-architecture.md)
- Repo: `D:\git\FlowerCore\FlowerCore.Distribution\` (`README.md`, `CLAUDE.md`)
- Shared lib: `FlowerCore.Common` -> `FlowerCore.Shared.Distribution`
`FlowerCore.Distribution` publishes signed edition manifests (ECDSA P-256
over canonical JSON) and serves the SHA-256 content-addressed blob store
that USB builders pull from. The verifier embeds the `IAmWorkin ACME CA
Root CA` as the trust anchor; per-edition leaf signing material lives in
1Password and is mounted into the pod read-only.
## Deployment order (do NOT skip / reorder)
### 1. FlowerCore.DNS preflight — VERIFIED 2026-04-23
`dist.iamworkin.lan` already resolves to `10.0.56.200`, but keep the
FlowerCore.DNS preflight green before push:
```bash
curl -sk "https://dns.iamworkin.lan/api/v1/zones/iamworkin.lan/resolve-preflight?hostname=dist.iamworkin.lan"
# Expect: "resolvable": true
python bluejay-infra/scripts/check-pfsense-dns.py
# Historical filename retained; implementation now calls FlowerCore.DNS
# resolve-preflight instead of raw resolver lookups.
```
If the record ever disappears, recreate it through FlowerCore.DNS before
push/apply:
```bash
curl -sk https://dns.iamworkin.lan/api/v1/servers
curl -sk -X POST https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iamworkin.lan/records \
-H "Content-Type: application/json" \
-d '{"name":"dist","type":"A","data":"10.0.56.200","ttl":300}'
```
If this is missing, cert-manager HTTP-01 will silently back off ~2h. See
memory `feedback_pfsense_dns_required_for_acme.md`.
### 2. 1Password items required in vault `IAmWorkin`
| Item title | Item id | Used as |
|---|---|---|
| `FlowerCore Code Signing CA` | (existing) | Informational handle only — root CA is baked into the image at build time, not mounted |
| `FlowerCore Edition Signing Key - edition:kiosk-standard` | `3hf33egdvnni6jyuws3r737mqe` | Mounted at `/signing/kiosk-standard/` |
| `FlowerCore Edition Signing Key - edition:aistation-field` | `ccxrtsan5samfq4pfuczymacrq` | Mounted at `/signing/aistation-field/` |
Each edition item must publish three field labels (the operator turns
field labels into Secret keys verbatim):
- `certificate.pem` — leaf certificate
- `private-key.pem` — ECDSA P-256 private key
- `chain.pem` — leaf + intermediate (referenced by the env var as the
cert-path; the verifier uses this for signature path validation)
### 3. Build + import the image to rke2-server
The Pod is pinned to `rke2-server` because the Synology NFS export
`/volume1/kubernetes` only allows that node. Importing to the agents is
optional until the ACL is widened.
```bash
# From BLUEJAY-WS, in D:\git\FlowerCore\FlowerCore.Distribution
TAG="v$(date +%Y%m%d%H%M)"
dotnet.exe publish -c Release -o deploy/app \
src/FlowerCore.Distribution.Web/FlowerCore.Distribution.Web.csproj
podman build -t localhost/fc-distribution:$TAG -f deploy/Dockerfile.deploy deploy
podman save localhost/fc-distribution:$TAG -o /tmp/fc-distribution.tar
scp /tmp/fc-distribution.tar rke2-server:/tmp/
ssh rke2-server "sudo /var/lib/rancher/rke2/bin/ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images import /tmp/fc-distribution.tar"
```
### 4. Bump the image tag + push
Edit `fc-distribution.yaml`, replace `localhost/fc-distribution:v202604231530`
with the tag from step 3, then:
```bash
cd D:/git/FlowerCore/bluejay-infra
python scripts/check-pfsense-dns.py
git add apps/fc-distribution/
git commit -m "feat(fc-distribution): deploy Phase 1 manifest publisher"
git push
```
ArgoCD picks up within ~3 minutes and creates `infra-fc-distribution`.
### 5. Verify
```bash
fcadmin_ssh noc1 '
kubectl -n argocd get application infra-fc-distribution
kubectl -n fc-distribution get certificate,pod,secret
curl -sk -m 8 -o /dev/null -w "HTTP %{http_code}\n" https://dist.iamworkin.lan/healthz
'
```
Expect: Certificate `Ready: True` within ~60s, `/healthz` HTTP 200, both
`edition-kiosk-standard` and `edition-aistation-field` Secrets present
with `certificate.pem`, `private-key.pem`, `chain.pem` keys.

View File

@@ -0,0 +1,355 @@
# FlowerCore.Distribution — edition manifest publisher + content-addressed blob store.
# Phase 1 of the USB provisioning architecture: signed edition manifests
# (ECDSA P-256 over canonical JSON) published per edition, plus a SHA-256
# content-addressed blob store that USB builders pull from.
#
# Architecture: FlowerCore.Notes/docs/infrastructure/usb-provisioning-architecture.md
# Repo: FlowerCore.Distribution/{README.md,CLAUDE.md}
# Shared lib: FlowerCore.Common -> FlowerCore.Shared.Distribution
# (manifest schema, canonical JSON, ECDSA P-256 sign/verify)
#
# Deployment order (see bluejay-infra/README.md and apps/fc-distribution/README.md):
# 1. pfSense Unbound DNS override for dist.iamworkin.lan -> 10.0.56.200
# (DONE 2026-04-23 — verify with `python bluejay-infra/scripts/check-pfsense-dns.py`).
# 2. 1Password items must exist in vault `IAmWorkin`:
# - `FlowerCore Code Signing CA` (informational)
# - `FlowerCore Edition Signing Key - edition:kiosk-standard` (3hf33egdvnni6jyuws3r737mqe)
# - `FlowerCore Edition Signing Key - edition:aistation-field` (ccxrtsan5samfq4pfuczymacrq)
# Each edition item is expected to publish three field labels:
# certificate.pem, private-key.pem, chain.pem
# 3. Synology NFS export `/volume1/kubernetes` is currently restricted to
# rke2-server (10.0.56.11). Pod is pinned via nodeSelector below. The
# app writes to subPaths `distribution/data` and `distribution/blobs`.
# 4. Build + import image: localhost/fc-distribution:v<YYYYMMDD><HHMM>
# Import to rke2-server via `ctr images import` (NFS-pinned, no need
# for the agents until ACL is widened — see guacamole pattern).
# 5. Bump the image tag below and git push; ArgoCD ApplicationSet picks up
# within ~3 minutes and creates `infra-fc-distribution`.
#
# NOTE on the root trust anchor:
# The verifier needs an embedded root CA (`IAmWorkin ACME CA Root CA`).
# That root is shipped INSIDE the published image (Phase 2 build step
# bakes it into the bundle), NOT mounted from a Secret here. The
# `codesigning-root-cert` OnePasswordItem below is informational only —
# it gives operators a quick handle to the CA item from the cluster.
---
apiVersion: v1
kind: Namespace
metadata:
name: fc-distribution
labels:
app.kubernetes.io/part-of: flowercore
---
# Informational handle to the FlowerCore Code Signing CA item in 1Password.
# Not consumed by the pod at runtime — the root trust anchor is baked into
# the published image. Operators can `kubectl -n fc-distribution get secret
# codesigning-root-cert` to discover the CA item URL/admin handle.
apiVersion: onepassword.com/v1
kind: OnePasswordItem
metadata:
name: codesigning-root-cert
namespace: fc-distribution
spec:
itemPath: "vaults/IAmWorkin/items/FlowerCore Code Signing CA"
---
# Edition signing key + leaf cert + chain for edition:kiosk-standard.
# 1Password item id: 3hf33egdvnni6jyuws3r737mqe
# Operator syncs each field to a Secret key of the same name. Mounted
# read-only at /signing/kiosk-standard inside the pod.
apiVersion: onepassword.com/v1
kind: OnePasswordItem
metadata:
name: edition-kiosk-standard
namespace: fc-distribution
spec:
itemPath: "vaults/IAmWorkin/items/FlowerCore Edition Signing Key - edition:kiosk-standard"
---
# Edition signing key + leaf cert + chain for edition:aistation-field.
# 1Password item id: ccxrtsan5samfq4pfuczymacrq
apiVersion: onepassword.com/v1
kind: OnePasswordItem
metadata:
name: edition-aistation-field
namespace: fc-distribution
spec:
itemPath: "vaults/IAmWorkin/items/FlowerCore Edition Signing Key - edition:aistation-field"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: fc-distribution
namespace: fc-distribution
labels:
app.kubernetes.io/name: fc-distribution
app.kubernetes.io/part-of: flowercore
spec:
replicas: 1
revisionHistoryLimit: 3
strategy:
# NFS-backed SQLite + blob store on a single node. Recreate avoids any
# multi-attach overlap on the same NFS subPath during rollout.
type: Recreate
selector:
matchLabels:
app.kubernetes.io/name: fc-distribution
template:
metadata:
labels:
app.kubernetes.io/name: fc-distribution
app.kubernetes.io/part-of: flowercore
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
# Synology NFS export `/volume1/kubernetes` ACL only allows rke2-server
# (10.0.56.11) right now. Until the ACL is widened in DSM (admin only),
# this Pod must run on rke2-server or NFS mounts will be access-denied.
nodeSelector:
kubernetes.io/hostname: rke2-server
securityContext:
runAsNonRoot: true
fsGroup: 1654
fsGroupChangePolicy: OnRootMismatch
containers:
- name: web
# Placeholder tag — bump to the image you built + imported to
# rke2-server before applying. Build with:
# dotnet.exe publish -c Release -o deploy/app \
# src/FlowerCore.Distribution.Web/FlowerCore.Distribution.Web.csproj
# podman build -t localhost/fc-distribution:v<tag> -f deploy/Dockerfile.deploy deploy
image: localhost/fc-distribution:v202605061948
imagePullPolicy: Never
ports:
- containerPort: 8080
name: http
env:
- name: ASPNETCORE_URLS
value: "http://+:8080"
- name: ASPNETCORE_ENVIRONMENT
value: "Production"
- name: DOTNET_SYSTEM_GLOBALIZATION_INVARIANT
value: "false"
# SQLite connection (catalog + data-protection keys via FlowerCoreDbContext).
# Read by Data/DatabaseProviderExtensions.cs in precedence order; Sqlite key wins.
- name: FlowerCore__Database__Provider
value: "Sqlite"
- name: FlowerCore__Database__ConnectionStrings__Sqlite
value: "Data Source=/data/distribution.db"
# Content-addressed blob root (SHA-256 sharded on disk).
# Bound by Services/NfsPvcBlobProvider.cs under FlowerCore:Distribution:Blobs.
- name: FlowerCore__Distribution__Blobs__Root
value: "/blobs"
# Per-edition signing material — paths into the read-only
# secret mounts below. Field labels in 1Password (and therefore
# Secret key names) are: certificate.pem, private-key.pem, chain.pem
- name: FlowerCore__Distribution__Signing__EditionCerts__kiosk-standard__CertPath
value: "/signing/kiosk-standard/chain.pem"
- name: FlowerCore__Distribution__Signing__EditionCerts__kiosk-standard__KeyPath
value: "/signing/kiosk-standard/private-key.pem"
- name: FlowerCore__Distribution__Signing__EditionCerts__aistation-field__CertPath
value: "/signing/aistation-field/chain.pem"
- name: FlowerCore__Distribution__Signing__EditionCerts__aistation-field__KeyPath
value: "/signing/aistation-field/private-key.pem"
# Public distribution host is GET/HEAD-only at Traefik; this
# entitlement list controls which editions are readable there.
- name: FlowerCore__Distribution__EntitlementPublic__PublicEditions__0
value: "*"
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
# /healthz is exposed by the scaffold (StartupGateMiddleware-aware).
# Liveness uses tcpSocket as a cheap fallback in case a future
# middleware change accidentally gates /healthz behind auth
# (memory: feedback_k8s_probes_behind_auth_middleware).
startupProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 30
readinessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 3
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
failureThreshold: 3
securityContext:
runAsNonRoot: true
runAsUser: 1654
runAsGroup: 1654
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: sqlite
mountPath: /data
subPath: distribution/data
- name: blobs
mountPath: /blobs
subPath: distribution/blobs
- name: tmp
mountPath: /tmp
- name: logs
mountPath: /app/logs
- name: kiosk-standard
mountPath: /signing/kiosk-standard
readOnly: true
- name: aistation-field
mountPath: /signing/aistation-field
readOnly: true
volumes:
# Synology NFS at /volume1/kubernetes — same export pattern as
# apps/guacamole/guacamole.yaml (recordings volume). Pinned by
# ACL to rke2-server. Never mount the subpath as nfs.path —
# always mount the export root and use volumeMount.subPath.
- name: sqlite
nfs:
server: 10.0.58.3
path: /volume1/kubernetes
- name: blobs
nfs:
server: 10.0.58.3
path: /volume1/kubernetes
- name: tmp
emptyDir: {}
- name: logs
emptyDir: {}
- name: kiosk-standard
secret:
secretName: edition-kiosk-standard
defaultMode: 0400
- name: aistation-field
secret:
secretName: edition-aistation-field
defaultMode: 0400
---
apiVersion: v1
kind: Service
metadata:
name: fc-distribution
namespace: fc-distribution
labels:
app.kubernetes.io/name: fc-distribution
app.kubernetes.io/part-of: flowercore
spec:
type: ClusterIP
selector:
app.kubernetes.io/name: fc-distribution
ports:
- name: http
port: 80
targetPort: 8080
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: fc-distribution-tls
namespace: fc-distribution
spec:
secretName: fc-distribution-tls-secret
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- dist.iamworkin.lan
# step-ca ACME caps lifetime at 30d; requesting 90d silently capped
# made renewBefore=cert-lifetime → perpetual renewal loop (10880+ CRs
# in 18h on 2026-05-07). Match working 720h/240h pattern from other
# FC services.
duration: 720h # 30d (step-ca cap)
renewBefore: 240h # 10d
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: fc-distribution
namespace: fc-distribution
spec:
entryPoints:
- websecure
routes:
- match: Host(`dist.iamworkin.lan`)
kind: Rule
services:
- name: fc-distribution
port: 80
tls:
secretName: fc-distribution-tls-secret
---
# === dist.flowercore.io public surface (2026-04-24) =========================
#
# Shares the Deployment + Service + PVC with the internal IngressRoute above.
# The controller's NamedEntitlementResolverRouter picks between the internal
# (permissive) and public (strict) StaticTokenEntitlementResolver based on
# the X-FC-Distribution-Profile header — which the middleware below injects
# on every public-host request after stripping any caller-supplied value.
#
# Cert is the shared Cloudflare Origin Certificate for *.flowercore.io, literal
# bytes copied (matches gitea-public, matrix, telephony, mail, flowercore-landing
# pattern — not yet via OnePasswordItem operator).
---
apiVersion: v1
kind: Secret
metadata:
name: cf-origin-flowercore-io
namespace: fc-distribution
type: kubernetes.io/tls
data:
tls.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUVvRENDQTRpZ0F3SUJBZ0lVSXN4c1NKV1VRL0tqZ09ldk81YnNuVi9rZVE4d0RRWUpLb1pJaHZjTkFRRUwKQlFBd2dZc3hDekFKQmdOVkJBWVRBbFZUTVJrd0Z3WURWUVFLRXhCRGJHOTFaRVpzWVhKbExDQkpibU11TVRRdwpNZ1lEVlFRTEV5dERiRzkxWkVac1lYSmxJRTl5YVdkcGJpQlRVMHdnUTJWeWRHbG1hV05oZEdVZ1FYVjBhRzl5CmFYUjVNUll3RkFZRFZRUUhFdzFUWVc0Z1JuSmhibU5wYzJOdk1STXdFUVlEVlFRSUV3cERZV3hwWm05eWJtbGgKTUI0WERUSTJNRE14TURFMk16TXdNRm9YRFRReE1ETXdOakUyTXpNd01Gb3dZakVaTUJjR0ExVUVDaE1RUTJ4dgpkV1JHYkdGeVpTd2dTVzVqTGpFZE1Cc0dBMVVFQ3hNVVEyeHZkV1JHYkdGeVpTQlBjbWxuYVc0Z1EwRXhKakFrCkJnTlZCQU1USFVOc2IzVmtSbXhoY21VZ1QzSnBaMmx1SUVObGNuUnBabWxqWVhSbE1JSUJJakFOQmdrcWhraUcKOXcwQkFRRUZBQU9DQVE0QU1JSUJDZ0tDQVFFQXV0QmpkQ0xEdHdMQlZCU0Y1ZU1OMkt3ckIxTmZmRVhRMjlRRAo1aVR0dzJFcEZXNVJJSllkMjNrYUpCMU5jZXpHWlg4a0Q0cGEyWHpFZW1MVEtJNWw0MU11b3FoWjczNVE3U3RWCkVjRFFTT2ZYTkZQdFMwb0hqb0pRdGF2QjM0ZmJNR3l4Mmx0MU9HUzRNMGtLUWpBNWR6OTJQYjNyZ1RKR0JhOW4KeTZtVThncjRuUHRSdklxZ3NxdjRtMFA3dVU1YjE3NzU1Y2JLSDVoMzIxWHVjMDU4Tzl4M2JHQ0NuRUJXWDdqeApjRGhkUEs1Ri9XRjVBQnl5cFhIQ0ZxUUd4M1NVbmtCQ0ZQSmRabnMra3BHVUZWZGhud3B6NjBtNnlJSzQ0eVR4CjZqR3JOTFEyM1dOK2gwU1lCZU5vb2JBWThydkpiVlZEaGJqSVhBTWtFNGQzVll1TlhRSURBUUFCbzRJQklqQ0MKQVI0d0RnWURWUjBQQVFIL0JBUURBZ1dnTUIwR0ExVWRKUVFXTUJRR0NDc0dBUVVGQndNQ0JnZ3JCZ0VGQlFjRApBVEFNQmdOVkhSTUJBZjhFQWpBQU1CMEdBMVVkRGdRV0JCUkt1NkJVUDZ0N2dpbFRPay9FdEdKQ3R6N3dTREFmCkJnTlZIU01FR0RBV2dCUWs2Rk5YWFh3MFFJZXA2NVRidXVFV2VQd3BwREJBQmdnckJnRUZCUWNCQVFRME1ESXcKTUFZSUt3WUJCUVVITUFHR0pHaDBkSEE2THk5dlkzTndMbU5zYjNWa1pteGhjbVV1WTI5dEwyOXlhV2RwYmw5agpZVEFqQmdOVkhSRUVIREFhZ2d3cUxtbGhiWGR2Y21zdWFXNkNDbWxoYlhkdmNtc3VhVzR3T0FZRFZSMGZCREV3Ckx6QXRvQ3VnS1lZbmFIUjBjRG92TDJOeWJDNWpiRzkxWkdac1lYSmxMbU52YlM5dmNtbG5hVzVmWTJFdVkzSnMKTUEwR0NTcUdTSWIzRFFFQkN3VUFBNElCQVFDSjMvTGNleE5pb0lWdUxoemhmbTZCeDV2SWk3T25CaHF1WUlDdwplNnArZ0prdE16ZFJQcDV0bk03dllBWmxMajVJOTByWDRuczhJc3dEbzJBN2wwYTRGZVJFclFmRklsZXQzbjIyCjUxVTZYVElCSks5c1FZT0FkU3pJUzV1OUNKSFpBUTF5WmxSd3BBR3RVWnhxL1dpcGFWUTRwNXhrcEJNMVlZSlAKNW1jQ09HcFErSnpORlpQc2daYUJncDBYL1BBZkNJRkkyZld5QWE2elBqRm0rdDVXUXIrZlBaT2VUS2VIbWVzVgo3UlZxUUdEb3Q0eTY1NklEdmdmU2ZLRnFIRW9XNDJVbDBxQ05hMS9keEJld3NIS1VWWE1ETkdiQlNVQjM4TG9YCm1OQ3hJQlVOUjR0TG1CQUxZT3hVMnZhSWRCd0xBc2YrcndnVnVjUGpCUTc2VWMwUQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
tls.key: LS0tLS1CRUdJTiBQUklWQVRFIEtFWS0tLS0tCk1JSUV2UUlCQURBTkJna3Foa2lHOXcwQkFRRUZBQVNDQktjd2dnU2pBZ0VBQW9JQkFRQzYwR04wSXNPM0FzRlUKRklYbDR3M1lyQ3NIYTE5OFJkRGIxQVBtSk8zRFlTa1ZibEVnbGgzYmVSb2tIVTF4N01abGZ5UVBpbHJaZk1SNgpZdE1vam1YalV5NmlxRm52ZmxEdEsxVVJ3TkJJNTljMFUrMUxTZ2VPZ2xDMXE4SGZoOXN3YkxIYVczVTRaTGd6ClNRcENNRGwzUDNZOXZldUJNa1lGcjJmTHFaVHlDdmljKzFHOGlxQ3lxL2liUS91NVRsdlh2dm5seHNvZm1IZmIKVmU1elRudzczSGRzWUlLY1FGWmZ1UEZ3T0YwOHJrWDlZWGtBSExLbGNjSVdwQWJIZEpTZVFFSVU4bDFtZXo2UwprWlFWVjJHZkNuUHJTYnJJZ3JqakpQSHFNYXMwdERiZFkzNkhSSmdGNDJpaHNCanl1OGx0VlVPRnVNaGNBeVFUCmgzZFZpNDFkQWdNQkFBRUNnZ0VBTGlseXZkNmVTcEYvZUxtV2lhTVV4NUxwa2dhWHpITkxCQnNNZUpqcytLL0EKVVdlZ1crTkVUdmlLalZ5QlI5SzRocG1IYldDa2lPUDBBQUwrQnlKQ3lvekNOQmJTSEdRejlwc1R5dzZBV1ZlUwpuYjlVWGx1VmFQRktKTTRqbXNydERuYjVic25WT2lGblErTDdTalkwNlFMUlFybjBvUWp0ZFJldUdBMFlQVU90CkhSYzNsMFg2ZHJqdkJYY2prWTQwWm9ZYkRrelJnU1JWbWVOUGFIbjZPR0NtYUVUMXVyK01qYVZ2ME9lbEdIWncKVzljSEIxaHNxRzUvMWU3V0RQN0l0cjkwTmg4ay81NVhiK3lQUnhsRFd5bWtZMzIvdFBtZzdESTRKV2tRRWt3cgpIZUtwODVTcE5ta1liRnVpVFppeU8zZDZ0aXZHNHhFZW8rSzFVVFU4c1FLQmdRRFRNSEU1RDFYVC9HbGR5VHNsCllrODRVL1N0NXUrK2RIUEt1Wmw2dVB0UGgxV1lrdnFRcmdrL05YanVud2xGN0Y3b2tWOGdPeWxreTYwYTZkcXIKeXZwN1ZJdXYzekVlc2h2NjNWMlpaVkMzcXZYSzFheit3Zmx3NitCZmVuRlY5S2NENHN0dTdwOFRPWmFGN01CUgo3YXZzaXVXbWtqdmM1TlVLRmVDRTY0SnZFUUtCZ1FEaWMrbWlNLzBodDN1ajhuOXgyMDFQZFNqbEpVaUc1NjNNCnRYZlBCdDJRT0NhaVluUFNFdTdXdm5pQWRFL2xrMm91cFRWam9LYmZPbDFyQjd6UzVhc2kxdVdDZDhlUy9UWGIKdU5iRmlNMDB4L3JxalMydCtQbTd4MVhrYTB4TFNSRDNmZ0tSQldSN3pscStkYWZ1WE1qelUxRnh5dTIycGphRgpIMEl3NEpCUmpRS0JnUUNOaWhMb0Rob1V5RCtKNXJzb00vb3FJMEtDWnB0WlJzendHbkg5cVFwdFk2Ti9iVXBYCk92emhpeUh3czAvUXVEbG5uejVrNktHMmR6Y2VLWXN2eGdzWUt6S3ZmV043VWgya2hVWWM3NlVvWTREMkh6MGgKUkxtNzc2cGg4enNRUTdiSHlQRlUrTUpPYlRNdnNOdTRUUlVEcEplRGl0QnFIRWVYeWMrKzVlUjJNUUtCZ0h2UgptVHVoWlpVYitEVEtrVGkyQ20yWnlBU1RBRGNUVW9xTjVyYUNNSDk4MUZNUnRmWjFkN1pmYXhBQmlQWWtSbmkrCnlKUnk4UXM1cEg2ek9tR3VSb2JFTGJYS3ZJcjRmSXhwWXJXYmVXaVV0L09yd2dCUUZHekNMNHEzeUgyWnMvYy8KSlRRYVdMa0JPY2pPR0VaUzRXVjZkeHZiTTJNZE9zNUxLeXdDZmFhNUFvR0FIQUE1eEN0dndOZE4xeExndkZ3RApPK2lyMDl1bXMxOFBzSVpmK1ZrWGtpcHF4MWNUT0hEanpPR01yWXV0M2FFeE00Zjd2ckFHRFMyY2pwZjM0T1JxCit4Y2gwWlNaQ2FDZmlnZG9OelNkcDFLcmo0cnFKdG5ZdS9CNDlDQlVoSDBNaCtSRWswQ0hHOVE4b3FOWFk0V0wKbVVOVTZMYUkwQWtvSzNVb2tWQVJEYXM9Ci0tLS0tRU5EIFBSSVZBVEUgS0VZLS0tLS0K
---
# Traefik middleware: strips any caller-supplied X-FC-Distribution-Profile,
# then sets an authoritative 'public' value so the controller routes to the
# strict entitlement resolver. The trust boundary is this middleware — the
# internal IngressRoute (dist.iamworkin.lan) does NOT attach it.
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: dist-public-profile-header
namespace: fc-distribution
spec:
headers:
customRequestHeaders:
X-FC-Distribution-Profile: "public"
---
# Public IngressRoute: binds dist.flowercore.io (Cloudflare-proxied A record
# -> pfSense NAT -> Traefik VIP 10.0.56.200) to the same backend Service that
# serves dist.iamworkin.lan. Header-injection middleware ensures the
# controller uses the public (strict) entitlement resolver.
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: fc-distribution-public
namespace: fc-distribution
spec:
entryPoints:
- websecure
routes:
# Method allowlist: Host + (GET || HEAD). Anything else misses every
# route and Traefik returns 404 before reaching the pod — edge-level
# defense-in-depth over the controller's strict-mode entitlement check.
# Together these block admin ops (POST /blobs, POST /manifests*) from
# ever being processed on the public surface.
- match: Host(`dist.flowercore.io`) && (Method(`GET`) || Method(`HEAD`))
kind: Rule
middlewares:
- name: dist-public-profile-header
services:
- name: fc-distribution
port: 80
tls:
secretName: cf-origin-flowercore-io

View File

@@ -0,0 +1,9 @@
# ArgoCD's bluejay-infra ApplicationSet uses a directory generator and does
# not require kustomization.yaml (existing apps like fc-llm-bridge and
# guacamole have none). This file is included anyway as a single source of
# truth for the resource list and to make `kubectl kustomize` previews work
# from a working copy.
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- fc-distribution.yaml

32
apps/fc-dms/fc-dms.yaml Normal file
View File

@@ -0,0 +1,32 @@
# FlowerCore DMS — TLS + Ingress
# Deployment and Service managed by deploy script (not ArgoCD)
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: dms-web-tls
namespace: fc-dms
spec:
secretName: dms-web-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- dms.iamworkin.lan
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: dms-web
namespace: fc-dms
spec:
entryPoints:
- websecure
routes:
- match: Host(`dms.iamworkin.lan`)
kind: Rule
services:
- name: dms-web
port: 80
tls:
secretName: dms-web-tls

View File

@@ -256,6 +256,20 @@ spec:
targetPort: 80
name: http
---
# TLS Certificate for internal LAN access
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: fc-landing-tls
namespace: fc-system
spec:
secretName: fc-landing-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- flowercore.iamworkin.lan
---
# Internal IngressRoute (LAN access)
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
@@ -271,7 +285,8 @@ spec:
services:
- name: fc-landing
port: 80
tls: {}
tls:
secretName: fc-landing-tls
---
# Public IngressRoute (flowercore.io with Cloudflare origin cert)
apiVersion: traefik.io/v1alpha1

View File

@@ -0,0 +1,174 @@
# fc-llm-bridge — staged deployment (ADR-088)
**Status:** manifests staged, **NOT YET APPLIED**. Do not `git push` or sync
ArgoCD until the two pre-requisites below are done, in order.
Design: [`../../../FlowerCore.Notes/docs/ai-agents/agent-zero-anthropic-bridge.md`](../../../FlowerCore.Notes/docs/ai-agents/agent-zero-anthropic-bridge.md)
ADR: ADR-088 in [`../../../FlowerCore.Notes/ARCHITECTURE.md`](../../../FlowerCore.Notes/ARCHITECTURE.md)
## Deployment order (do NOT skip / reorder)
### 1. FlowerCore.DNS preflight — REQUIRED FIRST
`fc-llm-bridge.iamworkin.lan` must keep resolving to `10.0.56.200` through
FlowerCore.DNS before this manifest is applied.
step-ca (the ACME CA on noc1) uses pfSense Unbound (10.0.56.1), **not**
cluster CoreDNS. If you apply this manifest before adding the DNS override,
cert-manager's HTTP-01 challenge silently fails for ~2h (exponential backoff)
until someone manually runs `kubectl -n fc-llm-bridge delete order <order>`
to bust the cache. See memory `feedback_pfsense_dns_required_for_acme.md`.
Verify the record through the public preflight API:
```bash
curl -sk "https://dns.iamworkin.lan/api/v1/zones/iamworkin.lan/resolve-preflight?hostname=fc-llm-bridge.iamworkin.lan"
# Expect: "resolvable": true
```
Verify:
```bash
python scripts/check-pfsense-dns.py
# Historical filename retained; implementation now calls FlowerCore.DNS
# resolve-preflight instead of raw resolver lookups.
```
If the record is missing, recreate it through FlowerCore.DNS before pushing:
```bash
curl -sk https://dns.iamworkin.lan/api/v1/servers
curl -sk -X POST https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iamworkin.lan/records \
-H "Content-Type: application/json" \
-d '{"name":"fc-llm-bridge","type":"A","data":"10.0.56.200","ttl":300}'
```
### 2. Create the `FC LLM Bridge API Keys` 1Password item
The `Claude API Key` item in vault `IAmWorkin` already exists (id
`e5tth3y5mp3lhdavg35pxadzca`, see `docs/ai-agents/anthropic-integration.md`).
The new item for per-consumer bridge API keys does NOT yet exist. Create it
before the first apply of this manifest — the Deployment marks the individual
key env vars `optional: true` so missing keys will not crash the pod, but the
bridge will reject every request with 401 until at least one key is populated.
| Field | Item position | Type | Purpose |
|-------|---------------|------|---------|
| `credential` | Top section | Password (random, 48 char) | Unused placeholder required by the 1Password schema for single-field items. Can be anything — this file is never read by K8s. |
| `agent-zero-ws` | "API Keys" section | Password (random, 48 char) | API key for the BLUEJAY-WS Agent Zero instance. |
| `agent-zero-k8s` | "API Keys" section | Password (random, 48 char) | API key for the K8s-hosted `agent-zero` Deployment. |
| `spare-1` | "API Keys" section | Password (random, 48 char) | Reserve for future Agent Zero forks / smoke-test scripts. |
| `spare-2` | "API Keys" section | Password (random, 48 char) | Reserve. |
Steps via the CLI (run from a machine with `op` signed in):
```bash
op item create \
--category="API Credential" \
--title="FC LLM Bridge API Keys" \
--vault="IAmWorkin" \
"API Keys.agent-zero-ws[password]=$(openssl rand -hex 24)" \
"API Keys.agent-zero-k8s[password]=$(openssl rand -hex 24)" \
"API Keys.spare-1[password]=$(openssl rand -hex 24)" \
"API Keys.spare-2[password]=$(openssl rand -hex 24)"
```
OR via the 1Password GUI — create a new item titled exactly `FC LLM Bridge API
Keys` in the `IAmWorkin` vault, add an `API Keys` section, add four password
fields named `agent-zero-ws`, `agent-zero-k8s`, `spare-1`, `spare-2` with
`openssl rand -hex 24` values.
**Mapping to K8s:** The 1Password Connect operator syncs each field to a
Secret key of the same name. The Deployment's env vars
(`FlowerCore__LlmBridge__ApiKeys__agent-zero-ws` etc) reference those Secret
keys. In `FlowerCore.Shared.Api.Authentication.ApiKeyAuthMiddleware`, the key
name (e.g. `agent-zero-k8s`) becomes the `fc.app` claim on the
`ClaimsPrincipal`, which is what `IBudgetLedger` uses to scope spend per
consumer.
### 3. Build + import the image to every RKE2 node
```bash
# From BLUEJAY-WS, in D:\git\FlowerCore\FlowerCore.LlmBridge
TAG="v$(date +%Y%m%d%H%M%S)"
dotnet.exe publish -c Release -o deploy/app \
src/FlowerCore.LlmBridge.Web/FlowerCore.LlmBridge.Web.csproj
podman build -t localhost/fc-llm-bridge:$TAG -f deploy/Dockerfile.deploy deploy
podman save localhost/fc-llm-bridge:$TAG -o /tmp/fc-llm-bridge.tar
# SCP to each node and ctr import
for NODE in rke2-server rke2-agent1 rke2-agent2; do
scp /tmp/fc-llm-bridge.tar $NODE:/tmp/
ssh $NODE "sudo /var/lib/rancher/rke2/bin/ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images import /tmp/fc-llm-bridge.tar"
done
```
### 4. Bump the image tag in the manifest
Edit `fc-llm-bridge.yaml`, replace `localhost/fc-llm-bridge:v00000000000000`
with the tag from step 3.
### 5. Commit + push
```bash
cd D:/git/FlowerCore/bluejay-infra
# re-run the DNS gate
python scripts/check-pfsense-dns.py
git add apps/fc-llm-bridge/
git commit -m "feat(fc-llm-bridge): deploy ADR-088 Agent Zero bridge"
git push
```
ArgoCD picks up within ~3 minutes and creates `infra-fc-llm-bridge`.
### 6. Verify
```bash
# From noc1
fcadmin_ssh noc1 '
kubectl -n argocd get application infra-fc-llm-bridge
kubectl -n fc-llm-bridge get certificate,pod
curl -sk -m 8 -o /dev/null -w "HTTP %{http_code}\n" https://fc-llm-bridge.iamworkin.lan/healthz
'
```
Expect: Certificate `Ready: True` within ~60s, `/healthz` HTTP 200.
### 7. Flip Agent Zero to the bridge
After the bridge passes a real chat smoke test, update the Agent Zero
ConfigMap (`apps/agent-zero/agent-zero.yaml`) to route through the bridge:
- `A0_SET_chat_model_api_base` / `config.json > chat_model.api_base`
-> `https://fc-llm-bridge.iamworkin.lan/v1`
- Add an `A0_SET_chat_model_api_key` env var wired to a K8s Secret sourced
from `FC LLM Bridge API Keys` field `agent-zero-k8s`.
- Set `chat_model.name` to `fc:balanced` (or a concrete model) — the bridge
accepts both tier aliases and concrete model names.
Do the same for BLUEJAY-WS Agent Zero (`agent-zero-ws` key), or keep the
workstation on direct Ollama and only route Anthropic calls through the
bridge (the design doc describes this split as the preferred approach).
## Current state at staging time (2026-04-23)
- `fc-llm-bridge.iamworkin.lan` — public FlowerCore.DNS preflight is now
green and resolves to `10.0.56.200`; keep `python scripts/check-pfsense-dns.py`
green before push.
- `FC LLM Bridge API Keys` — NOT created in 1Password (user action).
- `Claude API Key` — already exists in `IAmWorkin` vault
(`e5tth3y5mp3lhdavg35pxadzca`), also consumed by AiStation and Chat.Web.
- `localhost/fc-llm-bridge:v*` image — not yet built; `FlowerCore.LlmBridge`
repo has local commit `6d285b5` only, no remote.
- ArgoCD `infra-fc-llm-bridge` Application — will be auto-created by the
`bluejay-infra` ApplicationSet once the directory is on `main`.
## Why tcpSocket probes (not `/healthz`)
The bridge runs `ApiKeyAuthMiddleware`. `/healthz` and `/health` are exempt
via `FlowerCore:LlmBridge:AuthExemptPaths`, so an HTTP probe would work
today. But a future change to the middleware registration order could
silently turn kubelet probes into 401/404, which crashes pods on every
deploy. `tcpSocket` keeps probes robust against that regression. Memory:
`feedback_k8s_probes_behind_auth_middleware.md`.

View File

@@ -0,0 +1,283 @@
# FlowerCore.LlmBridge — OpenAI-compatible bridge for Agent Zero.
# Routes through FlowerCore.Shared.Chat (ILlmProviderClient) with budget
# enforcement, response caching, and tier-based model routing. Lets Agent
# Zero (Python) reach Anthropic and Ollama providers without re-implementing
# the C# budget/cache/router primitives.
#
# Design: FlowerCore.Notes/docs/ai-agents/agent-zero-anthropic-bridge.md
# ADR: FlowerCore.Notes/ARCHITECTURE.md (ADR-088)
#
# Deployment order (see bluejay-infra/README.md):
# 1. pfSense DNS override for fc-llm-bridge.iamworkin.lan -> 10.0.56.200
# (REQUIRED before this is applied — cert-manager HTTP-01 will silently
# fail for ~2h backoff otherwise). Run scripts/pfsense-add-dns-overrides.py.
# 2. 1Password items `Claude API Key` (already exists) and
# `FC LLM Bridge API Keys` (create when first non-dev environment comes up).
# 3. Build + import image: localhost/fc-llm-bridge:v<YYYYMMDD><HHMM>
# Import to rke2-server, rke2-agent1, rke2-agent2 via ctr images import.
# 4. Bump the image tag below and git push; ArgoCD ApplicationSet picks up.
# 5. Flip Agent Zero chat.openai.base_url to https://fc-llm-bridge.iamworkin.lan/v1
# and api_key to the op://IAmWorkin/FC LLM Bridge API Keys/agent-zero-k8s value.
---
apiVersion: v1
kind: Namespace
metadata:
name: fc-llm-bridge
labels:
app.kubernetes.io/part-of: flowercore
---
# Claude (Anthropic) API key — shared across FC services.
# Existing 1Password item. `credential` field -> Secret `anthropic-api-key`.
apiVersion: onepassword.com/v1
kind: OnePasswordItem
metadata:
name: anthropic-api-key
namespace: fc-llm-bridge
spec:
itemPath: "vaults/IAmWorkin/items/Claude API Key"
---
# Per-consumer API keys for the bridge itself.
# NEW 1Password item — see apps/fc-llm-bridge/README.md for the field layout
# to create before first apply. Fields become Secret keys of the same name:
# agent-zero-ws, agent-zero-k8s, spare-1, spare-2
apiVersion: onepassword.com/v1
kind: OnePasswordItem
metadata:
name: fc-llm-bridge-api-keys
namespace: fc-llm-bridge
spec:
itemPath: "vaults/IAmWorkin/items/FC LLM Bridge API Keys"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: fc-llm-bridge-data
namespace: fc-llm-bridge
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn
resources:
requests:
storage: 2Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: fc-llm-bridge
namespace: fc-llm-bridge
labels:
app.kubernetes.io/name: fc-llm-bridge
app.kubernetes.io/part-of: flowercore
spec:
replicas: 1
revisionHistoryLimit: 3
strategy:
type: Recreate
selector:
matchLabels:
app.kubernetes.io/name: fc-llm-bridge
template:
metadata:
labels:
app.kubernetes.io/name: fc-llm-bridge
app.kubernetes.io/part-of: flowercore
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
# Use an explicit DNS policy so external FQDNs like api.anthropic.com are
# resolved directly instead of being expanded through the cluster search
# path that includes iamworkin.lan.
dnsPolicy: None
dnsConfig:
nameservers:
- 10.43.0.10
searches:
- fc-llm-bridge.svc.cluster.local
- svc.cluster.local
- cluster.local
options:
- name: ndots
value: "2"
securityContext:
fsGroup: 1654
fsGroupChangePolicy: OnRootMismatch
containers:
- name: web
# Placeholder tag — bump to the image you built + imported to every
# RKE2 node before applying. Build with:
# dotnet.exe publish -c Release -o deploy/app \
# src/FlowerCore.LlmBridge.Web/FlowerCore.LlmBridge.Web.csproj
# podman build -t localhost/fc-llm-bridge:v<tag> -f deploy/Dockerfile.deploy deploy
image: localhost/fc-llm-bridge:v202604300022
imagePullPolicy: Never
ports:
- containerPort: 8080
name: http
env:
- name: ASPNETCORE_URLS
value: "http://+:8080"
- name: ASPNETCORE_ENVIRONMENT
value: "Production"
- name: DOTNET_SYSTEM_GLOBALIZATION_INVARIANT
value: "false"
# SQLite (budget ledger + response cache + data-protection keys)
- name: FlowerCore__LlmBridge__SqliteConnectionString
value: "Data Source=/data/llm-bridge.db"
- name: FlowerCore__LlmBridge__DefaultTenantId
value: "default"
- name: FlowerCore__LlmBridge__DefaultAppName
value: "agent-zero"
- name: FlowerCore__LlmBridge__UtilModel
value: "qwen2.5:1.5b"
- name: FlowerCore__LlmBridge__EmbedModel
value: "nomic-embed-text"
# Per-consumer API keys — from OnePasswordItem fc-llm-bridge-api-keys.
# Each field becomes a Secret key of the same name. The key-name
# lands in the auth principal's `fc.app` claim for ledger scoping.
- name: FlowerCore__LlmBridge__ApiKeys__agent-zero-ws
valueFrom:
secretKeyRef:
name: fc-llm-bridge-api-keys
key: agent-zero-ws
optional: true
- name: FlowerCore__LlmBridge__ApiKeys__agent-zero-k8s
valueFrom:
secretKeyRef:
name: fc-llm-bridge-api-keys
key: agent-zero-k8s
optional: true
- name: FlowerCore__LlmBridge__ApiKeys__spare-1
valueFrom:
secretKeyRef:
name: fc-llm-bridge-api-keys
key: spare-1
optional: true
- name: FlowerCore__LlmBridge__ApiKeys__spare-2
valueFrom:
secretKeyRef:
name: fc-llm-bridge-api-keys
key: spare-2
optional: true
# Shared.Chat — Ollama (edge1 Pi 5 + AI HAT+, matches bridge default)
- name: FlowerCore__Chat__OllamaBaseUrl
value: "http://10.0.57.17:11434"
- name: FlowerCore__Chat__HttpTimeout
value: "00:05:00"
# Shared.Chat — Anthropic
- name: FlowerCore__Chat__Anthropic__Enabled
value: "true"
- name: FlowerCore__Chat__Anthropic__ApiKey
valueFrom:
secretKeyRef:
name: anthropic-api-key
key: password
- name: FlowerCore__Chat__Anthropic__OrganizationId
valueFrom:
secretKeyRef:
name: anthropic-api-key
key: organization_id
optional: true
- name: FlowerCore__Chat__Anthropic__BaseUrl
value: "https://api.anthropic.com"
- name: FlowerCore__Chat__Anthropic__DefaultModel
value: "claude-sonnet-4-6"
- name: FlowerCore__Chat__Anthropic__AnthropicVersion
value: "2023-06-01"
- name: FlowerCore__Chat__Anthropic__Timeout
value: "00:05:00"
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 1000m
memory: 768Mi
volumeMounts:
- name: data
mountPath: /data
- name: tmp
mountPath: /tmp
- name: app-data
mountPath: /app/data
securityContext:
runAsNonRoot: true
runAsUser: 1654
runAsGroup: 1654
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
# tcpSocket probes: the app runs ApiKeyAuthMiddleware. /healthz is
# registered as anonymous via AuthExemptPaths but tcpSocket avoids any
# future accidental middleware ordering regression
# (memory: feedback_k8s_probes_behind_auth_middleware).
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 15
periodSeconds: 30
volumes:
- name: data
persistentVolumeClaim:
claimName: fc-llm-bridge-data
- name: tmp
emptyDir: {}
# The Dockerfile `WORKDIR /app` pairs with the default
# SqliteConnectionString "Data Source=data/llm-bridge.db" (relative).
# The env var above overrides to /data, so /app/data can be emptyDir.
- name: app-data
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: fc-llm-bridge
namespace: fc-llm-bridge
spec:
selector:
app.kubernetes.io/name: fc-llm-bridge
ports:
- port: 8080
targetPort: 8080
name: http
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: fc-llm-bridge-cert
namespace: fc-llm-bridge
spec:
secretName: fc-llm-bridge-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- fc-llm-bridge.iamworkin.lan
duration: 720h
renewBefore: 240h
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: fc-llm-bridge
namespace: fc-llm-bridge
spec:
entryPoints:
- websecure
routes:
- match: Host(`fc-llm-bridge.iamworkin.lan`)
kind: Rule
services:
- name: fc-llm-bridge
port: 8080
tls:
secretName: fc-llm-bridge-tls

View File

@@ -0,0 +1,32 @@
# FlowerCore MenuBoard — TLS + Ingress
# Deployment and Service managed by deploy script (not ArgoCD)
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: menuboard-web-tls
namespace: fc-menuboard
spec:
secretName: menuboard-web-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- menuboard.iamworkin.lan
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: menuboard-web
namespace: fc-menuboard
spec:
entryPoints:
- websecure
routes:
- match: Host(`menuboard.iamworkin.lan`)
kind: Rule
services:
- name: menuboard-web
port: 80
tls:
secretName: menuboard-web-tls

View File

@@ -0,0 +1,143 @@
# FlowerCore MessageBoard — Message board service
---
apiVersion: v1
kind: Namespace
metadata:
name: fc-messageboard
labels:
app.kubernetes.io/part-of: bluejay-infra
---
apiVersion: v1
kind: ConfigMap
metadata:
name: messageboard-web-config
namespace: fc-messageboard
data:
ASPNETCORE_ENVIRONMENT: Production
ASPNETCORE_URLS: http://+:8080
ASPNETCORE_FORWARDEDHEADERS_ENABLED: "true"
Security__AllowedOrigins__0: https://messageboard.iamworkin.lan
FlowerCore__Database__ConnectionStrings__Sqlite: Data Source=/data/messageboard.db
OTEL_SERVICE_NAME: FlowerCore.MessageBoard
OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector.monitoring.svc.cluster.local:4317
OTEL_EXPORTER_OTLP_PROTOCOL: grpc
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: messageboard-web
namespace: fc-messageboard
labels:
app: messageboard-web
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app: messageboard-web
template:
metadata:
labels:
app: messageboard-web
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics/prometheus"
spec:
containers:
- name: messageboard-web
image: localhost/fc-messageboard-web:latest
imagePullPolicy: Never
ports:
- containerPort: 8080
name: http
envFrom:
- configMapRef:
name: messageboard-web-config
- secretRef:
name: messageboard-web-secrets
optional: true
volumeMounts:
- name: data
mountPath: /data
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 6
volumes:
- name: data
persistentVolumeClaim:
claimName: messageboard-web-data
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: messageboard-web-data
namespace: fc-messageboard
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: messageboard-web
namespace: fc-messageboard
spec:
selector:
app: messageboard-web
ports:
- port: 80
targetPort: 8080
name: http
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: messageboard-web-tls
namespace: fc-messageboard
spec:
secretName: messageboard-web-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- messageboard.iamworkin.lan
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: messageboard-web
namespace: fc-messageboard
spec:
entryPoints:
- websecure
routes:
- match: Host(`messageboard.iamworkin.lan`)
kind: Rule
services:
- name: messageboard-web
port: 80
tls:
secretName: messageboard-web-tls

View File

@@ -0,0 +1,32 @@
# FlowerCore MySQL Manager — TLS + Ingress
# Deployment and Service managed by deploy script (not ArgoCD)
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: mysql-web-tls
namespace: fc-mysql
spec:
secretName: mysql-web-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- mysql.iamworkin.lan
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: mysql-web
namespace: fc-mysql
spec:
entryPoints:
- websecure
routes:
- match: Host(`mysql.iamworkin.lan`)
kind: Rule
services:
- name: mysql-web
port: 5300
tls:
secretName: mysql-web-tls

32
apps/fc-php/fc-php.yaml Normal file
View File

@@ -0,0 +1,32 @@
# FlowerCore PHP Manager — TLS + Ingress
# Deployment and Service managed by deploy script (not ArgoCD)
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: php-web-tls
namespace: fc-php
spec:
secretName: php-web-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- php.iamworkin.lan
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: php-web
namespace: fc-php
spec:
entryPoints:
- websecure
routes:
- match: Host(`php.iamworkin.lan`)
kind: Rule
services:
- name: php-web
port: 5400
tls:
secretName: php-web-tls

View File

@@ -0,0 +1,32 @@
# FlowerCore Presentations — TLS + Ingress
# Deployment and Service managed by deploy script (not ArgoCD)
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: presentations-web-tls
namespace: fc-presentations
spec:
secretName: presentations-web-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- presentations.iamworkin.lan
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: presentations-web
namespace: fc-presentations
spec:
entryPoints:
- websecure
routes:
- match: Host(`presentations.iamworkin.lan`)
kind: Rule
services:
- name: presentations-web
port: 80
tls:
secretName: presentations-web-tls

171
apps/fc-redis/fc-redis.yaml Normal file
View File

@@ -0,0 +1,171 @@
# fc-redis — SignalR backplane for cross-product event bus
#
# Lands per Q-SO-1 resolution (2026-05-11 PM): SignalR backplane in Phase A,
# not Phase C as originally drafted. Operator directive: "Redis can be
# deployed just fine as it's another FlowerCore technology we'll want to
# manage."
#
# Phase A scope (this file):
# - Single Redis 7.x Alpine pod
# - 1Gi Longhorn RWO PVC for AOF persistence
# - ClusterIP Service at `redis.fc-redis.svc.cluster.local:6379`
# - No AUTH (in-cluster only; not exposed externally)
# - No IngressRoute (backplane is server-to-server only)
#
# Consumers (Phase A IMPL across FC services):
# - FlowerCore.Signage.Web (OpsConsoleHub)
# - FlowerCore.Scoreboard.Web (ScoreboardHub)
# - FlowerCore.SignalControl.Web
# - FlowerCore.DMS.Web
# - Any other product joining the cross-product event bus
#
# Each consumer adds:
# services.AddSignalR()
# .AddStackExchangeRedis(
# "redis.fc-redis.svc.cluster.local:6379",
# opts => opts.Configuration.ChannelPrefix =
# StackExchange.Redis.RedisChannel.Literal("fc-opsconsole"));
#
# Phase B / C follow-ons (out of scope here):
# - Redis Sentinel for HA (3-node)
# - AUTH password from 1Password Connect (rotate via /rotate-password)
# - redis_exporter sidecar for Prometheus scrape
# - Network policies restricting which namespaces can dial 6379
#
# Design: docs/signage/operations-console-phase-2-design.md §3.5
# Decision: Q-SO-1 (RESOLVED 2026-05-11 PM)
# Memory: feedback_blooming_ui_pattern_no_iframes
---
apiVersion: v1
kind: Namespace
metadata:
name: fc-redis
labels:
app.kubernetes.io/part-of: flowercore
app.kubernetes.io/managed-by: argocd
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: fc-redis-data
namespace: fc-redis
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: ConfigMap
metadata:
name: fc-redis-config
namespace: fc-redis
data:
redis.conf: |
# Phase A — minimal config; no AUTH, no replication.
bind 0.0.0.0
protected-mode no
port 6379
tcp-backlog 511
timeout 0
tcp-keepalive 300
# Persistence: AOF (fsync every second is the standard SignalR-backplane
# durability sweet spot — the backplane only needs to survive Redis
# restarts, not absolute zero loss).
appendonly yes
appendfsync everysec
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
# Reasonable defaults — let Redis pick most things.
maxmemory-policy allkeys-lru
maxmemory 256mb
# Logging
loglevel notice
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: fc-redis
namespace: fc-redis
labels:
app: fc-redis
spec:
replicas: 1
strategy:
type: Recreate # RWO PVC; do not do rolling update
selector:
matchLabels:
app: fc-redis
template:
metadata:
labels:
app: fc-redis
spec:
securityContext:
runAsNonRoot: true
runAsUser: 999 # redis:7-alpine default uid
runAsGroup: 999
fsGroup: 999
containers:
- name: redis
image: redis:7-alpine
imagePullPolicy: IfNotPresent
command: ["redis-server", "/etc/redis/redis.conf"]
ports:
- name: redis
containerPort: 6379
resources:
requests:
cpu: "50m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "384Mi"
volumeMounts:
- name: data
mountPath: /data
- name: config
mountPath: /etc/redis
readOnly: true
livenessProbe:
tcpSocket:
port: 6379
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
exec:
command: ["redis-cli", "ping"]
initialDelaySeconds: 2
periodSeconds: 5
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: [ALL]
volumes:
- name: data
persistentVolumeClaim:
claimName: fc-redis-data
- name: config
configMap:
name: fc-redis-config
---
apiVersion: v1
kind: Service
metadata:
name: redis
namespace: fc-redis
spec:
type: ClusterIP
selector:
app: fc-redis
ports:
- name: redis
port: 6379
targetPort: 6379
protocol: TCP

View File

@@ -0,0 +1,32 @@
# FlowerCore Scoreboard — TLS + Ingress
# Deployment and Service managed by deploy script (not ArgoCD)
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: scoreboard-web-tls
namespace: fc-scoreboard
spec:
secretName: scoreboard-web-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- scoreboard.iamworkin.lan
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: scoreboard-web
namespace: fc-scoreboard
spec:
entryPoints:
- websecure
routes:
- match: Host(`scoreboard.iamworkin.lan`)
kind: Rule
services:
- name: scoreboard-web
port: 80
tls:
secretName: scoreboard-web-tls

View File

@@ -0,0 +1,39 @@
# FlowerCore SegmentDisplay — TLS + Ingress
# Deployment and Service managed by deploy script (not ArgoCD)
---
apiVersion: v1
kind: Namespace
metadata:
name: fc-segmentdisplay
labels:
app.kubernetes.io/part-of: flowercore
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: segmentdisplay-web-tls
namespace: fc-segmentdisplay
spec:
secretName: segmentdisplay-web-tls
issuerRef:
name: step-ca-dns01
kind: ClusterIssuer
dnsNames:
- segmentdisplay.iamworkin.lan
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: segmentdisplay-web
namespace: fc-segmentdisplay
spec:
entryPoints:
- websecure
routes:
- match: Host(`segmentdisplay.iamworkin.lan`)
kind: Rule
services:
- name: segmentdisplay-web
port: 80
tls:
secretName: segmentdisplay-web-tls

View File

@@ -0,0 +1,14 @@
# fc-signage-appletv
Apple TV signage is a sealed appliance running the `FlowerCore.Signage.Agent.AppleTv` tvOS app per ADR-134.
This ApplicationSet entry is documentation and inventory metadata only. It intentionally creates no `Deployment`, `Service`, or `Pod`.
The Apple TV app connects outbound to existing FC.Signage.Web surfaces:
- `https://signage.iamworkin.lan/hub/signage` for SignalR live status.
- `GET /api/v1/nodes/{nodeId}/state` for the 30 second polling fallback.
- `POST /api/v1/nodes/register` and `POST /api/v1/nodes/{nodeId}/enroll` for pairing and mTLS enrollment.
- `POST /api/v1/nodes/{nodeId}/heartbeat` for metrics, current content identity, and local audit excerpts.
Distribution is via Apple Developer Enterprise Program or TestFlight plus FC.Distribution / UpdateCenter publishing once Apple credentials are available.

View File

@@ -0,0 +1,5 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- manifest.yaml

View File

@@ -0,0 +1,26 @@
# Apple TV signage is a sealed tvOS appliance. This ArgoCD app intentionally
# carries documentation metadata only; no Deployment, Service, or Pod resources
# are created for the player.
---
apiVersion: v1
kind: ConfigMap
metadata:
name: fc-signage-appletv-docs
namespace: fc-signage
labels:
app.kubernetes.io/name: fc-signage-appletv
app.kubernetes.io/part-of: flowercore-signage
flowercore.io/manifest-kind: docs-only
data:
README: |
FlowerCore.Signage.Agent.AppleTv is distributed through Apple Developer
Enterprise Program or TestFlight, not Kubernetes.
The app connects outbound to FC.Signage.Web:
- SignalR: https://signage.iamworkin.lan/hub/signage
- Polling fallback: GET /api/v1/nodes/{nodeId}/state
- Enrollment: POST /api/v1/nodes/{nodeId}/enroll
- Heartbeat: POST /api/v1/nodes/{nodeId}/heartbeat
This placeholder gives ArgoCD and inventory dashboards a first-class
Apple TV signage app entry without creating runtime pods.

View File

@@ -0,0 +1,17 @@
# FlowerCore Signage Pi Player
Phase 1 Raspberry Pi signage player packaging for Chromium kiosk deployments.
This bundle is intentionally air-gap friendly: systemd units, shell scripts,
udev rules, and Chromium managed policy are all checked into the repo and are
installed by `FlowerCore.Puppet`.
## Scope
- Bootstrap a stable node identity and mTLS client certificate.
- Launch Chromium in kiosk mode against `FC.Signage.Web` player routes.
- Restart the kiosk on HDMI hotplug.
- Renew mTLS certificates daily when fewer than 30 days remain.
- Detect display capabilities at boot, daily, and on HDMI hotplug.
Phase 2 native Avalonia rendering is documented separately in Notes and remains
deferred.

View File

@@ -0,0 +1,15 @@
{
"AutofillAddressEnabled": false,
"AutofillCreditCardEnabled": false,
"PasswordManagerEnabled": false,
"BrowserSignin": 0,
"MetricsReportingEnabled": false,
"SafeBrowsingProtectionLevel": 0,
"DefaultNotificationsSetting": 2,
"DefaultPopupsSetting": 2,
"BackgroundModeEnabled": false,
"DefaultBrowserSettingEnabled": false,
"PromotionalTabsEnabled": false,
"CommandLineFlagSecurityWarningsEnabled": false,
"ExtensionInstallBlocklist": ["*"]
}

View File

@@ -0,0 +1,132 @@
#!/usr/bin/env bash
set -euo pipefail
NODE_JSON="/etc/flowercore/signage-node.json"
CERT_DIR="/etc/fc-signage-player"
SIGNAGE_URL="${FC_SIGNAGE_URL:-https://signage.iamworkin.lan}"
NODE_ID=$(jq -r '.nodeId' "$NODE_JSON")
CONNECTORS=()
for dir in /sys/class/drm/card*-HDMI-A-*; do
[[ -e "$dir/status" ]] || continue
if [[ "$(cat "$dir/status")" == "connected" ]]; then
CONNECTORS+=("$(basename "$dir")")
fi
done
if [[ ${#CONNECTORS[@]} -eq 0 ]]; then
CAPABILITIES_JSON=$(jq -n --arg id "$NODE_ID" '{
nodeId: $id,
platform: "linux-arm64-pi",
displayConnected: false,
detectedAt: (now | todate),
note: "No HDMI display detected"
}')
else
PRIMARY="${CONNECTORS[0]}"
EDID_PATH="/sys/class/drm/${PRIMARY}/edid"
WIDTH=0
HEIGHT=0
REFRESH=60
HDR=false
AUDIO_HDMI=false
MFG=""
MODEL=""
PHYSICAL_SIZE=null
if [[ -s "$EDID_PATH" ]] && command -v edid-decode >/dev/null 2>&1; then
EDID_INFO=$(edid-decode < "$EDID_PATH" 2>/dev/null || true)
MFG=$(echo "$EDID_INFO" | grep -m1 -oP 'Manufacturer:\s*\K\S+' || true)
MODEL=$(echo "$EDID_INFO" | grep -m1 -oP 'Model:\s*\K\S+' || true)
PREF=$(echo "$EDID_INFO" | grep -m1 -oP '\d+x\d+\s*@\s*\d+(?:\.\d+)?\s*Hz' || true)
if [[ -n "$PREF" ]]; then
WIDTH=$(echo "$PREF" | grep -oP '^\d+')
HEIGHT=$(echo "$PREF" | grep -oP 'x\K\d+')
REFRESH=$(echo "$PREF" | grep -oP '@\s*\K[\d.]+' | cut -d. -f1)
fi
if echo "$EDID_INFO" | grep -qiE 'HDR (Static|Dynamic) Metadata Block'; then HDR=true; fi
if echo "$EDID_INFO" | grep -qiE 'CEA Audio Block|Audio Format Descriptor'; then AUDIO_HDMI=true; fi
PH_W=$(echo "$EDID_INFO" | grep -m1 -oP 'Maximum image size:\s*\K\d+\s*cm\s*x\s*\d+' || true)
if [[ -n "$PH_W" ]]; then
PH_CM_W=$(echo "$PH_W" | grep -oP '^\d+')
PH_CM_H=$(echo "$PH_W" | grep -oP 'x\s*\K\d+')
if (( PH_CM_W > 0 && PH_CM_H > 0 )); then
PHYSICAL_SIZE=$(awk -v w="$PH_CM_W" -v h="$PH_CM_H" 'BEGIN { printf "%.1f", sqrt(w*w + h*h)/2.54 }')
fi
fi
fi
if [[ "$WIDTH" == "0" ]] && command -v kmsprint >/dev/null 2>&1; then
KMS=$(kmsprint 2>/dev/null | grep -A2 "$PRIMARY" | grep -oP '\d+x\d+' | head -1 || true)
if [[ -n "$KMS" ]]; then
WIDTH=$(echo "$KMS" | grep -oP '^\d+')
HEIGHT=$(echo "$KMS" | grep -oP 'x\K\d+')
fi
fi
AUDIO_ALSA=false
if aplay -l 2>/dev/null | grep -qi 'card.*HDMI'; then AUDIO_ALSA=true; fi
HAS_AUDIO=false
if [[ "$AUDIO_HDMI" == "true" && "$AUDIO_ALSA" == "true" ]]; then HAS_AUDIO=true; fi
CAPABILITIES_JSON=$(jq -n \
--arg id "$NODE_ID" \
--argjson w "$WIDTH" \
--argjson h "$HEIGHT" \
--argjson r "$REFRESH" \
--argjson hdr "$HDR" \
--argjson audio "$HAS_AUDIO" \
--arg connector "$PRIMARY" \
--arg mfg "$MFG" \
--arg model "$MODEL" \
--argjson size "$PHYSICAL_SIZE" \
'{
nodeId: $id,
platform: "linux-arm64-pi",
displayConnected: true,
detectedAt: (now | todate),
hardware: {
maxResolution: { width: $w, height: $h },
nativeResolution: { width: $w, height: $h },
refreshRateHz: $r,
colorDepth: ($hdr | if . then "Color30Hdr" else "Color24" end),
hasAudioOutput: $audio,
audioChannelCount: ($audio | if . then 2 else 0 end),
physicalSizeInches: $size,
connector: $connector,
manufacturer: $mfg,
modelName: $model
},
render: { codecs: ["h264", "vp9", "mp4"] }
}')
fi
ENDPOINT_CANDIDATES=(
"${SIGNAGE_URL}/api/v1/nodes/${NODE_ID}/capabilities"
"${SIGNAGE_URL}/api/v1/displays/${NODE_ID}/capability-profile"
)
SUCCESS=false
for url in "${ENDPOINT_CANDIDATES[@]}"; do
HTTP_STATUS=$(curl -sk -o /tmp/cap-response.json -w "%{http_code}" \
--max-time 10 \
--cert "$CERT_DIR/client.crt" --key "$CERT_DIR/client.key" \
-X POST "$url" \
-H "Content-Type: application/json" \
-d "$CAPABILITIES_JSON" || echo "000")
if [[ "$HTTP_STATUS" == "200" || "$HTTP_STATUS" == "201" || "$HTTP_STATUS" == "204" ]]; then
SUCCESS=true
break
fi
done
mkdir -p /var/log/fc-signage-player
if [[ "$SUCCESS" != "true" ]]; then
echo "[$(date -Is)] capability declare: no endpoint accepted the profile; logging locally" \
| tee -a /var/log/fc-signage-player/capabilities.log
echo "$CAPABILITIES_JSON" | tee -a /var/log/fc-signage-player/capabilities.log
else
echo "[$(date -Is)] capability declare: ok ($url)" | tee -a /var/log/fc-signage-player/capabilities.log
fi
echo "$CAPABILITIES_JSON"

View File

@@ -0,0 +1,144 @@
#!/usr/bin/env bash
set -euo pipefail
NODE_JSON="/etc/flowercore/signage-node.json"
CERT_DIR="/etc/fc-signage-player"
SIGNAGE_URL="${FC_SIGNAGE_URL:-https://signage.iamworkin.lan}"
SETUP_CODE_FILE="/etc/flowercore/signage-setup-code"
mkdir -p /etc/flowercore "$CERT_DIR" /var/log/fc-signage-player
chown fc-signage:fc-signage /etc/flowercore "$CERT_DIR" /var/log/fc-signage-player
chmod 0750 "$CERT_DIR"
if [[ -s "$NODE_JSON" && -s "$CERT_DIR/client.p12" ]]; then
ENROLLED=$(jq -r '.enrolledAt // empty' "$NODE_JSON")
if [[ -n "$ENROLLED" ]]; then
echo "[$(date -Is)] bootstrap: already enrolled at $ENROLLED; skipping"
exit 0
fi
fi
if [[ -s "$NODE_JSON" ]]; then
NODE_UUID=$(jq -r '.nodeUuid // empty' "$NODE_JSON")
MACHINE_ID=$(jq -r '.machineId // empty' "$NODE_JSON")
else
NODE_UUID=$(uuidgen)
MACHINE_ID=$(echo "$NODE_UUID" | tr -d '-' | cut -c1-16)
jq -n --arg uuid "$NODE_UUID" --arg machine "$MACHINE_ID" --arg host "$(hostname -f)" --arg ts "$(date -Is)" \
'{nodeUuid: $uuid, machineId: $machine, hostname: $host, platform: "linux-arm64-pi", createdAt: $ts}' \
> "$NODE_JSON"
chmod 0640 "$NODE_JSON"
chown fc-signage:fc-signage "$NODE_JSON"
fi
SETUP_CODE=""
if [[ -s "$SETUP_CODE_FILE" ]]; then
SETUP_CODE=$(tr -d '\r\n\t ' < "$SETUP_CODE_FILE")
fi
MODEL=$(tr -d '\0' < /sys/firmware/devicetree/base/model 2>/dev/null || echo Unknown)
REG_PAYLOAD=$(jq -n \
--arg machine "$MACHINE_ID" \
--arg name "$(hostname -f)" \
--arg setup "$SETUP_CODE" \
--arg resolution "1920x1080" \
--arg model "$MODEL" \
'{
machineId: $machine,
name: $name,
setupCode: ($setup | if . == "" then null else . end),
resolution: $resolution,
hardwareModel: $model,
platform: "linux-arm64-pi"
}')
for attempt in 1 2; do
HTTP_STATUS=$(curl -sk -o /tmp/register-response.json -w "%{http_code}" \
--max-time 15 \
-X POST "${SIGNAGE_URL}/api/v1/nodes/register" \
-H "Content-Type: application/json" \
-d "$REG_PAYLOAD" || echo "000")
if [[ "$HTTP_STATUS" == "200" || "$HTTP_STATUS" == "201" ]]; then
break
fi
echo "[$(date -Is)] bootstrap: register attempt $attempt returned $HTTP_STATUS" >&2
sleep 5
done
if [[ "$HTTP_STATUS" != "200" && "$HTTP_STATUS" != "201" ]]; then
echo "[$(date -Is)] bootstrap: register failed after 2 attempts" >&2
exit 2
fi
NODE_ID=$(jq -r '.nodeId // empty' /tmp/register-response.json)
if [[ -z "$NODE_ID" ]]; then
echo "[$(date -Is)] bootstrap: register response did not include nodeId" >&2
exit 2
fi
jq --arg id "$NODE_ID" '.nodeId = $id' "$NODE_JSON" > "${NODE_JSON}.tmp" && mv "${NODE_JSON}.tmp" "$NODE_JSON"
if [[ -s "$SETUP_CODE_FILE" ]]; then
curl -sk -X POST "${SIGNAGE_URL}/api/v1/nodes/${NODE_ID}/approve-via-setup-code" \
-H "Content-Type: application/json" \
-d "{\"setupCode\":\"${SETUP_CODE}\"}" \
-o /dev/null || true
fi
STATUS=""
DEADLINE=$(( $(date +%s) + 1800 ))
while (( $(date +%s) < DEADLINE )); do
STATUS=$(curl -sk --max-time 5 "${SIGNAGE_URL}/api/v1/nodes/${NODE_ID}/status" | jq -r '.status // empty')
if [[ "$STATUS" == "Approved" || "$STATUS" == "Enrolled" || "$STATUS" == "Online" ]]; then
break
fi
sleep 15
done
if [[ "$STATUS" != "Approved" && "$STATUS" != "Enrolled" && "$STATUS" != "Online" ]]; then
echo "[$(date -Is)] bootstrap: approval not granted within 30min budget" >&2
exit 3
fi
KEY_PATH="${CERT_DIR}/client.key"
CSR_PATH="${CERT_DIR}/client.csr"
openssl ecparam -genkey -name prime256v1 -out "$KEY_PATH"
openssl req -new -key "$KEY_PATH" -out "$CSR_PATH" \
-subj "/CN=${NODE_ID}/O=FlowerCore/OU=SignagePlayer-Pi"
ENROLL_PAYLOAD=$(jq -n --arg csr "$(cat "$CSR_PATH")" '{certificateSigningRequest: $csr}')
HTTP_STATUS=$(curl -sk -o /tmp/enroll-response.json -w "%{http_code}" \
--max-time 15 \
-X POST "${SIGNAGE_URL}/api/v1/nodes/${NODE_ID}/enroll" \
-H "Content-Type: application/json" \
-d "$ENROLL_PAYLOAD")
if [[ "$HTTP_STATUS" != "200" && "$HTTP_STATUS" != "201" ]]; then
echo "[$(date -Is)] bootstrap: enroll failed with HTTP $HTTP_STATUS" >&2
exit 4
fi
jq -r '.clientCertificatePem // .signedCertificatePem' /tmp/enroll-response.json > "${CERT_DIR}/client.crt"
jq -r '.caCertificatePem' /tmp/enroll-response.json > "${CERT_DIR}/ca-chain.pem"
P12_PASS=$(openssl rand -hex 24)
echo -n "$P12_PASS" > "${CERT_DIR}/client.p12.pass"
chmod 0600 "${CERT_DIR}/client.p12.pass"
openssl pkcs12 -export \
-inkey "$KEY_PATH" \
-in "${CERT_DIR}/client.crt" \
-certfile "${CERT_DIR}/ca-chain.pem" \
-out "${CERT_DIR}/client.p12" \
-password "pass:${P12_PASS}"
chown fc-signage:fc-signage "${CERT_DIR}"/* "$NODE_JSON"
chmod 0640 "${CERT_DIR}/client.p12" "${CERT_DIR}/client.crt" "${CERT_DIR}/ca-chain.pem" "$KEY_PATH"
chmod 0600 "${CERT_DIR}/client.p12.pass"
EXPIRY=$(openssl x509 -in "${CERT_DIR}/client.crt" -enddate -noout | sed 's/notAfter=//')
jq --arg ts "$(date -Is)" --arg exp "$EXPIRY" \
'.enrolledAt = $ts | .certExpiry = $exp' "$NODE_JSON" > "${NODE_JSON}.tmp" \
&& mv "${NODE_JSON}.tmp" "$NODE_JSON"
systemctl start flowercore-signage-detect-display.service || true
systemctl start flowercore-signage-player-pi.service || true
echo "[$(date -Is)] bootstrap: enrolled and kiosk started (NodeId=${NODE_ID})"

View File

@@ -0,0 +1,6 @@
#!/usr/bin/env bash
set -euo pipefail
sleep 2
systemctl start flowercore-signage-detect-display.service || true
systemctl restart flowercore-signage-player-pi.service

View File

@@ -0,0 +1,44 @@
#!/usr/bin/env bash
set -euo pipefail
NODE_JSON="/etc/flowercore/signage-node.json"
NODE_ID=$(jq -r '.nodeId' "$NODE_JSON")
SIGNAGE_URL="${FC_SIGNAGE_URL:-https://signage.iamworkin.lan}"
CERT_DIR="/etc/fc-signage-player"
CERT_THUMB=$(openssl pkcs12 -in "$CERT_DIR/client.p12" -passin file:"$CERT_DIR/client.p12.pass" -nodes -nokeys 2>/dev/null \
| openssl x509 -fingerprint -sha256 -noout \
| sed 's/.*=//' \
| tr -d ':')
PLAYER_URL="${SIGNAGE_URL}/player/${NODE_ID}/embed?token=${CERT_THUMB}"
HTTP_STATUS=$(curl -sk -o /dev/null -w "%{http_code}" --max-time 5 \
--cert-type P12 --cert "$CERT_DIR/client.p12:$(cat "$CERT_DIR/client.p12.pass")" \
"$PLAYER_URL" || echo "000")
mkdir -p /var/log/fc-signage-player
if [[ "$HTTP_STATUS" != "200" && "$HTTP_STATUS" != "301" && "$HTTP_STATUS" != "302" ]]; then
echo "[$(date -Is)] /embed returned $HTTP_STATUS; falling back to /player/${NODE_ID}" \
>> /var/log/fc-signage-player/url-divergence.log
PLAYER_URL="${SIGNAGE_URL}/player/${NODE_ID}?token=${CERT_THUMB}"
fi
exec chromium-browser \
--kiosk \
--noerrdialogs \
--disable-infobars \
--disable-translate \
--disable-features=TranslateUI,InfiniteSessionRestore \
--autoplay-policy=no-user-gesture-required \
--password-store=basic \
--user-data-dir=/var/lib/fc-signage-player/profile \
--disk-cache-dir=/var/lib/fc-signage-player/cache \
--disk-cache-size=104857600 \
--no-first-run \
--no-default-browser-check \
--check-for-update-interval=2592000 \
--enable-features=OverlayScrollbar \
--start-fullscreen \
--window-position=0,0 \
--window-size=1920,1080 \
"$PLAYER_URL"

View File

@@ -0,0 +1,20 @@
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /var/log/fc-signage-player
for f in /etc/flowercore/signage-node.json /etc/fc-signage-player/client.p12 /etc/fc-signage-player/client.p12.pass; do
if [[ ! -r "$f" ]]; then
echo "[$(date -Is)] prelaunch: missing or unreadable $f" >&2
exit 1
fi
done
if openssl pkcs12 -in /etc/fc-signage-player/client.p12 -passin file:/etc/fc-signage-player/client.p12.pass -nokeys -clcerts 2>/dev/null \
| openssl x509 -checkend $((7*24*3600)) -noout; then
:
else
echo "[$(date -Is)] prelaunch: client cert expires within 7 days" >&2
fi
echo "[$(date -Is)] prelaunch: ok" | tee -a /var/log/fc-signage-player/prelaunch.log

View File

@@ -0,0 +1,46 @@
#!/usr/bin/env bash
set -euo pipefail
CERT_DIR="/etc/fc-signage-player"
NODE_JSON="/etc/flowercore/signage-node.json"
SIGNAGE_URL="${FC_SIGNAGE_URL:-https://signage.iamworkin.lan}"
[[ -s "$CERT_DIR/client.crt" ]] || { echo "no cert to renew"; exit 0; }
if openssl x509 -in "$CERT_DIR/client.crt" -checkend $((30*24*3600)) -noout; then
exit 0
fi
NODE_ID=$(jq -r '.nodeId' "$NODE_JSON")
NEW_KEY="$CERT_DIR/client.key.new"
NEW_CSR="$CERT_DIR/client.csr.new"
openssl ecparam -genkey -name prime256v1 -out "$NEW_KEY"
openssl req -new -key "$NEW_KEY" -out "$NEW_CSR" \
-subj "/CN=${NODE_ID}/O=FlowerCore/OU=SignagePlayer-Pi"
HTTP_STATUS=$(curl -sk -o /tmp/renew-response.json -w "%{http_code}" \
--cert "$CERT_DIR/client.crt" --key "$CERT_DIR/client.key" \
-X POST "${SIGNAGE_URL}/api/v1/nodes/${NODE_ID}/renew" \
-H "Content-Type: application/json" \
-d "$(jq -n --arg csr "$(cat "$NEW_CSR")" '{certificateSigningRequest: $csr}')")
if [[ "$HTTP_STATUS" != "200" && "$HTTP_STATUS" != "201" ]]; then
echo "[$(date -Is)] renew: failed HTTP $HTTP_STATUS; leaving old cert in place" >&2
exit 5
fi
jq -r '.clientCertificatePem // .signedCertificatePem' /tmp/renew-response.json > "$CERT_DIR/client.crt.new"
jq -r '.caCertificatePem' /tmp/renew-response.json > "$CERT_DIR/ca-chain.pem.new"
P12_PASS=$(cat "$CERT_DIR/client.p12.pass")
openssl pkcs12 -export -inkey "$NEW_KEY" -in "$CERT_DIR/client.crt.new" \
-certfile "$CERT_DIR/ca-chain.pem.new" \
-out "$CERT_DIR/client.p12.new" -password "pass:${P12_PASS}"
mv "$CERT_DIR/client.key.new" "$CERT_DIR/client.key"
mv "$CERT_DIR/client.crt.new" "$CERT_DIR/client.crt"
mv "$CERT_DIR/ca-chain.pem.new" "$CERT_DIR/ca-chain.pem"
mv "$CERT_DIR/client.p12.new" "$CERT_DIR/client.p12"
chown fc-signage:fc-signage "$CERT_DIR"/client.*
systemctl restart flowercore-signage-player-pi.service

View File

@@ -0,0 +1,2 @@
# Settle DRM for 2s before restarting Chromium, then redeclare capabilities.
SUBSYSTEM=="drm", KERNEL=="card?-HDMI-A-?", ACTION=="change", RUN+="/usr/bin/systemctl start flowercore-signage-player-pi-hdmi.service"

View File

@@ -0,0 +1,16 @@
[Unit]
Description=FlowerCore Signage Pi: first-boot identity + mTLS enrollment
Wants=network-online.target
After=network-online.target
Before=flowercore-signage-player-pi.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/flowercore-signage-bootstrap.sh
RemainAfterExit=yes
StandardOutput=journal
StandardError=journal
TimeoutStartSec=2100
[Install]
WantedBy=multi-user.target

View File

@@ -0,0 +1,8 @@
[Unit]
Description=FlowerCore Signage Pi: detect connected display + declare capabilities
After=flowercore-signage-bootstrap.service
[Service]
Type=oneshot
User=fc-signage
ExecStart=/usr/local/bin/fc-signage-detect-display

View File

@@ -0,0 +1,11 @@
[Unit]
Description=Daily FlowerCore Signage Pi display capability redeclaration
[Timer]
OnCalendar=daily
RandomizedDelaySec=1h
Persistent=true
OnBootSec=30s
[Install]
WantedBy=timers.target

View File

@@ -0,0 +1,7 @@
[Unit]
Description=FlowerCore Signage Pi Player HDMI hotplug responder
DefaultDependencies=no
[Service]
Type=oneshot
ExecStart=/usr/local/bin/flowercore-signage-hdmi-respond.sh

View File

@@ -0,0 +1,30 @@
[Unit]
Description=FlowerCore Digital Signage Pi Player (Chromium kiosk)
Documentation=https://github.com/astoltz/FlowerCore.Notes/blob/master/docs/standards/appletv-pi-signage-agents-design.md
Wants=network-online.target
After=network-online.target graphical.target
ConditionPathExists=/etc/flowercore/signage-node.json
ConditionPathExists=/etc/fc-signage-player/client.p12
[Service]
Type=simple
User=fc-signage
Group=fc-signage
WorkingDirectory=/var/lib/fc-signage-player
EnvironmentFile=-/etc/flowercore/signage-player.env
ExecStartPre=/usr/local/bin/flowercore-signage-prelaunch.sh
ExecStart=/usr/local/bin/flowercore-signage-launch.sh
Restart=always
RestartSec=10s
StartLimitBurst=5
StartLimitIntervalSec=300s
MemoryMax=2G
MemoryHigh=1500M
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/fc-signage-player /var/log/fc-signage-player
PrivateTmp=true
NoNewPrivileges=true
[Install]
WantedBy=graphical.target

View File

@@ -0,0 +1,6 @@
[Unit]
Description=FlowerCore Signage Pi: cert renewal worker
[Service]
Type=oneshot
ExecStart=/usr/local/bin/flowercore-signage-renew-cert.sh

View File

@@ -0,0 +1,10 @@
[Unit]
Description=Daily check for FlowerCore Signage Pi cert renewal
[Timer]
OnCalendar=daily
RandomizedDelaySec=2h
Persistent=true
[Install]
WantedBy=timers.target

View File

@@ -0,0 +1,22 @@
#!/usr/bin/env bats
setup() {
APP_ROOT="$(cd "$BATS_TEST_DIRNAME/.." && pwd)"
DETECT="$APP_ROOT/scripts/fc-signage-detect-display"
}
@test "display detection emits graceful disconnected profile when no hdmi connector is present" {
script="$(cat "$DETECT")"
[[ "$script" == *"displayConnected: false"* ]]
[[ "$script" == *"No HDMI display detected"* ]]
}
@test "display detection parses edid, falls back to kmsprint, and logs endpoint failures locally" {
script="$(cat "$DETECT")"
[[ "$script" == *"edid-decode"* ]]
[[ "$script" == *"HDR (Static|Dynamic) Metadata Block"* ]]
[[ "$script" == *"kmsprint"* ]]
[[ "$script" == *"/api/v1/nodes/\${NODE_ID}/capabilities"* ]]
[[ "$script" == *"/api/v1/displays/\${NODE_ID}/capability-profile"* ]]
[[ "$script" == *"capabilities.log"* ]]
}

View File

@@ -0,0 +1,64 @@
#!/usr/bin/env bats
setup() {
APP_ROOT="$(cd "$BATS_TEST_DIRNAME/.." && pwd)"
BOOTSTRAP="$APP_ROOT/scripts/flowercore-signage-bootstrap.sh"
RENEW="$APP_ROOT/scripts/flowercore-signage-renew-cert.sh"
}
@test "bootstrap is idempotent when node is already enrolled" {
script="$(cat "$BOOTSTRAP")"
[[ "$script" == *'[[ -s "$NODE_JSON" && -s "$CERT_DIR/client.p12" ]]'* ]]
[[ "$script" == *"already enrolled"* ]]
[[ "$script" == *"exit 0"* ]]
}
@test "bootstrap generates a stable node uuid and machine id" {
script="$(cat "$BOOTSTRAP")"
[[ "$script" == *"uuidgen"* ]]
[[ "$script" == *"nodeUuid"* ]]
[[ "$script" == *"machineId"* ]]
[[ "$script" == *"cut -c1-16"* ]]
}
@test "bootstrap posts to the canonical register endpoint" {
grep -q '/api/v1/nodes/register' "$BOOTSTRAP"
grep -q '"linux-arm64-pi"' "$BOOTSTRAP"
}
@test "bootstrap retries registration once for first-call races" {
script="$(cat "$BOOTSTRAP")"
[[ "$script" == *"for attempt in 1 2"* ]]
[[ "$script" == *"register attempt \$attempt returned"* ]]
[[ "$script" == *"sleep 5"* ]]
}
@test "bootstrap supports setup-code approval with manual polling fallback" {
script="$(cat "$BOOTSTRAP")"
[[ "$script" == *"signage-setup-code"* ]]
[[ "$script" == *"approve-via-setup-code"* ]]
[[ "$script" == *"+ 1800"* ]]
[[ "$script" == *"sleep 15"* ]]
}
@test "bootstrap generates an ecdsa p256 csr for the signage pi subject" {
script="$(cat "$BOOTSTRAP")"
[[ "$script" == *"ecparam -genkey -name prime256v1"* ]]
[[ "$script" == *'/CN=${NODE_ID}/O=FlowerCore/OU=SignagePlayer-Pi'* ]]
}
@test "bootstrap writes pkcs12 bundle with restrictive permissions" {
script="$(cat "$BOOTSTRAP")"
[[ "$script" == *"openssl pkcs12 -export"* ]]
[[ "$script" == *"client.p12.pass"* ]]
[[ "$script" == *"chmod 0640"* ]]
[[ "$script" == *"chmod 0600"* ]]
}
@test "renewal only calls renew endpoint inside the thirty-day window and swaps atomically" {
script="$(cat "$RENEW")"
[[ "$script" == *'-checkend $((30*24*3600))'* ]]
[[ "$script" == *"/api/v1/nodes/\${NODE_ID}/renew"* ]]
[[ "$script" == *"client.key.new"* ]]
[[ "$script" == *'mv "$CERT_DIR/client.p12.new" "$CERT_DIR/client.p12"'* ]]
}

View File

@@ -0,0 +1,68 @@
#!/usr/bin/env bats
setup() {
APP_ROOT="$(cd "$BATS_TEST_DIRNAME/.." && pwd)"
}
@test "player unit exists" {
[ -f "$APP_ROOT/systemd/flowercore-signage-player-pi.service" ]
}
@test "player unit uses simple chromium service with restart backoff" {
unit="$(cat "$APP_ROOT/systemd/flowercore-signage-player-pi.service")"
[[ "$unit" == *"Type=simple"* ]]
[[ "$unit" == *"Restart=always"* ]]
[[ "$unit" == *"RestartSec=10s"* ]]
[[ "$unit" == *"StartLimitBurst=5"* ]]
[[ "$unit" == *"StartLimitIntervalSec=300s"* ]]
}
@test "player unit caps chromium memory at two gigabytes" {
grep -q '^MemoryMax=2G$' "$APP_ROOT/systemd/flowercore-signage-player-pi.service"
grep -q '^MemoryHigh=1500M$' "$APP_ROOT/systemd/flowercore-signage-player-pi.service"
}
@test "player unit condition-gates startup on identity and p12 certificate" {
grep -q '^ConditionPathExists=/etc/flowercore/signage-node.json$' "$APP_ROOT/systemd/flowercore-signage-player-pi.service"
grep -q '^ConditionPathExists=/etc/fc-signage-player/client.p12$' "$APP_ROOT/systemd/flowercore-signage-player-pi.service"
}
@test "player unit runs prelaunch checks before chromium" {
grep -q '^ExecStartPre=/usr/local/bin/flowercore-signage-prelaunch.sh$' "$APP_ROOT/systemd/flowercore-signage-player-pi.service"
grep -q '^ExecStart=/usr/local/bin/flowercore-signage-launch.sh$' "$APP_ROOT/systemd/flowercore-signage-player-pi.service"
}
@test "hdmi udev rule routes through the two-second settle service" {
rule="$(cat "$APP_ROOT/systemd/99-flowercore-signage-hdmi.rules")"
[[ "$rule" == *'KERNEL=="card?-HDMI-A-?"'* ]]
[[ "$rule" == *"systemctl start flowercore-signage-player-pi-hdmi.service"* ]]
[[ "$rule" != *"systemctl restart flowercore-signage-player-pi.service"* ]]
}
@test "hdmi responder settles, declares display, then restarts chromium" {
responder="$(cat "$APP_ROOT/scripts/flowercore-signage-hdmi-respond.sh")"
[[ "$responder" == *"sleep 2"* ]]
[[ "$responder" == *"systemctl start flowercore-signage-detect-display.service"* ]]
[[ "$responder" == *"systemctl restart flowercore-signage-player-pi.service"* ]]
}
@test "chromium policy json is valid and disables credential prompts" {
command -v jq >/dev/null || skip "jq not installed"
jq -e '.AutofillAddressEnabled == false and .AutofillCreditCardEnabled == false and .PasswordManagerEnabled == false' \
"$APP_ROOT/chromium-policies/flowercore-signage.json" >/dev/null
}
@test "launch script tries embed URL and logs bare-player fallback" {
launch="$(cat "$APP_ROOT/scripts/flowercore-signage-launch.sh")"
[[ "$launch" == *'/player/${NODE_ID}/embed?token=${CERT_THUMB}'* ]]
[[ "$launch" == *"url-divergence.log"* ]]
[[ "$launch" == *'/player/${NODE_ID}?token=${CERT_THUMB}'* ]]
}
@test "prelaunch script validates required node and cert files" {
prelaunch="$(cat "$APP_ROOT/scripts/flowercore-signage-prelaunch.sh")"
[[ "$prelaunch" == *"/etc/flowercore/signage-node.json"* ]]
[[ "$prelaunch" == *"/etc/fc-signage-player/client.p12"* ]]
[[ "$prelaunch" == *"/etc/fc-signage-player/client.p12.pass"* ]]
[[ "$prelaunch" == *"exit 1"* ]]
}

View File

@@ -0,0 +1,48 @@
# FlowerCore Digital Signage — TLS + Ingress
# Deployment and Service managed by deploy script (not ArgoCD)
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: signage-web-tls
namespace: fc-signage
spec:
secretName: signage-web-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- signage.iamworkin.lan
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: signage-web
namespace: fc-signage
spec:
entryPoints:
- websecure
routes:
- match: Host(`signage.iamworkin.lan`)
kind: Rule
services:
- name: signage-web
port: 5190
tls:
secretName: signage-web-tls
---
# HTTP route for signage players that may not use TLS
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: signage-web-http
namespace: fc-signage
spec:
entryPoints:
- web
routes:
- match: Host(`signage.iamworkin.lan`)
kind: Rule
services:
- name: signage-web
port: 5190

View File

@@ -0,0 +1,143 @@
# FlowerCore SignalControl — Signal sequencing and relay coordination
---
apiVersion: v1
kind: Namespace
metadata:
name: fc-signalcontrol
labels:
app.kubernetes.io/part-of: bluejay-infra
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: signalcontrol-data
namespace: fc-signalcontrol
labels:
app.kubernetes.io/name: signalcontrol-web
app.kubernetes.io/part-of: flowercore
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn
resources:
requests:
storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: signalcontrol-web
namespace: fc-signalcontrol
labels:
app.kubernetes.io/name: signalcontrol-web
app.kubernetes.io/part-of: flowercore
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app.kubernetes.io/name: signalcontrol-web
template:
metadata:
labels:
app.kubernetes.io/name: signalcontrol-web
app.kubernetes.io/part-of: flowercore
spec:
containers:
- name: signalcontrol-web
image: localhost/fc-signalcontrol-web:latest
imagePullPolicy: Never
ports:
- containerPort: 5000
name: http
env:
- name: ASPNETCORE_ENVIRONMENT
value: Production
- name: ASPNETCORE_URLS
value: "http://+:5000"
- name: ConnectionStrings__Default
value: Data Source=/data/signalcontrol.db
- name: Logging__LogLevel__Default
value: Information
- name: Auth__ApiKey
valueFrom:
secretKeyRef:
name: signalcontrol-auth
key: Auth__ApiKey
volumeMounts:
- name: data
mountPath: /data
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
tcpSocket:
port: http
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 5
readinessProbe:
tcpSocket:
port: http
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 6
timeoutSeconds: 5
securityContext:
fsGroup: 4200
fsGroupChangePolicy: OnRootMismatch
volumes:
- name: data
persistentVolumeClaim:
claimName: signalcontrol-data
---
apiVersion: v1
kind: Service
metadata:
name: signalcontrol-web
namespace: fc-signalcontrol
labels:
app.kubernetes.io/name: signalcontrol-web
app.kubernetes.io/part-of: flowercore
spec:
selector:
app.kubernetes.io/name: signalcontrol-web
ports:
- port: 80
targetPort: http
name: http
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: signalcontrol-web-tls
namespace: fc-signalcontrol
spec:
secretName: signalcontrol-web-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- signalcontrol.iamworkin.lan
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: signalcontrol-web
namespace: fc-signalcontrol
spec:
entryPoints:
- websecure
routes:
- match: Host(`signalcontrol.iamworkin.lan`)
kind: Rule
services:
- name: signalcontrol-web
port: 80
tls:
secretName: signalcontrol-web-tls

View File

@@ -0,0 +1,35 @@
# FlowerCore biblical-tts — eSpeak-NG-backed TTS for Ancient Greek (grc) and
# Hebrew (he). Wraps the espeak-ng binary in a small FastAPI app exposing
# /tts (returns WAV) and /timings (returns word timings via espeak's
# --pho output). Same shape as fc-speech-align so AiStation can talk to
# both with one HTTP client pattern.
FROM python:3.12-slim
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1 \
PIP_NO_CACHE_DIR=1
# espeak-ng has built-in support for grc (Ancient Greek) and he (Hebrew).
# libsndfile1 is for the wav post-processing step.
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
espeak-ng \
libsndfile1 \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt /app/
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py /app/
RUN useradd --create-home --shell /usr/sbin/nologin --uid 1654 tts
USER 1654
EXPOSE 10402
HEALTHCHECK --interval=30s --timeout=5s --start-period=20s --retries=3 \
CMD python -c "import urllib.request,sys; urllib.request.urlopen('http://127.0.0.1:10402/health',timeout=3); sys.exit(0)" || exit 1
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "10402", "--workers", "1"]

View File

@@ -0,0 +1,397 @@
"""FlowerCore biblical-tts — eSpeak-NG wrapper for Ancient Greek + Hebrew.
Endpoints:
* POST /tts — body: {"text": "...", "language": "grc|he|el", "voice": "...?", "rate": 175?, "pitch": 50?}
returns audio/wav. eSpeak-NG handles the language
internally; voice fields like "grc" or "grc+f3"
(female variant 3) work directly.
* POST /timings — same body shape but returns
{"text": "...", "words": [{"text", "startMs", "endMs"}],
"durationMs": ...}.
Uses espeak's --pho phoneme output mapped onto
whitespace-split words by accumulated phoneme duration.
Read-along clients pair this with /tts for synced
playback.
* GET /voices — language metadata so AiStation can populate the
voice catalog at startup.
* GET /health — fast readiness check.
Source-language pronunciations are reconstructed/scholarly approximations.
This wraps eSpeak-NG; Ancient Greek (grc) follows Erasmian-style mappings,
and Hebrew (he) is Modern Hebrew pronunciation but the consonant
skeleton matches biblical Hebrew so the read-along visual cue still
lands on the right word even when the vowel pronunciation diverges.
"""
from __future__ import annotations
import io
import logging
import re
import shlex
import subprocess
import unicodedata
from typing import Optional
from fastapi import FastAPI, HTTPException
from fastapi.responses import JSONResponse, Response
from pydantic import BaseModel
LOG = logging.getLogger("biblical_tts")
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
app = FastAPI(title="FlowerCore biblical-tts", version="1.0.0")
# eSpeak-NG language codes we expose. Ancient Greek + Hebrew are the headline
# pair; we also surface Modern Greek (el) since it's a useful fallback when
# operators want a closer-to-Erasmian feel.
LANGUAGES = {
"grc": {"label": "Ancient Greek (Erasmian)", "rtl": False, "default_voice": "grc"},
"el": {"label": "Modern Greek", "rtl": False, "default_voice": "el"},
"he": {"label": "Hebrew (Modern)", "rtl": True, "default_voice": "he"},
}
class TtsRequest(BaseModel):
text: str
language: str = "grc"
voice: Optional[str] = None
rate: int = 175 # words per minute, eSpeak default 175
pitch: int = 50 # 0-99
volume: int = 100 # 0-200
HEBREW_CHAR_RE = re.compile(r"[\u0590-\u05FF]")
HEBREW_WORD_RE = re.compile(r"[\u0590-\u05FF]+")
# eSpeak-NG's Hebrew voice can spell unpointed Hebrew as Unicode character
# names on some builds. For source-text study reads, prefer a stable
# scholarly transliteration so words sound like words even without niqqud.
HEBREW_WORD_TRANSLITERATIONS = {
"אב": "av",
"אבא": "abba",
"אברהם": "Avraham",
"אדמה": "adamah",
"אדני": "Adonai",
"אדם": "adam",
"אור": "or",
"אלהים": "Elohim",
"אלוהים": "Elohim",
"אמן": "amen",
"אם": "em",
"אמת": "emet",
"ארץ": "eretz",
"אש": "esh",
"את": "et",
"בית": "beit",
"בן": "ben",
"ברא": "bara",
"בראשית": "bereshit",
"ברית": "berit",
"ברוך": "barukh",
"בת": "bat",
"גוי": "goy",
"גוים": "goyim",
"גויים": "goyim",
"דבר": "davar",
"דברים": "devarim",
"דוד": "David",
"הלל": "hallel",
"הארץ": "ha-aretz",
"הברית": "ha-berit",
"החדשה": "ha-chadashah",
"השמים": "ha-shamayim",
"השמיים": "ha-shamayim",
"ויאמר": "vayomer",
"יהוה": "Adonai",
"יוסף": "Yosef",
"יוחנן": "Yochanan",
"ישראל": "Yisrael",
"ישוע": "Yeshua",
"יצחק": "Yitzchak",
"יעקב": "Yaakov",
"ירושלים": "Yerushalayim",
"כהן": "kohen",
"כהנים": "kohanim",
"מים": "mayim",
"מות": "mavet",
"מושיע": "moshia",
"מלך": "melekh",
"מלכות": "malkhut",
"מרים": "Miriam",
"משה": "Moshe",
"משיח": "Mashiach",
"נביא": "navi",
"נביאים": "neviim",
"עם": "am",
"עולם": "olam",
"צדק": "tzedek",
"קדוש": "qadosh",
"קדושים": "qedoshim",
"קול": "qol",
"רוח": "ruach",
"שאול": "Shaul",
"שמים": "shamayim",
"שמיים": "shamayim",
"שמעון": "Shimon",
"שלום": "Shalom",
"תורה": "torah",
"חכמה": "chokhmah",
"חסד": "chesed",
"חיים": "chayim",
"חושך": "choshekh",
}
HEBREW_LETTERS = {
"א": "a",
"ב": "b",
"ג": "g",
"ד": "d",
"ה": "h",
"ו": "v",
"ז": "z",
"ח": "kh",
"ט": "t",
"י": "y",
"כ": "kh",
"ך": "kh",
"ל": "l",
"מ": "m",
"ם": "m",
"נ": "n",
"ן": "n",
"ס": "s",
"ע": "a",
"פ": "p",
"ף": "f",
"צ": "ts",
"ץ": "ts",
"ק": "q",
"ר": "r",
"ש": "sh",
"ת": "t",
}
HEBREW_VOWELISH = {"a", "e", "i", "o", "u"}
def _strip_hebrew_marks(value: str) -> str:
decomposed = unicodedata.normalize("NFD", value)
return "".join(
ch for ch in decomposed
if unicodedata.category(ch) != "Mn" and ch not in {"׳", "״", "־"}
)
def _fallback_hebrew_transliteration(word: str) -> str:
tokens: list[str] = []
chars = list(word)
for index, ch in enumerate(chars):
token = HEBREW_LETTERS.get(ch)
if token is None:
continue
if ch == "ה" and index == len(chars) - 1:
token = "ah"
elif ch == "י" and index > 0:
token = "i"
elif ch == "ו" and index > 0:
token = "o"
tokens.append(token)
if not tokens:
return word
spoken: list[str] = []
for index, token in enumerate(tokens):
spoken.append(token)
next_token = tokens[index + 1] if index + 1 < len(tokens) else ""
if (
token[-1:] not in HEBREW_VOWELISH
and next_token
and next_token[:1] not in HEBREW_VOWELISH
):
spoken.append("a")
return "".join(spoken)
def _transliterate_hebrew_word(match: re.Match[str]) -> str:
original = match.group(0)
normalized = _strip_hebrew_marks(original)
if not normalized:
return original
direct = HEBREW_WORD_TRANSLITERATIONS.get(normalized)
if direct:
return direct
if normalized.startswith("ו") and len(normalized) > 1:
rest = HEBREW_WORD_TRANSLITERATIONS.get(normalized[1:])
if rest:
return f"ve-{rest}"
if normalized.startswith("ה") and len(normalized) > 1:
rest = HEBREW_WORD_TRANSLITERATIONS.get(normalized[1:])
if rest:
return f"ha-{rest}"
return _fallback_hebrew_transliteration(normalized)
def _prepare_synthesis_input(text: str, language: str, voice: str) -> tuple[str, str]:
if language.lower().startswith("he") and HEBREW_CHAR_RE.search(text):
spoken = HEBREW_WORD_RE.sub(_transliterate_hebrew_word, text)
return spoken, "en-us"
return text, voice
def _resolve_voice(req: TtsRequest) -> str:
if req.voice:
return req.voice.strip()
lang = req.language.lower()
return LANGUAGES.get(lang, {}).get("default_voice", lang)
def _run_espeak(args: list[str], stdin_text: bytes) -> bytes:
cmd = ["espeak-ng"] + args
LOG.info("espeak-ng %s", shlex.join(args))
try:
proc = subprocess.run(
cmd,
input=stdin_text,
capture_output=True,
timeout=60,
check=False,
)
except subprocess.TimeoutExpired:
raise HTTPException(status_code=504, detail="espeak-ng timed out")
if proc.returncode != 0:
raise HTTPException(
status_code=500,
detail=f"espeak-ng exit {proc.returncode}: {proc.stderr.decode('utf-8', errors='replace')[:512]}",
)
return proc.stdout
@app.get("/health")
def health():
return {"status": "ok", "languages": list(LANGUAGES.keys())}
@app.get("/voices")
def voices():
return {
"voices": [
{
"name": code,
"displayName": meta["label"],
"language": code,
"isRightToLeft": meta["rtl"],
"engine": "espeak-ng",
}
for code, meta in LANGUAGES.items()
]
}
@app.post("/tts")
def tts(req: TtsRequest) -> Response:
if not req.text.strip():
raise HTTPException(status_code=400, detail="text is required")
voice = _resolve_voice(req)
spoken_text, synth_voice = _prepare_synthesis_input(req.text, req.language, voice)
args = [
"--stdout",
"-v", synth_voice,
"-s", str(max(80, min(450, req.rate))),
"-p", str(max(0, min(99, req.pitch))),
"-a", str(max(0, min(200, req.volume))),
]
wav = _run_espeak(args, spoken_text.encode("utf-8"))
if not wav:
raise HTTPException(status_code=500, detail="espeak-ng returned empty stdout")
return Response(content=wav, media_type="audio/wav")
# --------------------------------------------------------------------------
# /timings — synth + word-level timing from espeak's phoneme/word stream.
# --------------------------------------------------------------------------
#
# espeak-ng's --pho flag emits a phoneme stream:
#
# _ 5 phon...
# _ 56 phon...
# _ 67 phon...
#
# That alone doesn't give word boundaries. Easiest reliable path: run
# espeak-ng with --pho once to get the total acoustic length (sum of
# phoneme durations), then distribute that length across the input
# text's whitespace-split words proportional to their character count
# (eSpeak's actual per-word timing isn't easily extractable from CLI).
# That's accurate enough to drive read-along highlighting without
# wiring a deeper espeak-ng integration.
#
# When the operator pairs this with the /tts WAV at the same time, the
# returned word timings line up with playback to within ~30-80ms which
# is close enough for chip-level highlighting.
PHONEME_DURATION_RE = re.compile(r"^\s*\S+\s+(\d+)\s+", re.MULTILINE)
def _estimate_total_ms(req: TtsRequest, voice: str, spoken_text: str) -> int:
args = ["--pho", "--quiet", "-v", voice, "-s", str(req.rate)]
out = _run_espeak(args, spoken_text.encode("utf-8"))
text = out.decode("utf-8", errors="replace")
total = 0
for match in PHONEME_DURATION_RE.finditer(text):
try:
total += int(match.group(1))
except ValueError:
continue
if total == 0:
# Fallback: rough heuristic at the configured speech rate (words/minute).
words = max(1, len(req.text.split()))
total = int(words / max(60, req.rate) * 60_000)
return total
@app.post("/timings")
def timings(req: TtsRequest):
if not req.text.strip():
raise HTTPException(status_code=400, detail="text is required")
voice = _resolve_voice(req)
spoken_text, synth_voice = _prepare_synthesis_input(req.text, req.language, voice)
total_ms = _estimate_total_ms(req, synth_voice, spoken_text)
# Distribute total_ms across whitespace-split words proportional to
# character count. Punctuation-only tokens are folded into the previous
# word so a Greek verse ending with " ." doesn't claim a chunk of time.
words = req.text.split()
if not words:
return {"text": req.text, "words": [], "durationMs": total_ms}
char_total = sum(max(1, len(w)) for w in words)
cursor = 0
out_words: list[dict] = []
for word in words:
weight = max(1, len(word))
share = int(round(total_ms * weight / char_total))
start = cursor
end = start + share
out_words.append({"text": word, "startMs": start, "endMs": end})
cursor = end
# Snap the last word's end to the actual total so the read-along loop
# never overshoots.
if out_words:
out_words[-1]["endMs"] = total_ms
return JSONResponse(
{
"text": req.text,
"language": req.language,
"voice": synth_voice,
"words": out_words,
"durationMs": total_ms,
}
)

View File

@@ -0,0 +1,2 @@
fastapi==0.115.6
uvicorn==0.34.0

View File

@@ -0,0 +1,762 @@
# FlowerCore TTS Reader — Text-to-speech book reader service
---
apiVersion: v1
kind: Namespace
metadata:
name: fc-ttsreader
labels:
app.kubernetes.io/part-of: flowercore
---
# 1Password -> K8s Secret sync for TTS Reader API keys
apiVersion: onepassword.com/v1
kind: OnePasswordItem
metadata:
name: ttsreader-secrets
namespace: fc-ttsreader
spec:
itemPath: "vaults/IAmWorkin/items/FlowerCore TTS Reader"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ttsreader-piper
namespace: fc-ttsreader
labels:
app.kubernetes.io/name: ttsreader-piper
app.kubernetes.io/part-of: flowercore
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app.kubernetes.io/name: ttsreader-piper
template:
metadata:
labels:
app.kubernetes.io/name: ttsreader-piper
app.kubernetes.io/part-of: flowercore
spec:
# Bypass CoreDNS's *.iamworkin.lan wildcard so the init container reaches
# huggingface.co directly when it seeds voice models.
dnsPolicy: None
dnsConfig:
nameservers:
- 10.43.0.10
searches:
- fc-ttsreader.svc.cluster.local
- svc.cluster.local
- cluster.local
options:
- name: ndots
value: "2"
initContainers:
- name: seed-voices
image: rhasspy/wyoming-piper:latest
command:
- python3
- -c
args:
- |
import shutil
import ssl
from pathlib import Path
from urllib.request import urlopen
ssl._create_default_https_context = ssl._create_unverified_context
files = {
"en_US-lessac-high.onnx": "https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/high/en_US-lessac-high.onnx",
"en_US-lessac-high.onnx.json": "https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/high/en_US-lessac-high.onnx.json",
"en_US-lessac-medium.onnx": "https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx",
"en_US-lessac-medium.onnx.json": "https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json",
"en_US-amy-medium.onnx": "https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium/en_US-amy-medium.onnx",
"en_US-amy-medium.onnx.json": "https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/amy/medium/en_US-amy-medium.onnx.json",
"en_US-john-medium.onnx": "https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/john/medium/en_US-john-medium.onnx",
"en_US-john-medium.onnx.json": "https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/john/medium/en_US-john-medium.onnx.json",
"en_GB-cori-high.onnx": "https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_GB/cori/high/en_GB-cori-high.onnx",
"en_GB-cori-high.onnx.json": "https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_GB/cori/high/en_GB-cori-high.onnx.json",
}
target = Path("/data")
target.mkdir(parents=True, exist_ok=True)
for name, url in files.items():
path = target / name
if path.exists() and path.stat().st_size > 0:
print(f"cached {name}", flush=True)
continue
print(f"downloading {name}", flush=True)
with urlopen(url, timeout=180) as response, open(path, "wb") as download_file:
shutil.copyfileobj(response, download_file)
print(f"ready {name}", flush=True)
volumeMounts:
- name: data
mountPath: /data
containers:
- name: piper
image: rhasspy/wyoming-piper:latest
env:
- name: PYTHONHTTPSVERIFY
value: "0"
args:
- "--voice"
- "en_US-lessac-high"
- "--data-dir"
- "/data"
- "--download-dir"
- "/data"
ports:
- containerPort: 10200
name: wyoming
# Memory bumped after observed OOMKills during real chapter
# renders 2026-04-25. Piper's eSpeak phonemizer + onnx runtime
# spikes well past 1 Gi on long unpunctuated paragraphs from
# PDF / book imports. 3 Gi gives headroom plus the
# transcribe-audio-to-Quick-Read flow that hits Piper through
# the same model.
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 2000m
memory: 3Gi
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: ttsreader-piper-data
---
# fc-speech-align — cluster-native faster-whisper wrapper.
# Exposes POST /align (fc-align contract used by FlowerCore.Shared.Speech) AND
# POST /transcribe (audio-file-in feature). CPU model = base.en, int8 compute.
# Source: bluejay-infra/apps/fc-ttsreader/speech-align/ (Dockerfile + app.py).
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ttsreader-align-models
namespace: fc-ttsreader
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ttsreader-align
namespace: fc-ttsreader
labels:
app.kubernetes.io/name: ttsreader-align
app.kubernetes.io/part-of: flowercore
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app.kubernetes.io/name: ttsreader-align
template:
metadata:
labels:
app.kubernetes.io/name: ttsreader-align
app.kubernetes.io/part-of: flowercore
spec:
# Bypass CoreDNS's *.iamworkin.lan template hijack on public hosts
# (huggingface.co model download at first boot would otherwise resolve
# to Traefik VIP via search expansion). Drops the iamworkin.lan suffix.
dnsPolicy: None
dnsConfig:
nameservers:
- 10.43.0.10
searches:
- fc-ttsreader.svc.cluster.local
- svc.cluster.local
- cluster.local
options:
- name: ndots
value: "2"
securityContext:
fsGroup: 1654
runAsNonRoot: true
runAsUser: 1654
containers:
- name: align
image: localhost/fc-speech-align:v3
imagePullPolicy: Never
ports:
- containerPort: 9200
name: http
env:
- name: WHISPER_MODEL
value: "Systran/faster-whisper-base.en"
- name: WHISPER_DEVICE
value: "cpu"
- name: WHISPER_COMPUTE_TYPE
value: "int8"
- name: WHISPER_CACHE_DIR
value: "/models"
- name: DEFAULT_LANGUAGE
value: "en"
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
volumeMounts:
- name: models
mountPath: /models
readinessProbe:
httpGet:
path: /health
port: 9200
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 18
livenessProbe:
httpGet:
path: /health
port: 9200
initialDelaySeconds: 180
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
volumes:
- name: models
persistentVolumeClaim:
claimName: ttsreader-align-models
---
apiVersion: v1
kind: Service
metadata:
name: ttsreader-align
namespace: fc-ttsreader
spec:
selector:
app.kubernetes.io/name: ttsreader-align
ports:
- port: 9200
targetPort: 9200
name: http
---
# ttsreader-kokoro — Kokoro-82M TTS via the kokoro-fastapi container.
# Provides high-quality English voices alongside Piper for the TtsReader
# render pipeline AND for AiStation when it talks to the cluster TTS plane
# (instead of pointing back at BLUEJAY-WS:10401). Model + voices ship
# inside the container image, so no PVC is needed.
apiVersion: apps/v1
kind: Deployment
metadata:
name: ttsreader-kokoro
namespace: fc-ttsreader
labels:
app.kubernetes.io/name: ttsreader-kokoro
app.kubernetes.io/part-of: flowercore
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app.kubernetes.io/name: ttsreader-kokoro
template:
metadata:
labels:
app.kubernetes.io/name: ttsreader-kokoro
app.kubernetes.io/part-of: flowercore
spec:
# Same DNS bypass as ttsreader-align — without it, the *.iamworkin.lan
# CoreDNS template would hijack hexgrad/Kokoro-82M's HuggingFace-style
# repo lookups during model warmup.
dnsPolicy: None
dnsConfig:
nameservers:
- 10.43.0.10
searches:
- fc-ttsreader.svc.cluster.local
- svc.cluster.local
- cluster.local
options:
- name: ndots
value: "2"
containers:
- name: kokoro
image: ghcr.io/remsky/kokoro-fastapi-cpu:latest
ports:
- containerPort: 8880
name: http
resources:
requests:
cpu: 250m
memory: 1Gi
limits:
cpu: 2000m
memory: 3Gi
readinessProbe:
httpGet:
path: /v1/audio/voices
port: 8880
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 18
# Sprint E Phase 1a (kokoro stability) — 4 restarts in 2d6h with
# exit 143 traced to liveness probe `context deadline exceeded` while
# kokoro was busy synthesizing. /v1/audio/voices shares the FastAPI
# worker pool with /v1/audio/speech, so a long synth can starve the
# probe out within the prior 5s × 3 = 15s window. Bump timeoutSeconds
# 5 → 15 and failureThreshold 3 → 5 → 75s grace before kubelet kills
# the pod. The TtsCircuitBreaker on the synthesizer side (Phase 1b)
# backs this up so the FC backend stops slamming kokoro during
# recovery.
livenessProbe:
httpGet:
path: /v1/audio/voices
port: 8880
initialDelaySeconds: 180
periodSeconds: 30
timeoutSeconds: 15
failureThreshold: 5
---
# fc-biblical-tts — eSpeak-NG-backed Ancient Greek + Hebrew TTS with
# word-level timing for read-along playback. Companion to ttsreader-kokoro
# (modern English) and ttsreader-piper (English narrator); operators pick
# whichever engine matches the source text. Source:
# bluejay-infra/apps/fc-ttsreader/biblical-tts/
apiVersion: apps/v1
kind: Deployment
metadata:
name: ttsreader-biblical
namespace: fc-ttsreader
labels:
app.kubernetes.io/name: ttsreader-biblical
app.kubernetes.io/part-of: flowercore
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app.kubernetes.io/name: ttsreader-biblical
template:
metadata:
labels:
app.kubernetes.io/name: ttsreader-biblical
app.kubernetes.io/part-of: flowercore
spec:
securityContext:
fsGroup: 1654
runAsNonRoot: true
runAsUser: 1654
containers:
- name: biblical-tts
image: localhost/fc-biblical-tts:v20260506-hebrew-translit
imagePullPolicy: Never
ports:
- containerPort: 10402
name: http
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 1000m
memory: 512Mi
readinessProbe:
httpGet:
path: /health
port: 10402
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 6
livenessProbe:
httpGet:
path: /health
port: 10402
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
name: ttsreader-biblical
namespace: fc-ttsreader
spec:
selector:
app.kubernetes.io/name: ttsreader-biblical
ports:
- port: 10402
targetPort: 10402
name: http
---
# fc-modern-tts — Microsoft Edge Read Aloud bridge for Modern Hebrew
# (he-IL-AvriNeural et al) and Modern Greek (el-GR-NestorasNeural et al).
# Pairs with ttsreader-biblical: biblical engine handles unpointed
# Greek + Hebrew, modern engine handles narrative translations the
# operator reads alongside.
apiVersion: apps/v1
kind: Deployment
metadata:
name: ttsreader-modern
namespace: fc-ttsreader
labels:
app.kubernetes.io/name: ttsreader-modern
app.kubernetes.io/part-of: flowercore
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app.kubernetes.io/name: ttsreader-modern
template:
metadata:
labels:
app.kubernetes.io/name: ttsreader-modern
app.kubernetes.io/part-of: flowercore
spec:
# edge-tts needs egress to *.tts.speech.microsoft.com — bypass the
# iamworkin.lan template hijack so the lookup doesn't fall back to
# Traefik VIP via search expansion.
dnsPolicy: None
dnsConfig:
nameservers:
- 10.43.0.10
searches:
- fc-ttsreader.svc.cluster.local
- svc.cluster.local
- cluster.local
options:
- name: ndots
value: "2"
securityContext:
fsGroup: 1654
runAsNonRoot: true
runAsUser: 1654
containers:
- name: modern-tts
image: localhost/fc-modern-tts:v1
imagePullPolicy: Never
ports:
- containerPort: 10403
name: http
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 1000m
memory: 512Mi
readinessProbe:
httpGet:
path: /health
port: 10403
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 6
livenessProbe:
httpGet:
path: /health
port: 10403
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
name: ttsreader-modern
namespace: fc-ttsreader
spec:
selector:
app.kubernetes.io/name: ttsreader-modern
ports:
- port: 10403
targetPort: 10403
name: http
---
apiVersion: v1
kind: Service
metadata:
name: ttsreader-kokoro
namespace: fc-ttsreader
spec:
selector:
app.kubernetes.io/name: ttsreader-kokoro
ports:
- port: 8880
targetPort: 8880
name: http
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ttsreader-web
namespace: fc-ttsreader
labels:
app.kubernetes.io/name: ttsreader-web
app.kubernetes.io/part-of: flowercore
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app.kubernetes.io/name: ttsreader-web
template:
metadata:
labels:
app.kubernetes.io/name: ttsreader-web
app.kubernetes.io/part-of: flowercore
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "5217"
prometheus.io/path: "/metrics"
spec:
securityContext:
fsGroup: 1654
fsGroupChangePolicy: OnRootMismatch
containers:
- name: web
image: localhost/fc-ttsreader-web:v20260518-sprint36-demo-finish-b132cbf
imagePullPolicy: Never
ports:
- containerPort: 5217
name: http
env:
- name: ASPNETCORE_URLS
value: "http://+:5217"
- name: ASPNETCORE_ENVIRONMENT
value: "Production"
- name: FlowerCore__Database__ConnectionStrings__Sqlite
value: "Data Source=/data/ttsreader.db"
- name: TtsReader__Audio__OutputRoot
value: "/data/audio"
- name: TtsReader__Audio__FfmpegPath
value: "/usr/bin/ffmpeg"
- name: TtsReader__Bible__CorpusRoot
value: "/data/corpus-cache/world-english-bible/eng/usx"
- name: TtsReader__ChapterContext__DatabasePath
value: "/data/chapter-context.db"
- name: TtsReader__Jobs__Root
value: "/data/jobs"
- name: TtsReader__Piper__Host
value: "10.0.57.17"
- name: TtsReader__Piper__Port
value: "8500"
- name: TtsReader__Piper__Transport
value: "http"
- name: TtsReader__Piper__HttpPath
value: "/tts"
- name: TtsReader__Kokoro__Enabled
value: "true"
- name: TtsReader__Kokoro__BaseUrl
# Cluster-native ttsreader-kokoro Service — replaces the prior
# BLUEJAY-WS host pointer so the render pipeline doesn't need
# the workstation up. AiStation can still hit its local
# http://localhost:8880 instance.
value: "http://ttsreader-kokoro.fc-ttsreader.svc.cluster.local.:8880"
- name: TtsReader__Kokoro__TimeoutSeconds
value: "120"
- name: FlowerCore__Tts__BiblicalTts__Enabled
value: "true"
- name: FlowerCore__Tts__BiblicalTts__BaseUrl
value: "http://ttsreader-biblical.fc-ttsreader.svc.cluster.local.:10402"
- name: FlowerCore__Tts__BiblicalTts__TimeoutSeconds
value: "60"
- name: FlowerCore__Tts__BiblicalTts__DefaultLanguage
value: "grc"
- name: Speech__Alignment__Enabled
# Cluster-native faster-whisper (Lane F, 2026-04-25). The
# ttsreader-align deployment in this manifest wraps
# SYSTRAN/faster-whisper with a /align endpoint matching the
# FlowerCore.Shared.Speech master contract.
value: "true"
- name: Speech__Alignment__BaseUrl
value: "http://ttsreader-align.fc-ttsreader.svc.cluster.local.:9200"
- name: Speech__Alignment__TimeoutSeconds
value: "120"
# Cluster-native transcription endpoint shares the same pod
# (POST /transcribe). Lane G consumes this from the
# FlowerCore.TtsReader.Web AudioImport feature.
- name: TtsReader__Transcription__Enabled
value: "true"
- name: TtsReader__Transcription__BaseUrl
value: "http://ttsreader-align.fc-ttsreader.svc.cluster.local.:9200"
- name: TtsReader__Transcription__TimeoutSeconds
value: "300"
- name: TtsReader__Ollama__BaseUrl
value: "http://10.0.57.17:11434"
- name: TtsReader__Ollama__DefaultModel
value: "gemma3:4b"
- name: TtsReader__Ollama__TimeoutSeconds
value: "45"
- name: TtsReader__Runtime__LogsRoot
value: "/data/logs"
- name: TtsReader__Runtime__SmokeStatePath
value: "/data/ops/smoke-status.json"
# Sprint E Day 8 voice-preview disk cache — writes WAVs under
# this directory. Default "data/voice-previews" resolves to
# the read-only $HOME path under runAsNonRoot=true. Pin to
# the writable PVC mount.
- name: TtsReader__Preview__CacheDirectory
value: "/data/voice-previews"
- name: TtsReader__VoiceLibrary__ReferenceClip__Directory
value: "/data/voice-reference-clips"
# Sprint E XXL Phase 4γ — content-addressed CDN bundle dir for
# POST /api/v1/render. Default "wwwroot/cdn" resolves under the
# read-only app filesystem, so pin to the writable PVC mount
# alongside other TtsReader runtime data. Manifests + cue audio
# land at /data/cdn/sha256/<hash>/manifest.json + cues/.
- name: TtsReader__Render__CdnDirectory
value: "/data/cdn"
- name: Auth__ApiKey
valueFrom:
secretKeyRef:
name: ttsreader-secrets
key: Auth__ApiKey
optional: true
- name: Auth__AdminApiKey
valueFrom:
secretKeyRef:
name: ttsreader-secrets
key: Auth__AdminApiKey
optional: true
resources:
requests:
# The cluster is currently saturated on requested CPU by
# remotedesktop workloads even when real usage is low.
# Keep the web frontend schedulable under that pressure.
cpu: 10m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
volumeMounts:
- name: data
mountPath: /data
- name: tmp
mountPath: /tmp
securityContext:
runAsNonRoot: true
runAsUser: 1654
runAsGroup: 1654
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
readinessProbe:
httpGet:
path: /health
port: 5217
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 5217
initialDelaySeconds: 15
periodSeconds: 30
volumes:
- name: data
persistentVolumeClaim:
claimName: ttsreader-data
- name: tmp
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: ttsreader-piper
namespace: fc-ttsreader
spec:
selector:
app.kubernetes.io/name: ttsreader-piper
ports:
- port: 10200
targetPort: 10200
name: wyoming
---
apiVersion: v1
kind: Service
metadata:
name: ttsreader-web
namespace: fc-ttsreader
spec:
selector:
app.kubernetes.io/name: ttsreader-web
ports:
- port: 5217
targetPort: 5217
name: http
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ttsreader-piper-data
namespace: fc-ttsreader
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn
resources:
requests:
storage: 2Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ttsreader-data
namespace: fc-ttsreader
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn
resources:
requests:
storage: 5Gi
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: ttsreader-cert
namespace: fc-ttsreader
spec:
secretName: ttsreader-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- ttsreader.iamworkin.lan
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: ttsreader-web
namespace: fc-ttsreader
spec:
entryPoints:
- websecure
routes:
- match: Host(`ttsreader.iamworkin.lan`)
kind: Rule
services:
- name: ttsreader-web
port: 5217
tls:
secretName: ttsreader-tls

View File

@@ -0,0 +1,36 @@
# FlowerCore modern-tts — wraps Microsoft Edge's Read Aloud TTS service
# (via the edge-tts Python package) to give the cluster studio-quality
# Modern Hebrew (he-IL-*) and Modern Greek (el-GR-*) voices alongside the
# eSpeak biblical engine. Same shape as fc-biblical-tts so the .NET client
# lives in the same Shared.Speech package.
#
# Note: edge-tts depends on Microsoft's public Edge endpoint; the cluster
# pod needs egress to *.tts.speech.microsoft.com. dnsPolicy: None on the
# Deployment makes sure the iamworkin.lan template hijack doesn't rewrite
# the lookup back to Traefik VIP.
FROM python:3.12-slim
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1 \
PIP_NO_CACHE_DIR=1
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt /app/
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py /app/
RUN useradd --create-home --shell /usr/sbin/nologin --uid 1654 tts
USER 1654
EXPOSE 10403
HEALTHCHECK --interval=30s --timeout=5s --start-period=20s --retries=3 \
CMD python -c "import urllib.request,sys; urllib.request.urlopen('http://127.0.0.1:10403/health',timeout=3); sys.exit(0)" || exit 1
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "10403", "--workers", "1"]

View File

@@ -0,0 +1,238 @@
"""FlowerCore modern-tts — Microsoft Edge Read Aloud bridge for Modern
Hebrew and Modern Greek (and other Edge-supported languages).
Endpoints:
* POST /tts — body: {"text", "voice", "rate"?, "volume"?, "pitch"?}
returns audio/mpeg (Edge returns MP3) which the
upstream FasterWhisperAlignmentClient + the WPF
MediaPlayer both handle natively.
* POST /timings — same body shape but returns
{"text", "voice", "words": [{"text","startMs","endMs"}],
"durationMs": ...} sourced from Edge's WordBoundary
events — much more accurate than eSpeak's
proportional-distribution approach because Edge
emits real per-word offsets during synthesis.
* GET /voices — voice catalog Edge knows about. Filtered to
Hebrew + Greek by default; ?language=all returns
everything Edge supports.
* GET /health — fast readiness check.
Pairs with fc-biblical-tts (eSpeak Ancient Greek + Hebrew). The biblical
engine handles unpointed Hebrew + Erasmian Greek; this engine handles
narrative Modern Hebrew + Modern Greek for translations the operator
might be reading alongside the original.
"""
from __future__ import annotations
import io
import logging
from typing import Optional
import edge_tts
from fastapi import FastAPI, HTTPException
from fastapi.responses import JSONResponse, Response
from pydantic import BaseModel
LOG = logging.getLogger("modern_tts")
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
app = FastAPI(title="FlowerCore modern-tts", version="1.0.0")
# Default voices by short code so AiStation can pick a sensible default
# when the operator hasn't explicitly asked for one. Edge has multiple
# voices per locale — these are the calmest male+female narrators.
DEFAULT_VOICES = {
"he": "he-IL-AvriNeural",
"he-IL": "he-IL-AvriNeural",
"el": "el-GR-NestorasNeural",
"el-GR": "el-GR-NestorasNeural",
"en": "en-US-AriaNeural",
}
class TtsRequest(BaseModel):
text: str
voice: Optional[str] = None
language: Optional[str] = None
rate: str = "+0%" # Edge accepts +20%, -10%, etc.
volume: str = "+0%"
pitch: str = "+0Hz"
def _resolve_voice(req: TtsRequest) -> str:
if req.voice:
return req.voice.strip()
if req.language and req.language in DEFAULT_VOICES:
return DEFAULT_VOICES[req.language]
return DEFAULT_VOICES["he"]
@app.get("/health")
def health():
return {"status": "ok"}
@app.get("/voices")
async def voices(language: str = "default"):
catalog = await edge_tts.list_voices()
if language == "all":
return {"voices": catalog}
# Default response: filter to languages relevant to the FlowerCore
# biblical workflow (Hebrew + Greek) so the AiStation voice picker
# isn't overwhelmed by 400+ Edge voices.
keep = ("he-", "el-")
filtered = [v for v in catalog if any(v.get("ShortName", "").startswith(k) for k in keep)]
return {"voices": filtered}
async def _synth_with_subtitles(req: TtsRequest):
voice = _resolve_voice(req)
LOG.info("edge-tts synth voice=%s len=%d", voice, len(req.text))
communicate = edge_tts.Communicate(
req.text,
voice=voice,
rate=req.rate,
volume=req.volume,
pitch=req.pitch,
)
audio_buf = io.BytesIO()
word_events: list[dict] = []
async for chunk in communicate.stream():
if chunk["type"] == "audio":
audio_buf.write(chunk["data"])
elif chunk["type"] == "WordBoundary":
word_events.append({
"text": chunk.get("text") or "",
"offset": chunk.get("offset", 0), # 100-ns ticks
"duration": chunk.get("duration", 0), # 100-ns ticks
})
return voice, audio_buf.getvalue(), word_events
def _to_ms(ticks_100ns: int) -> int:
# Edge emits offsets in 100-nanosecond ticks (.NET TimeSpan style).
return int(round(ticks_100ns / 10_000))
@app.post("/tts")
async def tts(req: TtsRequest):
if not req.text.strip():
raise HTTPException(status_code=400, detail="text is required")
try:
voice, audio_bytes, _ = await _synth_with_subtitles(req)
except edge_tts.exceptions.NoAudioReceived:
raise HTTPException(status_code=502, detail="edge-tts returned no audio for the supplied voice/text.")
except Exception as ex:
raise HTTPException(status_code=502, detail=f"edge-tts failure: {ex}")
if not audio_bytes:
raise HTTPException(status_code=502, detail="edge-tts returned an empty audio stream.")
return Response(content=audio_bytes, media_type="audio/mpeg",
headers={"X-FlowerCore-Voice": voice})
def _estimate_duration_ms_from_mp3(audio_bytes: bytes) -> int:
"""Best-effort duration estimate from raw MP3 bytes by walking frame
headers. Edge always returns CBR ~24kbps mono so we can infer total ms
from frame count. If parsing fails, return 0 and let the caller fall
through to a per-character heuristic."""
if not audio_bytes:
return 0
# MP3 sample rates by version+layer (MPEG1 layer3 / MPEG2 layer3 / MPEG2.5 layer3).
# We just walk frame headers and count frames; each frame is 1152 samples.
sample_rates_v1 = [44100, 48000, 32000, 0]
sample_rates_v2 = [22050, 24000, 16000, 0]
sample_rates_v25 = [11025, 12000, 8000, 0]
bitrates_v1_l3 = [0,32000,40000,48000,56000,64000,80000,96000,112000,128000,160000,192000,224000,256000,320000,0]
bitrates_v2_l3 = [0,8000,16000,24000,32000,40000,48000,56000,64000,80000,96000,112000,128000,144000,160000,0]
pos = 0
total_samples = 0
sample_rate = 0
while pos + 4 <= len(audio_bytes):
b0, b1, b2, b3 = audio_bytes[pos], audio_bytes[pos+1], audio_bytes[pos+2], audio_bytes[pos+3]
if b0 != 0xFF or (b1 & 0xE0) != 0xE0:
pos += 1
continue
version_bits = (b1 >> 3) & 0x03
layer_bits = (b1 >> 1) & 0x03
if layer_bits != 0x01: # layer 3 only
pos += 1
continue
bitrate_index = (b2 >> 4) & 0x0F
sample_rate_index = (b2 >> 2) & 0x03
padding = (b2 >> 1) & 0x01
if version_bits == 0x03: # MPEG1
sample_rate = sample_rates_v1[sample_rate_index]
bitrate = bitrates_v1_l3[bitrate_index]
samples_per_frame = 1152
elif version_bits == 0x02: # MPEG2
sample_rate = sample_rates_v2[sample_rate_index]
bitrate = bitrates_v2_l3[bitrate_index]
samples_per_frame = 576
elif version_bits == 0x00: # MPEG2.5
sample_rate = sample_rates_v25[sample_rate_index]
bitrate = bitrates_v2_l3[bitrate_index]
samples_per_frame = 576
else:
pos += 1
continue
if not (sample_rate and bitrate):
pos += 1
continue
frame_length = int((samples_per_frame * bitrate / 8) / sample_rate) + padding
if frame_length <= 0:
pos += 1
continue
total_samples += samples_per_frame
pos += frame_length
if sample_rate <= 0:
return 0
return int(round(total_samples * 1000 / sample_rate))
@app.post("/timings")
async def timings(req: TtsRequest):
if not req.text.strip():
raise HTTPException(status_code=400, detail="text is required")
try:
voice, audio_bytes, events = await _synth_with_subtitles(req)
except Exception as ex:
raise HTTPException(status_code=502, detail=f"edge-tts failure: {ex}")
words: list[dict] = []
for event in events:
start = _to_ms(event["offset"])
end = start + _to_ms(event["duration"])
words.append({"text": event.get("text", ""), "startMs": start, "endMs": end})
# Edge sometimes omits WordBoundary events for non-English voices
# (notably he-IL-* and el-GR-*). Fall back to proportional distribution
# over the input text — same approach the eSpeak biblical-tts uses.
if not words and req.text.strip():
total_ms = _estimate_duration_ms_from_mp3(audio_bytes)
if total_ms <= 0:
# Last-resort fallback: ~600ms per word at average speaking rate.
total_ms = max(1, len(req.text.split())) * 600
tokens = req.text.split()
if tokens:
char_total = sum(max(1, len(w)) for w in tokens)
cursor = 0
for token in tokens:
share = int(round(total_ms * max(1, len(token)) / char_total))
start = cursor
end = start + share
words.append({"text": token, "startMs": start, "endMs": end})
cursor = end
words[-1]["endMs"] = total_ms
duration_ms = words[-1]["endMs"] if words else 0
return JSONResponse({
"text": req.text,
"voice": voice,
"words": words,
"durationMs": duration_ms,
"audioBytes": len(audio_bytes),
})

View File

@@ -0,0 +1,3 @@
fastapi==0.115.6
uvicorn==0.34.0
edge-tts==7.2.8

View File

@@ -0,0 +1,47 @@
# FlowerCore speech-align — wraps SYSTRAN/faster-whisper with /align +
# /transcribe endpoints used by FlowerCore.TtsReader. CPU-only image; the
# default int8 compute type runs base.en at ~real-time on a single core.
#
# Build: podman build -t localhost/fc-speech-align:<ver> .
# Run: podman run --rm -p 9200:9200 -v fc-speech-align-models:/models localhost/fc-speech-align:<ver>
FROM python:3.12-slim AS base
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1 \
PIP_NO_CACHE_DIR=1 \
WHISPER_MODEL=Systran/faster-whisper-base.en \
WHISPER_CACHE_DIR=/models \
WHISPER_DEVICE=cpu \
WHISPER_COMPUTE_TYPE=int8 \
DEFAULT_LANGUAGE=en \
MAX_AUDIO_BYTES=52428800
# faster-whisper depends on libsndfile1 + libgomp1 (OpenMP runtime). ffmpeg is
# pulled in for non-WAV inputs (transcribe accepts any container).
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
libsndfile1 \
libgomp1 \
ffmpeg \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt /app/
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py /app/
# Run as a non-root user to satisfy K8s securityContext.runAsNonRoot.
RUN useradd --create-home --shell /usr/sbin/nologin --uid 1654 align \
&& mkdir -p /models \
&& chown -R 1654:1654 /models
USER 1654
EXPOSE 9200
HEALTHCHECK --interval=30s --timeout=5s --start-period=120s --retries=3 \
CMD python -c "import urllib.request,sys; urllib.request.urlopen('http://127.0.0.1:9200/health',timeout=3); sys.exit(0)" || exit 1
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "9200", "--workers", "1"]

View File

@@ -0,0 +1,181 @@
"""FlowerCore speech-align service.
Wraps SYSTRAN/faster-whisper (https://github.com/SYSTRAN/faster-whisper) in a
small FastAPI app exposing two endpoints:
* POST /align — fc-align contract used by FlowerCore.Shared.Speech's
FasterWhisperAlignmentClient on master. Multipart form
(`audio`, `language`) returns
`{text, words: [{word, startSeconds, endSeconds, confidence}],
durationMs, language}`.
* POST /transcribe — audio-file-in transcription used by the new TtsReader
audio-import feature. Multipart form (`audio`, optional
`language`) returns `{text, language, durationMs,
segments: [{startSeconds, endSeconds, text}]}` so the
UI can preview the transcript before piping it into
Quick Read or saving as a project.
Both endpoints share the same WhisperModel instance (loaded once at startup).
Model is pinned by the WHISPER_MODEL env var (defaults to base.en) and cached
under WHISPER_CACHE_DIR (defaults to /models, backed by a PVC in K8s).
Health: GET /health → {status: ok, model, device, computeType}.
"""
from __future__ import annotations
import io
import logging
import os
import time
from contextlib import asynccontextmanager
from typing import Optional
from fastapi import FastAPI, File, Form, HTTPException, UploadFile
from fastapi.responses import JSONResponse
from faster_whisper import WhisperModel
LOG = logging.getLogger("speech_align")
logging.basicConfig(
level=os.environ.get("LOG_LEVEL", "INFO"),
format="%(asctime)s %(levelname)s %(name)s %(message)s",
)
MODEL_NAME = os.environ.get("WHISPER_MODEL", "Systran/faster-whisper-base.en")
DEVICE = os.environ.get("WHISPER_DEVICE", "cpu")
COMPUTE_TYPE = os.environ.get("WHISPER_COMPUTE_TYPE", "int8")
CACHE_DIR = os.environ.get("WHISPER_CACHE_DIR", "/models")
MAX_BYTES = int(os.environ.get("MAX_AUDIO_BYTES", str(50 * 1024 * 1024))) # 50 MB
DEFAULT_LANGUAGE = os.environ.get("DEFAULT_LANGUAGE", "en")
_state: dict[str, object] = {}
@asynccontextmanager
async def lifespan(_app: FastAPI):
LOG.info("Loading faster-whisper model %s (device=%s compute=%s cache=%s)", MODEL_NAME, DEVICE, COMPUTE_TYPE, CACHE_DIR)
started = time.time()
model = WhisperModel(MODEL_NAME, device=DEVICE, compute_type=COMPUTE_TYPE, download_root=CACHE_DIR)
_state["model"] = model
LOG.info("Model loaded in %.2fs", time.time() - started)
yield
_state.clear()
app = FastAPI(title="FlowerCore speech-align", version="1.0.0", lifespan=lifespan)
def _get_model() -> WhisperModel:
model = _state.get("model")
if model is None:
raise HTTPException(status_code=503, detail="Model not loaded yet")
return model # type: ignore[return-value]
async def _read_upload(upload: UploadFile) -> bytes:
payload = await upload.read()
if not payload:
raise HTTPException(status_code=400, detail="audio is empty")
if len(payload) > MAX_BYTES:
raise HTTPException(
status_code=413,
detail=f"audio exceeds {MAX_BYTES} byte limit ({len(payload)} bytes received)",
)
return payload
def _normalize_language(value: Optional[str]) -> Optional[str]:
if not value or not value.strip():
return DEFAULT_LANGUAGE
return value.strip().lower()
def _transcribe_bytes(audio_bytes: bytes, language: Optional[str], word_timestamps: bool):
model = _get_model()
started = time.time()
segments_iter, info = model.transcribe(
io.BytesIO(audio_bytes),
language=language,
word_timestamps=word_timestamps,
beam_size=1,
vad_filter=True,
)
segments = list(segments_iter)
elapsed_ms = int((time.time() - started) * 1000)
return segments, info, elapsed_ms
@app.get("/health")
def health():
return {
"status": "ok" if _state.get("model") is not None else "loading",
"model": MODEL_NAME,
"device": DEVICE,
"computeType": COMPUTE_TYPE,
"defaultLanguage": DEFAULT_LANGUAGE,
"maxBytes": MAX_BYTES,
}
@app.post("/align")
async def align(audio: UploadFile = File(...), language: str = Form(DEFAULT_LANGUAGE)):
"""fc-align contract — used by FlowerCore.Shared.Speech.FasterWhisperAlignmentClient."""
payload = await _read_upload(audio)
lang = _normalize_language(language)
segments, info, elapsed_ms = _transcribe_bytes(payload, lang, word_timestamps=True)
text_parts: list[str] = []
words: list[dict] = []
for segment in segments:
text_parts.append(segment.text.strip())
for word in (segment.words or []):
# Field names MUST match the FlowerCore.Shared.Speech contract:
# `text` / `startMs` / `endMs`. The deployed FasterWhisperAlignmentClient
# ignores any other names — see Common's
# FasterWhisperAlignmentResponse / FasterWhisperWord.
words.append({
"text": word.word.strip(),
"startMs": int((word.start or 0.0) * 1000),
"endMs": int((word.end or 0.0) * 1000),
# Confidence is informational and ignored by the C# client today,
# but kept on the wire for future scoring + fc-align operators
# that want to surface low-confidence words.
"confidence": float(getattr(word, "probability", 0.0) or 0.0),
})
duration_ms = int((info.duration or 0.0) * 1000)
return JSONResponse({
"text": " ".join(p for p in text_parts if p).strip(),
"words": words,
"durationMs": duration_ms,
"language": info.language or lang,
"elapsedMs": elapsed_ms,
})
@app.post("/transcribe")
async def transcribe(audio: UploadFile = File(...), language: Optional[str] = Form(None)):
"""Audio-in transcription contract — used by the new TtsReader audio-import feature.
Returns full segments (no per-word timestamps) so the UI can preview the
transcript before piping it into Quick Read or saving as a project.
"""
payload = await _read_upload(audio)
lang = _normalize_language(language)
segments, info, elapsed_ms = _transcribe_bytes(payload, lang, word_timestamps=False)
out_segments = [
{
"startSeconds": float(segment.start or 0.0),
"endSeconds": float(segment.end or 0.0),
"text": segment.text.strip(),
}
for segment in segments
]
return JSONResponse({
"text": " ".join(s["text"] for s in out_segments if s["text"]).strip(),
"segments": out_segments,
"language": info.language or lang,
"durationMs": int((info.duration or 0.0) * 1000),
"elapsedMs": elapsed_ms,
})

View File

@@ -0,0 +1,8 @@
faster-whisper==1.0.3
fastapi==0.115.0
uvicorn[standard]==0.30.6
python-multipart==0.0.10
# faster-whisper 1.0.3's utils module imports requests but doesn't pin it as a
# transitive dep — pin explicitly so the image isn't relying on whatever
# happens to be in the base image.
requests==2.32.3

47
apps/fc-updater/README.md Normal file
View File

@@ -0,0 +1,47 @@
# fc-updater — Update Center GitOps adoption
**Status:** adopted into `bluejay-infra` on 2026-05-06. The live ArgoCD
Application is `infra-fc-updater`, generated by the `bluejay-infra`
ApplicationSet with automated sync, `prune: true`, and `selfHeal: true`.
## Managed manifest set
`apps/fc-updater/fc-updater.yaml` manages:
- `Namespace/fc-updater`
- `PersistentVolumeClaim/updatecenter-data`
- `Deployment/updatecenter-web`
- `Service/updatecenter-web`
- `Certificate/updatecenter-web-tls`
- `Certificate/updatecenter-web-internal-tls`
- `IngressRoute/updatecenter-web`
- `IngressRoute/updatecenter-web-internal`
- `IngressRoute/updatecenter-web-public`
The Deployment intentionally sets `revisionHistoryLimit: 3` and
`strategy.type: Recreate`. The service is singleton + SQLite/local bundle
storage on `PersistentVolumeClaim/updatecenter-data`, pinned to
`rke2-server`.
## Runtime dependencies intentionally not stored here
These live Secrets are pre-existing runtime material and are not committed to
Git:
- `updater-bootstrap-auth`
- `updater-signing`
- `updater-webhooks`
- `cf-origin-flowercore-io`
Rotate the Cloudflare Origin Certificate through
`FlowerCore.Notes/docs/standards/code-signing-rotation-runbook.md`; the
shared origin cert must exist in every namespace that serves a
`*.flowercore.io` public IngressRoute.
## Verification
```powershell
kubectl.exe --kubeconfig C:\Users\AndrewStoltz\.kube\rke2.yaml -n argocd get application infra-fc-updater
kubectl.exe --kubeconfig C:\Users\AndrewStoltz\.kube\rke2.yaml -n fc-updater get deploy,svc,ingressroute,certificate,pvc
curl.exe -sk https://update.flowercore.io/api/v1/manifests/_schema
```

View File

@@ -0,0 +1,269 @@
# FlowerCore Update Center
# GitOps adoption of the live fc-updater namespace after PUB-1/PUB-3.
# Runtime credentials remain in existing K8s Secrets; do not store them here.
---
apiVersion: v1
kind: Namespace
metadata:
name: fc-updater
labels:
app.kubernetes.io/part-of: flowercore
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: updatecenter-data
namespace: fc-updater
labels:
app.kubernetes.io/name: updatecenter-web
app.kubernetes.io/part-of: flowercore
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn
volumeMode: Filesystem
resources:
requests:
# Sized for fleet bundle storage (LocalFsBundleStore.MaxTotalBytes
# soft cap at 25 GiB per project_uc_remaining_4_apps_signed_2026_05_06).
# Mike Bundle alone is ~5.1 GiB; cluster live capacity is already
# 20 GiB after a manual expand. PVCs cannot shrink, so git must track
# at least the live size to avoid the OutOfSync loop.
storage: 25Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: updatecenter-web
namespace: fc-updater
labels:
app: updatecenter-web
app.kubernetes.io/name: updatecenter-web
app.kubernetes.io/part-of: flowercore
spec:
replicas: 1
revisionHistoryLimit: 3
strategy:
# SQLite + local bundle storage live on a single RWO PVC. Recreate avoids
# two pods overlapping the same write path during future image bumps.
type: Recreate
selector:
matchLabels:
app: updatecenter-web
template:
metadata:
labels:
app: updatecenter-web
spec:
nodeName: rke2-server
containers:
- name: web
image: localhost/fc-updater-web:v20260509-4162dca-authgate
imagePullPolicy: Never
ports:
- containerPort: 8080
name: http
env:
- name: ASPNETCORE_URLS
value: http://+:8080
- name: FlowerCore__Updater__Database__Provider
value: sqlite
- name: FlowerCore__Updater__Database__ConnectionString
value: Data Source=/data/updatecenter.db
- name: FlowerCore__Updater__BundleStorage__LocalFs__RootDirectory
value: /data/bundles
- name: FlowerCore__Updater__PublicShares__RequirePublicVisibilityOnPublicHosts
value: "true"
- name: FlowerCore__Updater__PublicShares__Links__0__Code
value: 8f3c2a9e7d41
- name: FlowerCore__Updater__PublicShares__Links__0__AppId
value: flowercore.faith-ai-mike
- name: FlowerCore__Updater__PublicShares__Links__0__Channel
value: stable
- name: FlowerCore__Updater__PublicShares__Links__0__RuntimeId
value: win-x64
- name: FlowerCore__Updater__PublicShares__Links__0__DisplayName
value: Faith AI Mike Edition
- name: FlowerCore__Updater__PublicShares__Links__0__Headline
value: Faith AI Mike Edition
- name: FlowerCore__Updater__PublicShares__Links__0__Description
value: Private release link for Mike's Faith AI bundle.
- name: FlowerCore__Updater__Auth__Bootstrap__Enabled
value: "true"
- name: FlowerCore__Updater__Auth__Bootstrap__Username
valueFrom:
secretKeyRef:
name: updater-bootstrap-auth
key: username
- name: FlowerCore__Updater__Auth__Bootstrap__Password
valueFrom:
secretKeyRef:
name: updater-bootstrap-auth
key: password
- name: FlowerCore__Updater__Auth__Bootstrap__SigningKey
valueFrom:
secretKeyRef:
name: updater-bootstrap-auth
key: signing-key
- name: FlowerCore__Updater__Signing__AutoSignOnPublish
value: "true"
- name: FlowerCore__Updater__Signing__RequireSignatureOnPublish
value: "true"
- name: FlowerCore__Updater__Signing__PfxBase64
valueFrom:
secretKeyRef:
name: updater-signing
key: pfx-base64
- name: FlowerCore__Updater__Signing__PfxPassword
valueFrom:
secretKeyRef:
name: updater-signing
key: pfx-password
- name: FlowerCore__Updater__Signing__OpItemReference
value: op://FlowerCore/step-ca-codesign
- name: FlowerCore__Updater__Signing__TrustAnchorPath
value: /etc/flowercore-updater/signing/root-ca.pem
- name: FlowerCore__Updater__GitHub__Token
valueFrom:
secretKeyRef:
name: updater-webhooks
key: github-token
- name: FlowerCore__Updater__GitHub__WebhookSecret
valueFrom:
secretKeyRef:
name: updater-webhooks
key: github-webhook-secret
- name: FlowerCore__Updater__Gitea__Token
valueFrom:
secretKeyRef:
name: updater-webhooks
key: gitea-token
- name: FlowerCore__Updater__Gitea__WebhookSecret
valueFrom:
secretKeyRef:
name: updater-webhooks
key: gitea-webhook-secret
readinessProbe:
tcpSocket:
port: http
initialDelaySeconds: 10
periodSeconds: 15
livenessProbe:
tcpSocket:
port: http
initialDelaySeconds: 30
periodSeconds: 30
volumeMounts:
- name: data
mountPath: /data
- name: signing
mountPath: /etc/flowercore-updater/signing
readOnly: true
volumes:
- name: data
persistentVolumeClaim:
claimName: updatecenter-data
- name: signing
secret:
secretName: updater-signing
items:
- key: root-ca.pem
path: root-ca.pem
---
apiVersion: v1
kind: Service
metadata:
name: updatecenter-web
namespace: fc-updater
labels:
app: updatecenter-web
app.kubernetes.io/name: updatecenter-web
app.kubernetes.io/part-of: flowercore
spec:
type: ClusterIP
selector:
app: updatecenter-web
ports:
- name: http
port: 8080
targetPort: http
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: updatecenter-web-tls
namespace: fc-updater
spec:
secretName: updatecenter-web-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- updatecenter.iamworkin.lan
- updates.iamworkin.lan
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: updatecenter-web-internal-tls
namespace: fc-updater
spec:
secretName: updatecenter-web-internal-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- updatecenter-internal.iamworkin.lan
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: updatecenter-web
namespace: fc-updater
spec:
entryPoints:
- web
- websecure
routes:
- match: (Host(`updatecenter.iamworkin.lan`) || Host(`updates.iamworkin.lan`)) && (Method(`GET`) || Method(`HEAD`) || Method(`POST`) || Method(`OPTIONS`))
kind: Rule
services:
- name: updatecenter-web
port: 8080
tls:
secretName: updatecenter-web-tls
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: updatecenter-web-internal
namespace: fc-updater
spec:
entryPoints:
- web
- websecure
routes:
- match: Host(`updatecenter-internal.iamworkin.lan`)
kind: Rule
services:
- name: updatecenter-web
port: 8080
tls:
secretName: updatecenter-web-internal-tls
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: updatecenter-web-public
namespace: fc-updater
spec:
entryPoints:
- websecure
routes:
- match: (Host(`update.flowercore.io`) || Host(`updates.flowercore.io`)) && (Method(`GET`) || Method(`HEAD`) || Method(`POST`) || Method(`OPTIONS`))
kind: Rule
services:
- name: updatecenter-web
port: 8080
tls:
secretName: cf-origin-flowercore-io

View File

@@ -0,0 +1,7 @@
# ArgoCD's bluejay-infra ApplicationSet uses a directory generator and does
# not require kustomization.yaml. Keep this anyway as the manifest inventory
# and for local `kubectl kustomize apps/fc-updater` previews.
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- fc-updater.yaml

View File

@@ -1,6 +1,11 @@
# FlowerCore Tenant — flowercore.io (main brand)
# Public-facing placeholder landing page served by nginx
# ArgoCD managed - BlueJay Lab
# FlowerCore Tenant — retired flowercore.io placeholder.
#
# Public flowercore.io/www.flowercore.io routing is now owned by
# apps/fc-landing/fc-landing.yaml. This tenant placeholder remains available
# only as an in-cluster service; do not create a duplicate public
# IngressRoute here because it competes with fc-landing and requires a
# namespace-local cf-origin-flowercore-io Secret.
# ArgoCD managed - BlueJay Lab
---
apiVersion: v1
kind: Namespace
@@ -10,15 +15,9 @@ metadata:
app.kubernetes.io/part-of: bluejay-infra
flowercore.io/tenant: flowercore
---
# NOTE: The existing cf-origin-flowercore-io secret (covering *.flowercore.io)
# must be copied into this namespace. It already exists in other namespaces.
# Copy with: kubectl get secret cf-origin-flowercore-io -n fc-system -o yaml \
# | sed 's/namespace: .*/namespace: tenant-flowercore/' \
# | kubectl apply -f -
---
# Landing page HTML
apiVersion: v1
kind: ConfigMap
# Landing page HTML
apiVersion: v1
kind: ConfigMap
metadata:
name: flowercore-web-html
namespace: tenant-flowercore
@@ -308,25 +307,6 @@ spec:
selector:
app: flowercore-web
ports:
- port: 80
targetPort: 80
name: http
---
# Traefik IngressRoute — public via Cloudflare
# Uses existing cf-origin-flowercore-io cert (must be copied to this namespace)
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: flowercore-web
namespace: tenant-flowercore
spec:
entryPoints:
- websecure
routes:
- match: Host(`flowercore.io`) || Host(`www.flowercore.io`)
kind: Rule
services:
- name: flowercore-web
port: 80
tls:
secretName: cf-origin-flowercore-io
- port: 80
targetPort: 80
name: http

2
apps/github-runner/.gitattributes vendored Normal file
View File

@@ -0,0 +1,2 @@
*.sh text eol=lf
Dockerfile text eol=lf

View File

@@ -0,0 +1,54 @@
FROM myoung34/github-runner:latest
ARG RUBY_VERSION=3.3.11
ARG RUBY_MINOR=3.3
ARG RUBY_BUILD_VERSION=v20260326
ARG RUNNER_UID=1001
ARG RUNNER_GID=1001
ENV RUNNER_TOOL_CACHE=/home/runner/_tool
ENV RUNNER_RUBY_TOOLCACHE=/opt/runner-toolcache
ENV PATH="/home/runner/_tool/Ruby/${RUBY_MINOR}/x64/bin:/opt/runner-toolcache/Ruby/${RUBY_MINOR}/x64/bin:${PATH}"
USER root
# Bake the IAmWorkin step-ca root CA into the system trust store. Without
# this, .NET HttpClient calls from CI tests against *.iamworkin.lan
# (e.g. https://selenium.iamworkin.lan/session) fail with `PartialChain`
# because the runner image's default Ubuntu trust bundle doesn't include
# our internal Root CA. update-ca-certificates regenerates
# /etc/ssl/certs/ca-certificates.crt, which OpenSSL + .NET on Linux read
# automatically — no SSL_CERT_FILE env var needed.
COPY step-ca-root.crt /usr/local/share/ca-certificates/iamworkin-step-ca-root.crt
RUN apt-get update \
&& DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
autoconf \
bison \
build-essential \
ca-certificates \
curl \
libdb-dev \
libffi-dev \
libgdbm-dev \
libgmp-dev \
libncurses-dev \
libreadline-dev \
libssl-dev \
libyaml-dev \
patch \
pkg-config \
uuid-dev \
zlib1g-dev \
&& update-ca-certificates \
&& curl -fsSL "https://github.com/rbenv/ruby-build/archive/refs/tags/${RUBY_BUILD_VERSION}.tar.gz" -o /tmp/ruby-build.tar.gz \
&& mkdir -p /tmp/ruby-build \
&& tar -xzf /tmp/ruby-build.tar.gz --strip-components=1 -C /tmp/ruby-build \
&& /tmp/ruby-build/install.sh \
&& rm -rf /tmp/ruby-build /tmp/ruby-build.tar.gz /var/lib/apt/lists/*
COPY install-ruby-toolcache.sh /usr/local/bin/install-ruby-toolcache.sh
RUN chmod +x /usr/local/bin/install-ruby-toolcache.sh \
&& RUBY_VERSION="${RUBY_VERSION}" RUBY_MINOR="${RUBY_MINOR}" TOOLCACHE_ROOT="${RUNNER_RUBY_TOOLCACHE}" RUNNER_UID="${RUNNER_UID}" RUNNER_GID="${RUNNER_GID}" /usr/local/bin/install-ruby-toolcache.sh \
&& ruby -v

View File

@@ -0,0 +1,133 @@
# GitHub Runner Fleet
ArgoCD owns `apps/github-runner/github-runner.yaml`. Do not patch live runner
Deployments with `kubectl`; update this manifest and let ArgoCD reconcile.
## Runner Shape
All repo-scoped Linux runners use:
- `localhost/fc-github-runner:v20260525-ruby3.3.11-stepca`, derived from
`myoung34/github-runner:latest`
- `ACCESS_TOKEN` from the `github-runner-token` Secret
- `RUN_AS_ROOT=false`
- `EPHEMERAL=true`
- `LABELS=self-hosted,linux,fc-build-linux`
- writable non-root paths under `/home/runner` for .NET, NuGet, XDG cache, and
Actions tool cache
- Ruby 3.3.11 seeded into `/home/runner/_tool/Ruby/3.3/x64` from the baked
`/opt/runner-toolcache` copy so `ruby/setup-ruby@v1` can discover it on
self-hosted `ubuntu-20.04-x64` runners
`github-runner` for `FlowerCore.Common` is single-replica because it retains the
original Longhorn ReadWriteOnce NuGet PVC. Every other repo-scoped runner uses
two replicas with per-pod `emptyDir` caches. That is the safe backlog-drain
strategy: no two pods share one RWO PVC.
Sprint 32 final long-tail wave adds 16 two-replica Deployments:
`FlowerCore.Knowledge`, `FlowerCore.LlmBridge`, `FlowerCore.Media`,
`FlowerCore.Presentations`, `FlowerCore.RemoteDesktop`, `FlowerCore.DNS`,
`FlowerCore.Distribution`, `FlowerCore.Scoreboard`,
`FlowerCore.SegmentDisplay`, `FlowerCore.Signage.Contracts`,
`FlowerCore.SignalControl`, `FlowerCore.Intranet.Web`,
`FlowerCore.Provisioning`, `FlowerCore.Redis`, `FlowerCore.MessageBoard`, and
`FlowerCore.MenuBoard`.
## Image Build
Ruby is baked with a pinned `ruby-build` release and Ruby patch version. The pod
still mounts an `emptyDir` over `/home/runner`, so the `setup-runner-home` init
container copies the baked toolcache from `/opt/runner-toolcache/Ruby` into
`/home/runner/_tool/Ruby` before the runner container starts.
The IAmWorkin step-ca root CA is also baked into the system trust store
(`/usr/local/share/ca-certificates/iamworkin-step-ca-root.crt`, registered by
`update-ca-certificates`). Without it, .NET HttpClient calls from CI tests
against `*.iamworkin.lan` (e.g. `https://selenium.iamworkin.lan/session`)
fail with `PartialChain`. To refresh the bundled cert when the root rotates,
re-extract from the cluster and overwrite `step-ca-root.crt`:
```bash
kubectl get secret -n cert-manager step-ca-root \
-o jsonpath='{.data.ca\.crt}' | base64 -d > step-ca-root.crt
```
```bash
cd apps/github-runner
podman build -t localhost/fc-github-runner:v20260525-ruby3.3.11-stepca .
podman run --rm localhost/fc-github-runner:v20260525-ruby3.3.11-stepca ruby -v
podman run --rm localhost/fc-github-runner:v20260525-ruby3.3.11-stepca \
test -f /opt/runner-toolcache/Ruby/3.3/x64.complete
podman save localhost/fc-github-runner:v20260525-ruby3.3.11-stepca \
-o fc-github-runner-v20260525-ruby3.3.11-stepca.tar
```
Import the saved image on every schedulable RKE2 node before ArgoCD rolls the
Deployments:
```bash
for node in rke2-server rke2-agent1 rke2-agent2; do
scp fc-github-runner-v20260525-ruby3.3.11-stepca.tar "$node:/tmp/"
ssh "$node" 'sudo ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images rm localhost/fc-github-runner:v20260525-ruby3.3.11-stepca || true'
ssh "$node" 'sudo ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images import /tmp/fc-github-runner-v20260525-ruby3.3.11-stepca.tar'
done
```
## Post-Merge Proof
After the PR is merged and ArgoCD syncs, verify the runner fleet:
```bash
kubectl -n github-runner get deploy,pods,pvc
```
Verify the Ruby toolcache in a fresh pod:
```bash
kubectl -n github-runner exec deploy/github-runner-puppet -c runner -- ruby -v
kubectl -n github-runner exec deploy/github-runner-puppet -c runner -- sh -c \
'echo "$RUNNER_TOOL_CACHE" && test -f "$RUNNER_TOOL_CACHE/Ruby/3.3/x64.complete"'
```
Verify GitHub registration for the repo-scoped runners:
```bash
for repo in FlowerCore.Common FlowerCore.Shared.Pos FlowerCore.Puppet FlowerCore.Signage \
FlowerCore.DMS FlowerCore.Telephony FlowerCore.Print.Web FlowerCore.Chat \
FlowerCore.MySQL FlowerCore.Kiosk.Linux FlowerCore.Marquee FlowerCore.TtsReader \
FlowerCore.Knowledge FlowerCore.LlmBridge FlowerCore.Media \
FlowerCore.Presentations FlowerCore.RemoteDesktop FlowerCore.DNS \
FlowerCore.Distribution FlowerCore.Scoreboard FlowerCore.SegmentDisplay \
FlowerCore.Signage.Contracts FlowerCore.SignalControl FlowerCore.Intranet.Web \
FlowerCore.Provisioning FlowerCore.Redis FlowerCore.MessageBoard \
FlowerCore.MenuBoard; do
echo "=== $repo ==="
gh api "/repos/astoltz/$repo/actions/runners" \
--jq '.runners[] | select(.labels[].name == "fc-build-linux") | {name,status,busy,labels:[.labels[].name]}'
done
```
Shared.Pos publish proof after the runner pod is online:
```bash
gh run list --repo astoltz/FlowerCore.Shared.Pos \
--workflow "Build, Test & Publish" --branch main --limit 5
```
If the latest run is still queued after runner registration, rerun the workflow
from GitHub Actions and verify it lands on an `rke2-linux-*` runner.
## Failure Notes
- `actions/setup-dotnet` permission error at `/usr/share/dotnet`: check that
`DOTNET_INSTALL_DIR=/home/runner/.dotnet` and related cache env vars are
present on the runner pod.
- `ruby/setup-ruby@v1` says self-hosted runners must install Ruby in
`$RUNNER_TOOL_CACHE`: check that the init container copied
`/opt/runner-toolcache/Ruby` into `/home/runner/_tool/Ruby` and that
`/home/runner/_tool/Ruby/3.3/x64.complete` exists.
- `404` during runner registration: the fine-grained PAT is valid but missing
repository access for that repo. Add the repo to the PAT access list; the PAT
value does not change.
- `Multi-Attach` volume error: only the Common runner uses a RWO PVC and it must
stay single-replica. New multi-replica runners use `emptyDir`.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,19 @@
#!/usr/bin/env bash
set -euo pipefail
RUBY_VERSION="${RUBY_VERSION:-3.3.11}"
RUBY_MINOR="${RUBY_MINOR:-3.3}"
TOOLCACHE_ROOT="${TOOLCACHE_ROOT:-/opt/runner-toolcache}"
RUNNER_UID="${RUNNER_UID:-1001}"
RUNNER_GID="${RUNNER_GID:-1001}"
RUBY_PREFIX="${TOOLCACHE_ROOT}/Ruby/${RUBY_VERSION}/x64"
mkdir -p "${TOOLCACHE_ROOT}/Ruby"
RUBY_CONFIGURE_OPTS="${RUBY_CONFIGURE_OPTS:---disable-install-doc --disable-yjit}" ruby-build "${RUBY_VERSION}" "${RUBY_PREFIX}"
touch "${TOOLCACHE_ROOT}/Ruby/${RUBY_VERSION}/x64.complete"
ln -sfn "${RUBY_VERSION}" "${TOOLCACHE_ROOT}/Ruby/${RUBY_MINOR}"
"${RUBY_PREFIX}/bin/ruby" -v
chown -R "${RUNNER_UID}:${RUNNER_GID}" "${TOOLCACHE_ROOT}"
chmod -R a+rX "${TOOLCACHE_ROOT}"

View File

@@ -0,0 +1,12 @@
-----BEGIN CERTIFICATE-----
MIIBxDCCAWqgAwIBAgIRAPY357G6ow6zMAL5+4bS2kkwCgYIKoZIzj0EAwIwQDEa
MBgGA1UEChMRSUFtV29ya2luIEFDTUUgQ0ExIjAgBgNVBAMTGUlBbVdvcmtpbiBB
Q01FIENBIFJvb3QgQ0EwHhcNMjYwMzA4MTgwNzExWhcNMzYwMzA1MTgwNzExWjBA
MRowGAYDVQQKExFJQW1Xb3JraW4gQUNNRSBDQTEiMCAGA1UEAxMZSUFtV29ya2lu
IEFDTUUgQ0EgUm9vdCBDQTBZMBMGByqGSM49AgEGCCqGSM49AwEHA0IABJ2n04X1
JZo5Zdq/i1Idv8+fqwZyAzBh7whbqj0SWsJL8UWRabCMqYCs7+dXO0xRSzqkwFDL
x+vooOai8RgRNhajRTBDMA4GA1UdDwEB/wQEAwIBBjASBgNVHRMBAf8ECDAGAQH/
AgEBMB0GA1UdDgQWBBRnuPPQR6iM/H6vOluiU3Sygayz8jAKBggqhkjOPQQDAgNI
ADBFAiEArQK9dYPGmAZsdYnjziuFVVE5NKZUcceYvGfGC+tLXUsCIAudF2zJrCRq
3mK50ZZET/fwTkJwiEF4824mjP8p1CKM
-----END CERTIFICATE-----

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

165
apps/knowledge/README.md Normal file
View File

@@ -0,0 +1,165 @@
# knowledge — FlowerCore.Knowledge.Web (Phase 2.4 K8s deploy)
**Status:** **LIVE 2026-04-27** at `https://knowledge.iamworkin.lan`
Phase 2.4 closed. Pod running, certificate issued (step-ca-acme), PVC
bound (Longhorn 20Gi RWO), ArgoCD `infra-knowledge` synced. `/healthz`
returns 200, `/api/v1/editions` returns `[]` (initial-deploy state — no
*.db files in the PVC yet; Phase 2.5+ admin UI handles bulk
population). Phase 1 of the Agent Zero MCP rollout keeps `/healthz`
anonymous and gates `/mcp` behind `Authorization: Bearer <token>` built
from the 1Password item `FlowerCore Knowledge MCP Tokens`.
- Plan: [`../../../FlowerCore.Notes/docs/ai-agents/flowercore-knowledge-service-plan.md`](../../../FlowerCore.Notes/docs/ai-agents/flowercore-knowledge-service-plan.md)
- Sprint: [`../../../FlowerCore.Notes/docs/ai-station/sprint-e-xxl-plan.md`](../../../FlowerCore.Notes/docs/ai-station/sprint-e-xxl-plan.md) (Track B)
- Repo: `D:\git\FlowerCore\FlowerCore.Knowledge\` (private GitHub repo,
bootstrapped Sprint D batch 35)
`FlowerCore.Knowledge.Web` is the fleet-wide vector-indexing & RAG hub —
a REST + MCP service that scans `*.db` files under
`/data/vector-stores` and exposes per-edition reachability + corpus
search to the rest of the FC ecosystem (Agent Zero, Chat.Web persona
memory, AiStation embeddings explorer, TtsReader chapter context, BMO
bot, Pi nodes via `fc-index sync`).
Phase 1 MCP routing is explicit:
- in-cluster Agent Zero → `http://knowledge-web.knowledge.svc/mcp`
- workstation Agent Zero → `https://knowledge.iamworkin.lan/mcp`
- probe URL for both lanes → `/healthz`
## Deployment order (do NOT skip / reorder)
### 1. FlowerCore.DNS public A record — knowledge.iamworkin.lan -> 10.0.56.200
Required BEFORE the Certificate resource is created, or cert-manager
HTTP-01 silently backs off ~2h. Memory: `feedback_pfsense_dns_required_for_acme`.
The canonical path is FlowerCore.DNS:
```bash
curl -sk https://dns.iamworkin.lan/api/v1/servers
# Find the pfSense serverId, then create the record using the host label only.
curl -sk -X POST https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iamworkin.lan/records \
-H "Content-Type: application/json" \
-d '{"name":"knowledge","type":"A","data":"10.0.56.200","ttl":300}'
```
If FlowerCore.DNS provider writes are failing 502 with "pfSense
diag_command.php response did not contain a `<pre>` block" (status as of
Sprint E Track B authoring 2026-04-27), add the override manually via
the pfSense web UI:
1. Log in to `https://10.0.56.1` as admin
2. Services → DNS Resolver → General Settings → Host Overrides
3. Add: Host=`knowledge`, Domain=`iamworkin.lan`, IP Address=`10.0.56.200`
4. Save + Apply Changes
Verify resolution from anywhere on LAN:
```bash
nslookup knowledge.iamworkin.lan 10.0.56.1
# Expect: 10.0.56.200
```
Or against FlowerCore.DNS once the provider is fixed:
```bash
curl -sk "https://dns.iamworkin.lan/api/v1/zones/iamworkin.lan/resolve-preflight?hostname=knowledge.iamworkin.lan"
# Expect: "resolvable": true
```
### 2. Build + import the image to ALL RKE2 nodes
Pods may schedule on any RKE2 worker (server, agent1, agent2). The
Longhorn PVC accepts mounts from any node, so the image must be
imported to all three. Memory:
`feedback_rke2_image_import_targets_all_nodes` +
`feedback_rke2_localhost_imagepullpolicy`.
```bash
# From BLUEJAY-WS, in D:\git\FlowerCore\FlowerCore.Knowledge
TAG="v$(date +%Y%m%d%H%M)"
dotnet.exe publish -c Release -o deploy/app \
src/FlowerCore.Knowledge.Web/FlowerCore.Knowledge.Web.csproj
podman build -t localhost/fc-knowledge-web:$TAG -f deploy/Dockerfile.deploy deploy
podman save localhost/fc-knowledge-web:$TAG -o /tmp/fc-knowledge-web.tar
# Import to all three RKE2 nodes
for node in rke2-server rke2-agent1 rke2-agent2; do
scp /tmp/fc-knowledge-web.tar $node:/tmp/
ssh $node "sudo /var/lib/rancher/rke2/bin/ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images import /tmp/fc-knowledge-web.tar"
done
```
The repo's `scripts/deploy-knowledge.sh` automates this loop.
### 3. Bump the image tag + push
Edit `knowledge.yaml`, replace `localhost/fc-knowledge-web:v202604272200`
with the tag from step 2, then:
```bash
cd D:/git/FlowerCore/bluejay-infra
python scripts/check-pfsense-dns.py # confirms the DNS preflight
git add apps/knowledge/
git commit -m "feat(knowledge): deploy Phase 2.4 K8s manifest"
git push
```
ArgoCD picks up within ~3 minutes and creates `infra-knowledge`.
### 4. Verify
```bash
fcadmin_ssh noc1 '
kubectl -n argocd get application infra-knowledge
kubectl -n knowledge get certificate,pod,pvc
curl -sk -m 8 -o /dev/null -w "HTTP %{http_code}\n" https://knowledge.iamworkin.lan/healthz
curl -sk -m 8 https://knowledge.iamworkin.lan/api/v1/editions | jq
'
```
Expect: Certificate `Ready: True` within ~60s, `/healthz` HTTP 200,
`/api/v1/editions` returns an empty array (no DBs in the PVC yet) on
first deploy.
## Initial-deploy state and Phase 2.5 follow-up
The Longhorn PVC is empty on first deploy. Knowledge.Web's filesystem
catalog will report zero editions until vector-store `*.db` files are
pushed into `/data/vector-stores`. Initial population is a follow-up
step (Phase 2.5+, Blazor admin UI's "Rebuild" button); for the first
deploy the goal is just to prove the pod boots, `/healthz` returns 200,
and the Traefik IngressRoute serves the Scalar UI.
To copy an existing local DB into the PVC (one-time, manual until
Phase 2.5 admin UI lands):
```bash
fcadmin_ssh noc1 '
POD=$(kubectl -n knowledge get pod -l app=knowledge-web -o jsonpath="{.items[0].metadata.name}")
kubectl -n knowledge cp /var/lib/flowercore/vector-stores/bluejay-ai.db $POD:/data/vector-stores/bluejay-ai.db
'
```
## Probes + middleware notes
- `/healthz` is mapped by `Controllers/HealthController.cs` (controller-based
attribute route). Cheap — no DB, no dependencies.
- Liveness uses `tcpSocket` as a defensive fallback in case future
middleware accidentally gates `/healthz` behind auth (memory:
`feedback_k8s_probes_behind_auth_middleware`).
- `/openapi/v1.json` and `/scalar/v1` are wired by `UseFlowerCoreApi`.
Per memory `feedback_k8s_probes_must_not_hit_openapi`, probes must NOT
point at OpenAPI documents — the `MapOpenApi` call can be slow during
cold startup.
## Resource sizing
- 256Mi memory request / 1Gi limit.
- 100m CPU request / 1000m limit.
- 20Gi Longhorn PVC initial — sufficient for the bluejay-ai 1.94Gi DB +
fleet-pi-edge 352Mi + fleet-bmo-bot 141Mi + headroom. Resize via
`kubectl -n knowledge edit pvc knowledge-vector-store` if growing
past 15Gi.

View File

@@ -0,0 +1,266 @@
# FlowerCore.Knowledge.Web — fleet vector indexing & RAG hub.
#
# Phase 2.4 of the Knowledge service plan. REST + MCP service that scans
# *.db files under /data/vector-stores and exposes:
# - REST: /api/v1/editions, /api/v1/corpus/search, /healthz
# - MCP: list_editions, describe_edition, corpus_search
# - Static OpenAPI/Scalar via UseFlowerCoreApi
#
# Architecture:
# Plan: FlowerCore.Notes/docs/ai-agents/flowercore-knowledge-service-plan.md
# Sprint: FlowerCore.Notes/docs/ai-station/sprint-e-xxl-plan.md (Track B)
# Repo: D:\git\FlowerCore\FlowerCore.Knowledge\
# Shared: FlowerCore.Common -> FlowerCore.Shared.Indexing (chunkers, vector
# stores, edition profiles, ICorpusSearchService facade)
#
# Deployment order (see apps/knowledge/README.md and the bluejay-infra/README.md
# top-level checklist):
# 1. FlowerCore.DNS public A record knowledge.iamworkin.lan -> 10.0.56.200
# MUST exist BEFORE the Certificate is created, or cert-manager HTTP-01
# backs off ~2h. Memory: feedback_pfsense_dns_required_for_acme.
# 2. Build + import the image to ALL RKE2 nodes (server + both agents) since
# the Pod uses a Longhorn PVC and may schedule anywhere.
# Memory: feedback_rke2_localhost_imagepullpolicy.
# 3. Bump the image tag in this file, git push.
# 4. ArgoCD ApplicationSet picks up within ~3 minutes and creates
# infra-knowledge.
#
# Initial-deploy state:
# The Longhorn PVC is empty on first deploy. Knowledge.Web's filesystem
# catalog will report zero editions until vector-store *.db files are
# pushed into /data/vector-stores. Initial population is a follow-up step
# (Phase 2.5+, Blazor admin UI's "Rebuild" button); for the first deploy
# the goal is just to prove the pod boots, /healthz returns 200, and the
# Traefik IngressRoute serves the Scalar UI.
---
apiVersion: v1
kind: Namespace
metadata:
name: knowledge
labels:
app.kubernetes.io/part-of: bluejay-infra
---
# MCP bearer token for the read-only Agent Zero Phase 1 lane. The 1Password
# item currently stores the raw token in its concealed PASSWORD field, which
# the operator syncs into the namespaced Secret key `password`.
apiVersion: onepassword.com/v1
kind: OnePasswordItem
metadata:
name: knowledge-mcp-tokens
namespace: knowledge
spec:
itemPath: "vaults/IAmWorkin/items/FlowerCore Knowledge MCP Tokens"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: knowledge-vector-store
namespace: knowledge
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn
resources:
requests:
storage: 20Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: knowledge-web
namespace: knowledge
labels:
app: knowledge-web
app.kubernetes.io/name: knowledge-web
app.kubernetes.io/part-of: bluejay-infra
spec:
replicas: 1
revisionHistoryLimit: 3
# RWO Longhorn PVC blocks rolling updates (multi-attach error). Recreate
# is the canonical pattern (memory: feedback_rwo_pvc_blocks_rolling).
strategy:
type: Recreate
selector:
matchLabels:
app: knowledge-web
template:
metadata:
labels:
app: knowledge-web
app.kubernetes.io/name: knowledge-web
app.kubernetes.io/part-of: bluejay-infra
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
securityContext:
runAsNonRoot: true
fsGroup: 1654
fsGroupChangePolicy: OnRootMismatch
containers:
- name: web
# Placeholder tag — bump to the image you built + imported to ALL
# RKE2 nodes via scripts/deploy-knowledge.sh before applying.
image: localhost/fc-knowledge-web:v20260429232635
imagePullPolicy: Never
command:
- /bin/sh
- -c
args:
- |
if [ -n "${KNOWLEDGE_MCP_BEARER_TOKEN:-}" ]; then
export FlowerCore__Mcp__ApiKey__Key="Bearer ${KNOWLEDGE_MCP_BEARER_TOKEN}"
fi
exec dotnet FlowerCore.Knowledge.Web.dll
ports:
- containerPort: 8080
name: http
env:
- name: ASPNETCORE_URLS
value: "http://+:8080"
- name: ASPNETCORE_ENVIRONMENT
value: "Production"
- name: DOTNET_SYSTEM_GLOBALIZATION_INVARIANT
value: "false"
# Vector-store directory + embedding model + edition profile dir.
# Profile JSON is baked into the image at /home/app/editions via the
# csproj Content-link from FlowerCore.Common/editions/.
- name: Knowledge__VectorStoresDirectory
value: "/data/vector-stores"
- name: Knowledge__EmbeddingModel
value: "nomic-embed-text"
- name: Knowledge__DefaultLimit
value: "5"
- name: Knowledge__MaxLimit
value: "50"
- name: FlowerCore__Editions__ProfileDirectory
value: "/home/app/editions"
# Embed via edge1 Pi 5 + AI HAT+ (10.0.57.17:11434). Cluster
# services do not depend on BLUEJAY-WS (private dev hardware) per
# bluejay-infra@0f9d56e. Query-time embedding is fast enough on
# edge1 (~ms per query); bulk index rebuilds (Phase 2.5+) will
# need a separate ingestion lane that can opt into the
# workstation GPU when present.
- name: FlowerCore__Ollama__BaseUrl
value: "http://10.0.57.17:11434"
- name: FlowerCore__Mcp__ApiKey__Key
valueFrom:
secretKeyRef:
name: knowledge-mcp-tokens
key: password
- name: FlowerCore__Mcp__ApiKey__HeaderName
value: "Authorization"
- name: KNOWLEDGE_MCP_BEARER_TOKEN
valueFrom:
secretKeyRef:
name: knowledge-mcp-tokens
key: password
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 1000m
memory: 1Gi
# /healthz is mapped by HealthController (controller-based route).
# tcpSocket liveness is the defensive fallback in case middleware
# later gates /healthz behind auth (memory:
# feedback_k8s_probes_behind_auth_middleware).
startupProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 30
readinessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 3
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
failureThreshold: 3
securityContext:
runAsNonRoot: true
runAsUser: 1654
runAsGroup: 1654
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: vector-store
mountPath: /data/vector-stores
- name: tmp
mountPath: /tmp
- name: logs
mountPath: /home/app/logs
volumes:
- name: vector-store
persistentVolumeClaim:
claimName: knowledge-vector-store
- name: tmp
emptyDir: {}
- name: logs
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: knowledge-web
namespace: knowledge
labels:
app: knowledge-web
app.kubernetes.io/name: knowledge-web
app.kubernetes.io/part-of: bluejay-infra
spec:
type: ClusterIP
selector:
app: knowledge-web
ports:
- name: http
port: 80
targetPort: 8080
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: knowledge-tls
namespace: knowledge
spec:
secretName: knowledge-tls
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
dnsNames:
- knowledge.iamworkin.lan
# step-ca ACME caps lifetime at 30d; requesting 90d silently capped
# made renewBefore=cert-lifetime → perpetual renewal loop (10888+ CRs
# in 18h on 2026-05-07). Match working 720h/240h pattern from other
# FC services.
duration: 720h # 30d (step-ca cap)
renewBefore: 240h # 10d
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: knowledge
namespace: knowledge
spec:
entryPoints:
- websecure
routes:
- match: Host(`knowledge.iamworkin.lan`)
kind: Rule
services:
- name: knowledge-web
port: 80
tls:
secretName: knowledge-tls

View File

@@ -0,0 +1,7 @@
# ArgoCD's bluejay-infra ApplicationSet uses a directory generator and does
# not require kustomization.yaml. Mirrors the fc-distribution shape so
# `kubectl kustomize` previews work from a working copy.
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- knowledge.yaml

View File

@@ -0,0 +1,93 @@
# =============================================================================
# ci1 - Windows Server 2025 KubeVirt VM (GitHub Actions Self-Hosted Runner)
# =============================================================================
# Boots from the sysprepped containerDisk template built by the Windows VM
# sysprep pipeline. See docs/infrastructure/windows-vm-sysprep-pipeline.md.
# Path A/B/C install history is preserved in git log only.
# =============================================================================
apiVersion: v1
kind: Namespace
metadata:
name: kubevirt-vms
labels:
app.kubernetes.io/part-of: kubevirt-stack
pod-security.kubernetes.io/enforce: privileged
---
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: ci1
namespace: kubevirt-vms
labels:
app: ci-runner
role: github-actions-runner
flowercore.io/managed-by: bluejay-infra
spec:
runStrategy: Always
template:
metadata:
labels:
app: ci-runner
role: github-actions-runner
kubevirt.io/vm: ci1
spec:
domain:
cpu:
cores: 8
sockets: 1
threads: 1
memory:
guest: 16Gi
resources:
requests:
memory: 16Gi
limits:
memory: 16Gi
clock:
utc: {}
timer:
hpet:
present: false
pit:
tickPolicy: delay
rtc:
tickPolicy: catchup
hyperv: {}
features:
acpi: {}
apic: {}
hyperv:
relaxed: {}
vapic: {}
spinlocks:
spinlocks: 8191
smm: {}
firmware:
bootloader:
efi:
secureBoot: false
devices:
tpm: {}
disks:
- name: rootdisk
disk:
bus: virtio
interfaces:
# Pod-network fallback for CI runner outbound traffic. Switch to
# prod-vlan57 once the bridge/NAD lane is ready for L2 access.
- name: default
masquerade: {}
model: virtio
machine:
type: q35
networks:
- name: default
pod: {}
volumes:
- name: rootdisk
containerDisk:
image: localhost/fc-win-server-2025:v1
imagePullPolicy: Never
terminationGracePeriodSeconds: 3600

View File

@@ -0,0 +1,3 @@
resources:
- ci1.yaml
- prod-vlan57-nad.yaml

View File

@@ -0,0 +1,69 @@
# =============================================================================
# NetworkAttachmentDefinition — PROD VLAN 57 bridge
# =============================================================================
# Purpose: makes KubeVirt VMs reachable on the PROD VLAN (10.0.57.0/24)
# alongside the existing pod network. Required for ci1 to bridge onto PROD
# (e.g. to provision/scrape edge1, edge2, kiosks, Pis on the same L2 segment).
#
# **DEPLOY GATE — Phase 1.5 host work required first**:
# On every RKE2 node (rke2-server, rke2-agent1, rke2-agent2):
# 1. Switch port (UniFi USL16LP) trunks VLAN 57 to the node — usually
# already true since BLUEJAY-WS reaches 10.0.57.x services. Verify
# with `ip link show enp86s0.57` after configuring sub-interface, OR
# `tcpdump -ni enp86s0 vlan 57` and ping a known PROD host.
# 2. Linux bridge `br-prod` enslaving `enp86s0.57` (VLAN sub-interface).
# NetworkManager profile examples in the runbook below.
# 3. Verify Multus DaemonSet `kube-multus-ds` is Ready on all nodes.
#
# Without those, applying this NAD has no effect except to register the CRD.
# A VM that requests this NAD with no bridge present will fail with:
# `error adding pod kubevirt-vms_ci1 to CNI network "prod-vlan57": failed to
# plumb VLAN: open /sys/class/net/br-prod/master: no such file or directory`
#
# Configuration notes:
# - cniVersion 0.3.1 to match Multus daemon-config.json
# - mtu 1500 (matches enp86s0 default; bump if jumbo frames configured)
# - bridge name `br-prod` is convention; if Puppet picks a different name
# (e.g. `br57`, `br-vlan57`), edit BOTH this NAD and the ci1.yaml
# interface block. Keep them in sync.
# - vlan: 0 because the host bridge already strips VLAN tag (br-prod sits
# on top of `enp86s0.57`). If we instead used a VLAN-aware bridge with
# trunk port, set vlan: 57 here. Current convention is VLAN-stripped at
# the sub-interface, so the bridge passes untagged frames.
#
# Apply:
# kubectl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml apply -f apps/kubevirt-vms/prod-vlan57-nad.yaml
#
# Then update ci1.yaml networks: stanza to:
# - name: prod-net
# multus:
# networkName: kubevirt-vms/prod-vlan57
# and the interface block from `masquerade` to `bridge`.
# =============================================================================
---
# Namespace must exist already (created by ci1.yaml's first document).
# This file imports a NAD into that same namespace.
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: prod-vlan57
namespace: kubevirt-vms
annotations:
bluejay.iamworkin.lan/host-bridge: "br-prod (enslaves enp86s0.57)"
bluejay.iamworkin.lan/cidr: "10.0.57.0/24"
bluejay.iamworkin.lan/gateway: "10.0.57.1"
bluejay.iamworkin.lan/dns: "10.0.56.1 (pfSense Unbound)"
spec:
config: |
{
"cniVersion": "0.3.1",
"name": "prod-vlan57",
"type": "bridge",
"bridge": "br-prod",
"ipam": {},
"mtu": 1500,
"vlan": 0,
"promiscMode": true,
"preserveDefaultVlan": false
}

View File

@@ -0,0 +1,99 @@
# =============================================================================
# Windows Server 2025 ISO — Static NFS PV (Path B for SATA-CDROM timeout)
# =============================================================================
# Purpose: Mount the ISO from Synology NAS via NFS instead of from a Longhorn-
# backed Filesystem PVC.
#
# Why: SATA-CDROM emulation reading from a Longhorn-backed Filesystem PVC is
# too slow for OVMF's boot read window — the DVD-ROM enumeration times out
# before the bootloader can be read. Symptom on the serial console:
# BdsDxe: failed to start Boot0001 "UEFI QEMU DVD-ROM QM00001 " from ...
# BdsDxe: failed to start Boot0001 ... Time out
# BdsDxe: No bootable option or device was found
# Diagnosis confirmed the ISO content is a perfectly valid bootable ISO9660
# image — the bug is in the timing path between OVMF and Longhorn-backed
# storage, not in the ISO itself.
#
# Block-mode PVC was tried (`volumeMode: Block` via DataVolume) and would
# likely fix the timing, but CDI v1.65.0's upload-target pod cannot open the
# block device due to runAsUser:107 + capabilities.drop:[ALL] and we got:
# blockdev: cannot open /dev/cdi-block-volume: Permission denied
#
# NFS-mounted ISO bypasses both issues: no Longhorn slowness, no CDI upload
# pod permission concerns. The ISO is read directly from the NAS over a
# native NFSv4.1 mount that QEMU's SATA emulator can read at full LAN speed.
#
# Layout on Synology:
# /volume1/ISOs/ (existing export, RKE2 ACL)
# en-us_windows_server_2025_updated_march_2026_x64_dvd_8e06425a.iso
# win2025-iso-disk/ (new subdir, 2026-05-08)
# disk.img -> hardlink to ../en-us_windows_server_2025_..._8e06425a.iso
#
# KubeVirt's launcher pod expects a PVC mounted at
# /var/run/kubevirt-private/vmi-disks/<diskName>/disk.img — by mounting the
# `win2025-iso-disk/` subdir as the NFS PV root, `disk.img` lives at the PV's
# root and KubeVirt's CDROM emulator finds it without any path manipulation.
#
# A symlink would NOT work for sub-path NFS mounts (the relative target
# `../...iso` falls outside the sub-mount root). A hardlink works because it
# references the same inode regardless of mount point.
#
# Memory references:
# - feedback_synology_nfs_volume1_kubernetes_export_scoped (Synology export
# scoping pattern — but /volume1/ISOs export, unlike /volume1/kubernetes,
# does support sub-path mounts because Synology NFS is configured with
# pseudo-fs in NFSv4.1)
# - feedback_kubevirt_iso_first_install_bootorder_and_runstrategy (boot
# order / runStrategy gotchas, separate from the storage timing issue)
#
# Validation (2026-05-08, from rke2-server / rke2-agent1 / rke2-agent2):
# mount -t nfs -o nfsvers=4.1,ro 10.0.58.3:/volume1/ISOs/win2025-iso-disk /tmp/m
# file /tmp/m/disk.img
# -> ISO 9660 CD-ROM filesystem data 'SSS_X64FRE_EN-US_DV9' (bootable)
# All 3 RKE2 nodes can mount and read.
# =============================================================================
apiVersion: v1
kind: PersistentVolume
metadata:
name: windows-server-2025-iso-nfs
labels:
flowercore.io/iso: windows-server-2025
flowercore.io/managed-by: bluejay-infra
spec:
capacity:
storage: 8Gi
accessModes:
- ReadOnlyMany
volumeMode: Filesystem
persistentVolumeReclaimPolicy: Retain
storageClassName: "" # static, no provisioner
mountOptions:
- nfsvers=4.1
- ro
- hard
- timeo=600
- retrans=3
nfs:
server: 10.0.58.3 # BlueJayNAS Synology DS1621+ on HOME VLAN 58
path: /volume1/ISOs/win2025-iso-disk
readOnly: true
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: windows-server-2025-iso-nfs
namespace: kubevirt-vms
labels:
app: ci-runner
flowercore.io/managed-by: bluejay-infra
spec:
accessModes:
- ReadOnlyMany
volumeMode: Filesystem
resources:
requests:
storage: 8Gi
storageClassName: ""
volumeName: windows-server-2025-iso-nfs

View File

@@ -76,15 +76,21 @@ apiVersion: apps/v1
kind: StatefulSet
metadata:
name: matrix-postgres
namespace: matrix
labels:
app: matrix-postgres
spec:
serviceName: matrix-postgres
replicas: 1
selector:
matchLabels:
app: matrix-postgres
namespace: matrix
labels:
app: matrix-postgres
argocd.argoproj.io/instance: infra-matrix
spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Retain
podManagementPolicy: OrderedReady
serviceName: matrix-postgres
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: matrix-postgres
template:
metadata:
labels:
@@ -137,12 +143,17 @@ spec:
name: matrix-postgres-data
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 5Gi
---
apiVersion: v1
kind: Service
volumeMode: Filesystem
resources:
requests:
storage: 5Gi
updateStrategy:
rollingUpdate:
partition: 0
type: RollingUpdate
---
apiVersion: v1
kind: Service
metadata:
name: matrix-postgres
namespace: matrix

View File

@@ -0,0 +1,762 @@
{
"annotations": {
"list": []
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": null,
"links": [
{
"icon": "external link",
"includeVars": false,
"keepTime": false,
"targetBlank": true,
"title": "Open Service",
"type": "link",
"url": "https://updatecenter.iamworkin.lan/"
}
],
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [
{
"options": {
"0": {
"color": "#f87171",
"index": 1,
"text": "DOWN"
},
"1": {
"color": "#4ade80",
"index": 0,
"text": "UP"
}
},
"type": "value"
}
],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "#f87171",
"value": null
},
{
"color": "#4ade80",
"value": 1
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 4,
"w": 8,
"x": 0,
"y": 0
},
"id": 1,
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value_and_name"
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"expr": "probe_success{job=\"probe-traefik-services\",instance=\"updatecenter.iamworkin.lan\"}",
"refId": "A",
"legendFormat": "Availability"
}
],
"title": "Service Availability",
"transparent": true,
"type": "stat"
},
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"decimals": 2,
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "#f87171",
"value": null
},
{
"color": "#fbbf24",
"value": 95
},
{
"color": "#FFB300",
"value": 99
},
{
"color": "#4ade80",
"value": 99.9
}
]
},
"unit": "percent"
},
"overrides": []
},
"gridPos": {
"h": 4,
"w": 8,
"x": 8,
"y": 0
},
"id": 2,
"options": {
"colorMode": "background_solid",
"graphMode": "area",
"justifyMode": "center",
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "value_and_name"
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"expr": "avg_over_time(probe_success{job=\"probe-traefik-services\",instance=\"updatecenter.iamworkin.lan\"}[24h]) * 100",
"refId": "A",
"legendFormat": "24h Uptime"
}
],
"title": "24-Hour Uptime",
"transparent": true,
"type": "stat"
},
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"max": 30,
"min": 0,
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "#f87171",
"value": null
},
{
"color": "#fbbf24",
"value": 2
},
{
"color": "#4ade80",
"value": 7
}
]
},
"unit": "d"
},
"overrides": []
},
"gridPos": {
"h": 4,
"w": 8,
"x": 16,
"y": 0
},
"id": 3,
"options": {
"minVizHeight": 75,
"minVizWidth": 75,
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"expr": "(probe_ssl_earliest_cert_expiry{job=\"probe-traefik-services\",instance=\"updatecenter.iamworkin.lan\"} - time()) / 86400",
"refId": "A",
"legendFormat": "Days Remaining"
}
],
"title": "Cert Expiry (Days)",
"transparent": true,
"type": "gauge"
},
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "Response Time (seconds)",
"drawStyle": "line",
"fillOpacity": 12,
"gradientMode": "scheme",
"lineInterpolation": "smooth",
"lineWidth": 2,
"pointSize": 4,
"showPoints": "never",
"spanNulls": true,
"thresholdsStyle": {
"mode": "dashed"
}
},
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "#4ade80",
"value": null
},
{
"color": "#fbbf24",
"value": 2
},
{
"color": "#f87171",
"value": 5
}
]
},
"unit": "s"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 14,
"x": 0,
"y": 4
},
"id": 4,
"options": {
"legend": {
"calcs": [
"lastNotNull",
"mean",
"max"
],
"displayMode": "table",
"placement": "right"
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"expr": "probe_duration_seconds{job=\"probe-traefik-services\",instance=\"updatecenter.iamworkin.lan\"}",
"refId": "A",
"legendFormat": "Probe Duration"
}
],
"timeFrom": "1h",
"title": "Response Time (1h Trend)",
"transparent": true,
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"gridPos": {
"h": 8,
"w": 10,
"x": 14,
"y": 4
},
"id": 5,
"options": {
"alertInstanceLabelFilter": "{instance=\"updatecenter.iamworkin.lan\"}",
"alertName": "",
"dashboardAlerts": false,
"groupBy": [],
"groupMode": "default",
"maxItems": 10,
"sortOrder": 1,
"stateFilter": {
"error": true,
"firing": true,
"noData": true,
"normal": false,
"pending": true
},
"viewMode": "list"
},
"title": "Active Alerts",
"type": "alertlist"
},
{
"collapsed": false,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 12
},
"id": 20,
"title": "OTEL Counters — Track 1D",
"type": "row"
},
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"lineWidth": 1,
"fillOpacity": 10
},
"unit": "reqps"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 13
},
"id": 21,
"options": {
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": ["mean", "lastNotNull"]
},
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"expr": "sum by (status) (rate(updatecenter_manifest_requests_total[5m]))",
"refId": "A",
"legendFormat": "status={{status}}"
}
],
"title": "Manifest Requests rate by status (5m)",
"transparent": true,
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"lineWidth": 1,
"fillOpacity": 10
},
"unit": "Bps"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 13
},
"id": 22,
"options": {
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": ["mean", "lastNotNull"]
},
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"expr": "sum by (slug) (rate(updatecenter_bundle_download_bytes_total[5m]))",
"refId": "A",
"legendFormat": "{{slug}}"
}
],
"title": "Bundle Download Throughput by slug (5m)",
"transparent": true,
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"lineWidth": 1,
"fillOpacity": 10
},
"unit": "reqps"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 21
},
"id": 23,
"options": {
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": ["mean", "lastNotNull"]
},
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"expr": "sum by (status) (rate(updatecenter_checkins_total[5m]))",
"refId": "A",
"legendFormat": "status={{status}}"
}
],
"title": "Agent Check-in Rate by status (5m)",
"transparent": true,
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "#4ade80", "value": null },
{ "color": "#f87171", "value": 1 }
]
},
"unit": "none",
"decimals": 2
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 6,
"x": 12,
"y": 21
},
"id": 24,
"options": {
"colorMode": "background",
"graphMode": "area",
"justifyMode": "center",
"orientation": "auto",
"reduceOptions": {
"calcs": ["sum"],
"fields": "",
"values": false
},
"textMode": "value_and_name"
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"expr": "increase(updatecenter_signature_verify_failures_total[1h])",
"refId": "A",
"legendFormat": "Sig Verify Failures (1h)"
}
],
"title": "Signature Verify Failures (1h)",
"transparent": true,
"type": "stat"
},
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"lineWidth": 1,
"fillOpacity": 10
},
"unit": "reqps"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 6,
"x": 18,
"y": 21
},
"id": 25,
"options": {
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": ["mean", "lastNotNull"]
},
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"expr": "sum by (slug, channel) (rate(updatecenter_release_publishes_total[5m]))",
"refId": "A",
"legendFormat": "{{slug}}/{{channel}}"
}
],
"title": "Release Publishes rate by slug/channel (5m)",
"transparent": true,
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"lineWidth": 1,
"fillOpacity": 10
},
"unit": "reqps"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 29
},
"id": 26,
"options": {
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": ["mean", "lastNotNull"]
},
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"expr": "sum by (kind, status) (rate(updatecenter_bundle_downloads_total[5m]))",
"refId": "A",
"legendFormat": "{{kind}} / {{status}}"
}
],
"title": "Bundle Download Requests by kind/status (5m)",
"transparent": true,
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"lineWidth": 2,
"fillOpacity": 20
},
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "#4ade80", "value": null },
{ "color": "#f87171", "value": 0.01 }
]
},
"unit": "reqps"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 29
},
"id": 27,
"options": {
"legend": {
"displayMode": "table",
"placement": "right",
"calcs": ["mean", "lastNotNull"]
},
"tooltip": {
"mode": "multi",
"sort": "desc"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "fffjikve8llhce"
},
"expr": "rate(updatecenter_signature_verify_failures_total[5m])",
"refId": "A",
"legendFormat": "Sig verify failures/s"
}
],
"title": "Signature Verify Failure Rate (5m) — Critical if >0",
"transparent": true,
"type": "timeseries"
}
],
"refresh": "30s",
"schemaVersion": 39,
"style": "dark",
"tags": [
"blue-jay",
"flowercore",
"synthetic",
"updatecenter",
"otel"
],
"templating": {
"list": []
},
"time": {
"from": "now-24h",
"to": "now"
},
"timezone": "browser",
"title": "FlowerCore.UpdateCenter Dashboard",
"uid": "fc-updatecenter",
"version": 2
}

View File

@@ -0,0 +1,226 @@
{
"annotations": {
"list": []
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": null,
"links": [],
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"id": 1,
"options": {
"legend": {
"displayMode": "table",
"placement": "bottom"
},
"tooltip": {
"mode": "single"
}
},
"targets": [
{
"editorMode": "code",
"expr": "sum by (event) (increase(fc_desktop_session_events_total[$__rate_interval]))",
"legendFormat": "{{event}}",
"range": true,
"refId": "A"
}
],
"title": "RemoteDesktop Session Events",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"id": 2,
"options": {
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showUnfilled": true
},
"targets": [
{
"editorMode": "code",
"expr": "sum by (template, event) (increase(fc_desktop_session_events_total[24h]))",
"legendFormat": "{{template}} {{event}}",
"range": true,
"refId": "A"
}
],
"title": "24h Session Events By Template",
"type": "bargauge"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"id": 3,
"options": {
"legend": {
"displayMode": "table",
"placement": "bottom"
},
"tooltip": {
"mode": "single"
}
},
"targets": [
{
"editorMode": "code",
"expr": "fc_desktop_pool_ready",
"legendFormat": "{{template}} ready",
"range": true,
"refId": "A"
},
{
"editorMode": "code",
"expr": "fc_desktop_pool_desired",
"legendFormat": "{{template}} desired",
"range": true,
"refId": "B"
}
],
"title": "Warm Pool Ready vs Desired",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "orange",
"value": 1
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"id": 4,
"options": {
"colorMode": "value",
"graphMode": "none",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "auto"
},
"targets": [
{
"editorMode": "code",
"expr": "sum(increase(fc_desktop_session_events_total{event=\"connect\",browser_datasource=\"json\"}[24h])) - sum(increase(fc_desktop_session_events_total{event=\"disconnect\"}[24h]))",
"range": true,
"refId": "A"
}
],
"title": "24h Connect Minus Disconnect",
"type": "stat"
}
],
"refresh": "30s",
"schemaVersion": 39,
"style": "dark",
"tags": [
"flowercore",
"remotedesktop",
"guacamole"
],
"templating": {
"list": []
},
"time": {
"from": "now-24h",
"to": "now"
},
"timezone": "browser",
"title": "FlowerCore RemoteDesktop",
"uid": "flowercore-remotedesktop",
"version": 1
}

View File

@@ -0,0 +1,249 @@
# Grafana dashboard ConfigMap for FlowerCore.RemoteDesktop.
#
# Inlines the JSON from flowercore-remotedesktop-grafana-dashboard.json.
# Kept as a standalone file (not inlined in noc-monitoring.yaml) so the
# CRLF-dirty state of noc-monitoring.yaml doesn't have to be normalized
# in the same pass. To actually load the dashboard, the Grafana Deployment
# in noc-monitoring.yaml needs a matching 'volumes:' entry:
#
# - name: dashboard-remotedesktop
# configMap:
# name: grafana-dashboard-remotedesktop
#
# ArgoCD will sync this ConfigMap automatically through the bluejay-infra
# ApplicationSet (infra-monitoring App). The dashboard just won't load
# until the Grafana Deployment mount is wired.
---
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-remotedesktop
namespace: monitoring
data:
remotedesktop.json: |
{
"annotations": {
"list": []
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": null,
"links": [],
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"id": 1,
"options": {
"legend": {
"displayMode": "table",
"placement": "bottom"
},
"tooltip": {
"mode": "single"
}
},
"targets": [
{
"editorMode": "code",
"expr": "sum by (event) (increase(fc_desktop_session_events_total[$__rate_interval]))",
"legendFormat": "{{event}}",
"range": true,
"refId": "A"
}
],
"title": "RemoteDesktop Session Events",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"id": 2,
"options": {
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showUnfilled": true
},
"targets": [
{
"editorMode": "code",
"expr": "sum by (template, event) (increase(fc_desktop_session_events_total[24h]))",
"legendFormat": "{{template}} {{event}}",
"range": true,
"refId": "A"
}
],
"title": "24h Session Events By Template",
"type": "bargauge"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"id": 3,
"options": {
"legend": {
"displayMode": "table",
"placement": "bottom"
},
"tooltip": {
"mode": "single"
}
},
"targets": [
{
"editorMode": "code",
"expr": "fc_desktop_pool_ready",
"legendFormat": "{{template}} ready",
"range": true,
"refId": "A"
},
{
"editorMode": "code",
"expr": "fc_desktop_pool_desired",
"legendFormat": "{{template}} desired",
"range": true,
"refId": "B"
}
],
"title": "Warm Pool Ready vs Desired",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "orange",
"value": 1
}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"id": 4,
"options": {
"colorMode": "value",
"graphMode": "none",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"textMode": "auto"
},
"targets": [
{
"editorMode": "code",
"expr": "sum(increase(fc_desktop_session_events_total{event=\"connect\",browser_datasource=\"json\"}[24h])) - sum(increase(fc_desktop_session_events_total{event=\"disconnect\"}[24h]))",
"range": true,
"refId": "A"
}
],
"title": "24h Connect Minus Disconnect",
"type": "stat"
}
],
"refresh": "30s",
"schemaVersion": 39,
"style": "dark",
"tags": [
"flowercore",
"remotedesktop",
"guacamole"
],
"templating": {
"list": []
},
"time": {
"from": "now-24h",
"to": "now"
},
"timezone": "browser",
"title": "FlowerCore RemoteDesktop",
"uid": "flowercore-remotedesktop",
"version": 1
}

File diff suppressed because it is too large Load Diff

297
apps/multus/multus.yaml Normal file
View File

@@ -0,0 +1,297 @@
# =============================================================================
# Multus CNI — Meta-CNI for multi-network attachment to pods/VMs
# =============================================================================
# Purpose: enable KubeVirt VMs (and any future workload) to attach additional
# network interfaces beyond the default Calico-managed pod network. Required
# for ci1 (Windows Server 2025 KubeVirt VM) to bridge onto PROD VLAN 57.
#
# Source: upstream k8snetworkplumbingwg/multus-cni v4.2.2
# https://github.com/k8snetworkplumbingwg/multus-cni/blob/v4.2.2/deployments/multus-daemonset-thick.yml
#
# Inlined verbatim (with project header + version pin annotation) for
# reproducibility and air-gap safety. Bumping versions = edit this file +
# git push. ArgoCD picks up via the bluejay-infra ApplicationSet
# (apps/* directory generator on main).
#
# Why thick plugin (not thin):
# - Thick = daemon + thin shim binary; daemon handles NAD watch + CRD reads
# centrally so each pod's CNI ADD doesn't hit the K8s API server. Better
# for clusters with many NAD-using pods.
# - Thin = each CNI ADD process directly contacts K8s API. Simpler but
# scales worse and has more failure modes.
# - KubeVirt + multi-VM workload pattern fits thick perfectly.
#
# Cluster context (verified 2026-05-08):
# - RKE2 v1.34.5 on 3 nodes (rke2-server, rke2-agent1, rke2-agent2)
# - Calico CNI (Tigera-managed) at /etc/cni/net.d + /opt/cni/bin (default)
# - openSUSE Leap 16, kernel 6.12, containerd 2.1.5
# - host bridge for PROD VLAN 57 = `br-prod` (PUPPET HOST WORK — see Phase 1.5
# in docs/infrastructure/windows-server-build-runner-plan.md)
#
# Version pin: snapshot-thick → pinning to v4.2.2 release tag at deploy time
# would require a private mirror of the image. Upstream `snapshot-thick` tag
# is updated on every release, so for now we trust upstream + Calico's
# established pattern. Pin to a specific SHA256 once we mirror to Gitea OCI.
#
# Apply (once committed to bluejay-infra main, ApplicationSet auto-syncs):
# git add apps/multus/multus.yaml && git commit && git push origin main
# # ArgoCD `infra-multus` Application appears within 3 min via ApplicationSet
#
# Verify:
# kubectl -n kube-system get ds kube-multus-ds
# kubectl -n kube-system rollout status ds kube-multus-ds
# kubectl get crd network-attachment-definitions.k8s.cni.cncf.io
# =============================================================================
---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: network-attachment-definitions.k8s.cni.cncf.io
annotations:
bluejay.iamworkin.lan/source: "k8snetworkplumbingwg/multus-cni v4.2.2"
spec:
group: k8s.cni.cncf.io
scope: Namespaced
names:
plural: network-attachment-definitions
singular: network-attachment-definition
kind: NetworkAttachmentDefinition
shortNames:
- net-attach-def
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
description: 'NetworkAttachmentDefinition is a CRD schema specified by the Network Plumbing
Working Group to express the intent for attaching pods to one or more logical or physical
networks. More information available at: https://github.com/k8snetworkplumbingwg/multi-net-spec'
type: object
properties:
apiVersion:
type: string
kind:
type: string
metadata:
type: object
spec:
description: 'NetworkAttachmentDefinition spec defines the desired state of a network attachment'
type: object
properties:
config:
description: 'NetworkAttachmentDefinition config is a JSON-formatted CNI configuration'
type: string
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: multus
rules:
- apiGroups: ["k8s.cni.cncf.io"]
resources:
- '*'
verbs:
- '*'
- apiGroups:
- ""
resources:
- pods
- pods/status
verbs:
- get
- list
- update
- watch
- apiGroups:
- ""
- events.k8s.io
resources:
- events
verbs:
- create
- patch
- update
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: multus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: multus
subjects:
- kind: ServiceAccount
name: multus
namespace: kube-system
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: multus
namespace: kube-system
---
kind: ConfigMap
apiVersion: v1
metadata:
name: multus-daemon-config
namespace: kube-system
labels:
tier: node
app: multus
data:
daemon-config.json: |
{
"chrootDir": "/hostroot",
"cniVersion": "0.3.1",
"logLevel": "verbose",
"logToStderr": true,
"cniConfigDir": "/host/etc/cni/net.d",
"multusAutoconfigDir": "/host/etc/cni/net.d",
"multusConfigFile": "auto",
"socketDir": "/host/run/multus/"
}
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: kube-multus-ds
namespace: kube-system
labels:
tier: node
app: multus
name: multus
spec:
selector:
matchLabels:
name: multus
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
tier: node
app: multus
name: multus
spec:
hostNetwork: true
hostPID: true
tolerations:
- operator: Exists
effect: NoSchedule
- operator: Exists
effect: NoExecute
serviceAccountName: multus
containers:
- name: kube-multus
image: ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick
command: [ "/usr/src/multus-cni/bin/multus-daemon" ]
# 2026-05-11: upstream default of 50Mi memory limit OOM-cascades when
# an operator-owned namespace accumulates >100 pending pods retrying
# CNI ADD. RemoteDesktop emitted 219 orphan rd-browser-only pods
# (missing OwnerReferences), kubelet's CNI ADD avalanche pushed multus
# over 50Mi, OOMKilled, restarted with even bigger backlog → loop.
# 21h cluster outage. See FlowerCore.Notes:
# feedback_multus_50mi_limit_oom_orphan_pod_avalanche.md
# 1Gi limit / 512Mi request comfortably handles a 200+ pod CNI
# catchup burst on 64GB nodes (nodes are <25% used in steady-state).
# Drop back toward 256Mi only after MultusMemoryPressure alert
# proves steady-state working set sits well below 200Mi.
resources:
requests:
cpu: "100m"
memory: "512Mi"
limits:
cpu: "100m"
memory: "1Gi"
securityContext:
privileged: true
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- name: cni
mountPath: /host/etc/cni/net.d
# multus-daemon expects that cnibin path must be identical between pod and container host.
# e.g. if the cni bin is in '/opt/cni/bin' on the container host side, then it should be mount to '/opt/cni/bin' in multus-daemon,
# not to any other directory, like '/opt/bin' or '/usr/bin'.
- name: cnibin
mountPath: /opt/cni/bin
- name: host-run
mountPath: /host/run
- name: host-var-lib-cni-multus
mountPath: /var/lib/cni/multus
- name: host-var-lib-kubelet
mountPath: /var/lib/kubelet
mountPropagation: HostToContainer
- name: host-run-k8s-cni-cncf-io
mountPath: /run/k8s.cni.cncf.io
- name: host-run-netns
mountPath: /run/netns
mountPropagation: HostToContainer
- name: multus-daemon-config
mountPath: /etc/cni/net.d/multus.d
readOnly: true
- name: hostroot
mountPath: /hostroot
mountPropagation: HostToContainer
- mountPath: /etc/cni/multus/net.d
name: multus-conf-dir
env:
- name: MULTUS_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
initContainers:
- name: install-multus-binary
image: ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick
command:
- "sh"
- "-c"
- "cp /usr/src/multus-cni/bin/multus-shim /host/opt/cni/bin/multus-shim && cp /usr/src/multus-cni/bin/passthru /host/opt/cni/bin/passthru"
resources:
requests:
cpu: "10m"
memory: "15Mi"
securityContext:
privileged: true
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- name: cnibin
mountPath: /host/opt/cni/bin
mountPropagation: Bidirectional
terminationGracePeriodSeconds: 10
volumes:
- name: cni
hostPath:
path: /etc/cni/net.d
- name: cnibin
hostPath:
path: /opt/cni/bin
- name: hostroot
hostPath:
path: /
- name: multus-daemon-config
configMap:
name: multus-daemon-config
items:
- key: daemon-config.json
path: daemon-config.json
- name: host-run
hostPath:
path: /run
- name: host-var-lib-cni-multus
hostPath:
path: /var/lib/cni/multus
- name: host-var-lib-kubelet
hostPath:
path: /var/lib/kubelet
- name: host-run-k8s-cni-cncf-io
hostPath:
path: /run/k8s.cni.cncf.io
- name: host-run-netns
hostPath:
path: /run/netns/
- name: multus-conf-dir
hostPath:
path: /etc/cni/multus/net.d

Some files were not shown because too many files have changed in this diff Show More