Commit Graph

350 Commits

Author SHA1 Message Date
Andrew Stoltz
436185818d fc-distribution: restrict public IngressRoute to GET+HEAD only
Live verification 2026-04-24 caught POST /blobs on dist.flowercore.io
returning 201 Created with the blob persisted — admin write operations
reachable on the public surface. Controller-level strict entitlement
was on, but that gates reads; writes weren't blocked at all.

Fix: add Method(GET) || Method(HEAD) to the Host match on the public
IngressRoute. POST/PUT/PATCH/DELETE now miss every route for
dist.flowercore.io and Traefik returns 404 before the pod sees the
request. Edge-level defense-in-depth on top of the controller's
strict-mode entitlement check.

The internal IngressRoute for dist.iamworkin.lan stays unrestricted —
admin POST /blobs + POST /manifests flows keep working from the lab.
2026-04-23 20:12:25 -05:00
Andrew Stoltz
c3cc404beb fc-distribution: add dist.flowercore.io public surface (Cloudflare A record + Origin Cert + profile-header middleware)
Lights up dist.flowercore.io end-to-end:
- cf-origin-flowercore-io Secret (literal *.flowercore.io Origin Cert,
  copied from the telephony/gitea-public/matrix/mail/flowercore/fc-landing
  pattern — not via OnePasswordItem yet).
- Traefik Middleware dist-public-profile-header: strips any caller-supplied
  X-FC-Distribution-Profile, injects 'public' so the controller's
  NamedEntitlementResolverRouter routes to the strict resolver.
- IngressRoute fc-distribution-public: Host(`dist.flowercore.io`) ->
  same backing Service as the internal dist.iamworkin.lan route.
  Middleware attached; cert secret cf-origin-flowercore-io.

Cloudflare DNS A record dist.flowercore.io -> 74.40.140.24 (proxied)
already created 2026-04-24 via Cloudflare API (record id
e9b957511556f37ff6763f4441acbc45).

Controller entitlement config is still DefaultAllow=false + empty
PublicEditions on the 'public' profile, so every public request
returns 403 by default. Populate FlowerCore__Distribution__EntitlementPublic__PublicEditions__0
via env var when ready to expose specific editions.
2026-04-23 20:10:29 -05:00
Andrew Stoltz
90627819cc fc-distribution: bump to v202604240010 (Phase 4 header-routing controller) 2026-04-23 19:23:35 -05:00
Andrew Stoltz
c97d486a3d feat(fc-segmentdisplay): switch tls certificate to dns01 2026-04-23 18:39:17 -05:00
Andrew Stoltz
209bdc16cd fc-distribution: bump to v202604232310 (Front D entitlement wired into ManifestsController) 2026-04-23 18:11:21 -05:00
Andrew Stoltz
3999634b06 Seed ttsreader piper voices before startup 2026-04-23 17:18:57 -05:00
Andrew Stoltz
61538d3712 fc-distribution: bump to v202604232212 (disable 30MB body size limit on POST /blobs) 2026-04-23 17:11:56 -05:00
Andrew Stoltz
ccaac367af fc-distribution: bump to v202604232206 (adds POST /blobs endpoint) 2026-04-23 17:07:13 -05:00
Andrew Stoltz
407d473b71 feat(infra): route dns preflight through flowercore dns 2026-04-23 17:03:22 -05:00
Andrew Stoltz
f9593e494a Allow ttsreader piper voice downloads 2026-04-23 16:50:21 -05:00
Andrew Stoltz
5b6c7b97fc feat(fc-distribution): bump image to v202604232145 — cert chain endpoint
- Serves GET /manifests/{edition}/{version}.cert (leaf+intermediate PEM)
- Adds CertChainPem migration on startup (nullable column)
- ManifestSignService now embeds version-specific certChainUrl

Provisioning Agent's verify step will flip from ChainNotServed (Phase 2A
soft-pass) to Valid once a fresh edition is published with this image.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 16:43:56 -05:00
Andrew Stoltz
a76eeb5c39 Add dedicated selectable piper for ttsreader 2026-04-23 16:37:03 -05:00
Andrew Stoltz
8a960ffc73 feat(fc-distribution): K8s manifest for Phase 1 edition publisher
Adds apps/fc-distribution/{fc-distribution.yaml,kustomization.yaml,README.md}.
Ships the FlowerCore.Distribution service (Blazor + REST + MCP) backed by
Synology NFS for SQLite catalog + content-addressed blob root.

Contents:
- Namespace fc-distribution
- 3x OnePasswordItem (FlowerCore Code Signing CA informational + per-edition
  signing keys for kiosk-standard and aistation-field)
- Deployment: localhost/fc-distribution:v202604232000 (already imported to
  rke2-server via ctr), pinned to rke2-server nodeSelector because Synology
  NFS ACL restricts writes to that node, emptyDir for /tmp + /app/logs,
  inline NFS for /data (subPath distribution/data) and /blobs (subPath
  distribution/blobs), Secret volume mounts for /signing/<edition>.
  readOnlyRootFilesystem + runAsUser 1654 + drop ALL capabilities.
  Probes: startup + readiness on /healthz, liveness on tcpSocket (defense
  against future auth middleware accidentally gating /healthz).
- Service (ClusterIP :80 -> container :8080)
- Certificate (cert-manager ClusterIssuer step-ca-acme, dist.iamworkin.lan,
  90d / 30d renew). pfSense Unbound override dist.iamworkin.lan ->
  10.0.56.200 already in place (req'd for HTTP-01).
- IngressRoute (Traefik websecure, Host rule on dist.iamworkin.lan)

Env var keys align with the scaffold:
  FlowerCore__Database__ConnectionStrings__Sqlite
  FlowerCore__Distribution__Blobs__Root
  FlowerCore__Distribution__Signing__EditionCerts__<slug>__{CertPath,KeyPath}

Consumer: ProvisioningAgent (USB-side, Phase 2) — see
FlowerCore.Notes/docs/infrastructure/usb-provisioning-architecture.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 15:59:50 -05:00
Andrew Stoltz
686dbacc66 Bump TTS Reader image for render follow-along 2026-04-23 15:54:07 -05:00
Andrew Stoltz
5ccf055465 check-pfsense-dns: add live-cluster scan
Extends the pre-merge DNS gate to (optionally) scan live-cluster
Certificates + IngressRoutes via kubectl. Closes the coverage hole
where a service's IngressRoute gets deployed from its own repo (not
from bluejay-infra/apps/) and the manifests-only scan misses it —
fc-retail/retail-web-tls stuck Issuing for 15h on a missing pfSense
Unbound override was exactly this class of bug.

Auto mode: if kubectl is on PATH and usable, live-scan runs silently.
--live  forces it (and errors out if kubectl can't reach the cluster).
--no-live skips live entirely (CI path with no cluster access).

Immediate live-scan finding on 2026-04-23: 10 orphan *.iamworkin.lan
IngressRoutes from failed e2e / codex / smoke / deleteproof test runs
in fc-php + fc-tenant-default (2026-04-16/17). None have DNS overrides
so their Certificates have been failing to issue for 7 days — the new
CertManagerCertificateNotReady alert will catch them too. Cleanup
(delete abandoned IngressRoutes + Certificates + CertificateRequests)
is a separate task; this check now surfaces them.
2026-04-23 15:51:19 -05:00
Andrew Stoltz
4da60820c6 Deploy TTS Reader queue presentation fix 2026-04-23 15:13:21 -05:00
Andrew Stoltz
1cc4324cfb Deploy TTS Reader import and preview fixes 2026-04-23 14:28:08 -05:00
Andrew Stoltz
bfc755057e fix(agent-zero): use streamable http for chat mcp 2026-04-23 13:54:06 -05:00
Andrew Stoltz
d6008ee205 fix(agent-zero): allow chat mcp pod port 2026-04-23 13:29:36 -05:00
Andrew Stoltz
39fe6f1dba fix(agent-zero): route chat mcp in-cluster 2026-04-23 13:26:10 -05:00
Andrew Stoltz
90fcf0cd5d fix(agent-zero): expose openai provider key 2026-04-23 13:21:12 -05:00
Andrew Stoltz
ffef5c9126 Deploy TTS Reader annotation timeout fix 2026-04-23 13:06:17 -05:00
Andrew Stoltz
634e90a9ee Deploy TTS Reader quick hardening release 2026-04-23 12:47:45 -05:00
Andrew Stoltz
86ccca18e3 Add Chat MCP server to Agent Zero 2026-04-23 12:41:58 -05:00
Andrew Stoltz
1c5caf3f40 Deploy TTS Reader v20260423114016 2026-04-23 11:57:39 -05:00
Andrew Stoltz
d3db19b0ca guacamole: enable json auth for remotedesktop sso 2026-04-23 11:27:30 -05:00
Andrew Stoltz
702a6e4f52 fix(agent-zero): use short DNS name to avoid CoreDNS template hijack
The full FQDN fc-llm-bridge.fc-llm-bridge.svc.cluster.local has 4 dots,
which is less than the pod's ndots:5 threshold. The resolver then
applies every entry in the search list BEFORE falling through to the
bare FQDN, and the CoreDNS 'template iamworkin.lan' catch-all matches
"...svc.cluster.local.iamworkin.lan" and returns Traefik VIP
10.0.56.200. The egress NetworkPolicy blocks that VIP (0.0.0.0/0
EXCEPT 10.0.0.0/8), so curl hangs for 30-134s and returns HTTP 000.

Reference: feedback_coredns_ndots_template_collision memory.

Fix: use "fc-llm-bridge.fc-llm-bridge.svc" (2 dots, still <5 so search
expansion still fires, but the first suffix "...svc.cluster.local"
hits the Kubernetes plugin in CoreDNS and returns the real ClusterIP
10.43.67.125 before the iamworkin.lan template is ever consulted).

Verified: pod-exec curl fc:cheap → HTTP 200 with a real chat.completion
envelope (Ollama/gemma3:4b via bridge).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:02:09 -05:00
Andrew Stoltz
6cbb5d8792 fix(agent-zero): NetworkPolicy egress rule for fc-llm-bridge (ADR-088)
The chat_model flip (62db15c) pointed Agent Zero at
fc-llm-bridge.fc-llm-bridge.svc.cluster.local:8080 but the existing
agent-zero-netpol only allowed egress to specific node IPs
(10.0.56.20:11434, 10.0.57.17:11434, 10.0.57.16:5200, 10.0.56.11:6443)
plus public-internet (with RFC1918 exclusion). ClusterIP traffic to
10.43.0.0/16 was implicitly denied, so pod-exec curl to the bridge
timed out after 134s.

Adds an egress rule allowing TCP 8080 to the fc-llm-bridge namespace
(matched by kubernetes.io/metadata.name which K8s 1.22+ sets
automatically). No ingress changes needed — fc-llm-bridge has no
NetworkPolicy, so the ingress side is already open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:59:17 -05:00
Andrew Stoltz
62db15c69c feat(agent-zero): route chat_model through fc-llm-bridge (ADR-088)
Flips Agent Zero's chat_model from direct local Ollama (gemma3:12b via
the 127.0.0.1:11434 sidecar proxy) to the FlowerCore LLM Bridge
(fc:balanced tier, OpenAI-compatible, Anthropic Claude Sonnet under the
hood) so chat turns are spend-tracked and can dispatch to any provider
via a single tier alias.

Scope is intentionally minimal and reversible:
  - chat_model: ollama/gemma3:12b/127.0.0.1:11434
              → openai/fc:balanced/fc-llm-bridge internal service URL
  - utility_model, embedding_model, browser_model: UNCHANGED
    (stay on local 127.0.0.1 Ollama sidecar — no spend, low latency,
    not worth routing through the bridge for small-model traffic).

Auth: new A0_SET_chat_model_api_key env var wired to the
fc-llm-bridge-api-keys Secret (field: agent-zero-k8s). The Secret is
synced by a new OnePasswordItem pointing at "FC LLM Bridge API Keys"
in the IAmWorkin vault. Bearer-token auth is now accepted by the
bridge (FlowerCore.LlmBridge@3225f1f).

Rollback: revert this commit; old image v202604231449 is still present
on all RKE2 nodes, and Agent Zero's strategy: Recreate makes the flip
atomic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:54:27 -05:00
Andrew Stoltz
84634f59f0 chore(fc-llm-bridge): bump image to v202604231520
Ships the Bearer-token auth fix (FlowerCore.LlmBridge@3225f1f) so Agent
Zero's OpenAI provider can authenticate with Authorization: Bearer in
addition to the original X-Api-Key header.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:51:57 -05:00
Andrew Stoltz
4cd5806fd0 fix(fc-llm-bridge): set dnsConfig ndots=2 to prevent CoreDNS wildcard hijack
Pods in this cluster inherit ndots=5. External FQDNs with <5 dots (like
api.anthropic.com) are expanded through the search path first, and the 4th
suffix `api.anthropic.com.iamworkin.lan` matches CoreDNS' `template IN A
iamworkin.lan` wildcard — resolves to Traefik VIP 10.0.56.200. TLS connect
lands on Traefik's default cert and the AnthropicClient rejects with
RemoteCertificateNameMismatch/RemoteCertificateChainErrors.

Setting ndots=2 makes the resolver try the bare FQDN first (3 dots in
api.anthropic.com), so the search path never fires.

Reference: memory feedback_coredns_ndots_template_collision. Wider follow-up:
the CoreDNS template plugin should add fallthrough for external public suffixes,
so every FC service calling external HTTPS APIs stops hitting this trap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:42:17 -05:00
Andrew Stoltz
11c48bef30 chore(fc-llm-bridge): bump to v202604231449 (Budget 1.0.1 multi-provider dispatcher)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:36:05 -05:00
Andrew Stoltz
a86e87050b fix(fc-llm-bridge): anthropic secret key is 'password' not 'credential'
The 1Password item "Claude API Key" stores the key in a standard Password
field (labeled `password`), so the OnePasswordItem operator creates the K8s
Secret with key `password`. Deployment was referencing `credential`, which
made the pod fail with CreateContainerConfigError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:29:32 -05:00
Andrew Stoltz
0214f94ac4 chore(fc-llm-bridge): bump image to v202604231424 (first live tag)
Built from FlowerCore.LlmBridge@6d285b5 (initial scaffold). Imported on all
three RKE2 nodes via podman save + ctr import. Replaces v00000000000000
placeholder — ArgoCD sync will roll the pod.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:28:05 -05:00
Andrew Stoltz
a1b8eb379d feat(fc-llm-bridge): stage ADR-088 manifests (not yet applied)
Staged but NOT applied. Do not git push until the two pre-requisites below
are done. See apps/fc-llm-bridge/README.md for the full order-of-ops.

Manifests (apps/fc-llm-bridge/fc-llm-bridge.yaml, 8 docs):
  - Namespace fc-llm-bridge
  - OnePasswordItem anthropic-api-key (existing Claude API Key item)
  - OnePasswordItem fc-llm-bridge-api-keys (NEW item, pending creation)
  - PersistentVolumeClaim fc-llm-bridge-data (2Gi longhorn)
  - Deployment fc-llm-bridge (port 8080, uid 1654, readOnlyRootFilesystem,
    tcpSocket probes to survive future ApiKeyAuthMiddleware reordering)
  - Service fc-llm-bridge ClusterIP
  - Certificate fc-llm-bridge-cert (step-ca-acme)
  - IngressRoute fc-llm-bridge (fc-llm-bridge.iamworkin.lan, websecure)

Pre-requisites BEFORE git push:
  1. pfSense Unbound override fc-llm-bridge.iamworkin.lan -> 10.0.56.200
     (currently NXDOMAIN -- verified via nslookup and check-pfsense-dns.py).
     Skipping this step puts cert-manager HTTP-01 into ~2h backoff.
  2. Create 1Password item `FC LLM Bridge API Keys` in vault IAmWorkin with
     password fields: agent-zero-ws, agent-zero-k8s, spare-1, spare-2.
  3. Build + import localhost/fc-llm-bridge:v<tag> to rke2-server +
     rke2-agent1 + rke2-agent2. Bump image tag from placeholder
     v00000000000000 before committing the apply.

Related: ADR-088 (FlowerCore.Notes/ARCHITECTURE.md), design doc at
FlowerCore.Notes/docs/ai-agents/agent-zero-anthropic-bridge.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 03:10:36 -05:00
Andrew Stoltz
9a1665907c fc-signalcontrol: align live port and selectors 2026-04-22 23:22:14 -05:00
Andrew Stoltz
899804215a statefulsets: align guacamole and matrix drift defaults 2026-04-22 23:11:47 -05:00
Andrew Stoltz
1dc66738e6 zabbix: align postgres tracking label 2026-04-22 22:50:24 -05:00
Andrew Stoltz
5623a272c5 zabbix: include statefulset defaults 2026-04-22 22:39:31 -05:00
Andrew Stoltz
3d3f91160b monitoring: add Print.Web Ollama Zabbix template 2026-04-22 22:07:40 -05:00
Andrew Stoltz
93f77c1844 fix(monitoring): use bluejay_v2 auth for snmp-nas (not public_v2)
Synology NAS is configured with community bluejay_monitor
(→ snmp.yml auth 'bluejay_v2'), not public. public_v2 was returning
HTTP 500 from snmp-exporter for this target. Verified bluejay_v2
returns metrics.

Keeps printer (10.0.58.107) on public_v2 — Epson ET-3750 uses
community "public" as documented in its SNMP settings.
2026-04-22 21:32:14 -05:00
Andrew Stoltz
59efc460fd fix(irc): use short name for unrealircd in anope + thelounge configs
Same CoreDNS iamworkin.lan template + ndots:5 hijack as the irc-notify fix.
Anope services (nickserv/chanserv/memo) have been disconnected from unrealircd
for weeks ("Host is unreachable" every 3s). Thelounge server defaults pointed
at the same broken FQDN.

Short name unrealircd.irc.svc resolves to the ClusterIP directly.
2026-04-22 21:23:38 -05:00
Andrew Stoltz
02959f1ac6 docs: deployment runbook + pfSense DNS pre-merge check
Adds a real README describing the 4-step deploy flow, with pfSense Unbound
host overrides as step 1 (the prerequisite that, if skipped, silently breaks
cert-manager HTTP-01 for ~2h per cert until manually diagnosed — root cause
of the 2026-04-22 cluster-wide cert outage).

Adds scripts/check-pfsense-dns.py: parses every apps/*/*.yaml, extracts
hostnames from Certificate.spec.dnsNames and Traefik IngressRoute
`Host(...)` match rules, and fails the check if any don't resolve via the
system DNS (pfSense Unbound on this LAN). Ignores IRC server-link labels,
image tags, comments — only checks hostnames cert-manager and Traefik will
actually use.

Run before `git push` or wire into pre-commit / Gitea Actions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 21:11:24 -05:00
Andrew Stoltz
a3aa84bdae fc-ttsreader: bump image to v20260422201135 (Quick Read highlight no-reflow fix)
Quick Read's active-sentence highlight was changing font-weight from
regular to semibold, which shifted glyph widths and reflowed the whole
paragraph mid-playback. New image drops the weight change and uses a
1px box-shadow ring instead for a stable layout.

Built from FlowerCore.TtsReader@e77d69d.
2026-04-22 20:20:30 -05:00
Andrew Stoltz
01cb9a557f fc-ttsreader: deploy fixed reader image 2026-04-22 16:13:15 -05:00
Andrew Stoltz
0fa46ad53b fc-ttsreader: deploy reader UI split image 2026-04-22 15:57:58 -05:00
Andrew Stoltz
1ded5a61c0 fc-segmentdisplay: add TLS ingress gitops stub 2026-04-22 15:55:54 -05:00
Andrew Stoltz
3c1d212251 fc-messageboard: deploy latest web image via gitops 2026-04-22 15:48:05 -05:00
Andrew Stoltz
c0547a9964 fc-signalcontrol: switch probes to tcpSocket — middleware blocks /health
The app's ApiKeyAuthenticationMiddleware runs BEFORE /health is mapped, so
unauthenticated probe requests get 404. tcpSocket probes verify the listener
is up without auth, which is correct for an internal K8s probe (kubelet
talks pod IP directly, not externally).

Real fix is in the app: move /health before the middleware or mark it
[AllowAnonymous]. Tracked separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 15:21:04 -05:00
Andrew Stoltz
973c1dae72 fc-signalcontrol: fix probe path /metrics/prometheus -> /health
The app exposes /health (Program.cs:91 maps a Healthy text response) but does
NOT expose /metrics/prometheus. K8s liveness/readiness probes against a 404
endpoint kept the pod in CrashLoopBackOff after PVC mount was added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 15:15:07 -05:00