Repoints fc-chat, fc-ttsreader, knowledge, fc-llm-bridge (off the slow edge1
Pi5 10.0.57.17) and intranet (off the reimaged BLUEJAY-AI test laptop
10.0.56.132) to the GX10 (DGX Spark / GB10) Ollama over the PROD MetalLB VIP
10.0.57.201. GX10 serves gemma3:12b/gemma3:4b/qwen2.5:1.5b/nomic-embed-text/
llama3.2:1b on local NVMe, warm-pinned (keep_alive=-1).
fc-chat default model qwen2.5-coder:7b -> gemma3:12b (the coder model won't
pull reliably on the GX10; gemma3:12b is the warm fleet default + a better
general-chat model). Other consumers keep their exact models. Inline comments
referencing edge1/BLUEJAY-AI are now historical; the values are the GX10 VIP.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three Certificates requested duration: 2160h (90d) with renewBefore: 720h
(30d). step-ca's ACME provisioner caps cert lifetime at 30d, so it silently
issued 720h certs — making renewBefore EQUAL to the actual cert lifetime.
cert-manager treats the cert as needing immediate renewal the moment it's
issued, creates a CertificateRequest, gets a new (still 30d) cert, marks
it for immediate renewal, and loops.
Damage on 2026-05-07 ~20:30 (caught during regroup after 5h gap):
- fc-worldbuilder/worldbuilder-web-tls: 2365 CRs in 18h
- fc-distribution/fc-distribution-tls: 10880 CRs in 18h
- knowledge/knowledge-tls: 10888 CRs in 18h
Total: 24,133 stale CertificateRequest objects in etcd.
Bulk-deleted all CRs + Orders in those 3 namespaces, then this commit
fixes the source so ArgoCD sync stops re-creating the loop.
Fix: match the working 720h/240h pattern used by every other FC service
cert (agent-zero, fc-dns, fc-llm-bridge, fc-php, traefik-system, etc.).
30d cert lifetime + 10d renewal headroom = renewal at day 20, which is
the cert-manager standard 2/3-of-lifetime practice.
Side effect during loop: ALSO contributed to step-ca load and may have
caused intermittent timeouts cluster-wide (the latest stuck challenge
was timing out dialing step-ca:9443 even though step-ca itself was up).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes after the Phase 2.4 deploy went live at
https://knowledge.iamworkin.lan:
1. **Ollama URL flip**: from BLUEJAY-WS (10.0.56.20:11434) to edge1 Pi 5
(10.0.57.17:11434). Honors the cluster-clean architecture from
bluejay-infra@0f9d56e ("Workstation is private dev hardware and should
not be in the cluster path"). Query-time embeddings (~ms per query)
are fast enough on edge1; bulk index rebuilds (Phase 2.5+) will need a
separate ingestion lane that can opt into the workstation GPU when
present. ArgoCD picks up the env-var change and rolls the pod
automatically — no image rebuild needed.
2. **README LIVE status**: flip the staged-not-yet-applied banner to
LIVE 2026-04-27. Pod running, certificate issued, PVC bound,
/healthz 200, /api/v1/editions [] (initial-deploy state). Phase 2.5+
admin UI handles bulk population.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NOT YET APPLIED — push to origin/main is gated on the DNS A record
knowledge.iamworkin.lan -> 10.0.56.200 being live. Per memory
feedback_pfsense_dns_required_for_acme, applying the Certificate
without DNS in place puts cert-manager into ~2h HTTP-01 backoff and
needs `kubectl -n knowledge delete order <name>` recovery.
Manifests authored:
- apps/knowledge/knowledge.yaml — Namespace, PVC (knowledge-vector-store
Longhorn 20Gi RWO), Deployment (single replica, Recreate, image
localhost/fc-knowledge-web:v202604272200 placeholder, runAsNonRoot
1654, readOnlyRootFilesystem, drop ALL caps, /healthz startupProbe +
readinessProbe, tcpSocket livenessProbe), Service (ClusterIP port
80 -> 8080), Certificate (step-ca-acme ClusterIssuer, 90d duration),
IngressRoute (knowledge.iamworkin.lan, websecure entrypoint).
- apps/knowledge/kustomization.yaml — `kubectl kustomize` preview file
(matches fc-distribution shape; ApplicationSet uses dir generator).
- apps/knowledge/README.md — deployment order checklist with the DNS
preflight, image build/import loop for all 3 RKE2 nodes, push
procedure, smoke verification, initial-deploy-state notes
(zero editions until *.db files are pushed to the PVC), resource
sizing, probe + middleware notes.
Companion artifacts (separate repos, separate commits):
- FlowerCore.Knowledge@eb91eb4 — Dockerfile.deploy at repo root
- FlowerCore.Notes@96cd443 — scripts/deploy-knowledge.sh
Apply order (from apps/knowledge/README.md):
1. Add DNS A record knowledge.iamworkin.lan -> 10.0.56.200 via
FlowerCore.DNS or pfSense web UI.
2. Run `bash scripts/deploy-knowledge.sh` from FlowerCore.Notes — this
builds + imports the image to all 3 RKE2 nodes with
FLOWERCORE_DEPLOY_SKIP_ROLLOUT=1 (since the Deployment doesn't
exist yet on the cluster).
3. Bump the image tag in this manifest to match the freshly-imported
tag, then `git push` from this repo to land on main. ArgoCD picks
up within ~3 minutes and creates `infra-knowledge`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>