bluejay-infra

Author	SHA1	Message	Date
Andrew Stoltz	e0460bd881	infra(ai): consolidate fleet Ollama consumers onto GX10 VIP 10.0.57.201 Repoints fc-chat, fc-ttsreader, knowledge, fc-llm-bridge (off the slow edge1 Pi5 10.0.57.17) and intranet (off the reimaged BLUEJAY-AI test laptop 10.0.56.132) to the GX10 (DGX Spark / GB10) Ollama over the PROD MetalLB VIP 10.0.57.201. GX10 serves gemma3:12b/gemma3:4b/qwen2.5:1.5b/nomic-embed-text/ llama3.2:1b on local NVMe, warm-pinned (keep_alive=-1). fc-chat default model qwen2.5-coder:7b -> gemma3:12b (the coder model won't pull reliably on the GX10; gemma3:12b is the warm fleet default + a better general-chat model). Other consumers keep their exact models. Inline comments referencing edge1/BLUEJAY-AI are now historical; the values are the GX10 VIP. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 00:54:36 -05:00
Andrew Stoltz	c4b08f41ab	feat(infra): prestage broader app exposure hardening	2026-06-04 18:14:22 -05:00
bluejay	300f8ad546	fix(monitoring): probe OIDC-safe health routes Sprint 58 Cx-12. Rebased over OIDC GitOps main; YAML parse and focused bluejay-infra lint tests passed.	2026-06-04 06:45:34 +00:00
Andrew Stoltz	9a58fd2af6	oidc: flip enforcement ON for knowledge + distribution (no-live-proof, fix-forward) Operator 2026-06-04: nothing is production yet, flip OIDC + fix-forward (no browser-proof gate). knowledge: Auth__Enabled false->true (OIDC env already wired). distribution: add OIDC env block (Authority/Audience/ClientId=distribution, ClientSecret from distribution-oidc-client) + Enabled=true; public read/entitlement + Method() allowlist stay open (OIDC gates admin only). Clients already provisioned (secrets present). ArgoCD deploys both. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 23:38:48 -05:00
Robot	ba5f5dd0fb	deploy(knowledge): roll audit backfill fix	2026-06-03 18:24:22 -05:00
Robot	dc699da7b3	fix(knowledge): persist federation database on PVC	2026-06-03 18:17:31 -05:00
Robot	1e8bf54c6e	deploy: roll Chat and Knowledge OIDC images	2026-06-03 18:13:09 -05:00
Codex	0f1dc5f871	fix(certs): kill cert-manager renewal loop on 3 broken Certificate specs Three Certificates requested duration: 2160h (90d) with renewBefore: 720h (30d). step-ca's ACME provisioner caps cert lifetime at 30d, so it silently issued 720h certs — making renewBefore EQUAL to the actual cert lifetime. cert-manager treats the cert as needing immediate renewal the moment it's issued, creates a CertificateRequest, gets a new (still 30d) cert, marks it for immediate renewal, and loops. Damage on 2026-05-07 ~20:30 (caught during regroup after 5h gap): - fc-worldbuilder/worldbuilder-web-tls: 2365 CRs in 18h - fc-distribution/fc-distribution-tls: 10880 CRs in 18h - knowledge/knowledge-tls: 10888 CRs in 18h Total: 24,133 stale CertificateRequest objects in etcd. Bulk-deleted all CRs + Orders in those 3 namespaces, then this commit fixes the source so ArgoCD sync stops re-creating the loop. Fix: match the working 720h/240h pattern used by every other FC service cert (agent-zero, fc-dns, fc-llm-bridge, fc-php, traefik-system, etc.). 30d cert lifetime + 10d renewal headroom = renewal at day 20, which is the cert-manager standard 2/3-of-lifetime practice. Side effect during loop: ALSO contributed to step-ca load and may have caused intermittent timeouts cluster-wide (the latest stuck challenge was timing out dialing step-ca:9443 even though step-ca itself was up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:32:00 -05:00
Andrew Stoltz	bb09a3786f	fix(knowledge): pin live manifest to bundled edition path	2026-04-29 23:37:02 -05:00
Andrew Stoltz	ae6b8c0142	fix(knowledge): keep mcp key env on new token secret	2026-04-29 23:06:07 -05:00
Andrew Stoltz	da55220218	feat(agent-zero): wire fc_knowledge phase1 rollout	2026-04-29 22:59:19 -05:00
Andrew Stoltz	1b633f57b2	chore(infra): wire knowledge MCP api key secret	2026-04-29 18:04:43 -05:00
Andrew Stoltz	4d9d537d83	fix(knowledge): repoint Ollama at edge1 + flip README to LIVE (Sprint E B7) Two changes after the Phase 2.4 deploy went live at https://knowledge.iamworkin.lan: 1. Ollama URL flip: from BLUEJAY-WS (10.0.56.20:11434) to edge1 Pi 5 (10.0.57.17:11434). Honors the cluster-clean architecture from bluejay-infra@0f9d56e ("Workstation is private dev hardware and should not be in the cluster path"). Query-time embeddings (~ms per query) are fast enough on edge1; bulk index rebuilds (Phase 2.5+) will need a separate ingestion lane that can opt into the workstation GPU when present. ArgoCD picks up the env-var change and rolls the pod automatically — no image rebuild needed. 2. README LIVE status: flip the staged-not-yet-applied banner to LIVE 2026-04-27. Pod running, certificate issued, PVC bound, /healthz 200, /api/v1/editions [] (initial-deploy state). Phase 2.5+ admin UI handles bulk population. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:56:35 -05:00
Andrew Stoltz	3bf6511d5d	feat(knowledge): stage Phase 2.4 K8s deployment manifests (Sprint E B2) NOT YET APPLIED — push to origin/main is gated on the DNS A record knowledge.iamworkin.lan -> 10.0.56.200 being live. Per memory feedback_pfsense_dns_required_for_acme, applying the Certificate without DNS in place puts cert-manager into ~2h HTTP-01 backoff and needs `kubectl -n knowledge delete order <name>` recovery. Manifests authored: - apps/knowledge/knowledge.yaml — Namespace, PVC (knowledge-vector-store Longhorn 20Gi RWO), Deployment (single replica, Recreate, image localhost/fc-knowledge-web:v202604272200 placeholder, runAsNonRoot 1654, readOnlyRootFilesystem, drop ALL caps, /healthz startupProbe + readinessProbe, tcpSocket livenessProbe), Service (ClusterIP port 80 -> 8080), Certificate (step-ca-acme ClusterIssuer, 90d duration), IngressRoute (knowledge.iamworkin.lan, websecure entrypoint). - apps/knowledge/kustomization.yaml — `kubectl kustomize` preview file (matches fc-distribution shape; ApplicationSet uses dir generator). - apps/knowledge/README.md — deployment order checklist with the DNS preflight, image build/import loop for all 3 RKE2 nodes, push procedure, smoke verification, initial-deploy-state notes (zero editions until *.db files are pushed to the PVC), resource sizing, probe + middleware notes. Companion artifacts (separate repos, separate commits): - FlowerCore.Knowledge@eb91eb4 — Dockerfile.deploy at repo root - FlowerCore.Notes@96cd443 — scripts/deploy-knowledge.sh Apply order (from apps/knowledge/README.md): 1. Add DNS A record knowledge.iamworkin.lan -> 10.0.56.200 via FlowerCore.DNS or pfSense web UI. 2. Run `bash scripts/deploy-knowledge.sh` from FlowerCore.Notes — this builds + imports the image to all 3 RKE2 nodes with FLOWERCORE_DEPLOY_SKIP_ROLLOUT=1 (since the Deployment doesn't exist yet on the cluster). 3. Bump the image tag in this manifest to match the freshly-imported tag, then `git push` from this repo to land on main. ArgoCD picks up within ~3 minutes and creates `infra-knowledge`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 16:28:26 -05:00

14 Commits