Files
bluejay-infra/apps/knowledge
Codex 0f1dc5f871 fix(certs): kill cert-manager renewal loop on 3 broken Certificate specs
Three Certificates requested duration: 2160h (90d) with renewBefore: 720h
(30d). step-ca's ACME provisioner caps cert lifetime at 30d, so it silently
issued 720h certs — making renewBefore EQUAL to the actual cert lifetime.
cert-manager treats the cert as needing immediate renewal the moment it's
issued, creates a CertificateRequest, gets a new (still 30d) cert, marks
it for immediate renewal, and loops.

Damage on 2026-05-07 ~20:30 (caught during regroup after 5h gap):
  - fc-worldbuilder/worldbuilder-web-tls:  2365 CRs in 18h
  - fc-distribution/fc-distribution-tls:  10880 CRs in 18h
  - knowledge/knowledge-tls:              10888 CRs in 18h
  Total: 24,133 stale CertificateRequest objects in etcd.

Bulk-deleted all CRs + Orders in those 3 namespaces, then this commit
fixes the source so ArgoCD sync stops re-creating the loop.

Fix: match the working 720h/240h pattern used by every other FC service
cert (agent-zero, fc-dns, fc-llm-bridge, fc-php, traefik-system, etc.).
30d cert lifetime + 10d renewal headroom = renewal at day 20, which is
the cert-manager standard 2/3-of-lifetime practice.

Side effect during loop: ALSO contributed to step-ca load and may have
caused intermittent timeouts cluster-wide (the latest stuck challenge
was timing out dialing step-ca:9443 even though step-ca itself was up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:32:00 -05:00
..

knowledge — FlowerCore.Knowledge.Web (Phase 2.4 K8s deploy)

Status: LIVE 2026-04-27 at https://knowledge.iamworkin.lan — Phase 2.4 closed. Pod running, certificate issued (step-ca-acme), PVC bound (Longhorn 20Gi RWO), ArgoCD infra-knowledge synced. /healthz returns 200, /api/v1/editions returns [] (initial-deploy state — no *.db files in the PVC yet; Phase 2.5+ admin UI handles bulk population). Phase 1 of the Agent Zero MCP rollout keeps /healthz anonymous and gates /mcp behind Authorization: Bearer <token> built from the 1Password item FlowerCore Knowledge MCP Tokens.

FlowerCore.Knowledge.Web is the fleet-wide vector-indexing & RAG hub — a REST + MCP service that scans *.db files under /data/vector-stores and exposes per-edition reachability + corpus search to the rest of the FC ecosystem (Agent Zero, Chat.Web persona memory, AiStation embeddings explorer, TtsReader chapter context, BMO bot, Pi nodes via fc-index sync).

Phase 1 MCP routing is explicit:

  • in-cluster Agent Zero → http://knowledge-web.knowledge.svc/mcp
  • workstation Agent Zero → https://knowledge.iamworkin.lan/mcp
  • probe URL for both lanes → /healthz

Deployment order (do NOT skip / reorder)

1. FlowerCore.DNS public A record — knowledge.iamworkin.lan -> 10.0.56.200

Required BEFORE the Certificate resource is created, or cert-manager HTTP-01 silently backs off ~2h. Memory: feedback_pfsense_dns_required_for_acme.

The canonical path is FlowerCore.DNS:

curl -sk https://dns.iamworkin.lan/api/v1/servers
# Find the pfSense serverId, then create the record using the host label only.

curl -sk -X POST https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iamworkin.lan/records \
  -H "Content-Type: application/json" \
  -d '{"name":"knowledge","type":"A","data":"10.0.56.200","ttl":300}'

If FlowerCore.DNS provider writes are failing 502 with "pfSense diag_command.php response did not contain a <pre> block" (status as of Sprint E Track B authoring 2026-04-27), add the override manually via the pfSense web UI:

  1. Log in to https://10.0.56.1 as admin
  2. Services → DNS Resolver → General Settings → Host Overrides
  3. Add: Host=knowledge, Domain=iamworkin.lan, IP Address=10.0.56.200
  4. Save + Apply Changes

Verify resolution from anywhere on LAN:

nslookup knowledge.iamworkin.lan 10.0.56.1
# Expect: 10.0.56.200

Or against FlowerCore.DNS once the provider is fixed:

curl -sk "https://dns.iamworkin.lan/api/v1/zones/iamworkin.lan/resolve-preflight?hostname=knowledge.iamworkin.lan"
# Expect: "resolvable": true

2. Build + import the image to ALL RKE2 nodes

Pods may schedule on any RKE2 worker (server, agent1, agent2). The Longhorn PVC accepts mounts from any node, so the image must be imported to all three. Memory: feedback_rke2_image_import_targets_all_nodes + feedback_rke2_localhost_imagepullpolicy.

# From BLUEJAY-WS, in D:\git\FlowerCore\FlowerCore.Knowledge
TAG="v$(date +%Y%m%d%H%M)"
dotnet.exe publish -c Release -o deploy/app \
  src/FlowerCore.Knowledge.Web/FlowerCore.Knowledge.Web.csproj
podman build -t localhost/fc-knowledge-web:$TAG -f deploy/Dockerfile.deploy deploy
podman save localhost/fc-knowledge-web:$TAG -o /tmp/fc-knowledge-web.tar

# Import to all three RKE2 nodes
for node in rke2-server rke2-agent1 rke2-agent2; do
  scp /tmp/fc-knowledge-web.tar $node:/tmp/
  ssh $node "sudo /var/lib/rancher/rke2/bin/ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images import /tmp/fc-knowledge-web.tar"
done

The repo's scripts/deploy-knowledge.sh automates this loop.

3. Bump the image tag + push

Edit knowledge.yaml, replace localhost/fc-knowledge-web:v202604272200 with the tag from step 2, then:

cd D:/git/FlowerCore/bluejay-infra
python scripts/check-pfsense-dns.py     # confirms the DNS preflight
git add apps/knowledge/
git commit -m "feat(knowledge): deploy Phase 2.4 K8s manifest"
git push

ArgoCD picks up within ~3 minutes and creates infra-knowledge.

4. Verify

fcadmin_ssh noc1 '
  kubectl -n argocd get application infra-knowledge
  kubectl -n knowledge get certificate,pod,pvc
  curl -sk -m 8 -o /dev/null -w "HTTP %{http_code}\n" https://knowledge.iamworkin.lan/healthz
  curl -sk -m 8 https://knowledge.iamworkin.lan/api/v1/editions | jq
'

Expect: Certificate Ready: True within ~60s, /healthz HTTP 200, /api/v1/editions returns an empty array (no DBs in the PVC yet) on first deploy.

Initial-deploy state and Phase 2.5 follow-up

The Longhorn PVC is empty on first deploy. Knowledge.Web's filesystem catalog will report zero editions until vector-store *.db files are pushed into /data/vector-stores. Initial population is a follow-up step (Phase 2.5+, Blazor admin UI's "Rebuild" button); for the first deploy the goal is just to prove the pod boots, /healthz returns 200, and the Traefik IngressRoute serves the Scalar UI.

To copy an existing local DB into the PVC (one-time, manual until Phase 2.5 admin UI lands):

fcadmin_ssh noc1 '
  POD=$(kubectl -n knowledge get pod -l app=knowledge-web -o jsonpath="{.items[0].metadata.name}")
  kubectl -n knowledge cp /var/lib/flowercore/vector-stores/bluejay-ai.db $POD:/data/vector-stores/bluejay-ai.db
'

Probes + middleware notes

  • /healthz is mapped by Controllers/HealthController.cs (controller-based attribute route). Cheap — no DB, no dependencies.
  • Liveness uses tcpSocket as a defensive fallback in case future middleware accidentally gates /healthz behind auth (memory: feedback_k8s_probes_behind_auth_middleware).
  • /openapi/v1.json and /scalar/v1 are wired by UseFlowerCoreApi. Per memory feedback_k8s_probes_must_not_hit_openapi, probes must NOT point at OpenAPI documents — the MapOpenApi call can be slow during cold startup.

Resource sizing

  • 256Mi memory request / 1Gi limit.
  • 100m CPU request / 1000m limit.
  • 20Gi Longhorn PVC initial — sufficient for the bluejay-ai 1.94Gi DB + fleet-pi-edge 352Mi + fleet-bmo-bot 141Mi + headroom. Resize via kubectl -n knowledge edit pvc knowledge-vector-store if growing past 15Gi.