Three Certificates requested duration: 2160h (90d) with renewBefore: 720h (30d). step-ca's ACME provisioner caps cert lifetime at 30d, so it silently issued 720h certs — making renewBefore EQUAL to the actual cert lifetime. cert-manager treats the cert as needing immediate renewal the moment it's issued, creates a CertificateRequest, gets a new (still 30d) cert, marks it for immediate renewal, and loops. Damage on 2026-05-07 ~20:30 (caught during regroup after 5h gap): - fc-worldbuilder/worldbuilder-web-tls: 2365 CRs in 18h - fc-distribution/fc-distribution-tls: 10880 CRs in 18h - knowledge/knowledge-tls: 10888 CRs in 18h Total: 24,133 stale CertificateRequest objects in etcd. Bulk-deleted all CRs + Orders in those 3 namespaces, then this commit fixes the source so ArgoCD sync stops re-creating the loop. Fix: match the working 720h/240h pattern used by every other FC service cert (agent-zero, fc-dns, fc-llm-bridge, fc-php, traefik-system, etc.). 30d cert lifetime + 10d renewal headroom = renewal at day 20, which is the cert-manager standard 2/3-of-lifetime practice. Side effect during loop: ALSO contributed to step-ca load and may have caused intermittent timeouts cluster-wide (the latest stuck challenge was timing out dialing step-ca:9443 even though step-ca itself was up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
knowledge — FlowerCore.Knowledge.Web (Phase 2.4 K8s deploy)
Status: LIVE 2026-04-27 at https://knowledge.iamworkin.lan —
Phase 2.4 closed. Pod running, certificate issued (step-ca-acme), PVC
bound (Longhorn 20Gi RWO), ArgoCD infra-knowledge synced. /healthz
returns 200, /api/v1/editions returns [] (initial-deploy state — no
*.db files in the PVC yet; Phase 2.5+ admin UI handles bulk
population). Phase 1 of the Agent Zero MCP rollout keeps /healthz
anonymous and gates /mcp behind Authorization: Bearer <token> built
from the 1Password item FlowerCore Knowledge MCP Tokens.
- Plan:
../../../FlowerCore.Notes/docs/ai-agents/flowercore-knowledge-service-plan.md - Sprint:
../../../FlowerCore.Notes/docs/ai-station/sprint-e-xxl-plan.md(Track B) - Repo:
D:\git\FlowerCore\FlowerCore.Knowledge\(private GitHub repo, bootstrapped Sprint D batch 35)
FlowerCore.Knowledge.Web is the fleet-wide vector-indexing & RAG hub —
a REST + MCP service that scans *.db files under
/data/vector-stores and exposes per-edition reachability + corpus
search to the rest of the FC ecosystem (Agent Zero, Chat.Web persona
memory, AiStation embeddings explorer, TtsReader chapter context, BMO
bot, Pi nodes via fc-index sync).
Phase 1 MCP routing is explicit:
- in-cluster Agent Zero →
http://knowledge-web.knowledge.svc/mcp - workstation Agent Zero →
https://knowledge.iamworkin.lan/mcp - probe URL for both lanes →
/healthz
Deployment order (do NOT skip / reorder)
1. FlowerCore.DNS public A record — knowledge.iamworkin.lan -> 10.0.56.200
Required BEFORE the Certificate resource is created, or cert-manager
HTTP-01 silently backs off ~2h. Memory: feedback_pfsense_dns_required_for_acme.
The canonical path is FlowerCore.DNS:
curl -sk https://dns.iamworkin.lan/api/v1/servers
# Find the pfSense serverId, then create the record using the host label only.
curl -sk -X POST https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iamworkin.lan/records \
-H "Content-Type: application/json" \
-d '{"name":"knowledge","type":"A","data":"10.0.56.200","ttl":300}'
If FlowerCore.DNS provider writes are failing 502 with "pfSense
diag_command.php response did not contain a <pre> block" (status as of
Sprint E Track B authoring 2026-04-27), add the override manually via
the pfSense web UI:
- Log in to
https://10.0.56.1as admin - Services → DNS Resolver → General Settings → Host Overrides
- Add: Host=
knowledge, Domain=iamworkin.lan, IP Address=10.0.56.200 - Save + Apply Changes
Verify resolution from anywhere on LAN:
nslookup knowledge.iamworkin.lan 10.0.56.1
# Expect: 10.0.56.200
Or against FlowerCore.DNS once the provider is fixed:
curl -sk "https://dns.iamworkin.lan/api/v1/zones/iamworkin.lan/resolve-preflight?hostname=knowledge.iamworkin.lan"
# Expect: "resolvable": true
2. Build + import the image to ALL RKE2 nodes
Pods may schedule on any RKE2 worker (server, agent1, agent2). The
Longhorn PVC accepts mounts from any node, so the image must be
imported to all three. Memory:
feedback_rke2_image_import_targets_all_nodes +
feedback_rke2_localhost_imagepullpolicy.
# From BLUEJAY-WS, in D:\git\FlowerCore\FlowerCore.Knowledge
TAG="v$(date +%Y%m%d%H%M)"
dotnet.exe publish -c Release -o deploy/app \
src/FlowerCore.Knowledge.Web/FlowerCore.Knowledge.Web.csproj
podman build -t localhost/fc-knowledge-web:$TAG -f deploy/Dockerfile.deploy deploy
podman save localhost/fc-knowledge-web:$TAG -o /tmp/fc-knowledge-web.tar
# Import to all three RKE2 nodes
for node in rke2-server rke2-agent1 rke2-agent2; do
scp /tmp/fc-knowledge-web.tar $node:/tmp/
ssh $node "sudo /var/lib/rancher/rke2/bin/ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images import /tmp/fc-knowledge-web.tar"
done
The repo's scripts/deploy-knowledge.sh automates this loop.
3. Bump the image tag + push
Edit knowledge.yaml, replace localhost/fc-knowledge-web:v202604272200
with the tag from step 2, then:
cd D:/git/FlowerCore/bluejay-infra
python scripts/check-pfsense-dns.py # confirms the DNS preflight
git add apps/knowledge/
git commit -m "feat(knowledge): deploy Phase 2.4 K8s manifest"
git push
ArgoCD picks up within ~3 minutes and creates infra-knowledge.
4. Verify
fcadmin_ssh noc1 '
kubectl -n argocd get application infra-knowledge
kubectl -n knowledge get certificate,pod,pvc
curl -sk -m 8 -o /dev/null -w "HTTP %{http_code}\n" https://knowledge.iamworkin.lan/healthz
curl -sk -m 8 https://knowledge.iamworkin.lan/api/v1/editions | jq
'
Expect: Certificate Ready: True within ~60s, /healthz HTTP 200,
/api/v1/editions returns an empty array (no DBs in the PVC yet) on
first deploy.
Initial-deploy state and Phase 2.5 follow-up
The Longhorn PVC is empty on first deploy. Knowledge.Web's filesystem
catalog will report zero editions until vector-store *.db files are
pushed into /data/vector-stores. Initial population is a follow-up
step (Phase 2.5+, Blazor admin UI's "Rebuild" button); for the first
deploy the goal is just to prove the pod boots, /healthz returns 200,
and the Traefik IngressRoute serves the Scalar UI.
To copy an existing local DB into the PVC (one-time, manual until Phase 2.5 admin UI lands):
fcadmin_ssh noc1 '
POD=$(kubectl -n knowledge get pod -l app=knowledge-web -o jsonpath="{.items[0].metadata.name}")
kubectl -n knowledge cp /var/lib/flowercore/vector-stores/bluejay-ai.db $POD:/data/vector-stores/bluejay-ai.db
'
Probes + middleware notes
/healthzis mapped byControllers/HealthController.cs(controller-based attribute route). Cheap — no DB, no dependencies.- Liveness uses
tcpSocketas a defensive fallback in case future middleware accidentally gates/healthzbehind auth (memory:feedback_k8s_probes_behind_auth_middleware). /openapi/v1.jsonand/scalar/v1are wired byUseFlowerCoreApi. Per memoryfeedback_k8s_probes_must_not_hit_openapi, probes must NOT point at OpenAPI documents — theMapOpenApicall can be slow during cold startup.
Resource sizing
- 256Mi memory request / 1Gi limit.
- 100m CPU request / 1000m limit.
- 20Gi Longhorn PVC initial — sufficient for the bluejay-ai 1.94Gi DB +
fleet-pi-edge 352Mi + fleet-bmo-bot 141Mi + headroom. Resize via
kubectl -n knowledge edit pvc knowledge-vector-storeif growing past 15Gi.