Files
bluejay-infra/apps/worldbuilder
Codex 0f1dc5f871 fix(certs): kill cert-manager renewal loop on 3 broken Certificate specs
Three Certificates requested duration: 2160h (90d) with renewBefore: 720h
(30d). step-ca's ACME provisioner caps cert lifetime at 30d, so it silently
issued 720h certs — making renewBefore EQUAL to the actual cert lifetime.
cert-manager treats the cert as needing immediate renewal the moment it's
issued, creates a CertificateRequest, gets a new (still 30d) cert, marks
it for immediate renewal, and loops.

Damage on 2026-05-07 ~20:30 (caught during regroup after 5h gap):
  - fc-worldbuilder/worldbuilder-web-tls:  2365 CRs in 18h
  - fc-distribution/fc-distribution-tls:  10880 CRs in 18h
  - knowledge/knowledge-tls:              10888 CRs in 18h
  Total: 24,133 stale CertificateRequest objects in etcd.

Bulk-deleted all CRs + Orders in those 3 namespaces, then this commit
fixes the source so ArgoCD sync stops re-creating the loop.

Fix: match the working 720h/240h pattern used by every other FC service
cert (agent-zero, fc-dns, fc-llm-bridge, fc-php, traefik-system, etc.).
30d cert lifetime + 10d renewal headroom = renewal at day 20, which is
the cert-manager standard 2/3-of-lifetime practice.

Side effect during loop: ALSO contributed to step-ca load and may have
caused intermittent timeouts cluster-wide (the latest stuck challenge
was timing out dialing step-ca:9443 even though step-ca itself was up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:32:00 -05:00
..

FlowerCore.WorldBuilder

ArgoCD-managed manifest for FlowerCore.WorldBuilder.Web — comic / storyboard authoring service that drives ComfyUI for panel image generation and QuestPDF for letter / A4 export.

Source: D:\git\FlowerCore\FlowerCore.WorldBuilder (master)

Deployment order

  1. DNS preflightworldbuilder.iamworkin.lan -> 10.0.56.200 MUST exist in pfSense Unbound before this manifest is applied, or cert-manager HTTP-01 silently exponential-backs-off ~2h. Memory: feedback_pfsense_dns_required_for_acme.
  2. Image import to ALL RKE2 nodes — pod can schedule to any of rke2-server (10.0.56.11), rke2-agent1 (10.0.56.12), rke2-agent2 (10.0.56.13). Build with:
    bash deploy/build.sh   # in FlowerCore.WorldBuilder repo
    podman save localhost/fc-worldbuilder:v<TAG> -o /tmp/fc-worldbuilder-v<TAG>.tar
    for h in 10.0.56.11 10.0.56.12 10.0.56.13; do
      scp /tmp/fc-worldbuilder-v<TAG>.tar fcadmin@$h:/tmp/
      ssh fcadmin@$h \
        "sudo /var/lib/rancher/rke2/bin/ctr -a /run/k3s/containerd/containerd.sock \
          -n k8s.io images import /tmp/fc-worldbuilder-v<TAG>.tar"
    done
    
    Memory: feedback_rke2_image_import_per_node_scp.
  3. Bump image tag in worldbuilder.yaml and git push. ArgoCD ApplicationSet picks up within ~3 minutes.
  4. First production render — open https://worldbuilder.iamworkin.lan, create World → Character → Storyboard → ExportJob, confirm artifact downloads. ComfyUI lives on BLUEJAY-WS at http://10.0.56.20:8188.

Health probes

  • startupProbe + readinessProbe: httpGet /healthz (registered explicitly in Program.cs — anonymous, no DB or OpenAPI dependency).
  • livenessProbe: tcpSocket as a cheap fallback. Memory: feedback_k8s_probes_must_not_hit_openapi, feedback_k8s_probes_behind_auth_middleware.

Storage

  • Longhorn RWO PVC worldbuilder-data (5Gi) mounted at /data. SQLite DB lives at /data/worldbuilder.db, generated images under /data/gallery/, PDF/PNG exports under /data/exports/.
  • DataProtection keys persist to the same SQLite via AddFlowerCoreDataProtection<WorldBuilderDbContext> — explicit migration 20260429133417_Initial already creates fc_dp_keys. Memory: feedback_dataprotection_keys_persist_to_app_dbcontext, feedback_intranet_dataprotection_table_must_have_explicit_migration.

Image generation backend

FlowerCore:WorldBuilder:ImageGeneration:BaseUrl=http://10.0.56.20:8188 — ComfyUI runs on BLUEJAY-WS Windows (R9700 / gfx1201 / ROCm 7.2.1). Pod reaches the workstation directly across the 10.0.56.0/24 VLAN (no Podman-style host- filter issues — K8s pods route via Calico, which is L3-routed across the VLAN).