fix(certs): kill cert-manager renewal loop on 3 broken Certificate specs

Three Certificates requested duration: 2160h (90d) with renewBefore: 720h
(30d). step-ca's ACME provisioner caps cert lifetime at 30d, so it silently
issued 720h certs — making renewBefore EQUAL to the actual cert lifetime.
cert-manager treats the cert as needing immediate renewal the moment it's
issued, creates a CertificateRequest, gets a new (still 30d) cert, marks
it for immediate renewal, and loops.

Damage on 2026-05-07 ~20:30 (caught during regroup after 5h gap):
  - fc-worldbuilder/worldbuilder-web-tls:  2365 CRs in 18h
  - fc-distribution/fc-distribution-tls:  10880 CRs in 18h
  - knowledge/knowledge-tls:              10888 CRs in 18h
  Total: 24,133 stale CertificateRequest objects in etcd.

Bulk-deleted all CRs + Orders in those 3 namespaces, then this commit
fixes the source so ArgoCD sync stops re-creating the loop.

Fix: match the working 720h/240h pattern used by every other FC service
cert (agent-zero, fc-dns, fc-llm-bridge, fc-php, traefik-system, etc.).
30d cert lifetime + 10d renewal headroom = renewal at day 20, which is
the cert-manager standard 2/3-of-lifetime practice.

Side effect during loop: ALSO contributed to step-ca load and may have
caused intermittent timeouts cluster-wide (the latest stuck challenge
was timing out dialing step-ca:9443 even though step-ca itself was up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Codex
2026-05-07 15:32:00 -05:00
parent 11c5f6e6cc
commit 0f1dc5f871
3 changed files with 19 additions and 6 deletions

View File

@@ -266,8 +266,12 @@ spec:
kind: ClusterIssuer kind: ClusterIssuer
dnsNames: dnsNames:
- dist.iamworkin.lan - dist.iamworkin.lan
duration: 2160h # 90d # step-ca ACME caps lifetime at 30d; requesting 90d silently capped
renewBefore: 720h # 30d # made renewBefore=cert-lifetime → perpetual renewal loop (10880+ CRs
# in 18h on 2026-05-07). Match working 720h/240h pattern from other
# FC services.
duration: 720h # 30d (step-ca cap)
renewBefore: 240h # 10d
--- ---
apiVersion: traefik.io/v1alpha1 apiVersion: traefik.io/v1alpha1
kind: IngressRoute kind: IngressRoute

View File

@@ -241,8 +241,12 @@ spec:
kind: ClusterIssuer kind: ClusterIssuer
dnsNames: dnsNames:
- knowledge.iamworkin.lan - knowledge.iamworkin.lan
duration: 2160h # 90d # step-ca ACME caps lifetime at 30d; requesting 90d silently capped
renewBefore: 720h # 30d # made renewBefore=cert-lifetime → perpetual renewal loop (10888+ CRs
# in 18h on 2026-05-07). Match working 720h/240h pattern from other
# FC services.
duration: 720h # 30d (step-ca cap)
renewBefore: 240h # 10d
--- ---
apiVersion: traefik.io/v1alpha1 apiVersion: traefik.io/v1alpha1
kind: IngressRoute kind: IngressRoute

View File

@@ -187,8 +187,13 @@ spec:
kind: ClusterIssuer kind: ClusterIssuer
dnsNames: dnsNames:
- worldbuilder.iamworkin.lan - worldbuilder.iamworkin.lan
duration: 2160h # 90d # step-ca ACME provisioner caps lifetime at 30d. Requesting 90d
renewBefore: 720h # 30d # silently capped to 30d, making renewBefore 720h (30d) equal to the
# actual cert lifetime — triggered a perpetual renewal loop that
# generated 2365+ CertificateRequest objects in 18h. Match the working
# 720h/240h pattern used by every other FC service cert.
duration: 720h # 30d (step-ca cap)
renewBefore: 240h # 10d
--- ---
apiVersion: traefik.io/v1alpha1 apiVersion: traefik.io/v1alpha1
kind: IngressRoute kind: IngressRoute