From 0f1dc5f871c3984ef45ceb7870c98a05c07809f2 Mon Sep 17 00:00:00 2001 From: Codex Date: Thu, 7 May 2026 15:32:00 -0500 Subject: [PATCH] fix(certs): kill cert-manager renewal loop on 3 broken Certificate specs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three Certificates requested duration: 2160h (90d) with renewBefore: 720h (30d). step-ca's ACME provisioner caps cert lifetime at 30d, so it silently issued 720h certs — making renewBefore EQUAL to the actual cert lifetime. cert-manager treats the cert as needing immediate renewal the moment it's issued, creates a CertificateRequest, gets a new (still 30d) cert, marks it for immediate renewal, and loops. Damage on 2026-05-07 ~20:30 (caught during regroup after 5h gap): - fc-worldbuilder/worldbuilder-web-tls: 2365 CRs in 18h - fc-distribution/fc-distribution-tls: 10880 CRs in 18h - knowledge/knowledge-tls: 10888 CRs in 18h Total: 24,133 stale CertificateRequest objects in etcd. Bulk-deleted all CRs + Orders in those 3 namespaces, then this commit fixes the source so ArgoCD sync stops re-creating the loop. Fix: match the working 720h/240h pattern used by every other FC service cert (agent-zero, fc-dns, fc-llm-bridge, fc-php, traefik-system, etc.). 30d cert lifetime + 10d renewal headroom = renewal at day 20, which is the cert-manager standard 2/3-of-lifetime practice. Side effect during loop: ALSO contributed to step-ca load and may have caused intermittent timeouts cluster-wide (the latest stuck challenge was timing out dialing step-ca:9443 even though step-ca itself was up). Co-Authored-By: Claude Opus 4.7 (1M context) --- apps/fc-distribution/fc-distribution.yaml | 8 ++++++-- apps/knowledge/knowledge.yaml | 8 ++++++-- apps/worldbuilder/worldbuilder.yaml | 9 +++++++-- 3 files changed, 19 insertions(+), 6 deletions(-) diff --git a/apps/fc-distribution/fc-distribution.yaml b/apps/fc-distribution/fc-distribution.yaml index baa40de..d331bd8 100644 --- a/apps/fc-distribution/fc-distribution.yaml +++ b/apps/fc-distribution/fc-distribution.yaml @@ -266,8 +266,12 @@ spec: kind: ClusterIssuer dnsNames: - dist.iamworkin.lan - duration: 2160h # 90d - renewBefore: 720h # 30d + # step-ca ACME caps lifetime at 30d; requesting 90d silently capped + # made renewBefore=cert-lifetime → perpetual renewal loop (10880+ CRs + # in 18h on 2026-05-07). Match working 720h/240h pattern from other + # FC services. + duration: 720h # 30d (step-ca cap) + renewBefore: 240h # 10d --- apiVersion: traefik.io/v1alpha1 kind: IngressRoute diff --git a/apps/knowledge/knowledge.yaml b/apps/knowledge/knowledge.yaml index 35497d9..797997d 100644 --- a/apps/knowledge/knowledge.yaml +++ b/apps/knowledge/knowledge.yaml @@ -241,8 +241,12 @@ spec: kind: ClusterIssuer dnsNames: - knowledge.iamworkin.lan - duration: 2160h # 90d - renewBefore: 720h # 30d + # step-ca ACME caps lifetime at 30d; requesting 90d silently capped + # made renewBefore=cert-lifetime → perpetual renewal loop (10888+ CRs + # in 18h on 2026-05-07). Match working 720h/240h pattern from other + # FC services. + duration: 720h # 30d (step-ca cap) + renewBefore: 240h # 10d --- apiVersion: traefik.io/v1alpha1 kind: IngressRoute diff --git a/apps/worldbuilder/worldbuilder.yaml b/apps/worldbuilder/worldbuilder.yaml index 5c35f39..17edc48 100644 --- a/apps/worldbuilder/worldbuilder.yaml +++ b/apps/worldbuilder/worldbuilder.yaml @@ -187,8 +187,13 @@ spec: kind: ClusterIssuer dnsNames: - worldbuilder.iamworkin.lan - duration: 2160h # 90d - renewBefore: 720h # 30d + # step-ca ACME provisioner caps lifetime at 30d. Requesting 90d + # silently capped to 30d, making renewBefore 720h (30d) equal to the + # actual cert lifetime — triggered a perpetual renewal loop that + # generated 2365+ CertificateRequest objects in 18h. Match the working + # 720h/240h pattern used by every other FC service cert. + duration: 720h # 30d (step-ca cap) + renewBefore: 240h # 10d --- apiVersion: traefik.io/v1alpha1 kind: IngressRoute