feat(infra): route dns preflight through flowercore dns

This commit is contained in:
Andrew Stoltz
2026-04-23 17:03:22 -05:00
parent f9593e494a
commit 407d473b71
4 changed files with 256 additions and 66 deletions

View File

@@ -6,28 +6,33 @@ Infrastructure manifests for ArgoCD. An `ApplicationSet` in `argocd` namespace w
Follow these steps in order. **Step 1 must run before step 3** — if you skip it, cert-manager HTTP-01 will silently fail for ~2h per cert (exponential backoff) until someone diagnoses the DNS.
### 1. Add the pfSense Unbound DNS override (REQUIRED)
### 1. Create or verify the FlowerCore.DNS A record (REQUIRED for current HTTP-01 manifests)
step-ca (the ACME CA on noc1) runs in a Podman container with host networking. Its container resolver uses pfSense Unbound (10.0.56.1), **not** cluster CoreDNS. So even though CoreDNS has a wildcard `*.iamworkin.lan → 10.0.56.200` for in-cluster lookups, step-ca cannot see it. Every new public hostname needs an explicit pfSense host override.
From `FlowerCore.Notes`:
The management path is now `FlowerCore.DNS`, not `FlowerCore.Notes/scripts/pfsense-add-dns-overrides.py`. Add or verify the public A record there before you apply the manifest:
```bash
# 1. Edit the HOSTS list in scripts/pfsense-add-dns-overrides.py
# Add: ("<yourservice>", "10.0.56.200", "cert-manager HTTP-01 target (Traefik VIP)")
# 2. Run:
source scripts/credential-helper.sh
export PFSENSE_PASS=$(get_cred "pfSense Admin")
python scripts/pfsense-add-dns-overrides.py
curl -sk https://dns.iamworkin.lan/api/v1/servers
# Find the pfSense serverId, then create the record using the host label only.
# Example: for foo.iamworkin.lan, use "name":"foo".
curl -sk -X POST https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iamworkin.lan/records \
-H "Content-Type: application/json" \
-d '{"name":"<yourservice>","type":"A","data":"10.0.56.200","ttl":300}'
```
Verify all referenced iamworkin.lan hosts resolve (run from anywhere on LAN):
```bash
python scripts/check-pfsense-dns.py
# Parses every apps/*/*.yaml, extracts hostnames from Certificate dnsNames
# and Traefik IngressRoute Host(...) rules, and fails if any don't resolve.
# Safe to run as a pre-merge / pre-sync check.
# Historical filename retained. The script now calls
# https://dns.iamworkin.lan/api/v1/zones/iamworkin.lan/resolve-preflight
# for every Certificate dnsName and Traefik Host(...) rule it finds.
python scripts/check-pfsense-dns.py --live
# Optional stronger pass when kubectl access is available; also checks
# live-cluster Certificates and IngressRoutes for drift outside manifests.
```
**Symptom if you skip this:** the Certificate resource stays `Ready: False` with `status.reason: unexpected non-ACME API error: context deadline exceeded`. Recovery requires `kubectl -n <ns> delete order <order-name>` after adding the DNS to bypass cert-manager's backoff.
@@ -43,7 +48,7 @@ Conventions:
- Image: `localhost/<name>:v<YYYYMMDD><HHMM>`, `imagePullPolicy: Never`. Import the image to every RKE2 node (server + both agents) via `ctr images import` before applying — pods schedule anywhere.
- If the app persists local state (SQLite, uploads), declare the `PersistentVolumeClaim` here with `storageClassName: longhorn` and `accessModes: [ReadWriteOnce]`. Add `strategy.type: Recreate` to the Deployment — RWO PVC blocks rolling updates.
- Probes: use `tcpSocket` if the app has middleware that intercepts unauth requests (returns 404/401 for `/health`). Otherwise prefer `httpGet` against whatever the app exposes (verify the path isn't gated by auth).
- Certificate: `issuerRef.name: step-ca-acme`, `issuerRef.kind: ClusterIssuer`. `dnsNames` must match the hostname you added to pfSense in step 1.
- Certificate: `issuerRef.name: step-ca-acme`, `issuerRef.kind: ClusterIssuer`. `dnsNames` must match the hostname you created in FlowerCore.DNS in step 1.
### 3. Commit & push
@@ -76,13 +81,18 @@ Before `git push`, always run:
python scripts/check-pfsense-dns.py
```
It's a ~3-second check that would have caught the entire 2026-04-22 cert-manager outage. Consider wiring it into a pre-commit hook or a Gitea Actions workflow.
It's a quick service-backed check that would have caught the entire 2026-04-22 cert-manager outage. Consider wiring it into a pre-commit hook or a Gitea Actions workflow.
## Retiring a service
1. `kubectl -n argocd delete application infra-<name>` (cascade deletes the K8s resources via ArgoCD finalizers)
2. `git rm -r apps/<name>/` and push
3. Remove the pfSense Unbound override — edit `scripts/pfsense-add-dns-overrides.py` to remove from HOSTS, or delete manually via the pfSense UI (Services → DNS Resolver → Host Overrides)
3. Remove the FlowerCore.DNS record through the UI or API, for example:
```bash
curl -sk https://dns.iamworkin.lan/api/v1/servers
curl -sk -X DELETE https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iamworkin.lan/records/<yourservice>
```
## Known gotchas
@@ -95,5 +105,6 @@ It's a ~3-second check that would have caught the entire 2026-04-22 cert-manager
- Cert-manager recovery playbook: `FlowerCore.Notes/memory/project_cert_manager_recovery_2026_04_22.md`
- Why pfSense DNS is required: `FlowerCore.Notes/memory/feedback_pfsense_dns_required_for_acme.md`
- Public DNS operator host: `https://dns.iamworkin.lan`
- Canonical credential helper: `FlowerCore.Notes/scripts/credential-helper.sh`
- pfSense admin automation: `FlowerCore.Notes/memory/feedback_pfsense_automation.md`