Add source-controlled Puppet/Hiera contracts for edge2 Divoom-as-DM-device without replacing the live flowercore-divoom systemd deployment. Add Divoom TV Pi HDMI systemd/Puppet deployment artifacts, LF shell-script guardrails, and focused lint coverage for the additive non-K8s deploy shape. Co-Authored-By: Codex <codex@openai.com>
8.6 KiB
bluejay-infra
Infrastructure manifests for ArgoCD. An ApplicationSet in argocd namespace watches the apps/* directories in this repo and creates one Application per subdir (prefixed infra-<name>).
Adding a new service to the cluster
Follow these steps in order. Step 1 must run before step 3 — if you skip it, cert-manager HTTP-01 will silently fail for ~2h per cert (exponential backoff) until someone diagnoses the DNS.
1. Create or verify the FlowerCore.DNS A record (REQUIRED for current HTTP-01 manifests)
step-ca (the ACME CA on noc1) runs in a Podman container with host networking. Its container resolver uses pfSense Unbound (10.0.56.1), not cluster CoreDNS. So even though CoreDNS has a wildcard *.iamworkin.lan → 10.0.56.200 for in-cluster lookups, step-ca cannot see it. Every new public hostname needs an explicit pfSense host override.
The management path is now FlowerCore.DNS, not FlowerCore.Notes/scripts/pfsense-add-dns-overrides.py. Add or verify the public A record there before you apply the manifest:
curl -sk https://dns.iamworkin.lan/api/v1/servers
# Find the pfSense serverId, then create the record using the host label only.
# Example: for foo.iamworkin.lan, use "name":"foo".
curl -sk -X POST https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iamworkin.lan/records \
-H "Content-Type: application/json" \
-d '{"name":"<yourservice>","type":"A","data":"10.0.56.200","ttl":300}'
Verify all referenced iamworkin.lan hosts resolve (run from anywhere on LAN):
python scripts/check-pfsense-dns.py
# Historical filename retained. The script now calls
# https://dns.iamworkin.lan/api/v1/zones/iamworkin.lan/resolve-preflight
# for every Certificate dnsName and Traefik Host(...) rule it finds.
python scripts/check-pfsense-dns.py --live
# Optional stronger pass when kubectl access is available; also checks
# live-cluster Certificates and IngressRoutes for drift outside manifests.
Symptom if you skip this: the Certificate resource stays Ready: False with status.reason: unexpected non-ACME API error: context deadline exceeded. Recovery requires kubectl -n <ns> delete order <order-name> after adding the DNS to bypass cert-manager's backoff.
2. Create the app manifest
Create apps/<name>/<name>.yaml containing the Namespace, Deployment, Service, Certificate, and IngressRoute. Reference an existing directory (e.g. apps/fc-messageboard/) for the canonical shape.
Conventions:
Namespacehas labelapp.kubernetes.io/part-of: bluejay-infraDeployment.spec.selector.matchLabelsandService.spec.selectorMUST use the same label key. The historical convention here isapp: <name>(notapp.kubernetes.io/name) — don't mix.- Image:
localhost/<name>:v<YYYYMMDD><HHMM>,imagePullPolicy: Never. Import the image to every RKE2 node (server + both agents) viactr images importbefore applying — pods schedule anywhere. - If the app persists local state (SQLite, uploads), declare the
PersistentVolumeClaimhere withstorageClassName: longhornandaccessModes: [ReadWriteOnce]. Addstrategy.type: Recreateto the Deployment — RWO PVC blocks rolling updates. - Probes: use
tcpSocketif the app has middleware that intercepts unauth requests (returns 404/401 for/health). Otherwise preferhttpGetagainst whatever the app exposes (verify the path isn't gated by auth). - Certificate:
issuerRef.name: step-ca-acme,issuerRef.kind: ClusterIssuer.dnsNamesmust match the hostname you created in FlowerCore.DNS in step 1.
3. Commit & push
git add apps/<name>/
git commit -m "<name>: initial deployment"
git push
ArgoCD's ApplicationSet picks up the new directory within ~3 minutes and creates infra-<name> with auto-sync + self-heal enabled.
4. Verify
# From noc1
fcadmin_ssh noc1 '
kubectl -n argocd get application infra-<name>
kubectl -n <ns> get certificate,pod
curl -sk -m 8 -o /dev/null -w "HTTP %{http_code}\n" https://<name>.iamworkin.lan/
'
Certificate should be Ready: True within ~60s. If it stalls False for >2m, the pfSense DNS step got skipped — go back to step 1, then kubectl -n <ns> delete order <order-name> to bust the backoff.
Pre-merge gate
Before git push, always run:
python scripts/check-pfsense-dns.py
It's a quick service-backed check that would have caught the entire 2026-04-22 cert-manager outage. Consider wiring it into a pre-commit hook or a Gitea Actions workflow.
Retiring a service
kubectl -n argocd delete application infra-<name>(cascade deletes the K8s resources via ArgoCD finalizers)git rm -r apps/<name>/and push- Remove the FlowerCore.DNS record through the UI or API, for example:
curl -sk https://dns.iamworkin.lan/api/v1/servers
curl -sk -X DELETE https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iamworkin.lan/records/<yourservice>
Known gotchas
- CoreDNS template + ndots:5 collision: inside pods,
<svc>.<ns>.svc.cluster.localwith <5 dots gets search-expanded throughiamworkin.lanFIRST and hits the wildcard template → resolves to Traefik VIP, not the real ClusterIP. Use short service names (<svc>) in K8s manifests. See memoryfeedback_coredns_ndots_template_collision.md. - Image not on node: pods stuck
ErrImageNeverPullmeans the image wasn't imported to the node Kubernetes scheduled the pod onto.ctr images importon all of rke2-server, rke2-agent1, rke2-agent2. - StatefulSet PVC drift:
volumeClaimTemplatesneeds explicitvolumeMode: Filesystemor ArgoCD SSA self-heals forever. See memoryfeedback_argocd_statefulset_pvc_drift.md. - IngressRoute namespace split: this RKE2 Traefik install does not allow cross-namespace service refs. Keep the
IngressRoute, backendService, and TLS secret in the same namespace; if one host is shared across namespaces, duplicate theCertificateand move the route next to the destination service. - Public read-only hosts: if a public host fronts a service that also exposes admin writes internally, add a Traefik route match like
Host(...) && (Method(GET) || Method(HEAD))on the public edge instead of trusting the app to reject unsafe methods. - Public read-write allowlist hosts: if a public host accepts a tightly bounded write surface (e.g. bootstrap-JWT POST), pin the allowlist as
(Method(GET) || Method(HEAD) || Method(POST) || Method(OPTIONS)). PUT/PATCH/DELETE must still 404 at the route. Track A'supdatecenter.iamworkin.lan/updates.iamworkin.lanare the canonical example. The lint test enforces this invariant. - Traefik VIP netpols: when a
NetworkPolicyallows10.0.56.200, also allow the post-DNAT backend ports (8443for TLS plus8080or8000for HTTP) or Calico will drop the rewritten flow. - Auth-safe probes: services behind API-key or global auth middleware should prefer
tcpSocketprobes unless/healthis explicitly exempted before the middleware runs. - ArgoCD must use internal Gitea URL:
http://gitea-clusterip.gitea.svc.cluster.local:3000/bluejay/bluejay-infra.git, not the external HTTPS URL (step-ca cert isn't trusted by ArgoCD). TheApplicationSetand any hand-createdApplicationmust both use the internal URL.
Local manifest lint
The repo now carries a local-first lint pass for the recurring K8s gotchas that have burned the fleet:
dotnet test tests/bluejay-infra-lint/BluejayInfraLint.Tests.csproj -c Release
That test project sweeps bluejay-infra/apps/** plus the canonical sibling FlowerCore.*\\k8s manifests that share the same workspace. Matching conftest.dev policy files live under tests/bluejay-infra-lint/conftest.dev/ for environments that also have conftest or opa.
Non-K8s Pi Artifacts
Some apps/* directories are deployment artifact bundles consumed by Puppet
instead of Kubernetes workloads. apps/fc-signage-pi-player/ carries the
Chromium signage Pi player, apps/fc-divoom-dm-pi-device/ carries the additive
edge2 Divoom-as-DeviceManagement-device profile/Hiera contract, and
apps/fc-divoom-tv-pi/ carries the Divoom TV Pi HDMI systemd/Puppet shape.
These bundles intentionally avoid Deployment, IngressRoute, Certificate, and
OnePasswordItem resources.
References
- OpenVox noc1 durability runbook:
docs/runbooks/openvoxserver-quadlet-durability.md - Cert-manager recovery playbook:
FlowerCore.Notes/memory/project_cert_manager_recovery_2026_04_22.md - Why pfSense DNS is required:
FlowerCore.Notes/memory/feedback_pfsense_dns_required_for_acme.md - Public DNS operator host:
https://dns.iamworkin.lan - Canonical credential helper:
FlowerCore.Notes/scripts/credential-helper.sh - pfSense admin automation:
FlowerCore.Notes/memory/feedback_pfsense_automation.md