The first batching pass (bacac06) left critical-severity alerts on the
immediate-print path. That's still per-event spam for any persistent
critical (e.g. PrintPaperRollCritical fires every 30s Grafana evaluation
cycle when paper is <5%). Caught immediately after deploy: CUPS queue grew
0 → 8 jobs in 8 minutes from a single firing PrintPaperRollCritical.
This commit aligns with the operator's verbatim ask ("one alert an hour"):
- Critical-severity alerts now go into the digest buffer, NOT the
immediate-print path. The digest payload already shows severity tags
per alertname, so the operator still sees "[critical] X" in the printout.
- The explicit `alert_channel=thermal_print_immediate` label still bypasses
batching, but only on NEW fingerprint arrival — it triggers a flush of
the CURRENT digest (with the new alert included), then clears. Repeat
webhooks for the same fingerprint dedupe in the buffer until the next
hourly tick OR until the alert resolves. No fingerprint can spam.
- `add_to_digest` now returns bool (True = buffer grew, False = dedup /
resolution / disabled) so the immediate-label path can flush only on
state transitions.
Net effect: max 1 thermal print per BATCH_INTERVAL_MIN per alert fingerprint,
regardless of severity. Rules that genuinely need same-second paper opt in
via `alert_channel=thermal_print_immediate` (currently zero rules use this).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bluejay-infra
Infrastructure manifests for ArgoCD. An ApplicationSet in argocd namespace watches the apps/* directories in this repo and creates one Application per subdir (prefixed infra-<name>).
Adding a new service to the cluster
Follow these steps in order. Step 1 must run before step 3 — if you skip it, cert-manager HTTP-01 will silently fail for ~2h per cert (exponential backoff) until someone diagnoses the DNS.
1. Create or verify the FlowerCore.DNS A record (REQUIRED for current HTTP-01 manifests)
step-ca (the ACME CA on noc1) runs in a Podman container with host networking. Its container resolver uses pfSense Unbound (10.0.56.1), not cluster CoreDNS. So even though CoreDNS has a wildcard *.iamworkin.lan → 10.0.56.200 for in-cluster lookups, step-ca cannot see it. Every new public hostname needs an explicit pfSense host override.
The management path is now FlowerCore.DNS, not FlowerCore.Notes/scripts/pfsense-add-dns-overrides.py. Add or verify the public A record there before you apply the manifest:
curl -sk https://dns.iamworkin.lan/api/v1/servers
# Find the pfSense serverId, then create the record using the host label only.
# Example: for foo.iamworkin.lan, use "name":"foo".
curl -sk -X POST https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iamworkin.lan/records \
-H "Content-Type: application/json" \
-d '{"name":"<yourservice>","type":"A","data":"10.0.56.200","ttl":300}'
Verify all referenced iamworkin.lan hosts resolve (run from anywhere on LAN):
python scripts/check-pfsense-dns.py
# Historical filename retained. The script now calls
# https://dns.iamworkin.lan/api/v1/zones/iamworkin.lan/resolve-preflight
# for every Certificate dnsName and Traefik Host(...) rule it finds.
python scripts/check-pfsense-dns.py --live
# Optional stronger pass when kubectl access is available; also checks
# live-cluster Certificates and IngressRoutes for drift outside manifests.
Symptom if you skip this: the Certificate resource stays Ready: False with status.reason: unexpected non-ACME API error: context deadline exceeded. Recovery requires kubectl -n <ns> delete order <order-name> after adding the DNS to bypass cert-manager's backoff.
2. Create the app manifest
Create apps/<name>/<name>.yaml containing the Namespace, Deployment, Service, Certificate, and IngressRoute. Reference an existing directory (e.g. apps/fc-messageboard/) for the canonical shape.
Conventions:
Namespacehas labelapp.kubernetes.io/part-of: bluejay-infraDeployment.spec.selector.matchLabelsandService.spec.selectorMUST use the same label key. The historical convention here isapp: <name>(notapp.kubernetes.io/name) — don't mix.- Image:
localhost/<name>:v<YYYYMMDD><HHMM>,imagePullPolicy: Never. Import the image to every RKE2 node (server + both agents) viactr images importbefore applying — pods schedule anywhere. - If the app persists local state (SQLite, uploads), declare the
PersistentVolumeClaimhere withstorageClassName: longhornandaccessModes: [ReadWriteOnce]. Addstrategy.type: Recreateto the Deployment — RWO PVC blocks rolling updates. - Probes: use
tcpSocketif the app has middleware that intercepts unauth requests (returns 404/401 for/health). Otherwise preferhttpGetagainst whatever the app exposes (verify the path isn't gated by auth). - Certificate:
issuerRef.name: step-ca-acme,issuerRef.kind: ClusterIssuer.dnsNamesmust match the hostname you created in FlowerCore.DNS in step 1.
3. Commit & push
git add apps/<name>/
git commit -m "<name>: initial deployment"
git push
ArgoCD's ApplicationSet picks up the new directory within ~3 minutes and creates infra-<name> with auto-sync + self-heal enabled.
4. Verify
# From noc1
fcadmin_ssh noc1 '
kubectl -n argocd get application infra-<name>
kubectl -n <ns> get certificate,pod
curl -sk -m 8 -o /dev/null -w "HTTP %{http_code}\n" https://<name>.iamworkin.lan/
'
Certificate should be Ready: True within ~60s. If it stalls False for >2m, the pfSense DNS step got skipped — go back to step 1, then kubectl -n <ns> delete order <order-name> to bust the backoff.
Pre-merge gate
Before git push, always run:
python scripts/check-pfsense-dns.py
It's a quick service-backed check that would have caught the entire 2026-04-22 cert-manager outage. Consider wiring it into a pre-commit hook or a Gitea Actions workflow.
Retiring a service
kubectl -n argocd delete application infra-<name>(cascade deletes the K8s resources via ArgoCD finalizers)git rm -r apps/<name>/and push- Remove the FlowerCore.DNS record through the UI or API, for example:
curl -sk https://dns.iamworkin.lan/api/v1/servers
curl -sk -X DELETE https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iamworkin.lan/records/<yourservice>
Known gotchas
- CoreDNS template + ndots:5 collision: inside pods,
<svc>.<ns>.svc.cluster.localwith <5 dots gets search-expanded throughiamworkin.lanFIRST and hits the wildcard template → resolves to Traefik VIP, not the real ClusterIP. Use short service names (<svc>) in K8s manifests. See memoryfeedback_coredns_ndots_template_collision.md. - Image not on node: pods stuck
ErrImageNeverPullmeans the image wasn't imported to the node Kubernetes scheduled the pod onto.ctr images importon all of rke2-server, rke2-agent1, rke2-agent2. - StatefulSet PVC drift:
volumeClaimTemplatesneeds explicitvolumeMode: Filesystemor ArgoCD SSA self-heals forever. See memoryfeedback_argocd_statefulset_pvc_drift.md. - IngressRoute namespace split: this RKE2 Traefik install does not allow cross-namespace service refs. Keep the
IngressRoute, backendService, and TLS secret in the same namespace; if one host is shared across namespaces, duplicate theCertificateand move the route next to the destination service. - Public read-only hosts: if a public host fronts a service that also exposes admin writes internally, add a Traefik route match like
Host(...) && (Method(GET) || Method(HEAD))on the public edge instead of trusting the app to reject unsafe methods. - Public read-write allowlist hosts: if a public host accepts a tightly bounded write surface (e.g. bootstrap-JWT POST), pin the allowlist as
(Method(GET) || Method(HEAD) || Method(POST) || Method(OPTIONS)). PUT/PATCH/DELETE must still 404 at the route. Track A'supdatecenter.iamworkin.lan/updates.iamworkin.lanare the canonical example. The lint test enforces this invariant. - Traefik VIP netpols: when a
NetworkPolicyallows10.0.56.200, also allow the post-DNAT backend ports (8443for TLS plus8080or8000for HTTP) or Calico will drop the rewritten flow. - Auth-safe probes: services behind API-key or global auth middleware should prefer
tcpSocketprobes unless/healthis explicitly exempted before the middleware runs. - ArgoCD must use internal Gitea URL:
http://gitea-clusterip.gitea.svc.cluster.local:3000/bluejay/bluejay-infra.git, not the external HTTPS URL (step-ca cert isn't trusted by ArgoCD). TheApplicationSetand any hand-createdApplicationmust both use the internal URL.
Local manifest lint
The repo now carries a local-first lint pass for the recurring K8s gotchas that have burned the fleet:
dotnet test tests/bluejay-infra-lint/BluejayInfraLint.Tests.csproj -c Release
That test project sweeps bluejay-infra/apps/** plus the canonical sibling FlowerCore.*\\k8s manifests that share the same workspace. Matching conftest.dev policy files live under tests/bluejay-infra-lint/conftest.dev/ for environments that also have conftest or opa.
References
- OpenVox noc1 durability runbook:
docs/runbooks/openvoxserver-quadlet-durability.md - Cert-manager recovery playbook:
FlowerCore.Notes/memory/project_cert_manager_recovery_2026_04_22.md - Why pfSense DNS is required:
FlowerCore.Notes/memory/feedback_pfsense_dns_required_for_acme.md - Public DNS operator host:
https://dns.iamworkin.lan - Canonical credential helper:
FlowerCore.Notes/scripts/credential-helper.sh - pfSense admin automation:
FlowerCore.Notes/memory/feedback_pfsense_automation.md