Go to file

Andrew Stoltz e641ceab48 monitoring(irc-notify): criticals also batch hourly — fix per-fire spam

The first batching pass (bacac06) left critical-severity alerts on the
immediate-print path. That's still per-event spam for any persistent
critical (e.g. PrintPaperRollCritical fires every 30s Grafana evaluation
cycle when paper is <5%). Caught immediately after deploy: CUPS queue grew
0 → 8 jobs in 8 minutes from a single firing PrintPaperRollCritical.

This commit aligns with the operator's verbatim ask ("one alert an hour"):

- Critical-severity alerts now go into the digest buffer, NOT the
  immediate-print path. The digest payload already shows severity tags
  per alertname, so the operator still sees "[critical] X" in the printout.
- The explicit `alert_channel=thermal_print_immediate` label still bypasses
  batching, but only on NEW fingerprint arrival — it triggers a flush of
  the CURRENT digest (with the new alert included), then clears. Repeat
  webhooks for the same fingerprint dedupe in the buffer until the next
  hourly tick OR until the alert resolves. No fingerprint can spam.
- `add_to_digest` now returns bool (True = buffer grew, False = dedup /
  resolution / disabled) so the immediate-label path can flush only on
  state transitions.

Net effect: max 1 thermal print per BATCH_INTERVAL_MIN per alert fingerprint,
regardless of severity. Rules that genuinely need same-second paper opt in
via `alert_channel=thermal_print_immediate` (currently zero rules use this).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-19 10:22:25 -05:00

apps

monitoring(irc-notify): criticals also batch hourly — fix per-fire spam

2026-05-19 10:22:25 -05:00

docs/runbooks

docs(openvox): document quadlet durability smoke (#12 )

2026-05-18 04:53:02 +00:00

scripts

docs(openvox): document quadlet durability smoke (#12 )

2026-05-18 04:53:02 +00:00

tests/bluejay-infra-lint

docs(openvox): document quadlet durability smoke (#12 )

2026-05-18 04:53:02 +00:00

.gitignore

K8s manifest hardening + new bluejay-infra-lint test project

2026-05-04 03:18:04 -05:00

README.md

docs(openvox): document quadlet durability smoke (#12 )

2026-05-18 04:53:02 +00:00

README.md

bluejay-infra

Infrastructure manifests for ArgoCD. An ApplicationSet in argocd namespace watches the apps/* directories in this repo and creates one Application per subdir (prefixed infra-<name>).

Adding a new service to the cluster

Follow these steps in order. Step 1 must run before step 3 — if you skip it, cert-manager HTTP-01 will silently fail for ~2h per cert (exponential backoff) until someone diagnoses the DNS.

1. Create or verify the FlowerCore.DNS A record (REQUIRED for current HTTP-01 manifests)

step-ca (the ACME CA on noc1) runs in a Podman container with host networking. Its container resolver uses pfSense Unbound (10.0.56.1), not cluster CoreDNS. So even though CoreDNS has a wildcard *.iamworkin.lan → 10.0.56.200 for in-cluster lookups, step-ca cannot see it. Every new public hostname needs an explicit pfSense host override.

The management path is now FlowerCore.DNS, not FlowerCore.Notes/scripts/pfsense-add-dns-overrides.py. Add or verify the public A record there before you apply the manifest:

curl -sk https://dns.iamworkin.lan/api/v1/servers
# Find the pfSense serverId, then create the record using the host label only.
# Example: for foo.iamworkin.lan, use "name":"foo".

curl -sk -X POST https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iamworkin.lan/records \
  -H "Content-Type: application/json" \
  -d '{"name":"<yourservice>","type":"A","data":"10.0.56.200","ttl":300}'

Verify all referenced iamworkin.lan hosts resolve (run from anywhere on LAN):

python scripts/check-pfsense-dns.py
# Historical filename retained. The script now calls
# https://dns.iamworkin.lan/api/v1/zones/iamworkin.lan/resolve-preflight
# for every Certificate dnsName and Traefik Host(...) rule it finds.

python scripts/check-pfsense-dns.py --live
# Optional stronger pass when kubectl access is available; also checks
# live-cluster Certificates and IngressRoutes for drift outside manifests.

Symptom if you skip this: the Certificate resource stays Ready: False with status.reason: unexpected non-ACME API error: context deadline exceeded. Recovery requires kubectl -n <ns> delete order <order-name> after adding the DNS to bypass cert-manager's backoff.

2. Create the app manifest

Create apps/<name>/<name>.yaml containing the Namespace, Deployment, Service, Certificate, and IngressRoute. Reference an existing directory (e.g. apps/fc-messageboard/) for the canonical shape.

Conventions:

Namespace has label app.kubernetes.io/part-of: bluejay-infra
Deployment.spec.selector.matchLabels and Service.spec.selector MUST use the same label key. The historical convention here is app: <name> (not app.kubernetes.io/name) — don't mix.
Image: localhost/<name>:v<YYYYMMDD><HHMM>, imagePullPolicy: Never. Import the image to every RKE2 node (server + both agents) via ctr images import before applying — pods schedule anywhere.
If the app persists local state (SQLite, uploads), declare the PersistentVolumeClaim here with storageClassName: longhorn and accessModes: [ReadWriteOnce]. Add strategy.type: Recreate to the Deployment — RWO PVC blocks rolling updates.
Probes: use tcpSocket if the app has middleware that intercepts unauth requests (returns 404/401 for /health). Otherwise prefer httpGet against whatever the app exposes (verify the path isn't gated by auth).
Certificate: issuerRef.name: step-ca-acme, issuerRef.kind: ClusterIssuer. dnsNames must match the hostname you created in FlowerCore.DNS in step 1.

3. Commit & push

git add apps/<name>/
git commit -m "<name>: initial deployment"
git push

ArgoCD's ApplicationSet picks up the new directory within ~3 minutes and creates infra-<name> with auto-sync + self-heal enabled.

4. Verify

# From noc1
fcadmin_ssh noc1 '
  kubectl -n argocd get application infra-<name>
  kubectl -n <ns> get certificate,pod
  curl -sk -m 8 -o /dev/null -w "HTTP %{http_code}\n" https://<name>.iamworkin.lan/
'

Certificate should be Ready: True within ~60s. If it stalls False for >2m, the pfSense DNS step got skipped — go back to step 1, then kubectl -n <ns> delete order <order-name> to bust the backoff.

Pre-merge gate

Before git push, always run:

python scripts/check-pfsense-dns.py

It's a quick service-backed check that would have caught the entire 2026-04-22 cert-manager outage. Consider wiring it into a pre-commit hook or a Gitea Actions workflow.

Retiring a service

kubectl -n argocd delete application infra-<name> (cascade deletes the K8s resources via ArgoCD finalizers)
git rm -r apps/<name>/ and push
Remove the FlowerCore.DNS record through the UI or API, for example:

curl -sk https://dns.iamworkin.lan/api/v1/servers
curl -sk -X DELETE https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iamworkin.lan/records/<yourservice>

Known gotchas

CoreDNS template + ndots:5 collision: inside pods, <svc>.<ns>.svc.cluster.local with <5 dots gets search-expanded through iamworkin.lan FIRST and hits the wildcard template → resolves to Traefik VIP, not the real ClusterIP. Use short service names (<svc>) in K8s manifests. See memory feedback_coredns_ndots_template_collision.md.
Image not on node: pods stuck ErrImageNeverPull means the image wasn't imported to the node Kubernetes scheduled the pod onto. ctr images import on all of rke2-server, rke2-agent1, rke2-agent2.
StatefulSet PVC drift: volumeClaimTemplates needs explicit volumeMode: Filesystem or ArgoCD SSA self-heals forever. See memory feedback_argocd_statefulset_pvc_drift.md.
IngressRoute namespace split: this RKE2 Traefik install does not allow cross-namespace service refs. Keep the IngressRoute, backend Service, and TLS secret in the same namespace; if one host is shared across namespaces, duplicate the Certificate and move the route next to the destination service.
Public read-only hosts: if a public host fronts a service that also exposes admin writes internally, add a Traefik route match like Host(...) && (Method(GET) || Method(HEAD)) on the public edge instead of trusting the app to reject unsafe methods.
Public read-write allowlist hosts: if a public host accepts a tightly bounded write surface (e.g. bootstrap-JWT POST), pin the allowlist as (Method(GET) || Method(HEAD) || Method(POST) || Method(OPTIONS)). PUT/PATCH/DELETE must still 404 at the route. Track A's updatecenter.iamworkin.lan / updates.iamworkin.lan are the canonical example. The lint test enforces this invariant.
Traefik VIP netpols: when a NetworkPolicy allows 10.0.56.200, also allow the post-DNAT backend ports (8443 for TLS plus 8080 or 8000 for HTTP) or Calico will drop the rewritten flow.
Auth-safe probes: services behind API-key or global auth middleware should prefer tcpSocket probes unless /health is explicitly exempted before the middleware runs.
ArgoCD must use internal Gitea URL: http://gitea-clusterip.gitea.svc.cluster.local:3000/bluejay/bluejay-infra.git, not the external HTTPS URL (step-ca cert isn't trusted by ArgoCD). The ApplicationSet and any hand-created Application must both use the internal URL.

Local manifest lint

The repo now carries a local-first lint pass for the recurring K8s gotchas that have burned the fleet:

dotnet test tests/bluejay-infra-lint/BluejayInfraLint.Tests.csproj -c Release

That test project sweeps bluejay-infra/apps/** plus the canonical sibling FlowerCore.*\\k8s manifests that share the same workspace. Matching conftest.dev policy files live under tests/bluejay-infra-lint/conftest.dev/ for environments that also have conftest or opa.

References

OpenVox noc1 durability runbook: docs/runbooks/openvoxserver-quadlet-durability.md
Cert-manager recovery playbook: FlowerCore.Notes/memory/project_cert_manager_recovery_2026_04_22.md
Why pfSense DNS is required: FlowerCore.Notes/memory/feedback_pfsense_dns_required_for_acme.md
Public DNS operator host: https://dns.iamworkin.lan
Canonical credential helper: FlowerCore.Notes/scripts/credential-helper.sh
pfSense admin automation: FlowerCore.Notes/memory/feedback_pfsense_automation.md

Languages

C# 48.3%

Python 24%

Shell 13.7%

Puppet 5.9%

Open Policy Agent 4.5%

Other 3.6%