Codex a1f5a393cd chore(github-runner): rename 1P item to GitHub PAT (Runner Registration)
Renames the OnePasswordItem.itemPath from "GitHub Runner Registration
Token" to "GitHub PAT (Runner Registration)" so the runner 1P entry
sits next to its siblings — GitHub PAT (Gitea Mirrors) and GitHub PAT
(NuGet Packages) — under a consistent "GitHub PAT (...)" naming pattern
and API_CREDENTIAL category.

Existing field "credential" remains the consumer (RUNNER_TOKEN env).
Comment block clarified to require Administration:read/write fine-grained
PAT scope on target repos.

Old 1P item renamed to "[DEPRECATED 2026-05-16] GitHub Runner
Registration" — kept as recovery backup; can be hard-deleted after the
first successful runner pod start against the new item path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:01:41 +00:00

bluejay-infra

Infrastructure manifests for ArgoCD. An ApplicationSet in argocd namespace watches the apps/* directories in this repo and creates one Application per subdir (prefixed infra-<name>).

Adding a new service to the cluster

Follow these steps in order. Step 1 must run before step 3 — if you skip it, cert-manager HTTP-01 will silently fail for ~2h per cert (exponential backoff) until someone diagnoses the DNS.

1. Create or verify the FlowerCore.DNS A record (REQUIRED for current HTTP-01 manifests)

step-ca (the ACME CA on noc1) runs in a Podman container with host networking. Its container resolver uses pfSense Unbound (10.0.56.1), not cluster CoreDNS. So even though CoreDNS has a wildcard *.iamworkin.lan → 10.0.56.200 for in-cluster lookups, step-ca cannot see it. Every new public hostname needs an explicit pfSense host override.

The management path is now FlowerCore.DNS, not FlowerCore.Notes/scripts/pfsense-add-dns-overrides.py. Add or verify the public A record there before you apply the manifest:

curl -sk https://dns.iamworkin.lan/api/v1/servers
# Find the pfSense serverId, then create the record using the host label only.
# Example: for foo.iamworkin.lan, use "name":"foo".

curl -sk -X POST https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iamworkin.lan/records \
  -H "Content-Type: application/json" \
  -d '{"name":"<yourservice>","type":"A","data":"10.0.56.200","ttl":300}'

Verify all referenced iamworkin.lan hosts resolve (run from anywhere on LAN):

python scripts/check-pfsense-dns.py
# Historical filename retained. The script now calls
# https://dns.iamworkin.lan/api/v1/zones/iamworkin.lan/resolve-preflight
# for every Certificate dnsName and Traefik Host(...) rule it finds.

python scripts/check-pfsense-dns.py --live
# Optional stronger pass when kubectl access is available; also checks
# live-cluster Certificates and IngressRoutes for drift outside manifests.

Symptom if you skip this: the Certificate resource stays Ready: False with status.reason: unexpected non-ACME API error: context deadline exceeded. Recovery requires kubectl -n <ns> delete order <order-name> after adding the DNS to bypass cert-manager's backoff.

2. Create the app manifest

Create apps/<name>/<name>.yaml containing the Namespace, Deployment, Service, Certificate, and IngressRoute. Reference an existing directory (e.g. apps/fc-messageboard/) for the canonical shape.

Conventions:

  • Namespace has label app.kubernetes.io/part-of: bluejay-infra
  • Deployment.spec.selector.matchLabels and Service.spec.selector MUST use the same label key. The historical convention here is app: <name> (not app.kubernetes.io/name) — don't mix.
  • Image: localhost/<name>:v<YYYYMMDD><HHMM>, imagePullPolicy: Never. Import the image to every RKE2 node (server + both agents) via ctr images import before applying — pods schedule anywhere.
  • If the app persists local state (SQLite, uploads), declare the PersistentVolumeClaim here with storageClassName: longhorn and accessModes: [ReadWriteOnce]. Add strategy.type: Recreate to the Deployment — RWO PVC blocks rolling updates.
  • Probes: use tcpSocket if the app has middleware that intercepts unauth requests (returns 404/401 for /health). Otherwise prefer httpGet against whatever the app exposes (verify the path isn't gated by auth).
  • Certificate: issuerRef.name: step-ca-acme, issuerRef.kind: ClusterIssuer. dnsNames must match the hostname you created in FlowerCore.DNS in step 1.

3. Commit & push

git add apps/<name>/
git commit -m "<name>: initial deployment"
git push

ArgoCD's ApplicationSet picks up the new directory within ~3 minutes and creates infra-<name> with auto-sync + self-heal enabled.

4. Verify

# From noc1
fcadmin_ssh noc1 '
  kubectl -n argocd get application infra-<name>
  kubectl -n <ns> get certificate,pod
  curl -sk -m 8 -o /dev/null -w "HTTP %{http_code}\n" https://<name>.iamworkin.lan/
'

Certificate should be Ready: True within ~60s. If it stalls False for >2m, the pfSense DNS step got skipped — go back to step 1, then kubectl -n <ns> delete order <order-name> to bust the backoff.

Pre-merge gate

Before git push, always run:

python scripts/check-pfsense-dns.py

It's a quick service-backed check that would have caught the entire 2026-04-22 cert-manager outage. Consider wiring it into a pre-commit hook or a Gitea Actions workflow.

Retiring a service

  1. kubectl -n argocd delete application infra-<name> (cascade deletes the K8s resources via ArgoCD finalizers)
  2. git rm -r apps/<name>/ and push
  3. Remove the FlowerCore.DNS record through the UI or API, for example:
curl -sk https://dns.iamworkin.lan/api/v1/servers
curl -sk -X DELETE https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iamworkin.lan/records/<yourservice>

Known gotchas

  • CoreDNS template + ndots:5 collision: inside pods, <svc>.<ns>.svc.cluster.local with <5 dots gets search-expanded through iamworkin.lan FIRST and hits the wildcard template → resolves to Traefik VIP, not the real ClusterIP. Use short service names (<svc>) in K8s manifests. See memory feedback_coredns_ndots_template_collision.md.
  • Image not on node: pods stuck ErrImageNeverPull means the image wasn't imported to the node Kubernetes scheduled the pod onto. ctr images import on all of rke2-server, rke2-agent1, rke2-agent2.
  • StatefulSet PVC drift: volumeClaimTemplates needs explicit volumeMode: Filesystem or ArgoCD SSA self-heals forever. See memory feedback_argocd_statefulset_pvc_drift.md.
  • IngressRoute namespace split: this RKE2 Traefik install does not allow cross-namespace service refs. Keep the IngressRoute, backend Service, and TLS secret in the same namespace; if one host is shared across namespaces, duplicate the Certificate and move the route next to the destination service.
  • Public read-only hosts: if a public host fronts a service that also exposes admin writes internally, add a Traefik route match like Host(...) && (Method(GET) || Method(HEAD)) on the public edge instead of trusting the app to reject unsafe methods.
  • Public read-write allowlist hosts: if a public host accepts a tightly bounded write surface (e.g. bootstrap-JWT POST), pin the allowlist as (Method(GET) || Method(HEAD) || Method(POST) || Method(OPTIONS)). PUT/PATCH/DELETE must still 404 at the route. Track A's updatecenter.iamworkin.lan / updates.iamworkin.lan are the canonical example. The lint test enforces this invariant.
  • Traefik VIP netpols: when a NetworkPolicy allows 10.0.56.200, also allow the post-DNAT backend ports (8443 for TLS plus 8080 or 8000 for HTTP) or Calico will drop the rewritten flow.
  • Auth-safe probes: services behind API-key or global auth middleware should prefer tcpSocket probes unless /health is explicitly exempted before the middleware runs.
  • ArgoCD must use internal Gitea URL: http://gitea-clusterip.gitea.svc.cluster.local:3000/bluejay/bluejay-infra.git, not the external HTTPS URL (step-ca cert isn't trusted by ArgoCD). The ApplicationSet and any hand-created Application must both use the internal URL.

Local manifest lint

The repo now carries a local-first lint pass for the recurring K8s gotchas that have burned the fleet:

dotnet test tests/bluejay-infra-lint/BluejayInfraLint.Tests.csproj -c Release

That test project sweeps bluejay-infra/apps/** plus the canonical sibling FlowerCore.*\\k8s manifests that share the same workspace. Matching conftest.dev policy files live under tests/bluejay-infra-lint/conftest.dev/ for environments that also have conftest or opa.

References

  • Cert-manager recovery playbook: FlowerCore.Notes/memory/project_cert_manager_recovery_2026_04_22.md
  • Why pfSense DNS is required: FlowerCore.Notes/memory/feedback_pfsense_dns_required_for_acme.md
  • Public DNS operator host: https://dns.iamworkin.lan
  • Canonical credential helper: FlowerCore.Notes/scripts/credential-helper.sh
  • pfSense admin automation: FlowerCore.Notes/memory/feedback_pfsense_automation.md
Description
Infrastructure manifests for ArgoCD
Readme 9.6 MiB
Languages
C# 48.3%
Python 24%
Shell 13.7%
Puppet 5.9%
Open Policy Agent 4.5%
Other 3.6%