Andrew Stoltz 6cbb5d8792 fix(agent-zero): NetworkPolicy egress rule for fc-llm-bridge (ADR-088)
The chat_model flip (62db15c) pointed Agent Zero at
fc-llm-bridge.fc-llm-bridge.svc.cluster.local:8080 but the existing
agent-zero-netpol only allowed egress to specific node IPs
(10.0.56.20:11434, 10.0.57.17:11434, 10.0.57.16:5200, 10.0.56.11:6443)
plus public-internet (with RFC1918 exclusion). ClusterIP traffic to
10.43.0.0/16 was implicitly denied, so pod-exec curl to the bridge
timed out after 134s.

Adds an egress rule allowing TCP 8080 to the fc-llm-bridge namespace
(matched by kubernetes.io/metadata.name which K8s 1.22+ sets
automatically). No ingress changes needed — fc-llm-bridge has no
NetworkPolicy, so the ingress side is already open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:59:17 -05:00

bluejay-infra

Infrastructure manifests for ArgoCD. An ApplicationSet in argocd namespace watches the apps/* directories in this repo and creates one Application per subdir (prefixed infra-<name>).

Adding a new service to the cluster

Follow these steps in order. Step 1 must run before step 3 — if you skip it, cert-manager HTTP-01 will silently fail for ~2h per cert (exponential backoff) until someone diagnoses the DNS.

1. Add the pfSense Unbound DNS override (REQUIRED)

step-ca (the ACME CA on noc1) runs in a Podman container with host networking. Its container resolver uses pfSense Unbound (10.0.56.1), not cluster CoreDNS. So even though CoreDNS has a wildcard *.iamworkin.lan → 10.0.56.200 for in-cluster lookups, step-ca cannot see it. Every new public hostname needs an explicit pfSense host override.

From FlowerCore.Notes:

# 1. Edit the HOSTS list in scripts/pfsense-add-dns-overrides.py
#    Add: ("<yourservice>", "10.0.56.200", "cert-manager HTTP-01 target (Traefik VIP)")
# 2. Run:
source scripts/credential-helper.sh
export PFSENSE_PASS=$(get_cred "pfSense Admin")
python scripts/pfsense-add-dns-overrides.py

Verify all referenced iamworkin.lan hosts resolve (run from anywhere on LAN):

python scripts/check-pfsense-dns.py
# Parses every apps/*/*.yaml, extracts hostnames from Certificate dnsNames
# and Traefik IngressRoute Host(...) rules, and fails if any don't resolve.
# Safe to run as a pre-merge / pre-sync check.

Symptom if you skip this: the Certificate resource stays Ready: False with status.reason: unexpected non-ACME API error: context deadline exceeded. Recovery requires kubectl -n <ns> delete order <order-name> after adding the DNS to bypass cert-manager's backoff.

2. Create the app manifest

Create apps/<name>/<name>.yaml containing the Namespace, Deployment, Service, Certificate, and IngressRoute. Reference an existing directory (e.g. apps/fc-messageboard/) for the canonical shape.

Conventions:

  • Namespace has label app.kubernetes.io/part-of: bluejay-infra
  • Deployment.spec.selector.matchLabels and Service.spec.selector MUST use the same label key. The historical convention here is app: <name> (not app.kubernetes.io/name) — don't mix.
  • Image: localhost/<name>:v<YYYYMMDD><HHMM>, imagePullPolicy: Never. Import the image to every RKE2 node (server + both agents) via ctr images import before applying — pods schedule anywhere.
  • If the app persists local state (SQLite, uploads), declare the PersistentVolumeClaim here with storageClassName: longhorn and accessModes: [ReadWriteOnce]. Add strategy.type: Recreate to the Deployment — RWO PVC blocks rolling updates.
  • Probes: use tcpSocket if the app has middleware that intercepts unauth requests (returns 404/401 for /health). Otherwise prefer httpGet against whatever the app exposes (verify the path isn't gated by auth).
  • Certificate: issuerRef.name: step-ca-acme, issuerRef.kind: ClusterIssuer. dnsNames must match the hostname you added to pfSense in step 1.

3. Commit & push

git add apps/<name>/
git commit -m "<name>: initial deployment"
git push

ArgoCD's ApplicationSet picks up the new directory within ~3 minutes and creates infra-<name> with auto-sync + self-heal enabled.

4. Verify

# From noc1
fcadmin_ssh noc1 '
  kubectl -n argocd get application infra-<name>
  kubectl -n <ns> get certificate,pod
  curl -sk -m 8 -o /dev/null -w "HTTP %{http_code}\n" https://<name>.iamworkin.lan/
'

Certificate should be Ready: True within ~60s. If it stalls False for >2m, the pfSense DNS step got skipped — go back to step 1, then kubectl -n <ns> delete order <order-name> to bust the backoff.

Pre-merge gate

Before git push, always run:

python scripts/check-pfsense-dns.py

It's a ~3-second check that would have caught the entire 2026-04-22 cert-manager outage. Consider wiring it into a pre-commit hook or a Gitea Actions workflow.

Retiring a service

  1. kubectl -n argocd delete application infra-<name> (cascade deletes the K8s resources via ArgoCD finalizers)
  2. git rm -r apps/<name>/ and push
  3. Remove the pfSense Unbound override — edit scripts/pfsense-add-dns-overrides.py to remove from HOSTS, or delete manually via the pfSense UI (Services → DNS Resolver → Host Overrides)

Known gotchas

  • CoreDNS template + ndots:5 collision: inside pods, <svc>.<ns>.svc.cluster.local with <5 dots gets search-expanded through iamworkin.lan FIRST and hits the wildcard template → resolves to Traefik VIP, not the real ClusterIP. Use short service names (<svc>) in K8s manifests. See memory feedback_coredns_ndots_template_collision.md.
  • Image not on node: pods stuck ErrImageNeverPull means the image wasn't imported to the node Kubernetes scheduled the pod onto. ctr images import on all of rke2-server, rke2-agent1, rke2-agent2.
  • StatefulSet PVC drift: volumeClaimTemplates needs explicit volumeMode: Filesystem or ArgoCD SSA self-heals forever. See memory feedback_argocd_statefulset_pvc_drift.md.
  • ArgoCD must use internal Gitea URL: http://gitea-clusterip.gitea.svc.cluster.local:3000/bluejay/bluejay-infra.git, not the external HTTPS URL (step-ca cert isn't trusted by ArgoCD). The ApplicationSet and any hand-created Application must both use the internal URL.

References

  • Cert-manager recovery playbook: FlowerCore.Notes/memory/project_cert_manager_recovery_2026_04_22.md
  • Why pfSense DNS is required: FlowerCore.Notes/memory/feedback_pfsense_dns_required_for_acme.md
  • Canonical credential helper: FlowerCore.Notes/scripts/credential-helper.sh
  • pfSense admin automation: FlowerCore.Notes/memory/feedback_pfsense_automation.md
Description
Infrastructure manifests for ArgoCD
Readme 9.6 MiB
Languages
C# 48.3%
Python 24%
Shell 13.7%
Puppet 5.9%
Open Policy Agent 4.5%
Other 3.6%