K8s manifest hardening + new bluejay-infra-lint test project
Manifest hardening (per documented memories): - apps/asterisk/deployment.yaml: dnsPolicy: None + explicit dnsConfig with ndots:2 to prevent CoreDNS *.iamworkin.lan template from hijacking external egress (downloads.asterisk.org). - apps/fc-llm-bridge/fc-llm-bridge.yaml: same dnsConfig pattern for api.anthropic.com egress. - apps/fc-ttsreader/fc-ttsreader.yaml: same dnsConfig pattern for huggingface.co model seeding. - apps/fc-messageboard/fc-messageboard.yaml: tcpSocket probes (replacing httpGet /health) per "Probes against /health 404 when app has global auth middleware". - apps/fc-signalcontrol/fc-signalcontrol.yaml: same tcpSocket probe fix. New lint project: - tests/bluejay-infra-lint/BluejayInfraLint.Tests.csproj — local-first lint test sweep for the recurring K8s gotchas in the fleet. - tests/bluejay-infra-lint/FleetManifestLintTests.cs — 7 lint tests covering tcpSocket probes, dnsConfig presence on egress-heavy pods, IngressRoute/Service namespace alignment, image pull policy, etc. - tests/bluejay-infra-lint/conftest.dev/ — matching conftest policies for environments with conftest/opa. - .gitignore — adds bin/ + obj/ + DS_Store/swp. README.md adds a "Local manifest lint" section with the canonical test command, plus 4 new gotcha entries (IngressRoute namespace split, public read-only host method allowlists, Traefik VIP netpol backend ports, auth-safe probes). Tests: 7 / 7 lint tests passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
14
README.md
14
README.md
@@ -99,8 +99,22 @@ curl -sk -X DELETE https://dns.iamworkin.lan/api/v1/servers/<serverId>/zones/iam
|
||||
- **CoreDNS template + ndots:5 collision**: inside pods, `<svc>.<ns>.svc.cluster.local` with <5 dots gets search-expanded through `iamworkin.lan` FIRST and hits the wildcard template → resolves to Traefik VIP, not the real ClusterIP. Use short service names (`<svc>`) in K8s manifests. See memory `feedback_coredns_ndots_template_collision.md`.
|
||||
- **Image not on node**: pods stuck `ErrImageNeverPull` means the image wasn't imported to the node Kubernetes scheduled the pod onto. `ctr images import` on all of rke2-server, rke2-agent1, rke2-agent2.
|
||||
- **StatefulSet PVC drift**: `volumeClaimTemplates` needs explicit `volumeMode: Filesystem` or ArgoCD SSA self-heals forever. See memory `feedback_argocd_statefulset_pvc_drift.md`.
|
||||
- **IngressRoute namespace split**: this RKE2 Traefik install does not allow cross-namespace service refs. Keep the `IngressRoute`, backend `Service`, and TLS secret in the same namespace; if one host is shared across namespaces, duplicate the `Certificate` and move the route next to the destination service.
|
||||
- **Public read-only hosts**: if a public host fronts a service that also exposes admin writes internally, add a Traefik route match like `Host(...) && (Method(GET) || Method(HEAD))` on the public edge instead of trusting the app to reject unsafe methods.
|
||||
- **Traefik VIP netpols**: when a `NetworkPolicy` allows `10.0.56.200`, also allow the post-DNAT backend ports (`8443` for TLS plus `8080` or `8000` for HTTP) or Calico will drop the rewritten flow.
|
||||
- **Auth-safe probes**: services behind API-key or global auth middleware should prefer `tcpSocket` probes unless `/health` is explicitly exempted before the middleware runs.
|
||||
- **ArgoCD must use internal Gitea URL**: `http://gitea-clusterip.gitea.svc.cluster.local:3000/bluejay/bluejay-infra.git`, not the external HTTPS URL (step-ca cert isn't trusted by ArgoCD). The `ApplicationSet` and any hand-created `Application` must both use the internal URL.
|
||||
|
||||
## Local manifest lint
|
||||
|
||||
The repo now carries a local-first lint pass for the recurring K8s gotchas that have burned the fleet:
|
||||
|
||||
```bash
|
||||
dotnet test tests/bluejay-infra-lint/BluejayInfraLint.Tests.csproj -c Release
|
||||
```
|
||||
|
||||
That test project sweeps `bluejay-infra/apps/**` plus the canonical sibling `FlowerCore.*\\k8s` manifests that share the same workspace. Matching `conftest.dev` policy files live under `tests/bluejay-infra-lint/conftest.dev/` for environments that also have `conftest` or `opa`.
|
||||
|
||||
## References
|
||||
|
||||
- Cert-manager recovery playbook: `FlowerCore.Notes/memory/project_cert_manager_recovery_2026_04_22.md`
|
||||
|
||||
Reference in New Issue
Block a user