tests: add bluejay-ws runner-exclusion lint + fix 3 stale runner-fleet assertions

Adds Runners_MustNotPinToOperatorWorkstationHosts lint test enforcing operator directive 2026-05-26: BLUEJAY-WS / iamworkin-ws must never be a fleet GitHub Actions runner. Build-side analog of the Sprint 9 NEW safe-account exclusion gate (Puppet GPO/AppLocker/WDAC/audit-forwarder modules refuse to apply on BLUEJAY-WS). Scans every github-runner Deployment for forbidden nodeName, nodeSelector, nodeAffinity match expressions, and toleration key/value pinning. See CLAUDE.md "Common Mistakes" entry and feedback_bluejay_ws_never_public_runner.md. Also fixes 3 pre-existing GitHubRunnerFleet_* lint failures that broke when the runner image bumped to v20260525-ruby3.3.11-stepca (added a setup-runner-home initContainer): * Add MainContainerMappings() helper (containers only, excludes initContainers) and switch GitHubRunnerFleet_MustRegisterRequiredReposAsRepoScopedDeployments + GitHubRunnerFleet_MustSetWritableNonRootDotnetAndCachePaths over to it. Without this, ContainerMappings().Should().ContainSingle() found the initContainer + runner = 2 containers and failed. * Loosen GitHubRunnerFleet_MustAvoidRwoMultiAttachForScaledDeployments ReplicaCount assertion from Be(2) to BeGreaterOrEqualTo(2). The semantic invariant is "at least 2 replicas so no single-pod bottleneck"; deployments tuned upward per 14d CI activity (e.g. github-runner-print-web at replicas: 3, see commit 1f1f682 PR #24) are valid. Lint baseline: 6 failed -> 3 failed (the 3 remaining are unrelated: PublicReadWriteIngressRoutes_* lives in FlowerCore.Updater/k8s/ ingressroute.yaml — separate PR; FcDeviceManagement_* needs operator domain decision on the missing apps/fc-devicemgmt/argocd-application.yaml). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
runners: bump tts-reader memory limit 4Gi -> 8Gi
2026-05-25 22:41:00 -05:00 · 2026-05-25 22:31:48 -05:00 · 2026-05-26 03:24:13 +00:00 · 2026-05-25 20:33:44 -05:00 · 2026-05-26 01:12:15 +00:00 · 2026-05-25 20:11:41 -05:00
12 changed files with 1749 additions and 221 deletions
--- a/apps/authentik/authentik.yaml
+++ b/apps/authentik/authentik.yaml
@@ -0,0 +1,448 @@
 # Authentik OIDC backend
 # ArgoCD-managed. BlueJay Lab.
 #
 # Stack:
 #   - PostgreSQL 16 StatefulSet (single replica, Longhorn RWO 5Gi)
 #   - Redis 7 Deployment (no persistence — session/cache only)
 #   - Authentik server + worker Deployments (image ghcr.io/goauthentik/server:2024.12.3)
 #   - Media PVC shared between server + worker (Longhorn RWO 2Gi)
 #   - Certificate via step-ca-acme ClusterIssuer
 #   - Traefik IngressRoute at id.iamworkin.lan
 #
 # Secrets come from 1Password item "authentik-credentials" (IAmWorkin vault, id y6i74ch22q5wvm7znquq4nhhcu)
 # via the OnePasswordItem CRD, materialized into k8s Secret authentik/authentik-credentials.
 #
 # Why the discovery URL is /application/o/pimanager/ : Authentik issues per-application OIDC providers.
 # The pimanager OIDC application/provider is created after the cluster pods are healthy (manual or
 # via API once the bootstrap token is available — see Notes substrate).
 ---
 apiVersion: v1
 kind: Namespace
 metadata:
  name: authentik
  labels:
    app.kubernetes.io/part-of: bluejay-infra
 ---
 # 1Password operator pulls the authentik-credentials item into a k8s Secret of the same name.
 # Field labels in 1P become Secret keys: AUTHENTIK_SECRET_KEY, POSTGRES_PASSWORD, REDIS_PASSWORD,
 # BOOTSTRAP_ADMIN_PASSWORD, BOOTSTRAP_ADMIN_TOKEN, BOOTSTRAP_ADMIN_EMAIL.
 apiVersion: onepassword.com/v1
 kind: OnePasswordItem
 metadata:
  name: authentik-credentials
  namespace: authentik
 spec:
  itemPath: "vaults/IAmWorkin/items/authentik-credentials"
 ---
 # Shared media volume for server + worker pods.
 apiVersion: v1
 kind: PersistentVolumeClaim
 metadata:
  name: authentik-media
  namespace: authentik
 spec:
  storageClassName: longhorn
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 2Gi
 ---
 # PostgreSQL 16 StatefulSet — Authentik's primary store.
 apiVersion: apps/v1
 kind: StatefulSet
 metadata:
  name: authentik-postgres
  namespace: authentik
  labels:
    app: authentik-postgres
    argocd.argoproj.io/instance: infra-authentik
 spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Retain
  podManagementPolicy: OrderedReady
  serviceName: authentik-postgres
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: authentik-postgres
  template:
    metadata:
      labels:
        app: authentik-postgres
    spec:
      containers:
        - name: postgres
          image: postgres:16-alpine
          ports:
            - containerPort: 5432
              name: postgres
          env:
            - name: POSTGRES_USER
              value: authentik
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: authentik-credentials
                  key: POSTGRES_PASSWORD
            - name: POSTGRES_DB
              value: authentik
            - name: POSTGRES_INITDB_ARGS
              value: "--encoding=UTF-8 --lc-collate=C --lc-ctype=C"
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
          readinessProbe:
            exec:
              command: ["pg_isready", "-U", "authentik"]
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:
            exec:
              command: ["pg_isready", "-U", "authentik"]
            initialDelaySeconds: 30
            periodSeconds: 30
          resources:
            requests: { cpu: 100m, memory: 256Mi }
            limits: { cpu: 1000m, memory: 1Gi }
          volumeMounts:
            - name: pgdata
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
    - metadata:
        name: pgdata
      spec:
        storageClassName: longhorn
        accessModes: [ReadWriteOnce]
        volumeMode: Filesystem
        resources:
          requests:
            storage: 5Gi
 ---
 apiVersion: v1
 kind: Service
 metadata:
  name: authentik-postgres
  namespace: authentik
 spec:
  clusterIP: None
  selector:
    app: authentik-postgres
  ports:
    - name: postgres
      port: 5432
      targetPort: 5432
 ---
 # Redis 7 — session storage + Celery broker. No persistence needed (cache).
 apiVersion: apps/v1
 kind: Deployment
 metadata:
  name: authentik-redis
  namespace: authentik
  labels:
    app: authentik-redis
    argocd.argoproj.io/instance: infra-authentik
 spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: authentik-redis
  template:
    metadata:
      labels:
        app: authentik-redis
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          args:
            - "--save"
            - ""
            - "--appendonly"
            - "no"
            - "--requirepass"
            - "$(REDIS_PASSWORD)"
          env:
            - name: REDIS_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: authentik-credentials
                  key: REDIS_PASSWORD
          ports:
            - containerPort: 6379
              name: redis
          readinessProbe:
            tcpSocket: { port: 6379 }
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:
            tcpSocket: { port: 6379 }
            initialDelaySeconds: 30
            periodSeconds: 30
          resources:
            requests: { cpu: 50m, memory: 64Mi }
            limits: { cpu: 500m, memory: 256Mi }
 ---
 apiVersion: v1
 kind: Service
 metadata:
  name: authentik-redis
  namespace: authentik
 spec:
  selector:
    app: authentik-redis
  ports:
    - name: redis
      port: 6379
      targetPort: 6379
 ---
 # Authentik server Deployment — HTTP frontend on :9000.
 apiVersion: apps/v1
 kind: Deployment
 metadata:
  name: authentik-server
  namespace: authentik
  labels:
    app: authentik-server
    argocd.argoproj.io/instance: infra-authentik
 spec:
  replicas: 1
  strategy:
    type: Recreate  # shares /media RWO PVC with worker
  selector:
    matchLabels:
      app: authentik-server
  template:
    metadata:
      labels:
        app: authentik-server
    spec:
      securityContext:
        # Authentik image runs as uid 1000 "authentik" but the Longhorn PVC mounts
        # root:root by default. fsGroup recursively chgrp + chmod g+rwx so the
        # non-root container can mkdir /media/public during the tenant_files migration.
        fsGroup: 1000
      containers:
        - name: server
          image: ghcr.io/goauthentik/server:2024.12.3
          args: ["server"]
          ports:
            - containerPort: 9000
              name: http
            - containerPort: 9443
              name: https
          env:
            - name: AUTHENTIK_SECRET_KEY
              valueFrom:
                secretKeyRef:
                  name: authentik-credentials
                  key: AUTHENTIK_SECRET_KEY
            - name: AUTHENTIK_REDIS__HOST
              value: authentik-redis
            - name: AUTHENTIK_REDIS__PASSWORD
              valueFrom:
                secretKeyRef:
                  name: authentik-credentials
                  key: REDIS_PASSWORD
            - name: AUTHENTIK_POSTGRESQL__HOST
              value: authentik-postgres
            - name: AUTHENTIK_POSTGRESQL__NAME
              value: authentik
            - name: AUTHENTIK_POSTGRESQL__USER
              value: authentik
            - name: AUTHENTIK_POSTGRESQL__PASSWORD
              valueFrom:
                secretKeyRef:
                  name: authentik-credentials
                  key: POSTGRES_PASSWORD
            - name: AUTHENTIK_BOOTSTRAP_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: authentik-credentials
                  key: BOOTSTRAP_ADMIN_PASSWORD
            - name: AUTHENTIK_BOOTSTRAP_TOKEN
              valueFrom:
                secretKeyRef:
                  name: authentik-credentials
                  key: BOOTSTRAP_ADMIN_TOKEN
            - name: AUTHENTIK_BOOTSTRAP_EMAIL
              valueFrom:
                secretKeyRef:
                  name: authentik-credentials
                  key: BOOTSTRAP_ADMIN_EMAIL
            - name: AUTHENTIK_DISABLE_UPDATE_CHECK
              value: "true"
            - name: AUTHENTIK_ERROR_REPORTING__ENABLED
              value: "false"
            - name: AUTHENTIK_LOG_LEVEL
              value: info
          # First-boot Authentik can take 3+ min on the migration phase
          # (waiting on DB lock while worker also runs migrations). Initial
          # delays are generous so kubelet doesn't kill the pod mid-migration;
          # periodSeconds keeps post-startup probing responsive.
          readinessProbe:
            httpGet:
              path: /-/health/ready/
              port: 9000
            initialDelaySeconds: 60
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 12
          livenessProbe:
            httpGet:
              path: /-/health/live/
              port: 9000
            initialDelaySeconds: 300
            periodSeconds: 30
            timeoutSeconds: 10
            failureThreshold: 3
          startupProbe:
            httpGet:
              path: /-/health/live/
              port: 9000
            initialDelaySeconds: 30
            periodSeconds: 15
            timeoutSeconds: 10
            failureThreshold: 40  # 30s + 40*15s = 10.5 min budget
          resources:
            requests: { cpu: 150m, memory: 512Mi }
            limits: { cpu: 1500m, memory: 1Gi }
          volumeMounts:
            - name: media
              mountPath: /media
      volumes:
        - name: media
          persistentVolumeClaim:
            claimName: authentik-media
 ---
 # Authentik worker Deployment — runs Celery background tasks.
 apiVersion: apps/v1
 kind: Deployment
 metadata:
  name: authentik-worker
  namespace: authentik
  labels:
    app: authentik-worker
    argocd.argoproj.io/instance: infra-authentik
 spec:
  replicas: 1
  strategy:
    type: Recreate  # shares /media RWO PVC with server
  selector:
    matchLabels:
      app: authentik-worker
  template:
    metadata:
      labels:
        app: authentik-worker
    spec:
      securityContext:
        # Same as server pod — non-root uid 1000 needs PVC group write.
        fsGroup: 1000
      containers:
        - name: worker
          image: ghcr.io/goauthentik/server:2024.12.3
          args: ["worker"]
          env:
            - name: AUTHENTIK_SECRET_KEY
              valueFrom:
                secretKeyRef:
                  name: authentik-credentials
                  key: AUTHENTIK_SECRET_KEY
            - name: AUTHENTIK_REDIS__HOST
              value: authentik-redis
            - name: AUTHENTIK_REDIS__PASSWORD
              valueFrom:
                secretKeyRef:
                  name: authentik-credentials
                  key: REDIS_PASSWORD
            - name: AUTHENTIK_POSTGRESQL__HOST
              value: authentik-postgres
            - name: AUTHENTIK_POSTGRESQL__NAME
              value: authentik
            - name: AUTHENTIK_POSTGRESQL__USER
              value: authentik
            - name: AUTHENTIK_POSTGRESQL__PASSWORD
              valueFrom:
                secretKeyRef:
                  name: authentik-credentials
                  key: POSTGRES_PASSWORD
            - name: AUTHENTIK_DISABLE_UPDATE_CHECK
              value: "true"
            - name: AUTHENTIK_ERROR_REPORTING__ENABLED
              value: "false"
            - name: AUTHENTIK_LOG_LEVEL
              value: info
          resources:
            requests: { cpu: 100m, memory: 256Mi }
            limits: { cpu: 1000m, memory: 768Mi }
          volumeMounts:
            - name: media
              mountPath: /media
      volumes:
        - name: media
          persistentVolumeClaim:
            claimName: authentik-media
 ---
 apiVersion: v1
 kind: Service
 metadata:
  name: authentik-server
  namespace: authentik
 spec:
  selector:
    app: authentik-server
  ports:
    - name: http
      port: 9000
      targetPort: 9000
    - name: https
      port: 9443
      targetPort: 9443
 ---
 # step-ca leaf certificate for id.iamworkin.lan.
 # step-ca container resolver uses pfSense Unbound, so the public A record for id.iamworkin.lan
 # MUST exist before this Certificate is applied (cert-manager HTTP-01 will silently 2h-backoff
 # otherwise). Added 2026-05-25 via scripts/pfsense-add-id-host.py.
 apiVersion: cert-manager.io/v1
 kind: Certificate
 metadata:
  name: authentik-tls
  namespace: authentik
 spec:
  secretName: authentik-tls
  dnsNames:
    - id.iamworkin.lan
  issuerRef:
    name: step-ca-acme
    kind: ClusterIssuer
 ---
 apiVersion: traefik.io/v1alpha1
 kind: IngressRoute
 metadata:
  name: authentik
  namespace: authentik
 spec:
  entryPoints: [websecure]
  routes:
    - match: Host(`id.iamworkin.lan`)
      kind: Rule
      services:
        - name: authentik-server
          port: 9000
  tls:
    secretName: authentik-tls
--- a/apps/fc-devicemgmt/argocd-application.yaml
+++ b/apps/fc-devicemgmt/argocd-application.yaml
@@ -1,33 +0,0 @@
 # Explicit ArgoCD Application shape for bootstrap/review.
 #
 # The live bluejay-infra ApplicationSet already discovers apps/* directories
 # and creates this same Application name (`infra-fc-devicemgmt`) automatically.
 # Keep repoURL on the internal Gitea ClusterIP URL; ArgoCD does not trust the
 # external step-ca HTTPS endpoint.
 apiVersion: argoproj.io/v1alpha1
 kind: Application
 metadata:
  name: infra-fc-devicemgmt
  namespace: argocd
  labels:
    app.kubernetes.io/name: fc-devicemgmt
    app.kubernetes.io/part-of: flowercore
    app.kubernetes.io/managed-by: argocd
    flowercore.io/tenant-id: system
    flowercore.io/created-by: bluejay-infra
 spec:
  project: default
  source:
    repoURL: http://gitea-clusterip.gitea.svc.cluster.local:3000/bluejay/bluejay-infra.git
    targetRevision: main
    path: apps/fc-devicemgmt
  destination:
    server: https://kubernetes.default.svc
    namespace: fc-devicemgmt
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true
--- a/apps/github-runner/.gitattributes
+++ b/apps/github-runner/.gitattributes
@@ -0,0 +1,2 @@
 *.sh text eol=lf
 Dockerfile text eol=lf
--- a/apps/github-runner/Dockerfile
+++ b/apps/github-runner/Dockerfile
@@ -0,0 +1,54 @@
 FROM myoung34/github-runner:latest
 ARG RUBY_VERSION=3.3.11
 ARG RUBY_MINOR=3.3
 ARG RUBY_BUILD_VERSION=v20260326
 ARG RUNNER_UID=1001
 ARG RUNNER_GID=1001
 ENV RUNNER_TOOL_CACHE=/home/runner/_tool
 ENV RUNNER_RUBY_TOOLCACHE=/opt/runner-toolcache
 ENV PATH="/home/runner/_tool/Ruby/${RUBY_MINOR}/x64/bin:/opt/runner-toolcache/Ruby/${RUBY_MINOR}/x64/bin:${PATH}"
 USER root
 # Bake the IAmWorkin step-ca root CA into the system trust store. Without
 # this, .NET HttpClient calls from CI tests against *.iamworkin.lan
 # (e.g. https://selenium.iamworkin.lan/session) fail with `PartialChain`
 # because the runner image's default Ubuntu trust bundle doesn't include
 # our internal Root CA. update-ca-certificates regenerates
 # /etc/ssl/certs/ca-certificates.crt, which OpenSSL + .NET on Linux read
 # automatically — no SSL_CERT_FILE env var needed.
 COPY step-ca-root.crt /usr/local/share/ca-certificates/iamworkin-step-ca-root.crt
 RUN apt-get update \
    && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        autoconf \
        bison \
        build-essential \
        ca-certificates \
        curl \
        libdb-dev \
        libffi-dev \
        libgdbm-dev \
        libgmp-dev \
        libncurses-dev \
        libreadline-dev \
        libssl-dev \
        libyaml-dev \
        patch \
        pkg-config \
        uuid-dev \
        zlib1g-dev \
    && update-ca-certificates \
    && curl -fsSL "https://github.com/rbenv/ruby-build/archive/refs/tags/${RUBY_BUILD_VERSION}.tar.gz" -o /tmp/ruby-build.tar.gz \
    && mkdir -p /tmp/ruby-build \
    && tar -xzf /tmp/ruby-build.tar.gz --strip-components=1 -C /tmp/ruby-build \
    && /tmp/ruby-build/install.sh \
    && rm -rf /tmp/ruby-build /tmp/ruby-build.tar.gz /var/lib/apt/lists/*
 COPY install-ruby-toolcache.sh /usr/local/bin/install-ruby-toolcache.sh
 RUN chmod +x /usr/local/bin/install-ruby-toolcache.sh \
    && RUBY_VERSION="${RUBY_VERSION}" RUBY_MINOR="${RUBY_MINOR}" TOOLCACHE_ROOT="${RUNNER_RUBY_TOOLCACHE}" RUNNER_UID="${RUNNER_UID}" RUNNER_GID="${RUNNER_GID}" /usr/local/bin/install-ruby-toolcache.sh \
    && ruby -v
--- a/apps/github-runner/README.md
+++ b/apps/github-runner/README.md
@@ -7,12 +7,17 @@ Deployments with `kubectl`; update this manifest and let ArgoCD reconcile.
 All repo-scoped Linux runners use:
 - `localhost/fc-github-runner:v20260525-ruby3.3.11-stepca`, derived from
  `myoung34/github-runner:latest`
 - `ACCESS_TOKEN` from the `github-runner-token` Secret
 - `RUN_AS_ROOT=false`
 - `EPHEMERAL=true`
 - `LABELS=self-hosted,linux,fc-build-linux`
 - writable non-root paths under `/home/runner` for .NET, NuGet, XDG cache, and
  Actions tool cache
 - Ruby 3.3.11 seeded into `/home/runner/_tool/Ruby/3.3/x64` from the baked
  `/opt/runner-toolcache` copy so `ruby/setup-ruby@v1` can discover it on
  self-hosted `ubuntu-20.04-x64` runners
 `github-runner` for `FlowerCore.Common` is single-replica because it retains the
 original Longhorn ReadWriteOnce NuGet PVC. Every other repo-scoped runner uses
@@ -28,6 +33,46 @@ Sprint 32 final long-tail wave adds 16 two-replica Deployments:
 `FlowerCore.Provisioning`, `FlowerCore.Redis`, `FlowerCore.MessageBoard`, and
 `FlowerCore.MenuBoard`.
 ## Image Build
 Ruby is baked with a pinned `ruby-build` release and Ruby patch version. The pod
 still mounts an `emptyDir` over `/home/runner`, so the `setup-runner-home` init
 container copies the baked toolcache from `/opt/runner-toolcache/Ruby` into
 `/home/runner/_tool/Ruby` before the runner container starts.
 The IAmWorkin step-ca root CA is also baked into the system trust store
 (`/usr/local/share/ca-certificates/iamworkin-step-ca-root.crt`, registered by
 `update-ca-certificates`). Without it, .NET HttpClient calls from CI tests
 against `*.iamworkin.lan` (e.g. `https://selenium.iamworkin.lan/session`)
 fail with `PartialChain`. To refresh the bundled cert when the root rotates,
 re-extract from the cluster and overwrite `step-ca-root.crt`:
 ```bash
 kubectl get secret -n cert-manager step-ca-root \
  -o jsonpath='{.data.ca\.crt}' | base64 -d > step-ca-root.crt
 ```
 ```bash
 cd apps/github-runner
 podman build -t localhost/fc-github-runner:v20260525-ruby3.3.11-stepca .
 podman run --rm localhost/fc-github-runner:v20260525-ruby3.3.11-stepca ruby -v
 podman run --rm localhost/fc-github-runner:v20260525-ruby3.3.11-stepca \
  test -f /opt/runner-toolcache/Ruby/3.3/x64.complete
 podman save localhost/fc-github-runner:v20260525-ruby3.3.11-stepca \
  -o fc-github-runner-v20260525-ruby3.3.11-stepca.tar
 ```
 Import the saved image on every schedulable RKE2 node before ArgoCD rolls the
 Deployments:
 ```bash
 for node in rke2-server rke2-agent1 rke2-agent2; do
  scp fc-github-runner-v20260525-ruby3.3.11-stepca.tar "$node:/tmp/"
  ssh "$node" 'sudo ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images rm localhost/fc-github-runner:v20260525-ruby3.3.11-stepca || true'
  ssh "$node" 'sudo ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images import /tmp/fc-github-runner-v20260525-ruby3.3.11-stepca.tar'
 done
 ```
 ## Post-Merge Proof
 After the PR is merged and ArgoCD syncs, verify the runner fleet:
@@ -36,6 +81,14 @@ After the PR is merged and ArgoCD syncs, verify the runner fleet:
 kubectl -n github-runner get deploy,pods,pvc
 ```
 Verify the Ruby toolcache in a fresh pod:
 ```bash
 kubectl -n github-runner exec deploy/github-runner-puppet -c runner -- ruby -v
 kubectl -n github-runner exec deploy/github-runner-puppet -c runner -- sh -c \
  'echo "$RUNNER_TOOL_CACHE" && test -f "$RUNNER_TOOL_CACHE/Ruby/3.3/x64.complete"'
 ```
 Verify GitHub registration for the repo-scoped runners:
 ```bash
@@ -69,6 +122,10 @@ from GitHub Actions and verify it lands on an `rke2-linux-*` runner.
 - `actions/setup-dotnet` permission error at `/usr/share/dotnet`: check that
  `DOTNET_INSTALL_DIR=/home/runner/.dotnet` and related cache env vars are
  present on the runner pod.
 - `ruby/setup-ruby@v1` says self-hosted runners must install Ruby in
  `$RUNNER_TOOL_CACHE`: check that the init container copied
  `/opt/runner-toolcache/Ruby` into `/home/runner/_tool/Ruby` and that
  `/home/runner/_tool/Ruby/3.3/x64.complete` exists.
 - `404` during runner registration: the fine-grained PAT is valid but missing
  repository access for that repo. Add the repo to the PAT access list; the PAT
  value does not change.
--- a/apps/github-runner/github-runner.yaml
+++ b/apps/github-runner/github-runner.yaml
--- a/apps/github-runner/install-ruby-toolcache.sh
+++ b/apps/github-runner/install-ruby-toolcache.sh
@@ -0,0 +1,19 @@
 #!/usr/bin/env bash
 set -euo pipefail
 RUBY_VERSION="${RUBY_VERSION:-3.3.11}"
 RUBY_MINOR="${RUBY_MINOR:-3.3}"
 TOOLCACHE_ROOT="${TOOLCACHE_ROOT:-/opt/runner-toolcache}"
 RUNNER_UID="${RUNNER_UID:-1001}"
 RUNNER_GID="${RUNNER_GID:-1001}"
 RUBY_PREFIX="${TOOLCACHE_ROOT}/Ruby/${RUBY_VERSION}/x64"
 mkdir -p "${TOOLCACHE_ROOT}/Ruby"
 RUBY_CONFIGURE_OPTS="${RUBY_CONFIGURE_OPTS:---disable-install-doc --disable-yjit}" ruby-build "${RUBY_VERSION}" "${RUBY_PREFIX}"
 touch "${TOOLCACHE_ROOT}/Ruby/${RUBY_VERSION}/x64.complete"
 ln -sfn "${RUBY_VERSION}" "${TOOLCACHE_ROOT}/Ruby/${RUBY_MINOR}"
 "${RUBY_PREFIX}/bin/ruby" -v
 chown -R "${RUNNER_UID}:${RUNNER_GID}" "${TOOLCACHE_ROOT}"
 chmod -R a+rX "${TOOLCACHE_ROOT}"
--- a/apps/github-runner/step-ca-root.crt
+++ b/apps/github-runner/step-ca-root.crt
@@ -0,0 +1,12 @@
 -----BEGIN CERTIFICATE-----
 MIIBxDCCAWqgAwIBAgIRAPY357G6ow6zMAL5+4bS2kkwCgYIKoZIzj0EAwIwQDEa
 MBgGA1UEChMRSUFtV29ya2luIEFDTUUgQ0ExIjAgBgNVBAMTGUlBbVdvcmtpbiBB
 Q01FIENBIFJvb3QgQ0EwHhcNMjYwMzA4MTgwNzExWhcNMzYwMzA1MTgwNzExWjBA
 MRowGAYDVQQKExFJQW1Xb3JraW4gQUNNRSBDQTEiMCAGA1UEAxMZSUFtV29ya2lu
 IEFDTUUgQ0EgUm9vdCBDQTBZMBMGByqGSM49AgEGCCqGSM49AwEHA0IABJ2n04X1
 JZo5Zdq/i1Idv8+fqwZyAzBh7whbqj0SWsJL8UWRabCMqYCs7+dXO0xRSzqkwFDL
 x+vooOai8RgRNhajRTBDMA4GA1UdDwEB/wQEAwIBBjASBgNVHRMBAf8ECDAGAQH/
 AgEBMB0GA1UdDgQWBBRnuPPQR6iM/H6vOluiU3Sygayz8jAKBggqhkjOPQQDAgNI
 ADBFAiEArQK9dYPGmAZsdYnjziuFVVE5NKZUcceYvGfGC+tLXUsCIAudF2zJrCRq
 3mK50ZZET/fwTkJwiEF4824mjP8p1CKM
 -----END CERTIFICATE-----
--- a/apps/monitoring/noc-monitoring.yaml
+++ b/apps/monitoring/noc-monitoring.yaml
@@ -280,13 +280,14 @@ data:
              printer_model: "NuPrint 210"
      # Print.Web health (Blazor app on edge2:5200)
      # Target `/health` (anonymous) — root path requires API key auth and returns 401.
      - job_name: "probe-printweb"
        metrics_path: /probe
        params:
          module: [http_2xx]
        scrape_interval: 30s
        static_configs:
-          - targets: ["http://10.0.57.16:5200/"]
+          - targets: ["http://10.0.57.16:5200/health"]
            labels:
              instance: "print-web"
              service: "print-web"
--- a/apps/selenium/network-policy.yaml
+++ b/apps/selenium/network-policy.yaml
@@ -24,7 +24,16 @@
 #     (10.0.57.16:5200), public internet 80/443 (excluding RFC1918), and
 #     fc-signage:5190 for the signage AAT lane.
 #   - Ingress: Traefik (4444 + 8089 ACME-solver-style), intra-pod,
-#     telephony / gitea / fc-system / fc-signage namespaces on 4444.
+#     telephony / gitea / fc-system / fc-signage / github-runner namespaces
 #     on 4444.
 #
 # 2026-05-25: added github-runner ingress on 4444 so CI jobs running in
 # self-hosted runner pods (e.g. FlowerCore.Print.Web `help-screenshots`)
 # can reach the grid. Without this allow, the session POST to
 # `selenium-hub.selenium.svc.cluster.local:4444` was DNAT'd to the hub
 # pod IP and then dropped at the Calico ingress hook — Selenium UI showed
 # 0/4 sessions while the .NET HTTP client timed out at 60s. Same family
 # as `feedback_netpol_dnat_backend_port`, wrong-source-namespace flavor.
 apiVersion: networking.k8s.io/v1
 kind: NetworkPolicy
 metadata:
@@ -203,6 +212,13 @@ spec:
    ports:
    - port: 4444
      protocol: TCP
  - from:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: github-runner
    ports:
    - port: 4444
      protocol: TCP
  podSelector: {}
  policyTypes:
  - Ingress
--- a/apps/selenium/selenium-grid.yaml
+++ b/apps/selenium/selenium-grid.yaml
@@ -0,0 +1,427 @@
 # Selenium Grid 4 — RKE2 deployment
 #
 # Hub + chrome + firefox + edge browser nodes serving fleet-wide AAT runs from
 # the GitHub Actions self-hosted runners. ArgoCD owns this namespace from
 # 2026-05-25 (`infra-selenium` Application; previously these resources were
 # orphan kubectl-applied since 2026-03-15).
 #
 # Endpoints:
 #   - Internal cluster: http://selenium-hub.selenium.svc.cluster.local:4444
 #   - LAN LoadBalancer (MetalLB): http://10.0.56.208:4444
 #   - Traefik public: https://selenium.iamworkin.lan
 #
 # Browser maxSessions:
 #   - chrome 2  (bumped from 1 on 2026-05-25 morning-routine — AAT-heavy
 #                Print.Web help-screenshots was the global bottleneck;
 #                see commit history for ops/runner-replica-rightsize)
 #   - firefox 1
 #   - edge 1
 #
 # Screenshots + video recording write to NFS via the chrome video sidecar.
 # See: CLAUDE.md "Selenium Grid & Visual AAT Testing" + bluejay-infra ADR notes.
 ---
 apiVersion: v1
 kind: Service
 metadata:
  labels:
    app: selenium-hub
    app.kubernetes.io/name: selenium-hub
    app.kubernetes.io/part-of: selenium-grid
  name: selenium-hub
  namespace: selenium
 spec:
  ports:
  - name: web
    port: 4444
    targetPort: 4444
  - name: publish
    port: 4442
    targetPort: 4442
  - name: subscribe
    port: 4443
    targetPort: 4443
  selector:
    app: selenium-hub
  type: ClusterIP
 ---
 apiVersion: v1
 kind: Service
 metadata:
  annotations:
    metallb.io/ip-allocated-from-pool: bluejay-pool
    metallb.universe.tf/loadBalancerIPs: 10.0.56.208
  labels:
    app: selenium-hub
    component: external-access
  name: selenium-hub-external
  namespace: selenium
 spec:
  clusterIP: 10.43.90.147
  clusterIPs:
  - 10.43.90.147
  externalTrafficPolicy: Local
  healthCheckNodePort: 32213
  ports:
  - name: web
    nodePort: 32411
    port: 4444
    targetPort: 4444
  - name: publish
    nodePort: 32068
    port: 4442
    targetPort: 4442
  - name: subscribe
    nodePort: 31000
    port: 4443
    targetPort: 4443
  selector:
    app: selenium-hub
  type: LoadBalancer
 ---
 apiVersion: apps/v1
 kind: Deployment
 metadata:
  labels:
    app: selenium-hub
    app.kubernetes.io/name: selenium-hub
    app.kubernetes.io/part-of: selenium-grid
  name: selenium-hub
  namespace: selenium
 spec:
  replicas: 1
  selector:
    matchLabels:
      app: selenium-hub
  template:
    metadata:
      labels:
        app: selenium-hub
        app.kubernetes.io/name: selenium-hub
        app.kubernetes.io/part-of: selenium-grid
    spec:
      containers:
      - env:
        - name: SE_NODE_SESSION_TIMEOUT
          value: '300'
        - name: SE_SESSION_REQUEST_TIMEOUT
          value: '300'
        - name: SE_SESSION_RETRY_INTERVAL
          value: '5'
        - name: JAVA_OPTS
          value: -Xmx512m
        image: selenium/hub:4.27.0
        livenessProbe:
          httpGet:
            path: /wd/hub/status
            port: 4444
          initialDelaySeconds: 30
          periodSeconds: 15
          timeoutSeconds: 5
        name: selenium-hub
        ports:
        - containerPort: 4444
          name: web
        - containerPort: 4442
          name: publish
        - containerPort: 4443
          name: subscribe
        readinessProbe:
          httpGet:
            path: /wd/hub/status
            port: 4444
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 5
        # Hub baseline working set ~766Mi on 2026-05-25 (75% of prior 1Gi
        # limit). Bump to 1.5Gi / 1Gi to keep ~50% headroom; matches the
        # stampede-buffer pattern documented for multus
        # (feedback_k8s_cni_multus_sizing). CPU left alone — observed 54m
        # against a 500m limit, no contention.
        resources:
          limits:
            cpu: 500m
            memory: 1536Mi
          requests:
            cpu: 250m
            memory: 1Gi
 ---
 apiVersion: apps/v1
 kind: Deployment
 metadata:
  labels:
    app: selenium-node-chrome
    app.kubernetes.io/name: selenium-node-chrome
    app.kubernetes.io/part-of: selenium-grid
  name: selenium-node-chrome
  namespace: selenium
 spec:
  replicas: 1
  selector:
    matchLabels:
      app: selenium-node-chrome
  template:
    metadata:
      labels:
        app: selenium-node-chrome
        app.kubernetes.io/name: selenium-node-chrome
        app.kubernetes.io/part-of: selenium-grid
    spec:
      containers:
      - env:
        - name: SE_EVENT_BUS_HOST
          value: selenium-hub
        - name: SE_EVENT_BUS_PUBLISH_PORT
          value: '4442'
        - name: SE_EVENT_BUS_SUBSCRIBE_PORT
          value: '4443'
        - name: SE_NODE_MAX_SESSIONS
          value: '2'
        - name: SE_NODE_OVERRIDE_MAX_SESSIONS
          value: 'false'
        - name: SE_VNC_NO_PASSWORD
          value: '1'
        - name: SE_SCREEN_WIDTH
          value: '1920'
        - name: SE_SCREEN_HEIGHT
          value: '1080'
        - name: SE_NODE_SESSION_TIMEOUT
          value: '300'
        image: selenium/node-chrome:4.27.0
        livenessProbe:
          httpGet:
            path: /status
            port: 5555
          initialDelaySeconds: 30
          periodSeconds: 15
        name: selenium-chrome
        ports:
        - containerPort: 5555
          name: node
        readinessProbe:
          httpGet:
            path: /status
            port: 5555
          initialDelaySeconds: 15
          periodSeconds: 5
        # Chromium-based browser node. Bumped from 1Gi -> 2Gi (req 512Mi
        # -> 1Gi) on 2026-05-25 — Edge had 51 OOMKills in 5d on the
        # original 1Gi cap (~1 OOM every 2.4h), and Chrome at maxSessions=2
        # was running 684Mi idle on the same cap. Matches the Firefox node's
        # tested-stable 2Gi limit. CPU unchanged.
        resources:
          limits:
            cpu: '1'
            memory: 2Gi
          requests:
            cpu: 500m
            memory: 1Gi
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      - env:
        - name: DISPLAY_CONTAINER_NAME
          value: localhost
        - name: SE_SCREEN_WIDTH
          value: '1920'
        - name: SE_SCREEN_HEIGHT
          value: '1080'
        - name: SE_VIDEO_FILE_NAME
          value: auto
        - name: SE_VIDEO_UPLOAD_ENABLED
          value: 'false'
        image: selenium/video:ffmpeg-7.1-20250101
        name: video
        resources:
          limits:
            cpu: 500m
            memory: 768Mi
          requests:
            cpu: 250m
            memory: 384Mi
        volumeMounts:
        - mountPath: /videos
          name: selenium-videos
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 2Gi
        name: dshm
      - emptyDir:
          sizeLimit: 5Gi
        name: selenium-videos
 ---
 apiVersion: apps/v1
 kind: Deployment
 metadata:
  labels:
    app: selenium-node-firefox
    app.kubernetes.io/name: selenium-node-firefox
    app.kubernetes.io/part-of: selenium-grid
  name: selenium-node-firefox
  namespace: selenium
 spec:
  replicas: 1
  selector:
    matchLabels:
      app: selenium-node-firefox
  template:
    metadata:
      labels:
        app: selenium-node-firefox
        app.kubernetes.io/name: selenium-node-firefox
        app.kubernetes.io/part-of: selenium-grid
    spec:
      containers:
      - env:
        - name: SE_EVENT_BUS_HOST
          value: selenium-hub
        - name: SE_EVENT_BUS_PUBLISH_PORT
          value: '4442'
        - name: SE_EVENT_BUS_SUBSCRIBE_PORT
          value: '4443'
        - name: SE_NODE_MAX_SESSIONS
          value: '1'
        - name: SE_NODE_OVERRIDE_MAX_SESSIONS
          value: 'true'
        - name: SE_VNC_NO_PASSWORD
          value: '1'
        - name: SE_START_VNC
          value: 'false'
        - name: SE_SCREEN_WIDTH
          value: '1920'
        - name: SE_SCREEN_HEIGHT
          value: '1080'
        - name: SE_NODE_SESSION_TIMEOUT
          value: '300'
        image: selenium/node-firefox:4.27.0
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /status
            port: 5555
          initialDelaySeconds: 30
          periodSeconds: 15
          timeoutSeconds: 5
        name: selenium-firefox
        ports:
        - containerPort: 5555
          name: node
        readinessProbe:
          failureThreshold: 5
          httpGet:
            path: /status
            port: 5555
          initialDelaySeconds: 15
          periodSeconds: 5
          timeoutSeconds: 5
        resources:
          limits:
            cpu: '1'
            memory: 2Gi
          requests:
            cpu: 500m
            memory: 1Gi
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 2Gi
        name: dshm
 ---
 apiVersion: apps/v1
 kind: Deployment
 metadata:
  labels:
    app: selenium-node-edge
    app.kubernetes.io/name: selenium-node-edge
    app.kubernetes.io/part-of: selenium-grid
  name: selenium-node-edge
  namespace: selenium
 spec:
  replicas: 1
  selector:
    matchLabels:
      app: selenium-node-edge
  template:
    metadata:
      labels:
        app: selenium-node-edge
        app.kubernetes.io/name: selenium-node-edge
        app.kubernetes.io/part-of: selenium-grid
    spec:
      containers:
      - env:
        - name: SE_EVENT_BUS_HOST
          value: selenium-hub
        - name: SE_EVENT_BUS_PUBLISH_PORT
          value: '4442'
        - name: SE_EVENT_BUS_SUBSCRIBE_PORT
          value: '4443'
        - name: SE_NODE_MAX_SESSIONS
          value: '1'
        - name: SE_NODE_OVERRIDE_MAX_SESSIONS
          value: 'true'
        - name: SE_VNC_NO_PASSWORD
          value: '1'
        - name: SE_SCREEN_WIDTH
          value: '1920'
        - name: SE_SCREEN_HEIGHT
          value: '1080'
        - name: SE_NODE_SESSION_TIMEOUT
          value: '300'
        image: selenium/node-edge:4.27.0
        livenessProbe:
          httpGet:
            path: /status
            port: 5555
          initialDelaySeconds: 30
          periodSeconds: 15
        name: selenium-edge
        ports:
        - containerPort: 5555
          name: node
        readinessProbe:
          httpGet:
            path: /status
            port: 5555
          initialDelaySeconds: 15
          periodSeconds: 5
        # Chromium-based browser node. Bumped from 1Gi -> 2Gi (req 512Mi
        # -> 1Gi) on 2026-05-25 — Edge had 51 OOMKills in 5d on the
        # original 1Gi cap (~1 OOM every 2.4h), and Chrome at maxSessions=2
        # was running 684Mi idle on the same cap. Matches the Firefox node's
        # tested-stable 2Gi limit. CPU unchanged.
        resources:
          limits:
            cpu: '1'
            memory: 2Gi
          requests:
            cpu: 500m
            memory: 1Gi
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 2Gi
        name: dshm
 ---
 apiVersion: traefik.io/v1alpha1
 kind: IngressRoute
 metadata:
  name: selenium-hub
  namespace: selenium
 spec:
  entryPoints:
  - websecure
  routes:
  - kind: Rule
    match: Host(`selenium.iamworkin.lan`)
    services:
    - name: selenium-hub
      port: 4444
  tls:
    secretName: selenium-tls
--- a/tests/bluejay-infra-lint/FleetManifestLintTests.cs
+++ b/tests/bluejay-infra-lint/FleetManifestLintTests.cs
@@ -67,6 +67,7 @@ public sealed class FleetManifestLintTests
        ["github-runner-chat"] = "https://github.com/astoltz/FlowerCore.Chat",
        ["github-runner-mysql"] = "https://github.com/astoltz/FlowerCore.MySQL",
        ["github-runner-kiosk-linux"] = "https://github.com/astoltz/FlowerCore.Kiosk.Linux",
        ["github-runner-updater"] = "https://github.com/astoltz/FlowerCore.Updater",
    };
    private static readonly HashSet<string> ScaledLinuxRunnerDeployments = new(StringComparer.Ordinal)
@@ -80,6 +81,7 @@ public sealed class FleetManifestLintTests
        "github-runner-chat",
        "github-runner-mysql",
        "github-runner-kiosk-linux",
        "github-runner-updater",
    };
    private static readonly IReadOnlyDictionary<string, string> WritableRunnerEnv = new Dictionary<string, string>(StringComparer.Ordinal)
@@ -234,7 +236,7 @@ public sealed class FleetManifestLintTests
        {
            deployments.Should().ContainKey(expectedRunner.Key);
-            var container = deployments[expectedRunner.Key].ContainerMappings().Should().ContainSingle().Subject;
+            var container = deployments[expectedRunner.Key].MainContainerMappings().Should().ContainSingle().Subject;
            EnvValue(container, "REPO_URL").Should().Be(expectedRunner.Value);
            EnvValue(container, "EPHEMERAL").Should().Be("true");
            EnvValue(container, "LABELS").Should().Be("self-hosted,linux,fc-build-linux");
@@ -250,7 +252,7 @@ public sealed class FleetManifestLintTests
    {
        foreach (var deployment in GitHubRunnerDeployments().Values)
        {
-            var container = deployment.ContainerMappings().Should().ContainSingle().Subject;
+            var container = deployment.MainContainerMappings().Should().ContainSingle().Subject;
            foreach (var expectedEnv in WritableRunnerEnv)
            {
@@ -277,7 +279,10 @@ public sealed class FleetManifestLintTests
        foreach (var deploymentName in ScaledLinuxRunnerDeployments)
        {
            var deployment = deployments[deploymentName];
-            ReplicaCount(deployment).Should().Be(2);
+            // Scaled runners must have >= 2 replicas (avoid single-pod bottleneck).
            // Individual deployments may be tuned upward per CI activity — see
            // "runners: right-size replica counts per 14d CI activity (#24)".
            ReplicaCount(deployment).Should().BeGreaterOrEqualTo(2, $"{deploymentName} is in the scaled set and must run with at least 2 replicas");
            var volumes = deployment.MappingSequence("spec", "template", "spec", "volumes");
            var claimNames = volumes
@@ -303,6 +308,108 @@ public sealed class FleetManifestLintTests
            .Be("github-runner-nuget-cache");
    }
    [Fact]
    public void Runners_MustNotPinToOperatorWorkstationHosts()
    {
        // CRITICAL SAFETY (operator directive 2026-05-26): BLUEJAY-WS is the
        // operator's primary workstation — host of the 1Password Connect
        // bearer token, fcadmin SSH keys to noc1, signing CA private keys,
        // and source for every FC repo. A self-hosted GitHub Actions runner
        // there would execute arbitrary PR code with that local access.
        // Build-side analog of the Sprint 9 NEW safe-account exclusion gate
        // (Puppet GPO/AppLocker/WDAC/audit-forwarder modules refuse to apply
        // on BLUEJAY-WS). This lint asserts no GitHub-runner Deployment in
        // apps/github-runner/ pins to a forbidden operator-workstation host
        // via nodeName, nodeSelector, nodeAffinity, or tolerations.
        // Existing legacy `bluejay-ws-sandbox-1` GitHub-registered runner is
        // out of scope here (it's a runtime registration, not a K8s
        // Deployment) — see CLAUDE.md "Common Mistakes" entry and
        // feedback_bluejay_ws_never_public_runner.md.
        var forbiddenHostPatterns = new[]
        {
            "bluejay-ws",
            "BLUEJAY-WS",
            "bluejay-ws.iamworkin.lan",
            "iamworkin-ws",
        };
        bool ContainsForbidden(string? value) =>
            !string.IsNullOrWhiteSpace(value)
            && forbiddenHostPatterns.Any(pattern => value!.Contains(pattern, StringComparison.OrdinalIgnoreCase));
        var violations = GitHubRunnerDeployments().Values.SelectMany(deployment =>
        {
            var local = new List<string>();
            var podSpec = ManifestNodeExtensions.Mapping(deployment.Root, "spec", "template", "spec");
            if (podSpec is null)
            {
                return local;
            }
            // nodeName: pins the pod to a specific node by name.
            var nodeName = ManifestNodeExtensions.Scalar(podSpec, "nodeName");
            if (ContainsForbidden(nodeName))
            {
                local.Add($"{deployment.Name} sets nodeName='{nodeName}' which targets a forbidden operator-workstation host.");
            }
            // nodeSelector: dict of label → value pinning the pod to nodes
            // carrying matching labels. Examples that would trip this:
            //   kubernetes.io/hostname: bluejay-ws
            //   flowercore.io/host: bluejay-ws.iamworkin.lan
            var nodeSelector = ManifestNodeExtensions.Mapping(podSpec, "nodeSelector");
            if (nodeSelector is not null)
            {
                foreach (var entry in nodeSelector.Children)
                {
                    var key = entry.Key is YamlScalarNode keyScalar ? keyScalar.Value : null;
                    var value = entry.Value is YamlScalarNode valueScalar ? valueScalar.Value : null;
                    if (ContainsForbidden(value))
                    {
                        local.Add($"{deployment.Name} has nodeSelector entry '{key}: {value}' which targets a forbidden operator-workstation host.");
                    }
                }
            }
            // nodeAffinity: matchExpressions over node labels.
            foreach (var term in ManifestNodeExtensions.MappingSequence(podSpec, "affinity", "nodeAffinity", "requiredDuringSchedulingIgnoredDuringExecution", "nodeSelectorTerms"))
            {
                foreach (var expr in ManifestNodeExtensions.MappingSequence(term, "matchExpressions"))
                {
                    var key = ManifestNodeExtensions.Scalar(expr, "key");
                    foreach (var valueNode in ManifestNodeExtensions.ScalarSequence(expr, "values"))
                    {
                        if (ContainsForbidden(valueNode))
                        {
                            local.Add($"{deployment.Name} has nodeAffinity matchExpression '{key}' value '{valueNode}' which targets a forbidden operator-workstation host.");
                        }
                    }
                }
            }
            // tolerations: scheduling onto a tainted operator-workstation
            // node would let the runner run there. Forbid any toleration
            // value that names the workstation.
            foreach (var toleration in ManifestNodeExtensions.MappingSequence(podSpec, "tolerations"))
            {
                var key = ManifestNodeExtensions.Scalar(toleration, "key");
                var value = ManifestNodeExtensions.Scalar(toleration, "value");
                if (ContainsForbidden(key))
                {
                    local.Add($"{deployment.Name} has toleration key '{key}' which targets a forbidden operator-workstation host.");
                }
                if (ContainsForbidden(value))
                {
                    local.Add($"{deployment.Name} has toleration value '{value}' which targets a forbidden operator-workstation host.");
                }
            }
            return local;
        }).ToList();
        violations.Should().BeEmpty("BLUEJAY-WS / iamworkin-ws must never host a fleet GitHub Actions runner; see CLAUDE.md 'Registering BLUEJAY-WS as a fleet GitHub Actions runner' and feedback_bluejay_ws_never_public_runner.md");
    }
    [Fact]
    public void Monitoring_MustAlertWhenLinuxRunnerDeploymentIsUnavailable()
    {
@@ -890,6 +997,22 @@ internal sealed record ManifestDocument(
            .ToList();
    }
    // MainContainerMappings excludes initContainers. Use this when asserting
    // properties of the primary container (env, image, volumeMounts) where an
    // initContainer would be a false-positive match — e.g. the GitHub runner
    // image's `setup-runner-home` initContainer should not count toward the
    // single-container assertions on the runner deployments.
    public IReadOnlyList<YamlMappingNode> MainContainerMappings()
    {
        var podSpec = PodSpec();
        if (podSpec is null)
        {
            return Array.Empty<YamlMappingNode>();
        }
        return ManifestNodeExtensions.MappingSequence(podSpec, "containers").ToList();
    }
    public IReadOnlyList<ContainerSpec> ContainerSpecs()
    {
        return ContainerMappings()
Author	SHA1	Message	Date
Andrew Stoltz	ec78175526	tests: add bluejay-ws runner-exclusion lint + fix 3 stale runner-fleet assertions Adds Runners_MustNotPinToOperatorWorkstationHosts lint test enforcing operator directive 2026-05-26: BLUEJAY-WS / iamworkin-ws must never be a fleet GitHub Actions runner. Build-side analog of the Sprint 9 NEW safe-account exclusion gate (Puppet GPO/AppLocker/WDAC/audit-forwarder modules refuse to apply on BLUEJAY-WS). Scans every github-runner Deployment for forbidden nodeName, nodeSelector, nodeAffinity match expressions, and toleration key/value pinning. See CLAUDE.md "Common Mistakes" entry and feedback_bluejay_ws_never_public_runner.md. Also fixes 3 pre-existing GitHubRunnerFleet_* lint failures that broke when the runner image bumped to v20260525-ruby3.3.11-stepca (added a setup-runner-home initContainer): * Add MainContainerMappings() helper (containers only, excludes initContainers) and switch GitHubRunnerFleet_MustRegisterRequiredReposAsRepoScopedDeployments + GitHubRunnerFleet_MustSetWritableNonRootDotnetAndCachePaths over to it. Without this, ContainerMappings().Should().ContainSingle() found the initContainer + runner = 2 containers and failed. * Loosen GitHubRunnerFleet_MustAvoidRwoMultiAttachForScaledDeployments ReplicaCount assertion from Be(2) to BeGreaterOrEqualTo(2). The semantic invariant is "at least 2 replicas so no single-pod bottleneck"; deployments tuned upward per 14d CI activity (e.g. github-runner-print-web at replicas: 3, see commit `1f1f682` PR #24) are valid. Lint baseline: 6 failed -> 3 failed (the 3 remaining are unrelated: PublicReadWriteIngressRoutes_* lives in FlowerCore.Updater/k8s/ ingressroute.yaml — separate PR; FcDeviceManagement_* needs operator domain decision on the missing apps/fc-devicemgmt/argocd-application.yaml). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 22:41:00 -05:00
Andrew Stoltz	2cc91b6df0	runners: bump tts-reader memory limit 4Gi -> 8Gi The github-runner-tts-reader pod was being OOMKilled (exit 137) mid-`dotnet test` on the TtsReader 1000+ test suite. PR #21 CI (the Windows -> Linux runner migration) flapped twice with the "self-hosted runner lost communication" annotation before the K8s-side symptoms surfaced via kubectl describe pod. Requests bumped 1Gi -> 2Gi, limits 4Gi -> 8Gi. Comment added inline so future fleet runs don't trip the same wall. Unblocks PR #21 + the 9 other open TtsReader PRs that all rebase through it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 22:31:48 -05:00
bluejay	0d2090fe81	runners: add github-runner-updater Deployment (#29 ) Close runner-fleet gap for FlowerCore.Updater. Matches Sprint 32 long-tail pattern; registers entry in fleet-lint required-set.	2026-05-26 03:24:13 +00:00
Andrew Stoltz	bc3548e715	runners: add github-runner-pimanager Deployment FlowerCore.PiManager build run 26417714843 sat queued 5h with zero self-hosted runners registered to the repo. PiManager was missed in the Sprint 32 long-tail sweep — every other FC repo got a dedicated repo-scoped Deployment with its own ACCESS_TOKEN registration, but PiManager fell through the cracks. Adds a 2-replica ephemeral runner Deployment matching the Signage / DMS / Print.Web pattern (per-pod emptyDir caches, no shared PVC, labels `self-hosted,linux,fc-build-linux`, shared github-runner-token PAT). Once ArgoCD syncs, the queued job will pick up automatically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 20:33:44 -05:00
bluejay	74333cc26b	selenium: right-size hub + chrome + edge memory limits (#28 )	2026-05-26 01:12:15 +00:00
Andrew Stoltz	7310fb88c2	selenium: right-size hub + chrome + edge memory limits Edge node has been OOMKilled 51 times in 5 days (~1 every 2.4h) on a 1Gi memory limit. Chrome runs maxSessions=2 on the same 1Gi cap and was idling at 684Mi — first concurrent session pushing the node to ~900Mi+ would be the next OOM. Hub was running at 766Mi against a 1Gi limit (75%); no recent restarts but no headroom either. Firefox node has been running at 2Gi memory limit for 9 days with zero restarts — that is the right size for a Selenium 4.27 browser node under our session profile (screen recording sidecar + 1080p rendering + page captures). Match it. Changes: - Hub: limit 1Gi -> 1.5Gi, request 512Mi -> 1Gi - Chrome: limit 1Gi -> 2Gi, request 512Mi -> 1Gi - Edge: limit 1Gi -> 2Gi, request 512Mi -> 1Gi CPU left alone on all three — observed utilization is well under the existing limits (hub 54m / 500m, chrome 185m / 1, edge 11m / 1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 20:11:41 -05:00
bluejay	148bc87b9a	runners: bake step-ca root CA into image (v20260525-stepca) (#27 )	2026-05-26 01:04:14 +00:00
Andrew Stoltz	2a1e842100	runners: bake step-ca root CA into image (v20260525-stepca) Without the IAmWorkin step-ca root CA in the runner image's system trust store, .NET HttpClient calls from CI tests against `*.iamworkin.lan` (e.g. `https://selenium.iamworkin.lan/session`) fail with `The remote certificate is invalid because of errors in the certificate chain: PartialChain`. FlowerCore.Print.Web's `WebScreenshotService` unit tests hit this on every build. Drop the step-ca root PEM into `/usr/local/share/ca-certificates/`, run `update-ca-certificates` once during apt install, and let OpenSSL + .NET-on-Linux read the regenerated `/etc/ssl/certs/ca-certificates.crt` automatically — no `SSL_CERT_FILE` env var, no per-Deployment volume mount. Image rebuilt + saved + imported on all 3 schedulable RKE2 nodes (rke2-server, rke2-agent1, rke2-agent2) before this PR — verified with `ctr images list -q \| grep stepca` on each node. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 19:55:38 -05:00
bluejay	bc28430d24	selenium: allow github-runner namespace ingress on 4444 (#26 )	2026-05-26 00:44:23 +00:00
Andrew Stoltz	cc92272217	selenium: allow github-runner namespace ingress on 4444 Unblocks CI jobs running in github-runner pods (e.g. FlowerCore.Print.Web `help-screenshots`) from reaching selenium-hub. Previously the session POST was DNAT'd to the hub pod IP then dropped at the Calico ingress hook, surfacing as a 60s timeout against http://selenium-hub.selenium.svc.cluster.local:4444 while the Selenium UI showed 0/4 sessions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 19:43:12 -05:00
bluejay	d6f4468a9c	selenium: migrate hub + 3 nodes into ArgoCD-managed manifests (#25 )	2026-05-26 00:09:35 +00:00
Andrew Stoltz	2f796a2ebd	selenium: migrate hub + 3 nodes + service + ingressroute into ArgoCD Previously orphan kubectl-applied since the Selenium Grid was first set up. The `infra-selenium` ArgoCD app existed but only managed `network-policy.yaml` — the deployments themselves drifted whenever anyone `kubectl set env`'d or `kubectl scale`'d. This commit captures the live state (with the 2026-05-25 maxSessions bump for chrome already baked in) as canonical git source. ArgoCD's ServerSideApply syncPolicy + selfHeal will now keep the grid in lock step with this file. Resources captured: - Service selenium-hub (ClusterIP, internal traffic on 4444) - Service selenium-hub-external (LoadBalancer, MetalLB 10.0.56.208) - Deployment selenium-hub - Deployment selenium-node-chrome (replicas=1, SE_NODE_MAX_SESSIONS=2) - Deployment selenium-node-firefox (replicas=1, maxSessions=1) - Deployment selenium-node-edge (replicas=1, maxSessions=1) - IngressRoute selenium-hub (Traefik, selenium.iamworkin.lan) No live behavior change — server-side dry-run confirms unchanged for hub/firefox/ingressroute, "configured" for hub-external + 3 deploys (default-field reordering only; SSA + field managers handle the diff). Refs: Sprint 33 morning-routine 2026-05-25 follow-up Q-MR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 19:08:55 -05:00
bluejay	1f1f6823db	runners: right-size replica counts per 14d CI activity (#24 )	2026-05-26 00:01:47 +00:00
Andrew Stoltz	b92f74b63a	runners: right-size replica counts per 14d CI activity data Drop 2 → 1 for 10 deploys based on trailing-14d run counts: - LlmBridge, Media, Knowledge, Intranet.Web, DNS (0 runs each) - Presentations (6), Redis (3), Provisioning (3), MessageBoard (3), MenuBoard (3) Bump 2 → 3 for Print.Web: 12 runs in trailing 5d, and the help-screenshots AAT job holds a runner 30+ min, creating head-of-line blocking for parallel PRs. Net change: -9 replicas (≈ -9 GiB committed memory). Aligns with Sprint 33 morning-routine capacity audit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 18:55:47 -05:00
Andrew Stoltz	cb7f7dbc4d	authentik: generous startup/liveness probes for first-boot migration The server pod was getting killed by liveness probe at 60s while still waiting on migration DB lock (worker pod also running migrations against same DB). Add startupProbe with 10.5 min budget so liveness doesn't fire until migrations finish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 16:03:03 -05:00
Andrew Stoltz	03126d5584	authentik: add fsGroup:1000 to server + worker so non-root uid can write /media PermissionError: [Errno 13] Permission denied: '/media/public' in tenant_files migration because Authentik container runs as uid 1000 but Longhorn PVC mounts root:root by default. fsGroup on Pod securityContext recursively chgrps the PVC mount to gid 1000 + chmods g+rwx. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 15:58:35 -05:00
Andrew Stoltz	495e884c41	authentik: initial deployment at id.iamworkin.lan Stack: - PostgreSQL 16 StatefulSet (Longhorn RWO 5Gi) - Redis 7 Deployment (no persistence) - Authentik server + worker (ghcr.io/goauthentik/server:2024.12.3) - Shared media PVC (Longhorn RWO 2Gi) between server+worker - Certificate via step-ca-acme ClusterIssuer - Traefik IngressRoute at id.iamworkin.lan Secrets sourced from 1Password item 'authentik-credentials' (IAmWorkin vault, id y6i74ch22q5wvm7znquq4nhhcu) via OnePasswordItem CRD. Fields: AUTHENTIK_SECRET_KEY, POSTGRES_PASSWORD, REDIS_PASSWORD, BOOTSTRAP_ADMIN_PASSWORD, BOOTSTRAP_ADMIN_TOKEN, BOOTSTRAP_ADMIN_EMAIL. DNS A record id.iamworkin.lan -> 10.0.56.200 added via scripts/pfsense-add-id-host.py (FlowerCore.DNS service was 502'ing on pfSense diag_command.php response parsing). Closes the immediate gap from PiManager OIDC Cohort 3 wire-up: PiManager (a87cd6f) configures id.iamworkin.lan as JWT authority but the backend was never deployed. Pirelay specifically is on Mode:apikey until this backend is bootstrapped and a pimanager service-account exists. Post-deploy bootstrap (manual once pods Ready): 1. Login at https://id.iamworkin.lan/if/admin/ as akadmin using BOOTSTRAP_ADMIN_PASSWORD from 1Password. 2. Create OAuth2/OpenID Provider for pimanager (issuer https://id.iamworkin.lan/application/o/pimanager/, audience 'pimanager'). 3. Create Application binding the provider. 4. Create service account user 'pimanager-service-account', generate long-lived token, store in 1Password as 'pimanager-service-account'. 5. Re-enable jwt mode on pirelay + un-mask puppet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 15:50:10 -05:00
Andrew Stoltz	65aa1e6104	fix(monitoring): point probe-printweb at /health (Q-MR-90) Root path requires API key auth — `/` returned 401 to the blackbox probe, firing PrintWebDown despite `/health` reporting Healthy. Pattern: feedback_k8s_probes_behind_auth_middleware. Mirrors FlowerCore.Notes scripts/monitoring/prometheus.yml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 14:52:02 -05:00
Andrew Stoltz	7f2a3b76b4	feat(github-runner): bake Ruby 3.3 into Linux self-hosted runner image (Q-MR-81)	2026-05-20 11:45:43 -05:00
bluejay	ea73f00461	fix(fc-devicemgmt): remove self-referential Application resource (Q-MR-79) ApplicationSet already creates infra-fc-devicemgmt; removing the in-repo Application child clears the self-reference drift.	2026-05-20 16:20:01 +00:00
Andrew Stoltz	25ace30a03	fix(fc-devicemgmt): remove self-referential Application resource (Q-MR-79)	2026-05-20 11:18:25 -05:00