selenium: allow github-runner namespace ingress on 4444 (#26 )

selenium: allow github-runner namespace ingress on 4444
Unblocks CI jobs running in github-runner pods (e.g. FlowerCore.Print.Web `help-screenshots`) from reaching selenium-hub. Previously the session POST was DNAT'd to the hub pod IP then dropped at the Calico ingress hook, surfacing as a 60s timeout against http://selenium-hub.selenium.svc.cluster.local:4444 while the Selenium UI showed 0/4 sessions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 00:44:23 +00:00 · 2026-05-25 19:43:12 -05:00 · 2026-05-26 00:09:35 +00:00 · 2026-05-25 19:08:55 -05:00 · 2026-05-26 00:01:47 +00:00 · 2026-05-25 18:55:47 -05:00
15 changed files with 1528 additions and 731 deletions
--- a/apps/authentik/authentik.yaml
+++ b/apps/authentik/authentik.yaml
@@ -0,0 +1,448 @@
+# Authentik OIDC backend
+# ArgoCD-managed. BlueJay Lab.
+#
+# Stack:
+#   - PostgreSQL 16 StatefulSet (single replica, Longhorn RWO 5Gi)
+#   - Redis 7 Deployment (no persistence — session/cache only)
+#   - Authentik server + worker Deployments (image ghcr.io/goauthentik/server:2024.12.3)
+#   - Media PVC shared between server + worker (Longhorn RWO 2Gi)
+#   - Certificate via step-ca-acme ClusterIssuer
+#   - Traefik IngressRoute at id.iamworkin.lan
+#
+# Secrets come from 1Password item "authentik-credentials" (IAmWorkin vault, id y6i74ch22q5wvm7znquq4nhhcu)
+# via the OnePasswordItem CRD, materialized into k8s Secret authentik/authentik-credentials.
+#
+# Why the discovery URL is /application/o/pimanager/ : Authentik issues per-application OIDC providers.
+# The pimanager OIDC application/provider is created after the cluster pods are healthy (manual or
+# via API once the bootstrap token is available — see Notes substrate).
+
+---
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: authentik
+  labels:
+    app.kubernetes.io/part-of: bluejay-infra
+
+---
+# 1Password operator pulls the authentik-credentials item into a k8s Secret of the same name.
+# Field labels in 1P become Secret keys: AUTHENTIK_SECRET_KEY, POSTGRES_PASSWORD, REDIS_PASSWORD,
+# BOOTSTRAP_ADMIN_PASSWORD, BOOTSTRAP_ADMIN_TOKEN, BOOTSTRAP_ADMIN_EMAIL.
+apiVersion: onepassword.com/v1
+kind: OnePasswordItem
+metadata:
+  name: authentik-credentials
+  namespace: authentik
+spec:
+  itemPath: "vaults/IAmWorkin/items/authentik-credentials"
+
+---
+# Shared media volume for server + worker pods.
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: authentik-media
+  namespace: authentik
+spec:
+  storageClassName: longhorn
+  accessModes: [ReadWriteOnce]
+  resources:
+    requests:
+      storage: 2Gi
+
+---
+# PostgreSQL 16 StatefulSet — Authentik's primary store.
+apiVersion: apps/v1
+kind: StatefulSet
+metadata:
+  name: authentik-postgres
+  namespace: authentik
+  labels:
+    app: authentik-postgres
+    argocd.argoproj.io/instance: infra-authentik
+spec:
+  persistentVolumeClaimRetentionPolicy:
+    whenDeleted: Retain
+    whenScaled: Retain
+  podManagementPolicy: OrderedReady
+  serviceName: authentik-postgres
+  replicas: 1
+  revisionHistoryLimit: 10
+  selector:
+    matchLabels:
+      app: authentik-postgres
+  template:
+    metadata:
+      labels:
+        app: authentik-postgres
+    spec:
+      containers:
+        - name: postgres
+          image: postgres:16-alpine
+          ports:
+            - containerPort: 5432
+              name: postgres
+          env:
+            - name: POSTGRES_USER
+              value: authentik
+            - name: POSTGRES_PASSWORD
+              valueFrom:
+                secretKeyRef:
+                  name: authentik-credentials
+                  key: POSTGRES_PASSWORD
+            - name: POSTGRES_DB
+              value: authentik
+            - name: POSTGRES_INITDB_ARGS
+              value: "--encoding=UTF-8 --lc-collate=C --lc-ctype=C"
+            - name: PGDATA
+              value: /var/lib/postgresql/data/pgdata
+          readinessProbe:
+            exec:
+              command: ["pg_isready", "-U", "authentik"]
+            initialDelaySeconds: 5
+            periodSeconds: 5
+          livenessProbe:
+            exec:
+              command: ["pg_isready", "-U", "authentik"]
+            initialDelaySeconds: 30
+            periodSeconds: 30
+          resources:
+            requests: { cpu: 100m, memory: 256Mi }
+            limits: { cpu: 1000m, memory: 1Gi }
+          volumeMounts:
+            - name: pgdata
+              mountPath: /var/lib/postgresql/data
+  volumeClaimTemplates:
+    - metadata:
+        name: pgdata
+      spec:
+        storageClassName: longhorn
+        accessModes: [ReadWriteOnce]
+        volumeMode: Filesystem
+        resources:
+          requests:
+            storage: 5Gi
+
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: authentik-postgres
+  namespace: authentik
+spec:
+  clusterIP: None
+  selector:
+    app: authentik-postgres
+  ports:
+    - name: postgres
+      port: 5432
+      targetPort: 5432
+
+---
+# Redis 7 — session storage + Celery broker. No persistence needed (cache).
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: authentik-redis
+  namespace: authentik
+  labels:
+    app: authentik-redis
+    argocd.argoproj.io/instance: infra-authentik
+spec:
+  replicas: 1
+  strategy:
+    type: Recreate
+  selector:
+    matchLabels:
+      app: authentik-redis
+  template:
+    metadata:
+      labels:
+        app: authentik-redis
+    spec:
+      containers:
+        - name: redis
+          image: redis:7-alpine
+          args:
+            - "--save"
+            - ""
+            - "--appendonly"
+            - "no"
+            - "--requirepass"
+            - "$(REDIS_PASSWORD)"
+          env:
+            - name: REDIS_PASSWORD
+              valueFrom:
+                secretKeyRef:
+                  name: authentik-credentials
+                  key: REDIS_PASSWORD
+          ports:
+            - containerPort: 6379
+              name: redis
+          readinessProbe:
+            tcpSocket: { port: 6379 }
+            initialDelaySeconds: 5
+            periodSeconds: 5
+          livenessProbe:
+            tcpSocket: { port: 6379 }
+            initialDelaySeconds: 30
+            periodSeconds: 30
+          resources:
+            requests: { cpu: 50m, memory: 64Mi }
+            limits: { cpu: 500m, memory: 256Mi }
+
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: authentik-redis
+  namespace: authentik
+spec:
+  selector:
+    app: authentik-redis
+  ports:
+    - name: redis
+      port: 6379
+      targetPort: 6379
+
+---
+# Authentik server Deployment — HTTP frontend on :9000.
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: authentik-server
+  namespace: authentik
+  labels:
+    app: authentik-server
+    argocd.argoproj.io/instance: infra-authentik
+spec:
+  replicas: 1
+  strategy:
+    type: Recreate  # shares /media RWO PVC with worker
+  selector:
+    matchLabels:
+      app: authentik-server
+  template:
+    metadata:
+      labels:
+        app: authentik-server
+    spec:
+      securityContext:
+        # Authentik image runs as uid 1000 "authentik" but the Longhorn PVC mounts
+        # root:root by default. fsGroup recursively chgrp + chmod g+rwx so the
+        # non-root container can mkdir /media/public during the tenant_files migration.
+        fsGroup: 1000
+      containers:
+        - name: server
+          image: ghcr.io/goauthentik/server:2024.12.3
+          args: ["server"]
+          ports:
+            - containerPort: 9000
+              name: http
+            - containerPort: 9443
+              name: https
+          env:
+            - name: AUTHENTIK_SECRET_KEY
+              valueFrom:
+                secretKeyRef:
+                  name: authentik-credentials
+                  key: AUTHENTIK_SECRET_KEY
+            - name: AUTHENTIK_REDIS__HOST
+              value: authentik-redis
+            - name: AUTHENTIK_REDIS__PASSWORD
+              valueFrom:
+                secretKeyRef:
+                  name: authentik-credentials
+                  key: REDIS_PASSWORD
+            - name: AUTHENTIK_POSTGRESQL__HOST
+              value: authentik-postgres
+            - name: AUTHENTIK_POSTGRESQL__NAME
+              value: authentik
+            - name: AUTHENTIK_POSTGRESQL__USER
+              value: authentik
+            - name: AUTHENTIK_POSTGRESQL__PASSWORD
+              valueFrom:
+                secretKeyRef:
+                  name: authentik-credentials
+                  key: POSTGRES_PASSWORD
+            - name: AUTHENTIK_BOOTSTRAP_PASSWORD
+              valueFrom:
+                secretKeyRef:
+                  name: authentik-credentials
+                  key: BOOTSTRAP_ADMIN_PASSWORD
+            - name: AUTHENTIK_BOOTSTRAP_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: authentik-credentials
+                  key: BOOTSTRAP_ADMIN_TOKEN
+            - name: AUTHENTIK_BOOTSTRAP_EMAIL
+              valueFrom:
+                secretKeyRef:
+                  name: authentik-credentials
+                  key: BOOTSTRAP_ADMIN_EMAIL
+            - name: AUTHENTIK_DISABLE_UPDATE_CHECK
+              value: "true"
+            - name: AUTHENTIK_ERROR_REPORTING__ENABLED
+              value: "false"
+            - name: AUTHENTIK_LOG_LEVEL
+              value: info
+          # First-boot Authentik can take 3+ min on the migration phase
+          # (waiting on DB lock while worker also runs migrations). Initial
+          # delays are generous so kubelet doesn't kill the pod mid-migration;
+          # periodSeconds keeps post-startup probing responsive.
+          readinessProbe:
+            httpGet:
+              path: /-/health/ready/
+              port: 9000
+            initialDelaySeconds: 60
+            periodSeconds: 10
+            timeoutSeconds: 5
+            failureThreshold: 12
+          livenessProbe:
+            httpGet:
+              path: /-/health/live/
+              port: 9000
+            initialDelaySeconds: 300
+            periodSeconds: 30
+            timeoutSeconds: 10
+            failureThreshold: 3
+          startupProbe:
+            httpGet:
+              path: /-/health/live/
+              port: 9000
+            initialDelaySeconds: 30
+            periodSeconds: 15
+            timeoutSeconds: 10
+            failureThreshold: 40  # 30s + 40*15s = 10.5 min budget
+          resources:
+            requests: { cpu: 150m, memory: 512Mi }
+            limits: { cpu: 1500m, memory: 1Gi }
+          volumeMounts:
+            - name: media
+              mountPath: /media
+      volumes:
+        - name: media
+          persistentVolumeClaim:
+            claimName: authentik-media
+
+---
+# Authentik worker Deployment — runs Celery background tasks.
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: authentik-worker
+  namespace: authentik
+  labels:
+    app: authentik-worker
+    argocd.argoproj.io/instance: infra-authentik
+spec:
+  replicas: 1
+  strategy:
+    type: Recreate  # shares /media RWO PVC with server
+  selector:
+    matchLabels:
+      app: authentik-worker
+  template:
+    metadata:
+      labels:
+        app: authentik-worker
+    spec:
+      securityContext:
+        # Same as server pod — non-root uid 1000 needs PVC group write.
+        fsGroup: 1000
+      containers:
+        - name: worker
+          image: ghcr.io/goauthentik/server:2024.12.3
+          args: ["worker"]
+          env:
+            - name: AUTHENTIK_SECRET_KEY
+              valueFrom:
+                secretKeyRef:
+                  name: authentik-credentials
+                  key: AUTHENTIK_SECRET_KEY
+            - name: AUTHENTIK_REDIS__HOST
+              value: authentik-redis
+            - name: AUTHENTIK_REDIS__PASSWORD
+              valueFrom:
+                secretKeyRef:
+                  name: authentik-credentials
+                  key: REDIS_PASSWORD
+            - name: AUTHENTIK_POSTGRESQL__HOST
+              value: authentik-postgres
+            - name: AUTHENTIK_POSTGRESQL__NAME
+              value: authentik
+            - name: AUTHENTIK_POSTGRESQL__USER
+              value: authentik
+            - name: AUTHENTIK_POSTGRESQL__PASSWORD
+              valueFrom:
+                secretKeyRef:
+                  name: authentik-credentials
+                  key: POSTGRES_PASSWORD
+            - name: AUTHENTIK_DISABLE_UPDATE_CHECK
+              value: "true"
+            - name: AUTHENTIK_ERROR_REPORTING__ENABLED
+              value: "false"
+            - name: AUTHENTIK_LOG_LEVEL
+              value: info
+          resources:
+            requests: { cpu: 100m, memory: 256Mi }
+            limits: { cpu: 1000m, memory: 768Mi }
+          volumeMounts:
+            - name: media
+              mountPath: /media
+      volumes:
+        - name: media
+          persistentVolumeClaim:
+            claimName: authentik-media
+
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: authentik-server
+  namespace: authentik
+spec:
+  selector:
+    app: authentik-server
+  ports:
+    - name: http
+      port: 9000
+      targetPort: 9000
+    - name: https
+      port: 9443
+      targetPort: 9443
+
+---
+# step-ca leaf certificate for id.iamworkin.lan.
+# step-ca container resolver uses pfSense Unbound, so the public A record for id.iamworkin.lan
+# MUST exist before this Certificate is applied (cert-manager HTTP-01 will silently 2h-backoff
+# otherwise). Added 2026-05-25 via scripts/pfsense-add-id-host.py.
+apiVersion: cert-manager.io/v1
+kind: Certificate
+metadata:
+  name: authentik-tls
+  namespace: authentik
+spec:
+  secretName: authentik-tls
+  dnsNames:
+    - id.iamworkin.lan
+  issuerRef:
+    name: step-ca-acme
+    kind: ClusterIssuer
+
+---
+apiVersion: traefik.io/v1alpha1
+kind: IngressRoute
+metadata:
+  name: authentik
+  namespace: authentik
+spec:
+  entryPoints: [websecure]
+  routes:
+    - match: Host(`id.iamworkin.lan`)
+      kind: Rule
+      services:
+        - name: authentik-server
+          port: 9000
+  tls:
+    secretName: authentik-tls
--- a/apps/brochure/README.md
+++ b/apps/brochure/README.md
@@ -1,27 +0,0 @@
-# FlowerCore Brochure
-
-`apps/brochure` hosts the public brochure split from `FlowerCore.Intranet.Web`.
-ArgoCD's `apps/*` ApplicationSet will create `infra-brochure` after this
-directory lands on `main`.
-
-## Runtime
-
- Host: `https://brochure.flowercore.io`
- Namespace: `brochure`
- Deployment: `brochure-web`
- Image: `localhost/fc-brochure-web:v20260524-sprint32`
- Port: `8080`
- Public route method allowlist: `GET` and `HEAD`
-
-## Operator Actions
-
-1. Publish and import `localhost/fc-brochure-web:v20260524-sprint32` to every
-   RKE2 node before sync, using the same podman save + `ctr images import`
-   flow as the Intranet deployment.
-2. Create the Cloudflare DNS record for `brochure.flowercore.io` pointing at
-   the FlowerCore public edge.
-3. Verify `infra-brochure` appears in ArgoCD, the certificate becomes Ready,
-   and `GET https://brochure.flowercore.io/` returns `200`.
-
-The route intentionally does not expose `/ops/*` or `/admin/*`; the Brochure
-web app returns `404` for those paths and Traefik only forwards read methods.
--- a/apps/brochure/brochure.yaml
+++ b/apps/brochure/brochure.yaml
@@ -1,131 +0,0 @@
-# FlowerCore Brochure public host
-#
-# Thin Blazor host for public What's New, walkthrough, and gallery content
-# carved out of FlowerCore.Intranet.Web. The ApplicationSet creates
-# infra-brochure from this directory after merge.
---
-apiVersion: v1
-kind: Namespace
-metadata:
-  name: brochure
-  labels:
-    app.kubernetes.io/part-of: flowercore
---
-apiVersion: apps/v1
-kind: Deployment
-metadata:
-  name: brochure-web
-  namespace: brochure
-  labels:
-    app: brochure-web
-    app.kubernetes.io/name: brochure-web
-    app.kubernetes.io/part-of: flowercore
-spec:
-  replicas: 1
-  revisionHistoryLimit: 3
-  selector:
-    matchLabels:
-      app: brochure-web
-  template:
-    metadata:
-      labels:
-        app: brochure-web
-        app.kubernetes.io/name: brochure-web
-        app.kubernetes.io/part-of: flowercore
-    spec:
-      containers:
-        - name: brochure-web
-          image: localhost/fc-brochure-web:v20260524-sprint32
-          imagePullPolicy: Never
-          ports:
-            - containerPort: 8080
-              name: http
-          env:
-            - name: ASPNETCORE_ENVIRONMENT
-              value: Production
-            - name: ASPNETCORE_URLS
-              value: "http://+:8080"
-          resources:
-            requests:
-              cpu: "25m"
-              memory: "128Mi"
-            limits:
-              cpu: "500m"
-              memory: "512Mi"
-          readinessProbe:
-            httpGet:
-              path: /health
-              port: http
-            initialDelaySeconds: 10
-            periodSeconds: 10
-          livenessProbe:
-            httpGet:
-              path: /health
-              port: http
-            initialDelaySeconds: 30
-            periodSeconds: 30
-          securityContext:
-            runAsNonRoot: true
-            runAsUser: 1654
-            runAsGroup: 1654
-            allowPrivilegeEscalation: false
-            readOnlyRootFilesystem: true
-            capabilities:
-              drop:
-                - ALL
-          volumeMounts:
-            - name: tmp
-              mountPath: /tmp
-      volumes:
-        - name: tmp
-          emptyDir: {}
---
-apiVersion: v1
-kind: Service
-metadata:
-  name: brochure-web
-  namespace: brochure
-  labels:
-    app: brochure-web
-    app.kubernetes.io/name: brochure-web
-    app.kubernetes.io/part-of: flowercore
-spec:
-  type: ClusterIP
-  selector:
-    app: brochure-web
-  ports:
-    - name: http
-      port: 8080
-      targetPort: http
---
-apiVersion: cert-manager.io/v1
-kind: Certificate
-metadata:
-  name: brochure-web-tls
-  namespace: brochure
-spec:
-  secretName: brochure-web-tls
-  issuerRef:
-    name: step-ca-acme
-    kind: ClusterIssuer
-  dnsNames:
-    - brochure.flowercore.io
-  duration: 720h
-  renewBefore: 240h
---
-apiVersion: traefik.io/v1alpha1
-kind: IngressRoute
-metadata:
-  name: brochure-web-public
-  namespace: brochure
-spec:
-  entryPoints:
-    - websecure
-  routes:
-    - match: Host(`brochure.flowercore.io`) && (Method(`GET`) || Method(`HEAD`))
-      kind: Rule
-      services:
-        - name: brochure-web
-          port: 8080
-  tls:
-    secretName: brochure-web-tls
--- a/apps/fc-devicemgmt/argocd-application.yaml
+++ b/apps/fc-devicemgmt/argocd-application.yaml
@@ -1,33 +0,0 @@
-# Explicit ArgoCD Application shape for bootstrap/review.
-#
-# The live bluejay-infra ApplicationSet already discovers apps/* directories
-# and creates this same Application name (`infra-fc-devicemgmt`) automatically.
-# Keep repoURL on the internal Gitea ClusterIP URL; ArgoCD does not trust the
-# external step-ca HTTPS endpoint.
-apiVersion: argoproj.io/v1alpha1
-kind: Application
-metadata:
-  name: infra-fc-devicemgmt
-  namespace: argocd
-  labels:
-    app.kubernetes.io/name: fc-devicemgmt
-    app.kubernetes.io/part-of: flowercore
-    app.kubernetes.io/managed-by: argocd
-    flowercore.io/tenant-id: system
-    flowercore.io/created-by: bluejay-infra
-spec:
-  project: default
-  source:
-    repoURL: http://gitea-clusterip.gitea.svc.cluster.local:3000/bluejay/bluejay-infra.git
-    targetRevision: main
-    path: apps/fc-devicemgmt
-  destination:
-    server: https://kubernetes.default.svc
-    namespace: fc-devicemgmt
-  syncPolicy:
-    automated:
-      prune: true
-      selfHeal: true
-    syncOptions:
-      - CreateNamespace=true
-      - ServerSideApply=true
--- a/apps/fc-devicemgmt/deployment-operator.yaml
+++ b/apps/fc-devicemgmt/deployment-operator.yaml
@@ -47,7 +47,7 @@ spec:
        fsGroupChangePolicy: OnRootMismatch
      containers:
        - name: operator
-          image: localhost/fc-devicemgmt-operator:v20260512-cx5
+          image: localhost/fc-devicemgmt-operator:v20260519-sp34cl3-fix
          imagePullPolicy: Never
          ports:
            - name: metrics
--- a/apps/fc-devicemgmt/deployment-web.yaml
+++ b/apps/fc-devicemgmt/deployment-web.yaml
@@ -4,6 +4,22 @@
 # Sprint 9+ lane. This manifest is static-valid without requiring the image to
 # exist yet; import localhost/fc-devicemgmt-web:<tag> to all schedulable RKE2
 # nodes before letting ArgoCD sync a live rollout.
+#
+# SCALED TO 0 — 2026-05-19 morning-routine cleanup.
+# The Web pod cannot start until TWO upstream gaps close:
+#   1. MySQL DB instance `flowercore_devicemgmt` (user `fc_devicemgmt`) is
+#      provisioned via fc-mysql Manager. The cluster currently has ZERO
+#      MySqlInstanceCrds and no `mysql.fc-mysql.svc:3306` Service, so the
+#      deployment-web container env `FlowerCore__Database__Host=mysql.fc-mysql.svc`
+#      points at nothing. Provision via the fc-mysql Manager UI/REST/MCP.
+#   2. 1Password vault item `IAmWorkin/FlowerCore DeviceManagement Runtime`
+#      with 5 fields (DB-Password, mtls-ca.pem, mtls-client.crt, mtls-client.key,
+#      mtls-chain.pem) — see apps/fc-devicemgmt/1password-item.yaml. Mint mTLS
+#      from step-ca-agent ClusterIssuer per ADR-126; DB-Password must match the
+#      password configured for the MySQL user.
+# Re-enable: change replicas back to 2 after both gaps close. The image tag
+# in this file (v20260512-cx5) MAY also need a refresh — it predates the
+# Sprint 34 Cl-3 operator fix; Web may have an analogous bug.
 apiVersion: apps/v1
 kind: Deployment
 metadata:
@@ -20,7 +36,7 @@ metadata:
  annotations:
    flowercore.io/traceability-standard: k8s-pod-ownership-and-traceability-standard
 spec:
-  replicas: 2
+  replicas: 0
  revisionHistoryLimit: 3
  selector:
    matchLabels:
--- a/apps/github-runner/.gitattributes
+++ b/apps/github-runner/.gitattributes
@@ -0,0 +1,2 @@
+*.sh text eol=lf
+Dockerfile text eol=lf
--- a/apps/github-runner/Dockerfile
+++ b/apps/github-runner/Dockerfile
@@ -0,0 +1,44 @@
+FROM myoung34/github-runner:latest
+
+ARG RUBY_VERSION=3.3.11
+ARG RUBY_MINOR=3.3
+ARG RUBY_BUILD_VERSION=v20260326
+ARG RUNNER_UID=1001
+ARG RUNNER_GID=1001
+
+ENV RUNNER_TOOL_CACHE=/home/runner/_tool
+ENV RUNNER_RUBY_TOOLCACHE=/opt/runner-toolcache
+ENV PATH="/home/runner/_tool/Ruby/${RUBY_MINOR}/x64/bin:/opt/runner-toolcache/Ruby/${RUBY_MINOR}/x64/bin:${PATH}"
+
+USER root
+
+RUN apt-get update \
+    && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
+        autoconf \
+        bison \
+        build-essential \
+        ca-certificates \
+        curl \
+        libdb-dev \
+        libffi-dev \
+        libgdbm-dev \
+        libgmp-dev \
+        libncurses-dev \
+        libreadline-dev \
+        libssl-dev \
+        libyaml-dev \
+        patch \
+        pkg-config \
+        uuid-dev \
+        zlib1g-dev \
+    && curl -fsSL "https://github.com/rbenv/ruby-build/archive/refs/tags/${RUBY_BUILD_VERSION}.tar.gz" -o /tmp/ruby-build.tar.gz \
+    && mkdir -p /tmp/ruby-build \
+    && tar -xzf /tmp/ruby-build.tar.gz --strip-components=1 -C /tmp/ruby-build \
+    && /tmp/ruby-build/install.sh \
+    && rm -rf /tmp/ruby-build /tmp/ruby-build.tar.gz /var/lib/apt/lists/*
+
+COPY install-ruby-toolcache.sh /usr/local/bin/install-ruby-toolcache.sh
+
+RUN chmod +x /usr/local/bin/install-ruby-toolcache.sh \
+    && RUBY_VERSION="${RUBY_VERSION}" RUBY_MINOR="${RUBY_MINOR}" TOOLCACHE_ROOT="${RUNNER_RUBY_TOOLCACHE}" RUNNER_UID="${RUNNER_UID}" RUNNER_GID="${RUNNER_GID}" /usr/local/bin/install-ruby-toolcache.sh \
+    && ruby -v
--- a/apps/github-runner/README.md
+++ b/apps/github-runner/README.md
@@ -7,12 +7,17 @@ Deployments with `kubectl`; update this manifest and let ArgoCD reconcile.

 All repo-scoped Linux runners use:

+- `localhost/fc-github-runner:v20260520-ruby3.3.11`, derived from
+  `myoung34/github-runner:latest`
 - `ACCESS_TOKEN` from the `github-runner-token` Secret
 - `RUN_AS_ROOT=false`
 - `EPHEMERAL=true`
 - `LABELS=self-hosted,linux,fc-build-linux`
 - writable non-root paths under `/home/runner` for .NET, NuGet, XDG cache, and
  Actions tool cache
+- Ruby 3.3.11 seeded into `/home/runner/_tool/Ruby/3.3/x64` from the baked
+  `/opt/runner-toolcache` copy so `ruby/setup-ruby@v1` can discover it on
+  self-hosted `ubuntu-20.04-x64` runners

 `github-runner` for `FlowerCore.Common` is single-replica because it retains the
 original Longhorn ReadWriteOnce NuGet PVC. Every other repo-scoped runner uses
@@ -28,9 +33,33 @@ Sprint 32 final long-tail wave adds 16 two-replica Deployments:
 `FlowerCore.Provisioning`, `FlowerCore.Redis`, `FlowerCore.MessageBoard`, and
 `FlowerCore.MenuBoard`.

-Sprint 37 Cx-2 closes the audited Linux runner gaps for
-`FlowerCore.DeviceManagement` and `FlowerCore.WorldBuilder` with the same
-two-replica `emptyDir` pattern.
+## Image Build
+
+Ruby is baked with a pinned `ruby-build` release and Ruby patch version. The pod
+still mounts an `emptyDir` over `/home/runner`, so the `setup-runner-home` init
+container copies the baked toolcache from `/opt/runner-toolcache/Ruby` into
+`/home/runner/_tool/Ruby` before the runner container starts.
+
+```bash
+cd apps/github-runner
+podman build -t localhost/fc-github-runner:v20260520-ruby3.3.11 .
+podman run --rm localhost/fc-github-runner:v20260520-ruby3.3.11 ruby -v
+podman run --rm localhost/fc-github-runner:v20260520-ruby3.3.11 \
+  test -f /opt/runner-toolcache/Ruby/3.3/x64.complete
+podman save localhost/fc-github-runner:v20260520-ruby3.3.11 \
+  -o fc-github-runner-v20260520-ruby3.3.11.tar
+```
+
+Import the saved image on every schedulable RKE2 node before ArgoCD rolls the
+Deployments:
+
+```bash
+for node in rke2-server rke2-agent1 rke2-agent2; do
+  scp fc-github-runner-v20260520-ruby3.3.11.tar "$node:/tmp/"
+  ssh "$node" 'sudo ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images rm localhost/fc-github-runner:v20260520-ruby3.3.11 || true'
+  ssh "$node" 'sudo ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images import /tmp/fc-github-runner-v20260520-ruby3.3.11.tar'
+done
+```

 ## Post-Merge Proof

@@ -40,6 +69,14 @@ After the PR is merged and ArgoCD syncs, verify the runner fleet:
 kubectl -n github-runner get deploy,pods,pvc
 ```

+Verify the Ruby toolcache in a fresh pod:
+
+```bash
+kubectl -n github-runner exec deploy/github-runner-puppet -c runner -- ruby -v
+kubectl -n github-runner exec deploy/github-runner-puppet -c runner -- sh -c \
+  'echo "$RUNNER_TOOL_CACHE" && test -f "$RUNNER_TOOL_CACHE/Ruby/3.3/x64.complete"'
+```
+
 Verify GitHub registration for the repo-scoped runners:

 ```bash
@@ -51,7 +88,7 @@ for repo in FlowerCore.Common FlowerCore.Shared.Pos FlowerCore.Puppet FlowerCore
            FlowerCore.Distribution FlowerCore.Scoreboard FlowerCore.SegmentDisplay \
            FlowerCore.Signage.Contracts FlowerCore.SignalControl FlowerCore.Intranet.Web \
            FlowerCore.Provisioning FlowerCore.Redis FlowerCore.MessageBoard \
-            FlowerCore.MenuBoard FlowerCore.DeviceManagement FlowerCore.WorldBuilder; do
+            FlowerCore.MenuBoard; do
  echo "=== $repo ==="
  gh api "/repos/astoltz/$repo/actions/runners" \
    --jq '.runners[] | select(.labels[].name == "fc-build-linux") | {name,status,busy,labels:[.labels[].name]}'
@@ -68,25 +105,15 @@ gh run list --repo astoltz/FlowerCore.Shared.Pos \
 If the latest run is still queued after runner registration, rerun the workflow
 from GitHub Actions and verify it lands on an `rke2-linux-*` runner.

-## Sprint 37 Cx-2 Gap Audit
-
-The 2026-05-18 GitHub workflow scan found these remaining repos with
-`runs-on: [self-hosted, linux, fc-build-linux]` but no K8s runner Deployment:
-`FlowerCore.AiStation.Linux`, `FlowerCore.PHP`, `FlowerCore.PiManager`,
-`FlowerCore.Shared.Barcodes`, `FlowerCore.Shared.Lookup`,
-`FlowerCore.Shared.Nodes`, `FlowerCore.Shared.PrintClient`,
-`FlowerCore.Shared.Relay`, `FlowerCore.Shared.ShowRunner`, and
-`FlowerCore.Shared.Storage`.
-
-Mixed/platform repos also have Linux workflow legs but need owner review before
-adding Linux runner Deployments: `FlowerCore.Library.Mac`,
-`FlowerCore.Signage.Agent.AppleTv`, and `FlowerCore.Signage.Player.Wpf`.
-
 ## Failure Notes

 - `actions/setup-dotnet` permission error at `/usr/share/dotnet`: check that
  `DOTNET_INSTALL_DIR=/home/runner/.dotnet` and related cache env vars are
  present on the runner pod.
+- `ruby/setup-ruby@v1` says self-hosted runners must install Ruby in
+  `$RUNNER_TOOL_CACHE`: check that the init container copied
+  `/opt/runner-toolcache/Ruby` into `/home/runner/_tool/Ruby` and that
+  `/home/runner/_tool/Ruby/3.3/x64.complete` exists.
 - `404` during runner registration: the fine-grained PAT is valid but missing
  repository access for that repo. Add the repo to the PAT access list; the PAT
  value does not change.
--- a/apps/github-runner/github-runner.yaml
+++ b/apps/github-runner/github-runner.yaml
--- a/apps/github-runner/install-ruby-toolcache.sh
+++ b/apps/github-runner/install-ruby-toolcache.sh
@@ -0,0 +1,19 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+RUBY_VERSION="${RUBY_VERSION:-3.3.11}"
+RUBY_MINOR="${RUBY_MINOR:-3.3}"
+TOOLCACHE_ROOT="${TOOLCACHE_ROOT:-/opt/runner-toolcache}"
+RUNNER_UID="${RUNNER_UID:-1001}"
+RUNNER_GID="${RUNNER_GID:-1001}"
+RUBY_PREFIX="${TOOLCACHE_ROOT}/Ruby/${RUBY_VERSION}/x64"
+
+mkdir -p "${TOOLCACHE_ROOT}/Ruby"
+RUBY_CONFIGURE_OPTS="${RUBY_CONFIGURE_OPTS:---disable-install-doc --disable-yjit}" ruby-build "${RUBY_VERSION}" "${RUBY_PREFIX}"
+
+touch "${TOOLCACHE_ROOT}/Ruby/${RUBY_VERSION}/x64.complete"
+ln -sfn "${RUBY_VERSION}" "${TOOLCACHE_ROOT}/Ruby/${RUBY_MINOR}"
+
+"${RUBY_PREFIX}/bin/ruby" -v
+chown -R "${RUNNER_UID}:${RUNNER_GID}" "${TOOLCACHE_ROOT}"
+chmod -R a+rX "${TOOLCACHE_ROOT}"
--- a/apps/monitoring/noc-monitoring.yaml
+++ b/apps/monitoring/noc-monitoring.yaml
@@ -280,13 +280,14 @@ data:
              printer_model: "NuPrint 210"

      # Print.Web health (Blazor app on edge2:5200)
+      # Target `/health` (anonymous) — root path requires API key auth and returns 401.
      - job_name: "probe-printweb"
        metrics_path: /probe
        params:
          module: [http_2xx]
        scrape_interval: 30s
        static_configs:
-          - targets: ["http://10.0.57.16:5200/"]
+          - targets: ["http://10.0.57.16:5200/health"]
            labels:
              instance: "print-web"
              service: "print-web"
@@ -729,7 +730,7 @@ data:
            expr: |
              kube_deployment_status_replicas_ready{
                namespace="github-runner",
-                deployment=~"github-runner(|-.+)"
+                deployment=~"github-runner(|-(sharedpos|puppet|signage|dms|telephony|print-web|chat|mysql|kiosk-linux))"
              } == 0
            for: 5m
            labels:
@@ -1273,24 +1274,55 @@ metadata:
 data:
  notify.py: |
    #!/usr/bin/env python3
-    """HTTP->IRC alert relay with thermal printer forwarding for Grafana webhooks.
-    Listens on :9119, posts to #alerts on UnrealIRCd via raw IRC protocol.
-    Alerts tagged alert_channel=thermal_print also POST to Print.Web /api/print/alert.
+    """HTTP->IRC alert relay with thermal-printer DIGEST forwarding.
+
+    Listens on :9119, posts to #alerts on UnrealIRCd, forwards to Print.Web
+    /api/print/alert. Thermal printing is BATCHED into hourly digests by
+    default so the printer no longer spam-fires per Grafana webhook.
+
+    Routing (per Grafana webhook alert):
+      - IRC: always per-event (operator likes the stream)
+      - Thermal printer:
+          * severity in {critical,disaster,page} OR
+            label alert_channel=thermal_print_immediate -> print NOW
+          * label alert_channel=thermal_print -> enqueue into hourly digest
+          * everything else -> IRC only
+      - RESOLVED webhooks remove the alert from the digest buffer
+
+    Env vars (defaults preserve old behavior on first deploy):
+      THERMAL_PRINT_ENABLED  default "true"   - master kill switch
+      BATCH_INTERVAL_MIN     default "60"     - minutes between digest prints
+      BATCH_MAX_PENDING      default "50"     - force-flush threshold
+
+    HTTP surface:
+      POST /         - Grafana webhook entry
+      POST /flush    - manual digest flush (idempotent)
+      GET  /         - status + config + buffer depth + stats
    """
-    import json, socket, sys, time
+    import json, os, socket, sys, threading, time
+    from collections import defaultdict
+    from datetime import datetime, timezone
    from http.server import HTTPServer, BaseHTTPRequestHandler
    from urllib.request import Request, urlopen
-    from urllib.error import URLError

-    IRC_HOST = "unrealircd.irc.svc"  # short name: CoreDNS ndots:5 + iamworkin.lan template hijacks full .cluster.local (see memory)
-    IRC_PORT = 6667
-    IRC_NICK = "grafana-bot"
-    IRC_CHANNEL = "#alerts"
-    PRINT_WEB_URL = "http://10.0.57.16:5200/api/print/alert"
-    PRINT_ENABLED = True
+    THERMAL_PRINT_ENABLED = os.environ.get("THERMAL_PRINT_ENABLED", "true").lower() == "true"
+    BATCH_INTERVAL_MIN    = int(os.environ.get("BATCH_INTERVAL_MIN", "60"))
+    BATCH_MAX_PENDING     = int(os.environ.get("BATCH_MAX_PENDING", "50"))
+
+    IRC_HOST      = os.environ.get("IRC_HOST", "unrealircd.irc.svc")
+    IRC_PORT      = int(os.environ.get("IRC_PORT", "6667"))
+    IRC_NICK      = os.environ.get("IRC_NICK", "grafana-bot")
+    IRC_CHANNEL   = os.environ.get("IRC_CHANNEL", "#alerts")
+    PRINT_WEB_URL = os.environ.get("PRINT_WEB_URL", "http://10.0.57.16:5200/api/print/alert")
+
+    _buffer_lock = threading.Lock()
+    _buffer = {}   # fingerprint -> {"alert": dict, "first_seen": float, "last_seen": float}
+    _last_flush_time = time.time()
+    _stats = {"webhooks_received": 0, "irc_sent": 0, "print_immediate": 0,
+              "digest_flushed": 0, "buffer_dedup": 0, "buffer_added": 0,
+              "buffer_resolved": 0, "started_at": time.time()}

    def send_irc(message):
-        """Connect, handle PING, join, send, quit."""
        try:
            sock = socket.create_connection((IRC_HOST, IRC_PORT), timeout=15)
            sock.sendall(f"NICK {IRC_NICK}\r\n".encode())
@@ -1323,52 +1355,137 @@ data:
            time.sleep(0.5)
            sock.sendall(b"QUIT :alert delivered\r\n")
            sock.close()
+            _stats["irc_sent"] += 1
            return True
        except Exception as e:
            print(f"[irc-notify] IRC send failed: {e}", file=sys.stderr)
            return False

-    def send_thermal_print(alert):
-        if not PRINT_ENABLED: return
-        labels = alert.get("labels", {})
-        annotations = alert.get("annotations", {})
-        status = alert.get("status", "firing").upper()
-        summary = annotations.get("summary", "")
-        description = annotations.get("description", "")
-        runbook = annotations.get("runbook", "")
-        # Build a useful message: summary + description + runbook steps
-        parts = []
-        if summary: parts.append(summary)
-        if description and description != summary: parts.append(description)
-        if runbook: parts.append("STEPS: " + runbook)
-        message = " | ".join(parts) if parts else labels.get("alertname", "Unknown alert")
-        payload = {
-            "title": labels.get("alertname", "Unknown"),
-            "severity": labels.get("severity", "warning").capitalize(),
-            "host": labels.get("instance", labels.get("host", "unknown")),
-            "message": message,
-            "eventId": alert.get("fingerprint", ""),
-            "source": "Grafana",
-            "status": "RESOLVED" if status == "RESOLVED" else "PROBLEM",
-            "acknowledged": False
-        }
+    def post_thermal(payload, kind):
+        if not THERMAL_PRINT_ENABLED:
+            print(f"[irc-notify] thermal disabled; skip {kind} ({payload.get('title','?')[:40]})", file=sys.stderr)
+            return False
        try:
            req = Request(PRINT_WEB_URL, data=json.dumps(payload).encode("utf-8"),
                          headers={"Content-Type": "application/json"}, method="POST")
            resp = urlopen(req, timeout=10)
-            print(f"[irc-notify] Thermal print sent: {resp.read().decode()}", file=sys.stderr)
+            if kind == "immediate": _stats["print_immediate"] += 1
+            print(f"[irc-notify] thermal {kind} sent: {payload.get('title','?')[:50]}", file=sys.stderr)
+            return True
        except Exception as e:
-            print(f"[irc-notify] Thermal print failed: {e}", file=sys.stderr)
+            print(f"[irc-notify] thermal {kind} failed: {e}", file=sys.stderr)
+            return False

-    def should_print(alert):
+    def fingerprint_of(alert):
+        fp = alert.get("fingerprint", "")
+        if fp: return fp
        labels = alert.get("labels", {})
-        if labels.get("alert_channel") == "thermal_print": return True
-        if labels.get("severity", "").lower() in ("critical", "disaster"): return True
-        if alert.get("status", "").upper() == "RESOLVED": return False
-        return False
+        target = labels.get("pod") or labels.get("instance") or labels.get("deployment") or labels.get("statefulset") or labels.get("namespace") or ""
+        return f"{labels.get('alertname','?')}/{labels.get('namespace','')}/{target}"
+
+    def is_critical(alert):
+        return alert.get("labels", {}).get("severity", "").lower() in ("critical", "disaster", "page")
+
+    def is_immediate_label(alert):
+        return alert.get("labels", {}).get("alert_channel") == "thermal_print_immediate"
+
+    def is_batched_label(alert):
+        return alert.get("labels", {}).get("alert_channel") == "thermal_print"
+
+    def add_to_digest(alert):
+        """Add an alert to the digest buffer. Returns True if the buffer GREW
+        (new fingerprint), False if it was a dedup, resolution, or no-op.
+        """
+        if not THERMAL_PRINT_ENABLED: return False
+        fp = fingerprint_of(alert)
+        status = alert.get("status", "firing").lower()
+        with _buffer_lock:
+            if status == "resolved":
+                if fp in _buffer:
+                    del _buffer[fp]
+                    _stats["buffer_resolved"] += 1
+                return False
+            if fp in _buffer:
+                _buffer[fp]["last_seen"] = time.time()
+                _buffer[fp]["alert"] = alert
+                _stats["buffer_dedup"] += 1
+                return False
+            _buffer[fp] = {"alert": alert, "first_seen": time.time(), "last_seen": time.time()}
+            _stats["buffer_added"] += 1
+            return True
+
+    def build_digest_payload():
+        with _buffer_lock:
+            items = list(_buffer.values())
+        if not items: return None
+        by_name = defaultdict(list)
+        for item in items:
+            labels = item["alert"].get("labels", {})
+            by_name[labels.get("alertname", "Unknown")].append(item)
+        lines = []
+        for name, group in sorted(by_name.items()):
+            targets = []
+            for it in group[:5]:
+                labels = it["alert"].get("labels", {})
+                t = (labels.get("pod") or labels.get("instance") or labels.get("deployment")
+                     or labels.get("statefulset") or labels.get("namespace") or "?")
+                targets.append(t)
+            more = f" (+{len(group)-5})" if len(group) > 5 else ""
+            sevs = sorted({it["alert"].get("labels", {}).get("severity", "warning") for it in group})
+            lines.append(f"[{'/'.join(sevs)}] {name} x{len(group)}: {', '.join(targets)}{more}")
+        now = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M UTC")
+        title = f"Alert digest: {len(items)} firing"
+        body = "\n".join([
+            f"=== {title} ===",
+            f"as of {now}",
+            "",
+            *lines,
+            "",
+            "Stream: #alerts (IRC)  |  Triage: grafana-noc1.iamworkin.lan",
+            "Force-flush: POST irc-notify.monitoring.svc:9119/flush",
+        ])
+        return {"title": title, "severity": "Warning", "host": "monitoring",
+                "message": body, "eventId": f"digest-{int(time.time())}",
+                "source": "Grafana digest", "status": "PROBLEM", "acknowledged": False}
+
+    def flush_digest():
+        payload = build_digest_payload()
+        if payload is None:
+            print("[irc-notify] flush: buffer empty, no digest sent", file=sys.stderr)
+            return False
+        sent = post_thermal(payload, "digest")
+        with _buffer_lock:
+            _buffer.clear()
+        if sent: _stats["digest_flushed"] += 1
+        return sent
+
+    def digest_loop():
+        global _last_flush_time
+        while True:
+            try:
+                now = time.time()
+                elapsed = now - _last_flush_time
+                if elapsed >= BATCH_INTERVAL_MIN * 60:
+                    print(f"[irc-notify] digest tick: interval reached ({BATCH_INTERVAL_MIN}m); buffer={len(_buffer)}", file=sys.stderr)
+                    flush_digest()
+                    _last_flush_time = now
+                elif len(_buffer) >= BATCH_MAX_PENDING:
+                    print(f"[irc-notify] digest tick: buffer full ({len(_buffer)}); force flush", file=sys.stderr)
+                    flush_digest()
+                    _last_flush_time = now
+                time.sleep(15)
+            except Exception as e:
+                print(f"[irc-notify] digest loop error: {e}", file=sys.stderr)
+                time.sleep(60)

    class Handler(BaseHTTPRequestHandler):
        def do_POST(self):
+            if self.path == "/flush":
+                ok = flush_digest()
+                self.send_response(200); self.send_header("Content-Type", "application/json"); self.end_headers()
+                self.wfile.write(json.dumps({"flushed": ok, "buffer_after": len(_buffer)}).encode())
+                return
+            _stats["webhooks_received"] += 1
            length = int(self.headers.get("Content-Length", 0))
            body = json.loads(self.rfile.read(length)) if length else {}
            for alert in body.get("alerts", []):
@@ -1383,22 +1500,56 @@ data:
                msg = f"{icon}{sev_tag} {name}: {summary}"
                if desc: msg += f"\n  {desc}"
                send_irc(msg)
-                if should_print(alert): send_thermal_print(alert)
-            self.send_response(200)
-            self.send_header("Content-Type", "application/json")
-            self.end_headers()
+                # Thermal routing — EVERYTHING (including criticals) goes into
+                # the hourly digest. Only the explicit `alert_channel=thermal_print_immediate`
+                # label bypasses, and even that flushes-the-current-digest rather
+                # than printing a standalone job, so the same fingerprint can't
+                # spam the printer per webhook cycle.
+                if status == "RESOLVED":
+                    add_to_digest(alert)  # removes from buffer
+                    continue
+                if is_immediate_label(alert):
+                    # Explicit opt-in for "paper this NOW" — first arrival of a
+                    # new fingerprint triggers an immediate digest flush; repeat
+                    # webhooks for the same fingerprint dedupe in the buffer
+                    # until the next interval or until the alert resolves.
+                    new_in_buffer = add_to_digest(alert)
+                    if new_in_buffer:
+                        global _last_flush_time
+                        flush_digest()
+                        _last_flush_time = time.time()
+                elif is_critical(alert) or is_batched_label(alert):
+                    add_to_digest(alert)
+                # else: IRC-only (warnings without thermal_print label)
+            self.send_response(200); self.send_header("Content-Type", "application/json"); self.end_headers()
            self.wfile.write(b'{"status":"ok"}')
+
        def do_GET(self):
-            self.send_response(200)
-            self.send_header("Content-Type", "application/json")
-            self.end_headers()
-            self.wfile.write(json.dumps({"service":"irc-notify","thermal_print":PRINT_ENABLED}).encode())
+            self.send_response(200); self.send_header("Content-Type", "application/json"); self.end_headers()
+            with _buffer_lock:
+                alertnames = sorted({it["alert"].get("labels", {}).get("alertname", "?") for it in _buffer.values()})
+                depth = len(_buffer)
+            info = {
+                "service": "irc-notify",
+                "config": {"thermal_print_enabled": THERMAL_PRINT_ENABLED,
+                           "batch_interval_min": BATCH_INTERVAL_MIN,
+                           "batch_max_pending": BATCH_MAX_PENDING,
+                           "irc_target": f"{IRC_HOST}:{IRC_PORT} {IRC_CHANNEL}",
+                           "print_web_url": PRINT_WEB_URL},
+                "buffer": {"depth": depth, "alertnames": alertnames,
+                           "seconds_since_last_flush": int(time.time() - _last_flush_time),
+                           "seconds_until_next_flush": max(0, int(BATCH_INTERVAL_MIN*60 - (time.time() - _last_flush_time)))},
+                "stats": _stats,
+            }
+            self.wfile.write(json.dumps(info, indent=2).encode())
+
        def log_message(self, format, *args):
            print(f"[irc-notify] {args[0]}", file=sys.stderr)

    if __name__ == "__main__":
+        threading.Thread(target=digest_loop, daemon=True).start()
        server = HTTPServer(("0.0.0.0", 9119), Handler)
-        print(f"IRC alert relay :9119 -> {IRC_HOST}:{IRC_PORT} {IRC_CHANNEL} (thermal: {PRINT_ENABLED})")
+        print(f"[irc-notify] :9119 -> IRC {IRC_HOST}:{IRC_PORT} {IRC_CHANNEL} | thermal={'ON' if THERMAL_PRINT_ENABLED else 'OFF'} | digest={BATCH_INTERVAL_MIN}m max={BATCH_MAX_PENDING}", file=sys.stderr)
        server.serve_forever()

 # =============================================================================
@@ -3509,7 +3660,7 @@ data:
              - refId: A
                relativeTimeRange: {from: 300, to: 0}
                datasourceUid: prometheus
-                model: {expr: 'kube_deployment_status_replicas_ready{namespace="github-runner",deployment=~"github-runner(|-.+)"} == 0', instant: true, refId: A}
+                model: {expr: 'kube_deployment_status_replicas_ready{namespace="github-runner",deployment=~"github-runner(|-(sharedpos|puppet|signage|dms|telephony|print-web|chat|mysql|kiosk-linux))"} == 0', instant: true, refId: A}
              - refId: B
                relativeTimeRange: {from: 300, to: 0}
                datasourceUid: __expr__
--- a/apps/selenium/network-policy.yaml
+++ b/apps/selenium/network-policy.yaml
@@ -24,7 +24,16 @@
 #     (10.0.57.16:5200), public internet 80/443 (excluding RFC1918), and
 #     fc-signage:5190 for the signage AAT lane.
 #   - Ingress: Traefik (4444 + 8089 ACME-solver-style), intra-pod,
-#     telephony / gitea / fc-system / fc-signage namespaces on 4444.
+#     telephony / gitea / fc-system / fc-signage / github-runner namespaces
+#     on 4444.
+#
+# 2026-05-25: added github-runner ingress on 4444 so CI jobs running in
+# self-hosted runner pods (e.g. FlowerCore.Print.Web `help-screenshots`)
+# can reach the grid. Without this allow, the session POST to
+# `selenium-hub.selenium.svc.cluster.local:4444` was DNAT'd to the hub
+# pod IP and then dropped at the Calico ingress hook — Selenium UI showed
+# 0/4 sessions while the .NET HTTP client timed out at 60s. Same family
+# as `feedback_netpol_dnat_backend_port`, wrong-source-namespace flavor.
 apiVersion: networking.k8s.io/v1
 kind: NetworkPolicy
 metadata:
@@ -203,6 +212,13 @@ spec:
    ports:
    - port: 4444
      protocol: TCP
+  - from:
+    - namespaceSelector:
+        matchLabels:
+          kubernetes.io/metadata.name: github-runner
+    ports:
+    - port: 4444
+      protocol: TCP
  podSelector: {}
  policyTypes:
  - Ingress
--- a/apps/selenium/selenium-grid.yaml
+++ b/apps/selenium/selenium-grid.yaml
@@ -0,0 +1,412 @@
+# Selenium Grid 4 — RKE2 deployment
+#
+# Hub + chrome + firefox + edge browser nodes serving fleet-wide AAT runs from
+# the GitHub Actions self-hosted runners. ArgoCD owns this namespace from
+# 2026-05-25 (`infra-selenium` Application; previously these resources were
+# orphan kubectl-applied since 2026-03-15).
+#
+# Endpoints:
+#   - Internal cluster: http://selenium-hub.selenium.svc.cluster.local:4444
+#   - LAN LoadBalancer (MetalLB): http://10.0.56.208:4444
+#   - Traefik public: https://selenium.iamworkin.lan
+#
+# Browser maxSessions:
+#   - chrome 2  (bumped from 1 on 2026-05-25 morning-routine — AAT-heavy
+#                Print.Web help-screenshots was the global bottleneck;
+#                see commit history for ops/runner-replica-rightsize)
+#   - firefox 1
+#   - edge 1
+#
+# Screenshots + video recording write to NFS via the chrome video sidecar.
+# See: CLAUDE.md "Selenium Grid & Visual AAT Testing" + bluejay-infra ADR notes.
+---
+apiVersion: v1
+kind: Service
+metadata:
+  labels:
+    app: selenium-hub
+    app.kubernetes.io/name: selenium-hub
+    app.kubernetes.io/part-of: selenium-grid
+  name: selenium-hub
+  namespace: selenium
+spec:
+  ports:
+  - name: web
+    port: 4444
+    targetPort: 4444
+  - name: publish
+    port: 4442
+    targetPort: 4442
+  - name: subscribe
+    port: 4443
+    targetPort: 4443
+  selector:
+    app: selenium-hub
+  type: ClusterIP
+---
+apiVersion: v1
+kind: Service
+metadata:
+  annotations:
+    metallb.io/ip-allocated-from-pool: bluejay-pool
+    metallb.universe.tf/loadBalancerIPs: 10.0.56.208
+  labels:
+    app: selenium-hub
+    component: external-access
+  name: selenium-hub-external
+  namespace: selenium
+spec:
+  clusterIP: 10.43.90.147
+  clusterIPs:
+  - 10.43.90.147
+  externalTrafficPolicy: Local
+  healthCheckNodePort: 32213
+  ports:
+  - name: web
+    nodePort: 32411
+    port: 4444
+    targetPort: 4444
+  - name: publish
+    nodePort: 32068
+    port: 4442
+    targetPort: 4442
+  - name: subscribe
+    nodePort: 31000
+    port: 4443
+    targetPort: 4443
+  selector:
+    app: selenium-hub
+  type: LoadBalancer
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  labels:
+    app: selenium-hub
+    app.kubernetes.io/name: selenium-hub
+    app.kubernetes.io/part-of: selenium-grid
+  name: selenium-hub
+  namespace: selenium
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: selenium-hub
+  template:
+    metadata:
+      labels:
+        app: selenium-hub
+        app.kubernetes.io/name: selenium-hub
+        app.kubernetes.io/part-of: selenium-grid
+    spec:
+      containers:
+      - env:
+        - name: SE_NODE_SESSION_TIMEOUT
+          value: '300'
+        - name: SE_SESSION_REQUEST_TIMEOUT
+          value: '300'
+        - name: SE_SESSION_RETRY_INTERVAL
+          value: '5'
+        - name: JAVA_OPTS
+          value: -Xmx512m
+        image: selenium/hub:4.27.0
+        livenessProbe:
+          httpGet:
+            path: /wd/hub/status
+            port: 4444
+          initialDelaySeconds: 30
+          periodSeconds: 15
+          timeoutSeconds: 5
+        name: selenium-hub
+        ports:
+        - containerPort: 4444
+          name: web
+        - containerPort: 4442
+          name: publish
+        - containerPort: 4443
+          name: subscribe
+        readinessProbe:
+          httpGet:
+            path: /wd/hub/status
+            port: 4444
+          initialDelaySeconds: 10
+          periodSeconds: 5
+          timeoutSeconds: 5
+        resources:
+          limits:
+            cpu: 500m
+            memory: 1Gi
+          requests:
+            cpu: 250m
+            memory: 512Mi
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  labels:
+    app: selenium-node-chrome
+    app.kubernetes.io/name: selenium-node-chrome
+    app.kubernetes.io/part-of: selenium-grid
+  name: selenium-node-chrome
+  namespace: selenium
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: selenium-node-chrome
+  template:
+    metadata:
+      labels:
+        app: selenium-node-chrome
+        app.kubernetes.io/name: selenium-node-chrome
+        app.kubernetes.io/part-of: selenium-grid
+    spec:
+      containers:
+      - env:
+        - name: SE_EVENT_BUS_HOST
+          value: selenium-hub
+        - name: SE_EVENT_BUS_PUBLISH_PORT
+          value: '4442'
+        - name: SE_EVENT_BUS_SUBSCRIBE_PORT
+          value: '4443'
+        - name: SE_NODE_MAX_SESSIONS
+          value: '2'
+        - name: SE_NODE_OVERRIDE_MAX_SESSIONS
+          value: 'false'
+        - name: SE_VNC_NO_PASSWORD
+          value: '1'
+        - name: SE_SCREEN_WIDTH
+          value: '1920'
+        - name: SE_SCREEN_HEIGHT
+          value: '1080'
+        - name: SE_NODE_SESSION_TIMEOUT
+          value: '300'
+        image: selenium/node-chrome:4.27.0
+        livenessProbe:
+          httpGet:
+            path: /status
+            port: 5555
+          initialDelaySeconds: 30
+          periodSeconds: 15
+        name: selenium-chrome
+        ports:
+        - containerPort: 5555
+          name: node
+        readinessProbe:
+          httpGet:
+            path: /status
+            port: 5555
+          initialDelaySeconds: 15
+          periodSeconds: 5
+        resources:
+          limits:
+            cpu: '1'
+            memory: 1Gi
+          requests:
+            cpu: 500m
+            memory: 512Mi
+        volumeMounts:
+        - mountPath: /dev/shm
+          name: dshm
+      - env:
+        - name: DISPLAY_CONTAINER_NAME
+          value: localhost
+        - name: SE_SCREEN_WIDTH
+          value: '1920'
+        - name: SE_SCREEN_HEIGHT
+          value: '1080'
+        - name: SE_VIDEO_FILE_NAME
+          value: auto
+        - name: SE_VIDEO_UPLOAD_ENABLED
+          value: 'false'
+        image: selenium/video:ffmpeg-7.1-20250101
+        name: video
+        resources:
+          limits:
+            cpu: 500m
+            memory: 768Mi
+          requests:
+            cpu: 250m
+            memory: 384Mi
+        volumeMounts:
+        - mountPath: /videos
+          name: selenium-videos
+      volumes:
+      - emptyDir:
+          medium: Memory
+          sizeLimit: 2Gi
+        name: dshm
+      - emptyDir:
+          sizeLimit: 5Gi
+        name: selenium-videos
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  labels:
+    app: selenium-node-firefox
+    app.kubernetes.io/name: selenium-node-firefox
+    app.kubernetes.io/part-of: selenium-grid
+  name: selenium-node-firefox
+  namespace: selenium
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: selenium-node-firefox
+  template:
+    metadata:
+      labels:
+        app: selenium-node-firefox
+        app.kubernetes.io/name: selenium-node-firefox
+        app.kubernetes.io/part-of: selenium-grid
+    spec:
+      containers:
+      - env:
+        - name: SE_EVENT_BUS_HOST
+          value: selenium-hub
+        - name: SE_EVENT_BUS_PUBLISH_PORT
+          value: '4442'
+        - name: SE_EVENT_BUS_SUBSCRIBE_PORT
+          value: '4443'
+        - name: SE_NODE_MAX_SESSIONS
+          value: '1'
+        - name: SE_NODE_OVERRIDE_MAX_SESSIONS
+          value: 'true'
+        - name: SE_VNC_NO_PASSWORD
+          value: '1'
+        - name: SE_START_VNC
+          value: 'false'
+        - name: SE_SCREEN_WIDTH
+          value: '1920'
+        - name: SE_SCREEN_HEIGHT
+          value: '1080'
+        - name: SE_NODE_SESSION_TIMEOUT
+          value: '300'
+        image: selenium/node-firefox:4.27.0
+        livenessProbe:
+          failureThreshold: 5
+          httpGet:
+            path: /status
+            port: 5555
+          initialDelaySeconds: 30
+          periodSeconds: 15
+          timeoutSeconds: 5
+        name: selenium-firefox
+        ports:
+        - containerPort: 5555
+          name: node
+        readinessProbe:
+          failureThreshold: 5
+          httpGet:
+            path: /status
+            port: 5555
+          initialDelaySeconds: 15
+          periodSeconds: 5
+          timeoutSeconds: 5
+        resources:
+          limits:
+            cpu: '1'
+            memory: 2Gi
+          requests:
+            cpu: 500m
+            memory: 1Gi
+        volumeMounts:
+        - mountPath: /dev/shm
+          name: dshm
+      volumes:
+      - emptyDir:
+          medium: Memory
+          sizeLimit: 2Gi
+        name: dshm
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  labels:
+    app: selenium-node-edge
+    app.kubernetes.io/name: selenium-node-edge
+    app.kubernetes.io/part-of: selenium-grid
+  name: selenium-node-edge
+  namespace: selenium
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: selenium-node-edge
+  template:
+    metadata:
+      labels:
+        app: selenium-node-edge
+        app.kubernetes.io/name: selenium-node-edge
+        app.kubernetes.io/part-of: selenium-grid
+    spec:
+      containers:
+      - env:
+        - name: SE_EVENT_BUS_HOST
+          value: selenium-hub
+        - name: SE_EVENT_BUS_PUBLISH_PORT
+          value: '4442'
+        - name: SE_EVENT_BUS_SUBSCRIBE_PORT
+          value: '4443'
+        - name: SE_NODE_MAX_SESSIONS
+          value: '1'
+        - name: SE_NODE_OVERRIDE_MAX_SESSIONS
+          value: 'true'
+        - name: SE_VNC_NO_PASSWORD
+          value: '1'
+        - name: SE_SCREEN_WIDTH
+          value: '1920'
+        - name: SE_SCREEN_HEIGHT
+          value: '1080'
+        - name: SE_NODE_SESSION_TIMEOUT
+          value: '300'
+        image: selenium/node-edge:4.27.0
+        livenessProbe:
+          httpGet:
+            path: /status
+            port: 5555
+          initialDelaySeconds: 30
+          periodSeconds: 15
+        name: selenium-edge
+        ports:
+        - containerPort: 5555
+          name: node
+        readinessProbe:
+          httpGet:
+            path: /status
+            port: 5555
+          initialDelaySeconds: 15
+          periodSeconds: 5
+        resources:
+          limits:
+            cpu: '1'
+            memory: 1Gi
+          requests:
+            cpu: 500m
+            memory: 512Mi
+        volumeMounts:
+        - mountPath: /dev/shm
+          name: dshm
+      volumes:
+      - emptyDir:
+          medium: Memory
+          sizeLimit: 2Gi
+        name: dshm
+---
+apiVersion: traefik.io/v1alpha1
+kind: IngressRoute
+metadata:
+  name: selenium-hub
+  namespace: selenium
+spec:
+  entryPoints:
+  - websecure
+  routes:
+  - kind: Rule
+    match: Host(`selenium.iamworkin.lan`)
+    services:
+    - name: selenium-hub
+      port: 4444
+  tls:
+    secretName: selenium-tls
--- a/tests/bluejay-infra-lint/FleetManifestLintTests.cs
+++ b/tests/bluejay-infra-lint/FleetManifestLintTests.cs
@@ -67,8 +67,6 @@ public sealed class FleetManifestLintTests
        ["github-runner-chat"] = "https://github.com/astoltz/FlowerCore.Chat",
        ["github-runner-mysql"] = "https://github.com/astoltz/FlowerCore.MySQL",
        ["github-runner-kiosk-linux"] = "https://github.com/astoltz/FlowerCore.Kiosk.Linux",
-        ["github-runner-devicemgmt"] = "https://github.com/astoltz/FlowerCore.DeviceManagement",
-        ["github-runner-worldbuilder"] = "https://github.com/astoltz/FlowerCore.WorldBuilder",
    };

    private static readonly HashSet<string> ScaledLinuxRunnerDeployments = new(StringComparer.Ordinal)
@@ -82,8 +80,6 @@ public sealed class FleetManifestLintTests
        "github-runner-chat",
        "github-runner-mysql",
        "github-runner-kiosk-linux",
-        "github-runner-devicemgmt",
-        "github-runner-worldbuilder",
    };

    private static readonly IReadOnlyDictionary<string, string> WritableRunnerEnv = new Dictionary<string, string>(StringComparer.Ordinal)
@@ -238,7 +234,7 @@ public sealed class FleetManifestLintTests
        {
            deployments.Should().ContainKey(expectedRunner.Key);

-            var container = RunnerContainer(deployments[expectedRunner.Key]);
+            var container = deployments[expectedRunner.Key].ContainerMappings().Should().ContainSingle().Subject;
            EnvValue(container, "REPO_URL").Should().Be(expectedRunner.Value);
            EnvValue(container, "EPHEMERAL").Should().Be("true");
            EnvValue(container, "LABELS").Should().Be("self-hosted,linux,fc-build-linux");
@@ -254,7 +250,7 @@ public sealed class FleetManifestLintTests
    {
        foreach (var deployment in GitHubRunnerDeployments().Values)
        {
-            var container = RunnerContainer(deployment);
+            var container = deployment.ContainerMappings().Should().ContainSingle().Subject;

            foreach (var expectedEnv in WritableRunnerEnv)
            {
@@ -315,7 +311,7 @@ public sealed class FleetManifestLintTests
        monitoring.Should().Contain("MacMiniRunnerOffline");
        monitoring.Should().Contain("LinuxRunnerOffline");
        monitoring.Should().Contain("kube_deployment_status_replicas_ready");
-        monitoring.Should().Contain("github-runner(|-.+)");
+        monitoring.Should().Contain("github-runner(|-(sharedpos|puppet|signage|dms|telephony|print-web|chat|mysql|kiosk-linux))");
        monitoring.Should().Contain("folder: CI Alerts");
        monitoring.Should().Contain("uid: linux-runner-offline");
        monitoring.Should().Contain("alert_channel: irc");
@@ -645,15 +641,6 @@ public sealed class FleetManifestLintTests
        return EnvMapping(container, name) is { } env ? ManifestNodeExtensions.Scalar(env, "value") : null;
    }

-    private static YamlMappingNode RunnerContainer(ManifestDocument deployment)
-    {
-        return deployment.ContainerMappings()
-            .Where(container => string.Equals(ManifestNodeExtensions.Scalar(container, "name"), "runner", StringComparison.Ordinal))
-            .Should()
-            .ContainSingle($"{deployment.Name} must keep exactly one main runner container")
-            .Subject;
-    }
-
    private static string? EnvSecretName(YamlMappingNode container, string name)
    {
        return EnvMapping(container, name) is { } env
Author	SHA1	Message	Date
bluejay	bc28430d24	selenium: allow github-runner namespace ingress on 4444 (#26 )	2026-05-26 00:44:23 +00:00
Andrew Stoltz	cc92272217	selenium: allow github-runner namespace ingress on 4444 Unblocks CI jobs running in github-runner pods (e.g. FlowerCore.Print.Web `help-screenshots`) from reaching selenium-hub. Previously the session POST was DNAT'd to the hub pod IP then dropped at the Calico ingress hook, surfacing as a 60s timeout against http://selenium-hub.selenium.svc.cluster.local:4444 while the Selenium UI showed 0/4 sessions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 19:43:12 -05:00
bluejay	d6f4468a9c	selenium: migrate hub + 3 nodes into ArgoCD-managed manifests (#25 )	2026-05-26 00:09:35 +00:00
Andrew Stoltz	2f796a2ebd	selenium: migrate hub + 3 nodes + service + ingressroute into ArgoCD Previously orphan kubectl-applied since the Selenium Grid was first set up. The `infra-selenium` ArgoCD app existed but only managed `network-policy.yaml` — the deployments themselves drifted whenever anyone `kubectl set env`'d or `kubectl scale`'d. This commit captures the live state (with the 2026-05-25 maxSessions bump for chrome already baked in) as canonical git source. ArgoCD's ServerSideApply syncPolicy + selfHeal will now keep the grid in lock step with this file. Resources captured: - Service selenium-hub (ClusterIP, internal traffic on 4444) - Service selenium-hub-external (LoadBalancer, MetalLB 10.0.56.208) - Deployment selenium-hub - Deployment selenium-node-chrome (replicas=1, SE_NODE_MAX_SESSIONS=2) - Deployment selenium-node-firefox (replicas=1, maxSessions=1) - Deployment selenium-node-edge (replicas=1, maxSessions=1) - IngressRoute selenium-hub (Traefik, selenium.iamworkin.lan) No live behavior change — server-side dry-run confirms unchanged for hub/firefox/ingressroute, "configured" for hub-external + 3 deploys (default-field reordering only; SSA + field managers handle the diff). Refs: Sprint 33 morning-routine 2026-05-25 follow-up Q-MR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 19:08:55 -05:00
bluejay	1f1f6823db	runners: right-size replica counts per 14d CI activity (#24 )	2026-05-26 00:01:47 +00:00
Andrew Stoltz	b92f74b63a	runners: right-size replica counts per 14d CI activity data Drop 2 → 1 for 10 deploys based on trailing-14d run counts: - LlmBridge, Media, Knowledge, Intranet.Web, DNS (0 runs each) - Presentations (6), Redis (3), Provisioning (3), MessageBoard (3), MenuBoard (3) Bump 2 → 3 for Print.Web: 12 runs in trailing 5d, and the help-screenshots AAT job holds a runner 30+ min, creating head-of-line blocking for parallel PRs. Net change: -9 replicas (≈ -9 GiB committed memory). Aligns with Sprint 33 morning-routine capacity audit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 18:55:47 -05:00
Andrew Stoltz	cb7f7dbc4d	authentik: generous startup/liveness probes for first-boot migration The server pod was getting killed by liveness probe at 60s while still waiting on migration DB lock (worker pod also running migrations against same DB). Add startupProbe with 10.5 min budget so liveness doesn't fire until migrations finish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 16:03:03 -05:00
Andrew Stoltz	03126d5584	authentik: add fsGroup:1000 to server + worker so non-root uid can write /media PermissionError: [Errno 13] Permission denied: '/media/public' in tenant_files migration because Authentik container runs as uid 1000 but Longhorn PVC mounts root:root by default. fsGroup on Pod securityContext recursively chgrps the PVC mount to gid 1000 + chmods g+rwx. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 15:58:35 -05:00
Andrew Stoltz	495e884c41	authentik: initial deployment at id.iamworkin.lan Stack: - PostgreSQL 16 StatefulSet (Longhorn RWO 5Gi) - Redis 7 Deployment (no persistence) - Authentik server + worker (ghcr.io/goauthentik/server:2024.12.3) - Shared media PVC (Longhorn RWO 2Gi) between server+worker - Certificate via step-ca-acme ClusterIssuer - Traefik IngressRoute at id.iamworkin.lan Secrets sourced from 1Password item 'authentik-credentials' (IAmWorkin vault, id y6i74ch22q5wvm7znquq4nhhcu) via OnePasswordItem CRD. Fields: AUTHENTIK_SECRET_KEY, POSTGRES_PASSWORD, REDIS_PASSWORD, BOOTSTRAP_ADMIN_PASSWORD, BOOTSTRAP_ADMIN_TOKEN, BOOTSTRAP_ADMIN_EMAIL. DNS A record id.iamworkin.lan -> 10.0.56.200 added via scripts/pfsense-add-id-host.py (FlowerCore.DNS service was 502'ing on pfSense diag_command.php response parsing). Closes the immediate gap from PiManager OIDC Cohort 3 wire-up: PiManager (a87cd6f) configures id.iamworkin.lan as JWT authority but the backend was never deployed. Pirelay specifically is on Mode:apikey until this backend is bootstrapped and a pimanager service-account exists. Post-deploy bootstrap (manual once pods Ready): 1. Login at https://id.iamworkin.lan/if/admin/ as akadmin using BOOTSTRAP_ADMIN_PASSWORD from 1Password. 2. Create OAuth2/OpenID Provider for pimanager (issuer https://id.iamworkin.lan/application/o/pimanager/, audience 'pimanager'). 3. Create Application binding the provider. 4. Create service account user 'pimanager-service-account', generate long-lived token, store in 1Password as 'pimanager-service-account'. 5. Re-enable jwt mode on pirelay + un-mask puppet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 15:50:10 -05:00
Andrew Stoltz	65aa1e6104	fix(monitoring): point probe-printweb at /health (Q-MR-90) Root path requires API key auth — `/` returned 401 to the blackbox probe, firing PrintWebDown despite `/health` reporting Healthy. Pattern: feedback_k8s_probes_behind_auth_middleware. Mirrors FlowerCore.Notes scripts/monitoring/prometheus.yml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 14:52:02 -05:00
Andrew Stoltz	7f2a3b76b4	feat(github-runner): bake Ruby 3.3 into Linux self-hosted runner image (Q-MR-81)	2026-05-20 11:45:43 -05:00
bluejay	ea73f00461	fix(fc-devicemgmt): remove self-referential Application resource (Q-MR-79) ApplicationSet already creates infra-fc-devicemgmt; removing the in-repo Application child clears the self-reference drift.	2026-05-20 16:20:01 +00:00
Andrew Stoltz	25ace30a03	fix(fc-devicemgmt): remove self-referential Application resource (Q-MR-79)	2026-05-20 11:18:25 -05:00
Andrew Stoltz	ca574c2280	brochure: delete apps/brochure/ — full prune per operator decision 2026-05-19 Removes the apps/brochure/ directory entirely from the bluejay-infra ApplicationSet glob. ArgoCD will: 1. See infra-brochure has no git source -> mark for delete 2. Prune the brochure namespace + Deployment + Service + Certificate + Secret + IngressRoute (all generated from the now-gone apps/brochure/brochure.yaml) 3. Remove the infra-brochure Application from argocd ns Operator decision 2026-05-19 (follow-up to `09387f9` ARCHIVED banner commit): "Yes, prune argo for brochure. Probably fully deleted there." The brochure subdomain project was a planning-chain misinterpretation of "make TtsReader + AI Station production-ready" — see memory/project_brochure_split_misinterpretation_archived_2026_05_19.md in FlowerCore.Notes for the full decision record. Reusable artifacts that were the operator's archive concern stay alive in their actual homes: - FlowerCore.Intranet.Web PR #8 content-NuGet carve-out: still in Intranet's master, may transfer to TtsReader / AI Station prod work - Sprint 32 Cl-5 substrate (public-twin design ideas): SUPERSEDED banner in-place in FlowerCore.Notes docs/standards/, history preserved - magpie-doc-writer + wren-walkthrough skill output: unchanged in Intranet's flowercore-whats-new/walkthroughs/galleries directories Companion Notes-side commit updates the "scaled to 0 + ARCHIVED banner" language in mvp-readiness.html + fleet-roadmap-2026-05-19-sprint36-v2.md + memory record to reflect full deletion instead. Wrong-codebase image localhost/fc-brochure-web:v20260524-sprint32 is being removed from rke2-server / rke2-agent1 / rke2-agent2 in a follow-up step (reclaims ~800MB per node). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:42:30 -05:00
Andrew Stoltz	09387f90e1	brochure: ARCHIVED 2026-05-19 — was a misinterpretation, do not re-enable The brochure split project was a misinterpretation of an operator request to make TtsReader + AI Station production-ready. Somewhere in the planning chain it spun up into a separate "showcase brochure product" with its own host, repo, NuGet, and Codex pack — none of which the operator actually wanted. The project itself is pointless and a waste of credits. Archive (not delete) per operator decision 2026-05-19, because some work shipped under the misinterpretation may still have reusable value: - FlowerCore.Intranet.Web PR #8 (merged) introduced FlowerCore.Brochure.Content content-NuGet carve-out — pattern may apply to TtsReader/AiStation production polish. - Sprint 32 Cl-5 substrate has design ideas for public-twin vs operator-host separation that may transfer. - magpie-doc-writer / wren-walkthrough skills still author useful Intranet content — those skills stay active. These manifests stay at replicas: 0 for ArgoCD continuity. Cleanup options (move out of apps/* glob, or delete entirely) are documented in README.md for an operator-explicit future call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:34:28 -05:00
Andrew Stoltz	e641ceab48	monitoring(irc-notify): criticals also batch hourly — fix per-fire spam The first batching pass (`bacac06`) left critical-severity alerts on the immediate-print path. That's still per-event spam for any persistent critical (e.g. PrintPaperRollCritical fires every 30s Grafana evaluation cycle when paper is <5%). Caught immediately after deploy: CUPS queue grew 0 → 8 jobs in 8 minutes from a single firing PrintPaperRollCritical. This commit aligns with the operator's verbatim ask ("one alert an hour"): - Critical-severity alerts now go into the digest buffer, NOT the immediate-print path. The digest payload already shows severity tags per alertname, so the operator still sees "[critical] X" in the printout. - The explicit `alert_channel=thermal_print_immediate` label still bypasses batching, but only on NEW fingerprint arrival — it triggers a flush of the CURRENT digest (with the new alert included), then clears. Repeat webhooks for the same fingerprint dedupe in the buffer until the next hourly tick OR until the alert resolves. No fingerprint can spam. - `add_to_digest` now returns bool (True = buffer grew, False = dedup / resolution / disabled) so the immediate-label path can flush only on state transitions. Net effect: max 1 thermal print per BATCH_INTERVAL_MIN per alert fingerprint, regardless of severity. Rules that genuinely need same-second paper opt in via `alert_channel=thermal_print_immediate` (currently zero rules use this). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:22:25 -05:00
Andrew Stoltz	c263426ea5	fc-devicemgmt: operator image fix + Web scaled to 0 OPERATOR (PodCrashLoopBackOff cleared): - Bumped image to v20260519-sp34cl3-fix (built from astoltz/FlowerCore.DeviceManagement@d9a3685 after Sprint 34 Cl-3 stranded branch was merged via PR #19 squash). - The v20260512-cx5 image was the broken Sprint 8 scaffold: generic Host builder, no kubeops, no Kestrel on :8080, no AddController chain. Readiness probe dial-tcp 8080 failed every restart. - The new image ships the AddController chain for all 4 reconcilers (DeviceCrd / DeviceGroupCrd / DevicePolicyCrd / RemoteCommandCrd) plus Kestrel on :8080 and /healthz. - Image saved + scp'd + ctr-imported on rke2-server / rke2-agent1 / rke2-agent2 before this commit. SHA256: 2cc79ee0a2313c550268d1244f805ae41b396362148dd5603061cc15b6f7fa7e WEB (DeploymentReplicasMismatch cleared via scale-to-0): - Web pod cannot start. Two upstream gaps must close first: 1) MySQL DB instance + user `fc_devicemgmt` / database `flowercore_devicemgmt` are not provisioned in fc-mysql. Cluster has zero MySqlInstanceCrds and no `mysql.fc-mysql.svc:3306` Service. 2) 1Password vault item `IAmWorkin/FlowerCore DeviceManagement Runtime` is missing (5 fields: DB-Password + 4 mTLS PEMs). OnePasswordItem CRD has been stuck Ready=False since 2026-05-18T02:58. - Same pattern as the brochure-web scale-to-0 in `914fed0` — make the cluster clean and quiet, let operator restart deploy on a real schedule. Re-enable path is fully documented in the deployment-web.yaml header comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:11:09 -05:00
Andrew Stoltz	bacac067cf	monitoring(irc-notify): hourly digest batching for thermal printer The thermal printer drained overnight (2026-05-18/19) because the old notify.py POSTed one print job per Grafana webhook fire. With 9 concurrently-firing alerts (zabbix-postgres + fc-devicemgmt + brochure + PrintPaperRollLow), every evaluation cycle stamped fresh CUPS jobs onto the queue until the operator physically powered the printer off. This refactor: - Adds env-var config: THERMAL_PRINT_ENABLED (master kill switch), BATCH_INTERVAL_MIN (default 60), BATCH_MAX_PENDING (default 50). - IRC delivery stays per-event (operator wants the live stream). - Thermal routing now: * critical/disaster/page severity OR alert_channel=thermal_print_immediate -> print immediately * alert_channel=thermal_print -> enqueue into hourly digest * RESOLVED -> remove from digest buffer (no resolution-spam prints) * else -> IRC only, no thermal - Background digest_loop thread flushes the buffer hourly (or sooner if buffer hits BATCH_MAX_PENDING). Digest payload is a single Print.Web /api/print/alert POST listing distinct alertnames + per-rule target counts. - New POST /flush endpoint (manual operator force-flush; useful for testing without waiting an hour). - GET / returns config + buffer depth + per-stat counters for observability. Net effect: max 1 thermal print per BATCH_INTERVAL_MIN for batched warnings, plus immediate prints for criticals. Closes the 2026-05-18/19 alert-storm incident. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 09:56:14 -05:00
bluejay	914fed08d8	fix(brochure): scale brochure-web to 0 — wrong codebase shipped (Intranet.Web binary in fc-brochure-web image, CrashLoopBackOff 296 restarts on /data read-only). Re-enable after Sprint 34 Cx-3 rebuild per docs/ai-agents/codex-prompts/2026-05-18-fc-brochure-web-rebuild-pack.md	2026-05-19 14:45:01 +00:00