Add SignalControl platform telemetry manifests

mail: remove cert-manager Certificate (manage mail-tls via step-ca JWK + noc1 renew timer)
step-ca-acme only has an HTTP-01 (Traefik) solver, but mail.iamworkin.lan must resolve to the dedicated MetalLB IP 10.0.56.202 (SMTP/IMAP), so HTTP-01 cannot validate (order stuck pending since 2026-05-06; cert expired 2026-05-24). mail-tls is now issued from step-ca's JWK 'admin' provisioner and auto-renewed by a systemd timer on noc1 that writes the mail-tls secret directly. The secret + Deployment mount + webmail IngressRoute are unchanged. Re-add a Certificate only if a DNS-01 solver is deployed for step-ca-acme. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 22:29:18 -05:00 · 2026-06-01 15:55:38 -05:00 · 2026-05-31 12:35:45 -05:00 · 2026-05-31 11:34:12 -05:00 · 2026-05-31 00:32:48 -05:00 · 2026-05-30 23:31:48 -05:00
15 changed files with 1359 additions and 140 deletions
--- a/apps/fc-signalcontrol/README.md
+++ b/apps/fc-signalcontrol/README.md
@@ -0,0 +1,33 @@
+# FlowerCore SignalControl platform notes
+
+This app owns the cluster web manager at `signalcontrol.iamworkin.lan` and documents the physical Pi pilot at `signal-a.iamworkin.lan` / `pirelay`.
+
+## mTLS enrollment pattern
+
+Do not install or restart anything from this repo. The intended pirelay pattern is the Pi-signage step-ca-agent shape:
+
+- stable node identity: `pirelay`
+- local private key and CSR generated on the node
+- CSR submitted through the approved DeviceManagement/step-ca enrollment path
+- client certificate and chain stored node-local under `/etc/flowercore/signalcontrol/mtls/`
+- daily renewal timer, renewing only when fewer than 30 days remain
+- certificate used for DM-agent to DM-web traffic and future SignalControl inter-service calls
+
+Secrets, enrollment codes, private keys, p12 passphrases, and OIDC client secrets stay out of Git.
+
+## Telemetry
+
+Monitoring manifests add a dedicated Prometheus job:
+
+- `signalcontrol-pi-app`
+- target `10.0.58.113:5200`
+- path `/metrics/prometheus`
+- labels `instance="pirelay"`, `host="signal-a.iamworkin.lan"`, `service="signalcontrol-pi"`
+
+Host metrics continue through the `edge-nodes` node_exporter target at `10.0.58.113:9100`.
+
+## Physical-control audit
+
+The app ships with `FlowerCore:SignalControl:PhysicalAudit:Enabled=false` and `ForwardingEnabled=false`. Enabling local audit creates a SHA-256 hash chain for physical-control mutations. Forwarding to `https://audit.iamworkin.lan/api/v1/audit/signalcontrol` requires flipping the forwarding gate separately.
+
+Telemetry reads and `/metrics` scrapes are not audited.
--- a/apps/fc-ttsreader/fc-ttsreader.yaml
+++ b/apps/fc-ttsreader/fc-ttsreader.yaml
@@ -532,7 +532,7 @@ spec:
        fsGroupChangePolicy: OnRootMismatch
      containers:
        - name: web
-          image: localhost/fc-ttsreader-web:v20260518-sprint36-demo-finish-b132cbf
+          image: localhost/fc-ttsreader-web:v20260531-tts-corrections-r2
          imagePullPolicy: Never
          ports:
            - containerPort: 5217
@@ -554,6 +554,8 @@ spec:
              value: "/data/chapter-context.db"
            - name: TtsReader__Jobs__Root
              value: "/data/jobs"
+            - name: TtsReader__Export__LocalCasRoot
+              value: "/data/bundles/cas"
            - name: TtsReader__Piper__Host
              value: "10.0.57.17"
            - name: TtsReader__Piper__Port
--- a/apps/fc-updater/fc-updater.yaml
+++ b/apps/fc-updater/fc-updater.yaml
@@ -58,7 +58,7 @@ spec:
      nodeName: rke2-server
      containers:
        - name: web
-          image: localhost/fc-updater-web:v20260509-4162dca-authgate
+          image: localhost/fc-updater-web:v202605310029-7974fc4
          imagePullPolicy: Never
          ports:
            - containerPort: 8080
@@ -88,6 +88,8 @@ spec:
              value: Faith AI Mike Edition
            - name: FlowerCore__Updater__PublicShares__Links__0__Description
              value: Private release link for Mike's Faith AI bundle.
+            - name: FlowerCore__Audit__Sinks__Loki__Enabled
+              value: "false"
            - name: FlowerCore__Updater__Auth__Bootstrap__Enabled
              value: "true"
            - name: FlowerCore__Updater__Auth__Bootstrap__Username
--- a/apps/github-runner/Dockerfile
+++ b/apps/github-runner/Dockerfile
@@ -12,6 +12,15 @@ ENV PATH="/home/runner/_tool/Ruby/${RUBY_MINOR}/x64/bin:/opt/runner-toolcache/Ru

 USER root

+# Bake the IAmWorkin step-ca root CA into the system trust store. Without
+# this, .NET HttpClient calls from CI tests against *.iamworkin.lan
+# (e.g. https://selenium.iamworkin.lan/session) fail with `PartialChain`
+# because the runner image's default Ubuntu trust bundle doesn't include
+# our internal Root CA. update-ca-certificates regenerates
+# /etc/ssl/certs/ca-certificates.crt, which OpenSSL + .NET on Linux read
+# automatically — no SSL_CERT_FILE env var needed.
+COPY step-ca-root.crt /usr/local/share/ca-certificates/iamworkin-step-ca-root.crt
+
 RUN apt-get update \
    && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        autoconf \
@@ -31,6 +40,7 @@ RUN apt-get update \
        pkg-config \
        uuid-dev \
        zlib1g-dev \
+    && update-ca-certificates \
    && curl -fsSL "https://github.com/rbenv/ruby-build/archive/refs/tags/${RUBY_BUILD_VERSION}.tar.gz" -o /tmp/ruby-build.tar.gz \
    && mkdir -p /tmp/ruby-build \
    && tar -xzf /tmp/ruby-build.tar.gz --strip-components=1 -C /tmp/ruby-build \
--- a/apps/github-runner/README.md
+++ b/apps/github-runner/README.md
@@ -7,7 +7,7 @@ Deployments with `kubectl`; update this manifest and let ArgoCD reconcile.

 All repo-scoped Linux runners use:

- `localhost/fc-github-runner:v20260520-ruby3.3.11`, derived from
+- `localhost/fc-github-runner:v20260525-ruby3.3.11-stepca`, derived from
  `myoung34/github-runner:latest`
 - `ACCESS_TOKEN` from the `github-runner-token` Secret
 - `RUN_AS_ROOT=false`
@@ -40,14 +40,26 @@ still mounts an `emptyDir` over `/home/runner`, so the `setup-runner-home` init
 container copies the baked toolcache from `/opt/runner-toolcache/Ruby` into
 `/home/runner/_tool/Ruby` before the runner container starts.

+The IAmWorkin step-ca root CA is also baked into the system trust store
+(`/usr/local/share/ca-certificates/iamworkin-step-ca-root.crt`, registered by
+`update-ca-certificates`). Without it, .NET HttpClient calls from CI tests
+against `*.iamworkin.lan` (e.g. `https://selenium.iamworkin.lan/session`)
+fail with `PartialChain`. To refresh the bundled cert when the root rotates,
+re-extract from the cluster and overwrite `step-ca-root.crt`:
+
+```bash
+kubectl get secret -n cert-manager step-ca-root \
+  -o jsonpath='{.data.ca\.crt}' | base64 -d > step-ca-root.crt
+```
+
 ```bash
 cd apps/github-runner
-podman build -t localhost/fc-github-runner:v20260520-ruby3.3.11 .
-podman run --rm localhost/fc-github-runner:v20260520-ruby3.3.11 ruby -v
-podman run --rm localhost/fc-github-runner:v20260520-ruby3.3.11 \
+podman build -t localhost/fc-github-runner:v20260525-ruby3.3.11-stepca .
+podman run --rm localhost/fc-github-runner:v20260525-ruby3.3.11-stepca ruby -v
+podman run --rm localhost/fc-github-runner:v20260525-ruby3.3.11-stepca \
  test -f /opt/runner-toolcache/Ruby/3.3/x64.complete
-podman save localhost/fc-github-runner:v20260520-ruby3.3.11 \
-  -o fc-github-runner-v20260520-ruby3.3.11.tar
+podman save localhost/fc-github-runner:v20260525-ruby3.3.11-stepca \
+  -o fc-github-runner-v20260525-ruby3.3.11-stepca.tar
 ```

 Import the saved image on every schedulable RKE2 node before ArgoCD rolls the
@@ -55,9 +67,9 @@ Deployments:

 ```bash
 for node in rke2-server rke2-agent1 rke2-agent2; do
-  scp fc-github-runner-v20260520-ruby3.3.11.tar "$node:/tmp/"
-  ssh "$node" 'sudo ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images rm localhost/fc-github-runner:v20260520-ruby3.3.11 || true'
-  ssh "$node" 'sudo ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images import /tmp/fc-github-runner-v20260520-ruby3.3.11.tar'
+  scp fc-github-runner-v20260525-ruby3.3.11-stepca.tar "$node:/tmp/"
+  ssh "$node" 'sudo ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images rm localhost/fc-github-runner:v20260525-ruby3.3.11-stepca || true'
+  ssh "$node" 'sudo ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images import /tmp/fc-github-runner-v20260525-ruby3.3.11-stepca.tar'
 done
 ```

--- a/apps/github-runner/github-runner.yaml
+++ b/apps/github-runner/github-runner.yaml
--- a/apps/github-runner/step-ca-root.crt
+++ b/apps/github-runner/step-ca-root.crt
@@ -0,0 +1,12 @@
+-----BEGIN CERTIFICATE-----
+MIIBxDCCAWqgAwIBAgIRAPY357G6ow6zMAL5+4bS2kkwCgYIKoZIzj0EAwIwQDEa
+MBgGA1UEChMRSUFtV29ya2luIEFDTUUgQ0ExIjAgBgNVBAMTGUlBbVdvcmtpbiBB
+Q01FIENBIFJvb3QgQ0EwHhcNMjYwMzA4MTgwNzExWhcNMzYwMzA1MTgwNzExWjBA
+MRowGAYDVQQKExFJQW1Xb3JraW4gQUNNRSBDQTEiMCAGA1UEAxMZSUFtV29ya2lu
+IEFDTUUgQ0EgUm9vdCBDQTBZMBMGByqGSM49AgEGCCqGSM49AwEHA0IABJ2n04X1
+JZo5Zdq/i1Idv8+fqwZyAzBh7whbqj0SWsJL8UWRabCMqYCs7+dXO0xRSzqkwFDL
+x+vooOai8RgRNhajRTBDMA4GA1UdDwEB/wQEAwIBBjASBgNVHRMBAf8ECDAGAQH/
+AgEBMB0GA1UdDgQWBBRnuPPQR6iM/H6vOluiU3Sygayz8jAKBggqhkjOPQQDAgNI
+ADBFAiEArQK9dYPGmAZsdYnjziuFVVE5NKZUcceYvGfGC+tLXUsCIAudF2zJrCRq
+3mK50ZZET/fwTkJwiEF4824mjP8p1CKM
+-----END CERTIFICATE-----
--- a/apps/intranet/intranet.yaml
+++ b/apps/intranet/intranet.yaml
@@ -46,7 +46,7 @@ spec:
    spec:
      containers:
        - name: intranet-web
-          image: localhost/fc-intranet-web:v20260508-brochure-w1
+          image: localhost/fc-intranet-web:v20260531-ttsreader-bridge
          imagePullPolicy: Never
          ports:
            - containerPort: 5300
--- a/apps/kubevirt-vms/ci1.yaml
+++ b/apps/kubevirt-vms/ci1.yaml
@@ -25,7 +25,7 @@ metadata:
    role: github-actions-runner
    flowercore.io/managed-by: bluejay-infra
 spec:
-  runStrategy: Always
+  runStrategy: Halted
  template:
    metadata:
      labels:
--- a/apps/mail/mail.yaml
+++ b/apps/mail/mail.yaml
@@ -207,20 +207,13 @@ spec:
    - port: 993
      targetPort: 993
      name: imaps
---
-# TLS Certificate via cert-manager
-apiVersion: cert-manager.io/v1
-kind: Certificate
-metadata:
-  name: mail-tls
-  namespace: mail
-spec:
-  secretName: mail-tls
-  issuerRef:
-    name: step-ca-acme
-    kind: ClusterIssuer
-  dnsNames:
-    - mail.iamworkin.lan
+# --- mail-tls Certificate REMOVED 2026-06-01 ---
+# mail-tls is now managed OUTSIDE cert-manager: issued from step-ca's JWK 'admin'
+# provisioner and auto-renewed by a systemd timer on noc1 (step ca renew), which
+# writes the mail-tls secret directly. step-ca-acme only has an HTTP-01 (Traefik)
+# solver, but mail.iamworkin.lan must resolve to the dedicated MetalLB IP 10.0.56.202
+# (SMTP/IMAP), so HTTP-01 cannot validate. Do NOT re-add a cert-manager Certificate
+# here unless a DNS-01 solver is deployed for step-ca-acme.
 ---
 # Traefik IngressRoute - Webmail placeholder
 apiVersion: traefik.io/v1alpha1
--- a/apps/monitoring/grafana-dashboard-signalcontrol.yaml
+++ b/apps/monitoring/grafana-dashboard-signalcontrol.yaml
@@ -0,0 +1,260 @@
+# Grafana dashboard ConfigMap for FlowerCore.SignalControl on pirelay.
+#
+# The Grafana Deployment in noc-monitoring.yaml mounts this ConfigMap at
+# /var/lib/grafana/dashboards/signalcontrol. The paired Prometheus jobs are:
+# - signalcontrol-pi-app: 10.0.58.113:5200 /metrics/prometheus
+# - edge-nodes: 10.0.58.113:9100 with instance="pirelay"
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: grafana-dashboard-signalcontrol
+  namespace: monitoring
+data:
+  signalcontrol.json: |
+    {
+      "annotations": { "list": [] },
+      "editable": true,
+      "fiscalYearStartMonth": 0,
+      "graphTooltip": 0,
+      "id": null,
+      "links": [],
+      "panels": [
+        {
+          "datasource": { "type": "prometheus", "uid": "prometheus" },
+          "fieldConfig": {
+            "defaults": {
+              "mappings": [],
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  { "color": "red", "value": null },
+                  { "color": "green", "value": 1 }
+                ]
+              },
+              "unit": "short"
+            },
+            "overrides": []
+          },
+          "gridPos": { "h": 5, "w": 6, "x": 0, "y": 0 },
+          "id": 1,
+          "options": {
+            "colorMode": "value",
+            "graphMode": "none",
+            "justifyMode": "auto",
+            "orientation": "auto",
+            "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false },
+            "textMode": "auto"
+          },
+          "targets": [
+            { "editorMode": "code", "expr": "up{job=\"signalcontrol-pi-app\",instance=\"pirelay\"}", "range": true, "refId": "A" }
+          ],
+          "title": "SignalControl App Up",
+          "type": "stat"
+        },
+        {
+          "datasource": { "type": "prometheus", "uid": "prometheus" },
+          "fieldConfig": {
+            "defaults": {
+              "mappings": [],
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  { "color": "red", "value": null },
+                  { "color": "green", "value": 1 }
+                ]
+              },
+              "unit": "short"
+            },
+            "overrides": []
+          },
+          "gridPos": { "h": 5, "w": 6, "x": 6, "y": 0 },
+          "id": 2,
+          "options": {
+            "colorMode": "value",
+            "graphMode": "none",
+            "justifyMode": "auto",
+            "orientation": "auto",
+            "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false },
+            "textMode": "auto"
+          },
+          "targets": [
+            { "editorMode": "code", "expr": "up{job=\"edge-nodes\",instance=\"pirelay\"}", "range": true, "refId": "A" }
+          ],
+          "title": "pirelay node_exporter Up",
+          "type": "stat"
+        },
+        {
+          "datasource": { "type": "prometheus", "uid": "prometheus" },
+          "fieldConfig": { "defaults": { "unit": "short" }, "overrides": [] },
+          "gridPos": { "h": 5, "w": 6, "x": 12, "y": 0 },
+          "id": 3,
+          "options": {
+            "colorMode": "value",
+            "graphMode": "area",
+            "justifyMode": "auto",
+            "orientation": "auto",
+            "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false },
+            "textMode": "name"
+          },
+          "targets": [
+            { "editorMode": "code", "expr": "signalcontrol_active_pattern{job=\"signalcontrol-pi-app\",instance=\"pirelay\"}", "legendFormat": "{{pattern}}", "range": true, "refId": "A" }
+          ],
+          "title": "Active Pattern",
+          "type": "stat"
+        },
+        {
+          "datasource": { "type": "prometheus", "uid": "prometheus" },
+          "fieldConfig": { "defaults": { "unit": "short" }, "overrides": [] },
+          "gridPos": { "h": 5, "w": 6, "x": 18, "y": 0 },
+          "id": 4,
+          "options": {
+            "colorMode": "value",
+            "graphMode": "area",
+            "justifyMode": "auto",
+            "orientation": "auto",
+            "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false },
+            "textMode": "name"
+          },
+          "targets": [
+            { "editorMode": "code", "expr": "signalcontrol_phase{job=\"signalcontrol-pi-app\",instance=\"pirelay\"}", "legendFormat": "{{phase}}", "range": true, "refId": "A" }
+          ],
+          "title": "Current Phase",
+          "type": "stat"
+        },
+        {
+          "datasource": { "type": "prometheus", "uid": "prometheus" },
+          "fieldConfig": { "defaults": { "unit": "ops" }, "overrides": [] },
+          "gridPos": { "h": 8, "w": 12, "x": 0, "y": 5 },
+          "id": 5,
+          "options": { "legend": { "displayMode": "table", "placement": "bottom" }, "tooltip": { "mode": "single" } },
+          "targets": [
+            {
+              "editorMode": "code",
+              "expr": "sum by (channel, state) (rate(signal_relay_writes_total{job=\"signalcontrol-pi-app\",instance=\"pirelay\"}[$__rate_interval]))",
+              "legendFormat": "channel {{channel}} {{state}}",
+              "range": true,
+              "refId": "A"
+            }
+          ],
+          "title": "Relay Activations",
+          "type": "timeseries"
+        },
+        {
+          "datasource": { "type": "prometheus", "uid": "prometheus" },
+          "fieldConfig": { "defaults": { "unit": "ops" }, "overrides": [] },
+          "gridPos": { "h": 8, "w": 12, "x": 12, "y": 5 },
+          "id": 6,
+          "options": { "legend": { "displayMode": "table", "placement": "bottom" }, "tooltip": { "mode": "single" } },
+          "targets": [
+            {
+              "editorMode": "code",
+              "expr": "sum by (source, to_phase) (rate(signal_transitions_total{job=\"signalcontrol-pi-app\",instance=\"pirelay\"}[$__rate_interval]))",
+              "legendFormat": "{{source}} -> {{to_phase}}",
+              "range": true,
+              "refId": "A"
+            }
+          ],
+          "title": "Phase Dwell / Transitions",
+          "type": "timeseries"
+        },
+        {
+          "datasource": { "type": "prometheus", "uid": "prometheus" },
+          "fieldConfig": { "defaults": { "unit": "short" }, "overrides": [] },
+          "gridPos": { "h": 8, "w": 12, "x": 0, "y": 13 },
+          "id": 7,
+          "options": { "legend": { "displayMode": "table", "placement": "bottom" }, "tooltip": { "mode": "single" } },
+          "targets": [
+            {
+              "editorMode": "code",
+              "expr": "sum by (action) (increase(signal_schedule_fires_total{job=\"signalcontrol-pi-app\",instance=\"pirelay\"}[24h]))",
+              "legendFormat": "{{action}}",
+              "range": true,
+              "refId": "A"
+            },
+            {
+              "editorMode": "code",
+              "expr": "sum by (from_pattern, to_pattern) (increase(flowercore_signalcontrol_pattern_switches_total{job=\"signalcontrol-pi-app\",instance=\"pirelay\"}[24h]))",
+              "legendFormat": "{{from_pattern}} -> {{to_pattern}}",
+              "range": true,
+              "refId": "B"
+            }
+          ],
+          "title": "Schedule Fires and Pattern Switches",
+          "type": "timeseries"
+        },
+        {
+          "datasource": { "type": "prometheus", "uid": "prometheus" },
+          "fieldConfig": { "defaults": { "unit": "percentunit" }, "overrides": [] },
+          "gridPos": { "h": 8, "w": 12, "x": 12, "y": 13 },
+          "id": 8,
+          "options": { "legend": { "displayMode": "table", "placement": "bottom" }, "tooltip": { "mode": "single" } },
+          "targets": [
+            {
+              "editorMode": "code",
+              "expr": "1 - avg by (instance) (rate(node_cpu_seconds_total{job=\"edge-nodes\",instance=\"pirelay\",mode=\"idle\"}[$__rate_interval]))",
+              "legendFormat": "CPU",
+              "range": true,
+              "refId": "A"
+            },
+            {
+              "editorMode": "code",
+              "expr": "1 - (node_memory_MemAvailable_bytes{job=\"edge-nodes\",instance=\"pirelay\"} / node_memory_MemTotal_bytes{job=\"edge-nodes\",instance=\"pirelay\"})",
+              "legendFormat": "Memory",
+              "range": true,
+              "refId": "B"
+            }
+          ],
+          "title": "pirelay Host Utilization",
+          "type": "timeseries"
+        },
+        {
+          "datasource": { "type": "prometheus", "uid": "prometheus" },
+          "fieldConfig": { "defaults": { "unit": "short" }, "overrides": [] },
+          "gridPos": { "h": 6, "w": 12, "x": 0, "y": 21 },
+          "id": 9,
+          "options": {
+            "colorMode": "value",
+            "graphMode": "area",
+            "justifyMode": "auto",
+            "orientation": "auto",
+            "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false },
+            "textMode": "auto"
+          },
+          "targets": [
+            { "editorMode": "code", "expr": "signalcontrol_screen_saver_enabled{job=\"signalcontrol-pi-app\",instance=\"pirelay\"}", "range": true, "refId": "A" }
+          ],
+          "title": "Screen-saver Enabled",
+          "type": "stat"
+        },
+        {
+          "datasource": { "type": "prometheus", "uid": "prometheus" },
+          "fieldConfig": { "defaults": { "unit": "short" }, "overrides": [] },
+          "gridPos": { "h": 6, "w": 12, "x": 12, "y": 21 },
+          "id": 10,
+          "options": {
+            "colorMode": "value",
+            "graphMode": "area",
+            "justifyMode": "auto",
+            "orientation": "auto",
+            "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false },
+            "textMode": "name"
+          },
+          "targets": [
+            { "editorMode": "code", "expr": "signalcontrol_animation_active{job=\"signalcontrol-pi-app\",instance=\"pirelay\"}", "legendFormat": "{{planner}}", "range": true, "refId": "A" }
+          ],
+          "title": "Screen-saver / Animation Engaged",
+          "type": "stat"
+        }
+      ],
+      "refresh": "30s",
+      "schemaVersion": 39,
+      "style": "dark",
+      "tags": [ "flowercore", "signalcontrol", "pirelay" ],
+      "templating": { "list": [] },
+      "time": { "from": "now-24h", "to": "now" },
+      "timezone": "browser",
+      "title": "FlowerCore SignalControl",
+      "uid": "flowercore-signalcontrol",
+      "version": 1
+    }
--- a/apps/monitoring/noc-monitoring.yaml
+++ b/apps/monitoring/noc-monitoring.yaml
@@ -230,6 +230,19 @@ data:
              vlan: "home"
              device: "pi3-ks0212"

+      # SignalControl Pi-edition app metrics (pirelay / signal-a)
+      - job_name: "signalcontrol-pi-app"
+        scrape_interval: 15s
+        metrics_path: /metrics/prometheus
+        static_configs:
+          - targets: ["10.0.58.113:5200"]
+            labels:
+              instance: "pirelay"
+              host: "signal-a.iamworkin.lan"
+              service: "signalcontrol-pi"
+              vlan: "home"
+              device: "pi3-ks0212"
+
      # Epson ET-3750 EcoTank Printer SNMP
      - job_name: "snmp-printer"
        scrape_interval: 5m
@@ -4051,6 +4064,9 @@ spec:
            - name: dashboards-remotedesktop
              mountPath: /var/lib/grafana/dashboards/remotedesktop
              readOnly: true
+            - name: dashboards-signalcontrol
+              mountPath: /var/lib/grafana/dashboards/signalcontrol
+              readOnly: true
            - name: datasource-provisioning
              mountPath: /etc/grafana/provisioning/datasources
              readOnly: true
@@ -4104,6 +4120,9 @@ spec:
        - name: dashboards-remotedesktop
          configMap:
            name: grafana-dashboard-remotedesktop
+        - name: dashboards-signalcontrol
+          configMap:
+            name: grafana-dashboard-signalcontrol
        - name: datasource-provisioning
          configMap:
            name: grafana-datasource-provisioning
--- a/apps/selenium/selenium-grid.yaml
+++ b/apps/selenium/selenium-grid.yaml
@@ -132,13 +132,18 @@ spec:
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 5
+        # Hub baseline working set ~766Mi on 2026-05-25 (75% of prior 1Gi
+        # limit). Bump to 1.5Gi / 1Gi to keep ~50% headroom; matches the
+        # stampede-buffer pattern documented for multus
+        # (feedback_k8s_cni_multus_sizing). CPU left alone — observed 54m
+        # against a 500m limit, no contention.
        resources:
          limits:
            cpu: 500m
-            memory: 1Gi
+            memory: 1536Mi
          requests:
            cpu: 250m
-            memory: 512Mi
+            memory: 1Gi
 ---
 apiVersion: apps/v1
 kind: Deployment
@@ -198,13 +203,18 @@ spec:
            port: 5555
          initialDelaySeconds: 15
          periodSeconds: 5
+        # Chromium-based browser node. Bumped from 1Gi -> 2Gi (req 512Mi
+        # -> 1Gi) on 2026-05-25 — Edge had 51 OOMKills in 5d on the
+        # original 1Gi cap (~1 OOM every 2.4h), and Chrome at maxSessions=2
+        # was running 684Mi idle on the same cap. Matches the Firefox node's
+        # tested-stable 2Gi limit. CPU unchanged.
        resources:
          limits:
            cpu: '1'
-            memory: 1Gi
+            memory: 2Gi
          requests:
            cpu: 500m
-            memory: 512Mi
+            memory: 1Gi
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
@@ -378,13 +388,18 @@ spec:
            port: 5555
          initialDelaySeconds: 15
          periodSeconds: 5
+        # Chromium-based browser node. Bumped from 1Gi -> 2Gi (req 512Mi
+        # -> 1Gi) on 2026-05-25 — Edge had 51 OOMKills in 5d on the
+        # original 1Gi cap (~1 OOM every 2.4h), and Chrome at maxSessions=2
+        # was running 684Mi idle on the same cap. Matches the Firefox node's
+        # tested-stable 2Gi limit. CPU unchanged.
        resources:
          limits:
            cpu: '1'
-            memory: 1Gi
+            memory: 2Gi
          requests:
            cpu: 500m
-            memory: 512Mi
+            memory: 1Gi
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
--- a/tests/bluejay-infra-lint/FleetManifestLintTests.cs
+++ b/tests/bluejay-infra-lint/FleetManifestLintTests.cs
@@ -67,6 +67,7 @@ public sealed class FleetManifestLintTests
        ["github-runner-chat"] = "https://github.com/astoltz/FlowerCore.Chat",
        ["github-runner-mysql"] = "https://github.com/astoltz/FlowerCore.MySQL",
        ["github-runner-kiosk-linux"] = "https://github.com/astoltz/FlowerCore.Kiosk.Linux",
+        ["github-runner-updater"] = "https://github.com/astoltz/FlowerCore.Updater",
    };

    private static readonly HashSet<string> ScaledLinuxRunnerDeployments = new(StringComparer.Ordinal)
@@ -80,6 +81,7 @@ public sealed class FleetManifestLintTests
        "github-runner-chat",
        "github-runner-mysql",
        "github-runner-kiosk-linux",
+        "github-runner-updater",
    };

    private static readonly IReadOnlyDictionary<string, string> WritableRunnerEnv = new Dictionary<string, string>(StringComparer.Ordinal)
@@ -234,7 +236,7 @@ public sealed class FleetManifestLintTests
        {
            deployments.Should().ContainKey(expectedRunner.Key);

-            var container = deployments[expectedRunner.Key].ContainerMappings().Should().ContainSingle().Subject;
+            var container = deployments[expectedRunner.Key].MainContainerMappings().Should().ContainSingle().Subject;
            EnvValue(container, "REPO_URL").Should().Be(expectedRunner.Value);
            EnvValue(container, "EPHEMERAL").Should().Be("true");
            EnvValue(container, "LABELS").Should().Be("self-hosted,linux,fc-build-linux");
@@ -250,7 +252,7 @@ public sealed class FleetManifestLintTests
    {
        foreach (var deployment in GitHubRunnerDeployments().Values)
        {
-            var container = deployment.ContainerMappings().Should().ContainSingle().Subject;
+            var container = deployment.MainContainerMappings().Should().ContainSingle().Subject;

            foreach (var expectedEnv in WritableRunnerEnv)
            {
@@ -277,7 +279,10 @@ public sealed class FleetManifestLintTests
        foreach (var deploymentName in ScaledLinuxRunnerDeployments)
        {
            var deployment = deployments[deploymentName];
-            ReplicaCount(deployment).Should().Be(2);
+            // Scaled runners must have >= 2 replicas (avoid single-pod bottleneck).
+            // Individual deployments may be tuned upward per CI activity — see
+            // "runners: right-size replica counts per 14d CI activity (#24)".
+            ReplicaCount(deployment).Should().BeGreaterOrEqualTo(2, $"{deploymentName} is in the scaled set and must run with at least 2 replicas");

            var volumes = deployment.MappingSequence("spec", "template", "spec", "volumes");
            var claimNames = volumes
@@ -303,6 +308,108 @@ public sealed class FleetManifestLintTests
            .Be("github-runner-nuget-cache");
    }

+    [Fact]
+    public void Runners_MustNotPinToOperatorWorkstationHosts()
+    {
+        // CRITICAL SAFETY (operator directive 2026-05-26): BLUEJAY-WS is the
+        // operator's primary workstation — host of the 1Password Connect
+        // bearer token, fcadmin SSH keys to noc1, signing CA private keys,
+        // and source for every FC repo. A self-hosted GitHub Actions runner
+        // there would execute arbitrary PR code with that local access.
+        // Build-side analog of the Sprint 9 NEW safe-account exclusion gate
+        // (Puppet GPO/AppLocker/WDAC/audit-forwarder modules refuse to apply
+        // on BLUEJAY-WS). This lint asserts no GitHub-runner Deployment in
+        // apps/github-runner/ pins to a forbidden operator-workstation host
+        // via nodeName, nodeSelector, nodeAffinity, or tolerations.
+        // Existing legacy `bluejay-ws-sandbox-1` GitHub-registered runner is
+        // out of scope here (it's a runtime registration, not a K8s
+        // Deployment) — see CLAUDE.md "Common Mistakes" entry and
+        // feedback_bluejay_ws_never_public_runner.md.
+        var forbiddenHostPatterns = new[]
+        {
+            "bluejay-ws",
+            "BLUEJAY-WS",
+            "bluejay-ws.iamworkin.lan",
+            "iamworkin-ws",
+        };
+
+        bool ContainsForbidden(string? value) =>
+            !string.IsNullOrWhiteSpace(value)
+            && forbiddenHostPatterns.Any(pattern => value!.Contains(pattern, StringComparison.OrdinalIgnoreCase));
+
+        var violations = GitHubRunnerDeployments().Values.SelectMany(deployment =>
+        {
+            var local = new List<string>();
+            var podSpec = ManifestNodeExtensions.Mapping(deployment.Root, "spec", "template", "spec");
+            if (podSpec is null)
+            {
+                return local;
+            }
+
+            // nodeName: pins the pod to a specific node by name.
+            var nodeName = ManifestNodeExtensions.Scalar(podSpec, "nodeName");
+            if (ContainsForbidden(nodeName))
+            {
+                local.Add($"{deployment.Name} sets nodeName='{nodeName}' which targets a forbidden operator-workstation host.");
+            }
+
+            // nodeSelector: dict of label → value pinning the pod to nodes
+            // carrying matching labels. Examples that would trip this:
+            //   kubernetes.io/hostname: bluejay-ws
+            //   flowercore.io/host: bluejay-ws.iamworkin.lan
+            var nodeSelector = ManifestNodeExtensions.Mapping(podSpec, "nodeSelector");
+            if (nodeSelector is not null)
+            {
+                foreach (var entry in nodeSelector.Children)
+                {
+                    var key = entry.Key is YamlScalarNode keyScalar ? keyScalar.Value : null;
+                    var value = entry.Value is YamlScalarNode valueScalar ? valueScalar.Value : null;
+                    if (ContainsForbidden(value))
+                    {
+                        local.Add($"{deployment.Name} has nodeSelector entry '{key}: {value}' which targets a forbidden operator-workstation host.");
+                    }
+                }
+            }
+
+            // nodeAffinity: matchExpressions over node labels.
+            foreach (var term in ManifestNodeExtensions.MappingSequence(podSpec, "affinity", "nodeAffinity", "requiredDuringSchedulingIgnoredDuringExecution", "nodeSelectorTerms"))
+            {
+                foreach (var expr in ManifestNodeExtensions.MappingSequence(term, "matchExpressions"))
+                {
+                    var key = ManifestNodeExtensions.Scalar(expr, "key");
+                    foreach (var valueNode in ManifestNodeExtensions.ScalarSequence(expr, "values"))
+                    {
+                        if (ContainsForbidden(valueNode))
+                        {
+                            local.Add($"{deployment.Name} has nodeAffinity matchExpression '{key}' value '{valueNode}' which targets a forbidden operator-workstation host.");
+                        }
+                    }
+                }
+            }
+
+            // tolerations: scheduling onto a tainted operator-workstation
+            // node would let the runner run there. Forbid any toleration
+            // value that names the workstation.
+            foreach (var toleration in ManifestNodeExtensions.MappingSequence(podSpec, "tolerations"))
+            {
+                var key = ManifestNodeExtensions.Scalar(toleration, "key");
+                var value = ManifestNodeExtensions.Scalar(toleration, "value");
+                if (ContainsForbidden(key))
+                {
+                    local.Add($"{deployment.Name} has toleration key '{key}' which targets a forbidden operator-workstation host.");
+                }
+                if (ContainsForbidden(value))
+                {
+                    local.Add($"{deployment.Name} has toleration value '{value}' which targets a forbidden operator-workstation host.");
+                }
+            }
+
+            return local;
+        }).ToList();
+
+        violations.Should().BeEmpty("BLUEJAY-WS / iamworkin-ws must never host a fleet GitHub Actions runner; see CLAUDE.md 'Registering BLUEJAY-WS as a fleet GitHub Actions runner' and feedback_bluejay_ws_never_public_runner.md");
+    }
+
    [Fact]
    public void Monitoring_MustAlertWhenLinuxRunnerDeploymentIsUnavailable()
    {
@@ -890,6 +997,22 @@ internal sealed record ManifestDocument(
            .ToList();
    }

+    // MainContainerMappings excludes initContainers. Use this when asserting
+    // properties of the primary container (env, image, volumeMounts) where an
+    // initContainer would be a false-positive match — e.g. the GitHub runner
+    // image's `setup-runner-home` initContainer should not count toward the
+    // single-container assertions on the runner deployments.
+    public IReadOnlyList<YamlMappingNode> MainContainerMappings()
+    {
+        var podSpec = PodSpec();
+        if (podSpec is null)
+        {
+            return Array.Empty<YamlMappingNode>();
+        }
+
+        return ManifestNodeExtensions.MappingSequence(podSpec, "containers").ToList();
+    }
+
    public IReadOnlyList<ContainerSpec> ContainerSpecs()
    {
        return ContainerMappings()
--- a/tests/bluejay-infra-lint/SignalControlPlatformManifestTests.cs
+++ b/tests/bluejay-infra-lint/SignalControlPlatformManifestTests.cs
@@ -0,0 +1,51 @@
+using FluentAssertions;
+using Xunit;
+
+namespace BluejayInfraLint.Tests;
+
+[Trait("Category", "Unit")]
+public sealed class SignalControlPlatformManifestTests
+{
+    private static readonly string Root = ManifestInventory.Load().BluejayRoot;
+
+    [Fact]
+    public void Monitoring_PrometheusScrapesSignalControlPiAppAndPirelayNodeExporter()
+    {
+        var monitoring = File.ReadAllText(Path.Combine(Root, "apps", "monitoring", "noc-monitoring.yaml"));
+
+        monitoring.Should().Contain("job_name: \"signalcontrol-pi-app\"");
+        monitoring.Should().Contain("metrics_path: /metrics/prometheus");
+        monitoring.Should().Contain("targets: [\"10.0.58.113:5200\"]");
+        monitoring.Should().Contain("host: \"signal-a.iamworkin.lan\"");
+        monitoring.Should().Contain("targets: [\"10.0.58.113:9100\"]");
+        monitoring.Should().Contain("instance: \"pirelay\"");
+    }
+
+    [Fact]
+    public void Monitoring_GrafanaMountsSignalControlDashboard()
+    {
+        var monitoring = File.ReadAllText(Path.Combine(Root, "apps", "monitoring", "noc-monitoring.yaml"));
+        var dashboard = File.ReadAllText(Path.Combine(Root, "apps", "monitoring", "grafana-dashboard-signalcontrol.yaml"));
+
+        monitoring.Should().Contain("name: dashboards-signalcontrol");
+        monitoring.Should().Contain("mountPath: /var/lib/grafana/dashboards/signalcontrol");
+        monitoring.Should().Contain("name: grafana-dashboard-signalcontrol");
+        dashboard.Should().Contain("\"uid\": \"flowercore-signalcontrol\"");
+        dashboard.Should().Contain("signalcontrol_active_pattern");
+        dashboard.Should().Contain("signal_relay_writes_total");
+        dashboard.Should().Contain("node_cpu_seconds_total");
+    }
+
+    [Fact]
+    public void FcSignalControlReadme_DocumentsMtlsTelemetryAndDefaultOffAudit()
+    {
+        var readme = File.ReadAllText(Path.Combine(Root, "apps", "fc-signalcontrol", "README.md"));
+
+        readme.Should().Contain("step-ca-agent");
+        readme.Should().Contain("10.0.58.113:5200");
+        readme.Should().Contain("10.0.58.113:9100");
+        readme.Should().Contain("PhysicalAudit:Enabled=false");
+        readme.Should().Contain("ForwardingEnabled=false");
+        readme.Should().Contain("Secrets, enrollment codes, private keys");
+    }
+}
Author	SHA1	Message	Date
Andrew Stoltz	62f6d8e7d5	Add SignalControl platform telemetry manifests	2026-06-01 22:29:18 -05:00
Andrew Stoltz	6c18f69cf2	mail: remove cert-manager Certificate (manage mail-tls via step-ca JWK + noc1 renew timer) step-ca-acme only has an HTTP-01 (Traefik) solver, but mail.iamworkin.lan must resolve to the dedicated MetalLB IP 10.0.56.202 (SMTP/IMAP), so HTTP-01 cannot validate (order stuck pending since 2026-05-06; cert expired 2026-05-24). mail-tls is now issued from step-ca's JWK 'admin' provisioner and auto-renewed by a systemd timer on noc1 that writes the mail-tls secret directly. The secret + Deployment mount + webmail IngressRoute are unchanged. Re-add a Certificate only if a DNS-01 solver is deployed for step-ca-acme. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-01 15:55:38 -05:00
Andrew Stoltz	47e2256556	Deploy TtsReader correction bridge images	2026-05-31 12:35:45 -05:00
Andrew Stoltz	9d77f8ba0e	fc-updater: disable loki audit sink	2026-05-31 11:34:12 -05:00
Andrew Stoltz	2f4be19c85	fc-updater: bump signing diagnostics image	2026-05-31 00:32:48 -05:00
Andrew Stoltz	2a62c40990	fc-updater: bump image to MSI installer surface	2026-05-30 23:31:48 -05:00
Andrew Stoltz	7be98e5efc	Bump UpdateCenter image to hosted-service fix	2026-05-30 20:22:13 -05:00
Andrew Stoltz	a65b356c9d	deploy(fc-updater): roll UC to v202605301823-a6c3354 (Phase 3 SQLite fixes) Durable image bump for FlowerCore.Updater main a6c3354 (PRs #63-#66): hosted-service + request-path SQLite DateTimeOffset fixes, StopHost restored + per-tick resilience, Shared.Settings 1.0.1. Image built + imported to rke2-server. Un-degrades the Phase-9 provenance verifier + settings poll (were stopped under the removed global Ignore mask). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 18:27:45 -05:00
Andrew Stoltz	08c17ef1b4	fc-updater: bump to v202605301703-296f350-fix2 (BackgroundServiceExceptionBehavior=Ignore so a hosted-service SQLite query crash can't stop the host) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 17:04:54 -05:00
Andrew Stoltz	06f2f002b7	fc-updater: bump image to v202605301657-296f350-fix1 (Shared.Settings SQLite poll fix) The v202605301642-296f350-rework image crash-looped: FlowerCore.Shared.Settings SettingsDbPollHostedService ran a DateTimeOffset Where/OrderBy on SettingsRecordChanges that SQLite can't translate, and as a BackgroundService it stopped the host. Shared.Settings 1.0.1 materializes the change-log then filters/orders in memory; Updater Web bumped to 1.0.1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 16:59:37 -05:00
Andrew Stoltz	7ac4a8b4b7	fc-updater: bump image to v202605301642-296f350-rework (ADR-179 rework live) Deploy the current FlowerCore.Updater main (PRs #52-#61) to prod: MSI-first packaging, beta gating + per-install tokens, interactive+bearer Authentik OIDC, native installer apply, and the .fcsetup.exe retirement (DropReleaseInstallers migration runs on the now-empty DB). Image pre-imported to rke2-server + agent1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 16:47:28 -05:00
Andrew Stoltz	90f2a86819	ops: trim load for degraded 2-node cluster (agent2 PSU dead) Scale all github-runner deployments to 1 replica and halt the ci1 KubeVirt VM. With agent2 down (failed PSU) the cluster runs on two passively-cooled NUCs; the ci1 8-vCPU VM drove agent1 to ~100C. Keep total load trimmed until replacement hardware is in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 13:47:13 -05:00
Andrew Stoltz	cbdefb2b23	Revert "ci1: expose WinRM/RDP/SSH ports on masquerade interface for Phase 2 bootstrap" The port additions caused the new VMI to stick at phase=Scheduled with reason=GuestNotRunning. The guest-console-log sidecar exited 1 and qemu never started. Reverting to the working 9-day-stable shape until the port-add path is verified in a non-production VM. Phase 2 (Windows runner install + registration) needs an operator- interactive virtctl-vnc session against the rebuilt VM, OR a separate investigation of why this port-add tipped over the VM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 11:35:10 -05:00
Andrew Stoltz	1c36fe3a0a	ci1: expose WinRM/RDP/SSH ports on masquerade interface for Phase 2 bootstrap The Phase 1 VM has been Running for 9 days but Phase 2 (Puppet bootstrap + runner registration) was deferred because the operator-interactive virtctl-vnc path was the only way in. The masquerade interface listed no exposed ports, so virtctl ssh and kubectl port-forward both hit 'no route to host' — qemu user-mode NAT does not forward inbound by default. Adding 5985 (WinRM HTTP) lets a kubectl port-forward + PowerShell remoting path drive runner registration entirely from outside the VM. 3389 + 22 are reserved for desktop access via Guacamole or virtctl ssh once OpenSSH Server is installed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 11:24:34 -05:00
Andrew Stoltz	2b420ce8a4	runners: fleet-wide right-size CPU requests from 500m to 100m All 33 runner Deployments now request 100m CPU instead of 500m, freeing roughly 50 idle pods × 400m = ~20 cores back to the cluster. Observed CPU usage on idle runners is ~1m via kubectl top; the 500m request was a 500× over-provision that was eating allocatable CPU and blocking new workload scheduling — WorldBuilder runner could not be scheduled even at the new 100m request because the pre-existing fleet held the cluster at 99% requested. Burst headroom preserved by limits.cpu: 2000m unchanged. TtsReader keeps its 8Gi memory limit from the 2026-05-25 OOMKill fix; only the CPU request line moves. Recreate strategy on each deployment means a brief offline window per runner during rollout; in-flight CI jobs complete on the existing container before the new spec takes effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 10:09:24 -05:00
Andrew Stoltz	5cbc1a06b1	runners: scale DM/AiStation.Linux/WorldBuilder down to 1 replica until cluster relieved After cutting requests to 100m, 4 of 6 new pods scheduled and 2 stayed Pending — cluster CPU REQUEST utilization is 49.6 of 48 allocatable cores because the existing fleet of ~50 idle runners reserves 25.6 cores (500m × ~50) for ~50m actual use. Single-replica per new repo gets the service online without competing with in-flight CI from the rest of the fleet. When the broader fleet-wide request right-sizing pass lands (500m → 100m on all idle runners would free ~20 cores), these can be bumped back to 2 replicas if PR-CI backlog warrants it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 10:03:30 -05:00
Andrew Stoltz	9e7ee39b3a	runners: drop CPU request 500m→100m on DM/AiStation.Linux/WorldBuilder All 3 fleet nodes were at 99% CPU REQUEST allocation; the 6 new pods from the previous commit (3 deployments × 2 replicas × 500m) couldn't schedule. Idle runners actually use ~1m CPU per `kubectl top pods`; the 500m request was significantly over-provisioned. Burst headroom preserved by limits.cpu: 2000m unchanged. Follow-up: similar request right-sizing pass across the rest of the runner fleet is queued for a future morning-routine sweep — 25 cores reserved for ~50m actual use is a large slack we can reclaim cluster- wide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 10:00:23 -05:00
Andrew Stoltz	ae030a5f33	runners: add github-runner Deployments for DeviceManagement + AiStation.Linux + WorldBuilder Morning-routine 2026-05-26 — these three repos had ZERO online Linux PR-CI capacity, blocking the Sprint 37 Cx-1 Linux-CI-migration PRs (DM #20/#21/ #22, AiStation.Linux #13, WorldBuilder #3/#4). Chicken-and-egg: the migration PRs need Linux runners that the migration creates. Each Deployment uses the same canonical emptyDir-only pattern as the fresh-2026-05-26 updater deployment that lives just above: - replicas: 2 (room for parallel PR-CI without head-of-line blocking) - per-pod emptyDir caches (no RWO PVC contention) - shared github-runner-token secret (existing ACCESS_TOKEN PAT has org-wide read access) - LABELS: self-hosted,linux,fc-build-linux - DOTNET_INSTALL_DIR pinned per ADR-170 family For AiStation.Linux specifically: Linux job will now pick up; the Windows job in #13 remains queued indefinitely until the Windows runner host substrate lands per Sprint 36 v2 Cl-2 / ADR-174 — that's a separate arc, not this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 09:55:31 -05:00
bluejay	bc8c35896f	tests: add bluejay-ws runner-exclusion lint + fix 3 stale runner-fleet assertions (#30 ) BLUEJAY-WS must never be a fleet GHA runner (operator directive 2026-05-26). Build-side analog of Sprint 9 safe-account exclusion. Also fixes 3 stale runner-fleet assertions broken by initContainer addition + replica tuning.	2026-05-26 03:42:01 +00:00
Andrew Stoltz	2cc91b6df0	runners: bump tts-reader memory limit 4Gi -> 8Gi The github-runner-tts-reader pod was being OOMKilled (exit 137) mid-`dotnet test` on the TtsReader 1000+ test suite. PR #21 CI (the Windows -> Linux runner migration) flapped twice with the "self-hosted runner lost communication" annotation before the K8s-side symptoms surfaced via kubectl describe pod. Requests bumped 1Gi -> 2Gi, limits 4Gi -> 8Gi. Comment added inline so future fleet runs don't trip the same wall. Unblocks PR #21 + the 9 other open TtsReader PRs that all rebase through it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 22:31:48 -05:00
bluejay	0d2090fe81	runners: add github-runner-updater Deployment (#29 ) Close runner-fleet gap for FlowerCore.Updater. Matches Sprint 32 long-tail pattern; registers entry in fleet-lint required-set.	2026-05-26 03:24:13 +00:00
Andrew Stoltz	bc3548e715	runners: add github-runner-pimanager Deployment FlowerCore.PiManager build run 26417714843 sat queued 5h with zero self-hosted runners registered to the repo. PiManager was missed in the Sprint 32 long-tail sweep — every other FC repo got a dedicated repo-scoped Deployment with its own ACCESS_TOKEN registration, but PiManager fell through the cracks. Adds a 2-replica ephemeral runner Deployment matching the Signage / DMS / Print.Web pattern (per-pod emptyDir caches, no shared PVC, labels `self-hosted,linux,fc-build-linux`, shared github-runner-token PAT). Once ArgoCD syncs, the queued job will pick up automatically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 20:33:44 -05:00
bluejay	74333cc26b	selenium: right-size hub + chrome + edge memory limits (#28 )	2026-05-26 01:12:15 +00:00
Andrew Stoltz	7310fb88c2	selenium: right-size hub + chrome + edge memory limits Edge node has been OOMKilled 51 times in 5 days (~1 every 2.4h) on a 1Gi memory limit. Chrome runs maxSessions=2 on the same 1Gi cap and was idling at 684Mi — first concurrent session pushing the node to ~900Mi+ would be the next OOM. Hub was running at 766Mi against a 1Gi limit (75%); no recent restarts but no headroom either. Firefox node has been running at 2Gi memory limit for 9 days with zero restarts — that is the right size for a Selenium 4.27 browser node under our session profile (screen recording sidecar + 1080p rendering + page captures). Match it. Changes: - Hub: limit 1Gi -> 1.5Gi, request 512Mi -> 1Gi - Chrome: limit 1Gi -> 2Gi, request 512Mi -> 1Gi - Edge: limit 1Gi -> 2Gi, request 512Mi -> 1Gi CPU left alone on all three — observed utilization is well under the existing limits (hub 54m / 500m, chrome 185m / 1, edge 11m / 1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 20:11:41 -05:00
bluejay	148bc87b9a	runners: bake step-ca root CA into image (v20260525-stepca) (#27 )	2026-05-26 01:04:14 +00:00
Andrew Stoltz	2a1e842100	runners: bake step-ca root CA into image (v20260525-stepca) Without the IAmWorkin step-ca root CA in the runner image's system trust store, .NET HttpClient calls from CI tests against `*.iamworkin.lan` (e.g. `https://selenium.iamworkin.lan/session`) fail with `The remote certificate is invalid because of errors in the certificate chain: PartialChain`. FlowerCore.Print.Web's `WebScreenshotService` unit tests hit this on every build. Drop the step-ca root PEM into `/usr/local/share/ca-certificates/`, run `update-ca-certificates` once during apt install, and let OpenSSL + .NET-on-Linux read the regenerated `/etc/ssl/certs/ca-certificates.crt` automatically — no `SSL_CERT_FILE` env var, no per-Deployment volume mount. Image rebuilt + saved + imported on all 3 schedulable RKE2 nodes (rke2-server, rke2-agent1, rke2-agent2) before this PR — verified with `ctr images list -q \| grep stepca` on each node. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 19:55:38 -05:00
bluejay	bc28430d24	selenium: allow github-runner namespace ingress on 4444 (#26 )	2026-05-26 00:44:23 +00:00