Compare commits

..

7 Commits

Author SHA1 Message Date
1f1f6823db runners: right-size replica counts per 14d CI activity (#24) 2026-05-26 00:01:47 +00:00
Andrew Stoltz
b92f74b63a runners: right-size replica counts per 14d CI activity data
Drop 2 → 1 for 10 deploys based on trailing-14d run counts:
  - LlmBridge, Media, Knowledge, Intranet.Web, DNS  (0 runs each)
  - Presentations (6), Redis (3), Provisioning (3),
    MessageBoard (3), MenuBoard (3)

Bump 2 → 3 for Print.Web: 12 runs in trailing 5d, and the
help-screenshots AAT job holds a runner 30+ min, creating
head-of-line blocking for parallel PRs.

Net change: -9 replicas (≈ -9 GiB committed memory).
Aligns with Sprint 33 morning-routine capacity audit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 18:55:47 -05:00
Andrew Stoltz
cb7f7dbc4d authentik: generous startup/liveness probes for first-boot migration
The server pod was getting killed by liveness probe at 60s while still
waiting on migration DB lock (worker pod also running migrations against
same DB). Add startupProbe with 10.5 min budget so liveness doesn't fire
until migrations finish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 16:03:03 -05:00
Andrew Stoltz
03126d5584 authentik: add fsGroup:1000 to server + worker so non-root uid can write /media
PermissionError: [Errno 13] Permission denied: '/media/public' in tenant_files
migration because Authentik container runs as uid 1000 but Longhorn PVC mounts
root:root by default. fsGroup on Pod securityContext recursively chgrps the
PVC mount to gid 1000 + chmods g+rwx.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 15:58:35 -05:00
Andrew Stoltz
495e884c41 authentik: initial deployment at id.iamworkin.lan
Stack:
  - PostgreSQL 16 StatefulSet (Longhorn RWO 5Gi)
  - Redis 7 Deployment (no persistence)
  - Authentik server + worker (ghcr.io/goauthentik/server:2024.12.3)
  - Shared media PVC (Longhorn RWO 2Gi) between server+worker
  - Certificate via step-ca-acme ClusterIssuer
  - Traefik IngressRoute at id.iamworkin.lan

Secrets sourced from 1Password item 'authentik-credentials' (IAmWorkin
vault, id y6i74ch22q5wvm7znquq4nhhcu) via OnePasswordItem CRD. Fields:
AUTHENTIK_SECRET_KEY, POSTGRES_PASSWORD, REDIS_PASSWORD,
BOOTSTRAP_ADMIN_PASSWORD, BOOTSTRAP_ADMIN_TOKEN, BOOTSTRAP_ADMIN_EMAIL.

DNS A record id.iamworkin.lan -> 10.0.56.200 added via
scripts/pfsense-add-id-host.py (FlowerCore.DNS service was 502'ing on
pfSense diag_command.php response parsing).

Closes the immediate gap from PiManager OIDC Cohort 3 wire-up: PiManager
(a87cd6f) configures id.iamworkin.lan as JWT authority but the backend
was never deployed. Pirelay specifically is on Mode:apikey until this
backend is bootstrapped and a pimanager service-account exists.

Post-deploy bootstrap (manual once pods Ready):
  1. Login at https://id.iamworkin.lan/if/admin/ as akadmin
     using BOOTSTRAP_ADMIN_PASSWORD from 1Password.
  2. Create OAuth2/OpenID Provider for pimanager (issuer
     https://id.iamworkin.lan/application/o/pimanager/, audience 'pimanager').
  3. Create Application binding the provider.
  4. Create service account user 'pimanager-service-account', generate
     long-lived token, store in 1Password as 'pimanager-service-account'.
  5. Re-enable jwt mode on pirelay + un-mask puppet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 15:50:10 -05:00
Andrew Stoltz
65aa1e6104 fix(monitoring): point probe-printweb at /health (Q-MR-90)
Root path requires API key auth — `/` returned 401 to the blackbox
probe, firing PrintWebDown despite `/health` reporting Healthy.
Pattern: feedback_k8s_probes_behind_auth_middleware.

Mirrors FlowerCore.Notes scripts/monitoring/prometheus.yml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:52:02 -05:00
Andrew Stoltz
7f2a3b76b4 feat(github-runner): bake Ruby 3.3 into Linux self-hosted runner image (Q-MR-81) 2026-05-20 11:45:43 -05:00
10 changed files with 871 additions and 510 deletions

View File

@@ -0,0 +1,448 @@
# Authentik OIDC backend
# ArgoCD-managed. BlueJay Lab.
#
# Stack:
# - PostgreSQL 16 StatefulSet (single replica, Longhorn RWO 5Gi)
# - Redis 7 Deployment (no persistence — session/cache only)
# - Authentik server + worker Deployments (image ghcr.io/goauthentik/server:2024.12.3)
# - Media PVC shared between server + worker (Longhorn RWO 2Gi)
# - Certificate via step-ca-acme ClusterIssuer
# - Traefik IngressRoute at id.iamworkin.lan
#
# Secrets come from 1Password item "authentik-credentials" (IAmWorkin vault, id y6i74ch22q5wvm7znquq4nhhcu)
# via the OnePasswordItem CRD, materialized into k8s Secret authentik/authentik-credentials.
#
# Why the discovery URL is /application/o/pimanager/ : Authentik issues per-application OIDC providers.
# The pimanager OIDC application/provider is created after the cluster pods are healthy (manual or
# via API once the bootstrap token is available — see Notes substrate).
---
apiVersion: v1
kind: Namespace
metadata:
name: authentik
labels:
app.kubernetes.io/part-of: bluejay-infra
---
# 1Password operator pulls the authentik-credentials item into a k8s Secret of the same name.
# Field labels in 1P become Secret keys: AUTHENTIK_SECRET_KEY, POSTGRES_PASSWORD, REDIS_PASSWORD,
# BOOTSTRAP_ADMIN_PASSWORD, BOOTSTRAP_ADMIN_TOKEN, BOOTSTRAP_ADMIN_EMAIL.
apiVersion: onepassword.com/v1
kind: OnePasswordItem
metadata:
name: authentik-credentials
namespace: authentik
spec:
itemPath: "vaults/IAmWorkin/items/authentik-credentials"
---
# Shared media volume for server + worker pods.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: authentik-media
namespace: authentik
spec:
storageClassName: longhorn
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 2Gi
---
# PostgreSQL 16 StatefulSet — Authentik's primary store.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: authentik-postgres
namespace: authentik
labels:
app: authentik-postgres
argocd.argoproj.io/instance: infra-authentik
spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Retain
podManagementPolicy: OrderedReady
serviceName: authentik-postgres
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: authentik-postgres
template:
metadata:
labels:
app: authentik-postgres
spec:
containers:
- name: postgres
image: postgres:16-alpine
ports:
- containerPort: 5432
name: postgres
env:
- name: POSTGRES_USER
value: authentik
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: authentik-credentials
key: POSTGRES_PASSWORD
- name: POSTGRES_DB
value: authentik
- name: POSTGRES_INITDB_ARGS
value: "--encoding=UTF-8 --lc-collate=C --lc-ctype=C"
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
readinessProbe:
exec:
command: ["pg_isready", "-U", "authentik"]
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
exec:
command: ["pg_isready", "-U", "authentik"]
initialDelaySeconds: 30
periodSeconds: 30
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 1000m, memory: 1Gi }
volumeMounts:
- name: pgdata
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: pgdata
spec:
storageClassName: longhorn
accessModes: [ReadWriteOnce]
volumeMode: Filesystem
resources:
requests:
storage: 5Gi
---
apiVersion: v1
kind: Service
metadata:
name: authentik-postgres
namespace: authentik
spec:
clusterIP: None
selector:
app: authentik-postgres
ports:
- name: postgres
port: 5432
targetPort: 5432
---
# Redis 7 — session storage + Celery broker. No persistence needed (cache).
apiVersion: apps/v1
kind: Deployment
metadata:
name: authentik-redis
namespace: authentik
labels:
app: authentik-redis
argocd.argoproj.io/instance: infra-authentik
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app: authentik-redis
template:
metadata:
labels:
app: authentik-redis
spec:
containers:
- name: redis
image: redis:7-alpine
args:
- "--save"
- ""
- "--appendonly"
- "no"
- "--requirepass"
- "$(REDIS_PASSWORD)"
env:
- name: REDIS_PASSWORD
valueFrom:
secretKeyRef:
name: authentik-credentials
key: REDIS_PASSWORD
ports:
- containerPort: 6379
name: redis
readinessProbe:
tcpSocket: { port: 6379 }
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
tcpSocket: { port: 6379 }
initialDelaySeconds: 30
periodSeconds: 30
resources:
requests: { cpu: 50m, memory: 64Mi }
limits: { cpu: 500m, memory: 256Mi }
---
apiVersion: v1
kind: Service
metadata:
name: authentik-redis
namespace: authentik
spec:
selector:
app: authentik-redis
ports:
- name: redis
port: 6379
targetPort: 6379
---
# Authentik server Deployment — HTTP frontend on :9000.
apiVersion: apps/v1
kind: Deployment
metadata:
name: authentik-server
namespace: authentik
labels:
app: authentik-server
argocd.argoproj.io/instance: infra-authentik
spec:
replicas: 1
strategy:
type: Recreate # shares /media RWO PVC with worker
selector:
matchLabels:
app: authentik-server
template:
metadata:
labels:
app: authentik-server
spec:
securityContext:
# Authentik image runs as uid 1000 "authentik" but the Longhorn PVC mounts
# root:root by default. fsGroup recursively chgrp + chmod g+rwx so the
# non-root container can mkdir /media/public during the tenant_files migration.
fsGroup: 1000
containers:
- name: server
image: ghcr.io/goauthentik/server:2024.12.3
args: ["server"]
ports:
- containerPort: 9000
name: http
- containerPort: 9443
name: https
env:
- name: AUTHENTIK_SECRET_KEY
valueFrom:
secretKeyRef:
name: authentik-credentials
key: AUTHENTIK_SECRET_KEY
- name: AUTHENTIK_REDIS__HOST
value: authentik-redis
- name: AUTHENTIK_REDIS__PASSWORD
valueFrom:
secretKeyRef:
name: authentik-credentials
key: REDIS_PASSWORD
- name: AUTHENTIK_POSTGRESQL__HOST
value: authentik-postgres
- name: AUTHENTIK_POSTGRESQL__NAME
value: authentik
- name: AUTHENTIK_POSTGRESQL__USER
value: authentik
- name: AUTHENTIK_POSTGRESQL__PASSWORD
valueFrom:
secretKeyRef:
name: authentik-credentials
key: POSTGRES_PASSWORD
- name: AUTHENTIK_BOOTSTRAP_PASSWORD
valueFrom:
secretKeyRef:
name: authentik-credentials
key: BOOTSTRAP_ADMIN_PASSWORD
- name: AUTHENTIK_BOOTSTRAP_TOKEN
valueFrom:
secretKeyRef:
name: authentik-credentials
key: BOOTSTRAP_ADMIN_TOKEN
- name: AUTHENTIK_BOOTSTRAP_EMAIL
valueFrom:
secretKeyRef:
name: authentik-credentials
key: BOOTSTRAP_ADMIN_EMAIL
- name: AUTHENTIK_DISABLE_UPDATE_CHECK
value: "true"
- name: AUTHENTIK_ERROR_REPORTING__ENABLED
value: "false"
- name: AUTHENTIK_LOG_LEVEL
value: info
# First-boot Authentik can take 3+ min on the migration phase
# (waiting on DB lock while worker also runs migrations). Initial
# delays are generous so kubelet doesn't kill the pod mid-migration;
# periodSeconds keeps post-startup probing responsive.
readinessProbe:
httpGet:
path: /-/health/ready/
port: 9000
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 12
livenessProbe:
httpGet:
path: /-/health/live/
port: 9000
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
startupProbe:
httpGet:
path: /-/health/live/
port: 9000
initialDelaySeconds: 30
periodSeconds: 15
timeoutSeconds: 10
failureThreshold: 40 # 30s + 40*15s = 10.5 min budget
resources:
requests: { cpu: 150m, memory: 512Mi }
limits: { cpu: 1500m, memory: 1Gi }
volumeMounts:
- name: media
mountPath: /media
volumes:
- name: media
persistentVolumeClaim:
claimName: authentik-media
---
# Authentik worker Deployment — runs Celery background tasks.
apiVersion: apps/v1
kind: Deployment
metadata:
name: authentik-worker
namespace: authentik
labels:
app: authentik-worker
argocd.argoproj.io/instance: infra-authentik
spec:
replicas: 1
strategy:
type: Recreate # shares /media RWO PVC with server
selector:
matchLabels:
app: authentik-worker
template:
metadata:
labels:
app: authentik-worker
spec:
securityContext:
# Same as server pod — non-root uid 1000 needs PVC group write.
fsGroup: 1000
containers:
- name: worker
image: ghcr.io/goauthentik/server:2024.12.3
args: ["worker"]
env:
- name: AUTHENTIK_SECRET_KEY
valueFrom:
secretKeyRef:
name: authentik-credentials
key: AUTHENTIK_SECRET_KEY
- name: AUTHENTIK_REDIS__HOST
value: authentik-redis
- name: AUTHENTIK_REDIS__PASSWORD
valueFrom:
secretKeyRef:
name: authentik-credentials
key: REDIS_PASSWORD
- name: AUTHENTIK_POSTGRESQL__HOST
value: authentik-postgres
- name: AUTHENTIK_POSTGRESQL__NAME
value: authentik
- name: AUTHENTIK_POSTGRESQL__USER
value: authentik
- name: AUTHENTIK_POSTGRESQL__PASSWORD
valueFrom:
secretKeyRef:
name: authentik-credentials
key: POSTGRES_PASSWORD
- name: AUTHENTIK_DISABLE_UPDATE_CHECK
value: "true"
- name: AUTHENTIK_ERROR_REPORTING__ENABLED
value: "false"
- name: AUTHENTIK_LOG_LEVEL
value: info
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 1000m, memory: 768Mi }
volumeMounts:
- name: media
mountPath: /media
volumes:
- name: media
persistentVolumeClaim:
claimName: authentik-media
---
apiVersion: v1
kind: Service
metadata:
name: authentik-server
namespace: authentik
spec:
selector:
app: authentik-server
ports:
- name: http
port: 9000
targetPort: 9000
- name: https
port: 9443
targetPort: 9443
---
# step-ca leaf certificate for id.iamworkin.lan.
# step-ca container resolver uses pfSense Unbound, so the public A record for id.iamworkin.lan
# MUST exist before this Certificate is applied (cert-manager HTTP-01 will silently 2h-backoff
# otherwise). Added 2026-05-25 via scripts/pfsense-add-id-host.py.
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: authentik-tls
namespace: authentik
spec:
secretName: authentik-tls
dnsNames:
- id.iamworkin.lan
issuerRef:
name: step-ca-acme
kind: ClusterIssuer
---
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: authentik
namespace: authentik
spec:
entryPoints: [websecure]
routes:
- match: Host(`id.iamworkin.lan`)
kind: Rule
services:
- name: authentik-server
port: 9000
tls:
secretName: authentik-tls

View File

@@ -1,263 +0,0 @@
# fc-build-windows runner gate
Status: OPEN-WITH-OPERATOR-ACTION as of 2026-05-20.
This directory is intentionally not a live runner deployment. It records the
exact gate for bringing up the Windows self-hosted runner fleet without faking
capacity in GitHub or Kubernetes.
## Lane evidence
- `D:\git\FlowerCore\FlowerCore.Notes\docs\dashboards\decisions-waiting.html`
lines 15078-15085: Q-MR-82 says the Updater Windows Sandbox E2E run is
queued and `bluejay-ws-sandbox-1` is offline.
- `D:\git\FlowerCore\FlowerCore.Notes\memory\project_morning_routine_8_2026_05_20.md`:
Morning Routine #8 carries Q-MR-82 as the fleet-wide Windows runner gap.
- `D:\git\FlowerCore\FlowerCore.Notes\docs\standards\sprint-37-codex-dispatch-log-2026-05-19.md`
lines 76, 84-85, and 97: keep BLUEJAY-WS out of runner plans, merge Linux
runner expansion separately, and keep true Windows-only workflows parked on
the Windows runner host substrate path.
- `D:\git\FlowerCore\FlowerCore.Notes\docs\ai-agents\codex-prompts\2026-05-20-xxxxl-sprint-42-orchestrator-briefs.md`
lane Cx-5: land a deployment only if a Windows runner image/substrate is
ready; otherwise commit an operator-action gate.
- `D:\git\FlowerCore\FlowerCore.Notes\memory\feedback_bluejay_ws_never_a_github_runner.md`:
BLUEJAY-WS is operator-only territory; Windows runners belong on a dedicated
KubeVirt Windows VM such as `ci1` or a sibling VM.
## Live probe summary
Commands run on 2026-05-20 from `D:\git\FlowerCore\bluejay-infra`:
```powershell
$env:KUBECONFIG="$env:USERPROFILE\.kube\rke2.yaml"
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"`t"}{.metadata.labels.kubernetes\.io/os}{"`n"}{end}'
```
Result: `rke2-agent1`, `rke2-agent2`, and `rke2-server` all report
`kubernetes.io/os=linux`. There is no Windows Kubernetes node, so Windows
containers on RKE2 cannot satisfy `fc-build-windows`.
```powershell
kubectl -n kubevirt-vms get vm,vmi,pods -o wide
```
Result: KubeVirt is healthy and `ci1` is `Running` / `Ready=True` on
`rke2-agent1` with VMI IP `10.42.103.35`.
```powershell
virtctl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml port-forward vm/ci1.kubevirt-vms 15985:5985
```
Result during port tests: `dial tcp 10.42.103.35:5985: connect: no route to
host`. The same result was seen for RDP 3389 and SSH 22. The VM exists, but it
is not remotely reachable for runner bootstrap from this lane.
```powershell
gh api /repos/astoltz/FlowerCore.Updater/actions/runners `
--jq '.runners[]? | {name,status,busy,labels:[.labels[].name]}'
gh run list --repo astoltz/FlowerCore.Updater `
--workflow "Updater Windows Sandbox E2E" --limit 5
```
Result: GitHub has one Updater runner, `bluejay-ws-sandbox-1`, with
`status=offline`; run `26150689447` is still `queued`.
## Feasibility classification
### Option A: Windows containers on RKE2
Not feasible without operator-physical infrastructure work. Kubernetes Windows
containers require a Windows node. The current cluster has Linux-only RKE2
nodes.
### Option B: KubeVirt Windows VM
Partially present, not deployable from this lane.
`apps/kubevirt-vms/ci1.yaml` already defines a Windows Server 2025 KubeVirt VM
using `localhost/fc-win-server-2025:v1`, and the live VM is running. However:
- the guest is not reachable over RDP, WinRM, or SSH through `virtctl
port-forward`;
- the current root disk is a `containerDisk`, so runner installation inside the
running guest is not a durable fleet state unless the first-boot automation
re-registers on every boot or the VM is moved to a persistent PVC-backed
disk;
- FC.Updater `Updater Windows Sandbox E2E` uses
`[self-hosted, windows, windows-sandbox]`, while `fc-build-windows` build jobs
use `[self-hosted, windows, fc-build-windows]`. Do not advertise
`windows-sandbox` until Windows Sandbox has been proven in the guest.
### Option C: bluejay-ws-sandbox-1
Operator-only emergency fallback. GitHub shows it registered but offline. The
current memory says BLUEJAY-WS must not be a fleet runner host, so this lane
does not start or re-register it. If the operator deliberately overrides the
policy to drain an emergency queue, start the existing visible runner console
from the BLUEJAY-WS desktop and treat that as temporary break-glass, not the
permanent Q-MR-82 closure.
## Operator action plan
### 1. Pick the Windows host class
Use `ci1` or a sibling Windows Server 2025 VM for WPF build/test jobs that need
`fc-build-windows`.
Use a Windows 11 Pro/Enterprise KubeVirt VM for Updater or WorldBuilder
Windows Sandbox gates, unless Windows Sandbox support is explicitly proven on
the selected guest. The workflow labels must match the real capability:
- WPF build runner: `self-hosted,windows,fc-build-windows,ci1`
- Sandbox runner: `self-hosted,windows,windows-sandbox,ci-sandbox1`
### 2. Make the VM reachable and durable
From BLUEJAY-WS:
```powershell
$env:KUBECONFIG="$env:USERPROFILE\.kube\rke2.yaml"
kubectl -n kubevirt-vms get vm,vmi,pods -o wide
virtctl --kubeconfig $env:KUBECONFIG vnc ci1 -n kubevirt-vms
virtctl --kubeconfig $env:KUBECONFIG port-forward vm/ci1.kubevirt-vms 13389:3389
virtctl --kubeconfig $env:KUBECONFIG port-forward vm/ci1.kubevirt-vms 15985:5985
```
Before runner registration, fix the current port-forward failure. The expected
state is that RDP or WinRM accepts a connection through the control plane.
For durability, either:
- move the runner VM to a persistent PVC-backed root disk; or
- keep `containerDisk` and bake first-boot runner registration into the sysprep
flow using a non-expiring credential lookup path.
Do not install a runner by hand into a transient VM and call Q-MR-82 closed.
### 3. Install runner prerequisites inside the VM
Run in an elevated PowerShell session in the Windows runner guest:
```powershell
winget install Microsoft.DotNet.SDK.10 --silent
winget install Microsoft.DotNet.DesktopRuntime.8 --silent
winget install Microsoft.PowerShell --silent
winget install Git.Git --silent
winget install Microsoft.VisualStudio.2022.BuildTools --silent
winget install Google.Chrome --silent
```
For a Sandbox-capable runner only:
```powershell
Enable-WindowsOptionalFeature -Online -FeatureName Containers-DisposableClientVM -All
Restart-Computer -Force
```
After reboot:
```powershell
Get-CimInstance -ClassName Win32_OptionalFeature -Filter "Name='Containers-DisposableClientVM'"
Test-Path C:\Windows\System32\WindowsSandbox.exe
```
### 4. Register repo-scoped GitHub runners
The `astoltz` account uses repo-scoped runners. Generate a fresh one-hour
registration token per repo immediately before `config.cmd`.
From a trusted operator shell with `gh` authenticated:
```powershell
$repos = @(
"FlowerCore.Updater",
"FlowerCore.WorldBuilder",
"FlowerCore.DeviceManagement"
)
foreach ($repo in $repos) {
$token = gh api -X POST "/repos/astoltz/$repo/actions/runners/registration-token" --jq .token
$repoSlug = $repo.ToLowerInvariant().Replace("flowercore.", "").Replace(".", "-")
$runnerDir = "C:\fc-ghr\$repoSlug-fc-build-windows"
New-Item -ItemType Directory -Force -Path $runnerDir | Out-Null
Set-Location $runnerDir
if (-not (Test-Path ".\config.cmd")) {
Invoke-WebRequest `
-Uri "https://github.com/actions/runner/releases/download/v2.323.0/actions-runner-win-x64-2.323.0.zip" `
-OutFile "actions-runner.zip"
Add-Type -AssemblyName System.IO.Compression.FileSystem
[System.IO.Compression.ZipFile]::ExtractToDirectory((Resolve-Path actions-runner.zip), $runnerDir)
}
.\config.cmd `
--url "https://github.com/astoltz/$repo" `
--token $token `
--name "ci1-$repoSlug-fc-build-windows" `
--labels "self-hosted,windows,fc-build-windows,ci1" `
--work "_work" `
--unattended `
--replace
.\svc.ps1 install
.\svc.ps1 start
}
```
For Updater Sandbox E2E, register only after the guest proves Sandbox support,
and use `windows-sandbox` labels:
```powershell
$token = gh api -X POST "/repos/astoltz/FlowerCore.Updater/actions/runners/registration-token" --jq .token
.\config.cmd `
--url "https://github.com/astoltz/FlowerCore.Updater" `
--token $token `
--name "ci-sandbox1-updater" `
--labels "self-hosted,windows,windows-sandbox,ci-sandbox1" `
--work "_work" `
--unattended `
--replace
```
Keep registration tokens out of Git and logs. The durable credential source for
automation should be the existing 1Password item named `GitHub PAT (Runner
Registration)`, used only to mint short-lived repo registration tokens.
### 5. Verify GitHub and workflow pickup
```powershell
gh api /repos/astoltz/FlowerCore.Updater/actions/runners `
--jq '.runners[] | select(.labels[].name == "windows-sandbox") | {name,status,busy,labels:[.labels[].name]}'
gh api /repos/astoltz/FlowerCore.DeviceManagement/actions/runners `
--jq '.runners[] | select(.labels[].name == "fc-build-windows") | {name,status,busy,labels:[.labels[].name]}'
gh run list --repo astoltz/FlowerCore.Updater `
--workflow "Updater Windows Sandbox E2E" --limit 3
```
Q-MR-82 can be marked resolved only after the Updater run moves from `queued` to
`in_progress` or `completed` on an online runner, or after the affected WPF
build repos show online `fc-build-windows` repo-scoped runners and their queued
jobs start.
## Break-glass BLUEJAY-WS command
Only if the operator explicitly overrides the "BLUEJAY-WS is not a runner"
policy to drain a queue:
```powershell
Set-Location C:\fc-ghr\updater-sandbox
.\run.cmd
```
If a Windows service exists:
```powershell
Get-Service 'actions.runner.*'
Start-Service 'actions.runner.*'
```
This does not close Q-MR-82 permanently. It is a temporary queue drain until a
dedicated VM runner is online.

View File

@@ -1,4 +0,0 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- operator-gate-configmap.yaml

View File

@@ -1,61 +0,0 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: fc-build-windows-operator-gate
namespace: kubevirt-vms
labels:
app.kubernetes.io/name: fc-build-windows
app.kubernetes.io/component: operator-gate
app.kubernetes.io/part-of: github-runner
flowercore.io/q-card: Q-MR-82
annotations:
flowercore.io/outcome: OPEN-WITH-OPERATOR-ACTION
flowercore.io/live-runner: "false"
data:
outcome: OPEN-WITH-OPERATOR-ACTION
gate.md: |
Do not treat this ConfigMap as runner capacity.
Current probe, 2026-05-20:
- RKE2 nodes are linux-only; Windows containers require a Windows node.
- KubeVirt `ci1` is Running/Ready, but RDP 3389, WinRM 5985, and SSH 22
through `virtctl port-forward` return `connect: no route to host`.
- GitHub Updater runner list has only `bluejay-ws-sandbox-1`, status
offline. Updater Windows Sandbox E2E run 26150689447 remains queued.
Required operator action:
1. Make a dedicated Windows VM reachable and durable.
2. Install .NET 10 SDK, .NET 8 Desktop Runtime, Git, VS Build Tools, and
PowerShell 7.
3. Register repo-scoped runners with short-lived GitHub registration tokens.
4. Add `fc-build-windows` labels only to WPF build-capable guests.
5. Add `windows-sandbox` labels only after Sandbox support is proven.
registration-token-pattern.ps1: |
$repo = "FlowerCore.Updater"
$token = gh api -X POST "/repos/astoltz/$repo/actions/runners/registration-token" --jq .token
$runnerDir = "C:\fc-ghr\updater-fc-build-windows"
New-Item -ItemType Directory -Force -Path $runnerDir | Out-Null
Set-Location $runnerDir
# Install the Actions runner package here if config.cmd is absent.
.\config.cmd `
--url "https://github.com/astoltz/$repo" `
--token $token `
--name "ci1-updater-fc-build-windows" `
--labels "self-hosted,windows,fc-build-windows,ci1" `
--work "_work" `
--unattended `
--replace
.\svc.ps1 install
.\svc.ps1 start
verification.ps1: |
gh api /repos/astoltz/FlowerCore.Updater/actions/runners `
--jq '.runners[] | {name,status,busy,labels:[.labels[].name]}'
gh run list --repo astoltz/FlowerCore.Updater `
--workflow "Updater Windows Sandbox E2E" --limit 3
$env:KUBECONFIG="$env:USERPROFILE\.kube\rke2.yaml"
kubectl -n kubevirt-vms get vm,vmi,pods -o wide

2
apps/github-runner/.gitattributes vendored Normal file
View File

@@ -0,0 +1,2 @@
*.sh text eol=lf
Dockerfile text eol=lf

View File

@@ -0,0 +1,44 @@
FROM myoung34/github-runner:latest
ARG RUBY_VERSION=3.3.11
ARG RUBY_MINOR=3.3
ARG RUBY_BUILD_VERSION=v20260326
ARG RUNNER_UID=1001
ARG RUNNER_GID=1001
ENV RUNNER_TOOL_CACHE=/home/runner/_tool
ENV RUNNER_RUBY_TOOLCACHE=/opt/runner-toolcache
ENV PATH="/home/runner/_tool/Ruby/${RUBY_MINOR}/x64/bin:/opt/runner-toolcache/Ruby/${RUBY_MINOR}/x64/bin:${PATH}"
USER root
RUN apt-get update \
&& DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
autoconf \
bison \
build-essential \
ca-certificates \
curl \
libdb-dev \
libffi-dev \
libgdbm-dev \
libgmp-dev \
libncurses-dev \
libreadline-dev \
libssl-dev \
libyaml-dev \
patch \
pkg-config \
uuid-dev \
zlib1g-dev \
&& curl -fsSL "https://github.com/rbenv/ruby-build/archive/refs/tags/${RUBY_BUILD_VERSION}.tar.gz" -o /tmp/ruby-build.tar.gz \
&& mkdir -p /tmp/ruby-build \
&& tar -xzf /tmp/ruby-build.tar.gz --strip-components=1 -C /tmp/ruby-build \
&& /tmp/ruby-build/install.sh \
&& rm -rf /tmp/ruby-build /tmp/ruby-build.tar.gz /var/lib/apt/lists/*
COPY install-ruby-toolcache.sh /usr/local/bin/install-ruby-toolcache.sh
RUN chmod +x /usr/local/bin/install-ruby-toolcache.sh \
&& RUBY_VERSION="${RUBY_VERSION}" RUBY_MINOR="${RUBY_MINOR}" TOOLCACHE_ROOT="${RUNNER_RUBY_TOOLCACHE}" RUNNER_UID="${RUNNER_UID}" RUNNER_GID="${RUNNER_GID}" /usr/local/bin/install-ruby-toolcache.sh \
&& ruby -v

View File

@@ -7,12 +7,17 @@ Deployments with `kubectl`; update this manifest and let ArgoCD reconcile.
All repo-scoped Linux runners use: All repo-scoped Linux runners use:
- `localhost/fc-github-runner:v20260520-ruby3.3.11`, derived from
`myoung34/github-runner:latest`
- `ACCESS_TOKEN` from the `github-runner-token` Secret - `ACCESS_TOKEN` from the `github-runner-token` Secret
- `RUN_AS_ROOT=false` - `RUN_AS_ROOT=false`
- `EPHEMERAL=true` - `EPHEMERAL=true`
- `LABELS=self-hosted,linux,fc-build-linux` - `LABELS=self-hosted,linux,fc-build-linux`
- writable non-root paths under `/home/runner` for .NET, NuGet, XDG cache, and - writable non-root paths under `/home/runner` for .NET, NuGet, XDG cache, and
Actions tool cache Actions tool cache
- Ruby 3.3.11 seeded into `/home/runner/_tool/Ruby/3.3/x64` from the baked
`/opt/runner-toolcache` copy so `ruby/setup-ruby@v1` can discover it on
self-hosted `ubuntu-20.04-x64` runners
`github-runner` for `FlowerCore.Common` is single-replica because it retains the `github-runner` for `FlowerCore.Common` is single-replica because it retains the
original Longhorn ReadWriteOnce NuGet PVC. Every other repo-scoped runner uses original Longhorn ReadWriteOnce NuGet PVC. Every other repo-scoped runner uses
@@ -28,6 +33,34 @@ Sprint 32 final long-tail wave adds 16 two-replica Deployments:
`FlowerCore.Provisioning`, `FlowerCore.Redis`, `FlowerCore.MessageBoard`, and `FlowerCore.Provisioning`, `FlowerCore.Redis`, `FlowerCore.MessageBoard`, and
`FlowerCore.MenuBoard`. `FlowerCore.MenuBoard`.
## Image Build
Ruby is baked with a pinned `ruby-build` release and Ruby patch version. The pod
still mounts an `emptyDir` over `/home/runner`, so the `setup-runner-home` init
container copies the baked toolcache from `/opt/runner-toolcache/Ruby` into
`/home/runner/_tool/Ruby` before the runner container starts.
```bash
cd apps/github-runner
podman build -t localhost/fc-github-runner:v20260520-ruby3.3.11 .
podman run --rm localhost/fc-github-runner:v20260520-ruby3.3.11 ruby -v
podman run --rm localhost/fc-github-runner:v20260520-ruby3.3.11 \
test -f /opt/runner-toolcache/Ruby/3.3/x64.complete
podman save localhost/fc-github-runner:v20260520-ruby3.3.11 \
-o fc-github-runner-v20260520-ruby3.3.11.tar
```
Import the saved image on every schedulable RKE2 node before ArgoCD rolls the
Deployments:
```bash
for node in rke2-server rke2-agent1 rke2-agent2; do
scp fc-github-runner-v20260520-ruby3.3.11.tar "$node:/tmp/"
ssh "$node" 'sudo ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images rm localhost/fc-github-runner:v20260520-ruby3.3.11 || true'
ssh "$node" 'sudo ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images import /tmp/fc-github-runner-v20260520-ruby3.3.11.tar'
done
```
## Post-Merge Proof ## Post-Merge Proof
After the PR is merged and ArgoCD syncs, verify the runner fleet: After the PR is merged and ArgoCD syncs, verify the runner fleet:
@@ -36,6 +69,14 @@ After the PR is merged and ArgoCD syncs, verify the runner fleet:
kubectl -n github-runner get deploy,pods,pvc kubectl -n github-runner get deploy,pods,pvc
``` ```
Verify the Ruby toolcache in a fresh pod:
```bash
kubectl -n github-runner exec deploy/github-runner-puppet -c runner -- ruby -v
kubectl -n github-runner exec deploy/github-runner-puppet -c runner -- sh -c \
'echo "$RUNNER_TOOL_CACHE" && test -f "$RUNNER_TOOL_CACHE/Ruby/3.3/x64.complete"'
```
Verify GitHub registration for the repo-scoped runners: Verify GitHub registration for the repo-scoped runners:
```bash ```bash
@@ -69,6 +110,10 @@ from GitHub Actions and verify it lands on an `rke2-linux-*` runner.
- `actions/setup-dotnet` permission error at `/usr/share/dotnet`: check that - `actions/setup-dotnet` permission error at `/usr/share/dotnet`: check that
`DOTNET_INSTALL_DIR=/home/runner/.dotnet` and related cache env vars are `DOTNET_INSTALL_DIR=/home/runner/.dotnet` and related cache env vars are
present on the runner pod. present on the runner pod.
- `ruby/setup-ruby@v1` says self-hosted runners must install Ruby in
`$RUNNER_TOOL_CACHE`: check that the init container copied
`/opt/runner-toolcache/Ruby` into `/home/runner/_tool/Ruby` and that
`/home/runner/_tool/Ruby/3.3/x64.complete` exists.
- `404` during runner registration: the fine-grained PAT is valid but missing - `404` during runner registration: the fine-grained PAT is valid but missing
repository access for that repo. Add the repo to the PAT access list; the PAT repository access for that repo. Add the repo to the PAT access list; the PAT
value does not change. value does not change.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,19 @@
#!/usr/bin/env bash
set -euo pipefail
RUBY_VERSION="${RUBY_VERSION:-3.3.11}"
RUBY_MINOR="${RUBY_MINOR:-3.3}"
TOOLCACHE_ROOT="${TOOLCACHE_ROOT:-/opt/runner-toolcache}"
RUNNER_UID="${RUNNER_UID:-1001}"
RUNNER_GID="${RUNNER_GID:-1001}"
RUBY_PREFIX="${TOOLCACHE_ROOT}/Ruby/${RUBY_VERSION}/x64"
mkdir -p "${TOOLCACHE_ROOT}/Ruby"
RUBY_CONFIGURE_OPTS="${RUBY_CONFIGURE_OPTS:---disable-install-doc --disable-yjit}" ruby-build "${RUBY_VERSION}" "${RUBY_PREFIX}"
touch "${TOOLCACHE_ROOT}/Ruby/${RUBY_VERSION}/x64.complete"
ln -sfn "${RUBY_VERSION}" "${TOOLCACHE_ROOT}/Ruby/${RUBY_MINOR}"
"${RUBY_PREFIX}/bin/ruby" -v
chown -R "${RUNNER_UID}:${RUNNER_GID}" "${TOOLCACHE_ROOT}"
chmod -R a+rX "${TOOLCACHE_ROOT}"

View File

@@ -280,13 +280,14 @@ data:
printer_model: "NuPrint 210" printer_model: "NuPrint 210"
# Print.Web health (Blazor app on edge2:5200) # Print.Web health (Blazor app on edge2:5200)
# Target `/health` (anonymous) — root path requires API key auth and returns 401.
- job_name: "probe-printweb" - job_name: "probe-printweb"
metrics_path: /probe metrics_path: /probe
params: params:
module: [http_2xx] module: [http_2xx]
scrape_interval: 30s scrape_interval: 30s
static_configs: static_configs:
- targets: ["http://10.0.57.16:5200/"] - targets: ["http://10.0.57.16:5200/health"]
labels: labels:
instance: "print-web" instance: "print-web"
service: "print-web" service: "print-web"