bluejay-infra

Author	SHA1	Message	Date
Andrew Stoltz	90f2a86819	ops: trim load for degraded 2-node cluster (agent2 PSU dead) Scale all github-runner deployments to 1 replica and halt the ci1 KubeVirt VM. With agent2 down (failed PSU) the cluster runs on two passively-cooled NUCs; the ci1 8-vCPU VM drove agent1 to ~100C. Keep total load trimmed until replacement hardware is in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 13:47:13 -05:00
Andrew Stoltz	2b420ce8a4	runners: fleet-wide right-size CPU requests from 500m to 100m All 33 runner Deployments now request 100m CPU instead of 500m, freeing roughly 50 idle pods × 400m = ~20 cores back to the cluster. Observed CPU usage on idle runners is ~1m via kubectl top; the 500m request was a 500× over-provision that was eating allocatable CPU and blocking new workload scheduling — WorldBuilder runner could not be scheduled even at the new 100m request because the pre-existing fleet held the cluster at 99% requested. Burst headroom preserved by limits.cpu: 2000m unchanged. TtsReader keeps its 8Gi memory limit from the 2026-05-25 OOMKill fix; only the CPU request line moves. Recreate strategy on each deployment means a brief offline window per runner during rollout; in-flight CI jobs complete on the existing container before the new spec takes effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 10:09:24 -05:00
Andrew Stoltz	5cbc1a06b1	runners: scale DM/AiStation.Linux/WorldBuilder down to 1 replica until cluster relieved After cutting requests to 100m, 4 of 6 new pods scheduled and 2 stayed Pending — cluster CPU REQUEST utilization is 49.6 of 48 allocatable cores because the existing fleet of ~50 idle runners reserves 25.6 cores (500m × ~50) for ~50m actual use. Single-replica per new repo gets the service online without competing with in-flight CI from the rest of the fleet. When the broader fleet-wide request right-sizing pass lands (500m → 100m on all idle runners would free ~20 cores), these can be bumped back to 2 replicas if PR-CI backlog warrants it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 10:03:30 -05:00
Andrew Stoltz	9e7ee39b3a	runners: drop CPU request 500m→100m on DM/AiStation.Linux/WorldBuilder All 3 fleet nodes were at 99% CPU REQUEST allocation; the 6 new pods from the previous commit (3 deployments × 2 replicas × 500m) couldn't schedule. Idle runners actually use ~1m CPU per `kubectl top pods`; the 500m request was significantly over-provisioned. Burst headroom preserved by limits.cpu: 2000m unchanged. Follow-up: similar request right-sizing pass across the rest of the runner fleet is queued for a future morning-routine sweep — 25 cores reserved for ~50m actual use is a large slack we can reclaim cluster- wide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 10:00:23 -05:00
Andrew Stoltz	ae030a5f33	runners: add github-runner Deployments for DeviceManagement + AiStation.Linux + WorldBuilder Morning-routine 2026-05-26 — these three repos had ZERO online Linux PR-CI capacity, blocking the Sprint 37 Cx-1 Linux-CI-migration PRs (DM #20/#21/ #22, AiStation.Linux #13, WorldBuilder #3/#4). Chicken-and-egg: the migration PRs need Linux runners that the migration creates. Each Deployment uses the same canonical emptyDir-only pattern as the fresh-2026-05-26 updater deployment that lives just above: - replicas: 2 (room for parallel PR-CI without head-of-line blocking) - per-pod emptyDir caches (no RWO PVC contention) - shared github-runner-token secret (existing ACCESS_TOKEN PAT has org-wide read access) - LABELS: self-hosted,linux,fc-build-linux - DOTNET_INSTALL_DIR pinned per ADR-170 family For AiStation.Linux specifically: Linux job will now pick up; the Windows job in #13 remains queued indefinitely until the Windows runner host substrate lands per Sprint 36 v2 Cl-2 / ADR-174 — that's a separate arc, not this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 09:55:31 -05:00
Andrew Stoltz	2cc91b6df0	runners: bump tts-reader memory limit 4Gi -> 8Gi The github-runner-tts-reader pod was being OOMKilled (exit 137) mid-`dotnet test` on the TtsReader 1000+ test suite. PR #21 CI (the Windows -> Linux runner migration) flapped twice with the "self-hosted runner lost communication" annotation before the K8s-side symptoms surfaced via kubectl describe pod. Requests bumped 1Gi -> 2Gi, limits 4Gi -> 8Gi. Comment added inline so future fleet runs don't trip the same wall. Unblocks PR #21 + the 9 other open TtsReader PRs that all rebase through it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 22:31:48 -05:00
bluejay	0d2090fe81	runners: add github-runner-updater Deployment (#29 ) Close runner-fleet gap for FlowerCore.Updater. Matches Sprint 32 long-tail pattern; registers entry in fleet-lint required-set.	2026-05-26 03:24:13 +00:00
Andrew Stoltz	bc3548e715	runners: add github-runner-pimanager Deployment FlowerCore.PiManager build run 26417714843 sat queued 5h with zero self-hosted runners registered to the repo. PiManager was missed in the Sprint 32 long-tail sweep — every other FC repo got a dedicated repo-scoped Deployment with its own ACCESS_TOKEN registration, but PiManager fell through the cracks. Adds a 2-replica ephemeral runner Deployment matching the Signage / DMS / Print.Web pattern (per-pod emptyDir caches, no shared PVC, labels `self-hosted,linux,fc-build-linux`, shared github-runner-token PAT). Once ArgoCD syncs, the queued job will pick up automatically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 20:33:44 -05:00
Andrew Stoltz	2a1e842100	runners: bake step-ca root CA into image (v20260525-stepca) Without the IAmWorkin step-ca root CA in the runner image's system trust store, .NET HttpClient calls from CI tests against `*.iamworkin.lan` (e.g. `https://selenium.iamworkin.lan/session`) fail with `The remote certificate is invalid because of errors in the certificate chain: PartialChain`. FlowerCore.Print.Web's `WebScreenshotService` unit tests hit this on every build. Drop the step-ca root PEM into `/usr/local/share/ca-certificates/`, run `update-ca-certificates` once during apt install, and let OpenSSL + .NET-on-Linux read the regenerated `/etc/ssl/certs/ca-certificates.crt` automatically — no `SSL_CERT_FILE` env var, no per-Deployment volume mount. Image rebuilt + saved + imported on all 3 schedulable RKE2 nodes (rke2-server, rke2-agent1, rke2-agent2) before this PR — verified with `ctr images list -q \| grep stepca` on each node. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 19:55:38 -05:00
Andrew Stoltz	b92f74b63a	runners: right-size replica counts per 14d CI activity data Drop 2 → 1 for 10 deploys based on trailing-14d run counts: - LlmBridge, Media, Knowledge, Intranet.Web, DNS (0 runs each) - Presentations (6), Redis (3), Provisioning (3), MessageBoard (3), MenuBoard (3) Bump 2 → 3 for Print.Web: 12 runs in trailing 5d, and the help-screenshots AAT job holds a runner 30+ min, creating head-of-line blocking for parallel PRs. Net change: -9 replicas (≈ -9 GiB committed memory). Aligns with Sprint 33 morning-routine capacity audit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 18:55:47 -05:00
Andrew Stoltz	7f2a3b76b4	feat(github-runner): bake Ruby 3.3 into Linux self-hosted runner image (Q-MR-81)	2026-05-20 11:45:43 -05:00
bluejay	f8fe3b2688	feat(github-runner): add final long-tail runners (#9 )	2026-05-18 04:52:01 +00:00
bluejay	f2ab892ebc	feat(github-runner): add Marquee + TtsReader per-repo runners (#8 )	2026-05-18 03:27:14 +00:00
Andrew Stoltz	6fe77225ae	fix(github-runner): dedupe DOTNET_INSTALL_DIR+NUGET_PACKAGES on base+sharedpos PR #5 rebase concatenated PR #5 env additions onto PR #7 env additions on the base + sharedpos Deployments, producing duplicate-key validation errors in ArgoCD's structured merge. The DOTNET_INSTALL_DIR and NUGET_PACKAGES values are identical between PR #5 and PR #7; keep the PR #7 originals and retain only the unique new env vars from PR #5 (DOTNET_CLI_TELEMETRY_OPTOUT, DOTNET_NOLOGO, DOTNET_GENERATE_ASPNET_CERTIFICATE). No behavioral change — same final env var set, no duplicates.	2026-05-17 21:53:05 -05:00
bluejay	634b9c4169	feat(github-runner): harden Linux runner fleet (#5 )	2026-05-18 02:51:02 +00:00
bluejay	65ac8d6f01	feat(github-runner): pod-env DOTNET_INSTALL_DIR + initContainer for non-root runner (#7 )	2026-05-18 02:25:18 +00:00
bluejay	b1e307151e	chore(github-runner): un-park github-runner-sharedpos (replicas 1) after Shared.Pos CI fix merged	2026-05-17 21:54:16 +00:00
bluejay	12b07219c7	chore(github-runner): park github-runner-sharedpos (replicas 0) until Cx-1 non-root fix Shared.Pos build fails on non-root runner (setup-dotnet /usr/share/dotnet denied); churning runner drove HighCPU on rke2-agent2. Re-enable in the Sprint 30+ Cx-1 Linux-runner-fleet lane (DOTNET_INSTALL_DIR on pod env).	2026-05-17 21:50:35 +00:00
bluejay	ad670fb344	feat(github-runner): add Shared.Pos repo-scoped Linux runner (unstick stuck publish)	2026-05-17 19:50:23 +00:00
Codex	6f6ca50987	fix(github-runner): switch RUNNER_TOKEN -> ACCESS_TOKEN; set RUN_AS_ROOT=false Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 22:08:56 +00:00
Codex	c7be58c1f7	chore(github-runner): bump replicas 0 -> 1 (PAT provisioned) Operator provisioned GitHub PAT (Runner Registration) 1P item. OnePasswordItem CRD already materialized the secret. Unblocks CI fleet-wide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 22:04:03 +00:00
Codex	710340d8be	chore(github-runner): rename 1P item to GitHub PAT (Runner Registration) Renames the OnePasswordItem.itemPath from "GitHub Runner Registration Token" to "GitHub PAT (Runner Registration)" so the runner 1P entry sits next to its siblings — GitHub PAT (Gitea Mirrors) and GitHub PAT (NuGet Packages) — under a consistent "GitHub PAT (...)" naming pattern and API_CREDENTIAL category. Existing field "credential" remains the consumer (RUNNER_TOKEN env). Comment block clarified to require Administration:read/write fine-grained PAT scope on target repos. Old 1P item renamed to "[DEPRECATED 2026-05-16] GitHub Runner Registration" — kept as recovery backup; can be hard-deleted after the first successful runner pod start against the new item path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 10:27:58 -05:00
Andrew Stoltz	7d2daaa4f8	chore(github-runner): replicas 1 → 0 until 1Password token provisioned github-runner-token OnePasswordItem exists but the underlying 1Password vault item hasn't been created yet, so the operator can't mint the K8s Secret. Pod stuck in CreateContainerConfigError → DeploymentReplicasMismatch alert fires. Scaling to 0 keeps the manifest infrastructure intact but stops trying to schedule until operator: 1. Creates "GitHub Runner Registration Token" item in IAmWorkin vault 2. Generates a token at github.com/astoltz/<repo>/settings/actions/runners/new 3. Updates the OnePasswordItem itemPath to point at it 4. Bumps replicas back to 1 via PR	2026-05-15 16:18:19 -05:00
Codex	e8094eb0bd	ci(github-runner): add Phase 2 ephemeral Linux runner K8s manifest Namespace github-runner with myoung34/github-runner:latest Deployment, 5Gi Longhorn RWO NuGet cache PVC, zero-privilege ServiceAccount, and OnePasswordItem CRD for the registration token. EPHEMERAL=true mode re-registers after each job; Recreate strategy avoids RWO multi-attach. Targets fc-build-linux label; single replica pinned to rke2-server node. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-14 12:46:25 -05:00

24 Commits