Files
bluejay-infra/apps/github-runner
Andrew Stoltz 2b420ce8a4 runners: fleet-wide right-size CPU requests from 500m to 100m
All 33 runner Deployments now request 100m CPU instead of 500m,
freeing roughly 50 idle pods × 400m = ~20 cores back to the cluster.
Observed CPU usage on idle runners is ~1m via kubectl top; the 500m
request was a 500× over-provision that was eating allocatable CPU
and blocking new workload scheduling — WorldBuilder runner could not
be scheduled even at the new 100m request because the pre-existing
fleet held the cluster at 99% requested.

Burst headroom preserved by limits.cpu: 2000m unchanged. TtsReader
keeps its 8Gi memory limit from the 2026-05-25 OOMKill fix; only
the CPU request line moves.

Recreate strategy on each deployment means a brief offline window
per runner during rollout; in-flight CI jobs complete on the
existing container before the new spec takes effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 10:09:24 -05:00
..

GitHub Runner Fleet

ArgoCD owns apps/github-runner/github-runner.yaml. Do not patch live runner Deployments with kubectl; update this manifest and let ArgoCD reconcile.

Runner Shape

All repo-scoped Linux runners use:

  • localhost/fc-github-runner:v20260525-ruby3.3.11-stepca, derived from myoung34/github-runner:latest
  • ACCESS_TOKEN from the github-runner-token Secret
  • RUN_AS_ROOT=false
  • EPHEMERAL=true
  • LABELS=self-hosted,linux,fc-build-linux
  • writable non-root paths under /home/runner for .NET, NuGet, XDG cache, and Actions tool cache
  • Ruby 3.3.11 seeded into /home/runner/_tool/Ruby/3.3/x64 from the baked /opt/runner-toolcache copy so ruby/setup-ruby@v1 can discover it on self-hosted ubuntu-20.04-x64 runners

github-runner for FlowerCore.Common is single-replica because it retains the original Longhorn ReadWriteOnce NuGet PVC. Every other repo-scoped runner uses two replicas with per-pod emptyDir caches. That is the safe backlog-drain strategy: no two pods share one RWO PVC.

Sprint 32 final long-tail wave adds 16 two-replica Deployments: FlowerCore.Knowledge, FlowerCore.LlmBridge, FlowerCore.Media, FlowerCore.Presentations, FlowerCore.RemoteDesktop, FlowerCore.DNS, FlowerCore.Distribution, FlowerCore.Scoreboard, FlowerCore.SegmentDisplay, FlowerCore.Signage.Contracts, FlowerCore.SignalControl, FlowerCore.Intranet.Web, FlowerCore.Provisioning, FlowerCore.Redis, FlowerCore.MessageBoard, and FlowerCore.MenuBoard.

Image Build

Ruby is baked with a pinned ruby-build release and Ruby patch version. The pod still mounts an emptyDir over /home/runner, so the setup-runner-home init container copies the baked toolcache from /opt/runner-toolcache/Ruby into /home/runner/_tool/Ruby before the runner container starts.

The IAmWorkin step-ca root CA is also baked into the system trust store (/usr/local/share/ca-certificates/iamworkin-step-ca-root.crt, registered by update-ca-certificates). Without it, .NET HttpClient calls from CI tests against *.iamworkin.lan (e.g. https://selenium.iamworkin.lan/session) fail with PartialChain. To refresh the bundled cert when the root rotates, re-extract from the cluster and overwrite step-ca-root.crt:

kubectl get secret -n cert-manager step-ca-root \
  -o jsonpath='{.data.ca\.crt}' | base64 -d > step-ca-root.crt
cd apps/github-runner
podman build -t localhost/fc-github-runner:v20260525-ruby3.3.11-stepca .
podman run --rm localhost/fc-github-runner:v20260525-ruby3.3.11-stepca ruby -v
podman run --rm localhost/fc-github-runner:v20260525-ruby3.3.11-stepca \
  test -f /opt/runner-toolcache/Ruby/3.3/x64.complete
podman save localhost/fc-github-runner:v20260525-ruby3.3.11-stepca \
  -o fc-github-runner-v20260525-ruby3.3.11-stepca.tar

Import the saved image on every schedulable RKE2 node before ArgoCD rolls the Deployments:

for node in rke2-server rke2-agent1 rke2-agent2; do
  scp fc-github-runner-v20260525-ruby3.3.11-stepca.tar "$node:/tmp/"
  ssh "$node" 'sudo ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images rm localhost/fc-github-runner:v20260525-ruby3.3.11-stepca || true'
  ssh "$node" 'sudo ctr -a /run/k3s/containerd/containerd.sock -n k8s.io images import /tmp/fc-github-runner-v20260525-ruby3.3.11-stepca.tar'
done

Post-Merge Proof

After the PR is merged and ArgoCD syncs, verify the runner fleet:

kubectl -n github-runner get deploy,pods,pvc

Verify the Ruby toolcache in a fresh pod:

kubectl -n github-runner exec deploy/github-runner-puppet -c runner -- ruby -v
kubectl -n github-runner exec deploy/github-runner-puppet -c runner -- sh -c \
  'echo "$RUNNER_TOOL_CACHE" && test -f "$RUNNER_TOOL_CACHE/Ruby/3.3/x64.complete"'

Verify GitHub registration for the repo-scoped runners:

for repo in FlowerCore.Common FlowerCore.Shared.Pos FlowerCore.Puppet FlowerCore.Signage \
            FlowerCore.DMS FlowerCore.Telephony FlowerCore.Print.Web FlowerCore.Chat \
            FlowerCore.MySQL FlowerCore.Kiosk.Linux FlowerCore.Marquee FlowerCore.TtsReader \
            FlowerCore.Knowledge FlowerCore.LlmBridge FlowerCore.Media \
            FlowerCore.Presentations FlowerCore.RemoteDesktop FlowerCore.DNS \
            FlowerCore.Distribution FlowerCore.Scoreboard FlowerCore.SegmentDisplay \
            FlowerCore.Signage.Contracts FlowerCore.SignalControl FlowerCore.Intranet.Web \
            FlowerCore.Provisioning FlowerCore.Redis FlowerCore.MessageBoard \
            FlowerCore.MenuBoard; do
  echo "=== $repo ==="
  gh api "/repos/astoltz/$repo/actions/runners" \
    --jq '.runners[] | select(.labels[].name == "fc-build-linux") | {name,status,busy,labels:[.labels[].name]}'
done

Shared.Pos publish proof after the runner pod is online:

gh run list --repo astoltz/FlowerCore.Shared.Pos \
  --workflow "Build, Test & Publish" --branch main --limit 5

If the latest run is still queued after runner registration, rerun the workflow from GitHub Actions and verify it lands on an rke2-linux-* runner.

Failure Notes

  • actions/setup-dotnet permission error at /usr/share/dotnet: check that DOTNET_INSTALL_DIR=/home/runner/.dotnet and related cache env vars are present on the runner pod.
  • ruby/setup-ruby@v1 says self-hosted runners must install Ruby in $RUNNER_TOOL_CACHE: check that the init container copied /opt/runner-toolcache/Ruby into /home/runner/_tool/Ruby and that /home/runner/_tool/Ruby/3.3/x64.complete exists.
  • 404 during runner registration: the fine-grained PAT is valid but missing repository access for that repo. Add the repo to the PAT access list; the PAT value does not change.
  • Multi-Attach volume error: only the Common runner uses a RWO PVC and it must stay single-replica. New multi-replica runners use emptyDir.