bluejay-infra

Author	SHA1	Message	Date
bluejay	b842738a0e	Merge pull request 'Sprint 63 Cx-10: align hardening probe paths with live routes' (#44 ) from codex/s63-cx10 into main Sprint 63 Cx-10 live-proof fix after Traefik curls found three stale probe-path annotations. Local lint 100/100; git diff --check clean; no Gitea statuses attached.	2026-06-05 03:02:14 +00:00
Andrew Stoltz	f0cb7a5e81	fix(hardening): align probe-path annotations with live health routes	2026-06-04 22:01:04 -05:00
bluejay	ac0f665323	Merge pull request 'Draft: Sprint 62 Cx-10 broader exposure hardening' (#43 ) from codex/s62-cx10 into main Sprint 63 Cx-10 reconcile-first merge after local lint proof: 100/100 passed, no Gitea statuses attached, CRLF diff check clean.	2026-06-05 02:51:37 +00:00
Andrew Stoltz	c4b08f41ab	feat(infra): prestage broader app exposure hardening	2026-06-04 18:14:22 -05:00
Andrew Stoltz	417d3830ae	test(lint): reconcile baseline infra assertions	2026-06-04 18:02:32 -05:00
bluejay	cb4ea13e7a	monitoring: mirror Sprint 60 probe coverage Merged on local lint plus live noc1 Prometheus /api/v1/rules proof.	2026-06-04 18:19:47 +00:00
Andrew Stoltz	a3cd67d6bb	monitoring: mirror Sprint 60 probe coverage	2026-06-04 13:15:18 -05:00
Andrew Stoltz	81a3ddac4c	fix(auth): mark OIDC healthz probes anonymous	2026-06-04 11:03:20 -05:00
bluejay	300f8ad546	fix(monitoring): probe OIDC-safe health routes Sprint 58 Cx-12. Rebased over OIDC GitOps main; YAML parse and focused bluejay-infra lint tests passed.	2026-06-04 06:45:34 +00:00
bluejay	fe38c2641f	Merge pull request 'fix(auth): deploy distribution root anonymous image' (#38 ) from codex/s58-distribution-root-anon-gitops into main	2026-06-04 06:20:09 +00:00
Andrew Stoltz	3b40dfb185	fix(auth): deploy distribution root anonymous image	2026-06-04 01:19:16 -05:00
bluejay	103878671c	Merge pull request 'fix(auth): deploy Distribution OIDC image tag' (#37 ) from codex/s58-oidc-proper into main	2026-06-04 06:05:15 +00:00
Andrew Stoltz	36039c1335	fix(auth): deploy distribution oidc image tag	2026-06-04 01:04:44 -05:00
bluejay	2a66109f13	Merge pull request 'feat(auth): adopt OIDC GitOps for DNS Distribution Media' (#36 ) from codex/s58-oidc-proper into main	2026-06-04 05:52:56 +00:00
Andrew Stoltz	933fea89d1	feat(auth): adopt oidc apps in gitops	2026-06-04 00:49:36 -05:00
Andrew Stoltz	13f9bb7710	fix(distribution): revert OIDC enforcement — enabling it gated /healthz probe (service down) Flipping Auth__Enabled=true gated the /healthz readiness probe (302->NotReady-> no endpoints->distribution.iamworkin.lan down, healthz=000). Classic feedback_k8s_probes_behind_auth_middleware. Revert to false (OIDC env block kept, gate off) to restore service. Proper fix (AllowAnonymous /healthz + CA-trust + idempotent Editions seed + OIDC-challenge wiring + browser-proof) -> falcon OIDC lane. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 23:47:29 -05:00
Andrew Stoltz	9a58fd2af6	oidc: flip enforcement ON for knowledge + distribution (no-live-proof, fix-forward) Operator 2026-06-04: nothing is production yet, flip OIDC + fix-forward (no browser-proof gate). knowledge: Auth__Enabled false->true (OIDC env already wired). distribution: add OIDC env block (Authority/Audience/ClientId=distribution, ClientSecret from distribution-oidc-client) + Enabled=true; public read/entitlement + Method() allowlist stay open (OIDC gates admin only). Clients already provisioned (secrets present). ArgoCD deploys both. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 23:38:48 -05:00
Andrew Stoltz	404d884863	Adopt live Library Retail AiStation web apps	2026-06-03 20:24:32 -05:00
bluejay	f4bd90f805	Merge pull request #33 from codex/s56-monitoring-coverage fix(monitoring): repoint pirelay scrape to signalcontrol	2026-06-04 01:22:49 +00:00
Andrew Stoltz	67d67ab73d	fix(monitoring): repoint pirelay scrape to signalcontrol	2026-06-03 20:20:36 -05:00
Andrew Stoltz	f7d41cdc60	revert: drop fc-library manifest — Library.Web already deployed live (41h) Library.Web is already running + serving at library.iamworkin.lan (root=200, healthz=200), deployed manually 41h ago (image fc-library-web:v20260602-..., PVC library-web-data holding the live SQLite DB). My from-scratch manifest used a different PVC name (library-data) which ArgoCD would attach as a fresh empty volume, orphaning the live DB. Adopting the live deploy into GitOps is a separate careful task. Not disturbing a working deployment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 19:30:23 -05:00
Andrew Stoltz	2c0afc28e4	deploy(fc-library): add Library.Web internal-host deployment From-scratch .Web deploy at library.iamworkin.lan (operator-authorized 2026-06-03). Cloned from the worldbuilder pattern: Deployment + Service + Longhorn RWO PVC + step-ca cert + Traefik IngressRoute. SQLite at /data/library.db, no OIDC, both /health + /healthz probes. Image localhost/fc-library:v202606031925 imported to both RKE2 nodes. DNS library.iamworkin.lan -> 10.0.56.200 already in pfSense. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 19:28:22 -05:00
Robot	ba5f5dd0fb	deploy(knowledge): roll audit backfill fix	2026-06-03 18:24:22 -05:00
Robot	dc699da7b3	fix(knowledge): persist federation database on PVC	2026-06-03 18:17:31 -05:00
Robot	1e8bf54c6e	deploy: roll Chat and Knowledge OIDC images	2026-06-03 18:13:09 -05:00
Andrew Stoltz	e2e93d482c	Deploy TtsReader schema repair image Co-Authored-By: Codex <codex@openai.com>	2026-06-02 22:00:15 -05:00
bluejay	4319cc2b51	Merge PR #32 : divoom pi deploy artifact manifests Lands Divoom-as-DM-device and Divoom-TV Pi HDMI deploy artifacts for Cx-6.	2026-06-03 02:47:36 +00:00
Andrew Stoltz	2bf339ce51	Deploy TtsReader PR29 live proof image Co-Authored-By: Codex <codex@openai.com>	2026-06-02 21:47:04 -05:00
Andrew Stoltz	5bdedfc5ae	divoom: add pi deploy artifact manifests Add source-controlled Puppet/Hiera contracts for edge2 Divoom-as-DM-device without replacing the live flowercore-divoom systemd deployment. Add Divoom TV Pi HDMI systemd/Puppet deployment artifacts, LF shell-script guardrails, and focused lint coverage for the additive non-K8s deploy shape. Co-Authored-By: Codex <codex@openai.com>	2026-06-02 21:45:27 -05:00
Andrew Stoltz	0307ae16ae	monitoring(probe): signage/mysql/php blackbox probe / -> /healthz (K8s-target mirror) Mirrors the live noc1 podman fix + Notes scripts/monitoring/prometheus.yml. These services enforce OIDC bearer auth (FlowerCore__Auth__Enabled=true), so an anonymous probe of / returns 401 -> false TraefikServiceDown. All three expose anonymous /healthz=200. This noc-monitoring.yaml is the forward K8s-migration target (not live); brings it in sync with the live config.	2026-06-02 01:09:57 -05:00
Andrew Stoltz	6c18f69cf2	mail: remove cert-manager Certificate (manage mail-tls via step-ca JWK + noc1 renew timer) step-ca-acme only has an HTTP-01 (Traefik) solver, but mail.iamworkin.lan must resolve to the dedicated MetalLB IP 10.0.56.202 (SMTP/IMAP), so HTTP-01 cannot validate (order stuck pending since 2026-05-06; cert expired 2026-05-24). mail-tls is now issued from step-ca's JWK 'admin' provisioner and auto-renewed by a systemd timer on noc1 that writes the mail-tls secret directly. The secret + Deployment mount + webmail IngressRoute are unchanged. Re-add a Certificate only if a DNS-01 solver is deployed for step-ca-acme. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-01 15:55:38 -05:00
Andrew Stoltz	47e2256556	Deploy TtsReader correction bridge images	2026-05-31 12:35:45 -05:00
Andrew Stoltz	9d77f8ba0e	fc-updater: disable loki audit sink	2026-05-31 11:34:12 -05:00
Andrew Stoltz	2f4be19c85	fc-updater: bump signing diagnostics image	2026-05-31 00:32:48 -05:00
Andrew Stoltz	2a62c40990	fc-updater: bump image to MSI installer surface	2026-05-30 23:31:48 -05:00
Andrew Stoltz	7be98e5efc	Bump UpdateCenter image to hosted-service fix	2026-05-30 20:22:13 -05:00
Andrew Stoltz	a65b356c9d	deploy(fc-updater): roll UC to v202605301823-a6c3354 (Phase 3 SQLite fixes) Durable image bump for FlowerCore.Updater main a6c3354 (PRs #63-#66): hosted-service + request-path SQLite DateTimeOffset fixes, StopHost restored + per-tick resilience, Shared.Settings 1.0.1. Image built + imported to rke2-server. Un-degrades the Phase-9 provenance verifier + settings poll (were stopped under the removed global Ignore mask). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 18:27:45 -05:00
Andrew Stoltz	08c17ef1b4	fc-updater: bump to v202605301703-296f350-fix2 (BackgroundServiceExceptionBehavior=Ignore so a hosted-service SQLite query crash can't stop the host) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 17:04:54 -05:00
Andrew Stoltz	06f2f002b7	fc-updater: bump image to v202605301657-296f350-fix1 (Shared.Settings SQLite poll fix) The v202605301642-296f350-rework image crash-looped: FlowerCore.Shared.Settings SettingsDbPollHostedService ran a DateTimeOffset Where/OrderBy on SettingsRecordChanges that SQLite can't translate, and as a BackgroundService it stopped the host. Shared.Settings 1.0.1 materializes the change-log then filters/orders in memory; Updater Web bumped to 1.0.1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 16:59:37 -05:00
Andrew Stoltz	7ac4a8b4b7	fc-updater: bump image to v202605301642-296f350-rework (ADR-179 rework live) Deploy the current FlowerCore.Updater main (PRs #52-#61) to prod: MSI-first packaging, beta gating + per-install tokens, interactive+bearer Authentik OIDC, native installer apply, and the .fcsetup.exe retirement (DropReleaseInstallers migration runs on the now-empty DB). Image pre-imported to rke2-server + agent1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 16:47:28 -05:00
Andrew Stoltz	90f2a86819	ops: trim load for degraded 2-node cluster (agent2 PSU dead) Scale all github-runner deployments to 1 replica and halt the ci1 KubeVirt VM. With agent2 down (failed PSU) the cluster runs on two passively-cooled NUCs; the ci1 8-vCPU VM drove agent1 to ~100C. Keep total load trimmed until replacement hardware is in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 13:47:13 -05:00
Andrew Stoltz	cbdefb2b23	Revert "ci1: expose WinRM/RDP/SSH ports on masquerade interface for Phase 2 bootstrap" The port additions caused the new VMI to stick at phase=Scheduled with reason=GuestNotRunning. The guest-console-log sidecar exited 1 and qemu never started. Reverting to the working 9-day-stable shape until the port-add path is verified in a non-production VM. Phase 2 (Windows runner install + registration) needs an operator- interactive virtctl-vnc session against the rebuilt VM, OR a separate investigation of why this port-add tipped over the VM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 11:35:10 -05:00
Andrew Stoltz	1c36fe3a0a	ci1: expose WinRM/RDP/SSH ports on masquerade interface for Phase 2 bootstrap The Phase 1 VM has been Running for 9 days but Phase 2 (Puppet bootstrap + runner registration) was deferred because the operator-interactive virtctl-vnc path was the only way in. The masquerade interface listed no exposed ports, so virtctl ssh and kubectl port-forward both hit 'no route to host' — qemu user-mode NAT does not forward inbound by default. Adding 5985 (WinRM HTTP) lets a kubectl port-forward + PowerShell remoting path drive runner registration entirely from outside the VM. 3389 + 22 are reserved for desktop access via Guacamole or virtctl ssh once OpenSSH Server is installed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 11:24:34 -05:00
Andrew Stoltz	2b420ce8a4	runners: fleet-wide right-size CPU requests from 500m to 100m All 33 runner Deployments now request 100m CPU instead of 500m, freeing roughly 50 idle pods × 400m = ~20 cores back to the cluster. Observed CPU usage on idle runners is ~1m via kubectl top; the 500m request was a 500× over-provision that was eating allocatable CPU and blocking new workload scheduling — WorldBuilder runner could not be scheduled even at the new 100m request because the pre-existing fleet held the cluster at 99% requested. Burst headroom preserved by limits.cpu: 2000m unchanged. TtsReader keeps its 8Gi memory limit from the 2026-05-25 OOMKill fix; only the CPU request line moves. Recreate strategy on each deployment means a brief offline window per runner during rollout; in-flight CI jobs complete on the existing container before the new spec takes effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 10:09:24 -05:00
Andrew Stoltz	5cbc1a06b1	runners: scale DM/AiStation.Linux/WorldBuilder down to 1 replica until cluster relieved After cutting requests to 100m, 4 of 6 new pods scheduled and 2 stayed Pending — cluster CPU REQUEST utilization is 49.6 of 48 allocatable cores because the existing fleet of ~50 idle runners reserves 25.6 cores (500m × ~50) for ~50m actual use. Single-replica per new repo gets the service online without competing with in-flight CI from the rest of the fleet. When the broader fleet-wide request right-sizing pass lands (500m → 100m on all idle runners would free ~20 cores), these can be bumped back to 2 replicas if PR-CI backlog warrants it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 10:03:30 -05:00
Andrew Stoltz	9e7ee39b3a	runners: drop CPU request 500m→100m on DM/AiStation.Linux/WorldBuilder All 3 fleet nodes were at 99% CPU REQUEST allocation; the 6 new pods from the previous commit (3 deployments × 2 replicas × 500m) couldn't schedule. Idle runners actually use ~1m CPU per `kubectl top pods`; the 500m request was significantly over-provisioned. Burst headroom preserved by limits.cpu: 2000m unchanged. Follow-up: similar request right-sizing pass across the rest of the runner fleet is queued for a future morning-routine sweep — 25 cores reserved for ~50m actual use is a large slack we can reclaim cluster- wide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 10:00:23 -05:00
Andrew Stoltz	ae030a5f33	runners: add github-runner Deployments for DeviceManagement + AiStation.Linux + WorldBuilder Morning-routine 2026-05-26 — these three repos had ZERO online Linux PR-CI capacity, blocking the Sprint 37 Cx-1 Linux-CI-migration PRs (DM #20/#21/ #22, AiStation.Linux #13, WorldBuilder #3/#4). Chicken-and-egg: the migration PRs need Linux runners that the migration creates. Each Deployment uses the same canonical emptyDir-only pattern as the fresh-2026-05-26 updater deployment that lives just above: - replicas: 2 (room for parallel PR-CI without head-of-line blocking) - per-pod emptyDir caches (no RWO PVC contention) - shared github-runner-token secret (existing ACCESS_TOKEN PAT has org-wide read access) - LABELS: self-hosted,linux,fc-build-linux - DOTNET_INSTALL_DIR pinned per ADR-170 family For AiStation.Linux specifically: Linux job will now pick up; the Windows job in #13 remains queued indefinitely until the Windows runner host substrate lands per Sprint 36 v2 Cl-2 / ADR-174 — that's a separate arc, not this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 09:55:31 -05:00
bluejay	bc8c35896f	tests: add bluejay-ws runner-exclusion lint + fix 3 stale runner-fleet assertions (#30 ) BLUEJAY-WS must never be a fleet GHA runner (operator directive 2026-05-26). Build-side analog of Sprint 9 safe-account exclusion. Also fixes 3 stale runner-fleet assertions broken by initContainer addition + replica tuning.	2026-05-26 03:42:01 +00:00
Andrew Stoltz	2cc91b6df0	runners: bump tts-reader memory limit 4Gi -> 8Gi The github-runner-tts-reader pod was being OOMKilled (exit 137) mid-`dotnet test` on the TtsReader 1000+ test suite. PR #21 CI (the Windows -> Linux runner migration) flapped twice with the "self-hosted runner lost communication" annotation before the K8s-side symptoms surfaced via kubectl describe pod. Requests bumped 1Gi -> 2Gi, limits 4Gi -> 8Gi. Comment added inline so future fleet runs don't trip the same wall. Unblocks PR #21 + the 9 other open TtsReader PRs that all rebase through it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 22:31:48 -05:00
bluejay	0d2090fe81	runners: add github-runner-updater Deployment (#29 ) Close runner-fleet gap for FlowerCore.Updater. Matches Sprint 32 long-tail pattern; registers entry in fleet-lint required-set.	2026-05-26 03:24:13 +00:00

1 2 3 4 5 ...

474 Commits