474 Commits

Author SHA1 Message Date
b842738a0e Merge pull request 'Sprint 63 Cx-10: align hardening probe paths with live routes' (#44) from codex/s63-cx10 into main
Sprint 63 Cx-10 live-proof fix after Traefik curls found three stale probe-path annotations. Local lint 100/100; git diff --check clean; no Gitea statuses attached.
2026-06-05 03:02:14 +00:00
Andrew Stoltz
f0cb7a5e81 fix(hardening): align probe-path annotations with live health routes 2026-06-04 22:01:04 -05:00
ac0f665323 Merge pull request 'Draft: Sprint 62 Cx-10 broader exposure hardening' (#43) from codex/s62-cx10 into main
Sprint 63 Cx-10 reconcile-first merge after local lint proof: 100/100 passed, no Gitea statuses attached, CRLF diff check clean.
2026-06-05 02:51:37 +00:00
Andrew Stoltz
c4b08f41ab feat(infra): prestage broader app exposure hardening 2026-06-04 18:14:22 -05:00
Andrew Stoltz
417d3830ae test(lint): reconcile baseline infra assertions 2026-06-04 18:02:32 -05:00
cb4ea13e7a monitoring: mirror Sprint 60 probe coverage
Merged on local lint plus live noc1 Prometheus /api/v1/rules proof.
2026-06-04 18:19:47 +00:00
Andrew Stoltz
a3cd67d6bb monitoring: mirror Sprint 60 probe coverage 2026-06-04 13:15:18 -05:00
Andrew Stoltz
81a3ddac4c fix(auth): mark OIDC healthz probes anonymous 2026-06-04 11:03:20 -05:00
300f8ad546 fix(monitoring): probe OIDC-safe health routes
Sprint 58 Cx-12. Rebased over OIDC GitOps main; YAML parse and focused bluejay-infra lint tests passed.
2026-06-04 06:45:34 +00:00
fe38c2641f Merge pull request 'fix(auth): deploy distribution root anonymous image' (#38) from codex/s58-distribution-root-anon-gitops into main 2026-06-04 06:20:09 +00:00
Andrew Stoltz
3b40dfb185 fix(auth): deploy distribution root anonymous image 2026-06-04 01:19:16 -05:00
103878671c Merge pull request 'fix(auth): deploy Distribution OIDC image tag' (#37) from codex/s58-oidc-proper into main 2026-06-04 06:05:15 +00:00
Andrew Stoltz
36039c1335 fix(auth): deploy distribution oidc image tag 2026-06-04 01:04:44 -05:00
2a66109f13 Merge pull request 'feat(auth): adopt OIDC GitOps for DNS Distribution Media' (#36) from codex/s58-oidc-proper into main 2026-06-04 05:52:56 +00:00
Andrew Stoltz
933fea89d1 feat(auth): adopt oidc apps in gitops 2026-06-04 00:49:36 -05:00
Andrew Stoltz
13f9bb7710 fix(distribution): revert OIDC enforcement — enabling it gated /healthz probe (service down)
Flipping Auth__Enabled=true gated the /healthz readiness probe (302->NotReady->
no endpoints->distribution.iamworkin.lan down, healthz=000). Classic
feedback_k8s_probes_behind_auth_middleware. Revert to false (OIDC env block kept,
gate off) to restore service. Proper fix (AllowAnonymous /healthz + CA-trust +
idempotent Editions seed + OIDC-challenge wiring + browser-proof) -> falcon OIDC lane.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 23:47:29 -05:00
Andrew Stoltz
9a58fd2af6 oidc: flip enforcement ON for knowledge + distribution (no-live-proof, fix-forward)
Operator 2026-06-04: nothing is production yet, flip OIDC + fix-forward (no
browser-proof gate). knowledge: Auth__Enabled false->true (OIDC env already
wired). distribution: add OIDC env block (Authority/Audience/ClientId=distribution,
ClientSecret from distribution-oidc-client) + Enabled=true; public read/entitlement
+ Method() allowlist stay open (OIDC gates admin only). Clients already provisioned
(secrets present). ArgoCD deploys both.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 23:38:48 -05:00
Andrew Stoltz
404d884863 Adopt live Library Retail AiStation web apps 2026-06-03 20:24:32 -05:00
f4bd90f805 Merge pull request #33 from codex/s56-monitoring-coverage
fix(monitoring): repoint pirelay scrape to signalcontrol
2026-06-04 01:22:49 +00:00
Andrew Stoltz
67d67ab73d fix(monitoring): repoint pirelay scrape to signalcontrol 2026-06-03 20:20:36 -05:00
Andrew Stoltz
f7d41cdc60 revert: drop fc-library manifest — Library.Web already deployed live (41h)
Library.Web is already running + serving at library.iamworkin.lan (root=200,
healthz=200), deployed manually 41h ago (image fc-library-web:v20260602-...,
PVC library-web-data holding the live SQLite DB). My from-scratch manifest used
a different PVC name (library-data) which ArgoCD would attach as a fresh empty
volume, orphaning the live DB. Adopting the live deploy into GitOps is a
separate careful task. Not disturbing a working deployment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 19:30:23 -05:00
Andrew Stoltz
2c0afc28e4 deploy(fc-library): add Library.Web internal-host deployment
From-scratch .Web deploy at library.iamworkin.lan (operator-authorized 2026-06-03).
Cloned from the worldbuilder pattern: Deployment + Service + Longhorn RWO PVC +
step-ca cert + Traefik IngressRoute. SQLite at /data/library.db, no OIDC, both
/health + /healthz probes. Image localhost/fc-library:v202606031925 imported to
both RKE2 nodes. DNS library.iamworkin.lan -> 10.0.56.200 already in pfSense.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 19:28:22 -05:00
Robot
ba5f5dd0fb deploy(knowledge): roll audit backfill fix 2026-06-03 18:24:22 -05:00
Robot
dc699da7b3 fix(knowledge): persist federation database on PVC 2026-06-03 18:17:31 -05:00
Robot
1e8bf54c6e deploy: roll Chat and Knowledge OIDC images 2026-06-03 18:13:09 -05:00
Andrew Stoltz
e2e93d482c Deploy TtsReader schema repair image
Co-Authored-By: Codex <codex@openai.com>
2026-06-02 22:00:15 -05:00
4319cc2b51 Merge PR #32: divoom pi deploy artifact manifests
Lands Divoom-as-DM-device and Divoom-TV Pi HDMI deploy artifacts for Cx-6.
2026-06-03 02:47:36 +00:00
Andrew Stoltz
2bf339ce51 Deploy TtsReader PR29 live proof image
Co-Authored-By: Codex <codex@openai.com>
2026-06-02 21:47:04 -05:00
Andrew Stoltz
5bdedfc5ae divoom: add pi deploy artifact manifests
Add source-controlled Puppet/Hiera contracts for edge2 Divoom-as-DM-device without replacing the live flowercore-divoom systemd deployment.

Add Divoom TV Pi HDMI systemd/Puppet deployment artifacts, LF shell-script guardrails, and focused lint coverage for the additive non-K8s deploy shape.

Co-Authored-By: Codex <codex@openai.com>
2026-06-02 21:45:27 -05:00
Andrew Stoltz
0307ae16ae monitoring(probe): signage/mysql/php blackbox probe / -> /healthz (K8s-target mirror)
Mirrors the live noc1 podman fix + Notes scripts/monitoring/prometheus.yml.
These services enforce OIDC bearer auth (FlowerCore__Auth__Enabled=true), so an
anonymous probe of / returns 401 -> false TraefikServiceDown. All three expose
anonymous /healthz=200. This noc-monitoring.yaml is the forward K8s-migration
target (not live); brings it in sync with the live config.
2026-06-02 01:09:57 -05:00
Andrew Stoltz
6c18f69cf2 mail: remove cert-manager Certificate (manage mail-tls via step-ca JWK + noc1 renew timer)
step-ca-acme only has an HTTP-01 (Traefik) solver, but mail.iamworkin.lan must resolve
to the dedicated MetalLB IP 10.0.56.202 (SMTP/IMAP), so HTTP-01 cannot validate (order
stuck pending since 2026-05-06; cert expired 2026-05-24). mail-tls is now issued from
step-ca's JWK 'admin' provisioner and auto-renewed by a systemd timer on noc1 that writes
the mail-tls secret directly. The secret + Deployment mount + webmail IngressRoute are
unchanged. Re-add a Certificate only if a DNS-01 solver is deployed for step-ca-acme.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 15:55:38 -05:00
Andrew Stoltz
47e2256556 Deploy TtsReader correction bridge images 2026-05-31 12:35:45 -05:00
Andrew Stoltz
9d77f8ba0e fc-updater: disable loki audit sink 2026-05-31 11:34:12 -05:00
Andrew Stoltz
2f4be19c85 fc-updater: bump signing diagnostics image 2026-05-31 00:32:48 -05:00
Andrew Stoltz
2a62c40990 fc-updater: bump image to MSI installer surface 2026-05-30 23:31:48 -05:00
Andrew Stoltz
7be98e5efc Bump UpdateCenter image to hosted-service fix 2026-05-30 20:22:13 -05:00
Andrew Stoltz
a65b356c9d deploy(fc-updater): roll UC to v202605301823-a6c3354 (Phase 3 SQLite fixes)
Durable image bump for FlowerCore.Updater main a6c3354 (PRs #63-#66): hosted-service
+ request-path SQLite DateTimeOffset fixes, StopHost restored + per-tick resilience,
Shared.Settings 1.0.1. Image built + imported to rke2-server. Un-degrades the Phase-9
provenance verifier + settings poll (were stopped under the removed global Ignore mask).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 18:27:45 -05:00
Andrew Stoltz
08c17ef1b4 fc-updater: bump to v202605301703-296f350-fix2 (BackgroundServiceExceptionBehavior=Ignore so a hosted-service SQLite query crash can't stop the host)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 17:04:54 -05:00
Andrew Stoltz
06f2f002b7 fc-updater: bump image to v202605301657-296f350-fix1 (Shared.Settings SQLite poll fix)
The v202605301642-296f350-rework image crash-looped: FlowerCore.Shared.Settings SettingsDbPollHostedService
ran a DateTimeOffset Where/OrderBy on SettingsRecordChanges that SQLite can't
translate, and as a BackgroundService it stopped the host. Shared.Settings 1.0.1
materializes the change-log then filters/orders in memory; Updater Web bumped to 1.0.1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 16:59:37 -05:00
Andrew Stoltz
7ac4a8b4b7 fc-updater: bump image to v202605301642-296f350-rework (ADR-179 rework live)
Deploy the current FlowerCore.Updater main (PRs #52-#61) to prod: MSI-first
packaging, beta gating + per-install tokens, interactive+bearer Authentik OIDC,
native installer apply, and the .fcsetup.exe retirement (DropReleaseInstallers
migration runs on the now-empty DB). Image pre-imported to rke2-server + agent1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 16:47:28 -05:00
Andrew Stoltz
90f2a86819 ops: trim load for degraded 2-node cluster (agent2 PSU dead)
Scale all github-runner deployments to 1 replica and halt the ci1
KubeVirt VM. With agent2 down (failed PSU) the cluster runs on two
passively-cooled NUCs; the ci1 8-vCPU VM drove agent1 to ~100C. Keep
total load trimmed until replacement hardware is in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 13:47:13 -05:00
Andrew Stoltz
cbdefb2b23 Revert "ci1: expose WinRM/RDP/SSH ports on masquerade interface for Phase 2 bootstrap"
The port additions caused the new VMI to stick at phase=Scheduled with
reason=GuestNotRunning. The guest-console-log sidecar exited 1 and
qemu never started. Reverting to the working 9-day-stable shape until
the port-add path is verified in a non-production VM.

Phase 2 (Windows runner install + registration) needs an operator-
interactive virtctl-vnc session against the rebuilt VM, OR a separate
investigation of why this port-add tipped over the VM.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 11:35:10 -05:00
Andrew Stoltz
1c36fe3a0a ci1: expose WinRM/RDP/SSH ports on masquerade interface for Phase 2 bootstrap
The Phase 1 VM has been Running for 9 days but Phase 2 (Puppet bootstrap +
runner registration) was deferred because the operator-interactive
virtctl-vnc path was the only way in. The masquerade interface listed
no exposed ports, so virtctl ssh and kubectl port-forward both hit
'no route to host' — qemu user-mode NAT does not forward inbound by
default.

Adding 5985 (WinRM HTTP) lets a kubectl port-forward + PowerShell
remoting path drive runner registration entirely from outside the VM.
3389 + 22 are reserved for desktop access via Guacamole or virtctl ssh
once OpenSSH Server is installed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 11:24:34 -05:00
Andrew Stoltz
2b420ce8a4 runners: fleet-wide right-size CPU requests from 500m to 100m
All 33 runner Deployments now request 100m CPU instead of 500m,
freeing roughly 50 idle pods × 400m = ~20 cores back to the cluster.
Observed CPU usage on idle runners is ~1m via kubectl top; the 500m
request was a 500× over-provision that was eating allocatable CPU
and blocking new workload scheduling — WorldBuilder runner could not
be scheduled even at the new 100m request because the pre-existing
fleet held the cluster at 99% requested.

Burst headroom preserved by limits.cpu: 2000m unchanged. TtsReader
keeps its 8Gi memory limit from the 2026-05-25 OOMKill fix; only
the CPU request line moves.

Recreate strategy on each deployment means a brief offline window
per runner during rollout; in-flight CI jobs complete on the
existing container before the new spec takes effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 10:09:24 -05:00
Andrew Stoltz
5cbc1a06b1 runners: scale DM/AiStation.Linux/WorldBuilder down to 1 replica until cluster relieved
After cutting requests to 100m, 4 of 6 new pods scheduled and 2 stayed
Pending — cluster CPU REQUEST utilization is 49.6 of 48 allocatable cores
because the existing fleet of ~50 idle runners reserves 25.6 cores
(500m × ~50) for ~50m actual use. Single-replica per new repo gets the
service online without competing with in-flight CI from the rest of the
fleet.

When the broader fleet-wide request right-sizing pass lands
(500m → 100m on all idle runners would free ~20 cores), these can be
bumped back to 2 replicas if PR-CI backlog warrants it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 10:03:30 -05:00
Andrew Stoltz
9e7ee39b3a runners: drop CPU request 500m→100m on DM/AiStation.Linux/WorldBuilder
All 3 fleet nodes were at 99% CPU REQUEST allocation; the 6 new pods
from the previous commit (3 deployments × 2 replicas × 500m) couldn't
schedule. Idle runners actually use ~1m CPU per `kubectl top pods`;
the 500m request was significantly over-provisioned. Burst headroom
preserved by limits.cpu: 2000m unchanged.

Follow-up: similar request right-sizing pass across the rest of the
runner fleet is queued for a future morning-routine sweep — 25 cores
reserved for ~50m actual use is a large slack we can reclaim cluster-
wide.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 10:00:23 -05:00
Andrew Stoltz
ae030a5f33 runners: add github-runner Deployments for DeviceManagement + AiStation.Linux + WorldBuilder
Morning-routine 2026-05-26 — these three repos had ZERO online Linux PR-CI
capacity, blocking the Sprint 37 Cx-1 Linux-CI-migration PRs (DM #20/#21/
#22, AiStation.Linux #13, WorldBuilder #3/#4). Chicken-and-egg: the
migration PRs need Linux runners that the migration creates.

Each Deployment uses the same canonical emptyDir-only pattern as the
fresh-2026-05-26 updater deployment that lives just above:
  - replicas: 2 (room for parallel PR-CI without head-of-line blocking)
  - per-pod emptyDir caches (no RWO PVC contention)
  - shared github-runner-token secret (existing ACCESS_TOKEN PAT has
    org-wide read access)
  - LABELS: self-hosted,linux,fc-build-linux
  - DOTNET_INSTALL_DIR pinned per ADR-170 family

For AiStation.Linux specifically: Linux job will now pick up; the
Windows job in #13 remains queued indefinitely until the Windows runner
host substrate lands per Sprint 36 v2 Cl-2 / ADR-174 — that's a separate
arc, not this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 09:55:31 -05:00
bc8c35896f tests: add bluejay-ws runner-exclusion lint + fix 3 stale runner-fleet assertions (#30)
BLUEJAY-WS must never be a fleet GHA runner (operator directive 2026-05-26). Build-side analog of Sprint 9 safe-account exclusion. Also fixes 3 stale runner-fleet assertions broken by initContainer addition + replica tuning.
2026-05-26 03:42:01 +00:00
Andrew Stoltz
2cc91b6df0 runners: bump tts-reader memory limit 4Gi -> 8Gi
The github-runner-tts-reader pod was being OOMKilled (exit 137)
mid-`dotnet test` on the TtsReader 1000+ test suite. PR #21 CI
(the Windows -> Linux runner migration) flapped twice with the
"self-hosted runner lost communication" annotation before the
K8s-side symptoms surfaced via kubectl describe pod.

Requests bumped 1Gi -> 2Gi, limits 4Gi -> 8Gi. Comment added
inline so future fleet runs don't trip the same wall.

Unblocks PR #21 + the 9 other open TtsReader PRs that all rebase
through it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 22:31:48 -05:00
0d2090fe81 runners: add github-runner-updater Deployment (#29)
Close runner-fleet gap for FlowerCore.Updater. Matches Sprint 32 long-tail pattern; registers entry in fleet-lint required-set.
2026-05-26 03:24:13 +00:00