bluejay-infra

Author	SHA1	Message	Date
Andrew Stoltz	2f796a2ebd	selenium: migrate hub + 3 nodes + service + ingressroute into ArgoCD Previously orphan kubectl-applied since the Selenium Grid was first set up. The `infra-selenium` ArgoCD app existed but only managed `network-policy.yaml` — the deployments themselves drifted whenever anyone `kubectl set env`'d or `kubectl scale`'d. This commit captures the live state (with the 2026-05-25 maxSessions bump for chrome already baked in) as canonical git source. ArgoCD's ServerSideApply syncPolicy + selfHeal will now keep the grid in lock step with this file. Resources captured: - Service selenium-hub (ClusterIP, internal traffic on 4444) - Service selenium-hub-external (LoadBalancer, MetalLB 10.0.56.208) - Deployment selenium-hub - Deployment selenium-node-chrome (replicas=1, SE_NODE_MAX_SESSIONS=2) - Deployment selenium-node-firefox (replicas=1, maxSessions=1) - Deployment selenium-node-edge (replicas=1, maxSessions=1) - IngressRoute selenium-hub (Traefik, selenium.iamworkin.lan) No live behavior change — server-side dry-run confirms unchanged for hub/firefox/ingressroute, "configured" for hub-external + 3 deploys (default-field reordering only; SSA + field managers handle the diff). Refs: Sprint 33 morning-routine 2026-05-25 follow-up Q-MR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 19:08:55 -05:00
Andrew Stoltz	b92f74b63a	runners: right-size replica counts per 14d CI activity data Drop 2 → 1 for 10 deploys based on trailing-14d run counts: - LlmBridge, Media, Knowledge, Intranet.Web, DNS (0 runs each) - Presentations (6), Redis (3), Provisioning (3), MessageBoard (3), MenuBoard (3) Bump 2 → 3 for Print.Web: 12 runs in trailing 5d, and the help-screenshots AAT job holds a runner 30+ min, creating head-of-line blocking for parallel PRs. Net change: -9 replicas (≈ -9 GiB committed memory). Aligns with Sprint 33 morning-routine capacity audit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 18:55:47 -05:00
Andrew Stoltz	cb7f7dbc4d	authentik: generous startup/liveness probes for first-boot migration The server pod was getting killed by liveness probe at 60s while still waiting on migration DB lock (worker pod also running migrations against same DB). Add startupProbe with 10.5 min budget so liveness doesn't fire until migrations finish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 16:03:03 -05:00
Andrew Stoltz	03126d5584	authentik: add fsGroup:1000 to server + worker so non-root uid can write /media PermissionError: [Errno 13] Permission denied: '/media/public' in tenant_files migration because Authentik container runs as uid 1000 but Longhorn PVC mounts root:root by default. fsGroup on Pod securityContext recursively chgrps the PVC mount to gid 1000 + chmods g+rwx. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 15:58:35 -05:00
Andrew Stoltz	495e884c41	authentik: initial deployment at id.iamworkin.lan Stack: - PostgreSQL 16 StatefulSet (Longhorn RWO 5Gi) - Redis 7 Deployment (no persistence) - Authentik server + worker (ghcr.io/goauthentik/server:2024.12.3) - Shared media PVC (Longhorn RWO 2Gi) between server+worker - Certificate via step-ca-acme ClusterIssuer - Traefik IngressRoute at id.iamworkin.lan Secrets sourced from 1Password item 'authentik-credentials' (IAmWorkin vault, id y6i74ch22q5wvm7znquq4nhhcu) via OnePasswordItem CRD. Fields: AUTHENTIK_SECRET_KEY, POSTGRES_PASSWORD, REDIS_PASSWORD, BOOTSTRAP_ADMIN_PASSWORD, BOOTSTRAP_ADMIN_TOKEN, BOOTSTRAP_ADMIN_EMAIL. DNS A record id.iamworkin.lan -> 10.0.56.200 added via scripts/pfsense-add-id-host.py (FlowerCore.DNS service was 502'ing on pfSense diag_command.php response parsing). Closes the immediate gap from PiManager OIDC Cohort 3 wire-up: PiManager (a87cd6f) configures id.iamworkin.lan as JWT authority but the backend was never deployed. Pirelay specifically is on Mode:apikey until this backend is bootstrapped and a pimanager service-account exists. Post-deploy bootstrap (manual once pods Ready): 1. Login at https://id.iamworkin.lan/if/admin/ as akadmin using BOOTSTRAP_ADMIN_PASSWORD from 1Password. 2. Create OAuth2/OpenID Provider for pimanager (issuer https://id.iamworkin.lan/application/o/pimanager/, audience 'pimanager'). 3. Create Application binding the provider. 4. Create service account user 'pimanager-service-account', generate long-lived token, store in 1Password as 'pimanager-service-account'. 5. Re-enable jwt mode on pirelay + un-mask puppet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 15:50:10 -05:00
Andrew Stoltz	65aa1e6104	fix(monitoring): point probe-printweb at /health (Q-MR-90) Root path requires API key auth — `/` returned 401 to the blackbox probe, firing PrintWebDown despite `/health` reporting Healthy. Pattern: feedback_k8s_probes_behind_auth_middleware. Mirrors FlowerCore.Notes scripts/monitoring/prometheus.yml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 14:52:02 -05:00
Andrew Stoltz	7f2a3b76b4	feat(github-runner): bake Ruby 3.3 into Linux self-hosted runner image (Q-MR-81)	2026-05-20 11:45:43 -05:00
Andrew Stoltz	25ace30a03	fix(fc-devicemgmt): remove self-referential Application resource (Q-MR-79)	2026-05-20 11:18:25 -05:00
Andrew Stoltz	ca574c2280	brochure: delete apps/brochure/ — full prune per operator decision 2026-05-19 Removes the apps/brochure/ directory entirely from the bluejay-infra ApplicationSet glob. ArgoCD will: 1. See infra-brochure has no git source -> mark for delete 2. Prune the brochure namespace + Deployment + Service + Certificate + Secret + IngressRoute (all generated from the now-gone apps/brochure/brochure.yaml) 3. Remove the infra-brochure Application from argocd ns Operator decision 2026-05-19 (follow-up to `09387f9` ARCHIVED banner commit): "Yes, prune argo for brochure. Probably fully deleted there." The brochure subdomain project was a planning-chain misinterpretation of "make TtsReader + AI Station production-ready" — see memory/project_brochure_split_misinterpretation_archived_2026_05_19.md in FlowerCore.Notes for the full decision record. Reusable artifacts that were the operator's archive concern stay alive in their actual homes: - FlowerCore.Intranet.Web PR #8 content-NuGet carve-out: still in Intranet's master, may transfer to TtsReader / AI Station prod work - Sprint 32 Cl-5 substrate (public-twin design ideas): SUPERSEDED banner in-place in FlowerCore.Notes docs/standards/, history preserved - magpie-doc-writer + wren-walkthrough skill output: unchanged in Intranet's flowercore-whats-new/walkthroughs/galleries directories Companion Notes-side commit updates the "scaled to 0 + ARCHIVED banner" language in mvp-readiness.html + fleet-roadmap-2026-05-19-sprint36-v2.md + memory record to reflect full deletion instead. Wrong-codebase image localhost/fc-brochure-web:v20260524-sprint32 is being removed from rke2-server / rke2-agent1 / rke2-agent2 in a follow-up step (reclaims ~800MB per node). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:42:30 -05:00
Andrew Stoltz	09387f90e1	brochure: ARCHIVED 2026-05-19 — was a misinterpretation, do not re-enable The brochure split project was a misinterpretation of an operator request to make TtsReader + AI Station production-ready. Somewhere in the planning chain it spun up into a separate "showcase brochure product" with its own host, repo, NuGet, and Codex pack — none of which the operator actually wanted. The project itself is pointless and a waste of credits. Archive (not delete) per operator decision 2026-05-19, because some work shipped under the misinterpretation may still have reusable value: - FlowerCore.Intranet.Web PR #8 (merged) introduced FlowerCore.Brochure.Content content-NuGet carve-out — pattern may apply to TtsReader/AiStation production polish. - Sprint 32 Cl-5 substrate has design ideas for public-twin vs operator-host separation that may transfer. - magpie-doc-writer / wren-walkthrough skills still author useful Intranet content — those skills stay active. These manifests stay at replicas: 0 for ArgoCD continuity. Cleanup options (move out of apps/* glob, or delete entirely) are documented in README.md for an operator-explicit future call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:34:28 -05:00
Andrew Stoltz	e641ceab48	monitoring(irc-notify): criticals also batch hourly — fix per-fire spam The first batching pass (`bacac06`) left critical-severity alerts on the immediate-print path. That's still per-event spam for any persistent critical (e.g. PrintPaperRollCritical fires every 30s Grafana evaluation cycle when paper is <5%). Caught immediately after deploy: CUPS queue grew 0 → 8 jobs in 8 minutes from a single firing PrintPaperRollCritical. This commit aligns with the operator's verbatim ask ("one alert an hour"): - Critical-severity alerts now go into the digest buffer, NOT the immediate-print path. The digest payload already shows severity tags per alertname, so the operator still sees "[critical] X" in the printout. - The explicit `alert_channel=thermal_print_immediate` label still bypasses batching, but only on NEW fingerprint arrival — it triggers a flush of the CURRENT digest (with the new alert included), then clears. Repeat webhooks for the same fingerprint dedupe in the buffer until the next hourly tick OR until the alert resolves. No fingerprint can spam. - `add_to_digest` now returns bool (True = buffer grew, False = dedup / resolution / disabled) so the immediate-label path can flush only on state transitions. Net effect: max 1 thermal print per BATCH_INTERVAL_MIN per alert fingerprint, regardless of severity. Rules that genuinely need same-second paper opt in via `alert_channel=thermal_print_immediate` (currently zero rules use this). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:22:25 -05:00
Andrew Stoltz	c263426ea5	fc-devicemgmt: operator image fix + Web scaled to 0 OPERATOR (PodCrashLoopBackOff cleared): - Bumped image to v20260519-sp34cl3-fix (built from astoltz/FlowerCore.DeviceManagement@d9a3685 after Sprint 34 Cl-3 stranded branch was merged via PR #19 squash). - The v20260512-cx5 image was the broken Sprint 8 scaffold: generic Host builder, no kubeops, no Kestrel on :8080, no AddController chain. Readiness probe dial-tcp 8080 failed every restart. - The new image ships the AddController chain for all 4 reconcilers (DeviceCrd / DeviceGroupCrd / DevicePolicyCrd / RemoteCommandCrd) plus Kestrel on :8080 and /healthz. - Image saved + scp'd + ctr-imported on rke2-server / rke2-agent1 / rke2-agent2 before this commit. SHA256: 2cc79ee0a2313c550268d1244f805ae41b396362148dd5603061cc15b6f7fa7e WEB (DeploymentReplicasMismatch cleared via scale-to-0): - Web pod cannot start. Two upstream gaps must close first: 1) MySQL DB instance + user `fc_devicemgmt` / database `flowercore_devicemgmt` are not provisioned in fc-mysql. Cluster has zero MySqlInstanceCrds and no `mysql.fc-mysql.svc:3306` Service. 2) 1Password vault item `IAmWorkin/FlowerCore DeviceManagement Runtime` is missing (5 fields: DB-Password + 4 mTLS PEMs). OnePasswordItem CRD has been stuck Ready=False since 2026-05-18T02:58. - Same pattern as the brochure-web scale-to-0 in `914fed0` — make the cluster clean and quiet, let operator restart deploy on a real schedule. Re-enable path is fully documented in the deployment-web.yaml header comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 10:11:09 -05:00
Andrew Stoltz	bacac067cf	monitoring(irc-notify): hourly digest batching for thermal printer The thermal printer drained overnight (2026-05-18/19) because the old notify.py POSTed one print job per Grafana webhook fire. With 9 concurrently-firing alerts (zabbix-postgres + fc-devicemgmt + brochure + PrintPaperRollLow), every evaluation cycle stamped fresh CUPS jobs onto the queue until the operator physically powered the printer off. This refactor: - Adds env-var config: THERMAL_PRINT_ENABLED (master kill switch), BATCH_INTERVAL_MIN (default 60), BATCH_MAX_PENDING (default 50). - IRC delivery stays per-event (operator wants the live stream). - Thermal routing now: * critical/disaster/page severity OR alert_channel=thermal_print_immediate -> print immediately * alert_channel=thermal_print -> enqueue into hourly digest * RESOLVED -> remove from digest buffer (no resolution-spam prints) * else -> IRC only, no thermal - Background digest_loop thread flushes the buffer hourly (or sooner if buffer hits BATCH_MAX_PENDING). Digest payload is a single Print.Web /api/print/alert POST listing distinct alertnames + per-rule target counts. - New POST /flush endpoint (manual operator force-flush; useful for testing without waiting an hour). - GET / returns config + buffer depth + per-stat counters for observability. Net effect: max 1 thermal print per BATCH_INTERVAL_MIN for batched warnings, plus immediate prints for criticals. Closes the 2026-05-18/19 alert-storm incident. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 09:56:14 -05:00
bluejay	914fed08d8	fix(brochure): scale brochure-web to 0 — wrong codebase shipped (Intranet.Web binary in fc-brochure-web image, CrashLoopBackOff 296 restarts on /data read-only). Re-enable after Sprint 34 Cx-3 rebuild per docs/ai-agents/codex-prompts/2026-05-18-fc-brochure-web-rebuild-pack.md	2026-05-19 14:45:01 +00:00
Andrew Stoltz	200aeab032	ttsreader: deploy study mode repair image	2026-05-18 16:33:08 -05:00
Andrew Stoltz	8182616d4c	ttsreader: point render piper to edge1 demo endpoint	2026-05-18 16:06:37 -05:00
Andrew Stoltz	f0862ac03c	ttsreader: deploy sprint36 demo audio image	2026-05-18 16:04:59 -05:00
Andrew Stoltz	46c392605e	monitoring: mirror PuppetServiceFailed alert from Notes (Sprint 33 Cx-7 Phase B) Mirrors the live `puppet` alert group from FlowerCore.Notes/scripts/monitoring/alerts.yml into the K8s ConfigMap so a future in-cluster Prometheus inherits the ruleset automatically. Source of truth remains the Notes file (live Podman Prometheus on noc1). See feedback_monitoring_k8s_target_vs_live_podman. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:11:07 -05:00
bluejay	d7238a5e3b	feat(brochure): add public brochure GitOps app (#13 )	2026-05-18 04:52:37 +00:00
bluejay	fc444a02a1	feat(chat): add public twin ingress (#11 )	2026-05-18 04:52:20 +00:00
bluejay	83d4883d55	feat(worldbuilder): pin k8s demo to fake backend (#10 )	2026-05-18 04:52:11 +00:00
bluejay	f8fe3b2688	feat(github-runner): add final long-tail runners (#9 )	2026-05-18 04:52:01 +00:00
bluejay	f2ab892ebc	feat(github-runner): add Marquee + TtsReader per-repo runners (#8 )	2026-05-18 03:27:14 +00:00
bluejay	fef68a9560	feat(fc-devicemgmt): add Kubernetes deployment manifests (#1 ) Sprint 8 IMPL lane Cx-5: fc-devicemgmt K8s manifests (rebased onto main 2026-05-18; 13 files, +944). Namespace + Web Deployment (replicas:2, MySQL backend) + Operator Deployment (replicas:1, KubeOps leader-elect) + Service + Certificate (step-ca-acme ClusterIssuer) + Traefik IngressRoute (devices.iamworkin.lan internal) + ServiceAccount + ClusterRole + ClusterRoleBinding + NetworkPolicy (CNI DNAT-aware backend ports) + OnePasswordItem (5-field consolidated) + ArgoCD Application bootstrap shape + lint coverage. Follow-ups (not merge blockers): - localhost/fc-devicemgmt-{web,operator}:v20260512-cx5 must be imported to all 3 RKE2 nodes; pods will ErrImageNeverPull until imported. - 1Password vault item 'FlowerCore DeviceManagement Runtime' must be created with 5 fields before pods can start. - DNS devices.iamworkin.lan -> 10.0.56.200 already present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 02:56:23 +00:00
Andrew Stoltz	6fe77225ae	fix(github-runner): dedupe DOTNET_INSTALL_DIR+NUGET_PACKAGES on base+sharedpos PR #5 rebase concatenated PR #5 env additions onto PR #7 env additions on the base + sharedpos Deployments, producing duplicate-key validation errors in ArgoCD's structured merge. The DOTNET_INSTALL_DIR and NUGET_PACKAGES values are identical between PR #5 and PR #7; keep the PR #7 originals and retain only the unique new env vars from PR #5 (DOTNET_CLI_TELEMETRY_OPTOUT, DOTNET_NOLOGO, DOTNET_GENERATE_ASPNET_CERTIFICATE). No behavioral change — same final env var set, no duplicates.	2026-05-17 21:53:05 -05:00
bluejay	634b9c4169	feat(github-runner): harden Linux runner fleet (#5 )	2026-05-18 02:51:02 +00:00
bluejay	b8c7e59005	Tighten Pi signage HDMI settle coverage (#3 )	2026-05-18 02:35:17 +00:00
bluejay	65ac8d6f01	feat(github-runner): pod-env DOTNET_INSTALL_DIR + initContainer for non-root runner (#7 )	2026-05-18 02:25:18 +00:00
bluejay	b1e307151e	chore(github-runner): un-park github-runner-sharedpos (replicas 1) after Shared.Pos CI fix merged	2026-05-17 21:54:16 +00:00
bluejay	12b07219c7	chore(github-runner): park github-runner-sharedpos (replicas 0) until Cx-1 non-root fix Shared.Pos build fails on non-root runner (setup-dotnet /usr/share/dotnet denied); churning runner drove HighCPU on rke2-agent2. Re-enable in the Sprint 30+ Cx-1 Linux-runner-fleet lane (DOTNET_INSTALL_DIR on pod env).	2026-05-17 21:50:35 +00:00
bluejay	9fd32c4415	feat(monitoring): MacMiniRunnerOffline alert (Sprint 28 reconcile)	2026-05-17 19:50:29 +00:00
bluejay	ad670fb344	feat(github-runner): add Shared.Pos repo-scoped Linux runner (unstick stuck publish)	2026-05-17 19:50:23 +00:00
Codex	6f6ca50987	fix(github-runner): switch RUNNER_TOKEN -> ACCESS_TOKEN; set RUN_AS_ROOT=false Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 22:08:56 +00:00
Codex	c7be58c1f7	chore(github-runner): bump replicas 0 -> 1 (PAT provisioned) Operator provisioned GitHub PAT (Runner Registration) 1P item. OnePasswordItem CRD already materialized the secret. Unblocks CI fleet-wide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 22:04:03 +00:00
Codex	710340d8be	chore(github-runner): rename 1P item to GitHub PAT (Runner Registration) Renames the OnePasswordItem.itemPath from "GitHub Runner Registration Token" to "GitHub PAT (Runner Registration)" so the runner 1P entry sits next to its siblings — GitHub PAT (Gitea Mirrors) and GitHub PAT (NuGet Packages) — under a consistent "GitHub PAT (...)" naming pattern and API_CREDENTIAL category. Existing field "credential" remains the consumer (RUNNER_TOKEN env). Comment block clarified to require Administration:read/write fine-grained PAT scope on target repos. Old 1P item renamed to "[DEPRECATED 2026-05-16] GitHub Runner Registration" — kept as recovery backup; can be hard-deleted after the first successful runner pod start against the new item path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 10:27:58 -05:00
Andrew Stoltz	7d2daaa4f8	chore(github-runner): replicas 1 → 0 until 1Password token provisioned github-runner-token OnePasswordItem exists but the underlying 1Password vault item hasn't been created yet, so the operator can't mint the K8s Secret. Pod stuck in CreateContainerConfigError → DeploymentReplicasMismatch alert fires. Scaling to 0 keeps the manifest infrastructure intact but stops trying to schedule until operator: 1. Creates "GitHub Runner Registration Token" item in IAmWorkin vault 2. Generates a token at github.com/astoltz/<repo>/settings/actions/runners/new 3. Updates the OnePasswordItem itemPath to point at it 4. Bumps replicas back to 1 via PR	2026-05-15 16:18:19 -05:00
Andrew Stoltz	e50e103ba0	fix(zabbix): bump web probe timeouts 5s→15s + add failureThreshold zabbix-web nginx+PHP-FPM container serves / at ~3-5s baseline with occasional 6-7s spikes (probe path renders full dashboard via PHP). kube-probe was killing the container after 3 consecutive 5s-timeout 499s, producing CrashLoopBackOff alert noise even though the app was serving real traffic fine. 15s timeout absorbs the natural variance; explicit failureThreshold=3 documents the policy (was implicit default). Closes the firing PodCrashLoopBackOff (zabbix-web) + pending HTTPServiceSlow/HTTPServiceDegraded alerts. zabbix.iamworkin.lan remains slow at the application layer (separate work — PHP-FPM warm-up + Zabbix server "host not found" agent lookup spam need their own fixes) but the pod restart loop stops.	2026-05-15 15:59:04 -05:00
Codex	e8094eb0bd	ci(github-runner): add Phase 2 ephemeral Linux runner K8s manifest Namespace github-runner with myoung34/github-runner:latest Deployment, 5Gi Longhorn RWO NuGet cache PVC, zero-privilege ServiceAccount, and OnePasswordItem CRD for the registration token. EPHEMERAL=true mode re-registers after each job; Recreate strategy avoids RWO multi-attach. Targets fc-build-linux label; single replica pinned to rke2-server node. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-14 12:46:25 -05:00
bluejay	8d87d9172c	Add Pi signage Phase 1 player artifacts Squash merge Sprint 14 Pi signage player artifacts.	2026-05-14 01:46:09 +00:00
Codex	cfd9743afa	Add Apple TV signage docs manifest	2026-05-13 20:32:48 -05:00
Andrew Stoltz	5029e209cd	kubevirt-vms: boot ci1 from server template	2026-05-12 16:58:18 -05:00
Codex	f298339152	fix(guacamole): add --- separator between macmini-vnc-creds OnePasswordItem and guacamole-branding ConfigMap Missing document separator caused YAML to merge the OnePasswordItem's top-level `spec: itemPath:` block into the ConfigMap that follows. Result: a ConfigMap with a `.spec` field whose K8s schema does not declare one, triggering ArgoCD's structured-merge diff to fail since 2026-05-11T15:30:54Z: Failed to compare desired state to live state: failed to calculate diff: error calculating structured merge diff: error building typed value from config resource: .spec: field not declared in schema App stayed Healthy (live K8s tolerated the extra field — ConfigMap ignored it) but ArgoCD's diff calc was broken, leaving the app stuck at sync=Unknown for all 21 resources. Adding the missing `---` separator makes the OnePasswordItem and ConfigMap proper sibling YAML documents, each with its own kind-correct schema. Diagnosed during 2026-05-12 morning routine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 09:26:03 -05:00
Codex	6e7d88db49	feat(fc-redis): add SignalR backplane for cross-product event bus (Q-SO-1 Phase A) Per Q-SO-1 operator resolution 2026-05-11 PM, Redis SignalR backplane lands in Phase A (was Phase C deferral). Treats Redis as a managed FC infrastructure component, not a deferred scaling escalation. Lands the minimal Phase A surface: - Namespace fc-redis - Single Redis 7-alpine pod with 1Gi Longhorn RWO PVC - ConfigMap with AOF persistence (everysec), 256Mi maxmemory, allkeys-lru - ClusterIP Service `redis.fc-redis.svc.cluster.local:6379` (in-cluster only) - No AUTH Phase A (Phase B add via 1Password Connect rotation) - No IngressRoute (backplane is server-to-server) Consumers (Phase A IMPL across FC services) add: services.AddSignalR().AddStackExchangeRedis( "redis.fc-redis.svc.cluster.local:6379", opts => opts.Configuration.ChannelPrefix = StackExchange.Redis.RedisChannel.Literal("fc-opsconsole")); Phase B/C follow-ons (not in this commit): Sentinel for HA, AUTH password from 1Password, redis_exporter sidecar for Prometheus, network policies. See FlowerCore.Notes/docs/signage/operations-console-phase-2-design.md section 3.5 (rewritten) and decisions-waiting.html Q-SO-1 (RESOLVED). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 19:02:58 -05:00
Codex	5ae50bd491	fix(telephony): init container runs as root to chown hostPath /tmp/tts-audio The fix-data-perms init container chowns /data (PVC) and /shared-tts (hostPath /tmp/tts-audio on rke2-agent1) to uid 1654 so the non-root telephony-web app can write Piper TTS .sln16 files. Without an explicit container-level securityContext override, the init container inherits pod-level runAsNonRoot:true / runAsUser:1654 and fails with 'chown: /shared-tts: Operation not permitted' the first time the hostPath comes up root-owned after a node reboot. Outage 2026-05-11 23:00 UTC: telephony-web in Init:CrashLoopBackOff for 9 hours (100+ restarts) until init container was bumped to runAsUser:0. Live cluster patched in the same operation; this commit makes the fix durable in git so ArgoCD sync preserves it. See Notes memory: feedback_hostpath_initcontainer_chown_perms Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 18:37:15 -05:00
Codex	653d4472f5	fix(monitoring): mirror Q-MR-3 MultusMemoryPressure + NamespacePendingPodBacklog alerts Two new preventive alert rules added to the kubernetes-state group of the K8s migration target ConfigMap. The live Podman Prometheus on noc1 has already been updated via FlowerCore.Notes/scripts/monitoring/alerts.yml + sudo cp + podman pod restart monitoring (this commit only locks it in the bluejay-infra K8s mirror so a future migration carries it forward). MultusMemoryPressure (critical, thermal_print): fires when kube-multus working set exceeds 80% of its memory limit for 5m. Catches the next multus OOM cascade BEFORE it kills the daemon cluster-wide. The 2026-05-10 21h outage hit because no alert fired on the rising multus working set; only downstream blackbox / Traefik / service alerts triggered, after the fact. NamespacePendingPodBacklog (warning): fires when any single namespace has >25 Pending pods sustained for 30m. Catches the operator-leak avalanche pattern (orphan pods from a crashed reconciler emitting children without ownerReferences) before it cascades into a CNI OOM. See FlowerCore.Notes: - feedback_multus_50mi_limit_oom_orphan_pod_avalanche - feedback_monitoring_k8s_target_vs_live_podman (workflow) Companion commits: - bluejay-infra@eb8693e (multus memory limit) - FlowerCore.RemoteDesktop@b02c59b (OwnerReferences fix) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 10:42:27 -05:00
Codex	eb8693e1ce	fix(multus): bump kube-multus-ds memory 50Mi/50Mi -> 1Gi/512Mi (prevent OOM cascade) Cluster outage 2026-05-10T17:43 through 2026-05-11 ~10:30 (~21h). Root cause: FlowerCore.RemoteDesktop emitted 219 orphan rd-browser-only-* pods in fc-desktop (missing OwnerReferences — see companion fix in FlowerCore.RemoteDesktop). Kubelet's continuous CNI ADD retries for those pending pods drove a request queue that exceeded the upstream default 50Mi limit on kube-multus-ds. Multus OOMKilled (exit 137), restarted with an even bigger backlog, OOMKilled again, positive feedback loop. Restart counts climbed to 276 / 412 / 261 across the 3 RKE2 nodes. Downstream blast radius: both Traefik pods stuck ContainerCreating (101m + 4h35m), all Longhorn CSI attacher/provisioner/instance-manager stuck, every Prometheus blackbox probe for *.iamworkin.lan failing, UpdateCenterPublicEdgeDown critical on update.flowercore.io, every ArgoCD app showing sync=Unknown because repo-server lost git connectivity. 45 firing Prometheus alerts. Recovery sequence (Q-MR-1 from FlowerCore.Notes morning routine): 1. kubectl patch kube-multus-ds memory live (this commit locks it in git so ArgoCD doesn't revert on next sync) 2. Force-delete the 219 orphan pods (kubectl --grace-period=0 --force) to break the avalanche 3. Rollout restart kube-multus-ds — STABLE after restart with new limit 4. Restart Traefik + Longhorn CSI to clear stuck ContainerCreating 5. Verify update.flowercore.io returns 200 + ArgoCD apps reconcile Tested incrementally: 256Mi limit was insufficient (still OOMed on catchup burst), 512Mi was insufficient on rke2-agent1 (most pods concentrated there), 1Gi/512Mi handled the full 200+ pending pod CNI catchup cleanly with 0 multus restarts after rollout. Nodes are 64GB with <25% used in steady-state, so the ~256Mi typical working-set is well within the new limit. Companion change: FlowerCore.RemoteDesktop must set OwnerReferences on every worker pod so future operator crashes don't leak orphans (Q-MR-2). Preventive alerts (Q-MR-3) MultusMemoryPressure + NamespacePendingPodBacklog are coming in a follow-up commit to apps/monitoring/. Memory: feedback_multus_50mi_limit_oom_orphan_pod_avalanche Decisions card: docs/dashboards/decisions-waiting.html Q-MR-1..3 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 10:30:05 -05:00
Codex	667777a653	revert(ci1): back to cdrom:scsi (virtio-blk disk hit QEMU flock) The virtio-blk disk swap (commit `84c9feb`) didn't help: qemu fails to acquire the write lock on the rootdisk PVC because the previous launcher's qemu process didn't release it cleanly. Same family of bug as the "stale QEMU flock" already documented in feedback_kubevirt_iso_first_install_bootorder_and_runstrategy, but now triggered on rke2-agent1 instead of agent2. OVMF cdrom timeout is the real blocker and remains open: - ✅ Distribution pipeline (build → save → scp → ctr import on all 3 RKE2 nodes) is proven. localhost/win-server-2025:1.0 lives in each node's containerd k8s.io namespace. - ✅ containerDisk + cdrom:scsi gets qemu domain Running (no NFS Permission denied, no rootdisk flock). - ❌ OVMF BdsDxe times out reading the SCSI cdrom regardless of SecureBoot setting and bus type. Reverting the disk type to cdrom:scsi so the VM lands back on the "qemu Running, OVMF stuck at Boot Manager" state — known-stable and easier to attack than the QEMU-flock state we hit by trying virtio-blk disk. Operator decision for next architectural step (one of): - Custom OVMF firmware build with longer Boot0001 timeout - KubeVirt version bump (v1.5+ has OVMF fixes) - Hyper-V/VirtualBox install + export VHD to ci1 - BIOS legacy boot (Win Server 2025 needs UEFI but install media has a BIOS path) - DataVolume HTTP datasource (CDI internalizes ISO bytes via different code path) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 21:35:00 -05:00
Codex	84c9feb893	fix(ci1): present ISO as virtio-blk disk instead of cdrom OVMF BdsDxe "starting Boot0001 ... Time out" persists across: - SATA cdrom + Longhorn Filesystem PVC (Path A) - SATA cdrom + Synology NFS (Path B failed: storage perms) - SCSI cdrom + Longhorn (Path B variant) - SCSI cdrom + containerDisk tmpfs (Path C) - + SecureBoot=false That rules out: storage IO speed, cdrom bus type, signature verification. Remaining cause is deeper in qemu's cdrom device emulation under KubeVirt v1.4.0's OVMF firmware — the cdrom read window for OVMF's first-sector probe is too short to satisfy from the cdrom controller path regardless of bus type. Workaround: present the ISO bytes as a regular virtio-blk DISK (not a cdrom). UEFI/OVMF still recognizes ISO9660 + El Torito boot records on any block device, so it can find and boot the EFI bootloader the same way it would from a USB stick. virtio-blk has a different read path that doesn't hit the cdrom-specific timeout. This also better aligns with the FlowerCore.Distribution USB-key pattern: ISO bytes on a block device, UEFI boots from the El Torito boot record, Windows installer takes over. The autounattend ConfigMap (ci1-autounattend) drives unattended Windows setup once the installer kicks off. The containerDisk OCI image (localhost/win-server-2025:1.0) remains unchanged — only the disk type in the VM spec changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 21:29:59 -05:00
Codex	427dbfcef2	[uc] Phase 1 auth gate deploy v20260509-4162dca-authgate	2026-05-08 21:16:54 -05:00
Codex	b651a4e2d0	fix(ci1): disable SecureBoot to allow OVMF to boot Windows ISO containerDisk delivery (commit `b998f50`) successfully gave qemu fast in-memory access to the ISO bytes (no NFS denial, no Longhorn read latency), but OVMF's BdsDxe still timed out: BdsDxe: loading Boot0001 "UEFI QEMU QEMU CD-ROM " from PciRoot(0x0)/Pci(0x2,0x4)/Pci(0x0,0x0)/Scsi(0x0,0x0) BdsDxe: starting Boot0001 ... Time out That rules out storage IO speed and bus type as causes (already tested both sata and scsi against both Longhorn-PVC and tmpfs-backed containerDisk). Remaining likely cause: SecureBoot signature verification on the ISO's EFI bootloader. KubeVirt's stock `/usr/share/OVMF/OVMF_VARS.secboot.fd` doesn't appear to ship with the Microsoft KEK/DB enrolled by default, so signed Windows EFI bootloaders fail the trust-chain check and OVMF reports a generic "Time out" rather than a verification failure. Disabling SecureBoot lets OVMF skip the chain check entirely and boot the El Torito EFI image. SMM stays enabled (KubeVirt only requires it WITH SecureBoot, not the inverse). TPM 2.0 emulation also stays on (`tpm: {}`), so BitLocker, Hyper-V, and WSL2 still work in the guest. This is acceptable for a CI runner. Long-term path back to SecureBoot: 1. Custom-build OVMF_VARS.fd with Microsoft KEK/DB pre-enrolled 2. Mount via firmware.bootloader.efi.persistent 3. secureBoot: true Tracked as a Phase 2 hardening task once the runner is doing real work and we want signed-boot guarantees. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 21:06:18 -05:00

1 2 3 4 5 ...

402 Commits