bluejay-infra

Author	SHA1	Message	Date
Codex	a0f8fd1790	chore(github-runner): gate token provisioning	2026-05-15 17:32:14 -05:00
Andrew Stoltz	7d2daaa4f8	chore(github-runner): replicas 1 → 0 until 1Password token provisioned github-runner-token OnePasswordItem exists but the underlying 1Password vault item hasn't been created yet, so the operator can't mint the K8s Secret. Pod stuck in CreateContainerConfigError → DeploymentReplicasMismatch alert fires. Scaling to 0 keeps the manifest infrastructure intact but stops trying to schedule until operator: 1. Creates "GitHub Runner Registration Token" item in IAmWorkin vault 2. Generates a token at github.com/astoltz/<repo>/settings/actions/runners/new 3. Updates the OnePasswordItem itemPath to point at it 4. Bumps replicas back to 1 via PR	2026-05-15 16:18:19 -05:00
Andrew Stoltz	e50e103ba0	fix(zabbix): bump web probe timeouts 5s→15s + add failureThreshold zabbix-web nginx+PHP-FPM container serves / at ~3-5s baseline with occasional 6-7s spikes (probe path renders full dashboard via PHP). kube-probe was killing the container after 3 consecutive 5s-timeout 499s, producing CrashLoopBackOff alert noise even though the app was serving real traffic fine. 15s timeout absorbs the natural variance; explicit failureThreshold=3 documents the policy (was implicit default). Closes the firing PodCrashLoopBackOff (zabbix-web) + pending HTTPServiceSlow/HTTPServiceDegraded alerts. zabbix.iamworkin.lan remains slow at the application layer (separate work — PHP-FPM warm-up + Zabbix server "host not found" agent lookup spam need their own fixes) but the pod restart loop stops.	2026-05-15 15:59:04 -05:00
Codex	e8094eb0bd	ci(github-runner): add Phase 2 ephemeral Linux runner K8s manifest Namespace github-runner with myoung34/github-runner:latest Deployment, 5Gi Longhorn RWO NuGet cache PVC, zero-privilege ServiceAccount, and OnePasswordItem CRD for the registration token. EPHEMERAL=true mode re-registers after each job; Recreate strategy avoids RWO multi-attach. Targets fc-build-linux label; single replica pinned to rke2-server node. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-14 12:46:25 -05:00
bluejay	8d87d9172c	Add Pi signage Phase 1 player artifacts Squash merge Sprint 14 Pi signage player artifacts.	2026-05-14 01:46:09 +00:00
Codex	cfd9743afa	Add Apple TV signage docs manifest	2026-05-13 20:32:48 -05:00
Andrew Stoltz	5029e209cd	kubevirt-vms: boot ci1 from server template	2026-05-12 16:58:18 -05:00
Codex	f298339152	fix(guacamole): add --- separator between macmini-vnc-creds OnePasswordItem and guacamole-branding ConfigMap Missing document separator caused YAML to merge the OnePasswordItem's top-level `spec: itemPath:` block into the ConfigMap that follows. Result: a ConfigMap with a `.spec` field whose K8s schema does not declare one, triggering ArgoCD's structured-merge diff to fail since 2026-05-11T15:30:54Z: Failed to compare desired state to live state: failed to calculate diff: error calculating structured merge diff: error building typed value from config resource: .spec: field not declared in schema App stayed Healthy (live K8s tolerated the extra field — ConfigMap ignored it) but ArgoCD's diff calc was broken, leaving the app stuck at sync=Unknown for all 21 resources. Adding the missing `---` separator makes the OnePasswordItem and ConfigMap proper sibling YAML documents, each with its own kind-correct schema. Diagnosed during 2026-05-12 morning routine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 09:26:03 -05:00
Codex	6e7d88db49	feat(fc-redis): add SignalR backplane for cross-product event bus (Q-SO-1 Phase A) Per Q-SO-1 operator resolution 2026-05-11 PM, Redis SignalR backplane lands in Phase A (was Phase C deferral). Treats Redis as a managed FC infrastructure component, not a deferred scaling escalation. Lands the minimal Phase A surface: - Namespace fc-redis - Single Redis 7-alpine pod with 1Gi Longhorn RWO PVC - ConfigMap with AOF persistence (everysec), 256Mi maxmemory, allkeys-lru - ClusterIP Service `redis.fc-redis.svc.cluster.local:6379` (in-cluster only) - No AUTH Phase A (Phase B add via 1Password Connect rotation) - No IngressRoute (backplane is server-to-server) Consumers (Phase A IMPL across FC services) add: services.AddSignalR().AddStackExchangeRedis( "redis.fc-redis.svc.cluster.local:6379", opts => opts.Configuration.ChannelPrefix = StackExchange.Redis.RedisChannel.Literal("fc-opsconsole")); Phase B/C follow-ons (not in this commit): Sentinel for HA, AUTH password from 1Password, redis_exporter sidecar for Prometheus, network policies. See FlowerCore.Notes/docs/signage/operations-console-phase-2-design.md section 3.5 (rewritten) and decisions-waiting.html Q-SO-1 (RESOLVED). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 19:02:58 -05:00
Codex	5ae50bd491	fix(telephony): init container runs as root to chown hostPath /tmp/tts-audio The fix-data-perms init container chowns /data (PVC) and /shared-tts (hostPath /tmp/tts-audio on rke2-agent1) to uid 1654 so the non-root telephony-web app can write Piper TTS .sln16 files. Without an explicit container-level securityContext override, the init container inherits pod-level runAsNonRoot:true / runAsUser:1654 and fails with 'chown: /shared-tts: Operation not permitted' the first time the hostPath comes up root-owned after a node reboot. Outage 2026-05-11 23:00 UTC: telephony-web in Init:CrashLoopBackOff for 9 hours (100+ restarts) until init container was bumped to runAsUser:0. Live cluster patched in the same operation; this commit makes the fix durable in git so ArgoCD sync preserves it. See Notes memory: feedback_hostpath_initcontainer_chown_perms Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 18:37:15 -05:00
Codex	653d4472f5	fix(monitoring): mirror Q-MR-3 MultusMemoryPressure + NamespacePendingPodBacklog alerts Two new preventive alert rules added to the kubernetes-state group of the K8s migration target ConfigMap. The live Podman Prometheus on noc1 has already been updated via FlowerCore.Notes/scripts/monitoring/alerts.yml + sudo cp + podman pod restart monitoring (this commit only locks it in the bluejay-infra K8s mirror so a future migration carries it forward). MultusMemoryPressure (critical, thermal_print): fires when kube-multus working set exceeds 80% of its memory limit for 5m. Catches the next multus OOM cascade BEFORE it kills the daemon cluster-wide. The 2026-05-10 21h outage hit because no alert fired on the rising multus working set; only downstream blackbox / Traefik / service alerts triggered, after the fact. NamespacePendingPodBacklog (warning): fires when any single namespace has >25 Pending pods sustained for 30m. Catches the operator-leak avalanche pattern (orphan pods from a crashed reconciler emitting children without ownerReferences) before it cascades into a CNI OOM. See FlowerCore.Notes: - feedback_multus_50mi_limit_oom_orphan_pod_avalanche - feedback_monitoring_k8s_target_vs_live_podman (workflow) Companion commits: - bluejay-infra@eb8693e (multus memory limit) - FlowerCore.RemoteDesktop@b02c59b (OwnerReferences fix) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 10:42:27 -05:00
Codex	eb8693e1ce	fix(multus): bump kube-multus-ds memory 50Mi/50Mi -> 1Gi/512Mi (prevent OOM cascade) Cluster outage 2026-05-10T17:43 through 2026-05-11 ~10:30 (~21h). Root cause: FlowerCore.RemoteDesktop emitted 219 orphan rd-browser-only-* pods in fc-desktop (missing OwnerReferences — see companion fix in FlowerCore.RemoteDesktop). Kubelet's continuous CNI ADD retries for those pending pods drove a request queue that exceeded the upstream default 50Mi limit on kube-multus-ds. Multus OOMKilled (exit 137), restarted with an even bigger backlog, OOMKilled again, positive feedback loop. Restart counts climbed to 276 / 412 / 261 across the 3 RKE2 nodes. Downstream blast radius: both Traefik pods stuck ContainerCreating (101m + 4h35m), all Longhorn CSI attacher/provisioner/instance-manager stuck, every Prometheus blackbox probe for *.iamworkin.lan failing, UpdateCenterPublicEdgeDown critical on update.flowercore.io, every ArgoCD app showing sync=Unknown because repo-server lost git connectivity. 45 firing Prometheus alerts. Recovery sequence (Q-MR-1 from FlowerCore.Notes morning routine): 1. kubectl patch kube-multus-ds memory live (this commit locks it in git so ArgoCD doesn't revert on next sync) 2. Force-delete the 219 orphan pods (kubectl --grace-period=0 --force) to break the avalanche 3. Rollout restart kube-multus-ds — STABLE after restart with new limit 4. Restart Traefik + Longhorn CSI to clear stuck ContainerCreating 5. Verify update.flowercore.io returns 200 + ArgoCD apps reconcile Tested incrementally: 256Mi limit was insufficient (still OOMed on catchup burst), 512Mi was insufficient on rke2-agent1 (most pods concentrated there), 1Gi/512Mi handled the full 200+ pending pod CNI catchup cleanly with 0 multus restarts after rollout. Nodes are 64GB with <25% used in steady-state, so the ~256Mi typical working-set is well within the new limit. Companion change: FlowerCore.RemoteDesktop must set OwnerReferences on every worker pod so future operator crashes don't leak orphans (Q-MR-2). Preventive alerts (Q-MR-3) MultusMemoryPressure + NamespacePendingPodBacklog are coming in a follow-up commit to apps/monitoring/. Memory: feedback_multus_50mi_limit_oom_orphan_pod_avalanche Decisions card: docs/dashboards/decisions-waiting.html Q-MR-1..3 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 10:30:05 -05:00
Codex	667777a653	revert(ci1): back to cdrom:scsi (virtio-blk disk hit QEMU flock) The virtio-blk disk swap (commit `84c9feb`) didn't help: qemu fails to acquire the write lock on the rootdisk PVC because the previous launcher's qemu process didn't release it cleanly. Same family of bug as the "stale QEMU flock" already documented in feedback_kubevirt_iso_first_install_bootorder_and_runstrategy, but now triggered on rke2-agent1 instead of agent2. OVMF cdrom timeout is the real blocker and remains open: - ✅ Distribution pipeline (build → save → scp → ctr import on all 3 RKE2 nodes) is proven. localhost/win-server-2025:1.0 lives in each node's containerd k8s.io namespace. - ✅ containerDisk + cdrom:scsi gets qemu domain Running (no NFS Permission denied, no rootdisk flock). - ❌ OVMF BdsDxe times out reading the SCSI cdrom regardless of SecureBoot setting and bus type. Reverting the disk type to cdrom:scsi so the VM lands back on the "qemu Running, OVMF stuck at Boot Manager" state — known-stable and easier to attack than the QEMU-flock state we hit by trying virtio-blk disk. Operator decision for next architectural step (one of): - Custom OVMF firmware build with longer Boot0001 timeout - KubeVirt version bump (v1.5+ has OVMF fixes) - Hyper-V/VirtualBox install + export VHD to ci1 - BIOS legacy boot (Win Server 2025 needs UEFI but install media has a BIOS path) - DataVolume HTTP datasource (CDI internalizes ISO bytes via different code path) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 21:35:00 -05:00
Codex	84c9feb893	fix(ci1): present ISO as virtio-blk disk instead of cdrom OVMF BdsDxe "starting Boot0001 ... Time out" persists across: - SATA cdrom + Longhorn Filesystem PVC (Path A) - SATA cdrom + Synology NFS (Path B failed: storage perms) - SCSI cdrom + Longhorn (Path B variant) - SCSI cdrom + containerDisk tmpfs (Path C) - + SecureBoot=false That rules out: storage IO speed, cdrom bus type, signature verification. Remaining cause is deeper in qemu's cdrom device emulation under KubeVirt v1.4.0's OVMF firmware — the cdrom read window for OVMF's first-sector probe is too short to satisfy from the cdrom controller path regardless of bus type. Workaround: present the ISO bytes as a regular virtio-blk DISK (not a cdrom). UEFI/OVMF still recognizes ISO9660 + El Torito boot records on any block device, so it can find and boot the EFI bootloader the same way it would from a USB stick. virtio-blk has a different read path that doesn't hit the cdrom-specific timeout. This also better aligns with the FlowerCore.Distribution USB-key pattern: ISO bytes on a block device, UEFI boots from the El Torito boot record, Windows installer takes over. The autounattend ConfigMap (ci1-autounattend) drives unattended Windows setup once the installer kicks off. The containerDisk OCI image (localhost/win-server-2025:1.0) remains unchanged — only the disk type in the VM spec changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 21:29:59 -05:00
Codex	427dbfcef2	[uc] Phase 1 auth gate deploy v20260509-4162dca-authgate	2026-05-08 21:16:54 -05:00
Codex	b651a4e2d0	fix(ci1): disable SecureBoot to allow OVMF to boot Windows ISO containerDisk delivery (commit `b998f50`) successfully gave qemu fast in-memory access to the ISO bytes (no NFS denial, no Longhorn read latency), but OVMF's BdsDxe still timed out: BdsDxe: loading Boot0001 "UEFI QEMU QEMU CD-ROM " from PciRoot(0x0)/Pci(0x2,0x4)/Pci(0x0,0x0)/Scsi(0x0,0x0) BdsDxe: starting Boot0001 ... Time out That rules out storage IO speed and bus type as causes (already tested both sata and scsi against both Longhorn-PVC and tmpfs-backed containerDisk). Remaining likely cause: SecureBoot signature verification on the ISO's EFI bootloader. KubeVirt's stock `/usr/share/OVMF/OVMF_VARS.secboot.fd` doesn't appear to ship with the Microsoft KEK/DB enrolled by default, so signed Windows EFI bootloaders fail the trust-chain check and OVMF reports a generic "Time out" rather than a verification failure. Disabling SecureBoot lets OVMF skip the chain check entirely and boot the El Torito EFI image. SMM stays enabled (KubeVirt only requires it WITH SecureBoot, not the inverse). TPM 2.0 emulation also stays on (`tpm: {}`), so BitLocker, Hyper-V, and WSL2 still work in the guest. This is acceptable for a CI runner. Long-term path back to SecureBoot: 1. Custom-build OVMF_VARS.fd with Microsoft KEK/DB pre-enrolled 2. Mount via firmware.bootloader.efi.persistent 3. secureBoot: true Tracked as a Phase 2 hardening task once the runner is doing real work and we want signed-boot guarantees. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 21:06:18 -05:00
Codex	b998f50f48	fix(ci1): switch ISO delivery to containerDisk OCI image (Path C) OCI image: localhost/win-server-2025:1.0 (8.27 GB) Built FROM scratch + ADD disk.img → /disk/disk.img on noc1, podman saved as tar (8.27 GB), SCP'd in parallel to all 3 RKE2 nodes, imported via ctr in k8s.io namespace. Verified present on all 3 schedulable nodes (rke2-server, rke2-agent1, rke2-agent2). Why containerDisk over the prior PVC paths: - Path A (Longhorn Filesystem PVC, sata): OVMF BdsDxe SATA-CDROM read timeout. Cdrom-backed PVC is too slow for OVMF's first-sector read window. - Path B (Synology NFS): uid 107 (qemu) denied at directory level by Synology export ACL despite file mode 0777. Memory: feedback_synology_iso_export_root_only_uid_107_denied. - Path B+SCSI: same OVMF timeout, just on SCSI controller. Bus choice was not load-bearing — the issue was always the slow PVC backing. - Path C (this commit): containerDisk delivers the ISO bytes from a tmpfs view of the OCI layer, no PVC controller in the read path. qemu reads at native FS speed; OVMF first-sector read completes well within timeout. This is also the KubeVirt-recommended pattern for installer ISOs. Connects to FlowerCore.Distribution / Provisioning USB story: same "OCI image of the OS installer + autounattend on a sysprep CDROM" pattern that the USB provisioning agent will use. The Windows install proceeds hands-off via the existing autounattend.xml in ci1-autounattend ConfigMap (RDP enabled, WinRM, UAC disabled, Administrator password from 1Password vault item h3ix4mgfk65gmkcmvh6ly3d3hu). Image lifecycle: bump tag (1.1, 1.2, ...) when ISO version changes, rebuild on noc1, redistribute to RKE2 nodes, update image: line. Legacy NFS PVC + PV manifest and CDI Longhorn PVC RETAINED for this commit so prior states are recoverable. Will prune in follow-up once containerDisk boot proves. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 20:45:38 -05:00
Codex	8fd9ae1cd3	fix(ci1): revert NFS Path B + flip ISO cdrom bus sata→scsi NFS Path B (commit `fc2aca0`) failed at storage layer: Synology export `/volume1/ISOs` denies non-root client UIDs at the directory level. qemu uid 107 cannot `ls /iso/` even though disk.img is mode 0777. Diagnosed via uid-107 + uid-0 busybox probe pods on rke2-agent2: - libvirt error: "Cannot access storage file ... Permission denied" (virStorageSourceReportBrokenChain:1281, virError Code=38 Domain=18) - uid 107 pod: "ls: can't open '/iso/': Permission denied" - uid 0 pod (same mount): "drwxrwxrwx 1 root root 16 ... disk.img" - SELinux Enforcing + virt_use_nfs=on, no AVC denials → not SELinux - File mode 0777 with owner 107:107 → not POSIX Same export-only-root pattern as `/volume1/kubernetes`. Memory: feedback_synology_iso_export_root_only_uid_107_denied.md Existing CDI-uploaded Longhorn PVC `windows-server-2025-iso` (10Gi Filesystem mode) verified to contain valid ISO bytes readable by uid 107 (mode 0660 root:107, 9.85 GB sparse, 8.27 GB blocks ≈ original 7.7 GB ISO). Reverting to it. The original OVMF SATA-CDROM read timeout that drove yesterday's NFS pivot is now addressed by `cdrom: bus: scsi` (virtio-scsi has a longer read window than the IDE/SATA emulator). Per user-prompt diagnostic chain Step 5. NFS PVC + PV (apps/kubevirt-vms/win2025-iso-nfs-pv.yaml) RETAINED so Path B state is recoverable; can be pruned in follow-up once SCSI boot is proven. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 18:54:36 -05:00
Codex	fc2aca0e9e	fix(ci1): mount Windows ISO via Synology NFS (Path B for SATA-CDROM timeout) Previous fix attempts confirmed the Longhorn-backed Filesystem PVC contains a perfectly valid bootable ISO9660 image. The bug is that SATA-CDROM emulation reading from a Longhorn Filesystem PVC is too slow for OVMF's boot read window — DVD-ROM enumeration times out before the bootloader loads. Symptom on the serial console: BdsDxe: failed to start Boot0001 "UEFI QEMU DVD-ROM QM00001 " ... Time out BdsDxe: No bootable option or device was found Block-mode PVC (Path A) was attempted and would likely fix the timing, but CDI v1.65.0's upload-target pod cannot open the underlying block device (runAsUser:107 + capabilities.drop:[ALL]): blockdev: cannot open /dev/cdi-block-volume: Permission denied Path B (this change): mount the ISO directly from Synology NAS over NFSv4.1. Bypasses both the Longhorn slowness and the CDI permission issue. QEMU's SATA emulator reads at native LAN speed. Layout: /volume1/ISOs/ — existing Synology export, RKE2 ACL already granted /volume1/ISOs/win2025-iso-disk/disk.img — new subdir, hardlink to the ISO file, named so KubeVirt's launcher finds it at the PV root A hardlink (not symlink) is required because symlinks with relative targets pointing to the parent directory are broken when the NFS PV sub-mounts the subdir as its root. Validated 2026-05-08 from rke2-server, rke2-agent1, rke2-agent2: mount -t nfs -o nfsvers=4.1,ro 10.0.58.3:/volume1/ISOs/win2025-iso-disk file disk.img -> ISO 9660 CD-ROM filesystem data ... (bootable) The original Longhorn Filesystem ISO PVC is RETAINED unused (so ArgoCD doesn't prune the populated PVC and so we have a fallback). Can be removed in a follow-up commit after the NFS path is proven on a successful Windows install. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:03:42 -05:00
Codex	ba18c52130	docs(ci1): record open rootdisk-flock and SATA-CDROM-timeout issues Documenting the remaining 2 unresolved issues for the next operator session, with the recovery paths from this session captured inline so the next agent doesn't repeat the same blind alleys: 1. rootdisk QEMU flock — every new launcher pod fails QEMU start with `Failed to get "write" lock` on the rootdisk Filesystem-mode disk.img. Stale flock from a previous force-deleted virt-launcher pod. Longhorn engine on rke2-agent2 needs to release the lock; `kubectl patch volume.longhorn.io/<pvc-name> spec.nodeID=""` is reverted by the Longhorn controller. Operator-level recovery only. 2. SATA CDROM read timeout — even with bootOrder=1 (windows-iso first), OVMF UEFI fails Boot0001 with "Time out" reading the SATA CDROM backed by the Filesystem-mode PVC. Block-mode DataVolume migration was attempted but blocked by CDI v1.65.0's upload pod running with `capabilities.drop: [ALL]` and `runAsUser: 107`, preventing direct block-device writes (`blockdev: cannot open /dev/cdi-block-volume: Permission denied`). See ISO PVC header docstring for 3 forward paths. Net commits during this session: - `1c4145a`: bootOrder swap (windows-iso=1, rootdisk=2) - `87a7d7c`: deprecated `running:` -> `runStrategy: Always` - `0bf47df`: ISO migration to Block-mode DataVolume (REVERTED) - `9f6dc1a`: revert to Filesystem PVC (CDI block-upload blocked) - `1c4145a` + `87a7d7c` + `9f6dc1a` are the live, correct configuration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 15:18:38 -05:00
Codex	9f6dc1a9d5	fix(ci1): revert ISO to Filesystem PVC; CDI v1.65.0 block-upload pod blocked by capability drop The Block-mode DataVolume migration (commit `0bf47df`) hit a CDI v1.65.0 limitation: the upload-target pod runs as uid 107 with `capabilities.drop: [ALL]`, so it cannot open the underlying block device: blockdev: cannot open /dev/cdi-block-volume: Permission denied Saving stream failed: Unable to transfer source data to target file: error determining if block device exists: exit status 1 Reverting to a Filesystem-mode PVC + virtctl image-upload pvc, which DID work (uploaded the 7.7 GiB ISO with valid ISO9660 magic intact). Boot timeout is unresolved (header docstring captures the open issue + 3 paths to revisit). The bootOrder swap (`1c4145a`) and runStrategy migration (`87a7d7c`) stay landed — those are correct improvements regardless of the volume-mode question. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:32:52 -05:00
Codex	0bf47dfa33	fix(ci1): switch ISO from filesystem PVC to Block-mode DataVolume The bootOrder swap alone didn't fix the install — even with `windows-iso` at bootOrder:1, OVMF UEFI still timed out reading the SATA CDROM: BdsDxe: starting Boot0001 "UEFI QEMU DVD-ROM QM00001 " from ... Sata(...) BdsDxe: failed to start Boot0001 ... : Time out BdsDxe: No bootable option or device was found. Diagnosis (debug pod mounting the live PVC): - /pvc/disk.img IS a valid bootable ISO9660 image — `file` reports "ISO 9660 CD-ROM filesystem data 'SSS_X64FRE_EN-US_DV9' (bootable)". - bytes 0..15: zeros (NOT QCOW2 magic 51 46 49 fb). - bytes 32769..32773: "CD001" — ISO9660 primary volume descriptor at the correct offset. So content was fine. The bug is in how KubeVirt + QEMU + Longhorn expose a Filesystem-mode PVC's `/disk.img` as a SATA CDROM. With Block-mode the underlying volume IS the raw ISO9660 sectors, OVMF reads them directly, no QEMU file-emulation layer. This is the recommended pattern for ISO install media on KubeVirt + Longhorn. Migration: - Replace `kind: PersistentVolumeClaim` with `kind: DataVolume` (CDI manages the underlying PVC + upload-target pod). - Set `pvc.volumeMode: Block`. - Annotate `cdi.kubevirt.io/storage.contentType: kubevirt` so CDI keeps raw bytes (no QCOW2 wrap). - VM volume reference changes from `persistentVolumeClaim.claimName` to `dataVolume.name`. KubeVirt's VMI controller blocks VM start until DV phase is Succeeded (upload completed). Operator step after this lands: 1. Wait for DV `phase: UploadReady` kubectl get dv -n kubevirt-vms windows-server-2025-iso -w 2. virtctl image-upload dv windows-server-2025-iso -n kubevirt-vms \ --image-path "...\en-us_windows_server_2025...iso" \ --uploadproxy-url https://localhost:8443 --insecure --no-create 3. Re-flip runStrategy to Always (was set to Halted live-side during migration; this commit keeps the manifest at Always). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:23:31 -05:00
Codex	87a7d7c70a	fix(ci1): switch deprecated `running: true` -> `runStrategy: Always` Required to clear OutOfSync state after the bootOrder fix. Live VM had runStrategy: Halted (set during diagnosis to release the PVC for inspection). Manifest had running: true. KubeVirt's validating webhook rejects sync: admission webhook "virtualmachine-validator.kubevirt.io" denied the request: Running and RunStrategy are mutually exclusive. Switching to runStrategy: Always preserves the original "auto-start + auto-restart" semantics with the non-deprecated field, and gives ArgoCD a clean diff target to flip Halted -> Always. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:12:07 -05:00
Codex	1c4145a581	fix(ci1): swap bootOrder so Windows install ISO boots first Original order: rootdisk=1 (empty 200Gi virtio), windows-iso=2 (SATA CDROM). UEFI tried the empty virtio disk first, got nothing, fell back to Boot0001 (the SATA CDROM) with a short timeout, and aborted with: BdsDxe: failed to start Boot0001 ... Time out BdsDxe: No bootable option or device was found. VM had been running 38+ min with rootdisk actualSize stuck at 4.13 GiB and no AgentConnected condition — install never started. Diagnosis via debug pod mounting the windows-server-2025-iso PVC: /pvc/disk.img: ISO 9660 CD-ROM filesystem data 'SSS_X64FRE_EN-US_DV9' (bootable) bytes 0..15: zeros (NOT QCOW2 magic 51 46 49 fb) bytes 32769..32773: "CD001" (ISO9660 primary volume descriptor) So the PVC content is a real bootable ISO — the only fix needed is to make the ISO bootOrder=1 for first install. After Windows installs, it writes its own UEFI Boot#### entries pointing at the rootdisk EFI partition; UEFI then boots from rootdisk going forward and the ISO at bootOrder:2 is a fallback for re-install scenarios. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:10:17 -05:00
Codex	c50a403f74	fix(infra): pin virtio-container-disk to v1.8.2 (containerd 2.1 manifest fix) KubeVirt v1.4.0 + RKE2 containerd 2.1.5 cannot pull quay.io/kubevirt/virtio-container-disk:latest: rpc error: code = Unimplemented desc = failed to pull and unpack image: not implemented: media type "application/vnd.docker.distribution.manifest.v1+prettyjws" is no longer supported since containerd v2.1, please rebuild the image as "application/vnd.docker.distribution.manifest.v2+json" or "application/vnd.oci.image.manifest.v1+json" The :latest tag was last rebuilt with the v1 manifest schema. Tagged versions v1.6.5+, v1.7.3, v1.8.2 are rebuilt with v2/OCI manifests. Pinning to v1.8.2 (newest available, contains current Windows VirtIO drivers). The image only contains the Windows VirtIO driver ISO mounted as a CDROM — not the KubeVirt runtime — so it is decoupled from the cluster KubeVirt version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:28:22 -05:00
Codex	fb7bd10528	feat(infra): activate ci1 VM — running:true + 10Gi ISO PVC + 1P password Phase 1 prereqs all satisfied: - Multus CNI v4.2.2 thick-plugin DS Running on rke2-server/agent1/agent2 - CDI v1.65.0 operator + CR Deployed (cdi-apiserver/deployment/uploadproxy all Running 1/1) - Windows Server 2025 ISO (7.7GiB, March 2026 update) uploaded via CDI virtctl image-upload to PVC windows-server-2025-iso. Verified via PVC annotations: cdi.kubevirt.io/storage.condition.running.message="Upload Complete", storage.pod.phase="Succeeded" - Local Administrator password generated (26 char, FANTASTIC strength). Stored in 1Password vault IAmWorkin (qaphopopkryhbg353ukzhhuqoq) item h3ix4mgfk65gmkcmvh6ly3d3hu. UTF-16-LE base64 in autounattend.xml Value field matches the 1P "autounattend AdministratorPassword Value" field. Changes: - ISO PVC bumped 6Gi → 10Gi (ISO is 7.7GiB, need headroom) - Added labels app=ci-runner, flowercore.io/managed-by=bluejay-infra - autounattend.xml AdministratorPassword Value: real base64-encoded password - spec.running: false → true (VM starts on next ArgoCD sync) - Header comment refreshed to LIVE state with prereq references Network: still pod-network masquerade. Multus NAD prod-vlan57 is registered but the VM doesn't use it yet (Phase 1.5 host bridge needed first). Verify after sync: kubectl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml -n kubevirt-vms get vm,vmi virtctl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml vnc ci1 -n kubevirt-vms Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:24:46 -05:00
Codex	6c21d14a98	deploy(fc-updater): bump image to v20260508-pub3-deepening-2bdf108 Promotes the fleet to FlowerCore.Updater main @ 2bdf108 which combines: - PR #6 publish pre-signed releases (1a188f4) - PR #7 deeper public-host privacy enforcement (8cd8544) - PublishPreSignedAsync(stream, sig) Integration coverage (2bdf108) Live image already imported to rke2-server and rolled via deploy-web.ps1. This commit aligns the bluejay-infra source of truth so ArgoCD doesn't snap the spec back to the previous tag (per feedback_argocd_managed_image_overrides_do_not_stick).	2026-05-08 13:07:24 -05:00
Codex	b3529f8e96	feat(infra): add Multus CNI + CDI + PROD VLAN 57 NAD as GitOps prereqs for ci1 Adds three new bluejay-infra apps that auto-pickup via ApplicationSet (apps/* directory generator on main): * apps/multus/multus.yaml — Multus CNI v4.2.2 thick-plugin daemonset (verbatim upstream, project-annotated). Enables KubeVirt VMs to attach additional network interfaces. Required by ci1 to bridge onto PROD VLAN 57. * apps/cdi/{cdi-operator.yaml,cdi-cr.yaml,README.md} — Containerized Data Importer v1.65.0 (verbatim upstream). Operator + CR pattern. Enables populating PVCs from HTTP/registry/upload sources, used to load the Windows Server 2025 ISO into the windows-server-2025-iso PVC. * apps/kubevirt-vms/prod-vlan57-nad.yaml — NetworkAttachmentDefinition for PROD VLAN 57 bridge. Deploy gated on Phase 1.5 host work: requires br-prod bridge enslaving enp86s0.57 on each RKE2 node (Puppet config-as-code). ci1.yaml continues to use pod-network masquerade until that lands; switching to multus.networkName: kubevirt-vms/prod-vlan57 is a one-line YAML edit followed by a GitOps push. Cluster verification (2026-05-08): - KubeVirt LIVE (3 nodes, virt-api/controller/handler/operator all Running) - Calico CNI on /etc/cni/net.d + /opt/cni/bin (Multus default paths) - ApplicationSet `bluejay-infra` already watches `apps/*` on main Reproducibility: upstream YAMLs vendored verbatim with project header diffs only. Bumping versions = re-curl + git push. No deploy-time internet fetch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:05:58 -05:00
Codex	00c11b4eaa	feat(infra): stage ci1 Windows Server 2025 KubeVirt VM (Phase 1, NOT YET APPLIED) Stages a draft VirtualMachine + Namespace + ISO PVC + rootdisk PVC + sysprep ConfigMap for the dedicated GitHub Actions self-hosted runner that replaces the never-registered bluejay-ws-sandbox-1 placeholder. Status: STAGED ONLY. spec.running = false. ISO PVC empty. Two operator decisions still pending before this can boot: 1. Network choice — pod-network fallback (in this draft) vs Multus + PROD VLAN NAD (preferred, requires Multus install). 2. ISO path — manual upload via helper pod (Path A) vs CDI HTTP import (Path B, requires CDI install). Cluster baseline 2026-05-08: - KubeVirt operator: installed, healthy, 14d - CDI: NOT installed - Multus: NOT installed - Calico-only CNI See docs/infrastructure/windows-server-build-runner-plan.md "Phase 1 readiness gate" for the full operator pickup checklist. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 12:32:47 -05:00
Codex	04881f46f0	deploy(intranet): promote brochure wave 1 image	2026-05-08 11:12:56 -05:00
Codex	c0038e4859	deploy(intranet): bump image to v20260508-7bad3a5 (Theme picker + FcThemedRoot) FlowerCore.Intranet.Web@7bad3a5 'feat(theme): add /admin/theme picker page + wrap routes in FcThemedRoot'. Image built, distributed to all 3 RKE2 nodes (10.0.56.11/12/13), 366/366 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 10:20:11 -05:00
Codex	dee48831c6	Deploy updater public privacy hardening	2026-05-07 17:12:33 -05:00
Codex	0f1dc5f871	fix(certs): kill cert-manager renewal loop on 3 broken Certificate specs Three Certificates requested duration: 2160h (90d) with renewBefore: 720h (30d). step-ca's ACME provisioner caps cert lifetime at 30d, so it silently issued 720h certs — making renewBefore EQUAL to the actual cert lifetime. cert-manager treats the cert as needing immediate renewal the moment it's issued, creates a CertificateRequest, gets a new (still 30d) cert, marks it for immediate renewal, and loops. Damage on 2026-05-07 ~20:30 (caught during regroup after 5h gap): - fc-worldbuilder/worldbuilder-web-tls: 2365 CRs in 18h - fc-distribution/fc-distribution-tls: 10880 CRs in 18h - knowledge/knowledge-tls: 10888 CRs in 18h Total: 24,133 stale CertificateRequest objects in etcd. Bulk-deleted all CRs + Orders in those 3 namespaces, then this commit fixes the source so ArgoCD sync stops re-creating the loop. Fix: match the working 720h/240h pattern used by every other FC service cert (agent-zero, fc-dns, fc-llm-bridge, fc-php, traefik-system, etc.). 30d cert lifetime + 10d renewal headroom = renewal at day 20, which is the cert-manager standard 2/3-of-lifetime practice. Side effect during loop: ALSO contributed to step-ca load and may have caused intermittent timeouts cluster-wide (the latest stuck challenge was timing out dialing step-ca:9443 even though step-ca itself was up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:32:00 -05:00
Codex	11c5f6e6cc	fix(selenium): GitOps-capture selenium-netpol (was unmanaged anywhere) Captured during 2026-05-07 regroup audit. selenium-netpol was applied via raw `kubectl apply` to the cluster on 2026-03-15 with no source-of-truth file anywhere — neither in bluejay-infra nor in any FC service repo. A cluster rebuild from bluejay-infra would have lost it entirely (including the Selenium Grid → Traefik VIP allow rule that gates AAT runs against *.iamworkin.lan services). Captured byte-for-byte from `kubectl get netpol -n selenium selenium-netpol -o yaml`. ServerSideApply via ArgoCD will adopt the existing resource without recreation. The Selenium Grid Deployment + Services themselves are still managed outside ArgoCD (deployed via raw kubectl from the original bring-up). Migrating those into bluejay-infra is a separate lane — this commit only restores GitOps repeatability for the NetworkPolicy. See feedback_networkpolicies_belong_in_bluejay_infra.md for the canonical pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 10:30:59 -05:00
Codex	d637fe9b30	fix(fc-desktop): land 4 NetworkPolicies into bluejay-infra (was deploy-script-only) Repeatability gap caught during 2026-05-07 morning regroup. The four fc-desktop NetworkPolicies (desktop-isolation, fc-desktop-default-deny, remotedesktop-web-isolation, cm-acme-http-solver-allow) were applied via FlowerCore.RemoteDesktop/scripts/deploy-web.sh `kubectl apply` calls. That meant a fresh cluster rebuild from bluejay-infra alone would miss all of them — Browser Lab session isolation, control-plane allow-list, and HTTP-01 cert renewal would silently fail to come up. Canonical FC GitOps pattern is for NetworkPolicies to live alongside other resources in bluejay-infra. Verified by audit: 6 of 11 cluster NetworkPolicies (agent-zero, edge2-services, monitoring, noc-services, telephony, voice) already follow this pattern. fc-desktop was the outlier; selenium-netpol is also unmanaged and tracked separately. Source-of-truth split (now documented in fc-desktop.yaml): - bluejay-infra OWNS: Certificate + IngressRoute + all NetworkPolicies. - FlowerCore.RemoteDesktop scripts/deploy-web.sh OWNS: Deployment + Service ONLY (because `localhost/fc-desktop:linux-xfce` image refs require manual ctr import on each node — Deployment in bluejay-infra would race the image-import step). Follow-up commits in FlowerCore.RemoteDesktop will: - Remove the now-duplicate k8s/{networkpolicy,namespace-default-deny, web-networkpolicy,acme-http01-solver-allow}.yaml files. - Drop the 3 `kubectl_apply_file` lines from scripts/deploy-web.sh. The 4 NPs in this commit are byte-for-byte identical to what's running in the cluster today (verified via kubectl get -o yaml diff). ServerSideApply in the bluejay-infra ApplicationSet will adopt the existing resources without recreating them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 10:27:20 -05:00
Codex	5bfe41beca	fix(monitoring): rename bare Grafana dashboard JSONs out of .json extension ArgoCD's directory-driven manifest parser scans .yaml AND .json by default. Bare Grafana dashboard JSONs (no apiVersion/kind/metadata) poisoned manifest generation for the entire infra-monitoring Application ("Object 'Kind' is missing in <dashboard JSON>"), leaving sync state Unknown. These files are SOURCE for the file-provisioning path on noc1 (/opt/monitoring/grafana/dashboards/) and also get inlined into ConfigMap wrappers like grafana-dashboard-remotedesktop.yaml. They are NOT K8s manifests and shouldn't be in ArgoCD's manifest tree. .argocdignore is for repo-level GitOps source eligibility, not for filtering manifests within a directory-mode Application — the cleanest fix is the .txt extension that ArgoCD's parser skips. Reverts the .argocdignore from the previous commit (didn't take effect). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 10:13:37 -05:00
Codex	df22774674	fix(infra): unstick fc-updater + monitoring ArgoCD apps fc-updater PVC: bump updatecenter-data storage 10Gi → 25Gi. The cluster PVC was already manually expanded to 20Gi to fit Mike Bundle (~5.1 GiB) plus the LocalFsBundleStore.MaxTotalBytes soft cap of 25 GiB (see project_uc_remaining_4_apps_signed_2026_05_06). PVCs cannot shrink, so ArgoCD couldn't sync the smaller git value (OutOfSync, retried 5x with "field can not be less than status.capacity"). Setting git to 25Gi gives headroom matching the soft cap. monitoring .argocdignore: skip bare dashboard JSON files. Both fc-updatecenter-dashboard.json and flowercore-remotedesktop-grafana- dashboard.json live in apps/monitoring/ as source-of-truth for file- provisioning to noc1's /opt/monitoring/grafana/dashboards/. ArgoCD's manifest parser tries to unmarshal every file and chokes on bare dashboard JSON ("Object 'Kind' is missing"), which then poisoned the whole infra-monitoring Application status (Unknown sync, no comparison possible). The .argocdignore tells ArgoCD to skip *.json — actual K8s deploys happen via ConfigMap wrappers like grafana-dashboard-remotedesktop.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 10:11:27 -05:00
Codex	c4065b15a3	deploy(ttsreader): persist voice reference clips on pvc	2026-05-06 20:48:58 -05:00
Codex	a4aa612373	deploy(fc-distribution): roll startup backfill fix	2026-05-06 19:51:11 -05:00
Codex	c2eb37dee9	deploy(ttsreader): enable phase6 biblical routing	2026-05-06 19:46:25 -05:00
Codex	bf6f542569	deploy(fc-distribution): roll latest endpoint fix	2026-05-06 19:38:26 -05:00
Codex	e150b2102f	deploy(fc-distribution): roll phase1 api image	2026-05-06 19:31:22 -05:00
Codex	33a765b0bc	deploy(fc-intranet-web): roll v20260506-1737 with Wave 2 specialist galleries 6 Wave 2 product galleries landed on intranet master c083016: - /specialists/telephony (7 sections + Overview, +11 tests) - /specialists/library (8 sections + Overview, +17 tests) - /specialists/retail (6 sections + Overview, +16 tests) - /specialists/mysql (6 sections + Overview, +22 tests) - /specialists/php (6 sections + Overview, +9 tests) - /specialists/pimanager (7 sections + Overview, +11 tests) NavMenu.razor wired with new Specialists section. Test ledger: 280 -> 366 (+86) full project, 0W/0E build. Sources: 6 sibling-depth worktrees claude/intranet-w2-{name} dispatched 2026-05-06 per intranet-xxxl-sprint-2026-05-05.md §4 Phase 2. Inherits Q-IK-1..15 + Q-IS-1..12 + Q-IX-1..7 verbatim per Q-IW-5. 6 Q-IW-1..6 cards on Notes decisions-waiting.html.	2026-05-06 17:38:22 -05:00
Codex	5484ed7db6	Adopt fc-updater into ArgoCD	2026-05-06 17:33:32 -05:00
Codex	2aa84349ea	merge claude/bluejay-infra-worldbuilder: roll fc-intranet-web v20260506-2120 with WorldBuilder LIVE flip	2026-05-06 16:22:51 -05:00
Codex	851f8e673b	deploy(fc-intranet-web): roll v20260506-2120 with WorldBuilder LIVE flip WorldBuilder live runtime promotion lands in the Intranet at /services/world-builder + ServiceRegistry homepage tile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 16:22:43 -05:00
Codex	f78f8c8192	Merge claude/bluejay-infra-ttsreader-4delta: bump fc-ttsreader image for Phase 4delta enrichment landing	2026-05-06 16:04:57 -05:00
Codex	6a89a76e39	fc-ttsreader: bump image to v202605061500 (Phase 4delta enrichment pipeline) Phase 4delta server-side HTML overlay enrichment landed in FlowerCore.TtsReader@8f23e15 (master @6091618). Adds 9-pass enrichment + SQLite-backed cache + 4 REST endpoints (/api/v1/enrich/{html,jsonld,both,passes}) + RenderRequest.sourceJsonLd. Tests 476 -> 522 (+46). Image already imported to all RKE2 nodes via deploy.sh; this bumps the bluejay-infra-managed tag so ArgoCD reconciles the live deployment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 16:04:31 -05:00
Codex	2489464d4f	fix(worldbuilder): cpu request 100m -> 25m to clear scheduler Cluster CPU-request budget at 99% on all 3 RKE2 nodes at deploy time. 0/3 nodes available; "3 Insufficient cpu". Actual CPU usage on the nodes is 10/52/19%, so the cluster is request-overprovisioned but has plenty of real headroom. Idle Blazor + SignalR + ComfyUI poller is ~5m. 25m unblocks scheduling and stays generous for expected runtime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 16:04:25 -05:00
Codex	4b777b16ac	monitoring: mirror fc-signage-marquee alert group into noc-monitoring K8s target Mirror of FlowerCore.Notes/scripts/monitoring/alerts.yml fc-signage-marquee group into the K8s migration target apps/monitoring/noc-monitoring.yaml so that future migration of the noc1 Podman monitoring stack into RKE2 inherits the marquee alert ruleset automatically. Three rules added: - MarqueeDroppedFramesHigh (5% / 5min / warning) - MarqueeRenderLatencyP99High (16ms / 10min / warning) - MarqueeAnimationDurationDrift (10% / 15min / info) All three gated with `unless on() absent_over_time(metric[7d])` so they don't fire during the metric-not-yet-published window before Track 3 IR-21 source IMPL ships the MarqueeMeter into Common + Web + WPF. Live source-of-truth (the noc1 Podman Prometheus reads from /opt/monitoring/prometheus/alerts.yml) was updated and reloaded in the same session — Notes commit 300daa0 carries the matching alerts.yml + Grafana fc-signage-dashboard.json change. Per feedback_monitoring_k8s_target_vs_live_podman: this file is the forward-looking K8s migration target, NOT what the live Podman Prometheus reads. ArgoCD-syncing this file does NOT push alerts to the live monitoring stack. Companion to: - FlowerCore.Notes 300daa0 (live alerts.yml + Grafana panels deployed) - docs/signage/marquee-performance-telemetry-design.md (Track 3 IR-21 spec) - docs/signage/marquee-animation-phases.md (Track 6 13-phase coverage matrix) Memory: project_marquee_vr_promotion_landed_2026_05_06 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 16:01:44 -05:00

1 2 3 4 5 ...

368 Commits