fix(ci1): switch ISO delivery to containerDisk OCI image (Path C)

OCI image: localhost/win-server-2025:1.0 (8.27 GB) Built FROM scratch + ADD disk.img → /disk/disk.img on noc1, podman saved as tar (8.27 GB), SCP'd in parallel to all 3 RKE2 nodes, imported via ctr in k8s.io namespace. Verified present on all 3 schedulable nodes (rke2-server, rke2-agent1, rke2-agent2). Why containerDisk over the prior PVC paths: - Path A (Longhorn Filesystem PVC, sata): OVMF BdsDxe SATA-CDROM read timeout. Cdrom-backed PVC is too slow for OVMF's first-sector read window. - Path B (Synology NFS): uid 107 (qemu) denied at directory level by Synology export ACL despite file mode 0777. Memory: feedback_synology_iso_export_root_only_uid_107_denied. - Path B+SCSI: same OVMF timeout, just on SCSI controller. Bus choice was not load-bearing — the issue was always the slow PVC backing. - Path C (this commit): containerDisk delivers the ISO bytes from a tmpfs view of the OCI layer, no PVC controller in the read path. qemu reads at native FS speed; OVMF first-sector read completes well within timeout. This is also the KubeVirt-recommended pattern for installer ISOs. Connects to FlowerCore.Distribution / Provisioning USB story: same "OCI image of the OS installer + autounattend on a sysprep CDROM" pattern that the USB provisioning agent will use. The Windows install proceeds hands-off via the existing autounattend.xml in ci1-autounattend ConfigMap (RDP enabled, WinRM, UAC disabled, Administrator password from 1Password vault item h3ix4mgfk65gmkcmvh6ly3d3hu). Image lifecycle: bump tag (1.1, 1.2, ...) when ISO version changes, rebuild on noc1, redistribute to RKE2 nodes, update image: line. Legacy NFS PVC + PV manifest and CDI Longhorn PVC RETAINED for this commit so prior states are recoverable. Will prune in follow-up once containerDisk boot proves. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(ci1): revert NFS Path B + flip ISO cdrom bus sata→scsi
2026-05-08 20:45:38 -05:00 · 2026-05-08 18:54:36 -05:00 · 2026-05-08 17:03:42 -05:00 · 2026-05-08 15:18:38 -05:00 · 2026-05-08 14:32:52 -05:00 · 2026-05-08 14:23:31 -05:00
3 changed files with 219 additions and 15 deletions
--- a/apps/fc-updater/fc-updater.yaml
+++ b/apps/fc-updater/fc-updater.yaml
@@ -58,7 +58,7 @@ spec:
      nodeName: rke2-server
      containers:
        - name: web
-          image: localhost/fc-updater-web:v20260507-public-privacy
+          image: localhost/fc-updater-web:v20260508-pub3-deepening-2bdf108
          imagePullPolicy: Never
          ports:
            - containerPort: 8080
--- a/apps/kubevirt-vms/ci1.yaml
+++ b/apps/kubevirt-vms/ci1.yaml
@@ -6,6 +6,14 @@
 # `bluejay-ws-sandbox-1` runner placeholder. Andrew explicitly does NOT want
 # BLUEJAY-WS registered as a runner (workstation has personal/operator state).
 #
 # Storage layout (2026-05-08):
 #   * ISO is now sourced from Synology NFS (Path B) — see
 #     win2025-iso-nfs-pv.yaml. The Longhorn Filesystem PVC
 #     `windows-server-2025-iso` below is RETAINED but UNUSED so the prior
 #     CDI upload state is preserved as a fallback (and so ArgoCD doesn't
 #     prune it on this commit). It can be deleted in a follow-up commit
 #     after the NFS path is proven on a successful Windows install.
 #
 # Status (2026-05-08): LIVE — Phase 1 prereqs satisfied:
 #   * Multus CNI v4.2.2 thick-plugin DaemonSet running on all 3 RKE2 nodes
 #     (apps/multus/multus.yaml; ApplicationSet `infra-multus` Synced/Healthy)
@@ -50,16 +58,34 @@ metadata:
 ---
 # ISO PVC — populated via CDI virtctl image-upload (CDI is now installed).
-# Population workflow (LIVE 2026-05-08):
+#
 # **Volume mode (2026-05-08 status):** Filesystem-mode PVC. A migration to
 # `volumeMode: Block` via DataVolume was attempted to address an OVMF SATA
 # CDROM read timeout, but CDI v1.65.0's upload-target pod runs as uid 107
 # with `capabilities.drop: [ALL]` and cannot open the underlying block
 # device (`blockdev: cannot open /dev/cdi-block-volume: Permission denied`).
 # Reverted to Filesystem PVC pending one of:
 #   - CDI deployment override granting CAP_SYS_RAWIO to upload pod
 #   - Pre-populated PVC via privileged init pod that dd's the ISO directly
 #   - Migration to a different storage class that exposes block devices
 #     differently (e.g. iSCSI, where Longhorn's CSI mount path may behave
 #     differently)
 #
 # Population workflow (this PVC, Filesystem mode):
 #   1. virtctl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml image-upload pvc \
 #        windows-server-2025-iso -n kubevirt-vms \
 #        --image-path "$env:USERPROFILE\Downloads\en-us_windows_server_2025_updated_march_2026_x64_dvd_8e06425a.iso" \
 #        --size 10Gi --storage-class longhorn --access-mode ReadWriteOnce \
-#        --uploadproxy-url https://cdi-uploadproxy.cdi.svc:443 --insecure
+#        --uploadproxy-url https://localhost:8443 --insecure
-#   (--uploadproxy-url uses port-forward in practice: see plan doc Phase 1.5.)
+#   (--uploadproxy-url uses port-forward in practice: `kubectl port-forward
 #   -n cdi service/cdi-uploadproxy 8443:443 &` first.)
 #
-# Note: CDI's PVC creation hooks add cdi.kubevirt.io/storage.* annotations
+# **Open boot issue:** even with the ISO at bootOrder:1, OVMF console showed:
-# automatically. The ISO source file is 7.7GB → request 10Gi for headroom.
+#   BdsDxe: starting Boot0001 "UEFI QEMU DVD-ROM QM00001 " from ... Sata(...)
 #   BdsDxe: failed to start Boot0001 ... Time out
 # Diagnosis confirmed PVC content IS a valid bootable ISO9660 image — the
 # timeout is in OVMF reading from the SATA-CDROM-backed-by-filesystem-PVC.
 # Block mode would likely fix it; see CDI permission issue above.
 apiVersion: v1
 kind: PersistentVolumeClaim
 metadata:
@@ -73,7 +99,7 @@ spec:
    - ReadWriteOnce          # Bump to ReadOnlyMany after population for multi-VM use
  resources:
    requests:
-      storage: 10Gi          # Bumped from 6Gi (Server 2025 ISO is 7.7GB)
+      storage: 10Gi          # Server 2025 ISO is 7.7GB; 10Gi for headroom
  storageClassName: longhorn
 ---
@@ -283,7 +309,33 @@ metadata:
    role: github-actions-runner
    flowercore.io/managed-by: bluejay-infra
 spec:
-  running: true   # LIVE — ISO uploaded 2026-05-08, password in 1P
+  # `running: true` is deprecated in favor of `runStrategy`. They are mutually
  # exclusive — KubeVirt's validating webhook rejects any VM that sets both:
  #   admission webhook "virtualmachine-validator.kubevirt.io" denied the request:
  #   Running and RunStrategy are mutually exclusive.
  # `Always` keeps a VMI running and restarts it if it crashes/exits — same
  # semantics as the old `running: true`.
  #
  # **2026-05-08 status: VM cannot start due to a stale QEMU flock on the
  # rootdisk PVC** (qemu reports `Failed to get "write" lock` on
  # `/var/run/kubevirt-private/vmi-disks/rootdisk/disk.img`). The flock was
  # left by a previous QEMU process during a force-deleted launcher pod
  # cycle. Recovery requires either (a) a Longhorn engine restart on
  # rke2-agent2, (b) a Longhorn volume detach via the longhorn-manager API
  # (kubectl patch on `volume.longhorn.io/<pvc-name>` does not work — the
  # spec.nodeID is reconciled back), or (c) a node reboot of rke2-agent2.
  #
  # **Confirmed working:** the bootOrder swap (windows-iso=1, rootdisk=2)
  # and the runStrategy migration (above). The ISO PVC was successfully
  # repopulated via virtctl image-upload pvc on the Filesystem-mode PVC.
  #
  # **Open: SATA CDROM read timeout** — even with bootOrder=1, OVMF reported
  # `BdsDxe: failed to start Boot0001 ... Time out` reading the SATA CDROM
  # backed by the Filesystem-mode PVC. A switch to Block-mode DataVolume
  # was attempted but blocked by a CDI v1.65.0 upload-pod permission issue
  # (capability drop prevents writing to the underlying block device).
  # See header docstring on the ISO PVC.
  runStrategy: Always   # LIVE — ISO uploaded 2026-05-08, password in 1P
  template:
    metadata:
      labels:
@@ -329,14 +381,35 @@ spec:
        devices:
          tpm: {}             # Non-persistent vTPM — sufficient for runner; no BitLocker
          disks:
-            - name: rootdisk
+            # bootOrder: ISO must be 1 for first-boot install (the rootdisk has no
            # EFI bootloader yet). After Windows installs, it writes its own UEFI
            # Boot#### entries pointing at the rootdisk's EFI partition; UEFI then
            # boots from rootdisk going forward and the ISO at bootOrder:2 acts as
            # a fallback for re-install scenarios.
            #
            # Original (broken) order had rootdisk=1, windows-iso=2 — UEFI tried
            # the empty virtio disk first, got nothing, fell back to the SATA
            # CDROM at Boot0001 with a short timeout, and timed out before the
            # CDROM enumerated. Console showed:
            #   BdsDxe: failed to start Boot0001 ... Time out
            #   BdsDxe: No bootable option or device was found.
            # Confirmed via debug pod: PVC content IS a real bootable ISO9660
            # (file: "ISO 9660 CD-ROM filesystem data ... (bootable)"), so the
            # only bug was boot priority.
            # 2026-05-08 PM: cdrom bus is SCSI (virtio-scsi controller). Bus
            # choice is no longer load-bearing since the ISO is delivered via
            # containerDisk (see volumes block below) — both SATA and SCSI
            # work fine when the cdrom backing isn't a slow PVC. SCSI is kept
            # because it's the modern bus and matches the standard FC
            # KubeVirt VM template.
            - name: windows-iso
              bootOrder: 1
              cdrom:
                bus: scsi
            - name: rootdisk
              bootOrder: 2
              disk:
                bus: virtio
            - name: windows-iso
              bootOrder: 2
              cdrom:
                bus: sata
            - name: virtio-drivers
              cdrom:
                bus: sata
@@ -363,8 +436,40 @@ spec:
          persistentVolumeClaim:
            claimName: ci1-rootdisk
        - name: windows-iso
-          persistentVolumeClaim:
+          # 2026-05-08 PM (Path C, CONTAINERDISK): the ISO is now packaged as
-            claimName: windows-server-2025-iso
+          # a KubeVirt containerDisk OCI image baked from
          # `FROM scratch ; ADD --chown=107:107 disk.img /disk/disk.img`.
          # The qemu user (uid 107) reads the ISO directly from a tmpfs view
          # of the OCI layer, bypassing both:
          #   - Synology NFS export ACL (Path B failed: uid 107 denied at
          #     directory level even with mode 0777, see memory
          #     feedback_synology_iso_export_root_only_uid_107_denied)
          #   - OVMF cdrom read-window timeout (Path A and Path B's SCSI
          #     retry both hit `BdsDxe: failed to start Boot0001 ... Time out`
          #     when the cdrom was backed by a PVC the storage controller
          #     couldn't satisfy reads from fast enough).
          #
          # Image build (one-time, per ISO version):
          #   1. Copy ISO to disk.img, write Dockerfile
          #   2. podman build --tag localhost/win-server-2025:1.0 .  (on noc1)
          #   3. podman save -o win-server-2025-1.0.tar localhost/win-server-2025:1.0
          #   4. SCP tar to all 3 RKE2 nodes (rke2-server, rke2-agent1, rke2-agent2)
          #   5. sudo /var/lib/rancher/rke2/bin/ctr -a /run/k3s/containerd/containerd.sock \
          #        -n k8s.io images import /tmp/win-server-2025-1.0.tar
          # Standard FC pattern per `feedback_rke2_localhost_imagepullpolicy`.
          #
          # When a new Windows ISO version ships, bump the tag (1.1, 1.2, ...),
          # rebuild + redistribute, and update the image: line below in a new
          # commit. KubeVirt picks up the new image via a VM restart.
          #
          # The legacy NFS PVC + PV (apps/kubevirt-vms/win2025-iso-nfs-pv.yaml)
          # and CDI Longhorn PVC (`windows-server-2025-iso`) are RETAINED for
          # this commit so the prior states are recoverable. Once the
          # containerDisk path proves on a successful Windows install, both
          # legacy artifacts can be pruned in a follow-up commit.
          containerDisk:
            image: localhost/win-server-2025:1.0
            imagePullPolicy: Never
        - name: virtio-drivers
          containerDisk:
            # Pinned to v1.8.2 (latest stable as of 2026-05-08).
--- a/apps/kubevirt-vms/win2025-iso-nfs-pv.yaml
+++ b/apps/kubevirt-vms/win2025-iso-nfs-pv.yaml
@@ -0,0 +1,99 @@
 # =============================================================================
 # Windows Server 2025 ISO — Static NFS PV (Path B for SATA-CDROM timeout)
 # =============================================================================
 # Purpose: Mount the ISO from Synology NAS via NFS instead of from a Longhorn-
 # backed Filesystem PVC.
 #
 # Why: SATA-CDROM emulation reading from a Longhorn-backed Filesystem PVC is
 # too slow for OVMF's boot read window — the DVD-ROM enumeration times out
 # before the bootloader can be read. Symptom on the serial console:
 #   BdsDxe: failed to start Boot0001 "UEFI QEMU DVD-ROM QM00001 " from ...
 #   BdsDxe: failed to start Boot0001 ... Time out
 #   BdsDxe: No bootable option or device was found
 # Diagnosis confirmed the ISO content is a perfectly valid bootable ISO9660
 # image — the bug is in the timing path between OVMF and Longhorn-backed
 # storage, not in the ISO itself.
 #
 # Block-mode PVC was tried (`volumeMode: Block` via DataVolume) and would
 # likely fix the timing, but CDI v1.65.0's upload-target pod cannot open the
 # block device due to runAsUser:107 + capabilities.drop:[ALL] and we got:
 #   blockdev: cannot open /dev/cdi-block-volume: Permission denied
 #
 # NFS-mounted ISO bypasses both issues: no Longhorn slowness, no CDI upload
 # pod permission concerns. The ISO is read directly from the NAS over a
 # native NFSv4.1 mount that QEMU's SATA emulator can read at full LAN speed.
 #
 # Layout on Synology:
 #   /volume1/ISOs/                                              (existing export, RKE2 ACL)
 #     en-us_windows_server_2025_updated_march_2026_x64_dvd_8e06425a.iso
 #     win2025-iso-disk/                                         (new subdir, 2026-05-08)
 #       disk.img -> hardlink to ../en-us_windows_server_2025_..._8e06425a.iso
 #
 # KubeVirt's launcher pod expects a PVC mounted at
 # /var/run/kubevirt-private/vmi-disks/<diskName>/disk.img — by mounting the
 # `win2025-iso-disk/` subdir as the NFS PV root, `disk.img` lives at the PV's
 # root and KubeVirt's CDROM emulator finds it without any path manipulation.
 #
 # A symlink would NOT work for sub-path NFS mounts (the relative target
 # `../...iso` falls outside the sub-mount root). A hardlink works because it
 # references the same inode regardless of mount point.
 #
 # Memory references:
 #   - feedback_synology_nfs_volume1_kubernetes_export_scoped (Synology export
 #     scoping pattern — but /volume1/ISOs export, unlike /volume1/kubernetes,
 #     does support sub-path mounts because Synology NFS is configured with
 #     pseudo-fs in NFSv4.1)
 #   - feedback_kubevirt_iso_first_install_bootorder_and_runstrategy (boot
 #     order / runStrategy gotchas, separate from the storage timing issue)
 #
 # Validation (2026-05-08, from rke2-server / rke2-agent1 / rke2-agent2):
 #   mount -t nfs -o nfsvers=4.1,ro 10.0.58.3:/volume1/ISOs/win2025-iso-disk /tmp/m
 #   file /tmp/m/disk.img
 #     -> ISO 9660 CD-ROM filesystem data 'SSS_X64FRE_EN-US_DV9' (bootable)
 # All 3 RKE2 nodes can mount and read.
 # =============================================================================
 apiVersion: v1
 kind: PersistentVolume
 metadata:
  name: windows-server-2025-iso-nfs
  labels:
    flowercore.io/iso: windows-server-2025
    flowercore.io/managed-by: bluejay-infra
 spec:
  capacity:
    storage: 8Gi
  accessModes:
    - ReadOnlyMany
  volumeMode: Filesystem
  persistentVolumeReclaimPolicy: Retain
  storageClassName: ""              # static, no provisioner
  mountOptions:
    - nfsvers=4.1
    - ro
    - hard
    - timeo=600
    - retrans=3
  nfs:
    server: 10.0.58.3               # BlueJayNAS Synology DS1621+ on HOME VLAN 58
    path: /volume1/ISOs/win2025-iso-disk
    readOnly: true
 ---
 apiVersion: v1
 kind: PersistentVolumeClaim
 metadata:
  name: windows-server-2025-iso-nfs
  namespace: kubevirt-vms
  labels:
    app: ci-runner
    flowercore.io/managed-by: bluejay-infra
 spec:
  accessModes:
    - ReadOnlyMany
  volumeMode: Filesystem
  resources:
    requests:
      storage: 8Gi
  storageClassName: ""
  volumeName: windows-server-2025-iso-nfs
Author	SHA1	Message	Date
Codex	b998f50f48	fix(ci1): switch ISO delivery to containerDisk OCI image (Path C) OCI image: localhost/win-server-2025:1.0 (8.27 GB) Built FROM scratch + ADD disk.img → /disk/disk.img on noc1, podman saved as tar (8.27 GB), SCP'd in parallel to all 3 RKE2 nodes, imported via ctr in k8s.io namespace. Verified present on all 3 schedulable nodes (rke2-server, rke2-agent1, rke2-agent2). Why containerDisk over the prior PVC paths: - Path A (Longhorn Filesystem PVC, sata): OVMF BdsDxe SATA-CDROM read timeout. Cdrom-backed PVC is too slow for OVMF's first-sector read window. - Path B (Synology NFS): uid 107 (qemu) denied at directory level by Synology export ACL despite file mode 0777. Memory: feedback_synology_iso_export_root_only_uid_107_denied. - Path B+SCSI: same OVMF timeout, just on SCSI controller. Bus choice was not load-bearing — the issue was always the slow PVC backing. - Path C (this commit): containerDisk delivers the ISO bytes from a tmpfs view of the OCI layer, no PVC controller in the read path. qemu reads at native FS speed; OVMF first-sector read completes well within timeout. This is also the KubeVirt-recommended pattern for installer ISOs. Connects to FlowerCore.Distribution / Provisioning USB story: same "OCI image of the OS installer + autounattend on a sysprep CDROM" pattern that the USB provisioning agent will use. The Windows install proceeds hands-off via the existing autounattend.xml in ci1-autounattend ConfigMap (RDP enabled, WinRM, UAC disabled, Administrator password from 1Password vault item h3ix4mgfk65gmkcmvh6ly3d3hu). Image lifecycle: bump tag (1.1, 1.2, ...) when ISO version changes, rebuild on noc1, redistribute to RKE2 nodes, update image: line. Legacy NFS PVC + PV manifest and CDI Longhorn PVC RETAINED for this commit so prior states are recoverable. Will prune in follow-up once containerDisk boot proves. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 20:45:38 -05:00
Codex	8fd9ae1cd3	fix(ci1): revert NFS Path B + flip ISO cdrom bus sata→scsi NFS Path B (commit `fc2aca0`) failed at storage layer: Synology export `/volume1/ISOs` denies non-root client UIDs at the directory level. qemu uid 107 cannot `ls /iso/` even though disk.img is mode 0777. Diagnosed via uid-107 + uid-0 busybox probe pods on rke2-agent2: - libvirt error: "Cannot access storage file ... Permission denied" (virStorageSourceReportBrokenChain:1281, virError Code=38 Domain=18) - uid 107 pod: "ls: can't open '/iso/': Permission denied" - uid 0 pod (same mount): "drwxrwxrwx 1 root root 16 ... disk.img" - SELinux Enforcing + virt_use_nfs=on, no AVC denials → not SELinux - File mode 0777 with owner 107:107 → not POSIX Same export-only-root pattern as `/volume1/kubernetes`. Memory: feedback_synology_iso_export_root_only_uid_107_denied.md Existing CDI-uploaded Longhorn PVC `windows-server-2025-iso` (10Gi Filesystem mode) verified to contain valid ISO bytes readable by uid 107 (mode 0660 root:107, 9.85 GB sparse, 8.27 GB blocks ≈ original 7.7 GB ISO). Reverting to it. The original OVMF SATA-CDROM read timeout that drove yesterday's NFS pivot is now addressed by `cdrom: bus: scsi` (virtio-scsi has a longer read window than the IDE/SATA emulator). Per user-prompt diagnostic chain Step 5. NFS PVC + PV (apps/kubevirt-vms/win2025-iso-nfs-pv.yaml) RETAINED so Path B state is recoverable; can be pruned in follow-up once SCSI boot is proven. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 18:54:36 -05:00
Codex	fc2aca0e9e	fix(ci1): mount Windows ISO via Synology NFS (Path B for SATA-CDROM timeout) Previous fix attempts confirmed the Longhorn-backed Filesystem PVC contains a perfectly valid bootable ISO9660 image. The bug is that SATA-CDROM emulation reading from a Longhorn Filesystem PVC is too slow for OVMF's boot read window — DVD-ROM enumeration times out before the bootloader loads. Symptom on the serial console: BdsDxe: failed to start Boot0001 "UEFI QEMU DVD-ROM QM00001 " ... Time out BdsDxe: No bootable option or device was found Block-mode PVC (Path A) was attempted and would likely fix the timing, but CDI v1.65.0's upload-target pod cannot open the underlying block device (runAsUser:107 + capabilities.drop:[ALL]): blockdev: cannot open /dev/cdi-block-volume: Permission denied Path B (this change): mount the ISO directly from Synology NAS over NFSv4.1. Bypasses both the Longhorn slowness and the CDI permission issue. QEMU's SATA emulator reads at native LAN speed. Layout: /volume1/ISOs/ — existing Synology export, RKE2 ACL already granted /volume1/ISOs/win2025-iso-disk/disk.img — new subdir, hardlink to the ISO file, named so KubeVirt's launcher finds it at the PV root A hardlink (not symlink) is required because symlinks with relative targets pointing to the parent directory are broken when the NFS PV sub-mounts the subdir as its root. Validated 2026-05-08 from rke2-server, rke2-agent1, rke2-agent2: mount -t nfs -o nfsvers=4.1,ro 10.0.58.3:/volume1/ISOs/win2025-iso-disk file disk.img -> ISO 9660 CD-ROM filesystem data ... (bootable) The original Longhorn Filesystem ISO PVC is RETAINED unused (so ArgoCD doesn't prune the populated PVC and so we have a fallback). Can be removed in a follow-up commit after the NFS path is proven on a successful Windows install. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:03:42 -05:00
Codex	ba18c52130	docs(ci1): record open rootdisk-flock and SATA-CDROM-timeout issues Documenting the remaining 2 unresolved issues for the next operator session, with the recovery paths from this session captured inline so the next agent doesn't repeat the same blind alleys: 1. rootdisk QEMU flock — every new launcher pod fails QEMU start with `Failed to get "write" lock` on the rootdisk Filesystem-mode disk.img. Stale flock from a previous force-deleted virt-launcher pod. Longhorn engine on rke2-agent2 needs to release the lock; `kubectl patch volume.longhorn.io/<pvc-name> spec.nodeID=""` is reverted by the Longhorn controller. Operator-level recovery only. 2. SATA CDROM read timeout — even with bootOrder=1 (windows-iso first), OVMF UEFI fails Boot0001 with "Time out" reading the SATA CDROM backed by the Filesystem-mode PVC. Block-mode DataVolume migration was attempted but blocked by CDI v1.65.0's upload pod running with `capabilities.drop: [ALL]` and `runAsUser: 107`, preventing direct block-device writes (`blockdev: cannot open /dev/cdi-block-volume: Permission denied`). See ISO PVC header docstring for 3 forward paths. Net commits during this session: - `1c4145a`: bootOrder swap (windows-iso=1, rootdisk=2) - `87a7d7c`: deprecated `running:` -> `runStrategy: Always` - `0bf47df`: ISO migration to Block-mode DataVolume (REVERTED) - `9f6dc1a`: revert to Filesystem PVC (CDI block-upload blocked) - `1c4145a` + `87a7d7c` + `9f6dc1a` are the live, correct configuration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 15:18:38 -05:00
Codex	9f6dc1a9d5	fix(ci1): revert ISO to Filesystem PVC; CDI v1.65.0 block-upload pod blocked by capability drop The Block-mode DataVolume migration (commit `0bf47df`) hit a CDI v1.65.0 limitation: the upload-target pod runs as uid 107 with `capabilities.drop: [ALL]`, so it cannot open the underlying block device: blockdev: cannot open /dev/cdi-block-volume: Permission denied Saving stream failed: Unable to transfer source data to target file: error determining if block device exists: exit status 1 Reverting to a Filesystem-mode PVC + virtctl image-upload pvc, which DID work (uploaded the 7.7 GiB ISO with valid ISO9660 magic intact). Boot timeout is unresolved (header docstring captures the open issue + 3 paths to revisit). The bootOrder swap (`1c4145a`) and runStrategy migration (`87a7d7c`) stay landed — those are correct improvements regardless of the volume-mode question. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:32:52 -05:00
Codex	0bf47dfa33	fix(ci1): switch ISO from filesystem PVC to Block-mode DataVolume The bootOrder swap alone didn't fix the install — even with `windows-iso` at bootOrder:1, OVMF UEFI still timed out reading the SATA CDROM: BdsDxe: starting Boot0001 "UEFI QEMU DVD-ROM QM00001 " from ... Sata(...) BdsDxe: failed to start Boot0001 ... : Time out BdsDxe: No bootable option or device was found. Diagnosis (debug pod mounting the live PVC): - /pvc/disk.img IS a valid bootable ISO9660 image — `file` reports "ISO 9660 CD-ROM filesystem data 'SSS_X64FRE_EN-US_DV9' (bootable)". - bytes 0..15: zeros (NOT QCOW2 magic 51 46 49 fb). - bytes 32769..32773: "CD001" — ISO9660 primary volume descriptor at the correct offset. So content was fine. The bug is in how KubeVirt + QEMU + Longhorn expose a Filesystem-mode PVC's `/disk.img` as a SATA CDROM. With Block-mode the underlying volume IS the raw ISO9660 sectors, OVMF reads them directly, no QEMU file-emulation layer. This is the recommended pattern for ISO install media on KubeVirt + Longhorn. Migration: - Replace `kind: PersistentVolumeClaim` with `kind: DataVolume` (CDI manages the underlying PVC + upload-target pod). - Set `pvc.volumeMode: Block`. - Annotate `cdi.kubevirt.io/storage.contentType: kubevirt` so CDI keeps raw bytes (no QCOW2 wrap). - VM volume reference changes from `persistentVolumeClaim.claimName` to `dataVolume.name`. KubeVirt's VMI controller blocks VM start until DV phase is Succeeded (upload completed). Operator step after this lands: 1. Wait for DV `phase: UploadReady` kubectl get dv -n kubevirt-vms windows-server-2025-iso -w 2. virtctl image-upload dv windows-server-2025-iso -n kubevirt-vms \ --image-path "...\en-us_windows_server_2025...iso" \ --uploadproxy-url https://localhost:8443 --insecure --no-create 3. Re-flip runStrategy to Always (was set to Halted live-side during migration; this commit keeps the manifest at Always). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:23:31 -05:00
Codex	87a7d7c70a	fix(ci1): switch deprecated `running: true` -> `runStrategy: Always` Required to clear OutOfSync state after the bootOrder fix. Live VM had runStrategy: Halted (set during diagnosis to release the PVC for inspection). Manifest had running: true. KubeVirt's validating webhook rejects sync: admission webhook "virtualmachine-validator.kubevirt.io" denied the request: Running and RunStrategy are mutually exclusive. Switching to runStrategy: Always preserves the original "auto-start + auto-restart" semantics with the non-deprecated field, and gives ArgoCD a clean diff target to flip Halted -> Always. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:12:07 -05:00
Codex	1c4145a581	fix(ci1): swap bootOrder so Windows install ISO boots first Original order: rootdisk=1 (empty 200Gi virtio), windows-iso=2 (SATA CDROM). UEFI tried the empty virtio disk first, got nothing, fell back to Boot0001 (the SATA CDROM) with a short timeout, and aborted with: BdsDxe: failed to start Boot0001 ... Time out BdsDxe: No bootable option or device was found. VM had been running 38+ min with rootdisk actualSize stuck at 4.13 GiB and no AgentConnected condition — install never started. Diagnosis via debug pod mounting the windows-server-2025-iso PVC: /pvc/disk.img: ISO 9660 CD-ROM filesystem data 'SSS_X64FRE_EN-US_DV9' (bootable) bytes 0..15: zeros (NOT QCOW2 magic 51 46 49 fb) bytes 32769..32773: "CD001" (ISO9660 primary volume descriptor) So the PVC content is a real bootable ISO — the only fix needed is to make the ISO bootOrder=1 for first install. After Windows installs, it writes its own UEFI Boot#### entries pointing at the rootdisk EFI partition; UEFI then boots from rootdisk going forward and the ISO at bootOrder:2 is a fallback for re-install scenarios. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:10:17 -05:00
Codex	c50a403f74	fix(infra): pin virtio-container-disk to v1.8.2 (containerd 2.1 manifest fix) KubeVirt v1.4.0 + RKE2 containerd 2.1.5 cannot pull quay.io/kubevirt/virtio-container-disk:latest: rpc error: code = Unimplemented desc = failed to pull and unpack image: not implemented: media type "application/vnd.docker.distribution.manifest.v1+prettyjws" is no longer supported since containerd v2.1, please rebuild the image as "application/vnd.docker.distribution.manifest.v2+json" or "application/vnd.oci.image.manifest.v1+json" The :latest tag was last rebuilt with the v1 manifest schema. Tagged versions v1.6.5+, v1.7.3, v1.8.2 are rebuilt with v2/OCI manifests. Pinning to v1.8.2 (newest available, contains current Windows VirtIO drivers). The image only contains the Windows VirtIO driver ISO mounted as a CDROM — not the KubeVirt runtime — so it is decoupled from the cluster KubeVirt version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:28:22 -05:00
Codex	fb7bd10528	feat(infra): activate ci1 VM — running:true + 10Gi ISO PVC + 1P password Phase 1 prereqs all satisfied: - Multus CNI v4.2.2 thick-plugin DS Running on rke2-server/agent1/agent2 - CDI v1.65.0 operator + CR Deployed (cdi-apiserver/deployment/uploadproxy all Running 1/1) - Windows Server 2025 ISO (7.7GiB, March 2026 update) uploaded via CDI virtctl image-upload to PVC windows-server-2025-iso. Verified via PVC annotations: cdi.kubevirt.io/storage.condition.running.message="Upload Complete", storage.pod.phase="Succeeded" - Local Administrator password generated (26 char, FANTASTIC strength). Stored in 1Password vault IAmWorkin (qaphopopkryhbg353ukzhhuqoq) item h3ix4mgfk65gmkcmvh6ly3d3hu. UTF-16-LE base64 in autounattend.xml Value field matches the 1P "autounattend AdministratorPassword Value" field. Changes: - ISO PVC bumped 6Gi → 10Gi (ISO is 7.7GiB, need headroom) - Added labels app=ci-runner, flowercore.io/managed-by=bluejay-infra - autounattend.xml AdministratorPassword Value: real base64-encoded password - spec.running: false → true (VM starts on next ArgoCD sync) - Header comment refreshed to LIVE state with prereq references Network: still pod-network masquerade. Multus NAD prod-vlan57 is registered but the VM doesn't use it yet (Phase 1.5 host bridge needed first). Verify after sync: kubectl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml -n kubevirt-vms get vm,vmi virtctl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml vnc ci1 -n kubevirt-vms Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:24:46 -05:00
Codex	6c21d14a98	deploy(fc-updater): bump image to v20260508-pub3-deepening-2bdf108 Promotes the fleet to FlowerCore.Updater main @ 2bdf108 which combines: - PR #6 publish pre-signed releases (1a188f4) - PR #7 deeper public-host privacy enforcement (8cd8544) - PublishPreSignedAsync(stream, sig) Integration coverage (2bdf108) Live image already imported to rke2-server and rolled via deploy-web.ps1. This commit aligns the bluejay-infra source of truth so ArgoCD doesn't snap the spec back to the previous tag (per feedback_argocd_managed_image_overrides_do_not_stick).	2026-05-08 13:07:24 -05:00