OCI image: localhost/win-server-2025:1.0 (8.27 GB)
Built FROM scratch + ADD disk.img → /disk/disk.img on noc1, podman
saved as tar (8.27 GB), SCP'd in parallel to all 3 RKE2 nodes,
imported via ctr in k8s.io namespace. Verified present on all 3
schedulable nodes (rke2-server, rke2-agent1, rke2-agent2).
Why containerDisk over the prior PVC paths:
- Path A (Longhorn Filesystem PVC, sata): OVMF BdsDxe SATA-CDROM
read timeout. Cdrom-backed PVC is too slow for OVMF's first-sector
read window.
- Path B (Synology NFS): uid 107 (qemu) denied at directory level by
Synology export ACL despite file mode 0777. Memory:
feedback_synology_iso_export_root_only_uid_107_denied.
- Path B+SCSI: same OVMF timeout, just on SCSI controller. Bus
choice was not load-bearing — the issue was always the slow PVC
backing.
- Path C (this commit): containerDisk delivers the ISO bytes from
a tmpfs view of the OCI layer, no PVC controller in the read path.
qemu reads at native FS speed; OVMF first-sector read completes
well within timeout. This is also the KubeVirt-recommended pattern
for installer ISOs.
Connects to FlowerCore.Distribution / Provisioning USB story: same
"OCI image of the OS installer + autounattend on a sysprep CDROM"
pattern that the USB provisioning agent will use. The Windows
install proceeds hands-off via the existing autounattend.xml in
ci1-autounattend ConfigMap (RDP enabled, WinRM, UAC disabled,
Administrator password from 1Password vault item
h3ix4mgfk65gmkcmvh6ly3d3hu).
Image lifecycle: bump tag (1.1, 1.2, ...) when ISO version changes,
rebuild on noc1, redistribute to RKE2 nodes, update image: line.
Legacy NFS PVC + PV manifest and CDI Longhorn PVC RETAINED for this
commit so prior states are recoverable. Will prune in follow-up
once containerDisk boot proves.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NFS Path B (commit fc2aca0) failed at storage layer: Synology export
`/volume1/ISOs` denies non-root client UIDs at the directory level.
qemu uid 107 cannot `ls /iso/` even though disk.img is mode 0777.
Diagnosed via uid-107 + uid-0 busybox probe pods on rke2-agent2:
- libvirt error: "Cannot access storage file ... Permission denied"
(virStorageSourceReportBrokenChain:1281, virError Code=38 Domain=18)
- uid 107 pod: "ls: can't open '/iso/': Permission denied"
- uid 0 pod (same mount): "drwxrwxrwx 1 root root 16 ... disk.img"
- SELinux Enforcing + virt_use_nfs=on, no AVC denials → not SELinux
- File mode 0777 with owner 107:107 → not POSIX
Same export-only-root pattern as `/volume1/kubernetes`. Memory:
feedback_synology_iso_export_root_only_uid_107_denied.md
Existing CDI-uploaded Longhorn PVC `windows-server-2025-iso` (10Gi
Filesystem mode) verified to contain valid ISO bytes readable by
uid 107 (mode 0660 root:107, 9.85 GB sparse, 8.27 GB blocks ≈
original 7.7 GB ISO). Reverting to it.
The original OVMF SATA-CDROM read timeout that drove yesterday's
NFS pivot is now addressed by `cdrom: bus: scsi` (virtio-scsi has
a longer read window than the IDE/SATA emulator). Per user-prompt
diagnostic chain Step 5.
NFS PVC + PV (apps/kubevirt-vms/win2025-iso-nfs-pv.yaml) RETAINED
so Path B state is recoverable; can be pruned in follow-up once
SCSI boot is proven.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous fix attempts confirmed the Longhorn-backed Filesystem PVC contains
a perfectly valid bootable ISO9660 image. The bug is that SATA-CDROM
emulation reading from a Longhorn Filesystem PVC is too slow for OVMF's
boot read window — DVD-ROM enumeration times out before the bootloader
loads. Symptom on the serial console:
BdsDxe: failed to start Boot0001 "UEFI QEMU DVD-ROM QM00001 " ... Time out
BdsDxe: No bootable option or device was found
Block-mode PVC (Path A) was attempted and would likely fix the timing,
but CDI v1.65.0's upload-target pod cannot open the underlying block
device (runAsUser:107 + capabilities.drop:[ALL]):
blockdev: cannot open /dev/cdi-block-volume: Permission denied
Path B (this change): mount the ISO directly from Synology NAS over
NFSv4.1. Bypasses both the Longhorn slowness and the CDI permission
issue. QEMU's SATA emulator reads at native LAN speed.
Layout:
/volume1/ISOs/ — existing Synology export, RKE2 ACL already granted
/volume1/ISOs/win2025-iso-disk/disk.img — new subdir, hardlink to the
ISO file, named so KubeVirt's launcher finds it at the PV root
A hardlink (not symlink) is required because symlinks with relative
targets pointing to the parent directory are broken when the NFS PV
sub-mounts the subdir as its root.
Validated 2026-05-08 from rke2-server, rke2-agent1, rke2-agent2:
mount -t nfs -o nfsvers=4.1,ro 10.0.58.3:/volume1/ISOs/win2025-iso-disk
file disk.img -> ISO 9660 CD-ROM filesystem data ... (bootable)
The original Longhorn Filesystem ISO PVC is RETAINED unused (so ArgoCD
doesn't prune the populated PVC and so we have a fallback). Can be
removed in a follow-up commit after the NFS path is proven on a
successful Windows install.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documenting the remaining 2 unresolved issues for the next operator
session, with the recovery paths from this session captured inline so
the next agent doesn't repeat the same blind alleys:
1. **rootdisk QEMU flock** — every new launcher pod fails QEMU start with
`Failed to get "write" lock` on the rootdisk Filesystem-mode disk.img.
Stale flock from a previous force-deleted virt-launcher pod. Longhorn
engine on rke2-agent2 needs to release the lock; `kubectl patch
volume.longhorn.io/<pvc-name> spec.nodeID=""` is reverted by the
Longhorn controller. Operator-level recovery only.
2. **SATA CDROM read timeout** — even with bootOrder=1 (windows-iso first),
OVMF UEFI fails Boot0001 with "Time out" reading the SATA CDROM backed
by the Filesystem-mode PVC. Block-mode DataVolume migration was
attempted but blocked by CDI v1.65.0's upload pod running with
`capabilities.drop: [ALL]` and `runAsUser: 107`, preventing direct
block-device writes (`blockdev: cannot open /dev/cdi-block-volume:
Permission denied`). See ISO PVC header docstring for 3 forward paths.
Net commits during this session:
- 1c4145a: bootOrder swap (windows-iso=1, rootdisk=2)
- 87a7d7c: deprecated `running:` -> `runStrategy: Always`
- 0bf47df: ISO migration to Block-mode DataVolume (REVERTED)
- 9f6dc1a: revert to Filesystem PVC (CDI block-upload blocked)
- 1c4145a + 87a7d7c + 9f6dc1a are the live, correct configuration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Block-mode DataVolume migration (commit 0bf47df) hit a CDI v1.65.0 limitation:
the upload-target pod runs as uid 107 with `capabilities.drop: [ALL]`, so it
cannot open the underlying block device:
blockdev: cannot open /dev/cdi-block-volume: Permission denied
Saving stream failed: Unable to transfer source data to target file:
error determining if block device exists: exit status 1
Reverting to a Filesystem-mode PVC + virtctl image-upload pvc, which DID work
(uploaded the 7.7 GiB ISO with valid ISO9660 magic intact). Boot timeout is
unresolved (header docstring captures the open issue + 3 paths to revisit).
The bootOrder swap (1c4145a) and runStrategy migration (87a7d7c) stay landed —
those are correct improvements regardless of the volume-mode question.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bootOrder swap alone didn't fix the install — even with `windows-iso` at
bootOrder:1, OVMF UEFI still timed out reading the SATA CDROM:
BdsDxe: starting Boot0001 "UEFI QEMU DVD-ROM QM00001 " from ... Sata(...)
BdsDxe: failed to start Boot0001 ... : Time out
BdsDxe: No bootable option or device was found.
Diagnosis (debug pod mounting the live PVC):
- /pvc/disk.img IS a valid bootable ISO9660 image — `file` reports
"ISO 9660 CD-ROM filesystem data 'SSS_X64FRE_EN-US_DV9' (bootable)".
- bytes 0..15: zeros (NOT QCOW2 magic 51 46 49 fb).
- bytes 32769..32773: "CD001" — ISO9660 primary volume descriptor at the
correct offset.
So content was fine. The bug is in how KubeVirt + QEMU + Longhorn expose a
Filesystem-mode PVC's `/disk.img` as a SATA CDROM. With Block-mode the
underlying volume IS the raw ISO9660 sectors, OVMF reads them directly,
no QEMU file-emulation layer. This is the recommended pattern for ISO
install media on KubeVirt + Longhorn.
Migration:
- Replace `kind: PersistentVolumeClaim` with `kind: DataVolume` (CDI manages
the underlying PVC + upload-target pod).
- Set `pvc.volumeMode: Block`.
- Annotate `cdi.kubevirt.io/storage.contentType: kubevirt` so CDI keeps raw
bytes (no QCOW2 wrap).
- VM volume reference changes from `persistentVolumeClaim.claimName` to
`dataVolume.name`. KubeVirt's VMI controller blocks VM start until DV
phase is Succeeded (upload completed).
Operator step after this lands:
1. Wait for DV `phase: UploadReady`
kubectl get dv -n kubevirt-vms windows-server-2025-iso -w
2. virtctl image-upload dv windows-server-2025-iso -n kubevirt-vms \
--image-path "...\en-us_windows_server_2025...iso" \
--uploadproxy-url https://localhost:8443 --insecure --no-create
3. Re-flip runStrategy to Always (was set to Halted live-side during
migration; this commit keeps the manifest at Always).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Required to clear OutOfSync state after the bootOrder fix. Live VM had
runStrategy: Halted (set during diagnosis to release the PVC for inspection).
Manifest had running: true. KubeVirt's validating webhook rejects sync:
admission webhook "virtualmachine-validator.kubevirt.io" denied the request:
Running and RunStrategy are mutually exclusive.
Switching to runStrategy: Always preserves the original "auto-start +
auto-restart" semantics with the non-deprecated field, and gives ArgoCD a
clean diff target to flip Halted -> Always.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Original order: rootdisk=1 (empty 200Gi virtio), windows-iso=2 (SATA CDROM).
UEFI tried the empty virtio disk first, got nothing, fell back to Boot0001
(the SATA CDROM) with a short timeout, and aborted with:
BdsDxe: failed to start Boot0001 ... Time out
BdsDxe: No bootable option or device was found.
VM had been running 38+ min with rootdisk actualSize stuck at 4.13 GiB and
no AgentConnected condition — install never started.
Diagnosis via debug pod mounting the windows-server-2025-iso PVC:
/pvc/disk.img: ISO 9660 CD-ROM filesystem data 'SSS_X64FRE_EN-US_DV9' (bootable)
bytes 0..15: zeros (NOT QCOW2 magic 51 46 49 fb)
bytes 32769..32773: "CD001" (ISO9660 primary volume descriptor)
So the PVC content is a real bootable ISO — the only fix needed is to make
the ISO bootOrder=1 for first install. After Windows installs, it writes its
own UEFI Boot#### entries pointing at the rootdisk EFI partition; UEFI then
boots from rootdisk going forward and the ISO at bootOrder:2 is a fallback
for re-install scenarios.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
KubeVirt v1.4.0 + RKE2 containerd 2.1.5 cannot pull
quay.io/kubevirt/virtio-container-disk:latest:
rpc error: code = Unimplemented
desc = failed to pull and unpack image: not implemented:
media type "application/vnd.docker.distribution.manifest.v1+prettyjws"
is no longer supported since containerd v2.1, please rebuild the image as
"application/vnd.docker.distribution.manifest.v2+json" or
"application/vnd.oci.image.manifest.v1+json"
The :latest tag was last rebuilt with the v1 manifest schema. Tagged versions
v1.6.5+, v1.7.3, v1.8.2 are rebuilt with v2/OCI manifests.
Pinning to v1.8.2 (newest available, contains current Windows VirtIO drivers).
The image only contains the Windows VirtIO driver ISO mounted as a CDROM —
not the KubeVirt runtime — so it is decoupled from the cluster KubeVirt
version.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 prereqs all satisfied:
- Multus CNI v4.2.2 thick-plugin DS Running on rke2-server/agent1/agent2
- CDI v1.65.0 operator + CR Deployed (cdi-apiserver/deployment/uploadproxy
all Running 1/1)
- Windows Server 2025 ISO (7.7GiB, March 2026 update) uploaded via CDI
virtctl image-upload to PVC windows-server-2025-iso. Verified via PVC
annotations: cdi.kubevirt.io/storage.condition.running.message="Upload
Complete", storage.pod.phase="Succeeded"
- Local Administrator password generated (26 char, FANTASTIC strength).
Stored in 1Password vault IAmWorkin (qaphopopkryhbg353ukzhhuqoq) item
h3ix4mgfk65gmkcmvh6ly3d3hu. UTF-16-LE base64 in autounattend.xml Value
field matches the 1P "autounattend AdministratorPassword Value" field.
Changes:
- ISO PVC bumped 6Gi → 10Gi (ISO is 7.7GiB, need headroom)
- Added labels app=ci-runner, flowercore.io/managed-by=bluejay-infra
- autounattend.xml AdministratorPassword Value: real base64-encoded password
- spec.running: false → true (VM starts on next ArgoCD sync)
- Header comment refreshed to LIVE state with prereq references
Network: still pod-network masquerade. Multus NAD prod-vlan57 is registered
but the VM doesn't use it yet (Phase 1.5 host bridge needed first).
Verify after sync:
kubectl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml -n kubevirt-vms get vm,vmi
virtctl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml vnc ci1 -n kubevirt-vms
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stages a draft VirtualMachine + Namespace + ISO PVC + rootdisk PVC + sysprep
ConfigMap for the dedicated GitHub Actions self-hosted runner that replaces
the never-registered bluejay-ws-sandbox-1 placeholder.
Status: STAGED ONLY. spec.running = false. ISO PVC empty. Two operator
decisions still pending before this can boot:
1. Network choice — pod-network fallback (in this draft) vs Multus +
PROD VLAN NAD (preferred, requires Multus install).
2. ISO path — manual upload via helper pod (Path A) vs CDI HTTP import
(Path B, requires CDI install).
Cluster baseline 2026-05-08:
- KubeVirt operator: installed, healthy, 14d
- CDI: NOT installed
- Multus: NOT installed
- Calico-only CNI
See docs/infrastructure/windows-server-build-runner-plan.md "Phase 1 readiness
gate" for the full operator pickup checklist.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>