Commit Graph

11 Commits

Author SHA1 Message Date
Codex
b998f50f48 fix(ci1): switch ISO delivery to containerDisk OCI image (Path C)
OCI image: localhost/win-server-2025:1.0 (8.27 GB)
Built FROM scratch + ADD disk.img → /disk/disk.img on noc1, podman
saved as tar (8.27 GB), SCP'd in parallel to all 3 RKE2 nodes,
imported via ctr in k8s.io namespace. Verified present on all 3
schedulable nodes (rke2-server, rke2-agent1, rke2-agent2).

Why containerDisk over the prior PVC paths:
  - Path A (Longhorn Filesystem PVC, sata): OVMF BdsDxe SATA-CDROM
    read timeout. Cdrom-backed PVC is too slow for OVMF's first-sector
    read window.
  - Path B (Synology NFS): uid 107 (qemu) denied at directory level by
    Synology export ACL despite file mode 0777. Memory:
    feedback_synology_iso_export_root_only_uid_107_denied.
  - Path B+SCSI: same OVMF timeout, just on SCSI controller. Bus
    choice was not load-bearing — the issue was always the slow PVC
    backing.
  - Path C (this commit): containerDisk delivers the ISO bytes from
    a tmpfs view of the OCI layer, no PVC controller in the read path.
    qemu reads at native FS speed; OVMF first-sector read completes
    well within timeout. This is also the KubeVirt-recommended pattern
    for installer ISOs.

Connects to FlowerCore.Distribution / Provisioning USB story: same
"OCI image of the OS installer + autounattend on a sysprep CDROM"
pattern that the USB provisioning agent will use. The Windows
install proceeds hands-off via the existing autounattend.xml in
ci1-autounattend ConfigMap (RDP enabled, WinRM, UAC disabled,
Administrator password from 1Password vault item
h3ix4mgfk65gmkcmvh6ly3d3hu).

Image lifecycle: bump tag (1.1, 1.2, ...) when ISO version changes,
rebuild on noc1, redistribute to RKE2 nodes, update image: line.

Legacy NFS PVC + PV manifest and CDI Longhorn PVC RETAINED for this
commit so prior states are recoverable. Will prune in follow-up
once containerDisk boot proves.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 20:45:38 -05:00
Codex
8fd9ae1cd3 fix(ci1): revert NFS Path B + flip ISO cdrom bus sata→scsi
NFS Path B (commit fc2aca0) failed at storage layer: Synology export
`/volume1/ISOs` denies non-root client UIDs at the directory level.
qemu uid 107 cannot `ls /iso/` even though disk.img is mode 0777.

Diagnosed via uid-107 + uid-0 busybox probe pods on rke2-agent2:
- libvirt error: "Cannot access storage file ... Permission denied"
  (virStorageSourceReportBrokenChain:1281, virError Code=38 Domain=18)
- uid 107 pod: "ls: can't open '/iso/': Permission denied"
- uid 0 pod (same mount): "drwxrwxrwx 1 root root 16 ... disk.img"
- SELinux Enforcing + virt_use_nfs=on, no AVC denials → not SELinux
- File mode 0777 with owner 107:107 → not POSIX

Same export-only-root pattern as `/volume1/kubernetes`. Memory:
feedback_synology_iso_export_root_only_uid_107_denied.md

Existing CDI-uploaded Longhorn PVC `windows-server-2025-iso` (10Gi
Filesystem mode) verified to contain valid ISO bytes readable by
uid 107 (mode 0660 root:107, 9.85 GB sparse, 8.27 GB blocks ≈
original 7.7 GB ISO). Reverting to it.

The original OVMF SATA-CDROM read timeout that drove yesterday's
NFS pivot is now addressed by `cdrom: bus: scsi` (virtio-scsi has
a longer read window than the IDE/SATA emulator). Per user-prompt
diagnostic chain Step 5.

NFS PVC + PV (apps/kubevirt-vms/win2025-iso-nfs-pv.yaml) RETAINED
so Path B state is recoverable; can be pruned in follow-up once
SCSI boot is proven.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 18:54:36 -05:00
Codex
fc2aca0e9e fix(ci1): mount Windows ISO via Synology NFS (Path B for SATA-CDROM timeout)
Previous fix attempts confirmed the Longhorn-backed Filesystem PVC contains
a perfectly valid bootable ISO9660 image. The bug is that SATA-CDROM
emulation reading from a Longhorn Filesystem PVC is too slow for OVMF's
boot read window — DVD-ROM enumeration times out before the bootloader
loads. Symptom on the serial console:
  BdsDxe: failed to start Boot0001 "UEFI QEMU DVD-ROM QM00001 " ... Time out
  BdsDxe: No bootable option or device was found

Block-mode PVC (Path A) was attempted and would likely fix the timing,
but CDI v1.65.0's upload-target pod cannot open the underlying block
device (runAsUser:107 + capabilities.drop:[ALL]):
  blockdev: cannot open /dev/cdi-block-volume: Permission denied

Path B (this change): mount the ISO directly from Synology NAS over
NFSv4.1. Bypasses both the Longhorn slowness and the CDI permission
issue. QEMU's SATA emulator reads at native LAN speed.

Layout:
  /volume1/ISOs/ — existing Synology export, RKE2 ACL already granted
  /volume1/ISOs/win2025-iso-disk/disk.img — new subdir, hardlink to the
    ISO file, named so KubeVirt's launcher finds it at the PV root

A hardlink (not symlink) is required because symlinks with relative
targets pointing to the parent directory are broken when the NFS PV
sub-mounts the subdir as its root.

Validated 2026-05-08 from rke2-server, rke2-agent1, rke2-agent2:
  mount -t nfs -o nfsvers=4.1,ro 10.0.58.3:/volume1/ISOs/win2025-iso-disk
  file disk.img -> ISO 9660 CD-ROM filesystem data ... (bootable)

The original Longhorn Filesystem ISO PVC is RETAINED unused (so ArgoCD
doesn't prune the populated PVC and so we have a fallback). Can be
removed in a follow-up commit after the NFS path is proven on a
successful Windows install.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 17:03:42 -05:00
Codex
ba18c52130 docs(ci1): record open rootdisk-flock and SATA-CDROM-timeout issues
Documenting the remaining 2 unresolved issues for the next operator
session, with the recovery paths from this session captured inline so
the next agent doesn't repeat the same blind alleys:

1. **rootdisk QEMU flock** — every new launcher pod fails QEMU start with
   `Failed to get "write" lock` on the rootdisk Filesystem-mode disk.img.
   Stale flock from a previous force-deleted virt-launcher pod. Longhorn
   engine on rke2-agent2 needs to release the lock; `kubectl patch
   volume.longhorn.io/<pvc-name> spec.nodeID=""` is reverted by the
   Longhorn controller. Operator-level recovery only.

2. **SATA CDROM read timeout** — even with bootOrder=1 (windows-iso first),
   OVMF UEFI fails Boot0001 with "Time out" reading the SATA CDROM backed
   by the Filesystem-mode PVC. Block-mode DataVolume migration was
   attempted but blocked by CDI v1.65.0's upload pod running with
   `capabilities.drop: [ALL]` and `runAsUser: 107`, preventing direct
   block-device writes (`blockdev: cannot open /dev/cdi-block-volume:
   Permission denied`). See ISO PVC header docstring for 3 forward paths.

Net commits during this session:
- 1c4145a: bootOrder swap (windows-iso=1, rootdisk=2)
- 87a7d7c: deprecated `running:` -> `runStrategy: Always`
- 0bf47df: ISO migration to Block-mode DataVolume (REVERTED)
- 9f6dc1a: revert to Filesystem PVC (CDI block-upload blocked)
- 1c4145a + 87a7d7c + 9f6dc1a are the live, correct configuration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 15:18:38 -05:00
Codex
9f6dc1a9d5 fix(ci1): revert ISO to Filesystem PVC; CDI v1.65.0 block-upload pod blocked by capability drop
The Block-mode DataVolume migration (commit 0bf47df) hit a CDI v1.65.0 limitation:
the upload-target pod runs as uid 107 with `capabilities.drop: [ALL]`, so it
cannot open the underlying block device:

  blockdev: cannot open /dev/cdi-block-volume: Permission denied
  Saving stream failed: Unable to transfer source data to target file:
  error determining if block device exists: exit status 1

Reverting to a Filesystem-mode PVC + virtctl image-upload pvc, which DID work
(uploaded the 7.7 GiB ISO with valid ISO9660 magic intact). Boot timeout is
unresolved (header docstring captures the open issue + 3 paths to revisit).

The bootOrder swap (1c4145a) and runStrategy migration (87a7d7c) stay landed —
those are correct improvements regardless of the volume-mode question.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:32:52 -05:00
Codex
0bf47dfa33 fix(ci1): switch ISO from filesystem PVC to Block-mode DataVolume
The bootOrder swap alone didn't fix the install — even with `windows-iso` at
bootOrder:1, OVMF UEFI still timed out reading the SATA CDROM:

  BdsDxe: starting Boot0001 "UEFI QEMU DVD-ROM QM00001 " from ... Sata(...)
  BdsDxe: failed to start Boot0001 ... : Time out
  BdsDxe: No bootable option or device was found.

Diagnosis (debug pod mounting the live PVC):
- /pvc/disk.img IS a valid bootable ISO9660 image — `file` reports
  "ISO 9660 CD-ROM filesystem data 'SSS_X64FRE_EN-US_DV9' (bootable)".
- bytes 0..15: zeros (NOT QCOW2 magic 51 46 49 fb).
- bytes 32769..32773: "CD001" — ISO9660 primary volume descriptor at the
  correct offset.

So content was fine. The bug is in how KubeVirt + QEMU + Longhorn expose a
Filesystem-mode PVC's `/disk.img` as a SATA CDROM. With Block-mode the
underlying volume IS the raw ISO9660 sectors, OVMF reads them directly,
no QEMU file-emulation layer. This is the recommended pattern for ISO
install media on KubeVirt + Longhorn.

Migration:
- Replace `kind: PersistentVolumeClaim` with `kind: DataVolume` (CDI manages
  the underlying PVC + upload-target pod).
- Set `pvc.volumeMode: Block`.
- Annotate `cdi.kubevirt.io/storage.contentType: kubevirt` so CDI keeps raw
  bytes (no QCOW2 wrap).
- VM volume reference changes from `persistentVolumeClaim.claimName` to
  `dataVolume.name`. KubeVirt's VMI controller blocks VM start until DV
  phase is Succeeded (upload completed).

Operator step after this lands:
1. Wait for DV `phase: UploadReady`
   kubectl get dv -n kubevirt-vms windows-server-2025-iso -w
2. virtctl image-upload dv windows-server-2025-iso -n kubevirt-vms \
     --image-path "...\en-us_windows_server_2025...iso" \
     --uploadproxy-url https://localhost:8443 --insecure --no-create
3. Re-flip runStrategy to Always (was set to Halted live-side during
   migration; this commit keeps the manifest at Always).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:23:31 -05:00
Codex
87a7d7c70a fix(ci1): switch deprecated running: true -> runStrategy: Always
Required to clear OutOfSync state after the bootOrder fix. Live VM had
runStrategy: Halted (set during diagnosis to release the PVC for inspection).
Manifest had running: true. KubeVirt's validating webhook rejects sync:
  admission webhook "virtualmachine-validator.kubevirt.io" denied the request:
  Running and RunStrategy are mutually exclusive.

Switching to runStrategy: Always preserves the original "auto-start +
auto-restart" semantics with the non-deprecated field, and gives ArgoCD a
clean diff target to flip Halted -> Always.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:12:07 -05:00
Codex
1c4145a581 fix(ci1): swap bootOrder so Windows install ISO boots first
Original order: rootdisk=1 (empty 200Gi virtio), windows-iso=2 (SATA CDROM).
UEFI tried the empty virtio disk first, got nothing, fell back to Boot0001
(the SATA CDROM) with a short timeout, and aborted with:
  BdsDxe: failed to start Boot0001 ... Time out
  BdsDxe: No bootable option or device was found.

VM had been running 38+ min with rootdisk actualSize stuck at 4.13 GiB and
no AgentConnected condition — install never started.

Diagnosis via debug pod mounting the windows-server-2025-iso PVC:
  /pvc/disk.img: ISO 9660 CD-ROM filesystem data 'SSS_X64FRE_EN-US_DV9' (bootable)
  bytes 0..15: zeros (NOT QCOW2 magic 51 46 49 fb)
  bytes 32769..32773: "CD001" (ISO9660 primary volume descriptor)

So the PVC content is a real bootable ISO — the only fix needed is to make
the ISO bootOrder=1 for first install. After Windows installs, it writes its
own UEFI Boot#### entries pointing at the rootdisk EFI partition; UEFI then
boots from rootdisk going forward and the ISO at bootOrder:2 is a fallback
for re-install scenarios.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:10:17 -05:00
Codex
c50a403f74 fix(infra): pin virtio-container-disk to v1.8.2 (containerd 2.1 manifest fix)
KubeVirt v1.4.0 + RKE2 containerd 2.1.5 cannot pull
quay.io/kubevirt/virtio-container-disk:latest:
  rpc error: code = Unimplemented
  desc = failed to pull and unpack image: not implemented:
  media type "application/vnd.docker.distribution.manifest.v1+prettyjws"
  is no longer supported since containerd v2.1, please rebuild the image as
  "application/vnd.docker.distribution.manifest.v2+json" or
  "application/vnd.oci.image.manifest.v1+json"

The :latest tag was last rebuilt with the v1 manifest schema. Tagged versions
v1.6.5+, v1.7.3, v1.8.2 are rebuilt with v2/OCI manifests.

Pinning to v1.8.2 (newest available, contains current Windows VirtIO drivers).
The image only contains the Windows VirtIO driver ISO mounted as a CDROM —
not the KubeVirt runtime — so it is decoupled from the cluster KubeVirt
version.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:28:22 -05:00
Codex
fb7bd10528 feat(infra): activate ci1 VM — running:true + 10Gi ISO PVC + 1P password
Phase 1 prereqs all satisfied:
- Multus CNI v4.2.2 thick-plugin DS Running on rke2-server/agent1/agent2
- CDI v1.65.0 operator + CR Deployed (cdi-apiserver/deployment/uploadproxy
  all Running 1/1)
- Windows Server 2025 ISO (7.7GiB, March 2026 update) uploaded via CDI
  virtctl image-upload to PVC windows-server-2025-iso. Verified via PVC
  annotations: cdi.kubevirt.io/storage.condition.running.message="Upload
  Complete", storage.pod.phase="Succeeded"
- Local Administrator password generated (26 char, FANTASTIC strength).
  Stored in 1Password vault IAmWorkin (qaphopopkryhbg353ukzhhuqoq) item
  h3ix4mgfk65gmkcmvh6ly3d3hu. UTF-16-LE base64 in autounattend.xml Value
  field matches the 1P "autounattend AdministratorPassword Value" field.

Changes:
- ISO PVC bumped 6Gi → 10Gi (ISO is 7.7GiB, need headroom)
- Added labels app=ci-runner, flowercore.io/managed-by=bluejay-infra
- autounattend.xml AdministratorPassword Value: real base64-encoded password
- spec.running: false → true (VM starts on next ArgoCD sync)
- Header comment refreshed to LIVE state with prereq references

Network: still pod-network masquerade. Multus NAD prod-vlan57 is registered
but the VM doesn't use it yet (Phase 1.5 host bridge needed first).

Verify after sync:
  kubectl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml -n kubevirt-vms get vm,vmi
  virtctl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml vnc ci1 -n kubevirt-vms

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:24:46 -05:00
Codex
00c11b4eaa feat(infra): stage ci1 Windows Server 2025 KubeVirt VM (Phase 1, NOT YET APPLIED)
Stages a draft VirtualMachine + Namespace + ISO PVC + rootdisk PVC + sysprep
ConfigMap for the dedicated GitHub Actions self-hosted runner that replaces
the never-registered bluejay-ws-sandbox-1 placeholder.

Status: STAGED ONLY. spec.running = false. ISO PVC empty. Two operator
decisions still pending before this can boot:
  1. Network choice — pod-network fallback (in this draft) vs Multus +
     PROD VLAN NAD (preferred, requires Multus install).
  2. ISO path — manual upload via helper pod (Path A) vs CDI HTTP import
     (Path B, requires CDI install).

Cluster baseline 2026-05-08:
  - KubeVirt operator: installed, healthy, 14d
  - CDI: NOT installed
  - Multus: NOT installed
  - Calico-only CNI

See docs/infrastructure/windows-server-build-runner-plan.md "Phase 1 readiness
gate" for the full operator pickup checklist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 12:32:47 -05:00