Compare commits

...

4 Commits

Author SHA1 Message Date
Codex
84c9feb893 fix(ci1): present ISO as virtio-blk disk instead of cdrom
OVMF BdsDxe "starting Boot0001 ... Time out" persists across:
  - SATA cdrom + Longhorn Filesystem PVC (Path A)
  - SATA cdrom + Synology NFS (Path B failed: storage perms)
  - SCSI cdrom + Longhorn (Path B variant)
  - SCSI cdrom + containerDisk tmpfs (Path C)
  - + SecureBoot=false

That rules out: storage IO speed, cdrom bus type, signature
verification. Remaining cause is deeper in qemu's cdrom device
emulation under KubeVirt v1.4.0's OVMF firmware — the cdrom read
window for OVMF's first-sector probe is too short to satisfy from
the cdrom controller path regardless of bus type.

Workaround: present the ISO bytes as a regular virtio-blk DISK
(not a cdrom). UEFI/OVMF still recognizes ISO9660 + El Torito
boot records on any block device, so it can find and boot the
EFI bootloader the same way it would from a USB stick. virtio-blk
has a different read path that doesn't hit the cdrom-specific
timeout.

This also better aligns with the FlowerCore.Distribution USB-key
pattern: ISO bytes on a block device, UEFI boots from the El
Torito boot record, Windows installer takes over. The autounattend
ConfigMap (ci1-autounattend) drives unattended Windows setup once
the installer kicks off.

The containerDisk OCI image (localhost/win-server-2025:1.0)
remains unchanged — only the disk type in the VM spec changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 21:29:59 -05:00
Codex
427dbfcef2 [uc] Phase 1 auth gate deploy v20260509-4162dca-authgate 2026-05-08 21:16:54 -05:00
Codex
b651a4e2d0 fix(ci1): disable SecureBoot to allow OVMF to boot Windows ISO
containerDisk delivery (commit b998f50) successfully gave qemu fast
in-memory access to the ISO bytes (no NFS denial, no Longhorn read
latency), but OVMF's BdsDxe still timed out:

  BdsDxe: loading Boot0001 "UEFI QEMU QEMU CD-ROM " from
    PciRoot(0x0)/Pci(0x2,0x4)/Pci(0x0,0x0)/Scsi(0x0,0x0)
  BdsDxe: starting Boot0001 ... Time out

That rules out storage IO speed and bus type as causes (already
tested both sata and scsi against both Longhorn-PVC and tmpfs-backed
containerDisk). Remaining likely cause: SecureBoot signature
verification on the ISO's EFI bootloader. KubeVirt's stock
`/usr/share/OVMF/OVMF_VARS.secboot.fd` doesn't appear to ship with
the Microsoft KEK/DB enrolled by default, so signed Windows EFI
bootloaders fail the trust-chain check and OVMF reports a generic
"Time out" rather than a verification failure.

Disabling SecureBoot lets OVMF skip the chain check entirely and
boot the El Torito EFI image. SMM stays enabled (KubeVirt only
requires it WITH SecureBoot, not the inverse). TPM 2.0 emulation
also stays on (`tpm: {}`), so BitLocker, Hyper-V, and WSL2 still
work in the guest.

This is acceptable for a CI runner. Long-term path back to
SecureBoot:
  1. Custom-build OVMF_VARS.fd with Microsoft KEK/DB pre-enrolled
  2. Mount via firmware.bootloader.efi.persistent
  3. secureBoot: true

Tracked as a Phase 2 hardening task once the runner is doing real
work and we want signed-boot guarantees.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 21:06:18 -05:00
Codex
b998f50f48 fix(ci1): switch ISO delivery to containerDisk OCI image (Path C)
OCI image: localhost/win-server-2025:1.0 (8.27 GB)
Built FROM scratch + ADD disk.img → /disk/disk.img on noc1, podman
saved as tar (8.27 GB), SCP'd in parallel to all 3 RKE2 nodes,
imported via ctr in k8s.io namespace. Verified present on all 3
schedulable nodes (rke2-server, rke2-agent1, rke2-agent2).

Why containerDisk over the prior PVC paths:
  - Path A (Longhorn Filesystem PVC, sata): OVMF BdsDxe SATA-CDROM
    read timeout. Cdrom-backed PVC is too slow for OVMF's first-sector
    read window.
  - Path B (Synology NFS): uid 107 (qemu) denied at directory level by
    Synology export ACL despite file mode 0777. Memory:
    feedback_synology_iso_export_root_only_uid_107_denied.
  - Path B+SCSI: same OVMF timeout, just on SCSI controller. Bus
    choice was not load-bearing — the issue was always the slow PVC
    backing.
  - Path C (this commit): containerDisk delivers the ISO bytes from
    a tmpfs view of the OCI layer, no PVC controller in the read path.
    qemu reads at native FS speed; OVMF first-sector read completes
    well within timeout. This is also the KubeVirt-recommended pattern
    for installer ISOs.

Connects to FlowerCore.Distribution / Provisioning USB story: same
"OCI image of the OS installer + autounattend on a sysprep CDROM"
pattern that the USB provisioning agent will use. The Windows
install proceeds hands-off via the existing autounattend.xml in
ci1-autounattend ConfigMap (RDP enabled, WinRM, UAC disabled,
Administrator password from 1Password vault item
h3ix4mgfk65gmkcmvh6ly3d3hu).

Image lifecycle: bump tag (1.1, 1.2, ...) when ISO version changes,
rebuild on noc1, redistribute to RKE2 nodes, update image: line.

Legacy NFS PVC + PV manifest and CDI Longhorn PVC RETAINED for this
commit so prior states are recoverable. Will prune in follow-up
once containerDisk boot proves.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 20:45:38 -05:00
2 changed files with 66 additions and 27 deletions

View File

@@ -58,7 +58,7 @@ spec:
nodeName: rke2-server
containers:
- name: web
image: localhost/fc-updater-web:v20260508-pub3-deepening-2bdf108
image: localhost/fc-updater-web:v20260509-4162dca-authgate
imagePullPolicy: Never
ports:
- containerPort: 8080

View File

@@ -377,7 +377,22 @@ spec:
firmware:
bootloader:
efi:
secureBoot: true
# 2026-05-08: SecureBoot=false during initial install. With SecureBoot
# enabled, OVMF's BdsDxe times out reading Boot0001 from the SCSI
# CDROM ("BdsDxe: failed to start Boot0001 ... Time out") before the
# EFI bootloader signature can verify against the OVMF VARS trust DB.
# KubeVirt's `/usr/share/OVMF/OVMF_VARS.secboot.fd` template doesn't
# appear to include the Microsoft KEK/DB by default, so signed
# Windows EFI bootloaders fail validation. Disabling SecureBoot lets
# OVMF skip the chain check and boot directly. This is acceptable for
# a CI runner — TPM 2.0 is still emulated (`tpm: {}` below) so
# BitLocker / Hyper-V / WSL still work.
# When the operator wants SecureBoot back, the path is:
# 1. Custom-build OVMF_VARS.fd with Microsoft KEK/DB enrolled
# 2. Mount it into the VM via firmware.bootloader.efi.persistent
# 3. Set secureBoot: true again
# Tracked separately from the install unblock.
secureBoot: false
devices:
tpm: {} # Non-persistent vTPM — sufficient for runner; no BitLocker
disks:
@@ -396,15 +411,24 @@ spec:
# Confirmed via debug pod: PVC content IS a real bootable ISO9660
# (file: "ISO 9660 CD-ROM filesystem data ... (bootable)"), so the
# only bug was boot priority.
# 2026-05-08 PM: cdrom bus flipped sata→scsi for windows-iso to address
# the OVMF SATA-CDROM read timeout (`BdsDxe: failed to start Boot0001 ...
# Time out`). The SCSI CDROM uses virtio-scsi controller which has a
# longer read window and works cleanly on Filesystem-backed PVCs.
# See diagnostic chain in HANDOFF.md / CODEX-STATUS.md "OPEN — ci1".
# 2026-05-08 PM: ISO presented as a virtio-blk DISK (not cdrom).
# Both SATA and SCSI cdrom buses hit OVMF BdsDxe "starting Boot0001
# ... Time out" regardless of storage backend (NFS, Longhorn PVC,
# containerDisk tmpfs — all rule out IO speed). The qemu cdrom
# emulation path appears to have a deep-seated read window issue
# under KubeVirt v1.4.0's OVMF firmware.
#
# Workaround: present the ISO bytes as a regular virtio-blk disk
# (model="virtio-non-transitional"). UEFI/OVMF still recognizes
# ISO9660 + El Torito boot records on a regular disk, so it can
# boot the EFI bootloader the same way it would from a USB stick.
# This is also closer to the FlowerCore.Distribution USB-key
# pattern: the ISO bytes live on a block device, UEFI boots from
# the GPT/El Torito boot record, Windows installer runs.
- name: windows-iso
bootOrder: 1
cdrom:
bus: scsi
disk:
bus: virtio
- name: rootdisk
bootOrder: 2
disk:
@@ -435,25 +459,40 @@ spec:
persistentVolumeClaim:
claimName: ci1-rootdisk
- name: windows-iso
# 2026-05-08 PM: REVERTED from NFS Path B back to the original CDI
# Longhorn Filesystem PVC. NFS Path B (commit fc2aca0) failed at the
# storage layer because the Synology export `/volume1/ISOs` denies
# non-root client UIDs at the directory level (qemu uid 107 cannot
# `ls /iso/` even with file mode 0777). Confirmed via uid-107 +
# uid-0 busybox probe pods on rke2-agent2 — same export-only-root
# pattern as `/volume1/kubernetes` documented in
# `feedback_synology_nfs_kubernetes_export_root_only`. Memory:
# `feedback_synology_iso_export_root_only_uid_107_denied.md`.
# 2026-05-08 PM (Path C, CONTAINERDISK): the ISO is now packaged as
# a KubeVirt containerDisk OCI image baked from
# `FROM scratch ; ADD --chown=107:107 disk.img /disk/disk.img`.
# The qemu user (uid 107) reads the ISO directly from a tmpfs view
# of the OCI layer, bypassing both:
# - Synology NFS export ACL (Path B failed: uid 107 denied at
# directory level even with mode 0777, see memory
# feedback_synology_iso_export_root_only_uid_107_denied)
# - OVMF cdrom read-window timeout (Path A and Path B's SCSI
# retry both hit `BdsDxe: failed to start Boot0001 ... Time out`
# when the cdrom was backed by a PVC the storage controller
# couldn't satisfy reads from fast enough).
#
# The Longhorn PVC `windows-server-2025-iso` (CDI Filesystem mode,
# 10Gi) was confirmed to contain valid ISO bytes that uid 107 CAN
# read (mode 0660 root:107). The OVMF SATA-CDROM read timeout from
# the original Path A is now addressed by the `bus: scsi` swap on
# the disks block above. The NFS PVC + PV are RETAINED on disk so
# the Path B state is recoverable; they can be pruned in a
# follow-up commit once SCSI boot is proven.
persistentVolumeClaim:
claimName: windows-server-2025-iso
# Image build (one-time, per ISO version):
# 1. Copy ISO to disk.img, write Dockerfile
# 2. podman build --tag localhost/win-server-2025:1.0 . (on noc1)
# 3. podman save -o win-server-2025-1.0.tar localhost/win-server-2025:1.0
# 4. SCP tar to all 3 RKE2 nodes (rke2-server, rke2-agent1, rke2-agent2)
# 5. sudo /var/lib/rancher/rke2/bin/ctr -a /run/k3s/containerd/containerd.sock \
# -n k8s.io images import /tmp/win-server-2025-1.0.tar
# Standard FC pattern per `feedback_rke2_localhost_imagepullpolicy`.
#
# When a new Windows ISO version ships, bump the tag (1.1, 1.2, ...),
# rebuild + redistribute, and update the image: line below in a new
# commit. KubeVirt picks up the new image via a VM restart.
#
# The legacy NFS PVC + PV (apps/kubevirt-vms/win2025-iso-nfs-pv.yaml)
# and CDI Longhorn PVC (`windows-server-2025-iso`) are RETAINED for
# this commit so the prior states are recoverable. Once the
# containerDisk path proves on a successful Windows install, both
# legacy artifacts can be pruned in a follow-up commit.
containerDisk:
image: localhost/win-server-2025:1.0
imagePullPolicy: Never
- name: virtio-drivers
containerDisk:
# Pinned to v1.8.2 (latest stable as of 2026-05-08).