Files
bluejay-infra/apps/kubevirt-vms/ci1.yaml
Codex fc2aca0e9e fix(ci1): mount Windows ISO via Synology NFS (Path B for SATA-CDROM timeout)
Previous fix attempts confirmed the Longhorn-backed Filesystem PVC contains
a perfectly valid bootable ISO9660 image. The bug is that SATA-CDROM
emulation reading from a Longhorn Filesystem PVC is too slow for OVMF's
boot read window — DVD-ROM enumeration times out before the bootloader
loads. Symptom on the serial console:
  BdsDxe: failed to start Boot0001 "UEFI QEMU DVD-ROM QM00001 " ... Time out
  BdsDxe: No bootable option or device was found

Block-mode PVC (Path A) was attempted and would likely fix the timing,
but CDI v1.65.0's upload-target pod cannot open the underlying block
device (runAsUser:107 + capabilities.drop:[ALL]):
  blockdev: cannot open /dev/cdi-block-volume: Permission denied

Path B (this change): mount the ISO directly from Synology NAS over
NFSv4.1. Bypasses both the Longhorn slowness and the CDI permission
issue. QEMU's SATA emulator reads at native LAN speed.

Layout:
  /volume1/ISOs/ — existing Synology export, RKE2 ACL already granted
  /volume1/ISOs/win2025-iso-disk/disk.img — new subdir, hardlink to the
    ISO file, named so KubeVirt's launcher finds it at the PV root

A hardlink (not symlink) is required because symlinks with relative
targets pointing to the parent directory are broken when the NFS PV
sub-mounts the subdir as its root.

Validated 2026-05-08 from rke2-server, rke2-agent1, rke2-agent2:
  mount -t nfs -o nfsvers=4.1,ro 10.0.58.3:/volume1/ISOs/win2025-iso-disk
  file disk.img -> ISO 9660 CD-ROM filesystem data ... (bootable)

The original Longhorn Filesystem ISO PVC is RETAINED unused (so ArgoCD
doesn't prune the populated PVC and so we have a fallback). Can be
removed in a follow-up commit after the NFS path is proven on a
successful Windows install.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 17:03:42 -05:00

459 lines
20 KiB
YAML

# =============================================================================
# ci1 — Windows Server 2025 KubeVirt VM (GitHub Actions Self-Hosted Runner)
# =============================================================================
# Purpose: dedicated CI runner for FlowerCore.Updater Sandbox E2E nightly +
# future fleet WPF AAT lanes. Replaces the never-registered
# `bluejay-ws-sandbox-1` runner placeholder. Andrew explicitly does NOT want
# BLUEJAY-WS registered as a runner (workstation has personal/operator state).
#
# Storage layout (2026-05-08):
# * ISO is now sourced from Synology NFS (Path B) — see
# win2025-iso-nfs-pv.yaml. The Longhorn Filesystem PVC
# `windows-server-2025-iso` below is RETAINED but UNUSED so the prior
# CDI upload state is preserved as a fallback (and so ArgoCD doesn't
# prune it on this commit). It can be deleted in a follow-up commit
# after the NFS path is proven on a successful Windows install.
#
# Status (2026-05-08): LIVE — Phase 1 prereqs satisfied:
# * Multus CNI v4.2.2 thick-plugin DaemonSet running on all 3 RKE2 nodes
# (apps/multus/multus.yaml; ApplicationSet `infra-multus` Synced/Healthy)
# * CDI v1.65.0 operator + CR Deployed (apps/cdi/; ApplicationSet
# `infra-cdi` Synced/Healthy; uploadproxy reachable via kubectl port-forward)
# * Windows Server 2025 ISO uploaded via CDI virtctl image-upload to
# PVC windows-server-2025-iso (7.7 GiB → 10Gi PVC, Bound, Upload Complete)
# * Local Administrator password generated, stored in 1Password vault
# IAmWorkin (qaphopopkryhbg353ukzhhuqoq) item id h3ix4mgfk65gmkcmvh6ly3d3hu
# * NetworkAttachmentDefinition prod-vlan57 registered (apps/kubevirt-vms/
# prod-vlan57-nad.yaml). VM still uses pod-network masquerade until Phase 1.5
# host bridge work lands (Puppet br-prod + enp86s0.57); switching is a
# one-line YAML edit + git push.
#
# See docs/infrastructure/windows-server-build-runner-plan.md "Phase 1 readiness gate".
#
# Network choice in this draft: **pod-network fallback** (Calico default).
# Outbound-only is fine for the Updater Sandbox E2E runner workload (the runner
# polls GitHub Actions over HTTPS; no inbound listener needed). Switch to a
# Multus PROD VLAN NetworkAttachmentDefinition once Multus is installed and the
# operator wants L2 access from `ci1` to other PROD VLAN services.
#
# Sizing: 8 vCPU / 16 GB RAM / 200 GB disk on Longhorn (default storageClass).
# Capacity check 2026-05-08: each RKE2 node has 16 vCPU / ~64Gi allocatable;
# 8 vCPU is ~17% of one node's allocatable, fits comfortably.
#
# Apply (after operator approval + ISO loaded):
# kubectl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml apply -f apps/kubevirt-vms/ci1.yaml
#
# Connect to console for Windows install:
# virtctl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml vnc ci1 -n kubevirt-vms
# (Or via Guacamole once a connection profile is added.)
# =============================================================================
apiVersion: v1
kind: Namespace
metadata:
name: kubevirt-vms
labels:
app.kubernetes.io/part-of: kubevirt-stack
pod-security.kubernetes.io/enforce: privileged
---
# ISO PVC — populated via CDI virtctl image-upload (CDI is now installed).
#
# **Volume mode (2026-05-08 status):** Filesystem-mode PVC. A migration to
# `volumeMode: Block` via DataVolume was attempted to address an OVMF SATA
# CDROM read timeout, but CDI v1.65.0's upload-target pod runs as uid 107
# with `capabilities.drop: [ALL]` and cannot open the underlying block
# device (`blockdev: cannot open /dev/cdi-block-volume: Permission denied`).
# Reverted to Filesystem PVC pending one of:
# - CDI deployment override granting CAP_SYS_RAWIO to upload pod
# - Pre-populated PVC via privileged init pod that dd's the ISO directly
# - Migration to a different storage class that exposes block devices
# differently (e.g. iSCSI, where Longhorn's CSI mount path may behave
# differently)
#
# Population workflow (this PVC, Filesystem mode):
# 1. virtctl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml image-upload pvc \
# windows-server-2025-iso -n kubevirt-vms \
# --image-path "$env:USERPROFILE\Downloads\en-us_windows_server_2025_updated_march_2026_x64_dvd_8e06425a.iso" \
# --size 10Gi --storage-class longhorn --access-mode ReadWriteOnce \
# --uploadproxy-url https://localhost:8443 --insecure
# (--uploadproxy-url uses port-forward in practice: `kubectl port-forward
# -n cdi service/cdi-uploadproxy 8443:443 &` first.)
#
# **Open boot issue:** even with the ISO at bootOrder:1, OVMF console showed:
# BdsDxe: starting Boot0001 "UEFI QEMU DVD-ROM QM00001 " from ... Sata(...)
# BdsDxe: failed to start Boot0001 ... Time out
# Diagnosis confirmed PVC content IS a valid bootable ISO9660 image — the
# timeout is in OVMF reading from the SATA-CDROM-backed-by-filesystem-PVC.
# Block mode would likely fix it; see CDI permission issue above.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: windows-server-2025-iso
namespace: kubevirt-vms
labels:
app: ci-runner
flowercore.io/managed-by: bluejay-infra
spec:
accessModes:
- ReadWriteOnce # Bump to ReadOnlyMany after population for multi-VM use
resources:
requests:
storage: 10Gi # Server 2025 ISO is 7.7GB; 10Gi for headroom
storageClassName: longhorn
---
# Root disk PVC — empty 200Gi volume that Windows installs into.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ci1-rootdisk
namespace: kubevirt-vms
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Gi
storageClassName: longhorn
---
# Sysprep ConfigMap — autounattend.xml for hands-off Windows install.
# Sets local Administrator password (REPLACE the placeholder), enables RDP,
# enables WinRM, sets hostname, and configures static-ish networking via DHCP.
# The ISO + VirtIO drivers handle the rest.
apiVersion: v1
kind: ConfigMap
metadata:
name: ci1-autounattend
namespace: kubevirt-vms
data:
autounattend.xml: |
<?xml version="1.0" encoding="utf-8"?>
<unattend xmlns="urn:schemas-microsoft-com:unattend">
<!-- Pass 1: WindowsPE — Disk setup and VirtIO driver injection -->
<settings pass="windowsPE">
<component name="Microsoft-Windows-International-Core-WinPE"
processorArchitecture="amd64"
publicKeyToken="31bf3856ad364e35"
language="neutral" versionScope="nonSxS">
<SetupUILanguage>
<UILanguage>en-US</UILanguage>
</SetupUILanguage>
<InputLocale>en-US</InputLocale>
<SystemLocale>en-US</SystemLocale>
<UILanguage>en-US</UILanguage>
<UserLocale>en-US</UserLocale>
</component>
<component name="Microsoft-Windows-PnpCustomizationsWinPE"
processorArchitecture="amd64"
publicKeyToken="31bf3856ad364e35"
language="neutral" versionScope="nonSxS">
<DriverPaths>
<PathAndCredentials wcm:action="add" wcm:keyValue="1">
<Path>E:\amd64\2k25</Path>
</PathAndCredentials>
</DriverPaths>
</component>
<component name="Microsoft-Windows-Setup"
processorArchitecture="amd64"
publicKeyToken="31bf3856ad364e35"
language="neutral" versionScope="nonSxS">
<DiskConfiguration>
<Disk wcm:action="add">
<DiskID>0</DiskID>
<WillWipeDisk>true</WillWipeDisk>
<CreatePartitions>
<CreatePartition wcm:action="add">
<Order>1</Order>
<Size>260</Size>
<Type>EFI</Type>
</CreatePartition>
<CreatePartition wcm:action="add">
<Order>2</Order>
<Size>128</Size>
<Type>MSR</Type>
</CreatePartition>
<CreatePartition wcm:action="add">
<Order>3</Order>
<Extend>true</Extend>
<Type>Primary</Type>
</CreatePartition>
</CreatePartitions>
<ModifyPartitions>
<ModifyPartition wcm:action="add">
<Order>1</Order>
<PartitionID>1</PartitionID>
<Format>FAT32</Format>
<Label>EFI</Label>
</ModifyPartition>
<ModifyPartition wcm:action="add">
<Order>2</Order>
<PartitionID>2</PartitionID>
</ModifyPartition>
<ModifyPartition wcm:action="add">
<Order>3</Order>
<PartitionID>3</PartitionID>
<Format>NTFS</Format>
<Label>Windows</Label>
</ModifyPartition>
</ModifyPartitions>
</Disk>
</DiskConfiguration>
<ImageInstall>
<OSImage>
<InstallTo>
<DiskID>0</DiskID>
<PartitionID>3</PartitionID>
</InstallTo>
<!-- Index 2 = Standard Desktop Experience. Use 4 for Datacenter Desktop. -->
<InstallFrom>
<MetaData wcm:action="add">
<Key>/IMAGE/INDEX</Key>
<Value>2</Value>
</MetaData>
</InstallFrom>
</OSImage>
</ImageInstall>
<UserData>
<AcceptEula>true</AcceptEula>
<FullName>FlowerCore CI Runner</FullName>
<Organization>FlowerCore</Organization>
<!-- Eval install — no product key needed for 180-day evaluation -->
</UserData>
</component>
</settings>
<!-- Pass 4: Specialize — Hostname, RDP, WinRM -->
<settings pass="specialize">
<component name="Microsoft-Windows-Shell-Setup"
processorArchitecture="amd64"
publicKeyToken="31bf3856ad364e35"
language="neutral" versionScope="nonSxS">
<ComputerName>CI1</ComputerName>
<TimeZone>Central Standard Time</TimeZone>
</component>
<component name="Microsoft-Windows-TerminalServices-LocalSessionManager"
processorArchitecture="amd64"
publicKeyToken="31bf3856ad364e35"
language="neutral" versionScope="nonSxS">
<fDenyTSConnections>false</fDenyTSConnections>
</component>
</settings>
<!-- Pass 7: OOBE — Admin account, RDP firewall, WinRM -->
<settings pass="oobeSystem">
<component name="Microsoft-Windows-Shell-Setup"
processorArchitecture="amd64"
publicKeyToken="31bf3856ad364e35"
language="neutral" versionScope="nonSxS">
<OOBE>
<HideEULAPage>true</HideEULAPage>
<HideLocalAccountScreen>true</HideLocalAccountScreen>
<HideOEMRegistrationScreen>true</HideOEMRegistrationScreen>
<HideOnlineAccountScreens>true</HideOnlineAccountScreens>
<HideWirelessSetupInOOBE>true</HideWirelessSetupInOOBE>
<ProtectYourPC>3</ProtectYourPC>
</OOBE>
<UserAccounts>
<AdministratorPassword>
<!-- Real password is in 1Password — vault qaphopopkryhbg353ukzhhuqoq,
item id h3ix4mgfk65gmkcmvh6ly3d3hu, title:
"ci1 Administrator (Windows Server 2025 KubeVirt VM)".
Field "autounattend AdministratorPassword Value (UTF-16-LE base64)"
matches the Value below.
To rotate: regenerate, recompute base64
$combined = $pw + "AdministratorPassword"
[Convert]::ToBase64String([Text.Encoding]::Unicode.GetBytes($combined))
then update both 1P item AND this Value field, recreate VM. -->
<Value>bAA3AGsANABOAHcAcgBMAG4AeQBTAHUAYgBBAHQAaQBzAFUAcAB6AEMAWQAhADkAYQBCAEEAZABtAGkAbgBpAHMAdAByAGEAdABvAHIAUABhAHMAcwB3AG8AcgBkAA==</Value>
<PlainText>false</PlainText>
</AdministratorPassword>
</UserAccounts>
<FirstLogonCommands>
<SynchronousCommand wcm:action="add">
<Order>1</Order>
<CommandLine>powershell.exe -ExecutionPolicy Bypass -Command "Set-NetFirewallRule -DisplayGroup 'Remote Desktop' -Enabled True"</CommandLine>
<Description>Enable RDP firewall rule</Description>
</SynchronousCommand>
<SynchronousCommand wcm:action="add">
<Order>2</Order>
<CommandLine>powershell.exe -ExecutionPolicy Bypass -Command "Enable-PSRemoting -Force; Set-Item WSMan:\localhost\Service\Auth\Basic $true; Set-Item WSMan:\localhost\Service\AllowUnencrypted $true"</CommandLine>
<Description>Enable WinRM (Phase 2 will pivot to HTTPS via step-ca cert)</Description>
</SynchronousCommand>
<SynchronousCommand wcm:action="add">
<Order>3</Order>
<CommandLine>cmd.exe /c reg add "HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System" /v EnableLUA /t REG_DWORD /d 0 /f</CommandLine>
<Description>Disable UAC (Phase 2 Puppet will re-evaluate)</Description>
</SynchronousCommand>
</FirstLogonCommands>
</component>
</settings>
</unattend>
---
# VirtualMachine — Windows Server 2025 CI runner.
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: ci1
namespace: kubevirt-vms
labels:
app: ci-runner
role: github-actions-runner
flowercore.io/managed-by: bluejay-infra
spec:
# `running: true` is deprecated in favor of `runStrategy`. They are mutually
# exclusive — KubeVirt's validating webhook rejects any VM that sets both:
# admission webhook "virtualmachine-validator.kubevirt.io" denied the request:
# Running and RunStrategy are mutually exclusive.
# `Always` keeps a VMI running and restarts it if it crashes/exits — same
# semantics as the old `running: true`.
#
# **2026-05-08 status: VM cannot start due to a stale QEMU flock on the
# rootdisk PVC** (qemu reports `Failed to get "write" lock` on
# `/var/run/kubevirt-private/vmi-disks/rootdisk/disk.img`). The flock was
# left by a previous QEMU process during a force-deleted launcher pod
# cycle. Recovery requires either (a) a Longhorn engine restart on
# rke2-agent2, (b) a Longhorn volume detach via the longhorn-manager API
# (kubectl patch on `volume.longhorn.io/<pvc-name>` does not work — the
# spec.nodeID is reconciled back), or (c) a node reboot of rke2-agent2.
#
# **Confirmed working:** the bootOrder swap (windows-iso=1, rootdisk=2)
# and the runStrategy migration (above). The ISO PVC was successfully
# repopulated via virtctl image-upload pvc on the Filesystem-mode PVC.
#
# **Open: SATA CDROM read timeout** — even with bootOrder=1, OVMF reported
# `BdsDxe: failed to start Boot0001 ... Time out` reading the SATA CDROM
# backed by the Filesystem-mode PVC. A switch to Block-mode DataVolume
# was attempted but blocked by a CDI v1.65.0 upload-pod permission issue
# (capability drop prevents writing to the underlying block device).
# See header docstring on the ISO PVC.
runStrategy: Always # LIVE — ISO uploaded 2026-05-08, password in 1P
template:
metadata:
labels:
app: ci-runner
role: github-actions-runner
kubevirt.io/vm: ci1
spec:
domain:
cpu:
cores: 8
sockets: 1
threads: 1
memory:
guest: 16Gi
resources:
requests:
memory: 16Gi
limits:
memory: 16Gi
clock:
utc: {}
timer:
hpet:
present: false
pit:
tickPolicy: delay
rtc:
tickPolicy: catchup
hyperv: {}
features:
acpi: {}
apic: {}
hyperv:
relaxed: {}
vapic: {}
spinlocks:
spinlocks: 8191
smm: {}
firmware:
bootloader:
efi:
secureBoot: true
devices:
tpm: {} # Non-persistent vTPM — sufficient for runner; no BitLocker
disks:
# bootOrder: ISO must be 1 for first-boot install (the rootdisk has no
# EFI bootloader yet). After Windows installs, it writes its own UEFI
# Boot#### entries pointing at the rootdisk's EFI partition; UEFI then
# boots from rootdisk going forward and the ISO at bootOrder:2 acts as
# a fallback for re-install scenarios.
#
# Original (broken) order had rootdisk=1, windows-iso=2 — UEFI tried
# the empty virtio disk first, got nothing, fell back to the SATA
# CDROM at Boot0001 with a short timeout, and timed out before the
# CDROM enumerated. Console showed:
# BdsDxe: failed to start Boot0001 ... Time out
# BdsDxe: No bootable option or device was found.
# Confirmed via debug pod: PVC content IS a real bootable ISO9660
# (file: "ISO 9660 CD-ROM filesystem data ... (bootable)"), so the
# only bug was boot priority.
- name: windows-iso
bootOrder: 1
cdrom:
bus: sata
- name: rootdisk
bootOrder: 2
disk:
bus: virtio
- name: virtio-drivers
cdrom:
bus: sata
- name: sysprep
cdrom:
bus: sata
interfaces:
# Pod-network fallback for Phase 1. To switch to PROD VLAN once Multus
# + the prod-vlan57 NAD exist, replace this block with:
# - name: prod-net
# bridge: {}
# model: virtio
# and update the networks: stanza to use multus.networkName: kubevirt-vms/prod-vlan57
- name: default
masquerade: {}
model: virtio
machine:
type: q35
networks:
- name: default
pod: {}
volumes:
- name: rootdisk
persistentVolumeClaim:
claimName: ci1-rootdisk
- name: windows-iso
# Path B (2026-05-08): mount ISO from Synology NFS instead of
# Longhorn Filesystem PVC. The Filesystem-PVC path was confirmed to
# contain a valid bootable ISO9660 image but caused OVMF's
# SATA-CDROM read window to time out:
# BdsDxe: failed to start Boot0001 ... Time out
# Block-mode DataVolume was attempted as Path A but blocked by CDI
# v1.65.0's upload pod capability drop. NFS-mounted ISO bypasses
# both issues. See win2025-iso-nfs-pv.yaml header for full rationale
# and Synology layout.
persistentVolumeClaim:
claimName: windows-server-2025-iso-nfs
- name: virtio-drivers
containerDisk:
# Pinned to v1.8.2 (latest stable as of 2026-05-08).
# The :latest tag uses Docker manifest v1 schema which containerd
# 2.1 (RKE2 v1.34.5) refuses to pull with:
# "media type application/vnd.docker.distribution.manifest.v1+prettyjws
# is no longer supported since containerd v2.1"
# v1.8.2 is rebuilt with manifest v2/OCI and works on containerd 2.1.
# Bump available: https://quay.io/repository/kubevirt/virtio-container-disk?tab=tags
image: quay.io/kubevirt/virtio-container-disk:v1.8.2
- name: sysprep
sysprep:
configMap:
name: ci1-autounattend
terminationGracePeriodSeconds: 3600