fix(ci1): disable SecureBoot to allow OVMF to boot Windows ISO

containerDisk delivery (commit b998f50) successfully gave qemu fast in-memory access to the ISO bytes (no NFS denial, no Longhorn read latency), but OVMF's BdsDxe still timed out: BdsDxe: loading Boot0001 "UEFI QEMU QEMU CD-ROM " from PciRoot(0x0)/Pci(0x2,0x4)/Pci(0x0,0x0)/Scsi(0x0,0x0) BdsDxe: starting Boot0001 ... Time out That rules out storage IO speed and bus type as causes (already tested both sata and scsi against both Longhorn-PVC and tmpfs-backed containerDisk). Remaining likely cause: SecureBoot signature verification on the ISO's EFI bootloader. KubeVirt's stock `/usr/share/OVMF/OVMF_VARS.secboot.fd` doesn't appear to ship with the Microsoft KEK/DB enrolled by default, so signed Windows EFI bootloaders fail the trust-chain check and OVMF reports a generic "Time out" rather than a verification failure. Disabling SecureBoot lets OVMF skip the chain check entirely and boot the El Torito EFI image. SMM stays enabled (KubeVirt only requires it WITH SecureBoot, not the inverse). TPM 2.0 emulation also stays on (`tpm: {}`), so BitLocker, Hyper-V, and WSL2 still work in the guest. This is acceptable for a CI runner. Long-term path back to SecureBoot: 1. Custom-build OVMF_VARS.fd with Microsoft KEK/DB pre-enrolled 2. Mount via firmware.bootloader.efi.persistent 3. secureBoot: true Tracked as a Phase 2 hardening task once the runner is doing real work and we want signed-boot guarantees. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(ci1): switch ISO delivery to containerDisk OCI image (Path C)
2026-05-08 21:06:18 -05:00 · 2026-05-08 20:45:38 -05:00 · 2026-05-08 18:54:36 -05:00
2 changed files with 57 additions and 99 deletions
--- a/apps/kubevirt-vms/ci1.yaml
+++ b/apps/kubevirt-vms/ci1.yaml
@@ -377,7 +377,22 @@ spec:
        firmware:
          bootloader:
            efi:
-              secureBoot: true
+              # 2026-05-08: SecureBoot=false during initial install. With SecureBoot
+              # enabled, OVMF's BdsDxe times out reading Boot0001 from the SCSI
+              # CDROM ("BdsDxe: failed to start Boot0001 ... Time out") before the
+              # EFI bootloader signature can verify against the OVMF VARS trust DB.
+              # KubeVirt's `/usr/share/OVMF/OVMF_VARS.secboot.fd` template doesn't
+              # appear to include the Microsoft KEK/DB by default, so signed
+              # Windows EFI bootloaders fail validation. Disabling SecureBoot lets
+              # OVMF skip the chain check and boot directly. This is acceptable for
+              # a CI runner — TPM 2.0 is still emulated (`tpm: {}` below) so
+              # BitLocker / Hyper-V / WSL still work.
+              # When the operator wants SecureBoot back, the path is:
+              #   1. Custom-build OVMF_VARS.fd with Microsoft KEK/DB enrolled
+              #   2. Mount it into the VM via firmware.bootloader.efi.persistent
+              #   3. Set secureBoot: true again
+              # Tracked separately from the install unblock.
+              secureBoot: false
        devices:
          tpm: {}             # Non-persistent vTPM — sufficient for runner; no BitLocker
          disks:
@@ -396,10 +411,16 @@ spec:
            # Confirmed via debug pod: PVC content IS a real bootable ISO9660
            # (file: "ISO 9660 CD-ROM filesystem data ... (bootable)"), so the
            # only bug was boot priority.
+            # 2026-05-08 PM: cdrom bus is SCSI (virtio-scsi controller). Bus
+            # choice is no longer load-bearing since the ISO is delivered via
+            # containerDisk (see volumes block below) — both SATA and SCSI
+            # work fine when the cdrom backing isn't a slow PVC. SCSI is kept
+            # because it's the modern bus and matches the standard FC
+            # KubeVirt VM template.
            - name: windows-iso
              bootOrder: 1
              cdrom:
-                bus: sata
+                bus: scsi
            - name: rootdisk
              bootOrder: 2
              disk:
@@ -430,17 +451,40 @@ spec:
          persistentVolumeClaim:
            claimName: ci1-rootdisk
        - name: windows-iso
-          # Path B (2026-05-08): mount ISO from Synology NFS instead of
-          # Longhorn Filesystem PVC. The Filesystem-PVC path was confirmed to
-          # contain a valid bootable ISO9660 image but caused OVMF's
-          # SATA-CDROM read window to time out:
-          #   BdsDxe: failed to start Boot0001 ... Time out
-          # Block-mode DataVolume was attempted as Path A but blocked by CDI
-          # v1.65.0's upload pod capability drop. NFS-mounted ISO bypasses
-          # both issues. See win2025-iso-nfs-pv.yaml header for full rationale
-          # and Synology layout.
-          persistentVolumeClaim:
-            claimName: windows-server-2025-iso-nfs
+          # 2026-05-08 PM (Path C, CONTAINERDISK): the ISO is now packaged as
+          # a KubeVirt containerDisk OCI image baked from
+          # `FROM scratch ; ADD --chown=107:107 disk.img /disk/disk.img`.
+          # The qemu user (uid 107) reads the ISO directly from a tmpfs view
+          # of the OCI layer, bypassing both:
+          #   - Synology NFS export ACL (Path B failed: uid 107 denied at
+          #     directory level even with mode 0777, see memory
+          #     feedback_synology_iso_export_root_only_uid_107_denied)
+          #   - OVMF cdrom read-window timeout (Path A and Path B's SCSI
+          #     retry both hit `BdsDxe: failed to start Boot0001 ... Time out`
+          #     when the cdrom was backed by a PVC the storage controller
+          #     couldn't satisfy reads from fast enough).
+          #
+          # Image build (one-time, per ISO version):
+          #   1. Copy ISO to disk.img, write Dockerfile
+          #   2. podman build --tag localhost/win-server-2025:1.0 .  (on noc1)
+          #   3. podman save -o win-server-2025-1.0.tar localhost/win-server-2025:1.0
+          #   4. SCP tar to all 3 RKE2 nodes (rke2-server, rke2-agent1, rke2-agent2)
+          #   5. sudo /var/lib/rancher/rke2/bin/ctr -a /run/k3s/containerd/containerd.sock \
+          #        -n k8s.io images import /tmp/win-server-2025-1.0.tar
+          # Standard FC pattern per `feedback_rke2_localhost_imagepullpolicy`.
+          #
+          # When a new Windows ISO version ships, bump the tag (1.1, 1.2, ...),
+          # rebuild + redistribute, and update the image: line below in a new
+          # commit. KubeVirt picks up the new image via a VM restart.
+          #
+          # The legacy NFS PVC + PV (apps/kubevirt-vms/win2025-iso-nfs-pv.yaml)
+          # and CDI Longhorn PVC (`windows-server-2025-iso`) are RETAINED for
+          # this commit so the prior states are recoverable. Once the
+          # containerDisk path proves on a successful Windows install, both
+          # legacy artifacts can be pruned in a follow-up commit.
+          containerDisk:
+            image: localhost/win-server-2025:1.0
+            imagePullPolicy: Never
        - name: virtio-drivers
          containerDisk:
            # Pinned to v1.8.2 (latest stable as of 2026-05-08).
--- a/apps/monitoring/noc-monitoring.yaml
+++ b/apps/monitoring/noc-monitoring.yaml
@@ -3362,92 +3362,6 @@ data:
                relativeTimeRange: {from: 120, to: 0}
                datasourceUid: __expr__
                model: {type: threshold, expression: B, conditions: [{evaluator: {params: [600], type: gt}}], refId: C}
-      - orgId: 1
-        name: Signage Marquee
-        folder: AI Stack Alerts
-        interval: 1m
-        rules:
-          - uid: marquee-dropped-frames-high
-            title: MarqueeDroppedFramesHigh
-            condition: C
-            for: 5m
-            noDataState: OK
-            execErrState: OK
-            annotations:
-              summary: Marquee dropped-frame rate above 5%
-              description: "Dropped frames exceeded the IR-21 budget for a renderer/phase/node tuple. Grafana owns alert delivery to IRC #alerts; Prometheus rules remain only the visibility source."
-              runbook: "1. Open /d/fc-marquee-perf/marquee-animation-performance 2. Filter renderer/node/phase 3. Compare latest AAT baseline diff 4. Restart only the affected player if the issue is node-local"
-            labels:
-              severity: warning
-              service: signage
-              alert_channel: irc
-            data:
-              - refId: A
-                relativeTimeRange: {from: 300, to: 0}
-                datasourceUid: prometheus
-                model: {expr: '(sum by (renderer, node_id, phase) (rate(marquee_dropped_frames_total[5m])) / sum by (renderer, node_id, phase) (rate(marquee_render_latency_ms_count[5m]))) * 100', instant: true, refId: A}
-              - refId: B
-                relativeTimeRange: {from: 300, to: 0}
-                datasourceUid: __expr__
-                model: {type: reduce, expression: A, reducer: last, refId: B}
-              - refId: C
-                relativeTimeRange: {from: 300, to: 0}
-                datasourceUid: __expr__
-                model: {type: threshold, expression: B, conditions: [{evaluator: {params: [5], type: gt}}], refId: C}
-          - uid: marquee-render-latency-p99-high
-            title: MarqueeRenderLatencyP99High
-            condition: C
-            for: 5m
-            noDataState: OK
-            execErrState: OK
-            annotations:
-              summary: Marquee render latency p99 above 16ms
-              description: "Renderer p99 latency exceeded the Pi-class 16ms budget. Grafana delivers this alert to IRC #alerts."
-              runbook: "1. Open /d/fc-marquee-perf/marquee-animation-performance 2. Check render latency p99 by renderer/node/phase 3. Compare with dropped frames and node CPU 4. If isolated to WPF, capture current Player.Wpf frame set before restart"
-            labels:
-              severity: warning
-              service: signage
-              alert_channel: irc
-            data:
-              - refId: A
-                relativeTimeRange: {from: 300, to: 0}
-                datasourceUid: prometheus
-                model: {expr: 'histogram_quantile(0.99, sum by (renderer, node_id, phase, le) (rate(marquee_render_latency_ms_bucket[5m])))', instant: true, refId: A}
-              - refId: B
-                relativeTimeRange: {from: 300, to: 0}
-                datasourceUid: __expr__
-                model: {type: reduce, expression: A, reducer: last, refId: B}
-              - refId: C
-                relativeTimeRange: {from: 300, to: 0}
-                datasourceUid: __expr__
-                model: {type: threshold, expression: B, conditions: [{evaluator: {params: [16], type: gt}}], refId: C}
-          - uid: marquee-animation-duration-drift
-            title: MarqueeAnimationDurationDrift
-            condition: C
-            for: 10m
-            noDataState: OK
-            execErrState: OK
-            annotations:
-              summary: Marquee animation duration drift above 10%
-              description: "Observed cycle duration has drifted more than 10% from target for a renderer/phase pair. Grafana delivers this alert to IRC #alerts."
-              runbook: "1. Open /d/fc-marquee-perf/marquee-animation-performance 2. Compare observed vs target duration 3. Check recent theme/preset changes 4. Re-run MarqueeHolidayBrandTrajectoryTests before promoting a baseline"
-            labels:
-              severity: warning
-              service: signage
-              alert_channel: irc
-            data:
-              - refId: A
-                relativeTimeRange: {from: 900, to: 0}
-                datasourceUid: prometheus
-                model: {expr: 'abs((histogram_quantile(0.5, sum by (renderer, phase, le) (rate(marquee_animation_duration_ms_bucket[15m]))) - avg by (renderer, phase) (marquee_animation_duration_target_ms)) / avg by (renderer, phase) (marquee_animation_duration_target_ms))', instant: true, refId: A}
-              - refId: B
-                relativeTimeRange: {from: 900, to: 0}
-                datasourceUid: __expr__
-                model: {type: reduce, expression: A, reducer: last, refId: B}
-              - refId: C
-                relativeTimeRange: {from: 900, to: 0}
-                datasourceUid: __expr__
-                model: {type: threshold, expression: B, conditions: [{evaluator: {params: [0.1], type: gt}}], refId: C}
      - orgId: 1
        name: Infrastructure
        folder: AI Stack Alerts