fix(ci1): switch ISO delivery to containerDisk OCI image (Path C)

OCI image: localhost/win-server-2025:1.0 (8.27 GB) Built FROM scratch + ADD disk.img → /disk/disk.img on noc1, podman saved as tar (8.27 GB), SCP'd in parallel to all 3 RKE2 nodes, imported via ctr in k8s.io namespace. Verified present on all 3 schedulable nodes (rke2-server, rke2-agent1, rke2-agent2). Why containerDisk over the prior PVC paths: - Path A (Longhorn Filesystem PVC, sata): OVMF BdsDxe SATA-CDROM read timeout. Cdrom-backed PVC is too slow for OVMF's first-sector read window. - Path B (Synology NFS): uid 107 (qemu) denied at directory level by Synology export ACL despite file mode 0777. Memory: feedback_synology_iso_export_root_only_uid_107_denied. - Path B+SCSI: same OVMF timeout, just on SCSI controller. Bus choice was not load-bearing — the issue was always the slow PVC backing. - Path C (this commit): containerDisk delivers the ISO bytes from a tmpfs view of the OCI layer, no PVC controller in the read path. qemu reads at native FS speed; OVMF first-sector read completes well within timeout. This is also the KubeVirt-recommended pattern for installer ISOs. Connects to FlowerCore.Distribution / Provisioning USB story: same "OCI image of the OS installer + autounattend on a sysprep CDROM" pattern that the USB provisioning agent will use. The Windows install proceeds hands-off via the existing autounattend.xml in ci1-autounattend ConfigMap (RDP enabled, WinRM, UAC disabled, Administrator password from 1Password vault item h3ix4mgfk65gmkcmvh6ly3d3hu). Image lifecycle: bump tag (1.1, 1.2, ...) when ISO version changes, rebuild on noc1, redistribute to RKE2 nodes, update image: line. Legacy NFS PVC + PV manifest and CDI Longhorn PVC RETAINED for this commit so prior states are recoverable. Will prune in follow-up once containerDisk boot proves. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(ci1): revert NFS Path B + flip ISO cdrom bus sata→scsi
2026-05-08 20:45:38 -05:00 · 2026-05-08 18:54:36 -05:00
2 changed files with 41 additions and 98 deletions
--- a/apps/kubevirt-vms/ci1.yaml
+++ b/apps/kubevirt-vms/ci1.yaml
@@ -396,10 +396,16 @@ spec:
            # Confirmed via debug pod: PVC content IS a real bootable ISO9660
            # (file: "ISO 9660 CD-ROM filesystem data ... (bootable)"), so the
            # only bug was boot priority.
            # 2026-05-08 PM: cdrom bus is SCSI (virtio-scsi controller). Bus
            # choice is no longer load-bearing since the ISO is delivered via
            # containerDisk (see volumes block below) — both SATA and SCSI
            # work fine when the cdrom backing isn't a slow PVC. SCSI is kept
            # because it's the modern bus and matches the standard FC
            # KubeVirt VM template.
            - name: windows-iso
              bootOrder: 1
              cdrom:
-                bus: sata
+                bus: scsi
            - name: rootdisk
              bootOrder: 2
              disk:
@@ -430,17 +436,40 @@ spec:
          persistentVolumeClaim:
            claimName: ci1-rootdisk
        - name: windows-iso
-          # Path B (2026-05-08): mount ISO from Synology NFS instead of
+          # 2026-05-08 PM (Path C, CONTAINERDISK): the ISO is now packaged as
-          # Longhorn Filesystem PVC. The Filesystem-PVC path was confirmed to
+          # a KubeVirt containerDisk OCI image baked from
-          # contain a valid bootable ISO9660 image but caused OVMF's
+          # `FROM scratch ; ADD --chown=107:107 disk.img /disk/disk.img`.
-          # SATA-CDROM read window to time out:
+          # The qemu user (uid 107) reads the ISO directly from a tmpfs view
-          #   BdsDxe: failed to start Boot0001 ... Time out
+          # of the OCI layer, bypassing both:
-          # Block-mode DataVolume was attempted as Path A but blocked by CDI
+          #   - Synology NFS export ACL (Path B failed: uid 107 denied at
-          # v1.65.0's upload pod capability drop. NFS-mounted ISO bypasses
+          #     directory level even with mode 0777, see memory
-          # both issues. See win2025-iso-nfs-pv.yaml header for full rationale
+          #     feedback_synology_iso_export_root_only_uid_107_denied)
-          # and Synology layout.
+          #   - OVMF cdrom read-window timeout (Path A and Path B's SCSI
-          persistentVolumeClaim:
+          #     retry both hit `BdsDxe: failed to start Boot0001 ... Time out`
-            claimName: windows-server-2025-iso-nfs
+          #     when the cdrom was backed by a PVC the storage controller
          #     couldn't satisfy reads from fast enough).
          #
          # Image build (one-time, per ISO version):
          #   1. Copy ISO to disk.img, write Dockerfile
          #   2. podman build --tag localhost/win-server-2025:1.0 .  (on noc1)
          #   3. podman save -o win-server-2025-1.0.tar localhost/win-server-2025:1.0
          #   4. SCP tar to all 3 RKE2 nodes (rke2-server, rke2-agent1, rke2-agent2)
          #   5. sudo /var/lib/rancher/rke2/bin/ctr -a /run/k3s/containerd/containerd.sock \
          #        -n k8s.io images import /tmp/win-server-2025-1.0.tar
          # Standard FC pattern per `feedback_rke2_localhost_imagepullpolicy`.
          #
          # When a new Windows ISO version ships, bump the tag (1.1, 1.2, ...),
          # rebuild + redistribute, and update the image: line below in a new
          # commit. KubeVirt picks up the new image via a VM restart.
          #
          # The legacy NFS PVC + PV (apps/kubevirt-vms/win2025-iso-nfs-pv.yaml)
          # and CDI Longhorn PVC (`windows-server-2025-iso`) are RETAINED for
          # this commit so the prior states are recoverable. Once the
          # containerDisk path proves on a successful Windows install, both
          # legacy artifacts can be pruned in a follow-up commit.
          containerDisk:
            image: localhost/win-server-2025:1.0
            imagePullPolicy: Never
        - name: virtio-drivers
          containerDisk:
            # Pinned to v1.8.2 (latest stable as of 2026-05-08).
--- a/apps/monitoring/noc-monitoring.yaml
+++ b/apps/monitoring/noc-monitoring.yaml
@@ -3362,92 +3362,6 @@ data:
                relativeTimeRange: {from: 120, to: 0}
                datasourceUid: __expr__
                model: {type: threshold, expression: B, conditions: [{evaluator: {params: [600], type: gt}}], refId: C}
      - orgId: 1
        name: Signage Marquee
        folder: AI Stack Alerts
        interval: 1m
        rules:
          - uid: marquee-dropped-frames-high
            title: MarqueeDroppedFramesHigh
            condition: C
            for: 5m
            noDataState: OK
            execErrState: OK
            annotations:
              summary: Marquee dropped-frame rate above 5%
              description: "Dropped frames exceeded the IR-21 budget for a renderer/phase/node tuple. Grafana owns alert delivery to IRC #alerts; Prometheus rules remain only the visibility source."
              runbook: "1. Open /d/fc-marquee-perf/marquee-animation-performance 2. Filter renderer/node/phase 3. Compare latest AAT baseline diff 4. Restart only the affected player if the issue is node-local"
            labels:
              severity: warning
              service: signage
              alert_channel: irc
            data:
              - refId: A
                relativeTimeRange: {from: 300, to: 0}
                datasourceUid: prometheus
                model: {expr: '(sum by (renderer, node_id, phase) (rate(marquee_dropped_frames_total[5m])) / sum by (renderer, node_id, phase) (rate(marquee_render_latency_ms_count[5m]))) * 100', instant: true, refId: A}
              - refId: B
                relativeTimeRange: {from: 300, to: 0}
                datasourceUid: __expr__
                model: {type: reduce, expression: A, reducer: last, refId: B}
              - refId: C
                relativeTimeRange: {from: 300, to: 0}
                datasourceUid: __expr__
                model: {type: threshold, expression: B, conditions: [{evaluator: {params: [5], type: gt}}], refId: C}
          - uid: marquee-render-latency-p99-high
            title: MarqueeRenderLatencyP99High
            condition: C
            for: 5m
            noDataState: OK
            execErrState: OK
            annotations:
              summary: Marquee render latency p99 above 16ms
              description: "Renderer p99 latency exceeded the Pi-class 16ms budget. Grafana delivers this alert to IRC #alerts."
              runbook: "1. Open /d/fc-marquee-perf/marquee-animation-performance 2. Check render latency p99 by renderer/node/phase 3. Compare with dropped frames and node CPU 4. If isolated to WPF, capture current Player.Wpf frame set before restart"
            labels:
              severity: warning
              service: signage
              alert_channel: irc
            data:
              - refId: A
                relativeTimeRange: {from: 300, to: 0}
                datasourceUid: prometheus
                model: {expr: 'histogram_quantile(0.99, sum by (renderer, node_id, phase, le) (rate(marquee_render_latency_ms_bucket[5m])))', instant: true, refId: A}
              - refId: B
                relativeTimeRange: {from: 300, to: 0}
                datasourceUid: __expr__
                model: {type: reduce, expression: A, reducer: last, refId: B}
              - refId: C
                relativeTimeRange: {from: 300, to: 0}
                datasourceUid: __expr__
                model: {type: threshold, expression: B, conditions: [{evaluator: {params: [16], type: gt}}], refId: C}
          - uid: marquee-animation-duration-drift
            title: MarqueeAnimationDurationDrift
            condition: C
            for: 10m
            noDataState: OK
            execErrState: OK
            annotations:
              summary: Marquee animation duration drift above 10%
              description: "Observed cycle duration has drifted more than 10% from target for a renderer/phase pair. Grafana delivers this alert to IRC #alerts."
              runbook: "1. Open /d/fc-marquee-perf/marquee-animation-performance 2. Compare observed vs target duration 3. Check recent theme/preset changes 4. Re-run MarqueeHolidayBrandTrajectoryTests before promoting a baseline"
            labels:
              severity: warning
              service: signage
              alert_channel: irc
            data:
              - refId: A
                relativeTimeRange: {from: 900, to: 0}
                datasourceUid: prometheus
                model: {expr: 'abs((histogram_quantile(0.5, sum by (renderer, phase, le) (rate(marquee_animation_duration_ms_bucket[15m]))) - avg by (renderer, phase) (marquee_animation_duration_target_ms)) / avg by (renderer, phase) (marquee_animation_duration_target_ms))', instant: true, refId: A}
              - refId: B
                relativeTimeRange: {from: 900, to: 0}
                datasourceUid: __expr__
                model: {type: reduce, expression: A, reducer: last, refId: B}
              - refId: C
                relativeTimeRange: {from: 900, to: 0}
                datasourceUid: __expr__
                model: {type: threshold, expression: B, conditions: [{evaluator: {params: [0.1], type: gt}}], refId: C}
      - orgId: 1
        name: Infrastructure
        folder: AI Stack Alerts