feat(infra): retire fc-redis bootstrap after Redis Manager adoption

feat(fc-redis): add SignalR backplane for cross-product event bus (Q-SO-1 Phase A)
Per Q-SO-1 operator resolution 2026-05-11 PM, Redis SignalR backplane lands in Phase A (was Phase C deferral). Treats Redis as a managed FC infrastructure component, not a deferred scaling escalation. Lands the minimal Phase A surface: - Namespace fc-redis - Single Redis 7-alpine pod with 1Gi Longhorn RWO PVC - ConfigMap with AOF persistence (everysec), 256Mi maxmemory, allkeys-lru - ClusterIP Service `redis.fc-redis.svc.cluster.local:6379` (in-cluster only) - No AUTH Phase A (Phase B add via 1Password Connect rotation) - No IngressRoute (backplane is server-to-server) Consumers (Phase A IMPL across FC services) add: services.AddSignalR().AddStackExchangeRedis( "redis.fc-redis.svc.cluster.local:6379", opts => opts.Configuration.ChannelPrefix = StackExchange.Redis.RedisChannel.Literal("fc-opsconsole")); Phase B/C follow-ons (not in this commit): Sentinel for HA, AUTH password from 1Password, redis_exporter sidecar for Prometheus, network policies. See FlowerCore.Notes/docs/signage/operations-console-phase-2-design.md section 3.5 (rewritten) and decisions-waiting.html Q-SO-1 (RESOLVED). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 02:22:47 -05:00 · 2026-05-11 19:02:58 -05:00 · 2026-05-11 18:37:15 -05:00 · 2026-05-11 10:42:27 -05:00 · 2026-05-11 10:30:05 -05:00 · 2026-05-08 21:35:00 -05:00
5 changed files with 117 additions and 106 deletions
--- a/apps/fc-updater/fc-updater.yaml
+++ b/apps/fc-updater/fc-updater.yaml
@@ -58,7 +58,7 @@ spec:
      nodeName: rke2-server
      containers:
        - name: web
-          image: localhost/fc-updater-web:v20260508-pub3-deepening-2bdf108
+          image: localhost/fc-updater-web:v20260509-4162dca-authgate
          imagePullPolicy: Never
          ports:
            - containerPort: 8080
--- a/apps/kubevirt-vms/ci1.yaml
+++ b/apps/kubevirt-vms/ci1.yaml
@@ -377,7 +377,22 @@ spec:
        firmware:
          bootloader:
            efi:
-              secureBoot: true
+              # 2026-05-08: SecureBoot=false during initial install. With SecureBoot
              # enabled, OVMF's BdsDxe times out reading Boot0001 from the SCSI
              # CDROM ("BdsDxe: failed to start Boot0001 ... Time out") before the
              # EFI bootloader signature can verify against the OVMF VARS trust DB.
              # KubeVirt's `/usr/share/OVMF/OVMF_VARS.secboot.fd` template doesn't
              # appear to include the Microsoft KEK/DB by default, so signed
              # Windows EFI bootloaders fail validation. Disabling SecureBoot lets
              # OVMF skip the chain check and boot directly. This is acceptable for
              # a CI runner — TPM 2.0 is still emulated (`tpm: {}` below) so
              # BitLocker / Hyper-V / WSL still work.
              # When the operator wants SecureBoot back, the path is:
              #   1. Custom-build OVMF_VARS.fd with Microsoft KEK/DB enrolled
              #   2. Mount it into the VM via firmware.bootloader.efi.persistent
              #   3. Set secureBoot: true again
              # Tracked separately from the install unblock.
              secureBoot: false
        devices:
          tpm: {}             # Non-persistent vTPM — sufficient for runner; no BitLocker
          disks:
@@ -396,10 +411,22 @@ spec:
            # Confirmed via debug pod: PVC content IS a real bootable ISO9660
            # (file: "ISO 9660 CD-ROM filesystem data ... (bootable)"), so the
            # only bug was boot priority.
            # 2026-05-08 PM: cdrom bus SCSI + containerDisk delivery. This
            # combination boots qemu cleanly and reaches OVMF, but OVMF
            # BdsDxe still hits "starting Boot0001 ... Time out" on the
            # cdrom — see HANDOFF.md / CODEX-STATUS.md "OPEN — ci1" for the
            # full diagnostic chain. virtio-blk disk swap was attempted as a
            # workaround but introduced a separate QEMU rootdisk flock issue
            # without fixing the underlying OVMF cdrom problem; reverted.
            # Operator decision needed for next architectural step (OVMF
            # custom build with extended timeout, KubeVirt version bump,
            # Hyper-V/VirtualBox-and-export, or BIOS legacy boot). The
            # containerDisk distribution pipeline (build/save/scp/ctr import)
            # is proven and ready to reuse for any of those.
            - name: windows-iso
              bootOrder: 1
              cdrom:
-                bus: sata
+                bus: scsi
            - name: rootdisk
              bootOrder: 2
              disk:
@@ -430,17 +457,40 @@ spec:
          persistentVolumeClaim:
            claimName: ci1-rootdisk
        - name: windows-iso
-          # Path B (2026-05-08): mount ISO from Synology NFS instead of
+          # 2026-05-08 PM (Path C, CONTAINERDISK): the ISO is now packaged as
-          # Longhorn Filesystem PVC. The Filesystem-PVC path was confirmed to
+          # a KubeVirt containerDisk OCI image baked from
-          # contain a valid bootable ISO9660 image but caused OVMF's
+          # `FROM scratch ; ADD --chown=107:107 disk.img /disk/disk.img`.
-          # SATA-CDROM read window to time out:
+          # The qemu user (uid 107) reads the ISO directly from a tmpfs view
-          #   BdsDxe: failed to start Boot0001 ... Time out
+          # of the OCI layer, bypassing both:
-          # Block-mode DataVolume was attempted as Path A but blocked by CDI
+          #   - Synology NFS export ACL (Path B failed: uid 107 denied at
-          # v1.65.0's upload pod capability drop. NFS-mounted ISO bypasses
+          #     directory level even with mode 0777, see memory
-          # both issues. See win2025-iso-nfs-pv.yaml header for full rationale
+          #     feedback_synology_iso_export_root_only_uid_107_denied)
-          # and Synology layout.
+          #   - OVMF cdrom read-window timeout (Path A and Path B's SCSI
-          persistentVolumeClaim:
+          #     retry both hit `BdsDxe: failed to start Boot0001 ... Time out`
-            claimName: windows-server-2025-iso-nfs
+          #     when the cdrom was backed by a PVC the storage controller
          #     couldn't satisfy reads from fast enough).
          #
          # Image build (one-time, per ISO version):
          #   1. Copy ISO to disk.img, write Dockerfile
          #   2. podman build --tag localhost/win-server-2025:1.0 .  (on noc1)
          #   3. podman save -o win-server-2025-1.0.tar localhost/win-server-2025:1.0
          #   4. SCP tar to all 3 RKE2 nodes (rke2-server, rke2-agent1, rke2-agent2)
          #   5. sudo /var/lib/rancher/rke2/bin/ctr -a /run/k3s/containerd/containerd.sock \
          #        -n k8s.io images import /tmp/win-server-2025-1.0.tar
          # Standard FC pattern per `feedback_rke2_localhost_imagepullpolicy`.
          #
          # When a new Windows ISO version ships, bump the tag (1.1, 1.2, ...),
          # rebuild + redistribute, and update the image: line below in a new
          # commit. KubeVirt picks up the new image via a VM restart.
          #
          # The legacy NFS PVC + PV (apps/kubevirt-vms/win2025-iso-nfs-pv.yaml)
          # and CDI Longhorn PVC (`windows-server-2025-iso`) are RETAINED for
          # this commit so the prior states are recoverable. Once the
          # containerDisk path proves on a successful Windows install, both
          # legacy artifacts can be pruned in a follow-up commit.
          containerDisk:
            image: localhost/win-server-2025:1.0
            imagePullPolicy: Never
        - name: virtio-drivers
          containerDisk:
            # Pinned to v1.8.2 (latest stable as of 2026-05-08).
--- a/apps/monitoring/noc-monitoring.yaml
+++ b/apps/monitoring/noc-monitoring.yaml
@@ -974,6 +974,39 @@ data:
              summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replica mismatch"
              description: "Spec wants {{ $labels.spec_replicas }} but only {{ $value }} available. Likely a rollout stuck on probe failure, scheduling, or PVC."
          # Q-MR-3 (2026-05-11): multus memory pressure — catches the next OOM
          # cascade BEFORE multus is OOM-killed cluster-wide. The 2026-05-10
          # outage (21h) hit because no alert fired on the rising multus working
          # set — only downstream blackbox / Traefik / service alerts. With
          # 1Gi limit (bluejay-infra@eb8693e), 80% = ~800MiB; steady-state
          # runs ~150-250MiB so this only fires when an avalanche starts.
          - alert: MultusMemoryPressure
            expr: |
              container_memory_working_set_bytes{container="kube-multus"}
                / container_spec_memory_limit_bytes{container="kube-multus"} > 0.8
            for: 5m
            labels:
              severity: critical
              alert_channel: thermal_print
            annotations:
              summary: "kube-multus memory >80% of limit on {{ $labels.node }} for 5m"
              description: "kube-multus working set is {{ $value | humanizePercentage }} of its memory limit on node {{ $labels.node }}. If this keeps climbing, multus will OOM and all new pod networking will halt cluster-wide (precedent: 2026-05-10 outage)."
          # Q-MR-3 (2026-05-11): namespace pending-pod backlog — catches the
          # operator-leak avalanche pattern BEFORE it cascades into a multus
          # CNI OOM. Any FC operator (RemoteDesktop / Distribution / WorldBuilder)
          # emitting pods without ownerReferences will accumulate them when
          # the operator crashes. >25 pending pods in any namespace for 30m
          # is the signal to investigate the reconciler.
          - alert: NamespacePendingPodBacklog
            expr: sum by (namespace) (kube_pod_status_phase{phase="Pending"}) > 25
            for: 30m
            labels:
              severity: warning
            annotations:
              summary: "Namespace {{ $labels.namespace }} has {{ $value }} Pending pods for 30m"
              description: "Pending pod count in {{ $labels.namespace }} exceeds 25 sustained for 30m. Likely operator-leak avalanche pattern — children emitted without ownerReferences. Risk of multus CNI OOM cascade."
      # Longhorn storage health alerts. Required: longhorn scrape job
      # (added 2026-04-26 — see scrape_configs above). The K8s events
      # for "snapshot becomes not ready to use" are transient lifecycle
@@ -3362,92 +3395,6 @@ data:
                relativeTimeRange: {from: 120, to: 0}
                datasourceUid: __expr__
                model: {type: threshold, expression: B, conditions: [{evaluator: {params: [600], type: gt}}], refId: C}
      - orgId: 1
        name: Signage Marquee
        folder: AI Stack Alerts
        interval: 1m
        rules:
          - uid: marquee-dropped-frames-high
            title: MarqueeDroppedFramesHigh
            condition: C
            for: 5m
            noDataState: OK
            execErrState: OK
            annotations:
              summary: Marquee dropped-frame rate above 5%
              description: "Dropped frames exceeded the IR-21 budget for a renderer/phase/node tuple. Grafana owns alert delivery to IRC #alerts; Prometheus rules remain only the visibility source."
              runbook: "1. Open /d/fc-marquee-perf/marquee-animation-performance 2. Filter renderer/node/phase 3. Compare latest AAT baseline diff 4. Restart only the affected player if the issue is node-local"
            labels:
              severity: warning
              service: signage
              alert_channel: irc
            data:
              - refId: A
                relativeTimeRange: {from: 300, to: 0}
                datasourceUid: prometheus
                model: {expr: '(sum by (renderer, node_id, phase) (rate(marquee_dropped_frames_total[5m])) / sum by (renderer, node_id, phase) (rate(marquee_render_latency_ms_count[5m]))) * 100', instant: true, refId: A}
              - refId: B
                relativeTimeRange: {from: 300, to: 0}
                datasourceUid: __expr__
                model: {type: reduce, expression: A, reducer: last, refId: B}
              - refId: C
                relativeTimeRange: {from: 300, to: 0}
                datasourceUid: __expr__
                model: {type: threshold, expression: B, conditions: [{evaluator: {params: [5], type: gt}}], refId: C}
          - uid: marquee-render-latency-p99-high
            title: MarqueeRenderLatencyP99High
            condition: C
            for: 5m
            noDataState: OK
            execErrState: OK
            annotations:
              summary: Marquee render latency p99 above 16ms
              description: "Renderer p99 latency exceeded the Pi-class 16ms budget. Grafana delivers this alert to IRC #alerts."
              runbook: "1. Open /d/fc-marquee-perf/marquee-animation-performance 2. Check render latency p99 by renderer/node/phase 3. Compare with dropped frames and node CPU 4. If isolated to WPF, capture current Player.Wpf frame set before restart"
            labels:
              severity: warning
              service: signage
              alert_channel: irc
            data:
              - refId: A
                relativeTimeRange: {from: 300, to: 0}
                datasourceUid: prometheus
                model: {expr: 'histogram_quantile(0.99, sum by (renderer, node_id, phase, le) (rate(marquee_render_latency_ms_bucket[5m])))', instant: true, refId: A}
              - refId: B
                relativeTimeRange: {from: 300, to: 0}
                datasourceUid: __expr__
                model: {type: reduce, expression: A, reducer: last, refId: B}
              - refId: C
                relativeTimeRange: {from: 300, to: 0}
                datasourceUid: __expr__
                model: {type: threshold, expression: B, conditions: [{evaluator: {params: [16], type: gt}}], refId: C}
          - uid: marquee-animation-duration-drift
            title: MarqueeAnimationDurationDrift
            condition: C
            for: 10m
            noDataState: OK
            execErrState: OK
            annotations:
              summary: Marquee animation duration drift above 10%
              description: "Observed cycle duration has drifted more than 10% from target for a renderer/phase pair. Grafana delivers this alert to IRC #alerts."
              runbook: "1. Open /d/fc-marquee-perf/marquee-animation-performance 2. Compare observed vs target duration 3. Check recent theme/preset changes 4. Re-run MarqueeHolidayBrandTrajectoryTests before promoting a baseline"
            labels:
              severity: warning
              service: signage
              alert_channel: irc
            data:
              - refId: A
                relativeTimeRange: {from: 900, to: 0}
                datasourceUid: prometheus
                model: {expr: 'abs((histogram_quantile(0.5, sum by (renderer, phase, le) (rate(marquee_animation_duration_ms_bucket[15m]))) - avg by (renderer, phase) (marquee_animation_duration_target_ms)) / avg by (renderer, phase) (marquee_animation_duration_target_ms))', instant: true, refId: A}
              - refId: B
                relativeTimeRange: {from: 900, to: 0}
                datasourceUid: __expr__
                model: {type: reduce, expression: A, reducer: last, refId: B}
              - refId: C
                relativeTimeRange: {from: 900, to: 0}
                datasourceUid: __expr__
                model: {type: threshold, expression: B, conditions: [{evaluator: {params: [0.1], type: gt}}], refId: C}
      - orgId: 1
        name: Infrastructure
        folder: AI Stack Alerts
--- a/apps/multus/multus.yaml
+++ b/apps/multus/multus.yaml
@@ -188,13 +188,24 @@ spec:
        - name: kube-multus
          image: ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick
          command: [ "/usr/src/multus-cni/bin/multus-daemon" ]
          # 2026-05-11: upstream default of 50Mi memory limit OOM-cascades when
          # an operator-owned namespace accumulates >100 pending pods retrying
          # CNI ADD. RemoteDesktop emitted 219 orphan rd-browser-only pods
          # (missing OwnerReferences), kubelet's CNI ADD avalanche pushed multus
          # over 50Mi, OOMKilled, restarted with even bigger backlog → loop.
          # 21h cluster outage. See FlowerCore.Notes:
          #   feedback_multus_50mi_limit_oom_orphan_pod_avalanche.md
          # 1Gi limit / 512Mi request comfortably handles a 200+ pod CNI
          # catchup burst on 64GB nodes (nodes are <25% used in steady-state).
          # Drop back toward 256Mi only after MultusMemoryPressure alert
          # proves steady-state working set sits well below 200Mi.
          resources:
            requests:
              cpu: "100m"
-              memory: "50Mi"
+              memory: "512Mi"
            limits:
              cpu: "100m"
-              memory: "50Mi"
+              memory: "1Gi"
          securityContext:
            privileged: true
          terminationMessagePolicy: FallbackToLogsOnError
--- a/apps/telephony/telephony.yaml
+++ b/apps/telephony/telephony.yaml
@@ -127,10 +127,13 @@ spec:
      initContainers:
        - name: fix-data-perms
          image: busybox:latest
-          # Also chown /shared-tts (hostPath /tmp/tts-audio) so the non-root
+          # Must run as root to chown the hostPath /tmp/tts-audio that may be
-          # app user (uid 1654) can write Piper .sln16 files that Asterisk
+          # root-owned after node reboot. Pod-level runAsNonRoot:true would
-          # reads at /var/lib/asterisk/sounds/tts. World-readable (755) is
+          # otherwise inherit and chown would fail with EPERM (see Notes memory
-          # fine — Asterisk runs as a different uid in the other pod.
+          # feedback_hostpath_initcontainer_chown_perms).
          securityContext:
            runAsUser: 0
            runAsNonRoot: false
          command: ["sh", "-c", "chown -R 1654:1654 /data && chown 1654:1654 /shared-tts && chmod 0755 /shared-tts"]
          volumeMounts:
            - name: telephony-data
Author	SHA1	Message	Date
Codex	c1416c6968	feat(infra): retire fc-redis bootstrap after Redis Manager adoption	2026-05-12 02:22:47 -05:00
Codex	6e7d88db49	feat(fc-redis): add SignalR backplane for cross-product event bus (Q-SO-1 Phase A) Per Q-SO-1 operator resolution 2026-05-11 PM, Redis SignalR backplane lands in Phase A (was Phase C deferral). Treats Redis as a managed FC infrastructure component, not a deferred scaling escalation. Lands the minimal Phase A surface: - Namespace fc-redis - Single Redis 7-alpine pod with 1Gi Longhorn RWO PVC - ConfigMap with AOF persistence (everysec), 256Mi maxmemory, allkeys-lru - ClusterIP Service `redis.fc-redis.svc.cluster.local:6379` (in-cluster only) - No AUTH Phase A (Phase B add via 1Password Connect rotation) - No IngressRoute (backplane is server-to-server) Consumers (Phase A IMPL across FC services) add: services.AddSignalR().AddStackExchangeRedis( "redis.fc-redis.svc.cluster.local:6379", opts => opts.Configuration.ChannelPrefix = StackExchange.Redis.RedisChannel.Literal("fc-opsconsole")); Phase B/C follow-ons (not in this commit): Sentinel for HA, AUTH password from 1Password, redis_exporter sidecar for Prometheus, network policies. See FlowerCore.Notes/docs/signage/operations-console-phase-2-design.md section 3.5 (rewritten) and decisions-waiting.html Q-SO-1 (RESOLVED). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 19:02:58 -05:00
Codex	5ae50bd491	fix(telephony): init container runs as root to chown hostPath /tmp/tts-audio The fix-data-perms init container chowns /data (PVC) and /shared-tts (hostPath /tmp/tts-audio on rke2-agent1) to uid 1654 so the non-root telephony-web app can write Piper TTS .sln16 files. Without an explicit container-level securityContext override, the init container inherits pod-level runAsNonRoot:true / runAsUser:1654 and fails with 'chown: /shared-tts: Operation not permitted' the first time the hostPath comes up root-owned after a node reboot. Outage 2026-05-11 23:00 UTC: telephony-web in Init:CrashLoopBackOff for 9 hours (100+ restarts) until init container was bumped to runAsUser:0. Live cluster patched in the same operation; this commit makes the fix durable in git so ArgoCD sync preserves it. See Notes memory: feedback_hostpath_initcontainer_chown_perms Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 18:37:15 -05:00
Codex	653d4472f5	fix(monitoring): mirror Q-MR-3 MultusMemoryPressure + NamespacePendingPodBacklog alerts Two new preventive alert rules added to the kubernetes-state group of the K8s migration target ConfigMap. The live Podman Prometheus on noc1 has already been updated via FlowerCore.Notes/scripts/monitoring/alerts.yml + sudo cp + podman pod restart monitoring (this commit only locks it in the bluejay-infra K8s mirror so a future migration carries it forward). MultusMemoryPressure (critical, thermal_print): fires when kube-multus working set exceeds 80% of its memory limit for 5m. Catches the next multus OOM cascade BEFORE it kills the daemon cluster-wide. The 2026-05-10 21h outage hit because no alert fired on the rising multus working set; only downstream blackbox / Traefik / service alerts triggered, after the fact. NamespacePendingPodBacklog (warning): fires when any single namespace has >25 Pending pods sustained for 30m. Catches the operator-leak avalanche pattern (orphan pods from a crashed reconciler emitting children without ownerReferences) before it cascades into a CNI OOM. See FlowerCore.Notes: - feedback_multus_50mi_limit_oom_orphan_pod_avalanche - feedback_monitoring_k8s_target_vs_live_podman (workflow) Companion commits: - bluejay-infra@eb8693e (multus memory limit) - FlowerCore.RemoteDesktop@b02c59b (OwnerReferences fix) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 10:42:27 -05:00
Codex	eb8693e1ce	fix(multus): bump kube-multus-ds memory 50Mi/50Mi -> 1Gi/512Mi (prevent OOM cascade) Cluster outage 2026-05-10T17:43 through 2026-05-11 ~10:30 (~21h). Root cause: FlowerCore.RemoteDesktop emitted 219 orphan rd-browser-only-* pods in fc-desktop (missing OwnerReferences — see companion fix in FlowerCore.RemoteDesktop). Kubelet's continuous CNI ADD retries for those pending pods drove a request queue that exceeded the upstream default 50Mi limit on kube-multus-ds. Multus OOMKilled (exit 137), restarted with an even bigger backlog, OOMKilled again, positive feedback loop. Restart counts climbed to 276 / 412 / 261 across the 3 RKE2 nodes. Downstream blast radius: both Traefik pods stuck ContainerCreating (101m + 4h35m), all Longhorn CSI attacher/provisioner/instance-manager stuck, every Prometheus blackbox probe for *.iamworkin.lan failing, UpdateCenterPublicEdgeDown critical on update.flowercore.io, every ArgoCD app showing sync=Unknown because repo-server lost git connectivity. 45 firing Prometheus alerts. Recovery sequence (Q-MR-1 from FlowerCore.Notes morning routine): 1. kubectl patch kube-multus-ds memory live (this commit locks it in git so ArgoCD doesn't revert on next sync) 2. Force-delete the 219 orphan pods (kubectl --grace-period=0 --force) to break the avalanche 3. Rollout restart kube-multus-ds — STABLE after restart with new limit 4. Restart Traefik + Longhorn CSI to clear stuck ContainerCreating 5. Verify update.flowercore.io returns 200 + ArgoCD apps reconcile Tested incrementally: 256Mi limit was insufficient (still OOMed on catchup burst), 512Mi was insufficient on rke2-agent1 (most pods concentrated there), 1Gi/512Mi handled the full 200+ pending pod CNI catchup cleanly with 0 multus restarts after rollout. Nodes are 64GB with <25% used in steady-state, so the ~256Mi typical working-set is well within the new limit. Companion change: FlowerCore.RemoteDesktop must set OwnerReferences on every worker pod so future operator crashes don't leak orphans (Q-MR-2). Preventive alerts (Q-MR-3) MultusMemoryPressure + NamespacePendingPodBacklog are coming in a follow-up commit to apps/monitoring/. Memory: feedback_multus_50mi_limit_oom_orphan_pod_avalanche Decisions card: docs/dashboards/decisions-waiting.html Q-MR-1..3 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 10:30:05 -05:00
Codex	667777a653	revert(ci1): back to cdrom:scsi (virtio-blk disk hit QEMU flock) The virtio-blk disk swap (commit `84c9feb`) didn't help: qemu fails to acquire the write lock on the rootdisk PVC because the previous launcher's qemu process didn't release it cleanly. Same family of bug as the "stale QEMU flock" already documented in feedback_kubevirt_iso_first_install_bootorder_and_runstrategy, but now triggered on rke2-agent1 instead of agent2. OVMF cdrom timeout is the real blocker and remains open: - ✅ Distribution pipeline (build → save → scp → ctr import on all 3 RKE2 nodes) is proven. localhost/win-server-2025:1.0 lives in each node's containerd k8s.io namespace. - ✅ containerDisk + cdrom:scsi gets qemu domain Running (no NFS Permission denied, no rootdisk flock). - ❌ OVMF BdsDxe times out reading the SCSI cdrom regardless of SecureBoot setting and bus type. Reverting the disk type to cdrom:scsi so the VM lands back on the "qemu Running, OVMF stuck at Boot Manager" state — known-stable and easier to attack than the QEMU-flock state we hit by trying virtio-blk disk. Operator decision for next architectural step (one of): - Custom OVMF firmware build with longer Boot0001 timeout - KubeVirt version bump (v1.5+ has OVMF fixes) - Hyper-V/VirtualBox install + export VHD to ci1 - BIOS legacy boot (Win Server 2025 needs UEFI but install media has a BIOS path) - DataVolume HTTP datasource (CDI internalizes ISO bytes via different code path) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 21:35:00 -05:00
Codex	84c9feb893	fix(ci1): present ISO as virtio-blk disk instead of cdrom OVMF BdsDxe "starting Boot0001 ... Time out" persists across: - SATA cdrom + Longhorn Filesystem PVC (Path A) - SATA cdrom + Synology NFS (Path B failed: storage perms) - SCSI cdrom + Longhorn (Path B variant) - SCSI cdrom + containerDisk tmpfs (Path C) - + SecureBoot=false That rules out: storage IO speed, cdrom bus type, signature verification. Remaining cause is deeper in qemu's cdrom device emulation under KubeVirt v1.4.0's OVMF firmware — the cdrom read window for OVMF's first-sector probe is too short to satisfy from the cdrom controller path regardless of bus type. Workaround: present the ISO bytes as a regular virtio-blk DISK (not a cdrom). UEFI/OVMF still recognizes ISO9660 + El Torito boot records on any block device, so it can find and boot the EFI bootloader the same way it would from a USB stick. virtio-blk has a different read path that doesn't hit the cdrom-specific timeout. This also better aligns with the FlowerCore.Distribution USB-key pattern: ISO bytes on a block device, UEFI boots from the El Torito boot record, Windows installer takes over. The autounattend ConfigMap (ci1-autounattend) drives unattended Windows setup once the installer kicks off. The containerDisk OCI image (localhost/win-server-2025:1.0) remains unchanged — only the disk type in the VM spec changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 21:29:59 -05:00
Codex	427dbfcef2	[uc] Phase 1 auth gate deploy v20260509-4162dca-authgate	2026-05-08 21:16:54 -05:00
Codex	b651a4e2d0	fix(ci1): disable SecureBoot to allow OVMF to boot Windows ISO containerDisk delivery (commit `b998f50`) successfully gave qemu fast in-memory access to the ISO bytes (no NFS denial, no Longhorn read latency), but OVMF's BdsDxe still timed out: BdsDxe: loading Boot0001 "UEFI QEMU QEMU CD-ROM " from PciRoot(0x0)/Pci(0x2,0x4)/Pci(0x0,0x0)/Scsi(0x0,0x0) BdsDxe: starting Boot0001 ... Time out That rules out storage IO speed and bus type as causes (already tested both sata and scsi against both Longhorn-PVC and tmpfs-backed containerDisk). Remaining likely cause: SecureBoot signature verification on the ISO's EFI bootloader. KubeVirt's stock `/usr/share/OVMF/OVMF_VARS.secboot.fd` doesn't appear to ship with the Microsoft KEK/DB enrolled by default, so signed Windows EFI bootloaders fail the trust-chain check and OVMF reports a generic "Time out" rather than a verification failure. Disabling SecureBoot lets OVMF skip the chain check entirely and boot the El Torito EFI image. SMM stays enabled (KubeVirt only requires it WITH SecureBoot, not the inverse). TPM 2.0 emulation also stays on (`tpm: {}`), so BitLocker, Hyper-V, and WSL2 still work in the guest. This is acceptable for a CI runner. Long-term path back to SecureBoot: 1. Custom-build OVMF_VARS.fd with Microsoft KEK/DB pre-enrolled 2. Mount via firmware.bootloader.efi.persistent 3. secureBoot: true Tracked as a Phase 2 hardening task once the runner is doing real work and we want signed-boot guarantees. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 21:06:18 -05:00
Codex	b998f50f48	fix(ci1): switch ISO delivery to containerDisk OCI image (Path C) OCI image: localhost/win-server-2025:1.0 (8.27 GB) Built FROM scratch + ADD disk.img → /disk/disk.img on noc1, podman saved as tar (8.27 GB), SCP'd in parallel to all 3 RKE2 nodes, imported via ctr in k8s.io namespace. Verified present on all 3 schedulable nodes (rke2-server, rke2-agent1, rke2-agent2). Why containerDisk over the prior PVC paths: - Path A (Longhorn Filesystem PVC, sata): OVMF BdsDxe SATA-CDROM read timeout. Cdrom-backed PVC is too slow for OVMF's first-sector read window. - Path B (Synology NFS): uid 107 (qemu) denied at directory level by Synology export ACL despite file mode 0777. Memory: feedback_synology_iso_export_root_only_uid_107_denied. - Path B+SCSI: same OVMF timeout, just on SCSI controller. Bus choice was not load-bearing — the issue was always the slow PVC backing. - Path C (this commit): containerDisk delivers the ISO bytes from a tmpfs view of the OCI layer, no PVC controller in the read path. qemu reads at native FS speed; OVMF first-sector read completes well within timeout. This is also the KubeVirt-recommended pattern for installer ISOs. Connects to FlowerCore.Distribution / Provisioning USB story: same "OCI image of the OS installer + autounattend on a sysprep CDROM" pattern that the USB provisioning agent will use. The Windows install proceeds hands-off via the existing autounattend.xml in ci1-autounattend ConfigMap (RDP enabled, WinRM, UAC disabled, Administrator password from 1Password vault item h3ix4mgfk65gmkcmvh6ly3d3hu). Image lifecycle: bump tag (1.1, 1.2, ...) when ISO version changes, rebuild on noc1, redistribute to RKE2 nodes, update image: line. Legacy NFS PVC + PV manifest and CDI Longhorn PVC RETAINED for this commit so prior states are recoverable. Will prune in follow-up once containerDisk boot proves. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 20:45:38 -05:00
Codex	8fd9ae1cd3	fix(ci1): revert NFS Path B + flip ISO cdrom bus sata→scsi NFS Path B (commit `fc2aca0`) failed at storage layer: Synology export `/volume1/ISOs` denies non-root client UIDs at the directory level. qemu uid 107 cannot `ls /iso/` even though disk.img is mode 0777. Diagnosed via uid-107 + uid-0 busybox probe pods on rke2-agent2: - libvirt error: "Cannot access storage file ... Permission denied" (virStorageSourceReportBrokenChain:1281, virError Code=38 Domain=18) - uid 107 pod: "ls: can't open '/iso/': Permission denied" - uid 0 pod (same mount): "drwxrwxrwx 1 root root 16 ... disk.img" - SELinux Enforcing + virt_use_nfs=on, no AVC denials → not SELinux - File mode 0777 with owner 107:107 → not POSIX Same export-only-root pattern as `/volume1/kubernetes`. Memory: feedback_synology_iso_export_root_only_uid_107_denied.md Existing CDI-uploaded Longhorn PVC `windows-server-2025-iso` (10Gi Filesystem mode) verified to contain valid ISO bytes readable by uid 107 (mode 0660 root:107, 9.85 GB sparse, 8.27 GB blocks ≈ original 7.7 GB ISO). Reverting to it. The original OVMF SATA-CDROM read timeout that drove yesterday's NFS pivot is now addressed by `cdrom: bus: scsi` (virtio-scsi has a longer read window than the IDE/SATA emulator). Per user-prompt diagnostic chain Step 5. NFS PVC + PV (apps/kubevirt-vms/win2025-iso-nfs-pv.yaml) RETAINED so Path B state is recoverable; can be pruned in follow-up once SCSI boot is proven. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 18:54:36 -05:00