Compare commits
3 Commits
sprint40/c
...
sprint42/c
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
6e581d2879 | ||
| ea73f00461 | |||
|
|
25ace30a03 |
263
apps/fc-build-windows/README.md
Normal file
263
apps/fc-build-windows/README.md
Normal file
@@ -0,0 +1,263 @@
|
|||||||
|
# fc-build-windows runner gate
|
||||||
|
|
||||||
|
Status: OPEN-WITH-OPERATOR-ACTION as of 2026-05-20.
|
||||||
|
|
||||||
|
This directory is intentionally not a live runner deployment. It records the
|
||||||
|
exact gate for bringing up the Windows self-hosted runner fleet without faking
|
||||||
|
capacity in GitHub or Kubernetes.
|
||||||
|
|
||||||
|
## Lane evidence
|
||||||
|
|
||||||
|
- `D:\git\FlowerCore\FlowerCore.Notes\docs\dashboards\decisions-waiting.html`
|
||||||
|
lines 15078-15085: Q-MR-82 says the Updater Windows Sandbox E2E run is
|
||||||
|
queued and `bluejay-ws-sandbox-1` is offline.
|
||||||
|
- `D:\git\FlowerCore\FlowerCore.Notes\memory\project_morning_routine_8_2026_05_20.md`:
|
||||||
|
Morning Routine #8 carries Q-MR-82 as the fleet-wide Windows runner gap.
|
||||||
|
- `D:\git\FlowerCore\FlowerCore.Notes\docs\standards\sprint-37-codex-dispatch-log-2026-05-19.md`
|
||||||
|
lines 76, 84-85, and 97: keep BLUEJAY-WS out of runner plans, merge Linux
|
||||||
|
runner expansion separately, and keep true Windows-only workflows parked on
|
||||||
|
the Windows runner host substrate path.
|
||||||
|
- `D:\git\FlowerCore\FlowerCore.Notes\docs\ai-agents\codex-prompts\2026-05-20-xxxxl-sprint-42-orchestrator-briefs.md`
|
||||||
|
lane Cx-5: land a deployment only if a Windows runner image/substrate is
|
||||||
|
ready; otherwise commit an operator-action gate.
|
||||||
|
- `D:\git\FlowerCore\FlowerCore.Notes\memory\feedback_bluejay_ws_never_a_github_runner.md`:
|
||||||
|
BLUEJAY-WS is operator-only territory; Windows runners belong on a dedicated
|
||||||
|
KubeVirt Windows VM such as `ci1` or a sibling VM.
|
||||||
|
|
||||||
|
## Live probe summary
|
||||||
|
|
||||||
|
Commands run on 2026-05-20 from `D:\git\FlowerCore\bluejay-infra`:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
$env:KUBECONFIG="$env:USERPROFILE\.kube\rke2.yaml"
|
||||||
|
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"`t"}{.metadata.labels.kubernetes\.io/os}{"`n"}{end}'
|
||||||
|
```
|
||||||
|
|
||||||
|
Result: `rke2-agent1`, `rke2-agent2`, and `rke2-server` all report
|
||||||
|
`kubernetes.io/os=linux`. There is no Windows Kubernetes node, so Windows
|
||||||
|
containers on RKE2 cannot satisfy `fc-build-windows`.
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
kubectl -n kubevirt-vms get vm,vmi,pods -o wide
|
||||||
|
```
|
||||||
|
|
||||||
|
Result: KubeVirt is healthy and `ci1` is `Running` / `Ready=True` on
|
||||||
|
`rke2-agent1` with VMI IP `10.42.103.35`.
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
virtctl --kubeconfig $env:USERPROFILE\.kube\rke2.yaml port-forward vm/ci1.kubevirt-vms 15985:5985
|
||||||
|
```
|
||||||
|
|
||||||
|
Result during port tests: `dial tcp 10.42.103.35:5985: connect: no route to
|
||||||
|
host`. The same result was seen for RDP 3389 and SSH 22. The VM exists, but it
|
||||||
|
is not remotely reachable for runner bootstrap from this lane.
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
gh api /repos/astoltz/FlowerCore.Updater/actions/runners `
|
||||||
|
--jq '.runners[]? | {name,status,busy,labels:[.labels[].name]}'
|
||||||
|
gh run list --repo astoltz/FlowerCore.Updater `
|
||||||
|
--workflow "Updater Windows Sandbox E2E" --limit 5
|
||||||
|
```
|
||||||
|
|
||||||
|
Result: GitHub has one Updater runner, `bluejay-ws-sandbox-1`, with
|
||||||
|
`status=offline`; run `26150689447` is still `queued`.
|
||||||
|
|
||||||
|
## Feasibility classification
|
||||||
|
|
||||||
|
### Option A: Windows containers on RKE2
|
||||||
|
|
||||||
|
Not feasible without operator-physical infrastructure work. Kubernetes Windows
|
||||||
|
containers require a Windows node. The current cluster has Linux-only RKE2
|
||||||
|
nodes.
|
||||||
|
|
||||||
|
### Option B: KubeVirt Windows VM
|
||||||
|
|
||||||
|
Partially present, not deployable from this lane.
|
||||||
|
|
||||||
|
`apps/kubevirt-vms/ci1.yaml` already defines a Windows Server 2025 KubeVirt VM
|
||||||
|
using `localhost/fc-win-server-2025:v1`, and the live VM is running. However:
|
||||||
|
|
||||||
|
- the guest is not reachable over RDP, WinRM, or SSH through `virtctl
|
||||||
|
port-forward`;
|
||||||
|
- the current root disk is a `containerDisk`, so runner installation inside the
|
||||||
|
running guest is not a durable fleet state unless the first-boot automation
|
||||||
|
re-registers on every boot or the VM is moved to a persistent PVC-backed
|
||||||
|
disk;
|
||||||
|
- FC.Updater `Updater Windows Sandbox E2E` uses
|
||||||
|
`[self-hosted, windows, windows-sandbox]`, while `fc-build-windows` build jobs
|
||||||
|
use `[self-hosted, windows, fc-build-windows]`. Do not advertise
|
||||||
|
`windows-sandbox` until Windows Sandbox has been proven in the guest.
|
||||||
|
|
||||||
|
### Option C: bluejay-ws-sandbox-1
|
||||||
|
|
||||||
|
Operator-only emergency fallback. GitHub shows it registered but offline. The
|
||||||
|
current memory says BLUEJAY-WS must not be a fleet runner host, so this lane
|
||||||
|
does not start or re-register it. If the operator deliberately overrides the
|
||||||
|
policy to drain an emergency queue, start the existing visible runner console
|
||||||
|
from the BLUEJAY-WS desktop and treat that as temporary break-glass, not the
|
||||||
|
permanent Q-MR-82 closure.
|
||||||
|
|
||||||
|
## Operator action plan
|
||||||
|
|
||||||
|
### 1. Pick the Windows host class
|
||||||
|
|
||||||
|
Use `ci1` or a sibling Windows Server 2025 VM for WPF build/test jobs that need
|
||||||
|
`fc-build-windows`.
|
||||||
|
|
||||||
|
Use a Windows 11 Pro/Enterprise KubeVirt VM for Updater or WorldBuilder
|
||||||
|
Windows Sandbox gates, unless Windows Sandbox support is explicitly proven on
|
||||||
|
the selected guest. The workflow labels must match the real capability:
|
||||||
|
|
||||||
|
- WPF build runner: `self-hosted,windows,fc-build-windows,ci1`
|
||||||
|
- Sandbox runner: `self-hosted,windows,windows-sandbox,ci-sandbox1`
|
||||||
|
|
||||||
|
### 2. Make the VM reachable and durable
|
||||||
|
|
||||||
|
From BLUEJAY-WS:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
$env:KUBECONFIG="$env:USERPROFILE\.kube\rke2.yaml"
|
||||||
|
kubectl -n kubevirt-vms get vm,vmi,pods -o wide
|
||||||
|
virtctl --kubeconfig $env:KUBECONFIG vnc ci1 -n kubevirt-vms
|
||||||
|
virtctl --kubeconfig $env:KUBECONFIG port-forward vm/ci1.kubevirt-vms 13389:3389
|
||||||
|
virtctl --kubeconfig $env:KUBECONFIG port-forward vm/ci1.kubevirt-vms 15985:5985
|
||||||
|
```
|
||||||
|
|
||||||
|
Before runner registration, fix the current port-forward failure. The expected
|
||||||
|
state is that RDP or WinRM accepts a connection through the control plane.
|
||||||
|
|
||||||
|
For durability, either:
|
||||||
|
|
||||||
|
- move the runner VM to a persistent PVC-backed root disk; or
|
||||||
|
- keep `containerDisk` and bake first-boot runner registration into the sysprep
|
||||||
|
flow using a non-expiring credential lookup path.
|
||||||
|
|
||||||
|
Do not install a runner by hand into a transient VM and call Q-MR-82 closed.
|
||||||
|
|
||||||
|
### 3. Install runner prerequisites inside the VM
|
||||||
|
|
||||||
|
Run in an elevated PowerShell session in the Windows runner guest:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
winget install Microsoft.DotNet.SDK.10 --silent
|
||||||
|
winget install Microsoft.DotNet.DesktopRuntime.8 --silent
|
||||||
|
winget install Microsoft.PowerShell --silent
|
||||||
|
winget install Git.Git --silent
|
||||||
|
winget install Microsoft.VisualStudio.2022.BuildTools --silent
|
||||||
|
winget install Google.Chrome --silent
|
||||||
|
```
|
||||||
|
|
||||||
|
For a Sandbox-capable runner only:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
Enable-WindowsOptionalFeature -Online -FeatureName Containers-DisposableClientVM -All
|
||||||
|
Restart-Computer -Force
|
||||||
|
```
|
||||||
|
|
||||||
|
After reboot:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
Get-CimInstance -ClassName Win32_OptionalFeature -Filter "Name='Containers-DisposableClientVM'"
|
||||||
|
Test-Path C:\Windows\System32\WindowsSandbox.exe
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Register repo-scoped GitHub runners
|
||||||
|
|
||||||
|
The `astoltz` account uses repo-scoped runners. Generate a fresh one-hour
|
||||||
|
registration token per repo immediately before `config.cmd`.
|
||||||
|
|
||||||
|
From a trusted operator shell with `gh` authenticated:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
$repos = @(
|
||||||
|
"FlowerCore.Updater",
|
||||||
|
"FlowerCore.WorldBuilder",
|
||||||
|
"FlowerCore.DeviceManagement"
|
||||||
|
)
|
||||||
|
|
||||||
|
foreach ($repo in $repos) {
|
||||||
|
$token = gh api -X POST "/repos/astoltz/$repo/actions/runners/registration-token" --jq .token
|
||||||
|
$repoSlug = $repo.ToLowerInvariant().Replace("flowercore.", "").Replace(".", "-")
|
||||||
|
$runnerDir = "C:\fc-ghr\$repoSlug-fc-build-windows"
|
||||||
|
|
||||||
|
New-Item -ItemType Directory -Force -Path $runnerDir | Out-Null
|
||||||
|
Set-Location $runnerDir
|
||||||
|
|
||||||
|
if (-not (Test-Path ".\config.cmd")) {
|
||||||
|
Invoke-WebRequest `
|
||||||
|
-Uri "https://github.com/actions/runner/releases/download/v2.323.0/actions-runner-win-x64-2.323.0.zip" `
|
||||||
|
-OutFile "actions-runner.zip"
|
||||||
|
Add-Type -AssemblyName System.IO.Compression.FileSystem
|
||||||
|
[System.IO.Compression.ZipFile]::ExtractToDirectory((Resolve-Path actions-runner.zip), $runnerDir)
|
||||||
|
}
|
||||||
|
|
||||||
|
.\config.cmd `
|
||||||
|
--url "https://github.com/astoltz/$repo" `
|
||||||
|
--token $token `
|
||||||
|
--name "ci1-$repoSlug-fc-build-windows" `
|
||||||
|
--labels "self-hosted,windows,fc-build-windows,ci1" `
|
||||||
|
--work "_work" `
|
||||||
|
--unattended `
|
||||||
|
--replace
|
||||||
|
|
||||||
|
.\svc.ps1 install
|
||||||
|
.\svc.ps1 start
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
For Updater Sandbox E2E, register only after the guest proves Sandbox support,
|
||||||
|
and use `windows-sandbox` labels:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
$token = gh api -X POST "/repos/astoltz/FlowerCore.Updater/actions/runners/registration-token" --jq .token
|
||||||
|
.\config.cmd `
|
||||||
|
--url "https://github.com/astoltz/FlowerCore.Updater" `
|
||||||
|
--token $token `
|
||||||
|
--name "ci-sandbox1-updater" `
|
||||||
|
--labels "self-hosted,windows,windows-sandbox,ci-sandbox1" `
|
||||||
|
--work "_work" `
|
||||||
|
--unattended `
|
||||||
|
--replace
|
||||||
|
```
|
||||||
|
|
||||||
|
Keep registration tokens out of Git and logs. The durable credential source for
|
||||||
|
automation should be the existing 1Password item named `GitHub PAT (Runner
|
||||||
|
Registration)`, used only to mint short-lived repo registration tokens.
|
||||||
|
|
||||||
|
### 5. Verify GitHub and workflow pickup
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
gh api /repos/astoltz/FlowerCore.Updater/actions/runners `
|
||||||
|
--jq '.runners[] | select(.labels[].name == "windows-sandbox") | {name,status,busy,labels:[.labels[].name]}'
|
||||||
|
|
||||||
|
gh api /repos/astoltz/FlowerCore.DeviceManagement/actions/runners `
|
||||||
|
--jq '.runners[] | select(.labels[].name == "fc-build-windows") | {name,status,busy,labels:[.labels[].name]}'
|
||||||
|
|
||||||
|
gh run list --repo astoltz/FlowerCore.Updater `
|
||||||
|
--workflow "Updater Windows Sandbox E2E" --limit 3
|
||||||
|
```
|
||||||
|
|
||||||
|
Q-MR-82 can be marked resolved only after the Updater run moves from `queued` to
|
||||||
|
`in_progress` or `completed` on an online runner, or after the affected WPF
|
||||||
|
build repos show online `fc-build-windows` repo-scoped runners and their queued
|
||||||
|
jobs start.
|
||||||
|
|
||||||
|
## Break-glass BLUEJAY-WS command
|
||||||
|
|
||||||
|
Only if the operator explicitly overrides the "BLUEJAY-WS is not a runner"
|
||||||
|
policy to drain a queue:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
Set-Location C:\fc-ghr\updater-sandbox
|
||||||
|
.\run.cmd
|
||||||
|
```
|
||||||
|
|
||||||
|
If a Windows service exists:
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
Get-Service 'actions.runner.*'
|
||||||
|
Start-Service 'actions.runner.*'
|
||||||
|
```
|
||||||
|
|
||||||
|
This does not close Q-MR-82 permanently. It is a temporary queue drain until a
|
||||||
|
dedicated VM runner is online.
|
||||||
4
apps/fc-build-windows/kustomization.yaml
Normal file
4
apps/fc-build-windows/kustomization.yaml
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
apiVersion: kustomize.config.k8s.io/v1beta1
|
||||||
|
kind: Kustomization
|
||||||
|
resources:
|
||||||
|
- operator-gate-configmap.yaml
|
||||||
61
apps/fc-build-windows/operator-gate-configmap.yaml
Normal file
61
apps/fc-build-windows/operator-gate-configmap.yaml
Normal file
@@ -0,0 +1,61 @@
|
|||||||
|
apiVersion: v1
|
||||||
|
kind: ConfigMap
|
||||||
|
metadata:
|
||||||
|
name: fc-build-windows-operator-gate
|
||||||
|
namespace: kubevirt-vms
|
||||||
|
labels:
|
||||||
|
app.kubernetes.io/name: fc-build-windows
|
||||||
|
app.kubernetes.io/component: operator-gate
|
||||||
|
app.kubernetes.io/part-of: github-runner
|
||||||
|
flowercore.io/q-card: Q-MR-82
|
||||||
|
annotations:
|
||||||
|
flowercore.io/outcome: OPEN-WITH-OPERATOR-ACTION
|
||||||
|
flowercore.io/live-runner: "false"
|
||||||
|
data:
|
||||||
|
outcome: OPEN-WITH-OPERATOR-ACTION
|
||||||
|
gate.md: |
|
||||||
|
Do not treat this ConfigMap as runner capacity.
|
||||||
|
|
||||||
|
Current probe, 2026-05-20:
|
||||||
|
- RKE2 nodes are linux-only; Windows containers require a Windows node.
|
||||||
|
- KubeVirt `ci1` is Running/Ready, but RDP 3389, WinRM 5985, and SSH 22
|
||||||
|
through `virtctl port-forward` return `connect: no route to host`.
|
||||||
|
- GitHub Updater runner list has only `bluejay-ws-sandbox-1`, status
|
||||||
|
offline. Updater Windows Sandbox E2E run 26150689447 remains queued.
|
||||||
|
|
||||||
|
Required operator action:
|
||||||
|
1. Make a dedicated Windows VM reachable and durable.
|
||||||
|
2. Install .NET 10 SDK, .NET 8 Desktop Runtime, Git, VS Build Tools, and
|
||||||
|
PowerShell 7.
|
||||||
|
3. Register repo-scoped runners with short-lived GitHub registration tokens.
|
||||||
|
4. Add `fc-build-windows` labels only to WPF build-capable guests.
|
||||||
|
5. Add `windows-sandbox` labels only after Sandbox support is proven.
|
||||||
|
registration-token-pattern.ps1: |
|
||||||
|
$repo = "FlowerCore.Updater"
|
||||||
|
$token = gh api -X POST "/repos/astoltz/$repo/actions/runners/registration-token" --jq .token
|
||||||
|
$runnerDir = "C:\fc-ghr\updater-fc-build-windows"
|
||||||
|
|
||||||
|
New-Item -ItemType Directory -Force -Path $runnerDir | Out-Null
|
||||||
|
Set-Location $runnerDir
|
||||||
|
|
||||||
|
# Install the Actions runner package here if config.cmd is absent.
|
||||||
|
.\config.cmd `
|
||||||
|
--url "https://github.com/astoltz/$repo" `
|
||||||
|
--token $token `
|
||||||
|
--name "ci1-updater-fc-build-windows" `
|
||||||
|
--labels "self-hosted,windows,fc-build-windows,ci1" `
|
||||||
|
--work "_work" `
|
||||||
|
--unattended `
|
||||||
|
--replace
|
||||||
|
|
||||||
|
.\svc.ps1 install
|
||||||
|
.\svc.ps1 start
|
||||||
|
verification.ps1: |
|
||||||
|
gh api /repos/astoltz/FlowerCore.Updater/actions/runners `
|
||||||
|
--jq '.runners[] | {name,status,busy,labels:[.labels[].name]}'
|
||||||
|
|
||||||
|
gh run list --repo astoltz/FlowerCore.Updater `
|
||||||
|
--workflow "Updater Windows Sandbox E2E" --limit 3
|
||||||
|
|
||||||
|
$env:KUBECONFIG="$env:USERPROFILE\.kube\rke2.yaml"
|
||||||
|
kubectl -n kubevirt-vms get vm,vmi,pods -o wide
|
||||||
@@ -1,33 +0,0 @@
|
|||||||
# Explicit ArgoCD Application shape for bootstrap/review.
|
|
||||||
#
|
|
||||||
# The live bluejay-infra ApplicationSet already discovers apps/* directories
|
|
||||||
# and creates this same Application name (`infra-fc-devicemgmt`) automatically.
|
|
||||||
# Keep repoURL on the internal Gitea ClusterIP URL; ArgoCD does not trust the
|
|
||||||
# external step-ca HTTPS endpoint.
|
|
||||||
apiVersion: argoproj.io/v1alpha1
|
|
||||||
kind: Application
|
|
||||||
metadata:
|
|
||||||
name: infra-fc-devicemgmt
|
|
||||||
namespace: argocd
|
|
||||||
labels:
|
|
||||||
app.kubernetes.io/name: fc-devicemgmt
|
|
||||||
app.kubernetes.io/part-of: flowercore
|
|
||||||
app.kubernetes.io/managed-by: argocd
|
|
||||||
flowercore.io/tenant-id: system
|
|
||||||
flowercore.io/created-by: bluejay-infra
|
|
||||||
spec:
|
|
||||||
project: default
|
|
||||||
source:
|
|
||||||
repoURL: http://gitea-clusterip.gitea.svc.cluster.local:3000/bluejay/bluejay-infra.git
|
|
||||||
targetRevision: main
|
|
||||||
path: apps/fc-devicemgmt
|
|
||||||
destination:
|
|
||||||
server: https://kubernetes.default.svc
|
|
||||||
namespace: fc-devicemgmt
|
|
||||||
syncPolicy:
|
|
||||||
automated:
|
|
||||||
prune: true
|
|
||||||
selfHeal: true
|
|
||||||
syncOptions:
|
|
||||||
- CreateNamespace=true
|
|
||||||
- ServerSideApply=true
|
|
||||||
@@ -656,15 +656,14 @@ data:
|
|||||||
summary: "Print queue backlog on edge2 ({{ $value }} active jobs)"
|
summary: "Print queue backlog on edge2 ({{ $value }} active jobs)"
|
||||||
description: "CUPS has {{ $value }} active jobs queued. Possible printer jam, USB disconnect, or paper out."
|
description: "CUPS has {{ $value }} active jobs queued. Possible printer jam, USB disconnect, or paper out."
|
||||||
|
|
||||||
# Printer hardware and paper-roll lifecycle alerts.
|
# Paper roll lifecycle alerts (XL Track I, 2026-04-26).
|
||||||
# print_printer_online: 1 when the transport is reachable/selected.
|
# Source-of-truth gauge: print_paper_remaining_percent (Print.Web OTEL,
|
||||||
# print_printer_state enum: 0 unknown, 1 online, 2 offline,
|
# hydrated on startup from the active PaperRoll row).
|
||||||
# 3 paper_depleted, 4 jam, 5 head_error, 6 cover_open.
|
# alert_channel=thermal_print routes through irc-notify -> Print.Web
|
||||||
# Offline/jam/cover alerts stay IRC-only. Paper depleted and head
|
# /api/print/alert so the printer announces its own paper-out warning
|
||||||
# error may route to the thermal digest only when the printer is
|
# on its remaining paper. Self-referential humor + operator nudge.
|
||||||
# online enough to make that useful.
|
|
||||||
- alert: PrintPaperRollLow
|
- alert: PrintPaperRollLow
|
||||||
expr: (print_paper_remaining_percent{job="printweb-otel"} < 10 and print_paper_remaining_percent{job="printweb-otel"} > 5) and print_printer_online{job="printweb-otel"} == 1
|
expr: print_paper_remaining_percent{job="printweb-otel"} < 10 and print_paper_remaining_percent{job="printweb-otel"} > 5
|
||||||
for: 5m
|
for: 5m
|
||||||
labels:
|
labels:
|
||||||
severity: warning
|
severity: warning
|
||||||
@@ -673,59 +672,15 @@ data:
|
|||||||
summary: "Print roll low on edge2 ({{ $value | printf \"%.1f\" }}% remaining)"
|
summary: "Print roll low on edge2 ({{ $value | printf \"%.1f\" }}% remaining)"
|
||||||
description: "NuPrint 210 paper roll has {{ $value | printf \"%.1f\" }}% remaining. Operator should load a fresh roll soon. Run /api/paper/status for the precise mm + estimated jobs left."
|
description: "NuPrint 210 paper roll has {{ $value | printf \"%.1f\" }}% remaining. Operator should load a fresh roll soon. Run /api/paper/status for the precise mm + estimated jobs left."
|
||||||
|
|
||||||
- alert: PrinterOfflineWarning
|
|
||||||
expr: print_printer_state{job="printweb-otel"} == 2
|
|
||||||
for: 2m
|
|
||||||
labels:
|
|
||||||
severity: warning
|
|
||||||
service: print-web
|
|
||||||
alert_channel: irc
|
|
||||||
annotations:
|
|
||||||
summary: "Print.Web printer offline on edge2"
|
|
||||||
description: "Print.Web reports the NuPrint 210 transport is offline or unreachable. IRC-only by design: do not thermal-print an alert when the thermal printer itself is offline."
|
|
||||||
|
|
||||||
- alert: PrintPaperRollCritical
|
- alert: PrintPaperRollCritical
|
||||||
expr: print_printer_state{job="printweb-otel"} == 3 and print_printer_online{job="printweb-otel"} == 1
|
expr: print_paper_remaining_percent{job="printweb-otel"} <= 5
|
||||||
for: 2m
|
for: 2m
|
||||||
labels:
|
labels:
|
||||||
severity: critical
|
severity: critical
|
||||||
alert_channel: thermal_print
|
alert_channel: thermal_print
|
||||||
annotations:
|
annotations:
|
||||||
summary: "Print paper depleted on edge2"
|
summary: "Print roll critical on edge2 ({{ $value | printf \"%.1f\" }}% remaining)"
|
||||||
description: "NuPrint 210 reports paper depleted while the printer is still online. Load a new roll, drain the hardware buffer if needed, then replay DeadLetter jobs from /print-log."
|
description: "NuPrint 210 paper roll at {{ $value | printf \"%.1f\" }}% — load a new roll NOW. The 50ft roll has a ~12% red-stripe zone; once paper passes that, the printer can run dry mid-job."
|
||||||
|
|
||||||
- alert: PrinterJamWarning
|
|
||||||
expr: print_printer_state{job="printweb-otel"} == 4
|
|
||||||
for: 2m
|
|
||||||
labels:
|
|
||||||
severity: warning
|
|
||||||
service: print-web
|
|
||||||
alert_channel: irc
|
|
||||||
annotations:
|
|
||||||
summary: "Print.Web printer jam on edge2"
|
|
||||||
description: "Print.Web reports a paper/cutter jam state. IRC-only: clear the jam, drain the hardware buffer if bytes were queued, then retry affected jobs."
|
|
||||||
|
|
||||||
- alert: PrinterHeadErrorCritical
|
|
||||||
expr: print_printer_state{job="printweb-otel"} == 5
|
|
||||||
for: 2m
|
|
||||||
labels:
|
|
||||||
severity: critical
|
|
||||||
service: print-web
|
|
||||||
alert_channel: thermal_print
|
|
||||||
annotations:
|
|
||||||
summary: "Print.Web printer head error on edge2"
|
|
||||||
description: "Print.Web reports a thermal head or unrecoverable printer error. Critical routing may enter the thermal digest per existing policy; IRC remains the primary triage stream."
|
|
||||||
|
|
||||||
- alert: PrinterCoverOpenWarning
|
|
||||||
expr: print_printer_state{job="printweb-otel"} == 6
|
|
||||||
for: 2m
|
|
||||||
labels:
|
|
||||||
severity: warning
|
|
||||||
service: print-web
|
|
||||||
alert_channel: irc
|
|
||||||
annotations:
|
|
||||||
summary: "Print.Web printer cover open on edge2"
|
|
||||||
description: "Print.Web reports the printer cover/lid is open. IRC-only: close the cover and verify /api/print/status before retrying jobs."
|
|
||||||
|
|
||||||
- alert: PrintJobDeadLetter
|
- alert: PrintJobDeadLetter
|
||||||
expr: increase(print_jobs_dead_letter_total[15m]) > 0
|
expr: increase(print_jobs_dead_letter_total[15m]) > 0
|
||||||
@@ -3680,146 +3635,6 @@ data:
|
|||||||
relativeTimeRange: {from: 120, to: 0}
|
relativeTimeRange: {from: 120, to: 0}
|
||||||
datasourceUid: __expr__
|
datasourceUid: __expr__
|
||||||
model: {type: threshold, expression: B, conditions: [{evaluator: {params: [600], type: gt}}], refId: C}
|
model: {type: threshold, expression: B, conditions: [{evaluator: {params: [600], type: gt}}], refId: C}
|
||||||
- orgId: 1
|
|
||||||
name: Print Services
|
|
||||||
folder: Print Alerts
|
|
||||||
interval: 1m
|
|
||||||
rules:
|
|
||||||
- uid: printer-offline-warning
|
|
||||||
title: PrinterOfflineWarning
|
|
||||||
condition: C
|
|
||||||
for: 2m
|
|
||||||
noDataState: OK
|
|
||||||
execErrState: OK
|
|
||||||
annotations:
|
|
||||||
summary: "Print.Web printer offline on edge2"
|
|
||||||
description: "Print.Web reports the NuPrint 210 transport is offline or unreachable. IRC-only by design: do not thermal-print an alert when the thermal printer itself is offline."
|
|
||||||
runbook: "1. Check edge2 power/network 2. Check USB/CUPS queue 3. Open https://print.iamworkin.lan/admin 4. Do not force thermal routing for offline alerts."
|
|
||||||
labels:
|
|
||||||
severity: warning
|
|
||||||
service: print-web
|
|
||||||
alert_channel: irc
|
|
||||||
data:
|
|
||||||
- refId: A
|
|
||||||
relativeTimeRange: {from: 120, to: 0}
|
|
||||||
datasourceUid: prometheus
|
|
||||||
model: {expr: 'print_printer_state{job="printweb-otel"} == 2', instant: true, refId: A}
|
|
||||||
- refId: B
|
|
||||||
relativeTimeRange: {from: 120, to: 0}
|
|
||||||
datasourceUid: __expr__
|
|
||||||
model: {type: reduce, expression: A, reducer: last, refId: B}
|
|
||||||
- refId: C
|
|
||||||
relativeTimeRange: {from: 120, to: 0}
|
|
||||||
datasourceUid: __expr__
|
|
||||||
model: {type: threshold, expression: B, conditions: [{evaluator: {params: [0], type: gt}}], refId: C}
|
|
||||||
- uid: print-paper-roll-critical
|
|
||||||
title: PrintPaperRollCritical
|
|
||||||
condition: C
|
|
||||||
for: 2m
|
|
||||||
noDataState: OK
|
|
||||||
execErrState: OK
|
|
||||||
annotations:
|
|
||||||
summary: "Print paper depleted on edge2"
|
|
||||||
description: "NuPrint 210 reports paper depleted while the printer is still online. Load a new roll, drain the hardware buffer if needed, then replay DeadLetter jobs from /print-log."
|
|
||||||
runbook: "1. Load a fresh roll 2. Drain the hardware buffer if paper-out happened mid-job 3. Open https://print.iamworkin.lan/print-log 4. Retry DeadLetter jobs after the state clears."
|
|
||||||
labels:
|
|
||||||
severity: critical
|
|
||||||
service: print-web
|
|
||||||
alert_channel: thermal_print
|
|
||||||
data:
|
|
||||||
- refId: A
|
|
||||||
relativeTimeRange: {from: 120, to: 0}
|
|
||||||
datasourceUid: prometheus
|
|
||||||
model: {expr: 'print_printer_state{job="printweb-otel"} == 3 and print_printer_online{job="printweb-otel"} == 1', instant: true, refId: A}
|
|
||||||
- refId: B
|
|
||||||
relativeTimeRange: {from: 120, to: 0}
|
|
||||||
datasourceUid: __expr__
|
|
||||||
model: {type: reduce, expression: A, reducer: last, refId: B}
|
|
||||||
- refId: C
|
|
||||||
relativeTimeRange: {from: 120, to: 0}
|
|
||||||
datasourceUid: __expr__
|
|
||||||
model: {type: threshold, expression: B, conditions: [{evaluator: {params: [0], type: gt}}], refId: C}
|
|
||||||
- uid: printer-jam-warning
|
|
||||||
title: PrinterJamWarning
|
|
||||||
condition: C
|
|
||||||
for: 2m
|
|
||||||
noDataState: OK
|
|
||||||
execErrState: OK
|
|
||||||
annotations:
|
|
||||||
summary: "Print.Web printer jam on edge2"
|
|
||||||
description: "Print.Web reports a paper/cutter jam state. IRC-only: clear the jam, drain the hardware buffer if bytes were queued, then retry affected jobs."
|
|
||||||
runbook: "1. Clear paper/cutter path 2. Drain hardware buffer if CUPS queued bytes 3. Verify /api/print/status 4. Retry affected jobs."
|
|
||||||
labels:
|
|
||||||
severity: warning
|
|
||||||
service: print-web
|
|
||||||
alert_channel: irc
|
|
||||||
data:
|
|
||||||
- refId: A
|
|
||||||
relativeTimeRange: {from: 120, to: 0}
|
|
||||||
datasourceUid: prometheus
|
|
||||||
model: {expr: 'print_printer_state{job="printweb-otel"} == 4', instant: true, refId: A}
|
|
||||||
- refId: B
|
|
||||||
relativeTimeRange: {from: 120, to: 0}
|
|
||||||
datasourceUid: __expr__
|
|
||||||
model: {type: reduce, expression: A, reducer: last, refId: B}
|
|
||||||
- refId: C
|
|
||||||
relativeTimeRange: {from: 120, to: 0}
|
|
||||||
datasourceUid: __expr__
|
|
||||||
model: {type: threshold, expression: B, conditions: [{evaluator: {params: [0], type: gt}}], refId: C}
|
|
||||||
- uid: printer-head-error-critical
|
|
||||||
title: PrinterHeadErrorCritical
|
|
||||||
condition: C
|
|
||||||
for: 2m
|
|
||||||
noDataState: OK
|
|
||||||
execErrState: OK
|
|
||||||
annotations:
|
|
||||||
summary: "Print.Web printer head error on edge2"
|
|
||||||
description: "Print.Web reports a thermal head or unrecoverable printer error. Critical routing may enter the thermal digest per existing policy; IRC remains the primary triage stream."
|
|
||||||
runbook: "1. Let the printer cool if overheated 2. Power-cycle only after checking queued jobs 3. Verify /api/print/status 4. Retry jobs after the state clears."
|
|
||||||
labels:
|
|
||||||
severity: critical
|
|
||||||
service: print-web
|
|
||||||
alert_channel: thermal_print
|
|
||||||
data:
|
|
||||||
- refId: A
|
|
||||||
relativeTimeRange: {from: 120, to: 0}
|
|
||||||
datasourceUid: prometheus
|
|
||||||
model: {expr: 'print_printer_state{job="printweb-otel"} == 5', instant: true, refId: A}
|
|
||||||
- refId: B
|
|
||||||
relativeTimeRange: {from: 120, to: 0}
|
|
||||||
datasourceUid: __expr__
|
|
||||||
model: {type: reduce, expression: A, reducer: last, refId: B}
|
|
||||||
- refId: C
|
|
||||||
relativeTimeRange: {from: 120, to: 0}
|
|
||||||
datasourceUid: __expr__
|
|
||||||
model: {type: threshold, expression: B, conditions: [{evaluator: {params: [0], type: gt}}], refId: C}
|
|
||||||
- uid: printer-cover-open-warning
|
|
||||||
title: PrinterCoverOpenWarning
|
|
||||||
condition: C
|
|
||||||
for: 2m
|
|
||||||
noDataState: OK
|
|
||||||
execErrState: OK
|
|
||||||
annotations:
|
|
||||||
summary: "Print.Web printer cover open on edge2"
|
|
||||||
description: "Print.Web reports the printer cover/lid is open. IRC-only: close the cover and verify /api/print/status before retrying jobs."
|
|
||||||
runbook: "1. Close the printer cover 2. Verify /api/print/status returns online 3. Retry affected jobs only after the state clears."
|
|
||||||
labels:
|
|
||||||
severity: warning
|
|
||||||
service: print-web
|
|
||||||
alert_channel: irc
|
|
||||||
data:
|
|
||||||
- refId: A
|
|
||||||
relativeTimeRange: {from: 120, to: 0}
|
|
||||||
datasourceUid: prometheus
|
|
||||||
model: {expr: 'print_printer_state{job="printweb-otel"} == 6', instant: true, refId: A}
|
|
||||||
- refId: B
|
|
||||||
relativeTimeRange: {from: 120, to: 0}
|
|
||||||
datasourceUid: __expr__
|
|
||||||
model: {type: reduce, expression: A, reducer: last, refId: B}
|
|
||||||
- refId: C
|
|
||||||
relativeTimeRange: {from: 120, to: 0}
|
|
||||||
datasourceUid: __expr__
|
|
||||||
model: {type: threshold, expression: B, conditions: [{evaluator: {params: [0], type: gt}}], refId: C}
|
|
||||||
- orgId: 1
|
- orgId: 1
|
||||||
name: CI Runners
|
name: CI Runners
|
||||||
folder: CI Alerts
|
folder: CI Alerts
|
||||||
|
|||||||
@@ -304,7 +304,7 @@ public sealed class FleetManifestLintTests
|
|||||||
}
|
}
|
||||||
|
|
||||||
[Fact]
|
[Fact]
|
||||||
public void Monitoring_MustIncludeRequiredAlertRoutingGuards()
|
public void Monitoring_MustAlertWhenLinuxRunnerDeploymentIsUnavailable()
|
||||||
{
|
{
|
||||||
var monitoring = File.ReadAllText(Path.Combine(Inventory.BluejayRoot, "apps", "monitoring", "noc-monitoring.yaml"));
|
var monitoring = File.ReadAllText(Path.Combine(Inventory.BluejayRoot, "apps", "monitoring", "noc-monitoring.yaml"));
|
||||||
|
|
||||||
@@ -315,15 +315,6 @@ public sealed class FleetManifestLintTests
|
|||||||
monitoring.Should().Contain("folder: CI Alerts");
|
monitoring.Should().Contain("folder: CI Alerts");
|
||||||
monitoring.Should().Contain("uid: linux-runner-offline");
|
monitoring.Should().Contain("uid: linux-runner-offline");
|
||||||
monitoring.Should().Contain("alert_channel: irc");
|
monitoring.Should().Contain("alert_channel: irc");
|
||||||
|
|
||||||
monitoring.Should().Contain("PrinterOfflineWarning");
|
|
||||||
monitoring.Should().Contain("expr: print_printer_state{job=\"printweb-otel\"} == 2");
|
|
||||||
monitoring.Should().Contain("IRC-only by design: do not thermal-print an alert when the thermal printer itself is offline.");
|
|
||||||
monitoring.Should().Contain("PrintPaperRollCritical");
|
|
||||||
monitoring.Should().Contain("expr: print_printer_state{job=\"printweb-otel\"} == 3 and print_printer_online{job=\"printweb-otel\"} == 1");
|
|
||||||
monitoring.Should().Contain("PrinterJamWarning");
|
|
||||||
monitoring.Should().Contain("PrinterHeadErrorCritical");
|
|
||||||
monitoring.Should().Contain("PrinterCoverOpenWarning");
|
|
||||||
}
|
}
|
||||||
|
|
||||||
[Fact]
|
[Fact]
|
||||||
|
|||||||
Reference in New Issue
Block a user