Stages a draft VirtualMachine + Namespace + ISO PVC + rootdisk PVC + sysprep
ConfigMap for the dedicated GitHub Actions self-hosted runner that replaces
the never-registered bluejay-ws-sandbox-1 placeholder.
Status: STAGED ONLY. spec.running = false. ISO PVC empty. Two operator
decisions still pending before this can boot:
1. Network choice — pod-network fallback (in this draft) vs Multus +
PROD VLAN NAD (preferred, requires Multus install).
2. ISO path — manual upload via helper pod (Path A) vs CDI HTTP import
(Path B, requires CDI install).
Cluster baseline 2026-05-08:
- KubeVirt operator: installed, healthy, 14d
- CDI: NOT installed
- Multus: NOT installed
- Calico-only CNI
See docs/infrastructure/windows-server-build-runner-plan.md "Phase 1 readiness
gate" for the full operator pickup checklist.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three Certificates requested duration: 2160h (90d) with renewBefore: 720h
(30d). step-ca's ACME provisioner caps cert lifetime at 30d, so it silently
issued 720h certs — making renewBefore EQUAL to the actual cert lifetime.
cert-manager treats the cert as needing immediate renewal the moment it's
issued, creates a CertificateRequest, gets a new (still 30d) cert, marks
it for immediate renewal, and loops.
Damage on 2026-05-07 ~20:30 (caught during regroup after 5h gap):
- fc-worldbuilder/worldbuilder-web-tls: 2365 CRs in 18h
- fc-distribution/fc-distribution-tls: 10880 CRs in 18h
- knowledge/knowledge-tls: 10888 CRs in 18h
Total: 24,133 stale CertificateRequest objects in etcd.
Bulk-deleted all CRs + Orders in those 3 namespaces, then this commit
fixes the source so ArgoCD sync stops re-creating the loop.
Fix: match the working 720h/240h pattern used by every other FC service
cert (agent-zero, fc-dns, fc-llm-bridge, fc-php, traefik-system, etc.).
30d cert lifetime + 10d renewal headroom = renewal at day 20, which is
the cert-manager standard 2/3-of-lifetime practice.
Side effect during loop: ALSO contributed to step-ca load and may have
caused intermittent timeouts cluster-wide (the latest stuck challenge
was timing out dialing step-ca:9443 even though step-ca itself was up).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captured during 2026-05-07 regroup audit. selenium-netpol was applied via
raw `kubectl apply` to the cluster on 2026-03-15 with no source-of-truth
file anywhere — neither in bluejay-infra nor in any FC service repo. A
cluster rebuild from bluejay-infra would have lost it entirely (including
the Selenium Grid → Traefik VIP allow rule that gates AAT runs against
*.iamworkin.lan services).
Captured byte-for-byte from `kubectl get netpol -n selenium selenium-netpol
-o yaml`. ServerSideApply via ArgoCD will adopt the existing resource
without recreation.
The Selenium Grid Deployment + Services themselves are still managed
outside ArgoCD (deployed via raw kubectl from the original bring-up).
Migrating those into bluejay-infra is a separate lane — this commit only
restores GitOps repeatability for the NetworkPolicy.
See feedback_networkpolicies_belong_in_bluejay_infra.md for the canonical
pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Repeatability gap caught during 2026-05-07 morning regroup. The four
fc-desktop NetworkPolicies (desktop-isolation, fc-desktop-default-deny,
remotedesktop-web-isolation, cm-acme-http-solver-allow) were applied via
FlowerCore.RemoteDesktop/scripts/deploy-web.sh `kubectl apply` calls.
That meant a fresh cluster rebuild from bluejay-infra alone would miss
all of them — Browser Lab session isolation, control-plane allow-list,
and HTTP-01 cert renewal would silently fail to come up.
Canonical FC GitOps pattern is for NetworkPolicies to live alongside
other resources in bluejay-infra. Verified by audit: 6 of 11 cluster
NetworkPolicies (agent-zero, edge2-services, monitoring, noc-services,
telephony, voice) already follow this pattern. fc-desktop was the
outlier; selenium-netpol is also unmanaged and tracked separately.
Source-of-truth split (now documented in fc-desktop.yaml):
- bluejay-infra OWNS: Certificate + IngressRoute + all NetworkPolicies.
- FlowerCore.RemoteDesktop scripts/deploy-web.sh OWNS: Deployment +
Service ONLY (because `localhost/fc-desktop:linux-xfce` image refs
require manual ctr import on each node — Deployment in bluejay-infra
would race the image-import step).
Follow-up commits in FlowerCore.RemoteDesktop will:
- Remove the now-duplicate k8s/{networkpolicy,namespace-default-deny,
web-networkpolicy,acme-http01-solver-allow}.yaml files.
- Drop the 3 `kubectl_apply_file` lines from scripts/deploy-web.sh.
The 4 NPs in this commit are byte-for-byte identical to what's running in
the cluster today (verified via kubectl get -o yaml diff). ServerSideApply
in the bluejay-infra ApplicationSet will adopt the existing resources
without recreating them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ArgoCD's directory-driven manifest parser scans *.yaml AND *.json by
default. Bare Grafana dashboard JSONs (no apiVersion/kind/metadata)
poisoned manifest generation for the entire infra-monitoring Application
("Object 'Kind' is missing in <dashboard JSON>"), leaving sync state
Unknown.
These files are SOURCE for the file-provisioning path on noc1
(/opt/monitoring/grafana/dashboards/) and also get inlined into ConfigMap
wrappers like grafana-dashboard-remotedesktop.yaml. They are NOT K8s
manifests and shouldn't be in ArgoCD's manifest tree.
.argocdignore is for repo-level GitOps source eligibility, not for
filtering manifests within a directory-mode Application — the cleanest
fix is the *.txt extension that ArgoCD's parser skips.
Reverts the .argocdignore from the previous commit (didn't take effect).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fc-updater PVC: bump updatecenter-data storage 10Gi → 25Gi.
The cluster PVC was already manually expanded to 20Gi to fit Mike Bundle
(~5.1 GiB) plus the LocalFsBundleStore.MaxTotalBytes soft cap of 25 GiB
(see project_uc_remaining_4_apps_signed_2026_05_06). PVCs cannot shrink,
so ArgoCD couldn't sync the smaller git value (OutOfSync, retried 5x with
"field can not be less than status.capacity"). Setting git to 25Gi gives
headroom matching the soft cap.
monitoring .argocdignore: skip bare dashboard JSON files.
Both fc-updatecenter-dashboard.json and flowercore-remotedesktop-grafana-
dashboard.json live in apps/monitoring/ as source-of-truth for file-
provisioning to noc1's /opt/monitoring/grafana/dashboards/. ArgoCD's
manifest parser tries to unmarshal every file and chokes on bare dashboard
JSON ("Object 'Kind' is missing"), which then poisoned the whole
infra-monitoring Application status (Unknown sync, no comparison possible).
The .argocdignore tells ArgoCD to skip *.json — actual K8s deploys happen
via ConfigMap wrappers like grafana-dashboard-remotedesktop.yaml.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WorldBuilder live runtime promotion lands in the Intranet at
/services/world-builder + ServiceRegistry homepage tile.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4delta server-side HTML overlay enrichment landed in
FlowerCore.TtsReader@8f23e15 (master @6091618). Adds 9-pass enrichment +
SQLite-backed cache + 4 REST endpoints (/api/v1/enrich/{html,jsonld,both,passes})
+ RenderRequest.sourceJsonLd. Tests 476 -> 522 (+46). Image already imported
to all RKE2 nodes via deploy.sh; this bumps the bluejay-infra-managed tag so
ArgoCD reconciles the live deployment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cluster CPU-request budget at 99% on all 3 RKE2 nodes at deploy time.
0/3 nodes available; "3 Insufficient cpu". Actual CPU usage on the
nodes is 10/52/19%, so the cluster is request-overprovisioned but has
plenty of real headroom. Idle Blazor + SignalR + ComfyUI poller is ~5m.
25m unblocks scheduling and stays generous for expected runtime.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror of FlowerCore.Notes/scripts/monitoring/alerts.yml fc-signage-marquee
group into the K8s migration target apps/monitoring/noc-monitoring.yaml so
that future migration of the noc1 Podman monitoring stack into RKE2
inherits the marquee alert ruleset automatically.
Three rules added:
- MarqueeDroppedFramesHigh (5% / 5min / warning)
- MarqueeRenderLatencyP99High (16ms / 10min / warning)
- MarqueeAnimationDurationDrift (10% / 15min / info)
All three gated with `unless on() absent_over_time(metric[7d])` so they
don't fire during the metric-not-yet-published window before Track 3
IR-21 source IMPL ships the MarqueeMeter into Common + Web + WPF.
Live source-of-truth (the noc1 Podman Prometheus reads from
/opt/monitoring/prometheus/alerts.yml) was updated and reloaded
in the same session — Notes commit 300daa0 carries the matching
alerts.yml + Grafana fc-signage-dashboard.json change.
Per feedback_monitoring_k8s_target_vs_live_podman: this file is the
forward-looking K8s migration target, NOT what the live Podman
Prometheus reads. ArgoCD-syncing this file does NOT push alerts to
the live monitoring stack.
Companion to:
- FlowerCore.Notes 300daa0 (live alerts.yml + Grafana panels deployed)
- docs/signage/marquee-performance-telemetry-design.md (Track 3 IR-21 spec)
- docs/signage/marquee-animation-phases.md (Track 6 13-phase coverage matrix)
Memory: project_marquee_vr_promotion_landed_2026_05_06
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumps tag to bring live pod up to FlowerCore.Intranet.Web@a9ede80 (master tip
post-fleet-search-resurrect merge). Image imported to all 3 RKE2 nodes via
scripts/deploy.sh v20260505-1108.
Closes the source-vs-deployed gap that existed since 2026-04-29: the
KnowledgeFleetSearchController + Service + TrustedHeader auth handler were
running on the deployed pod but never landed on master. Surgical extraction
from stale codex/fleet-knowledge-search branch (12-file rebase conflict made
full merge non-trivial) brings the source up to match production.
+7 tests (280/280 vs 273), 0W/0E build.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds fc-updatecenter-dashboard.json (uid: fc-updatecenter, version: 2)
to apps/monitoring/ — mirrors the dashboard deployed to noc1 at
/opt/monitoring/grafana/dashboards/fc-updatecenter-dashboard.json.
13 panels: 5 existing probe/availability panels + 1 OTEL row header
+ 7 new panels for the 6 OTEL counters added to FlowerCore.Updater.Web:
updatecenter_manifest_requests_total
updatecenter_bundle_download_bytes_total
updatecenter_bundle_downloads_total
updatecenter_checkins_total
updatecenter_release_publishes_total
updatecenter_signature_verify_failures_total
Live on Grafana at https://grafana.iamworkin.lan/d/fc-updatecenter
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumps tag to include the new AddDataProtectionKeys EF migration that closes
the fc_dp_keys table-creation gap from v20260505-1023. Master tip a82d7d4.
Previous tag v20260505-1023 crash-looped on every page load with
'no such table: fc_dp_keys' due to eb9fe6d (DataProtection-in-DB)
registering the DI but missing the table-creation migration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JSON-provisioned dashboard for FlowerCore.RemoteDesktop session metrics,
matches the Apr 23 staging done in the codex/ttsreader-release-b6ca2d5
worktree. Drop into apps/monitoring so ArgoCD-managed Grafana provisioning
picks it up alongside the other FC service dashboards.