Commit Graph

452 Commits

Author SHA1 Message Date
Robot
ba5f5dd0fb deploy(knowledge): roll audit backfill fix 2026-06-03 18:24:22 -05:00
Robot
dc699da7b3 fix(knowledge): persist federation database on PVC 2026-06-03 18:17:31 -05:00
Robot
1e8bf54c6e deploy: roll Chat and Knowledge OIDC images 2026-06-03 18:13:09 -05:00
Andrew Stoltz
e2e93d482c Deploy TtsReader schema repair image
Co-Authored-By: Codex <codex@openai.com>
2026-06-02 22:00:15 -05:00
4319cc2b51 Merge PR #32: divoom pi deploy artifact manifests
Lands Divoom-as-DM-device and Divoom-TV Pi HDMI deploy artifacts for Cx-6.
2026-06-03 02:47:36 +00:00
Andrew Stoltz
2bf339ce51 Deploy TtsReader PR29 live proof image
Co-Authored-By: Codex <codex@openai.com>
2026-06-02 21:47:04 -05:00
Andrew Stoltz
5bdedfc5ae divoom: add pi deploy artifact manifests
Add source-controlled Puppet/Hiera contracts for edge2 Divoom-as-DM-device without replacing the live flowercore-divoom systemd deployment.

Add Divoom TV Pi HDMI systemd/Puppet deployment artifacts, LF shell-script guardrails, and focused lint coverage for the additive non-K8s deploy shape.

Co-Authored-By: Codex <codex@openai.com>
2026-06-02 21:45:27 -05:00
Andrew Stoltz
0307ae16ae monitoring(probe): signage/mysql/php blackbox probe / -> /healthz (K8s-target mirror)
Mirrors the live noc1 podman fix + Notes scripts/monitoring/prometheus.yml.
These services enforce OIDC bearer auth (FlowerCore__Auth__Enabled=true), so an
anonymous probe of / returns 401 -> false TraefikServiceDown. All three expose
anonymous /healthz=200. This noc-monitoring.yaml is the forward K8s-migration
target (not live); brings it in sync with the live config.
2026-06-02 01:09:57 -05:00
Andrew Stoltz
6c18f69cf2 mail: remove cert-manager Certificate (manage mail-tls via step-ca JWK + noc1 renew timer)
step-ca-acme only has an HTTP-01 (Traefik) solver, but mail.iamworkin.lan must resolve
to the dedicated MetalLB IP 10.0.56.202 (SMTP/IMAP), so HTTP-01 cannot validate (order
stuck pending since 2026-05-06; cert expired 2026-05-24). mail-tls is now issued from
step-ca's JWK 'admin' provisioner and auto-renewed by a systemd timer on noc1 that writes
the mail-tls secret directly. The secret + Deployment mount + webmail IngressRoute are
unchanged. Re-add a Certificate only if a DNS-01 solver is deployed for step-ca-acme.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 15:55:38 -05:00
Andrew Stoltz
47e2256556 Deploy TtsReader correction bridge images 2026-05-31 12:35:45 -05:00
Andrew Stoltz
9d77f8ba0e fc-updater: disable loki audit sink 2026-05-31 11:34:12 -05:00
Andrew Stoltz
2f4be19c85 fc-updater: bump signing diagnostics image 2026-05-31 00:32:48 -05:00
Andrew Stoltz
2a62c40990 fc-updater: bump image to MSI installer surface 2026-05-30 23:31:48 -05:00
Andrew Stoltz
7be98e5efc Bump UpdateCenter image to hosted-service fix 2026-05-30 20:22:13 -05:00
Andrew Stoltz
a65b356c9d deploy(fc-updater): roll UC to v202605301823-a6c3354 (Phase 3 SQLite fixes)
Durable image bump for FlowerCore.Updater main a6c3354 (PRs #63-#66): hosted-service
+ request-path SQLite DateTimeOffset fixes, StopHost restored + per-tick resilience,
Shared.Settings 1.0.1. Image built + imported to rke2-server. Un-degrades the Phase-9
provenance verifier + settings poll (were stopped under the removed global Ignore mask).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 18:27:45 -05:00
Andrew Stoltz
08c17ef1b4 fc-updater: bump to v202605301703-296f350-fix2 (BackgroundServiceExceptionBehavior=Ignore so a hosted-service SQLite query crash can't stop the host)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 17:04:54 -05:00
Andrew Stoltz
06f2f002b7 fc-updater: bump image to v202605301657-296f350-fix1 (Shared.Settings SQLite poll fix)
The v202605301642-296f350-rework image crash-looped: FlowerCore.Shared.Settings SettingsDbPollHostedService
ran a DateTimeOffset Where/OrderBy on SettingsRecordChanges that SQLite can't
translate, and as a BackgroundService it stopped the host. Shared.Settings 1.0.1
materializes the change-log then filters/orders in memory; Updater Web bumped to 1.0.1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 16:59:37 -05:00
Andrew Stoltz
7ac4a8b4b7 fc-updater: bump image to v202605301642-296f350-rework (ADR-179 rework live)
Deploy the current FlowerCore.Updater main (PRs #52-#61) to prod: MSI-first
packaging, beta gating + per-install tokens, interactive+bearer Authentik OIDC,
native installer apply, and the .fcsetup.exe retirement (DropReleaseInstallers
migration runs on the now-empty DB). Image pre-imported to rke2-server + agent1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 16:47:28 -05:00
Andrew Stoltz
90f2a86819 ops: trim load for degraded 2-node cluster (agent2 PSU dead)
Scale all github-runner deployments to 1 replica and halt the ci1
KubeVirt VM. With agent2 down (failed PSU) the cluster runs on two
passively-cooled NUCs; the ci1 8-vCPU VM drove agent1 to ~100C. Keep
total load trimmed until replacement hardware is in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 13:47:13 -05:00
Andrew Stoltz
cbdefb2b23 Revert "ci1: expose WinRM/RDP/SSH ports on masquerade interface for Phase 2 bootstrap"
The port additions caused the new VMI to stick at phase=Scheduled with
reason=GuestNotRunning. The guest-console-log sidecar exited 1 and
qemu never started. Reverting to the working 9-day-stable shape until
the port-add path is verified in a non-production VM.

Phase 2 (Windows runner install + registration) needs an operator-
interactive virtctl-vnc session against the rebuilt VM, OR a separate
investigation of why this port-add tipped over the VM.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 11:35:10 -05:00
Andrew Stoltz
1c36fe3a0a ci1: expose WinRM/RDP/SSH ports on masquerade interface for Phase 2 bootstrap
The Phase 1 VM has been Running for 9 days but Phase 2 (Puppet bootstrap +
runner registration) was deferred because the operator-interactive
virtctl-vnc path was the only way in. The masquerade interface listed
no exposed ports, so virtctl ssh and kubectl port-forward both hit
'no route to host' — qemu user-mode NAT does not forward inbound by
default.

Adding 5985 (WinRM HTTP) lets a kubectl port-forward + PowerShell
remoting path drive runner registration entirely from outside the VM.
3389 + 22 are reserved for desktop access via Guacamole or virtctl ssh
once OpenSSH Server is installed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 11:24:34 -05:00
Andrew Stoltz
2b420ce8a4 runners: fleet-wide right-size CPU requests from 500m to 100m
All 33 runner Deployments now request 100m CPU instead of 500m,
freeing roughly 50 idle pods × 400m = ~20 cores back to the cluster.
Observed CPU usage on idle runners is ~1m via kubectl top; the 500m
request was a 500× over-provision that was eating allocatable CPU
and blocking new workload scheduling — WorldBuilder runner could not
be scheduled even at the new 100m request because the pre-existing
fleet held the cluster at 99% requested.

Burst headroom preserved by limits.cpu: 2000m unchanged. TtsReader
keeps its 8Gi memory limit from the 2026-05-25 OOMKill fix; only
the CPU request line moves.

Recreate strategy on each deployment means a brief offline window
per runner during rollout; in-flight CI jobs complete on the
existing container before the new spec takes effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 10:09:24 -05:00
Andrew Stoltz
5cbc1a06b1 runners: scale DM/AiStation.Linux/WorldBuilder down to 1 replica until cluster relieved
After cutting requests to 100m, 4 of 6 new pods scheduled and 2 stayed
Pending — cluster CPU REQUEST utilization is 49.6 of 48 allocatable cores
because the existing fleet of ~50 idle runners reserves 25.6 cores
(500m × ~50) for ~50m actual use. Single-replica per new repo gets the
service online without competing with in-flight CI from the rest of the
fleet.

When the broader fleet-wide request right-sizing pass lands
(500m → 100m on all idle runners would free ~20 cores), these can be
bumped back to 2 replicas if PR-CI backlog warrants it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 10:03:30 -05:00
Andrew Stoltz
9e7ee39b3a runners: drop CPU request 500m→100m on DM/AiStation.Linux/WorldBuilder
All 3 fleet nodes were at 99% CPU REQUEST allocation; the 6 new pods
from the previous commit (3 deployments × 2 replicas × 500m) couldn't
schedule. Idle runners actually use ~1m CPU per `kubectl top pods`;
the 500m request was significantly over-provisioned. Burst headroom
preserved by limits.cpu: 2000m unchanged.

Follow-up: similar request right-sizing pass across the rest of the
runner fleet is queued for a future morning-routine sweep — 25 cores
reserved for ~50m actual use is a large slack we can reclaim cluster-
wide.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 10:00:23 -05:00
Andrew Stoltz
ae030a5f33 runners: add github-runner Deployments for DeviceManagement + AiStation.Linux + WorldBuilder
Morning-routine 2026-05-26 — these three repos had ZERO online Linux PR-CI
capacity, blocking the Sprint 37 Cx-1 Linux-CI-migration PRs (DM #20/#21/
#22, AiStation.Linux #13, WorldBuilder #3/#4). Chicken-and-egg: the
migration PRs need Linux runners that the migration creates.

Each Deployment uses the same canonical emptyDir-only pattern as the
fresh-2026-05-26 updater deployment that lives just above:
  - replicas: 2 (room for parallel PR-CI without head-of-line blocking)
  - per-pod emptyDir caches (no RWO PVC contention)
  - shared github-runner-token secret (existing ACCESS_TOKEN PAT has
    org-wide read access)
  - LABELS: self-hosted,linux,fc-build-linux
  - DOTNET_INSTALL_DIR pinned per ADR-170 family

For AiStation.Linux specifically: Linux job will now pick up; the
Windows job in #13 remains queued indefinitely until the Windows runner
host substrate lands per Sprint 36 v2 Cl-2 / ADR-174 — that's a separate
arc, not this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 09:55:31 -05:00
bc8c35896f tests: add bluejay-ws runner-exclusion lint + fix 3 stale runner-fleet assertions (#30)
BLUEJAY-WS must never be a fleet GHA runner (operator directive 2026-05-26). Build-side analog of Sprint 9 safe-account exclusion. Also fixes 3 stale runner-fleet assertions broken by initContainer addition + replica tuning.
2026-05-26 03:42:01 +00:00
Andrew Stoltz
2cc91b6df0 runners: bump tts-reader memory limit 4Gi -> 8Gi
The github-runner-tts-reader pod was being OOMKilled (exit 137)
mid-`dotnet test` on the TtsReader 1000+ test suite. PR #21 CI
(the Windows -> Linux runner migration) flapped twice with the
"self-hosted runner lost communication" annotation before the
K8s-side symptoms surfaced via kubectl describe pod.

Requests bumped 1Gi -> 2Gi, limits 4Gi -> 8Gi. Comment added
inline so future fleet runs don't trip the same wall.

Unblocks PR #21 + the 9 other open TtsReader PRs that all rebase
through it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 22:31:48 -05:00
0d2090fe81 runners: add github-runner-updater Deployment (#29)
Close runner-fleet gap for FlowerCore.Updater. Matches Sprint 32 long-tail pattern; registers entry in fleet-lint required-set.
2026-05-26 03:24:13 +00:00
Andrew Stoltz
bc3548e715 runners: add github-runner-pimanager Deployment
FlowerCore.PiManager build run 26417714843 sat queued 5h with zero
self-hosted runners registered to the repo. PiManager was missed in
the Sprint 32 long-tail sweep — every other FC repo got a dedicated
repo-scoped Deployment with its own ACCESS_TOKEN registration, but
PiManager fell through the cracks.

Adds a 2-replica ephemeral runner Deployment matching the Signage /
DMS / Print.Web pattern (per-pod emptyDir caches, no shared PVC,
labels `self-hosted,linux,fc-build-linux`, shared github-runner-token
PAT). Once ArgoCD syncs, the queued job will pick up automatically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 20:33:44 -05:00
74333cc26b selenium: right-size hub + chrome + edge memory limits (#28) 2026-05-26 01:12:15 +00:00
Andrew Stoltz
7310fb88c2 selenium: right-size hub + chrome + edge memory limits
Edge node has been OOMKilled 51 times in 5 days (~1 every 2.4h) on a
1Gi memory limit. Chrome runs maxSessions=2 on the same 1Gi cap and
was idling at 684Mi — first concurrent session pushing the node to
~900Mi+ would be the next OOM. Hub was running at 766Mi against a 1Gi
limit (75%); no recent restarts but no headroom either.

Firefox node has been running at 2Gi memory limit for 9 days with
zero restarts — that is the right size for a Selenium 4.27 browser
node under our session profile (screen recording sidecar + 1080p
rendering + page captures). Match it.

Changes:
- Hub:    limit 1Gi -> 1.5Gi, request 512Mi -> 1Gi
- Chrome: limit 1Gi -> 2Gi,   request 512Mi -> 1Gi
- Edge:   limit 1Gi -> 2Gi,   request 512Mi -> 1Gi

CPU left alone on all three — observed utilization is well under the
existing limits (hub 54m / 500m, chrome 185m / 1, edge 11m / 1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 20:11:41 -05:00
148bc87b9a runners: bake step-ca root CA into image (v20260525-stepca) (#27) 2026-05-26 01:04:14 +00:00
Andrew Stoltz
2a1e842100 runners: bake step-ca root CA into image (v20260525-stepca)
Without the IAmWorkin step-ca root CA in the runner image's system
trust store, .NET HttpClient calls from CI tests against
`*.iamworkin.lan` (e.g. `https://selenium.iamworkin.lan/session`) fail
with `The remote certificate is invalid because of errors in the
certificate chain: PartialChain`. FlowerCore.Print.Web's
`WebScreenshotService` unit tests hit this on every build.

Drop the step-ca root PEM into `/usr/local/share/ca-certificates/`,
run `update-ca-certificates` once during apt install, and let OpenSSL +
.NET-on-Linux read the regenerated `/etc/ssl/certs/ca-certificates.crt`
automatically — no `SSL_CERT_FILE` env var, no per-Deployment volume
mount.

Image rebuilt + saved + imported on all 3 schedulable RKE2 nodes
(rke2-server, rke2-agent1, rke2-agent2) before this PR — verified with
`ctr images list -q | grep stepca` on each node.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 19:55:38 -05:00
bc28430d24 selenium: allow github-runner namespace ingress on 4444 (#26) 2026-05-26 00:44:23 +00:00
Andrew Stoltz
cc92272217 selenium: allow github-runner namespace ingress on 4444
Unblocks CI jobs running in github-runner pods (e.g. FlowerCore.Print.Web
`help-screenshots`) from reaching selenium-hub. Previously the session
POST was DNAT'd to the hub pod IP then dropped at the Calico ingress
hook, surfacing as a 60s timeout against
http://selenium-hub.selenium.svc.cluster.local:4444 while the Selenium
UI showed 0/4 sessions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 19:43:12 -05:00
d6f4468a9c selenium: migrate hub + 3 nodes into ArgoCD-managed manifests (#25) 2026-05-26 00:09:35 +00:00
Andrew Stoltz
2f796a2ebd selenium: migrate hub + 3 nodes + service + ingressroute into ArgoCD
Previously orphan kubectl-applied since the Selenium Grid was first set
up. The `infra-selenium` ArgoCD app existed but only managed
`network-policy.yaml` — the deployments themselves drifted whenever
anyone `kubectl set env`'d or `kubectl scale`'d.

This commit captures the live state (with the 2026-05-25 maxSessions
bump for chrome already baked in) as canonical git source. ArgoCD's
ServerSideApply syncPolicy + selfHeal will now keep the grid in lock
step with this file.

Resources captured:
  - Service selenium-hub (ClusterIP, internal traffic on 4444)
  - Service selenium-hub-external (LoadBalancer, MetalLB 10.0.56.208)
  - Deployment selenium-hub
  - Deployment selenium-node-chrome (replicas=1, SE_NODE_MAX_SESSIONS=2)
  - Deployment selenium-node-firefox (replicas=1, maxSessions=1)
  - Deployment selenium-node-edge (replicas=1, maxSessions=1)
  - IngressRoute selenium-hub (Traefik, selenium.iamworkin.lan)

No live behavior change — server-side dry-run confirms unchanged for
hub/firefox/ingressroute, "configured" for hub-external + 3 deploys
(default-field reordering only; SSA + field managers handle the diff).

Refs: Sprint 33 morning-routine 2026-05-25 follow-up Q-MR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 19:08:55 -05:00
1f1f6823db runners: right-size replica counts per 14d CI activity (#24) 2026-05-26 00:01:47 +00:00
Andrew Stoltz
b92f74b63a runners: right-size replica counts per 14d CI activity data
Drop 2 → 1 for 10 deploys based on trailing-14d run counts:
  - LlmBridge, Media, Knowledge, Intranet.Web, DNS  (0 runs each)
  - Presentations (6), Redis (3), Provisioning (3),
    MessageBoard (3), MenuBoard (3)

Bump 2 → 3 for Print.Web: 12 runs in trailing 5d, and the
help-screenshots AAT job holds a runner 30+ min, creating
head-of-line blocking for parallel PRs.

Net change: -9 replicas (≈ -9 GiB committed memory).
Aligns with Sprint 33 morning-routine capacity audit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 18:55:47 -05:00
Andrew Stoltz
cb7f7dbc4d authentik: generous startup/liveness probes for first-boot migration
The server pod was getting killed by liveness probe at 60s while still
waiting on migration DB lock (worker pod also running migrations against
same DB). Add startupProbe with 10.5 min budget so liveness doesn't fire
until migrations finish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 16:03:03 -05:00
Andrew Stoltz
03126d5584 authentik: add fsGroup:1000 to server + worker so non-root uid can write /media
PermissionError: [Errno 13] Permission denied: '/media/public' in tenant_files
migration because Authentik container runs as uid 1000 but Longhorn PVC mounts
root:root by default. fsGroup on Pod securityContext recursively chgrps the
PVC mount to gid 1000 + chmods g+rwx.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 15:58:35 -05:00
Andrew Stoltz
495e884c41 authentik: initial deployment at id.iamworkin.lan
Stack:
  - PostgreSQL 16 StatefulSet (Longhorn RWO 5Gi)
  - Redis 7 Deployment (no persistence)
  - Authentik server + worker (ghcr.io/goauthentik/server:2024.12.3)
  - Shared media PVC (Longhorn RWO 2Gi) between server+worker
  - Certificate via step-ca-acme ClusterIssuer
  - Traefik IngressRoute at id.iamworkin.lan

Secrets sourced from 1Password item 'authentik-credentials' (IAmWorkin
vault, id y6i74ch22q5wvm7znquq4nhhcu) via OnePasswordItem CRD. Fields:
AUTHENTIK_SECRET_KEY, POSTGRES_PASSWORD, REDIS_PASSWORD,
BOOTSTRAP_ADMIN_PASSWORD, BOOTSTRAP_ADMIN_TOKEN, BOOTSTRAP_ADMIN_EMAIL.

DNS A record id.iamworkin.lan -> 10.0.56.200 added via
scripts/pfsense-add-id-host.py (FlowerCore.DNS service was 502'ing on
pfSense diag_command.php response parsing).

Closes the immediate gap from PiManager OIDC Cohort 3 wire-up: PiManager
(a87cd6f) configures id.iamworkin.lan as JWT authority but the backend
was never deployed. Pirelay specifically is on Mode:apikey until this
backend is bootstrapped and a pimanager service-account exists.

Post-deploy bootstrap (manual once pods Ready):
  1. Login at https://id.iamworkin.lan/if/admin/ as akadmin
     using BOOTSTRAP_ADMIN_PASSWORD from 1Password.
  2. Create OAuth2/OpenID Provider for pimanager (issuer
     https://id.iamworkin.lan/application/o/pimanager/, audience 'pimanager').
  3. Create Application binding the provider.
  4. Create service account user 'pimanager-service-account', generate
     long-lived token, store in 1Password as 'pimanager-service-account'.
  5. Re-enable jwt mode on pirelay + un-mask puppet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 15:50:10 -05:00
Andrew Stoltz
65aa1e6104 fix(monitoring): point probe-printweb at /health (Q-MR-90)
Root path requires API key auth — `/` returned 401 to the blackbox
probe, firing PrintWebDown despite `/health` reporting Healthy.
Pattern: feedback_k8s_probes_behind_auth_middleware.

Mirrors FlowerCore.Notes scripts/monitoring/prometheus.yml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 14:52:02 -05:00
Andrew Stoltz
7f2a3b76b4 feat(github-runner): bake Ruby 3.3 into Linux self-hosted runner image (Q-MR-81) 2026-05-20 11:45:43 -05:00
ea73f00461 fix(fc-devicemgmt): remove self-referential Application resource (Q-MR-79)
ApplicationSet already creates infra-fc-devicemgmt; removing the in-repo Application child clears the self-reference drift.
2026-05-20 16:20:01 +00:00
Andrew Stoltz
25ace30a03 fix(fc-devicemgmt): remove self-referential Application resource (Q-MR-79) 2026-05-20 11:18:25 -05:00
Andrew Stoltz
ca574c2280 brochure: delete apps/brochure/ — full prune per operator decision 2026-05-19
Removes the apps/brochure/ directory entirely from the bluejay-infra
ApplicationSet glob. ArgoCD will:

  1. See infra-brochure has no git source -> mark for delete
  2. Prune the brochure namespace + Deployment + Service + Certificate
     + Secret + IngressRoute (all generated from the now-gone
     apps/brochure/brochure.yaml)
  3. Remove the infra-brochure Application from argocd ns

Operator decision 2026-05-19 (follow-up to 09387f9 ARCHIVED banner
commit): "Yes, prune argo for brochure. Probably fully deleted there."

The brochure subdomain project was a planning-chain misinterpretation
of "make TtsReader + AI Station production-ready" — see
memory/project_brochure_split_misinterpretation_archived_2026_05_19.md
in FlowerCore.Notes for the full decision record.

Reusable artifacts that were the operator's archive concern stay alive
in their actual homes:

- FlowerCore.Intranet.Web PR #8 content-NuGet carve-out: still in
  Intranet's master, may transfer to TtsReader / AI Station prod work
- Sprint 32 Cl-5 substrate (public-twin design ideas): SUPERSEDED banner
  in-place in FlowerCore.Notes docs/standards/, history preserved
- magpie-doc-writer + wren-walkthrough skill output: unchanged in
  Intranet's flowercore-whats-new/walkthroughs/galleries directories

Companion Notes-side commit updates the "scaled to 0 + ARCHIVED banner"
language in mvp-readiness.html + fleet-roadmap-2026-05-19-sprint36-v2.md
+ memory record to reflect full deletion instead.

Wrong-codebase image localhost/fc-brochure-web:v20260524-sprint32 is
being removed from rke2-server / rke2-agent1 / rke2-agent2 in a
follow-up step (reclaims ~800MB per node).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 10:42:30 -05:00
Andrew Stoltz
09387f90e1 brochure: ARCHIVED 2026-05-19 — was a misinterpretation, do not re-enable
The brochure split project was a misinterpretation of an operator request
to make TtsReader + AI Station production-ready. Somewhere in the planning
chain it spun up into a separate "showcase brochure product" with its own
host, repo, NuGet, and Codex pack — none of which the operator actually
wanted. The project itself is pointless and a waste of credits.

Archive (not delete) per operator decision 2026-05-19, because some work
shipped under the misinterpretation may still have reusable value:

- FlowerCore.Intranet.Web PR #8 (merged) introduced FlowerCore.Brochure.Content
  content-NuGet carve-out — pattern may apply to TtsReader/AiStation production
  polish.
- Sprint 32 Cl-5 substrate has design ideas for public-twin vs operator-host
  separation that may transfer.
- magpie-doc-writer / wren-walkthrough skills still author useful Intranet
  content — those skills stay active.

These manifests stay at replicas: 0 for ArgoCD continuity. Cleanup options
(move out of apps/* glob, or delete entirely) are documented in README.md
for an operator-explicit future call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 10:34:28 -05:00
Andrew Stoltz
e641ceab48 monitoring(irc-notify): criticals also batch hourly — fix per-fire spam
The first batching pass (bacac06) left critical-severity alerts on the
immediate-print path. That's still per-event spam for any persistent
critical (e.g. PrintPaperRollCritical fires every 30s Grafana evaluation
cycle when paper is <5%). Caught immediately after deploy: CUPS queue grew
0 → 8 jobs in 8 minutes from a single firing PrintPaperRollCritical.

This commit aligns with the operator's verbatim ask ("one alert an hour"):

- Critical-severity alerts now go into the digest buffer, NOT the
  immediate-print path. The digest payload already shows severity tags
  per alertname, so the operator still sees "[critical] X" in the printout.
- The explicit `alert_channel=thermal_print_immediate` label still bypasses
  batching, but only on NEW fingerprint arrival — it triggers a flush of
  the CURRENT digest (with the new alert included), then clears. Repeat
  webhooks for the same fingerprint dedupe in the buffer until the next
  hourly tick OR until the alert resolves. No fingerprint can spam.
- `add_to_digest` now returns bool (True = buffer grew, False = dedup /
  resolution / disabled) so the immediate-label path can flush only on
  state transitions.

Net effect: max 1 thermal print per BATCH_INTERVAL_MIN per alert fingerprint,
regardless of severity. Rules that genuinely need same-second paper opt in
via `alert_channel=thermal_print_immediate` (currently zero rules use this).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 10:22:25 -05:00
Andrew Stoltz
c263426ea5 fc-devicemgmt: operator image fix + Web scaled to 0
OPERATOR (PodCrashLoopBackOff cleared):
- Bumped image to v20260519-sp34cl3-fix (built from astoltz/FlowerCore.DeviceManagement@d9a3685
  after Sprint 34 Cl-3 stranded branch was merged via PR #19 squash).
- The v20260512-cx5 image was the broken Sprint 8 scaffold: generic Host
  builder, no kubeops, no Kestrel on :8080, no AddController chain. Readiness
  probe dial-tcp 8080 failed every restart.
- The new image ships the AddController chain for all 4 reconcilers
  (DeviceCrd / DeviceGroupCrd / DevicePolicyCrd / RemoteCommandCrd) plus
  Kestrel on :8080 and /healthz.
- Image saved + scp'd + ctr-imported on rke2-server / rke2-agent1 / rke2-agent2
  before this commit. SHA256: 2cc79ee0a2313c550268d1244f805ae41b396362148dd5603061cc15b6f7fa7e

WEB (DeploymentReplicasMismatch cleared via scale-to-0):
- Web pod cannot start. Two upstream gaps must close first:
  1) MySQL DB instance + user `fc_devicemgmt` / database `flowercore_devicemgmt`
     are not provisioned in fc-mysql. Cluster has zero MySqlInstanceCrds and
     no `mysql.fc-mysql.svc:3306` Service.
  2) 1Password vault item `IAmWorkin/FlowerCore DeviceManagement Runtime` is
     missing (5 fields: DB-Password + 4 mTLS PEMs). OnePasswordItem CRD has
     been stuck Ready=False since 2026-05-18T02:58.
- Same pattern as the brochure-web scale-to-0 in 914fed0 — make the cluster
  clean and quiet, let operator restart deploy on a real schedule.

Re-enable path is fully documented in the deployment-web.yaml header comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 10:11:09 -05:00