Unblocks CI jobs running in github-runner pods (e.g. FlowerCore.Print.Web
`help-screenshots`) from reaching selenium-hub. Previously the session
POST was DNAT'd to the hub pod IP then dropped at the Calico ingress
hook, surfacing as a 60s timeout against
http://selenium-hub.selenium.svc.cluster.local:4444 while the Selenium
UI showed 0/4 sessions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously orphan kubectl-applied since the Selenium Grid was first set
up. The `infra-selenium` ArgoCD app existed but only managed
`network-policy.yaml` — the deployments themselves drifted whenever
anyone `kubectl set env`'d or `kubectl scale`'d.
This commit captures the live state (with the 2026-05-25 maxSessions
bump for chrome already baked in) as canonical git source. ArgoCD's
ServerSideApply syncPolicy + selfHeal will now keep the grid in lock
step with this file.
Resources captured:
- Service selenium-hub (ClusterIP, internal traffic on 4444)
- Service selenium-hub-external (LoadBalancer, MetalLB 10.0.56.208)
- Deployment selenium-hub
- Deployment selenium-node-chrome (replicas=1, SE_NODE_MAX_SESSIONS=2)
- Deployment selenium-node-firefox (replicas=1, maxSessions=1)
- Deployment selenium-node-edge (replicas=1, maxSessions=1)
- IngressRoute selenium-hub (Traefik, selenium.iamworkin.lan)
No live behavior change — server-side dry-run confirms unchanged for
hub/firefox/ingressroute, "configured" for hub-external + 3 deploys
(default-field reordering only; SSA + field managers handle the diff).
Refs: Sprint 33 morning-routine 2026-05-25 follow-up Q-MR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The server pod was getting killed by liveness probe at 60s while still
waiting on migration DB lock (worker pod also running migrations against
same DB). Add startupProbe with 10.5 min budget so liveness doesn't fire
until migrations finish.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PermissionError: [Errno 13] Permission denied: '/media/public' in tenant_files
migration because Authentik container runs as uid 1000 but Longhorn PVC mounts
root:root by default. fsGroup on Pod securityContext recursively chgrps the
PVC mount to gid 1000 + chmods g+rwx.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stack:
- PostgreSQL 16 StatefulSet (Longhorn RWO 5Gi)
- Redis 7 Deployment (no persistence)
- Authentik server + worker (ghcr.io/goauthentik/server:2024.12.3)
- Shared media PVC (Longhorn RWO 2Gi) between server+worker
- Certificate via step-ca-acme ClusterIssuer
- Traefik IngressRoute at id.iamworkin.lan
Secrets sourced from 1Password item 'authentik-credentials' (IAmWorkin
vault, id y6i74ch22q5wvm7znquq4nhhcu) via OnePasswordItem CRD. Fields:
AUTHENTIK_SECRET_KEY, POSTGRES_PASSWORD, REDIS_PASSWORD,
BOOTSTRAP_ADMIN_PASSWORD, BOOTSTRAP_ADMIN_TOKEN, BOOTSTRAP_ADMIN_EMAIL.
DNS A record id.iamworkin.lan -> 10.0.56.200 added via
scripts/pfsense-add-id-host.py (FlowerCore.DNS service was 502'ing on
pfSense diag_command.php response parsing).
Closes the immediate gap from PiManager OIDC Cohort 3 wire-up: PiManager
(a87cd6f) configures id.iamworkin.lan as JWT authority but the backend
was never deployed. Pirelay specifically is on Mode:apikey until this
backend is bootstrapped and a pimanager service-account exists.
Post-deploy bootstrap (manual once pods Ready):
1. Login at https://id.iamworkin.lan/if/admin/ as akadmin
using BOOTSTRAP_ADMIN_PASSWORD from 1Password.
2. Create OAuth2/OpenID Provider for pimanager (issuer
https://id.iamworkin.lan/application/o/pimanager/, audience 'pimanager').
3. Create Application binding the provider.
4. Create service account user 'pimanager-service-account', generate
long-lived token, store in 1Password as 'pimanager-service-account'.
5. Re-enable jwt mode on pirelay + un-mask puppet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes the apps/brochure/ directory entirely from the bluejay-infra
ApplicationSet glob. ArgoCD will:
1. See infra-brochure has no git source -> mark for delete
2. Prune the brochure namespace + Deployment + Service + Certificate
+ Secret + IngressRoute (all generated from the now-gone
apps/brochure/brochure.yaml)
3. Remove the infra-brochure Application from argocd ns
Operator decision 2026-05-19 (follow-up to 09387f9 ARCHIVED banner
commit): "Yes, prune argo for brochure. Probably fully deleted there."
The brochure subdomain project was a planning-chain misinterpretation
of "make TtsReader + AI Station production-ready" — see
memory/project_brochure_split_misinterpretation_archived_2026_05_19.md
in FlowerCore.Notes for the full decision record.
Reusable artifacts that were the operator's archive concern stay alive
in their actual homes:
- FlowerCore.Intranet.Web PR #8 content-NuGet carve-out: still in
Intranet's master, may transfer to TtsReader / AI Station prod work
- Sprint 32 Cl-5 substrate (public-twin design ideas): SUPERSEDED banner
in-place in FlowerCore.Notes docs/standards/, history preserved
- magpie-doc-writer + wren-walkthrough skill output: unchanged in
Intranet's flowercore-whats-new/walkthroughs/galleries directories
Companion Notes-side commit updates the "scaled to 0 + ARCHIVED banner"
language in mvp-readiness.html + fleet-roadmap-2026-05-19-sprint36-v2.md
+ memory record to reflect full deletion instead.
Wrong-codebase image localhost/fc-brochure-web:v20260524-sprint32 is
being removed from rke2-server / rke2-agent1 / rke2-agent2 in a
follow-up step (reclaims ~800MB per node).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The brochure split project was a misinterpretation of an operator request
to make TtsReader + AI Station production-ready. Somewhere in the planning
chain it spun up into a separate "showcase brochure product" with its own
host, repo, NuGet, and Codex pack — none of which the operator actually
wanted. The project itself is pointless and a waste of credits.
Archive (not delete) per operator decision 2026-05-19, because some work
shipped under the misinterpretation may still have reusable value:
- FlowerCore.Intranet.Web PR #8 (merged) introduced FlowerCore.Brochure.Content
content-NuGet carve-out — pattern may apply to TtsReader/AiStation production
polish.
- Sprint 32 Cl-5 substrate has design ideas for public-twin vs operator-host
separation that may transfer.
- magpie-doc-writer / wren-walkthrough skills still author useful Intranet
content — those skills stay active.
These manifests stay at replicas: 0 for ArgoCD continuity. Cleanup options
(move out of apps/* glob, or delete entirely) are documented in README.md
for an operator-explicit future call.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first batching pass (bacac06) left critical-severity alerts on the
immediate-print path. That's still per-event spam for any persistent
critical (e.g. PrintPaperRollCritical fires every 30s Grafana evaluation
cycle when paper is <5%). Caught immediately after deploy: CUPS queue grew
0 → 8 jobs in 8 minutes from a single firing PrintPaperRollCritical.
This commit aligns with the operator's verbatim ask ("one alert an hour"):
- Critical-severity alerts now go into the digest buffer, NOT the
immediate-print path. The digest payload already shows severity tags
per alertname, so the operator still sees "[critical] X" in the printout.
- The explicit `alert_channel=thermal_print_immediate` label still bypasses
batching, but only on NEW fingerprint arrival — it triggers a flush of
the CURRENT digest (with the new alert included), then clears. Repeat
webhooks for the same fingerprint dedupe in the buffer until the next
hourly tick OR until the alert resolves. No fingerprint can spam.
- `add_to_digest` now returns bool (True = buffer grew, False = dedup /
resolution / disabled) so the immediate-label path can flush only on
state transitions.
Net effect: max 1 thermal print per BATCH_INTERVAL_MIN per alert fingerprint,
regardless of severity. Rules that genuinely need same-second paper opt in
via `alert_channel=thermal_print_immediate` (currently zero rules use this).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OPERATOR (PodCrashLoopBackOff cleared):
- Bumped image to v20260519-sp34cl3-fix (built from astoltz/FlowerCore.DeviceManagement@d9a3685
after Sprint 34 Cl-3 stranded branch was merged via PR #19 squash).
- The v20260512-cx5 image was the broken Sprint 8 scaffold: generic Host
builder, no kubeops, no Kestrel on :8080, no AddController chain. Readiness
probe dial-tcp 8080 failed every restart.
- The new image ships the AddController chain for all 4 reconcilers
(DeviceCrd / DeviceGroupCrd / DevicePolicyCrd / RemoteCommandCrd) plus
Kestrel on :8080 and /healthz.
- Image saved + scp'd + ctr-imported on rke2-server / rke2-agent1 / rke2-agent2
before this commit. SHA256: 2cc79ee0a2313c550268d1244f805ae41b396362148dd5603061cc15b6f7fa7e
WEB (DeploymentReplicasMismatch cleared via scale-to-0):
- Web pod cannot start. Two upstream gaps must close first:
1) MySQL DB instance + user `fc_devicemgmt` / database `flowercore_devicemgmt`
are not provisioned in fc-mysql. Cluster has zero MySqlInstanceCrds and
no `mysql.fc-mysql.svc:3306` Service.
2) 1Password vault item `IAmWorkin/FlowerCore DeviceManagement Runtime` is
missing (5 fields: DB-Password + 4 mTLS PEMs). OnePasswordItem CRD has
been stuck Ready=False since 2026-05-18T02:58.
- Same pattern as the brochure-web scale-to-0 in 914fed0 — make the cluster
clean and quiet, let operator restart deploy on a real schedule.
Re-enable path is fully documented in the deployment-web.yaml header comment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The thermal printer drained overnight (2026-05-18/19) because the old
notify.py POSTed one print job per Grafana webhook fire. With 9
concurrently-firing alerts (zabbix-postgres + fc-devicemgmt + brochure
+ PrintPaperRollLow), every evaluation cycle stamped fresh CUPS jobs
onto the queue until the operator physically powered the printer off.
This refactor:
- Adds env-var config: THERMAL_PRINT_ENABLED (master kill switch),
BATCH_INTERVAL_MIN (default 60), BATCH_MAX_PENDING (default 50).
- IRC delivery stays per-event (operator wants the live stream).
- Thermal routing now:
* critical/disaster/page severity OR alert_channel=thermal_print_immediate
-> print immediately
* alert_channel=thermal_print -> enqueue into hourly digest
* RESOLVED -> remove from digest buffer (no resolution-spam prints)
* else -> IRC only, no thermal
- Background digest_loop thread flushes the buffer hourly (or sooner
if buffer hits BATCH_MAX_PENDING). Digest payload is a single
Print.Web /api/print/alert POST listing distinct alertnames + per-rule
target counts.
- New POST /flush endpoint (manual operator force-flush; useful for
testing without waiting an hour).
- GET / returns config + buffer depth + per-stat counters for observability.
Net effect: max 1 thermal print per BATCH_INTERVAL_MIN for batched
warnings, plus immediate prints for criticals. Closes the 2026-05-18/19
alert-storm incident.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the live `puppet` alert group from
FlowerCore.Notes/scripts/monitoring/alerts.yml into the K8s ConfigMap so a
future in-cluster Prometheus inherits the ruleset automatically.
Source of truth remains the Notes file (live Podman Prometheus on noc1).
See feedback_monitoring_k8s_target_vs_live_podman.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #5 rebase concatenated PR #5 env additions onto PR #7 env additions on
the base + sharedpos Deployments, producing duplicate-key validation
errors in ArgoCD's structured merge. The DOTNET_INSTALL_DIR and
NUGET_PACKAGES values are identical between PR #5 and PR #7; keep the
PR #7 originals and retain only the unique new env vars from PR #5
(DOTNET_CLI_TELEMETRY_OPTOUT, DOTNET_NOLOGO, DOTNET_GENERATE_ASPNET_CERTIFICATE).
No behavioral change — same final env var set, no duplicates.
Shared.Pos build fails on non-root runner (setup-dotnet /usr/share/dotnet denied); churning runner drove HighCPU on rke2-agent2. Re-enable in the Sprint 30+ Cx-1 Linux-runner-fleet lane (DOTNET_INSTALL_DIR on pod env).
Operator provisioned GitHub PAT (Runner Registration) 1P item. OnePasswordItem CRD already materialized the secret. Unblocks CI fleet-wide.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Renames the OnePasswordItem.itemPath from "GitHub Runner Registration
Token" to "GitHub PAT (Runner Registration)" so the runner 1P entry
sits next to its siblings — GitHub PAT (Gitea Mirrors) and GitHub PAT
(NuGet Packages) — under a consistent "GitHub PAT (...)" naming pattern
and API_CREDENTIAL category.
Existing field "credential" remains the consumer (RUNNER_TOKEN env).
Comment block clarified to require Administration:read/write fine-grained
PAT scope on target repos.
Old 1P item renamed to "[DEPRECATED 2026-05-16] GitHub Runner
Registration" — kept as recovery backup; can be hard-deleted after the
first successful runner pod start against the new item path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Renames the OnePasswordItem.itemPath from "GitHub Runner Registration
Token" to "GitHub PAT (Runner Registration)" so the runner 1P entry
sits next to its siblings — GitHub PAT (Gitea Mirrors) and GitHub PAT
(NuGet Packages) — under a consistent "GitHub PAT (...)" naming pattern
and API_CREDENTIAL category.
Existing field "credential" remains the consumer (RUNNER_TOKEN env).
Comment block clarified to require Administration:read/write fine-grained
PAT scope on target repos.
Old 1P item renamed to "[DEPRECATED 2026-05-16] GitHub Runner
Registration" — kept as recovery backup; can be hard-deleted after the
first successful runner pod start against the new item path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
github-runner-token OnePasswordItem exists but the underlying 1Password
vault item hasn't been created yet, so the operator can't mint the K8s
Secret. Pod stuck in CreateContainerConfigError → DeploymentReplicasMismatch
alert fires.
Scaling to 0 keeps the manifest infrastructure intact but stops trying
to schedule until operator:
1. Creates "GitHub Runner Registration Token" item in IAmWorkin vault
2. Generates a token at github.com/astoltz/<repo>/settings/actions/runners/new
3. Updates the OnePasswordItem itemPath to point at it
4. Bumps replicas back to 1 via PR
zabbix-web nginx+PHP-FPM container serves / at ~3-5s baseline with
occasional 6-7s spikes (probe path renders full dashboard via PHP).
kube-probe was killing the container after 3 consecutive 5s-timeout
499s, producing CrashLoopBackOff alert noise even though the app
was serving real traffic fine.
15s timeout absorbs the natural variance; explicit failureThreshold=3
documents the policy (was implicit default).
Closes the firing PodCrashLoopBackOff (zabbix-web) + pending
HTTPServiceSlow/HTTPServiceDegraded alerts. zabbix.iamworkin.lan
remains slow at the application layer (separate work — PHP-FPM
warm-up + Zabbix server "host not found" agent lookup spam need
their own fixes) but the pod restart loop stops.
Namespace github-runner with myoung34/github-runner:latest Deployment,
5Gi Longhorn RWO NuGet cache PVC, zero-privilege ServiceAccount, and
OnePasswordItem CRD for the registration token. EPHEMERAL=true mode
re-registers after each job; Recreate strategy avoids RWO multi-attach.
Targets fc-build-linux label; single replica pinned to rke2-server node.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Missing document separator caused YAML to merge the OnePasswordItem's
top-level `spec: itemPath:` block into the ConfigMap that follows.
Result: a ConfigMap with a `.spec` field whose K8s schema does not
declare one, triggering ArgoCD's structured-merge diff to fail since
2026-05-11T15:30:54Z:
Failed to compare desired state to live state: failed to calculate
diff: error calculating structured merge diff: error building typed
value from config resource: .spec: field not declared in schema
App stayed Healthy (live K8s tolerated the extra field — ConfigMap
ignored it) but ArgoCD's diff calc was broken, leaving the app stuck at
sync=Unknown for all 21 resources. Adding the missing `---` separator
makes the OnePasswordItem and ConfigMap proper sibling YAML documents,
each with its own kind-correct schema.
Diagnosed during 2026-05-12 morning routine.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Q-SO-1 operator resolution 2026-05-11 PM, Redis SignalR backplane lands
in Phase A (was Phase C deferral). Treats Redis as a managed FC infrastructure
component, not a deferred scaling escalation.
Lands the minimal Phase A surface:
- Namespace fc-redis
- Single Redis 7-alpine pod with 1Gi Longhorn RWO PVC
- ConfigMap with AOF persistence (everysec), 256Mi maxmemory, allkeys-lru
- ClusterIP Service `redis.fc-redis.svc.cluster.local:6379` (in-cluster only)
- No AUTH Phase A (Phase B add via 1Password Connect rotation)
- No IngressRoute (backplane is server-to-server)
Consumers (Phase A IMPL across FC services) add:
services.AddSignalR().AddStackExchangeRedis(
"redis.fc-redis.svc.cluster.local:6379",
opts => opts.Configuration.ChannelPrefix =
StackExchange.Redis.RedisChannel.Literal("fc-opsconsole"));
Phase B/C follow-ons (not in this commit): Sentinel for HA, AUTH password
from 1Password, redis_exporter sidecar for Prometheus, network policies.
See FlowerCore.Notes/docs/signage/operations-console-phase-2-design.md
section 3.5 (rewritten) and decisions-waiting.html Q-SO-1 (RESOLVED).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>