Commit Graph

418 Commits

Author SHA1 Message Date
Codex
53f67c8713 merge claude/k8s-manifest-hardening: K8s gotcha sweep (C7) + lint extensions 2026-05-04 23:00:34 -05:00
Codex
6b9cf3d12c K8s gotcha sweep C7 — extend lint + cover Track A allowlist + scope Notes/k8s
Follow-up to 0b52093 (K8s manifest hardening) closing two real gaps the
prior sweep didn't catch:

1. Public read-write allowlist regression guard (Track A)
   - New PublicReadWriteAllowlistHosts set tracks updatecenter.iamworkin.lan
     + updates.iamworkin.lan. The allowlist on those hosts is
     GET||HEAD||POST||OPTIONS — POST is required for the bootstrap-JWT
     check-in endpoint. PUT/PATCH/DELETE must still 404 at the route.
   - New PublicReadWriteIngressRoutes_MustPinGetHeadPostOptionsAllowlist
     test enforces the allowlist invariant (3 required methods present,
     3 forbidden methods absent).
   - Companion conftest.dev policy 08_public_readwrite_allowlist.rego.

2. Selenium NetworkPolicy DNAT backend port audit
   - FlowerCore.Notes/k8s/selenium/06-networkpolicy.yaml allowed Traefik
     VIP 10.0.56.200:443 + :80 but its 10.42.0.0/16 + 10.43.0.0/16 egress
     rules didn't include the post-DNAT backend ports (8443 for Traefik
     TLS, 8080 for HTTP). Per feedback_netpol_dnat_backend_port: kube-proxy
     DNATs the destination to a backend pod IP+port BEFORE Calico
     evaluates the FORWARD chain, so without those backend ports in the
     pod CIDR rule, Selenium-driven browser AAT calls to
     https://*.iamworkin.lan time out at connect.
   - Lint inventory now includes FlowerCore.Notes/k8s/selenium/ so
     regressions in this manifest fail fast.

Lint scope notes:
   - FlowerCore.Notes/k8s/guacamole/ + monitoring/ are historical
     scaffolds that have diverged from the live state (bluejay-infra/apps/
     is canonical). Operator review is required before bringing them in
     line OR decommissioning them — kept out of lint scope until that
     decision lands (see xxl-regroup-2026-05-03-followup.md "Codex 7 §0").

README hardening:
   - New "Public read-write allowlist hosts" entry under "Known gotchas"
     documenting the GET||HEAD||POST||OPTIONS pattern + linking the lint.

Tests: 8/8 lint tests pass.

Companion fix in FlowerCore.Updater repo on branch
codex/k8s-gotcha-fleet-sweep-c7 (k8s/web-deployment.yaml: localhost/ image
needs imagePullPolicy: Never). The FlowerCore.Updater fix applies to a
deploy that's currently live but bites only on first scheduled-pod
landing on a fresh node — not a live production-impact regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:57:59 -05:00
Codex
0b52093b36 K8s manifest hardening + new bluejay-infra-lint test project
Manifest hardening (per documented memories):
- apps/asterisk/deployment.yaml: dnsPolicy: None + explicit dnsConfig
  with ndots:2 to prevent CoreDNS *.iamworkin.lan template from
  hijacking external egress (downloads.asterisk.org).
- apps/fc-llm-bridge/fc-llm-bridge.yaml: same dnsConfig pattern for
  api.anthropic.com egress.
- apps/fc-ttsreader/fc-ttsreader.yaml: same dnsConfig pattern for
  huggingface.co model seeding.
- apps/fc-messageboard/fc-messageboard.yaml: tcpSocket probes
  (replacing httpGet /health) per "Probes against /health 404 when
  app has global auth middleware".
- apps/fc-signalcontrol/fc-signalcontrol.yaml: same tcpSocket probe
  fix.

New lint project:
- tests/bluejay-infra-lint/BluejayInfraLint.Tests.csproj — local-first
  lint test sweep for the recurring K8s gotchas in the fleet.
- tests/bluejay-infra-lint/FleetManifestLintTests.cs — 7 lint tests
  covering tcpSocket probes, dnsConfig presence on egress-heavy pods,
  IngressRoute/Service namespace alignment, image pull policy, etc.
- tests/bluejay-infra-lint/conftest.dev/ — matching conftest policies
  for environments with conftest/opa.
- .gitignore — adds bin/ + obj/ + DS_Store/swp.

README.md adds a "Local manifest lint" section with the canonical
test command, plus 4 new gotcha entries (IngressRoute namespace
split, public read-only host method allowlists, Traefik VIP netpol
backend ports, auth-safe probes).

Tests: 7 / 7 lint tests passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 03:18:04 -05:00
Codex
7a9098d3bd fix(fc-ttsreader): lower web cpu request 2026-05-04 02:28:11 -05:00
Andrew Stoltz
57d7ba46a7 feat(monitoring): add fc-remotedesktop grafana dashboard
JSON-provisioned dashboard for FlowerCore.RemoteDesktop session metrics,
matches the Apr 23 staging done in the codex/ttsreader-release-b6ca2d5
worktree. Drop into apps/monitoring so ArgoCD-managed Grafana provisioning
picks it up alongside the other FC service dashboards.
2026-04-30 14:32:54 -05:00
Andrew Stoltz
9ec2e2d52e deploy(ttsreader): bump web image to b6ca2d5 2026-04-30 12:43:48 -05:00
Andrew Stoltz
b4d62a8a50 deploy(fc-ttsreader): roll chapter-context image 2026-04-30 02:31:55 -05:00
Andrew Stoltz
fbbc07023b deploy(fc-llm-bridge): roll fc:vision image v202604300022
Source: FlowerCore.LlmBridge@8dd181c (feat: fc:vision route + image
content forwarding). Adds:

- fc:vision tier alias parsing (TryParseTier handles fc:vision,
  FC:VISION, openai/fc:vision, vision)
- Image content forwarding: OpenAi image_url shape (https URL +
  data:[mediaType];base64,... URI) and Anthropic image/source
  passthrough are now promoted to LlmContentBlocks. Text-only
  content-parts arrays still flatten to the legacy joined string.
- DefaultRoutes seeder + appsettings.json gain Vision -> Anthropic +
  claude-sonnet-4-6.

Image built on BLUEJAY-WS, podman save + ctr import to all 3 RKE2
nodes (rke2-server, rke2-agent1, rke2-agent2). Bridge tests: 62/62
green (was 51/51, +11). Backwards-compatible with current chat /
util / embed callers; existing routes unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 00:26:45 -05:00
Andrew Stoltz
4b0eef0fb0 deploy(fc-llm-bridge): roll alias-fix image v20260430001132 2026-04-30 00:13:48 -05:00
Andrew Stoltz
bb09a3786f fix(knowledge): pin live manifest to bundled edition path 2026-04-29 23:37:02 -05:00
Andrew Stoltz
006dbcf671 fix(agent-zero): export knowledge mcp gate to python builder 2026-04-29 23:32:55 -05:00
Andrew Stoltz
1be71d6ba7 fix(agent-zero): export mcp servers without python indent errors 2026-04-29 23:19:48 -05:00
Andrew Stoltz
0c8026c912 fix(agent-zero): avoid heredoc break in mcp bootstrap 2026-04-29 23:16:54 -05:00
Andrew Stoltz
621ae47e00 fix(agent-zero): repair fc knowledge mcp manifest 2026-04-29 23:11:57 -05:00
Andrew Stoltz
ae6b8c0142 fix(knowledge): keep mcp key env on new token secret 2026-04-29 23:06:07 -05:00
Andrew Stoltz
da55220218 feat(agent-zero): wire fc_knowledge phase1 rollout 2026-04-29 22:59:19 -05:00
Andrew Stoltz
b1ad253dd6 fix(agent-zero): prefix bridge embedding alias for litellm 2026-04-29 21:14:12 -05:00
Andrew Stoltz
ee935f6e07 fix(agent-zero): keep internal util/embed on bridge v1 2026-04-29 21:09:04 -05:00
Andrew Stoltz
2853ee2024 chore(bridge): bump fc-llm-bridge image tag v202604292028 2026-04-29 20:50:55 -05:00
Andrew Stoltz
b4a34e16ca refactor(agent-zero): drop ollama-proxy sidecar (Phase 3) 2026-04-29 20:50:55 -05:00
Andrew Stoltz
0d5a1fd530 fix(agent-zero): route util and embed through llm bridge 2026-04-29 19:14:01 -05:00
Andrew Stoltz
1b633f57b2 chore(infra): wire knowledge MCP api key secret 2026-04-29 18:04:43 -05:00
Andrew Stoltz
ee8afd0a08 deploy(intranet): promote auth-gated intranet image 2026-04-29 17:11:17 -05:00
Andrew Stoltz
cf35884eae deploy(intranet): harden knowledge search rollout 2026-04-29 16:43:09 -05:00
Andrew Stoltz
9881767b11 deploy(intranet): bump intranet web for knowledge search lane 2026-04-29 16:21:27 -05:00
Andrew Stoltz
c9bf23834b chore(ttsreader): bump image to v202604291817
Per-profile MoodAnnotationModelOverride picker — Profiles page now shows
a model dropdown from IModelRegistry instead of a free-text field; model
override null-falls-back to global TtsReader:Ollama:DefaultModel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 13:21:40 -05:00
Andrew Stoltz
174002023d fix(agent-zero): move corpus_search + intranet_search into bluejay-tools-c
The prior commit b71f9e4 created a stray YAML document between the
bluejay-tools-c and bluejay-profile sections. kubectl applied the stray
block's data to bluejay-profile (wrong ConfigMap, wrong mount target).

The setup-bluejay initContainer copies bluejay-tools-{a,b,c} to the tools
directory; bluejay-profile is copied to the agent profile directory. Tools
must live in one of the three tools ConfigMaps.

Fix: insert corpus_search.py and intranet_search.py directly into the
bluejay-tools-c YAML document (before kind/metadata, matching the
data-first layout the rest of the file uses). Also fix two mojibake
characters (→ and ·) that were corrupted in the prior commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:49:23 -05:00
Andrew Stoltz
b71f9e4ec9 feat(agent-zero): add corpus_search + intranet_search to cluster configmaps
- Add corpus_search.py to bluejay-tools-c: semantic vector search over
  fleet SQLite-vec DBs (fleet-workstation-full, fleet-pi-edge, fleet-bmo-bot).
  Returns offline-friendly results for Bible/Greek/Hebrew/Strongs corpora.
  Cluster pod degrades gracefully (no DB mounted yet — BLUEJAY-WS only for now).

- Add intranet_search.py to bluejay-tools-c: live RAG search over the
  intranet vector store via GET /api/search?q=...&topK=N. Uses in-cluster
  service URL (http://intranet-web.intranet.svc:5300) to bypass Traefik TLS
  and the private-range egress denylist.

- Fix intranet_search.py param name: was 'limit', now 'topK' matching the
  SearchController's [FromQuery] parameter name.

- NetworkPolicy: add egress rule for intranet namespace port 5300 (without
  this the pod's TCP connection to the search endpoint was dropped).

- agent-zero.yaml: set FLOWERCORE_INTRANET_URL env var to in-cluster service
  URL so intranet_search uses internal routing, not the public Traefik VIP.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 08:34:31 -05:00
Andrew Stoltz
f1431f7324 feat(agent-zero): wire Print.Web API key to pod via 1Password OnePasswordItem
Add `print-web-api-keys` OnePasswordItem CRD that syncs from 1Password
"Print.Web API Keys" vault item (password field). Mount as PRINT_WEB_API_KEY
env var in the agent-zero container.

The print_web.py Python tool (already in bluejay-tools ConfigMaps) reads
PRINT_WEB_URL and PRINT_WEB_API_KEY env vars for all HTTP calls to the
thermal print service on edge2. Previously the key was unset so every API
call was rejected with 401.

Note: Print.Web uses the legacy REST MCP shape (/api/mcp/tools/*) not the
streamable-http protocol. The Python tool bridges this gap — no /mcp endpoint
exists on Print.Web today. Network policy already allows 10.0.57.16:5200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 20:36:36 -05:00
Andrew Stoltz
35bd055cb4 feat(guacamole): add macmini-vnc-creds OnePasswordItem + fix Mac mini connection IPs
Phase 1 of Mac mini onboarding (2026-04-28):
- Add OnePasswordItem CRD 'macmini-vnc-creds' in guacamole namespace bound to
  vault item 'Mac Mini' — operator mints Secret with username/password/VNC Password fields
- Mac mini discovered at 10.0.56.115 (INFRA VLAN) — not 10.0.57.50 stored in 1P IP field
- Guacamole connections updated via API (not stored here): VNC conn #10, SSH conns #9/#33
  corrected from old IP 10.0.57.50 → 10.0.56.115
- macOS: 26.4.1 (Sequoia), Apple M1, 16 GB, user: bluejay (admin group)
- VNC port 5900 confirmed open; SSH works via noc1 jumpbox with password auth

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 20:09:45 -05:00
Andrew Stoltz
f604ab419e feat(ttsreader): bump image to v202604281923 (SignalR ProgressHub)
Adds ProgressHub endpoint at /hubs/progress with project-scoped
group broadcasting for JobStarted, CueProgress, JobCompleted, and
JobFailed events.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 19:30:41 -05:00
Andrew Stoltz
b2786252b0 chore(ttsreader): bump web image to v202604281831 (ops failed-manifest cleanup)
Deploys fix for stale Failed manifest accumulation in TTS Reader Ops view
and atomic-write guard against empty/corrupt job manifests.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 18:31:53 -05:00
Andrew Stoltz
45ee40920d fix(ttsreader): bump image to v202604281638 (Range support + Ollama timeout 240s) 2026-04-28 16:44:57 -05:00
Andrew Stoltz
8ad7eb714b fix(ttsreader): bump image to v202604281542 (annotation few-shot prompt + UI hint) 2026-04-28 15:46:28 -05:00
Andrew Stoltz
3cb44c3104 feat(noc-services): wire puppetdb.iamworkin.lan through Traefik step-ca cert 2026-04-28 15:13:20 -05:00
Andrew Stoltz
2400329acd fix(intranet): bump image to v20260428-1500 (Monitoring crash patch + Lane 11 anatomy refresh) 2026-04-28 14:59:27 -05:00
Andrew Stoltz
c17af882cc fix(ttsreader): bump image to v202604281444 for UX polish (cross-chapter Bible passage, /profiles dedup, /ops table) 2026-04-28 14:48:13 -05:00
Andrew Stoltz
76b1938afa fix(ttsreader): bump image to v202604281434 for live playback regression patch (study-player + speech override synth) 2026-04-28 14:43:06 -05:00
Andrew Stoltz
ced04a6148 intranet: bump web image to v20260428-0953
Sprint E XXL Intranet docs depth + read-aloud-root sweep deploy.

Image tag v20260427-2353 → v20260428-0953:
- Track A (Intranet.Web@c4f3d78): 7 service pages deepened toward
  PrintService.razor's 8-tab depth standard. Workflows / Verified
  Surfaces / Recent Verified Changes added.
- Read-aloud-root sweep (Intranet.Web@787982c): data-read-aloud-root
  wrappers added to 6 older /services/* pages so the read-aloud
  overlay scopes content extraction precisely instead of falling back
  to <main> with layout chrome included.
2026-04-28 09:54:27 -05:00
Andrew Stoltz
f2258b92a2 fc-ttsreader: bump web image to v202604280946 + add Render__CdnDirectory env
Sprint E XXL Phase 4γ MVP deploy — POST /api/v1/render endpoint.

Two changes:
1. Image tag v202604272339 → v202604280946 (TtsReader@d9e0a58 master tip
   includes the new RenderController + RenderService + 9 tests).
2. New TtsReader__Render__CdnDirectory=/data/cdn env var. Default
   wwwroot/cdn resolves under the read-only app filesystem when
   runAsNonRoot=true; pin to the existing writable PVC mount alongside
   other TtsReader runtime data. Manifests + cue audio land at
   /data/cdn/sha256/<hash>/manifest.json + cues/.

Pre-existing PVC mount at /data/ already covers this — no PVC change
needed, just the env var override.

Pairs with TtsReader@d9e0a58 master tip (ready for image build + import).
2026-04-28 09:47:46 -05:00
Andrew Stoltz
979a7c7b25 feat(intranet): bump fc-intranet-web to v20260427-2353 + persist PageReadingOverrides
Bump intranet image to v20260427-2353 (master @ 38b0148):
- Sprint E search lane: /search Blazor page + IntranetSearchService
  + DocsCorpusIndexer + Shared.Indexing wiring
- 7 new service pages: LocalAiAgents, AiTopology, Distribution, Dns,
  Knowledge, LlmBridge, Provisioning
- PiManager drift docs

New env var: PageReadingOverrides__FilePath=/data/page-reading-overrides.json
so the persisted Lane 2α store lives on the writable PVC instead of
the default in-memory fallback (which loses state on pod restart).
Operator-edited overrides via the existing /api/v1/pages/{encoded}/overrides
controller will now survive across restarts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 23:54:17 -05:00
Andrew Stoltz
0df8f7b936 chore(ttsreader): bump fc-ttsreader-web to v202604272339 (Sprint E Phase C — partial-render UX)
TtsReader@9333480: distinguishes partial-render (yellow Warning, audio
plays, 'Re-render N failed sentences' button) from full-fail (red
Danger, 'Try render again'). New TtsFallbackChainFailedException carries
both voices when Kokoro + Piper both fail; chapter breadcrumb names
the entire chain instead of just the requested voice. +8 tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 23:40:19 -05:00
Andrew Stoltz
38558641c1 fix(ttsreader-kokoro): bump liveness probe timeouts (Sprint E Phase 1a)
Kokoro pod has 4 restarts in 2d6h with exit 143 (SIGTERM from kubelet).
kubectl describe events all show:

  Liveness probe failed: Get "http://10.42.229.109:8880/v1/audio/voices":
    context deadline exceeded

The probe path /v1/audio/voices shares the FastAPI worker pool with
/v1/audio/speech. A long synth (Bible chapter, 30+ sentences) holds the
pool past the prior 5s × 3 = 15s probe window, kubelet kills the pod,
in-flight renders fail. Operator hits "fallback chain failed" toasts +
partial-render breadcrumbs during these windows.

Bump probe timeoutSeconds 5 → 15 and failureThreshold 3 → 5 → 75 s of
grace before kubelet gives up. Combined with the kokoro-side circuit
breaker landing in TtsReader (Sprint E Phase 1b), the FC backend will
also stop slamming kokoro during recovery so it can serve the probe
even faster.

The companion Prometheus alerts (KokoroPodFlapping, PiperPodFlapping)
land in FlowerCore.Notes/scripts/monitoring/alerts.yml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 23:28:07 -05:00
Andrew Stoltz
63d905b4df chore(ttsreader): bump fc-ttsreader-web to v202604272236 (Thinking + Feedback ALTERs) 2026-04-27 22:37:08 -05:00
Andrew Stoltz
d95f4e0caf chore(ttsreader): bump fc-ttsreader-web to v202604272228 (ChatSessions IsFavorite ALTER hotfix) 2026-04-27 22:28:56 -05:00
Andrew Stoltz
7bc565d17e fix(ttsreader): pin VoicePreview CacheDirectory to /data PVC
Day 8 disk-cache warmer crashes on production with
'Read-only file system : /home/app/data' because the relative default
'data/voice-previews' resolves under runAsNonRoot HOME (read-only with
readOnlyRootFilesystem=true). Pin to /data/voice-previews so the cache
lands on the writable PVC mount alongside ttsreader.db, audio output,
and jobs root.

Image v202604272216 (already on nodes) is unaffected by this — only
the env routing changes. ArgoCD reconciles + rollout restart picks up
the new env without rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 22:24:04 -05:00
Andrew Stoltz
dfe9c3b67e chore(ttsreader): bump fc-ttsreader-web to v202604272216 (brace-escape fix) 2026-04-27 22:16:19 -05:00
Andrew Stoltz
37f8db89e4 chore(ttsreader): bump fc-ttsreader-web to v202604272208 (Day 10 + VoiceProfiles hotfix)
v202604272157 crash-looped on the production PVC because Database.EnsureCreated()
is a no-op on existing DBs and the VoiceProfiles table was missing. TtsReader@a9f0b73
adds an idempotent CREATE TABLE IF NOT EXISTS to the infra reconciler before
TtsReaderDataSeeder runs. Bumping the manifest to pick up that fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 22:09:08 -05:00
Andrew Stoltz
00c7d8df24 chore(ttsreader): bump fc-ttsreader-web to v202604272157 (Sprint E Day 10 UX polish)
Compact project page (Setup chip strip + chapter inspect-toggle drawer)
+ render feedback (rolling ETA strip + active-chapter pulse) + Bible
Dashboard navigates to /projects/{id} on queue. Source TtsReader@79de78b.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 21:58:12 -05:00
Andrew Stoltz
c6811eadd8 intranet: bump image to v20260427-newpages-and-topology
Adds 7 new pages (5 service pages, AI topology, opencode operator guide)
to https://intranet.iamworkin.lan:
  /services/dns
  /services/distribution
  /services/llm-bridge
  /services/knowledge
  /services/provisioning
  /services/ai-topology
  /development/local-ai-agents

Plus topology corrections in /services/ai (AiStack.razor) and 6 new nav entries.

Source commit: FlowerCore.Intranet.Web@1598542 on
codex-wip-pre-readaloud-collision-2026-04-24.

Image built from artifacts/publish via Dockerfile.deploy on BLUEJAY-WS,
imported to all 3 RKE2 nodes (rke2-server + rke2-agent1 + rke2-agent2).

Build: 0 warnings, 0 errors, 197/197 tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:52:34 -05:00