fix(guacamole): add --- separator between macmini-vnc-creds OnePasswordItem and guacamole-branding ConfigMap

Missing document separator caused YAML to merge the OnePasswordItem's top-level `spec: itemPath:` block into the ConfigMap that follows. Result: a ConfigMap with a `.spec` field whose K8s schema does not declare one, triggering ArgoCD's structured-merge diff to fail since 2026-05-11T15:30:54Z: Failed to compare desired state to live state: failed to calculate diff: error calculating structured merge diff: error building typed value from config resource: .spec: field not declared in schema App stayed Healthy (live K8s tolerated the extra field — ConfigMap ignored it) but ArgoCD's diff calc was broken, leaving the app stuck at sync=Unknown for all 21 resources. Adding the missing `---` separator makes the OnePasswordItem and ConfigMap proper sibling YAML documents, each with its own kind-correct schema. Diagnosed during 2026-05-12 morning routine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(fc-redis): add SignalR backplane for cross-product event bus (Q-SO-1 Phase A)
2026-05-12 09:26:03 -05:00 · 2026-05-11 19:02:58 -05:00 · 2026-05-11 18:37:15 -05:00 · 2026-05-11 10:42:27 -05:00 · 2026-05-11 10:30:05 -05:00
5 changed files with 227 additions and 8 deletions
--- a/apps/fc-redis/fc-redis.yaml
+++ b/apps/fc-redis/fc-redis.yaml
@@ -0,0 +1,171 @@
 # fc-redis — SignalR backplane for cross-product event bus
 #
 # Lands per Q-SO-1 resolution (2026-05-11 PM): SignalR backplane in Phase A,
 # not Phase C as originally drafted. Operator directive: "Redis can be
 # deployed just fine as it's another FlowerCore technology we'll want to
 # manage."
 #
 # Phase A scope (this file):
 #   - Single Redis 7.x Alpine pod
 #   - 1Gi Longhorn RWO PVC for AOF persistence
 #   - ClusterIP Service at `redis.fc-redis.svc.cluster.local:6379`
 #   - No AUTH (in-cluster only; not exposed externally)
 #   - No IngressRoute (backplane is server-to-server only)
 #
 # Consumers (Phase A IMPL across FC services):
 #   - FlowerCore.Signage.Web (OpsConsoleHub)
 #   - FlowerCore.Scoreboard.Web (ScoreboardHub)
 #   - FlowerCore.SignalControl.Web
 #   - FlowerCore.DMS.Web
 #   - Any other product joining the cross-product event bus
 #
 # Each consumer adds:
 #   services.AddSignalR()
 #           .AddStackExchangeRedis(
 #               "redis.fc-redis.svc.cluster.local:6379",
 #               opts => opts.Configuration.ChannelPrefix =
 #                   StackExchange.Redis.RedisChannel.Literal("fc-opsconsole"));
 #
 # Phase B / C follow-ons (out of scope here):
 #   - Redis Sentinel for HA (3-node)
 #   - AUTH password from 1Password Connect (rotate via /rotate-password)
 #   - redis_exporter sidecar for Prometheus scrape
 #   - Network policies restricting which namespaces can dial 6379
 #
 # Design: docs/signage/operations-console-phase-2-design.md §3.5
 # Decision: Q-SO-1 (RESOLVED 2026-05-11 PM)
 # Memory: feedback_blooming_ui_pattern_no_iframes
 ---
 apiVersion: v1
 kind: Namespace
 metadata:
  name: fc-redis
  labels:
    app.kubernetes.io/part-of: flowercore
    app.kubernetes.io/managed-by: argocd
 ---
 apiVersion: v1
 kind: PersistentVolumeClaim
 metadata:
  name: fc-redis-data
  namespace: fc-redis
 spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 1Gi
 ---
 apiVersion: v1
 kind: ConfigMap
 metadata:
  name: fc-redis-config
  namespace: fc-redis
 data:
  redis.conf: |
    # Phase A — minimal config; no AUTH, no replication.
    bind 0.0.0.0
    protected-mode no
    port 6379
    tcp-backlog 511
    timeout 0
    tcp-keepalive 300
    # Persistence: AOF (fsync every second is the standard SignalR-backplane
    # durability sweet spot — the backplane only needs to survive Redis
    # restarts, not absolute zero loss).
    appendonly yes
    appendfsync everysec
    auto-aof-rewrite-percentage 100
    auto-aof-rewrite-min-size 64mb
    # Reasonable defaults — let Redis pick most things.
    maxmemory-policy allkeys-lru
    maxmemory 256mb
    # Logging
    loglevel notice
 ---
 apiVersion: apps/v1
 kind: Deployment
 metadata:
  name: fc-redis
  namespace: fc-redis
  labels:
    app: fc-redis
 spec:
  replicas: 1
  strategy:
    type: Recreate           # RWO PVC; do not do rolling update
  selector:
    matchLabels:
      app: fc-redis
  template:
    metadata:
      labels:
        app: fc-redis
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 999       # redis:7-alpine default uid
        runAsGroup: 999
        fsGroup: 999
      containers:
        - name: redis
          image: redis:7-alpine
          imagePullPolicy: IfNotPresent
          command: ["redis-server", "/etc/redis/redis.conf"]
          ports:
            - name: redis
              containerPort: 6379
          resources:
            requests:
              cpu: "50m"
              memory: "128Mi"
            limits:
              cpu: "500m"
              memory: "384Mi"
          volumeMounts:
            - name: data
              mountPath: /data
            - name: config
              mountPath: /etc/redis
              readOnly: true
          livenessProbe:
            tcpSocket:
              port: 6379
            initialDelaySeconds: 5
            periodSeconds: 10
          readinessProbe:
            exec:
              command: ["redis-cli", "ping"]
            initialDelaySeconds: 2
            periodSeconds: 5
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop: [ALL]
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: fc-redis-data
        - name: config
          configMap:
            name: fc-redis-config
 ---
 apiVersion: v1
 kind: Service
 metadata:
  name: redis
  namespace: fc-redis
 spec:
  type: ClusterIP
  selector:
    app: fc-redis
  ports:
    - name: redis
      port: 6379
      targetPort: 6379
      protocol: TCP
--- a/apps/guacamole/guacamole.yaml
+++ b/apps/guacamole/guacamole.yaml
@@ -466,11 +466,11 @@ spec:
  itemPath: vaults/IAmWorkin/items/Guacamole JSON Auth
 ---
 ---
-# 1Password-backed credentials for Mac mini VNC access (Phase 1 — 2026-04-28)
+# 1Password-backed credentials for Mac mini VNC access (Phase 1 <EFBFBD> 2026-04-28)
 # The operator mints Secret 'macmini-vnc-creds' with keys: username, password, VNC Password
 # Note: '1Password' field label 'VNC Password' -> K8s Secret key 'VNC Password' (space retained)
 # Guacamole VNC connection password is sourced from the 'VNC Password' field.
-# Actual IP is 10.0.56.115 (INFRA VLAN) — the 1P item 'IP' field is kept as backup reference.
+# Actual IP is 10.0.56.115 (INFRA VLAN) <EFBFBD> the 1P item 'IP' field is kept as backup reference.
 apiVersion: onepassword.com/v1
 kind: OnePasswordItem
 metadata:
@@ -481,6 +481,7 @@ metadata:
    app.kubernetes.io/part-of: flowercore
 spec:
  itemPath: vaults/IAmWorkin/items/Mac Mini
 ---
 # Blue Jay Branding Extension (CSS + translations)
 apiVersion: v1
 kind: ConfigMap
--- a/apps/monitoring/noc-monitoring.yaml
+++ b/apps/monitoring/noc-monitoring.yaml
@@ -974,6 +974,39 @@ data:
              summary: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replica mismatch"
              description: "Spec wants {{ $labels.spec_replicas }} but only {{ $value }} available. Likely a rollout stuck on probe failure, scheduling, or PVC."
          # Q-MR-3 (2026-05-11): multus memory pressure — catches the next OOM
          # cascade BEFORE multus is OOM-killed cluster-wide. The 2026-05-10
          # outage (21h) hit because no alert fired on the rising multus working
          # set — only downstream blackbox / Traefik / service alerts. With
          # 1Gi limit (bluejay-infra@eb8693e), 80% = ~800MiB; steady-state
          # runs ~150-250MiB so this only fires when an avalanche starts.
          - alert: MultusMemoryPressure
            expr: |
              container_memory_working_set_bytes{container="kube-multus"}
                / container_spec_memory_limit_bytes{container="kube-multus"} > 0.8
            for: 5m
            labels:
              severity: critical
              alert_channel: thermal_print
            annotations:
              summary: "kube-multus memory >80% of limit on {{ $labels.node }} for 5m"
              description: "kube-multus working set is {{ $value | humanizePercentage }} of its memory limit on node {{ $labels.node }}. If this keeps climbing, multus will OOM and all new pod networking will halt cluster-wide (precedent: 2026-05-10 outage)."
          # Q-MR-3 (2026-05-11): namespace pending-pod backlog — catches the
          # operator-leak avalanche pattern BEFORE it cascades into a multus
          # CNI OOM. Any FC operator (RemoteDesktop / Distribution / WorldBuilder)
          # emitting pods without ownerReferences will accumulate them when
          # the operator crashes. >25 pending pods in any namespace for 30m
          # is the signal to investigate the reconciler.
          - alert: NamespacePendingPodBacklog
            expr: sum by (namespace) (kube_pod_status_phase{phase="Pending"}) > 25
            for: 30m
            labels:
              severity: warning
            annotations:
              summary: "Namespace {{ $labels.namespace }} has {{ $value }} Pending pods for 30m"
              description: "Pending pod count in {{ $labels.namespace }} exceeds 25 sustained for 30m. Likely operator-leak avalanche pattern — children emitted without ownerReferences. Risk of multus CNI OOM cascade."
      # Longhorn storage health alerts. Required: longhorn scrape job
      # (added 2026-04-26 — see scrape_configs above). The K8s events
      # for "snapshot becomes not ready to use" are transient lifecycle
--- a/apps/multus/multus.yaml
+++ b/apps/multus/multus.yaml
@@ -188,13 +188,24 @@ spec:
        - name: kube-multus
          image: ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick
          command: [ "/usr/src/multus-cni/bin/multus-daemon" ]
          # 2026-05-11: upstream default of 50Mi memory limit OOM-cascades when
          # an operator-owned namespace accumulates >100 pending pods retrying
          # CNI ADD. RemoteDesktop emitted 219 orphan rd-browser-only pods
          # (missing OwnerReferences), kubelet's CNI ADD avalanche pushed multus
          # over 50Mi, OOMKilled, restarted with even bigger backlog → loop.
          # 21h cluster outage. See FlowerCore.Notes:
          #   feedback_multus_50mi_limit_oom_orphan_pod_avalanche.md
          # 1Gi limit / 512Mi request comfortably handles a 200+ pod CNI
          # catchup burst on 64GB nodes (nodes are <25% used in steady-state).
          # Drop back toward 256Mi only after MultusMemoryPressure alert
          # proves steady-state working set sits well below 200Mi.
          resources:
            requests:
              cpu: "100m"
-              memory: "50Mi"
+              memory: "512Mi"
            limits:
              cpu: "100m"
-              memory: "50Mi"
+              memory: "1Gi"
          securityContext:
            privileged: true
          terminationMessagePolicy: FallbackToLogsOnError
--- a/apps/telephony/telephony.yaml
+++ b/apps/telephony/telephony.yaml
@@ -127,10 +127,13 @@ spec:
      initContainers:
        - name: fix-data-perms
          image: busybox:latest
-          # Also chown /shared-tts (hostPath /tmp/tts-audio) so the non-root
+          # Must run as root to chown the hostPath /tmp/tts-audio that may be
-          # app user (uid 1654) can write Piper .sln16 files that Asterisk
+          # root-owned after node reboot. Pod-level runAsNonRoot:true would
-          # reads at /var/lib/asterisk/sounds/tts. World-readable (755) is
+          # otherwise inherit and chown would fail with EPERM (see Notes memory
-          # fine — Asterisk runs as a different uid in the other pod.
+          # feedback_hostpath_initcontainer_chown_perms).
          securityContext:
            runAsUser: 0
            runAsNonRoot: false
          command: ["sh", "-c", "chown -R 1654:1654 /data && chown 1654:1654 /shared-tts && chmod 0755 /shared-tts"]
          volumeMounts:
            - name: telephony-data
Author	SHA1	Message	Date
Codex	f298339152	fix(guacamole): add --- separator between macmini-vnc-creds OnePasswordItem and guacamole-branding ConfigMap Missing document separator caused YAML to merge the OnePasswordItem's top-level `spec: itemPath:` block into the ConfigMap that follows. Result: a ConfigMap with a `.spec` field whose K8s schema does not declare one, triggering ArgoCD's structured-merge diff to fail since 2026-05-11T15:30:54Z: Failed to compare desired state to live state: failed to calculate diff: error calculating structured merge diff: error building typed value from config resource: .spec: field not declared in schema App stayed Healthy (live K8s tolerated the extra field — ConfigMap ignored it) but ArgoCD's diff calc was broken, leaving the app stuck at sync=Unknown for all 21 resources. Adding the missing `---` separator makes the OnePasswordItem and ConfigMap proper sibling YAML documents, each with its own kind-correct schema. Diagnosed during 2026-05-12 morning routine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 09:26:03 -05:00
Codex	6e7d88db49	feat(fc-redis): add SignalR backplane for cross-product event bus (Q-SO-1 Phase A) Per Q-SO-1 operator resolution 2026-05-11 PM, Redis SignalR backplane lands in Phase A (was Phase C deferral). Treats Redis as a managed FC infrastructure component, not a deferred scaling escalation. Lands the minimal Phase A surface: - Namespace fc-redis - Single Redis 7-alpine pod with 1Gi Longhorn RWO PVC - ConfigMap with AOF persistence (everysec), 256Mi maxmemory, allkeys-lru - ClusterIP Service `redis.fc-redis.svc.cluster.local:6379` (in-cluster only) - No AUTH Phase A (Phase B add via 1Password Connect rotation) - No IngressRoute (backplane is server-to-server) Consumers (Phase A IMPL across FC services) add: services.AddSignalR().AddStackExchangeRedis( "redis.fc-redis.svc.cluster.local:6379", opts => opts.Configuration.ChannelPrefix = StackExchange.Redis.RedisChannel.Literal("fc-opsconsole")); Phase B/C follow-ons (not in this commit): Sentinel for HA, AUTH password from 1Password, redis_exporter sidecar for Prometheus, network policies. See FlowerCore.Notes/docs/signage/operations-console-phase-2-design.md section 3.5 (rewritten) and decisions-waiting.html Q-SO-1 (RESOLVED). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 19:02:58 -05:00
Codex	5ae50bd491	fix(telephony): init container runs as root to chown hostPath /tmp/tts-audio The fix-data-perms init container chowns /data (PVC) and /shared-tts (hostPath /tmp/tts-audio on rke2-agent1) to uid 1654 so the non-root telephony-web app can write Piper TTS .sln16 files. Without an explicit container-level securityContext override, the init container inherits pod-level runAsNonRoot:true / runAsUser:1654 and fails with 'chown: /shared-tts: Operation not permitted' the first time the hostPath comes up root-owned after a node reboot. Outage 2026-05-11 23:00 UTC: telephony-web in Init:CrashLoopBackOff for 9 hours (100+ restarts) until init container was bumped to runAsUser:0. Live cluster patched in the same operation; this commit makes the fix durable in git so ArgoCD sync preserves it. See Notes memory: feedback_hostpath_initcontainer_chown_perms Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 18:37:15 -05:00
Codex	653d4472f5	fix(monitoring): mirror Q-MR-3 MultusMemoryPressure + NamespacePendingPodBacklog alerts Two new preventive alert rules added to the kubernetes-state group of the K8s migration target ConfigMap. The live Podman Prometheus on noc1 has already been updated via FlowerCore.Notes/scripts/monitoring/alerts.yml + sudo cp + podman pod restart monitoring (this commit only locks it in the bluejay-infra K8s mirror so a future migration carries it forward). MultusMemoryPressure (critical, thermal_print): fires when kube-multus working set exceeds 80% of its memory limit for 5m. Catches the next multus OOM cascade BEFORE it kills the daemon cluster-wide. The 2026-05-10 21h outage hit because no alert fired on the rising multus working set; only downstream blackbox / Traefik / service alerts triggered, after the fact. NamespacePendingPodBacklog (warning): fires when any single namespace has >25 Pending pods sustained for 30m. Catches the operator-leak avalanche pattern (orphan pods from a crashed reconciler emitting children without ownerReferences) before it cascades into a CNI OOM. See FlowerCore.Notes: - feedback_multus_50mi_limit_oom_orphan_pod_avalanche - feedback_monitoring_k8s_target_vs_live_podman (workflow) Companion commits: - bluejay-infra@eb8693e (multus memory limit) - FlowerCore.RemoteDesktop@b02c59b (OwnerReferences fix) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 10:42:27 -05:00
Codex	eb8693e1ce	fix(multus): bump kube-multus-ds memory 50Mi/50Mi -> 1Gi/512Mi (prevent OOM cascade) Cluster outage 2026-05-10T17:43 through 2026-05-11 ~10:30 (~21h). Root cause: FlowerCore.RemoteDesktop emitted 219 orphan rd-browser-only-* pods in fc-desktop (missing OwnerReferences — see companion fix in FlowerCore.RemoteDesktop). Kubelet's continuous CNI ADD retries for those pending pods drove a request queue that exceeded the upstream default 50Mi limit on kube-multus-ds. Multus OOMKilled (exit 137), restarted with an even bigger backlog, OOMKilled again, positive feedback loop. Restart counts climbed to 276 / 412 / 261 across the 3 RKE2 nodes. Downstream blast radius: both Traefik pods stuck ContainerCreating (101m + 4h35m), all Longhorn CSI attacher/provisioner/instance-manager stuck, every Prometheus blackbox probe for *.iamworkin.lan failing, UpdateCenterPublicEdgeDown critical on update.flowercore.io, every ArgoCD app showing sync=Unknown because repo-server lost git connectivity. 45 firing Prometheus alerts. Recovery sequence (Q-MR-1 from FlowerCore.Notes morning routine): 1. kubectl patch kube-multus-ds memory live (this commit locks it in git so ArgoCD doesn't revert on next sync) 2. Force-delete the 219 orphan pods (kubectl --grace-period=0 --force) to break the avalanche 3. Rollout restart kube-multus-ds — STABLE after restart with new limit 4. Restart Traefik + Longhorn CSI to clear stuck ContainerCreating 5. Verify update.flowercore.io returns 200 + ArgoCD apps reconcile Tested incrementally: 256Mi limit was insufficient (still OOMed on catchup burst), 512Mi was insufficient on rke2-agent1 (most pods concentrated there), 1Gi/512Mi handled the full 200+ pending pod CNI catchup cleanly with 0 multus restarts after rollout. Nodes are 64GB with <25% used in steady-state, so the ~256Mi typical working-set is well within the new limit. Companion change: FlowerCore.RemoteDesktop must set OwnerReferences on every worker pod so future operator crashes don't leak orphans (Q-MR-2). Preventive alerts (Q-MR-3) MultusMemoryPressure + NamespacePendingPodBacklog are coming in a follow-up commit to apps/monitoring/. Memory: feedback_multus_50mi_limit_oom_orphan_pod_avalanche Decisions card: docs/dashboards/decisions-waiting.html Q-MR-1..3 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 10:30:05 -05:00