Commit Graph

11 Commits

Author SHA1 Message Date
Andrew Stoltz
e50e103ba0 fix(zabbix): bump web probe timeouts 5s→15s + add failureThreshold
zabbix-web nginx+PHP-FPM container serves / at ~3-5s baseline with
occasional 6-7s spikes (probe path renders full dashboard via PHP).
kube-probe was killing the container after 3 consecutive 5s-timeout
499s, producing CrashLoopBackOff alert noise even though the app
was serving real traffic fine.

15s timeout absorbs the natural variance; explicit failureThreshold=3
documents the policy (was implicit default).

Closes the firing PodCrashLoopBackOff (zabbix-web) + pending
HTTPServiceSlow/HTTPServiceDegraded alerts. zabbix.iamworkin.lan
remains slow at the application layer (separate work — PHP-FPM
warm-up + Zabbix server "host not found" agent lookup spam need
their own fixes) but the pod restart loop stops.
2026-05-15 15:59:04 -05:00
Andrew Stoltz
223e9a9232 feat(zabbix): add RemoteDesktop monitoring template
New Zabbix 7.2 template under `Templates/FlowerCore` that scrapes
the `/metrics` exposition from FlowerCore.RemoteDesktop and extracts:

- `fc_desktop_session_events_total` split by event (launch/connect/
  disconnect/recording), with a dedicated datapoint for the
  `browser_datasource="json"` slice to track delegated-auth launches.
- `fc_desktop_pool_ready` gauge sum for warm pools.

Trigger: `nodata(flowercore.remotedesktop.metrics,10m)=1` warns when
the public desktop host stops exposing metrics.

Follows the existing `flowercore-print-ollama.yaml` pattern — import
manually into Zabbix and link to the Print/Desktop host. Not a K8s
manifest; ArgoCD ignores.

Grafana dashboard JSON is drafted at
`apps/monitoring/flowercore-remotedesktop-grafana-dashboard.json`
but still needs a ConfigMap wrap + Grafana Deployment volume mount
in noc-monitoring.yaml before it ships (follow-up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 23:30:32 -05:00
Andrew Stoltz
1dc66738e6 zabbix: align postgres tracking label 2026-04-22 22:50:24 -05:00
Andrew Stoltz
5623a272c5 zabbix: include statefulset defaults 2026-04-22 22:39:31 -05:00
Andrew Stoltz
3d3f91160b monitoring: add Print.Web Ollama Zabbix template 2026-04-22 22:07:40 -05:00
Andrew Stoltz
fff998dab5 matrix, zabbix: add volumeMode to postgres PVC templates
Same ArgoCD + SSA self-heal loop pattern as guacamole (20e4130):
K8s defaults volumeMode=Filesystem on volumeClaimTemplates at
creation, git omits it, argocd-controller owns the atomic list so
every reconcile sees drift, and volumeClaimTemplates is immutable
so it can never reconcile. Adding the field closes both loops.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 14:48:43 -05:00
Andrew M. Stoltz
f3fde15002 Update telephony-web image to v20260324d, resolve merge conflicts 2026-03-24 15:55:52 -05:00
Claude Code
efc3dc5b4e Increase Zabbix web probe timeouts to 5s (prevents 502 during heavy dashboard queries) 2026-03-12 20:40:09 -05:00
Claude Code
518340b373 Tune Zabbix stack: PostgreSQL, web PHP-FPM, server caches
PostgreSQL 16:
- shared_buffers 128MB→256MB, work_mem 4MB→16MB
- random_page_cost 4→1.1 (SSD/Longhorn), effective_io_concurrency→200
- maintenance_work_mem→128MB, wal_buffers→8MB
- max_connections 100→50, memory limit 512Mi→1Gi

Zabbix Web:
- PHP_FPM_PM_MAX_CHILDREN 50→10 (fixes 68x OOMKill)
- ZBX_MEMORYLIMIT 128M→256M, PM_MAX_REQUESTS→500
- Memory limit 512Mi→768Mi, request 128Mi→256Mi

Zabbix Server:
- ZBX_CACHESIZE→64M, ZBX_VALUECACHESIZE→64M
- ZBX_HISTORYCACHESIZE→32M, ZBX_TRENDCACHESIZE→8M
- ZBX_STARTPOLLERS→10, ZBX_STARTPOLLERSUNREACHABLE→3
2026-03-12 19:21:15 -05:00
Andrew Stoltz
3199c509c0 Wire Zabbix/Matrix credentials to 1Password-synced secrets, add OnePasswordItem CRDs
- Zabbix: Remove hardcoded zabbix-db-secret and zabbix-admin-secret, reference
  zabbix-credentials (1Password) for DB-User, DB-Password, and admin password
- Matrix: Remove hardcoded matrix-db-secret, reference matrix-credentials for
  Postgres user/password. Convert ConfigMap homeserver.yaml to template with
  __DB_PASSWORD__/__DB_USER__ placeholders, inject via busybox init container
- Guacamole: Add OnePasswordItem CRD for future use. MySQL DB creds remain in
  guac-db-secret (1Password item lacks DB-specific fields — gap documented)
- All three services now include OnePasswordItem CRD manifests for ArgoCD mgmt
2026-03-09 18:28:38 -05:00
ef442e29eb Add infrastructure manifests for 9 services
Zabbix, IRC, Mail, Guacamole, Matrix, TeamSpeak, Intranet, PKI Web, FC Landing.
All with cert-manager TLS, Traefik IngressRoutes, Longhorn PVCs.
2026-03-09 16:35:04 -05:00