Performance Optimization for Diagnostics Server¶
Note
- All configuration variables are optional and have default values.
- Uncomment and set values only when you need to override the defaults.
- Apply changes incrementally and monitor server behavior after each adjustment.
Prerequisites¶
Before tuning, ensure you have:
- A running Privacera Diagnostics Server deployment.
- Access to the
vars.privacera-diagnostics.ymlconfiguration file.
Choose a Database Backend¶
The server supports two backends. Choose before any other tuning — the active backend determines which sections below apply.
| Variable | Description | Default | When to Change |
|---|---|---|---|
DIAG_SERVER_DB_TYPE | Database backend: sqlite or mariadb | sqlite | Use mariadb for multi-replica deployments or when the SQLite file sits on NFS/EFS |
SQLite is the default and requires zero external infrastructure. It is the right choice for single-server deployments on local or EBS-backed storage.
MariaDB is required when:
- You run more than one Diagnostics Server replica.
- The storage volume is NFS or EFS (SQLite WAL does not work correctly over NFS).
| YAML | |
|---|---|
SQLite Performance Tuning¶
Skip this section if you are using MariaDB.
SQLite performance is dominated by three factors: lock-wait tolerance, WAL checkpoint frequency, and how many requests can touch the database at the same time.
Lock Timeout¶
| Variable | Description | Default | Recommendation |
|---|---|---|---|
DIAG_SERVER_SQLITE_BUSY_TIMEOUT_MS | Milliseconds SQLite retries a locked file before raising an error | 30000 (30 s) | Increase to 60000 on EFS/NFS; keep default on EBS or local disk |
Critical for EFS/NFS
On EFS or NFS mounts, file-lock latency is much higher than on local disk. If you see database is locked errors in the server logs, raise this value first before any other change.
WAL Checkpoint Frequency¶
| Variable | Description | Default | Recommendation |
|---|---|---|---|
DIAG_SERVER_SQLITE_WAL_AUTOCHECKPOINT | Number of WAL pages that trigger an automatic checkpoint | 1000 | Lower to 500 for write-heavy workloads; raise to 2000 to reduce lock spikes on EFS |
Tip
Each auto-checkpoint briefly takes an exclusive lock. Raising this value reduces how often that lock is acquired at the cost of a slightly larger WAL file on disk.
| YAML | |
|---|---|
Concurrent Session Limit¶
| Variable | Description | Default | Recommendation |
|---|---|---|---|
DIAG_SERVER_SQLITE_MAX_CONCURRENT_SESSIONS | Maximum concurrent in-flight DB sessions | Auto-calculated from CPU count | Set explicitly only when the auto value causes semaphore-wait warnings |
The auto-calculation is: min(32, cpu_count + 4) - 2. This reserves 2 slots for readiness probes.
Warning
Never set this to 0 — it deadlocks all database requests. Leave blank to use the auto-calculated value.
| YAML | |
|---|---|
Readiness Probe Sensitivity¶
| Variable | Description | Default | Recommendation |
|---|---|---|---|
DIAG_SERVER_SQLITE_READINESS_WINDOW_SIZE | Number of recent DB sessions tracked by the readiness probe | 3 | Raise to 5 to tolerate more transient failures before the pod is marked Not Ready |
The pod reports Not Ready only when all of the last N sessions failed. Raising this value makes the readiness probe more tolerant of transient SQLite lock errors.
| YAML | |
|---|---|
In-Memory Heartbeat Queue¶
| Variable | Description | Default | Recommendation |
|---|---|---|---|
DIAG_SERVER_SQLITE_HEARTBEAT_QUEUE_MAX_SIZE | Maximum heartbeats held in-memory when the DB is temporarily unavailable | 200 | Increase to 500 in large deployments (many client pods) to buffer heartbeats during a brief DB hiccup |
Heartbeats bypass the concurrency semaphore and queue in memory. The pod returns 503 only when this queue fills up (sustained DB outage), not on transient slowness.
| YAML | |
|---|---|
MariaDB Configuration¶
Skip this section if you are using SQLite.
Connection Details¶
| Variable | Description | Default |
|---|---|---|
DIAG_SERVER_DB_HOST | MariaDB hostname or Kubernetes service name | mariadb |
DIAG_SERVER_DB_PORT | MariaDB port | 3306 |
DIAG_SERVER_DB_USER | Database username | — |
DIAG_SERVER_DB_PASSWORD | Database password | — |
DIAG_SERVER_DB_NAME | Database schema name | diag_server |
| YAML | |
|---|---|
Connection Pool¶
These settings control how many simultaneous database connections the server maintains. They apply only when DIAG_SERVER_DB_TYPE is mariadb — SQLite uses NullPool (no persistent connections).
| Variable | Description | Default | Recommendation |
|---|---|---|---|
DIAG_SERVER_DB_POOL_CAPACITY | Maximum persistent connections in the pool | 10 | Increase to 20–30 for high-concurrency deployments (many simultaneous users) |
DIAG_SERVER_DB_POOL_OVERFLOW | Extra connections allowed beyond pool capacity | 30 | Set to 1.5× pool capacity as a starting point |
DIAG_SERVER_DB_POOL_TIMEOUT_SECS | Seconds to wait for a free connection before failing | 30 | Reduce to 10 to fail fast under saturation |
MariaDB server-side limit
Ensure MariaDB's max_connections is at least DIAG_SERVER_DB_POOL_CAPACITY + DIAG_SERVER_DB_POOL_OVERFLOW + headroom (for admin connections). For pool=20, overflow=30, set max_connections ≥ 60.
| YAML | |
|---|---|
SocketIO Performance¶
SocketIO settings affect real-time communication between the server and browser clients. Tune these when clients experience dropped connections or high-latency updates.
| Variable | Description | Default | Recommendation |
|---|---|---|---|
DIAG_SERVER_SOCKETIO_PING_TIMEOUT | Seconds to wait for a pong before closing the connection | 60 | Increase to 120 for slow or high-latency networks |
DIAG_SERVER_SOCKETIO_PING_INTERVAL | Seconds between keepalive pings | 25 | Decrease to 15 for faster detection of dropped browser tabs |
DIAG_SERVER_SOCKETIO_ASYNC_MODE | Async execution mode (asgi or threading) | asgi | Keep asgi for best throughput; use threading only for legacy compatibility |
DIAG_SERVER_SOCKETIO_LOG_LEVEL | Log verbosity for SocketIO events | WARNING | Set to ERROR in production to reduce log I/O overhead |
DIAG_SERVER_SOCKETIO_ENGINEIO_LOG_LEVEL | Log verbosity for EngineIO transport events | WARNING | Set to ERROR in production to reduce log I/O overhead |
| YAML | |
|---|---|
Data Purge and Retention¶
Background purge tasks remove stale records and prevent unbounded database growth. Schedule purges during off-peak hours to avoid I/O contention with live traffic.
Server-Side Purge Schedule¶
| Variable | Description | Default | Recommendation |
|---|---|---|---|
DIAG_SERVER_DATA_PURGE_ENABLED | Enable the nightly purge background task | true | Keep true to prevent unbounded growth |
DIAG_SERVER_DATA_PURGE_RUN_TIME | Time of day to run purge (24-hour HH:MM) | 00:00 | Schedule during the lowest-traffic window |
DIAG_SERVER_DATA_PURGE_ERROR_LOGS_RETENTION_DAYS | Days to retain error log records | 7 | Reduce to desired days for space-constrained environments; increase for compliance |
DIAG_SERVER_DATA_PURGE_POD_TEST_RESULT_RETENTION_DAYS | Days to retain pod test result records | 30 | Reduce to desired days if storage is limited |
DIAG_SERVER_DATA_PURGE_RETRY_INTERVAL_SECONDS | Seconds to wait before retrying a failed purge | 3600 | Lower to 600 for faster recovery after a purge failure |
| YAML | |
|---|---|
Client-Side Test Result File Retention¶
These variables control how many historical test-result files the diagnostics client retains on disk before rotating.
| Variable | Description | Default |
|---|---|---|
DIAG_TEST_RESULTS_FILE_RETENTION_COUNT | Global retention count for all services | 7 (7 days for all services) |
DIAG_TEST_RESULTS_RETENTION_<SERVICE> | Per-service override (uppercase, hyphens → underscores) | "" (uses global) |
Naming pattern: DIAG_TEST_RESULTS_RETENTION_{CONTAINER_NAME_IN_UPPERCASE}.
| Example Container | Variable |
|---|---|
auditserver | DIAG_TEST_RESULTS_RETENTION_AUDITSERVER |
ranger | DIAG_TEST_RESULTS_RETENTION_RANGER |
portal | DIAG_TEST_RESULTS_RETENTION_PORTAL |
databricks-unity-catalog | DIAG_TEST_RESULTS_RETENTION_DATABRICKS_UNITY_CATALOG |
| YAML | |
|---|---|
Diagnostics Test Scheduling¶
These variables control when and how often the client sidecar runs its diagnostic test suite against each service.
| Variable | Description | Default | Recommendation |
|---|---|---|---|
DIAG_TESTS_INITIAL_WAIT_SECS | Seconds to wait after pod startup before running the first test | 120 | Increase to 300 for services that take longer to become ready |
DIAG_TESTS_INTERVAL_SECS | Seconds between successful test runs | 86400 (24 h) | Reduce to 43200 (12 h) to detect configuration drift earlier |
DIAG_TESTS_RETRY_ENABLED | Enable automatic re-test after a failure | false | Set to true in production to catch transient failures |
DIAG_TESTS_RETRY_INTERVAL_SECS | Seconds between retry attempts on failure | 43200 (12 h) | Reduce to 3600 (1 h) for faster failure confirmation |
| YAML | |
|---|---|
Apply the Changes¶
Save and close the file, then regenerate and apply the Helm charts:
Quick Reference — Recommended Configurations¶
| Scenario | Key Variables to Set |
|---|---|
| Getting started | All defaults; enable DIAG_SERVER_METRICS_ENABLE: "true" to observe behavior |
| Multi-replica server deployment | Switch to DIAG_SERVER_DB_TYPE: "mariadb", configure DIAG_SERVER_DB_HOST, DIAG_SERVER_DB_POOL_CAPACITY, DIAG_SERVER_DB_POOL_OVERFLOW |
| SQLite on EFS/NFS (unavoidable) | DIAG_SERVER_SQLITE_BUSY_TIMEOUT_MS: "60000", DIAG_SERVER_SQLITE_WAL_AUTOCHECKPOINT: "2000", DIAG_SERVER_SQLITE_READINESS_WINDOW_SIZE: "5" — or migrate to MariaDB |
| Many client sidecars sending heartbeats | Raise DIAG_SERVER_SQLITE_HEARTBEAT_QUEUE_MAX_SIZE to buffer spikes; raise DIAG_SERVER_SQLITE_MAX_CONCURRENT_SESSIONS if semaphore-wait warnings appear |
| High log I/O overhead | DIAG_SERVER_SOCKETIO_LOG_LEVEL: "ERROR", DIAG_SERVER_SOCKETIO_ENGINEIO_LOG_LEVEL: "ERROR" |
| Browser clients frequently disconnecting | Increase DIAG_SERVER_SOCKETIO_PING_TIMEOUT; decrease DIAG_SERVER_SOCKETIO_PING_INTERVAL |
| Database growing too large | Reduce DIAG_SERVER_DATA_PURGE_ERROR_LOGS_RETENTION_DAYS and DIAG_SERVER_DATA_PURGE_POD_TEST_RESULT_RETENTION_DAYS; set DIAG_TEST_RESULTS_FILE_RETENTION_COUNT |
| Slow failure / offline detection | Decrease DIAG_SERVER_HEARTBEAT_ACTIVE_THRESHOLD_SECONDS and DIAG_SERVER_HEARTBEAT_DEGRADED_THRESHOLD_SECONDS |
| Services take long to become ready on startup | Increase DIAG_TESTS_INITIAL_WAIT_SECS so the first test run does not start before the service is up |
| Need faster re-test after failures | Set DIAG_TESTS_RETRY_ENABLED: "true" and reduce DIAG_TESTS_RETRY_INTERVAL_SECS |