Performance Optimization for Diagnostics Server¶

Note

All configuration variables are optional and have default values.
Uncomment and set values only when you need to override the defaults.
Apply changes incrementally and monitor server behavior after each adjustment.

Prerequisites¶

Before tuning, ensure you have:

A running Privacera Diagnostics Server deployment.
Access to the vars.privacera-diagnostics.yml configuration file.

Choose a Database Backend¶

The server supports two backends. Choose before any other tuning — the active backend determines which sections below apply.

Variable	Description	Default	When to Change
`DIAG_SERVER_DB_TYPE`	Database backend: `sqlite` or `mariadb`	`sqlite`	Use `mariadb` for multi-replica deployments or when the SQLite file sits on NFS/EFS

SQLite is the default and requires zero external infrastructure. It is the right choice for single-server deployments on local or EBS-backed storage.

MariaDB is required when:

You run more than one Diagnostics Server replica.
The storage volume is NFS or EFS (SQLite WAL does not work correctly over NFS).

YAML
# Use SQLite (default — no change needed)
# DIAG_SERVER_DB_TYPE: "sqlite"

# Switch to MariaDB
DIAG_SERVER_DB_TYPE: "mariadb"

SQLite Performance Tuning¶

Skip this section if you are using MariaDB.

SQLite performance is dominated by three factors: lock-wait tolerance, WAL checkpoint frequency, and how many requests can touch the database at the same time.

Lock Timeout¶

Variable	Description	Default	Recommendation
`DIAG_SERVER_SQLITE_BUSY_TIMEOUT_MS`	Milliseconds SQLite retries a locked file before raising an error	`30000` (30 s)	Increase to `60000` on EFS/NFS; keep default on EBS or local disk

Critical for EFS/NFS

On EFS or NFS mounts, file-lock latency is much higher than on local disk. If you see database is locked errors in the server logs, raise this value first before any other change.

YAML
# Raise only when the SQLite file is on EFS/NFS
DIAG_SERVER_SQLITE_BUSY_TIMEOUT_MS: "60000"

WAL Checkpoint Frequency¶

Variable	Description	Default	Recommendation
`DIAG_SERVER_SQLITE_WAL_AUTOCHECKPOINT`	Number of WAL pages that trigger an automatic checkpoint	`1000`	Lower to `500` for write-heavy workloads; raise to `2000` to reduce lock spikes on EFS

Tip

Each auto-checkpoint briefly takes an exclusive lock. Raising this value reduces how often that lock is acquired at the cost of a slightly larger WAL file on disk.

YAML
# Reduce checkpoint frequency on EFS to avoid lock spikes
DIAG_SERVER_SQLITE_WAL_AUTOCHECKPOINT: "2000"

Concurrent Session Limit¶

Variable	Description	Default	Recommendation
`DIAG_SERVER_SQLITE_MAX_CONCURRENT_SESSIONS`	Maximum concurrent in-flight DB sessions	Auto-calculated from CPU count	Set explicitly only when the auto value causes semaphore-wait warnings

The auto-calculation is: min(32, cpu_count + 4) - 2. This reserves 2 slots for readiness probes.

Warning

Never set this to 0 — it deadlocks all database requests. Leave blank to use the auto-calculated value.

YAML
1 2	`# Override only when semaphore-wait warnings appear in the logs # DIAG_SERVER_SQLITE_MAX_CONCURRENT_SESSIONS: "12"`

Readiness Probe Sensitivity¶

Variable	Description	Default	Recommendation
`DIAG_SERVER_SQLITE_READINESS_WINDOW_SIZE`	Number of recent DB sessions tracked by the readiness probe	`3`	Raise to `5` to tolerate more transient failures before the pod is marked Not Ready

The pod reports Not Ready only when all of the last N sessions failed. Raising this value makes the readiness probe more tolerant of transient SQLite lock errors.

YAML
DIAG_SERVER_SQLITE_READINESS_WINDOW_SIZE: "5"

In-Memory Heartbeat Queue¶

Variable	Description	Default	Recommendation
`DIAG_SERVER_SQLITE_HEARTBEAT_QUEUE_MAX_SIZE`	Maximum heartbeats held in-memory when the DB is temporarily unavailable	`200`	Increase to `500` in large deployments (many client pods) to buffer heartbeats during a brief DB hiccup

Heartbeats bypass the concurrency semaphore and queue in memory. The pod returns 503 only when this queue fills up (sustained DB outage), not on transient slowness.

YAML
DIAG_SERVER_SQLITE_HEARTBEAT_QUEUE_MAX_SIZE: "500"

MariaDB Configuration¶

Skip this section if you are using SQLite.

Connection Details¶

Variable	Description	Default
`DIAG_SERVER_DB_HOST`	MariaDB hostname or Kubernetes service name	`mariadb`
`DIAG_SERVER_DB_PORT`	MariaDB port	`3306`
`DIAG_SERVER_DB_USER`	Database username	—
`DIAG_SERVER_DB_PASSWORD`	Database password	—
`DIAG_SERVER_DB_NAME`	Database schema name	`diag_server`

YAML
DIAG_SERVER_DB_TYPE: "mariadb"
DIAG_SERVER_DB_HOST: "mariadb"
DIAG_SERVER_DB_PORT: "3306"
DIAG_SERVER_DB_USER: "<DB_USERNAME>"
DIAG_SERVER_DB_PASSWORD: "<DB_PASSWORD>"
DIAG_SERVER_DB_NAME: "diag_server"

Connection Pool¶

These settings control how many simultaneous database connections the server maintains. They apply only when DIAG_SERVER_DB_TYPE is mariadb — SQLite uses NullPool (no persistent connections).

Variable	Description	Default	Recommendation
`DIAG_SERVER_DB_POOL_CAPACITY`	Maximum persistent connections in the pool	`10`	Increase to `20–30` for high-concurrency deployments (many simultaneous users)
`DIAG_SERVER_DB_POOL_OVERFLOW`	Extra connections allowed beyond pool capacity	`30`	Set to `1.5×` pool capacity as a starting point
`DIAG_SERVER_DB_POOL_TIMEOUT_SECS`	Seconds to wait for a free connection before failing	`30`	Reduce to `10` to fail fast under saturation

MariaDB server-side limit

Ensure MariaDB's max_connections is at least DIAG_SERVER_DB_POOL_CAPACITY + DIAG_SERVER_DB_POOL_OVERFLOW + headroom (for admin connections). For pool=20, overflow=30, set max_connections ≥ 60.

YAML
DIAG_SERVER_DB_POOL_CAPACITY: "20"
DIAG_SERVER_DB_POOL_OVERFLOW: "30"
DIAG_SERVER_DB_POOL_TIMEOUT_SECS: "10"

SocketIO Performance¶

SocketIO settings affect real-time communication between the server and browser clients. Tune these when clients experience dropped connections or high-latency updates.

Variable	Description	Default	Recommendation
`DIAG_SERVER_SOCKETIO_PING_TIMEOUT`	Seconds to wait for a pong before closing the connection	`60`	Increase to `120` for slow or high-latency networks
`DIAG_SERVER_SOCKETIO_PING_INTERVAL`	Seconds between keepalive pings	`25`	Decrease to `15` for faster detection of dropped browser tabs
`DIAG_SERVER_SOCKETIO_ASYNC_MODE`	Async execution mode (`asgi` or `threading`)	`asgi`	Keep `asgi` for best throughput; use `threading` only for legacy compatibility
`DIAG_SERVER_SOCKETIO_LOG_LEVEL`	Log verbosity for SocketIO events	`WARNING`	Set to `ERROR` in production to reduce log I/O overhead
`DIAG_SERVER_SOCKETIO_ENGINEIO_LOG_LEVEL`	Log verbosity for EngineIO transport events	`WARNING`	Set to `ERROR` in production to reduce log I/O overhead

YAML
# Production-tuned SocketIO settings
DIAG_SERVER_SOCKETIO_PING_TIMEOUT: "120"
DIAG_SERVER_SOCKETIO_PING_INTERVAL: "15"
DIAG_SERVER_SOCKETIO_LOG_LEVEL: "ERROR"
DIAG_SERVER_SOCKETIO_ENGINEIO_LOG_LEVEL: "ERROR"

Data Purge and Retention¶

Background purge tasks remove stale records and prevent unbounded database growth. Schedule purges during off-peak hours to avoid I/O contention with live traffic.

Server-Side Purge Schedule¶

Variable	Description	Default	Recommendation
`DIAG_SERVER_DATA_PURGE_ENABLED`	Enable the nightly purge background task	`true`	Keep `true` to prevent unbounded growth
`DIAG_SERVER_DATA_PURGE_RUN_TIME`	Time of day to run purge (24-hour `HH:MM`)	`00:00`	Schedule during the lowest-traffic window
`DIAG_SERVER_DATA_PURGE_ERROR_LOGS_RETENTION_DAYS`	Days to retain error log records	`7`	Reduce to desired days for space-constrained environments; increase for compliance
`DIAG_SERVER_DATA_PURGE_POD_TEST_RESULT_RETENTION_DAYS`	Days to retain pod test result records	`30`	Reduce to desired days if storage is limited
`DIAG_SERVER_DATA_PURGE_RETRY_INTERVAL_SECONDS`	Seconds to wait before retrying a failed purge	`3600`	Lower to `600` for faster recovery after a purge failure

YAML
DIAG_SERVER_DATA_PURGE_ENABLED: "true"
DIAG_SERVER_DATA_PURGE_RUN_TIME: "00:00"
DIAG_SERVER_DATA_PURGE_ERROR_LOGS_RETENTION_DAYS: "7"
DIAG_SERVER_DATA_PURGE_POD_TEST_RESULT_RETENTION_DAYS: "30"
DIAG_SERVER_DATA_PURGE_RETRY_INTERVAL_SECONDS: "600"

Client-Side Test Result File Retention¶

These variables control how many historical test-result files the diagnostics client retains on disk before rotating.

Variable	Description	Default
`DIAG_TEST_RESULTS_FILE_RETENTION_COUNT`	Global retention count for all services	`7` (7 days for all services)
`DIAG_TEST_RESULTS_RETENTION_<SERVICE>`	Per-service override (uppercase, hyphens → underscores)	`""` (uses global)

Naming pattern: DIAG_TEST_RESULTS_RETENTION_{CONTAINER_NAME_IN_UPPERCASE}.

Example Container	Variable
`auditserver`	`DIAG_TEST_RESULTS_RETENTION_AUDITSERVER`
`ranger`	`DIAG_TEST_RESULTS_RETENTION_RANGER`
`portal`	`DIAG_TEST_RESULTS_RETENTION_PORTAL`
`databricks-unity-catalog`	`DIAG_TEST_RESULTS_RETENTION_DATABRICKS_UNITY_CATALOG`

YAML
# Retain the last 10 result files for all services
DIAG_TEST_RESULTS_FILE_RETENTION_COUNT: "10"

# Override for a noisy connector that generates many results
DIAG_TEST_RESULTS_RETENTION_DATABRICKS_UNITY_CATALOG: "5"

Diagnostics Test Scheduling¶

These variables control when and how often the client sidecar runs its diagnostic test suite against each service.

Variable	Description	Default	Recommendation
`DIAG_TESTS_INITIAL_WAIT_SECS`	Seconds to wait after pod startup before running the first test	`120`	Increase to `300` for services that take longer to become ready
`DIAG_TESTS_INTERVAL_SECS`	Seconds between successful test runs	`86400` (24 h)	Reduce to `43200` (12 h) to detect configuration drift earlier
`DIAG_TESTS_RETRY_ENABLED`	Enable automatic re-test after a failure	`false`	Set to `true` in production to catch transient failures
`DIAG_TESTS_RETRY_INTERVAL_SECS`	Seconds between retry attempts on failure	`43200` (12 h)	Reduce to `3600` (1 h) for faster failure confirmation

YAML
DIAG_TESTS_INITIAL_WAIT_SECS: "120"
DIAG_TESTS_INTERVAL_SECS: "86400"
DIAG_TESTS_RETRY_ENABLED: "true"
DIAG_TESTS_RETRY_INTERVAL_SECS: "3600"

Apply the Changes¶

Save and close the file, then regenerate and apply the Helm charts:

Bash
cd ~/privacera/privacera-manager
./privacera-manager.sh setup
./pm_with_helm.sh upgrade

Quick Reference — Recommended Configurations¶

Scenario	Key Variables to Set
Getting started	All defaults; enable `DIAG_SERVER_METRICS_ENABLE: "true"` to observe behavior
Multi-replica server deployment	Switch to `DIAG_SERVER_DB_TYPE: "mariadb"`, configure `DIAG_SERVER_DB_HOST`, `DIAG_SERVER_DB_POOL_CAPACITY`, `DIAG_SERVER_DB_POOL_OVERFLOW`
SQLite on EFS/NFS (unavoidable)	`DIAG_SERVER_SQLITE_BUSY_TIMEOUT_MS: "60000"`, `DIAG_SERVER_SQLITE_WAL_AUTOCHECKPOINT: "2000"`, `DIAG_SERVER_SQLITE_READINESS_WINDOW_SIZE: "5"` — or migrate to MariaDB
Many client sidecars sending heartbeats	Raise `DIAG_SERVER_SQLITE_HEARTBEAT_QUEUE_MAX_SIZE` to buffer spikes; raise `DIAG_SERVER_SQLITE_MAX_CONCURRENT_SESSIONS` if semaphore-wait warnings appear
High log I/O overhead	`DIAG_SERVER_SOCKETIO_LOG_LEVEL: "ERROR"`, `DIAG_SERVER_SOCKETIO_ENGINEIO_LOG_LEVEL: "ERROR"`
Browser clients frequently disconnecting	Increase `DIAG_SERVER_SOCKETIO_PING_TIMEOUT`; decrease `DIAG_SERVER_SOCKETIO_PING_INTERVAL`
Database growing too large	Reduce `DIAG_SERVER_DATA_PURGE_ERROR_LOGS_RETENTION_DAYS` and `DIAG_SERVER_DATA_PURGE_POD_TEST_RESULT_RETENTION_DAYS`; set `DIAG_TEST_RESULTS_FILE_RETENTION_COUNT`
Slow failure / offline detection	Decrease `DIAG_SERVER_HEARTBEAT_ACTIVE_THRESHOLD_SECONDS` and `DIAG_SERVER_HEARTBEAT_DEGRADED_THRESHOLD_SECONDS`
Services take long to become ready on startup	Increase `DIAG_TESTS_INITIAL_WAIT_SECS` so the first test run does not start before the service is up
Need faster re-test after failures	Set `DIAG_TESTS_RETRY_ENABLED: "true"` and reduce `DIAG_TESTS_RETRY_INTERVAL_SECS`

Prev Advanced Configuration