Skip to content

Performance Optimization for Diagnostics Server

Note

  • All configuration variables are optional and have default values.
  • Uncomment and set values only when you need to override the defaults.
  • Apply changes incrementally and monitor server behavior after each adjustment.

Prerequisites

Before tuning, ensure you have:

  • A running Privacera Diagnostics Server deployment.
  • Access to the vars.privacera-diagnostics.yml configuration file.

Choose a Database Backend

The server supports two backends. Choose before any other tuning — the active backend determines which sections below apply.

Variable Description Default When to Change
DIAG_SERVER_DB_TYPE Database backend: sqlite or mariadb sqlite Use mariadb for multi-replica deployments or when the SQLite file sits on NFS/EFS

SQLite is the default and requires zero external infrastructure. It is the right choice for single-server deployments on local or EBS-backed storage.

MariaDB is required when:

  • You run more than one Diagnostics Server replica.
  • The storage volume is NFS or EFS (SQLite WAL does not work correctly over NFS).
YAML
1
2
3
4
5
# Use SQLite (default — no change needed)
# DIAG_SERVER_DB_TYPE: "sqlite"

# Switch to MariaDB
DIAG_SERVER_DB_TYPE: "mariadb"

SQLite Performance Tuning

Skip this section if you are using MariaDB.

SQLite performance is dominated by three factors: lock-wait tolerance, WAL checkpoint frequency, and how many requests can touch the database at the same time.

Lock Timeout

Variable Description Default Recommendation
DIAG_SERVER_SQLITE_BUSY_TIMEOUT_MS Milliseconds SQLite retries a locked file before raising an error 30000 (30 s) Increase to 60000 on EFS/NFS; keep default on EBS or local disk

Critical for EFS/NFS

On EFS or NFS mounts, file-lock latency is much higher than on local disk. If you see database is locked errors in the server logs, raise this value first before any other change.

YAML
# Raise only when the SQLite file is on EFS/NFS
DIAG_SERVER_SQLITE_BUSY_TIMEOUT_MS: "60000"

WAL Checkpoint Frequency

Variable Description Default Recommendation
DIAG_SERVER_SQLITE_WAL_AUTOCHECKPOINT Number of WAL pages that trigger an automatic checkpoint 1000 Lower to 500 for write-heavy workloads; raise to 2000 to reduce lock spikes on EFS

Tip

Each auto-checkpoint briefly takes an exclusive lock. Raising this value reduces how often that lock is acquired at the cost of a slightly larger WAL file on disk.

YAML
# Reduce checkpoint frequency on EFS to avoid lock spikes
DIAG_SERVER_SQLITE_WAL_AUTOCHECKPOINT: "2000"

Concurrent Session Limit

Variable Description Default Recommendation
DIAG_SERVER_SQLITE_MAX_CONCURRENT_SESSIONS Maximum concurrent in-flight DB sessions Auto-calculated from CPU count Set explicitly only when the auto value causes semaphore-wait warnings

The auto-calculation is: min(32, cpu_count + 4) - 2. This reserves 2 slots for readiness probes.

Warning

Never set this to 0 — it deadlocks all database requests. Leave blank to use the auto-calculated value.

YAML
# Override only when semaphore-wait warnings appear in the logs
# DIAG_SERVER_SQLITE_MAX_CONCURRENT_SESSIONS: "12"

Readiness Probe Sensitivity

Variable Description Default Recommendation
DIAG_SERVER_SQLITE_READINESS_WINDOW_SIZE Number of recent DB sessions tracked by the readiness probe 3 Raise to 5 to tolerate more transient failures before the pod is marked Not Ready

The pod reports Not Ready only when all of the last N sessions failed. Raising this value makes the readiness probe more tolerant of transient SQLite lock errors.

YAML
DIAG_SERVER_SQLITE_READINESS_WINDOW_SIZE: "5"

In-Memory Heartbeat Queue

Variable Description Default Recommendation
DIAG_SERVER_SQLITE_HEARTBEAT_QUEUE_MAX_SIZE Maximum heartbeats held in-memory when the DB is temporarily unavailable 200 Increase to 500 in large deployments (many client pods) to buffer heartbeats during a brief DB hiccup

Heartbeats bypass the concurrency semaphore and queue in memory. The pod returns 503 only when this queue fills up (sustained DB outage), not on transient slowness.

YAML
DIAG_SERVER_SQLITE_HEARTBEAT_QUEUE_MAX_SIZE: "500"

MariaDB Configuration

Skip this section if you are using SQLite.

Connection Details

Variable Description Default
DIAG_SERVER_DB_HOST MariaDB hostname or Kubernetes service name mariadb
DIAG_SERVER_DB_PORT MariaDB port 3306
DIAG_SERVER_DB_USER Database username
DIAG_SERVER_DB_PASSWORD Database password
DIAG_SERVER_DB_NAME Database schema name diag_server
YAML
1
2
3
4
5
6
DIAG_SERVER_DB_TYPE: "mariadb"
DIAG_SERVER_DB_HOST: "mariadb"
DIAG_SERVER_DB_PORT: "3306"
DIAG_SERVER_DB_USER: "<DB_USERNAME>"
DIAG_SERVER_DB_PASSWORD: "<DB_PASSWORD>"
DIAG_SERVER_DB_NAME: "diag_server"

Connection Pool

These settings control how many simultaneous database connections the server maintains. They apply only when DIAG_SERVER_DB_TYPE is mariadb — SQLite uses NullPool (no persistent connections).

Variable Description Default Recommendation
DIAG_SERVER_DB_POOL_CAPACITY Maximum persistent connections in the pool 10 Increase to 20–30 for high-concurrency deployments (many simultaneous users)
DIAG_SERVER_DB_POOL_OVERFLOW Extra connections allowed beyond pool capacity 30 Set to 1.5× pool capacity as a starting point
DIAG_SERVER_DB_POOL_TIMEOUT_SECS Seconds to wait for a free connection before failing 30 Reduce to 10 to fail fast under saturation

MariaDB server-side limit

Ensure MariaDB's max_connections is at least DIAG_SERVER_DB_POOL_CAPACITY + DIAG_SERVER_DB_POOL_OVERFLOW + headroom (for admin connections). For pool=20, overflow=30, set max_connections ≥ 60.

YAML
1
2
3
DIAG_SERVER_DB_POOL_CAPACITY: "20"
DIAG_SERVER_DB_POOL_OVERFLOW: "30"
DIAG_SERVER_DB_POOL_TIMEOUT_SECS: "10"

SocketIO Performance

SocketIO settings affect real-time communication between the server and browser clients. Tune these when clients experience dropped connections or high-latency updates.

Variable Description Default Recommendation
DIAG_SERVER_SOCKETIO_PING_TIMEOUT Seconds to wait for a pong before closing the connection 60 Increase to 120 for slow or high-latency networks
DIAG_SERVER_SOCKETIO_PING_INTERVAL Seconds between keepalive pings 25 Decrease to 15 for faster detection of dropped browser tabs
DIAG_SERVER_SOCKETIO_ASYNC_MODE Async execution mode (asgi or threading) asgi Keep asgi for best throughput; use threading only for legacy compatibility
DIAG_SERVER_SOCKETIO_LOG_LEVEL Log verbosity for SocketIO events WARNING Set to ERROR in production to reduce log I/O overhead
DIAG_SERVER_SOCKETIO_ENGINEIO_LOG_LEVEL Log verbosity for EngineIO transport events WARNING Set to ERROR in production to reduce log I/O overhead
YAML
1
2
3
4
5
# Production-tuned SocketIO settings
DIAG_SERVER_SOCKETIO_PING_TIMEOUT: "120"
DIAG_SERVER_SOCKETIO_PING_INTERVAL: "15"
DIAG_SERVER_SOCKETIO_LOG_LEVEL: "ERROR"
DIAG_SERVER_SOCKETIO_ENGINEIO_LOG_LEVEL: "ERROR"

Data Purge and Retention

Background purge tasks remove stale records and prevent unbounded database growth. Schedule purges during off-peak hours to avoid I/O contention with live traffic.

Server-Side Purge Schedule

Variable Description Default Recommendation
DIAG_SERVER_DATA_PURGE_ENABLED Enable the nightly purge background task true Keep true to prevent unbounded growth
DIAG_SERVER_DATA_PURGE_RUN_TIME Time of day to run purge (24-hour HH:MM) 00:00 Schedule during the lowest-traffic window
DIAG_SERVER_DATA_PURGE_ERROR_LOGS_RETENTION_DAYS Days to retain error log records 7 Reduce to desired days for space-constrained environments; increase for compliance
DIAG_SERVER_DATA_PURGE_POD_TEST_RESULT_RETENTION_DAYS Days to retain pod test result records 30 Reduce to desired days if storage is limited
DIAG_SERVER_DATA_PURGE_RETRY_INTERVAL_SECONDS Seconds to wait before retrying a failed purge 3600 Lower to 600 for faster recovery after a purge failure
YAML
1
2
3
4
5
DIAG_SERVER_DATA_PURGE_ENABLED: "true"
DIAG_SERVER_DATA_PURGE_RUN_TIME: "00:00"
DIAG_SERVER_DATA_PURGE_ERROR_LOGS_RETENTION_DAYS: "7"
DIAG_SERVER_DATA_PURGE_POD_TEST_RESULT_RETENTION_DAYS: "30"
DIAG_SERVER_DATA_PURGE_RETRY_INTERVAL_SECONDS: "600"

Client-Side Test Result File Retention

These variables control how many historical test-result files the diagnostics client retains on disk before rotating.

Variable Description Default
DIAG_TEST_RESULTS_FILE_RETENTION_COUNT Global retention count for all services 7 (7 days for all services)
DIAG_TEST_RESULTS_RETENTION_<SERVICE> Per-service override (uppercase, hyphens → underscores) "" (uses global)

Naming pattern: DIAG_TEST_RESULTS_RETENTION_{CONTAINER_NAME_IN_UPPERCASE}.

Example Container Variable
auditserver DIAG_TEST_RESULTS_RETENTION_AUDITSERVER
ranger DIAG_TEST_RESULTS_RETENTION_RANGER
portal DIAG_TEST_RESULTS_RETENTION_PORTAL
databricks-unity-catalog DIAG_TEST_RESULTS_RETENTION_DATABRICKS_UNITY_CATALOG
YAML
1
2
3
4
5
# Retain the last 10 result files for all services
DIAG_TEST_RESULTS_FILE_RETENTION_COUNT: "10"

# Override for a noisy connector that generates many results
DIAG_TEST_RESULTS_RETENTION_DATABRICKS_UNITY_CATALOG: "5"

Diagnostics Test Scheduling

These variables control when and how often the client sidecar runs its diagnostic test suite against each service.

Variable Description Default Recommendation
DIAG_TESTS_INITIAL_WAIT_SECS Seconds to wait after pod startup before running the first test 120 Increase to 300 for services that take longer to become ready
DIAG_TESTS_INTERVAL_SECS Seconds between successful test runs 86400 (24 h) Reduce to 43200 (12 h) to detect configuration drift earlier
DIAG_TESTS_RETRY_ENABLED Enable automatic re-test after a failure false Set to true in production to catch transient failures
DIAG_TESTS_RETRY_INTERVAL_SECS Seconds between retry attempts on failure 43200 (12 h) Reduce to 3600 (1 h) for faster failure confirmation
YAML
1
2
3
4
DIAG_TESTS_INITIAL_WAIT_SECS: "120"
DIAG_TESTS_INTERVAL_SECS: "86400"
DIAG_TESTS_RETRY_ENABLED: "true"
DIAG_TESTS_RETRY_INTERVAL_SECS: "3600"

Apply the Changes

Save and close the file, then regenerate and apply the Helm charts:

Bash
1
2
3
cd ~/privacera/privacera-manager
./privacera-manager.sh setup
./pm_with_helm.sh upgrade

Scenario Key Variables to Set
Getting started All defaults; enable DIAG_SERVER_METRICS_ENABLE: "true" to observe behavior
Multi-replica server deployment Switch to DIAG_SERVER_DB_TYPE: "mariadb", configure DIAG_SERVER_DB_HOST, DIAG_SERVER_DB_POOL_CAPACITY, DIAG_SERVER_DB_POOL_OVERFLOW
SQLite on EFS/NFS (unavoidable) DIAG_SERVER_SQLITE_BUSY_TIMEOUT_MS: "60000", DIAG_SERVER_SQLITE_WAL_AUTOCHECKPOINT: "2000", DIAG_SERVER_SQLITE_READINESS_WINDOW_SIZE: "5" — or migrate to MariaDB
Many client sidecars sending heartbeats Raise DIAG_SERVER_SQLITE_HEARTBEAT_QUEUE_MAX_SIZE to buffer spikes; raise DIAG_SERVER_SQLITE_MAX_CONCURRENT_SESSIONS if semaphore-wait warnings appear
High log I/O overhead DIAG_SERVER_SOCKETIO_LOG_LEVEL: "ERROR", DIAG_SERVER_SOCKETIO_ENGINEIO_LOG_LEVEL: "ERROR"
Browser clients frequently disconnecting Increase DIAG_SERVER_SOCKETIO_PING_TIMEOUT; decrease DIAG_SERVER_SOCKETIO_PING_INTERVAL
Database growing too large Reduce DIAG_SERVER_DATA_PURGE_ERROR_LOGS_RETENTION_DAYS and DIAG_SERVER_DATA_PURGE_POD_TEST_RESULT_RETENTION_DAYS; set DIAG_TEST_RESULTS_FILE_RETENTION_COUNT
Slow failure / offline detection Decrease DIAG_SERVER_HEARTBEAT_ACTIVE_THRESHOLD_SECONDS and DIAG_SERVER_HEARTBEAT_DEGRADED_THRESHOLD_SECONDS
Services take long to become ready on startup Increase DIAG_TESTS_INITIAL_WAIT_SECS so the first test run does not start before the service is up
Need faster re-test after failures Set DIAG_TESTS_RETRY_ENABLED: "true" and reduce DIAG_TESTS_RETRY_INTERVAL_SECS