Skip to content

Connector Monitoring & Alerting

This document provides a consolidated overview of the monitoring and alerting capabilities available for Privacera connectors. It covers Grafana-based dashboards, key performance metrics (including P95 and P99 latency), alert categories, and guidance on how to use these tools for troubleshooting and capacity planning.

Overview

This document covers:

  • Connector dashboards — Grafana dashboards for JDBC and common connector metrics
  • Key metrics — Metrics exposed by Privacera and guidance on how to interpret them
  • Alert categories — Performance, availability, and resource saturation alerts
  • Latency metrics — Understanding P95 and P99 latency in production environments
  • Troubleshooting workflow — Steps to diagnose and resolve connector issues

For detailed troubleshooting steps, see the Connector troubleshooting guide.

Connector Dashboards Overview

Privacera provides Grafana-based dashboards to monitor connector health, performance, and reliability. These dashboards are organized into two categories: JDBC metrics (query performance and latency) and common connector metrics (system and runtime health).

Connector JDBC Metrics

Dashboards in this category provide a query-centric view of connector behavior. They help with:

  • Query performance — Analyze query duration (average, P95, and P99 latency) and observe throughput trends over time..
  • Latency distribution — Identify tail latency and detect slow or outlier queries.
  • Execution behavior — Monitor success and failure rates, error categories, and request volume to correlate load with performance and reliability.

Use these dashboards for capacity planning, SLA monitoring, and diagnosing query slowness or intermittent failures.

Connector Common Metrics

Dashboards in this category provide a system and runtime view shared across connector types. They surface:

  • JVM and resource usage — Heap and non-heap memory, garbage collection, thread count, and CPU utilization.
  • Thread and connection pools — Active threads, queue depth, rejected tasks, connection pool usage, and wait times.
  • Runtime health — Indicators that help you spot resource saturation, memory pressure, and scaling needs before they impact query performance.

Use these dashboards to monitor connector stability, plan scaling, and troubleshoot resource-related alerts.

Key Metrics Explained

JDBC / Query Performance Metrics

Query Latency

Measures the time required to process queries. Common aggregations include:

  • Average — Mean query latency
  • P95 — 95th percentile query latency
  • P99 — 99th percentile query latency

Understanding P95 and P99

  • P95 — 95% of queries complete within this time.
  • P99 — 99% of queries complete within this time.

These metrics help you:

  • Capture tail latency
  • Spot intermittent performance degradation
  • Detect resource saturation early

If average latency is normal but P95/P99 is high, typical causes include:

  • Backend slowness
  • Resource contention
  • Thread pool exhaustion
  • Intermittent network issues

Query Throughput

Measures:

  • Queries per second (QPS) — Rate of queries processed per second
  • Total requests processed — Cumulative number of processed requests

Use throughput for:

  • Capacity planning
  • Detecting traffic spikes
  • Correlating load with latency

Query Failures

Tracks:

  • Error count — Total number of failed queries
  • Error rate (%) — Percentage of queries that failed
  • Exception categories - Classification of failure types

Spikes in failures may indicate:

  • Backend system issues
  • Authentication problems
  • Expired tokens
  • Configuration errors

Connector Common Metrics

JVM Metrics

Includes:

  • Heap memory usage
  • Non-heap memory
  • Garbage collection (GC) pause time
  • Thread count

Watch for

  • Heap memory usage consistently above 80%
  • Frequent or prolonged GC pauses
  • A continuously increasing thread count

CPU Usage

Sustained high CPU may indicate:

  • Heavy query load
  • Inefficient query patterns
  • Insufficient capacity or need for scaling

Memory Usage

Monitor:

  • Total memory usage
  • Heap memory utilization
  • Upward memory usage trends

Sudden growth may indicate:

  • Memory leak
  • Large result sets
  • Unbounded cache growth

Thread Pool Metrics

Tracks:

  • Active threads
  • Queue depth
  • Rejected tasks

Symptoms of thread exhaustion

  • Rising P95/P99 latency
  • Increased timeouts
  • Higher failure rates

Connection Pool Metrics

Includes:

  • Active connections
  • Idle connections
  • Connection wait time
  • Connection acquisition failures
Observation Likely cause
High connection wait time Connection pool size may be too small
Frequent connection timeouts Backend slowness or connection pool exhaustion
100% utilization Scaling or connection pool tuning required

Alert Categories

Privacera provides alerts across three main categories.

Performance Alerts

Triggered when:

  • P95 or P99 exceeds baseline
  • Average latency increases significantly
  • Throughput drops unexpectedly

Example threshold guidelines:

Condition Severity
P95 latency > 1.5× baseline Warning
P95 latency > 2× baseline Critical
Error rate > 2% Warning
Error rate > 5% Critical
CPU usage > 75% Warning
CPU usage > 90% Critical

Availability Alerts

Triggered when:

  • Health checks fail
  • The connector becomes unreachable
  • Pod or container restarts increase unexpectedly

Resource Saturation Alerts

Triggered when:

  • Heap memory usage exceeds 85%
  • Thread pool queue depth continues to increase
  • Connection pool exhaustion is detected
  • CPU utilization exceeds 90%

These act as early indicators of instability or the need to scale.

Understanding P95 vs P99 in Production

Why Average Is Not Enough

Average latency hides outliers.

Example: if 99 requests complete in 100 ms and 1 request takes 5000 ms, the average can still look acceptable, while P99 will expose the slow request.

When to Focus on P95

  • SLA monitoring
  • Overall user experience assessment
  • General performance tracking

When to Focus on P99

  • Rare but severe latency spikes
  • Backend contention analysis
  • Deep SRE investigations

Troubleshooting Workflow

Step 1 — Identify Alert Type

Determine whether the issue is:

  • Latency
  • Error spike
  • Resource saturation
  • Availability

Step 2 — Correlate Metrics

Observation Likely cause
High P99 latency and high CPU utilization Load-driven performance issue
High P99 latency, normal CPU, and high backend latency Backend issue
High error rate with authentication-related errors Credential misconfiguration or expired token

Step 3 — Check Connector Logs

Use the Connector troubleshooting guide and look for:

  • Timeout exceptions
  • Authentication failures
  • Backend connection failures
  • OutOfMemory errors
  • Thread rejection errors

Step 4 — Take Remedial Action

Possible actions:

  • Scale the connector horizontally
  • Increase CPU or memory allocation
  • Increase connection pool size
  • Tune JVM heap settings
  • Optimize backend warehouse sizing
  • Correct credential configuration issues

Capacity Planning Guidance

Use the dashboards to monitor trends over time. Plan scaling when you see:

  • Sustained CPU above 70%
  • Gradual increase in P95 latency
  • Connection pool consistently near capacity
  • Increasing thread queue depth

Proactive scaling

If these trends continue, plan scaling before user impact occurs.

Best Practices

  • Monitor P95 and P99, not just averages.
  • Set alerts based on established baselines for your environment.
  • Correlate latency metrics with CPU utilization and connection pool metrics.
  • Track error rate (%), not just raw error counts.
  • Review performance trends regularly.
  • Scale proactively using capacity planning signals.