Connector Monitoring & Alerting¶

This document provides a consolidated overview of the monitoring and alerting capabilities available for Privacera connectors. It covers Grafana-based dashboards, key performance metrics (including P95 and P99 latency), alert categories, and guidance on how to use these tools for troubleshooting and capacity planning.

Overview¶

This document covers:

Connector dashboards — Grafana dashboards for JDBC and common connector metrics
Key metrics — Metrics exposed by Privacera and guidance on how to interpret them
Alert categories — Performance, availability, and resource saturation alerts
Latency metrics — Understanding P95 and P99 latency in production environments
Troubleshooting workflow — Steps to diagnose and resolve connector issues

For detailed troubleshooting steps, see the Connector troubleshooting guide.

Connector Dashboards Overview¶

Privacera provides Grafana-based dashboards to monitor connector health, performance, and reliability. These dashboards are organized into two categories: JDBC metrics (query performance and latency) and common connector metrics (system and runtime health).

Connector JDBC Metrics¶

Dashboards in this category provide a query-centric view of connector behavior. They help with:

Query performance — Analyze query duration (average, P95, and P99 latency) and observe throughput trends over time..
Latency distribution — Identify tail latency and detect slow or outlier queries.
Execution behavior — Monitor success and failure rates, error categories, and request volume to correlate load with performance and reliability.

Use these dashboards for capacity planning, SLA monitoring, and diagnosing query slowness or intermittent failures.

Connector Common Metrics¶

Dashboards in this category provide a system and runtime view shared across connector types. They surface:

JVM and resource usage — Heap and non-heap memory, garbage collection, thread count, and CPU utilization.
Thread and connection pools — Active threads, queue depth, rejected tasks, connection pool usage, and wait times.
Runtime health — Indicators that help you spot resource saturation, memory pressure, and scaling needs before they impact query performance.

Use these dashboards to monitor connector stability, plan scaling, and troubleshoot resource-related alerts.

Key Metrics Explained¶

JDBC / Query Performance Metrics¶

Query Latency¶

Measures the time required to process queries. Common aggregations include:

Average — Mean query latency
P95 — 95^th percentile query latency
P99 — 99^th percentile query latency

Understanding P95 and P99

P95 — 95% of queries complete within this time.
P99 — 99% of queries complete within this time.

These metrics help you:

Capture tail latency
Spot intermittent performance degradation
Detect resource saturation early

If average latency is normal but P95/P99 is high, typical causes include:

Backend slowness
Resource contention
Thread pool exhaustion
Intermittent network issues

Query Throughput¶

Measures:

Queries per second (QPS) — Rate of queries processed per second
Total requests processed — Cumulative number of processed requests

Use throughput for:

Capacity planning
Detecting traffic spikes
Correlating load with latency

Query Failures¶

Tracks:

Error count — Total number of failed queries
Error rate (%) — Percentage of queries that failed
Exception categories - Classification of failure types

Spikes in failures may indicate:

Backend system issues
Authentication problems
Expired tokens
Configuration errors

Connector Common Metrics¶

JVM Metrics¶

Includes:

Heap memory usage
Non-heap memory
Garbage collection (GC) pause time
Thread count

Watch for

Heap memory usage consistently above 80%
Frequent or prolonged GC pauses
A continuously increasing thread count

CPU Usage¶

Sustained high CPU may indicate:

Heavy query load
Inefficient query patterns
Insufficient capacity or need for scaling

Memory Usage¶

Monitor:

Total memory usage
Heap memory utilization
Upward memory usage trends

Sudden growth may indicate:

Memory leak
Large result sets
Unbounded cache growth

Thread Pool Metrics¶

Tracks:

Active threads
Queue depth
Rejected tasks

Symptoms of thread exhaustion

Rising P95/P99 latency
Increased timeouts
Higher failure rates

Connection Pool Metrics¶

Includes:

Active connections
Idle connections
Connection wait time
Connection acquisition failures

Observation	Likely cause
High connection wait time	Connection pool size may be too small
Frequent connection timeouts	Backend slowness or connection pool exhaustion
100% utilization	Scaling or connection pool tuning required

Alert Categories¶

Privacera provides alerts across three main categories.

Performance Alerts¶

Triggered when:

P95 or P99 exceeds baseline
Average latency increases significantly
Throughput drops unexpectedly

Example threshold guidelines:

Condition	Severity
P95 latency > 1.5× baseline	Warning
P95 latency > 2× baseline	Critical
Error rate > 2%	Warning
Error rate > 5%	Critical
CPU usage > 75%	Warning
CPU usage > 90%	Critical

Availability Alerts¶

Triggered when:

Health checks fail
The connector becomes unreachable
Pod or container restarts increase unexpectedly

Resource Saturation Alerts¶

Triggered when:

Heap memory usage exceeds 85%
Thread pool queue depth continues to increase
Connection pool exhaustion is detected
CPU utilization exceeds 90%

These act as early indicators of instability or the need to scale.

Understanding P95 vs P99 in Production¶

Why Average Is Not Enough¶

Average latency hides outliers.

Example: if 99 requests complete in 100 ms and 1 request takes 5000 ms, the average can still look acceptable, while P99 will expose the slow request.

When to Focus on P95¶

SLA monitoring
Overall user experience assessment
General performance tracking

When to Focus on P99¶

Rare but severe latency spikes
Backend contention analysis
Deep SRE investigations

Troubleshooting Workflow¶

Step 1 — Identify Alert Type¶

Determine whether the issue is:

Latency
Error spike
Resource saturation
Availability

Step 2 — Correlate Metrics¶

Observation	Likely cause
High P99 latency and high CPU utilization	Load-driven performance issue
High P99 latency, normal CPU, and high backend latency	Backend issue
High error rate with authentication-related errors	Credential misconfiguration or expired token

Step 3 — Check Connector Logs¶

Use the Connector troubleshooting guide and look for:

Timeout exceptions
Authentication failures
Backend connection failures
OutOfMemory errors
Thread rejection errors

Step 4 — Take Remedial Action¶

Possible actions:

Scale the connector horizontally
Increase CPU or memory allocation
Increase connection pool size
Tune JVM heap settings
Optimize backend warehouse sizing
Correct credential configuration issues

Capacity Planning Guidance¶

Use the dashboards to monitor trends over time. Plan scaling when you see:

Sustained CPU above 70%
Gradual increase in P95 latency
Connection pool consistently near capacity
Increasing thread queue depth

Proactive scaling

If these trends continue, plan scaling before user impact occurs.

Best Practices¶

Monitor P95 and P99, not just averages.
Set alerts based on established baselines for your environment.
Correlate latency metrics with CPU utilization and connection pool metrics.
Track error rate (%), not just raw error counts.
Review performance trends regularly.
Scale proactively using capacity planning signals.

Prev topic: User Guide