Connector Monitoring & Alerting¶
This document provides a consolidated overview of the monitoring and alerting capabilities available for Privacera connectors. It covers Grafana-based dashboards, key performance metrics (including P95 and P99 latency), alert categories, and guidance on how to use these tools for troubleshooting and capacity planning.
Overview¶
This document covers:
- Connector dashboards — Grafana dashboards for JDBC and common connector metrics
- Key metrics — Metrics exposed by Privacera and guidance on how to interpret them
- Alert categories — Performance, availability, and resource saturation alerts
- Latency metrics — Understanding P95 and P99 latency in production environments
- Troubleshooting workflow — Steps to diagnose and resolve connector issues
For detailed troubleshooting steps, see the Connector troubleshooting guide.
Connector Dashboards Overview¶
Privacera provides Grafana-based dashboards to monitor connector health, performance, and reliability. These dashboards are organized into two categories: JDBC metrics (query performance and latency) and common connector metrics (system and runtime health).
Connector JDBC Metrics¶
Dashboards in this category provide a query-centric view of connector behavior. They help with:
- Query performance — Analyze query duration (average, P95, and P99 latency) and observe throughput trends over time..
- Latency distribution — Identify tail latency and detect slow or outlier queries.
- Execution behavior — Monitor success and failure rates, error categories, and request volume to correlate load with performance and reliability.
Use these dashboards for capacity planning, SLA monitoring, and diagnosing query slowness or intermittent failures.
Connector Common Metrics¶
Dashboards in this category provide a system and runtime view shared across connector types. They surface:
- JVM and resource usage — Heap and non-heap memory, garbage collection, thread count, and CPU utilization.
- Thread and connection pools — Active threads, queue depth, rejected tasks, connection pool usage, and wait times.
- Runtime health — Indicators that help you spot resource saturation, memory pressure, and scaling needs before they impact query performance.
Use these dashboards to monitor connector stability, plan scaling, and troubleshoot resource-related alerts.
Key Metrics Explained¶
JDBC / Query Performance Metrics¶
Query Latency¶
Measures the time required to process queries. Common aggregations include:
- Average — Mean query latency
- P95 — 95th percentile query latency
- P99 — 99th percentile query latency
Understanding P95 and P99
- P95 — 95% of queries complete within this time.
- P99 — 99% of queries complete within this time.
These metrics help you:
- Capture tail latency
- Spot intermittent performance degradation
- Detect resource saturation early
If average latency is normal but P95/P99 is high, typical causes include:
- Backend slowness
- Resource contention
- Thread pool exhaustion
- Intermittent network issues
Query Throughput¶
Measures:
- Queries per second (QPS) — Rate of queries processed per second
- Total requests processed — Cumulative number of processed requests
Use throughput for:
- Capacity planning
- Detecting traffic spikes
- Correlating load with latency
Query Failures¶
Tracks:
- Error count — Total number of failed queries
- Error rate (%) — Percentage of queries that failed
- Exception categories - Classification of failure types
Spikes in failures may indicate:
- Backend system issues
- Authentication problems
- Expired tokens
- Configuration errors
Connector Common Metrics¶
JVM Metrics¶
Includes:
- Heap memory usage
- Non-heap memory
- Garbage collection (GC) pause time
- Thread count
Watch for
- Heap memory usage consistently above 80%
- Frequent or prolonged GC pauses
- A continuously increasing thread count
CPU Usage¶
Sustained high CPU may indicate:
- Heavy query load
- Inefficient query patterns
- Insufficient capacity or need for scaling
Memory Usage¶
Monitor:
- Total memory usage
- Heap memory utilization
- Upward memory usage trends
Sudden growth may indicate:
- Memory leak
- Large result sets
- Unbounded cache growth
Thread Pool Metrics¶
Tracks:
- Active threads
- Queue depth
- Rejected tasks
Symptoms of thread exhaustion
- Rising P95/P99 latency
- Increased timeouts
- Higher failure rates
Connection Pool Metrics¶
Includes:
- Active connections
- Idle connections
- Connection wait time
- Connection acquisition failures
| Observation | Likely cause |
|---|---|
| High connection wait time | Connection pool size may be too small |
| Frequent connection timeouts | Backend slowness or connection pool exhaustion |
| 100% utilization | Scaling or connection pool tuning required |
Alert Categories¶
Privacera provides alerts across three main categories.
Performance Alerts¶
Triggered when:
- P95 or P99 exceeds baseline
- Average latency increases significantly
- Throughput drops unexpectedly
Example threshold guidelines:
| Condition | Severity |
|---|---|
| P95 latency > 1.5× baseline | Warning |
| P95 latency > 2× baseline | Critical |
| Error rate > 2% | Warning |
| Error rate > 5% | Critical |
| CPU usage > 75% | Warning |
| CPU usage > 90% | Critical |
Availability Alerts¶
Triggered when:
- Health checks fail
- The connector becomes unreachable
- Pod or container restarts increase unexpectedly
Resource Saturation Alerts¶
Triggered when:
- Heap memory usage exceeds 85%
- Thread pool queue depth continues to increase
- Connection pool exhaustion is detected
- CPU utilization exceeds 90%
These act as early indicators of instability or the need to scale.
Understanding P95 vs P99 in Production¶
Why Average Is Not Enough¶
Average latency hides outliers.
Example: if 99 requests complete in 100 ms and 1 request takes 5000 ms, the average can still look acceptable, while P99 will expose the slow request.
When to Focus on P95¶
- SLA monitoring
- Overall user experience assessment
- General performance tracking
When to Focus on P99¶
- Rare but severe latency spikes
- Backend contention analysis
- Deep SRE investigations
Troubleshooting Workflow¶
Step 1 — Identify Alert Type¶
Determine whether the issue is:
- Latency
- Error spike
- Resource saturation
- Availability
Step 2 — Correlate Metrics¶
| Observation | Likely cause |
|---|---|
| High P99 latency and high CPU utilization | Load-driven performance issue |
| High P99 latency, normal CPU, and high backend latency | Backend issue |
| High error rate with authentication-related errors | Credential misconfiguration or expired token |
Step 3 — Check Connector Logs¶
Use the Connector troubleshooting guide and look for:
- Timeout exceptions
- Authentication failures
- Backend connection failures
OutOfMemoryerrors- Thread rejection errors
Step 4 — Take Remedial Action¶
Possible actions:
- Scale the connector horizontally
- Increase CPU or memory allocation
- Increase connection pool size
- Tune JVM heap settings
- Optimize backend warehouse sizing
- Correct credential configuration issues
Capacity Planning Guidance¶
Use the dashboards to monitor trends over time. Plan scaling when you see:
- Sustained CPU above 70%
- Gradual increase in P95 latency
- Connection pool consistently near capacity
- Increasing thread queue depth
Proactive scaling
If these trends continue, plan scaling before user impact occurs.
Best Practices¶
- Monitor P95 and P99, not just averages.
- Set alerts based on established baselines for your environment.
- Correlate latency metrics with CPU utilization and connection pool metrics.
- Track error rate (%), not just raw error counts.
- Review performance trends regularly.
- Scale proactively using capacity planning signals.
- Prev topic: User Guide