Event Processing Alerts¶

Event Processing alerts help identify performance or throughput bottlenecks in the Privacera. These alerts ensure timely policy synchronization and maintain stable event flow across Privacera components. This guide covers troubleshooting for two related event processing alerts:

High Event Processing Time: Individual events taking too long to process.
High Event Processing Lag: Events accumulating in the processing queue.

High Event Processing Time¶

This alert is triggered when the average event processing time exceeds 30 seconds for more than 5 minutes. This alert indicates that the connector is taking longer than expected to process policy change events, which can lead to:
- Delayed policy synchronization: Changes in the policy server may not be reflected promptly in the target system.
- Resource contention: The connector may become overwhelmed with processing tasks, leading to degraded performance.
- Memory pressure: Long-running event processing can consume excessive memory and cause out-of-memory conditions.
- Thread pool exhaustion: Slow event processing can saturate the thread pool.

Root Causes¶

Large batch sizes: Processing too many resources or permissions in a single event.
Complex policy evaluations: Resource-intensive policy calculations or complex permission mappings.
Database performance issues: Slow database queries or connection timeouts.
Network latency: Slow communication with external services or APIs.
Resource constraints: Insufficient CPU, memory, or I/O capacity.
Concurrent processing conflicts: Multiple events competing for the same resources.

High Event Processing Lag¶

This alert is triggered when more than 50 events accumulate in the processing queue over 5 minutes. This alert indicates that events are being created faster than they can be processed, leading to:
- Event backlog accumulation: Events queue up faster than processing capacity can handle.
- Delayed policy synchronization: New policy changes take longer to propagate to target systems.
- Throughput mismatch: Event creation rate exceeds processing rate.
- Resource saturation: Processing capacity is overwhelmed by incoming event volume.

Root Causes¶

High event creation rate: Sudden spikes in policy changes, bulk operations, or scheduled jobs.
Insufficient processing capacity: Not enough resources allocated for event processing.
Processing bottlenecks: Slow database operations, API calls, or external service delays.
Thread pool limitations: Insufficient concurrent processing threads.
Resource constraints: CPU, memory, or I/O limitations affecting processing speed.
Configuration issues: Suboptimal batch sizes, timeout settings, or concurrency limits.
System overload: Multiple connectors or services competing for resources.

Troubleshooting Steps¶

Follow these steps to identify and resolve the issue.

Step 1: Run Diagnostics Tool: The Diagnostics Tool provides automated testing of connector functionality and performance metrics.

Open the Diagnostic Portal and navigate to Dashboard → Pods.
Select the ops-server pod from the available pods list.
Under the CURRENT TEST RESULTS tab, review the PyTest Report for the following checks:
1. test_diag_client_pod_cpu_utilization: Check CPU usage patterns.
2. test_jvm_process_cpu_utilization: Monitor JVM CPU consumption.
3. test_diag_client_disk_space: Verify available disk space.
4. test_system_process_cpu_utilization: Monitor overall system CPU utilization and identify resource bottlenecks.
5. test_shared_secret_value: Verify that shared secret configuration is properly set and accessible.

Step 2: Monitor Connector Metrics: Review the Connector-Common dashboard to identify performance bottlenecks:

Navigate to Grafana → Dashboards → Application-Dashboards → connectors → common → Connector-Common.
Check the following panels:
1. Event Count - Time Series: Check for sudden spikes that correlate with high processing times and identify if volume increases happen at specific times (e.g., business hours, batch jobs).
2. Average Event Processing Time [Including schedular delay]: Verify if the average exceeds the 30-second threshold.
3. Resource Loading Threads Shutdown Counter: Check for thread shutdowns during resource loading and monitor if multiple events are competing for resources.
4. Permission Loading Threads Shutdown Counter: Monitor permission loading thread health and check for thread pool exhaustion issues.

Step 3: Monitor Ops-Server Metrics: Review the Ops-Server dashboard to identify performance bottlenecks:

Navigate to Grafana → Dashboards → Application-Dashboards → ops-server → Ops Server.
Check the following panels:
1. Total Events in Processing Timeline: Monitor the event queue size over time and identify patterns of backlog accumulation.
2. Frequency of Event Creation by application and event type: Check for sudden spikes in event creation that correlate with performance issues.
3. Event Creation Count Over Time: Verify if event creation rate is consistently high or has sudden bursts.
4. Total Events by Application Name and Task Status: Check if specific applications or event types are causing issues.
5. Success Rate of Event Creation: Ensure events are being created successfully and not failing.
6. Average Event Completion Time: Monitor individual event processing times.
7. Timeline for Average Time Taken by Event Creation: Track processing time trends over time.

Step 4: Check Resource Utilization: Ensure the connector has adequate resources:

Review the Pod Monitoring dashboard under Dashboards → Infra-Dashboards in Grafana.
Check pod memory and CPU usage patterns.
If resource constraints are detected, adjust resources using Compute Sizing.

Step 5: Review Configuration Settings: Optimize connector configuration for better performance:

Configuration File Location: privacera/privacera-manager/config/custom-vars/connectors/<connector-type>/<instance>/
Key Properties to Review:
1. Batch Size Settings: Reduce batch sizes if processing large volumes of data.
  1. Check for properties like CONNECTOR_<TYPE>_BATCH_SIZE or similar.
2. Thread Pool Configuration: Adjust thread pool sizes for better concurrency.
  1. Check properties like CONNECTOR_<TYPE>_THREAD_POOL_SIZE.
3. Timeout Settings: Ensure timeout values are optimized for your environment’s latency and workload.
  1. Review connection and query timeout properties.
For the exact property names and examples, refer to your connector-specific documentation.

Escalation Checklist¶

If the issue cannot be resolved through the specific troubleshooting guides, escalate it to the Privacera support with the following details. For additional assistance, refer How to Contact Support for detailed guidance on reaching out to the support team.

Timestamp of the error: Include the exact time the alert was triggered.
Alert type: Specify whether it is High Event Processing Time or High Event Processing Lag.
Grafana dashboard and alert screenshots:
- High Event Processing Time:
  - Grafana → Dashboards → Application-Dashboards → connectors → common → Connector-Common
  - Grafana → Alerting → Alert rules → High Event Processing Time Alert
- High Event Processing Lag:
  - Grafana → Dashboards → Application-Dashboards → ops-server → Ops Server
  - Grafana → Alerting → Alert rules → High Event Processing Lag Alert

Ops-Server Service Logs: Include any logs showing event processing delays, queue buildup, or performance issues.

Option 1: Download Log from Diagnostic Portal (Recommended)

Open the Diagnostic Portal and navigate to Dashboard → Pods.
Select the ops-server pod from the available pods list.
Click on the Logs tab and download logs by clicking on DOWNLOAD LOGS button.

Option 2: Manual Log Collection (If Diagnostic Service is Not Enabled)

Bash
# Create log archive
kubectl exec -it <OPS_SERVER_POD> -n <NAMESPACE> -- bash -c "cd /workdir/logs/ && tar -czf ops-server-logs.tar.gz *.log"

# Copy the fixed-name archive
kubectl cp <OPS_SERVER_POD>:/workdir/logs/ops-server-logs.tar.gz ./ops-server-logs.tar.gz -n <NAMESPACE>

# Extract logs
tar -xzf ops-server-logs.tar.gz

Configuration Files: Attach relevant configuration files (e.g., properties files) with sensitive information masked.
Performance Metrics: Include screenshots of the relevant panels from the Ops-Server dashboard:
- Processing Time alerts: Average Event Completion Time, Timeline for Average Time Taken
- Lag alerts: Total Events in Processing Timeline, Frequency of Event Creation
Resource Utilization: Provide CPU and memory usage graphs from the Pod Monitoring dashboard.
Event Volume Analysis: Include data showing event creation rates and processing rates over time.

Ensure all sensitive information in configuration files and logs is masked before sharing.

Prev topic: Troubleshooting for Connector