Changelog Task Queue Alert

Changelog Task Queue Alert is triggered when the current changelog processor queue count exceeds 30 tasks for more than 5 minutes. The alert evaluates the current queue size and fires if it remains above the threshold of 30 for 5 consecutive minutes. This alert indicates that changelog tasks (policy change events) are accumulating in the processing queue faster than they can be processed, which can lead to:

Delayed policy synchronization: Policy changes from the policy server may not be reflected promptly in the target system.
Queue backlog accumulation: Changelog tasks queue up faster than processing capacity can handle.
Resource contention: The connector may become overwhelmed with queued tasks, leading to degraded performance.
Processing bottlenecks: Slow processing of changelog tasks can indicate underlying performance issues.

Root Cause¶

High event creation rate: Sudden spikes in policy changes, bulk operations, or scheduled jobs creating more changelog tasks than can be processed.
Slow changelog processing: Individual changelog tasks taking longer than expected to process due to:
- Database performance issues (slow queries, connection timeouts)
- Complex policy evaluations or resource-intensive operations
- Resource constraints (insufficient CPU, memory, or I/O capacity)

Troubleshooting Steps¶

Follow these steps to identify and resolve the issue.

Step 1: Run Diagnostics Tool: The Diagnostics Tool provides automated testing of connector functionality and performance metrics.

Open the Diagnostic Portal and navigate to Dashboard → Pods.
Select the connector pod from the available pods list.
Under the CURRENT TEST RESULTS tab, review the PyTest Report for the following checks:
1. test_diag_client_pod_cpu_utilization: Check CPU usage patterns.
2. test_jvm_process_cpu_utilization: Monitor JVM CPU consumption.
3. test_diag_client_disk_space: Verify available disk space.
4. test_system_process_cpu_utilization: Monitor overall system CPU utilization and identify resource bottlenecks.

Step 2: Monitor Changelog Queue Metrics: Review the Connector-Common dashboard to identify queue buildup:

Navigate to Grafana → Dashboards → Application-Dashboards → connectors → common → Connector-Common.
Check the following panels:
1. Count of Changelog Processor Queue Task: Verify if the queue count exceeds the threshold of 30.
2. Changelog Processor Queue Task - Time Series: Monitor the queue size over time and identify patterns of backlog accumulation.
3. Event Count - Time Series: Check for sudden spikes in event creation that correlate with queue buildup.
4. Average Event Processing Time [Including schedular delay]: Verify if processing times are elevated, which could indicate slow task processing.

Step 3: Monitor Ops-Server Metrics: Review the Ops-Server dashboard to identify event creation patterns:

Navigate to Grafana → Dashboards → Application-Dashboards → ops-server → Ops Server.
Check the following panels:
1. Frequency of Event Creation by application and event type: Check for sudden spikes in event creation that correlate with queue issues.
2. Event Creation Count Over Time: Verify if event creation rate is consistently high or has sudden bursts.
3. Total Events by Application Name and Task Status: Check if specific applications or event types are causing issues.
4. Success Rate of Event Creation: Ensure events are being created successfully and not failing.

Step 4: Check Resource Utilization: Ensure the connector has adequate resources:

Review the Pod Monitoring dashboard under Dashboards → Infra-Dashboards in Grafana.
Check pod memory and CPU usage patterns.
If resource constraints are detected, adjust resources using Compute Sizing.

Step 5: Review Configuration Settings: Optimize connector configuration for better changelog processing:

Configuration File Location: ~/privacera/privacera-manager/config/custom-vars/connectors/<connector-type>/<instance>/
Key Properties to Review:
1. Changelog Queue Threshold: Check if the threshold is appropriate for your workload.
  - Property: CONNECTOR_ON_DEMAND_TASK_CHANGELOG_QUEUE_THRESHOLD
  - This threshold controls backpressure - when exceeded, new task fetching may be paused.
2. Thread Pool Configuration: Adjust thread pool sizes for better concurrency in changelog processing.
  - Check properties like CONNECTOR_<TYPE>_THREAD_POOL_SIZE.
3. Timeout Settings: Ensure timeout values are optimized for your environment's latency and workload.
  - Review connection and query timeout properties.
4. Batch Size Settings: Adjust batch sizes if processing large volumes of changelog tasks.
  - Check for property CONNECTOR_ON_DEMAND_TASK_BATCH_SIZEE
For the exact property names and examples, refer to your connector-specific documentation.

Step 6: Investigate Processing Delays: If the queue is building up, investigate why tasks are processing slowly:

Database Performance: Check for slow database queries or connection issues.
- Review JDBC error rates in the Connector JDBC Metrics dashboard.
- Verify database connection pool settings and timeout values.

Escalation Checklist¶

If the issue cannot be resolved through the specific troubleshooting guides, escalate it to the Privacera support with the following details. For additional assistance, refer How to Contact Support for detailed guidance on reaching out to the support team.

Timestamp of the error: Include the exact time the alert was triggered.
Grafana dashboard and alert screenshots:
1. Grafana → Dashboards → Application-Dashboards → connectors → common → Connector-Common
2. Grafana → Alerting → Alert rules → Changelog Task Queue Alert

Connector Service Logs: Include any logs showing changelog processing delays, queue buildup, or performance issues.

Option 1: Download Log from Diagnostic Portal (Recommended)

Open the Diagnostic Portal and navigate to Dashboard → Pods.
Select the connector pod from the available pods list.
Click on the Logs tab and download logs by clicking on DOWNLOAD LOGS button.

Option 2: Manual Log Collection (If Diagnostic Service is Not Enabled)

Bash
# Create log archive
kubectl exec -it <CONNECTOR_POD> -n <NAMESPACE> -- bash -c "cd /workdir/policysync/logs/ && tar -czf connector-logs.tar.gz *.log"

# Copy the fixed-name archive
kubectl cp <CONNECTOR_POD>:/workdir/policysync/logs/connector-logs.tar.gz ./connector-logs.tar.gz -n <NAMESPACE>

# Extract logs
tar -xzf connector-logs.tar.gz

Configuration Files: Attach relevant configuration files (e.g., properties files) with sensitive information masked.
Performance Metrics: Include screenshots of the relevant panels from the Connector-Common dashboard:
1. Count of Changelog Processor Queue Task: Current queue size
2. Changelog Processor Queue Task - Time Series: Queue size trends over time
3. Average Event Processing Time: Processing time metrics
Resource Utilization: Provide CPU and memory usage graphs from the Pod Monitoring dashboard.

Ensure all sensitive information in configuration files and logs is masked before sharing.

Prev topic: Troubleshooting for Connector