RocksDB Total Memory Usage Alert¶
The RocksDB Total Memory Usage alert is triggered when the total estimated memory usage of RocksDB exceeds the 1GB threshold for more than 5 minutes. This alert indicates that RocksDB, the embedded key-value storage engine used by connectors, is consuming excessive memory, which can lead to:
- Out of Memory (OOM) errors: The connector pod may be killed by Kubernetes when memory limits are exceeded.
- Performance degradation: High memory usage can increase garbage collection pressure and slow down connector operations.
- System instability: Memory pressure can negatively impact other components running on the same node.
- Data loss risk: If the pod is terminated due to an OOM event, in-flight operations may be lost.
Root Cause¶
RocksDB's memory usage can exceed the threshold due to several reasons:
- Large dataset size: The connector is managing a large number of resources, permissions, or metadata entries, causing RocksDB to allocate more memory for its internal data structures, such as memtables, block cache, and table readers.
- High write throughput: Frequent writes to RocksDB (e.g., during resource discovery, permission synchronization, or event processing) can cause memtables to grow and consume more memory before being flushed to disk.
- Inefficient cache configuration: Block cache or table reader cache sizes may be too large, or cache hit rates may be low, leading to unnecessary memory consumption.
- Memory leaks: Long-running processes may accumulate memory over time due to improper cleanup of RocksDB resources.
- Compaction backlog: If RocksDB compaction falls behind, more SST files may remain in memory, increasing table reader memory usage.
- Resource constraints: Insufficient memory allocation for the connector pod, causing RocksDB to compete with other components for available memory.
- Configuration issues: Suboptimal RocksDB configuration parameters (e.g., write buffer size, block cache size) that do not match the workload characteristics.
Troubleshooting Steps¶
Follow these steps to identify and resolve the issue.
Step 1: Run the Diagnostics Tool: The Diagnostics Tool provides automated testing of connector functionality and performance metrics.
- Open the Diagnostic Portal and navigate to Dashboard → Pods.
- Select the connector pod from the available pods list.
- Under the CURRENT TEST RESULTS tab, review the PyTest Report for the following checks:
- test_diag_client_pod_cpu_utilization: Check CPU usage patterns that may correlate with memory pressure.
- test_jvm_process_cpu_utilization: Monitor JVM CPU consumption and garbage collection activity.
- test_diag_client_disk_space: Verify available disk space, as low disk space can prevent RocksDB from flushing memtables.
- test_system_process_cpu_utilization: Monitor overall system CPU utilization and identify resource bottlenecks.
Step 2: Monitor RocksDB Memory Metrics: Review the Connector-Common dashboard to identify memory consumption patterns:
- Navigate to Grafana → Dashboards → Application-Dashboards → connectors → common → Connector-Common.
- Check the RocksDB Memory Metrics section and review the following panels:
- Total Estimated Memory: Verify whether memory usage exceeds the 1GB threshold, and monitor the trend over time.
- Memory Components Breakdown (Stacked Timeseries): Identify which component (Memtables, Block Cache, Table Readers) is consuming the most memory:
- High Memtables memory: Indicates a write-heavy workload or slow memtable flushes to disk.
- High Block Cache memory: May indicate large cache size configuration or low cache hit rates.
- High Table Readers memory: Suggests that many SST files are being kept open in memory, possibly due to a compaction backlog.
- Block Cache Hit Ratio: Check whether the hit ratio is below 80%, which may indicate inefficient cache usage or an undersized cache.
- Memtable Hit Ratio: Verify whether the memtable hit ratio is healthy, as low ratios may indicate memory pressure.
- Combined Hit/Miss: Track both hit and miss rates to understand access patterns and evaluate cache effectiveness.
Step 3: Monitor RocksDB Disk Size Metrics: Review disk usage to understand the relationship between memory and disk:
- In the same Connector-Common dashboard, review the RocksDB Disk Size Metrics section:
- Total SST Files Size: Verify whether disk usage is increasing rapidly, as this may correlate with memory growth.
- Disk Space Efficiency Ratio: Check whether the ratio is below 60%, which may indicate high disk waste and contribute to memory pressure.
- Disk Size by Column Family: Identify the column families consuming the most disk space to help pinpoint potential sources of memory growth.
Step 4: Check Resource Utilization: Ensure the connector has adequate resources:
- Review the Pod Monitoring dashboard under Dashboards → Infra-Dashboards in Grafana.
- Check pod memory and CPU usage patterns:
- Verify whether the pod is approaching its memory limits.
- Look for memory spikes that correlate with RocksDB memory growth.
- Monitor garbage collection frequency and duration, as elevated GC activity may indicate memory pressure.
- If resource constraints are detected, adjust resources using Compute Sizing:
- Increase pod memory limits to accommodate RocksDB growth.
- Ensure sufficient headroom above the 1GB threshold to prevent out-of-memory (OOM) errors.
Step 5: Review Configuration Settings: Optimize the connector workload and RocksDB configuration to reduce memory pressure.
- Configuration File Location:
privacera/privacera-manager/config/custom-vars/connectors/<connector-type>/<instance>/ -
Key Properties to Review:
- RocksDB Configuration: Review RocksDB tuning properties that influence memory usage.
- ROCKSDB_DB_WRITE_BUFFER_SIZE: Controls the size of write buffers. If memtables are consuming excessive memory, consider reducing this value. The default value (
0) enables automatic sizing. - ROCKSDB_MAX_BACKGROUND_JOBS: Controls the number of background jobs for compaction and flushing. If compaction is falling behind, consider increasing this value. The default is
2. - ROCKSDB_ALLOW_CONCURRENT_MEMTABLE_WRITE: Enables concurrent writes to memtables (default:
true). Disabling this setting may reduce memory pressure but can negatively impact write throughput.
- ROCKSDB_DB_WRITE_BUFFER_SIZE: Controls the size of write buffers. If memtables are consuming excessive memory, consider reducing this value. The default value (
- Connector Workload Settings: Optimize sync intervals and batch sizes to reduce RocksDB write frequency:
- SYNC_RESOURCE_INTERVAL_SEC: Resource sync interval in seconds (default:
60). Increasing this interval reduces the frequency of writes to RocksDB. - SYNC_SERVICEUSER_INTERVAL_SEC: Service user sync interval in seconds (default:
900). Increase this value if user synchronization is causing high write rates. - SYNC_SERVICEPOLICY_INTERVAL_SEC: Service policy sync interval in seconds (default:
1800). Increase this value if policy synchronization is causing high write rates. - CONNECTOR_ON_DEMAND_TASK_BATCH_SIZE: Batch size for on-demand task processing. Reducing this value may help if large batches cause memory spikes.
- CONNECTOR_ON_DEMAND_TASK_CHANGELOG_QUEUE_THRESHOLD: Changelog queue threshold. If queue is frequently full, this may indicate high write pressure.
- SYNC_RESOURCE_INTERVAL_SEC: Resource sync interval in seconds (default:
Property names and examples vary by connector. Refer to the connector-specific documentation for exact details. Any changes to these settings require a connector pod restart to take effect.
- RocksDB Configuration: Review RocksDB tuning properties that influence memory usage.
Step 6: Investigate Memory Growth Patterns: Analyze memory usage trends to identify the root cause:
- Monitor Memory Trends Over Time:
- In the Connector-Common dashboard, review the Total Estimated Memory panel over extended periods (hours/days).
- Check whether memory usage is:
- Continuously growing: This may indicate a memory leak or unbounded dataset growth. If observed, escalate to support with memory trend graphs.
- Spiking periodically: This may correlate with scheduled sync operations or batch jobs. Review sync intervals and batch sizes.
- Stable but high: This may indicate normal operation with a large dataset. Consider increasing pod memory limits if memory usage is approaching the OOM threshold.
- Correlate with Event Processing:
- Review the Event Count - Time Series panel in the Connector-Common dashboard.
- Check whether memory growth correlates with periods of high event processing rates.
- If event processing is causing memory spikes, consider adjusting batch sizes or processing intervals to reduce write pressure.
- Analyze Compaction Backlog:
- Review the Disk Size by Column Family panel to determine whether disk usage is growing rapidly.
- If the Disk Space Efficiency Ratio is below 60%, this indicates high disk waste, which may correlate with a compaction backlog.
- High Table Readers memory combined with growing disk usage suggests compaction may be falling behind.
- If compaction is suspected, review ROCKSDB_MAX_BACKGROUND_JOBS configuration and consider increasing it.
Escalation Checklist¶
If the issue cannot be resolved through the specific troubleshooting guides, escalate it to the Privacera support with the following details. For additional assistance, refer How to Contact Support for detailed guidance on reaching out to the support team.
- Timestamp of the error: Include the exact time the alert was triggered.
- Grafana dashboard and alert screenshots:
- Grafana → Dashboards → Application-Dashboards → connectors → common → Connector-Common
- RocksDB Memory Metrics section:
- Total Estimated Memory
- Memory Components Breakdown
- Block Cache Hit Ratio
- Memtable Hit Ratio
- RocksDB Disk Size Metrics section:
- Total SST Files Size
- Disk Space Efficiency Ratio
- Disk Size by Column Family
- RocksDB Memory Metrics section:
- Grafana → Alerting → Alert rules → RocksDB Total Memory Usage
- Grafana → Dashboards → Application-Dashboards → connectors → common → Connector-Common
-
Connector Service Logs: Include any logs showing memory-related errors, OOM errors, or performance issues.
Option 1: Download Log from Diagnostic Portal (Recommended)
- Open the Diagnostic Portal and navigate to Dashboard → Pods.
- Select the connector pod from the available pods list.
- Click on the Logs tab and download logs by clicking on DOWNLOAD LOGS button.
Option 2: Manual Log Collection (If Diagnostic Service is Not Enabled)
-
Configuration Files: Attach relevant configuration files (for example, properties files and JVM settings), ensuring all sensitive information is masked.
- Performance Metrics: Include screenshots of the relevant panels from the Connector-Common dashboard:
- Total Estimated Memory: Current and historical memory usage trends
- Memory Components Breakdown: Breakdown of memory consumption by component
- Block Cache Hit Ratio and Memtable Hit Ratio: Cache performance metrics
- Disk Size Metrics: Disk usage and efficiency ratios
- Resource Utilization: Provide CPU and memory usage graphs from the Pod Monitoring dashboard, including:
- Pod memory usage over time
- JVM heap usage
- Memory Growth Analysis: Include data showing memory usage trends over time (hours/days) to help determine whether memory is continuously growing or experiencing temporary spikes.
Ensure all sensitive information in configuration files and logs is masked before sharing.
- Prev topic: Troubleshooting for Connector