Handling of Part files from Map Reduce Jobs¶
Overview¶
Part File Tag Propagation is a feature in Privacera Discovery that propagates classification tags from scanned part files to their parent folders through a two-phase process. This feature is particularly useful for big data environments where data is stored in partitioned formats produced by Spark, Hive, or other distributed processing frameworks.
How It Works at a High Level¶
- Phase 1 - Scanning: Part files are scanned and classified, with tags applied directly to individual part files
- Phase 2 - Background Processing: A separate Folder Tagger service runs in the background to aggregate and propagate tags from part files to their parent folders
When part files within a folder are classified with sensitive data tags (such as EMAIL, SSN, or CREDIT_CARD), these tags are first applied to the individual files. Later, a background process aggregates these tags and applies them to the parent folder. This provides an efficient way to understand the classification of entire datasets without scanning every individual file.
Two-Phase Process with Time Lag
Part files are tagged immediately after scanning, but folder-level tags appear after background processing completes. There is a time delay between when part files are tagged and when their tags appear at the folder level.
What are Part Files?¶
Part files are output files generated by distributed data processing frameworks when writing data to filesystems. These files typically follow naming patterns such as:
part-00000,part-00001,part-r-00000,part-m-000010_0,1_0,100_5
When working with large datasets, MapReduce automatically splits the output into multiple part files, all stored within a single parent folder that represents the final output. Scanning each part file individually can be resource-intensive.
To improve efficiency, folder-level tag propagation, combined with the Quick Scan Limit feature, uses classification tags from a sampled subset of part files (as configured by the Quick Scan Limit) to apply tags at the folder level. This enables folder-level tagging to be used for Access Control, eliminating the need to tag each part file within the parent folder.
How It Works¶
The tag propagation feature operates in two phases:
Phase 1: Part File Scanning and Classification¶
During resource scanning, part files are scanned and classified individually:
- Part File Detection: The system identifies part files using regex patterns:
- Default patterns:
^part-.*$and^[0-9]+_[0-9]+$ -
These patterns cover common part file naming conventions from Spark, Hive, and other distributed frameworks
-
Part File Scanning: Discovery applies classification rules to each part file
- If Quick Scan is enabled (
DISCOVERY_QUICK_SCAN_ENABLEandDISCOVERY_QUICK_SCAN_LIMIT, default: 10), only a sample of part files are scanned - Each scanned part file is classified based on its content and metadata
-
Classification tags (such as EMAIL, SSN, CREDIT_CARD) are assigned to individual part files
-
Individual File Tagging: Tags are applied directly to the scanned part files in the Data Inventory
Quick Scan Feature
When Quick Scan is enabled (e.g., DISCOVERY_QUICK_SCAN_ENABLE: "true" and DISCOVERY_QUICK_SCAN_LIMIT: "10"), only a randomly sampled subset of part files (e.g., 10 files per parent folder) are scanned and tagged. This improves performance while maintaining classification accuracy.
Phase 2: Background Folder Tagger Processing¶
After part files are tagged, a separate background process propagates tags to parent folders:
-
Folder Tagger Service: A background service (
FolderLevelTagger) runs periodically to process folders containing part files -
Tag Propagation: The service identifies parent folders with tagged part files and propagates tags upward
- Tags from all child part files are aggregated at the folder level
- Duplicate tags are automatically eliminated
-
The parent folder's metadata is updated with aggregated tags
-
Time Lag: There is a time delay between when part files are tagged and when folder-level tags appear
- Part files are tagged immediately after scanning
- Folder-level tags appear after the background Folder Tagger service completes processing
- The delay depends on service configuration and system load
This feature Must Be Enabled
The FolderLevelTagger service is disabled by default. Set DISCOVERY_FOLDER_TAGGER_ENABLE: "true" in Discovery configuration to enable it.
Tag Aggregation and Updates¶
When multiple part files exist in a folder:
- Tags from all scanned part files (including sampled files from Quick Scan) are aggregated at the folder level
- Duplicate tags are eliminated
- During re-scans, folder tags are updated based on new scan results
- Manual tags added by users to folders are preserved
Supported Data Sources¶
| Data Source | Supported |
|---|---|
| AWS S3 | ✅ |
| Google Cloud Storage | ✅ |
| Azure Data Lake Storage | ✅ |
Viewing Propagated Tags¶
To view tags at different stages:
Viewing Part File Tags (Immediate)¶
After scanning completes, you can view tags on individual part files:
- Go to Discovery > Data Inventory > Classifications
- Search for the part file path (e.g.,
/path/to/folder/part-00000) - View classification tags assigned to the individual part file
Viewing Folder-Level Tags (After Background Processing)¶
After the Folder Tagger service completes processing:
- Go to Discovery > Data Inventory > Classifications
- Search for the parent folder path (e.g.,
/path/to/folder/) - View aggregated tags at the folder level (propagated from child part files)
Time Lag
Folder-level tags appear after the background Folder Tagger service processes the folder. This may take some time depending on system load and service configuration. Part file tags are available immediately after scanning.
Best Practices¶
- Enable for Big Data Workloads: Enable tag propagation for environments with Spark, Hive, or other partitioned data outputs
- Configure Quick Scan Limits: Use
quick.scan.limitto optimize scanning of part files- Set to
5-10for faster scans with reasonable accuracy - Set to
20-50for better classification accuracy at the cost of longer scan times - Higher values provide more accuracy but scan more files
- The limit applies per parent folder, randomly sampling files from each folder
- Set to
- Understand the Two-Phase Process:
- Part files are tagged immediately after scanning
- Folder tags appear later after background processing completes
- Plan your workflows accordingly and allow time for folder tag propagation
- Use with Quick Scan: Combine tag propagation with quick scan for optimal performance
- Quick scan tags a randomly sampled subset of part files first
- Background Folder Tagger propagates tags from sampled files to folders
- This provides folder-level classification without scanning all part files
- Plan for Tag Availability:
- Use part file tags for immediate classification results
- Use folder tags for access control policies (available after background processing)
- Consider the time lag when building automated workflows
Related Documentation¶
- Previous topic: Priority-Based Offline Scan
- Next topic: Customizing Scans