Skip to content

Handling of Part files from Map Reduce Jobs

Overview

Part File Tag Propagation is a feature in Privacera Discovery that propagates classification tags from scanned part files to their parent folders through a two-phase process. This feature is particularly useful for big data environments where data is stored in partitioned formats produced by Spark, Hive, or other distributed processing frameworks.

How It Works at a High Level

  1. Phase 1 - Scanning: Part files are scanned and classified, with tags applied directly to individual part files
  2. Phase 2 - Background Processing: A separate Folder Tagger service runs in the background to aggregate and propagate tags from part files to their parent folders

When part files within a folder are classified with sensitive data tags (such as EMAIL, SSN, or CREDIT_CARD), these tags are first applied to the individual files. Later, a background process aggregates these tags and applies them to the parent folder. This provides an efficient way to understand the classification of entire datasets without scanning every individual file.

Two-Phase Process with Time Lag

Part files are tagged immediately after scanning, but folder-level tags appear after background processing completes. There is a time delay between when part files are tagged and when their tags appear at the folder level.

What are Part Files?

Part files are output files generated by distributed data processing frameworks when writing data to filesystems. These files typically follow naming patterns such as:

  • part-00000, part-00001, part-r-00000, part-m-00001
  • 0_0, 1_0, 100_5

When working with large datasets, MapReduce automatically splits the output into multiple part files, all stored within a single parent folder that represents the final output. Scanning each part file individually can be resource-intensive.

To improve efficiency, folder-level tag propagation, combined with the Quick Scan Limit feature, uses classification tags from a sampled subset of part files (as configured by the Quick Scan Limit) to apply tags at the folder level. This enables folder-level tagging to be used for Access Control, eliminating the need to tag each part file within the parent folder.

How It Works

The tag propagation feature operates in two phases:

Phase 1: Part File Scanning and Classification

During resource scanning, part files are scanned and classified individually:

  1. Part File Detection: The system identifies part files using regex patterns:
  2. Default patterns: ^part-.*$ and ^[0-9]+_[0-9]+$
  3. These patterns cover common part file naming conventions from Spark, Hive, and other distributed frameworks

  4. Part File Scanning: Discovery applies classification rules to each part file

  5. If Quick Scan is enabled (DISCOVERY_QUICK_SCAN_ENABLE and DISCOVERY_QUICK_SCAN_LIMIT, default: 10), only a sample of part files are scanned
  6. Each scanned part file is classified based on its content and metadata
  7. Classification tags (such as EMAIL, SSN, CREDIT_CARD) are assigned to individual part files

  8. Individual File Tagging: Tags are applied directly to the scanned part files in the Data Inventory

Quick Scan Feature

When Quick Scan is enabled (e.g., DISCOVERY_QUICK_SCAN_ENABLE: "true" and DISCOVERY_QUICK_SCAN_LIMIT: "10"), only a randomly sampled subset of part files (e.g., 10 files per parent folder) are scanned and tagged. This improves performance while maintaining classification accuracy.

Phase 2: Background Folder Tagger Processing

After part files are tagged, a separate background process propagates tags to parent folders:

  1. Folder Tagger Service: A background service (FolderLevelTagger) runs periodically to process folders containing part files

  2. Tag Propagation: The service identifies parent folders with tagged part files and propagates tags upward

  3. Tags from all child part files are aggregated at the folder level
  4. Duplicate tags are automatically eliminated
  5. The parent folder's metadata is updated with aggregated tags

  6. Time Lag: There is a time delay between when part files are tagged and when folder-level tags appear

  7. Part files are tagged immediately after scanning
  8. Folder-level tags appear after the background Folder Tagger service completes processing
  9. The delay depends on service configuration and system load

This feature Must Be Enabled

The FolderLevelTagger service is disabled by default. Set DISCOVERY_FOLDER_TAGGER_ENABLE: "true" in Discovery configuration to enable it.

Tag Aggregation and Updates

When multiple part files exist in a folder:

  • Tags from all scanned part files (including sampled files from Quick Scan) are aggregated at the folder level
  • Duplicate tags are eliminated
  • During re-scans, folder tags are updated based on new scan results
  • Manual tags added by users to folders are preserved

Supported Data Sources

Data Source Supported
AWS S3
Google Cloud Storage
Azure Data Lake Storage

Viewing Propagated Tags

To view tags at different stages:

Viewing Part File Tags (Immediate)

After scanning completes, you can view tags on individual part files:

  1. Go to Discovery > Data Inventory > Classifications
  2. Search for the part file path (e.g., /path/to/folder/part-00000)
  3. View classification tags assigned to the individual part file

Viewing Folder-Level Tags (After Background Processing)

After the Folder Tagger service completes processing:

  1. Go to Discovery > Data Inventory > Classifications
  2. Search for the parent folder path (e.g., /path/to/folder/)
  3. View aggregated tags at the folder level (propagated from child part files)

Time Lag

Folder-level tags appear after the background Folder Tagger service processes the folder. This may take some time depending on system load and service configuration. Part file tags are available immediately after scanning.

Best Practices

  1. Enable for Big Data Workloads: Enable tag propagation for environments with Spark, Hive, or other partitioned data outputs
  2. Configure Quick Scan Limits: Use quick.scan.limit to optimize scanning of part files
    • Set to 5-10 for faster scans with reasonable accuracy
    • Set to 20-50 for better classification accuracy at the cost of longer scan times
    • Higher values provide more accuracy but scan more files
    • The limit applies per parent folder, randomly sampling files from each folder
  3. Understand the Two-Phase Process:
    • Part files are tagged immediately after scanning
    • Folder tags appear later after background processing completes
    • Plan your workflows accordingly and allow time for folder tag propagation
  4. Use with Quick Scan: Combine tag propagation with quick scan for optimal performance
    • Quick scan tags a randomly sampled subset of part files first
    • Background Folder Tagger propagates tags from sampled files to folders
    • This provides folder-level classification without scanning all part files
  5. Plan for Tag Availability:
    • Use part file tags for immediate classification results
    • Use folder tags for access control policies (available after background processing)
    • Consider the time lag when building automated workflows