Using Metadata Dictionaries in Unstructured Content Scanning¶
Overview¶
Using Metadata Dictionaries in Unstructured Content Scanning is a specialized feature in Privacera Discovery that extends content scanning for unstructured data by making metadata dictionaries available for classification. This applies to:
- Unstructured files: Text files, documents, PDFs, and other non-tabular data
- Database columns with unstructured content: Text-heavy columns (like descriptions, comments, or notes) that contain more than 5 tokens
When content scanning is performed on unstructured data, this feature allows metadata dictionaries (originally designed for column name matching) to also be applied to the content for enhanced classification.
How It Works at a High Level¶
For structured data (tables, CSV files, etc.), Discovery performs two types of scanning:
- Content Scanning: Analyzes actual data values within columns using content-based rules (prefixed with
c_) - Meta Scanning: Analyzes metadata like column names using metadata-based rules (prefixed with
m_)
For unstructured data (text files, documents, database columns with unstructured content, etc.), Discovery performs content scanning, and metadata dictionaries can be applied in two ways:
By Default (Without DISCOVERY_APPLY_METANAME_DICT_TO_UNSTRUCT): - Content Scanning: Performed using content-based patterns and models (e.g., c_EMAIL, c_SSN_ML_MODEL) - Metadata Dictionaries with m_ prefix: Applied to file name and file path matching - Metadata Dictionaries with c_ prefix: NOT available for content scanning
With DISCOVERY_APPLY_METANAME_DICT_TO_UNSTRUCT Enabled: - Content Scanning: Performed with all content-based patterns and models - Metadata Dictionaries with m_ prefix: Applied to file name and file path matching (same as default) - Metadata Dictionaries with c_ prefix: Made available to scan unstructured content (NEW capability) - When creating unstructured rules, you can use: - m_<meta_dict> (e.g., m_EMAIL_KEYWORD) for file path/name matching - c_<meta_dict> (e.g., c_EMAIL_KEYWORD) for unstructured content scanning
Key Difference
By default, metadata dictionaries with m_ prefix can only scan file names and paths. When you enable DISCOVERY_APPLY_METANAME_DICT_TO_UNSTRUCT: "true", metadata dictionaries become available with c_ prefix to ALSO scan unstructured content (file content and database columns with unstructured data). You must explicitly create rules with c_ prefix (e.g., c_EMAIL_KEYWORD) to scan content, or use m_ prefix (e.g., m_EMAIL_KEYWORD) to match file paths/names.
Enabling the Feature¶
For detailed configuration steps, refer to the Configuration to Use Metadata Dictionaries in Unstructured Content Scanning documentation.
Rule Configuration for Unstructured Data¶
Understanding Rule Prefixes¶
Discovery uses prefixes to distinguish between different types of classification rules:
| Prefix | Meaning | Used For | Example |
|---|---|---|---|
m_ | Metadata | Column name matching (structured data) File path/name matching (unstructured data) | m_EMAIL_KEYWORD |
c_ | Content | Data value matching (structured data) Unstructured content matching (unstructured files and database columns) | c_EMAIL |
Creating Rules for Unstructured Data Classification¶
Critical: Prefix Usage for Unstructured Rules
When creating classification rules for unstructured data, you must use the appropriate prefix based on what you want to scan:
For File Path/Name Matching:
- Use
m_prefix to match against file paths and file names - Example:
m_EMAIL_KEYWORDwill match if the file path or name contains keywords from the EMAIL_KEYWORD dictionary
For Unstructured Content Matching:
- Use
c_prefix to scan unstructured content (file content and database columns with unstructured data) - Example:
c_EMAIL_KEYWORDwill scan the EMAIL_KEYWORD dictionary within the unstructured content - This requires
DISCOVERY_APPLY_METANAME_DICT_TO_UNSTRUCT: "true"to be enabled
Key Points:
- The system does NOT automatically convert
m_toc_or vice versa - Choose the appropriate prefix based on your scanning needs
- You can combine both in a single rule if you want to match both file path AND content
Related Documentation¶
- Configuration to Use Metadata Dictionaries in Unstructured Content Scanning
- Discovery Advanced Configuration
- Discovery Scanning Overview
- Classification Rules
- Dictionary Management
- Prev topic: Priority-Based Offline Scan