Skip to content

Using Metadata Dictionaries in Unstructured Content Scanning

Overview

Using Metadata Dictionaries in Unstructured Content Scanning is a specialized feature in Privacera Discovery that extends content scanning for unstructured data by making metadata dictionaries available for classification. This applies to:

  1. Unstructured files: Text files, documents, PDFs, and other non-tabular data
  2. Database columns with unstructured content: Text-heavy columns (like descriptions, comments, or notes) that contain more than 5 tokens

When content scanning is performed on unstructured data, this feature allows metadata dictionaries (originally designed for column name matching) to also be applied to the content for enhanced classification.

How It Works at a High Level

For structured data (tables, CSV files, etc.), Discovery performs two types of scanning:

  1. Content Scanning: Analyzes actual data values within columns using content-based rules (prefixed with c_)
  2. Meta Scanning: Analyzes metadata like column names using metadata-based rules (prefixed with m_)

For unstructured data (text files, documents, database columns with unstructured content, etc.), Discovery performs content scanning, and metadata dictionaries can be applied in two ways:

By Default (Without DISCOVERY_APPLY_METANAME_DICT_TO_UNSTRUCT): - Content Scanning: Performed using content-based patterns and models (e.g., c_EMAIL, c_SSN_ML_MODEL) - Metadata Dictionaries with m_ prefix: Applied to file name and file path matching - Metadata Dictionaries with c_ prefix: NOT available for content scanning

With DISCOVERY_APPLY_METANAME_DICT_TO_UNSTRUCT Enabled: - Content Scanning: Performed with all content-based patterns and models - Metadata Dictionaries with m_ prefix: Applied to file name and file path matching (same as default) - Metadata Dictionaries with c_ prefix: Made available to scan unstructured content (NEW capability) - When creating unstructured rules, you can use: - m_<meta_dict> (e.g., m_EMAIL_KEYWORD) for file path/name matching - c_<meta_dict> (e.g., c_EMAIL_KEYWORD) for unstructured content scanning

Key Difference

By default, metadata dictionaries with m_ prefix can only scan file names and paths. When you enable DISCOVERY_APPLY_METANAME_DICT_TO_UNSTRUCT: "true", metadata dictionaries become available with c_ prefix to ALSO scan unstructured content (file content and database columns with unstructured data). You must explicitly create rules with c_ prefix (e.g., c_EMAIL_KEYWORD) to scan content, or use m_ prefix (e.g., m_EMAIL_KEYWORD) to match file paths/names.

Enabling the Feature

For detailed configuration steps, refer to the Configuration to Use Metadata Dictionaries in Unstructured Content Scanning documentation.

Rule Configuration for Unstructured Data

Understanding Rule Prefixes

Discovery uses prefixes to distinguish between different types of classification rules:

Prefix Meaning Used For Example
m_ Metadata Column name matching (structured data)
File path/name matching (unstructured data)
m_EMAIL_KEYWORD
c_ Content Data value matching (structured data)
Unstructured content matching (unstructured files and database columns)
c_EMAIL

Creating Rules for Unstructured Data Classification

Critical: Prefix Usage for Unstructured Rules

When creating classification rules for unstructured data, you must use the appropriate prefix based on what you want to scan:

For File Path/Name Matching:

  • Use m_ prefix to match against file paths and file names
  • Example: m_EMAIL_KEYWORD will match if the file path or name contains keywords from the EMAIL_KEYWORD dictionary

For Unstructured Content Matching:

  • Use c_ prefix to scan unstructured content (file content and database columns with unstructured data)
  • Example: c_EMAIL_KEYWORD will scan the EMAIL_KEYWORD dictionary within the unstructured content
  • This requires DISCOVERY_APPLY_METANAME_DICT_TO_UNSTRUCT: "true" to be enabled

Key Points:

  • The system does NOT automatically convert m_ to c_ or vice versa
  • Choose the appropriate prefix based on your scanning needs
  • You can combine both in a single rule if you want to match both file path AND content