Data Classification in Discovery¶
Privacera Discovery provides various tools and features to classify data efficiently. These tools help in detecting, categorizing, and tagging sensitive information based on predefined classification methods. The classification techniques used in Discovery include Dictionaries, Heuristic Models, Rules, and Tags.
Dictionary¶
A Dictionary in Discovery is a collection of text entries used to match data within a source. It operates in three distinct modes:
- Fuzzy Match: The dictionary text is approximately matched with source data.
- Exact Match: The dictionary text is precisely matched with source data.
- Pattern Match: The dictionary text is treated as a set of regex patterns, and the source data is evaluated using these patterns.
Dictionaries help in recognizing sensitive terms, common phrases, or industry-specific keywords to ensure proper classification.
Models¶
Models use advanced heuristic and logic to classify data based on structured syntax, validation checks, and probabilistic analysis. These models are effective for identifying:
- Social Security Numbers (SSNs) by applying syntax validation.
- Credit Card Numbers using:
- Bank Identification Number (BIN) range validation
- Luhn algorithm check
Heuristic models enhance detection accuracy for structured sensitive data by combining multiple validation techniques.
Rules¶
Rules use features derived from the analyzed source data—including dictionary matches, metadata, frequency analysis, and heuristic model results—to determine classification outcomes. Rules assign Tags to fields or columns based on the evaluation.
There are two types of rule-based classification:
- Structured Rules: Apply to structured data sources like relational databases and data warehouses.
- Unstructured Rules: Apply to unstructured data sources, such as text files, documents, and logs.
Two-Pass Rule Execution in Discovery¶
The Discovery engine applies rules in two passes:
- First Pass: Applies structured or unstructured rules at the column level, generating initial classification tags.
- Second Pass: Applies post-processing rules at the resource level (file or table) based on previously generated tags.
This two-step classification process refines the accuracy of detected sensitive data.
Tags for Data Classification¶
Tags are used to categorize and classify data into predefined sensitivity categories. These tags help organizations enforce data governance policies by identifying key data elements such as:
- SSN (Social Security Number)
- CC (Credit Card)
- PHONE_NUMBER
- ADDRESS
Tags serve as metadata markers that can be leveraged for access control, encryption, or masking strategies.
Conclusion¶
Privacera Discovery ensures accurate and scalable data classification using Dictionaries, Heuristic Models, Rules, and Tags. These techniques help organizations automate compliance enforcement, improve data security, and enhance data governance strategies.
For further details, refer to:
- Previous topic: Setup Scanning
- Next topic: Tagging Mechanism