Using Dictionaries¶

In Privacera Discovery, a Dictionary is a powerful classification mechanism used to detect and tag sensitive data by matching it against predefined terms or patterns. Dictionaries are particularly useful for identifying industry-specific terminology, business-sensitive information, or common sensitive terms across datasets.

Discovery supports three types of dictionary matching modes:

1. Fuzzy Match¶

Description: Matches source data entries that are approximately similar to the dictionary terms. This is useful when the data may contain variations, misspellings, or alternate forms of the term.
Use Case Example: Matching names of medications or organizations with slight spelling differences.

Sample Dictionary Entries:

Text Only
1 2 3	`diabetes hypertension cardiovascular`

Example Matches:
"diabtes"
"cardio vascular"
"hypertensn"

2. Exact Match¶

Description: Requires the data to match the dictionary entries exactly as specified. This is ideal for scenarios where data values are consistent and strictly formatted.
Use Case Example: Matching a set of sanctioned entity names or internal project codes.

Sample Dictionary Entries:

Text Only
1 2 3	`ProjectX123 ConfidentialClient RestrictedGroupA`

Example Matches:
"ProjectX123" ➝ Match
"ProjectX" ➝ No Match

Note

All patterns in dictionaries use case-insensitive matching. This means the system treats “Email,” “email,” and “EMAIL” as the same word. This ensures that variations in capitalization do not affect detection accuracy across different match modes.

3. Pattern Match (Regex)¶

Description: Allows the use of regular expressions to define complex patterns. This is particularly useful for matching structured patterns like IDs, codes, or specific text formats. All regex matches are case-insensitive.
Use Case Example: Detecting customer IDs, invoice numbers, or formatted codes.

Sample Dictionary Entries (Regex):

Text Only
1 2	`^INV[0-9]{5}$ ^CUS-[A-Z]{3}-[0-9]{4}$`

Example Matches:
"INV12345" ➝ Match
"inv12345" ➝ Match
"CUS-ABC-2023" ➝ Match
"cus-abc-2023" ➝ Match
"CUS-abc-2023" ➝ Match (case insensitive)

Best Practices¶

Use Fuzzy Match for unstructured or user-generated content where spelling or word variations may occur.
Use Exact Match for clean, controlled and consistent data like predefined lists or official terms.
Use Pattern Match for structured formats and identifiers such as IDs, email addresses, or phone numbers.

Conclusion¶

Dictionaries in Privacera Discovery provide flexibility and control in identifying sensitive information. By leveraging the appropriate match type—fuzzy, exact, or regex—you can improve detection accuracy and enhance your data governance policies.

Previous topic: Tagging Mechanism
Next topic: Models