Using Dictionaries¶
In Privacera Discovery, a Dictionary is a powerful classification mechanism used to detect and tag sensitive data by matching it against predefined terms or patterns. Dictionaries are particularly useful for identifying industry-specific terminology, business-sensitive information, or common sensitive terms across datasets.
Discovery supports three types of dictionary matching modes:
1. Fuzzy Match¶
- Description: Matches source data entries that are approximately similar to the dictionary terms. This is useful when the data may contain variations, misspellings, or alternate forms of the term.
-
Use Case Example: Matching names of medications or organizations with slight spelling differences.
-
Sample Dictionary Entries:
- Example Matches:
- "diabtes"
- "cardio vascular"
- "hypertensn"
2. Exact Match¶
- Description: Requires the data to match the dictionary entries exactly as specified. This is ideal for scenarios where data values are consistent and strictly formatted.
-
Use Case Example: Matching a set of sanctioned entity names or internal project codes.
-
Sample Dictionary Entries:
- Example Matches:
- "ProjectX123" ➝ Match
- "ProjectX" ➝ No Match
Note
All patterns in dictionaries use case-insensitive matching. This means the system treats “Email,” “email,” and “EMAIL” as the same word. This ensures that variations in capitalization do not affect detection accuracy across different match modes.
3. Pattern Match (Regex)¶
-
Description: Allows the use of regular expressions to define complex patterns. This is particularly useful for matching structured patterns like IDs, codes, or specific text formats. All regex matches are case-insensitive.
-
Use Case Example: Detecting customer IDs, invoice numbers, or formatted codes.
-
Sample Dictionary Entries (Regex):
-
Example Matches:
- "INV12345" ➝ Match
- "inv12345" ➝ Match
- "CUS-ABC-2023" ➝ Match
- "cus-abc-2023" ➝ Match
- "CUS-abc-2023" ➝ Match (case insensitive)
Canada Address, Person Name, City, and Province¶
The product includes Canadian dictionaries and tags for address context, person names, cities, and provinces. Enable these under Discovery → Dictionaries and Discovery → Tags when Canada-specific classification is required.
Tags and purpose¶
| Tag | What it detects |
|---|---|
| CANADA_ADDRESS | Canadian address context via CANADA_ADDRESS_KEYWORD; stricter rules combine keywords with the Canada Postal Code model. |
| CANADA_PERSON_NAME | Names matched against CANADA_PERSON_NAME_LOOKUP. |
| CANADA_CITY | Cities matched against CANADA_CITY_LOOKUP (exact list). |
| CANADA_PROVINCE | Provinces and territories via CANADA_PROVINCE_LOOKUP (names, abbreviations, common short forms). |
Dictionary keys and files¶
| Dictionary key | Match type | Typical file | Drives tag |
|---|---|---|---|
CANADA_ADDRESS_KEYWORD | Fuzzy | canada_address_keyword.txt | CANADA_ADDRESS |
CANADA_PERSON_NAME_LOOKUP | Exact | canada_person_name_lookup.txt | CANADA_PERSON_NAME |
CANADA_CITY_LOOKUP | Exact | canada_city_lookup.txt | CANADA_CITY |
CANADA_PROVINCE_LOOKUP | Exact | canada_province_lookup.txt | CANADA_PROVINCE |
In each dictionary, set Tags (input tags) so matches associate with the right CANADA_* tag.
Rules and postal dependency¶
- Structured rules use column-level features such as
c_CANADA_ADDRESS_KEYWORD,c_CANADA_PERSON_NAME_LOOKUP,c_CANADA_CITY_LOOKUP,c_CANADA_PROVINCE_LOOKUP, and for strict address patternsc_CANADA_POSTAL_CODE_ML_MODEL. Enable the Canada structured rules you need under Discovery → Rules. - Unstructured rules combine these features within a word proximity window (for example address keyword + postal model). Use Discovery → Rules Mapping so outputs map to
CANADA_ADDRESS,CANADA_CITY,CANADA_PROVINCE, andCANADA_PERSON_NAME. - If a rule requires it, enable
CANADA_POSTAL_CODE(tag, modelCANADA_POSTAL_CODE_ML_MODEL, andCANADA_POSTAL_CODE_KEYWORDwhen your rules reference it). Without the postal model, strict Canada address rules may not apply.
What to disable for Canada-first classification¶
US and generic detectors overlap the same signals (names, cities, states, street-style text). Leaving them enabled can produce PERSON_NAME, US_CITY, US_STATE, or US_ADDRESS alongside or instead of CANADA_* tags.
For Canada-first scans on the same data:
- Tags — Disable
PERSON_NAME,US_CITY,US_STATE, andUS_ADDRESSwhen you want onlyCANADA_PERSON_NAME,CANADA_CITY,CANADA_PROVINCE, andCANADA_ADDRESSrespectively. - Dictionaries — Disable US counterparts that feed those tags (for example
PERSON_NAME_LOOKUP,US_CITY_LOOKUP,US_STATE_LOOKUP, and dictionaries tied toUS_ADDRESS/street-addresspatterns). - Rules — Disable structured and unstructured rules that output the US tags above; keep Canada dictionary and model rules enabled.
- Models — Keep
CANADA_POSTAL_CODE_ML_MODELenabled if your Canada address rules depend on it.
Portal naming
Exact labels can vary by release. Search Tags, Dictionaries, and Rules for the names above.
Quick checklist¶
- Enable
CANADA_ADDRESS,CANADA_PERSON_NAME,CANADA_CITY,CANADA_PROVINCE(as needed). - Enable the four Canada dictionaries and link tags on each.
- Enable Canada postal tag/model/keyword when address rules need the postal model.
- Enable Canada rules and Rules Mapping for unstructured output.
- Disable
PERSON_NAME,US_CITY,US_STATE,US_ADDRESS(and related rules/dictionaries) when you want Canadian identifiers without US overlap.
Best Practices¶
- Use Fuzzy Match for unstructured or user-generated content where spelling or word variations may occur.
- Use Exact Match for clean, controlled and consistent data like predefined lists or official terms.
- Use Pattern Match for structured formats and identifiers such as IDs, email addresses, or phone numbers.
Conclusion¶
Dictionaries in Privacera Discovery provide flexibility and control in identifying sensitive information. By leveraging the appropriate match type—fuzzy, exact, or regex—you can improve detection accuracy and enhance your data governance policies.
- Previous topic: Tagging Mechanism
- Next topic: Models