Skip to content

Using Dictionaries

In Privacera Discovery, a Dictionary is a powerful classification mechanism used to detect and tag sensitive data by matching it against predefined terms or patterns. Dictionaries are particularly useful for identifying industry-specific terminology, business-sensitive information, or common sensitive terms across datasets.

Discovery supports three types of dictionary matching modes:

1. Fuzzy Match

  • Description: Matches source data entries that are approximately similar to the dictionary terms. This is useful when the data may contain variations, misspellings, or alternate forms of the term.
  • Use Case Example: Matching names of medications or organizations with slight spelling differences.

  • Sample Dictionary Entries:

    Text Only
    1
    2
    3
    diabetes
    hypertension
    cardiovascular
    

  • Example Matches:
  • "diabtes"
  • "cardio vascular"
  • "hypertensn"

2. Exact Match

  • Description: Requires the data to match the dictionary entries exactly as specified. This is ideal for scenarios where data values are consistent and strictly formatted.
  • Use Case Example: Matching a set of sanctioned entity names or internal project codes.

  • Sample Dictionary Entries:

    Text Only
    1
    2
    3
    ProjectX123
    ConfidentialClient
    RestrictedGroupA
    

  • Example Matches:
  • "ProjectX123" ➝ Match
  • "ProjectX" ➝ No Match

Note

All patterns in dictionaries use case-insensitive matching. This means the system treats “Email,” “email,” and “EMAIL” as the same word. This ensures that variations in capitalization do not affect detection accuracy across different match modes.

3. Pattern Match (Regex)

  • Description: Allows the use of regular expressions to define complex patterns. This is particularly useful for matching structured patterns like IDs, codes, or specific text formats. All regex matches are case-insensitive.

  • Use Case Example: Detecting customer IDs, invoice numbers, or formatted codes.

  • Sample Dictionary Entries (Regex):

    Text Only
    ^INV[0-9]{5}$
    ^CUS-[A-Z]{3}-[0-9]{4}$
    

  • Example Matches:

  • "INV12345" ➝ Match
  • "inv12345" ➝ Match
  • "CUS-ABC-2023" ➝ Match
  • "cus-abc-2023" ➝ Match
  • "CUS-abc-2023" ➝ Match (case insensitive)

Canada Address, Person Name, City, and Province

The product includes Canadian dictionaries and tags for address context, person names, cities, and provinces. Enable these under Discovery → Dictionaries and Discovery → Tags when Canada-specific classification is required.

Tags and purpose

Tag What it detects
CANADA_ADDRESS Canadian address context via CANADA_ADDRESS_KEYWORD; stricter rules combine keywords with the Canada Postal Code model.
CANADA_PERSON_NAME Names matched against CANADA_PERSON_NAME_LOOKUP.
CANADA_CITY Cities matched against CANADA_CITY_LOOKUP (exact list).
CANADA_PROVINCE Provinces and territories via CANADA_PROVINCE_LOOKUP (names, abbreviations, common short forms).

Dictionary keys and files

Dictionary key Match type Typical file Drives tag
CANADA_ADDRESS_KEYWORD Fuzzy canada_address_keyword.txt CANADA_ADDRESS
CANADA_PERSON_NAME_LOOKUP Exact canada_person_name_lookup.txt CANADA_PERSON_NAME
CANADA_CITY_LOOKUP Exact canada_city_lookup.txt CANADA_CITY
CANADA_PROVINCE_LOOKUP Exact canada_province_lookup.txt CANADA_PROVINCE

In each dictionary, set Tags (input tags) so matches associate with the right CANADA_* tag.

Rules and postal dependency

  • Structured rules use column-level features such as c_CANADA_ADDRESS_KEYWORD, c_CANADA_PERSON_NAME_LOOKUP, c_CANADA_CITY_LOOKUP, c_CANADA_PROVINCE_LOOKUP, and for strict address patterns c_CANADA_POSTAL_CODE_ML_MODEL. Enable the Canada structured rules you need under Discovery → Rules.
  • Unstructured rules combine these features within a word proximity window (for example address keyword + postal model). Use Discovery → Rules Mapping so outputs map to CANADA_ADDRESS, CANADA_CITY, CANADA_PROVINCE, and CANADA_PERSON_NAME.
  • If a rule requires it, enable CANADA_POSTAL_CODE (tag, model CANADA_POSTAL_CODE_ML_MODEL, and CANADA_POSTAL_CODE_KEYWORD when your rules reference it). Without the postal model, strict Canada address rules may not apply.

What to disable for Canada-first classification

US and generic detectors overlap the same signals (names, cities, states, street-style text). Leaving them enabled can produce PERSON_NAME, US_CITY, US_STATE, or US_ADDRESS alongside or instead of CANADA_* tags.

For Canada-first scans on the same data:

  1. Tags — Disable PERSON_NAME, US_CITY, US_STATE, and US_ADDRESS when you want only CANADA_PERSON_NAME, CANADA_CITY, CANADA_PROVINCE, and CANADA_ADDRESS respectively.
  2. Dictionaries — Disable US counterparts that feed those tags (for example PERSON_NAME_LOOKUP, US_CITY_LOOKUP, US_STATE_LOOKUP, and dictionaries tied to US_ADDRESS / street-address patterns).
  3. Rules — Disable structured and unstructured rules that output the US tags above; keep Canada dictionary and model rules enabled.
  4. Models — Keep CANADA_POSTAL_CODE_ML_MODEL enabled if your Canada address rules depend on it.

Portal naming

Exact labels can vary by release. Search Tags, Dictionaries, and Rules for the names above.

Quick checklist

  • Enable CANADA_ADDRESS, CANADA_PERSON_NAME, CANADA_CITY, CANADA_PROVINCE (as needed).
  • Enable the four Canada dictionaries and link tags on each.
  • Enable Canada postal tag/model/keyword when address rules need the postal model.
  • Enable Canada rules and Rules Mapping for unstructured output.
  • Disable PERSON_NAME, US_CITY, US_STATE, US_ADDRESS (and related rules/dictionaries) when you want Canadian identifiers without US overlap.

Best Practices

  • Use Fuzzy Match for unstructured or user-generated content where spelling or word variations may occur.
  • Use Exact Match for clean, controlled and consistent data like predefined lists or official terms.
  • Use Pattern Match for structured formats and identifiers such as IDs, email addresses, or phone numbers.

Conclusion

Dictionaries in Privacera Discovery provide flexibility and control in identifying sensitive information. By leveraging the appropriate match type—fuzzy, exact, or regex—you can improve detection accuracy and enhance your data governance policies.