Using Dictionaries¶
In Privacera Discovery, a Dictionary is a powerful classification mechanism used to detect and tag sensitive data by matching it against predefined terms or patterns. Dictionaries are particularly useful for identifying industry-specific terminology, business-sensitive information, or common sensitive terms across datasets.
Discovery supports three types of dictionary matching modes:
1. Fuzzy Match¶
- Description: Matches source data entries that are approximately similar to the dictionary terms. This is useful when the data may contain variations, misspellings, or alternate forms of the term.
-
Use Case Example: Matching names of medications or organizations with slight spelling differences.
-
Sample Dictionary Entries:
- Example Matches:
- "diabtes"
- "cardio vascular"
- "hypertensn"
2. Exact Match¶
- Description: Requires the data to match the dictionary entries exactly as specified. This is ideal for scenarios where data values are consistent and strictly formatted.
-
Use Case Example: Matching a set of sanctioned entity names or internal project codes.
-
Sample Dictionary Entries:
- Example Matches:
- "ProjectX123" ➝ Match
- "ProjectX" ➝ No Match
Note
All patterns in dictionaries use case-insensitive matching. This means the system treats “Email,” “email,” and “EMAIL” as the same word. This ensures that variations in capitalization do not affect detection accuracy across different match modes.
3. Pattern Match (Regex)¶
-
Description: Allows the use of regular expressions to define complex patterns. This is particularly useful for matching structured patterns like IDs, codes, or specific text formats. All regex matches are case-insensitive.
-
Use Case Example: Detecting customer IDs, invoice numbers, or formatted codes.
-
Sample Dictionary Entries (Regex):
-
Example Matches:
- "INV12345" ➝ Match
- "inv12345" ➝ Match
- "CUS-ABC-2023" ➝ Match
- "cus-abc-2023" ➝ Match
- "CUS-abc-2023" ➝ Match (case insensitive)
Canada Address, Person Name, City, and Province¶
The product includes Canadian dictionaries and tags for address context, person names, cities, and provinces. Enable these under Discovery → Dictionaries and Discovery → Tags when Canada-specific classification is required.
Tags and purpose¶
| Tag | What it detects |
|---|---|
CANADA_ADDRESS | Canadian address context via CANADA_ADDRESS_KEYWORD; stricter rules combine keywords with the Canada Postal Code model. |
CANADA_PERSON_NAME | Names matched against CANADA_PERSON_NAME_LOOKUP. |
CANADA_CITY | Cities matched against CANADA_CITY_LOOKUP (exact list). |
CANADA_PROVINCE | Provinces and territories via CANADA_PROVINCE_LOOKUP (names, abbreviations, common short forms). |
Dictionary keys and files¶
| Dictionary key | Match type | Typical file | Drives tag |
|---|---|---|---|
CANADA_ADDRESS_KEYWORD | Fuzzy | canada_address_keyword.txt | CANADA_ADDRESS |
CANADA_PERSON_NAME_LOOKUP | Exact | canada_person_name_lookup.txt | CANADA_PERSON_NAME |
CANADA_CITY_LOOKUP | Exact | canada_city_lookup.txt | CANADA_CITY |
CANADA_PROVINCE_LOOKUP | Exact | canada_province_lookup.txt | CANADA_PROVINCE |
In each dictionary, set Tags (input tags) so matches associate with the right CANADA_* tag.
Rules and postal dependency¶
- Structured rules use column-level features such as
c_CANADA_ADDRESS_KEYWORD,c_CANADA_PERSON_NAME_LOOKUP,c_CANADA_CITY_LOOKUP,c_CANADA_PROVINCE_LOOKUP, and for strict address patternsc_CANADA_POSTAL_CODE_ML_MODEL. Enable the Canada structured rules you need under Discovery → Rules. - Unstructured rules combine these features within a word proximity window (for example address keyword + postal model). Use Discovery → Rules Mapping so outputs map to
CANADA_ADDRESS,CANADA_CITY,CANADA_PROVINCE, andCANADA_PERSON_NAME. - If a rule requires it, enable
CANADA_POSTAL_CODE(tag, modelCANADA_POSTAL_CODE_ML_MODEL, andCANADA_POSTAL_CODE_KEYWORDwhen your rules reference it). Without the postal model, strict Canada address rules may not apply.
What to disable for Canada-first classification¶
US and generic detectors overlap the same signals (names, cities, states, street-style text). Leaving them enabled can produce PERSON_NAME, US_CITY, US_STATE, or US_ADDRESS alongside or instead of CANADA_* tags.
For Canada-first scans on the same data:
- Tags — Disable
PERSON_NAME,US_CITY,US_STATE, andUS_ADDRESSwhen you want onlyCANADA_PERSON_NAME,CANADA_CITY,CANADA_PROVINCE, andCANADA_ADDRESSrespectively. - Dictionaries — Disable US counterparts that feed those tags (for example
PERSON_NAME_LOOKUP,US_CITY_LOOKUP,US_STATE_LOOKUP, and dictionaries tied toUS_ADDRESS/street-addresspatterns). - Rules — Disable structured and unstructured rules that output the US tags above; keep Canada dictionary and model rules enabled.
- Models — Keep
CANADA_POSTAL_CODE_ML_MODELenabled if your Canada address rules depend on it.
Portal naming
Exact labels can vary by release. Search Tags, Dictionaries, and Rules for the names above.
Quick checklist¶
- Enable
CANADA_ADDRESS,CANADA_PERSON_NAME,CANADA_CITY,CANADA_PROVINCE(as needed). - Enable the four Canada dictionaries and link tags on each.
- Enable Canada postal tag/model/keyword when address rules need the postal model.
- Enable Canada rules and Rules Mapping for unstructured output.
- Disable
PERSON_NAME,US_CITY,US_STATE,US_ADDRESS(and related rules/dictionaries) when you want Canadian identifiers without US overlap.
Australia Business Number (ABN)¶
The product includes a fuzzy-match keyword dictionary for Australian Business Numbers (ABNs). Enable the dictionary under Discovery → Dictionaries when you need to classify Australia-specific business identifiers.
Tag and purpose¶
| Tag | What it detects |
|---|---|
AU_ABN | Detects 11-digit Australian Business Numbers issued by the Australian Taxation Office (ATO). Values are validated using the official ABN check algorithm. |
Dictionary key and file¶
| Dictionary key | Match type | Typical file | Drives tag |
|---|---|---|---|
AU_ABN_KEYWORD | Fuzzy | au_abn_keyword.txt | AU_ABN |
In the dictionary, set Tags (input tags) so matches associate with AU_ABN. Typical entries include terms such as ABN, Australian Business Number, and business number.
Rules and model dependency¶
- Structured rules use column-level features
c_AU_ABN_ML_MODELand, for the strict rule,m_AU_ABN_KEYWORD. Enable the Australia ABN structured rules you need under Discovery → Rules (for example AU ABN Strict and AU ABN). - The strict rule requires both the model and keyword dictionary; the review rule uses the model alone when no keyword is present in the column context.
To enable ABN detection, configure the following assets:
- Tag:
AU_ABN - Model:
AU_ABN_ML_MODEL - Dictionary:
AU_ABN_KEYWORD(required for the strict rule)
Without the model, ABN rules are not evaluated.
Quick checklist¶
- Enable
AU_ABNtag. - Enable
AU_ABN_KEYWORDdictionary and link the tag. - Enable
AU_ABN_ML_MODELmodel. - Enable ABN structured rules and Rules Mapping as needed.
Australia Company Number (ACN)¶
The product includes a fuzzy-match keyword dictionary for Australian Company Numbers (ACNs). Enable it under Discovery → Dictionaries when Australia-specific company-identifier classification is required.
Tag and purpose¶
| Tag | What it detects |
|---|---|
AU_ACN | Detects 9-digit Australian Company Numbers issued by ASIC, including values with leading zeros. Values are validated using the official ACN check-digit algorithm. |
Dictionary key and file¶
| Dictionary key | Match type | Typical file | Drives tag |
|---|---|---|---|
AU_ACN_KEYWORD | Fuzzy | au_acn_keyword.txt | AU_ACN |
In the dictionary, set Tags (input tags) so matches associate with AU_ACN. Typical entries include terms such as ACN, Australian Company Number, and company number.
Rules and model dependency¶
- Structured rules use
c_AU_ACN_ML_MODELand, for the strict rule,m_AU_ACN_KEYWORD. Enable the Australia ACN structured rules you need under Discovery → Rules (for example AU ACN Strict and AU ACN). - The strict rule requires both the model and keyword dictionary; the review rule uses the model alone when no keyword is present.
To enable ACN detection, configure the following assets:
- Tag:
AU_ACN - Model:
AU_ACN_ML_MODEL - Dictionary:
AU_ACN_KEYWORD(required for the strict rule)
Without the model, ACN rules are not evaluated.
Quick checklist¶
- Enable
AU_ACNtag. - Enable
AU_ACN_KEYWORDdictionary and link the tag. - Enable
AU_ACN_ML_MODELmodel. - Enable ACN structured rules and Rules Mapping as needed.
New Zealand IRD Number¶
The product includes a fuzzy-match keyword dictionary for New Zealand Inland Revenue Department (IRD) numbers. Enable the dictionary under Discovery → Dictionaries when you need to classify New Zealand tax-identifier identifiers.
Tag and purpose¶
| Tag | What it detects |
|---|---|
NZ_IRD | Detects 8- or 9-digit IRD numbers used for tax identification in New Zealand. Values are validated using the IRD check-digit algorithm. |
Dictionary key and file¶
| Dictionary key | Match type | Typical file | Drives tag |
|---|---|---|---|
NZ_IRD_KEYWORD | Fuzzy | nz_ird_keyword.txt | NZ_IRD |
In the dictionary, set Tags (input tags) so matches associate with NZ_IRD. Typical entries include terms such as IRD, IRD number, Inland Revenue, and tax number.
Rules and model dependency¶
- Structured rules use
c_NZ_IRD_ML_MODELand, for the strict rule,m_NZ_IRD_KEYWORD. Enable the New Zealand IRD structured rules under Discovery → Rules (for example New Zealand IRD Number Strict and New Zealand IRD Number). - Unstructured rules combine
c_NZ_IRD_ML_MODELwithc_NZ_IRD_KEYWORDwithin a word-proximity window (for examplerule_nz_ird). Use Discovery → Rules Mapping so outputs map toNZ_IRD.
To enable IRD detection, configure the following assets:
- Tag:
NZ_IRD - Model:
NZ_IRD_ML_MODEL - Dictionary:
NZ_IRD_KEYWORD(required for strict and unstructured rules)
Without the model, IRD rules are not evaluated.
Quick checklist¶
- Enable
NZ_IRDtag. - Enable
NZ_IRD_KEYWORDdictionary and link the tag. - Enable
NZ_IRD_ML_MODELmodel. - Enable IRD structured and unstructured rules and Rules Mapping as needed.
Australia and New Zealand Bank Account Numbers¶
The product includes fuzzy-match keyword dictionaries for Australian and New Zealand bank account numbers. Enable them under Discovery → Dictionaries when ANZ bank-account classification is required.
Tags and purpose¶
| Tag | What it detects |
|---|---|
AU_BANK_ACCOUNT | Detects Australian bank account numbers consisting of a 6-digit BSB (Bank-State-Branch) code plus an account number. |
NZ_BANK_ACCOUNT | Detects New Zealand bank account numbers in the standard format (bank–branch–account–suffix). |
Dictionary keys and files¶
| Dictionary key | Match type | Typical file | Drives tag |
|---|---|---|---|
AU_BANK_ACCOUNT_KEYWORD | Fuzzy | au_bank_account_keyword.txt | AU_BANK_ACCOUNT |
NZ_BANK_ACCOUNT_KEYWORD | Fuzzy | nz_bank_account_keyword.txt | NZ_BANK_ACCOUNT |
In each dictionary, set Tags (input tags) so matches associate with the corresponding tag. Typical entries include terms such as BSB, bank account, account number, and bank account number.
Rules and model dependency¶
- Structured rules use
c_AU_BANK_ACCOUNT_ML_MODEL/c_NZ_BANK_ACCOUNT_ML_MODELand, for strict rules,m_AU_BANK_ACCOUNT_KEYWORD/m_NZ_BANK_ACCOUNT_KEYWORD. Enable the bank-account structured rules under Discovery → Rules (for example Australian Bank Account Number Strict, Australian Bank Account Number, New Zealand Bank Account Number Strict, and New Zealand Bank Account Number). - Unstructured rules combine each model with its keyword dictionary within a word-proximity window (for example
rule_au_bank_accountandrule_nz_bank_account). Use Discovery → Rules Mapping so outputs map toAU_BANK_ACCOUNTandNZ_BANK_ACCOUNT. - Enable each tag, model (
AU_BANK_ACCOUNT_ML_MODEL,NZ_BANK_ACCOUNT_ML_MODEL), and keyword dictionary when using the corresponding strict or unstructured rules.
Quick checklist¶
- Enable
AU_BANK_ACCOUNTand/orNZ_BANK_ACCOUNTtags (as needed). - Enable the matching keyword dictionaries and link tags on each.
- Enable
AU_BANK_ACCOUNT_ML_MODELand/orNZ_BANK_ACCOUNT_ML_MODELmodels. - Enable bank-account structured and unstructured rules and Rules Mapping as needed.
Australia Medicare Number¶
The product includes an Australian Medicare dictionary and tag that pair with the Australia Medicare Number model. Enable these under Discovery → Dictionaries and Discovery → Tags when Australian Medicare number classification is required.
Tags and purpose¶
| Tag | What it detects |
|---|---|
AU_MEDICARE | Australian Medicare card numbers detected by AU_MEDICARE_ML_MODEL; strict rules combine the model with AU_MEDICARE_KEYWORD. |
Dictionary keys and files¶
| Dictionary key | Match type | Typical file | Drives tag |
|---|---|---|---|
AU_MEDICARE_KEYWORD | Fuzzy | au_medicare_keyword.txt | AU_MEDICARE |
In the dictionary, set Tags (input tags) so matches associate with the AU_MEDICARE tag.
Rules¶
- Structured rules use column-level features:
c_AU_MEDICARE_ML_MODEL+m_AU_MEDICARE_KEYWORD→ Australia Medicare Number Strict (auto-tag).c_AU_MEDICARE_ML_MODELalone → Australia Medicare Number (review).
- Unstructured rules combine the model feature with the keyword feature within a 5-word, order-strict proximity window:
rule_au_medicare→c_AU_MEDICARE_ML_MODEL+c_AU_MEDICARE_KEYWORD→AU_MEDICARE.
- Use Discovery → Rules Mapping to confirm
AU_MEDICAREis mapped to its model feature key for unstructured output.
Portal naming
Exact labels can vary by release. Search Tags, Dictionaries, and Rules for the names above.
Quick checklist¶
- Enable the
AU_MEDICAREtag. - Enable the
AU_MEDICARE_ML_MODELmodel. - Enable the
AU_MEDICARE_KEYWORDdictionary and link theAU_MEDICAREtag on it. - Enable the Australia Medicare structured rules (Strict + non-strict variants) under Discovery → Rules.
- Enable the Australia Medicare unstructured rule and confirm Rules Mapping under Discovery → Rules Mapping.
Australia and New Zealand Passport¶
The product includes ANZ passport dictionaries and tags that pair with the Australian Passport and New Zealand Passport models. Enable these under Discovery → Dictionaries and Discovery → Tags when ANZ passport classification is required.
Tags and purpose¶
| Tag | What it detects |
|---|---|
AU_PASSPORT | Australian passport numbers detected by AUSTRALIA_PASSPORT_ML_MODEL; strict rules combine the model with AU_PASSPORT_KEYWORD. |
NZ_PASSPORT | New Zealand passport numbers detected by NEW_ZEALAND_PASSPORT_ML_MODEL; strict rules combine the model with NZ_PASSPORT_KEYWORD. |
Dictionary keys and files¶
| Dictionary key | Match type | Typical file | Drives tag |
|---|---|---|---|
AU_PASSPORT_KEYWORD | Fuzzy | australia_passport_keyword.txt | AU_PASSPORT |
NZ_PASSPORT_KEYWORD | Fuzzy | nz_passport_keyword.txt | NZ_PASSPORT |
In each dictionary, set Tags (input tags) so matches associate with the right AU_PASSPORT / NZ_PASSPORT tag.
Rules¶
- Structured rules use column-level features:
c_AUSTRALIA_PASSPORT_ML_MODEL+m_AU_PASSPORT_KEYWORD→ Australian Passport Strict (auto-tag).c_AUSTRALIA_PASSPORT_ML_MODELalone → Australian Passport (review).c_NEW_ZEALAND_PASSPORT_ML_MODEL+m_NZ_PASSPORT_KEYWORD→ New Zealand Passport Strict (auto-tag).c_NEW_ZEALAND_PASSPORT_ML_MODELalone → New Zealand Passport (review).
- Unstructured rules combine the model feature with the keyword feature within a 5-word, order-strict proximity window:
rule_au_passport→c_AUSTRALIA_PASSPORT_ML_MODEL+c_AU_PASSPORT_KEYWORD→AU_PASSPORT.rule_nz_passport→c_NEW_ZEALAND_PASSPORT_ML_MODEL+c_NZ_PASSPORT_KEYWORD→NZ_PASSPORT.
- Use Discovery → Rules Mapping to confirm
AU_PASSPORTandNZ_PASSPORTare mapped to their model feature keys for unstructured output.
Portal naming
Exact labels can vary by release. Search Tags, Dictionaries, and Rules for the names above.
Quick checklist¶
- Enable the
AU_PASSPORTand/orNZ_PASSPORTtags (as needed). - Enable the
AUSTRALIA_PASSPORT_ML_MODELand/orNEW_ZEALAND_PASSPORT_ML_MODELmodels. - Enable the
AU_PASSPORT_KEYWORD/NZ_PASSPORT_KEYWORDdictionaries and link tags on each. - Enable the AU / NZ structured rules (Strict + non-strict variants) under Discovery → Rules.
- Enable the AU / NZ unstructured rules and confirm Rules Mapping under Discovery → Rules Mapping.
Australia and New Zealand Driver Licence¶
The product includes ANZ driver licence dictionaries and tags that pair with the Australian Driver Licence and New Zealand Driver Licence models. Enable these under Discovery → Dictionaries and Discovery → Tags when ANZ driver licence classification is required.
Tags and purpose¶
| Tag | What it detects |
|---|---|
AU_DRIVER_LICENSE | Australian driver licence numbers detected by AUSTRALIA_DRIVER_LICENSE_ML_MODEL; strict rules combine the model with AU_DRIVER_LICENSE_KEYWORD. |
NZ_DRIVER_LICENSE | New Zealand driver licence numbers detected by NEW_ZEALAND_DRIVER_LICENSE_ML_MODEL; strict rules combine the model with NZ_DRIVER_LICENSE_KEYWORD. |
Dictionary keys and files¶
| Dictionary key | Match type | Typical file | Drives tag |
|---|---|---|---|
AU_DRIVER_LICENSE_KEYWORD | Fuzzy | australia_driver_license_keyword.txt | AU_DRIVER_LICENSE |
NZ_DRIVER_LICENSE_KEYWORD | Fuzzy | nz_driver_license_keyword.txt | NZ_DRIVER_LICENSE |
In each dictionary, set Tags (input tags) so matches associate with the right AU_DRIVER_LICENSE / NZ_DRIVER_LICENSE tag.
Rules¶
- Structured rules use column-level features:
c_AUSTRALIA_DRIVER_LICENSE_ML_MODEL+m_AU_DRIVER_LICENSE_KEYWORD→ Australian Driver Licence Strict (ACTUAL_SCORE).c_AUSTRALIA_DRIVER_LICENSE_ML_MODELalone → Australian Driver Licence (review).c_NEW_ZEALAND_DRIVER_LICENSE_ML_MODEL+m_NZ_DRIVER_LICENSE_KEYWORD→ New Zealand Driver Licence Strict (ACTUAL_SCORE).c_NEW_ZEALAND_DRIVER_LICENSE_ML_MODELalone → New Zealand Driver Licence (review).
- Unstructured rules combine the model feature with the keyword feature within a 5-word, order-strict proximity window:
rule_au_driver_license→c_AUSTRALIA_DRIVER_LICENSE_ML_MODEL+c_AU_DRIVER_LICENSE_KEYWORD→AU_DRIVER_LICENSE.rule_nz_driver_license→c_NEW_ZEALAND_DRIVER_LICENSE_ML_MODEL+c_NZ_DRIVER_LICENSE_KEYWORD→NZ_DRIVER_LICENSE.
- Use Discovery → Rules Mapping to confirm
AU_DRIVER_LICENSEandNZ_DRIVER_LICENSEare mapped to their model feature keys for unstructured output.
NZ format overlap
The NZ driver licence format (2 letters + 6 digits, e.g. AB123456) is identical to the NZ passport format. Column-name keyword context (m_NZ_DRIVER_LICENSE_KEYWORD) is the primary differentiator in structured scans. In unstructured scans both the NZ_DRIVER_LICENSE and NZ_PASSPORT detectors may fire on the same value — this is expected behaviour.
Portal naming
Exact labels can vary by release. Search Tags, Dictionaries, and Rules for the names above.
Quick checklist¶
- Enable the
AU_DRIVER_LICENSEand/orNZ_DRIVER_LICENSEtags (as needed). - Enable the
AUSTRALIA_DRIVER_LICENSE_ML_MODELand/orNEW_ZEALAND_DRIVER_LICENSE_ML_MODELmodels. - Enable the
AU_DRIVER_LICENSE_KEYWORD/NZ_DRIVER_LICENSE_KEYWORDdictionaries and link tags on each. - Enable the AU / NZ structured rules (Strict + non-strict variants) under Discovery → Rules.
- Enable the AU / NZ unstructured rules and confirm Rules Mapping under Discovery → Rules Mapping.
Best Practices¶
- Use Fuzzy Match for unstructured or user-generated content where spelling or word variations may occur.
- Use Exact Match for clean, controlled and consistent data like predefined lists or official terms.
- Use Pattern Match for structured formats and identifiers such as IDs, email addresses, or phone numbers.
Conclusion¶
Dictionaries in Privacera Discovery provide flexibility and control in identifying sensitive information. By leveraging the appropriate match type—fuzzy, exact, or regex—you can improve detection accuracy and enhance your data governance policies.
- Previous topic: Tagging Mechanism
- Next topic: Models