Heuristic Models in Privacera Discovery¶
Privacera supports a broad range of heuristic models designed to identify and classify sensitive data using logic, pattern recognition, and validation mechanisms. These models help automate sensitive data tagging and improve classification accuracy.
1. Generic Models¶
These are baseline models with configurable parameters to guide pattern matching and logic-based detection.
| Parameter | Type | Default | Description |
|---|---|---|---|
| INCLUDE_PATTERN_<#> | String | None | Patterns to be matched. |
| EXCLUDE_PATTERN_<#> | String | None | Patterns to be excluded from matching. |
| ONLY_DIGITS | Boolean | FALSE | Removes all non-numeric characters in the string before matching. |
| CHECK_DIGIT_CODE_VALIDATE | String | None | Indicates whether to evaluate a checksum digit based on the last digit. |
| DO_LOOKUP | Boolean | FALSE | Enables pattern lookup using LOOKUP_PATTERN. |
| LOOKUP_DICT | String | None | A dictionary name or key. |
| LOOKUP_PATTERN | String | None | Pattern used for matching. |
| ISO3166_CC_VALIDATE_FLAG | Boolean | FALSE | Enables ISO country code validation using ISO3166_CC_PATTERN. |
| ISO3166_CC_PATTERN | String | None | Pattern for matching ISO country codes. |
| ISO3166_CC_LOOKUP_KEY | String | None | Dictionary name for country code lookup. |
2. Credit Card Model¶
| Parameter | Type | Default | Description |
|---|---|---|---|
| CC_PATTERN | String | — | Override pattern for credit card numbers. |
| DEFAULT_TYPES | Boolean | TRUE | Validate against known issuing network prefixes. |
| LUHN_CHECK | Boolean | TRUE | Validate the Luhn checksum on the credit card number. |
| DEFAULT_VALIDATORS_TO_EXCLUDE | String | — | Comma-separated list of default validators to exclude from validation. Example: RUPAY,MIR,TROY etc. |
Supported Default Credit Card Types: AMEX, MASTERCARD, VISA, DINERS, DISCOVER, JCBCARD, MAESTRO, CHINA_UNIONPAY, INSTAPAYMENT, MIR, RUPAY, TROY, VERVE, HIPERCARD, AURA, CARNET, BCGLOBAL
Note
You can exclude the default credit card types from validation by setting the DEFAULT_VALIDATORS_TO_EXCLUDE parameter.
Supported Default Credit Card Types
Supported Default Credit Card Types¶
| Card Type | Description |
|---|---|
| AMEX | Starts with 34 or 37, 15 digits |
| MASTERCARD | Starts with 2221–2720 or 51–55, 16 digits |
| VISA | Starts with 4, 16-19 digits |
| DINERS | Starts with 300–305, 3095, 36, 38, 39; 14 digits |
| DISCOVER | Starts with 6011, 622–628, 644–649, or 65; 16–17 digits |
| JCBCARD | Starts with 2131, 1800 (15 digits) or 35 (15 digits) |
| MAESTRO | Starts with 5018, 5020, 5038, 6304, 6759, 6761, 6763; 12–19 digits |
| CHINA_UNIONPAY | Starts with 62, 16–19 digits |
| INSTAPAYMENT | Starts with 637–639, 16 digits |
| MIR | Starts with 2200–2204, 16–19 digits |
| RUPAY | Starts with 60, 65, 81, 82; 16 digits |
| TROY | Starts with 9792, 16 digits |
| VERVE | BIN ranges: 506099–506199, 507865–507896, 650002–650027; 16–19 digits |
| HIPERCARD | BINs like 384100,384140,384160,637568,637599,637609,637612; 16–19 digits |
| AURA | Starts with 507860; 16–19 digits |
| CARNET | BINs like 286900, 506203,506222,506237,506262,506276,506281,506301 etc.; 16–19 digits |
| BCGLOBAL | Specific 6541/6556/700013 ranges; 16 digits |
To support additional card types, use:
Regex property names
ADDITIONAL_REGEX_<CARD_TYPE>: Custom regex for matching a specific credit card type.
Examples of Additional Regexes:
| YAML | |
|---|---|
3. Date of Birth (DOB) Model¶
| Parameter | Type | Default | Description |
|---|---|---|---|
| MIN_AGE_YEARS | Integer | 5 | Age lower threshold. |
| MAX_AGE_YEARS | Integer | 100 | Age upper threshold. |
| DATE_REGEX_var1 | String | — | Custom regex for matching date format. |
| DATE_FORMAT_var1 | String | — | Date format corresponding to the regex. |
4. EIN Model¶
| Parameter | Type | Default | Description |
|---|---|---|---|
| EIN_PATTERN | String | — | Override default EIN pattern. |
| VALIDATIONS | Boolean | TRUE | Enable format validations. |
| STRICT_PATTERN | Boolean | TRUE | Match only if EIN has exact format. |
5. Geo Latitude/Longitude Model¶
| Parameter | Type | Default | Description |
|---|---|---|---|
| MIN_LAT | Double | — | Lower bound on latitude. |
| MAX_LAT | Double | — | Upper bound on latitude. |
| MIN_LONG | Double | — | Lower bound on longitude. |
| MAX_LONG | Double | — | Upper bound on longitude. |
| MIN_FRACTIONAL_DIGITS | Integer | 3 | Minimum decimal precision. |
6. ITIN Model¶
| Parameter | Type | Default | Description |
|---|---|---|---|
| ITIN_PATTERN | String | — | Override default ITIN pattern. |
| STRICT_PATTERN | Boolean | TRUE | Match only if ITIN has exact format. |
7. MIME Model¶
| Parameter | Type | Default | Description |
|---|---|---|---|
| LOOKUP_DICT | String | — | Dictionary name of MIME types. |
Examples: EXEC_MIME_KEYWORD, IMAGE_MIME_KEYWORD
8. Phone Number Model¶
| Parameter | Type | Default | Description |
|---|---|---|---|
| COUNTRY_CODE | String | US | 2-letter ISO country code. |
9. SSN Model¶
This model identifies possible SSNs (Social Security Numbers) in data by applying pattern matching, format checks, and optionally statistical/randomness-based validations.
There are two main phases:
A. Pattern Matching Phase
- Uses regular expressions (regex) to identify common SSN formats (e.g., 123-45-6789, 123456789, or 6789).
- Configurable using pattern-related flags like STRICT_PATTERN, USE_9_DIGIT_PATTERN, etc.
B. Statistical Validation Phase
- Applies advanced validations like entropy, digit distribution (variance), and sequentiality.
- These checks assess how "random" or "natural" the SSN-like string appears.
- Controlled by flags like USE_SEQUENCE_CHECK, USE_VARIANCE_CHECK, etc.
- All these are collectively controlled by USE_ADDITIONAL_VALIDATIONS.
Configuration Groups¶
A. Pattern Configuration
- Customize what qualifies as an SSN match:
| Parameter | Type | Default | Description |
|---|---|---|---|
| SSN_PATTERN | String | — | Override default SSN pattern. |
| STRICT_PATTERN | Boolean | FALSE | Match only if SSN has exact format. |
| USE_9_DIGIT_PATTERN | Boolean | FALSE | Allow matching of any 9-digit string. |
| USE_4_DIGIT_PATTERN | Boolean | FALSE | Allow matching of 4-digit string. |
| STRICT_EXT_PATTERN | Boolean | TRUE | Requires hyphen/dot/space-separated format. |
B. Basic Validation Settings
- Check for basic data quality and known issues:
- Examples : "111111111", "987654321", "000000000", "1234567890", "12345678901"
| Parameter | Type | Default | Description |
|---|---|---|---|
| VALIDATIONS | Boolean | TRUE | Enable blacklist checks. |
| UNIQUE_DIGIT_THRESHOLD | Integer | 3 | Minimum number of unique digits required. |
| MAX_REPETITION_ALLOWED | Integer | 5 | Maximum allowed repeated digits. |
C. Entropy Checks
- Entropy measures how random/unpredictable a number string is.
| Parameter | Type | Default | Description |
|---|---|---|---|
| USE_ENTROPY_CHECK | Boolean | TRUE | Enable entropy-based checks to validate randomness of the match. |
| MAX_ENTROPY_SCORE_DEDUCTION | Boolean | 40 | Maximum penalty based on low entropy in matched SSN patterns. |
| MIN_ENTROPY | Double | 1.0 | Minimum acceptable entropy value for matched results. |
| MAX_ENTROPY | Double | 3.0 | Maximum acceptable entropy value for matched results. |
D. Variance Checks
- Variance looks at how digits are distributed — helps filter non-random digit repetition.
| Parameter | Type | Default | Description |
|---|---|---|---|
| USE_VARIANCE_CHECK | Boolean | TRUE | Enable variance-based checks on digit distribution. |
| MAX_VARIANCE_SCORE_DEDUCTION | Integer | 40 | Maximum penalty based on variance of digit distribution. |
| MIN_VARIANCE | Double | 1e12 | Minimum acceptable variance for randomness checks. |
| MAX_VARIANCE | Double | 3e15 | Maximum acceptable variance for randomness checks. |
E. Sequence Checks
- Looks for sequential patterns
| Parameter | Type | Default | Description |
|---|---|---|---|
| USE_SEQUENCE_CHECK | Boolean | TRUE | Enable sequence pattern checks. |
| MAX_SEQUENCE_SCORE_DEDUCTION | Integer | 40 | Maximum penalty applied if sequential patterns are detected. |
F. Sampling Control
- These determine when randomness-based logic should apply.
| Parameter | Type | Default | Description |
|---|---|---|---|
| MINIMUM_DATA_SIZE | Integer | 20 | Minimum number of samples required for randomness validation to apply. |
Examples of Invalid SSNs: - Starts with 9, 666, 000, or 98765432 - Middle digits = 00 - Last digits = 0000 - Dummy values: 123456789, 111111111, etc.
10. VIN Model¶
- Detects Vehicle Identification Numbers (VINs).
- Validates using length and VIN-specific checksum.
11. ZIP Code Model¶
| Parameter | Type | Default | Description |
|---|---|---|---|
| ZIP_DICT_KEY | String | US_ZIP_LOOKUP | Dictionary name of ZIP codes. |
| ZIP_PATTERN | String | — | Pattern for matching ZIP codes. |
| STRICT_PATTERN | Boolean | FALSE | If TRUE, enforces strict 5+4 ZIP format. |
12. Driving License Model¶
- Detects potential U.S. driver license numbers across 50 states.
- Utilizes state-specific regex patterns for structural validation.
- Applies robust heuristics to minimize false positives:
- Length must be between 4–17 characters.
- Rejects low-entropy or repetitive sequences.
- Enforces character diversity to ensure validity.
13. Brazil CPF Model¶
The Brazil CPF (Cadastro de Pessoas Físicas) model identifies Brazilian individual taxpayer identification numbers using pattern matching and validation algorithms.
| Parameter | Type | Default | Description |
|---|---|---|---|
| DEFAULT_TYPES | Boolean | TRUE | Validates against the standard CPF format and checksum algorithm. |
| ADDITIONAL_REGEX | String | — | Matches specific CPF formats using a custom regex. |
Tip
This model is disabled by default. To detect CPF numbers in your data, you must enable it.
The model identifies CPF numbers based on the following criteria:
- Format: An 11-digit number in XXX.XXX.XXX-XX,XXXXXXXXXXX or XXX XXX XXX XX format.
- Validation: Uses the CPF checksum algorithm (mod 11).
- Exclusions: Automatically excludes known invalid patterns (e.g.,
000.000.000-00,111.111.111-11).
Custom Formats
You can define custom CPF formats using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your custom format.
Examples
| YAML | |
|---|---|
14. Brazil CNPJ Model¶
The Brazil CNPJ (Cadastro Nacional da Pessoa Jurídica) model identifies Brazilian company identification numbers using pattern matching and validation algorithms.
| Parameter | Type | Default | Description |
|---|---|---|---|
| DEFAULT_TYPES | Boolean | TRUE | Validates against the standard CNPJ format and checksum algorithm. |
| ADDITIONAL_REGEX | String | — | Matches specific CNPJ formats using a custom regex. |
Tip
This model is disabled by default. To detect CNPJ numbers in your data, you must enable it.
The model identifies CNPJ numbers based on the following criteria:
- Format: A 14-digit number in XX.XXX.XXX/XXXX-XX,XXXXXXXXXXXXXX or XX XXX XXX XXXX XX format.
- Validation: Uses the CNPJ checksum algorithm (mod 11).
- Exclusions: Automatically excludes known invalid patterns (e.g.,
00.000.000/0000-00,11.111.111/1111-11).
Custom Formats
You can define custom CNPJ formats using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your custom format.
Examples
| YAML | |
|---|---|
15. Canada SIN Model¶
The Canada SIN (Social Insurance Number) model detects Canadian Social Insurance Numbers issued to individuals for tax and government benefit purposes. The model leverages pattern recognition and checksum validation logic to accurately identify valid SIN values while reducing false positives.
Tip
The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Canada SIN numbers in your data, enable the following in the portal:
- Tag: Enable the
CANADA_SINtag (Tags). - Model: Enable the Canada SIN model, type
CANADA_SIN_ML_MODEL(Models). - Dictionary: Enable the
CANADA_SIN_KEYWORDdictionary if you use the strict rule (Dictionaries). - Rules: Ensure the Canada SIN structured rules (e.g. "Canada SIN Number", "Canada SIN Number Strict") and the unstructured rule (e.g.
rule_canada_sin_number) are present and enabled (Rules).
The model identifies Canada SIN numbers based on the following criteria:
-
Format: Exactly nine digits in one of the following formats:
Format Example Hyphen-separated 123-456-789 Space-separated 123 456 789 Raw (no separator) 123456789 Sequences of fewer than nine digits are neither matched nor padded to nine. Word boundaries prevent substrings of longer digit sequences (e.g. 10 or more digits) from being matched.
-
Validation: The full nine-digit value is validated using the Luhn algorithm. The first digit must not be 0 or 8.
- Exclusions: Automatically excludes known invalid patterns, such as all-zero sequences and repetitive digits (e.g. 111-111-111).
Custom Formats
You can extend SIN detection to additional formats by defining custom regex patterns using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your format.
Examples
| YAML | |
|---|---|
16. Canada Postal Code Model¶
The Canada Postal Code model identifies Canadian postal codes using pattern matching and character-set validation. It uses the tag CANADA_POSTAL_CODE.
Tip
The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Canada Postal Code in your data, enable the following in the portal:
- Tag: Enable the
CANADA_POSTAL_CODEtag (Tags). - Model: Enable the Canada Postal Code model, type
CANADA_POSTAL_CODE_ML_MODEL(Models). - Dictionary: Enable the
CANADA_POSTAL_CODE_KEYWORDdictionary if you use the strict rule (Dictionaries). - Rules: Ensure the Canada Postal Code structured rules (e.g. "Canada Postal Code", "Canada Postal Code Strict") and the unstructured rule (e.g.
rule_canada_postal_code) are present and enabled (Rules).
The model identifies Canada Postal Code values based on the following criteria:
- Format: Postal codes follow the pattern A1A 1A1 — a Forward Sortation Area (FSA) of letter-digit-letter, an optional space, then a Local Delivery Unit (LDU) of digit-letter-digit. Both A1A 1A1 (with space) and A1A1A1 (without space) are accepted.
- Keyword dictionary: The strict classification rule uses the
CANADA_POSTAL_CODE_KEYWORDdictionary — which includes terms such as postal code, postcode, and code postal — alongside the detector to improve match confidence. -
Validation: Only permitted letters are accepted in each position:
Position Excluded Letters Allowed Set 1st (FSA, letter) D, F, I, O, Q, U, W, Z A, B, C, E, G, H, J–N, P, R, S, T, V, X, Y 3rd (FSA, letter) D, F, I, O, Q, U A, B, C, E, G, H, J–N, P, R, S, T, V–Z 6th (LDU, letter) D, F, I, O, Q, U A, B, C, E, G, H, J–N, P, R, S, T, V–Z
Examples
- Column or field names:
Postal Code,Postcode,Code postal,Zip,Mailing Code - Sample values:
K1A 0B1,M5H 2N2,T2E 7W2(with space);K1A0B1,M5H2N2(no space)
17. Canada Driver License Model¶
- Detects potential Canadian driver license numbers across all 13 provinces and territories.
- Utilizes province-specific regex patterns for structural validation.
- Applies robust heuristics to minimize false positives:
- Length must be between 5–17 characters.
- Rejects sequential or repetitive patterns.
Tip
The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Canada Driver License in your data, enable the following in the portal:
- Tag: Enable the
CANADA_DRIVER_LICENSEtag (Tags). - Model: Enable the Canada Driver License model, type
CA_DL_ML_MODEL(Models). - Dictionary: Enable the
CA_DL_KEYWORDdictionary if you use the strict rule (Dictionaries). - Rules: Ensure the Canada Driver License structured rules (e.g. "Canada Driver License", "Canada Driver License Strict") and the unstructured rule (e.g.
rule_canada_driver_license) are present and enabled (Rules).
The model identifies Canada Driver License values based on the following criteria:
| Parameter | Type | Default | Description |
|---|---|---|---|
| EXCLUDED_PROVINCES | String | — | Comma-separated list of provinces/territories to exclude from detection. By default all are included; use this to exclude only the ones you do not need. Values are case-insensitive (normalized to uppercase for matching). |
| MINIMUM_VALID_LENGTH | Integer | 5 | Minimum length for valid license numbers. |
| MAXIMUM_VALID_LENGTH | Integer | 17 | Maximum length for valid license numbers. |
Valid values for EXCLUDED_PROVINCES (use exact names; case-insensitive):
ALBERTA, BRITISH_COLUMBIA, MANITOBA, MANITOBA_NO_HYPHEN, NEW_BRUNSWICK, NEWFOUNDLAND_AND_LABRADOR, NOVA_SCOTIA, ONTARIO, PRINCE_EDWARD_ISLAND, QUEBEC_HYPHEN, QUEBEC_SPACE, QUEBEC_NO_SEP, SASKATCHEWAN, NORTHWEST_TERRITORIES, NUNAVUT, YUKON
Conclusion¶
Privacera Discovery's model-based classification supports numerous sensitive data formats through specialized validation rules and pattern matching. These models help achieve consistent and automated tagging of data across domains such as finance, healthcare, HR, and more.
- Previous topic: Dictionaries
- Next topic: Rules