Heuristic Models in Privacera Discovery¶

Privacera supports a broad range of heuristic models designed to identify and classify sensitive data using logic, pattern recognition, and validation mechanisms. These models help automate sensitive data tagging and improve classification accuracy.

1. Generic Models¶

These are baseline models with configurable parameters to guide pattern matching and logic-based detection.

Parameter	Type	Default	Description
INCLUDE_PATTERN_<#>	String	None	Patterns to be matched.
EXCLUDE_PATTERN_<#>	String	None	Patterns to be excluded from matching.
ONLY_DIGITS	Boolean	FALSE	Removes all non-numeric characters in the string before matching.
CHECK_DIGIT_CODE_VALIDATE	String	None	Indicates whether to evaluate a checksum digit based on the last digit.
DO_LOOKUP	Boolean	FALSE	Enables pattern lookup using `LOOKUP_PATTERN`.
LOOKUP_DICT	String	None	A dictionary name or key.
LOOKUP_PATTERN	String	None	Pattern used for matching.
ISO3166_CC_VALIDATE_FLAG	Boolean	FALSE	Enables ISO country code validation using `ISO3166_CC_PATTERN`.
ISO3166_CC_PATTERN	String	None	Pattern for matching ISO country codes.
ISO3166_CC_LOOKUP_KEY	String	None	Dictionary name for country code lookup.

2. Credit Card Model¶

Parameter	Type	Default	Description
CC_PATTERN	String	—	Override pattern for credit card numbers.
DEFAULT_TYPES	Boolean	TRUE	Validate against known issuing network prefixes.
LUHN_CHECK	Boolean	TRUE	Validate the Luhn checksum on the credit card number.
DEFAULT_VALIDATORS_TO_EXCLUDE	String	—	Comma-separated list of default validators to exclude from validation. Example: RUPAY,MIR,TROY etc.

Supported Default Credit Card Types: AMEX, MASTERCARD, VISA, DINERS, DISCOVER, JCBCARD, MAESTRO, CHINA_UNIONPAY, INSTAPAYMENT, MIR, RUPAY, TROY, VERVE, HIPERCARD, AURA, CARNET, BCGLOBAL

Note

You can exclude the default credit card types from validation by setting the DEFAULT_VALIDATORS_TO_EXCLUDE parameter.

Supported Default Credit Card Types

Supported Default Credit Card Types¶

Card Type	Description
AMEX	Starts with 34 or 37, 15 digits
MASTERCARD	Starts with 2221–2720 or 51–55, 16 digits
VISA	Starts with 4, 16-19 digits
DINERS	Starts with 300–305, 3095, 36, 38, 39; 14 digits
DISCOVER	Starts with 6011, 622–628, 644–649, or 65; 16–17 digits
JCBCARD	Starts with 2131, 1800 (15 digits) or 35 (15 digits)
MAESTRO	Starts with 5018, 5020, 5038, 6304, 6759, 6761, 6763; 12–19 digits
CHINA_UNIONPAY	Starts with 62, 16–19 digits
INSTAPAYMENT	Starts with 637–639, 16 digits
MIR	Starts with 2200–2204, 16–19 digits
RUPAY	Starts with 60, 65, 81, 82; 16 digits
TROY	Starts with 9792, 16 digits
VERVE	BIN ranges: 506099–506199, 507865–507896, 650002–650027; 16–19 digits
HIPERCARD	BINs like 384100,384140,384160,637568,637599,637609,637612; 16–19 digits
AURA	Starts with 507860; 16–19 digits
CARNET	BINs like 286900, 506203,506222,506237,506262,506276,506281,506301 etc.; 16–19 digits
BCGLOBAL	Specific 6541/6556/700013 ranges; 16 digits

To support additional card types, use:

Regex property names

ADDITIONAL_REGEX_<CARD_TYPE>: Custom regex for matching a specific credit card type.

Examples of Additional Regexes:

YAML
ADDITIONAL_REGEX_JCB: ^((?:2131|1800|35\d{3})\d{11})$
ADDITIONAL_REGEX_MAESTRO: ^((?:5018|5020|5038|6304|6759|6761|6763)\d{8,15})$

3. Date of Birth (DOB) Model¶

Parameter	Type	Default	Description
MIN_AGE_YEARS	Integer	5	Age lower threshold.
MAX_AGE_YEARS	Integer	100	Age upper threshold.
DATE_REGEX_var1	String	—	Custom regex for matching date format.
DATE_FORMAT_var1	String	—	Date format corresponding to the regex.

4. EIN Model¶

Parameter	Type	Default	Description
EIN_PATTERN	String	—	Override default EIN pattern.
VALIDATIONS	Boolean	TRUE	Enable format validations.
STRICT_PATTERN	Boolean	TRUE	Match only if EIN has exact format.

5. Geo Latitude/Longitude Model¶

Parameter	Type	Default	Description
MIN_LAT	Double	—	Lower bound on latitude.
MAX_LAT	Double	—	Upper bound on latitude.
MIN_LONG	Double	—	Lower bound on longitude.
MAX_LONG	Double	—	Upper bound on longitude.
MIN_FRACTIONAL_DIGITS	Integer	3	Minimum decimal precision.

6. ITIN Model¶

Parameter	Type	Default	Description
ITIN_PATTERN	String	—	Override default ITIN pattern.
STRICT_PATTERN	Boolean	TRUE	Match only if ITIN has exact format.

7. MIME Model¶

Parameter	Type	Default	Description
LOOKUP_DICT	String	—	Dictionary name of MIME types.

Examples: EXEC_MIME_KEYWORD, IMAGE_MIME_KEYWORD

8. Phone Number Model¶

Parameter	Type	Default	Description
COUNTRY_CODE	String	US	2-letter ISO country code.

9. SSN Model¶

This model identifies possible SSNs (Social Security Numbers) in data by applying pattern matching, format checks, and optionally statistical/randomness-based validations.

There are two main phases:

A. Pattern Matching Phase

Uses regular expressions (regex) to identify common SSN formats (e.g., 123-45-6789, 123456789, or 6789).
Configurable using pattern-related flags like STRICT_PATTERN, USE_9_DIGIT_PATTERN, etc.

B. Statistical Validation Phase

Applies advanced validations like entropy, digit distribution (variance), and sequentiality.
These checks assess how "random" or "natural" the SSN-like string appears.
Controlled by flags like USE_SEQUENCE_CHECK, USE_VARIANCE_CHECK, etc.
All these are collectively controlled by USE_ADDITIONAL_VALIDATIONS.

Configuration Groups¶

A. Pattern Configuration

Customize what qualifies as an SSN match:

Parameter	Type	Default	Description
SSN_PATTERN	String	—	Override default SSN pattern.
STRICT_PATTERN	Boolean	FALSE	Match only if SSN has exact format.
USE_9_DIGIT_PATTERN	Boolean	FALSE	Allow matching of any 9-digit string.
USE_4_DIGIT_PATTERN	Boolean	FALSE	Allow matching of 4-digit string.
STRICT_EXT_PATTERN	Boolean	TRUE	Requires hyphen/dot/space-separated format.

B. Basic Validation Settings

Check for basic data quality and known issues:
Examples : "111111111", "987654321", "000000000", "1234567890", "12345678901"

Parameter	Type	Default	Description
VALIDATIONS	Boolean	TRUE	Enable blacklist checks.
UNIQUE_DIGIT_THRESHOLD	Integer	3	Minimum number of unique digits required.
MAX_REPETITION_ALLOWED	Integer	5	Maximum allowed repeated digits.

C. Entropy Checks

Entropy measures how random/unpredictable a number string is.

Parameter	Type	Default	Description
USE_ENTROPY_CHECK	Boolean	TRUE	Enable entropy-based checks to validate randomness of the match.
MAX_ENTROPY_SCORE_DEDUCTION	Boolean	40	Maximum penalty based on low entropy in matched SSN patterns.
MIN_ENTROPY	Double	1.0	Minimum acceptable entropy value for matched results.
MAX_ENTROPY	Double	3.0	Maximum acceptable entropy value for matched results.

D. Variance Checks

Variance looks at how digits are distributed — helps filter non-random digit repetition.

Parameter	Type	Default	Description
USE_VARIANCE_CHECK	Boolean	TRUE	Enable variance-based checks on digit distribution.
MAX_VARIANCE_SCORE_DEDUCTION	Integer	40	Maximum penalty based on variance of digit distribution.
MIN_VARIANCE	Double	1e12	Minimum acceptable variance for randomness checks.
MAX_VARIANCE	Double	3e15	Maximum acceptable variance for randomness checks.

E. Sequence Checks

Looks for sequential patterns

Parameter	Type	Default	Description
USE_SEQUENCE_CHECK	Boolean	TRUE	Enable sequence pattern checks.
MAX_SEQUENCE_SCORE_DEDUCTION	Integer	40	Maximum penalty applied if sequential patterns are detected.

F. Sampling Control

These determine when randomness-based logic should apply.

Parameter	Type	Default	Description
MINIMUM_DATA_SIZE	Integer	20	Minimum number of samples required for randomness validation to apply.

Examples of Invalid SSNs: - Starts with 9, 666, 000, or 98765432 - Middle digits = 00 - Last digits = 0000 - Dummy values: 123456789, 111111111, etc.

10. VIN Model¶

Detects Vehicle Identification Numbers (VINs).
Validates using length and VIN-specific checksum.

11. ZIP Code Model¶

Parameter	Type	Default	Description
ZIP_DICT_KEY	String	US_ZIP_LOOKUP	Dictionary name of ZIP codes.
ZIP_PATTERN	String	—	Pattern for matching ZIP codes.
STRICT_PATTERN	Boolean	FALSE	If TRUE, enforces strict 5+4 ZIP format.

12. Driving License Model¶

Detects potential U.S. driver license numbers across 50 states.
Utilizes state-specific regex patterns for structural validation.
Applies robust heuristics to minimize false positives:
- Length must be between 4–17 characters.
- Rejects low-entropy or repetitive sequences.
- Enforces character diversity to ensure validity.

13. Brazil CPF Model¶

The Brazil CPF (Cadastro de Pessoas Físicas) model identifies Brazilian individual taxpayer identification numbers using pattern matching and validation algorithms.

Parameter	Type	Default	Description
DEFAULT_TYPES	Boolean	TRUE	Validates against the standard CPF format and checksum algorithm.
ADDITIONAL_REGEX	String	—	Matches specific CPF formats using a custom regex.

Tip

This model is disabled by default. To detect CPF numbers in your data, you must enable it.

The model identifies CPF numbers based on the following criteria:

Format: An 11-digit number in XXX.XXX.XXX-XX,XXXXXXXXXXX or XXX XXX XXX XX format.
Validation: Uses the CPF checksum algorithm (mod 11).
Exclusions: Automatically excludes known invalid patterns (e.g.,000.000.000-00, 111.111.111-11).

Custom Formats

You can define custom CPF formats using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your custom format.

Examples

YAML
ADDITIONAL_REGEX_FORMATTED: ^((?:\d{3}\.\d{3}\.\d{3}-\d{2}))$
ADDITIONAL_REGEX_NUMERIC: ^((?:\d{11}))$

14. Brazil CNPJ Model¶

The Brazil CNPJ (Cadastro Nacional da Pessoa Jurídica) model identifies Brazilian company identification numbers using pattern matching and validation algorithms.

Parameter	Type	Default	Description
DEFAULT_TYPES	Boolean	TRUE	Validates against the standard CNPJ format and checksum algorithm.
ADDITIONAL_REGEX	String	—	Matches specific CNPJ formats using a custom regex.

Tip

This model is disabled by default. To detect CNPJ numbers in your data, you must enable it.

The model identifies CNPJ numbers based on the following criteria:

Format: A 14-digit number in XX.XXX.XXX/XXXX-XX,XXXXXXXXXXXXXX or XX XXX XXX XXXX XX format.
Validation: Uses the CNPJ checksum algorithm (mod 11).
Exclusions: Automatically excludes known invalid patterns (e.g., 00.000.000/0000-00, 11.111.111/1111-11).