Heuristic Models in Privacera Discovery¶
Privacera supports a broad range of heuristic models designed to identify and classify sensitive data using logic, pattern recognition, and validation mechanisms. These models help automate sensitive data tagging and improve classification accuracy.
1. Generic Models¶
These are baseline models with configurable parameters to guide pattern matching and logic-based detection.
Parameter | Type | Default | Description |
---|---|---|---|
INCLUDE_PATTERN_<#> | String | None | Patterns to be matched. |
EXCLUDE_PATTERN_<#> | String | None | Patterns to be excluded from matching. |
ONLY_DIGITS | Boolean | FALSE | Removes all non-numeric characters in the string before matching. |
CHECK_DIGIT_CODE_VALIDATE | String | None | Indicates whether to evaluate a checksum digit based on the last digit. |
DO_LOOKUP | Boolean | FALSE | Enables pattern lookup using LOOKUP_PATTERN . |
LOOKUP_DICT | String | None | A dictionary name or key. |
LOOKUP_PATTERN | String | None | Pattern used for matching. |
ISO3166_CC_VALIDATE_FLAG | Boolean | FALSE | Enables ISO country code validation using ISO3166_CC_PATTERN . |
ISO3166_CC_PATTERN | String | None | Pattern for matching ISO country codes. |
ISO3166_CC_LOOKUP_KEY | String | None | Dictionary name for country code lookup. |
2. Credit Card Model¶
Parameter | Type | Default | Description |
---|---|---|---|
CC_PATTERN | String | — | Override pattern for credit card numbers. |
DEFAULT_TYPES | Boolean | TRUE | Validate against known issuing network prefixes. |
LUHN_CHECK | Boolean | TRUE | Validate the Luhn checksum on the credit card number. |
DEFAULT_VALIDATORS_TO_EXCLUDE | String | — | Comma-separated list of default validators to exclude from validation. Example: RUPAY,MIR,TROY etc. |
Supported Default Credit Card Types: AMEX, MASTERCARD, VISA, DINERS, DISCOVER, JCBCARD, MAESTRO, CHINA_UNIONPAY, INSTAPAYMENT, MIR, RUPAY, TROY, VERVE, HIPERCARD, AURA, CARNET, BCGLOBAL
Note
You can exclude the default credit card types from validation by setting the DEFAULT_VALIDATORS_TO_EXCLUDE
parameter.
Supported Default Credit Card Types
Supported Default Credit Card Types¶
Card Type | Description |
---|---|
AMEX | Starts with 34 or 37, 15 digits |
MASTERCARD | Starts with 2221–2720 or 51–55, 16 digits |
VISA | Starts with 4, 16-19 digits |
DINERS | Starts with 300–305, 3095, 36, 38, 39; 14 digits |
DISCOVER | Starts with 6011, 622–628, 644–649, or 65; 16–17 digits |
JCBCARD | Starts with 2131, 1800 (15 digits) or 35 (15 digits) |
MAESTRO | Starts with 5018, 5020, 5038, 6304, 6759, 6761, 6763; 12–19 digits |
CHINA_UNIONPAY | Starts with 62, 16–19 digits |
INSTAPAYMENT | Starts with 637–639, 16 digits |
MIR | Starts with 2200–2204, 16–19 digits |
RUPAY | Starts with 60, 65, 81, 82; 16 digits |
TROY | Starts with 9792, 16 digits |
VERVE | BIN ranges: 506099–506199, 507865–507896, 650002–650027; 16–19 digits |
HIPERCARD | BINs like 384100,384140,384160,637568,637599,637609,637612; 16–19 digits |
AURA | Starts with 507860; 16–19 digits |
CARNET | BINs like 286900, 506203,506222,506237,506262,506276,506281,506301 etc.; 16–19 digits |
BCGLOBAL | Specific 6541/6556/700013 ranges; 16 digits |
To support additional card types, use:
Regex property names
ADDITIONAL_REGEX_<CARD_TYPE>
: Custom regex for matching a specific credit card type.
Examples of Additional Regexes:
YAML | |
---|---|
3. Date of Birth (DOB) Model¶
Parameter | Type | Default | Description |
---|---|---|---|
MIN_AGE_YEARS | Integer | 5 | Age lower threshold. |
MAX_AGE_YEARS | Integer | 100 | Age upper threshold. |
DATE_REGEX_var1 | String | — | Custom regex for matching date format. |
DATE_FORMAT_var1 | String | — | Date format corresponding to the regex. |
4. EIN Model¶
Parameter | Type | Default | Description |
---|---|---|---|
EIN_PATTERN | String | — | Override default EIN pattern. |
VALIDATIONS | Boolean | TRUE | Enable format validations. |
STRICT_PATTERN | Boolean | TRUE | Match only if EIN has exact format. |
5. Geo Latitude/Longitude Model¶
Parameter | Type | Default | Description |
---|---|---|---|
MIN_LAT | Double | — | Lower bound on latitude. |
MAX_LAT | Double | — | Upper bound on latitude. |
MIN_LONG | Double | — | Lower bound on longitude. |
MAX_LONG | Double | — | Upper bound on longitude. |
MIN_FRACTIONAL_DIGITS | Integer | 3 | Minimum decimal precision. |
6. ITIN Model¶
Parameter | Type | Default | Description |
---|---|---|---|
ITIN_PATTERN | String | — | Override default ITIN pattern. |
STRICT_PATTERN | Boolean | TRUE | Match only if ITIN has exact format. |
7. MIME Model¶
Parameter | Type | Default | Description |
---|---|---|---|
LOOKUP_DICT | String | — | Dictionary name of MIME types. |
Examples: EXEC_MIME_KEYWORD
, IMAGE_MIME_KEYWORD
8. Phone Number Model¶
Parameter | Type | Default | Description |
---|---|---|---|
COUNTRY_CODE | String | US | 2-letter ISO country code. |
9. SSN Model¶
This model identifies possible SSNs (Social Security Numbers) in data by applying pattern matching, format checks, and optionally statistical/randomness-based validations.
There are two main phases:
A. Pattern Matching Phase
- Uses regular expressions (regex) to identify common SSN formats (e.g., 123-45-6789, 123456789, or 6789).
- Configurable using pattern-related flags like STRICT_PATTERN, USE_9_DIGIT_PATTERN, etc.
B. Statistical Validation Phase
- Applies advanced validations like entropy, digit distribution (variance), and sequentiality.
- These checks assess how "random" or "natural" the SSN-like string appears.
- Controlled by flags like USE_SEQUENCE_CHECK, USE_VARIANCE_CHECK, etc.
- All these are collectively controlled by USE_ADDITIONAL_VALIDATIONS.
Configuration Groups¶
A. Pattern Configuration
- Customize what qualifies as an SSN match:
Parameter | Type | Default | Description |
---|---|---|---|
SSN_PATTERN | String | — | Override default SSN pattern. |
STRICT_PATTERN | Boolean | FALSE | Match only if SSN has exact format. |
USE_9_DIGIT_PATTERN | Boolean | FALSE | Allow matching of any 9-digit string. |
USE_4_DIGIT_PATTERN | Boolean | FALSE | Allow matching of 4-digit string. |
STRICT_EXT_PATTERN | Boolean | TRUE | Requires hyphen/dot/space-separated format. |
B. Basic Validation Settings
- Check for basic data quality and known issues:
- Examples : "111111111", "987654321", "000000000", "1234567890", "12345678901"
Parameter | Type | Default | Description |
---|---|---|---|
VALIDATIONS | Boolean | TRUE | Enable blacklist checks. |
UNIQUE_DIGIT_THRESHOLD | Integer | 3 | Minimum number of unique digits required. |
MAX_REPETITION_ALLOWED | Integer | 5 | Maximum allowed repeated digits. |
C. Entropy Checks
- Entropy measures how random/unpredictable a number string is.
Parameter | Type | Default | Description |
---|---|---|---|
USE_ENTROPY_CHECK | Boolean | TRUE | Enable entropy-based checks to validate randomness of the match. |
MAX_ENTROPY_SCORE_DEDUCTION | Boolean | 40 | Maximum penalty based on low entropy in matched SSN patterns. |
MIN_ENTROPY | Double | 1.0 | Minimum acceptable entropy value for matched results. |
MAX_ENTROPY | Double | 3.0 | Maximum acceptable entropy value for matched results. |
D. Variance Checks
- Variance looks at how digits are distributed — helps filter non-random digit repetition.
Parameter | Type | Default | Description |
---|---|---|---|
USE_VARIANCE_CHECK | Boolean | TRUE | Enable variance-based checks on digit distribution. |
MAX_VARIANCE_SCORE_DEDUCTION | Integer | 40 | Maximum penalty based on variance of digit distribution. |
MIN_VARIANCE | Double | 1e12 | Minimum acceptable variance for randomness checks. |
MAX_VARIANCE | Double | 3e15 | Maximum acceptable variance for randomness checks. |
E. Sequence Checks
- Looks for sequential patterns
Parameter | Type | Default | Description |
---|---|---|---|
USE_SEQUENCE_CHECK | Boolean | TRUE | Enable sequence pattern checks. |
MAX_SEQUENCE_SCORE_DEDUCTION | Integer | 40 | Maximum penalty applied if sequential patterns are detected. |
F. Sampling Control
- These determine when randomness-based logic should apply.
Parameter | Type | Default | Description |
---|---|---|---|
MINIMUM_DATA_SIZE | Integer | 20 | Minimum number of samples required for randomness validation to apply. |
Examples of Invalid SSNs: - Starts with 9, 666, 000, or 98765432 - Middle digits = 00 - Last digits = 0000 - Dummy values: 123456789
, 111111111
, etc.
10. VIN Model¶
- Detects Vehicle Identification Numbers (VINs).
- Validates using length and VIN-specific checksum.
11. ZIP Code Model¶
Parameter | Type | Default | Description |
---|---|---|---|
ZIP_DICT_KEY | String | US_ZIP_LOOKUP | Dictionary name of ZIP codes. |
ZIP_PATTERN | String | — | Pattern for matching ZIP codes. |
STRICT_PATTERN | Boolean | FALSE | If TRUE, enforces strict 5+4 ZIP format. |
Conclusion¶
Privacera Discovery's model-based classification supports numerous sensitive data formats through specialized validation rules and pattern matching. These models help achieve consistent and automated tagging of data across domains such as finance, healthcare, HR, and more.
- Previous topic: Dictionaries
- Next topic: Rules