Skip to content

Heuristic Models in Privacera Discovery

Privacera supports a broad range of heuristic models designed to identify and classify sensitive data using logic, pattern recognition, and validation mechanisms. These models help automate sensitive data tagging and improve classification accuracy.


1. Generic Models

These are baseline models with configurable parameters to guide pattern matching and logic-based detection.

Parameter Type Default Description
INCLUDE_PATTERN_<#> String None Patterns to be matched.
EXCLUDE_PATTERN_<#> String None Patterns to be excluded from matching.
ONLY_DIGITS Boolean FALSE Removes all non-numeric characters in the string before matching.
CHECK_DIGIT_CODE_VALIDATE String None Indicates whether to evaluate a checksum digit based on the last digit.
DO_LOOKUP Boolean FALSE Enables pattern lookup using LOOKUP_PATTERN.
LOOKUP_DICT String None A dictionary name or key.
LOOKUP_PATTERN String None Pattern used for matching.
ISO3166_CC_VALIDATE_FLAG Boolean FALSE Enables ISO country code validation using ISO3166_CC_PATTERN.
ISO3166_CC_PATTERN String None Pattern for matching ISO country codes.
ISO3166_CC_LOOKUP_KEY String None Dictionary name for country code lookup.

2. Credit Card Model

Parameter Type Default Description
CC_PATTERN String Override pattern for credit card numbers.
DEFAULT_TYPES Boolean TRUE Validate against known issuing network prefixes.
LUHN_CHECK Boolean TRUE Validate the Luhn checksum on the credit card number.
DEFAULT_VALIDATORS_TO_EXCLUDE String Comma-separated list of default validators to exclude from validation. Example: RUPAY,MIR,TROY etc.

Supported Default Credit Card Types: AMEX, MASTERCARD, VISA, DINERS, DISCOVER, JCBCARD, MAESTRO, CHINA_UNIONPAY, INSTAPAYMENT, MIR, RUPAY, TROY, VERVE, HIPERCARD, AURA, CARNET, BCGLOBAL

Note

You can exclude the default credit card types from validation by setting the DEFAULT_VALIDATORS_TO_EXCLUDE parameter.

Supported Default Credit Card Types

Supported Default Credit Card Types

Card Type Description
AMEX Starts with 34 or 37, 15 digits
MASTERCARD Starts with 2221–2720 or 51–55, 16 digits
VISA Starts with 4, 16-19 digits
DINERS Starts with 300–305, 3095, 36, 38, 39; 14 digits
DISCOVER Starts with 6011, 622–628, 644–649, or 65; 16–17 digits
JCBCARD Starts with 2131, 1800 (15 digits) or 35 (15 digits)
MAESTRO Starts with 5018, 5020, 5038, 6304, 6759, 6761, 6763; 12–19 digits
CHINA_UNIONPAY Starts with 62, 16–19 digits
INSTAPAYMENT Starts with 637–639, 16 digits
MIR Starts with 2200–2204, 16–19 digits
RUPAY Starts with 60, 65, 81, 82; 16 digits
TROY Starts with 9792, 16 digits
VERVE BIN ranges: 506099–506199, 507865–507896, 650002–650027; 16–19 digits
HIPERCARD BINs like 384100,384140,384160,637568,637599,637609,637612; 16–19 digits
AURA Starts with 507860; 16–19 digits
CARNET BINs like 286900, 506203,506222,506237,506262,506276,506281,506301 etc.; 16–19 digits
BCGLOBAL Specific 6541/6556/700013 ranges; 16 digits

To support additional card types, use:

Regex property names

ADDITIONAL_REGEX_<CARD_TYPE>: Custom regex for matching a specific credit card type.

Examples of Additional Regexes:

YAML
ADDITIONAL_REGEX_JCB: ^((?:2131|1800|35\d{3})\d{11})$
ADDITIONAL_REGEX_MAESTRO: ^((?:5018|5020|5038|6304|6759|6761|6763)\d{8,15})$


3. Date of Birth (DOB) Model

Parameter Type Default Description
MIN_AGE_YEARS Integer 5 Age lower threshold.
MAX_AGE_YEARS Integer 100 Age upper threshold.
DATE_REGEX_var1 String Custom regex for matching date format.
DATE_FORMAT_var1 String Date format corresponding to the regex.

4. EIN Model

Parameter Type Default Description
EIN_PATTERN String Override default EIN pattern.
VALIDATIONS Boolean TRUE Enable format validations.
STRICT_PATTERN Boolean TRUE Match only if EIN has exact format.

5. Geo Latitude/Longitude Model

Parameter Type Default Description
MIN_LAT Double Lower bound on latitude.
MAX_LAT Double Upper bound on latitude.
MIN_LONG Double Lower bound on longitude.
MAX_LONG Double Upper bound on longitude.
MIN_FRACTIONAL_DIGITS Integer 3 Minimum decimal precision.

6. ITIN Model

Parameter Type Default Description
ITIN_PATTERN String Override default ITIN pattern.
STRICT_PATTERN Boolean TRUE Match only if ITIN has exact format.

7. MIME Model

Parameter Type Default Description
LOOKUP_DICT String Dictionary name of MIME types.

Examples: EXEC_MIME_KEYWORD, IMAGE_MIME_KEYWORD


8. Phone Number Model

Parameter Type Default Description
COUNTRY_CODE String US 2-letter ISO country code.

9. SSN Model

This model identifies possible SSNs (Social Security Numbers) in data by applying pattern matching, format checks, and optionally statistical/randomness-based validations.

There are two main phases:

A. Pattern Matching Phase

  • Uses regular expressions (regex) to identify common SSN formats (e.g., 123-45-6789, 123456789, or 6789).
  • Configurable using pattern-related flags like STRICT_PATTERN, USE_9_DIGIT_PATTERN, etc.

B. Statistical Validation Phase

  • Applies advanced validations like entropy, digit distribution (variance), and sequentiality.
  • These checks assess how "random" or "natural" the SSN-like string appears.
  • Controlled by flags like USE_SEQUENCE_CHECK, USE_VARIANCE_CHECK, etc.
  • All these are collectively controlled by USE_ADDITIONAL_VALIDATIONS.

Configuration Groups

A. Pattern Configuration

  • Customize what qualifies as an SSN match:
Parameter Type Default Description
SSN_PATTERN String Override default SSN pattern.
STRICT_PATTERN Boolean FALSE Match only if SSN has exact format.
USE_9_DIGIT_PATTERN Boolean FALSE Allow matching of any 9-digit string.
USE_4_DIGIT_PATTERN Boolean FALSE Allow matching of 4-digit string.
STRICT_EXT_PATTERN Boolean TRUE Requires hyphen/dot/space-separated format.

B. Basic Validation Settings

  • Check for basic data quality and known issues:
  • Examples : "111111111", "987654321", "000000000", "1234567890", "12345678901"
Parameter Type Default Description
VALIDATIONS Boolean TRUE Enable blacklist checks.
UNIQUE_DIGIT_THRESHOLD Integer 3 Minimum number of unique digits required.
MAX_REPETITION_ALLOWED Integer 5 Maximum allowed repeated digits.

C. Entropy Checks

  • Entropy measures how random/unpredictable a number string is.
Parameter Type Default Description
USE_ENTROPY_CHECK Boolean TRUE Enable entropy-based checks to validate randomness of the match.
MAX_ENTROPY_SCORE_DEDUCTION Boolean 40 Maximum penalty based on low entropy in matched SSN patterns.
MIN_ENTROPY Double 1.0 Minimum acceptable entropy value for matched results.
MAX_ENTROPY Double 3.0 Maximum acceptable entropy value for matched results.

D. Variance Checks

  • Variance looks at how digits are distributed — helps filter non-random digit repetition.
Parameter Type Default Description
USE_VARIANCE_CHECK Boolean TRUE Enable variance-based checks on digit distribution.
MAX_VARIANCE_SCORE_DEDUCTION Integer 40 Maximum penalty based on variance of digit distribution.
MIN_VARIANCE Double 1e12 Minimum acceptable variance for randomness checks.
MAX_VARIANCE Double 3e15 Maximum acceptable variance for randomness checks.

E. Sequence Checks

  • Looks for sequential patterns
Parameter Type Default Description
USE_SEQUENCE_CHECK Boolean TRUE Enable sequence pattern checks.
MAX_SEQUENCE_SCORE_DEDUCTION Integer 40 Maximum penalty applied if sequential patterns are detected.

F. Sampling Control

  • These determine when randomness-based logic should apply.
Parameter Type Default Description
MINIMUM_DATA_SIZE Integer 20 Minimum number of samples required for randomness validation to apply.

Examples of Invalid SSNs: - Starts with 9, 666, 000, or 98765432 - Middle digits = 00 - Last digits = 0000 - Dummy values: 123456789, 111111111, etc.


10. VIN Model

  • Detects Vehicle Identification Numbers (VINs).
  • Validates using length and VIN-specific checksum.

11. ZIP Code Model

Parameter Type Default Description
ZIP_DICT_KEY String US_ZIP_LOOKUP Dictionary name of ZIP codes.
ZIP_PATTERN String Pattern for matching ZIP codes.
STRICT_PATTERN Boolean FALSE If TRUE, enforces strict 5+4 ZIP format.

Conclusion

Privacera Discovery's model-based classification supports numerous sensitive data formats through specialized validation rules and pattern matching. These models help achieve consistent and automated tagging of data across domains such as finance, healthcare, HR, and more.

Comments