Skip to content

Heuristic Models in Privacera Discovery

Privacera supports a broad range of heuristic models designed to identify and classify sensitive data using logic, pattern recognition, and validation mechanisms. These models help automate sensitive data tagging and improve classification accuracy.


1. Generic Models

These are baseline models with configurable parameters to guide pattern matching and logic-based detection.

Parameter Type Default Description
INCLUDE_PATTERN_<#> String None Patterns to be matched.
EXCLUDE_PATTERN_<#> String None Patterns to be excluded from matching.
ONLY_DIGITS Boolean FALSE Removes all non-numeric characters in the string before matching.
CHECK_DIGIT_CODE_VALIDATE String None Indicates whether to evaluate a checksum digit based on the last digit.
DO_LOOKUP Boolean FALSE Enables pattern lookup using LOOKUP_PATTERN.
LOOKUP_DICT String None A dictionary name or key.
LOOKUP_PATTERN String None Pattern used for matching.
ISO3166_CC_VALIDATE_FLAG Boolean FALSE Enables ISO country code validation using ISO3166_CC_PATTERN.
ISO3166_CC_PATTERN String None Pattern for matching ISO country codes.
ISO3166_CC_LOOKUP_KEY String None Dictionary name for country code lookup.

2. Credit Card Model

Parameter Type Default Description
CC_PATTERN String Override pattern for credit card numbers.
DEFAULT_TYPES Boolean TRUE Validate against known issuing network prefixes.
LUHN_CHECK Boolean TRUE Validate the Luhn checksum on the credit card number.
DEFAULT_VALIDATORS_TO_EXCLUDE String Comma-separated list of default validators to exclude from validation. Example: RUPAY,MIR,TROY etc.

Supported Default Credit Card Types: AMEX, MASTERCARD, VISA, DINERS, DISCOVER, JCBCARD, MAESTRO, CHINA_UNIONPAY, INSTAPAYMENT, MIR, RUPAY, TROY, VERVE, HIPERCARD, AURA, CARNET, BCGLOBAL

Note

You can exclude the default credit card types from validation by setting the DEFAULT_VALIDATORS_TO_EXCLUDE parameter.

Supported Default Credit Card Types

Supported Default Credit Card Types

Card Type Description
AMEX Starts with 34 or 37, 15 digits
MASTERCARD Starts with 2221–2720 or 51–55, 16 digits
VISA Starts with 4, 16-19 digits
DINERS Starts with 300–305, 3095, 36, 38, 39; 14 digits
DISCOVER Starts with 6011, 622–628, 644–649, or 65; 16–17 digits
JCBCARD Starts with 2131, 1800 (15 digits) or 35 (15 digits)
MAESTRO Starts with 5018, 5020, 5038, 6304, 6759, 6761, 6763; 12–19 digits
CHINA_UNIONPAY Starts with 62, 16–19 digits
INSTAPAYMENT Starts with 637–639, 16 digits
MIR Starts with 2200–2204, 16–19 digits
RUPAY Starts with 60, 65, 81, 82; 16 digits
TROY Starts with 9792, 16 digits
VERVE BIN ranges: 506099–506199, 507865–507896, 650002–650027; 16–19 digits
HIPERCARD BINs like 384100,384140,384160,637568,637599,637609,637612; 16–19 digits
AURA Starts with 507860; 16–19 digits
CARNET BINs like 286900, 506203,506222,506237,506262,506276,506281,506301 etc.; 16–19 digits
BCGLOBAL Specific 6541/6556/700013 ranges; 16 digits

To support additional card types, use:

Regex property names

ADDITIONAL_REGEX_<CARD_TYPE>: Custom regex for matching a specific credit card type.

Examples of Additional Regexes:

YAML
ADDITIONAL_REGEX_JCB: ^((?:2131|1800|35\d{3})\d{11})$
ADDITIONAL_REGEX_MAESTRO: ^((?:5018|5020|5038|6304|6759|6761|6763)\d{8,15})$


3. Date of Birth (DOB) Model

Parameter Type Default Description
MIN_AGE_YEARS Integer 5 Age lower threshold.
MAX_AGE_YEARS Integer 100 Age upper threshold.
DATE_REGEX_var1 String Custom regex for matching date format.
DATE_FORMAT_var1 String Date format corresponding to the regex.

4. EIN Model

Parameter Type Default Description
EIN_PATTERN String Override default EIN pattern.
VALIDATIONS Boolean TRUE Enable format validations.
STRICT_PATTERN Boolean TRUE Match only if EIN has exact format.

5. Geo Latitude/Longitude Model

Parameter Type Default Description
MIN_LAT Double Lower bound on latitude.
MAX_LAT Double Upper bound on latitude.
MIN_LONG Double Lower bound on longitude.
MAX_LONG Double Upper bound on longitude.
MIN_FRACTIONAL_DIGITS Integer 3 Minimum decimal precision.

6. ITIN Model

Parameter Type Default Description
ITIN_PATTERN String Override default ITIN pattern.
STRICT_PATTERN Boolean TRUE Match only if ITIN has exact format.

7. MIME Model

Parameter Type Default Description
LOOKUP_DICT String Dictionary name of MIME types.

Examples: EXEC_MIME_KEYWORD, IMAGE_MIME_KEYWORD


8. Phone Number Model

Parameter Type Default Description
COUNTRY_CODE String US 2-letter ISO country code.

9. SSN Model

This model identifies possible SSNs (Social Security Numbers) in data by applying pattern matching, format checks, and optionally statistical/randomness-based validations.

There are two main phases:

A. Pattern Matching Phase

  • Uses regular expressions (regex) to identify common SSN formats (e.g., 123-45-6789, 123456789, or 6789).
  • Configurable using pattern-related flags like STRICT_PATTERN, USE_9_DIGIT_PATTERN, etc.

B. Statistical Validation Phase

  • Applies advanced validations like entropy, digit distribution (variance), and sequentiality.
  • These checks assess how "random" or "natural" the SSN-like string appears.
  • Controlled by flags like USE_SEQUENCE_CHECK, USE_VARIANCE_CHECK, etc.
  • All these are collectively controlled by USE_ADDITIONAL_VALIDATIONS.

Configuration Groups

A. Pattern Configuration

  • Customize what qualifies as an SSN match:
Parameter Type Default Description
SSN_PATTERN String Override default SSN pattern.
STRICT_PATTERN Boolean FALSE Match only if SSN has exact format.
USE_9_DIGIT_PATTERN Boolean FALSE Allow matching of any 9-digit string.
USE_4_DIGIT_PATTERN Boolean FALSE Allow matching of 4-digit string.
STRICT_EXT_PATTERN Boolean TRUE Requires hyphen/dot/space-separated format.

B. Basic Validation Settings

  • Check for basic data quality and known issues:
  • Examples : "111111111", "987654321", "000000000", "1234567890", "12345678901"
Parameter Type Default Description
VALIDATIONS Boolean TRUE Enable blacklist checks.
UNIQUE_DIGIT_THRESHOLD Integer 3 Minimum number of unique digits required.
MAX_REPETITION_ALLOWED Integer 5 Maximum allowed repeated digits.

C. Entropy Checks

  • Entropy measures how random/unpredictable a number string is.
Parameter Type Default Description
USE_ENTROPY_CHECK Boolean TRUE Enable entropy-based checks to validate randomness of the match.
MAX_ENTROPY_SCORE_DEDUCTION Boolean 40 Maximum penalty based on low entropy in matched SSN patterns.
MIN_ENTROPY Double 1.0 Minimum acceptable entropy value for matched results.
MAX_ENTROPY Double 3.0 Maximum acceptable entropy value for matched results.

D. Variance Checks

  • Variance looks at how digits are distributed — helps filter non-random digit repetition.
Parameter Type Default Description
USE_VARIANCE_CHECK Boolean TRUE Enable variance-based checks on digit distribution.
MAX_VARIANCE_SCORE_DEDUCTION Integer 40 Maximum penalty based on variance of digit distribution.
MIN_VARIANCE Double 1e12 Minimum acceptable variance for randomness checks.
MAX_VARIANCE Double 3e15 Maximum acceptable variance for randomness checks.

E. Sequence Checks

  • Looks for sequential patterns
Parameter Type Default Description
USE_SEQUENCE_CHECK Boolean TRUE Enable sequence pattern checks.
MAX_SEQUENCE_SCORE_DEDUCTION Integer 40 Maximum penalty applied if sequential patterns are detected.

F. Sampling Control

  • These determine when randomness-based logic should apply.
Parameter Type Default Description
MINIMUM_DATA_SIZE Integer 20 Minimum number of samples required for randomness validation to apply.

Examples of Invalid SSNs: - Starts with 9, 666, 000, or 98765432 - Middle digits = 00 - Last digits = 0000 - Dummy values: 123456789, 111111111, etc.


10. VIN Model

  • Detects Vehicle Identification Numbers (VINs).
  • Validates using length and VIN-specific checksum.

11. ZIP Code Model

Parameter Type Default Description
ZIP_DICT_KEY String US_ZIP_LOOKUP Dictionary name of ZIP codes.
ZIP_PATTERN String Pattern for matching ZIP codes.
STRICT_PATTERN Boolean FALSE If TRUE, enforces strict 5+4 ZIP format.

12. Driving License Model

  • Detects potential U.S. driver license numbers across 50 states.
  • Utilizes state-specific regex patterns for structural validation.
  • Applies robust heuristics to minimize false positives:
    • Length must be between 4–17 characters.
    • Rejects low-entropy or repetitive sequences.
    • Enforces character diversity to ensure validity.

13. Brazil CPF Model

The Brazil CPF (Cadastro de Pessoas Físicas) model identifies Brazilian individual taxpayer identification numbers using pattern matching and validation algorithms.

Parameter Type Default Description
DEFAULT_TYPES Boolean TRUE Validates against the standard CPF format and checksum algorithm.
ADDITIONAL_REGEX String Matches specific CPF formats using a custom regex.

Tip

This model is disabled by default. To detect CPF numbers in your data, you must enable it.

The model identifies CPF numbers based on the following criteria:

  • Format: An 11-digit number in XXX.XXX.XXX-XX,XXXXXXXXXXX or XXX XXX XXX XX format.
  • Validation: Uses the CPF checksum algorithm (mod 11).
  • Exclusions: Automatically excludes known invalid patterns (e.g.,000.000.000-00, 111.111.111-11).

Custom Formats

You can define custom CPF formats using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your custom format.

Examples

YAML
ADDITIONAL_REGEX_FORMATTED: ^((?:\d{3}\.\d{3}\.\d{3}-\d{2}))$
ADDITIONAL_REGEX_NUMERIC: ^((?:\d{11}))$

14. Brazil CNPJ Model

The Brazil CNPJ (Cadastro Nacional da Pessoa Jurídica) model identifies Brazilian company identification numbers using pattern matching and validation algorithms.

Parameter Type Default Description
DEFAULT_TYPES Boolean TRUE Validates against the standard CNPJ format and checksum algorithm.
ADDITIONAL_REGEX String Matches specific CNPJ formats using a custom regex.

Tip

This model is disabled by default. To detect CNPJ numbers in your data, you must enable it.

The model identifies CNPJ numbers based on the following criteria:

  • Format: A 14-digit number in XX.XXX.XXX/XXXX-XX,XXXXXXXXXXXXXX or XX XXX XXX XXXX XX format.
  • Validation: Uses the CNPJ checksum algorithm (mod 11).
  • Exclusions: Automatically excludes known invalid patterns (e.g., 00.000.000/0000-00, 11.111.111/1111-11).

Custom Formats

You can define custom CNPJ formats using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your custom format.

Examples

YAML
ADDITIONAL_REGEX_FORMATTED: ^((?:\d{2}\.\d{3}\.\d{3}\/\d{4}-\d{2}))$
ADDITIONAL_REGEX_NUMERIC: ^((?:\d{14}))$

15. Canada SIN Model

The Canada SIN (Social Insurance Number) model detects Canadian Social Insurance Numbers issued to individuals for tax and government benefit purposes. The model leverages pattern recognition and checksum validation logic to accurately identify valid SIN values while reducing false positives.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Canada SIN numbers in your data, enable the following in the portal:

  • Tag: Enable the CANADA_SIN tag (Tags).
  • Model: Enable the Canada SIN model, type CANADA_SIN_ML_MODEL (Models).
  • Dictionary: Enable the CANADA_SIN_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the Canada SIN structured rules (e.g. "Canada SIN Number", "Canada SIN Number Strict") and the unstructured rule (e.g. rule_canada_sin_number) are present and enabled (Rules).

The model identifies Canada SIN numbers based on the following criteria:

  • Format: Exactly nine digits in one of the following formats:

    Format Example
    Hyphen-separated 123-456-789
    Space-separated 123 456 789
    Raw (no separator) 123456789

    Sequences of fewer than nine digits are neither matched nor padded to nine. Word boundaries prevent substrings of longer digit sequences (e.g. 10 or more digits) from being matched.

  • Validation: The full nine-digit value is validated using the Luhn algorithm. The first digit must not be 0 or 8.

  • Exclusions: Automatically excludes known invalid patterns, such as all-zero sequences and repetitive digits (e.g. 111-111-111).

Custom Formats

You can extend SIN detection to additional formats by defining custom regex patterns using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your format.

Examples

YAML
1
2
3
ADDITIONAL_REGEX_FORMATTED: ^((?:\d{3}-\d{3}-\d{3}))$
ADDITIONAL_REGEX_SPACED: ^((?:\d{3}\s\d{3}\s\d{3}))$
ADDITIONAL_REGEX_NUMERIC: ^((?:\d{9}))$

16. Canada Postal Code Model

The Canada Postal Code model identifies Canadian postal codes using pattern matching and character-set validation. It uses the tag CANADA_POSTAL_CODE.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Canada Postal Code in your data, enable the following in the portal:

  • Tag: Enable the CANADA_POSTAL_CODE tag (Tags).
  • Model: Enable the Canada Postal Code model, type CANADA_POSTAL_CODE_ML_MODEL (Models).
  • Dictionary: Enable the CANADA_POSTAL_CODE_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the Canada Postal Code structured rules (e.g. "Canada Postal Code", "Canada Postal Code Strict") and the unstructured rule (e.g. rule_canada_postal_code) are present and enabled (Rules).

The model identifies Canada Postal Code values based on the following criteria:

  • Format: Postal codes follow the pattern A1A 1A1 — a Forward Sortation Area (FSA) of letter-digit-letter, an optional space, then a Local Delivery Unit (LDU) of digit-letter-digit. Both A1A 1A1 (with space) and A1A1A1 (without space) are accepted.
  • Keyword dictionary: The strict classification rule uses the CANADA_POSTAL_CODE_KEYWORD dictionary — which includes terms such as postal code, postcode, and code postal — alongside the detector to improve match confidence.
  • Validation: Only permitted letters are accepted in each position:

    Position Excluded Letters Allowed Set
    1st (FSA, letter) D, F, I, O, Q, U, W, Z A, B, C, E, G, H, J–N, P, R, S, T, V, X, Y
    3rd (FSA, letter) D, F, I, O, Q, U A, B, C, E, G, H, J–N, P, R, S, T, V–Z
    6th (LDU, letter) D, F, I, O, Q, U A, B, C, E, G, H, J–N, P, R, S, T, V–Z

Examples

  • Column or field names: Postal Code, Postcode, Code postal, Zip, Mailing Code
  • Sample values: K1A 0B1, M5H 2N2, T2E 7W2 (with space); K1A0B1, M5H2N2 (no space)

17. Canada Driver License Model

  • Detects potential Canadian driver license numbers across all 13 provinces and territories.
  • Utilizes province-specific regex patterns for structural validation.
  • Applies robust heuristics to minimize false positives:
    • Length must be between 5–17 characters.
    • Rejects sequential or repetitive patterns.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Canada Driver License in your data, enable the following in the portal:

  • Tag: Enable the CANADA_DRIVER_LICENSE tag (Tags).
  • Model: Enable the Canada Driver License model, type CA_DL_ML_MODEL (Models).
  • Dictionary: Enable the CA_DL_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the Canada Driver License structured rules (e.g. "Canada Driver License", "Canada Driver License Strict") and the unstructured rule (e.g. rule_canada_driver_license) are present and enabled (Rules).

The model identifies Canada Driver License values based on the following criteria:

Parameter Type Default Description
EXCLUDED_PROVINCES String Comma-separated list of provinces/territories to exclude from detection. By default all are included; use this to exclude only the ones you do not need. Values are case-insensitive (normalized to uppercase for matching).
MINIMUM_VALID_LENGTH Integer 5 Minimum length for valid license numbers.
MAXIMUM_VALID_LENGTH Integer 17 Maximum length for valid license numbers.

Valid values for EXCLUDED_PROVINCES (use exact names; case-insensitive):

ALBERTA, BRITISH_COLUMBIA, MANITOBA, MANITOBA_NO_HYPHEN, NEW_BRUNSWICK, NEWFOUNDLAND_AND_LABRADOR, NOVA_SCOTIA, ONTARIO, PRINCE_EDWARD_ISLAND, QUEBEC_HYPHEN, QUEBEC_SPACE, QUEBEC_NO_SEP, SASKATCHEWAN, NORTHWEST_TERRITORIES, NUNAVUT, YUKON


18. Canada Health Card Model

The Canada Health Card model detects provincial health card numbers for supported provinces and territories. Formats vary by province; the detector uses province-specific patterns and validation. Tag: CANADA_HEALTH_CARD.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Canada health card values in your data, enable the following in the portal:

  • Tag: Enable the CANADA_HEALTH_CARD tag (Tags).
  • Model: Enable the Canada Health Card model, type CANADA_HEALTH_CARD_ML_MODEL (Models).
  • Dictionary: Enable the CANADA_HEALTH_CARD_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the structured rules (e.g. "Canada Health Card", "Canada Health Card Strict") and the unstructured rule rule_canada_health_card are present and enabled (Rules).

The model identifies values based on the following supported formats (after normalization):

Province / territory Pattern summary Example shape
Ontario Four letters + ten digits (optional space) e.g. ABCD1234567890
Quebec Twelve alphanumeric characters 12-char alphanumeric
British Columbia Ten digits 1234567890
Alberta Nine digits 123456789
  • Exclusions: Use model property EXCLUDED_PROVINCES (comma-separated, case-insensitive) to skip provinces you do not need. Valid values match the internal enum names, e.g. ONTARIO, QUEBEC, BRITISH_COLUMBIA, ALBERTA.
  • Strict rule: Combines the detector with the CANADA_HEALTH_CARD_KEYWORD dictionary (terms such as health card, OHIP, assurance maladie) to reduce false positives.
  • Custom formats: You can add patterns via ADDITIONAL_REGEX_<TYPE> properties on the model, consistent with other Canada models.

19. Canada Business Number (BN) Model

The Canada Business Number (BN) model detects the nine-digit Business Number issued by the Canada Revenue Agency. The nine digits are validated with the Luhn algorithm. This model is for BN only (not GST/HST program accounts). Tag: CANADA_BN.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Canada BN in your data, enable the following in the portal:

  • Tag: Enable the CANADA_BN tag (Tags).
  • Model: Enable the Canada BN Number model, type CANADA_BN_ML_MODEL (Models).
  • Dictionary: Enable the CANADA_BN_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the structured rules (e.g. "Canada BN Number", "Canada BN Number Strict") and the unstructured rule rule_canada_bn_number are present and enabled (Rules).

The model identifies Canada BN values based on the following criteria:

  • Format: Nine digits in one of these shapes (word boundaries avoid matching inside longer digit strings):

    Format Example
    Hyphen-separated 123-456-789
    Space-separated 123 456 789
    Raw 123456789
  • Validation: Luhn check on the nine-digit value.

  • Keyword dictionary: The strict rule uses CANADA_BN_KEYWORD (e.g. BN, Business Number, numéro d'entreprise) alongside the detector.
  • Custom formats: Extend detection with ADDITIONAL_REGEX_<TYPE> model properties.

20. Canada GST HST Model

The Canada GST HST model detects GST/HST program account numbers: a valid nine-digit BN (Luhn-valid) immediately followed by RT and four digits (e.g. 123456789RT0001). Plain nine-digit BN strings are not tagged by this model—use the Canada Business Number (BN) model for those. Tag: CANADA_GST_HST.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Canada GST/HST program identifiers in your data, enable the following in the portal:

  • Tag: Enable the CANADA_GST_HST tag (Tags).
  • Model: Enable the Canada GST HST model (display name may show as "Canada GST HST"), type CANADA_GST_HST_ML_MODEL (Models).
  • Dictionary: Enable the CANADA_GST_HST_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the structured rules (e.g. "Canada GST HST", "Canada GST HST Strict") and the unstructured rule rule_canada_gst_hst are present and enabled (Rules).

The model identifies values based on the following criteria:

  • Format: Nine-digit BN (with optional hyphens or spaces between digit groups) + RT (case-insensitive) + four digits, e.g. 123-456-789RT0001, 123456789RT0001.
  • Validation: The nine-digit BN portion must pass Luhn validation; the suffix must be RT plus exactly four digits.
  • Keyword dictionary: The strict rule uses CANADA_GST_HST_KEYWORD (e.g. GST, HST, RT0001, tax number, TPS, TVH) alongside the detector.

21. Australia Medicare Number Model

The Australia Medicare Number model detects Australian Medicare card numbers using pattern matching and checksum validation. Tag: AU_MEDICARE.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Australian Medicare numbers in your data, enable the following in the portal:

  • Tag: Enable the AU_MEDICARE tag (Tags).
  • Model: Enable the Australia Medicare Number model, type AU_MEDICARE_ML_MODEL (Models).
  • Dictionary: Enable the AU_MEDICARE_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the Australia Medicare structured rules (e.g. "Australia Medicare Number", "Australia Medicare Number Strict") and the unstructured rule rule_au_medicare are present and enabled (Rules).

The model identifies Australian Medicare card numbers based on the following criteria:

  • Format: A 10-digit base number (with an optional 11th Individual Reference Number digit), accepted in the following display formats:

    Format Example
    Official 4-5-1 spaced 2123 45670 1
    4-5-1 hyphenated 2123-45670-1
    4-4-1-1 spaced 2123 4567 0 1
    4-4-1-1 hyphenated 2123-4567-0-1
    With IRN via slash 2123 45670 1/2
    With IRN via space 2123 45670 1 2
    With IRN via hyphen 2123 45670 1-2
    10 consecutive digits 2123456701
    11 consecutive digits 21234567012
    10 digits + hyphen + IRN 2123456701-2
  • Validation: The model applies the following checks:

    • First digit must be 2–6.
    • Issue number (digit 10) must be 1–9; 0 is never issued.
    • Optional IRN (11th digit) must be 1–9 if present.
    • Check digit (digit 9) is validated using a weighted sum algorithm with weights [1, 3, 7, 9, 1, 3, 7, 9] applied to digits 1–8; the result mod 10 must equal digit 9.
  • Exclusions: Automatically rejects known false positive patterns:

    • All-same digit strings (e.g. 2222222222).
    • Step-1 ascending or descending sequential strings (e.g. 2345678901).
  • Keyword dictionary: The strict rule combines the detector with the AU_MEDICARE_KEYWORD dictionary (terms such as medicare, medicare_number, medicare_card, health_card_number, health_insurance_number, health_id, pbs_number) to reduce false positives in structured data.

Parameter Type Default Description
DEFAULT_TYPES Boolean TRUE Enables the built-in Medicare pattern and checksum validation.
ADDITIONAL_REGEX_\<TYPE> String Custom regex to match additional Medicare number formats.

Custom Formats

You can extend Medicare number detection to additional formats by defining custom regex patterns using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your format.

Examples

YAML
ADDITIONAL_REGEX_SPACED:  ^\d{4}\s\d{5}\s\d$
ADDITIONAL_REGEX_NUMERIC: ^\d{10,11}$

22. Australian Passport Model

The Australian Passport model detects Australian passport document numbers using strict format validation. The printed Australian passport number is eight characters: one uppercase letter followed by seven digits (for example, N1234567). The document number does not carry a check digit, so the detector relies on format checks and false-positive filters rather than a checksum. Tag: AU_PASSPORT.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Australian passport numbers in your data, enable the following in the portal:

  • Tag: Enable the AU_PASSPORT tag (Tags).
  • Model: Enable the Australian Passport model, type AUSTRALIA_PASSPORT_ML_MODEL (Models).
  • Dictionary: Enable the AU_PASSPORT_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the Australian Passport structured rules (e.g. "Australian Passport", "Australian Passport Strict") and the unstructured rule rule_au_passport are present and enabled (Rules).

The model identifies Australian passport values based on the following criteria:

  • Format: Exactly eight characters — one uppercase letter followed by seven digits. The detector requires a boundary (whitespace, start/end of value, or non-./- punctuation) on either side of the match.

    Format Example
    Letter + 7 digits N1234567
  • Validation: The printed AU passport number does not include a check digit, so no checksum is applied. The detector instead rejects obvious filler and test patterns:

    • all-same digit sequences (e.g. A1111111)
    • fully ascending or descending digit sequences (e.g. A1234567, A7654321)
    • alternating two-digit patterns (e.g. A1010101)
    • step-2 ascending or descending digit sequences (e.g. A2468024)
  • Keyword dictionary: The strict rule combines the detector with the AU_PASSPORT_KEYWORD dictionary (terms such as Australian Passport, AU Passport, au_passport_no) to reduce false positives in structured data.
Parameter Type Default Description
DEFAULT_TYPES Boolean TRUE Validate against the standard AU passport format (letter + 7 digits).
ADDITIONAL_REGEX String Custom regex for matching alternate AU passport formats; see below.

Custom Formats

You can extend AU passport detection to additional formats by defining custom regex patterns using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your format.

Examples

YAML
ADDITIONAL_REGEX_PREFIXED: ^((?:[A-Z]{2}\d{7}))$
ADDITIONAL_REGEX_SPACED:   ^((?:[A-Z]\s\d{7}))$

23. New Zealand Passport Model

The New Zealand Passport model detects New Zealand passport document numbers using strict format validation. The printed NZ passport number is eight characters: two uppercase letters followed by six digits (for example, AB123456). The ICAO check digit exists only in the machine-readable zone (MRZ) and is not part of the printed document number, so the detector relies on format checks and false-positive filters rather than a checksum. Tag: NZ_PASSPORT.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect New Zealand passport numbers in your data, enable the following in the portal:

  • Tag: Enable the NZ_PASSPORT tag (Tags).
  • Model: Enable the New Zealand Passport model, type NEW_ZEALAND_PASSPORT_ML_MODEL (Models).
  • Dictionary: Enable the NZ_PASSPORT_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the New Zealand Passport structured rules (e.g. "New Zealand Passport", "New Zealand Passport Strict") and the unstructured rule rule_nz_passport are present and enabled (Rules).

The model identifies New Zealand passport values based on the following criteria:

  • Format: Exactly eight characters — two uppercase letters followed by six digits. The detector requires a boundary (whitespace, start/end of value, or non-./- punctuation) on either side of the match.

    Format Example
    2 letters + 6 digits AB123456
  • Validation: The printed NZ passport number does not include a check digit (the ICAO check digit applies only to the MRZ). No checksum is applied. The detector instead rejects obvious filler and test patterns:

    • all-same digit sequences (e.g. AB111111)
    • fully ascending or descending digit sequences (e.g. AB123456, AB654321)
    • alternating two-digit patterns (e.g. AB101010)
    • repeating half-length blocks (e.g. AB123123)
    • step-2 ascending or descending digit sequences (e.g. AB246802)
  • Keyword dictionary: The strict rule combines the detector with the NZ_PASSPORT_KEYWORD dictionary (terms such as New Zealand Passport, NZ Passport, nz_passport_no) to reduce false positives in structured data.
  • Additional-regex acceptance: If a candidate value fails the default 2-letter + 6-digit check but matches at least one configured ADDITIONAL_REGEX_<TYPE> pattern, the annotation is still accepted. This lets you support alternate NZ formats without overriding the default validator.
Parameter Type Default Description
DEFAULT_TYPES Boolean TRUE Validate against the standard NZ passport format (2 letters + 6 digits).
ADDITIONAL_REGEX String Custom regex for alternate NZ passport formats; accepted via fallback path.

Custom Formats

You can extend NZ passport detection to additional formats by defining custom regex patterns using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your format.

Examples

YAML
ADDITIONAL_REGEX_LEGACY:   ^((?:[A-Z]\d{7}))$
ADDITIONAL_REGEX_EXTENDED: ^((?:[A-Z]{2}\d{7}))$

24. Australian Driver Licence Model

The Australian Driver Licence model detects Australian state and territory driver's licence numbers across all eight issuing jurisdictions. Because each state uses a different format — ranging from purely numeric strings to alphanumeric combinations — the detector applies per-state regex patterns followed by false-positive filters rather than a single checksum. Tag: AU_DRIVER_LICENSE.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Australian driver licence numbers in your data, enable the following in the portal:

  • Tag: Enable the AU_DRIVER_LICENSE tag (Tags).
  • Model: Enable the Australian Driver Licence model, type AUSTRALIA_DRIVER_LICENSE_ML_MODEL (Models).
  • Dictionary: Enable the AU_DRIVER_LICENSE_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the Australian Driver Licence structured rules (e.g. "Australian Driver Licence", "Australian Driver Licence Strict") and the unstructured rule rule_au_driver_license are present and enabled (Rules).

The model identifies Australian driver licence values based on the following criteria:

  • Format: Varies by state or territory. After stripping non-alphanumeric boundary characters, the cleaned value must be between 6 and 10 characters long. The table below summarises each jurisdiction's expected format:

    State / Territory Format summary Example shape
    New South Wales (NSW) 8 digits 50291837
    Victoria (VIC) 8–10 digits (typically 9) 502918374
    Queensland (QLD) 8 digits, leading zero 05029183
    South Australia (SA) 1 uppercase letter + 6 digits, or 7 digits A503917 / 5039172
    Western Australia (WA) 7 digits 5039172
    Tasmania (TAS) 6–8 digits (most commonly 7) 503917
    Australian Capital Territory (ACT) 8–9 digits 502918374
    Northern Territory (NT) 6–7 digits 503917
  • Validation: No check digit is defined for Australian driver licences. The detector instead rejects obvious filler and test patterns:

    • all-same digit sequences (e.g. 11111111)
    • fully ascending or descending digit sequences (e.g. 12345678, 87654321)
    • repeating block patterns (e.g. 12341234)
    • alternating two-digit patterns (e.g. 12121212)
    • inputs of 8 or more digits with fewer than 4 distinct digit characters
  • Keyword dictionary: The strict classification rule combines the detector with the AU_DRIVER_LICENSE_KEYWORD dictionary — which includes terms such as Australian Driver Licence, AU DL, au_driver_licence — to reduce false positives in structured data.
Parameter Type Default Description
STATES String Comma-separated whitelist of states/territories to detect. When set, only the listed states are active and all others are disabled. Values are case-insensitive (normalised to uppercase). Mutually exclusive with EXCLUDED_STATES — if both are provided, STATES is applied first (retain), then EXCLUDED_STATES (remove).
EXCLUDED_STATES String Comma-separated blacklist of states/territories to exclude from detection. By default all are included. Values are case-insensitive (normalised to uppercase).
MINIMUM_VALID_LENGTH Integer 6 Minimum length (alphanumeric characters only) for a valid licence number.
MAXIMUM_VALID_LENGTH Integer 10 Maximum length (alphanumeric characters only) for a valid licence number.

Valid values for STATES / EXCLUDED_STATES (use exact names; case-insensitive):

NSW, VIC, QLD, SA, WA, TAS, ACT, NT


25. New Zealand Driver Licence Model

The New Zealand Driver Licence model detects New Zealand driver's licence numbers using strict format validation. The printed NZ driver's licence number is eight characters: two uppercase letters followed by six digits (for example, AB123456). No check digit is defined for this format, so the detector relies on format checks and false-positive filters. Tag: NZ_DRIVER_LICENSE.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect New Zealand driver licence numbers in your data, enable the following in the portal:

  • Tag: Enable the NZ_DRIVER_LICENSE tag (Tags).
  • Model: Enable the New Zealand Driver Licence model, type NEW_ZEALAND_DRIVER_LICENSE_ML_MODEL (Models).
  • Dictionary: Enable the NZ_DRIVER_LICENSE_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the New Zealand Driver Licence structured rules (e.g. "New Zealand Driver Licence", "New Zealand Driver Licence Strict") and the unstructured rule rule_nz_driver_license are present and enabled (Rules).

The model identifies New Zealand driver licence values based on the following criteria:

  • Format: Exactly eight characters — two uppercase letters followed by six digits. The detector requires a boundary (whitespace, start/end of value, or non-./- punctuation) on either side of the match.

    Format Example
    2 letters + 6 digits AB123456
  • Validation: No check digit is defined for the NZ driver's licence. The detector instead rejects obvious filler and test patterns:

    • all-same digit sequences (e.g. AB111111)
    • fully ascending or descending digit sequences (e.g. AB123456, AB654321)
    • alternating two-digit patterns (e.g. AB101010)
    • repeating 3-digit blocks (e.g. AB123123)
    • step-2 ascending or descending digit sequences (e.g. AB246802)
  • Keyword dictionary: The strict classification rule combines the detector with the NZ_DRIVER_LICENSE_KEYWORD dictionary — which includes terms such as New Zealand Driver Licence, NZ DL, nz_driver_licence — to reduce false positives in structured data.
  • Additional-regex acceptance: If a candidate value fails the default 2-letter + 6-digit check but matches at least one configured ADDITIONAL_REGEX_<TYPE> pattern, the annotation is still accepted. This lets you support alternate NZ formats without overriding the default validator.
Parameter Type Default Description
DEFAULT_TYPES Boolean TRUE Validate against the standard NZ driver licence format (2 letters + 6 digits).
ADDITIONAL_REGEX String Custom regex for alternate NZ driver licence formats; accepted via fallback path.

Custom Formats

You can extend NZ driver licence detection to additional formats by defining custom regex patterns using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your format.

Examples

YAML
ADDITIONAL_REGEX_LEGACY:   ^((?:[A-Z]\d{6}))$
ADDITIONAL_REGEX_EXTENDED: ^((?:[A-Z]{3}\d{6}))$

Conclusion

Privacera Discovery's model-based classification supports numerous sensitive data formats through specialized validation rules and pattern matching. These models help achieve consistent and automated tagging of data across domains such as finance, healthcare, HR, and more.