Skip to content

Heuristic Models in Privacera Discovery

Privacera supports a broad range of heuristic models designed to identify and classify sensitive data using logic, pattern recognition, and validation mechanisms. These models help automate sensitive data tagging and improve classification accuracy.


1. Generic Models

These are baseline models with configurable parameters to guide pattern matching and logic-based detection.

Parameter Type Default Description
INCLUDE_PATTERN_<#> String None Patterns to be matched.
EXCLUDE_PATTERN_<#> String None Patterns to be excluded from matching.
ONLY_DIGITS Boolean FALSE Removes all non-numeric characters in the string before matching.
CHECK_DIGIT_CODE_VALIDATE String None Indicates whether to evaluate a checksum digit based on the last digit.
DO_LOOKUP Boolean FALSE Enables pattern lookup using LOOKUP_PATTERN.
LOOKUP_DICT String None A dictionary name or key.
LOOKUP_PATTERN String None Pattern used for matching.
ISO3166_CC_VALIDATE_FLAG Boolean FALSE Enables ISO country code validation using ISO3166_CC_PATTERN.
ISO3166_CC_PATTERN String None Pattern for matching ISO country codes.
ISO3166_CC_LOOKUP_KEY String None Dictionary name for country code lookup.

2. Credit Card Model

Parameter Type Default Description
CC_PATTERN String Override pattern for credit card numbers.
DEFAULT_TYPES Boolean TRUE Validate against known issuing network prefixes.
LUHN_CHECK Boolean TRUE Validate the Luhn checksum on the credit card number.
DEFAULT_VALIDATORS_TO_EXCLUDE String Comma-separated list of default validators to exclude from validation. Example: RUPAY,MIR,TROY etc.

Supported Default Credit Card Types: AMEX, MASTERCARD, VISA, DINERS, DISCOVER, JCBCARD, MAESTRO, CHINA_UNIONPAY, INSTAPAYMENT, MIR, RUPAY, TROY, VERVE, HIPERCARD, AURA, CARNET, BCGLOBAL

Note

You can exclude the default credit card types from validation by setting the DEFAULT_VALIDATORS_TO_EXCLUDE parameter.

Supported Default Credit Card Types

Supported Default Credit Card Types

Card Type Description
AMEX Starts with 34 or 37, 15 digits
MASTERCARD Starts with 2221–2720 or 51–55, 16 digits
VISA Starts with 4, 16-19 digits
DINERS Starts with 300–305, 3095, 36, 38, 39; 14 digits
DISCOVER Starts with 6011, 622–628, 644–649, or 65; 16–17 digits
JCBCARD Starts with 2131, 1800 (15 digits) or 35 (15 digits)
MAESTRO Starts with 5018, 5020, 5038, 6304, 6759, 6761, 6763; 12–19 digits
CHINA_UNIONPAY Starts with 62, 16–19 digits
INSTAPAYMENT Starts with 637–639, 16 digits
MIR Starts with 2200–2204, 16–19 digits
RUPAY Starts with 60, 65, 81, 82; 16 digits
TROY Starts with 9792, 16 digits
VERVE BIN ranges: 506099–506199, 507865–507896, 650002–650027; 16–19 digits
HIPERCARD BINs like 384100,384140,384160,637568,637599,637609,637612; 16–19 digits
AURA Starts with 507860; 16–19 digits
CARNET BINs like 286900, 506203,506222,506237,506262,506276,506281,506301 etc.; 16–19 digits
BCGLOBAL Specific 6541/6556/700013 ranges; 16 digits

To support additional card types, use:

Regex property names

ADDITIONAL_REGEX_<CARD_TYPE>: Custom regex for matching a specific credit card type.

Examples of Additional Regexes:

YAML
ADDITIONAL_REGEX_JCB: ^((?:2131|1800|35\d{3})\d{11})$
ADDITIONAL_REGEX_MAESTRO: ^((?:5018|5020|5038|6304|6759|6761|6763)\d{8,15})$


3. Date of Birth (DOB) Model

Parameter Type Default Description
MIN_AGE_YEARS Integer 5 Age lower threshold.
MAX_AGE_YEARS Integer 100 Age upper threshold.
DATE_REGEX_var1 String Custom regex for matching date format.
DATE_FORMAT_var1 String Date format corresponding to the regex.

4. EIN Model

Parameter Type Default Description
EIN_PATTERN String Override default EIN pattern.
VALIDATIONS Boolean TRUE Enable format validations.
STRICT_PATTERN Boolean TRUE Match only if EIN has exact format.

5. Geo Latitude/Longitude Model

Parameter Type Default Description
MIN_LAT Double Lower bound on latitude.
MAX_LAT Double Upper bound on latitude.
MIN_LONG Double Lower bound on longitude.
MAX_LONG Double Upper bound on longitude.
MIN_FRACTIONAL_DIGITS Integer 3 Minimum decimal precision.

6. ITIN Model

Parameter Type Default Description
ITIN_PATTERN String Override default ITIN pattern.
STRICT_PATTERN Boolean TRUE Match only if ITIN has exact format.

7. MIME Model

Parameter Type Default Description
LOOKUP_DICT String Dictionary name of MIME types.

Examples: EXEC_MIME_KEYWORD, IMAGE_MIME_KEYWORD


8. Phone Number Model

Parameter Type Default Description
COUNTRY_CODE String US 2-letter ISO country code.

9. SSN Model

This model identifies possible SSNs (Social Security Numbers) in data by applying pattern matching, format checks, and optionally statistical/randomness-based validations.

There are two main phases:

A. Pattern Matching Phase

  • Uses regular expressions (regex) to identify common SSN formats (e.g., 123-45-6789, 123456789, or 6789).
  • Configurable using pattern-related flags like STRICT_PATTERN, USE_9_DIGIT_PATTERN, etc.

B. Statistical Validation Phase

  • Applies advanced validations like entropy, digit distribution (variance), and sequentiality.
  • These checks assess how "random" or "natural" the SSN-like string appears.
  • Controlled by flags like USE_SEQUENCE_CHECK, USE_VARIANCE_CHECK, etc.
  • All these are collectively controlled by USE_ADDITIONAL_VALIDATIONS.

Configuration Groups

A. Pattern Configuration

  • Customize what qualifies as an SSN match:
Parameter Type Default Description
SSN_PATTERN String Override default SSN pattern.
STRICT_PATTERN Boolean FALSE Match only if SSN has exact format.
USE_9_DIGIT_PATTERN Boolean FALSE Allow matching of any 9-digit string.
USE_4_DIGIT_PATTERN Boolean FALSE Allow matching of 4-digit string.
STRICT_EXT_PATTERN Boolean TRUE Requires hyphen/dot/space-separated format.

B. Basic Validation Settings

  • Check for basic data quality and known issues:
  • Examples : "111111111", "987654321", "000000000", "1234567890", "12345678901"
Parameter Type Default Description
VALIDATIONS Boolean TRUE Enable blacklist checks.
UNIQUE_DIGIT_THRESHOLD Integer 3 Minimum number of unique digits required.
MAX_REPETITION_ALLOWED Integer 5 Maximum allowed repeated digits.

C. Entropy Checks

  • Entropy measures how random/unpredictable a number string is.
Parameter Type Default Description
USE_ENTROPY_CHECK Boolean TRUE Enable entropy-based checks to validate randomness of the match.
MAX_ENTROPY_SCORE_DEDUCTION Boolean 40 Maximum penalty based on low entropy in matched SSN patterns.
MIN_ENTROPY Double 1.0 Minimum acceptable entropy value for matched results.
MAX_ENTROPY Double 3.0 Maximum acceptable entropy value for matched results.

D. Variance Checks

  • Variance looks at how digits are distributed — helps filter non-random digit repetition.
Parameter Type Default Description
USE_VARIANCE_CHECK Boolean TRUE Enable variance-based checks on digit distribution.
MAX_VARIANCE_SCORE_DEDUCTION Integer 40 Maximum penalty based on variance of digit distribution.
MIN_VARIANCE Double 1e12 Minimum acceptable variance for randomness checks.
MAX_VARIANCE Double 3e15 Maximum acceptable variance for randomness checks.

E. Sequence Checks

  • Looks for sequential patterns
Parameter Type Default Description
USE_SEQUENCE_CHECK Boolean TRUE Enable sequence pattern checks.
MAX_SEQUENCE_SCORE_DEDUCTION Integer 40 Maximum penalty applied if sequential patterns are detected.

F. Sampling Control

  • These determine when randomness-based logic should apply.
Parameter Type Default Description
MINIMUM_DATA_SIZE Integer 20 Minimum number of samples required for randomness validation to apply.

Examples of Invalid SSNs: - Starts with 9, 666, 000, or 98765432 - Middle digits = 00 - Last digits = 0000 - Dummy values: 123456789, 111111111, etc.


10. VIN (Vehicle Identification Number) Model

The VIN model detects ISO 3779-compliant Vehicle Identification Numbers — the unique 17-character identifier stamped on every road vehicle since 1981. Tag: VIN.

Tip

The tag and model are enabled by default. The VIN_KEYWORD keyword dictionary used by the strict structured rule is enabled by default as well; verify in the portal under Tags, Models, and Dictionaries if you have customized your installation.

The model identifies VIN values based on the following criteria:

  • Format: Exactly 17 characters in three segments. Letters I, O, and Q are excluded throughout to avoid confusion with the digits 1 and 0.

    Position(s) Field Allowed characters
    1–8 World Manufacturer Identifier (WMI) + vehicle attributes [A-HJ-NPR-Z0-9]
    9 Check digit 0–9 or X
    10–11 Model year (10) + plant code (11) [A-HJ-NPR-Z0-9]
    12–17 Production / serial number 0–9 only

    Example: 1HGCM82633A123456.

  • Validation: The detector requires a boundary (whitespace, start/end of value, or non-./- punctuation) on either side of the match. Positions 12–17 are constrained to digits per ISO 3779 §5.3.4 — this filters out alphanumeric junk that would otherwise occasionally clear the mod-11 checksum.

  • Check digit: VIN includes a mod-11 check digit at position 9. Each position is assigned a weight (8, 7, 6, 5, 4, 3, 2, 10, 0, 9, 8, 7, 6, 5, 4, 3, 2) and each letter has a transliteration value. The weighted sum modulo 11 must equal the check character (where X represents 10).

  • Keyword dictionary: The strict rule combines the detector with the VIN_KEYWORD dictionary (terms such as VIN, vehicle_id, chassis_no, vehicle_identification_number) to reduce false positives in structured data.

Parameter Type Default Description
VIN_REGEX String Override the default VIN matching pattern.

Custom Formats

The VIN format is defined by ISO 3779 and does not vary by country, so custom patterns are rarely needed. If your data contains non-standard VINs (for example, pre-1981 records with shorter identifiers), supply a replacement pattern via VIN_REGEX.

YAML
VIN_REGEX: (?i)(?:[\p{Punct}&&[^\.-]]|\s|^)(?<VIN>[A-HJ-NPR-Z0-9]{8}[0-9X][A-HJ-NPR-Z0-9]{2}\d{6})(?:[\p{Punct}&&[^\.-]]|\s|$)

11. ZIP Code Model

Parameter Type Default Description
ZIP_DICT_KEY String US_ZIP_LOOKUP Dictionary name of ZIP codes.
ZIP_PATTERN String Pattern for matching ZIP codes.
STRICT_PATTERN Boolean FALSE If TRUE, enforces strict 5+4 ZIP format.

12. Driving License Model

  • Detects potential U.S. driver license numbers across 50 states.
  • Utilizes state-specific regex patterns for structural validation.
  • Applies robust heuristics to minimize false positives:
    • Length must be between 4–17 characters.
    • Rejects low-entropy or repetitive sequences.
    • Enforces character diversity to ensure validity.

13. Brazil CPF Model

The Brazil CPF (Cadastro de Pessoas Físicas) model identifies Brazilian individual taxpayer identification numbers using pattern matching and validation algorithms.

Parameter Type Default Description
DEFAULT_TYPES Boolean TRUE Validates against the standard CPF format and checksum algorithm.
ADDITIONAL_REGEX String Matches specific CPF formats using a custom regex.

Tip

This model is disabled by default. To detect CPF numbers in your data, you must enable it.

The model identifies CPF numbers based on the following criteria:

  • Format: An 11-digit number in XXX.XXX.XXX-XX,XXXXXXXXXXX or XXX XXX XXX XX format.
  • Validation: Uses the CPF checksum algorithm (mod 11).
  • Exclusions: Automatically excludes known invalid patterns (e.g.,000.000.000-00, 111.111.111-11).

Custom Formats

You can define custom CPF formats using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your custom format.

Examples

YAML
ADDITIONAL_REGEX_FORMATTED: ^((?:\d{3}\.\d{3}\.\d{3}-\d{2}))$
ADDITIONAL_REGEX_NUMERIC: ^((?:\d{11}))$

14. Brazil CNPJ Model

The Brazil CNPJ (Cadastro Nacional da Pessoa Jurídica) model identifies Brazilian company identification numbers using pattern matching and validation algorithms.

Parameter Type Default Description
DEFAULT_TYPES Boolean TRUE Validates against the standard CNPJ format and checksum algorithm.
ADDITIONAL_REGEX String Matches specific CNPJ formats using a custom regex.

Tip

This model is disabled by default. To detect CNPJ numbers in your data, you must enable it.

The model identifies CNPJ numbers based on the following criteria:

  • Format: A 14-digit number in XX.XXX.XXX/XXXX-XX,XXXXXXXXXXXXXX or XX XXX XXX XXXX XX format.
  • Validation: Uses the CNPJ checksum algorithm (mod 11).
  • Exclusions: Automatically excludes known invalid patterns (e.g., 00.000.000/0000-00, 11.111.111/1111-11).

Custom Formats

You can define custom CNPJ formats using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your custom format.

Examples

YAML
ADDITIONAL_REGEX_FORMATTED: ^((?:\d{2}\.\d{3}\.\d{3}\/\d{4}-\d{2}))$
ADDITIONAL_REGEX_NUMERIC: ^((?:\d{14}))$

15. Canada SIN Model

The Canada SIN (Social Insurance Number) model detects Canadian Social Insurance Numbers issued to individuals for tax and government benefit purposes. The model leverages pattern recognition and checksum validation logic to accurately identify valid SIN values while reducing false positives.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Canada SIN numbers in your data, enable the following in the portal:

  • Tag: Enable the CANADA_SIN tag (Tags).
  • Model: Enable the Canada SIN model, type CANADA_SIN_ML_MODEL (Models).
  • Dictionary: Enable the CANADA_SIN_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the Canada SIN structured rules (e.g. "Canada SIN Number", "Canada SIN Number Strict") and the unstructured rule (e.g. rule_canada_sin_number) are present and enabled (Rules).

The model identifies Canada SIN numbers based on the following criteria:

  • Format: Exactly nine digits in one of the following formats:

    Format Example
    Hyphen-separated 123-456-789
    Space-separated 123 456 789
    Raw (no separator) 123456789

    Sequences of fewer than nine digits are neither matched nor padded to nine. Word boundaries prevent substrings of longer digit sequences (e.g. 10 or more digits) from being matched.

  • Validation: The full nine-digit value is validated using the Luhn algorithm. The first digit must not be 0 or 8.

  • Exclusions: Automatically excludes known invalid patterns, such as all-zero sequences and repetitive digits (e.g. 111-111-111).

Custom Formats

You can extend SIN detection to additional formats by defining custom regex patterns using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your format.

Examples

YAML
1
2
3
ADDITIONAL_REGEX_FORMATTED: ^((?:\d{3}-\d{3}-\d{3}))$
ADDITIONAL_REGEX_SPACED: ^((?:\d{3}\s\d{3}\s\d{3}))$
ADDITIONAL_REGEX_NUMERIC: ^((?:\d{9}))$

16. Canada Postal Code Model

The Canada Postal Code model identifies Canadian postal codes using pattern matching and character-set validation. It uses the tag CANADA_POSTAL_CODE.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Canada Postal Code in your data, enable the following in the portal:

  • Tag: Enable the CANADA_POSTAL_CODE tag (Tags).
  • Model: Enable the Canada Postal Code model, type CANADA_POSTAL_CODE_ML_MODEL (Models).
  • Dictionary: Enable the CANADA_POSTAL_CODE_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the Canada Postal Code structured rules (e.g. "Canada Postal Code", "Canada Postal Code Strict") and the unstructured rule (e.g. rule_canada_postal_code) are present and enabled (Rules).

The model identifies Canada Postal Code values based on the following criteria:

  • Format: Postal codes follow the pattern A1A 1A1 — a Forward Sortation Area (FSA) of letter-digit-letter, an optional space, then a Local Delivery Unit (LDU) of digit-letter-digit. Both A1A 1A1 (with space) and A1A1A1 (without space) are accepted.
  • Keyword dictionary: The strict classification rule uses the CANADA_POSTAL_CODE_KEYWORD dictionary — which includes terms such as postal code, postcode, and code postal — alongside the detector to improve match confidence.
  • Validation: Only permitted letters are accepted in each position:

    Position Excluded Letters Allowed Set
    1st (FSA, letter) D, F, I, O, Q, U, W, Z A, B, C, E, G, H, J–N, P, R, S, T, V, X, Y
    3rd (FSA, letter) D, F, I, O, Q, U A, B, C, E, G, H, J–N, P, R, S, T, V–Z
    6th (LDU, letter) D, F, I, O, Q, U A, B, C, E, G, H, J–N, P, R, S, T, V–Z

Examples

  • Column or field names: Postal Code, Postcode, Code postal, Zip, Mailing Code
  • Sample values: K1A 0B1, M5H 2N2, T2E 7W2 (with space); K1A0B1, M5H2N2 (no space)

17. Canada Driver License Model

  • Detects potential Canadian driver license numbers across all 13 provinces and territories.
  • Utilizes province-specific regex patterns for structural validation.
  • Applies robust heuristics to minimize false positives:
    • Length must be between 5–17 characters.
    • Rejects sequential or repetitive patterns.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Canada Driver License in your data, enable the following in the portal:

  • Tag: Enable the CANADA_DRIVER_LICENSE tag (Tags).
  • Model: Enable the Canada Driver License model, type CA_DL_ML_MODEL (Models).
  • Dictionary: Enable the CA_DL_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the Canada Driver License structured rules (e.g. "Canada Driver License", "Canada Driver License Strict") and the unstructured rule (e.g. rule_canada_driver_license) are present and enabled (Rules).

The model identifies Canada Driver License values based on the following criteria:

Parameter Type Default Description
EXCLUDED_PROVINCES String Comma-separated list of provinces/territories to exclude from detection. By default all are included; use this to exclude only the ones you do not need. Values are case-insensitive (normalized to uppercase for matching).
MINIMUM_VALID_LENGTH Integer 5 Minimum length for valid license numbers.
MAXIMUM_VALID_LENGTH Integer 17 Maximum length for valid license numbers.

Valid values for EXCLUDED_PROVINCES (use exact names; case-insensitive):

ALBERTA, BRITISH_COLUMBIA, MANITOBA, MANITOBA_NO_HYPHEN, NEW_BRUNSWICK, NEWFOUNDLAND_AND_LABRADOR, NOVA_SCOTIA, ONTARIO, PRINCE_EDWARD_ISLAND, QUEBEC_HYPHEN, QUEBEC_SPACE, QUEBEC_NO_SEP, SASKATCHEWAN, NORTHWEST_TERRITORIES, NUNAVUT, YUKON


18. Canada Health Card Model

The Canada Health Card model detects provincial health card numbers for supported provinces and territories. Formats vary by province; the detector uses province-specific patterns and validation. Tag: CANADA_HEALTH_CARD.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Canada health card values in your data, enable the following in the portal:

  • Tag: Enable the CANADA_HEALTH_CARD tag (Tags).
  • Model: Enable the Canada Health Card model, type CANADA_HEALTH_CARD_ML_MODEL (Models).
  • Dictionary: Enable the CANADA_HEALTH_CARD_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the structured rules (e.g. "Canada Health Card", "Canada Health Card Strict") and the unstructured rule rule_canada_health_card are present and enabled (Rules).

The model identifies values based on the following supported formats (after normalization):

Province / territory Pattern summary Example shape
Ontario Four letters + ten digits (optional space) e.g. ABCD1234567890
Quebec Twelve alphanumeric characters 12-char alphanumeric
British Columbia Ten digits 1234567890
Alberta Nine digits 123456789
  • Exclusions: Use model property EXCLUDED_PROVINCES (comma-separated, case-insensitive) to skip provinces you do not need. Valid values match the internal enum names, e.g. ONTARIO, QUEBEC, BRITISH_COLUMBIA, ALBERTA.
  • Strict rule: Combines the detector with the CANADA_HEALTH_CARD_KEYWORD dictionary (terms such as health card, OHIP, assurance maladie) to reduce false positives.
  • Custom formats: You can add patterns via ADDITIONAL_REGEX_<TYPE> properties on the model, consistent with other Canada models.

19. Canada Business Number (BN) Model

The Canada Business Number (BN) model detects the nine-digit Business Number issued by the Canada Revenue Agency. The nine digits are validated with the Luhn algorithm. This model is for BN only (not GST/HST program accounts). Tag: CANADA_BN.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Canada BN in your data, enable the following in the portal:

  • Tag: Enable the CANADA_BN tag (Tags).
  • Model: Enable the Canada BN Number model, type CANADA_BN_ML_MODEL (Models).
  • Dictionary: Enable the CANADA_BN_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the structured rules (e.g. "Canada BN Number", "Canada BN Number Strict") and the unstructured rule rule_canada_bn_number are present and enabled (Rules).

The model identifies Canada BN values based on the following criteria:

  • Format: Nine digits in one of these shapes (word boundaries avoid matching inside longer digit strings):

    Format Example
    Hyphen-separated 123-456-789
    Space-separated 123 456 789
    Raw 123456789
  • Validation: Luhn check on the nine-digit value.

  • Keyword dictionary: The strict rule uses CANADA_BN_KEYWORD (e.g. BN, Business Number, numéro d'entreprise) alongside the detector.
  • Custom formats: Extend detection with ADDITIONAL_REGEX_<TYPE> model properties.

20. Canada GST HST Model

The Canada GST HST model detects GST/HST program account numbers: a valid nine-digit BN (Luhn-valid) immediately followed by RT and four digits (e.g. 123456789RT0001). Plain nine-digit BN strings are not tagged by this model—use the Canada Business Number (BN) model for those. Tag: CANADA_GST_HST.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Canada GST/HST program identifiers in your data, enable the following in the portal:

  • Tag: Enable the CANADA_GST_HST tag (Tags).
  • Model: Enable the Canada GST HST model (display name may show as "Canada GST HST"), type CANADA_GST_HST_ML_MODEL (Models).
  • Dictionary: Enable the CANADA_GST_HST_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the structured rules (e.g. "Canada GST HST", "Canada GST HST Strict") and the unstructured rule rule_canada_gst_hst are present and enabled (Rules).

The model identifies values based on the following criteria:

  • Format: Nine-digit BN (with optional hyphens or spaces between digit groups) + RT (case-insensitive) + four digits, e.g. 123-456-789RT0001, 123456789RT0001.
  • Validation: The nine-digit BN portion must pass Luhn validation; the suffix must be RT plus exactly four digits.
  • Keyword dictionary: The strict rule uses CANADA_GST_HST_KEYWORD (e.g. GST, HST, RT0001, tax number, TPS, TVH) alongside the detector.

21. Australia Medicare Number Model

The Australia Medicare Number model detects Australian Medicare card numbers using pattern matching and checksum validation. Tag: AU_MEDICARE.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Australian Medicare numbers in your data, enable the following in the portal:

  • Tag: Enable the AU_MEDICARE tag (Tags).
  • Model: Enable the Australia Medicare Number model, type AU_MEDICARE_ML_MODEL (Models).
  • Dictionary: Enable the AU_MEDICARE_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the Australia Medicare structured rules (e.g. "Australia Medicare Number", "Australia Medicare Number Strict") and the unstructured rule rule_au_medicare are present and enabled (Rules).

The model identifies Australian Medicare card numbers based on the following criteria:

  • Format: A 10-digit base number (with an optional 11th Individual Reference Number digit), accepted in the following display formats:

    Format Example
    Official 4-5-1 spaced 2123 45670 1
    4-5-1 hyphenated 2123-45670-1
    4-4-1-1 spaced 2123 4567 0 1
    4-4-1-1 hyphenated 2123-4567-0-1
    With IRN via slash 2123 45670 1/2
    With IRN via space 2123 45670 1 2
    With IRN via hyphen 2123 45670 1-2
    10 consecutive digits 2123456701
    11 consecutive digits 21234567012
    10 digits + hyphen + IRN 2123456701-2
  • Validation: The model applies the following checks:

    • First digit must be 2–6.
    • Issue number (digit 10) must be 1–9; 0 is never issued.
    • Optional IRN (11th digit) must be 1–9 if present.
    • Check digit (digit 9) is validated using a weighted sum algorithm with weights [1, 3, 7, 9, 1, 3, 7, 9] applied to digits 1–8; the result mod 10 must equal digit 9.
  • Exclusions: Automatically rejects known false positive patterns:

    • All-same digit strings (e.g. 2222222222).
    • Step-1 ascending or descending sequential strings (e.g. 2345678901).
  • Keyword dictionary: The strict rule combines the detector with the AU_MEDICARE_KEYWORD dictionary (terms such as medicare, medicare_number, medicare_card, health_card_number, health_insurance_number, health_id, pbs_number) to reduce false positives in structured data.

Parameter Type Default Description
DEFAULT_TYPES Boolean TRUE Enables the built-in Medicare pattern and checksum validation.
ADDITIONAL_REGEX_\<TYPE> String Custom regex to match additional Medicare number formats.

Custom Formats

You can extend Medicare number detection to additional formats by defining custom regex patterns using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your format.

Examples

YAML
ADDITIONAL_REGEX_SPACED:  ^\d{4}\s\d{5}\s\d$
ADDITIONAL_REGEX_NUMERIC: ^\d{10,11}$

22. Australian Passport Model

The Australian Passport model detects Australian passport document numbers using strict format validation. The printed Australian passport number is eight characters: one uppercase letter followed by seven digits (for example, N1234567). The document number does not carry a check digit, so the detector relies on format checks and false-positive filters rather than a checksum. Tag: AU_PASSPORT.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Australian passport numbers in your data, enable the following in the portal:

  • Tag: Enable the AU_PASSPORT tag (Tags).
  • Model: Enable the Australian Passport model, type AUSTRALIA_PASSPORT_ML_MODEL (Models).
  • Dictionary: Enable the AU_PASSPORT_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the Australian Passport structured rules (e.g. "Australian Passport", "Australian Passport Strict") and the unstructured rule rule_au_passport are present and enabled (Rules).

The model identifies Australian passport values based on the following criteria:

  • Format: Exactly eight characters — one uppercase letter followed by seven digits. The detector requires a boundary (whitespace, start/end of value, or non-./- punctuation) on either side of the match.

    Format Example
    Letter + 7 digits N1234567
  • Validation: The printed AU passport number does not include a check digit, so no checksum is applied. The detector instead rejects obvious filler and test patterns:

    • all-same digit sequences (e.g. A1111111)
    • fully ascending or descending digit sequences (e.g. A1234567, A7654321)
    • alternating two-digit patterns (e.g. A1010101)
    • step-2 ascending or descending digit sequences (e.g. A2468024)
  • Keyword dictionary: The strict rule combines the detector with the AU_PASSPORT_KEYWORD dictionary (terms such as Australian Passport, AU Passport, au_passport_no) to reduce false positives in structured data.
Parameter Type Default Description
DEFAULT_TYPES Boolean TRUE Validate against the standard AU passport format (letter + 7 digits).
ADDITIONAL_REGEX String Custom regex for matching alternate AU passport formats; see below.

Custom Formats

You can extend AU passport detection to additional formats by defining custom regex patterns using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your format.

Examples

YAML
ADDITIONAL_REGEX_PREFIXED: ^((?:[A-Z]{2}\d{7}))$
ADDITIONAL_REGEX_SPACED:   ^((?:[A-Z]\s\d{7}))$

23. New Zealand Passport Model

The New Zealand Passport model detects New Zealand passport document numbers using strict format validation. The printed NZ passport number is eight characters: two uppercase letters followed by six digits (for example, AB123456). The ICAO check digit exists only in the machine-readable zone (MRZ) and is not part of the printed document number, so the detector relies on format checks and false-positive filters rather than a checksum. Tag: NZ_PASSPORT.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect New Zealand passport numbers in your data, enable the following in the portal:

  • Tag: Enable the NZ_PASSPORT tag (Tags).
  • Model: Enable the New Zealand Passport model, type NEW_ZEALAND_PASSPORT_ML_MODEL (Models).
  • Dictionary: Enable the NZ_PASSPORT_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the New Zealand Passport structured rules (e.g. "New Zealand Passport", "New Zealand Passport Strict") and the unstructured rule rule_nz_passport are present and enabled (Rules).

The model identifies New Zealand passport values based on the following criteria:

  • Format: Exactly eight characters — two uppercase letters followed by six digits. The detector requires a boundary (whitespace, start/end of value, or non-./- punctuation) on either side of the match.

    Format Example
    2 letters + 6 digits AB123456
  • Validation: The printed NZ passport number does not include a check digit (the ICAO check digit applies only to the MRZ). No checksum is applied. The detector instead rejects obvious filler and test patterns:

    • all-same digit sequences (e.g. AB111111)
    • fully ascending or descending digit sequences (e.g. AB123456, AB654321)
    • alternating two-digit patterns (e.g. AB101010)
    • repeating half-length blocks (e.g. AB123123)
    • step-2 ascending or descending digit sequences (e.g. AB246802)
  • Keyword dictionary: The strict rule combines the detector with the NZ_PASSPORT_KEYWORD dictionary (terms such as New Zealand Passport, NZ Passport, nz_passport_no) to reduce false positives in structured data.
  • Additional-regex acceptance: If a candidate value fails the default 2-letter + 6-digit check but matches at least one configured ADDITIONAL_REGEX_<TYPE> pattern, the annotation is still accepted. This lets you support alternate NZ formats without overriding the default validator.
Parameter Type Default Description
DEFAULT_TYPES Boolean TRUE Validate against the standard NZ passport format (2 letters + 6 digits).
ADDITIONAL_REGEX String Custom regex for alternate NZ passport formats; accepted via fallback path.

Custom Formats

You can extend NZ passport detection to additional formats by defining custom regex patterns using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your format.

Examples

YAML
ADDITIONAL_REGEX_LEGACY:   ^((?:[A-Z]\d{7}))$
ADDITIONAL_REGEX_EXTENDED: ^((?:[A-Z]{2}\d{7}))$

24. Australian Driver Licence Model

The Australian Driver Licence model detects Australian state and territory driver's licence numbers across all eight issuing jurisdictions. Because each state uses a different format — ranging from purely numeric strings to alphanumeric combinations — the detector applies per-state regex patterns followed by false-positive filters rather than a single checksum. Tag: AU_DRIVER_LICENSE.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Australian driver licence numbers in your data, enable the following in the portal:

  • Tag: Enable the AU_DRIVER_LICENSE tag (Tags).
  • Model: Enable the Australian Driver Licence model, type AUSTRALIA_DRIVER_LICENSE_ML_MODEL (Models).
  • Dictionary: Enable the AU_DRIVER_LICENSE_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the Australian Driver Licence structured rules (e.g. "Australian Driver Licence", "Australian Driver Licence Strict") and the unstructured rule rule_au_driver_license are present and enabled (Rules).

The model identifies Australian driver licence values based on the following criteria:

  • Format: Varies by state or territory. After stripping non-alphanumeric boundary characters, the cleaned value must be between 6 and 10 characters long. The table below summarises each jurisdiction's expected format:

    State / Territory Format summary Example shape
    New South Wales (NSW) 8 digits 50291837
    Victoria (VIC) 8–10 digits (typically 9) 502918374
    Queensland (QLD) 8 digits, leading zero 05029183
    South Australia (SA) 1 uppercase letter + 6 digits, or 7 digits A503917 / 5039172
    Western Australia (WA) 7 digits 5039172
    Tasmania (TAS) 6–8 digits (most commonly 7) 503917
    Australian Capital Territory (ACT) 8–9 digits 502918374
    Northern Territory (NT) 6–7 digits 503917
  • Validation: No check digit is defined for Australian driver licences. The detector instead rejects obvious filler and test patterns:

    • all-same digit sequences (e.g. 11111111)
    • fully ascending or descending digit sequences (e.g. 12345678, 87654321)
    • repeating block patterns (e.g. 12341234)
    • alternating two-digit patterns (e.g. 12121212)
    • inputs of 8 or more digits with fewer than 4 distinct digit characters
  • Keyword dictionary: The strict classification rule combines the detector with the AU_DRIVER_LICENSE_KEYWORD dictionary — which includes terms such as Australian Driver Licence, AU DL, au_driver_licence — to reduce false positives in structured data.
Parameter Type Default Description
STATES String Comma-separated whitelist of states/territories to detect. When set, only the listed states are active and all others are disabled. Values are case-insensitive (normalised to uppercase). Mutually exclusive with EXCLUDED_STATES — if both are provided, STATES is applied first (retain), then EXCLUDED_STATES (remove).
EXCLUDED_STATES String Comma-separated blacklist of states/territories to exclude from detection. By default all are included. Values are case-insensitive (normalised to uppercase).
MINIMUM_VALID_LENGTH Integer 6 Minimum length (alphanumeric characters only) for a valid licence number.
MAXIMUM_VALID_LENGTH Integer 10 Maximum length (alphanumeric characters only) for a valid licence number.

Valid values for STATES / EXCLUDED_STATES (use exact names; case-insensitive):

NSW, VIC, QLD, SA, WA, TAS, ACT, NT


25. New Zealand Driver Licence Model

The New Zealand Driver Licence model detects New Zealand driver's licence numbers using strict format validation. The printed NZ driver's licence number is eight characters: two uppercase letters followed by six digits (for example, AB123456). No check digit is defined for this format, so the detector relies on format checks and false-positive filters. Tag: NZ_DRIVER_LICENSE.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect New Zealand driver licence numbers in your data, enable the following in the portal:

  • Tag: Enable the NZ_DRIVER_LICENSE tag (Tags).
  • Model: Enable the New Zealand Driver Licence model, type NEW_ZEALAND_DRIVER_LICENSE_ML_MODEL (Models).
  • Dictionary: Enable the NZ_DRIVER_LICENSE_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the New Zealand Driver Licence structured rules (e.g. "New Zealand Driver Licence", "New Zealand Driver Licence Strict") and the unstructured rule rule_nz_driver_license are present and enabled (Rules).

The model identifies New Zealand driver licence values based on the following criteria:

  • Format: Exactly eight characters — two uppercase letters followed by six digits. The detector requires a boundary (whitespace, start/end of value, or non-./- punctuation) on either side of the match.

    Format Example
    2 letters + 6 digits AB123456
  • Validation: No check digit is defined for the NZ driver's licence. The detector instead rejects obvious filler and test patterns:

    • all-same digit sequences (e.g. AB111111)
    • fully ascending or descending digit sequences (e.g. AB123456, AB654321)
    • alternating two-digit patterns (e.g. AB101010)
    • repeating 3-digit blocks (e.g. AB123123)
    • step-2 ascending or descending digit sequences (e.g. AB246802)
  • Keyword dictionary: The strict classification rule combines the detector with the NZ_DRIVER_LICENSE_KEYWORD dictionary — which includes terms such as New Zealand Driver Licence, NZ DL, nz_driver_licence — to reduce false positives in structured data.
  • Additional-regex acceptance: If a candidate value fails the default 2-letter + 6-digit check but matches at least one configured ADDITIONAL_REGEX_<TYPE> pattern, the annotation is still accepted. This lets you support alternate NZ formats without overriding the default validator.
Parameter Type Default Description
DEFAULT_TYPES Boolean TRUE Validate against the standard NZ driver licence format (2 letters + 6 digits).
ADDITIONAL_REGEX String Custom regex for alternate NZ driver licence formats; accepted via fallback path.

Custom Formats

You can extend NZ driver licence detection to additional formats by defining custom regex patterns using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your format.

Examples

YAML
ADDITIONAL_REGEX_LEGACY:   ^((?:[A-Z]\d{6}))$
ADDITIONAL_REGEX_EXTENDED: ^((?:[A-Z]{3}\d{6}))$

26. Australia Business Number (ABN) Model

The Australia Business Number (ABN) model detects the eleven-digit identifier issued by the Australian Taxation Office (ATO) to businesses and other entities. The eleven digits are validated with the official modulo 89 check algorithm. Tag: AU_ABN.

Tip

By default, tags, models, and dictionaries are disabled, while rules are enabled. To detect Australian Business Numbers in your data, enable the following in the portal:

  • Tag: Enable the AU_ABN tag (Tags).
  • Model: Enable the AU ABN Number model, type AU_ABN_ML_MODEL (Models).
  • Dictionary: Enable the AU_ABN_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the structured rules (e.g. "AU ABN Strict", "AU ABN") are present and enabled (Rules).

The model identifies ABN values based on the following criteria:

  • Format: Eleven digits in one of these shapes (word boundaries avoid matching inside longer digit strings):

    Format Example
    Space-separated 51 824 753 556
    Raw 51824753556
  • Validation: The official ABN check — subtract one from the first digit, apply weights 10, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19 to the resulting eleven digits, and require the weighted sum to be divisible by 89.

  • Keyword dictionary: The strict rule uses AU_ABN_KEYWORD (e.g. ABN, Australian Business Number, business number) alongside the detector.
  • Custom formats: Extend detection with ADDITIONAL_REGEX_<type> model properties.

27. Australia Company Number (ACN) Model

The Australia Company Number (ACN) model detects the nine-digit identifier issued by ASIC (Australian Securities and Investments Commission) to registered companies. Values are zero-padded to nine digits and validated with the official modulo 10 check digit. Tag: AU_ACN.

Tip

By default, tags, models, and dictionaries are disabled, while rules are enabled. To detect ACNs, enable:

  • Tag: Enable the AU_ACN tag (Tags).
  • Model: Enable the AU ACN Number model, type AU_ACN_ML_MODEL (Models).
  • Dictionary: Enable the AU_ACN_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the structured rules (e.g. "AU ACN Strict", "AU ACN") are present and enabled (Rules).

The model identifies ACN values based on the following criteria:

  • Format: Nine digits (with leading zeros preserved) in one of these shapes:

    Format Example
    Space-separated 004 085 616
    Hyphen-separated 004-085-616
    Raw 004085616
  • Validation: Apply weights 8, 7, 6, 5, 4, 3, 2, 1 to the first eight digits; the ninth digit is the check digit such that the weighted sum is divisible by 10.

  • Keyword dictionary: The strict rule uses AU_ACN_KEYWORD (e.g. ACN, Australian Company Number, company number) alongside the detector.
  • Custom formats: Extend detection with ADDITIONAL_REGEX_<type> model properties.

28. New Zealand IRD Number Model

The New Zealand IRD Number model detects Inland Revenue Department (IRD) tax identifiers used for individuals and entities in New Zealand. Tag: NZ_IRD.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect New Zealand IRD numbers in your data, enable the following in the portal:

  • Tag: Enable the NZ_IRD tag (Tags).
  • Model: Enable the NZ IRD Number model, type NZ_IRD_ML_MODEL (Models).
  • Dictionary: Enable the NZ_IRD_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the structured rules (e.g. "New Zealand IRD Number Strict", "New Zealand IRD Number") and the unstructured rule rule_nz_ird are present and enabled (Rules).

The model identifies IRD values based on the following criteria:

  • Format: Eight or nine digits (leading zeros preserved) in one of these shapes:

    Format Example
    Hyphen-separated 136-410-132
    Space-separated 136 410 132
    Raw 136410132
  • Validation: IRD check-digit validation per Inland Revenue Department rules (modulo 11 algorithm on the digit sequence).

  • Keyword dictionary: The strict rule uses NZ_IRD_KEYWORD (e.g. IRD, IRD number, Inland Revenue, tax number) alongside the detector.
  • Custom formats: Extend detection with ADDITIONAL_REGEX_<type> model properties.

29. Australia Bank Account Number Model

The Australia Bank Account Number model detects Australian bank account numbers composed of a BSB (Bank-State-Branch) code and an account number. Tag: AU_BANK_ACCOUNT.

Tip

By default, tags, models, and dictionaries are disabled, while rules are enabled. To detect Australian bank accounts, enable:

  • Tag: Enable the AU_BANK_ACCOUNT tag (Tags).
  • Model: Enable the AU Bank Account Number model, type AU_BANK_ACCOUNT_ML_MODEL (Models).
  • Dictionary: Enable the AU_BANK_ACCOUNT_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the structured rules (e.g. "Australian Bank Account Number Strict", "Australian Bank Account Number") and the unstructured rule rule_au_bank_account are present and enabled (Rules).

The model identifies values based on the following criteria:

  • Format: A six-digit BSB followed by an account number (typically six to nine digits). Common shapes include:

    Format Example
    Hyphen-separated 062-001-12345678
    Raw 06200112345678

    The BSB portion is six digits, often written as XXX-XXX when hyphenated.

  • Validation: Structural validation of BSB and account-number length and format; the detector applies heuristics to reduce false positives on arbitrary digit strings.

  • Keyword dictionary: The strict rule uses AU_BANK_ACCOUNT_KEYWORD (e.g. BSB, bank account, account number) alongside the detector.
  • Custom formats: Extend detection with ADDITIONAL_REGEX_<type> model properties.

30. New Zealand Bank Account Number Model

The New Zealand Bank Account Number model detects NZ bank account numbers in the standard domestic format. Tag: NZ_BANK_ACCOUNT.

Tip

By default, tags, models, and dictionaries are disabled, while rules are enabled. To detect NZ bank accounts, enable:

  • Tag: Enable the NZ_BANK_ACCOUNT tag (Tags).
  • Model: Enable the NZ Bank Account Number model, type NZ_BANK_ACCOUNT_ML_MODEL (Models).
  • Dictionary: Enable the NZ_BANK_ACCOUNT_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the structured rules (e.g. "New Zealand Bank Account Number Strict", "New Zealand Bank Account Number") and the unstructured rule rule_nz_bank_account are present and enabled (Rules).

The model identifies values based on the following criteria:

  • Format: Standard New Zealand bank account layout BB-bbbb-AAAAAAA-SSS — two-digit bank code, four-digit branch, seven-digit account body, and three-digit suffix:

    Format Example
    Hyphen-separated 01-0102-0123456-000
    Raw 0101020123456000
  • Validation: Structural validation of each segment length and overall format; heuristics reduce false positives on arbitrary digit strings.

  • Keyword dictionary: The strict rule uses NZ_BANK_ACCOUNT_KEYWORD (e.g. bank account, account number, NZ bank account) alongside the detector.
  • Custom formats: Extend detection with ADDITIONAL_REGEX_<type> model properties.

31. Australia Phone Number Model

The Australia Phone Number model detects Australian personal phone numbers — both mobile (04XX XXX XXX) and landline ((0X) XXXX XXXX, where X ∈ {2, 3, 7, 8}) — in national and international (+61) formats. Tag: AU_PHONE_NUMBER.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Australian phone numbers in your data, enable the following in the portal:

  • Tag: Enable the AU_PHONE_NUMBER tag (Tags).
  • Model: Enable the Australian Phone Number model, type AUSTRALIA_PHONE_NUMBER_ML_MODEL (Models).
  • Dictionary: Enable the ANZ_PHONE_NUMBER_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the Australian Phone Number structured rules (e.g. "Australia Phone Number", "Australia Phone Number Strict") and the unstructured rule rule_au_phone_keyword_unstruct are present and enabled (Rules).

The model identifies Australian phone-number values based on the following criteria:

  • Format: Exactly 10 national digits (or the equivalent international form with the trunk 0 dropped). The detector requires a boundary (whitespace, start/end of value, or non-./- punctuation) on either side of the match.

    Form Example
    Mobile (national) 0412 345 678
    Landline (national) (02) 9876 5432
    International (+61) +61 412 345 678
    Bare-country-code 61 412 345 678

    The bare-61 form covers values whose leading + was stripped by spreadsheet auto-format or ETL. Separators between groups may be space, hyphen, dot, or none.

  • Validation:

    • Mobile prefix must be 04; landline area code must be one of 02, 03, 07, 08, and the digit immediately after the area code must be in 2–9 (this guard keeps NZ mobiles, which use 02[01], from matching the AU landline shape).
    • Business shortcodes (1800, 1300, 13) are excluded by design — they start with 1, which does not match any AU personal-number prefix.
    • The detector rejects all-same-digit, sequential, alternating, step-2, and repeating-block digit strings (filler / test data).
  • Keyword dictionary: The strict rule combines the detector with the ANZ_PHONE_NUMBER_KEYWORD dictionary (terms such as phone, mobile, contact_number, au_phone) to reduce false positives in structured data.

  • Cross-family disambiguation: The Australian and New Zealand phone detectors share the m_ANZ_PHONE_NUMBER_KEYWORD column-name keyword. When both detectors match content on the same column, the rule engine emits an ambiguous REVIEW-grade tag pair rather than auto-classifying either country — see Rules for the cross-family guard pattern.

Parameter Type Default Description
DEFAULT_TYPES Boolean TRUE Validate against the standard AU mobile / landline formats described above.
ADDITIONAL_REGEX String Custom regex for matching alternate AU phone formats; see below.

Custom Formats

Extend AU phone detection to additional formats by defining custom regex patterns using the ADDITIONAL_REGEX_<TYPE> property. Replace <TYPE> with a descriptive name for your format.

Examples

YAML
ADDITIONAL_REGEX_INTL_NO_PLUS:   ^(0061\s?4\d{2}\s?\d{3}\s?\d{3})$
ADDITIONAL_REGEX_GROUPED_MOBILE: ^(04\d{2}-\d{3}-\d{3})$

32. New Zealand Phone Number Model

The New Zealand Phone Number model detects NZ personal phone numbers — mobile (02X XXX XXXX, with X ∈ {0, 1}; 10 national digits) and landline ((0X) XXX XXXX, where X ∈ {3, 4, 6, 7, 9}; 9 national digits) — in national and international (+64) formats. Tag: NZ_PHONE_NUMBER.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect New Zealand phone numbers in your data, enable the following in the portal:

  • Tag: Enable the NZ_PHONE_NUMBER tag (Tags).
  • Model: Enable the New Zealand Phone Number model, type NEW_ZEALAND_PHONE_NUMBER_ML_MODEL (Models).
  • Dictionary: Enable the ANZ_PHONE_NUMBER_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the New Zealand Phone Number structured rules (e.g. "New Zealand Phone Number", "New Zealand Phone Number Strict") and the unstructured rule rule_nz_phone_keyword_unstruct are present and enabled (Rules).

The model identifies New Zealand phone-number values based on the following criteria:

  • Format: 10 national digits for mobile, 9 national digits for landline (or the equivalent international form with the trunk 0 dropped). The detector requires a boundary (whitespace, start/end of value, or non-./- punctuation) on either side of the match.

    Form Example
    Mobile (national) 021 234 5678
    Landline (national) (03) 123 4567
    International (+64) +64 21 234 5678
    Bare-country-code 64 3 123 4567

    Separators between groups may be space, hyphen, dot, or none.

  • Validation:

    • Mobile prefix is restricted to 020 (Skinny) and 021 (Spark / Vodafone NZ) by default — this keeps NZ mobile and AU landline mutually exclusive (AU landlines occupy 02[2–9]).
    • Landline area codes are 03, 04, 06, 07, 09.
    • Business shortcodes 0800 (toll-free) and 0900 (premium) are explicitly excluded.
    • The detector rejects all-same-digit, sequential, alternating, step-2, and repeating-block digit strings.
  • Keyword dictionary: The strict rule combines the detector with the ANZ_PHONE_NUMBER_KEYWORD dictionary alongside the detector (shared with Australia phone — column-name keyword is country-agnostic for ANZ).

  • Cross-family disambiguation: NZ and AU phone detectors share keyword context. When both detect content on the same column, the rule engine emits an ambiguous REVIEW-grade tag pair rather than auto-classifying either country; see Rules. Operators using carriers in the 022/023/024/026/027/028/029 mobile prefix range can extend coverage by configuring an ADDITIONAL_REGEX_* parameter, or by relying on the column-name keyword to disambiguate when the column is clearly NZ-only.

Parameter Type Default Description
DEFAULT_TYPES Boolean TRUE Validate against the standard NZ mobile (02[01]) and landline formats above.
ADDITIONAL_REGEX String Custom regex for matching alternate NZ phone formats; see below.

Custom Formats

Extend NZ phone detection to additional formats by defining custom regex patterns using the ADDITIONAL_REGEX_<TYPE> property.

Examples

YAML
ADDITIONAL_REGEX_022_MOBILE: ^(02[2-9]\d{3}\s?\d{4})$
ADDITIONAL_REGEX_INTL_NO_PLUS: ^(0064\s?2[01]\d{3}\s?\d{4})$

33. Australia Tax File Number (TFN) Model

The Australia TFN model detects Australian Tax File Numbers — the personal identifier issued by the Australian Taxation Office. Tag: AU_TFN.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect Australian TFNs in your data, enable the following in the portal:

  • Tag: Enable the AU_TFN tag (Tags).
  • Model: Enable the Australian TFN model, type AUSTRALIA_TFN_ML_MODEL (Models).
  • Dictionary: Enable the AU_TFN_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the AU TFN structured rules (e.g. "Australia TFN", "Australia TFN Strict") and the unstructured rule rule_au_tfn are present and enabled (Rules).

The model identifies AU TFN values based on the following criteria:

  • Format: 8 or 9 numeric digits. Common groupings are accepted.

    Format Example
    9-digit, 3-3-3 spaced 123 456 782
    9-digit consecutive 123456782
    8-digit, 4-4 spaced 1234 5678
    8-digit consecutive 12345678

    The detector requires a boundary on either side of the match.

  • Validation: TFNs are validated via a mod-11 weighted checksum using weights [1, 4, 3, 7, 5, 8, 6, 9, 10]. The weighted sum of all digits must be divisible by 11. The same weight sequence applies to both 8-digit and 9-digit TFNs (the 8-digit case uses the first eight weights).

    Warning

    Some third-party libraries cite [10, 7, 8, 4, 6, 3, 5, 1] for the 8-digit TFN. That weight set belongs to the ABN check digit, not the TFN. Privacera's detector uses the canonical ATO algorithm.

  • Keyword dictionary: The strict rule combines the detector with the AU_TFN_KEYWORD dictionary (terms such as TFN, tax_file_number, tax_file_no, au_tfn) to reduce false positives in structured data.

  • Disambiguation with SECURITY_PIN: 8-digit TFN values also match the SECURITY_PIN_PATTERN regex (3–8 digits). The Security PIN Strict rule includes c_AUSTRALIA_TFN_ML_MODEL in its hasNotKeys guard, so a PIN-named column whose values happen to be valid TFNs is rerouted to the AU_TFN review path rather than auto-tagging as SECURITY_PIN.

Parameter Type Default Description
DEFAULT_TYPES Boolean TRUE Validate against the standard 8-digit / 9-digit TFN formats described above.
ADDITIONAL_REGEX String Custom regex for matching alternate TFN groupings; see below.

Custom Formats

YAML
ADDITIONAL_REGEX_DOT_SEPARATED: ^(\d{3}\.\d{3}\.\d{3})$
ADDITIONAL_REGEX_HYPHENATED:    ^(\d{3}-\d{3}-\d{3})$

34. ANZ Vehicle Number Plate Model

The ANZ Vehicle Number Plate model detects Australian and New Zealand vehicle license plate / registration numbers across the most common state and territory formats. Tag: ANZ_VEHICLE_NUMBER_PLATE.

Tip

The tag, model, and dictionary are disabled by default; rules are enabled by default. To detect ANZ vehicle plates in your data, enable the following in the portal:

  • Tag: Enable the ANZ_VEHICLE_NUMBER_PLATE tag (Tags).
  • Model: Enable the ANZ Vehicle Number Plate model, type ANZ_VEHICLE_NUMBER_PLATE_ML_MODEL (Models).
  • Dictionary: Enable the ANZ_VEHICLE_PLATE_KEYWORD dictionary if you use the strict rule (Dictionaries).
  • Rules: Ensure the ANZ Vehicle Plate structured rules (e.g. "ANZ Vehicle Number Plate", "ANZ Vehicle Number Plate Strict") and the unstructured rule rule_anz_vehicle_plate_keyword_unstruct are present and enabled (Rules).

The model identifies vehicle-plate values based on the following criteria:

  • Format: Letter-and-digit combinations covering common AU state and NZ formats. Separators between letter and digit groups may be space, hyphen, or none.

    Pattern Example Typical use
    3 letters + 3 digits ABC123 AU NSW / VIC / QLD / SA / WA / TAS / NT / ACT, NZ modern
    3 letters + 4 digits ABC1234 AU custom / vanity
    2 letters + 3–4 digits AB1234 Older NZ, some AU motorcycle
    3 letters + 3 digits + 1–2 letters ABC123AB AU custom / club plates
    2–4 digits + 2–3 letters 1234ABC QLD / NSW legacy, AU custom
  • Validation:

    • The cleaned candidate (separators removed) must be 5–8 characters and contain at least one letter and one digit — short tokens like AB12 are rejected.
    • All-same-digit sequences (e.g. ABC000) are rejected.
    • Sequential, alternating, and step-2 digit sequences are rejected only when the digit section is 5 or more digits, to avoid incorrectly rejecting legitimate 3-digit plates that happen to look sequential.
  • Out of scope by design:

    • UK post-2001 plates (AB12CDE — 2 letters + 2 digits + 3 letters) are not matched.
    • Some US plate shapes happen to look identical to AU / NZ plates and may still match at the regex level. Plate formats themselves carry no country signal, so pair the detector with the m_ANZ_VEHICLE_PLATE_KEYWORD column-name keyword to disambiguate.
  • Keyword dictionary: The strict rule combines the detector with the ANZ_VEHICLE_PLATE_KEYWORD dictionary (terms such as plate, registration, rego, vehicle_plate, number_plate) to reduce false positives in structured data.

  • Interaction with SECURITY_PIN: A purely numeric plate variant (e.g. 1234ABC where the digit section overlaps with the 3–8 digit SECURITY_PIN_PATTERN window) could compete with the Security PIN Strict rule. The PIN strict rule includes c_ANZ_VEHICLE_NUMBER_PLATE_ML_MODEL in its hasNotKeys guard so a column matching both content models is routed to the plate review path.

Parameter Type Default Description
DEFAULT_TYPES Boolean TRUE Validate against the standard AU / NZ plate shapes described above.
ADDITIONAL_REGEX String Custom regex for matching alternate plate formats; see below.

Custom Formats

YAML
ADDITIONAL_REGEX_UK_POST_2001: ^([A-Z]{2}\d{2}\s?[A-Z]{3})$
ADDITIONAL_REGEX_EURO:         ^([A-Z]{1,3}-\d{1,4}-[A-Z]{1,3})$

Conclusion

Privacera Discovery's model-based classification supports numerous sensitive data formats through specialized validation rules and pattern matching. These models help achieve consistent and automated tagging of data across domains such as finance, healthcare, HR, and more.