Skip to content

Embedded JSON and XML Tagging

Overview

Embedded JSON and XML tagging lets Privacera Discovery find and tag sensitive data stored inside JSON or XML documents that live within a single structured column—for example, a relational VARCHAR column, a CSV field, or a Parquet string column.

Without this feature, Discovery treats the entire column value as one string. With it enabled, Discovery flattens nested structures and classifies each leaf path independently, so tags can apply at paths such as customer_data.email instead of only at customer_data.

How it works at a high level

During a structured content scan (tables, CSV, Parquet, and similar):

  1. Discovery reads each column value from sample records.
  2. If embedded tagging is enabled and the value looks like JSON (starts with {), Discovery parses the JSON and extracts leaf fields.
  3. If JSON parsing produces no fields and the value looks like XML (starts with <), Discovery parses the XML instead.
  4. Each extracted path is registered as a nested field (for example, payload.customer.ssn).
  5. Standard Discovery classification (detectors, rules, scoring) runs on each extracted value.
  6. Tags are applied at the nested path in the Data Inventory.
Text Only
1
2
3
4
5
6
Column: customer_data
Value:  {"name":"Jane","ssn":"123-45-6789"}
                ▼  (embedded tagging enabled)
Paths:  customer_data.name
        customer_data.ssn  →  separate classification & tags per path

JSON before XML

Discovery always attempts JSON parsing first when the value begins with {. XML parsing runs only when JSON parsing does not yield extractable fields and the value begins with <.

Supported data sources

Embedded tagging applies wherever Discovery performs structured content scanning, including:

  • JDBC-connected databases (Snowflake, PostgreSQL, SQL Server, and others)
  • Delimited files (CSV) and columnar formats (Parquet) scanned as structured resources
  • Other structured scan modes that surface column name and value pairs

It does not replace scanning of standalone JSON or XML files; those are handled as unstructured or semi-structured resources according to your scan configuration.

Nested field naming

Source Path format Example
Embedded JSON <column-name>.<dot-separated-json-path> orders.items.0.sku
Embedded XML <column-name>.<dot-separated-element-path> Invoice.Customer.Email

Array elements in JSON may appear as numeric path segments (for example, items.0, items.1).

Unstructured text inside structured tables

A related optional setting—DISCOVERY_UNSTRUCTURED_VALUE_CHECKING_ENABLED—addresses a different case: plain-language text in structured columns (for example, a notes column with sentences, not JSON).

When enabled on a structured resource:

  • Discovery counts non-punctuation tokens in each field value.
  • Values with more than DISCOVERY_NUM_TOKENS_FOR_UNSTRUCTURED_DATA_DETECTION tokens (default 5) are classified using unstructured rules and NER, in addition to standard structured detectors.
  • This also applies to leaf values extracted from embedded JSON/XML when embedded tagging is on.

This is separate from Using Metadata Dictionaries in Unstructured Content Scanning, which controls how metadata dictionaries apply to unstructured files and related content.

Configuration

For step-by-step setup (Ansible variables, restart, and verification), see Configuring Embedded JSON and XML Tagging.

Quick reference

Ansible variable Default Purpose
DISCOVERY_TAG_EMBEDDED_XML_JSON "false" Enable embedded JSON/XML parsing and tagging
DISCOVERY_MAX_FIELD_EMBEDDED_JSON 50 Max leaf fields per embedded JSON value
DISCOVERY_MAX_FIELD_EMBEDDED_XML 50 Max leaf fields per embedded XML value
DISCOVERY_UNSTRUCTURED_VALUE_CHECKING_ENABLED "false" Unstructured rules on prose-like structured values
DISCOVERY_NUM_TOKENS_FOR_UNSTRUCTURED_DATA_DETECTION 5 Exclusive token-count threshold; values with more than this many non-punctuation tokens qualify as prose

Best practices

  • Enable explicitly: Set DISCOVERY_TAG_EMBEDDED_XML_JSON: "true" in inventory when you rely on nested-path tags; do not assume it is on by default.
  • Cap large documents: Use DISCOVERY_MAX_FIELD_EMBEDDED_JSON and DISCOVERY_MAX_FIELD_EMBEDDED_XML to balance coverage and scan duration on wide JSON/XML payloads.
  • Combine with column rules: Nested paths still benefit from content detectors and structured rules; ensure models and dictionaries cover the sensitive types inside your JSON/XML schemas.
  • Review nested tags in the portal: After scanning, verify tags on expected paths (for example, customer_data.ssn) before syncing to external systems.

See also