Configuring Embedded JSON and XML Tagging¶

Structured data sources—JDBC tables, CSV, Parquet, and similar formats—sometimes store JSON or XML inside a single column. For example, a column named customer_data might contain:

JSON
{"name":"Jane","ssn":"123-45-6789"}

By default, Discovery classifies that value as one opaque string. When embedded JSON/XML tagging is enabled, Discovery parses the payload, classifies each nested path separately, and applies tags at paths such as customer_data.ssn and customer_data.name.

For an overview of how classification behaves with nested paths, see Embedded JSON and XML Tagging.

Prerequisites¶

Discovery is installed and running. Refer to Discovery installation steps.
At least one structured data source is configured and scannable (for example, JDBC, CSV, or Parquet).

When to enable¶

Enable embedded JSON/XML tagging when:

Sensitive attributes are stored as JSON or XML inside VARCHAR, CLOB, or similar columns.
You need field-level tags (for example, on payload.email) rather than a single tag on the whole column.
Column names alone do not reveal the sensitive paths inside the document.

Optionally enable unstructured value checking (see below) when structured tables also contain long free-text values (for example, a notes or comments column) that should be classified with unstructured rules.

Configuration¶

Embedded tagging is controlled through Discovery Ansible variables. These are written to privacera_discovery_custom.properties during deployment.

Step 1 — Edit Discovery variables¶

SSH into the instance where Privacera Manager is installed.
Navigate to the Privacera Manager directory:
Bash
1
cd ~/privacera/privacera-manager

Open the Discovery variables file for your cloud provider:

AWSAzureGCP

Bash
1	`vi config/custom-vars/vars.discovery.aws.yml`

Bash
1	`vi config/custom-vars/vars.discovery.azure.yml`

Bash
1	`vi config/custom-vars/vars.discovery.gcp.yml`

Step 2 — Set embedded JSON/XML properties¶

Add or update the following variables:

YAML
# Enable parsing and tagging of JSON/XML inside structured column values
DISCOVERY_TAG_EMBEDDED_XML_JSON: "true"

# Maximum leaf fields extracted per embedded JSON value (default: 50)
DISCOVERY_MAX_FIELD_EMBEDDED_JSON: 50

# Maximum leaf fields extracted per embedded XML value (default: 50)
DISCOVERY_MAX_FIELD_EMBEDDED_XML: 50

Ansible variable	Discovery property	Default	Description
`DISCOVERY_TAG_EMBEDDED_XML_JSON`	`privacera.discovery.tag.field.embedded`	`"false"`	Master switch for embedded JSON/XML parsing and tagging
`DISCOVERY_MAX_FIELD_EMBEDDED_JSON`	`privacera.discovery.max.field.embedded.json`	`50`	Cap on leaf fields extracted from one JSON value
`DISCOVERY_MAX_FIELD_EMBEDDED_XML`	`privacera.discovery.max.field.embedded.xml`	`50`	Cap on leaf fields extracted from one XML value

Set the master switch explicitly

The shipped Ansible default for DISCOVERY_TAG_EMBEDDED_XML_JSON is "false". Set it to "true" in your vars.discovery.*.yml file when you want embedded tagging enabled.

Tuning field limits

Increase DISCOVERY_MAX_FIELD_EMBEDDED_JSON or DISCOVERY_MAX_FIELD_EMBEDDED_XML for wide or deeply nested documents.
Decrease them to limit scan time on columns that contain very large JSON or XML payloads.
For JSON only, setting DISCOVERY_MAX_FIELD_EMBEDDED_JSON to 0 removes the field cap. For XML, use a positive value (for example, 50).

Step 3 (Optional) — Unstructured text in structured columns¶

If structured tables contain long prose in some columns (for example, clinical notes or ticket descriptions), you can enable unstructured-style classification on those values:

YAML
# Apply unstructured rules to prose-like values in structured columns
DISCOVERY_UNSTRUCTURED_VALUE_CHECKING_ENABLED: "true"

# Exclusive token threshold: values with more than this many non-punctuation tokens qualify as unstructured prose (default: 5)
DISCOVERY_NUM_TOKENS_FOR_UNSTRUCTURED_DATA_DETECTION: 5

Ansible variable	Discovery property	Default	Description
`DISCOVERY_UNSTRUCTURED_VALUE_CHECKING_ENABLED`	`privacera.discovery.unstructured.value.checking.enabled`	`"false"`	Enable unstructured rules and NER on qualifying structured field values
`DISCOVERY_NUM_TOKENS_FOR_UNSTRUCTURED_DATA_DETECTION`	`privacera.discovery.num.tokens.unstructured.data.detection`	`5`	Exclusive token-count threshold (excluding punctuation-only tokens); values must exceed this count to qualify as prose

When embedded tagging is enabled, extracted JSON/XML sub-fields are also evaluated under this setting if it is enabled.

Step 4 — Save and close the variables file¶

After adding the embedded JSON/XML variables—and, if you use it, the optional unstructured settings in Step 3—save the file and exit the editor.

Step 5 — Restart Discovery¶

After changing variables, restart Discovery services:

Bash
cd ~/privacera/privacera-manager
./privacera-manager.sh setup
./pm_with_helm.sh upgrade 

Verify configuration¶

After restart, run a scan on a table or file that contains embedded JSON or XML. In scan logs (INFO level), Discovery reports the effective values for embedded tagging and unstructured checking for each resource.

In the portal, confirm that tags appear on nested paths (for example, customer_data.ssn) rather than only on the parent column name.

Property reference (direct edit)¶

If you edit privacera_discovery_custom.properties directly instead of Ansible, use:

Properties
privacera.discovery.tag.field.embedded=true
privacera.discovery.max.field.embedded.json=50
privacera.discovery.max.field.embedded.xml=50
privacera.discovery.unstructured.value.checking.enabled=true
privacera.discovery.num.tokens.unstructured.data.detection=5

Setting	Ansible variable	Role
Maximum characters classified per field value	`DISCOVERY_CONTENT_MAX_CHARACTER`	Truncates long extracted values before classification (default: 10000)

Operational considerations¶

Performance: Embedded parsing adds JSON/XML processing and additional classifier passes per column. Tune DISCOVERY_MAX_FIELD_EMBEDDED_* on large payloads.
Field naming: Nested JSON paths appear as parentColumn.nested.path. XML paths use dot-separated element names from the document root.
Detection order: JSON is attempted first when a value starts with {. XML is attempted when JSON parsing yields no fields and the value starts with <.

Prev Advanced Configuration