Configuring Embedded JSON and XML Tagging¶
Structured data sources—JDBC tables, CSV, Parquet, and similar formats—sometimes store JSON or XML inside a single column. For example, a column named customer_data might contain:
| JSON | |
|---|---|
By default, Discovery classifies that value as one opaque string. When embedded JSON/XML tagging is enabled, Discovery parses the payload, classifies each nested path separately, and applies tags at paths such as customer_data.ssn and customer_data.name.
For an overview of how classification behaves with nested paths, see Embedded JSON and XML Tagging.
Prerequisites¶
- Discovery is installed and running. Refer to Discovery installation steps.
- At least one structured data source is configured and scannable (for example, JDBC, CSV, or Parquet).
When to enable¶
Enable embedded JSON/XML tagging when:
- Sensitive attributes are stored as JSON or XML inside VARCHAR, CLOB, or similar columns.
- You need field-level tags (for example, on
payload.email) rather than a single tag on the whole column. - Column names alone do not reveal the sensitive paths inside the document.
Optionally enable unstructured value checking (see below) when structured tables also contain long free-text values (for example, a notes or comments column) that should be classified with unstructured rules.
Configuration¶
Embedded tagging is controlled through Discovery Ansible variables. These are written to privacera_discovery_custom.properties during deployment.
Step 1 — Edit Discovery variables¶
- SSH into the instance where Privacera Manager is installed.
-
Navigate to the Privacera Manager directory:
Bash -
Open the Discovery variables file for your cloud provider:
Step 2 — Set embedded JSON/XML properties¶
Add or update the following variables:
| Ansible variable | Discovery property | Default | Description |
|---|---|---|---|
DISCOVERY_TAG_EMBEDDED_XML_JSON | privacera.discovery.tag.field.embedded | "false" | Master switch for embedded JSON/XML parsing and tagging |
DISCOVERY_MAX_FIELD_EMBEDDED_JSON | privacera.discovery.max.field.embedded.json | 50 | Cap on leaf fields extracted from one JSON value |
DISCOVERY_MAX_FIELD_EMBEDDED_XML | privacera.discovery.max.field.embedded.xml | 50 | Cap on leaf fields extracted from one XML value |
Set the master switch explicitly
The shipped Ansible default for DISCOVERY_TAG_EMBEDDED_XML_JSON is "false". Set it to "true" in your vars.discovery.*.yml file when you want embedded tagging enabled.
Tuning field limits
- Increase
DISCOVERY_MAX_FIELD_EMBEDDED_JSONorDISCOVERY_MAX_FIELD_EMBEDDED_XMLfor wide or deeply nested documents. - Decrease them to limit scan time on columns that contain very large JSON or XML payloads.
- For JSON only, setting
DISCOVERY_MAX_FIELD_EMBEDDED_JSONto0removes the field cap. For XML, use a positive value (for example,50).
Step 3 (Optional) — Unstructured text in structured columns¶
If structured tables contain long prose in some columns (for example, clinical notes or ticket descriptions), you can enable unstructured-style classification on those values:
| Ansible variable | Discovery property | Default | Description |
|---|---|---|---|
DISCOVERY_UNSTRUCTURED_VALUE_CHECKING_ENABLED | privacera.discovery.unstructured.value.checking.enabled | "false" | Enable unstructured rules and NER on qualifying structured field values |
DISCOVERY_NUM_TOKENS_FOR_UNSTRUCTURED_DATA_DETECTION | privacera.discovery.num.tokens.unstructured.data.detection | 5 | Exclusive token-count threshold (excluding punctuation-only tokens); values must exceed this count to qualify as prose |
When embedded tagging is enabled, extracted JSON/XML sub-fields are also evaluated under this setting if it is enabled.
Step 4 — Save and close the variables file¶
After adding the embedded JSON/XML variables—and, if you use it, the optional unstructured settings in Step 3—save the file and exit the editor.
Step 5 — Restart Discovery¶
After changing variables, restart Discovery services:
Verify configuration¶
After restart, run a scan on a table or file that contains embedded JSON or XML. In scan logs (INFO level), Discovery reports the effective values for embedded tagging and unstructured checking for each resource.
In the portal, confirm that tags appear on nested paths (for example, customer_data.ssn) rather than only on the parent column name.
Property reference (direct edit)¶
If you edit privacera_discovery_custom.properties directly instead of Ansible, use:
| Properties | |
|---|---|
Related settings¶
| Setting | Ansible variable | Role |
|---|---|---|
| Maximum characters classified per field value | DISCOVERY_CONTENT_MAX_CHARACTER | Truncates long extracted values before classification (default: 10000) |
Operational considerations¶
- Performance: Embedded parsing adds JSON/XML processing and additional classifier passes per column. Tune
DISCOVERY_MAX_FIELD_EMBEDDED_*on large payloads. - Field naming: Nested JSON paths appear as
parentColumn.nested.path. XML paths use dot-separated element names from the document root. - Detection order: JSON is attempted first when a value starts with
{. XML is attempted when JSON parsing yields no fields and the value starts with<.