Skip to content

Configuring Embedded JSON and XML Tagging

Structured data sources—JDBC tables, CSV, Parquet, and similar formats—sometimes store JSON or XML inside a single column. For example, a column named customer_data might contain:

JSON
{"name":"Jane","ssn":"123-45-6789"}

By default, Discovery classifies that value as one opaque string. When embedded JSON/XML tagging is enabled, Discovery parses the payload, classifies each nested path separately, and applies tags at paths such as customer_data.ssn and customer_data.name.

For an overview of how classification behaves with nested paths, see Embedded JSON and XML Tagging.

Prerequisites

  • Discovery is installed and running. Refer to Discovery installation steps.
  • At least one structured data source is configured and scannable (for example, JDBC, CSV, or Parquet).

When to enable

Enable embedded JSON/XML tagging when:

  • Sensitive attributes are stored as JSON or XML inside VARCHAR, CLOB, or similar columns.
  • You need field-level tags (for example, on payload.email) rather than a single tag on the whole column.
  • Column names alone do not reveal the sensitive paths inside the document.

Optionally enable unstructured value checking (see below) when structured tables also contain long free-text values (for example, a notes or comments column) that should be classified with unstructured rules.

Configuration

Embedded tagging is controlled through Discovery Ansible variables. These are written to privacera_discovery_custom.properties during deployment.

Step 1 — Edit Discovery variables

  1. SSH into the instance where Privacera Manager is installed.
  2. Navigate to the Privacera Manager directory:

    Bash
    cd ~/privacera/privacera-manager
    
  3. Open the Discovery variables file for your cloud provider:

    Bash
    vi config/custom-vars/vars.discovery.aws.yml
    
    Bash
    vi config/custom-vars/vars.discovery.azure.yml
    
    Bash
    vi config/custom-vars/vars.discovery.gcp.yml
    

Step 2 — Set embedded JSON/XML properties

Add or update the following variables:

YAML
1
2
3
4
5
6
7
8
# Enable parsing and tagging of JSON/XML inside structured column values
DISCOVERY_TAG_EMBEDDED_XML_JSON: "true"

# Maximum leaf fields extracted per embedded JSON value (default: 50)
DISCOVERY_MAX_FIELD_EMBEDDED_JSON: 50

# Maximum leaf fields extracted per embedded XML value (default: 50)
DISCOVERY_MAX_FIELD_EMBEDDED_XML: 50
Ansible variable Discovery property Default Description
DISCOVERY_TAG_EMBEDDED_XML_JSON privacera.discovery.tag.field.embedded "false" Master switch for embedded JSON/XML parsing and tagging
DISCOVERY_MAX_FIELD_EMBEDDED_JSON privacera.discovery.max.field.embedded.json 50 Cap on leaf fields extracted from one JSON value
DISCOVERY_MAX_FIELD_EMBEDDED_XML privacera.discovery.max.field.embedded.xml 50 Cap on leaf fields extracted from one XML value

Set the master switch explicitly

The shipped Ansible default for DISCOVERY_TAG_EMBEDDED_XML_JSON is "false". Set it to "true" in your vars.discovery.*.yml file when you want embedded tagging enabled.

Tuning field limits

  • Increase DISCOVERY_MAX_FIELD_EMBEDDED_JSON or DISCOVERY_MAX_FIELD_EMBEDDED_XML for wide or deeply nested documents.
  • Decrease them to limit scan time on columns that contain very large JSON or XML payloads.
  • For JSON only, setting DISCOVERY_MAX_FIELD_EMBEDDED_JSON to 0 removes the field cap. For XML, use a positive value (for example, 50).

Step 3 (Optional) — Unstructured text in structured columns

If structured tables contain long prose in some columns (for example, clinical notes or ticket descriptions), you can enable unstructured-style classification on those values:

YAML
1
2
3
4
5
# Apply unstructured rules to prose-like values in structured columns
DISCOVERY_UNSTRUCTURED_VALUE_CHECKING_ENABLED: "true"

# Exclusive token threshold: values with more than this many non-punctuation tokens qualify as unstructured prose (default: 5)
DISCOVERY_NUM_TOKENS_FOR_UNSTRUCTURED_DATA_DETECTION: 5
Ansible variable Discovery property Default Description
DISCOVERY_UNSTRUCTURED_VALUE_CHECKING_ENABLED privacera.discovery.unstructured.value.checking.enabled "false" Enable unstructured rules and NER on qualifying structured field values
DISCOVERY_NUM_TOKENS_FOR_UNSTRUCTURED_DATA_DETECTION privacera.discovery.num.tokens.unstructured.data.detection 5 Exclusive token-count threshold (excluding punctuation-only tokens); values must exceed this count to qualify as prose

When embedded tagging is enabled, extracted JSON/XML sub-fields are also evaluated under this setting if it is enabled.

Step 4 — Save and close the variables file

After adding the embedded JSON/XML variables—and, if you use it, the optional unstructured settings in Step 3—save the file and exit the editor.

Step 5 — Restart Discovery

After changing variables, restart Discovery services:

Bash
1
2
3
cd ~/privacera/privacera-manager
./privacera-manager.sh setup
./pm_with_helm.sh upgrade 

Verify configuration

After restart, run a scan on a table or file that contains embedded JSON or XML. In scan logs (INFO level), Discovery reports the effective values for embedded tagging and unstructured checking for each resource.

In the portal, confirm that tags appear on nested paths (for example, customer_data.ssn) rather than only on the parent column name.

Property reference (direct edit)

If you edit privacera_discovery_custom.properties directly instead of Ansible, use:

Properties
1
2
3
4
5
privacera.discovery.tag.field.embedded=true
privacera.discovery.max.field.embedded.json=50
privacera.discovery.max.field.embedded.xml=50
privacera.discovery.unstructured.value.checking.enabled=true
privacera.discovery.num.tokens.unstructured.data.detection=5
Setting Ansible variable Role
Maximum characters classified per field value DISCOVERY_CONTENT_MAX_CHARACTER Truncates long extracted values before classification (default: 10000)

Operational considerations

  • Performance: Embedded parsing adds JSON/XML processing and additional classifier passes per column. Tune DISCOVERY_MAX_FIELD_EMBEDDED_* on large payloads.
  • Field naming: Nested JSON paths appear as parentColumn.nested.path. XML paths use dot-separated element names from the document root.
  • Detection order: JSON is attempted first when a value starts with {. XML is attempted when JSON parsing yields no fields and the value starts with <.