Skip to main content

Privacera Platform

Data Source Scanning


Register Data Sources

Before you can use Privacera Discovery productively, make sure you have registered your data sources, including any JDBC-based systems you have.

Behavior of trailing '/' in data source URL/URIs

If a data source URL/URI has a trailing /, then Privacera Discovery will scan the folders in the bucket individually. If the data source URL/URI does not have a trailing /, then the folders in the bucket will be scanned together.

For example, say the following three folders are in an S3 bucket:

  • A

  • A_1

  • A_1_2

If these three folders need to be scanned individually, then the URL/URI in the data source should be listed as:

  • s3://bucket/A/

  • s3://bucket/A_1/

  • s3://bucket/A_1_2/

If the three folders need to be scanned together, then the URL/URI in the data source should be listed as:

  • s3://bucket/A

Or, if you want to scan A_1 and A_1_2, then the URL/URI should be listed as:

  • s3://bucket/A_1

This will scan both s3://bucket/A_1 and s3://bucket/A_1_2.

Adjusting default scan depth

Privacera Discovery operations are computationally intensive. Therefore, Discovery defaults to scanning only a sample of targeted data in order to determine whether sensitive information is present.

Individual customers are responsible for determining what level of scanning is necessary to meet their regulatory requirements. You can adjust the sampling size by setting the DISCOVERY*MAX* variables detailed in Discovery Custom Properties.

Scan setup

Using Privacera Discovery, you can configure scans and set threshold scores to determine if a resource should be reviewed for non-compliance. This is done from the Scan Setup page.

To view the Scan Setup page, select Discovery > Scan Setup from the navigation menu.

The Scan Setup page displays the following information:

  • Application Status: The total number of enabled and disabled applications.

  • System Classification: This allows you to set the global value at what percentage match will cause the scanned resource to be classified. To automatically classify the associated tags, enable the auto classification feature using the enable/disable toggle.

  • Minimum Review: This allows you to set the global minimum value that will send the tagged resources to the Pending Review status under classification for manual verification. Tag scores falling below the review score are ignored.

  • Reduce Score: If a column has empty data but is meta-tagged with 100% score, this reduces the score with the value that is set here. For example: If it is configured to 50, then the final score set for that column tag will be 50 and it will be re-evaluated based on the auto-classification and review score threshold.

    If you toggle the reduce score enable, it will reduce. If you toggle the reduce score enable, it will reduce the score of the associated meta tag. If you disable the reduce score feature, the meta tags will not be auto-classified.

  • Rescan Type: For file system and database applications, scanning options include:

    • Incremental: Only scans resources that have been modified since the previous scan.

    • Scan: Rescans the resource completely regardless of previous scans.


  1. Create privacera_tags in the Ranger Tag Based Policy

  2. Associate the privacera_tags to S3 Service.

  3. Create a JSON file where you can add tags.

                      vi s3_tag.json

    Sample JSON:

  4. Push the tag to Ranger.

                      curl -i -L -k -u admin:welcome1 -H "Content-type: application/json" -d @s3_tag.json -X PUT http://${RANGER_HOST}


                      HTTP/1.1 204 No Content
    Set-Cookie: RANGERADMINSESSIONID=517FD2032481415D188C6925FA96E7E3; Path=/; HttpOnly
    X-Frame-Options: DENY
    X-XSS-Protection: 1; mode=block
    Strict-Transport-Security: max-age=31536000; includeSubDomains
    Content-Security-Policy: default-src 'none'; script-src 'self' 'unsafe-inline' 'unsafe-eval'; connect-src 'self'; img-src 'self'; style-src 'self' 'unsafe-inline';font-src 'self'
    Cache-Control: no-cache, no-store, max-age=0, must-revalidate
    Pragma: no-cache
    Expires: 0
    X-Content-Type-Options: nosniff
    Content-Type: application/json
    Date: Sun, 08 Mar 2020 18:55:44 GMT
    Server: Apache Ranger

    To get the tagged resources list.

                      curl -i -L -k -u admin:welcome1 -H "Content-type: application/json" -X GET http://${RANGER_HOST}



Test the Tag-Based Policies for S3 with the sample given above:

  1. Create user <kate> in EC2 and add permissions read, metaread, write, metawrite to the S3 bucket ${Bucket_Name} in privacera_s3 service.

  2. Create a deny tag-based policy for user <kate> - tag = SSN, Component = S3, permissions = read, write.

  3. Now try to access the ${Bucket_Name} with user <kate>.

  4. Denied audit is seen with ${SSN} tag in the audits.

Start offline and realtime scansdic

There are two ways to scan resources in Privacera Discovery:

Start offline scanning

You can manually scan resources (offline scanning) from the Data Source page.

To start offline scanning, follow these steps:

  1. From the navigation menu, select Discovery > Data Sources.

  2. Select a resource from the Applications list.


    Ensure that the application is enabled.

  3. Under Include Resource tab, check the Rescan checkbox of the resource to be scanned.

    The Info and Success dialog is displayed.

Start realtime scanning

By default, Privacera Discovery scans resources that you add to an application (realtime scanning). When a new file is added to the Include Resource tab of the Data Source page, realtime scanning occurs.

To scan the resource in realtime, the application should be enabled and resource should be added to the Include Resource tab in the application. For example, to copy a file from the cluster to HDFS, use the following command:

hdfs dfs -put -f <local-src> … <HDFS_dest_path>

For AWS S3, you can fetch S3 tags. For more information, see Configure S3 for Real-Time ScanningConfigure S3 for real-time scanning

View classification results

You can view scan results on the Classification page. For more information, see Classification.