- PrivaceraCloud Release 4.5
- PrivaceraCloud User Guide
- PrivaceraCloud
- What is PrivaceraCloud?
- Getting Started with Privacera Cloud
- User Interface
- Dashboard
- Access Manager
- Discovery
- Usage statistics
- Encryption and Masking
- Privacera Encryption core ideas and terminology
- Encryption Schemes
- Encryption Schemes
- System Encryption Schemes Enabled by Default
- View Encryption Schemes
- Formats, Algorithms, and Scopes
- Record the Names of Schemes in Use and Do Not Delete Them
- System Encryption Schemes Enabled by Default
- Viewing the Encryption Schemes
- Formats, Algorithms, and Scopes
- Record the Names of Schemes in Use and Do Not Delete Them
- Encryption Schemes
- Presentation Schemes
- Masking schemes
- Create scheme policies on PrivaceraCloud
- Encryption formats, algorithms, and scopes
- Deprecated encryption formats, algorithms, and scopes
- PEG REST API on PrivaceraCloud
- PEG API Endpoint
- Request Summary for PrivaceraCloud
- Prerequisites
- Anatomy of a PEG API endpoint on PrivaceraCloud
- About constructing the datalist for /protect
- About deconstructing the response from /unprotect
- Example of data transformation with /unprotect and presentation scheme
- Example PEG REST API endpoints for PrivaceraCloud
- Audit details for PEG REST API accesses
- Make calls on behalf of another user on PrivaceraCloud
- Privacera Encryption UDF for masking in Databricks
- Privacera Encryption UDFs for Trino
- Syntax of Privacera Encryption UDFs for Trino
- Prerequisites for installing Privacera Crypto plug-in for Trino
- Variable values to obtain from Privacera
- Determine required paths to crypto jar and crypto.properties
- Download Privacera Crypto Jar
- Set variables in Trino etc/crypto.properties
- Restart Trino to register the Privacera Crypto UDFs for Trino
- Example queries to verify Privacera-supplied UDFs
- Azure AD setup
- Launch Pad
- Settings
- General functions in PrivaceraCloud settings
- Applications
- About applications
- Azure Data Lake Storage Gen 2 (ADLS)
- Athena
- Privacera Discovery with Cassandra
- Databricks
- Databricks SQL
- Dremio
- DynamoDB
- Elastic MapReduce from Amazon
- EMRFS S3
- Files
- File Explorer for Google Cloud Storage
- Glue
- Google BigQuery
- Kinesis
- Lambda
- Microsoft SQL Server
- MySQL for Discovery
- Open Source Spark
- Oracle for Discovery
- PostgreSQL
- Power BI
- Presto
- Redshift
- Redshift Spectrum
- Kinesis
- Snowflake
- Starburst Enterprise with PrivaceraCloud
- Starburst Enterprise Presto
- Trino
- Datasource
- User Management
- API Key
- About Account
- Statistics
- Help
- Apache Ranger API
- Reference
- Okta Setup for SAML-SSO
- Azure AD setup
- SCIM Server User-Provisioning
- AWS Access with IAM
- Access AWS S3 buckets from multiple AWS accounts
- Add UserInfo in S3 Requests sent via Dataserver
- EMR Native Ranger Integration with PrivaceraCloud
- Spark Properties
- Operational Status
- How-to
- Create CloudFormation Stack
- Enable Real-time Scanning of S3 Buckets
- Enable Discovery Realtime Scanning Using IAM Role
- How to configure multiple JSON Web Tokens (JWTs) for EMR
- Enable offline scanning on Azure Data Lake Storage Gen 2 (ADLS)
- Enable Real-time Scanning on Azure Data Lake Storage Gen 2 (ADLS)
- How to Get Support
- Coordinated Vulnerability Disclosure (CVD) Program of Privacera
- Shared Security Model
- PrivaceraCloud
- PrivaceraCloud Previews
- Privacera documentation changelog
Classifications via random sampling
By default PrivaceraCloud scans at "shallow depth" of a database. That is, for performance, the system examines the records of the database to derive the classifications.
This default assumes that the database itself is uniform with normalized data and the records are accurately represented.
However, with some unnormalized databases, this uniformity might be lacking.
If you suspect that your database values are not uniform, you can configure PrivaceraCloud to take a random sample from the entire database for analysis in classification.
One purpose of random sampling is to help isolate these data variations to eliminate them.
Supported JDBC applications for random sampling
Random sampling is supported for the following applications:
MySQL
Oracle
Trino
If you configure random sampling for any other database, it is ignored.
Prerequisites for random sampling
Know the names of the applications you want to randomly sample.
Be sure to have the JDBC connection details for those applications.
To minimize performance impact, determine if your database can be considered "large". By default, PrivaceraCloud considers any database with 10,000 records or more to be large. In this case, the random sampling is based on a subset of the data.
Define datasource (application) and configure random sampling
Random sampling is part of configuring a datasource. For details on setup, see Applications.
To enable random sampling for a database:
Go to Settings > Applications.
Under Connected Applications, click the name of the application.
On the BASIC tab, Click the toggle jdbc.random.record.fetching.
If your database has more than 10,000 records, specify the approximate number in the rows.as.small.dataset field.
Effects of random sampling
Random sampling has some visible effects.
Performance impact
You might perceive a delay in the running of random samples. Performance times can increase depending on the size of the sample.
Variations in classifications
You should not expect the same classification results for the same database from random sample to random sample.
Each random sample operates on a subset of the data. Depending on variations in the sampling of values in the database, the results of classification can vary.
Each random sample is unique. The records are selected randomly and so results vary from sample to sample.
For example, suppose an EMAIL column does not have consistent values:
Sometimes, a delimiter that distinguishes a first and a last name with @-sign indicating Internet domain.
A random sampling of such records can result in a consistent classification as PERSON NAME.
Sometimes a bare username with no delimited last name and with no @-sign at all.
The inconsistent variation in the data makes a concrete classification difficult to derive.
This same inherent inconsistency in the column values can result in variations of classification from run to run, each with its own unique random sampling.