- Platform Release 6.5
- Privacera Platform Release 6.5
- Enhancements and updates in Privacera Access Management 6.5 release
- Enhancements and updates in Privacera Discovery 6.5 release
- Enhancements and updates in Privacera Encryption 6.5 release
- Deprecation of older version of PolicySync
- Upgrade Prerequisites
- Supported versions of third-party systems
- Documentation changelog
- Known Issues 6.5
- Platform - Supported Versions of Third-Party Systems
- Platform Support Policy and End-of-Support Dates
- Privacera Platform Release 6.5
- Privacera Platform Installation
- About Privacera Manager (PM)
- Install overview
- Prerequisites
- Installation
- Default services configuration
- Component services configurations
- Access Management
- Data Server
- UserSync
- Privacera Plugin
- Databricks
- Spark standalone
- Spark on EKS
- Portal SSO with PingFederate
- Trino Open Source
- Dremio
- AWS EMR
- AWS EMR with Native Apache Ranger
- GCP Dataproc
- Starburst Enterprise
- Privacera services (Data Assets)
- Audit Fluentd
- Grafana
- Ranger Tagsync
- Discovery
- Encryption & Masking
- Privacera Encryption Gateway (PEG) and Cryptography with Ranger KMS
- AWS S3 bucket encryption
- Ranger KMS
- AuthZ / AuthN
- Security
- Access Management
- Reference - Custom Properties
- Validation
- Additional Privacera Manager configurations
- Upgrade Privacera Manager
- Troubleshooting
- How to validate installation
- Possible Errors and Solutions in Privacera Manager
- Unable to Connect to Docker
- Terminate Installation
- 6.5 Platform Installation fails with invalid apiVersion
- Ansible Kubernetes Module does not load
- Unable to connect to Kubernetes Cluster
- Common Errors/Warnings in YAML Config Files
- Delete old unused Privacera Docker images
- Unable to debug error for an Ansible task
- Unable to upgrade from 4.x to 5.x or 6.x due to Zookeeper snapshot issue
- Storage issue in Privacera UserSync & PolicySync
- Permission Denied Errors in PM Docker Installation
- Unable to initialize the Discovery Kubernetes pod
- Portal service
- Grafana service
- Audit server
- Audit Fluentd
- Privacera Plugin
- How-to
- Appendix
- AWS topics
- AWS CLI
- AWS IAM
- Configure S3 for real-time scanning
- Install Docker and Docker compose (AWS-Linux-RHEL)
- AWS S3 MinIO quick setup
- Cross account IAM role for Databricks
- Integrate Privacera services in separate VPC
- Securely access S3 buckets ssing IAM roles
- Multiple AWS account support in Dataserver using Databricks
- Multiple AWS S3 IAM role support in Dataserver
- Azure topics
- GCP topics
- Kubernetes
- Microsoft SQL topics
- Snowflake configuration for PolicySync
- Create Azure resources
- Databricks
- Spark Plug-in
- Azure key vault
- Add custom properties
- Migrate Ranger KMS master key
- IAM policy for AWS controller
- Customize topic and table names
- Configure SSL for Privacera
- Configure Real-time scan across projects in GCP
- Upload custom SSL certificates
- Deployment size
- Service-level system properties
- PrestoSQL standalone installation
- AWS topics
- Privacera Platform User Guide
- Introduction to Privacera Platform
- Settings
- Data inventory
- Token generator
- System configuration
- Diagnostics
- Notifications
- How-to
- Privacera Discovery User Guide
- What is Discovery?
- Discovery Dashboard
- Scan Techniques
- Processing order of scan techniques
- Add and scan resources in a data source
- Start or cancel a scan
- Tags
- Dictionaries
- Patterns
- Scan status
- Data zone movement
- Models
- Disallowed Tags policy
- Rules
- Types of rules
- Example rules and classifications
- Create a structured rule
- Create an unstructured rule
- Create a rule mapping
- Export rules and mappings
- Import rules and mappings
- Post-processing in real-time and offline scans
- Enable post-processing
- Example of post-processing rules on tags
- List of structured rules
- Supported scan file formats
- Data Source Scanning
- Data Inventory
- TagSync using Apache Ranger
- Compliance Workflow
- Data zones and workflow policies
- Workflow Policies
- Alerts Dashboard
- Data Zone Dashboard
- Data zone movement
- Workflow policy use case example
- Discovery Health Check
- Reports
- How-to
- Privacera Encryption Guide
- Overview of Privacera Encryption
- Install Privacera Encryption
- Encryption Key Management
- Schemes
- Encryption with PEG REST API
- Privacera Encryption REST API
- PEG API endpoint
- PEG REST API encryption endpoints
- PEG REST API authentication methods on Privacera Platform
- Common PEG REST API fields
- Construct the datalist for the /protect endpoint
- Deconstruct the response from the /unprotect endpoint
- Example data transformation with the /unprotect endpoint and presentation scheme
- Example PEG API endpoints
- /authenticate
- /protect with encryption scheme
- /protect with masking scheme
- /protect with both encryption and masking schemes
- /unprotect without presentation scheme
- /unprotect with presentation scheme
- /unprotect with masking scheme
- REST API response partial success on bulk operations
- Audit details for PEG REST API accesses
- Make encryption API calls on behalf of another user
- Troubleshoot REST API Issues on Privacera Platform
- Privacera Encryption REST API
- Encryption with Databricks, Hive, Streamsets, Trino
- Databricks UDFs for encryption and masking on PrivaceraPlatform
- Hive UDFs for encryption on Privacera Platform
- StreamSets Data Collector (SDC) and Privacera Encryption on Privacera Platform
- Trino UDFs for encryption and masking on Privacera Platform
- Privacera Access Management User Guide
- Privacera Access Management
- How Polices are evaluated
- Resource policies
- Policies overview
- Creating Resource Based Policies
- Configure Policy with Attribute-Based Access Control
- Configuring Policy with Conditional Masking
- Tag Policies
- Entitlement
- Service Explorer
- Users, groups, and roles
- Permissions
- Reports
- Audit
- Security Zone
- Access Control using APIs
- AWS User Guide
- Overview of Privacera on AWS
- Configure policies for AWS services
- Using Athena with data access server
- Using DynamoDB with data access server
- Databricks access manager policy
- Accessing Kinesis with data access server
- Accessing Firehose with Data Access Server
- EMR user guide
- AWS S3 bucket encryption
- Getting started with Minio
- Plugins
- How to Get Support
- Coordinated Vulnerability Disclosure (CVD) Program of Privacera
- Shared Security Model
- Privacera Platform documentation changelog
Spark on EKS
Privacera plugin in Spark on EKS
This section covers how you can use Privacera Manager to generate the setup script and Spark custom configuration for SSL to install the Privacera plugin in Spark on an EKS cluster.
Prerequisites
Ensure the following prerequisites are met:
Running Spark on an EKS cluster.
Privacera services must be up and running.
Configuration
SSH to the instance as USER.
Run the following commands.
cd ~/privacera/privacera-manager cp config/sample-vars/vars.spark-standalone.yml config/custom-vars/ vi config/custom-vars/vars.spark-standalone.yml
Edit the following properties. For property details and description, refer to the Configuration Properties below.
SPARK_STANDALONE_ENABLE:"true" SPARK_ENV_TYPE:"<PLEASE_CHANGE>" SPARK_HOME:"<PLEASE_CHANGE>" SPARK_USER_HOME:"<PLEASE_CHANGE>"
Run the following commands:
cd ~/privacera/privacera-manager ./privacera-manager.sh update
After the update is complete, the Spark custom configuration (
spark_custom_conf.zip
) for SSL will be generated at the path,cd ~/privacera/privacera-manager/output/spark-standalone
.Create the Spark Docker Image
Run the following commands to export
PRIVACERA_BASE_DOWNLOAD_URL
:exportPRIVACERA_BASE_DOWNLOAD_URL=<PRIVACERA_BASE_DOWNLOAD_URL>
Create a folder.
mkdir -p ~/privacera-spark-plugin cd ~/privacera-spark-plugin
Download and extract package using wget.
wget ${PRIVACERA_BASE_DOWNLOAD_URL}/spark-plugin/k8s-spark-pkg.tar.gz -O k8s-spark-pkg.tar.gz tar xzf k8s-spark-pkg.tar.gz rm -r k8s-spark-pkg.tar.gz
Copy
spark_custom_conf.zip
file from the Privacera Manager output folder into the files folder.cp ~/privacera/privacera-manager/output/spark-standalone/spark_custom_conf.zip files/spark_custom_conf.zip
You can either built OLAC Docker image or FGAC Docker image.
OLAC
To built the OLAC Docker image, use the following command:
./build_image.sh ${PRIVACERA_BASE_DOWNLOAD_URL} OLAC
FGAC
To built the FGAC Docker image, use the following command:
./build_image.sh ${PRIVACERA_BASE_DOWNLOAD_URL} FGAC
Test the Spark Docker image.
Create a S3 bucket ${S3_BUCKET} for sample testing.
Download sample data using the following link and put it in the ${S3_BUCKET} at location (s3://${S3_BUCKET}/customer_data).
wget https://privacera-demo.s3.amazonaws.com/data/uploads/customer_data_clear/customer_data_without_header.csv
Start Docker in an interactive mode.
IMAGE=privacera-spark-plugin:latest docker run --rm -i -t ${IMAGE} bash
Start spark-shell inside the Docker container.
JWT_TOKEN="<PLEASE_CHANGE>" cd /opt/privacera/spark/bin ./spark-shell \ --conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}"\ --conf "spark.hadoop.privacera.jwt.oauth.enable=true"
Run the following command to read the S3 file:
val df= spark.read.csv("s3a://${S3_BUCKET}/customer_data/customer_data_without_header.csv")
Exit the Docker shell.
exit
Publish the Spark Docker Image into your Docker Registry.
For
HUB
,HUB_USERNAME
, andHUB_PASSWORD
, use the Docker hub URL and login credentials.For
ENV_TAG
, its value can be user-defined depending on your deployment environment such as development, production or test. For example,ENV_TAG=dev
can be used for a development environment.
HUB=<PLEASE_CHANGE> HUB_USERNAME=<PLEASE_CHANGE> HUB_PASSWORD=<PLEASE_CHANGE> ENV_TAG=<PLEASE_CHANGE> DEST_IMAGE=${HUB}/privacera-spark-plugin:${ENV_TAG} SOURCE_IMAGE=privacera-spark-plugin:latest docker login -u ${HUB_USERNAME} -p ${HUB_PASSWORD}${HUB} docker tag ${SOURCE_IMAGE}${DEST_IMAGE} docker push ${DEST_IMAGE}
Deploy Spark Plugin on EKS cluster.
SSH to EKS cluster where you want to deploy Spark on EKS cluster.
Run the following commands to export
PRIVACERA_BASE_DOWNLOAD_URL
:exportPRIVACERA_BASE_DOWNLOAD_URL=<PRIVACERA_BASE_DOWNLOAD_URL>
Create a folder.
mkdir ~/privacera-spark-plugin cd ~/privacera-spark-plugin
Download and extract package using wget.
wget ${PRIVACERA_DOWNLOAD_URL}/plugin/spark/k8s-spark-deploy.tar.gz -O k8s-spark-deploy.tar.gz tar xzf k8s-spark-deploy.tar.gz rm -r k8s-spark-deploy.tar.gz cd k8s-spark-deploy/
Open
penv.sh
file and substitute the values of the following properties, refer to the table below:Property
Description
Example
SPARK_NAME_SPACE
Kubernetes namespace
privacera-spark-plugin-test
SPARK_PLUGIN_ROLE_BINDING
Spark role Binding
privacera-sa-spark-plugin-role-binding
SPARK_PLUGIN_SERVICE_ACCOUNT
Spark services account
privacera-sa-spark-plugin
SPARK_PLUGN_ROLE
Spark services account role
privacera-sa-spark-plugin-role
SPARK_PLUGIN_APP_NAME
Spark services account role
privacera-sa-spark-plugin-role
SPARK_PLUGIN_IMAGE
Docker image with hub
myhub.docker.com}/privacera-spark-plugin:prod-olac
SPARK_DOCKER_PULL_SECRET
Secret for docker-registry
spark-plugin-docker-hub
Run the following command to replace the properties value in the Kubernetes deployment
.yml
file:mkdir -p backup cp *.yml backup/ ./replace.sh
Run the following command to create Kubernetes resources:
kubectl apply -f namespace.yml kubectl apply -f service-account.yml kubectl apply -f role.yml kubectl apply -f role-binding.yml
Run the following command to create secret for
docker-registry
:kubectl create secret docker-registry spark-plugin-docker-hub --docker-server=<PLEASE_CHANGE> --docker-username=<PLEASE_CHANGE> --docker-password='<PLEASE_CHANGE>' --namespace=<PLEASE_CHANGE>
Run the following command to deploy a sample Spark application:
Note
This is an sample file used for deployment. As per your use case, you can create Spark deployment file and deploy a Docker image.
kubectl apply -f privacera-spark-examples.yml -n ${SPARK_NAME_SPACE}
This will deploy spark application in Kubernetes pod with Privacera plugin and it will keep the pod running, so that you can use it in interactive mode.
Configuration properties
Property | Description | Example |
---|---|---|
| Property to enable generating setup script and configs for Spark standalone plugin installation. | true |
| Set the environment type. It can be any user-defined type. For example, if you're working in an environment that runs locally, you can set the type as local; for a production environment, set it as prod. | local |
| Home path of your Spark installation. | ~/privacera/spark/spark-3.1.1-bin-hadoop3.2 |
| User home directory of your Spark installation. | /home/ec2-user |
| Use the property to enable/disable the fallback behavior to the privacera_files and privacera_hive services. It confirms whether the resources files should be allowed/denied access to the user. To enable the fallback, set to true; to disable, set to false. | true |
Validation
Get all the resources.
kubectl get all -n ${SPARK_NAME_SPACE}
Copy POD ID that you will need for spark-master connection.
Get the cluster info.
kubectl cluster-info
Copy Kubernetes control plane URL from the above output that we need during spark-shell command, for example ( https://xxxxxxxxxxxxxxxxxxxxxxx.yl4.us-east-1.eks.amazonaws.com).
When using the URL for
EKS_SERVER
property in step 4, prefix the property value withk8s://
. The following is an example of the property:EKS_SERVER="k8s://https://xxxxxxxxxxxxxxxxxxxxxxx.yl4.us-east-1.eks.amazonaws.com"
Connect to Kubernetes master node.
kubectl -n ${SPARK_NAME_SPACE}exec -it <POD_ID> -- bash
Set the following properties:
SPARK_NAME_SPACE="<PLEASE_CHANGE>" SPARK_PLUGIN_SERVICE_ACCOUNT="<PLEASE_CHANGE>" SPARK_PLUGIN_IMAGE="<PLEASE_CHANGE>" SPARK_DOCKER_PULL_SECRET="spark-plugin-docker-hub" EKS_SERVER="<PLEASE_CHANGE>" JWT_TOKEN="<PLEASE_CHANGE>"
Run the following commands to open spark-shell. The command contains all the setup which is required to open the spark-shell.
cd /opt/privacera/spark/bin ./spark-shell --master ${EKS_SERVER}\ --deploy-mode client \ --conf spark.kubernetes.authenticate.serviceAccountName=${SPARK_PLUGIN_SERVICE_ACCOUNT}\ --conf spark.kubernetes.namespace=${SPARK_NAME_SPACE}\ --conf spark.kubernetes.authenticate.submission.caCertFile=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt \ --conf spark.kubernetes.authenticate.submission.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=${SPARK_PLUGIN_SERVICE_ACCOUNT}\ --conf spark.kubernetes.container.image=${SPARK_PLUGIN_IMAGE}\ --conf spark.kubernetes.container.image.pullPolicy=Always \ --conf spark.kubernetes.container.image.pullSecrets=${SPARK_DOCKER_PULL_SECRET}\ --conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}"\ --conf "spark.hadoop.privacera.jwt.oauth.enable=true"\ --conf spark.driver.bindAddress='0.0.0.0'\ --conf spark.driver.host=$SPARK_PLUGIN_POD_IP\ --conf spark.port.maxRetries=4\ --conf spark.kubernetes.driver.pod.name=$SPARK_PLUGIN_POD_NAME
Run the following command using
spark-submit
with JWT authentication../spark-submit \ --master ${EKS_SERVER}\ --name spark-cloud-new \ --deploy-mode cluster \ --conf spark.kubernetes.authenticate.serviceAccountName=${SPARK_PLUGIN_SERVICE_ACCOUNT}\ --conf spark.kubernetes.namespace=${SPARK_NAME_SPACE}\ --conf spark.kubernetes.authenticate.submission.caCertFile=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt \ --conf spark.kubernetes.authenticate.submission.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=${SPARK_PLUGIN_SERVICE_ACCOUNT}\ --conf spark.kubernetes.container.image=${SPARK_PLUGIN_IMAGE}\ --conf spark.kubernetes.container.image.pullPolicy=Always \ --conf spark.kubernetes.container.image.pullSecrets=${SPARK_DOCKER_PULL_SECRET}\ --conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}"\ --conf spark.driver.bindAddress='0.0.0.0'\ --conf spark.driver.host=$SPARK_PLUGIN_POD_IP\ --conf spark.port.maxRetries=4\ --conf spark.kubernetes.driver.pod.name=$SPARK_PLUGIN_POD_NAME\ --class com.privacera.spark.poc.SparkSample \ <your-code-jar/file>
To check the read access on the S3 file, run the following command in the open spark-shell:
val df= spark.read.csv("s3a://${S3_BUCKET}/customer_data/customer_data_without_header.csv") df.show()
To check the write access on the S3 file, run the following command in the open spark-shell:
df.write.format("csv").mode("overwrite").save("s3a://${S3_BUCKET}/output/k8s/sample/csv")
Check the Audit logs on the Privacera Portal.
To verify the spark-shell setup, open another SSH connection for Kubernetes cluster and run the following command to check the running pods:
kubectl get pods -n ${SPARK_NAME_SPACE}
You will see the spark executor pods
-exec-x
. For example,spark-shell-xxxxxxxxxxxxxxxx-exec-1
andspark-shell-xxxxxxxxxxxxxxxx-exec-2
.