Databricks
Databricks Spark Plug-in (Python/SQL)#
These instructions guide the installation of the Privacera Spark plug-in in GCP Databricks.
Prerequisite
Ensure the following prerequisite is met:
- All the Privacera core (default) services should be installed and running.
Configuration
-
Run the following commands.
cd ~/privacera/privacera-manager cp config/sample-vars/vars.databricks.plugin.yml config/custom-vars/ vi config/custom-vars/vars.databricks.plugin.yml
-
Update
DATABRICKS_MANAGE_INIT_SCRIPT
as we will manually upload the init script to GCP Cloud Storage in the step below.DATABRICKS_MANAGE_INIT_SCRIPT: "false"
-
Run the following commands.
cd ~/privacera/privacera-manager ./privacera-manager.sh update
After the update is completed, the init script (ranger_enable.sh) and Privacera custom configuration (privacera_custom_conf.zip) for SSL will be generated at the location,
~/privacera/privacera-manager/output/databricks
.
Custom Configuration File
(Recommended) Perform the following steps only if you have https enabled for Ranger:
Upload the privacera_custom_conf.zip
to a storage bucket in GCP and copy the public URL. For example, https://storage.googleapis.com/${PUBLIC_GCS_BUCKET}/ranger_enable.sh, where ${PUBLIC_GCS_BUCKET} is the GCP bucket name. We will use this URL in the init script to download privacera_custom_conf.zip
to the Databricks cluster.
Managing init Script and Spark Configurations
-
Run the following command.
cd ~/privacera/privacera-manager/output/databricks vi ranger_enable.sh
-
In the
CUST_CONF_URL
property, add the public URL of the GCP storage bucket where you placed theprivacera_custom_conf.zip
.export CUST_CONF_URL=https://storage.googleapis.com/${PUBLIC_GCS_BUCKET}/ranger_enable.sh
-
Upload the init script,
ranger_enable.sh
, to your Google Cloud Storage account and copy the file path of the script. For example, gs://privacera/dev/init/ranger_enable.sh -
Log on to the Databricks console with your account and open the target cluster or create a new cluster.
-
Open the Cluster dialog and go to Edit mode.
-
Open Advanced Options, open the tab Init Scripts. Enter (paste) the file path from step 3 for the init script location. Save (Confirm) this configuration.
-
Open Advanced Options, open the tab Spark. Add the following content to the Spark Config edit box:
spark.driver.extraJavaOptions -javaagent:/databricks/jars/privacera-agent.jar spark.databricks.isv.product privacera spark.databricks.pyspark.enableProcessIsolation false privacera.spark.view.levelmaskingrowfilter.extension.enable true
-
Save (Confirm) this configuration.
-
Start (or Restart) the selected Databricks Cluster.
Validation
In order to help evaluate the use of Privacera with Databricks, Privacera provides a set of Privacera Manager 'demo' notebooks. These can be downloaded from Privacera S3 repository using either your favorite browser, or a command line 'wget'. Use the notebook/sql sequence that matches your cluster.
-
Download using your browser (just click on the correct file for your cluster, below:
https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPlugin.sql
If AWS S3 is configured from your Databricks cluster: https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginS3.sql
If ADLS Gen2 is configured from your Databricks cluster: https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginADLS.sql
or, if you are working from a Linux command line, use the 'wget' command to download.
wget https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPlugin.sql -O PrivaceraSparkPlugin.sql
wget https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginS3.sql -O PrivaceraSparkPluginS3.sql
wget https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginADLS.sql -O PrivaceraSparkPluginADLS.sql
-
Import the Databricks notebook:
- Login to Databricks Console ->
- Select Workspace -> Users -> Your User ->
- Click on drop down ->
- Click on Import and Choose the file downloaded
-
Follow the suggested steps in the text of the notebook to exercise and validate Privacera with Databricks.