Skip to content

Databricks

Databricks Spark Plug-in (Python/SQL)#

These instructions guide the installation of the Privacera Spark plug-in in GCP Databricks.

Prerequisite

Ensure the following prerequisite is met:

  • All the Privacera core (default) services should be installed and running.

Configuration

  1. Run the following commands.

    cd ~/privacera/privacera-manager
    cp config/sample-vars/vars.databricks.plugin.yml config/custom-vars/
    vi config/custom-vars/vars.databricks.plugin.yml
    
  2. Update DATABRICKS_MANAGE_INIT_SCRIPT as we will manually upload the init script to GCP Cloud Storage in the step below.

    DATABRICKS_MANAGE_INIT_SCRIPT: "false"
    
  3. Run the following commands.

    cd ~/privacera/privacera-manager
    ./privacera-manager.sh update
    

    After the update is completed, the init script (ranger_enable.sh) and Privacera custom configuration (privacera_custom_conf.zip) for SSL will be generated at the location,~/privacera/privacera-manager/output/databricks.

Custom Configuration File

(Recommended) Perform the following steps only if you have https enabled for Ranger:

Upload the privacera_custom_conf.zip to a storage bucket in GCP and copy the public URL. For example, https://storage.googleapis.com/${PUBLIC_GCS_BUCKET}/ranger_enable.sh, where ${PUBLIC_GCS_BUCKET} is the GCP bucket name. We will use this URL in the init script to download privacera_custom_conf.zip to the Databricks cluster.

Managing init Script and Spark Configurations

  1. Run the following command.

    cd ~/privacera/privacera-manager/output/databricks
    vi ranger_enable.sh
    
  2. In the CUST_CONF_URL property, add the public URL of the GCP storage bucket where you placed the privacera_custom_conf.zip.

    export CUST_CONF_URL=https://storage.googleapis.com/${PUBLIC_GCS_BUCKET}/ranger_enable.sh
    
  3. Upload the init script, ranger_enable.sh, to your Google Cloud Storage account and copy the file path of the script. For example, gs://privacera/dev/init/ranger_enable.sh

  4. Log on to the Databricks console with your account and open the target cluster or create a new cluster.

  5. Open the Cluster dialog and go to Edit mode.

  6. Open Advanced Options, open the tab Init Scripts. Enter (paste) the file path from step 3 for the init script location. Save (Confirm) this configuration.

  7. Open Advanced Options, open the tab Spark. Add the following content to the Spark Config edit box:

    spark.driver.extraJavaOptions -javaagent:/databricks/jars/privacera-agent.jar
    spark.databricks.isv.product privacera
    spark.databricks.pyspark.enableProcessIsolation false
    privacera.spark.view.levelmaskingrowfilter.extension.enable true
    
  8. Save (Confirm) this configuration.

  9. Start (or Restart) the selected Databricks Cluster.

Validation

In order to help evaluate the use of Privacera with Databricks, Privacera provides a set of Privacera Manager 'demo' notebooks. These can be downloaded from Privacera S3 repository using either your favorite browser, or a command line 'wget'. Use the notebook/sql sequence that matches your cluster.

  1. Download using your browser (just click on the correct file for your cluster, below:

    https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPlugin.sql

    If AWS S3 is configured from your Databricks cluster: https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginS3.sql

    If ADLS Gen2 is configured from your Databricks cluster: https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginADLS.sql

    or, if you are working from a Linux command line, use the 'wget' command to download.

    wget https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPlugin.sql -O PrivaceraSparkPlugin.sql

    wget https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginS3.sql -O PrivaceraSparkPluginS3.sql

    wget https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginADLS.sql -O PrivaceraSparkPluginADLS.sql

  2. Import the Databricks notebook:

    • Login to Databricks Console ->
    • Select Workspace -> Users -> Your User ->
    • Click on drop down ->
    • Click on Import and Choose the file downloaded
  3. Follow the suggested steps in the text of the notebook to exercise and validate Privacera with Databricks.