Privacera Platform master publication

Databricks
:
Privacera Plugin in Databricks
Databricks

Privacera provides two types of plugin solutions for access control in Databricks clusters. Both plugins are mutually exclusive and cannot be enabled on the same cluster.

Databricks Spark Fine-Grained Access Control (FGAC) Plugin

  • Recommended for SQL, Python, R language notebooks.

  • Provides FGAC on databases with row filtering and column masking features.

  • Uses privacera_hive, privacera_s3, privacera_adls, privacera_files services for resource-based access control, and privacera_tag service for tag-based access control.

  • Uses the plugin implementation from Privacera.

Databricks Spark Object Level Access Control (OLAC) Plugin

OLAC plugin was introduced to provide an alternative solution for Scala language clusters, since using Scala language on Databricks Spark has some security concerns.

  • Recommended for Scala language notebooks.

  • Provides OLAC on S3 locations which you are trying to access via Spark.

  • Uses privacera_s3 service for resource-based access control and privacera_tag service for tag-based access control.

  • Uses the signed-authorization implementation from Privacera.

Databricks cluster deployment matrix with Privacera plugin:

Job/Workflow use-case for automated cluster:

Run-Now will create the new cluster based on the definition mentioned in the job description.

Table 55. 

Job Type  

Languages

FGAC/DBX version

OLAC/DBX Version

Notebook

Python/R/SQL

Supported [7.3, 9.1 , 10.4]

JAR

Java/Scala

Not supported

Supported[7.3, 9.1 , 10.4]

spark-submit

Java/Scala/Python

Not supported

Supported[7.3, 9.1 , 10.4]

Python

Python

Supported [7.3, 9.1 , 10.4]

Python wheel

Python

Supported [9.1 , 10.4]

Delta Live Tables pipeline

Not supported

Not supported



Job on existing cluster:

Run-Now will use the existing cluster which is mentioned in the job description.

Table 56. 

Job Type

Languages

FGAC/DBX version

OLAC

Notebook

Python/R/SQL

supported [7.3, 9.1 , 10.4]

Not supported

JAR

Java/Scala

Not supported

Not supported

spark-submit

Java/Scala/Python

Not supported

Not supported

Python

Python

Not supported

Not supported

Python wheel

Python

supported [9.1 , 10.4]

Not supported

Delta Live Tables pipeline

Not supported

Not supported



Interactive use-case

Interactive use-case is running a notebook of SQL/Python on an interactive cluster.

Table 57. 

Cluster Type

Languages

FGAC

OLAC

Standard clusters

Scala/Python/R/SQL

Not supported

Supported [7.3,9.1,10.4]

High Concurrency clusters

Python/R/SQL

Supported [7.3,9.1,10.4

Supported [7.3,9.1,10.4]

Single Node

Scala/Python/R/SQL

Not supported

Supported [7.3,9.1,10.4]



Databricks Spark Fine-Grained Access Control Plugin [FGAC] [Python, SQL]
Configuration
  1. Run the following commands:

    cd ~/privacera/privacera-manager
    cp config/sample-vars/vars.databricks.plugin.yml config/custom-vars/
    vi config/custom-vars/vars.databricks.plugin.yml
    
  2. Edit the following properties to allow Privacera Platform to connect to your Databricks host. For property details and description, refer to the Configuration Properties below.

    DATABRICKS_HOST_URL: "<PLEASE_UPDATE>"
    DATABRICKS_TOKEN: "<PLEASE_UPDATE>"
    DATABRICKS_WORKSPACES_LIST:
    - alias: DEFAULT
    databricks_host_url: "{{DATABRICKS_HOST_URL}}"
    token: "{{DATABRICKS_TOKEN}}"
    DATABRICKS_MANAGE_INIT_SCRIPT: "true"
    DATABRICKS_ENABLE: "true"
    

    Note

    You can also add custom properties that are not included by default. See Databricks.

  3. Run the following commands:

    cd ~/privacera/privacera-manager
    ./privacera-manager.sh update
    
  4. (Optional) By default, policies under the default service name, privacera_hive, are enforced. You can customize a different service name and enforce policies defined in the new name. See Configure Service Name for Databricks Spark Plugin.

Configuration properties

Property Name

Description

Example Values

DATABRICKS_HOST_URL

Enter the URL where the Databricks environment is hosted.

For AZURE Databricks,

DATABRICKS_HOST_URL: "https://xdx-66506xxxxxxxx.2.azuredatabricks.net/?o=665066931xxxxxxx"

For AWS Databricks

DATABRICKS_HOST_URL: "https://xxx-7xxxfaxx-xxxx.cloud.databricks.com"

DATABRICKS_TOKEN

Enter the token.

To generate the token,

1. Login to your Databricks account.

2. Click the user profile icon in the upper right corner of your Databricks workspace.

3. Click User Settings.

4. Click the Generate New Token button.

5. Optionally enter a description (comment) and expiration period.

6. Click the Generate button.

7. Copy the generated token.

DATABRICKS_TOKEN: "xapid40xxxf65xxxxxxe1470eayyyyycdc06"

DATABRICKS_WORKSPACES_LIST

Add multiple Databricks workspaces to connect to Ranger.

  1. To add a single workspace, add the following default JSON in the text area to define the host URL and token of the Databricks workspace. The text area should not be left empty and should at least contain the default JSON.

    [{"alias":"DEFAULT",
    "databricks_host_url":"{{DATABRICKS_HOST_URL}}",
    "token":"{{DATABRICKS_TOKEN}}"}]
    

    Note

    Do not edit any of the values in the default JSON.

  2. To add two workspaces, use the following JSON.

    [{"alias":"DEFAULT",
    "databricks_host_url":"{{DATABRICKS_HOST_URL}}",
    "token":"{{DATABRICKS_TOKEN}}"},
    {"alias":"<workspace-2-alias>","databricks_host_url":"<workspace-2-url>",
    "token":"<dbx-token-for-workspace-2>"}]
    

Note: {{var}} is an Ansible variable. Such a variable re-uses the value of a predefined variable. Hence, do not edit the properties, databricks_host_url and token of the alias: DEFAULT as they are set by DATABRICKS_HOST_URL and DATABRICKS_TOKEN respectively.

DATABRICKS_ENABLE

If set to 'true' Privacera Manager will create the Databricks cluster Init script "ranger_enable.sh" to:

'~/privacera/privacera-manager/output/databricks/ranger_enable.sh.

"true"

"false"

DATABRICKS_MANAGE_INIT_SCRIPT

If set to 'true' Privacera Manager will upload Init script ('ranger_enable.sh') to the identified Databricks Host.

If set to 'false' upload the following two files to the DBFS location. The files can be located at *~/privacera/privacera-manager/output/databricks*.

  • privacera_spark_plugin_job.conf

  • privacera_spark_plugin.conf

"true"

"false"

DATABRICKS_SPARK_PLUGIN_AGENT_JAR

Use the Java agent to assign a string of extra JVM options to pass to the Spark driver.

-javaagent:/databricks/jars/privacera-agent.jar

DATABRICKS_SPARK_PRIVACERA_CUSTOM_CURRENT_USER_UDF_NAME

Property to map logged-in user to Ranger user for row-filter policy.

It is mapped with the Databricks cluster-level property spark.hadoop.privacera.custom.current_user.udf.names. See Spark Properties. Check if this property is set in your Databricks cluster. If it is being used, then set its value similar to the PM property. If the value of the PM property and Databricks cluster-level property differ, then it can cause an unexpected behavior.

current_user()

DATABRICKS_SPARK_PRIVACERA_VIEW_LEVEL_MASKING_ROWFILTER_EXTENSION_ENABLE

Property to enable masking, row-filter and data_admin access on view.

Property to enable masking, row-filter and data_admin access on view. This property is a Privacera Manager (PM) property

It is mapped with the Databricks cluster-level property spark.hadoop.privacera.spark.view.levelmaskingrowfilter.extension.enable. See Spark Properties. Check if this property is set in your Databricks cluster. If it is being used, then set its value similar to the PM property. If the value of the PM property and Databricks cluster-level property differ, then it can cause an unexpected behavior.

false

DATABRICKS_SQL_CLUSTER_POLICY_SPARK_CONF

Configure Databricks Cluster policy.

Add the following JSON in the text area:

[{"Note":"First spark conf","key":"spark.hadoop.first.spark.test","value":"test1"},{"Note":"Second spark conf","key":"spark.hadoop.first.spark.test","value":"test2"}]

DATABRICKS_POST_PLUGIN_COMMAND_LIST

Note

This property is not part of the default YAML file, but can be added if required.

Use this property, if you want to run a specific set of commands in the Databricks init script.

The following example will be added to the cluster init script to allow Athena JDBC via data access server.

DATABRICKS_POST_PLUGIN_COMMAND_LIST:

- sudo iptables -I OUTPUT 1 -p tcp -m tcp --dport 8181 -j ACCEPT

- sudo curl -k -u user:password {{PORTAL_URL}}/api/dataserver/cert?type=dataserver_jks -o /etc/ssl/certs/dataserver.jks

- sudo chmod 755 /etc/ssl/certs/dataserver.jks

DATABRICKS_SPARK_PYSPARK_ENABLE_PY4J_SECURITY

This property allows you to backlist APIs to enable security. This property is a Privacera Manager (PM) property

It is mapped with the Databricks cluster-level property spark.databricks.pyspark.enablePy4JSecurity. See Spark Properties. Check if this property is set in your Databricks cluster. If it is being used, then set its value similar to the PM property. If the value of the PM property and Databricks cluster-level property differ, then it can cause an unexpected behavior.

The following example will be added to the cluster init script to allow Athena JDBC via data access server.

DATABRICKS_POST_PLUGIN_COMMAND_LIST:

- sudo iptables -I OUTPUT 1 -p tcp -m tcp --dport 8181 -j ACCEPT

- sudo curl -k -u user:password {{PORTAL_URL}}/api/dataserver/cert?type=dataserver_jks -o /etc/ssl/certs/dataserver.jks

- sudo chmod 755 /etc/ssl/certs/dataserver.jks

Managing init script

Automatic upload

If DATABRICKS_ENABLE is 'true' and DATABRICKS_MANAGE_INIT_SCRIPT is 'true', then the Init script will be uploaded automatically to your Databricks host. The init script will be uploaded to dbfs:/privacera/<DEPLOYMENT_ENV_NAME>/ranger_enable.sh where <DEPLOYMENT_ENV_NAME> is the value of DEPLOYMENT_ENV_NAME mentioned in vars.privacera.yml.

Manual upload

If DATABRICKS_ENABLE is 'true' and DATABRICKS_MANAGE_INIT_SCRIPT is 'false', then the Init script must be uploaded to your Databricks host.

Note

To avoid the manual steps below, you should set DATABRICKS_MANAGE_INIT_SCRIPT=true and follow the instructions outlined in Automatic Upload.

  1. Open a terminal and connect to Databricks account using your Databricks login credentials/token.

    Connect using login credentials:

    1. If you're using login credentials, then run the following command:

      databricks configure --profile privacera
    2. Enter the Databricks URL:

      Databricks Host (should begin with https://): https://dbc-xxxxxxxx-xxxx.cloud.databricks.com/
    3. Enter the username and password:
      Username: email-id@example.com
      Password:

    Connect using Databricks token:

    1. If you don't have a Databricks token, you can generate one. For more information, refer Generate a personal access token.

    2. If you're using token, then run the following command:

      databricks configure --token --profile privacera
    3. Enter the Databricks URL:

      Databricks Host (should begin with https://): https://dbc-xxxxxxxx-xxxx.cloud.databricks.com/
    4. Enter the token:

      Token:
  2. To check if the connection to your Databricks account is established, run the following command:

                dbfs ls dbfs:/ --profile privacera

    You should see the list of files in the output, if you are connected to your account.

  3. Upload files manually to Databricks:

    1. Copy the following files to DBFS, which are available in the PM host at the location, ~/privacera/privacera-manager/output/databricks:

      • ranger_enable.sh

      • privacera_spark_plugin.conf

      • privacera_spark_plugin_job.conf

      • privacera_custom_conf.zip

    2. Run the following command. For the value of <DEPLOYMENT_ENV_NAME>, you can get it from the file, ~/privacera/privacera-manager/config/vars.privacera.yml.

      export DEPLOYMENT_ENV_NAME=<DEPLOYMENT_ENV_NAME>
      dbfs mkdirs dbfs:/privacera/${DEPLOYMENT_ENV_NAME} --profile privacera
      dbfs cp ranger_enable.sh dbfs:/privacera/${DEPLOYMENT_ENV_NAME}/ --profile privacera
      dbfs cp privacera_spark_plugin.conf dbfs:/privacera/${DEPLOYMENT_ENV_NAME}/ --profile privacera
      dbfs cp privacera_spark_plugin_job.conf dbfs:/privacera/${DEPLOYMENT_ENV_NAME}/ --profile privacera
      dbfs cp privacera_custom_conf.zip dbfs:/privacera/${DEPLOYMENT_ENV_NAME}/ --profile privacera
    3. Verify the files have been uploaded.

      dbfs ls dbfs:/privacera/${DEPLOYMENT_ENV_NAME}/ --profile privacera

      The Init Script will be uploaded to dbfs:/privacera/<DEPLOYMENT_ENV_NAME>/ranger_enable.sh, where <DEPLOYMENT_ENV_NAME> is the value of DEPLOYMENT_ENV_NAME mentioned in vars.privacera.yml.

Configure Databricks Cluster
  1. Once the update completes successfully, log on to the Databricks console with your account and open the target cluster, or create a new target cluster.

  2. Open the Cluster dialog and enter Edit mode.

  3. In the Configuration tab, select Advanced Options > Spark.

  4. Add the following content to the Spark Config edit box. For more information on the Spark config properties, click here.

    New Properties

    spark.databricks.cluster.profile serverless
    spark.databricks.isv.product privacera
    spark.driver.extraJavaOptions -javaagent:/databricks/jars/privacera-agent.jar
    spark.databricks.repl.allowedLanguages sql,python,r
    

    Old Properties

    spark.databricks.cluster.profile serverless
    spark.databricks.repl.allowedLanguages sql,python,r
    spark.driver.extraJavaOptions -javaagent:/databricks/jars/ranger-spark-plugin-faccess-2.0.0-SNAPSHOT.jar
    spark.databricks.isv.product privacera
    spark.databricks.pyspark.enableProcessIsolation true

    Note

    • From Privacera 5.0.6.1 Release onwards, it is recommended to replace the Old Properties with the New Properties. However, the Old Properties will also continue to work.

    • For Databricks versions &lt; 7.3, Old Properties should only be used since the versions are in extended support.

  5. In the Configuration tab, in Edit mode, Open Advanced Options (at the bottom of the dialog) and then set init script path. For the <DEPLOYMENT_ENV_NAME> variable, enter the deployment name as defined for the DEPLOYMENT_ENV_NAME variable in the vars.privacera.yml.

    dbfs:/privacera/<DEPLOYMENT_ENV_NAME>/ranger_enable.sh
    
  6. In the Table Access Control section, uncheck Enable table access control and only allow Python and SQL commands and Enable credential passthrough for user-level data access and only allow Python and SQL commands checkboxes.

  7. Save (Confirm) this configuration.

  8. Start (or Restart) the selected Databricks Cluster.

Related Information

For further reading, see:

Validation

In order to help evaluate the use of Privacera with Databricks, Privacera provides a set of Privacera Manager 'demo' notebooks. These can be downloaded from Privacera S3 repository using either your favorite browser, or a command line 'wget'. Use the notebook/sql sequence that matches your cluster.

  1. Download using your browser (just click on the correct file for your cluster, below:

    https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPlugin.sql

    If AWS S3 is configured from your Databricks cluster: https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginS3.sql

    If ADLS Gen2 is configured from your Databricks cluster: https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginADLS.sql

    or, if you are working from a Linux command line, use the 'wget' command to download.

    wget https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPlugin.sql -O PrivaceraSparkPlugin.sql

    wget https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginS3.sql -O PrivaceraSparkPluginS3.sql

    wget https://privacera.s3.amazonaws.com/public/pm-demo-data/databricks/PrivaceraSparkPluginADLS.sql -O PrivaceraSparkPluginADLS.sql

  2. Import the Databricks notebook:

    1. Log in to the Databricks Console

    2. Select Workspace > Users > Your User.

    3. From the drop down menu, select Import and choose the file downloaded.

  3. Follow the suggested steps in the text of the notebook to exercise and validate Privacera with Databricks.

Databricks Spark Object-level Access Control Plugin [OLAC] [Scala]
Prerequisites

Ensure the following prerequisites are met:

Configuration
  1. Run the following commands.

    cd ~/privacera/privacera-manager/
    cp config/sample-vars/vars.databricks.scala.yml config/custom-vars/
    vi config/custom-vars/vars.databricks.scala.yml
    
  2. Edit the following properties. For property details and description, refer to the Configuration Properties below.

    DATASERVER_DATABRICKS_ALLOWED_URLS : "<PLEASE_UPDATE>"
    DATASERVER_AWS_STS_ROLE: "<PLEASE_CHANGE>"
    
  3. Run the following commands.

    cd ~/privacera/privacera-manager
    ./privacera-manager.sh update
    
Configuration properties

Property

Description

Example

DATABRICKS_SCALA_ENABLE

Set the property to enable/disable Databricks Scala. This is found under Databricks Signed URL Configuration For Scala Clusters section.

DATASERVER_DATABRICKS_ALLOWED_URLS

Add a URL or comma-separated URLs.

Privacera Dataserver serves only those URLs mentioned in this property.

https://xxx-7xxxfaxx-xxxx.cloud.databricks.com

DATASERVER_AWS_STS_ROLE

Add the instance profile ARN of the AWS role, which can access Delta Files in Databricks.

arn:aws:iam::111111111111:role/assume-role

DATABRICKS_MANAGE_INIT_SCRIPT

Set the init script.

If enabled, Privacera Manager will upload Init script ('ranger_enable.sh') to the identified Databricks Host.

If disabled, Privacera Manager will take no action regarding the Init script for the Databricks File System.

DATABRICKS_SCALA_CLUSTER_POLICY_SPARK_CONF

Configure Databricks Cluster policy.

Add the following JSON in the text area:

[{"Note":"First spark conf",
"key":"spark.hadoop.first.spark.test",
"value":"test1"},
{"Note":"Second spark conf",
"key":"spark.hadoop.first.spark.test",
"value":"test2"}]
Managing init script
Automatic Upload

If DATABRICKS_ENABLE is 'true' and DATABRICKS_MANAGE_INIT_SCRIPT is "true", the Init script will be uploaded automatically to your Databricks host. The Init Script will be uploaded to dbfs:/privacera/<DEPLOYMENT_ENV_NAME>/ranger_enable_scala.sh, where <DEPLOYMENT_ENV_NAME> is the value of DEPLOYMENT_ENV_NAME mentioned in vars.privacera.yml.

Manual Upload

If DATABRICKS_ENABLE is 'true' and DATABRICKS_MANAGE_INIT_SCRIPT is "false" the Init script must be uploaded to your Databricks host.

  1. Open a terminal and connect to Databricks account using your Databricks login credentials/token.

    • Connect using login credentials:

      1. If you're using login credentials, then run the following command.

        databricks configure --profile privacera
        
      2. Enter the Databricks URL.

        Databricks Host (should begin with https://): https://dbc-xxxxxxxx-xxxx.cloud.databricks.com/
        
      3. Enter the username and password.

        Username: email-id@yourdomain.com
        Password:
        
    • Connect using Databricks token:

      1. If you don't have a Databricks token, you can generate one. For more information, refer Generate a personal access token.

      2. If you're using token, then run the following command.

        databricks configure --token --profile privacera
        
      3. Enter the Databricks URL.

        Databricks Host (should begin with https://): https://dbc-xxxxxxxx-xxxx.cloud.databricks.com/
        
      4. Enter the token.

        Token:
        
  2. To check if the connection to your Databricks account is established, run the following command.

    dbfs ls dbfs:/ --profile privacera
    

    You should see the list of files in the output, if you are connected to your account.

  3. Upload files manually to Databricks.

    1. Copy the following files to DBFS, which are available in the PM host at the location, ~/privacera/privacera-manager/output/databricks:

      • ranger_enable_scala.sh

      • privacera_spark_scala_plugin.conf

      • privacera_spark_scala_plugin_job.conf

    2. Run the following command. For the value of <DEPLOYMENT_ENV_NAME>, you can get it from the file, ~/privacera/privacera-manager/config/vars.privacera.yml.

      export DEPLOYMENT_ENV_NAME=<DEPLOYMENT_ENV_NAME>
      dbfs mkdirs dbfs:/privacera/${DEPLOYMENT_ENV_NAME} --profile privacera
      dbfs cp ranger_enable_scala.sh dbfs:/privacera/${DEPLOYMENT_ENV_NAME}/ --profile privacera
      dbfs cp privacera_spark_scala_plugin.conf dbfs:/privacera/${DEPLOYMENT_ENV_NAME}/ --profile privacera
      dbfs cp privacera_spark_scala_plugin_job.conf dbfs:/privacera/${DEPLOYMENT_ENV_NAME}/ --profile privacera
      
    3. Verify the files have been uploaded.

      dbfs ls dbfs:/privacera/${DEPLOYMENT_ENV_NAME}/ --profile privacera
      

      The Init Script is uploaded to dbfs:/privacera/<DEPLOYMENT_ENV_NAME>/ranger_enable_scala.sh, where <DEPLOYMENT_ENV_NAME> is the value of DEPLOYMENT_ENV_NAME mentioned in vars.privacera.yml.

Configure Databricks cluster
  1. Once the update completes successfully, log on to the Databricks console with your account and open the target cluster, or create a new target cluster.

  2. Open the Cluster dialog. enter Edit mode.

  3. In the Configuration tab, in Edit mode, Open Advanced Options (at the bottom of the dialog) and then the Spark tab.

  4. Add the following content to the Spark Config edit box. For more information on the Spark config properties, click here.

    New Properties

    spark.databricks.isv.product privacera
    spark.driver.extraJavaOptions -javaagent:/databricks/jars/privacera-agent.jar
    spark.executor.extraJavaOptions -javaagent:/databricks/jars/privacera-agent.jar
    spark.databricks.repl.allowedLanguages sql,python,r,scala
    spark.databricks.delta.formatCheck.enabled false
    

    Old Properties

    spark.databricks.cluster.profile serverless
    spark.databricks.delta.formatCheck.enabled false
    spark.driver.extraJavaOptions -javaagent:/databricks/jars/ranger-spark-plugin-faccess-2.0.0-SNAPSHOT.jar
    spark.executor.extraJavaOptions -javaagent:/databricks/jars/ranger-spark-plugin-faccess-2.0.0-SNAPSHOT.jar
    spark.databricks.isv.product privaceraspark.databricks.repl.allowedLanguages sql,python,r,scala
    

    Note

    • From Privacera 5.0.6.1 Release onwards, it is recommended to replace the Old Properties with the New Properties. However, the Old Properties will also continue to work.

    • For Databricks versions &lt; 7.3, Old Properties should only be used since the versions are in extended support.

  5. (Optional) To use regional endpoint for S3 access, add the following content to the Spark Config edit box.

    spark.hadoop.fs.s3a.endpoint https://s3.<region>.amazonaws.com
    spark.hadoop.fs.s3.endpoint https://s3.<region>.amazonaws.com
    spark.hadoop.fs.s3n.endpoint https://s3.<region>.amazonaws.com
    
  6. In the Configuration tab, in Edit mode, Open Advanced Options (at the bottom of the dialog) and then set init script path. For the <DEPLOYMENT_ENV_NAME> variable, enter the deployment name as defined for the DEPLOYMENT_ENV_NAME variable in the vars.privacera.yml.

    dbfs:/privacera/<DEPLOYMENT_ENV_NAME>/ranger_enable_scala.sh
    
  7. Save (Confirm) this configuration.

  8. Start (or Restart) the selected Databricks Cluster.

Related information

For further reading, see: