Privacera Platform master publication

Databricks user guide

:

Spark Fine-grained Access Control (FGAC)

Enable View-level access control

  1. Edit the SparkConfig of your existing Privacera-enabled Databricks Cluster.

  2. Add the following property:

    spark.hadoop.privacera.spark.view.levelmaskingrowfilter.extension.enable true
  3. Save and restart the Databricks cluster.

Apply View-level access control

To CREATE VIEW in Spark Plug-In, you need the permission for DATA_ADMIN.

The source table on which you are going to create a view requires DATA_ADMIN access in Ranger policy.

Use Case

  • Let’s take a use case where we have 'employee_db' database and two tables inside it with below data:

    #Requires create privilege on the database enabled by default;
    create database if not exists employee_db;
    
    data_admin.jpg
  • Create two tables.

    #Requires privilege for table creation;
    create table if not exists employee_db.employee_data(id int,userid string,country string);
    create table if not exists employee_db.country_region(country string,region string);
    
    data_admin1.jpg
  • Insert test data.

    #Requires update privilege for tables;
    
    insert into employee_db.country_region values ('US','NA'), ('CA','NA'), ('UK','UK'), ('DE','EU'), ('FR','EU'); 
    insert into employee_db.employee_data values (1,'james','US'),(2,'john','US'), (3,'mark','UK'), (4,'sally-sales','UK'),(5,'sally','DE'), (6,'emily','DE');
    
    data_admin2.jpg
    #Requires select privilege for columns;
    select * from employee_db.country_region; 
    select * from employee_db.employee_data; 
    
  • Now try to create a View on top of above two tables created, we will get ERROR as below:

    create view employee_db.employee_region(userid, region) as select e.userid, cr.region from employee_db.employee_data e, employee_db.country_region cr where e.country = cr.country;
    
    Error: Error while compiling statement: 
    FAILED: HiveAccessControlException 
    Permission denied: user [emily] does not have [DATA_ADMIN] privilege on [employee_db/employee_data] (state=42000,code=40000)
    
    data_admin4.jpg
  • Create a view policy for table on employee_db.employee_region as shown in the above image.

    data_admin3.jpg

    Now create a policy as shown above in the image and try to execute the same query the query, it will pass through.

    Note

    Granting Data_admin privileges on the resource implicitly grants Select privilege on the same resource.

Alter View

#Requires alter permission on the view;
ALTER VIEW employee_db.employee_region AS  select e.userid, cr.region 
from employee_db.employee_data e, employee_db.country_region cr where 
e.country = cr.country;

Rename View

#Requires alter permission on the view;
ALTER VIEW  employee_db.employee_region RENAME to employee_db.employee_region_renamed;

Drop View

#Requires Drop permission on the view;
DROP VIEW employee_db.employee_region_renamed;

Row-Level Filter

create view if not exists employee_db.employee_region(userid, region) as select 
e.userid, cr.region from employee_db.employee_data e, 
employee_db.country_region cr where e.country = cr.country;

select * from employee_db.employee_region;
      
hive_rowfilter.jpg

Column Masking

select * from employee_db.employee_region;
      
hive_masking.jpg

Whitelisting for Py4J security manager

Certain Python methods are blacklisted on Databricks clusters to enhance security on the clusters. While trying to access such methods, you might receive the following error:

py4j.security.Py4JSecurityException: … is not whitelisted”

If you still want to access the Python classes or methods, you can add them to a whitelisting file. To whitelist classes or methods, do the following:

  1. Create a file containing a list of all the packages, class constructors or methods that should be whitelisted.

    1. For whitelisting a complete java package (including all it’s classes), add the package name ending with .* in the end.

      org.apache.spark.api.python.*
    2. For whitelisting constructors of the given class, add the fully qualified class name.

      org.apache.spark.api.python.PythonRDD
    3. For whitelisting specific methods of a given class, add the fully qualified class name followed by the method name.

      org.apache.spark.api.python.PythonRDD.runJobToPythonFile
      org.apache.spark.api.python.SerDeUtil.pythonToJava
  2. Once you have added all the required packages, classes and methods, the file will contain a list of commands as shown below.

    org.apache.spark.sql.SparkSession.createRDDFromTrustedPath
    org.apache.spark.api.java.JavaRDD.rdd
    org.apache.spark.rdd.RDD.isBarrier
    org.apache.spark.api.python.*
    
  3. Upload the file to a DBFS location that could be referenced from the Spark Application Configuration section.

    Suppose the whitelist.txt file contains the classes/methods to be whitelisted. Run following command to upload to Databricks.

    dbfs cp whitelist.txt dbfs:/privacera/whitelist.txt
  4. Add the following command to the Spark Config with reference to the DBFS file location.

    spark.hadoop.privacera.whitelist dbfs:/privacera/whitelist.txt
  5. Restart your cluster.

Access AWS S3 using Boto3 from Databricks

This section describes how to use the AWS SDK (Boto3) for Privacera Platform to access AWS S3 file data through a Privacera DataServer proxy.

Prerequisites

Ensure that the following prerequisites are met:

  1. Put the iptables in the Databricks init-script.

    To enable boto3 access control in your Databricks environment, add the following command to open port 8282 for outgoing connections:

    sudo iptables -I OUTPUT 1 -p tcp -m tcp --dport 8282 -j ACCEPT
  2. Restart the Databricks cluster.

We pass the iptables command as shown below through the Privacera Manager properties in the vars.databricks.plugin.yml file and run the update privacera manager command.

DATABRICKS_POST_PLUGIN_COMMAND_LIST: 
- echo "Completed Installation" 
- iptable command goes here
Accessing AWS S3 files

The following commands must be run in a notebook for Databricks:

  1. Install the AWS Boto3 libraries

    pip install boto3
  2. Import the required libraries

    import boto3
  3. Fetch the DataServer certificate

    Note

    If SSL is enabled on the dataserver, the port is 8282.

    %sh
    sudo iptables -I OUTPUT 1 -p tcp -m tcp --dport 8282 -j ACCEPT
    dirname="/tmp/lib3"
    mkdir -p -- "$dirname"
    DS_URL="https://{DATASERVER_EC2_OR_K8S_LB_URL}:{DAS_SSL_PORT}"
    #Sample url as shown below
    #DS_URL="https://10.999.99.999:8282"
    DS_CERT_FILE="$dirname/ds.pem"
    
    curl -k -H "connection:close" -o "${DS_CERT_FILE}" 
    "${DS_URL}/services/certificate"
  4. Access the AWS S3 files

    def check_s3_file_exists(bucket, key, access_key, secret_key, endpoint_url, dataserver_cert, region_name):
    exec_status = False
    access_key = access_key
    secret_key = secret_key
    endpoint_url = endpoint_url
    try:
        s3 = boto3.resource(service_name='s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key, endpoint_url=endpoint_url, region_name=region_name, verify=dataserver_cert)
        print(s3.Object(bucket_name=bucket, key=key).get()['Body'].read().decode('utf-8'))
        exec_status = True
      except Exception as e:
        print("Got error: {}".format(e))
      finally:
        return exec_status  
      
    def read_s3_file(bucket, key, access_key, secret_key, endpoint_url, dataserver_cert, region_name):
      exec_status = False
      access_key = access_key
      secret_key = secret_key
      endpoint_url = endpoint_url
      try:
        s3 = boto3.client(service_name='s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key, endpoint_url=endpoint_url, region_name=region_name, verify=dataserver_cert)
        obj = s3.get_object(Bucket=bucket, Key=key)
        print(obj['Body'].read().decode('utf-8'))
        exec_status = True
      except Exception as e:
        print("Got error: {}".format(e))
      finally:
        return exec_status
      
    readFilePath = "file data/data/format=txt/sample/sample_small.txt"
    bucket = "infraqa-test"
    #platform
    access_key = "${privacera_access_key}"
    secret_key = "${privacera_secret_key}"
    endpoint_url = "https://${DATASERVER_EC2_OR_K8S_LB_URL}:${DAS_SSL_PORT}"
    #sample value as shown below
    endpoint_url = "https://10.999.99.999:8282"
    priv_dataserver_cert = "/tmp/lib3/ds.pem"
    region_name = "us-east-1"
    print(f"got file===== {readFilePath} ============= bucket= {bucket}")
    status = check_s3_file_exists(bucket, readFilePath, access_key, secret_key, endpoint_url, priv_dataserver_cert, region_name)

Access Azure file using Azure SDK from Databricks

This section describes how to use the Azure SDK for Privacera Platform to access Azure DataStorage/Datalake file data through a Privacera DataServer proxy.

prerequisites

Ensure that the following prerequisites are met:

  1. Put the iptables in the Databricks init-script.

    To enable boto3 access control in your Databricks environment, add the following command to open port 8282 for outgoing connections:

    sudo iptables -I OUTPUT 1 -p tcp -m tcp --dport 8282 -j ACCEPT
  2. Restart the Databricks cluster.

We pass the iptables command as shown below through the Privacera Manager properties in the vars.databricks.plugin.yml file and run the update privacera manager command.

DATABRICKS_POST_PLUGIN_COMMAND_LIST: 
- echo "Completed Installation" 
- iptable command goes here
Accessing Azure files

The following commands must be run in a notebook for Databricks:

  1. Install the Azure SDK libraries

    pip install azure-storage-file-datalake
  2. Import the required libraries

    import os, uuid, sys
    from azure.storage.filedatalake import DataLakeServiceClient
    from azure.core._match_conditions import MatchConditions
    from azure.storage.filedatalake._models import ContentSettings
  3. Fetch the DataServer certificate

    Note

    If SSL is enabled on the dataserver, the port is 8282.

    sudo iptables -I OUTPUT 1 -p tcp -m tcp --dport 8282 -j ACCEPT
    dirname="/tmp/lib3"
    mkdir -p -- "$dirname"
    DS_URL="https://{DATASERVER_EC2_OR_K8S_LB_URL}:{DAS_SSL_PORT}"
    #Sample url as shown below
    #DS_URL="https://10.999.99.999:8282"
    DS_CERT_FILE="$dirname/ds.pem"
    
    curl -k -H "connection:close" -o "${DS_CERT_FILE}" 
    "${DS_URL}/services/certificate"
  4. Initialize the account storage through connection string method

    def initialize_storage_account_connect_str(my_connection_string):
        
        try:  
            global service_client
            print(my_connection_string)
            os.environ['REQUESTS_CA_BUNDLE'] = '/tmp/lib3/ds.pem'
            service_client = DataLakeServiceClient.from_connection_string(conn_str=my_connection_string, headers={'x-ms-version': '2020-02-10'})
        
        except Exception as e:
            print(e)
  5. Prepare the connection string

    def prepare_connect_str():
        try:
            
            connect_str = "DefaultEndpointsProtocol=https;AccountName=${privacera_access_key}-{storage_account_name};AccountKey=${base64_encoded_value_of(privacera_access_key|privacera_secret_key)};BlobEndpoint=https://${DATASERVER_EC2_OR_K8S_LB_URL}:${DAS_SSL_PORT};"
            
           # sample value is shown below
           #connect_str = "DefaultEndpointsProtocol=https;AccountName=MMTTU5Njg4Njk0MDAwA6amFpLnBhdGVsOjE6MTY1MTU5Njg4Njk0MDAw==-pqadatastorage;AccountKey=TVRVNUTU5Njg4Njk0MDAwTURBd01UQTZhbUZwTG5CaGRHVnNPakU2TVRZMU1URTJOVGcyTnpVMTU5Njg4Njk0MDAwVZwLzNFbXBCVEZOQWpkRUNxNmpYcjTU5Njg4Njk0MDAwR3Q4N29UNFFmZWpMOTlBN1M4RkIrSjdzSE5IMFZic0phUUcyVHTU5Njg4Njk0MDAwUxnPT0=;BlobEndpoint=https://10.999.99.999:8282;"
    
            return connect_str
        except Exception as e:
          print(e)
  6. Define a sample access method to get Azure file and directories

    def list_directory_contents(connect_str):
        try:
            initialize_storage_account_connect_str(connect_str)
            
            file_system_client = service_client.get_file_system_client(file_system="{storage_container_name}")
            #sample values as shown below
            #file_system_client = service_client.get_file_system_client(file_system="infraqa-test")
    
            paths = file_system_client.get_paths(path="{directory_path}")
            #sample values as shown below
            #paths = file_system_client.get_paths(path="file data/data/format=csv/sample/")
    
            for path in paths:
                print(path.name + '\n')
    
        except Exception as e:
          print(e)
  7. To verify that the proxy is functioning, call the access methods

    connect_str = prepare_connect_str()
    list_directory_contents(connect_str)