Skip to content

Object Store Access in Databricks Clusters

This document provides information about how access to Object Stores like AWS S3, ADLS in Azure, GCS in GCP, and other file access works when using Databricks Clusters with Privacera.

This only applies to Databricks Clusters with Fine-Grained Access Control (FGAC) enabled.

Overview

When using Databricks Clusters with Privacera, Privacera can enforce access control policies on Object Stores and other file access. In Databricks Clusters, files can be accessed in the following ways:

  • Databricks File System (DBFS): This is a distributed file system that is built on top of cloud storage. It allows you to access files stored in cloud storage using a file system interface. For example, you can use the dbutils.fs API to read and write files in DBFS.
  • spark.read() and spark.write(): This is a Spark API that allows you to read data from various sources, including cloud storage. For example, you can use the spark.read API to read data from S3, ADLS, or GCS.
  • DataFrames: This is a Spark API that allows you to create DataFrames from various sources, including cloud storage. For example, you can use the spark.read.csv API to read CSV files from S3, ADLS, or GCS.
  • Creating External Tables: This is a SQL API that allows you to create external tables in Databricks. External tables are tables that are stored in cloud storage, and you can use SQL queries to access them. For example, you can use the CREATE EXTERNAL TABLE statement to create an external table in Databricks.

Usage

Configuring Policies

Privacera has a service repository for Files. Generally, it is called privacera_files. Access to files can be managed by providing the file path. File paths can be in the following formats:

  • s3://bucket-name/path/to/file
  • adls://container-name/path/to/file
  • gs://bucket-name/path/to/file
  • dbfs:/path/to/file
  • /path/to/file

Accessing files using Apache Spark commands

When using Databricks Cluster with FGAC there is no change in the way you access files. You can use the same commands as you would normally use in Databricks. Privacera can enforce access control policies on the underlying cloud storage and file mounts.

Here are some examples of how to access files using Apache Spark commands:

Python
1
2
3
4
# Read a CSV file from S3
df = spark.read.csv("s3://bucket-name/path/to/file.csv")
# Write a DataFrame to S3
df.write.csv("s3://bucket-name/path/to/output.csv")

In the above example if the user has access to the file path s3://bucket-name/path/to/file.csv, then the user will be able to read the file. If the user does not have access to the file path, then the user will get an access denied error.

Databricks File System (DBFS)

When files from DBFS are accessed, Privacera can enforce access control policies on the underlying cloud storage.

Here are some examples of how to access files using DBFS:

Python
1
2
3
4
# Read a CSV file from DBFS
df = spark.read.csv("dbfs:/path/to/file.csv")
# Write a DataFrame to DBFS
df.write.csv("dbfs:/path/to/output.csv")
In the above example if the user has access to the file path dbfs:/path/to/file.csv, then the user will be able to read the file. If the user does not have access to the file path, then the user will get an access denied error.

Creating External Tables

When creating external tables, Privacera makes sure that the user has access to the underlying cloud storage.

Here is an example of how to create an external table:

SQL
1
2
3
CREATE EXTERNAL TABLE my_table
USING CSV
LOCATION 's3://bucket-name/path/to/file.csv'
In the above example if the user has access to the file path s3://bucket-name/path/to/file.csv, then the user will be able to create the external table. If the user does not have access to the file path, then the user will get an access denied error.

For creating the tables the user must also have the permission to create tables in the database.

Other Considerations

Support for Boto3

When using Databricks Clusters using Privacera's FGAC support, any direct access to the nodes IAM role are not allowed. This means any tools that use the IAM role to access the underlying cloud storage will not work. However, it is possible to use Boto3 library using AWS Access Keys and Secret Keys or Privacera's PToken. You can refer to the Using Boto3 from Databricks Cluster with FGAC for how to use Boto3 with Privacera's PToken.

Comments