Object Store Access in Databricks Clusters¶
This document provides information about how access to Object Stores like AWS S3, ADLS in Azure, GCS in GCP, and other file access works when using Databricks Clusters with Privacera.
This only applies to Databricks Clusters with Fine-Grained Access Control (FGAC) enabled.
Overview¶
When using Databricks Clusters with Privacera, Privacera can enforce access control policies on Object Stores and other file access. In Databricks Clusters, files can be accessed in the following ways:
- Databricks File System (DBFS): This is a distributed file system that is built on top of cloud storage. It allows you to access files stored in cloud storage using a file system interface. For example, you can use the
dbutils.fs
API to read and write files in DBFS. - spark.read() and spark.write(): This is a Spark API that allows you to read data from various sources, including cloud storage. For example, you can use the
spark.read
API to read data from S3, ADLS, or GCS. - DataFrames: This is a Spark API that allows you to create DataFrames from various sources, including cloud storage. For example, you can use the
spark.read.csv
API to read CSV files from S3, ADLS, or GCS. - Creating External Tables: This is a SQL API that allows you to create external tables in Databricks. External tables are tables that are stored in cloud storage, and you can use SQL queries to access them. For example, you can use the
CREATE EXTERNAL TABLE
statement to create an external table in Databricks.
Usage¶
Configuring Policies¶
Privacera has a service repository for Files. Generally, it is called privacera_files
. Access to files can be managed by providing the file path. File paths can be in the following formats:
s3://bucket-name/path/to/file
adls://container-name/path/to/file
gs://bucket-name/path/to/file
dbfs:/path/to/file
/path/to/file
Accessing files using Apache Spark commands¶
When using Databricks Cluster with FGAC there is no change in the way you access files. You can use the same commands as you would normally use in Databricks. Privacera can enforce access control policies on the underlying cloud storage and file mounts.
Here are some examples of how to access files using Apache Spark commands:
Python | |
---|---|
In the above example if the user has access to the file path s3://bucket-name/path/to/file.csv
, then the user will be able to read the file. If the user does not have access to the file path, then the user will get an access denied error.
Databricks File System (DBFS)¶
When files from DBFS are accessed, Privacera can enforce access control policies on the underlying cloud storage.
Here are some examples of how to access files using DBFS:
Python | |
---|---|
dbfs:/path/to/file.csv
, then the user will be able to read the file. If the user does not have access to the file path, then the user will get an access denied error. Creating External Tables¶
When creating external tables, Privacera makes sure that the user has access to the underlying cloud storage.
Here is an example of how to create an external table:
In the above example if the user has access to the file paths3://bucket-name/path/to/file.csv
, then the user will be able to create the external table. If the user does not have access to the file path, then the user will get an access denied error. For creating the tables the user must also have the permission to create tables in the database.
Other Considerations¶
Support for Boto3¶
When using Databricks Clusters using Privacera's FGAC support, any direct access to the nodes IAM role are not allowed. This means any tools that use the IAM role to access the underlying cloud storage will not work. However, it is possible to use Boto3 library using AWS Access Keys and Secret Keys or Privacera's PToken. You can refer to the Using Boto3 from Databricks Cluster with FGAC for how to use Boto3 with Privacera's PToken.
- Prev Connector Guide