About FGAC and OLAC for Apache Spark¶
This document is not applicable for Databricks Unity Catalog or EMR/EMR Serverless secured by AWS Lake Formation
This document outlines Privacera’s integrations for Apache Spark, specifically, Fine-Grained Access Control (FGAC) and Object Level Access Control (OLAC). The primary goal is to help you understand how each solution addresses different security needs and to guide you in selecting the right approach for your Apache Spark environment.
Overview¶
When securing Apache Spark with Privacera, it is important to differentiate between:
- Access Controls at the File/Object Level: Ensuring users only see specific files they are authorized to view.
- Access Controls at the Row/Column Level: Granting or restricting data access at a more granular level.
Privacera offers two solutions:
- FGAC: Allows fine-grained controls within tables (rows, columns) but is supported only on Databricks clusters. FGAC also supports object-store access so that files, objects, and folders can be protected using the same mechanisms.
- OLAC: Grants or denies access to entire objects or folders.
Fine-Grained Access Control (FGAC)¶
Fine-Grained Access Control (FGAC) allows table-, column-, and row-level restrictions, offering more granular security rules. Beyond row- and column-level enforcement, FGAC also supports object store controls within Databricks. This is enforced using the Privacera's Plugin design which intercepts SparkSQL queries and applies the relevant policies.
Read here for About FGAC and read here for Apache Ranger Plugin.
Secure FGAC
To differentiate between FGAC used for compliance and security, in this document we will refer Secure FGAC as a mode that ensures users can't bypass the security policies and access the underlying IAM role.
Core Features¶
- Granularity: Policies can be defined at the table, column, and row levels.
- Data Masking: Dynamically hide or transform certain columns.
- Row-Level Filtering: Exclude or include specific rows based on policies.
- Object-Level Support: Enforce policies on files/folders in object stores when running in a Databricks FGAC-enabled environment.
- Auditing: Logs fine-grained data accesses for review and monitoring.
How FGAC Works in Apache Spark¶
Whenever SparkSQL or Spark read/write methods run on a Spark cluster, Privacera checks user permissions and applies relevant policies for both object and table-level data.
sequenceDiagram
participant User
participant SparkSQL
participant PrivaceraPlugin
participant SparkEngine as SparkEngine<br/>(Privileged IAM Role)
participant ObjectStore
User->>SparkSQL: Submit SparkSQL Query
SparkSQL->>PrivaceraPlugin: Intercept Query
PrivaceraPlugin->>PrivaceraPlugin: Check User Access to Table and Columns
alt User Has Access
PrivaceraPlugin->>PrivaceraPlugin: Modify Query (Add UDF, Apply Row-Level Filter)
PrivaceraPlugin->>SparkSQL: Return Modified Query
SparkSQL->>SparkEngine: Execute Modified Query
SparkEngine->>ObjectStore: Access Data using IAM Role
ObjectStore->>SparkEngine: Return Data
SparkEngine->>User: Return Query Results
else User Does Not Have Access
PrivaceraPlugin->>SparkSQL: Deny Access (Throw Exception)
SparkSQL->>User: Access Denied Error
end
Use Cases¶
Use Case | Description |
---|---|
Column-Level Security | Limit access to certain columns for specific user groups. |
Row-Level Filtering | Filter out rows based on user attributes or roles. |
Dynamic Data Masking | Apply different masking rules based on user attributes or roles. |
Dynamic Data Encryption | Encrypt or decrypt data dynamically based on user attributes or roles. |
Object Store Access with FGAC | Restrict or allow files in object stores just as with OLAC |
Examples of Secure FGAC use cases for Data Analysts and Data Engineers¶
Example 1: Filtering Data Based on Customer Consent
An analytics team uses a BI tool (such as Tableau or Power BI) connected to a Databricks cluster protected by Privacera FGAC. Some customers have withheld consent for marketing use of their data. Whenever the BI tool queries the Spark engine, FGAC policies automatically exclude rows associated with these non-consenting customers. As a result, the marketing dashboards only display data for customers who provided consent, ensuring regulatory compliance (e.g., GDPR, CCPA) while still providing insights into consenting customers’ behavior.
Example 2: Redacting Sensitive Data for Partner Sharing
A company collaborates with external partners who run analytics through the same Spark cluster via a BI tool. Privacera’s FGAC policies intercept queries that request sensitive fields (e.g., personally identifiable information) and dynamically mask or obfuscate those columns before results reach the partner’s BI dashboard. Partners view only redacted columns, maintaining privacy standards and protecting sensitive data, yet still gaining value from the shared dataset for their analytic workflows.
Technical Considerations¶
- Databricks Requirement: Secure FGAC is only operational on Databricks clusters configured with FGAC.
- Security of Cluster: Databricks clusters must be appropriately secured to ensure users cannot bypass FGAC and access IAM role on compute node directly.
- IAM Roles: Even on Databricks, restricting IAM permissions is a best practice to reduce the risk of unauthorized access.
Technical Limitations¶
- Databricks Only: Secure FGAC is only supported on Databricks clusters. It is not available for open-source Apache Spark or other platforms.
- Scala and Java Support: Secure FGAC is not supported for Scala and Java. It is only available for Python and R
Object Level Access Control (OLAC)¶
Object Level Access Control (OLAC) restricts access to specific files or objects stored in systems like AWS S3, ADLS, GCS, or MinIO. OLAC is designed to work with open-source Apache Spark and does not require Databricks. It is particularly useful for organizations that need to control access to entire objects or folders without relying on privileged IAM roles on Spark clusters. This is achieved by using the Privacera DataServer to issue temporary credentials based on user permissions.
Read here for About OLAC and read here for Privacera DataServer
Core Features¶
- Granularity: Controls access to entire objects or folders.
- Security Focus: Minimizes the need for privileged IAM roles on Spark clusters.
- Policy-Based Management: Supports data tag-based and user attribute-based policies.
- Auditing: Provides insights into object-level data access.
How OLAC Works in Apache Spark¶
When a Spark job requests data from an object store, Privacera DataServer issues temporary credentials based on the user’s permissions. This ensures the Spark cluster itself does not maintain privileged credentials.
Use Cases¶
Use Case | Description |
---|---|
Data Engineering Pipelines | Ensures only authorized files are loaded for transformations and analysis. |
ETL Processes | Restricts Spark-based ETL to specific input data objects or folders. |
Technical Limitations¶
- No Fine-Grained Control: OLAC does not allow filtering at row or column levels.
- SparkSQL Limitations: Users have access to objects, but row and column level controls are not enforced by OLAC. Also, the cluster needs access to the technical metastore like Hive Metastore or Glue Catalog.
- External Catalogs: Privacera does not manage permissions for catalogs like Hive Metastore or Glue Catalog.
Comparison: OLAC vs Secure FGAC¶
Below is a quick comparison of OLAC and Secure FGAC for Apache Spark.
Aspect | OLAC | Secure FGAC (Databricks Only) |
---|---|---|
Object-Level Permissions/Enforcement | Yes | Yes (Includes files/folders in object stores) |
Row/Column-Level Filtering & Masking | No | Yes |
Credential Management via DataServer | Yes | Not Applicable |
Cluster IAM Role Requirement | No | Varies |
Recommendations¶
- Use OLAC if your main priority is controlling access to entire objects and reducing reliance on privileged credentials in Spark clusters. Also only OLAC is supported for Scala and Java.
- Use FGAC if you need advanced row-, column-, and object-level security, and you are operating on Databricks with FGAC enabled.
- Secure Your Cluster: For both OLAC and FGAC, ensure that your clusters (whether open-source or Databricks) are configured with minimal privileges and robust authentication.
FAQ¶
Here are some frequently asked questions regarding FGAC and OLAC:
-
Can I use FGAC without Databricks?
- Only in Databricks Clusters FGAC can guarantee security enforcement, which means that users can't access the underlying IAM role. However, if you have compliance requirements that require row- and column-level security to be applied for Spark jobs or for trusted users, then you can enable FGAC in compliance mode. This mode allows you to enforce row- and column-level security on Spark jobs, but it does not guarantee that users can't access the underlying IAM role (e.g. using Boto3 libraries). You can read more about this in the FGAC Compliance Mode section.
-
Do I need both OLAC and FGAC enabled at the same time?
- Not necessarily. FGAC provides row and column filtering in addition to object-level enforcement on Databricks. Organizations often choose one or the other depending on the required level of granularity.
-
Can OLAC and FGAC policies coexist in the same environment?
- It is not necessary in Databricks Cluster. However, when using FGAC for compliance reason in Open Source Spark, there is an option to use both OLAC and FGAC for different reasons. You can read more about this in the OLAC along with FGAC
-
What happens if a user bypasses SparkSQL and tries to directly read from the object store?
- With OLAC, there are no IAM on the on compute nodes and credentials to access the respective objects are provided by the external running Privacera DataServer of if the user is authorized. With FGAC on Databricks, all access to underlying IAM roles is denied for users running Spark jobs or using the Databricks Notebooks.
-
Can I use Scala or Java with FGAC?
- No. FGAC is only supported for Python and R. For Scala and Java, you should use OLAC.
- Prev topic: Apache Spark
- Next topic: FGAC for Compliance in Apache Spark