FGAC Compliance Mode in Apache Spark¶
This document is not applicable for Databricks Cluster with Privacera FGAC, Databricks Unity Catalog or EMR/EMR Serverless secured by AWS Lake Formation
This document explains how Fine-Grained Access Control (FGAC) applied to meet compliance requirements to open source Apache Spark deployments on Kubernetes and Amazon EMR/EMR Serverless without Lake Formation.
It is possible to enable row level filtering and masking policies in open source Spark environments but with the limitation that user would be able to bypass the Privacera plugin and access the data directly from the object store. However, if compliance is the primary use case and user or service user is trusted, then this approach can be used.
Overview¶
Fine-Grained Access Control (FGAC) refers to the ability to enforce column-level masking, row-level filtering, and dynamic policies within Spark. For security-sensitive use cases, relying on open source Spark FGAC alone is risky because Spark’s architecture allows users to run arbitrary code and potentially bypass FGAC. Nevertheless, FGAC can still be valuable for:
- Compliance: Filtering or masking personally identifiable information (PII) in reports.
- Data Governance: Enforcing organizational rules on how data can be used.
- Simplifying Audits: Providing partial auditing of who accesses certain columns or rows.
When deployed in Kubernetes, EMR, or EMR Serverless, Privacera can inject a policy enforcement engine that monitors SparkSQL commands and attempts to filter or mask sensitive data. However, the cluster must be appropriately secured so that end users cannot circumvent these controls.
Why Use FGAC for Compliance?¶
Organizations often face strict regulations regarding the handling of sensitive data, such as GDPR, HIPAA, or CCPA. These regulations require organizations to implement measures to protect personal data and ensure that it is only accessed by authorized individuals. Organizations generally create multiple copies of the datasets for different purposes, which can be provided to different teams or partners. This requires ETL processes to filter or mask sensitive data. However if data filtering and masking code is embedded in the ETL process, it can be difficult to audit and enforce. Particularly in large organizations which have hundreds and thousands of ETL jobs, it is difficult to ensure that all jobs are correctly filtering or masking sensitive data. FGAC provides a centralized way to enforce these policies during processing, ensuring that sensitive data is handled appropriately. Also when these polices change over time, they can be updated in one place, rather than having to update all ETL jobs.
Architecture¶
Below is a simplified view of how FGAC can integrate with open source Spark in Kubernetes or Amazon EMR:
- Spark Driver: Launches tasks in containers (Kubernetes pods or EMR nodes).
- Privacera Plugin: Intercepts SparkSQL queries or Spark read/write operations.
- Policy Evaluation: Privacera checks row- and column-level policies for the given user and dataset.
- Data Masking/Filtering: Unapproved rows are filtered out, or sensitive columns are masked.
- Execution: Spark tasks run on worker nodes using the filtered or masked datasets.
sequenceDiagram
participant User
participant SparkSQL
participant PrivaceraPlugin
participant SparkEngine as SparkEngine<br/>(Privileged IAM Role)
participant ObjectStore
User->>SparkSQL: Submit SparkSQL Query
SparkSQL->>PrivaceraPlugin: Intercept Query
PrivaceraPlugin->>PrivaceraPlugin: Check User Access to Table and Columns
alt User Has Access
PrivaceraPlugin->>PrivaceraPlugin: Modify Query (Add UDF, Apply Row-Level Filter)
PrivaceraPlugin->>SparkSQL: Return Modified Query
SparkSQL->>SparkEngine: Execute Modified Query
SparkEngine->>ObjectStore: Access Data using IAM Role
ObjectStore->>SparkEngine: Return Data
SparkEngine->>User: Return Query Results
else User Does Not Have Access
PrivaceraPlugin->>SparkSQL: Deny Access (Throw Exception)
SparkSQL->>User: Access Denied Error
end
Warning
Although the Privacera plugin can apply FGAC at query time, the users could still bypass the Spark layer using direct object store access. So this approach is not suitable for security-sensitive use cases.
Advantages¶
- Compliance-Oriented Filtering: Easily exclude non-consenting users or mask PII in compliance with regulations like GDPR, HIPAA, or CCPA.
- Scalable: FGAC leverages Spark’s parallel processing, ensuring that row-level filtering or column masking is applied across large datasets.
- Audit Trails: Privacera can generate audit logs of user queries and data accesses, demonstrating how sensitive data was handled.
Prerequisites¶
- IAM Roles: Ensure that the Spark cluster has the necessary IAM roles to object store.
- Access Policies: Define Privacera policies for row-level filtering and column masking, including for Object Store access.
- MetaStore: Ensure that the Spark cluster is connected to Hive or AWS Glue Metastore. Note: Privacera does not enforce restrictions on the MetaStore, so users with direct access to the Metastore can bypass FGAC.
Limitations¶
- Potential Bypass: Spark allows arbitrary code execution (e.g., Python libraries) that can bypass FGAC if users have privileged IAM roles or direct object store access.
- Cluster Security Requirements: Administrators must ensure that only trusted code or jobs can be run, otherwise malicious code can circumvent Privacera policies.
- No Native Column/Row Enforcement: Open source Spark’s architecture does not provide built-in row or column-level enforcement, so FGAC relies on the Privacera plugin intercepting Spark commands.
- Limited Support: Official support for FGAC is primarily on Databricks Cluster; using FGAC on open source Spark is feasible but considered an exception for compliance use cases where security is not the primary concern.
- Access Policies: Privacera policies must be defined for row-level filtering and column masking, including for Object Store access. This policies are only enforced when the Spark job is run through the Privacera plugin.
Below is a diagram that shows how, in FGAC integration, users can bypass Privacera and use the SparkEngine's IAM role to access datasets in the ObjectStore.
sequenceDiagram
participant User
participant SparkJob
participant SparkEngine as SparkEngine<br/>(Privileged IAM Role)
participant ObjectStore
User->>SparkJob: Submit Spark Job with Boto3 Code
SparkJob->>SparkEngine: Execute Python Code (Boto3)
SparkEngine->>ObjectStore: Access Data via Boto3 using IAM Role
ObjectStore->>SparkEngine: Return Data
SparkEngine->>User: Return Job Results
Best Practices¶
- Code Reviews: Inspect Spark jobs for calls to external libraries or code that might bypass FGAC.
- Service Users: For each use case, create a dedicated service user with limited access permissions in Privacera to run Spark jobs. This helps in updating policies without affecting other use cases.
- Data Governance Policies: Define clear policies for which datasets require column masking or row filtering.
Use Cases¶
FGAC is particularly useful in scenarios where organizations need to filter or mask sensitive data for compliance purposes. Here are two examples:
Example 1: Filtering Data Based on Customer Consent
An organization runs a data pipeline to create datasets for marketing purposes. To comply with data protection regulations like GDPR or CCPA, they must exclude data belonging to customers who have not given consent for their information to be used in marketing activities. Using Privacera's FGAC, the pipeline enforces policies that filter out data of non-consenting customers during the Spark processing stages. This ensures that only data from customers who have provided consent is included in the marketing datasets, maintaining compliance with consent requirements and protecting customer privacy.
Example 2: Redacting Sensitive Data for Partner Sharing
A company needs to share datasets with external partners for collaborative projects or analytics. However, to comply with privacy laws and protect sensitive information, PII and other confidential data must be redacted or hashed before sharing. Privacera's FGAC enables the organization to implement policies that automatically mask or obfuscate sensitive fields during the Spark ETL process. As a result, the pipeline produces datasets where PII is securely redacted or transformed, ensuring that shared data complies with legal obligations and internal security policies while still being useful for the partners.
FAQ¶
Here are some frequently asked questions about FGAC in open source Apache Spark:
-
Is FGAC officially supported on open source Spark?
- Officially, FGAC is available on Databricks with direct support. However, only for compliance use cases Apache Spark can be configured with FGAC in Kubernetes or EMR with Privacera.
-
Can FGAC prevent all unauthorized data access?
- No. If end users have elevated permissions (e.g., direct object store access), they can bypass FGAC. Ensuring the cluster is locked down is critical.
-
Do I still need OLAC if I use FGAC?
- Since FGAC can enforce row/column-level on SparkSQL as well as Spark read/write operations, OLAC is not required. However, there is an option to use both OLAC and FGAC for different reasons. You can read more about this in the OLAC along with FGAC
- Prev topic: FGAC v/s OLAC in Apache Spark
- Next topic: Coexistence of OLAC and FGAC for Compliance in Apache Spark