Skip to content

About Access Control in Apache Spark

Apache Spark is a powerful, open-source engine for large-scale data processing. However, its flexibility such as the ability to run arbitrary code or connect directly to object stores creates security and governance challenges. This document highlights common design patterns for securing Apache Spark, focusing on how Privacera integrates with Spark environments to enforce data governance policies.

Deployment Patterns

Organizations can deploy Apache Spark in various environments, each presenting unique security and governance challenges. Here are some common scenarios and how Privacera helps:

  1. Apache Spark on Kubernetes

    • Usage: Spark clusters run as containers on Kubernetes like AWS EKS, leveraging container orchestration and scaling.
    • Challenges: Potential direct access to object stores, ensuring ephemeral containers adhere to security policies.
    • Privacera Support: OLAC ensures the Kubernetes cluster does not need privileged IAM roles; FGAC (where supported) can filter rows/columns. Code reviews and cluster lockdown remain key.
  2. Apache Spark on EMR and EMR Serverless

    • Usage: AWS EMR or EMR Serverless provides a managed Spark environment with auto-scaling.
    • Challenges: Guaranteeing that Spark jobs do not bypass S3 security controls.
    • Privacera Support: OLAC obtains temporary credentials from Privacera DataServer; FGAC can be used for row/column controls if the cluster and execution environment are locked down.
  3. AWS EMR and EMR Serverless Secured with Lake Formation

    • Usage: Lake Formation handles access control to underlying data in S3.
    • Challenges: Balancing Lake Formation’s native controls with more granular Spark-level policies.
    • Privacera Support: PolicySync can apply FGAC rules in conjunction with Lake Formation’s table/column restrictions, creating a unified governance layer.
  4. Apache Spark in Databricks Clusters (Python and R Only)

    • Usage: Databricks runtime focusing on Python and R notebooks.
    • Challenges: Ensuring data scientists only see permitted data.
  5. Privacera Support: For Databricks (non-Unity Catalog), Privacera can use Apache Ranger plugins for FGAC, for both table, row and column level and broad object-level control.

  6. Apache Spark in Databricks Clusters (All Languages: Scala, Java, Python, R)

    • Usage: Full Databricks environment supporting all major Spark languages.
    • Challenges: Preventing any language from bypassing Spark security (e.g., Scala jobs or Python libraries calling object store APIs).
    • Privacera Support: Ranger plugin-based FGAC plus OLAC can combine to provide row/column filtering and object-level security.
  7. Apache Spark in Databricks Governed by Unity Catalog

    • Usage: Unity Catalog centralizes the governance of data in Databricks.
    • Challenges: Maintaining consistent policies across notebooks, jobs, and BI integrations.
    • Privacera Support: PolicySync to Unity Catalog translates Apache Ranger FGAC rules into Unity Catalog’s table/column-level constructs, ensuring consistent governance.
  8. Apache Spark in Databricks SQL Warehouse Secured by Native Databricks Permissions

    • Usage: A dedicated SQL warehouse for analytics and BI dashboards.
    • Challenges: Maintaining advanced row- and column-level policies on top of basic SQL-level permissions.
    • Privacera Support: Privacera can augment the native permissions with FGAC using Secure Views or PolicySync to enforce row/column filtering.

Spark Security Challenges

  1. Arbitrary Code Execution: Users can run Python, Scala, or R code on Spark clusters, potentially bypassing security checks.
  2. Distributed Architecture: Multiple worker nodes each need consistent enforcement of data access policies.
  3. Direct Object Store Access: If cluster nodes have privileged IAM roles, users could read data directly from S3, ADLS, or GCS without going through Spark’s security layer.

The following design patterns address these challenges to different degrees.

Fine-Grained Access Control (FGAC)

FGAC enforces row- and column-level policies in Spark. With Privacera, it is fully supported in Databricks using Apache Ranger plugins. Some open-source Spark clusters can also implement FGAC with Privacera, but with additional considerations.

Key Elements

  • Ranger Plugin or PolicySync: Intercepts SparkSQL queries.
  • Row-Level Filters & Column Masking: Limits precisely which data is visible.
  • Audit Logs: Every access attempt is recorded.

Pros & Cons

  • Pros:
    • Protects sensitive columns (e.g., PII).
    • Filters out restricted rows for unauthorized users.
    • Useful for compliance (GDPR, HIPAA, etc.).
  • Cons:
    • Bypass possible if Spark cluster is not locked down.
    • Performance overhead from filtering/masking large datasets.

Typical Use Cases

  • Regulatory compliance requiring PII masking.
  • Multi-tenant environments where different rows must be accessible to different groups.

Object Level Access Control (OLAC)

Object Level Access Control manages permissions at the file or folder level in cloud storage systems (e.g., AWS S3, ADLS, GCS, or MinIO). It ensures that only authorized users or groups can retrieve specific objects.

Key Elements

  • Privacera DataServer: Issues short-lived credentials to Spark jobs.
  • No Privileged IAM Role: The Spark cluster does not hold permanent cloud credentials.
  • Access Policies: Defined centrally to allow or deny entire objects or directories.

Pros & Cons

  • Pros:
    • Strong security boundary at the storage layer.
    • Spark nodes don’t need privileged IAM roles.
    • Simplified model: either you have object access or you don’t.
  • Cons:
    • No row/column filtering.
    • If a user has object access, they see all data within that file.

Typical Use Cases

  • ETL pipelines that require access to entire files.
  • Machine learning workflows where entire datasets must be read.

Combined OLAC + FGAC

You can combine both OLAC (to control who can access which files) and FGAC (to filter rows and columns from those files). This layered approach can offer support for both compliance and limiting data exposure.

Key Elements

  • Short-Lived Credentials: Issued by Privacera DataServer for object retrieval.
  • Fine-Grained Enforcement: Row and column filtering via Apache Ranger plugin or PolicySync.
  • Unified Policy: Coordinated in Privacera; separate enforcement points at the file level and the row/column level.

Pros & Cons

  • Pros:
    • Strong defense-in-depth: Even if object-level access is granted, row/column filtering can further limit exposure.
    • Ideal for compliance + security scenarios.
  • Cons:
    • More complex to configure and maintain.
    • Potential overlap or conflicts between OLAC and FGAC policies.

Typical Use Cases

  • High-security, regulated industries (finance, healthcare) requiring strict controls at multiple levels.
  • Shared analytics environments where users must see certain columns but be barred from others.

Best Practices

  1. Secure the Cluster: Restrict who can submit Spark jobs and ensure IAM roles are tightly scoped.
  2. Centralized Policy Management: Use Privacera’s interface to manage OLAC and FGAC in one place.
  3. Audit and Monitoring: Regularly review logs for suspicious activities.
  4. Periodic Policy Reviews: Keep row/column policies and object-level rules up to date as data usage evolves.
  5. Performance Tuning: For FGAC, watch out for performance hits on large datasets; consider indexing or partitioning strategies.

Comments