Skip to content

Apache Spark Connect with Privacera Spark Plugin OLAC

This guide describes how to use Apache Spark Connect for secure client–server connectivity to a remote Spark cluster with Privacera Spark Plugin OLAC integration.

Overview

Spark Connect provides a decoupled client–server architecture for remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, notebooks, and programming languages.

For more details, refer to Apache Spark Connect.

With Privacera OLAC, the plugin and privacera_spark.properties must be present where the Spark Connect server runs (the Spark driver side). Clients pass JWT identity using the same spark.hadoop.privacera.jwt.str configuration patterns as a normal Spark session. See JWT Token User Identity.

Pre-requisites

  • Apache Spark OLAC Setup — A fully configured, end-to-end Apache Spark OLAC environment is required, as the Spark Connect server operates within this setup and must incorporate all Privacera plugins along with the necessary trust configurations.
  • Python stack for Spark Connect (thin client and image) — PySpark against Spark Connect needs the same gRPC-oriented Python packages whether they are installed on a separate client host or baked into the Spark image. On a client-only machine, install compatible versions of:

    • grpcio
    • grpcio-status
    • pandas
    • pyarrow
    • zstandard

    Note

    Apache Spark–compatible versions of these libraries align with each pyspark-connect release on PyPI (pyspark-connect release history). Installing pyspark-connect directly bundles the required dependencies and usually removes the need to install them individually.

    Make sure below RUN steps included in Spark image dockerfile, The sample Dockerfile in Setup (Building the Docker Image, expand the Dockerfile note under Create Dockerfile) already contains the below Spark Connect–related RUN steps.

    Docker
    1
    2
    3
    4
    5
    6
    # To Fix NoSuchFileException in Apache Spark Connect for '.ammonite/rt-17.0.17.jar' and JLine warnings (Spark 4.x shell)
    RUN mkdir -p /nonexistent
    RUN chown -R ${USER_NAME}:${GROUP_NAME} /nonexistent
    
    # install python dependencies - require for pyspark with Apache Spark Connect
    RUN pip install --no-cache-dir grpcio grpcio-status pandas pyarrow zstandard
    
  • Network: Spark Connect is gRPC. Expose the listener port so clients can reach it (for example LoadBalancer, NodePort, or port-forward on EKS).

Start Spark Connect Server

  1. Export the JWT token

    Bash
    export JWT_TOKEN="<jwt-token>"
    
  2. Set a writable log directory. By default Spark writes daemon logs under $SPARK_HOME/logs, but a non-root user such as spark may not have permission there, which causes startup errors. Using /tmp aligns with a common log4j2.properties layout used in custom images (see Setup and Troubleshooting):

    Bash
    export SPARK_LOG_DIR=/tmp
    
  3. Start the Connect Server:

    Bash
    ${SPARK_HOME}/sbin/start-connect-server.sh \
      --conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}"
    

Verify Connect listener (server)

  • Confirm the process is listening on the expected gRPC port (default is often 15002). The Spark Connect server (driver) logs typically include a line such as:

    Text Only
    INFO org.apache.spark.sql.connect.service.SparkConnectServer :187 [main] Spark Connect server started at: [::]:15002
    
  • From a client, run a minimal remote session (see Run use cases) and run a simple check (for example reading a file from S3) to confirm end-to-end connectivity.

Check Connect Server .out logs

After the Connect server starts, a line similar to the following appears. Spark’s sbin scripts emit stdout/stderr capture files under the configured log directory—here ${SPARK_LOG_DIR} (for example /tmp). On the host or pod, ls -lt "${SPARK_LOG_DIR}" lists files so the newest capture after server start can be identified.

Text Only
starting org.apache.spark.sql.connect.service.SparkConnectServer, logging to /tmp/*-spark-org.apache.spark.sql.connect.service.SparkConnectServer-1-*.out*

To follow logs in real time:

Bash
tail -F /tmp/*-spark-org.apache.spark.sql.connect.service.SparkConnectServer-1-*.out*

Configure log level

Logging is driven by Spark’s log4j2.properties (see the file packaged into the image in Setup). To increase Privacera plugin detail for troubleshooting, follow Troubleshooting for Access Management for Apache Spark OLAC (for example logger.privacera.level = debug where the template defines it), rebuild or roll out the image, then inspect privacera.log or copied pod logs as described there.

Run usecases

In this pattern the Spark Connect server (Spark driver) runs on Amazon EKS (or similar), and the client runs on a different machine or network. The client uses a remote URI: sc://<ip-or-dns>:<port>.

To find an address for sc://<ip>:15002, use a hostname or IP that the client can actually reach. A pod IP from inside the cluster is often not reachable from a laptop; prefer a LoadBalancer hostname, NodePort on a node’s routable address, or another published Service endpoint. For inspection only (for example from a jump host on the same network as pods):

Bash
kubectl get pods -n <spark-k8s-namespace> -o wide

On the client host:

Bash
export JWT_TOKEN="<jwt-token>"
cd "<SPARK_HOME>/bin"

spark-shell

Bash
1
2
3
./spark-shell \
  --remote "sc://<ip>:15002" \
  --conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}"

When the client attaches successfully, the console typically shows something like:

Text Only
Spark connect server version 4.x.x
SparkSession available as 'spark'

After the scala> prompt appears, the following example runs a small S3 read through the remote SparkSession so storage access is exercised under Privacera OLAC (replace placeholders with a real s3a:// URI and format such as csv or parquet):

Scala
1
2
3
val readFilePath = "s3a://<path>/sample.txt";
val df = spark.read.format("<format — csv, parquet, delta, …>").option("header", "true").option("inferSchema", "true").load(readFilePath);
df.show(2);

pyspark

Bash
1
2
3
./pyspark \
  --remote "sc://<ip>:15002" \
  --conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}"

When the client attaches successfully, the console typically shows something like:

Text Only
Client connected to the Spark Connect server at <ip>:15002
SparkSession available as 'spark'

The same S3 read check in Python (after the >>> prompt):

Python
1
2
3
readFilePath = "s3a://<path>/sample.txt";
df = spark.read.format("<format — csv, parquet, delta, …>").option("header", "true").option("inferSchema", "true").load(readFilePath);
df.show(2);

Here the Spark Connect server and client are reached as localhost (typical after port-forwarding the Connect port to a local workstation, or when both processes run on the same VM).

Example with default local port 15002 after port-forward or local bind:

Bash
export JWT_TOKEN="<jwt-token>"
cd "${SPARK_HOME}/bin"

spark-shell

Bash
1
2
3
./spark-shell \
  --remote "sc://localhost:15002" \
  --conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}"

When the client attaches successfully, the console typically shows something like:

Text Only
Spark connect server version 4.x.x
SparkSession available as 'spark'

After the scala> prompt appears, the following example runs a small S3 read through the remote SparkSession so storage access is exercised under Privacera OLAC (replace placeholders with a real s3a:// URI and format such as csv or parquet):

Scala
1
2
3
val readFilePath = "s3a://<path>/sample.txt";
val df = spark.read.format("<format — csv, parquet, delta, …>").option("header", "true").option("inferSchema", "true").load(readFilePath);
df.show(2);

pyspark

Bash
1
2
3
./pyspark \
  --remote "sc://localhost:15002" \
  --conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}"

When the client attaches successfully, the console typically shows something like:

Text Only
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'

The same S3 read check in Python (after the >>> prompt):

Python
1
2
3
readFilePath = "s3a://<path>/sample.txt";
df = spark.read.format("<format — csv, parquet, delta, …>").option("header", "true").option("inferSchema", "true").load(readFilePath);
df.show(2);

Verify Audit Logs

The following applies after completing Option 1 or Option 2 under Run use cases.

  • Privacera Portal: Navigate to Access ManagementAuditsACCESS tab

  • Verify audit entry: In Audit entry verify Resource, User etc and Client IP will show IP of instance where Spark Connect Server is running.

Stop Spark Connect Server

On the same host or pod where the server was started:

Bash
${SPARK_HOME}/sbin/stop-connect-server.sh

If the server is managed by Kubernetes or a supervisor, stop it through the same mechanism so the JVM shuts down cleanly.