Apache Spark Connect with Privacera Spark Plugin OLAC¶
This guide describes how to use Apache Spark Connect for secure client–server connectivity to a remote Spark cluster with Privacera Spark Plugin OLAC integration.
Overview¶
Spark Connect provides a decoupled client–server architecture for remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, notebooks, and programming languages.
For more details, refer to Apache Spark Connect.
With Privacera OLAC, the plugin and privacera_spark.properties must be present where the Spark Connect server runs (the Spark driver side). Clients pass JWT identity using the same spark.hadoop.privacera.jwt.str configuration patterns as a normal Spark session. See JWT Token User Identity.
Pre-requisites¶
- Apache Spark OLAC Setup — A fully configured, end-to-end Apache Spark OLAC environment is required, as the Spark Connect server operates within this setup and must incorporate all Privacera plugins along with the necessary trust configurations.
-
Python stack for Spark Connect (thin client and image) — PySpark against Spark Connect needs the same gRPC-oriented Python packages whether they are installed on a separate client host or baked into the Spark image. On a client-only machine, install compatible versions of:
grpciogrpcio-statuspandaspyarrowzstandard
Note
Apache Spark–compatible versions of these libraries align with each
pyspark-connectrelease on PyPI (pyspark-connect release history). Installingpyspark-connectdirectly bundles the required dependencies and usually removes the need to install them individually.Make sure below
RUNsteps included in Spark image dockerfile, The sample Dockerfile in Setup (Building the Docker Image, expand the Dockerfile note under Create Dockerfile) already contains the below Spark Connect–relatedRUNsteps. -
Network: Spark Connect is gRPC. Expose the listener port so clients can reach it (for example LoadBalancer, NodePort, or port-forward on EKS).
Start Spark Connect Server¶
-
Export the JWT token
Bash -
Set a writable log directory. By default Spark writes daemon logs under
$SPARK_HOME/logs, but a non-root user such assparkmay not have permission there, which causes startup errors. Using/tmpaligns with a commonlog4j2.propertieslayout used in custom images (see Setup and Troubleshooting):Bash -
Start the Connect Server:
Verify Connect listener (server)¶
-
Confirm the process is listening on the expected gRPC port (default is often 15002). The Spark Connect server (driver) logs typically include a line such as:
Text Only -
From a client, run a minimal remote session (see Run use cases) and run a simple check (for example reading a file from S3) to confirm end-to-end connectivity.
Check Connect Server .out logs¶
After the Connect server starts, a line similar to the following appears. Spark’s sbin scripts emit stdout/stderr capture files under the configured log directory—here ${SPARK_LOG_DIR} (for example /tmp). On the host or pod, ls -lt "${SPARK_LOG_DIR}" lists files so the newest capture after server start can be identified.
| Text Only | |
|---|---|
To follow logs in real time:
| Bash | |
|---|---|
Configure log level¶
Logging is driven by Spark’s log4j2.properties (see the file packaged into the image in Setup). To increase Privacera plugin detail for troubleshooting, follow Troubleshooting for Access Management for Apache Spark OLAC (for example logger.privacera.level = debug where the template defines it), rebuild or roll out the image, then inspect privacera.log or copied pod logs as described there.
Run usecases¶
In this pattern the Spark Connect server (Spark driver) runs on Amazon EKS (or similar), and the client runs on a different machine or network. The client uses a remote URI: sc://<ip-or-dns>:<port>.
To find an address for sc://<ip>:15002, use a hostname or IP that the client can actually reach. A pod IP from inside the cluster is often not reachable from a laptop; prefer a LoadBalancer hostname, NodePort on a node’s routable address, or another published Service endpoint. For inspection only (for example from a jump host on the same network as pods):
| Bash | |
|---|---|
On the client host:
spark-shell
| Bash | |
|---|---|
When the client attaches successfully, the console typically shows something like:
After the scala> prompt appears, the following example runs a small S3 read through the remote SparkSession so storage access is exercised under Privacera OLAC (replace placeholders with a real s3a:// URI and format such as csv or parquet):
| Scala | |
|---|---|
pyspark
| Bash | |
|---|---|
When the client attaches successfully, the console typically shows something like:
| Text Only | |
|---|---|
The same S3 read check in Python (after the >>> prompt):
Here the Spark Connect server and client are reached as localhost (typical after port-forwarding the Connect port to a local workstation, or when both processes run on the same VM).
Example with default local port 15002 after port-forward or local bind:
spark-shell
| Bash | |
|---|---|
When the client attaches successfully, the console typically shows something like:
After the scala> prompt appears, the following example runs a small S3 read through the remote SparkSession so storage access is exercised under Privacera OLAC (replace placeholders with a real s3a:// URI and format such as csv or parquet):
| Scala | |
|---|---|
pyspark
| Bash | |
|---|---|
When the client attaches successfully, the console typically shows something like:
| Text Only | |
|---|---|
The same S3 read check in Python (after the >>> prompt):
Verify Audit Logs¶
The following applies after completing Option 1 or Option 2 under Run use cases.
-
Privacera Portal: Navigate to Access Management → Audits → ACCESS tab
-
Verify audit entry: In Audit entry verify
Resource,Useretc andClient IPwill show IP of instance where Spark Connect Server is running.
Stop Spark Connect Server¶
On the same host or pod where the server was started:
| Bash | |
|---|---|
If the server is managed by Kubernetes or a supervisor, stop it through the same mechanism so the JVM shuts down cleanly.
- Prev topic: Connector Guide