Privacera Spark Plugin OLAC + Spark Operator¶

This guide walks you through deploying Spark OLAC workloads using the Spark Operator on Kubernetes with Privacera plugin integration.

Prerequisites¶

Install Spark Operator using Helm

helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm repo update
helm install spark-operator spark-operator/spark-operator \
  --namespace spark-operator \
  --create-namespace

For detailed installation instructions, refer to the Spark Operator documentation.

Build Docker Image with Spark Plugin¶

Follow the Privacera documentation to build and push the Docker image with the Spark plugin

Note

Only follow the steps up to "Push the Docker Image to the Remote Hub". The remaining deployment steps will be covered in this guide.

Create Configuration Files for Spark Operator¶

Navigate to config folder and update the privacera_spark.properties file as follow

cd ~/privacera-oss-plugin/config 
vi privacera_spark.properties

# Update truststore path  
#privacera.signer.truststore=/opt/privacera/global-truststore.p12 
privacera.signer.truststore=/privacera-secrets/global-truststore.p12

Create Spark Operator Directory¶

cd ~/privacera-oss-plugin
mkdir -p spark-operator

Create JWT Token File¶

Create the JWT token file for authentication and add your JWT token to it

cd ~/privacera-oss-plugin/spark-operator
vi token.jwt

Create Python Script¶

Create a test Python script to verify the setup

cd ~/privacera-oss-plugin/spark-operator
vi plugin-test.py

plugin-test.py
from pyspark.sql import SparkSession
import time

spark = SparkSession\
    .builder\
    .appName("PrivaceraCsvToParquet")\
    .getOrCreate()

spark.conf.set("spark.hadoop.privacera.jwt.oauth.enable", "true")
spark.conf.set("spark.hadoop.privacera.jwt.token", "/privacera-jwt/token.jwt")

print(f"spark.hadoop.privacera.jwt.oauth.enable is {spark.conf.get('spark.hadoop.privacera.jwt.oauth.enable')}")
print(f"spark.hadoop.privacera.jwt.token is {spark.conf.get('spark.hadoop.privacera.jwt.token')}")

with open("/privacera-secrets/privacera_spark.properties") as p:
    pconf = p.read().strip()
    print(f"event log conf is {pconf}")

df = spark.read.csv("s3a://MY-BUCKET/SAMPLE.csv")
df.show(2)
df.count()

df.write.parquet("s3a://MY-BUCKET/output", mode="overwrite")
df.show(2)
df.count()

spark.stop()

# Keep pod running for 5 minutes (optional)
time.sleep(5*60)

Create SparkApplication YAML¶

Create the SparkApplication manifest with Privacera-specific configurations

cd ~/privacera-oss-plugin/spark-operator
vi spark-plugin-app.yml

spark-plugin-app.yml
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: <SPARK_APPLICATION_NAME>
  namespace: default
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: "<SPARK_PLUGIN_IMAGE>"
  imagePullPolicy: Always
  mainApplicationFile: local:///privacera-conf/plugin-test.py
  sparkVersion: "<SPARK_VERSION>"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
  sparkConf:
    spark.driver.extraJavaOptions: >-
      -javaagent:/opt/spark/jars/privacera-agent.jar
      -Dlog4j.configurationFile=file:///privacera-conf/log4j2.properties
    spark.executor.extraJavaOptions: >-
      -javaagent:/opt/spark/jars/privacera-agent.jar
      -Dlog4j.configurationFile=file:///privacera-conf/log4j2.properties
    spark.sql.hive.metastore.sharedPrefixes: com.privacera,com.amazonaws
    spark.driver.extraClassPath: /privacera-secrets
    spark.executor.extraClassPath: /privacera-secrets
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: <SPARK_VERSION>
    serviceAccount: spark-operator-spark
    secrets:
      - name: privacera-spark-secret
        path: /privacera-secrets
        secretType: Generic
      - name: privacera-jwt-secret
        path: /privacera-jwt
        secretType: Generic
    configMaps:
      - name: privacera-spark-configmap
        path: /privacera-conf
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: <SPARK_VERSION>
    secrets:
      - name: privacera-spark-secret
        path: /privacera-secrets
        secretType: Generic
      - name: privacera-jwt-secret
        path: /privacera-jwt
        secretType: Generic
    configMaps:
      - name: privacera-spark-configmap
        path: /privacera-conf

Note

Replace the placeholders in the YAML file before deployment: SPARK_APPLICATION_NAME, SPARK_PLUGIN_IMAGE, and SPARK_VERSION.

Warning

Do not mount anything in /opt/spark to avoid timing issues that can hide /opt/spark/jars
The /privacera-secrets path must be in the Spark classpath

Create Deployment Script¶

Create a script to automate the creation of Kubernetes secrets, configmaps, and deployment

cd ~/privacera-oss-plugin/spark-operator
vi deploy_spark_operator.sh

deploy_spark_operator.sh
#!/bin/bash
set -x

SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
ENV_FILE=${SCRIPT_DIR}/../penv.sh

if [ -f ${ENV_FILE} ]; then
    echo "Loading env file ${ENV_FILE}"
    source ${ENV_FILE}
fi

export JWT_SECRET_NAME="privacera-jwt-secret"

# Delete and recreate Kubernetes secret
kubectl delete secret ${PRIVACERA_SECRET_NAME}
kubectl create secret generic ${PRIVACERA_SECRET_NAME} \
    --from-file=${SCRIPT_DIR}/../config/privacera_spark.properties \
    --from-file=${SCRIPT_DIR}/../config/global-truststore.p12

# Delete and recreate Kubernetes configmap
kubectl delete configmap ${PRIVACERA_CONFIGMAP_NAME}
kubectl create configmap ${PRIVACERA_CONFIGMAP_NAME} \
    --from-file=${SCRIPT_DIR}/../config/log4j2.properties \
    --from-file=${SCRIPT_DIR}/plugin-test.py

# Delete and recreate Kubernetes secret for JWT
kubectl delete secret ${JWT_SECRET_NAME}
kubectl create secret generic ${JWT_SECRET_NAME} \
    --from-file=${SCRIPT_DIR}/token.jwt

# Delete and apply the SparkApplication
kubectl delete -f ${SCRIPT_DIR}/spark-plugin-app.yml
kubectl apply -f ${SCRIPT_DIR}/spark-plugin-app.yml

Deploy the Application¶

Make the deployment script executable and run it

cd ~/privacera-oss-plugin/spark-operator
chmod +x deploy_spark_operator.sh
./deploy_spark_operator.sh

Verify Deployment¶

Check the Kubernetes namespace (default) for the Spark Operator pods

kubectl get pods -n default
kubectl get sparkapplications -n default

Monitor the driver pod logs

kubectl logs -f <DRIVER_POD_NAME>

Prev topic: Connector Guide