Skip to content

Privacera Spark Plugin OLAC + Spark Operator

This guide walks you through deploying Spark OLAC workloads using the Spark Operator on Kubernetes with Privacera plugin integration.

Prerequisites

Install Spark Operator using Helm

helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm repo update
helm install spark-operator spark-operator/spark-operator \
  --namespace spark-operator \
  --create-namespace

For detailed installation instructions, refer to the Spark Operator documentation.

Build Docker Image with Spark Plugin

Follow the Privacera documentation to build and push the Docker image with the Spark plugin

Note

Only follow the steps up to "Push the Docker Image to the Remote Hub". The remaining deployment steps will be covered in this guide.

Create Configuration Files for Spark Operator

Navigate to config folder and update the privacera_spark.properties file as follow

cd ~/privacera-oss-plugin/config 
vi privacera_spark.properties

# Update truststore path  
#privacera.signer.truststore=/opt/privacera/global-truststore.p12 
privacera.signer.truststore=/privacera-secrets/global-truststore.p12

Create Spark Operator Directory

cd ~/privacera-oss-plugin
mkdir -p spark-operator

Create JWT Token File

Create the JWT token file for authentication and add your JWT token to it

cd ~/privacera-oss-plugin/spark-operator
vi token.jwt

Create Python Script

Create a test Python script to verify the setup

cd ~/privacera-oss-plugin/spark-operator
vi plugin-test.py
plugin-test.py
from pyspark.sql import SparkSession
import time

spark = SparkSession\
    .builder\
    .appName("PrivaceraCsvToParquet")\
    .getOrCreate()

spark.conf.set("spark.hadoop.privacera.jwt.oauth.enable", "true")
spark.conf.set("spark.hadoop.privacera.jwt.token", "/privacera-jwt/token.jwt")

print(f"spark.hadoop.privacera.jwt.oauth.enable is {spark.conf.get('spark.hadoop.privacera.jwt.oauth.enable')}")
print(f"spark.hadoop.privacera.jwt.token is {spark.conf.get('spark.hadoop.privacera.jwt.token')}")

with open("/privacera-secrets/privacera_spark.properties") as p:
    pconf = p.read().strip()
    print(f"event log conf is {pconf}")

df = spark.read.csv("s3a://MY-BUCKET/SAMPLE.csv")
df.show(2)
df.count()

df.write.parquet("s3a://MY-BUCKET/output", mode="overwrite")
df.show(2)
df.count()

spark.stop()

# Keep pod running for 5 minutes (optional)
time.sleep(5*60)

Create SparkApplication YAML

Create the SparkApplication manifest with Privacera-specific configurations

cd ~/privacera-oss-plugin/spark-operator
vi spark-plugin-app.yml
spark-plugin-app.yml
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: <SPARK_APPLICATION_NAME>
  namespace: default
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: "<SPARK_PLUGIN_IMAGE>"
  imagePullPolicy: Always
  mainApplicationFile: local:///privacera-conf/plugin-test.py
  sparkVersion: "<SPARK_VERSION>"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
  sparkConf:
    spark.driver.extraJavaOptions: >-
      -javaagent:/opt/spark/jars/privacera-agent.jar
      -Dlog4j.configurationFile=file:///privacera-conf/log4j2.properties
    spark.executor.extraJavaOptions: >-
      -javaagent:/opt/spark/jars/privacera-agent.jar
      -Dlog4j.configurationFile=file:///privacera-conf/log4j2.properties
    spark.sql.hive.metastore.sharedPrefixes: com.privacera,com.amazonaws
    spark.driver.extraClassPath: /privacera-secrets
    spark.executor.extraClassPath: /privacera-secrets
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: <SPARK_VERSION>
    serviceAccount: spark-operator-spark
    secrets:
      - name: privacera-spark-secret
        path: /privacera-secrets
        secretType: Generic
      - name: privacera-jwt-secret
        path: /privacera-jwt
        secretType: Generic
    configMaps:
      - name: privacera-spark-configmap
        path: /privacera-conf
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: <SPARK_VERSION>
    secrets:
      - name: privacera-spark-secret
        path: /privacera-secrets
        secretType: Generic
      - name: privacera-jwt-secret
        path: /privacera-jwt
        secretType: Generic
    configMaps:
      - name: privacera-spark-configmap
        path: /privacera-conf

Note

Replace the placeholders in the YAML file before deployment: SPARK_APPLICATION_NAME, SPARK_PLUGIN_IMAGE, and SPARK_VERSION.

Warning

  • Do not mount anything in /opt/spark to avoid timing issues that can hide /opt/spark/jars
  • The /privacera-secrets path must be in the Spark classpath

Create Deployment Script

Create a script to automate the creation of Kubernetes secrets, configmaps, and deployment

cd ~/privacera-oss-plugin/spark-operator
vi deploy_spark_operator.sh
deploy_spark_operator.sh
#!/bin/bash
set -x

SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
ENV_FILE=${SCRIPT_DIR}/../penv.sh

if [ -f ${ENV_FILE} ]; then
    echo "Loading env file ${ENV_FILE}"
    source ${ENV_FILE}
fi

export JWT_SECRET_NAME="privacera-jwt-secret"

# Delete and recreate Kubernetes secret
kubectl delete secret ${PRIVACERA_SECRET_NAME}
kubectl create secret generic ${PRIVACERA_SECRET_NAME} \
    --from-file=${SCRIPT_DIR}/../config/privacera_spark.properties \
    --from-file=${SCRIPT_DIR}/../config/global-truststore.p12

# Delete and recreate Kubernetes configmap
kubectl delete configmap ${PRIVACERA_CONFIGMAP_NAME}
kubectl create configmap ${PRIVACERA_CONFIGMAP_NAME} \
    --from-file=${SCRIPT_DIR}/../config/log4j2.properties \
    --from-file=${SCRIPT_DIR}/plugin-test.py

# Delete and recreate Kubernetes secret for JWT
kubectl delete secret ${JWT_SECRET_NAME}
kubectl create secret generic ${JWT_SECRET_NAME} \
    --from-file=${SCRIPT_DIR}/token.jwt

# Delete and apply the SparkApplication
kubectl delete -f ${SCRIPT_DIR}/spark-plugin-app.yml
kubectl apply -f ${SCRIPT_DIR}/spark-plugin-app.yml

Deploy the Application

Make the deployment script executable and run it

cd ~/privacera-oss-plugin/spark-operator
chmod +x deploy_spark_operator.sh
./deploy_spark_operator.sh

Verify Deployment

Check the Kubernetes namespace (default) for the Spark Operator pods

kubectl get pods -n default
kubectl get sparkapplications -n default

Monitor the driver pod logs

kubectl logs -f <DRIVER_POD_NAME>