Privacera Spark Plugin OLAC + Spark Operator
This guide walks you through deploying Spark OLAC workloads using the Spark Operator on Kubernetes with Privacera plugin integration.
Prerequisites
Install Spark Operator using Helm
helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm repo update
helm install spark-operator spark-operator/spark-operator \
--namespace spark-operator \
--create-namespace
For detailed installation instructions, refer to the Spark Operator documentation.
Build Docker Image with Spark Plugin
Follow the Privacera documentation to build and push the Docker image with the Spark plugin
Note
Only follow the steps up to "Push the Docker Image to the Remote Hub". The remaining deployment steps will be covered in this guide.
Create Configuration Files for Spark Operator
Navigate to config folder and update the privacera_spark.properties file as follow
cd ~/privacera-oss-plugin/config
vi privacera_spark.properties
# Update truststore path
#privacera.signer.truststore=/opt/privacera/global-truststore.p12
privacera.signer.truststore=/privacera-secrets/global-truststore.p12
Create Spark Operator Directory
cd ~/privacera-oss-plugin
mkdir -p spark-operator
Create JWT Token File
Create the JWT token file for authentication and add your JWT token to it
cd ~/privacera-oss-plugin/spark-operator
vi token.jwt
Create Python Script
Create a test Python script to verify the setup
cd ~/privacera-oss-plugin/spark-operator
vi plugin-test.py
| plugin-test.py |
|---|
| from pyspark.sql import SparkSession
import time
spark = SparkSession\
.builder\
.appName("PrivaceraCsvToParquet")\
.getOrCreate()
spark.conf.set("spark.hadoop.privacera.jwt.oauth.enable", "true")
spark.conf.set("spark.hadoop.privacera.jwt.token", "/privacera-jwt/token.jwt")
print(f"spark.hadoop.privacera.jwt.oauth.enable is {spark.conf.get('spark.hadoop.privacera.jwt.oauth.enable')}")
print(f"spark.hadoop.privacera.jwt.token is {spark.conf.get('spark.hadoop.privacera.jwt.token')}")
with open("/privacera-secrets/privacera_spark.properties") as p:
pconf = p.read().strip()
print(f"event log conf is {pconf}")
df = spark.read.csv("s3a://MY-BUCKET/SAMPLE.csv")
df.show(2)
df.count()
df.write.parquet("s3a://MY-BUCKET/output", mode="overwrite")
df.show(2)
df.count()
spark.stop()
# Keep pod running for 5 minutes (optional)
time.sleep(5*60)
|
Create SparkApplication YAML
Create the SparkApplication manifest with Privacera-specific configurations
cd ~/privacera-oss-plugin/spark-operator
vi spark-plugin-app.yml
| spark-plugin-app.yml |
|---|
| apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: <SPARK_APPLICATION_NAME>
namespace: default
spec:
type: Python
pythonVersion: "3"
mode: cluster
image: "<SPARK_PLUGIN_IMAGE>"
imagePullPolicy: Always
mainApplicationFile: local:///privacera-conf/plugin-test.py
sparkVersion: "<SPARK_VERSION>"
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20
sparkConf:
spark.driver.extraJavaOptions: >-
-javaagent:/opt/spark/jars/privacera-agent.jar
-Dlog4j.configurationFile=file:///privacera-conf/log4j2.properties
spark.executor.extraJavaOptions: >-
-javaagent:/opt/spark/jars/privacera-agent.jar
-Dlog4j.configurationFile=file:///privacera-conf/log4j2.properties
spark.sql.hive.metastore.sharedPrefixes: com.privacera,com.amazonaws
spark.driver.extraClassPath: /privacera-secrets
spark.executor.extraClassPath: /privacera-secrets
driver:
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: <SPARK_VERSION>
serviceAccount: spark-operator-spark
secrets:
- name: privacera-spark-secret
path: /privacera-secrets
secretType: Generic
- name: privacera-jwt-secret
path: /privacera-jwt
secretType: Generic
configMaps:
- name: privacera-spark-configmap
path: /privacera-conf
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: <SPARK_VERSION>
secrets:
- name: privacera-spark-secret
path: /privacera-secrets
secretType: Generic
- name: privacera-jwt-secret
path: /privacera-jwt
secretType: Generic
configMaps:
- name: privacera-spark-configmap
path: /privacera-conf
|
Note
Replace the placeholders in the YAML file before deployment: SPARK_APPLICATION_NAME, SPARK_PLUGIN_IMAGE, and SPARK_VERSION.
Warning
- Do not mount anything in
/opt/spark to avoid timing issues that can hide /opt/spark/jars - The
/privacera-secrets path must be in the Spark classpath
Create Deployment Script
Create a script to automate the creation of Kubernetes secrets, configmaps, and deployment
cd ~/privacera-oss-plugin/spark-operator
vi deploy_spark_operator.sh
| deploy_spark_operator.sh |
|---|
| #!/bin/bash
set -x
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
ENV_FILE=${SCRIPT_DIR}/../penv.sh
if [ -f ${ENV_FILE} ]; then
echo "Loading env file ${ENV_FILE}"
source ${ENV_FILE}
fi
export JWT_SECRET_NAME="privacera-jwt-secret"
# Delete and recreate Kubernetes secret
kubectl delete secret ${PRIVACERA_SECRET_NAME}
kubectl create secret generic ${PRIVACERA_SECRET_NAME} \
--from-file=${SCRIPT_DIR}/../config/privacera_spark.properties \
--from-file=${SCRIPT_DIR}/../config/global-truststore.p12
# Delete and recreate Kubernetes configmap
kubectl delete configmap ${PRIVACERA_CONFIGMAP_NAME}
kubectl create configmap ${PRIVACERA_CONFIGMAP_NAME} \
--from-file=${SCRIPT_DIR}/../config/log4j2.properties \
--from-file=${SCRIPT_DIR}/plugin-test.py
# Delete and recreate Kubernetes secret for JWT
kubectl delete secret ${JWT_SECRET_NAME}
kubectl create secret generic ${JWT_SECRET_NAME} \
--from-file=${SCRIPT_DIR}/token.jwt
# Delete and apply the SparkApplication
kubectl delete -f ${SCRIPT_DIR}/spark-plugin-app.yml
kubectl apply -f ${SCRIPT_DIR}/spark-plugin-app.yml
|
Deploy the Application
Make the deployment script executable and run it
cd ~/privacera-oss-plugin/spark-operator
chmod +x deploy_spark_operator.sh
./deploy_spark_operator.sh
Verify Deployment
Check the Kubernetes namespace (default) for the Spark Operator pods
kubectl get pods -n default
kubectl get sparkapplications -n default
Monitor the driver pod logs
kubectl logs -f <DRIVER_POD_NAME>