Connector Guide - Access - Apache Spark OLAC
This is the connector guide for using Apache Spark OLAC with Privacera. Please make sure that the connector has been installed and configured correctly before proceeding with the instructions in this guide.
Run spark session
-
Navigate to ${SPARK_HOME}/bin
folder and export the JWT token
Bash |
---|
| cd <SPARK_HOME>/bin
export JWT_TOKEN="<JWT_TOKEN>"
|
-
Start spark-session (choose one of spark-shell, pyspark, or spark-sql)
-
To pass the JWT token directly as a command-line argument, use the following configuration when connecting to the cluster
Bash |
---|
| ./<spark-shell | pyspark | spark-sql> \
--conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}"
|
-
To use the file path containing the JWT token, use the following configuration:
Bash |
---|
| ./<spark-shell | pyspark | spark-sql> \
--conf "spark.hadoop.privacera.jwt.token=<path-to-jwt-token-file>"
|
Run spark session with executor
-
SSH into driver pod of the Namespace
-
Export the following variables
Bash |
---|
| SPARK_NAME_SPACE=<SPARK_NAME_SPACE>
SPARK_IMAGE=<SPARK_IMAGE>
SVC_ACCOUNT=privacera-sa-spark-plugin
K8S_MASTER=k8s://https://kubernetes.default.svc
export JWT_TOKEN="<JWT_TOKEN>"
|
-
Run the following command to start spark session with executors
Bash |
---|
| /opt/spark/bin/spark-shell \
--master ${K8S_MASTER} \
--deploy-mode client \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${SVC_ACCOUNT} \
--conf spark.kubernetes.namespace=${SPARK_NAME_SPACE} \
--conf spark.kubernetes.driver.request.cores=0.001 \
--conf spark.driver.memory=1g \
--conf spark.kubernetes.executor.request.cores=0.001 \
--conf spark.executor.memory=1g \
--conf spark.kubernetes.container.image=${SPARK_IMAGE} \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.driver.host=${SPARK_PLUGIN_POD_IP} \
--conf spark.driver.port=7077 \
--conf spark.blockManager.port=7078 \
--conf spark.kubernetes.executor.secrets.privacera-spark-secret=/privacera-secret \
--conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}"
|
Run spark session with MinIO
-
To enable custom S3 endpoints for accessing specific MinIO buckets along with S3 buckets, you must update these properties for each bucket individually include following properties when starting spark session.
Bash |
---|
| ./<spark-shell | pyspark | spark-sql> \
--conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}" \
--conf "spark.hadoop.fs.s3a.bucket.<bucket-1>.path.style.access=true" \
--conf "spark.hadoop.fs.s3a.bucket.<bucket-1>.connection.ssl.enabled=true" \
--conf "spark.hadoop.fs.s3a.bucket.<bucket-1>.endpoint=https://<MINIO_HOST>:<MINIO_PORT>" \
--conf "spark.hadoop.fs.s3a.bucket.<bucket-2>.path.style.access=true" \
--conf "spark.hadoop.fs.s3a.bucket.<bucket-2>.connection.ssl.enabled=true" \
--conf "spark.hadoop.fs.s3a.bucket.<bucket-2>.endpoint=https://<MINIO_HOST>:<MINIO_PORT>"
|
-
To enable global endpoint to access only minio buckets, include following properties when starting spark session.
Bash |
---|
| ./<spark-shell | pyspark | spark-sql> \
--conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}" \
--conf "spark.hadoop.fs.s3a.path.style.access=true" \
--conf "spark.hadoop.fs.s3a.connection.ssl.enabled=true" \
--conf "spark.hadoop.fs.s3a.endpoint=https://<MINIO_HOST>:<MINIO_PORT>"
|