Skip to content

Connector Guide - Access - Apache Spark OLAC

This is the connector guide for using Apache Spark OLAC with Privacera. Please make sure that the connector has been installed and configured correctly before proceeding with the instructions in this guide.

Run spark session

  1. Navigate to ${SPARK_HOME}/bin folder and export the JWT token

    Bash
    cd <SPARK_HOME>/bin
    export JWT_TOKEN="<JWT_TOKEN>"
    

  2. Start spark-session (choose one of spark-shell, pyspark, or spark-sql)

  3. To pass the JWT token directly as a command-line argument, use the following configuration when connecting to the cluster

    Bash
    ./<spark-shell | pyspark | spark-sql> \
    --conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}"
    

  4. To use the file path containing the JWT token, use the following configuration:

    Bash
    ./<spark-shell | pyspark | spark-sql> \
    --conf "spark.hadoop.privacera.jwt.token=<path-to-jwt-token-file>"
    

Run spark session with executor

  1. SSH into driver pod of the Namespace

  2. Export the following variables

    Bash
    1
    2
    3
    4
    5
    6
    7
    SPARK_NAME_SPACE=<SPARK_NAME_SPACE>
    SPARK_IMAGE=<SPARK_IMAGE>
    
    SVC_ACCOUNT=privacera-sa-spark-plugin
    K8S_MASTER=k8s://https://kubernetes.default.svc
    
    export JWT_TOKEN="<JWT_TOKEN>"
    

  3. Run the following command to start spark session with executors

    Bash
    /opt/spark/bin/spark-shell \
      --master ${K8S_MASTER} \
      --deploy-mode client \
      --conf spark.executor.instances=2 \
      --conf spark.kubernetes.authenticate.driver.serviceAccountName=${SVC_ACCOUNT} \
      --conf spark.kubernetes.namespace=${SPARK_NAME_SPACE} \
      --conf spark.kubernetes.driver.request.cores=0.001 \
      --conf spark.driver.memory=1g \
      --conf spark.kubernetes.executor.request.cores=0.001 \
      --conf spark.executor.memory=1g \
      --conf spark.kubernetes.container.image=${SPARK_IMAGE} \
      --conf spark.kubernetes.container.image.pullPolicy=Always \
      --conf spark.driver.host=${SPARK_PLUGIN_POD_IP} \
      --conf spark.driver.port=7077 \
      --conf spark.blockManager.port=7078 \
      --conf spark.kubernetes.executor.secrets.privacera-spark-secret=/privacera-secret \
      --conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}"
    

Run spark session with MinIO

  1. To enable custom S3 endpoints for accessing specific MinIO buckets along with S3 buckets, you must update these properties for each bucket individually include following properties when starting spark session.

    Bash
    1
    2
    3
    4
    5
    6
    7
    8
    ./<spark-shell | pyspark | spark-sql> \
    --conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}" \
    --conf "spark.hadoop.fs.s3a.bucket.<bucket-1>.path.style.access=true" \
    --conf "spark.hadoop.fs.s3a.bucket.<bucket-1>.connection.ssl.enabled=true" \
    --conf "spark.hadoop.fs.s3a.bucket.<bucket-1>.endpoint=https://<MINIO_HOST>:<MINIO_PORT>" \
    --conf "spark.hadoop.fs.s3a.bucket.<bucket-2>.path.style.access=true" \
    --conf "spark.hadoop.fs.s3a.bucket.<bucket-2>.connection.ssl.enabled=true" \
    --conf "spark.hadoop.fs.s3a.bucket.<bucket-2>.endpoint=https://<MINIO_HOST>:<MINIO_PORT>"
    

  2. To enable global endpoint to access only minio buckets, include following properties when starting spark session.

    Bash
    1
    2
    3
    4
    5
    ./<spark-shell | pyspark | spark-sql> \
    --conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}" \
    --conf "spark.hadoop.fs.s3a.path.style.access=true" \
    --conf "spark.hadoop.fs.s3a.connection.ssl.enabled=true" \
    --conf "spark.hadoop.fs.s3a.endpoint=https://<MINIO_HOST>:<MINIO_PORT>"
    

Comments