Connector Guide - AWS EMR - Accessing AWS S3 (OLAC)¶

When AWS EMR is enabled with Privacera, then you can use AWS S3 as the object store. The following instructions can be used to connect to the cluster and run Spark jobs. The access is done at the AWS S3 Object leve which is also known as OLAC (Object Level Access Control).

SSH to your EMR master node:
Bash
1
ssh your_user@<emr-master-node>
If you are using JWT for authentication, then you will have to pass the JWT token to the EMR cluster. You can do this by either passing the JWT token directly as a command-line argument or using a file path containing the JWT token.
- To pass the JWT token directly as a command-line argument, use the following configuration when connecting to the cluster:
  Bash
  1
  --conf "spark.hadoop.privacera.jwt.token.str=<your-jwt-token>"
- To use the file path containing the JWT token, use the following configuration:
  Bash
  1
  --conf "spark.hadoop.privacera.jwt.token=<path-to-jwt-token-file>"
Connecting to Apache Spark Cluster

pysparkspark-shellspark-sql

Connect to pyspark
Bash
1
pyspark
Include the below additional configuration if you have enabled JWT authorization in the cluster.
- To pass the JWT token directly as a command-line argument, use the following configuration:
  Bash
  1
  --conf "spark.hadoop.privacera.jwt.token.str=<your-jwt-token>"
- To use the file path containing the JWT token, use the following configuration:
  Bash
  1
  --conf "spark.hadoop.privacera.jwt.token=<path-to-jwt-token-file>"

Run spark read/write

Bash
df = spark.read.csv("s3a://${S3_BUCKET}/${CSV_FILE}")
df.show(5)

df.write.format("csv").mode("overwrite").save("s3a://${S3_BUCKET}/${CSV_FILE}")

Connect to spark-shell
Bash
1
spark-shell
Include the below additional configuration if you have enabled JWT authorization in the cluster.
- To pass the JWT token directly as a command-line argument, use the following configuration:
  Bash
  1
  --conf "spark.hadoop.privacera.jwt.token.str=<your-jwt-token>"
- To use the file path containing the JWT token, use the following configuration:
  Bash
  1
  --conf "spark.hadoop.privacera.jwt.token=<path-to-jwt-token-file>"

Run spark read/write

Bash
val df = spark.read.csv("s3a://${S3_BUCKET}/${CSV_FILE}")
df.show(5)

df.write.format("csv").mode("overwrite").save("s3a://${S3_BUCKET}/${CSV_FILE}")

When using Spark SQL, the query retrieves the metadata from AWS Glue catalog or Hive Metastore, which provides the location of the data in S3. The access to these files is controlled by Privacera.

For running SQL commands, the cluster should have access to the AWS Glue catalog or Hive Metastore.

Connect to spark-sql
Bash
1
spark-sql

Run spark sql query

Bash
DROP DATABASE IF EXISTS priv_emr_hive CASCADE;

CREATE DATABASE IF NOT EXISTS priv_emr_hive LOCATION 's3a://${S3_BUCKET}/${PATH_TO_DB}';

Prev Connector Guide

Connector Guide - AWS EMR - Accessing AWS S3 (OLAC)¶

Comments