Spark OLAC¶

SSH to emr master node
Bash
1
ssh hadoop@<emr-master-node>
Run the following command
Bash
1 2
sudo su - <user> kinit
Connect to spark tool

pysparkspark-shellspark-sql

Connect to pyspark
Bash
1
pyspark
Include the below additional configuration if you have enabled JWT authorization in the cluster.
- To pass the JWT token directly as a command-line argument, use the following configuration:
  Bash
  1
  --conf "spark.hadoop.privacera.jwt.token.str=<your-jwt-token>"
- To use the file path containing the JWT token, use the following configuration:
  Bash
  1
  --conf "spark.hadoop.privacera.jwt.token=<path-to-jwt-token-file>"

Run spark read/write

Bash
df = spark.read.csv("s3a://${S3_BUCKET}/${CSV_FILE}")
df.show(5)

df.write.format("csv").mode("overwrite").save("s3a://${S3_BUCKET}/${CSV_FILE}")

Connect to spark-shell
Bash
1
spark-shell
Include the below additional configuration if you have enabled JWT authorization in the cluster.
- To pass the JWT token directly as a command-line argument, use the following configuration:
  Bash
  1
  --conf "spark.hadoop.privacera.jwt.token.str=<your-jwt-token>"
- To use the file path containing the JWT token, use the following configuration:
  Bash
  1
  --conf "spark.hadoop.privacera.jwt.token=<path-to-jwt-token-file>"

Run spark read/write

Bash
val df = spark.read.csv("s3a://${S3_BUCKET}/${CSV_FILE}")
df.show(5)

df.write.format("csv").mode("overwrite").save("s3a://${S3_BUCKET}/${CSV_FILE}")

Connect to spark-sql
Bash
1
spark-sql
Include the below additional configuration if you have enabled JWT authorization in the cluster.
- To pass the JWT token directly as a command-line argument, use the following configuration:
  Bash
  1
  --conf "spark.hadoop.privacera.jwt.token.str=<your-jwt-token>"
- To use the file path containing the JWT token, use the following configuration:
  Bash
  1
  --conf "spark.hadoop.privacera.jwt.token=<path-to-jwt-token-file>"

Run spark sql query

Bash
DROP DATABASE IF EXISTS priv_emr_hive CASCADE;

CREATE DATABASE IF NOT EXISTS priv_emr_hive LOCATION 's3a://${S3_BUCKET}/${PATH_TO_DB}';

Prev topic: Iceberg
Next topic: AWS EMR Serverless

Spark OLAC¶

Comments