Delta Lake¶

SSH to emr master node
Bash
1
ssh hadoop@<emr-master-node>
Run the following command
Bash
1 2
sudo su - <user> kinit
Connect to spark tool

pysparkspark-shellspark-sql

If you are using OLAC, connect to pyspark as below

Bash
pyspark \
  --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
  --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \

If you are using FGAC or OLAC_FGAC, update the spark.sql.extensions as below:

Bash
pyspark \
  --conf "spark.sql.extensions=com.privacera.spark.agent.SparkSQLExtension,io.delta.sql.DeltaSparkSessionExtension" \
  --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \

Include the below additional configuration if you have enabled JWT authorization in the cluster.
- To pass the JWT token directly as a command-line argument, use the following configuration:
  Bash
  1
  --conf "spark.hadoop.privacera.jwt.token.str=<your-jwt-token>"
- To use the file path containing the JWT token, use the following configuration:
  Bash
  1
  --conf "spark.hadoop.privacera.jwt.token=<path-to-jwt-token-file>"

Run spark read/write

Bash
df = spark.read.format("delta").option("header", "true").option("inferSchema", "true").load("s3a://${S3_BUCKET}/${DELTA_FILE}")
df.show(5)

df.write.format("delta").option("header", "true").mode("overwrite").save("s3a://${S3_BUCKET}/${DELTA_FILE}")

If you are using OLAC, connect to spark-shell as below

Bash
spark-shell \
  --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
  --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \

If you are using FGAC or OLAC_FGAC, update the spark.sql.extensions as below:

Bash
spark-shell \
  --conf "spark.sql.extensions=com.privacera.spark.agent.SparkSQLExtension,io.delta.sql.DeltaSparkSessionExtension" \
  --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \

Include the below additional configuration if you have enabled JWT authorization in the cluster.
- To pass the JWT token directly as a command-line argument, use the following configuration:
  Bash
  1
  --conf "spark.hadoop.privacera.jwt.token.str=<your-jwt-token>"
- To use the file path containing the JWT token, use the following configuration:
  Bash
  1
  --conf "spark.hadoop.privacera.jwt.token=<path-to-jwt-token-file>"

Run spark read/write

Bash
val df = spark.read.format("delta").option("header", "true").option("inferSchema", "true").load("s3a://${S3_BUCKET}/${DELTA_FILE}")
df.show(5)

df.write.format("delta").option("header", "true").mode("overwrite").save("s3a://${S3_BUCKET}/${DELTA_FILE}")

If you are using OLAC, connect to spark-sql as below

Bash
spark-sql \
  --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
  --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \

If you are using FGAC or OLAC_FGAC, update the spark.sql.extensions as below:

Bash
spark-sql \
  --conf "spark.sql.extensions=com.privacera.spark.agent.SparkSQLExtension,io.delta.sql.DeltaSparkSessionExtension" \
  --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \

Include the below additional configuration if you have enabled JWT authorization in the cluster.
- To pass the JWT token directly as a command-line argument, use the following configuration:
  Bash
  1
  --conf "spark.hadoop.privacera.jwt.token.str=<your-jwt-token>"
- To use the file path containing the JWT token, use the following configuration:
  Bash
  1
  --conf "spark.hadoop.privacera.jwt.token=<path-to-jwt-token-file>"

Run spark sql query

Bash
CREATE TABLE IF NOT EXISTS <db_name>.<table_name> (
    id INT,
    name STRING,
    age INT,
    city STRING)
    USING DELTA
    LOCATION 's3://${S3_BUCKET}/${TABLE_PATH}';

SELECT * FROM <db_name>.<table_name>;

Prev topic: AWS EMR
Next topic: Iceberg

Delta Lake¶

Comments