Iceberg SSH to emr master node Bash ssh hadoop@<emr-master-node>
Run the following command Connect to spark tool pyspark spark-shell spark-sql
For Hadoop catalog,
Bash pyspark \
--conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
--conf "spark.sql.catalog.hadoop_catalog=org.apache.iceberg.spark.SparkCatalog" \
--conf "spark.sql.catalog.hadoop_catalog.type=hadoop" \
--conf "spark.sql.catalog.hadoop_catalog.warehouse=s3:// ${ S3_BUCKET } / ${ S3_OBJECT_PATH } "
For Glue catalog,
Bash pyspark \
--conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
--conf "spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog" \
--conf "spark.sql.catalog.glue_catalog.warehouse=s3:// ${ S3_BUCKET } / ${ S3_OBJECT_PATH } " \
--conf "" \
--conf "" \
--conf ""
If you are using FGAC or OLAC_FGAC, update the spark.sql.extensions
property in above command as follows:
Bash --conf "spark.sql.extensions=com.privacera.spark.agent.SparkSQLExtension,org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
Include the below additional configuration if you have enabled JWT authorization in the cluster.
To pass the JWT token directly as a command-line argument, use the following configuration:
Bash --conf "spark.hadoop.privacera.jwt.token.str=<your-jwt-token>"
To use the file path containing the JWT token, use the following configuration:
Bash --conf "spark.hadoop.privacera.jwt.token=<path-to-jwt-token-file>"
Create table and read
Bash ## Create a DataFrame.
data = spark.createDataFrame([
( "100" , "2015-01-01" , "2015-01-01T13:51:39.340396Z" ) ,
( "101" , "2015-01-01" , "2015-01-01T12:14:58.597216Z" ) ,
( "102" , "2015-01-01" , "2015-01-01T13:51:40.417052Z" ) ,
( "103" , "2015-01-01" , "2015-01-01T13:51:40.519832Z" )
] ,[ "id" , "creation_date" , "last_update_time" ])
## Write a DataFrame as a Iceberg dataset to the Amazon S3 location.
spark.sql( """CREATE TABLE IF NOT EXISTS <my_catalog>.<db_name>.<table_name> (id string,
creation_date string,
last_update_time string)
USING iceberg
location 's3:// ${ S3_BUCKET } / ${ TABLE_PATH } '""" )
data.writeTo( "<my_catalog>.<db_name>.<table_name>" ) .append()
df = "iceberg" ) .load( "<my_catalog>.<db_name>.<table_name" )
For Hadoop catalog,
Bash spark-shell \
--conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
--conf "spark.sql.catalog.hadoop_catalog=org.apache.iceberg.spark.SparkCatalog" \
--conf "spark.sql.catalog.hadoop_catalog.type=hadoop" \
--conf "spark.sql.catalog.hadoop_catalog.warehouse=s3:// ${ S3_BUCKET } / ${ S3_OBJECT_PATH } "
For Glue catalog,
Bash spark-shell \
--conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
--conf "spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog" \
--conf "spark.sql.catalog.glue_catalog.warehouse=s3:// ${ S3_BUCKET } / ${ S3_OBJECT_PATH } " \
--conf "" \
--conf "" \
--conf ""
If you are using FGAC or OLAC_FGAC, update the spark.sql.extensions
property in above command as follows:
Bash --conf "spark.sql.extensions=com.privacera.spark.agent.SparkSQLExtension,org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
Include the below additional configuration if you have enabled JWT authorization in the cluster.
To pass the JWT token directly as a command-line argument, use the following configuration:
Bash --conf "spark.hadoop.privacera.jwt.token.str=<your-jwt-token>"
To use the file path containing the JWT token, use the following configuration:
Bash --conf "spark.hadoop.privacera.jwt.token=<path-to-jwt-token-file>"
Create table and read
Bash ## Create a DataFrame.
val data = Seq(
( "100" , "2015-01-01" , "2015-01-01T13:51:39.340396Z" ) ,
( "101" , "2015-01-01" , "2015-01-01T12:14:58.597216Z" ) ,
( "102" , "2015-01-01" , "2015-01-01T13:51:40.417052Z" ) ,
( "103" , "2015-01-01" , "2015-01-01T13:51:40.519832Z" )
) .toDF( "id" , "creation_date" , "last_update_time" )
## Write a DataFrame as a Iceberg dataset to the Amazon S3 location.
spark.sql( """CREATE TABLE IF NOT EXISTS <my_catalog>.<db_name>.<table_name> (id string,
creation_date string,
last_update_time string)
USING iceberg
location 's3:// ${ S3_BUCKET } / ${ TABLE_PATH } '""" )
data.writeTo( "<my_catalog>.<db_name>.<table_name>" ) .append()
val df = "iceberg" ) .load( "<my_catalog>.<db_name>.<table_name" )
For Hadoop catalog,
Bash spark-sql \
--conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
--conf "spark.sql.catalog.hadoop_catalog=org.apache.iceberg.spark.SparkCatalog" \
--conf "spark.sql.catalog.hadoop_catalog.type=hadoop" \
--conf "spark.sql.catalog.hadoop_catalog.warehouse=s3:// ${ S3_BUCKET } / ${ S3_OBJECT_PATH } "
For Glue catalog,
Bash spark-sql \
--conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
--conf "spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog" \
--conf "spark.sql.catalog.glue_catalog.warehouse=s3:// ${ S3_BUCKET } / ${ S3_OBJECT_PATH } " \
--conf "" \
--conf "" \
--conf ""
If you are using FGAC or OLAC_FGAC, update the spark.sql.extensions
property in above command as follows:
Bash --conf "spark.sql.extensions=com.privacera.spark.agent.SparkSQLExtension,org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
Include the below additional configuration if you have enabled JWT authorization in the cluster.
To pass the JWT token directly as a command-line argument, use the following configuration:
Bash --conf "spark.hadoop.privacera.jwt.token.str=<your-jwt-token>"
To use the file path containing the JWT token, use the following configuration:
Bash --conf "spark.hadoop.privacera.jwt.token=<path-to-jwt-token-file>"
Run spark sql query
Bash CREATE TABLE IF NOT EXISTS <my_catalog>.<db_name>.<table_name> (
id INT,
name STRING,
age INT,
city STRING)
USING iceberg
SELECT * FROM <my_catalog>.<db_name>.<table_name>;