Skip to content

AWS EMR Serverless User Guide

Using Privacera with AWS EMR Serverless

For using Privacera with AWS EMR Serverless, you need to make sure that the JWT token is passed to the Spark job or the Jupyter Notebook. Given below are the reference steps to configure the Apache Spark job and Jupyter Notebook

Tip

Replace the JWT-TOKEN with the actual JWT token in all the below use cases.

  1. EMR Studio Workspace and connect to Jupyter Notebook

    • Create a Workspace by providing a unique name, S3 storage path and enable the Interactive endpoint.
    • Connect to Jupyter Notebook and provide the JWT token in the notebook using the following format:
    Python
    spark.conf.set("spark.hadoop.privacera.jwt.oauth.enable", "true")
    spark.conf.set("spark.hadoop.privacera.jwt.token.str", "<JWT-TOKEN>")
    
  2. Spark Job

    • Create a Spark Job with the following Privacera specific Spark properties in the spark-defaults classification.

      JSON configuration:
      JSON
      1
      2
      3
      4
      5
      "spark.hadoop.privacera.jwt.oauth.enable": "true",
      "spark.hadoop.privacera.jwt.token.str": "<JWT-TOKEN>",
      "spark.driver.extraJavaOptions": "-javaagent:/usr/lib/spark/jars/privacera-agent.jar",
      "spark.executor.extraJavaOptions": "-javaagent:/usr/lib/spark/jars/privacera-agent.jar",
      "spark.sql.hive.metastore.sharedPrefixes": "com.amazonaws.services.dynamodbv2,com.privacera,com.amazonaws"
      

Advanced Use Cases

Iceberg

If you are using Iceberg with AWS EMR Serverless, you need to configure the Docker image with the required Iceberg JARs. For Hadoop Catalog, there are no additional Privacera configurations required. However, for Glue Catalog, you need to pass additional property.

You can configure Iceberg with either Hadoop or Glue Catalog by updating the existing Application configuration by adding properties under spark-defaults.

Configure Iceberg with Hadoop Catalog:

For the application, in the spark-defaults section, add the following properties.

This is just for your reference. You can modify the properties as per your requirement.

Application configuration for OLAC:

Add the following to the spark-defaults classification section in the Application configuration.

JSON
1
2
3
4
5
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions", 
"spark.sql.catalog.hadoop_catalog": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.hadoop_catalog.type": "hadoop",
"spark.sql.catalog.hadoop_catalog.warehouse": "s3://amzn-s3-demo-bucket/example-prefix/",
"spark.jars": "/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar"

Application configuration for OLAC_FGAC:

Add the following to the spark-defaults classification section in the Application configuration.

JSON
1
2
3
4
5
"spark.sql.extensions": "com.privacera.spark.agent.SparkSQLExtension,org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions", 
"spark.sql.catalog.hadoop_catalog": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.hadoop_catalog.type": "hadoop",
"spark.sql.catalog.hadoop_catalog.warehouse": "s3://amzn-s3-demo-bucket/example-prefix/",
"spark.jars": "/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar"

Configure Iceberg with Glue Catalog:

For the application, in the spark-defaults section, add the following properties. Update the properties for the warehouse location. Also for Privacera, you need to update the property spark.sql.catalog.glue_catalog.s3.client-factory-impl

Application configuration for OLAC:

Add the following to the spark-defaults classification section in the Application configuration.

JSON
1
2
3
4
5
6
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.catalog.glue_catalog": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.glue_catalog.warehouse": "s3://amzn-s3-demo-bucket/example-prefix/",
"spark.sql.catalog.glue_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
"spark.sql.catalog.glue_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
"spark.sql.catalog.glue_catalog.s3.client-factory-impl" : "com.privacera.iceberg.aws.s3.PrivaceraAwsClientFactory"

Application configuration for OLAC_FGAC:

Add the following to the spark-defaults classification section in the Application configuration.

JSON
1
2
3
4
5
6
"spark.sql.extensions": "com.privacera.spark.agent.SparkSQLExtension,org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.catalog.glue_catalog": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.glue_catalog.warehouse": "s3://amzn-s3-demo-bucket/example-prefix/",
"spark.sql.catalog.glue_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
"spark.sql.catalog.glue_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
"spark.sql.catalog.glue_catalog.s3.client-factory-impl" : "com.privacera.iceberg.aws.s3.PrivaceraAwsClientFactory"

Delta

To use Delta with EMR Serverless, update the existing Application configuration by adding the following properties under spark-defaults section.

This is just for your reference. You can modify the properties as per your requirement.

Application configuration for OLAC:

Add the following to the spark-defaults classification section in the Application configuration.

JSON
1
2
3
"spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog",
"spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension",
"spark.jars": "/usr/share/aws/delta/lib/delta-spark.jar,/usr/share/aws/delta/lib/delta-storage.jar"

Application configuration for OLAC_FGAC:

Add the following to the spark-defaults classification section in the Application configuration.

JSON
1
2
3
"spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog",
"spark.sql.extensions": "com.privacera.spark.agent.SparkSQLExtension,io.delta.sql.DeltaSparkSessionExtension",
"spark.jars": "/usr/share/aws/delta/lib/delta-spark.jar,/usr/share/aws/delta/lib/delta-storage.jar"


External Hive Metastore

If you are using an External Hive Metastore with AWS EMR Serverless and you want to run jobs which needs access to it, then you need to configure the Docker image with the required JDBC driver and connection properties. This section provides the additional steps for running jobs which needs access to the External Hive Metastore.

Warning

For using External Hive Metastore, you have to make sure that the Docker image is already configured with the required JDBC driver and connection properties. Refer to Configuring External Hive Metastore with AWS EMR Serverless for more details.

Warning

Privacera doesn't provide access control to the External Hive Metastore. You have to ensure that the users using this integration need the necessary permissions to read and write the External Hive Metastore. Also, you need to ensure that the connection properties are secure and only accessible to the users who need it.

  1. Update the existing Application configuration by adding the following properties under spark-defaults:

    Application configuration:

    Add the following to the spark-defaults classification section in the Application configuration.

    JSON
    1
    2
    3
    4
    "spark.hadoop.javax.jdo.option.ConnectionDriverName": "org.mariadb.jdbc.Driver",
    "spark.hadoop.javax.jdo.option.ConnectionURL": "jdbc:mysql://<host>:3306/<database_name>",
    "spark.hadoop.javax.jdo.option.ConnectionUserName": "<user_name>",
    "spark.hadoop.javax.jdo.option.ConnectionPassword": "<password>"
    

  2. Disable the Use AWS Glue Data Catalog as metastore checkbox under Additional configurations.


Comments