AWS EMR Serverless User Guide¶
Using Privacera with AWS EMR Serverless¶
For using Privacera with AWS EMR Serverless, you need to make sure that the JWT token is passed to the Spark job or the Jupyter Notebook. Given below are the reference steps to configure the Apache Spark job and Jupyter Notebook
Tip
Replace the JWT-TOKEN with the actual JWT token in all the below use cases.
-
EMR Studio Workspace and connect to Jupyter Notebook
- Create a Workspace by providing a unique name, S3 storage path and enable the
Interactive endpoint
. - Connect to Jupyter Notebook and provide the JWT token in the notebook using the following format:
- Create a Workspace by providing a unique name, S3 storage path and enable the
-
Spark Job
-
Create a Spark Job with the following Privacera specific Spark properties in the
spark-defaults
classification.JSON configuration:
-
Advanced Use Cases¶
Iceberg¶
If you are using Iceberg with AWS EMR Serverless, you need to configure the Docker image with the required Iceberg JARs. For Hadoop Catalog, there are no additional Privacera configurations required. However, for Glue Catalog, you need to pass additional property.
You can configure Iceberg with either Hadoop or Glue Catalog by updating the existing Application configuration by adding properties under spark-defaults
.
Configure Iceberg with Hadoop Catalog:¶
For the application, in the spark-defaults section, add the following properties.
This is just for your reference. You can modify the properties as per your requirement.
Application configuration:
Add the following to the spark-defaults
classification section in the Application configuration.
Configure Iceberg with Glue Catalog:¶
For the application, in the spark-defaults section, add the following properties. Update the properties for the warehouse location. Also for Privacera, you need to update the property spark.sql.catalog.glue_catalog.s3.client-factory-impl
Application configuration:
Add the following to the spark-defaults
classification section in the Application configuration.
Delta¶
To use Delta with EMR Serverless, update the existing Application configuration by adding the following properties under spark-defaults section.
This is just for your reference. You can modify the properties as per your requirement.
Application configuration:
Add the following to the spark-defaults
classification section in the Application configuration.
External Hive Metastore¶
If you are using an External Hive Metastore with AWS EMR Serverless and you want to run jobs which needs access to it, then you need to configure the Docker image with the required JDBC driver and connection properties. This section provides the additional steps for running jobs which needs access to the External Hive Metastore.
Warning
For using External Hive Metastore, you have to make sure that the Docker image is already configured with the required JDBC driver and connection properties. Refer to Configuring External Hive Metastore with AWS EMR Serverless for more details.
Warning
Privacera doesn't provide access control to the External Hive Metastore. You have to ensure that the users using this integration need the necessary permissions to read and write the External Hive Metastore. Also, you need to ensure that the connection properties are secure and only accessible to the users who need it.
-
Update the existing Application configuration by adding the following properties under
spark-defaults
:Application configuration:
Add the following to the
spark-defaults
classification section in the Application configuration. -
Disable the
Use AWS Glue Data Catalog as metastore
checkbox under Additional configurations.
- Prev topic: AWS EMR - Spark OLAC
- Next topic: Databricks