Advanced Configuration
Iceberg Catalog in Apache Spark¶
-
SSH into the instance where Privacera Manager is installed.
-
Run the following command to navigate to the
/config
directory.Bash -
Run the following command to open the
.yml
file to be edited.Bash -
Modify the following properties:
Variable Definition EMR_SPARK_ICEBERG_ENABLE Set to true
to enable the Iceberg catalog for EMR Spark. Default:false
.EMR_SPARK_ICEBERG_CATALOG_TYPE Set the Iceberg catalog type. Default: hadoop
. Supported types:hadoop
,glue
.EMR_SPARK_ICEBERG_CATALOG_NAME Set the Iceberg catalog name. Default: hadoop_catalog
. Supported types:haddop_catalog
,glue_catalog
.EMR_SPARK_ICEBERG_CATALOG_WAREHOUSE_LOCATION Set the location for the Iceberg catalog warehouse. -
To use Iceberg with the Hadoop catalog, update the catalog type to
hadoop
and the catalog name tohadoop_catalog
in thevars.emr.yml
file. -
To use Iceberg with Glue as the Metastore, update the catalog type to
glue
and the catalog name toglue_catalog
in thevars.emr.yml
file. -
Once the properties are configured, update your Privacera Manager platform instance by following the Quickstart guide.
- After the
post-install
, create a new cluster using the newly generated emr-template.json file from the output directory.
Configuring Iceberg with Hadoop¶
To configure Iceberg with hadoop, update the spark-defaults
configuration in emr template as shown below. Then, create a new emr cluster with this template:
privacera-emr-spark-defaults
Configuring Iceberg with Glue as Metastore¶
To configure Iceberg with Glue as Metastore, update the spark-defaults
configuration in emr template as shown below. Then, create a new emr cluster with this template:
privacer-emr-spark-defaults-glue
JWT Auth Configuration¶
By Default, Privacera uses the kerberized user for authorization. However, we also support JWT (JSON Web Token) integration, which will use the user/group from the JWT payload instead of the Kerberized login user.
Here are the steps to configure JWT token integration
-
SSH to the instance where Privacera is installed.
-
Run the following command to navigate to the /config directory.
Bash -
Run the following command to open the .yml file to be edited.
Bash -
Add the below property to enable JWT authentication:
Bash -
Once the properties are configured, run the following commands to update your Privacera Manager platform instance:
-
After the
post-install
, create a new cluster with newly generated emr-template.json file from output directory.
Note
- JWT Auth Configuration supported only for Spark OLAC
Add EMR_JWT_OAUTH_ENABLE
in EMR bootstrap action script to enable JWT authentication.
privacera-emr-bootstrap-action-spark_olac
External Hive Metastore (EHM)¶
-
SSH to the instance where Privacera is installed.
-
Run the following command to navigate to the /config directory.
Bash -
Run the following command to open the .yml file to be edited.
Bash -
Modify the following properties:
Variable | Definition |
---|---|
EMR_HIVE_METASTORE | Set to 'hive' to enable External Hive Metastore |
EMR_HIVE_METASTORE_CONNECTION_URL | Set the JDBC Connection URL (ex: jdbc:mysql:// |
EMR_HIVE_METASTORE_CONNECTION_DRIVER | Set JDBC Driver Name (ex: "org.mariadb.jdbc.Driver") |
EMR_HIVE_METASTORE_CONNECTION_USERNAME | Set the JDBC username |
EMR_HIVE_METASTORE_CONNECTION_PASSWORD | Set the JDBC password |
-
Once the properties are configured, run the following commands to update your Privacera Manager platform instance:
-
After the
post-install
, create a new cluster with newly generated emr-template.json file from output directory.
Update hive-site configuration in emr template as below and create new emr cluster with this template.
privacera-emr-hive-site
Configure JWT in Hive Metastore (HMS) for OLAC¶
- To configure the JWT token in Hive Metastore (HMS) for OLAC, add the following EMR step in
emr-template.json
file. This step will update the JWT token on the HMS server located on the EMR Master Node. - Create and upload
update_jwt_token_in_hms.sh
script to S3:- To update the JWT token in the Hive Metastore (HMS), create a script named
update_jwt_token_in_hms.sh
with the content provided below, and upload it to the specified S3 location.
- To update the JWT token in the Hive Metastore (HMS), create a script named
Multiple Master Nodes (MMN)¶
-
SSH to the instance where Privacera is installed.
-
Run the following command to navigate to the /config directory.
Bash -
Run the following command to open the .yml file to be edited.
Bash -
Add the below property and set the count to '3' to enable multiple master node:
Bash -
Once the properties are configured, run the following commands to update your Privacera Manager platform instance:
-
After the
post-install
, create a new cluster with newly generated emr-template.json file from output directory.
Enable Access Requester Pays buckets¶
- To enable access to Requester Pays buckets, include the following property in EMR cluster's Spark configuration:
- For Spark OLAC, here are the steps to configure Requester Pay in DataServer
Ignore S3 Objects from Privacera Access Check¶
By default, the Dataserver is used to perform access control on all objects. However, if you want to exclude certain objects or entire buckets from Privacera access checks and access them directly through the IAM role attached to the EMR node, you can use one of the methods outlined below:
Note
- The property accepts a comma-separated list of paths to be excluded from Privacera Access Check.
- Paths to be ignored support all s3 file protocols such as
s3://
,s3a://
, ands3n://
. - Ensure that the attached IAM role has the necessary permissions to access the specified paths.
- The property supports the wildcard character
*
in both the bucket name and object path.
EMR Bootstrap Action¶
- The Bootstrap Action for configuring the ignore paths requires a script file that must be uploaded to an S3 location accessible through the IAM role attached to the EMR nodes.
- This approach should be used when enabling event logs in the EMR cluster and you want to exclude the event logs bucket, or when you wish to ignore certain common buckets and objects from Privacera Access Check.
- For the event log bucket, ensure that you ignore the entire bucket, not just the specific path where the event logs are stored.
Follow the steps below to create the Script file and add the Bootstrap Action in the EMR template file:
-
Create a Script file named
privacera_update_spark_ignore_path.sh
with the content provided below at an S3 location: -
Update the EMR Template file with the below given Bootstrap Action:
Replace <BUCKET>
, <PATH_TO_SCRIPT>
, and <IGNORE_PATHS_SEPERATED_BY_COMMAS>
placeholders with the appropriate values.
JSON | |
---|---|
privacera_spark_custom.properties
. The above bootstrap action will update the ignore paths in both the Master and Executor nodes.
Spark Configuration¶
-
If you want to ignore objects that are specific to certain use cases and are not included in the
privacera_spark_custom.properties
file, you can use thespark.hadoop.privacera.olac.extra.ignore.paths
property. -
The above property must be configured either in the
spark-defaults.conf
file of the Master node or using the--conf
option. s -
If the property is configured in both locations, the value passed via the --conf option will take precedence.
Delta Lake Configuration¶
-
SSH into the instance where Privacera Manager is installed.
-
Run the following command to navigate to the
/config
directory.Bash -
Run the following command to open the
.yml
file for editing:Bash -
Modify the following properties:
Variable Definition EMR_SPARK_DELTA_LAKE_ENABLE Set to true
to enable delta lake support for EMR Spark. Default:false
.EMR_SPARK_DELTA_LAKE_CORE_JAR_DOWNLOAD_URL Set the URL to download the delta-core jar
.EMR_SPARK_DELTA_LAKE_STORAGE_JAR_DOWNLOAD_URL Set the URL to download the delta-storage jar
. -
Refer to the Delta Lake compatibility page to check the Delta Lake versions and their compatible Apache Spark versions.
-
Once the properties are configured, update your Privacera Manager platform instance by following the Quickstart guide.
-
After the
post-install
, create a new cluster using the newly generated emr-template.json file from the output directory.
- To enable Delta Lake support for EMR Spark, update the
BootstrapActions
configuration in emr template as shown below. Then, create a new emr cluster with this template:
privacera-emr-bootstrap-actions-delta-lake
Update <delta_lake_core_jar_download_url>
, <delta_lake_storage_jar_download_url>
placeholders with the appropriate values.
- Refer to the Delta Lake compatibility page to check the Delta Lake versions and their compatible Apache Spark versions.
EBS Root Volume Size and Auto Termination Policy¶
- SSH into the instance where Privacera Manager is installed.
- Run the following command to navigate to the
/config
directory.Bash - Run the following command to open the
.yml
file to be edited.Bash - To increase the EBS root volume size, modify the following property:
- To set the auto-termination policy, modify the following property:
- Once the properties are configured, update your Privacera Manager platform instance by following the Quickstart guide.
- After the
post-install
, create a new cluster using the newly generated emr-template.json file from the output directory.
- To increase the EBS root volume size and set the auto-termination policy, update the
resources
configuration in emr template as shown below. Then, create a new emr cluster with this template:
Encrypt Sensitive Data in Signer Request and Response Payload¶
- SSH into the instance where Privacera Manager is installed.
- Run the following command to navigate to the
/config
directory.Bash - Run the following command to open the
.yml
file for editing.Bash - Update the following property to redact sensitive data in debug logs at the root level:
- Once the properties are configured, update your Privacera Manager platform instance by following the Quickstart guide.
- After the
post-install
, create a new cluster using the newly generated emr-template.json file from the output directory.
- Prev topic: Setup
- Next topic: Troubleshooting