Skip to content

Advanced Configuration

Iceberg Catalog in Apache Spark

  1. SSH into the instance where Privacera Manager is installed.

  2. Run the following command to navigate to the /config directory.

    Bash
    cd ~/privacera/privacera-manager/config
    

  3. Run the following command to open the .yml file to be edited.

    Bash
    vi custom-vars/vars.emr.yml
    

  4. Modify the following properties:

    Variable Definition
    EMR_SPARK_ICEBERG_ENABLE Set to true to enable the Iceberg catalog for EMR Spark. Default: false.
    EMR_SPARK_ICEBERG_CATALOG_TYPE Set the Iceberg catalog type. Default: hadoop. Supported types: hadoop, glue.
    EMR_SPARK_ICEBERG_CATALOG_NAME Set the Iceberg catalog name. Default: hadoop_catalog. Supported types: haddop_catalog, glue_catalog.
    EMR_SPARK_ICEBERG_CATALOG_WAREHOUSE_LOCATION Set the location for the Iceberg catalog warehouse.
  5. To use Iceberg with the Hadoop catalog, update the catalog type to hadoop and the catalog name to hadoop_catalog in the vars.emr.yml file.

  6. To use Iceberg with Glue as the Metastore, update the catalog type to glue and the catalog name to glue_catalog in the vars.emr.yml file.

  7. Once the properties are configured, update your Privacera Manager platform instance by following the Quickstart guide.

  8. After the post-install, create a new cluster using the newly generated emr-template.json file from the output directory.

Configuring Iceberg with Hadoop

To configure Iceberg with hadoop, update the spark-defaults configuration in emr template as shown below. Then, create a new emr cluster with this template:

privacera-emr-spark-defaults
JSON
1
2
3
4
5
6
7
8
9
{
  "Classification": "spark-defaults",
  "ConfigurationProperties": {
    "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    "spark.sql.catalog.hadoop_catalog": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.hadoop_catalog.warehouse": "<PLEASE_UPDATE>",
    "spark.sql.catalog.hadoop_catalog.type": "hadoop"
  }
}

Configuring Iceberg with Glue as Metastore

To configure Iceberg with Glue as Metastore, update the spark-defaults configuration in emr template as shown below. Then, create a new emr cluster with this template:

privacer-emr-spark-defaults-glue
JSON
{
  "Classification": "spark-defaults",
  "ConfigurationProperties": {
    "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    "spark.sql.catalog.glue_catalog": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.glue_catalog.warehouse": "<PLEASE_UPDATE>",
    "spark.sql.catalog.glue_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
    "spark.sql.catalog.glue_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
    "spark.sql.catalog.glue_catalog.s3.client-factory-impl": "com.privacera.iceberg.aws.s3.PrivaceraAwsClientFactory",
    "spark.jars": "/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar"
  }
}

JWT Auth Configuration

By Default, Privacera uses the kerberized user for authorization. However, we also support JWT (JSON Web Token) integration, which will use the user/group from the JWT payload instead of the Kerberized login user.

Here are the steps to configure JWT token integration

  1. SSH to the instance where Privacera is installed.

  2. Run the following command to navigate to the /config directory.

    Bash
    cd ~/privacera/privacera-manager/config
    

  3. Run the following command to open the .yml file to be edited.

    Bash
    vi custom-vars/vars.emr.yml
    

  4. Add the below property to enable JWT authentication:

    Bash
    EMR_JWT_OAUTH_ENABLE: true
    

  5. Once the properties are configured, run the following commands to update your Privacera Manager platform instance:

    Bash
    cd ~/privacera/privacera-manager
    
    # step 1 - setup which generates the helm charts.
    ./privacera-manager.sh setup
    
    # step 2 - install or upgrade the Privacera Manager helm charts
    ./pm_with_helm.sh [install|upgrade]
    
    # step 3 - post-installation steps which generates plugin configurations,
    ./privacera-manager.sh post-install
    

  6. After the post-install, create a new cluster with newly generated emr-template.json file from output directory.

Note

  • JWT Auth Configuration supported only for Spark OLAC

Add EMR_JWT_OAUTH_ENABLE in EMR bootstrap action script to enable JWT authentication.

privacera-emr-bootstrap-action-spark_olac
JSON
"BootstrapActions": [
{
  "Name": "Install Privacera Plugins on Master Node",
  "ScriptBootstrapAction": {
    "Path": "s3://elasticmapreduce/bootstrap-actions/run-if",
    "Args": [
      {
        "Fn::Sub": "instance.isMaster=true"
      },
      {
        "Fn::Sub": "export EMR_JWT_OAUTH_ENABLE=true ; wget ${PrivaceraDownloadUrl}/privacera_emr.sh ; chmod +x ./privacera_emr.sh ; sudo -E ./privacera_emr.sh spark-olac"
      }
    ]
  }
},
{
  "Name": "Install Spark OLAC in Core Node",
  "ScriptBootstrapAction": {
    "Path": "s3://elasticmapreduce/bootstrap-actions/run-if",
    "Args": [
      {
        "Fn::Sub": "instance.isMaster=false"
      },
      {
        "Fn::Sub": "export EMR_JWT_OAUTH_ENABLE=true ; wget ${PrivaceraDownloadUrl}/privacera_emr.sh ; chmod +x ./privacera_emr.sh ; sudo -E ./privacera_emr.sh spark-olac"
      }
    ]
  }
}
]

External Hive Metastore (EHM)

  1. SSH to the instance where Privacera is installed.

  2. Run the following command to navigate to the /config directory.

    Bash
    cd ~/privacera/privacera-manager/config
    

  3. Run the following command to open the .yml file to be edited.

    Bash
    vi custom-vars/vars.emr.yml
    

  4. Modify the following properties:

Variable Definition
EMR_HIVE_METASTORE Set to 'hive' to enable External Hive Metastore
EMR_HIVE_METASTORE_CONNECTION_URL Set the JDBC Connection URL (ex: jdbc:mysql://:3306/?createDatabaseIfNotExist=true)
EMR_HIVE_METASTORE_CONNECTION_DRIVER Set JDBC Driver Name (ex: "org.mariadb.jdbc.Driver")
EMR_HIVE_METASTORE_CONNECTION_USERNAME Set the JDBC username
EMR_HIVE_METASTORE_CONNECTION_PASSWORD Set the JDBC password
  1. Once the properties are configured, run the following commands to update your Privacera Manager platform instance:

    Bash
    cd ~/privacera/privacera-manager
    
    # step 1 - setup which generates the helm charts.
    ./privacera-manager.sh setup
    
    # step 2 - install or upgrade the Privacera Manager helm charts
    ./pm_with_helm.sh [install|upgrade]
    
    # step 3 - post-installation steps which generates plugin configurations,
    ./privacera-manager.sh post-install
    

  2. After the post-install, create a new cluster with newly generated emr-template.json file from output directory.

Update hive-site configuration in emr template as below and create new emr cluster with this template.

privacera-emr-hive-site
JSON
{
  "Classification": "hive-site",
  "ConfigurationProperties": {
    "javax.jdo.option.ConnectionURL": "<jdbc-connection-url>",
    "javax.jdo.option.ConnectionDriverName": "<jdbc-driver>",
    "javax.jdo.option.ConnectionUserName": "<jdbc-username>",
    "javax.jdo.option.ConnectionPassword": "<jdbc-password>",
    "hive.server2.enable.doAs": "false",
    "parquet.column.index.access": "true",
    "fs.s3a.impl": "com.amazon.ws.emr.hadoop.fs.EmrFileSystem",
    "hive.metastore.warehouse.dir": {
      "Ref": "<hive-metastore-s3path>"
    }
  }
}

Configure JWT in Hive Metastore (HMS) for OLAC

  • To configure the JWT token in Hive Metastore (HMS) for OLAC, add the following EMR step in emr-template.json file. This step will update the JWT token on the HMS server located on the EMR Master Node.
    Bash
    "ConfigureJWTinHMS": {
      "Type": "AWS::EMR::Step",
      "Properties": {
        "ActionOnFailure": "CONTINUE",
        "HadoopJarStep": {
          "Args": [
            {
              "Fn::Sub": "s3://<path_to_your_file>/update_jwt_token_in_hms.sh"
            },
            {
              "Fn::Sub":"<UPDATE_JWT_TOKEN>"
            }
          ],
          "Jar": {
            "Fn::Sub": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar"
          }
        },
        "Name": "ConfigureJWTinHMS",
        "JobFlowId": {
          "Ref": "EMRCLUSTER"
        }
      }
    }
    
  • Create and upload update_jwt_token_in_hms.sh script to S3:
    • To update the JWT token in the Hive Metastore (HMS), create a script named update_jwt_token_in_hms.sh with the content provided below, and upload it to the specified S3 location.
      Bash
      #!/bin/bash
      set -x
      
      # Check if an argument is provided
      if [ "$#" -eq 0 ]; then
        echo "No argument provided. Usage: ./update_jwt_token_in_hms.sh <jwt-token>"
        exit 1
      fi
      
      export hcat_hive_site=/etc/hive-hcatalog/hcat-conf/hive-site.xml
      export jwt_token=${1}
      
      echo "Adding jwt token ${jwt_token} in ${hcat_hive_site}"
      
      restart_hive_hcatalog() {
        if systemctl is-active --quiet hive-hcatalog-server; then
          echo "hive-hcatalog-server is running. Restarting the service now..."
          sudo systemctl stop hive-hcatalog-server
          sudo systemctl start hive-hcatalog-server
        else
          echo "hive-hcatalog-server is not running."
        fi
      }
      
      update_jwt_token() {
        sudo sed -i "s/<\/configuration>//g" ${hcat_hive_site}
      sudo -E bash -c 'cat <<EOF >>${hcat_hive_site}
      <property>
        <name>privacera.ds.jwt.auth.token</name>
        <value>${jwt_token}</value>
      </property>
      
      </configuration>
      EOF'
      }
      
      update_jwt_token
      restart_hive_hcatalog
      

Multiple Master Nodes (MMN)

  1. SSH to the instance where Privacera is installed.

  2. Run the following command to navigate to the /config directory.

    Bash
    cd ~/privacera/privacera-manager/config
    

  3. Run the following command to open the .yml file to be edited.

    Bash
    vi custom-vars/vars.emr.yml
    

  4. Add the below property and set the count to '3' to enable multiple master node:

    Bash
    EMR_MASTER_NODE_COUNT: 3
    

  5. Once the properties are configured, run the following commands to update your Privacera Manager platform instance:

    Bash
    cd ~/privacera/privacera-manager
    
    # step 1 - setup which generates the helm charts.
    ./privacera-manager.sh setup
    
    # step 2 - install or upgrade the Privacera Manager helm charts
    ./pm_with_helm.sh [install|upgrade]
    
    # step 3 - post-installation steps which generates plugin configurations,
    ./privacera-manager.sh post-install
    

  6. After the post-install, create a new cluster with newly generated emr-template.json file from output directory.

Update hive-site configuration in emr template as below and create new emr cluster with this template.

privacera-emr-config
JSON
1
2
3
4
5
"CoreInstanceGroup": {
  "Name": "Core Instance Group",
  "InstanceCount": 3,
  "InstanceType": "<instance-type>"
}

Enable Access Requester Pays buckets

  • To enable access to Requester Pays buckets, include the following property in EMR cluster's Spark configuration:
    JSON
    1
    2
    3
    4
    5
    6
    {
      "Classification": "spark-defaults",
      "Properties": {
        "spark.hadoop.fs.s3a.requester-pays.enabled": "true"
      }
    }
    
  • For Spark OLAC, here are the steps to configure Requester Pay in DataServer

Ignore S3 Objects from Privacera Access Check

By default, the Dataserver is used to perform access control on all objects. However, if you want to exclude certain objects or entire buckets from Privacera access checks and access them directly through the IAM role attached to the EMR node, you can use one of the methods outlined below:

Note

  • The property accepts a comma-separated list of paths to be excluded from Privacera Access Check.
  • Paths to be ignored support all s3 file protocols such as s3://, s3a://, and s3n://.
  • Ensure that the attached IAM role has the necessary permissions to access the specified paths.
  • The property supports the wildcard character * in both the bucket name and object path.

EMR Bootstrap Action

  • The Bootstrap Action for configuring the ignore paths requires a script file that must be uploaded to an S3 location accessible through the IAM role attached to the EMR nodes.
  • This approach should be used when enabling event logs in the EMR cluster and you want to exclude the event logs bucket, or when you wish to ignore certain common buckets and objects from Privacera Access Check.
  • For the event log bucket, ensure that you ignore the entire bucket, not just the specific path where the event logs are stored.

Follow the steps below to create the Script file and add the Bootstrap Action in the EMR template file:

  1. Create a Script file named privacera_update_spark_ignore_path.sh with the content provided below at an S3 location:

    Bash
    #!/bin/bash
    
    # Check if an argument is provided
    if [ "$#" -eq 0 ]; then
      echo "No argument provided. Usage: ./privacera_update_spark_ignore_path.sh <comma_separated_ignore_s3_uri>"
      exit 1
    fi
    
    
    ignore_path=${1}
    echo "Comma separated ignore path: ${ignore_path}"
    
    
    priv_spark_conf_dir="/opt/privacera/plugin/privacera-spark-plugin/spark-conf"
    priv_spark_conf_file="${priv_spark_conf_dir}/privacera_spark_custom.properties"
    
    
    echo "Creating ${priv_spark_conf_dir}"
    sudo mkdir -p ${priv_spark_conf_dir}
    
    
    echo "Creating ${priv_spark_conf_file}"
    sudo touch ${priv_spark_conf_file}
    sudo chown hadoop:hadoop ${priv_spark_conf_file}
    
    
    echo "Updating ignore path '${ignore_path}' into '${priv_spark_conf_file}'"
    sudo echo "privacera.olac.ignore.paths=${ignore_path}" >> ${priv_spark_conf_file}
    

  2. Update the EMR Template file with the below given Bootstrap Action:

Replace <BUCKET>, <PATH_TO_SCRIPT>, and <IGNORE_PATHS_SEPERATED_BY_COMMAS> placeholders with the appropriate values.

JSON
{
  "Name": "Update Spark Ignore Path",
  "ScriptBootstrapAction": {
    "Path": {
      "Fn::Sub": "s3://<BUCKET>/<PATH_TO_SCRIPT>/privacera_update_spark_ignore_path.sh"
    },
    "Args": [
      "<IGNORE_PATHS_SEPERATED_BY_COMMAS>"
    ]
  }
}
3. Save the EMR Template and trigger the EMR Cluster. The Bootstrap Action will be executed during the EMR cluster creation, and the ignore paths will be added to the privacera_spark_custom.properties.

The above bootstrap action will update the ignore paths in both the Master and Executor nodes.

Spark Configuration

  • If you want to ignore objects that are specific to certain use cases and are not included in the privacera_spark_custom.properties file, you can use the spark.hadoop.privacera.olac.extra.ignore.paths property.

  • The above property must be configured either in the spark-defaults.conf file of the Master node or using the --conf option. s

  • If the property is configured in both locations, the value passed via the --conf option will take precedence.

Delta Lake Configuration

  1. SSH into the instance where Privacera Manager is installed.

  2. Run the following command to navigate to the /config directory.

    Bash
    cd ~/privacera/privacera-manager/config
    

  3. Run the following command to open the .yml file for editing:

    Bash
    vi custom-vars/vars.emr.yml
    

  4. Modify the following properties:

    Variable Definition
    EMR_SPARK_DELTA_LAKE_ENABLE Set to true to enable delta lake support for EMR Spark. Default: false.
    EMR_SPARK_DELTA_LAKE_CORE_JAR_DOWNLOAD_URL Set the URL to download the delta-core jar.
    EMR_SPARK_DELTA_LAKE_STORAGE_JAR_DOWNLOAD_URL Set the URL to download the delta-storage jar.
  5. Refer to the Delta Lake compatibility page to check the Delta Lake versions and their compatible Apache Spark versions.

  6. Once the properties are configured, update your Privacera Manager platform instance by following the Quickstart guide.

  7. After the post-install, create a new cluster using the newly generated emr-template.json file from the output directory.

  • To enable Delta Lake support for EMR Spark, update the BootstrapActions configuration in emr template as shown below. Then, create a new emr cluster with this template:
privacera-emr-bootstrap-actions-delta-lake

Update <delta_lake_core_jar_download_url>, <delta_lake_storage_jar_download_url> placeholders with the appropriate values.

JSON
"BootstrapActions":[
  {
    "Name":"Install Spark OLAC in Master Node",
    "ScriptBootstrapAction":{
      "Path":"s3://elasticmapreduce/bootstrap-actions/run-if",
      "Args":[
        {
          "Fn::Sub":"instance.isMaster=true"
        },
        {
          "Fn::Sub":"export SPARK_DELTA_LAKE_ENABLE=enable-spark-deltalake ; export SPARK_DELTA_LAKE_CORE_JAR_DOWNLOAD_URL=<delta_lake_core_jar_download_url>; export SPARK_DELTA_LAKE_STORAGE_JAR_DOWNLOAD_URL=<delta_lake_storage_jar_download_url>; wget ${PrivaceraDownloadUrl}/privacera_emr.sh ; chmod +x ./privacera_emr.sh ; sudo -E ./privacera_emr.sh spark-olac"
        }
      ]
    }
  },
  {
    "Name":"Install Spark OLAC in Core Node",
    "ScriptBootstrapAction":{
      "Path":"s3://elasticmapreduce/bootstrap-actions/run-if",
      "Args":[
        {
          "Fn::Sub":"instance.isMaster=false"
        },
        {
          "Fn::Sub":"export SPARK_DELTA_LAKE_ENABLE=enable-spark-deltalake ; export SPARK_DELTA_LAKE_CORE_JAR_DOWNLOAD_URL=<delta_lake_core_jar_download_url>; export SPARK_DELTA_LAKE_STORAGE_JAR_DOWNLOAD_URL=<delta_lake_storage_jar_download_url>; wget ${PrivaceraDownloadUrl}/privacera_emr.sh ; chmod +x ./privacera_emr.sh ; sudo -E ./privacera_emr.sh spark-olac"
        }
      ]
    }
  }
]

EBS Root Volume Size and Auto Termination Policy

  • SSH into the instance where Privacera Manager is installed.
  • Run the following command to navigate to the /config directory.
    Bash
    cd ~/privacera/privacera-manager/config
    
  • Run the following command to open the .yml file to be edited.
    Bash
    vi custom-vars/vars.emr.yml
    
  • To increase the EBS root volume size, modify the following property:
    Bash
    # Uncomment the below property to set EBS root volume size. Default: 20
    # EMR_EBS_ROOT_VOLUME_SIZE: "<PLEASE_CHANGE>"
    
  • To set the auto-termination policy, modify the following property:
    Bash
    # Uncomment the below property to set EMR Auto termination time in seconds.
    # EMR_AUTO_TERMINATION_TIMEOUT: "<PLEASE_CHANGE>"
    
  • Once the properties are configured, update your Privacera Manager platform instance by following the Quickstart guide.
  • After the post-install, create a new cluster using the newly generated emr-template.json file from the output directory.
  • To increase the EBS root volume size and set the auto-termination policy, update the resources configuration in emr template as shown below. Then, create a new emr cluster with this template:
privacera-emr-resources
JSON
{
  "Resources":{
    "EMRCLUSTER":{
      "Type":"AWS::EMR::Cluster",
      "Properties":{
        "EbsRootVolumeSize":20,
        "AutoTerminationPolicy": {
          "IdleTimeout": 3600
        }
      }
    }
  }
}

Redact Sensitive Data in Debug Logs at Root Level

  1. SSH into the instance where Privacera Manager is installed.
  2. Run the following command to navigate to the /config directory.
    Bash
    cd ~/privacera/privacera-manager/config
    
  3. Run the following command to open the .yml file for editing.
    Bash
    vi custom-vars/vars.emr.yml
    
  4. Update the following property to redact sensitive data in debug logs at the root level:
    Bash
    # uncomment to encrypt sensitive data in spark-plugin request/response payload. Default: `false`.
    # EMR_SPARK_ENCRYPT_SENSITIVE_PAYLOAD_DATA_ENABLED: "true"
    
  5. Once the properties are configured, update your Privacera Manager platform instance by following the Quickstart guide.
  6. After the post-install, create a new cluster using the newly generated emr-template.json file from the output directory.

Comments