Skip to main content

Privacera Platform

AWS EMR
:

This topic shows how to configure AWS EMR with Privacera using Privacera Manager.

Configuration

  1. SSH to the instance as USER.

  2. Run the following commands.

    cd ~/privacera/privacera-manager
    cp config/sample-vars/vars.emr.yml config/custom-vars/
    vi config/custom-vars/vars.emr.yml
  3. Edit the following properties.

    Property

    Description

    Example

    EMR_ENABLE

    Enable EMR template creation.

    true

    EMR_CLUSTER_NAME

    Define a unique name for the EMR cluster.

    Privacera-EMR

    EMR_CREATE_SG

    Set this to true if you don't have existing security groups and want Privacera Manager to take care of adding security group creation steps in the EMR CF template.

    false

    EMR_MASTER_SG_ID

    If EMR_CREATE_SG is false, set this property. Security Group ID for EMR Master Node Group.

    sg-xxxxxxx

    EMR_SLAVE_SG_ID

    If EMR_CREATE_SG is false, set this property. Security Group ID for EMR Slave Node Group.

    sg-xxxxxxx

    EMR_SERVICE_ACCESS_SG_ID

    If EMR_CREATE_SG is false, set this property. Security Group ID for EMR ServiceAccessSecurity. Fill this property only if you are creating EMR in a Private Network.

    sg-xxxxxxx

    EMR_SG_VPC_ID

    If EMR_CREATE_SG is true, set this property. VPC ID in which you want to create the EMR Cluster.

    vpc-xxxxxxxxxxx

    EMR_MASTER_SG_NAME

    If EMR_CREATE_SG is true, set this property. Security Group Name for EMR Master Node Group. The security group name will be added to the emr-template.json.

    priv-master-sg

    EMR_SLAVE_SG_NAME

    If EMR_CREATE_SG is true, set this property. Security Group Name for EMR Slave Node Group. The security group name will be added to the emr-template.json.

    priv-slave-sg

    EMR_SERVICE_ACCESS_SG_NAME

    If EMR_CREATE_SG is true, set this property. Security Group Name for EMR ServiceAccessSecurity. The security group name will be added to the emr-template.json. Fill this property only if you are creating EMR in a Private Network.

    priv-private-sg

    EMR_SUBNET_ID

    Subnet ID

    EMR_KEYPAIR

    An existing EC2 key pair to SSH into the master node of the cluster.

    privacera-test-pair

    EMR_EC2_MARKET_TYPE

    Set market type as SPOT or ON_DEMAND.

    SPOT

    EMR_EC2_INSTANCE_TYPE

    Set the instance type. Instances can be of different types such as m5.xlarge, r5.xlarge and so on.

    m5.large

    EMR_MASTER_NODE_COUNT

    Node count for Master. The number of nodes can be 1, 2 and so on.

    1

    EMR_CORE_NODE_COUNT

    Node count for Core. The number of cores can be 1, 2 and so on.

    1

    EMR_VERSION

    Version of EMR.

    emr-x.xx.x

    EMR_EC2_DOMAIN

    Domain used by the nodes. It depends on EMR Region, for example, ".ec2.internal" is for us-east-1.

    .ec2.internal

    EMR_USE_STS_REGIONAL_ENDPOINTS

    Set the property to enable/disable regional endpoints for S3 requests.

    Default value is false.

    true

    EMR_TERMINATION_PROTECT

    Set to enable/disable termination protection.

    true

    EMR_LOGS_PATH

    S3 location for storing EMR logs.

    s3://privacera-logs-bucket/

    EMR_KERBEROS_ENABLE

    Set to true if you want to enable kerberization on EMR.

    false

    EMR_KDC_ADMIN_PASSWORD

    If EMR_KERBEROS_ENABLE is true, set this property. The password used within the cluster for the kadmin service.

    EMR_CROSS_REALM_PASSWORD

    If EMR_KERBEROS_ENABLE is true, set this property. The cross-realm trust principal password, which must be identical across realms.

    EMR_SECURITY_CONFIG

    Name of the Security Configurations created for EMR. This can be a pre-created configuration, or Privacera Manager can generate a template through which you can create this configuration.

    EMR_KERB_TICKET_LIFETIME

    Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The period for which a Kerberos ticket issued by the cluster’s KDC is valid. Cluster applications and services auto-renew tickets after they expire.

    EMR_KERB_TICKET_LIFETIME: 24

    EMR_KERB_REALM

    Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The Kerberos realm name for the other realm in the trust relationship.

    EMR_KERB_DOMAIN

    Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The domain name of the other realm in the trust relationship.

    EMR_KERB_ADMIN_SERVER

    Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The fully qualified domain name (FQDN) and an optional port for the Kerberos admin server in the other realm. If a port is not specified, 749 is used.

    EMR_KERB_KDC_SERVER

    Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The fully qualified domain name (FQDN) and an optional port for the KDC in the other realm. If a port is not specified, 88 is used.

    EMR_AWS_ACCT_ID

    AWS Account ID where EMR Cluster resides

    9999999

    EMR_DEFAULT_ROLE

    Default role attached to EMR Cluster for performing cluster-related activities. This should be a pre-created role.

    EMR_DefaultRole

    EMR_ROLE_FOR_CLUSTER_NODES

    The IAM Role will be attached to each node in the EMR Cluster.

    This should have only minimal permissions for downloading the privacera_cust_conf.zip and basic EMR capabilities. It can be an existing one, if not, you can use the IAM role CF template to generate it after the Privacera Manager update.

    restricted_node_role

    EMR_USE_SINGLE_ROLE_FOR_APPS

    If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. Create a Single IAM Role that will be used by All EMR Applications.

    true

    EMR_ROLE_FOR_APPS

    If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. IAM Role name which will be used by all EMR Apps

    app_data_access_role

    EMR_ROLE_FOR_SPARK

    If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. Create multiple IAM Roles to be used by specific applications. Set EMR_USE_SINGLE_ROLE_FOR_APPS to be false. IAM Role name which will be used by Spark Application (Dataserver) for data access.

    spark_data_access_role

    EMR_ROLE_FOR_HIVE

    If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. IAM Role name which will be used by Hive Application for data access.

    hive_data_access_role

    EMR_ROLE_FOR_PRESTO

    If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. IAM Role name which will be used by Presto Application for data access.

    presto_data_access_role

    EMR_HIVE_METASTORE

    Metastore type. e.g. "glue", "hive" (For external hive-metastore)

    glue

    EMR_HIVE_METASTORE_PATH

    S3 location for hive metastore

    s3://hive-warehouse

    EMR_HIVE_METASTORE_CONNECTION_URL

    If EMR_HIVE_METASTORE is hive, set this property. JDBC Connection URL for connecting to hive.

    jdbc:mysql://<jdbc-host>:3306/<hive-db-name>?createDatabaseIfNotExist=true

    EMR_HIVE_METASTORE_CONNECTION_DRIVER

    If EMR_HIVE_METASTORE is hive, set this property. JDBC Driver Name

    org.mariadb.jdbc.Driver

    EMR_HIVE_METASTORE_CONNECTION_USERNAME

    If EMR_HIVE_METASTORE is hive, set this property. JDBC UserName

    hive

    EMR_HIVE_METASTORE_CONNECTION_PASSWORD

    If EMR_HIVE_METASTORE is hive, set this property. JDBC Password

    StRong@PassW0rd

    EMR_HIVE_SERVICE_NAME

    Custom hive service name for hive application in EMR

    teamA_policy

    EMR_TRINO_HIVE_SERVICE_NAME

    Custom hive service name for trino application in EMR

    teamB_policy

    EMR_SPARK_HIVE_SERVICE_NAME

    Custom hive access service name for spark applications in EMR

    teamC_policy

    EMR_APP_SPARK_OLAC_ENABLE

    To install Spark application with Privacera plugin, set the property to true. OLAC is known as Object Level Access Control.

    Note:

    • Recommended when complete access control on the objects in AWS S3 is required.

    • When the property is set to true, s3 and s3n protocols will not be supported on EMR clusters while running Spark queries.

    true

    EMR_APP_SPARK_FGAC_ENABLE

    To install Spark application with Privacera plugin, set the property to true. FGAC is known as Fine Grained Access Control for Table and Column.

    Note: Recommended for compliance purposes, since the whole cluster will still have direct access to AWS S3 data.

    false

    EMR_APP_PRESTO_DB_ENABLE

    To install PrestoDB application with Privacera plugin, set the property to true.

    PrestoDB and Trino are mutually exclusive. Only one should be enabled at a time.

    false

    EMR_APP_PRESTO_SQL_ENABLE

    To install Trino application with Privacera plugin, set the property to true.

    PrestoDB and Trino are mutually exclusive. Only one should be enabled at a time.

    Note: Trino is supported for EMR versions 6.1.0 and higher.

    Note: If the EMR version is 6.4.0, setting this flag installs the Trino plugin.

    false

    EMR_APP_HIVE_ENABLE

    To install Hive application with Privacera plugin, set the property to true.

    true

    EMR_APP_ZEPPELIN_ENABLE

    To install Zeppelin application, set the property to true.

    true

    EMR_APP_LIVY_ENABLE

    To install Livy application, set the property to true.

    true

    EMR_CUST_CONF_ZIP_PATH

    A path where the privacera_cust_conf.zip file will be placed should be added. Privacera Manager will generate a privacera_cust_conf.zip under ~/privacera/privacera-manager/output/emr folder. This privacera_cust_conf.zip needs to be placed at an s3 or any https location from which the EMR cluster can download it.

    s3://privacera-artifacts/

    EMR_SPARK_ENABLE_VIEW_LEVEL_ACCESS_CONTROL

    Set the property to true to enable view-level column masking and row filter for SparkSQL. The property can be used only when you set EMR_APP_SPARK_FGAC_ENABLE to true.

    To learn how to use view-level access control in Spark, click here.

    false

    EMR_RANGER_IS_FALLBACK_SUPPORTED

    Use the property to enable/disable the fallback behavior to the privacera_files and privacera_hive services. It confirms whether the resources files should be allowed/denied access to the user.

    To enable the fallback, set to true; to disable, set to false.

    true

    EMR_SPARK_DELTA_LAKE_ENABLE

    Set this property to true to enable Delta Lake on EMR Spark.

    true

    EMR_SPARK_DELTA_LAKE_CORE_JAR_DOWNLOAD_URL

    Download URL of Delta Lake core JAR. The Delta Lake core JAR has dependency with Spark version.

    You have to find the appropriate version for your EMR. See Delta Lake compatibility with Apache Spark.

    Get the appropriate Delta Lake core JAR download link and update the property. See Delta Core.

    For example, for Spark version 3.1.x, the download URL is https://repo1.maven.org/maven2/io/delta/delta-core_2.12/1.0.1/delta-core_2.12-1.0.1.jar.

    https://repo1.maven.org/maven2/io/delta/delta-core_2.12/1.0.1/delta-core_2.12-1.0.1.jar

    If your cluster was running while External Hive Metastore was down, and you are unable to connect to it, restart the following three servers.

    sudo systemctl restart hive-hcatalog-server
    sudo systemctl restart hive-server2
    sudo systemctl restart presto-server
  4. Run the following commands.

    cd ~/privacera/privacera-manager
    ./privacera-manager.sh update

    After the update is finished, all the cloud-formation JSON template files and privacera_cust_conf.zip will be available at the path, ~/privacera/privacera-manager/output/emr.

  5. Configure and run the following in AWS instance where Privacera is installed.

    1. (Optional) Create IAM roles using the emr-roles-creation-template.json template. Run the following command.

      aws --region <AWS-REGION> cloudformation create-stack --stack-name privacera-emr-role-creation --template-body file://emr-roles-creation-template.json --capabilities CAPABILITY_NAMED_IAM

      Note

      This will create IAM roles with minimal permissions. You can add bucket permissions into respective IAM roles as per your requirements.

    2. (Optional) Create Security Configurations using the emr-security-config-template.json template. Run the following command.

      aws --region <AWS-REGION> cloudformation create-stack --stack-name privacera-emr-security-config-creation  --template-body file://emr-security-config-template.json
    3. Confirm the privacera_cust_conf.zip file has been copied to the location specified in EMR_CUST_CONF_ZIP_PATH.

    4. Create EMR using the emr-template.json template. Run the following command.

      aws --region <AWS-REGION> cloudformation create-stack --stack-name privacera-emr-creation  --template-body file://emr-template.json

      Note

      If you are upgrading EMR to version 6.4 and higher from EMR version <=6.3 to use Trino plug-in, then you must re-create the EMR security configuration based on the new template generated via PM since the security configuration has trino user newly added

Note

  • For PrestoDB, secrets encryption of Solr authentication password is not supported. However, the properties file where the password resides is accessible only to the presto service user, hence it is invulnerable.

  • If your cluster was running while External Hive Metastore was down, and you are unable to connect to it, restart the following three servers:

    sudo systemctl restart hive-hcatalog-server
    sudo systemctl restart hive-server2
    sudo systemctl restart presto-server