Skip to content

Setup for Access Management for EMR Serverless

Configure

Perform following steps to configure EMR Serverless connector:

  1. SSH into the instance where Privacera Manager is installed.

  2. Navigate to the /config directory by running the following command:

    Bash
    cd ~/privacera/privacera-manager/config
    

  3. Copy the sample variables by running the following command:

    Bash
    cp sample-vars/vars.emr-serverless.yml custom-vars/
    

  4. Open the .yml file for editing by running the following command:

    Bash
    vi custom-vars/vars.emr-serverless.yml
    

  5. Modify the following properties. You can get the supported versions from the AWS EMR Serverless from AWS EMR Serverless Versions

    Bash
    1
    2
    3
    4
    5
    6
    7
    8
    EMR_SERVERLESS_ENABLE: "true"
    
    # EMR serverless version e.g. emr-7.2.0:latest
    EMR_SERVERLESS_VERSION: "<PLEASE_CHANGE>"
    
    # The unique name of the EMR Serverless application within your AWS account 
    # to avoid conflicts with other applications. Example: privacera1-emr-serverless
    EMR_SERVERLESS_APP_NAME: "<PLEASE_CHANGE>"
    

    Note

    EMR_SERVERLESS_VERSION is the EMR Serverless Spark Docker image tag, which you can get from this link.

  6. Once the properties are configured, update your Privacera Manager platform instance by following the commands

    Bash
    cd ~/privacera/privacera-manager
    ./privacera-manager.sh post-install
    

  7. Once the post-install process is complete, you will see emr-serverless folder in the ~/privacera/privacera-manager/output directory, with the following folder structure:

    Bash
    1
    2
    3
    4
    5
    6
    7
    output/
    ├── emr-serverless/
       ├── olac/
          ├── Dockerfile_Privacera_Spark_OLAC
          ├── setup_emrserverless_spark_olac.sh
          ├── spark_custom_conf
          ├── spark_custom_conf.zip
    

Build custom Docker image (Multi Architecture - AWS Intel/AMD64 and AWS Graviton)

What is multi-architecture Docker image?

A multi-archecture or multi-platform docker image is an image that can support multiple CPU architectures such as Intel/AMD64 and ARM. It looks like a single image with a single tag, but it is a list of images targeting multiple architectures organized by a manifest list.

For EMR Serverless custom docker image, this will allow you to build a single Docker image that can run on both AWS Intel/AMD64 and AWS Graviton EC2 instances.

You can refer to GKE Site and Docker Site for more information on multi-architecture Docker images.

Multi-architecture Docker images are not mandatory for Privacera. The other option is to build separate Docker images for each architecture and push them to ECR.

  1. To build and push the Docker image, you need to copy the following Docker files and other configuration files to the EC2 instance where you can build the Docker image or you can build in the same EC2 instance where Privacera Manager is installed.

    Bash
    1
    2
    3
    4
    ## Files to copy
    Dockerfile_Privacera_Spark_OLAC
    setup_emrserverless_spark_olac.sh
    spark_custom_conf.zip
    
  2. Once the required files are on the EC2 instance where you can build the Docker image, run the following command. You can set the following environments variables before running the command or replace the values in the command itself.

    Here are some global variables that you need to set before running the command:

    Variable Name Description Sample Value
    aws_account_id Your AWS account ID. "123456789012"
    region The AWS region where your ECR repository is located. "us-east-1"
    ecr_repo_name The name of your ECR repository where the Docker image will be pushed. "privacera/emr-serverless-spark-olac"
    tag The tag for the Docker image. v1.0

    You can set the following environments variables before running the command or replace the values in the command itself.

    Bash
    1
    2
    3
    4
    aws_account_id=<PLEASE_CHANGE>
    region=<PLEASE_CHANGE>
    ecr_repo_name=<PLEASE_CHANGE>
    tag=<PLEASE_CHANGE>
    
  3. Ensure that you have the necessary IAM permissions to manage customized Docker image in your Amazon Elastic Container Registry (ECR). You can use the IAM policies on this AWS documentation link to grant the necessary permissions.

  4. Create ECR repository by running the following command. This is a one-time setup.

    Bash
    1
    2
    3
    aws ecr create-repository \
        --repository-name ${ecr_repo_name} \
        --region ${region}
    
  5. Run the following command to log into AWS Elastic Container Registry (ECR) repository that you just created. Make sure to set the environment variables or replace the variables before running the command.

    Bash
    1
    2
    3
    4
    # Login to ECR repo
    aws ecr get-login-password --region ${region} | \
        docker login --username AWS \
        --password-stdin ${aws_account_id}.dkr.ecr.${region}.amazonaws.com
    
  6. Make sure you have buildx support in your docker cli by running -

    Bash
    docker buildx --help
    
    If the above command fails, then you need to enable buildx by following the instructions at this to install Docker buildx.

  7. Follow these steps to build a multi-arch Docker image that can be used to run on AWS Intel/AMD EC2 instances and AWS Graviton EC2 instances for EMR Serverless.

    First build a builder instance with the following command. This allows you to build the multi-architecture Docker image for the CPU architecture of your build host.

    Bash
    1
    2
    3
    4
    # --use option can set as the builder as a default builder  
    docker buildx create \
        --name multi-arch-builder \
        --driver=docker-container
    

    Then follow this command build to build the EMR Serverless Docker image for both ARM and x86 platforms.

    Since it is a multi-architecture image, it cannot be loaded in your docker engine but has to be pushed into ECR directly. This step also pushes into the ECR repository. Later you can do a docker pull from ECR which will load the correct platform image based on the architecture of the host.

    Bash
    1
    2
    3
    4
    5
    6
    7
    docker buildx build \
        --file ./Dockerfile_Privacera_Spark_OLAC \
        --tag ${aws_account_id}.dkr.ecr.${region}.amazonaws.com/${ecr_repo_name}:${tag} \
        --platform linux/arm64/v8,linux/amd64 \
        --builder multi-arch-builder \
        --push \
        .
    
  8. To verify that the Docker image was created successfully, run the following commands. Make sure to set the environment variables or replace the variables before running the command.

    Bash
    1
    2
    3
    4
    # this command will show the manifest which should have both the ARM 
    # and x86 platforms
    docker manifest inspect \
        ${aws_account_id}.dkr.ecr.${region}.amazonaws.com/${ecr_repo_name}:${tag}
    
    Bash
    1
    2
    3
    4
    # this will pull the correct platform image based on the architecture of 
    # the host
    docker pull \
        ${aws_account_id}.dkr.ecr.${region}.amazonaws.com/${ecr_repo_name}:${tag}
    
    Bash
    1
    2
    3
    # This command will open the bash shell in the docker container.
    docker run -it --rm --entrypoint /bin/bash \
        ${aws_account_id}.dkr.ecr.${region}.amazonaws.com/${ecr_repo_name}:${tag}
    
    Once inside the container, you can inspect the environment to ensure it’s set up correctly. There should be /opt/privacera folder inside the image. Run exit to exit. This should also delete the container since the --rm flag was used.

Create Application

With EMR Serverless, you can create one or more applications that use open-source analytics frameworks. To create an application, follow these steps:

Note

Refer to the latest AWS documentation for deploying EMR Serverless applications.

  • Application settings: Provide a unique name for the application (e.g., emr_serverless_spark_app). Select type as Spark, and specify the release version that you have configured in the vars.emr-serverless.yml file.
  • Custom Image Settings: Select the image that you have uploaded in ECR repository.
  • Application Configuration: Add Privacera specific Spark configuration properties to spark-defaults section:
JSON configuration:

Privacera specific Spark configuration properties need to be added to the spark-defaults classification section as shown below. Rest of the application configuration is show for completeness, and you can modify as per your requirement. The rootLogger.level property can be set to warn, debug, or trace based on the log level you want to set.

JSON
{
  "runtimeConfiguration": [
    {
      "classification": "spark-driver-log4j2",
      "configurations": [],
      "properties": {
        "rootLogger.level": "<warn | debug | trace>"
      }
    },
    {
      "classification": "spark-executor-log4j2",
      "configurations": [],
      "properties": {
        "rootLogger.level": "<warn | debug | trace>"
      }
    },
    {
      "classification": "spark-defaults",
      "configurations": [],
      "properties": {
        "spark.executor.extraJavaOptions": "-javaagent:/usr/lib/spark/jars/privacera-agent.jar",
        "spark.driver.extraJavaOptions": "-javaagent:/usr/lib/spark/jars/privacera-agent.jar",
        "spark.sql.hive.metastore.sharedPrefixes": "com.amazonaws.services.dynamodbv2,com.privacera,com.amazonaws",
        "spark.hadoop.fs.s3a.access.key": "P_ACCESS_KEY",
        "spark.hadoop.fs.s3a.secret.key": "P_SECRET_KEY",
        "spark.hadoop.fs.s3a.session.token": "P_SESSION_TOKEN",
        "spark.hadoop.fs.s3a.s3.signing-algorithm": "PrivaceraAwsSdkV2Signer",
        "spark.hadoop.fs.s3a.custom.signers": "PrivaceraAwsSdkV2Signer:com.privacera.spark.agent.signer.PrivaceraAwsSdkV2Signer"
      }
    }
  ]
}

Next Steps

To submit a job to the EMR Serverless application, refer to the Privacera's User Guide for AWS EMR Serverless

Comments