Skip to content

Troubleshooting for Access Management for EMR

Accessing S3 Buckets Containing a 'dot' in the Name in EMR 6.x and above

In EMR version 6.x and above, you may encounter an error when attempting to read from or write to an S3 bucket that contains a dot (.) in its name using the s3a protocol in PySpark or Spark shell. This issue is caused by a problem with the AWS SDK.

Text Only
com.amazonaws.SdkClientException: Unable to execute HTTP request: Certificate for <{bucket-name-with-name}.east.us.s3.amazonaws.com> doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: Unable to execute HTTP request: Certificate for <{bucket-name-with-name}.s3.amazonaws.com> doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]

You need to enable path-style access for buckets with dots and give the properties shown below:

Bash
pyspark --conf "spark.hadoop.fs.s3a.path.style.access=true"
Bash
spark-shell --conf "spark.hadoop.fs.s3a.path.style.access=true"

Delta Table Creation Fails with S3 Protocol

When creating a Delta table using the s3 protocol in AWS EMR, the table creation fails as expected when no policy is applied. However, after applying the required permissions, the table creation still fails with the following exception:

Text Only
(pyspark.errors.exceptions.captured.AnalysisException: Cannot create table ('`db`.`table`'). 
The associated location ('s3://<location>') is not empty and also not a Delta table.)

To successfully create a Delta table without encountering exceptions, follow these steps:

  1. Check for Auto-Generated Folders:

    • After running the Delta table creation query, verify whether any _$folder$ directories exist in the specified S3 location.
  2. Manually Delete Unwanted Folders:

    • If the following folders are present in AWS S3, then you will need to work with your administrator to delete them in AWS S3 directly or through Privacera S3 browser.

      Note

      The following folder structure is an example. The actual folders may vary based on the table location used in your query.

      • <hms_database>_$folder$
      • <hms_database>/<delta_tables>_$folder$
      • <hms_database>/<delta_tables>/<table_1>_$folder$
      • <hms_database>/<delta_tables>/<table_1>/_delta_log_$folder$
  3. Retry the Delta Table Creation Query:

    • After removing the unwanted folders, re-run your Delta table creation SQL query.

Note

  • The Delta library automatically creates these folders (ending with _$folder$) during the table creation process.
  • If the user lacks the necessary permissions to the S3 bucket, these folders are not cleaned up, as the query execution is halted due to the permission issue.
  • We recommend performing manual cleanup before retrying the query with the required permissions.
  • The same issue can occur even without Privacera if the IAM role has permission to create/delete _$folder$ objects but lacks permission to the actual table location.

Enable DEBUG logs for AWS EMR Spark Jobs

Here are the steps to enable debugs logs for spark plugin running in EMR. Secure shell (ssh) to master node of your EMR cluster and run the following commands:

Text Only
sudo su -
vi /usr/lib/spark/conf/log4j2.properties
Expand the following section and copy it's content to the log4j2.properties file, or modify the existing file.

log4j2.properties
Properties
rootLogger.level = info
rootLogger.appenderRefs = stdout, file
rootLogger.appenderRef.stdout.ref = console
rootLogger.appenderRef.file.ref = RollingFileAppender

# console appender
appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} [%t] %p %c :%L %m%n%ex

## rolling file config
appender.rolling.type = RollingFile
appender.rolling.name = RollingFileAppender
appender.rolling.fileName = /tmp/${sys:user.name}/privacera.log
appender.rolling.filePattern = /tmp/${sys:user.name}/privacera-%d{yyyy-MM-dd}-%i.log
appender.rolling.layout.type = PatternLayout
appender.rolling.layout.pattern = %d{yyyy-MM-dd HH:mm:ss} [%t] %p %c :%L %m%n%ex
appender.rolling.policies.type = Policies
appender.rolling.policies.time.type = TimeBasedTriggeringPolicy
appender.rolling.policies.time.interval = 1
appender.rolling.policies.size.type = SizeBasedTriggeringPolicy
appender.rolling.policies.size.size = 100MB

# privacera
logger.privacera.name = com.privacera
logger.privacera.level = debug
logger.privacera.additivity = false
logger.privacera.appenderRefs = rolling
logger.privacera.appenderRef.rolling.ref = RollingFileAppender

# ranger
logger.ranger.name = org.apache.ranger
logger.ranger.level = info
logger.ranger.additivity = false
logger.ranger.appenderRefs = rolling
logger.ranger.appenderRef.rolling.ref = RollingFileAppender

# aws sdk v1
logger.amazon.name = com.amazon
logger.amazon.level = info
logger.amazon.additivity = false
logger.amazon.appenderRefs = rolling
logger.amazon.appenderRef.rolling.ref = RollingFileAppender

logger.amazonaws.name = com.amazonaws
logger.amazonaws.level = info
logger.amazonaws.additivity = false
logger.amazonaws.appenderRefs = rolling
logger.amazonaws.appenderRef.rolling.ref = RollingFileAppender

# aws sdk v2
logger.software-amazon.name = software.amazon.awssdk
logger.software-amazon.level = info
logger.software-amazon.additivity = false
logger.software-amazon.appenderRefs = rolling
logger.software-amazon.appenderRef.rolling.ref = RollingFileAppender

# apache http
logger.apache.name = org.apache.http.wire
logger.apache.level = info
logger.apache.additivity = false
logger.apache.appenderRefs = rolling
logger.apache.appenderRef.rolling.ref = RollingFileAppender

You will have to restart your spark-shell or spark-sql command to pick up the changes.

Run your use-case to generate the debug logs. The logs will be available in the /tmp/<user>/privacera.log file.

After your debugging session is over, ensure to revert the changes made to the log4j2.properties file.

Verify access-control.properties in EMR Trino

This section explains how to verify the access-control.properties configuration file used by EMR Trino.

  1. SSH to the master node.
    Bash
    ssh your_user@<emr-master-node>
    
  2. Check the access-control.properties configuration
    Bash
    cat /usr/lib/trino/etc/access-control.properties
    
    The access-control.properties file should contain the following properties
    Properties
    1
    2
    3
    4
    5
    6
    7
    access-control.name=privacera-ranger
    security.config-file=etc/rules.json
    security.refresh-period=60s
    ranger.hive.policy.authorization.enabled=true
    ranger.hive.policy.repo.catalog.mapping=privacera_hive:hive,awsdatacatalog
    ranger.policy.authorization.viewowner.default=trino
    ranger.policy.authorization.clustername=trino
    

Note

Verify that the access-control.name property is set to privacera-ranger to confirm that Privacera Trino Plugin is configured.

Enable Debug Logs for EMR Trino

This section explains how to enable debug logging for EMR Trino to help troubleshoot access management issues.

  1. SSH to the master node.
    Bash
    ssh your_user@<emr-master-node>
    
  2. Check the existing Trino logging configuration file
    Bash
    cat /etc/trino/conf/log.properties
    
    The file should contain only:
    Properties
    io.trino=INFO
    
  3. Edit the Trino logging configuration file and add the following logging configuration
    Bash
    sudo vi /etc/trino/conf/log.properties
    
    Add or update the logging configuration:
    Adjust the log level as needed. You can set it to INFO, DEBUG, TRACE, or ERROR, depending on the level of troubleshooting required.
    Properties
    1
    2
    3
    io.trino=INFO
    org.apache.ranger=INFO
    org.apache.ranger.authorization.trino=DEBUG
    
  4. Restart the Trino service to apply the changes
    Bash
    sudo systemctl restart trino-server
    

To Verify Trino Server Logs

  1. Navigate to the Trino logs directory
    Bash
    cd /var/log/trino
    
  2. Run the following command and ensure configuration is updated
    Bash
    cat /usr/lib/trino/etc/log.properties
    
  3. Monitor the server logs in real-time
    Bash
    tail -F server.log
    

After your debugging session is complete, revert the changes made to the log.properties file to prevent excessive log generation.

Deploy Custom Code Build for EMR Trino

This section explains how to deploy a custom code build of the Privacera Trino plugin to your EMR cluster.

  1. SSH to the master node.

    Bash
    ssh your_user@<emr-master-node>
    

  2. Navigate to the Privacera plugins directory

    Bash
    cd /opt/privacera/downloads/plugins/
    

  3. Back up the existing Trino plugin package

    Bash
    sudo mkdir -p /opt/privacera/downloads/plugins/trino-old-jar
    sudo mv /opt/privacera/downloads/plugins/privacera-trino-plugin.tar.gz /opt/privacera/downloads/plugins/trino-old-jar/
    

  4. Download the new custom Trino plugin package

    Bash
    CUSTOM_BUILD_URL= <url>
    sudo wget -O privacera-trino-plugin.tar.gz $CUSTOM_BUILD_URL
    

  5. Execute the Privacera setup script

    Bash
    1
    2
    3
    cd /opt/privacera/downloads/scripts
    sudo su
    sh setup_privacera_presto_plugin.sh
    

  6. Restart the Trino service to apply the custom build

    Bash
    sudo systemctl restart trino-server
    

Verification

  • To confirm that the custom-built Privacera Trino plugin is installed correctly, check the plugin version file:
    Bash
    sudo cat /opt/privacera/plugin/privacera-trino-plugin/privacera_version.txt
    
  • The privacera_version.txt file should contain the version information that corresponds to your custom build.

Rollback Procedure

If issues occur with the custom build, you can rollback to the previous version:

  1. Stop the Trino service

    Bash
    sudo systemctl stop trino-server
    

  2. Remove the new plugin package and restore the backup

    Bash
    sudo rm /opt/privacera/downloads/plugins/privacera-trino-plugin.tar.gz
    sudo mv /opt/privacera/downloads/plugins/trino-old-jar/privacera-trino-plugin.tar.gz /opt/privacera/downloads/plugins/
    

  3. Execute the setup script and restart the service

    Bash
    1
    2
    3
    cd /opt/privacera/downloads/scripts
    sudo sh setup_privacera_presto_plugin.sh
    sudo systemctl restart trino-server