Skip to content

Troubleshooting for Access Management for EMR

Accessing S3 Buckets Containing a 'dot' in the Name in EMR 6.x and above

In EMR version 6.x and above, you may encounter an error when attempting to read from or write to an S3 bucket that contains a dot (.) in its name using the s3a protocol in PySpark or Spark shell. This issue is caused by a problem with the AWS SDK.

Text Only
com.amazonaws.SdkClientException: Unable to execute HTTP request: Certificate for <{bucket-name-with-name}.east.us.s3.amazonaws.com> doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: Unable to execute HTTP request: Certificate for <{bucket-name-with-name}.s3.amazonaws.com> doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]

You need to enable path-style access for buckets with dots and give the properties shown below:

Bash
pyspark --conf "spark.hadoop.fs.s3a.path.style.access=true"
Bash
spark-shell --conf "spark.hadoop.fs.s3a.path.style.access=true"

Delta Table Creation Fails with S3 Protocol

When creating a Delta table using the s3 protocol in AWS EMR, the table creation fails as expected when no policy is applied. However, after applying the required permissions, the table creation still fails with the following exception:

Text Only
(pyspark.errors.exceptions.captured.AnalysisException: Cannot create table ('`db`.`table`'). 
The associated location ('s3://<location>') is not empty and also not a Delta table.)

To successfully create a Delta table without encountering exceptions, follow these steps:

  1. Check for Auto-Generated Folders:

    • After running the Delta table creation query, verify whether any _$folder$ directories exist in the specified S3 location.
  2. Manually Delete Unwanted Folders:

    • If the following folders are present in AWS S3, then you will need to work with your administrator to delete them in AWS S3 directly or through Privacera S3 browser.

      Note

      The following folder structure is an example. The actual folders may vary based on the table location used in your query.

      • <hms_database>_$folder$
      • <hms_database>/<delta_tables>_$folder$
      • <hms_database>/<delta_tables>/<table_1>_$folder$
      • <hms_database>/<delta_tables>/<table_1>/_delta_log_$folder$
  3. Retry the Delta Table Creation Query:

    • After removing the unwanted folders, re-run your Delta table creation SQL query.

Note

  • The Delta library automatically creates these folders (ending with _$folder$) during the table creation process.
  • If the user lacks the necessary permissions to the S3 bucket, these folders are not cleaned up, as the query execution is halted due to the permission issue.
  • We recommend performing manual cleanup before retrying the query with the required permissions.
  • The same issue can occur even without Privacera if the IAM role has permission to create/delete _$folder$ objects but lacks permission to the actual table location.

Enable DEBUG logs for AWS EMR Spark Jobs

Here are the steps to enable debugs logs for spark plugin running in EMR. Secure shell (ssh) to master node of your EMR cluster and run the following commands:

Text Only
sudo su -
vi /usr/lib/spark/conf/log4j2.properties
Expand the following section and copy it's content to the log4j2.properties file, or modify the existing file.

log4j2.properties
Properties
rootLogger.level = info
rootLogger.appenderRefs = stdout, file
rootLogger.appenderRef.stdout.ref = console
rootLogger.appenderRef.file.ref = RollingFileAppender

# console appender
appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} [%t] %p %c :%L %m%n%ex

## rolling file config
appender.rolling.type = RollingFile
appender.rolling.name = RollingFileAppender
appender.rolling.fileName = /tmp/${sys:user.name}/privacera.log
appender.rolling.filePattern = /tmp/${sys:user.name}/privacera-%d{yyyy-MM-dd}-%i.log
appender.rolling.layout.type = PatternLayout
appender.rolling.layout.pattern = %d{yyyy-MM-dd HH:mm:ss} [%t] %p %c :%L %m%n%ex
appender.rolling.policies.type = Policies
appender.rolling.policies.time.type = TimeBasedTriggeringPolicy
appender.rolling.policies.time.interval = 1
appender.rolling.policies.size.type = SizeBasedTriggeringPolicy
appender.rolling.policies.size.size = 100MB

# privacera
logger.privacera.name = com.privacera
logger.privacera.level = debug
logger.privacera.additivity = false
logger.privacera.appenderRefs = rolling
logger.privacera.appenderRef.rolling.ref = RollingFileAppender

# ranger
logger.ranger.name = org.apache.ranger
logger.ranger.level = info
logger.ranger.additivity = false
logger.ranger.appenderRefs = rolling
logger.ranger.appenderRef.rolling.ref = RollingFileAppender

# aws sdk v1
logger.amazon.name = com.amazon
logger.amazon.level = info
logger.amazon.additivity = false
logger.amazon.appenderRefs = rolling
logger.amazon.appenderRef.rolling.ref = RollingFileAppender

logger.amazonaws.name = com.amazonaws
logger.amazonaws.level = info
logger.amazonaws.additivity = false
logger.amazonaws.appenderRefs = rolling
logger.amazonaws.appenderRef.rolling.ref = RollingFileAppender

# aws sdk v2
logger.software-amazon.name = software.amazon.awssdk
logger.software-amazon.level = info
logger.software-amazon.additivity = false
logger.software-amazon.appenderRefs = rolling
logger.software-amazon.appenderRef.rolling.ref = RollingFileAppender

# apache http
logger.apache.name = org.apache.http.wire
logger.apache.level = info
logger.apache.additivity = false
logger.apache.appenderRefs = rolling
logger.apache.appenderRef.rolling.ref = RollingFileAppender

You will have to restart your spark-shell or spark-sql command to pick up the changes.

Run your use-case to generate the debug logs. The logs will be available in the /tmp/<user>/privacera.log file.

After your debugging session is over, ensure to revert the changes made to the log4j2.properties file.

Comments