Unset AWS credentials in Spark's environment for Glue Data Catalog calls on EMR¶
Overview¶
When Spark accesses AWS Glue Data Catalog through the AWS SDK v2, the default credential chain checks environment variables and shared credentials before the EC2 instance profile. If AWS_* environment variables or AWS_SHARED_CREDENTIALS_FILE point to credentials that are not valid for Glue API calls, Spark metadata calls can fail with errors such as:
| Text Only | |
|---|---|
Why this happens¶
On EMR 7.x, Spark uses AWS SDK v2 to call AWS Glue. The SDK’s default credential provider chain prefers environment variables and the shared credentials file over the EC2 instance profile when those are set. Interactive sessions, automation, or other components on the node may set AWS_* values or AWS_SHARED_CREDENTIALS_FILE for purposes that do not match Glue Data Catalog authentication. Spark then picks up those sources first, which often surfaces as the invalid security token error above.
To handle this case, follow the configuration steps below.
Configure¶
-
SSH to the instance where Privacera Manager is installed.
-
Navigate to the Privacera Manager configuration directory:
Bash -
Open the EMR variables file for editing:
Bash -
Uncomment and set the following property:
YAML Note
- Set to
"true"to enable Spark-side unsetting ofAWS_*andAWS_SHARED_CREDENTIALS_FILEso Glue metadata calls use instance profile credentials. - Set to
"false"(default) to retain legacy behavior (no additional unset block inspark-env.sh).
- Set to
-
Apply the configuration by running Privacera Manager post-install:
Verification¶
After the cluster is up, SSH to a node where Spark runs and check:
| Bash | |
|---|---|
When enabled, you should see unset lines similar to:
| Text Only | |
|---|---|
SHOW DATABASES) to confirm Glue access succeeds without invalid-token errors. - Prev topic: Advanced Configuration