Skip to content

Spark OLAC

  1. SSH to emr master node
    Bash
    ssh hadoop@<emr-master-node>
    
  2. Run the following command
    Bash
    sudo su - <user>
    kinit
    
  3. Connect to spark tool
  • Connect to pyspark
    Bash
    pyspark
    
  • Include the below additional configuration if you have enabled JWT authorization in the cluster.

    • To pass the JWT token directly as a command-line argument, use the following configuration:

      Bash
      --conf "spark.hadoop.privacera.jwt.token.str=<your-jwt-token>"
      

    • To use the file path containing the JWT token, use the following configuration:

      Bash
      --conf "spark.hadoop.privacera.jwt.token=<path-to-jwt-token-file>"
      

  • Run spark read/write

    Bash
    1
    2
    3
    4
    df = spark.read.csv("s3a://${S3_BUCKET}/${CSV_FILE}")
    df.show(5)
    
    df.write.format("csv").mode("overwrite").save("s3a://${S3_BUCKET}/${CSV_FILE}")
    

  • Connect to spark-shell
    Bash
    spark-shell
    
  • Include the below additional configuration if you have enabled JWT authorization in the cluster.

    • To pass the JWT token directly as a command-line argument, use the following configuration:

      Bash
      --conf "spark.hadoop.privacera.jwt.token.str=<your-jwt-token>"
      

    • To use the file path containing the JWT token, use the following configuration:

      Bash
      --conf "spark.hadoop.privacera.jwt.token=<path-to-jwt-token-file>"
      

  • Run spark read/write

    Bash
    1
    2
    3
    4
    val df = spark.read.csv("s3a://${S3_BUCKET}/${CSV_FILE}")
    df.show(5)
    
    df.write.format("csv").mode("overwrite").save("s3a://${S3_BUCKET}/${CSV_FILE}")
    

  • Connect to spark-sql
    Bash
    spark-sql
    
  • Include the below additional configuration if you have enabled JWT authorization in the cluster.

    • To pass the JWT token directly as a command-line argument, use the following configuration:

      Bash
      --conf "spark.hadoop.privacera.jwt.token.str=<your-jwt-token>"
      

    • To use the file path containing the JWT token, use the following configuration:

      Bash
      --conf "spark.hadoop.privacera.jwt.token=<path-to-jwt-token-file>"
      

  • Run spark sql query

    Bash
    1
    2
    3
    DROP DATABASE IF EXISTS priv_emr_hive CASCADE;
    
    CREATE DATABASE IF NOT EXISTS priv_emr_hive LOCATION 's3a://${S3_BUCKET}/${PATH_TO_DB}';
    

Comments