Skip to content

Spark OLAC

  1. SSH to emr master node
    Bash
    ssh hadoop@<emr-master-node>
    
  2. Run the following command
    Bash
    sudo su - <user>
    kinit
    
  3. Connect to spark tool
  • Connect to pyspark
    Bash
    pyspark
    
    With jwt
    Bash
    export JWT_TOKEN=<your-jwt-token>
    pyspark --conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}"
    
  • Run spark read/write
    Bash
    1
    2
    3
    4
    df = spark.read.csv("s3a://${S3_BUCKET}/${CSV_FILE}")
    df.show(5)
    
    df.write.format("csv").mode("overwrite").save("s3a://${S3_BUCKET}/${CSV_FILE}")
    
  • Connect to spark-shell
    Bash
    spark-shell
    
    With jwt
    Bash
    export JWT_TOKEN=<your-jwt-token>
    spark-shell --conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}"
    
  • Run spark read/write
    Bash
    1
    2
    3
    4
    val df = spark.read.csv("s3a://${S3_BUCKET}/${CSV_FILE}")
    df.show(5)
    
    df.write.format("csv").mode("overwrite").save("s3a://${S3_BUCKET}/${CSV_FILE}")
    
  • Connect to spark-sql
    Bash
    spark-sql
    
    With jwt
    Bash
    export JWT_TOKEN=<your-jwt-token>
    spark-sql --conf "spark.hadoop.privacera.jwt.token.str=${JWT_TOKEN}"
    
  • Run spark sql query
    Bash
    1
    2
    3
    DROP DATABASE IF EXISTS priv_emr_hive CASCADE;
    
    CREATE DATABASE IF NOT EXISTS priv_emr_hive LOCATION 's3a://${S3_BUCKET}/${PATH_TO_DB}';
    

Comments