You can find the full code in here

ETL pipeline using Spark that loads data from s3, processes the data into analytics tables, and loads them back into s3.

AWS Credentials

We are using aws as environment variables in this repo:

export AWS_ACCESS_KEY_ID=<YOUR_AWS_ACCESS_KEY_ID>
export AWS_SECRET_ACCESS_KEY=<YOUR_AWS_SECRET_ACCESS_KEY>
export AWS_DEFAULT_REGION=<YOUR_REGION>

EMR Cluster and Pyspark Job

The EMR set up was done through the console, you can check the tutorial on this link or here