GitHub - abhisheksunny/Data-Lake-Using-Apache-Spark

Background Information

The music streaming startup, Sparkify, has grown their user base and song database and currently have their data in S3, in a directory of containing files of JSON logs of user activity on the app, and another directory with JSON metadata about the songs on their app.

Purpose

Converting song and log data files to properly formatted and segregated tables like structure in a Data Lake that would allow easy and instant query over the data for fetching meaningful results and insights. Apache Spark is used for data ingestion and ETL transformation. Amazon S3 is used for storing source JSON data and output parquet data. The python script file containing the Spark code runs on Amazon EMR, a cloud based computing instance modified for big-data use cases.

Technology Stack

Apache Spark
AWS S3
AWS EMR

Running Instruction

Update dl.cfg file with AWS keys having required S3 permission.
Update the variable output_data in etl.py with the destination data location. Execute python etl.py <MODE> for transforming data from JSON to structured parquet files. Currently supported values for MODE are local and S3.

Files

dl.cfg - Configuration file, containing AWS Keys.
etl.py - Python Script for transforming data.

Data Ingestion

Song File - The data within these JSON files is read into a dataframe on-the-go and then is used to populate songs and artists tables.

Log File - Just as song data, Log File is loaded from JSON Files and all the remaining tables(users, songplays, time) are populated.

Additional Design Information

Fact Table

songplays - records in log data associated with song plays

songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users - users in the app

user_id, first_name, last_name, gender, level

songs - songs in music database

song_id, title, artist_id, year, duration

artists - artists in music database

artist_id, name, location, lattitude, longitude

time - timestamps of records in songplays broken down into specific units

start_time, hour, day, week, month, year, weekday

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
README.md		README.md
dl.cfg		dl.cfg
etl.py		etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Background Information

Purpose

Technology Stack

Running Instruction

Files

Data Ingestion

Additional Design Information

Fact Table

Dimension Tables

About

Uh oh!

Releases

Packages

Languages

abhisheksunny/Data-Lake-Using-Apache-Spark

Folders and files

Latest commit

History

Repository files navigation

Background Information

Purpose

Technology Stack

Running Instruction

Files

Data Ingestion

Additional Design Information

Fact Table

Dimension Tables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages