This repository contains data modeling, ingestion, and analysis work for structuring and optimizing schema design for Analytics Take Home Assessment.
This assignment focuses on designing and implementing a structured data model for the given nested JSON data. The data includes information about users, brands, and receipts with complex nested structures.
The data injestion, modeling and querying is done on Snowflake platform. The SQL Queries in the directory in the exact order can be run on Snowflake platform to get the results seen with my work.
- brands.json: Contains brand information in the receipts
- receipts.json: Contains receipt data with nested line items, rewards information, and user references
- users.json: Contains user account information and metadata
The repository demonstrates a multi-layered data modeling approach:
- Raw Data Ingestion: Scripts in the
data-ingestionfolder handle the loading of raw JSON data into Snowflake stage - Data Modeling: SQL scripts in the
data-modelingfolder transform nested JSON into structured relational tables - Data Quality: Scripts in the
data-quality-issuefolder identify and address data quality issues - Analytics: The
questionsfolder contains queries that answer the questions making use of the designed relational data model
- Nested JSON structures into relational tables
- Handling of complex one-to-many relationships
- Optimized storage and compute with proper primary and foreign key relationships
- Data quality validation
- Stakeholder discussion
- Review the Entity Relationship Diagram (
erd.jpg) to understand the data model - Execute the ingestion scripts in the
data-ingestionfolder to load raw data - Run the modeling scripts in the
data-modelingfolder to create structured tables - Review the queries in the
questionsfolder to get answers about business questions
├── data-ingestion/ # Raw data files and ingestion scripts
│ ├── brands.json
│ ├── brands.json.gz
│ ├── load.sql # SQL scripts for data loading
│ ├── receipts.json
│ ├── receipts.json.gz
│ ├── users.json
│ └── users.json.gz
│
├── data-modeling/ # Data modeling artifacts and scripts for gold layer (production-level data)
│ ├── brands-gold-table.sql
│ ├── erd.jpg # Entity Relationship Diagram
│ ├── receipts-gold-table.sql
│ └── user-details-gold-table.sql
│
├── data-quality-issue/ # Scripts and documentation to identify data quality improvements and issues
│ ├── data-quality.sql
│ └── task.md
│
├── questions/ # Answering business questions leveraging data model
│ ├── questions.sql
│ └── task.md
│
└── stakeholder-communication/ # Discussion points with stakeholders
├── message.md
└── task.md
