Data_Engineer_Project

Hello World!

🚀 Welcome to my git repo :BreadcrumbsData_Engineer_Project:

Data_Engineer_Project

Data Pipeline for AIQ Data Engineer Assignment

This data pipeline processes sales data, user data, and weather data to generate actionable insights and visualizations. The pipeline is designed to be executed through the main.py script, which orchestrates the entire flow from data extraction to analysis.

I utilized advanced data engineering techniques, integrating multiple data sources like sales, user, and weather data to build a seamless pipeline. The pipeline automates data validation, transformation, and analysis, ensuring accurate insights and efficient processing.

This project showcases how combining data pipeline automation with real-time data processing can drive actionable insights and improve decision-making.

Languages and Tools

Languages

Python:

JavaScript:

SQL:

Tools

Software Development:

SDLC:

Software Engineering:

Version Control:

Data_Engineer_Project

Connects to the MySQL database. Loads sales data from the database. Validates the schema of the sales data. Fetches user data from an API. Validates the schema of the user data. Merges user data with sales data. Merges user and weather data based on location. Outputs the final merged data to a CSV file. After the data processing, it initializes a DataAnalyzer to perform various analyses on the final data.

2. model

a. mymodel.py Contains the MyModel class, responsible for interacting with the data sources, performing data operations, and managing the data pipeline.

b. analysis.py Includes the DataAnalyzer class, which performs various analyses on the final merged data, such as calculating total sales amount per customer, average order quantity per product, and generating visualizations.

3. dataloader

Contains utility functions for loading data. dataloader.py includes functions for making a connection to the MySQL database, loading sales data, and fetching user data from an API.

4. configs

Stores configuration files for the data pipeline, including config.py for general configurations and logging_config.yaml for logging settings.

5. SQLQuery

Includes SQL queries for creating and loading data into the MySQL database.

6. MyException

Contains exception handling classes. exceptions.py defines custom exceptions for better error handling.

7. utils

Contains utility functions and configurations, such as config.py for configuration settings and logger.py for logging setup.

8. tests

Includes unit tests for the data pipeline components.

9. AnalysisOutputResults

Stores the output of the data analysis, including CSV files and visualizations.

Usage Instructions

Install Dependencies:

Make sure you have Python 3.9 installed.

Install the required dependencies using:

bash Copy code

git clone https://github.com/himalayaashish/Data_Engineer_Project.git

cd Data_Engineer_Project

pip install -r requirements.txt

Database Configuration:

Configure the MySQL database connection details in configs/config.py. API Configuration:

Update the API endpoint in main.py with the desired endpoint for fetching user data. Run the Data Pipeline:

Execute the following command in the terminal:

bash Copy code

python main.py

View Results:

The final merged data will be stored in final_data.csv. Analysis results and visualizations will be saved in the AnalysisOutputResults directory. Unit Testing Run the unit tests using:

pytest
pandas
pymysql
jsonschema
pytest (for testing)

Contributing

Feel free to contribute by opening issues or submitting pull requests.

Area of improvement

1- Any third party software / tool like Tableau or power BI is good for data visualization.

2- Data analysis can be done using plain sql query. We can create stored procedures and analysis.py can be further enhanced just to call these stored procedures when ever is required.

3- Data drift has been mostly handled with SCHEMA checks but needs further modification to validate the range of each fields.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hello World!