In this project, I build a simple data pipeline following the ELT(extract - load - transform) model using the Brazilian-Ecommerce dataset, perform data processing and transformation, serve to create reports, in-depth analysis and support for the Data Analyst team
- Data Source: The project uses the Brazilian Ecommerce public dataset by Olist, downloaded from kaggle.com in
.csvformat.- The 5 csv files are loaded into
PostgreSQL, considering it a data source. - The remaining 4 csv files are extracted directly.
- The 5 csv files are loaded into
- Extract Data: Data is extracted using
Polarsas aDataFramefrom aPostgreSQLdatabase andCSVfile. - Load Data: After extracting data from the above two data sources, we load it into
Snowflakeatrawlayer fromPolarsDataFrame. - Tranform Data: After loading the data, we perform
transformwithdbtonSnowflaketo createdimensionandfacttables in thestaginglayer and calculate aggregates in themartlayer. - Serving: Data is served for
reporting,analysis, anddecision supportusingMetabaseandApache Superset. - package and orchestrator: The entire project is packaged and orchestrated by
DockerandDagster.
- olist_geolocation_dataset: This dataset has information Brazilian zip codes and its lat/lng coordinates.
- olist_customers_dataset: This dataset has information about the customer and its location.
- olist_order_items_dataset: This dataset includes data about the items purchased within each order.
- olist_order_payments_dataset: This dataset includes data about the orders payment options.
- olist_order_reviews_dataset: This dataset includes data about the reviews made by the customers.
- olist_orders_dataset: This is the core dataset. From each order you might find all other information.
- olist_products_dataset: This dataset includes data about the products sold by Olist.
- olist_sellers_dataset: This dataset includes data about the sellers that fulfilled orders made at Olist.
Graph Lineage (dagster) trong dα»± Γ‘n nΓ y bao gα»m 4 layer:
- source layer: This layer contains
assetsthatcollectdata fromPostgreSQLandCSVfiles usingPolarsDataFrame. - raw layer: This layer contains
assetsthat perform the task of loading data fromPolarsDataFrameintoSnowflakewarehouse inrawschema. - staging layer: This layer contains assets that handle data transformation from the
rawschema, then the data is put into thestagingschema. - mart layer: This layer contains
assetsthat are responsible for synthesizing calculations from data in thestagingschema and then putting the data into themartschema.
PostgreSQLPolarsDbtDagsterSnowflakeDockerMetabaseApache Superset
Here's what you can do with NinjaSketch:
- You can completely change the logic or create new
assetsin thedata pipelineas you wish, performaggregatecalculationson theassetsin thepipelineaccording to your purposes. - You can also create new
data chartsas well as change existingchartsas you like with extremely diversechart typesonMetabaseandApache Superset. - You can also create new or change my existing
dashboardsas you like
- Add more
data sourcesto increase data richness. - Refer to other
data warehousesbesidesSnowflakesuch asAmazon RedshiftorGoogle Bigquery. - Perform more
cleaningandoptimizationprocessingof the data. - Perform more advanced
statistics,analysisandcalculations. - Check out other popular and popular
data orchestrationtools likeApache Airflow. - Separate
dbtinto a separate service (separatecontainer) indockerwhen the project expands - Learn about
dbt packageslikedbt-labs/dbt_utilsto help make thetransformationprocess faster and more optimal.
To run the project in your local environment, follow these steps:
- Run
git clone https://github.com/longNguyen010203/ECommerce-ELT-Pipeline.gitto clone the repository to your local machine. - run
make buildto build the images from the Dockerfile - run
make upto pull images from docker hub and launch services - run
make psql_createto create tables with schema for PostgreSQL - run
make psql_importto load data from CSV file to PostgreSQL - Open http://localhost:3001 and click
Materialize allbutton to run the Pipeline - Open https://app.snowflake.com and login to check and monitor updated data
- Open http://localhost:3030 to see charts and dashboards


