Skip to content

IgorComune/tech_challenge_ml_engineer_phase3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

52 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Tech Challenge ML Engineer - Phase 3

Amazon Delivery Time Prediction

This project was developed as part of the Phase 3 Tech Challenge, focusing on Order Cycle Time (OCT) prediction using Amazon Delivery data.
The objective is to build a robust Machine Learning Operations (MLOps) pipeline that includes MLflow model serving connected with a Streamlit visualization app and a SimPy-based last-mile delivery simulator.

πŸš€ Application and Experiment Tracking

Access the live application and the full history of model experiments using the links below.

Tool Status Link
Active Forecasting App 🟒 Deployed Open App
MLflow Experiment Tracking πŸ“Š Monitoring View Dashboard

πŸ—οΈ Project Architecture

The architecture illustrates the data flow from raw ingestion, through feature engineering and model training, to the final deployment and tracking services.

Architecture

🎯 Project Objective

  • OTD Prediction = The total time elapsed from the moment in which the order is input into the system until its final delivery at the customer's location in minutes.
    This prediction is crucial for ensuring the deliveries meet the minimum Service Level Agreement (SLA) of 120 minutes, thereby directly impacting customer satisfaction and logistics efficiency.

πŸ“Š Dataset and Simulation

The Amazon Delivery Dataset provides a comprehensive view of last-mile logistics operations, including:

  • 43,632 deliveries across multiple cities
  • Order details and delivery agents information
  • Weather and traffic conditions
  • Delivery performance metrics

We also utilized Simpy for discrete-event simulation to model and analyze various scenarios related to delivery performance.

πŸ“ Repository Architecture

β”œβ”€β”€ data/                       # Raw and processed data
β”‚   β”œβ”€β”€ raw/                    # Original data from Kaggle
β”‚   └── processed/              # Cleaned and transformed data
β”‚   └── simulation/             # Simulation results
β”œβ”€β”€ notebooks/                  # Notebooks
β”‚   β”œβ”€β”€ 02_EDA                  # Exploratory Data Analysis (EDA)
β”‚   β”œβ”€β”€ 02_MODEL_VALIDATION     # ML Process Analysis
β”‚   └── 02_SIMULATION.ipynb     # Simulation Data Analysis
β”œβ”€β”€ reports/                    # Reports and figures
β”‚   β”œβ”€β”€ figures/models/         # Images Plots from Model Validation
β”œβ”€β”€ src/                        # Project source code
β”‚   β”œβ”€β”€ config/                 # Configuration files
β”‚   β”œβ”€β”€ data/                   # Processing modules
β”‚   β”œβ”€β”€ features/               # Feature Engineering modules
β”‚   β”œβ”€β”€ modeling/               # ML training modules
β”‚   β”œβ”€β”€ models/                 # Models files
β”‚   β”œβ”€β”€ utils/                  # Utilities modules
β”‚   └── visualization/          # Visualizations
β”œβ”€β”€ tests/                      # Project tests
β”œβ”€β”€ app.py                      # Streamlit app
β”œβ”€β”€ otd_simulator.py            # Simpy script
β”œβ”€β”€ project.toml                # Poetry config files
└── requirements.txt            # Requirements

πŸš€ Implemented Features

1. Data Pipeline

  • βœ… Data collection from Kaggle
  • βœ… Data processing and cleaning
  • βœ… Storage in organized structure
  • βœ… Categorical variable mappings

2. Machine Learning Model

  • βœ… Exploratory Data Analysis (EDA)
  • βœ… Feature Engineering
  • βœ… Training with LightGBM
  • βœ… Experiment tracking with MLflow
  • βœ… MlFlow for experiment and versioning

3. User Interface

  • βœ… Interactive dashboard in Streamlit
  • βœ… Data and results visualizations
  • βœ… Real-time prediction interface
  • βœ… Simpy Last Mile simulation

πŸ› οΈ Technologies Used

  • Python 3.11.x
  • Pandas & NumPy for data manipulation
  • Statsmodels & Scipy for statistics
  • Scikit-learn for ML pipeline
  • LightGBM for ML modeling
  • Shap for feature importances
  • MLflow for experiment tracking
  • Streamlit for interactive dashboard
  • Matplotlib, Seaborn & Plotly for visualizations
  • Simpy for simulation

βš™οΈ Setup and Installation

Prerequisites

  • Python 3.11.x
  • Conda installed globally
  • Poetry installed globally

Step by step:

Environment Variables Setup (Critical)

  1. Create an account at https://dagshub.com/ and generate a token from your profile settings.
  2. You must set up your environment variables (including DAGsHub credentials and MLflow URI) before installing dependencies.

Option 1: Installation via Conda and pip (Recommended)

  1. Clone the repository:
git clone https://github.com/IgorComune/tech_challenge_ml_engineer_phase3.git
cd tech_challenge_ml_engineer_phase3
  1. Create a conda environment:
conda create -n tech_challenge python=3.11.9
conda activate tech_challenge
  1. Install dependencies:
pip install -r requirements.txt
  1. Run the project: Streamlit Dashboard:
streamlit run app.py

Option 2: Installation via Poetry

  1. Clone the repository:
git clone [https://github.com/IgorComune/tech_challenge_ml_engineer_phase3.git](https://github.com/IgorComune/tech_challenge_ml_engineer_phase3.git)
cd tech_challenge_ml_engineer_phase3
  1. Install dependencies and create the Poetry virtual environment:
# This command automatically reads pyproject.toml, creates the .venv (virtual environment), and installs all dependencies.
poetry install
  1. Activate the environment:
# This command spawns a shell in the project's virtual environment.
poetry shell
  1. Running the Project: Streamlit Dashboard:
streamlit run app.py

πŸ“ˆ Results and Insights

Business Impact

  • Improvement in on-Time Delivery (OTD) SLA (120 minutes) rate from 41% to 70% with Predictive model and corrective real-time actions.

Exploratory Analysis

  • Identification of key delivery patterns based on categorical features.
  • Development of new predictive features based on categorical patterns and statistics test.
  • Correlation between weather conditions and delivery time
  • Traffic impact on logistics performance.

Model Performance

  • LightGBM model for OTD prediction
  • Evaluation metrics available in MLflow
  • Feature importance visualization with Shap

Interactive Dashboard

  • User-friendly interface for data analysis
  • Real-time predictions
  • Interactive result visualizations

Simulation Last Mile Process

  • Implementation of politics in real-time logistics (e.g., re-routing, agent reassignment), resulting in a measurable uplift in On-Time Delivery (OTD) performance.

πŸ“ Data Structure

Processed Data:

  • amazon_delivery_processed.csv - Complete processed dataset

Models:

  • models:/LightGBM_Ajustado/Production - Versioned LightGBM

πŸ§ͺ Testing and Validation

  • Test notebooks available in tests/

🀝 Contribution

This project was developed as part of Phase 3 Tech Challenge. Feedback and suggestions are welcome!

πŸ“„ License

This project is under the license specified in the LICENSE file.


Project: Tech Challenge ML Engineer - Phase 3
Institution: PΓ³s-Tech


"Transforming data into insights, insights into value."

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages