Machine Learning Project Template

A comprehensive, production-ready template for machine learning projects with modern Python tooling and best practices. This template provides a complete ML pipeline with data processing, preprocessing, modeling, evaluation, and visualization capabilities.

🚀 Features

Core ML Capabilities

Data Processing: Robust data loading, validation, and management utilities
Preprocessing Pipeline: Configurable preprocessing with missing value handling, outlier detection, encoding, and scaling
Model Management: Abstract base classes with concrete implementations for scikit-learn and Hugging Face transformer models
Transformer Support: Pre-trained and fine-tunable transformer models (BERT, DistilBERT, RoBERTa) for text classification
Evaluation Framework: Comprehensive metrics calculation and model comparison tools
Visualization Suite: Rich plotting capabilities for data exploration and model analysis

Development & Production Tools

Modern Dependency Management: uv for fast, reliable package management
Code Quality: ruff for linting and formatting, mypy for type checking
Testing: pytest with coverage reporting via codecov
CI/CD: GitHub Actions workflows for automated testing and deployment
Documentation: MkDocs with Material theme
Containerization: Docker support with multi-stage builds
Pre-commit Hooks: Automated code quality checks
Cross-version Testing: tox for testing across Python versions

Project Structure

mlproject-template/
├── src/                          # Source code
│   ├── common/                   # Common utilities and configuration
│   │   ├── config.py            # Configuration management with Pydantic
│   │   ├── logging_config.py    # Logging setup
│   │   └── utils.py             # Utility functions
│   ├── data_process/            # Data loading and processing
│   │   └── loader.py            # DataLoader class
│   ├── preprocess/              # Data preprocessing
│   │   ├── transformers.py      # Custom transformers
│   │   └── pipeline.py          # Preprocessing pipeline
│   ├── model/                   # Model implementations
│   │   ├── base.py              # Abstract base model
│   │   ├── sklearn_models.py    # Scikit-learn model wrappers
│   │   └── transformer_models.py # Hugging Face transformer models
│   ├── evaluate/                # Model evaluation
│   │   └── metrics.py           # Evaluation metrics and utilities
│   ├── visualization/           # Plotting and visualization
│   │   └── plots.py             # Visualization utilities
│   └── main.py                  # Main CLI interface
├── config/                      # Configuration files
│   └── default.yaml            # Default configuration
├── notebooks/                   # Jupyter notebooks
│   └── 01_data_exploration.ipynb # Data exploration example
├── data/                        # Data directories
│   ├── raw/                    # Raw data
│   └── processed/              # Processed data
├── tests/                       # Test files
├── docs/                        # Documentation
└── reports/                     # Generated reports and outputs

🛠️ Getting Started

Prerequisites

Python 3.10 or higher
uv package manager

Installation

Use this template by clicking the "Use this template" button on GitHub, or clone directly:
```
git clone https://github.com/avineshpvs/mlproject-template.git
cd mlproject-template
```

Install uv (if not already installed):

# macOS/Linux
curl -LsSf https://astral.sh/uv/0.7.8/install.sh | sh

# Windows
powershell -c "irm https://astral.sh/uv/0.7.8/install.ps1 | iex"

Set up the development environment:
```
make install
```
This will:
- Create a virtual environment
- Install all dependencies
- Set up pre-commit hooks
- Generate the lock file

Quick Start

Run the example ML pipeline:

# Run with default settings (creates sample data)
uv run python -m src.main

# List available models
uv run python -m src.main --list-models

# Run with specific model
uv run python -m src.main --model random_forest_classifier --output my_results

Try transformer models (requires additional dependencies):

# Install transformer dependencies
uv add transformers torch datasets

# Run transformer example
uv run python examples/transformer_example.py

# Compare traditional ML vs transformers
uv run python examples/model_comparison.py

Explore the data using the provided notebook:

jupyter lab notebooks/01_data_exploration.ipynb

Run tests and quality checks:

make check  # Code quality checks
make test   # Run tests with coverage

📊 Usage Examples

Data Loading and Processing

from src.data_process import DataLoader
from src.common import get_logger

logger = get_logger(__name__)

# Load data
data_loader = DataLoader()
df = data_loader.load_csv("your_data.csv")

# Get comprehensive data information
data_info = data_loader.get_data_info(df)
logger.info(f"Dataset shape: {data_info['shape']}")

# Split data
X_train, X_test, y_train, y_test = data_loader.train_test_split(df, "target_column")

Preprocessing Pipeline

from src.preprocess import create_preprocessing_pipeline

# Configure preprocessing
config = {
    "missing_values": {"strategy": "mean", "enabled": True},
    "outliers": {"method": "iqr", "threshold": 1.5, "action": "clip", "enabled": True},
    "categorical_encoding": {"method": "onehot", "drop_first": True, "enabled": True},
    "scaling": {"method": "standard", "enabled": True}
}

# Create and use pipeline
preprocessor = create_preprocessing_pipeline(config)
X_processed = preprocessor.fit_transform(X_train)

Model Training and Evaluation

from src.model import create_model
from src.evaluate import ModelEvaluator

# Create and train model
model = create_model("random_forest_classifier", n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)

# Evaluate model
evaluator = ModelEvaluator()
metrics = evaluator.evaluate_classification(y_test, y_pred, y_pred_proba)
print(f"Accuracy: {metrics['accuracy']:.4f}")

Visualization

from src.visualization import MLVisualizer, plot_data_overview

# Create comprehensive data overview
figures = plot_data_overview(df, save_dir="plots")

# Custom visualizations
visualizer = MLVisualizer()
fig = visualizer.plot_correlation_matrix(df, save_path="correlation.png")

🔧 Configuration

The template uses a flexible configuration system with YAML files and Pydantic models:

from src.common import settings

# Access configuration
print(f"Project name: {settings.project_name}")
print(f"Data directory: {settings.data_dir}")
print(f"Random seed: {settings.data.random_seed}")

# Load custom configuration
from src.common.config import MLProjectSettings
custom_settings = MLProjectSettings.from_yaml("config/custom.yaml")

🧪 Available Models

The template includes ready-to-use implementations for common ML algorithms:

Traditional ML Models

Linear Models: Linear Regression, Logistic Regression
Tree-based Models: Decision Trees, Random Forest
Support Vector Machines: SVR, SVC

Transformer Models (Text Classification)

BERT: Fine-tunable BERT models for text classification
DistilBERT: Lightweight, fast DistilBERT models
RoBERTa: Robustly optimized BERT models
Pre-trained Pipelines: Ready-to-use sentiment analysis and text classification

Installation for Transformer Models

# Install transformer dependencies
uv add transformers torch datasets

# Or install all optional dependencies
uv sync --all-extras

Usage Examples

# Traditional ML model
model = create_model("random_forest_classifier", n_estimators=100)

# Transformer models
sentiment_model = create_model("sentiment_pipeline")
bert_model = create_model("distilbert_classifier", num_labels=3)

# List all available models
uv run python -m src.main --list-models

# Run transformer example
uv run python examples/transformer_example.py

📈 Evaluation Metrics

Comprehensive evaluation capabilities:

Classification: Accuracy, Precision, Recall, F1-score, ROC-AUC, Confusion Matrix
Regression: MAE, MSE, RMSE, R², Residual Analysis
Model Comparison: Side-by-side comparison tools
Cross-validation: Built-in cross-validation support

🐳 Docker Support

Build and run with Docker:

# Build image
docker build -t ml-project .

# Run container
docker run -v $(pwd)/data:/app/data ml-project

📚 Documentation

Generate and serve documentation:

make docs      # Serve documentation locally
make docs-test # Test documentation build

🧹 Code Quality

The template enforces high code quality standards:

make check  # Run all quality checks

This includes:

Linting: Ruff for fast Python linting
Formatting: Automatic code formatting
Type Checking: MyPy for static type analysis
Dependency Analysis: Deptry for unused dependencies
Security: Pre-commit hooks for security checks

🚀 CI/CD Pipeline

Automated workflows for:

Quality Assurance: Code quality checks on every PR
Testing: Multi-version Python testing (3.9-3.13)
Documentation: Automatic documentation building
Docker: Container image building and testing
Coverage: Code coverage reporting with Codecov

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes and add tests
Run quality checks: make check && make test
Commit your changes: git commit -am 'Add feature'
Push to the branch: git push origin feature-name
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgements

This project builds upon excellent work from:

💡 Why This Template?

Modern Python Tooling

uv: 10-100x faster than pip, with comprehensive project management
Ruff: Extremely fast Python linter and formatter
Pydantic: Runtime type checking and data validation
Type Safety: Full type annotations with mypy checking

Production Ready

Configuration Management: Flexible, environment-aware configuration
Logging: Structured logging with configurable levels
Error Handling: Comprehensive error handling and validation
Testing: High test coverage with pytest
Documentation: Auto-generated API documentation

ML Best Practices

Reproducibility: Seed management and deterministic pipelines
Modularity: Clean separation of concerns
Extensibility: Easy to add new models, transformers, and metrics
Monitoring: Built-in evaluation and comparison tools

Developer Experience

Fast Setup: One command installation and setup
IDE Support: Full type hints and IntelliSense support
Pre-commit Hooks: Automatic code quality enforcement
Rich CLI: Comprehensive command-line interface

Start building your next ML project with confidence! 🎯

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.devcontainer		.devcontainer
.github		.github
config		config
docs		docs
examples		examples
notebooks		notebooks
outputs		outputs
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
codecov.yaml		codecov.yaml
mkdocs.yml		mkdocs.yml
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
tox.ini		tox.ini
uv.lock		uv.lock

License

avineshpvs/mlproject-template

Folders and files

Latest commit

History

Repository files navigation