Skip to content

avineshpvs/mlproject-template

Repository files navigation

CI Python Version uv Ruff

Machine Learning Project Template

A comprehensive, production-ready template for machine learning projects with modern Python tooling and best practices. This template provides a complete ML pipeline with data processing, preprocessing, modeling, evaluation, and visualization capabilities.

πŸš€ Features

Core ML Capabilities

  • Data Processing: Robust data loading, validation, and management utilities
  • Preprocessing Pipeline: Configurable preprocessing with missing value handling, outlier detection, encoding, and scaling
  • Model Management: Abstract base classes with concrete implementations for scikit-learn and Hugging Face transformer models
  • Transformer Support: Pre-trained and fine-tunable transformer models (BERT, DistilBERT, RoBERTa) for text classification
  • Evaluation Framework: Comprehensive metrics calculation and model comparison tools
  • Visualization Suite: Rich plotting capabilities for data exploration and model analysis

Development & Production Tools

  • Modern Dependency Management: uv for fast, reliable package management
  • Code Quality: ruff for linting and formatting, mypy for type checking
  • Testing: pytest with coverage reporting via codecov
  • CI/CD: GitHub Actions workflows for automated testing and deployment
  • Documentation: MkDocs with Material theme
  • Containerization: Docker support with multi-stage builds
  • Pre-commit Hooks: Automated code quality checks
  • Cross-version Testing: tox for testing across Python versions

Project Structure

mlproject-template/
β”œβ”€β”€ src/                          # Source code
β”‚   β”œβ”€β”€ common/                   # Common utilities and configuration
β”‚   β”‚   β”œβ”€β”€ config.py            # Configuration management with Pydantic
β”‚   β”‚   β”œβ”€β”€ logging_config.py    # Logging setup
β”‚   β”‚   └── utils.py             # Utility functions
β”‚   β”œβ”€β”€ data_process/            # Data loading and processing
β”‚   β”‚   └── loader.py            # DataLoader class
β”‚   β”œβ”€β”€ preprocess/              # Data preprocessing
β”‚   β”‚   β”œβ”€β”€ transformers.py      # Custom transformers
β”‚   β”‚   └── pipeline.py          # Preprocessing pipeline
β”‚   β”œβ”€β”€ model/                   # Model implementations
β”‚   β”‚   β”œβ”€β”€ base.py              # Abstract base model
β”‚   β”‚   β”œβ”€β”€ sklearn_models.py    # Scikit-learn model wrappers
β”‚   β”‚   └── transformer_models.py # Hugging Face transformer models
β”‚   β”œβ”€β”€ evaluate/                # Model evaluation
β”‚   β”‚   └── metrics.py           # Evaluation metrics and utilities
β”‚   β”œβ”€β”€ visualization/           # Plotting and visualization
β”‚   β”‚   └── plots.py             # Visualization utilities
β”‚   └── main.py                  # Main CLI interface
β”œβ”€β”€ config/                      # Configuration files
β”‚   └── default.yaml            # Default configuration
β”œβ”€β”€ notebooks/                   # Jupyter notebooks
β”‚   └── 01_data_exploration.ipynb # Data exploration example
β”œβ”€β”€ data/                        # Data directories
β”‚   β”œβ”€β”€ raw/                    # Raw data
β”‚   └── processed/              # Processed data
β”œβ”€β”€ tests/                       # Test files
β”œβ”€β”€ docs/                        # Documentation
└── reports/                     # Generated reports and outputs

πŸ› οΈ Getting Started

Prerequisites

  • Python 3.10 or higher
  • uv package manager

Installation

  1. Use this template by clicking the "Use this template" button on GitHub, or clone directly:

    git clone https://github.com/avineshpvs/mlproject-template.git
    cd mlproject-template
  2. Install uv (if not already installed):

    # macOS/Linux
    curl -LsSf https://astral.sh/uv/0.7.8/install.sh | sh
    
    # Windows
    powershell -c "irm https://astral.sh/uv/0.7.8/install.ps1 | iex"
  3. Set up the development environment:

    make install

    This will:

    • Create a virtual environment
    • Install all dependencies
    • Set up pre-commit hooks
    • Generate the lock file

Quick Start

  1. Run the example ML pipeline:

    # Run with default settings (creates sample data)
    uv run python -m src.main
    
    # List available models
    uv run python -m src.main --list-models
    
    # Run with specific model
    uv run python -m src.main --model random_forest_classifier --output my_results
  2. Try transformer models (requires additional dependencies):

    # Install transformer dependencies
    uv add transformers torch datasets
    
    # Run transformer example
    uv run python examples/transformer_example.py
    
    # Compare traditional ML vs transformers
    uv run python examples/model_comparison.py
  3. Explore the data using the provided notebook:

    jupyter lab notebooks/01_data_exploration.ipynb
  4. Run tests and quality checks:

    make check  # Code quality checks
    make test   # Run tests with coverage

πŸ“Š Usage Examples

Data Loading and Processing

from src.data_process import DataLoader
from src.common import get_logger

logger = get_logger(__name__)

# Load data
data_loader = DataLoader()
df = data_loader.load_csv("your_data.csv")

# Get comprehensive data information
data_info = data_loader.get_data_info(df)
logger.info(f"Dataset shape: {data_info['shape']}")

# Split data
X_train, X_test, y_train, y_test = data_loader.train_test_split(df, "target_column")

Preprocessing Pipeline

from src.preprocess import create_preprocessing_pipeline

# Configure preprocessing
config = {
    "missing_values": {"strategy": "mean", "enabled": True},
    "outliers": {"method": "iqr", "threshold": 1.5, "action": "clip", "enabled": True},
    "categorical_encoding": {"method": "onehot", "drop_first": True, "enabled": True},
    "scaling": {"method": "standard", "enabled": True}
}

# Create and use pipeline
preprocessor = create_preprocessing_pipeline(config)
X_processed = preprocessor.fit_transform(X_train)

Model Training and Evaluation

from src.model import create_model
from src.evaluate import ModelEvaluator

# Create and train model
model = create_model("random_forest_classifier", n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)

# Evaluate model
evaluator = ModelEvaluator()
metrics = evaluator.evaluate_classification(y_test, y_pred, y_pred_proba)
print(f"Accuracy: {metrics['accuracy']:.4f}")

Visualization

from src.visualization import MLVisualizer, plot_data_overview

# Create comprehensive data overview
figures = plot_data_overview(df, save_dir="plots")

# Custom visualizations
visualizer = MLVisualizer()
fig = visualizer.plot_correlation_matrix(df, save_path="correlation.png")

πŸ”§ Configuration

The template uses a flexible configuration system with YAML files and Pydantic models:

from src.common import settings

# Access configuration
print(f"Project name: {settings.project_name}")
print(f"Data directory: {settings.data_dir}")
print(f"Random seed: {settings.data.random_seed}")

# Load custom configuration
from src.common.config import MLProjectSettings
custom_settings = MLProjectSettings.from_yaml("config/custom.yaml")

πŸ§ͺ Available Models

The template includes ready-to-use implementations for common ML algorithms:

Traditional ML Models

  • Linear Models: Linear Regression, Logistic Regression
  • Tree-based Models: Decision Trees, Random Forest
  • Support Vector Machines: SVR, SVC

Transformer Models (Text Classification)

  • BERT: Fine-tunable BERT models for text classification
  • DistilBERT: Lightweight, fast DistilBERT models
  • RoBERTa: Robustly optimized BERT models
  • Pre-trained Pipelines: Ready-to-use sentiment analysis and text classification

Installation for Transformer Models

# Install transformer dependencies
uv add transformers torch datasets

# Or install all optional dependencies
uv sync --all-extras

Usage Examples

# Traditional ML model
model = create_model("random_forest_classifier", n_estimators=100)

# Transformer models
sentiment_model = create_model("sentiment_pipeline")
bert_model = create_model("distilbert_classifier", num_labels=3)
# List all available models
uv run python -m src.main --list-models

# Run transformer example
uv run python examples/transformer_example.py

πŸ“ˆ Evaluation Metrics

Comprehensive evaluation capabilities:

  • Classification: Accuracy, Precision, Recall, F1-score, ROC-AUC, Confusion Matrix
  • Regression: MAE, MSE, RMSE, RΒ², Residual Analysis
  • Model Comparison: Side-by-side comparison tools
  • Cross-validation: Built-in cross-validation support

🐳 Docker Support

Build and run with Docker:

# Build image
docker build -t ml-project .

# Run container
docker run -v $(pwd)/data:/app/data ml-project

πŸ“š Documentation

Generate and serve documentation:

make docs      # Serve documentation locally
make docs-test # Test documentation build

🧹 Code Quality

The template enforces high code quality standards:

make check  # Run all quality checks

This includes:

  • Linting: Ruff for fast Python linting
  • Formatting: Automatic code formatting
  • Type Checking: MyPy for static type analysis
  • Dependency Analysis: Deptry for unused dependencies
  • Security: Pre-commit hooks for security checks

πŸš€ CI/CD Pipeline

Automated workflows for:

  • Quality Assurance: Code quality checks on every PR
  • Testing: Multi-version Python testing (3.9-3.13)
  • Documentation: Automatic documentation building
  • Docker: Container image building and testing
  • Coverage: Code coverage reporting with Codecov

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make your changes and add tests
  4. Run quality checks: make check && make test
  5. Commit your changes: git commit -am 'Add feature'
  6. Push to the branch: git push origin feature-name
  7. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgements

This project builds upon excellent work from:

πŸ’‘ Why This Template?

Modern Python Tooling

  • uv: 10-100x faster than pip, with comprehensive project management
  • Ruff: Extremely fast Python linter and formatter
  • Pydantic: Runtime type checking and data validation
  • Type Safety: Full type annotations with mypy checking

Production Ready

  • Configuration Management: Flexible, environment-aware configuration
  • Logging: Structured logging with configurable levels
  • Error Handling: Comprehensive error handling and validation
  • Testing: High test coverage with pytest
  • Documentation: Auto-generated API documentation

ML Best Practices

  • Reproducibility: Seed management and deterministic pipelines
  • Modularity: Clean separation of concerns
  • Extensibility: Easy to add new models, transformers, and metrics
  • Monitoring: Built-in evaluation and comparison tools

Developer Experience

  • Fast Setup: One command installation and setup
  • IDE Support: Full type hints and IntelliSense support
  • Pre-commit Hooks: Automatic code quality enforcement
  • Rich CLI: Comprehensive command-line interface

Start building your next ML project with confidence! 🎯

About

ML project template

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published