A comprehensive, production-ready template for machine learning projects with modern Python tooling and best practices. This template provides a complete ML pipeline with data processing, preprocessing, modeling, evaluation, and visualization capabilities.
- Data Processing: Robust data loading, validation, and management utilities
- Preprocessing Pipeline: Configurable preprocessing with missing value handling, outlier detection, encoding, and scaling
- Model Management: Abstract base classes with concrete implementations for scikit-learn and Hugging Face transformer models
- Transformer Support: Pre-trained and fine-tunable transformer models (BERT, DistilBERT, RoBERTa) for text classification
- Evaluation Framework: Comprehensive metrics calculation and model comparison tools
- Visualization Suite: Rich plotting capabilities for data exploration and model analysis
- Modern Dependency Management: uv for fast, reliable package management
- Code Quality: ruff for linting and formatting, mypy for type checking
- Testing: pytest with coverage reporting via codecov
- CI/CD: GitHub Actions workflows for automated testing and deployment
- Documentation: MkDocs with Material theme
- Containerization: Docker support with multi-stage builds
- Pre-commit Hooks: Automated code quality checks
- Cross-version Testing: tox for testing across Python versions
mlproject-template/
βββ src/ # Source code
β βββ common/ # Common utilities and configuration
β β βββ config.py # Configuration management with Pydantic
β β βββ logging_config.py # Logging setup
β β βββ utils.py # Utility functions
β βββ data_process/ # Data loading and processing
β β βββ loader.py # DataLoader class
β βββ preprocess/ # Data preprocessing
β β βββ transformers.py # Custom transformers
β β βββ pipeline.py # Preprocessing pipeline
β βββ model/ # Model implementations
β β βββ base.py # Abstract base model
β β βββ sklearn_models.py # Scikit-learn model wrappers
β β βββ transformer_models.py # Hugging Face transformer models
β βββ evaluate/ # Model evaluation
β β βββ metrics.py # Evaluation metrics and utilities
β βββ visualization/ # Plotting and visualization
β β βββ plots.py # Visualization utilities
β βββ main.py # Main CLI interface
βββ config/ # Configuration files
β βββ default.yaml # Default configuration
βββ notebooks/ # Jupyter notebooks
β βββ 01_data_exploration.ipynb # Data exploration example
βββ data/ # Data directories
β βββ raw/ # Raw data
β βββ processed/ # Processed data
βββ tests/ # Test files
βββ docs/ # Documentation
βββ reports/ # Generated reports and outputs
- Python 3.10 or higher
- uv package manager
-
Use this template by clicking the "Use this template" button on GitHub, or clone directly:
git clone https://github.com/avineshpvs/mlproject-template.git cd mlproject-template -
Install uv (if not already installed):
# macOS/Linux curl -LsSf https://astral.sh/uv/0.7.8/install.sh | sh # Windows powershell -c "irm https://astral.sh/uv/0.7.8/install.ps1 | iex"
-
Set up the development environment:
make install
This will:
- Create a virtual environment
- Install all dependencies
- Set up pre-commit hooks
- Generate the lock file
-
Run the example ML pipeline:
# Run with default settings (creates sample data) uv run python -m src.main # List available models uv run python -m src.main --list-models # Run with specific model uv run python -m src.main --model random_forest_classifier --output my_results
-
Try transformer models (requires additional dependencies):
# Install transformer dependencies uv add transformers torch datasets # Run transformer example uv run python examples/transformer_example.py # Compare traditional ML vs transformers uv run python examples/model_comparison.py
-
Explore the data using the provided notebook:
jupyter lab notebooks/01_data_exploration.ipynb
-
Run tests and quality checks:
make check # Code quality checks make test # Run tests with coverage
from src.data_process import DataLoader
from src.common import get_logger
logger = get_logger(__name__)
# Load data
data_loader = DataLoader()
df = data_loader.load_csv("your_data.csv")
# Get comprehensive data information
data_info = data_loader.get_data_info(df)
logger.info(f"Dataset shape: {data_info['shape']}")
# Split data
X_train, X_test, y_train, y_test = data_loader.train_test_split(df, "target_column")from src.preprocess import create_preprocessing_pipeline
# Configure preprocessing
config = {
"missing_values": {"strategy": "mean", "enabled": True},
"outliers": {"method": "iqr", "threshold": 1.5, "action": "clip", "enabled": True},
"categorical_encoding": {"method": "onehot", "drop_first": True, "enabled": True},
"scaling": {"method": "standard", "enabled": True}
}
# Create and use pipeline
preprocessor = create_preprocessing_pipeline(config)
X_processed = preprocessor.fit_transform(X_train)from src.model import create_model
from src.evaluate import ModelEvaluator
# Create and train model
model = create_model("random_forest_classifier", n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)
# Evaluate model
evaluator = ModelEvaluator()
metrics = evaluator.evaluate_classification(y_test, y_pred, y_pred_proba)
print(f"Accuracy: {metrics['accuracy']:.4f}")from src.visualization import MLVisualizer, plot_data_overview
# Create comprehensive data overview
figures = plot_data_overview(df, save_dir="plots")
# Custom visualizations
visualizer = MLVisualizer()
fig = visualizer.plot_correlation_matrix(df, save_path="correlation.png")The template uses a flexible configuration system with YAML files and Pydantic models:
from src.common import settings
# Access configuration
print(f"Project name: {settings.project_name}")
print(f"Data directory: {settings.data_dir}")
print(f"Random seed: {settings.data.random_seed}")
# Load custom configuration
from src.common.config import MLProjectSettings
custom_settings = MLProjectSettings.from_yaml("config/custom.yaml")The template includes ready-to-use implementations for common ML algorithms:
- Linear Models: Linear Regression, Logistic Regression
- Tree-based Models: Decision Trees, Random Forest
- Support Vector Machines: SVR, SVC
- BERT: Fine-tunable BERT models for text classification
- DistilBERT: Lightweight, fast DistilBERT models
- RoBERTa: Robustly optimized BERT models
- Pre-trained Pipelines: Ready-to-use sentiment analysis and text classification
# Install transformer dependencies
uv add transformers torch datasets
# Or install all optional dependencies
uv sync --all-extras# Traditional ML model
model = create_model("random_forest_classifier", n_estimators=100)
# Transformer models
sentiment_model = create_model("sentiment_pipeline")
bert_model = create_model("distilbert_classifier", num_labels=3)# List all available models
uv run python -m src.main --list-models
# Run transformer example
uv run python examples/transformer_example.pyComprehensive evaluation capabilities:
- Classification: Accuracy, Precision, Recall, F1-score, ROC-AUC, Confusion Matrix
- Regression: MAE, MSE, RMSE, RΒ², Residual Analysis
- Model Comparison: Side-by-side comparison tools
- Cross-validation: Built-in cross-validation support
Build and run with Docker:
# Build image
docker build -t ml-project .
# Run container
docker run -v $(pwd)/data:/app/data ml-projectGenerate and serve documentation:
make docs # Serve documentation locally
make docs-test # Test documentation buildThe template enforces high code quality standards:
make check # Run all quality checksThis includes:
- Linting: Ruff for fast Python linting
- Formatting: Automatic code formatting
- Type Checking: MyPy for static type analysis
- Dependency Analysis: Deptry for unused dependencies
- Security: Pre-commit hooks for security checks
Automated workflows for:
- Quality Assurance: Code quality checks on every PR
- Testing: Multi-version Python testing (3.9-3.13)
- Documentation: Automatic documentation building
- Docker: Container image building and testing
- Coverage: Code coverage reporting with Codecov
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes and add tests
- Run quality checks:
make check && make test - Commit your changes:
git commit -am 'Add feature' - Push to the branch:
git push origin feature-name - Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
This project builds upon excellent work from:
- uv: 10-100x faster than pip, with comprehensive project management
- Ruff: Extremely fast Python linter and formatter
- Pydantic: Runtime type checking and data validation
- Type Safety: Full type annotations with mypy checking
- Configuration Management: Flexible, environment-aware configuration
- Logging: Structured logging with configurable levels
- Error Handling: Comprehensive error handling and validation
- Testing: High test coverage with pytest
- Documentation: Auto-generated API documentation
- Reproducibility: Seed management and deterministic pipelines
- Modularity: Clean separation of concerns
- Extensibility: Easy to add new models, transformers, and metrics
- Monitoring: Built-in evaluation and comparison tools
- Fast Setup: One command installation and setup
- IDE Support: Full type hints and IntelliSense support
- Pre-commit Hooks: Automatic code quality enforcement
- Rich CLI: Comprehensive command-line interface
Start building your next ML project with confidence! π―