Skip to content

Armando1514/MLOps-Distributed-Training-Pipeline

Repository files navigation

MLOps Pipeline for Distributed Training and Cloud Deployment

This repository contains a configuration-driven MLOps pipeline for training, evaluating, tracking, and deploying machine learning models. It is built around Docker, Hydra, PyTorch Lightning, MLflow, and Google Cloud utilities for distributed execution.

The project is designed to support two main workflows:

  1. Local development and testing inside Docker.
  2. Cloud execution for distributed or production-style training.

The repository currently includes an example binary text classification pipeline, but the structure is generic enough to extend to other models and tasks.

What this project does

  • trains models with PyTorch Lightning
  • manages experiments with MLflow
  • generates runtime configs with Hydra
  • runs code in Docker for reproducibility
  • supports local development, CI, and production/cloud profiles
  • includes infrastructure helpers for Google Cloud
  • supports distributed execution through torchrun
  • exports trained models as artifacts

Tech stack

  • Python 3.10
  • Docker / Docker Compose
  • PyTorch + PyTorch Lightning
  • Hydra
  • MLflow
  • Google Cloud SDK
  • FastAPI / Uvicorn
  • PostgreSQL for MLflow backend

Project structure

.
├── .envs/                         # Environment files used by Docker and Make targets
├── dataset/                       # Example local parquet datasets
│   ├── train.parquet
│   ├── dev.parquet
│   └── test.parquet
├── docker/
│   ├── Dockerfile
│   └── scripts/
│       ├── start-prediction-service.sh
│       ├── start-tracking-server.sh
│       └── startup-script.sh
├── src/
│   ├── config_schemas/            # Typed configuration schemas
│   ├── configs/                   # Hydra configs and generated config output
│   ├── data_modules/              # Dataset and DataModule logic
│   ├── evaluation/                # Evaluation modules and tasks
│   ├── infrastructure/            # Google Cloud infrastructure helpers
│   ├── models/                    # Backbones, heads, adapters, transformations
│   ├── training/                  # Training modules, losses, schedulers, tasks
│   ├── utils/                     # Utility helpers
│   ├── entrypoint.py              # Basic config entrypoint
│   ├── generate_final_config.py   # Generates final runtime config
│   └── run_tasks.py               # Main task runner
├── docker-compose.yaml            # CPU/local compose setup
├── docker-compose-gpu.yaml        # GPU/local compose setup
├── Makefile                       # Main way to run the project
├── pyproject.toml                 # Python dependencies
└── README.md

How the pipeline works

At a high level, the flow is:

  1. Start the project environment with Docker.
  2. Generate the final runtime config.
  3. Run the configured tasks.
  4. Log training and evaluation into MLflow.
  5. Save checkpoints and exported model artifacts.

The main execution path is:

  • src/generate_final_config.py builds the final resolved configuration.
  • src/run_tasks.py initializes distributed execution and runs all configured tasks.
  • Each task is instantiated from Hydra config and executed in sequence.

Main commands

The Makefile is the source of truth for running the project.

To see all available commands:

make help

Prerequisites

Before running anything, make sure you have:

  • Docker
  • Docker Compose (docker compose)
  • Make
  • Google Cloud authentication available locally if you want cloud features
  • valid environment files under .envs/

The container mounts your local Google Cloud config from:

~/.config/gcloud/

So if you plan to use GCP, authenticate locally first.

Environment profiles

The project defines three Docker profiles:

  • dev: local development
  • prod: production/cloud-oriented commands
  • ci: CI/testing environment

By default, the Makefile uses:

  • service: app-dev
  • container: build-model-dev-container
  • profile: dev

Local development quick start

1. Build the development image

make build

2. Start the development containers

make up

This starts the development profile in the background.

3. Check running services

make ps

4. View logs

make logs

5. Open a shell inside the dev container

make shell

or:

make exec-in

Easiest way to run the pipeline locally

The clearest local execution path is this:

make local-run-tasks

This target does two things:

  1. runs local-generate-final-config
  2. launches torchrun inside the development container

It uses the local parquet files from dataset/ and overrides the configuration for a small local run.

In practice, it is the best command to verify that the pipeline works end to end on your machine.

What make local-run-tasks does

It launches:

torchrun --standalone --nproc_per_node=1 -m src.run_tasks

with local overrides such as:

  • local dataset paths:
    • /app/dataset/train.parquet
    • /app/dataset/dev.parquet
    • /app/dataset/test.parquet
  • smaller batch sizes
  • reduced worker count
  • limited batches for a short smoke test
  • max_epochs=1

This means the local target is intentionally configured as a lightweight validation run, not a full production training job.

Recommended local workflow

For normal local development, use these commands in order:

make build
make up
make local-run-tasks

That is usually enough to:

  • build the image
  • start the services
  • generate the local config
  • run training and evaluation once

Configuration generation

Local config generation

make local-generate-final-config

This runs inside the dev container and generates the final config using a local experiment override:

python src/generate_final_config.py +experiment/bert=local_bert

Production config generation

make generate-final-config

This runs inside the prod container and generates the final config with a Docker image tag intended for cloud execution.

Running cloud / production jobs

The Makefile defines a higher-level target for cloud execution:

make run-tasks

This target performs:

  1. make generate-final-config
  2. make push
  3. runs the cloud launch step from inside the production container

Use this when you want to prepare and launch a remote job rather than run a local smoke test.

Because this path depends on:

  • GCP credentials
  • container registry access
  • infrastructure environment variables
  • production MLflow settings

make sure .envs/.mlflow-prod and .envs/.infrastructure are correctly configured first.

MLflow

MLflow is used for experiment tracking and artifact storage.

In local development, the stack uses:

  • PostgreSQL as backend store
  • a Docker volume for artifact storage

Relevant services and resources:

  • mlflow-db container
  • mlflow-artifact-store volume
  • postgresql-mlflow-data volume

The tracking server is started through:

docker/scripts/start-tracking-server.sh

Jupyter notebook support

To start JupyterLab inside the container:

make notebook

The container exposes port 8888.

CPU and GPU Docker setups

CPU

The default Makefile uses:

docker-compose.yaml

This is the standard local path.

GPU

If you want GPU-backed local execution, the repository also includes:

docker-compose-gpu.yaml

This compose file reserves NVIDIA GPUs for the app-dev service. Use it when your machine has a properly configured NVIDIA Docker runtime.

Useful development commands

Formatting

make format
make sort
make format-and-sort

Validation

make lint
make check-type-annotations
make test
make full-check

These run inside the Docker environment, which helps keep development reproducible.

Stopping and cleaning up

Stop containers

make down

Restart the development environment

make restart

Remove MLflow volumes

make clean-mlflow-volumes

Use the volume cleanup command only if you intentionally want to remove stored MLflow metadata and artifacts.

Dataset

The repository already contains example local datasets:

dataset/train.parquet
dataset/dev.parquet
dataset/test.parquet

The local run target is wired to use these files automatically.

For cloud or larger-scale runs, the generated config shows that the pipeline can also read from GCS paths.

Code entrypoints

src/run_tasks.py

This is the main runtime entrypoint.

It:

  • loads the generated config
  • initializes PyTorch distributed processing
  • sets the backend to gloo or nccl
  • seeds the run
  • instantiates and executes each task from the config

src/generate_final_config.py

This prepares the final resolved Hydra config that run_tasks.py consumes.

src/entrypoint.py

This is a simple config entrypoint and is mainly useful for checking that config loading works.

Training and evaluation tasks

The provided example configuration contains two main tasks:

  • binary_text_classification_task
  • binary_text_evaluation_task

So the current repository is not just a generic MLOps scaffold. It already includes a concrete example pipeline for text classification, with:

  • tokenization transformation
  • model backbone / adapter / head composition
  • Lightning trainers
  • checkpoints
  • evaluation
  • model export

Cloud infrastructure helpers

Cloud-related logic is located under:

src/infrastructure/

This includes helpers for:

  • instance template creation
  • instance group creation

The intended production flow is:

  1. build the image
  2. push it to Artifact Registry
  3. generate the final production config
  4. launch the job on GCP

Environment files

The project reads settings from:

.envs/.postgres
.envs/.mlflow-common
.envs/.mlflow-dev
.envs/.mlflow-prod
.envs/.infrastructure

These are imported by Docker Compose and the Makefile.

At minimum, verify these before running production commands:

  • MLflow ports and URIs
  • PostgreSQL settings
  • GCP registry values
  • zone / VM / project settings

Typical usage examples

Local smoke test

make build
make up
make local-run-tasks

Open a shell and inspect the environment

make up
make shell

Run code quality checks

make up
make full-check

Generate production config

make up-prod
make generate-final-config

Notes

  • The Makefile is the best interface for this project.
  • For local development, prefer make local-run-tasks over manually invoking Python scripts.
  • For cloud execution, use the production-oriented targets only after verifying the .envs/ values and GCP authentication.
  • The repository already contains sample local data, so you can test the full flow without downloading anything else first.

Summary

If you only want the shortest path to verify the project works, run:

make build
make up
make local-run-tasks

That is the clearest starting point for this repository.

Credits

Project inspuired by the project coming from this udemy course> Build, Manage, and Deploy Machine Learning (AI) Projects with Python and MLOps author Kıvanç Yüksel.

About

Production-ready MLOps pipeline for distributed training and model deployment on the cloud, integrating DVC for data versioning, MLflow for experiment tracking.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors