MLOps Pipeline for Distributed Training and Cloud Deployment

This repository contains a configuration-driven MLOps pipeline for training, evaluating, tracking, and deploying machine learning models. It is built around Docker, Hydra, PyTorch Lightning, MLflow, and Google Cloud utilities for distributed execution.

The project is designed to support two main workflows:

Local development and testing inside Docker.
Cloud execution for distributed or production-style training.

The repository currently includes an example binary text classification pipeline, but the structure is generic enough to extend to other models and tasks.

What this project does

trains models with PyTorch Lightning
manages experiments with MLflow
generates runtime configs with Hydra
runs code in Docker for reproducibility
supports local development, CI, and production/cloud profiles
includes infrastructure helpers for Google Cloud
supports distributed execution through torchrun
exports trained models as artifacts

Tech stack

Python 3.10
Docker / Docker Compose
PyTorch + PyTorch Lightning
Hydra
MLflow
Google Cloud SDK
FastAPI / Uvicorn
PostgreSQL for MLflow backend

Project structure

.
├── .envs/                         # Environment files used by Docker and Make targets
├── dataset/                       # Example local parquet datasets
│   ├── train.parquet
│   ├── dev.parquet
│   └── test.parquet
├── docker/
│   ├── Dockerfile
│   └── scripts/
│       ├── start-prediction-service.sh
│       ├── start-tracking-server.sh
│       └── startup-script.sh
├── src/
│   ├── config_schemas/            # Typed configuration schemas
│   ├── configs/                   # Hydra configs and generated config output
│   ├── data_modules/              # Dataset and DataModule logic
│   ├── evaluation/                # Evaluation modules and tasks
│   ├── infrastructure/            # Google Cloud infrastructure helpers
│   ├── models/                    # Backbones, heads, adapters, transformations
│   ├── training/                  # Training modules, losses, schedulers, tasks
│   ├── utils/                     # Utility helpers
│   ├── entrypoint.py              # Basic config entrypoint
│   ├── generate_final_config.py   # Generates final runtime config
│   └── run_tasks.py               # Main task runner
├── docker-compose.yaml            # CPU/local compose setup
├── docker-compose-gpu.yaml        # GPU/local compose setup
├── Makefile                       # Main way to run the project
├── pyproject.toml                 # Python dependencies
└── README.md

How the pipeline works

At a high level, the flow is:

Start the project environment with Docker.
Generate the final runtime config.
Run the configured tasks.
Log training and evaluation into MLflow.
Save checkpoints and exported model artifacts.

The main execution path is:

src/generate_final_config.py builds the final resolved configuration.
src/run_tasks.py initializes distributed execution and runs all configured tasks.
Each task is instantiated from Hydra config and executed in sequence.

Main commands

The Makefile is the source of truth for running the project.

To see all available commands:

make help

Prerequisites

Before running anything, make sure you have:

Docker
Docker Compose (docker compose)
Make
Google Cloud authentication available locally if you want cloud features
valid environment files under .envs/

The container mounts your local Google Cloud config from:

~/.config/gcloud/

So if you plan to use GCP, authenticate locally first.

Environment profiles

The project defines three Docker profiles:

dev: local development
prod: production/cloud-oriented commands
ci: CI/testing environment

By default, the Makefile uses:

service: app-dev
container: build-model-dev-container
profile: dev

Local development quick start

1. Build the development image

make build

2. Start the development containers

make up

This starts the development profile in the background.

3. Check running services

make ps

4. View logs

make logs

5. Open a shell inside the dev container

make shell

or:

make exec-in

Easiest way to run the pipeline locally

The clearest local execution path is this:

make local-run-tasks

This target does two things:

runs local-generate-final-config
launches torchrun inside the development container

It uses the local parquet files from dataset/ and overrides the configuration for a small local run.

In practice, it is the best command to verify that the pipeline works end to end on your machine.

What `make local-run-tasks` does

It launches:

torchrun --standalone --nproc_per_node=1 -m src.run_tasks

with local overrides such as:

local dataset paths:
- /app/dataset/train.parquet
- /app/dataset/dev.parquet
- /app/dataset/test.parquet
smaller batch sizes
reduced worker count
limited batches for a short smoke test
max_epochs=1

This means the local target is intentionally configured as a lightweight validation run, not a full production training job.

Recommended local workflow

For normal local development, use these commands in order:

make build
make up
make local-run-tasks

That is usually enough to:

build the image
start the services
generate the local config
run training and evaluation once

Configuration generation

Local config generation

make local-generate-final-config

This runs inside the dev container and generates the final config using a local experiment override:

python src/generate_final_config.py +experiment/bert=local_bert

Production config generation

make generate-final-config

This runs inside the prod container and generates the final config with a Docker image tag intended for cloud execution.

Running cloud / production jobs

The Makefile defines a higher-level target for cloud execution:

make run-tasks

This target performs:

make generate-final-config
make push
runs the cloud launch step from inside the production container

Use this when you want to prepare and launch a remote job rather than run a local smoke test.

Because this path depends on:

GCP credentials
container registry access
infrastructure environment variables
production MLflow settings

make sure .envs/.mlflow-prod and .envs/.infrastructure are correctly configured first.

MLflow

MLflow is used for experiment tracking and artifact storage.

In local development, the stack uses:

PostgreSQL as backend store
a Docker volume for artifact storage

Relevant services and resources:

mlflow-db container
mlflow-artifact-store volume
postgresql-mlflow-data volume

The tracking server is started through:

docker/scripts/start-tracking-server.sh

Jupyter notebook support

To start JupyterLab inside the container:

make notebook

The container exposes port 8888.

CPU and GPU Docker setups

CPU

The default Makefile uses:

docker-compose.yaml

This is the standard local path.

GPU

If you want GPU-backed local execution, the repository also includes:

docker-compose-gpu.yaml

This compose file reserves NVIDIA GPUs for the app-dev service. Use it when your machine has a properly configured NVIDIA Docker runtime.

Useful development commands

Formatting

make format
make sort
make format-and-sort

Validation

make lint
make check-type-annotations
make test
make full-check

These run inside the Docker environment, which helps keep development reproducible.

Stopping and cleaning up

Stop containers

make down

Restart the development environment

make restart

Remove MLflow volumes

make clean-mlflow-volumes

Use the volume cleanup command only if you intentionally want to remove stored MLflow metadata and artifacts.

Dataset

The repository already contains example local datasets:

dataset/train.parquet
dataset/dev.parquet
dataset/test.parquet

The local run target is wired to use these files automatically.

For cloud or larger-scale runs, the generated config shows that the pipeline can also read from GCS paths.

Code entrypoints

`src/run_tasks.py`

This is the main runtime entrypoint.

It:

loads the generated config
initializes PyTorch distributed processing
sets the backend to gloo or nccl
seeds the run
instantiates and executes each task from the config

`src/generate_final_config.py`

This prepares the final resolved Hydra config that run_tasks.py consumes.

`src/entrypoint.py`

This is a simple config entrypoint and is mainly useful for checking that config loading works.

Training and evaluation tasks

The provided example configuration contains two main tasks:

binary_text_classification_task
binary_text_evaluation_task

So the current repository is not just a generic MLOps scaffold. It already includes a concrete example pipeline for text classification, with:

tokenization transformation
model backbone / adapter / head composition
Lightning trainers
checkpoints
evaluation
model export

Cloud infrastructure helpers

Cloud-related logic is located under:

src/infrastructure/

This includes helpers for:

instance template creation
instance group creation

The intended production flow is:

build the image
push it to Artifact Registry
generate the final production config
launch the job on GCP

Environment files

The project reads settings from:

.envs/.postgres
.envs/.mlflow-common
.envs/.mlflow-dev
.envs/.mlflow-prod
.envs/.infrastructure

These are imported by Docker Compose and the Makefile.

At minimum, verify these before running production commands:

MLflow ports and URIs
PostgreSQL settings
GCP registry values
zone / VM / project settings

Typical usage examples

Local smoke test

make build
make up
make local-run-tasks

Open a shell and inspect the environment

make up
make shell

Run code quality checks

make up
make full-check

Generate production config

make up-prod
make generate-final-config

Notes

The Makefile is the best interface for this project.
For local development, prefer make local-run-tasks over manually invoking Python scripts.
For cloud execution, use the production-oriented targets only after verifying the .envs/ values and GCP authentication.
The repository already contains sample local data, so you can test the full flow without downloading anything else first.

Summary

If you only want the shortest path to verify the project works, run:

make build
make up
make local-run-tasks

That is the clearest starting point for this repository.

Credits

Project inspuired by the project coming from this udemy course> Build, Manage, and Deploy Machine Learning (AI) Projects with Python and MLOps author Kıvanç Yüksel.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.envs		.envs
data/pytorch-lightning/checkpoints		data/pytorch-lightning/checkpoints
dataset		dataset
docker		docker
src		src
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
docker-compose-gpu.yaml		docker-compose-gpu.yaml
docker-compose.yaml		docker-compose.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Folders and files

Latest commit

History

Repository files navigation

MLOps Pipeline for Distributed Training and Cloud Deployment

What this project does

Tech stack

Project structure

How the pipeline works

Main commands

Prerequisites

Environment profiles

Local development quick start

1. Build the development image

2. Start the development containers

3. Check running services

4. View logs

5. Open a shell inside the dev container

Easiest way to run the pipeline locally

What make local-run-tasks does

Recommended local workflow

Configuration generation

Local config generation

Production config generation

Running cloud / production jobs

MLflow

Jupyter notebook support

CPU and GPU Docker setups

CPU

GPU

Useful development commands

Formatting

Validation

Stopping and cleaning up

Stop containers

Restart the development environment

Remove MLflow volumes

Dataset

Code entrypoints

src/run_tasks.py

src/generate_final_config.py

src/entrypoint.py

Training and evaluation tasks

Cloud infrastructure helpers

Environment files

Typical usage examples

Local smoke test

Open a shell and inspect the environment

Run code quality checks

Generate production config

Notes

Summary

Credits

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

What `make local-run-tasks` does

`src/run_tasks.py`

`src/generate_final_config.py`

`src/entrypoint.py`

Packages