This repository contains a configuration-driven MLOps pipeline for training, evaluating, tracking, and deploying machine learning models. It is built around Docker, Hydra, PyTorch Lightning, MLflow, and Google Cloud utilities for distributed execution.
The project is designed to support two main workflows:
- Local development and testing inside Docker.
- Cloud execution for distributed or production-style training.
The repository currently includes an example binary text classification pipeline, but the structure is generic enough to extend to other models and tasks.
- trains models with PyTorch Lightning
- manages experiments with MLflow
- generates runtime configs with Hydra
- runs code in Docker for reproducibility
- supports local development, CI, and production/cloud profiles
- includes infrastructure helpers for Google Cloud
- supports distributed execution through
torchrun - exports trained models as artifacts
- Python 3.10
- Docker / Docker Compose
- PyTorch + PyTorch Lightning
- Hydra
- MLflow
- Google Cloud SDK
- FastAPI / Uvicorn
- PostgreSQL for MLflow backend
.
├── .envs/ # Environment files used by Docker and Make targets
├── dataset/ # Example local parquet datasets
│ ├── train.parquet
│ ├── dev.parquet
│ └── test.parquet
├── docker/
│ ├── Dockerfile
│ └── scripts/
│ ├── start-prediction-service.sh
│ ├── start-tracking-server.sh
│ └── startup-script.sh
├── src/
│ ├── config_schemas/ # Typed configuration schemas
│ ├── configs/ # Hydra configs and generated config output
│ ├── data_modules/ # Dataset and DataModule logic
│ ├── evaluation/ # Evaluation modules and tasks
│ ├── infrastructure/ # Google Cloud infrastructure helpers
│ ├── models/ # Backbones, heads, adapters, transformations
│ ├── training/ # Training modules, losses, schedulers, tasks
│ ├── utils/ # Utility helpers
│ ├── entrypoint.py # Basic config entrypoint
│ ├── generate_final_config.py # Generates final runtime config
│ └── run_tasks.py # Main task runner
├── docker-compose.yaml # CPU/local compose setup
├── docker-compose-gpu.yaml # GPU/local compose setup
├── Makefile # Main way to run the project
├── pyproject.toml # Python dependencies
└── README.md
At a high level, the flow is:
- Start the project environment with Docker.
- Generate the final runtime config.
- Run the configured tasks.
- Log training and evaluation into MLflow.
- Save checkpoints and exported model artifacts.
The main execution path is:
src/generate_final_config.pybuilds the final resolved configuration.src/run_tasks.pyinitializes distributed execution and runs all configured tasks.- Each task is instantiated from Hydra config and executed in sequence.
The Makefile is the source of truth for running the project.
To see all available commands:
make helpBefore running anything, make sure you have:
- Docker
- Docker Compose (
docker compose) - Make
- Google Cloud authentication available locally if you want cloud features
- valid environment files under
.envs/
The container mounts your local Google Cloud config from:
~/.config/gcloud/
So if you plan to use GCP, authenticate locally first.
The project defines three Docker profiles:
dev: local developmentprod: production/cloud-oriented commandsci: CI/testing environment
By default, the Makefile uses:
- service:
app-dev - container:
build-model-dev-container - profile:
dev
make buildmake upThis starts the development profile in the background.
make psmake logsmake shellor:
make exec-inThe clearest local execution path is this:
make local-run-tasksThis target does two things:
- runs
local-generate-final-config - launches
torchruninside the development container
It uses the local parquet files from dataset/ and overrides the configuration for a small local run.
In practice, it is the best command to verify that the pipeline works end to end on your machine.
It launches:
torchrun --standalone --nproc_per_node=1 -m src.run_taskswith local overrides such as:
- local dataset paths:
/app/dataset/train.parquet/app/dataset/dev.parquet/app/dataset/test.parquet
- smaller batch sizes
- reduced worker count
- limited batches for a short smoke test
max_epochs=1
This means the local target is intentionally configured as a lightweight validation run, not a full production training job.
For normal local development, use these commands in order:
make build
make up
make local-run-tasksThat is usually enough to:
- build the image
- start the services
- generate the local config
- run training and evaluation once
make local-generate-final-configThis runs inside the dev container and generates the final config using a local experiment override:
python src/generate_final_config.py +experiment/bert=local_bertmake generate-final-configThis runs inside the prod container and generates the final config with a Docker image tag intended for cloud execution.
The Makefile defines a higher-level target for cloud execution:
make run-tasksThis target performs:
make generate-final-configmake push- runs the cloud launch step from inside the production container
Use this when you want to prepare and launch a remote job rather than run a local smoke test.
Because this path depends on:
- GCP credentials
- container registry access
- infrastructure environment variables
- production MLflow settings
make sure .envs/.mlflow-prod and .envs/.infrastructure are correctly configured first.
MLflow is used for experiment tracking and artifact storage.
In local development, the stack uses:
- PostgreSQL as backend store
- a Docker volume for artifact storage
Relevant services and resources:
mlflow-dbcontainermlflow-artifact-storevolumepostgresql-mlflow-datavolume
The tracking server is started through:
docker/scripts/start-tracking-server.sh
To start JupyterLab inside the container:
make notebookThe container exposes port 8888.
The default Makefile uses:
docker-compose.yaml
This is the standard local path.
If you want GPU-backed local execution, the repository also includes:
docker-compose-gpu.yaml
This compose file reserves NVIDIA GPUs for the app-dev service. Use it when your machine has a properly configured NVIDIA Docker runtime.
make format
make sort
make format-and-sortmake lint
make check-type-annotations
make test
make full-checkThese run inside the Docker environment, which helps keep development reproducible.
make downmake restartmake clean-mlflow-volumesUse the volume cleanup command only if you intentionally want to remove stored MLflow metadata and artifacts.
The repository already contains example local datasets:
dataset/train.parquet
dataset/dev.parquet
dataset/test.parquet
The local run target is wired to use these files automatically.
For cloud or larger-scale runs, the generated config shows that the pipeline can also read from GCS paths.
This is the main runtime entrypoint.
It:
- loads the generated config
- initializes PyTorch distributed processing
- sets the backend to
glooornccl - seeds the run
- instantiates and executes each task from the config
This prepares the final resolved Hydra config that run_tasks.py consumes.
This is a simple config entrypoint and is mainly useful for checking that config loading works.
The provided example configuration contains two main tasks:
binary_text_classification_taskbinary_text_evaluation_task
So the current repository is not just a generic MLOps scaffold. It already includes a concrete example pipeline for text classification, with:
- tokenization transformation
- model backbone / adapter / head composition
- Lightning trainers
- checkpoints
- evaluation
- model export
Cloud-related logic is located under:
src/infrastructure/
This includes helpers for:
- instance template creation
- instance group creation
The intended production flow is:
- build the image
- push it to Artifact Registry
- generate the final production config
- launch the job on GCP
The project reads settings from:
.envs/.postgres
.envs/.mlflow-common
.envs/.mlflow-dev
.envs/.mlflow-prod
.envs/.infrastructure
These are imported by Docker Compose and the Makefile.
At minimum, verify these before running production commands:
- MLflow ports and URIs
- PostgreSQL settings
- GCP registry values
- zone / VM / project settings
make build
make up
make local-run-tasksmake up
make shellmake up
make full-checkmake up-prod
make generate-final-config- The Makefile is the best interface for this project.
- For local development, prefer
make local-run-tasksover manually invoking Python scripts. - For cloud execution, use the production-oriented targets only after verifying the
.envs/values and GCP authentication. - The repository already contains sample local data, so you can test the full flow without downloading anything else first.
If you only want the shortest path to verify the project works, run:
make build
make up
make local-run-tasksThat is the clearest starting point for this repository.
Project inspuired by the project coming from this udemy course> Build, Manage, and Deploy Machine Learning (AI) Projects with Python and MLOps author Kıvanç Yüksel.