MLOps project for MLOps Zoomcamp
This is a final project for MLOps Zoomcamp 2025. It uses open data from Data Science Stack Exchange. The data for this and other stack exchange / stack overflow sites can be downloaded from archive.org.
The project is built around a classification model which would predict whether a topic from a stack exchange forum has been answered depending on several numerical features number of votes, title length (in number of symbols), number of comments, whether post author has description etc.
The project uses following technologies:
- Docker for containerization
- Docker Compose for managing containers
- Prefect for flow orcherstration with PostgreSQL as a DBMS for Prefect
- LightGBM as a ML framework
- MLflow for model tracking & model registry
- MLServer as an inference server
- uv & venv for managing python environment
- ruff, isort & mypy for statically ensuring code quality and enforcing codestyle
There are 4 workflows in the pipeline:
download_and_unpack_archive- downloads archive with Stack Exchange forum posts from archive.org and unpack itsprepare_dataset- prepares dataset from initial archive: builds features and performs split into train, validation & testtrain_model- trains LightGBM model from prepared datasetprepare_mlserver_resources- prepares resources from previously trained model for serving with MLServer
Requirements:
- Docker + Docker Compose
- uv + python
- (optionally) Statically validate code
In
srcdirectory:
uv run ruff check && uv run isort --check --diff . && uv run mypy- Build docker images and setup docker containers:
docker compose build && \
docker compose up-
Visit Prefect UI dashboard on
http://localhost:18900and launch workflows in order. Examples of parameters for workflows:download_and_unpack_archive:community=linguisticsprepare_dataset:validation_size=0.1,test_size=0.3train_model:model_name=linguistics_lgbm,model_alias=prodprepare_mlserver_resources: same as fortrain_model
-
After that, model would be available for inference via MLServer on
http://localhost:18910(restart might be required)