LLM Post-Training And Systems Optimization

This repo is for an end-to-end LLM systems project, not just a one-off fine-tune.

The project has two connected goals:

build a post-training baseline on a public instruction dataset
use that baseline as the reference point for later training-efficiency and inference-serving optimization

The current baseline stack is:

QLoRA SFT
Qwen/Qwen3-8B
HuggingFaceH4/ultrachat_200k
YAML-driven configs
local artifact outputs under .artifacts/

Project Plan

The project follows four phases:

Training baseline Build a stable SFT baseline, collect loss, step time, throughput, and memory metrics.
Training optimization Compare stronger training setups such as multi-GPU and DeepSpeed against the baseline.
Inference baseline Deploy the trained model and establish a serving benchmark, likely starting with vLLM.
Inference optimization Compare serving configurations and frameworks on TTFT, latency, throughput, and memory usage.

Current Status

As of 2026-03-27, the codebase and environment have already passed Phase 1 smoke testing on the current machine.

Completed so far:

switched the project from gated Llama 3.1 8B to public Qwen/Qwen3-8B
verified the current machine target: Ubuntu 24.04, 2 x RTX 5090
standardized on the Python environment at /venv/main
confirmed torch, CUDA, bf16, and bitsandbytes work in /venv/main
verified Qwen/Qwen3-8B access from this machine
verified dataset formatting on HuggingFaceH4/ultrachat_200k
updated the training entrypoint to match the installed transformers / trl APIs
completed a short debug run
completed a smaller single-GPU baseline run successfully

What is not done yet:

the official baseline run has not been completed yet
no official training result should be treated as final for later comparison work yet

Current Decision

The current project decision is:

keep the baseline definition as single-GPU
use a fixed 50k subset as the next official baseline run
move that run to a stronger machine instead of continuing on the current one

Why the current run was stopped:

a larger 50k single-GPU baseline was started on the current machine
after runtime review, we decided not to keep spending time and rental cost on the current machine overnight
the run was intentionally stopped before adoption as the official baseline
the next baseline run should start fresh on the new machine

Next Start Point

When returning on the new machine, the next task is:

run the official 50k single-GPU baseline

Use this config:

configs/qlora_sft_full.yaml

Current intended baseline values in that config:

model: Qwen/Qwen3-8B
dataset: HuggingFaceH4/ultrachat_200k
split: train_sft
subset_size: 50000
max_seq_length: 2048
output path: .artifacts/outputs/qlora_sft_50k

Environment Handoff

The currently verified environment on this machine was:

repo path: /workspace/llm-posttrain-opt
Python environment: /venv/main
Python: 3.12.13
GPU: NVIDIA GeForce RTX 5090 x 2

Key package versions that were actually validated here:

torch==2.11.0+cu130
transformers==4.57.6
trl==0.29.1
peft==0.18.1
datasets==4.8.4
accelerate==1.13.0
huggingface_hub==0.36.2
bitsandbytes==0.49.2

Important note:

this environment was validated by successful local runs on the current machine
requirements.txt is older than the environment that was actually validated
after moving to a new machine, re-check the environment instead of assuming the new machine matches this one exactly

What To Check On The New Machine

On the next machine, verify these items first:

GPU visibility and CUDA availability
/venv/main or the replacement environment path
torch, bitsandbytes, transformers, trl, peft, datasets, accelerate, huggingface_hub
access to Qwen/Qwen3-8B
enough disk space for Hugging Face cache and outputs

Recommended first commands:

cd /workspace/llm-posttrain-opt
source /venv/main/bin/activate
export LLM_PT_DATA_ROOT=/workspace/llm-posttrain-opt/.artifacts
export CUDA_VISIBLE_DEVICES=0
python scripts/inspect_dataset.py --config configs/qlora_sft_debug.yaml --num-samples 3
python -m train.train_sft --config configs/qlora_sft_debug.yaml

If those pass, launch the official baseline:

cd /workspace/llm-posttrain-opt
source /venv/main/bin/activate
export LLM_PT_DATA_ROOT=/workspace/llm-posttrain-opt/.artifacts
export CUDA_VISIBLE_DEVICES=0
python -m train.train_sft --config configs/qlora_sft_full.yaml

Current Repo Structure

configs/
  qlora_sft.yaml
  qlora_sft_debug.yaml
  qlora_sft_full.yaml
scripts/
  inspect_dataset.py
  run_train_baseline.sh
  setup_env.sh
train/
  __init__.py
  config.py
  data.py
  metrics.py
  model.py
  preflight.py
  train_sft.py
  utils.py
docs/
  experiment_log.md
requirements.txt
README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Post-Training And Systems Optimization

Project Plan

Current Status

Current Decision

Next Start Point

Environment Handoff

What To Check On The New Machine

Current Repo Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
docs		docs
scripts		scripts
train		train
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LLM Post-Training And Systems Optimization

Project Plan

Current Status

Current Decision

Next Start Point

Environment Handoff

What To Check On The New Machine

Current Repo Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages