This repo is for an end-to-end LLM systems project, not just a one-off fine-tune.
The project has two connected goals:
- build a post-training baseline on a public instruction dataset
- use that baseline as the reference point for later training-efficiency and inference-serving optimization
The current baseline stack is:
- QLoRA SFT
Qwen/Qwen3-8BHuggingFaceH4/ultrachat_200k- YAML-driven configs
- local artifact outputs under
.artifacts/
The project follows four phases:
- Training baseline Build a stable SFT baseline, collect loss, step time, throughput, and memory metrics.
- Training optimization Compare stronger training setups such as multi-GPU and DeepSpeed against the baseline.
- Inference baseline
Deploy the trained model and establish a serving benchmark, likely starting with
vLLM. - Inference optimization Compare serving configurations and frameworks on TTFT, latency, throughput, and memory usage.
As of 2026-03-27, the codebase and environment have already passed Phase 1 smoke testing on the current machine.
Completed so far:
- switched the project from gated Llama 3.1 8B to public
Qwen/Qwen3-8B - verified the current machine target: Ubuntu 24.04,
2 x RTX 5090 - standardized on the Python environment at
/venv/main - confirmed
torch, CUDA, bf16, andbitsandbyteswork in/venv/main - verified
Qwen/Qwen3-8Baccess from this machine - verified dataset formatting on
HuggingFaceH4/ultrachat_200k - updated the training entrypoint to match the installed
transformers/trlAPIs - completed a short debug run
- completed a smaller single-GPU baseline run successfully
What is not done yet:
- the official baseline run has not been completed yet
- no official training result should be treated as final for later comparison work yet
The current project decision is:
- keep the baseline definition as single-GPU
- use a fixed
50ksubset as the next official baseline run - move that run to a stronger machine instead of continuing on the current one
Why the current run was stopped:
- a larger
50ksingle-GPU baseline was started on the current machine - after runtime review, we decided not to keep spending time and rental cost on the current machine overnight
- the run was intentionally stopped before adoption as the official baseline
- the next baseline run should start fresh on the new machine
When returning on the new machine, the next task is:
- run the official
50ksingle-GPU baseline
Use this config:
configs/qlora_sft_full.yaml
Current intended baseline values in that config:
- model:
Qwen/Qwen3-8B - dataset:
HuggingFaceH4/ultrachat_200k - split:
train_sft subset_size: 50000max_seq_length: 2048- output path:
.artifacts/outputs/qlora_sft_50k
The currently verified environment on this machine was:
- repo path:
/workspace/llm-posttrain-opt - Python environment:
/venv/main - Python:
3.12.13 - GPU:
NVIDIA GeForce RTX 5090 x 2
Key package versions that were actually validated here:
torch==2.11.0+cu130transformers==4.57.6trl==0.29.1peft==0.18.1datasets==4.8.4accelerate==1.13.0huggingface_hub==0.36.2bitsandbytes==0.49.2
Important note:
- this environment was validated by successful local runs on the current machine
requirements.txtis older than the environment that was actually validated- after moving to a new machine, re-check the environment instead of assuming the new machine matches this one exactly
On the next machine, verify these items first:
- GPU visibility and CUDA availability
/venv/mainor the replacement environment pathtorch,bitsandbytes,transformers,trl,peft,datasets,accelerate,huggingface_hub- access to
Qwen/Qwen3-8B - enough disk space for Hugging Face cache and outputs
Recommended first commands:
cd /workspace/llm-posttrain-opt
source /venv/main/bin/activate
export LLM_PT_DATA_ROOT=/workspace/llm-posttrain-opt/.artifacts
export CUDA_VISIBLE_DEVICES=0
python scripts/inspect_dataset.py --config configs/qlora_sft_debug.yaml --num-samples 3
python -m train.train_sft --config configs/qlora_sft_debug.yamlIf those pass, launch the official baseline:
cd /workspace/llm-posttrain-opt
source /venv/main/bin/activate
export LLM_PT_DATA_ROOT=/workspace/llm-posttrain-opt/.artifacts
export CUDA_VISIBLE_DEVICES=0
python -m train.train_sft --config configs/qlora_sft_full.yamlconfigs/
qlora_sft.yaml
qlora_sft_debug.yaml
qlora_sft_full.yaml
scripts/
inspect_dataset.py
run_train_baseline.sh
setup_env.sh
train/
__init__.py
config.py
data.py
metrics.py
model.py
preflight.py
train_sft.py
utils.py
docs/
experiment_log.md
requirements.txt
README.md