Skip to content

elvaoao/llm-posttrain-opt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Post-Training And Systems Optimization

This repo is for an end-to-end LLM systems project, not just a one-off fine-tune.

The project has two connected goals:

  • build a post-training baseline on a public instruction dataset
  • use that baseline as the reference point for later training-efficiency and inference-serving optimization

The current baseline stack is:

  • QLoRA SFT
  • Qwen/Qwen3-8B
  • HuggingFaceH4/ultrachat_200k
  • YAML-driven configs
  • local artifact outputs under .artifacts/

Project Plan

The project follows four phases:

  1. Training baseline Build a stable SFT baseline, collect loss, step time, throughput, and memory metrics.
  2. Training optimization Compare stronger training setups such as multi-GPU and DeepSpeed against the baseline.
  3. Inference baseline Deploy the trained model and establish a serving benchmark, likely starting with vLLM.
  4. Inference optimization Compare serving configurations and frameworks on TTFT, latency, throughput, and memory usage.

Current Status

As of 2026-03-27, the codebase and environment have already passed Phase 1 smoke testing on the current machine.

Completed so far:

  • switched the project from gated Llama 3.1 8B to public Qwen/Qwen3-8B
  • verified the current machine target: Ubuntu 24.04, 2 x RTX 5090
  • standardized on the Python environment at /venv/main
  • confirmed torch, CUDA, bf16, and bitsandbytes work in /venv/main
  • verified Qwen/Qwen3-8B access from this machine
  • verified dataset formatting on HuggingFaceH4/ultrachat_200k
  • updated the training entrypoint to match the installed transformers / trl APIs
  • completed a short debug run
  • completed a smaller single-GPU baseline run successfully

What is not done yet:

  • the official baseline run has not been completed yet
  • no official training result should be treated as final for later comparison work yet

Current Decision

The current project decision is:

  • keep the baseline definition as single-GPU
  • use a fixed 50k subset as the next official baseline run
  • move that run to a stronger machine instead of continuing on the current one

Why the current run was stopped:

  • a larger 50k single-GPU baseline was started on the current machine
  • after runtime review, we decided not to keep spending time and rental cost on the current machine overnight
  • the run was intentionally stopped before adoption as the official baseline
  • the next baseline run should start fresh on the new machine

Next Start Point

When returning on the new machine, the next task is:

  • run the official 50k single-GPU baseline

Use this config:

  • configs/qlora_sft_full.yaml

Current intended baseline values in that config:

  • model: Qwen/Qwen3-8B
  • dataset: HuggingFaceH4/ultrachat_200k
  • split: train_sft
  • subset_size: 50000
  • max_seq_length: 2048
  • output path: .artifacts/outputs/qlora_sft_50k

Environment Handoff

The currently verified environment on this machine was:

  • repo path: /workspace/llm-posttrain-opt
  • Python environment: /venv/main
  • Python: 3.12.13
  • GPU: NVIDIA GeForce RTX 5090 x 2

Key package versions that were actually validated here:

  • torch==2.11.0+cu130
  • transformers==4.57.6
  • trl==0.29.1
  • peft==0.18.1
  • datasets==4.8.4
  • accelerate==1.13.0
  • huggingface_hub==0.36.2
  • bitsandbytes==0.49.2

Important note:

  • this environment was validated by successful local runs on the current machine
  • requirements.txt is older than the environment that was actually validated
  • after moving to a new machine, re-check the environment instead of assuming the new machine matches this one exactly

What To Check On The New Machine

On the next machine, verify these items first:

  • GPU visibility and CUDA availability
  • /venv/main or the replacement environment path
  • torch, bitsandbytes, transformers, trl, peft, datasets, accelerate, huggingface_hub
  • access to Qwen/Qwen3-8B
  • enough disk space for Hugging Face cache and outputs

Recommended first commands:

cd /workspace/llm-posttrain-opt
source /venv/main/bin/activate
export LLM_PT_DATA_ROOT=/workspace/llm-posttrain-opt/.artifacts
export CUDA_VISIBLE_DEVICES=0
python scripts/inspect_dataset.py --config configs/qlora_sft_debug.yaml --num-samples 3
python -m train.train_sft --config configs/qlora_sft_debug.yaml

If those pass, launch the official baseline:

cd /workspace/llm-posttrain-opt
source /venv/main/bin/activate
export LLM_PT_DATA_ROOT=/workspace/llm-posttrain-opt/.artifacts
export CUDA_VISIBLE_DEVICES=0
python -m train.train_sft --config configs/qlora_sft_full.yaml

Current Repo Structure

configs/
  qlora_sft.yaml
  qlora_sft_debug.yaml
  qlora_sft_full.yaml
scripts/
  inspect_dataset.py
  run_train_baseline.sh
  setup_env.sh
train/
  __init__.py
  config.py
  data.py
  metrics.py
  model.py
  preflight.py
  train_sft.py
  utils.py
docs/
  experiment_log.md
requirements.txt
README.md

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors