This guide provides detailed instructions, best practices, and optimized configurations for testing Mixtral, DeepSeek, and Qwen series models using the Megatron-Core framework to achieve optimal performance and reliability.
- DeepSeek-V3 best practices in a single command
- Current includes H100, B200, and Long Context. GB200 config coming soon.
- Container Setup
- Design Docs
- Environment Setup
- Performance Benchmarking
- DeepSeek Checkpoint Conversion
- Dockerfile: dockers/Dockerfile
Please refer to design_docs.
Before entering the container, you need to install yq to process .yaml configuration files.
Click here to view installation steps.
-
Create a local bin directory:
mkdir -p ~/.local/bin -
Download the
yqexecutable:wget https://github.com/mikefarah/yq/releases/download/v4.27.5/yq_linux_amd64 -O ~/.local/bin/yq -
Make it executable:
chmod +x ~/.local/bin/yq -
Add the local bin directory to your
PATHin~/.bashrc:export PATH="$HOME/.local/bin:$PATH"
-
Apply the changes:
source ~/.bashrc
Before running any scripts, you need to set up the following environment variables:
export WANDB_API_KEY="your_wandb_api_key_here"
export MEGATRON_PATH="/path/to/your/megatron/directory"
export MCORE_RELEASE_VERSION="0.13"
export CONTAINER_IMAGE="/path/to/container/image.sqsh"
export CLUSTER="your_cluster_name"WANDB_API_KEY: Your Weights & Biases API key for experiment tracking.- Get your key from wandb.ai/authorize.
MEGATRON_PATH: Absolute path to your Megatron-MoE installation directory.- Example:
path/to/Megatron-LM
- Example:
MCORE_RELEASE_VERSION: Version of Megatron-Core to use.- Currently recommended:
0.13
- Currently recommended:
CONTAINER_IMAGE: Path to the container image file (.sqsh).- Example:
path/to/container/image.sqsh
- Example:
CLUSTER: Name of your cluster environment (e.g.,EOS,CW).
For performance benchmarking, you can launch scripts either with sbatch via sbatch_benchmarking.sh or on an interactive node via interactive_benchmarking.sh.
-
MODEL- This is a required environment variable that must be set in your script or command.
- Predefined models include:
Mixtral-8x2B,Mixtral-8x7B,Mixtral-8x22B,DeepSeek-V2,DeepSeek-V2-Lite,DeepSeek-V3,DeepSeek-V3-Lite, andQwen2-57B-A14B.
-
CLUSTER,MCORE_RELEASE_VERSION, andMEGATRON_PATH- These required variables must be defined in your script or command for proper execution.
-
CONTAINER_IMAGE -
Using WandB for Experiment Tracking
Click here to view WandB setup instructions.
- To use WandB for experiment tracking, set
WANDB_API_KEYwith your key from wandb.ai/authorize. It is highly recommended to addexport WANDB_API_KEY="your_own_wandb_api_key"to your~/.bashrc. - If you do not wish to use WandB, comment out the following lines in your model's
.yamlconfiguration file:# --wandb-project: wandb_project_name # --wandb-exp-name: wandb_experiment_name
- To use WandB for experiment tracking, set
All model-specific runner configurations can be adjusted through runtime_configs/benchmarking/runtime.conf or via the benchmarking command.
-
Available Model-Specific Runner Configurations
Click here to view available model-specific benchmarking configurations.
- Parallel Mappings:
TP,PP,EP,CP,VPP,PP_FIRST,PP_LAST, andLAYERS_PER_VP - Batch Sizes:
MBSandGBS - Model Architecture:
NUM_LAYERS - MoE Configurations:
MOE_TOKEN_DISPATCHER,MOE_GROUPED_GEMM, and--moe-extended-ep - Training Configurations:
NNODES,RUN_TIME, andPRETRAIN. Note that specifying a shorter run time may improve your job's priority in the Slurm queue. - Data Configurations:
SEQ_LENandDATASET
- Parallel Mappings:
-
All available optimial configurations are listed in
runtime_configs/benchmarking/runtime.conf.
All cluster configurations can be customized either through cluster_configs/benchmarking/your_own_cluster.conf or via the benchmarking command. For guidance on creating your own cluster configurations, refer to the template provided in cluster_configs/benchmarking/template.conf.
- Required Cluster-Specific Slurm Settings:
ACCOUNT,PARTITION,RUN_NAME, andCONTAINER_MOUNTS - Required Cluster-Specific Paths:
OUTPUT_PATH,DATA_PATH,TOKENIZER_MODEL, andLOAD_PATH
- To benchmark a model from scratch with preconfigured parameters:
# Example for DeepSeek-V3 MODEL=DeepSeek-V3 bash ./sbatch_benchmarking.sh - To train a model with custom parameters:
# Example for DeepSeek-V3 MODEL=DeepSeek-V3 TP=2 PP=8 EP=64 VPP=1 PP_FIRST=8 PP_LAST=5 RUN_TIME=00:60:00 NNODES=64 bash sbatch_benchmarking.sh --recompute-granularity selective --recompute-modules mla_up_proj layernorm - To monitor your jobs, use
squeue -u $USERfor a one-time status check orwatch -n 1 squeue -u $USERfor continuous monitoring. For detailed logging, refer to the WandB dashboard.
| Please try MBridge and Megatron-Bridge for better HF<->MCore conversion support.
Download the DeepSeek-V3 checkpoint from HuggingFace:
# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/deepseek-ai/DeepSeek-V3The downloaded checkpoint is in FP8 format. Run the following command to convert it to BF16 format, using this script:
python inference/fp8_cast_bf16.py --input-fp8-hf-path /your/input/fp8/hf/path --output-bf16-hf-path /your/output/bf16/hf/pathTo convert the BF16 HuggingFace checkpoint to a Megatron legacy checkpoint, execute the following command:
# Example for DeepSeek-V3
MODEL=DeepSeek-V3 bash ./ckpt_convert_scripts/DeepSeek-V3/convert_deepseek_v3.sh
Finally, run this command to convert the legacy checkpoint into a distributed checkpoint:
MODEL=DeepSeek-V3 TP=1 PP=4 EP=64 VPP=1 PP_FIRST=16 PP_LAST=13 NNODES=32 LOAD_PATH=/path/to/legacy/checkpoint bash ./sbatch_benchmarking.sh --ckpt-convert-save /path/to/save/distributed/checkpoint --ckpt-convert-format torch_dist --no-save-optimFor reference, after conversion, the legacy checkpoint is approximately 3.4T, and the distributed checkpoint is about 1.4T.