Built on the powerful Megatron-Core framework, this guide delivers detailed instructions and best practices for testing cutting-edge Mixtral, DeepSeek, and Qwen series models. By following this guide, users can ensure optimal performance and reliability, unlocking the full potential of these innovative models.
- Megatron MoE Testing Guide
- Table of Contents
- 0. Container Setup
- 1. Environment Setup
- 2. Performance Benchmarking
- 3. DeepSeek Checkpoint Conversion
- Dockerfile: dockers/Dockerfile.
Before entering the container, to process the model configuration .yaml file using the yq package, follow these steps:
Click here to view steps.
-
Create a directory named
binin your home directory if it doesn't already exist:mkdir -p ~/.local/bin -
Download the
yqexecutable to the newly created directory:wget https://github.com/mikefarah/yq/releases/download/v4.27.5/yq_linux_amd64 -O ~/.local/bin/yq -
Grant execution permissions to the
yqexecutable:chmod +x ~/.local/bin/yq -
Edit your
~/.bashrcfile and append the following line to include~/.local/binin your system'sPATH:export PATH="$HOME/.local/bin:$PATH" -
Apply the changes by sourcing your shell configuration file:
source ~/.bashrc
For performance benchmarking, you can launch scripts either using sbatch via sbatch_benchmarking.sh or on an interactive node via interactive_benchmarking.sh.
-
Environment Variable
MODEL:The
MODELenvironment variable is required and must be explicitly defined in either the benchmarking script or the benchmarking command. We have predefined models including:Mixtral-8x2B,Mixtral-8x7B,Mixtral-8x22B,DeepSeek-V2,DeepSeek-V2-Lite,DeepSeek-V3, andQwen2-57B-A14B. -
Environment Variables
CLUSTER,MCORE_RELEASE_VERSION, andMEGATRON_PATH:These variables are required and must be explicitly defined in either the benchmarking script or the benchmarking command to ensure proper execution.
-
Environment Variable
CONTAINER_IMAGE:- The
CONTAINER_IMAGEenvironment variable must be updated to either the path of your local container image or a Docker URL. - When using Gitlab Docker images, ensure that the port number is removed from the URL.
- To import a container image into a local
.sqshfile, use the following command:enroot import -o ./IMAGE.sqsh docker://[USER@][REGISTRY#]IMAGE[:TAG] - For more details, please refer to the enroot documentation.
- The
-
Using WandB for Experiment Tracking:
- To utilize WandB for experiment tracking, replace
WANDB_API_KEYwith your own key from https://wandb.ai/authorize. It is highly recommended to addexport WANDB_API_KEY="your_own_wandb_api_key"to your~/.bashrc.
- To utilize WandB for experiment tracking, replace
All common configurations can be adjusted either through runtime_configs/benchmarking/common.conf or via the benchmarking command.
-
Environment Variable
TRAINING_PARAMS_PATH:To streamline the performance benchmarking process, we have provided preconfigured
.yamlfiles for several commonly used MoE models, including Mixtral-8x2B, Mixtral-8x7B, Mixtral-8x22B, DeepSeek-V2, DeepSeek-V2-Lite, DeepSeek-V3, Qwen2-57B-A14B. These files, located within the megatron-moe-scripts/model_configs/benchmarking directory, contain all the necessary configurations for the models. -
Environment Variable
COMMENT:To append a comment to your
wandb-exp-namefor distinguishing it from other WandB experiments, please setCOMMENTaccordingly. -
Environment Variable
PROFILE:To profile the training process, please set
PROFILE=1when executing the benchmarking scripts. -
Environment Variable
PR:For benchmarking with
fp8data type, please setPR=fp8when launching the benchmarking scripts. Ensure that the container image is installed with TE version 1.7.0 or higher. For TE versions lower than 1.7.0, only the attention layer will be computed infp8.
All model-specific configurations can be adjusted either through runtime_configs/benchmarking/runtime.conf or via the benchmarking command.
-
Available Model-Specific Configurations:
- Parallel Mappings:
TP,PP,EP,CP,VPP,PP_FIRST,PP_LAST, andLAYERS_PER_VP. - Batch Sizes:
MBSandGBS. - Model architecture:
NUM_LAYERS. - MoE configurations:
MOE_TOKEN_DISPATCHER,MOE_GROUPED_GEMM, and--moe-extended-ep. - Training configurations:
NNODES,RUN_TIME, andPRETRAIN. Note that specifying a shorter running time may improve your job's priority in the Slurm queue. - Data configurations:
SEQ_LENandDATASET.
- Parallel Mappings:
-
Preconfigured Benchmarking Models:
Click here to view preconfigured benchmarking models.
Model TP PP EP CP VPP MBS GBS LAYERS DISPATCHER GROUPED_GEMM NODES RUN_TIME PRETRAIN SLEN DATASET PP_FIRST PP_LAST Mixtral-8x2B 1 1 8 1 1 2 256 24 alltoall false 8 00:20:00 1 4096 Slimpajama Mixtral-8x7B 1 4 8 1 8 1 256 32 alltoall true 8 00:20:00 0 4096 Slimpajama Mixtral-8x22B 2 8 8 1 1 1 256 56 alltoall true 16 00:20:00 0 4096 Slimpajama DeepSeek-V2 1 16 8 1 2 1 1024 60 alltoall true 32 00:20:00 0 4096 Slimpajama 2 2 DeepSeek-V2-Lite 1 1 8 1 1 1 512 27 alltoall true 1 00:20:00 0 4096 Slimpajama DeepSeek-V3 1 16 64 1 1 1 8192 61 flex true 128 00:20:00 0 4096 Slimpajama 4 1 Qwen2-57B-A14B 2 4 4 1 7 1 256 28 alltoall true 8 00:20:00 0 4096 Slimpajama
All cluster configurations can be customized either through cluster_configs/benchmarking/your_own_cluster.conf or via the benchmarking command. For guidance on creating your own cluster configurations, please refer to the template provided in cluster_configs/benchmarking/template.conf.
- Required cluster-specific Slurm settings:
ACCOUNT,PARTITION,RUN_NAME, andCONTAINER_MOUNTS. - Required cluster-specific paths:
OUTPUT_PATH,DATA_PATH,TOKENIZER_MODEL, andLOAD_PATH.
-
To benchmark the model with training from scratch using preconfigured parameters, execute the following command:
# Example for DeepSeek-V2-Lite MODEL=DeepSeek-V2-Lite bash ./sbatch_benchmarking.sh -
To train the model using costum parameters, refer to the following command:
# Example for DeepSeek-V2-Lite MODEL=DeepSeek-V2-Lite MCORE_RELEASE_VERSION=0.11.0 PR=bf16 PROFILE=1 TP=1 PP=1 EP=8 VPP=1 MBS=1 GBS=512 SEQ_LEN=4096 MOE_TOKEN_DISPATCHER=alltoall MOE_GROUPED_GEMM=true bash ./sbatch_benchmarking.sh --moe-extended-ep -
To monitor your jobs, use
squeue -u $USERfor a one-time status check, orwatch -n 1 squeue -u $USERfor continuous monitoring. For detailed logging information, refer to the WandB dashboard.
The moe-permute-fusion feature is currently compatible only with TE version 2.1.0 or higher. For TE versions lower than 2.1.0, please comment out the corresponding line in the preconfigured .yaml files:
--moe-permute-fusion: true
For MLM-main, tp-comm-overlap can be enabled with specifically installed TE. Note that this feature currently only works for dense layer blocks (e.g., self-attention layers) and is not yet compatible with MoE layers.
To install TE with UserBuffer support, execute the following commands:
NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install git+https://github.com/NVIDIA/TransformerEngine.git
-
The DeepSeek checkpoint conversion scripts are designed to work with
TP=1. -
Conversion to Distributed Checkpoints:
By default, the scripts generate legacy checkpoints. To convert to distributed checkpoints, please follow these steps:
- First, convert to legacy checkpoints.
- Modify the following line in your
.yamlconfiguration file, located within the megatron-moe-scripts/model_configs/benchmarking directory:--load: /path/to/legacy/checkpoint - Add the following lines to your
.yamlconfiguration file:--ckpt-convert-save: /path/to/save/distributed/checkpoint --ckpt-convert-format: torch_dist - Run the benchmarking script once to complete the conversion.
-
Download Checkpoint:
Download the DeepSeek-V2 or DeepSeek-V2-Lite checkpoint from HuggingFace.
-
Update Encironment Variables:
Update the following environment variables in convert_deepseek_v2.sh:
MODEL,MEGATRON_PATH,SOURCE_CKPT_PATH, andTARGET_CKPT_PATH. -
Run Conversion Script:
Execute the conversion script using the following command:
# Example for DeepSeek-V2 MODEL=DeepSeek-V2 bash ./ckpt_convert_scripts/DeepSeek-V2/convert_deepseek_v2.sh # Example for DeepSeek-V2-Lite MODEL=DeepSeek-V2-Lite bash ./ckpt_convert_scripts/DeepSeek-V2/convert_deepseek_v2.sh
-
Download Checkpoint:
Download the DeepSeek-V3 checkpoint from HuggingFace.
-
Update Encironment Variables:
Update the following environment variables in convert_deepseek_v3.sh:
MODEL,MEGATRON_PATH,SOURCE_CKPT_PATH, andTARGET_CKPT_PATH. -
Run Conversion Script:
Execute the conversion script using the following command:
MODEL=DeepSeek-V3 bash ./ckpt_convert_scripts/DeepSeek-V3/convert_deepseek_v3.sh