Skip to content

Commit 2033904

Browse files
Merge branch 'main' into siddharth/mamba-chunked-prefill-bugfix
2 parents bc66c18 + 3b83c3f commit 2033904

File tree

20 files changed

+273
-57
lines changed

20 files changed

+273
-57
lines changed

README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ pip install --no-build-isolation .[mlm,dev]
9999

100100
```
101101
Megatron-LM/
102-
├── megatron/
102+
├── megatron/
103103
│ ├── core/ # Megatron Core (kernels, parallelism, building blocks)
104104
│ │ ├── models/ # Transformer models
105105
│ │ ├── transformer/ # Transformer building blocks
@@ -128,7 +128,7 @@ Megatron-LM/
128128

129129
- **Training state-of-the-art foundation models** at scale with cutting-edge performance on latest NVIDIA hardware
130130
- **Research teams** exploring new architectures and training techniques
131-
- **Learning distributed training** concepts and best practices
131+
- **Learning distributed training** concepts and best practices
132132
- **Quick experimentation** with proven model configurations
133133

134134
**What you get:**
@@ -137,7 +137,7 @@ Megatron-LM/
137137
- End-to-end examples from data prep to evaluation
138138
- Research-focused tools and utilities
139139

140-
### Megatron Core: Composable Library
140+
### Megatron Core: Composable Library
141141

142142
**Composable library** with GPU-optimized building blocks for custom training frameworks.
143143

@@ -170,7 +170,7 @@ Megatron-LM/
170170
- **[Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge)** - Training library with bidirectional Hugging Face ↔ Megatron checkpoint conversion, flexible training loops, and production-ready recipes
171171
- **[NeMo RL](https://github.com/NVIDIA-NeMo/RL)** - Scalable toolkit for efficient reinforcement learning with RLHF, DPO, and other post-training methods
172172
- **[NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)** - Enterprise framework with cloud-native support and end-to-end examples
173-
- **[TensorRT Model Optimizer (ModelOpt)](https://github.com/NVIDIA/TensorRT-Model-Optimizer)** - Model optimization toolkit for quantization, pruning, and distillation
173+
- **[TensorRT Model Optimizer (ModelOpt)](https://github.com/NVIDIA/TensorRT-Model-Optimizer)** - Model optimization toolkit for quantization, pruning, distillation, speculative decoding, and more. Checkout end-to-end examples in [examples/post_training/modelopt](./examples/post_training/modelopt/).
174174

175175
**Compatible with:** [Hugging Face Accelerate](https://github.com/huggingface/accelerate), [Colossal-AI](https://github.com/hpcaitech/ColossalAI), [DeepSpeed](https://github.com/microsoft/DeepSpeed)
176176

@@ -257,7 +257,7 @@ Our codebase efficiently trains models from 2B to 462B parameters across thousan
257257
**Benchmark Configuration:**
258258

259259
- **Vocabulary size**: 131,072 tokens
260-
- **Sequence length**: 4096 tokens
260+
- **Sequence length**: 4096 tokens
261261
- **Model scaling**: Varied hidden size, attention heads, and layers to achieve target parameter counts
262262
- **Communication optimizations**: Fine-grained overlapping with DP (`--overlap-grad-reduce`, `--overlap-param-gather`), TP (`--tp-comm-overlap`), and PP (enabled by default)
263263

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
!slurm*
Lines changed: 87 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,93 @@
11
<div align="center">
22

3-
# TensorRT Model Optimizer Integration Advanced Topics
3+
# Advanced Usage
44

5-
[Local Examples](#getting-started-in-a-local-environment) |
6-
[Configuration](#learn-more-about-configuration) |
7-
[Slurm Examples](ADVANCED.md#slurm-examples) |
8-
[Advanced Topics](ADVANCED.md) |
9-
[Megatron-LM Integration](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt)
5+
[Advanced Configuration](#advanced-configuration) |
6+
[Slurm Examples](#slurm-examples) |
7+
[Checkpoint Resume](#checkpoint-resume) |
108

119
</div>
1210

11+
## Advanced Configuration
12+
13+
### Understanding Configuration Variables
14+
15+
For simplicity, we use `shell` scripts and variables as arguments. Each script has at least 1 positional
16+
argument `[model_conf]`. Some scripts may require more such as `[qformat]` is needed for
17+
quantization.
18+
19+
```sh
20+
\
21+
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
22+
bash quantize.sh [model_conf] [qformat]
23+
```
24+
25+
> **❗ IMPORTANT:** `model_conf` is used to get the corresponding Megatron-LM `${MODEL_ARGS}`. For example,
26+
> `meta-llama/Llama-3.1-8B-Instruct` or `deepseek-ai/DeepSeek-R1` are both supported.
27+
>
28+
> Provide the pretrained checkpoint through variable `${HF_MODEL_CKPT}` in commandline or
29+
> in a configuration shell script. More variables (e.g. `${TP}`, `${EP}`, ...) can be provided through
30+
> commandline but we recommend passing all variables in a separate `shell` script.
31+
32+
### Using Configuration Scripts
33+
34+
When `${HF_MODEL_CKPT}` is not set through the commandline, `./env_setup_template.sh` can be used
35+
to pass all variables instead. If you have your own script, use `${SANDBOX_ENV_SETUP}`.
36+
37+
```sh
38+
\
39+
SANDBOX_ENV_SETUP=<path_to_your_script> \
40+
bash quantize.sh [model_conf] [qformat]
41+
```
42+
43+
**For Slurm execution**, you **MUST USE** `${SANDBOX_ENV_SETUP}` (default: `./env_setup_template.sh`).
44+
Other variables are not passed through `sbatch` and `srun` automatically.
45+
46+
### Common Configuration Variables
47+
48+
- `HF_MODEL_CKPT`: Path to pretrained model checkpoint
49+
- `TP`: Tensor parallelism degree
50+
- `PP`: Pipeline parallelism degree
51+
- `EP`: Expert parallelism degree (for MoE models)
52+
- `ETP`: Expert tensor parallelism degree (for MoE models)
53+
- `MLM_MODEL_SAVE`: Path to save Megatron-LM checkpoint
54+
- `MLM_MODEL_LOAD`: Path to load Megatron-LM checkpoint
55+
- `MLM_EXTRA_ARGS`: Additional Megatron-LM arguments (e.g., for uneven PP)
56+
57+
## Slurm Examples
58+
59+
For models that require multi-node, our scripts in Megatron-LM examples also support `slurm` with a sbatch wrapper.
60+
Start with the example `slurm/sbatch.sh` with some minor modification or use your existing `sbatch`
61+
script.
62+
63+
Different from local environment, we only allow passing variables through a shell script (default: `env_setup_template.sh`).
64+
Commandline variable passthrough is not supported.
65+
66+
<br>
67+
68+
### ⭐ BF16 Kimi-K2-Instruct EAGLE3 Training
69+
70+
`conf/moonshotai/kimi_k2_instruct.sh` is a config that has been tested
71+
with 8 nodes of DGX H100 (TP=8, ETP=1, EP=64, overall 64 H100 GPUs in total). Update `HF_MODEL_CKPT` to the exact
72+
checkpoint path in the container to start:
73+
74+
```sh
75+
export USER_FSW=<path_to_scratch_space>
76+
export CONTAINER_IMAGE=<path_to_container_image>
77+
export SANDBOX_ENV_SETUP=./conf/moonshotai/kimi_k2_instruct.sh
78+
sbatch --nodes=8 slurm/sbatch.sh "eagle3.sh moonshotai/Kimi-K2-Instruct"
79+
```
80+
81+
To export the trained EAGLE3 model, switch to `kimi_k2_instruct_export.sh`.
82+
**We only support pipeline-parallel (PP) export.** In this case, 2 nodes are used (PP=16).
83+
84+
```sh
85+
export USER_FSW=<path_to_scratch_space>
86+
export CONTAINER_IMAGE=<path_to_container_image>
87+
export SANDBOX_ENV_SETUP=./conf/moonshotai/kimi_k2_instruct_export.sh
88+
sbatch --nodes=2 slurm/sbatch.sh "export.sh moonshotai/Kimi-K2-Instruct"
89+
```
90+
91+
## Checkpoint Resume
92+
93+
WIP

examples/post_training/modelopt/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ ARG PIP_CONSTRAINT=
44

55
WORKDIR /workspace/nmm-sandbox
66

7-
RUN pip install jsonlines omegaconf pulp torchprofile
7+
RUN pip install jsonlines omegaconf
88
RUN pip install flask flask_restful fire nltk
99
RUN pip install tiktoken blobfile
1010

examples/post_training/modelopt/README.md

Lines changed: 36 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -5,22 +5,21 @@
55

66
[TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) |
77
[Local Examples](#getting-started-in-a-local-environment) |
8-
[Configuration](ADVANCED.md#learn-more-about-configuration) |
9-
[Slurm Examples](ADVANCED.md#slurm-examples) |
10-
[Speculative Decoding](speculative.md) |
11-
[Advanced Topics](ADVANCED.md)
8+
[Configuration](./ADVANCED.md#advanced-configuration) |
9+
[Slurm Examples](./ADVANCED.md#slurm-examples) |
10+
[Speculative Decoding](./speculative.md) |
11+
[Advanced Topics](./ADVANCED.md)
1212

1313
</div>
1414

1515
[TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (**ModelOpt**, `nvidia-modelopt`)
16-
provides end-to-end model optimization for
17-
NVIDIA hardware including quantization (real or simulated), sparsity, knowledge distillation, pruning,
18-
neural architecture search, and speulative decoding.
16+
provides end-to-end model optimization for NVIDIA hardware including quantization (real or simulated),
17+
knowledge distillation, pruning, speculative decoding, and more.
1918

2019

2120
## Major Features
2221

23-
- Start from Hugging Face pretrained model checkpoint with on-the-fly conversion.
22+
- Start from Hugging Face pretrained model checkpoint with on-the-fly conversion to Megatron-LM checkpoint format.
2423
- Support all kinds of model parallelism (TP, EP, ETP, PP).
2524
- Export to TensorRT-LLM, vLLM, and SGLang ready unified checkpoint.
2625

@@ -46,6 +45,10 @@ pip install -U nvidia-modelopt
4645
Alternatively, you can install from [source](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
4746
to try our latest features.
4847

48+
> **❗ IMPORTANT:** The first positional argument (e.g. `meta-llama/Llama-3.2-1B-Instruct`) of each script
49+
> is the config name used to match the supported model config in `conf/`. The pretrained HF checkpoint should
50+
> be downloaded and provided through `${HF_MODEL_CKPT}`.
51+
4952

5053
### ⭐ NVFP4 Quantization, Qauntization-Aware Training, and Model Export
5154

@@ -58,7 +61,7 @@ provide `${EXPORT_DIR}` to `export.sh`.
5861
> low-precision numerical behavior (fake-quant) which can be run on GPUs with compute > 80.
5962
> Real low-precision paramters (e.g. `E4M3` or `E2M1`)
6063
> and low-precision compute (e.g. `FP8Linear`) are also supported depending on GPU compute capability.
61-
> **See [Adanvanced Topics](advanced.md) for details**.
64+
> **See [Adanvanced Topics](./ADVANCED.md) for details**.
6265
6366
```sh
6467
\
@@ -75,31 +78,6 @@ provide `${EXPORT_DIR}` to `export.sh`.
7578
./export.sh meta-llama/Llama-3.2-1B-Instruct
7679
```
7780

78-
> **❗ IMPORTANT:** The first positional arugment (e.g. `meta-llama/Llama-3.2-1B-Instruct`) of each script
79-
> is the config name used to match the supported model config in `conf/`. The pretrained checkpoint should
80-
> be downloaded and provided through `${HF_MODEL_CKPT}`.
81-
82-
Loading the saved distributed checkpoint, the quantized Megatron model can be resumed for inference
83-
(generate or evaluate) or training (SFT or PEFT). To read more about these features, see
84-
[Adanvanced Topics](advanced.md). To learn more about the design, see our [Design]() document [WIP].
85-
86-
```sh
87-
\
88-
TP=1 \
89-
MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
90-
./generate.sh meta-llama/Llama-3.2-1B-Instruct
91-
92-
\
93-
TP=1 \
94-
MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
95-
./mmlu.sh meta-llama/Llama-3.2-1B-Instruct
96-
97-
\
98-
TP=1 \
99-
MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
100-
./finetune.sh meta-llama/Llama-3.2-1B-Instruct
101-
```
102-
10381
### ⭐ Online BF16 EAGLE3 Training
10482

10583
Online EAGLE3 training has both the target (frozen) and draft models in the memory where the `hidden_states`
@@ -122,7 +100,7 @@ deployment.
122100
./export.sh meta-llama/Llama-3.2-1B-Instruct
123101
```
124102

125-
See [Adanvanced Topics](ADVANCED.md) for a `moonshotai/Kimi-K2-Instruct` EAGLE3 training example using `slurm`.
103+
See [Adanvanced Topics](./ADVANCED.md) for a `moonshotai/Kimi-K2-Instruct` EAGLE3 training example using `slurm`.
126104

127105
### ⭐ Pruning
128106

@@ -165,5 +143,28 @@ MLM_MODEL_SAVE=Qwen3-8B-Pruned \
165143
> default `conf/` by setting `MLM_EXTRA_ARGS`. E.g.: for loading above pruned Qwen3-8B checkpoint for mmlu, set:
166144
> `MLM_EXTRA_ARGS="--num-layers 24"`
167145
146+
### ⭐ Inference and Training
147+
148+
The saved Megatron-LM distributed checkpoint (output of above scripts) can be resumed for inference
149+
(generate or evaluate) or training (SFT or PEFT). To read more about these features, see
150+
[Advanced Topics](./ADVANCED.md).
151+
152+
```sh
153+
\
154+
TP=1 \
155+
MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
156+
./generate.sh meta-llama/Llama-3.2-1B-Instruct
157+
158+
\
159+
TP=1 \
160+
MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
161+
./mmlu.sh meta-llama/Llama-3.2-1B-Instruct
162+
163+
\
164+
TP=1 \
165+
MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
166+
./finetune.sh meta-llama/Llama-3.2-1B-Instruct
167+
```
168+
168169
## Advanced Usage
169170
TBD
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
#!/bin/bash
2+
3+
HF_MODEL_CKPT=/workspace/scratch/moonshotai/Kimi-K2-Instruct
4+
TP=8
5+
ETP=1
6+
EP=64
7+
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/bin/bash
2+
3+
HF_MODEL_CKPT=/workspace/scratch/moonshotai/Kimi-K2-Instruct
4+
5+
MLM_EXTRA_ARGS=" \
6+
--decoder-first-pipeline-num-layers 3 \
7+
--decoder-last-pipeline-num-layers 2 \
8+
--init-model-with-meta-device \
9+
--use-cpu-initialization \
10+
11+
"
12+
13+
# Layer distribution over PP: 3, [4] * 14, 2.
14+
PP=16
15+
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
#!/bin/bash
2+
3+
HF_MODEL_CKPT=/workspace/scratch/meta-llama/Llama-3.2-1B-Instruct
4+
TP=1
5+
ETP=1
6+
EP=1
7+
PP=1
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
#!/bin/bash
2+
3+
#SBATCH -A <account>
4+
#SBATCH -p <partition>
5+
#SBATCH --job-name=<job-name>
6+
#SBATCH --nodes=1 --ntasks-per-node=8 --gpus-per-node=8
7+
#SBATCH -t 04:00:00
8+
#SBATCH --exclusive --mem=0 --overcommit
9+
10+
# Bash coloring
11+
RED='\033[0;31m'
12+
YELLOW='\033[0;33m'
13+
GREEN='\033[0;32m'
14+
BLUE='\033[0;34m'
15+
PURPLE='\033[0;35m'
16+
WHITE='\033[0;37m'
17+
18+
# Predefined logging
19+
MLM_ERROR="${RED}ERROR: ${WHITE}"
20+
MLM_WARNING="${YELLOW}WARNING:${WHITE}"
21+
22+
# CHANGE THE FOLLOWING TO YOUR DATA, MEGATRON, and CHECKPOINT DIR
23+
if [[ -z ${USER_FSW} ]]; then
24+
printf "${MLM_ERROR} Variable USER_FSW (read/write scratch space) must be set!\n"
25+
exit 1
26+
fi
27+
28+
if [ -z ${SANDBOX_DIR} ]; then
29+
SANDBOX_DIR="$(pwd)"
30+
printf "${MLM_WARNING} Variable SANDBOX_DIR not set! (default: ${SANDBOX_DIR})\n"
31+
fi
32+
33+
if [ -z ${SANDBOX_ENV_SETUP} ]; then
34+
SANDBOX_ENV_SETUP=./env_setup_template.sh
35+
printf "${MLM_WARNING} Variable SANDBOX_ENV_SETUP not set! (default: ${SANDBOX_ENV_SETUP})\n"
36+
fi
37+
38+
if [ -z ${CONTAINER_IMAGE} ]; then
39+
CONTAINER_IMAGE="nvidia-modelopt-megatron:latest"
40+
printf "${MLM_WARNING} Variable CONTAINER_IMAGE not set! (default: ${CONTAINER_IMAGE})\n"
41+
fi
42+
43+
if [ -z ${LAUNCH_SCRIPT} ]; then
44+
LAUNCH_SCRIPT="python"
45+
printf "${MLM_WARNING} Variable LAUNCH_SCRIPT not set! (default: ${LAUNCH_SCRIPT})\n"
46+
fi
47+
48+
# DO NOT MODIFY THE VALUES BELOW UNLESS YOU KNOW WHAT YOU ARE DOING!!!
49+
DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
50+
51+
CONTAINER_MOUNT="${SANDBOX_DIR}:/workspace/nmm-sandbox,${USER_FSW}:/workspace/scratch"
52+
53+
srun -l \
54+
--mpi=pmix \
55+
--output=%x_%j_$DATETIME.log \
56+
--container-image ${CONTAINER_IMAGE} \
57+
--container-workdir "/workspace/nmm-sandbox" \
58+
--container-mounts ${CONTAINER_MOUNT} \
59+
--export "HF_MODEL_CKPT=${HF_MODEL_CKPT},SANDBOX_ENV_SETUP=${SANDBOX_ENV_SETUP},LAUNCH_SCRIPT=${LAUNCH_SCRIPT}" \
60+
bash ${1}
61+
62+
set +x
63+

0 commit comments

Comments
 (0)