NVIDIA
diff --git a/‎README.md‎
Lines changed: 5 additions & 5 deletions b/‎README.md‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎examples/post_training/modelopt/.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎examples/post_training/modelopt/.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/post_training/modelopt/ADVANCED.md‎
Lines changed: 87 additions & 6 deletions b/‎examples/post_training/modelopt/ADVANCED.md‎
Lines changed: 87 additions & 6 deletions
diff --git a/‎examples/post_training/modelopt/Dockerfile‎
Lines changed: 1 addition & 1 deletion b/‎examples/post_training/modelopt/Dockerfile‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/post_training/modelopt/README.md‎
Lines changed: 36 additions & 35 deletions b/‎examples/post_training/modelopt/README.md‎
Lines changed: 36 additions & 35 deletions
diff --git a/‎examples/post_training/modelopt/conf/moonshotai/kimi_k2_instruct.sh‎
Lines changed: 7 additions & 0 deletions b/‎examples/post_training/modelopt/conf/moonshotai/kimi_k2_instruct.sh‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎examples/post_training/modelopt/conf/moonshotai/kimi_k2_instruct_export.sh‎
Lines changed: 15 additions & 0 deletions b/‎examples/post_training/modelopt/conf/moonshotai/kimi_k2_instruct_export.sh‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎examples/post_training/modelopt/slurm/env_setup_template.sh‎
Lines changed: 7 additions & 0 deletions b/‎examples/post_training/modelopt/slurm/env_setup_template.sh‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎examples/post_training/modelopt/slurm/sbatch.sh‎
Lines changed: 63 additions & 0 deletions b/‎examples/post_training/modelopt/slurm/sbatch.sh‎
Lines changed: 63 additions & 0 deletions
@@ -99,7 +99,7 @@ pip install --no-build-isolation .[mlm,dev]
 
 ```
 Megatron-LM/
-├── megatron/                    
+├── megatron/
 │   ├── core/                    # Megatron Core (kernels, parallelism, building blocks)
 │   │   ├── models/              # Transformer models
 │   │   ├── transformer/         # Transformer building blocks
@@ -128,7 +128,7 @@ Megatron-LM/
 
 - **Training state-of-the-art foundation models** at scale with cutting-edge performance on latest NVIDIA hardware
 - **Research teams** exploring new architectures and training techniques
-- **Learning distributed training** concepts and best practices  
+- **Learning distributed training** concepts and best practices
 - **Quick experimentation** with proven model configurations
 
 **What you get:**
@@ -137,7 +137,7 @@ Megatron-LM/
 - End-to-end examples from data prep to evaluation
 - Research-focused tools and utilities
 
-### Megatron Core: Composable Library  
+### Megatron Core: Composable Library
 
 **Composable library** with GPU-optimized building blocks for custom training frameworks.
 
@@ -170,7 +170,7 @@ Megatron-LM/
 - **[Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge)** - Training library with bidirectional Hugging Face ↔ Megatron checkpoint conversion, flexible training loops, and production-ready recipes
 - **[NeMo RL](https://github.com/NVIDIA-NeMo/RL)** - Scalable toolkit for efficient reinforcement learning with RLHF, DPO, and other post-training methods
 - **[NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)** - Enterprise framework with cloud-native support and end-to-end examples
-- **[TensorRT Model Optimizer (ModelOpt)](https://github.com/NVIDIA/TensorRT-Model-Optimizer)** - Model optimization toolkit for quantization, pruning, and distillation
+- **[TensorRT Model Optimizer (ModelOpt)](https://github.com/NVIDIA/TensorRT-Model-Optimizer)** - Model optimization toolkit for quantization, pruning, distillation, speculative decoding, and more. Checkout end-to-end examples in [examples/post_training/modelopt](./examples/post_training/modelopt/).
 
 **Compatible with:** [Hugging Face Accelerate](https://github.com/huggingface/accelerate), [Colossal-AI](https://github.com/hpcaitech/ColossalAI), [DeepSpeed](https://github.com/microsoft/DeepSpeed)
 
@@ -257,7 +257,7 @@ Our codebase efficiently trains models from 2B to 462B parameters across thousan
 **Benchmark Configuration:**
 
 - **Vocabulary size**: 131,072 tokens
-- **Sequence length**: 4096 tokens  
+- **Sequence length**: 4096 tokens
 - **Model scaling**: Varied hidden size, attention heads, and layers to achieve target parameter counts
 - **Communication optimizations**: Fine-grained overlapping with DP (`--overlap-grad-reduce`, `--overlap-param-gather`), TP (`--tp-comm-overlap`), and PP (enabled by default)
 
 
@@ -0,0 +1 @@
+!slurm*
@@ -1,12 +1,93 @@
 <div align="center">
 
-# TensorRT Model Optimizer Integration Advanced Topics
+# Advanced Usage
 
-[Local Examples](#getting-started-in-a-local-environment) |
-[Configuration](#learn-more-about-configuration) |
-[Slurm Examples](ADVANCED.md#slurm-examples) |
-[Advanced Topics](ADVANCED.md) |
-[Megatron-LM Integration](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt)
+[Advanced Configuration](#advanced-configuration) |
+[Slurm Examples](#slurm-examples) |
+[Checkpoint Resume](#checkpoint-resume) |
 
 </div>
 
+## Advanced Configuration
+
+### Understanding Configuration Variables
+
+For simplicity, we use `shell` scripts and variables as arguments. Each script has at least 1 positional
+argument `[model_conf]`. Some scripts may require more such as `[qformat]` is needed for
+quantization.
+
+```sh
+\
+    HF_MODEL_CKPT=<pretrained_model_name_or_path> \
+    bash quantize.sh [model_conf] [qformat]
+```
+
+> **❗ IMPORTANT:** `model_conf` is used to get the corresponding Megatron-LM `${MODEL_ARGS}`. For example,
+> `meta-llama/Llama-3.1-8B-Instruct` or `deepseek-ai/DeepSeek-R1` are both supported.
+>
+> Provide the pretrained checkpoint through variable `${HF_MODEL_CKPT}` in commandline or
+> in a configuration shell script. More variables (e.g. `${TP}`, `${EP}`, ...) can be provided through
+> commandline but we recommend passing all variables in a separate `shell` script.
+
+### Using Configuration Scripts
+
+When `${HF_MODEL_CKPT}` is not set through the commandline, `./env_setup_template.sh` can be used
+to pass all variables instead. If you have your own script, use `${SANDBOX_ENV_SETUP}`.
+
+```sh
+\
+    SANDBOX_ENV_SETUP=<path_to_your_script> \
+    bash quantize.sh [model_conf] [qformat]
+```
+
+**For Slurm execution**, you **MUST USE** `${SANDBOX_ENV_SETUP}` (default: `./env_setup_template.sh`).
+Other variables are not passed through `sbatch` and `srun` automatically.
+
+### Common Configuration Variables
+
+- `HF_MODEL_CKPT`: Path to pretrained model checkpoint
+- `TP`: Tensor parallelism degree
+- `PP`: Pipeline parallelism degree
+- `EP`: Expert parallelism degree (for MoE models)
+- `ETP`: Expert tensor parallelism degree (for MoE models)
+- `MLM_MODEL_SAVE`: Path to save Megatron-LM checkpoint
+- `MLM_MODEL_LOAD`: Path to load Megatron-LM checkpoint
+- `MLM_EXTRA_ARGS`: Additional Megatron-LM arguments (e.g., for uneven PP)
+
+## Slurm Examples
+
+For models that require multi-node, our scripts in Megatron-LM examples also support `slurm` with a sbatch wrapper.
+Start with the example `slurm/sbatch.sh` with some minor modification or use your existing `sbatch`
+script.
+
+Different from local environment, we only allow passing variables through a shell script (default: `env_setup_template.sh`).
+Commandline variable passthrough is not supported.
+
+<br>
+
+### ⭐ BF16 Kimi-K2-Instruct EAGLE3 Training
+
+ `conf/moonshotai/kimi_k2_instruct.sh` is a config that has been tested
+with 8 nodes of DGX H100 (TP=8, ETP=1, EP=64, overall 64 H100 GPUs in total). Update `HF_MODEL_CKPT` to the exact
+checkpoint path in the container to start:
+
+```sh
+export USER_FSW=<path_to_scratch_space>
+export CONTAINER_IMAGE=<path_to_container_image>
+export SANDBOX_ENV_SETUP=./conf/moonshotai/kimi_k2_instruct.sh
+sbatch --nodes=8 slurm/sbatch.sh "eagle3.sh moonshotai/Kimi-K2-Instruct"
+```
+
+To export the trained EAGLE3 model, switch to `kimi_k2_instruct_export.sh`.
+**We only support pipeline-parallel (PP) export.** In this case, 2 nodes are used (PP=16).
+
+```sh
+export USER_FSW=<path_to_scratch_space>
+export CONTAINER_IMAGE=<path_to_container_image>
+export SANDBOX_ENV_SETUP=./conf/moonshotai/kimi_k2_instruct_export.sh
+sbatch --nodes=2 slurm/sbatch.sh "export.sh moonshotai/Kimi-K2-Instruct"
+```
+
+## Checkpoint Resume
+
+WIP
@@ -4,7 +4,7 @@ ARG PIP_CONSTRAINT=
 
 WORKDIR /workspace/nmm-sandbox
 
-RUN pip install jsonlines omegaconf pulp torchprofile
+RUN pip install jsonlines omegaconf
 RUN pip install flask flask_restful fire nltk
 RUN pip install tiktoken blobfile
 
 
@@ -5,22 +5,21 @@
 
 [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) |
 [Local Examples](#getting-started-in-a-local-environment) |
-[Configuration](ADVANCED.md#learn-more-about-configuration) |
-[Slurm Examples](ADVANCED.md#slurm-examples) |
-[Speculative Decoding](speculative.md) |
-[Advanced Topics](ADVANCED.md)
+[Configuration](./ADVANCED.md#advanced-configuration) |
+[Slurm Examples](./ADVANCED.md#slurm-examples) |
+[Speculative Decoding](./speculative.md) |
+[Advanced Topics](./ADVANCED.md)
 
 </div>
 
 [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (**ModelOpt**, `nvidia-modelopt`)
-provides end-to-end model optimization for
-NVIDIA hardware including quantization (real or simulated), sparsity, knowledge distillation, pruning,
-neural architecture search, and speulative decoding.
+provides end-to-end model optimization for NVIDIA hardware including quantization (real or simulated),
+knowledge distillation, pruning, speculative decoding, and more.
 
 
 ## Major Features
 
-- Start from Hugging Face pretrained model checkpoint with on-the-fly conversion.
+- Start from Hugging Face pretrained model checkpoint with on-the-fly conversion to Megatron-LM checkpoint format.
 - Support all kinds of model parallelism (TP, EP, ETP, PP).
 - Export to TensorRT-LLM, vLLM, and SGLang ready unified checkpoint.
 
@@ -46,6 +45,10 @@ pip install -U nvidia-modelopt
 Alternatively, you can install from [source](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
 to try our latest features.
 
+> **❗ IMPORTANT:** The first positional argument (e.g. `meta-llama/Llama-3.2-1B-Instruct`) of each script
+> is the config name used to match the supported model config in `conf/`. The pretrained HF checkpoint should
+> be downloaded and provided through `${HF_MODEL_CKPT}`.
+
 
 ### ⭐ NVFP4 Quantization, Qauntization-Aware Training, and Model Export
 
@@ -58,7 +61,7 @@ provide `${EXPORT_DIR}` to `export.sh`.
 > low-precision numerical behavior (fake-quant) which can be run on GPUs with compute > 80.
 > Real low-precision paramters (e.g. `E4M3` or `E2M1`)
 > and low-precision compute (e.g. `FP8Linear`) are also supported depending on GPU compute capability.
-> **See [Adanvanced Topics](advanced.md) for details**.
+> **See [Adanvanced Topics](./ADVANCED.md) for details**.
 
 ```sh
 \
@@ -75,31 +78,6 @@ provide `${EXPORT_DIR}` to `export.sh`.
     ./export.sh meta-llama/Llama-3.2-1B-Instruct
 ```
 
-> **❗ IMPORTANT:** The first positional arugment (e.g. `meta-llama/Llama-3.2-1B-Instruct`) of each script
-> is the config name used to match the supported model config in `conf/`. The pretrained checkpoint should
-> be downloaded and provided through `${HF_MODEL_CKPT}`.
-
-Loading the saved distributed checkpoint, the quantized Megatron model can be resumed for inference
-(generate or evaluate) or training (SFT or PEFT). To read more about these features, see
-[Adanvanced Topics](advanced.md). To learn more about the design, see our [Design]() document [WIP].
-
-```sh
-\
-    TP=1 \
-    MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
-    ./generate.sh meta-llama/Llama-3.2-1B-Instruct
-
-\
-    TP=1 \
-    MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
-    ./mmlu.sh meta-llama/Llama-3.2-1B-Instruct
-
-\
-    TP=1 \
-    MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
-    ./finetune.sh meta-llama/Llama-3.2-1B-Instruct
-```
-
 ### ⭐ Online BF16 EAGLE3 Training
 
 Online EAGLE3 training has both the target (frozen) and draft models in the memory where the `hidden_states`
@@ -122,7 +100,7 @@ deployment.
     ./export.sh meta-llama/Llama-3.2-1B-Instruct
 ```
 
-See [Adanvanced Topics](ADVANCED.md) for a `moonshotai/Kimi-K2-Instruct` EAGLE3 training example using `slurm`.
+See [Adanvanced Topics](./ADVANCED.md) for a `moonshotai/Kimi-K2-Instruct` EAGLE3 training example using `slurm`.
 
 ### ⭐ Pruning
 
@@ -165,5 +143,28 @@ MLM_MODEL_SAVE=Qwen3-8B-Pruned \
 > default `conf/` by setting `MLM_EXTRA_ARGS`. E.g.: for loading above pruned Qwen3-8B checkpoint for mmlu, set:
 > `MLM_EXTRA_ARGS="--num-layers 24"`
 
+### ⭐ Inference and Training
+
+The saved Megatron-LM distributed checkpoint (output of above scripts) can be resumed for inference
+(generate or evaluate) or training (SFT or PEFT). To read more about these features, see
+[Advanced Topics](./ADVANCED.md).
+
+```sh
+\
+    TP=1 \
+    MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
+    ./generate.sh meta-llama/Llama-3.2-1B-Instruct
+
+\
+    TP=1 \
+    MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
+    ./mmlu.sh meta-llama/Llama-3.2-1B-Instruct
+
+\
+    TP=1 \
+    MLM_MODEL_CKPT=/tmp/Llama-3.2-1B-Instruct_quant \
+    ./finetune.sh meta-llama/Llama-3.2-1B-Instruct
+```
+
 ## Advanced Usage
 TBD
@@ -0,0 +1,7 @@
+#!/bin/bash
+
+HF_MODEL_CKPT=/workspace/scratch/moonshotai/Kimi-K2-Instruct
+TP=8
+ETP=1
+EP=64
+
@@ -0,0 +1,15 @@
+#!/bin/bash
+
+HF_MODEL_CKPT=/workspace/scratch/moonshotai/Kimi-K2-Instruct
+
+MLM_EXTRA_ARGS=" \
+    --decoder-first-pipeline-num-layers 3 \
+    --decoder-last-pipeline-num-layers 2 \
+    --init-model-with-meta-device \
+    --use-cpu-initialization \
+
+"
+
+# Layer distribution over PP: 3, [4] * 14, 2.
+PP=16
+
@@ -0,0 +1,7 @@
+#!/bin/bash
+
+HF_MODEL_CKPT=/workspace/scratch/meta-llama/Llama-3.2-1B-Instruct
+TP=1
+ETP=1
+EP=1
+PP=1
@@ -0,0 +1,63 @@
+#!/bin/bash
+
+#SBATCH -A <account>
+#SBATCH -p <partition>
+#SBATCH --job-name=<job-name>
+#SBATCH --nodes=1 --ntasks-per-node=8 --gpus-per-node=8
+#SBATCH -t 04:00:00
+#SBATCH --exclusive --mem=0 --overcommit
+
+# Bash coloring
+RED='\033[0;31m'
+YELLOW='\033[0;33m'
+GREEN='\033[0;32m'
+BLUE='\033[0;34m'
+PURPLE='\033[0;35m'
+WHITE='\033[0;37m'
+
+# Predefined logging
+MLM_ERROR="${RED}ERROR:  ${WHITE}"
+MLM_WARNING="${YELLOW}WARNING:${WHITE}"
+
+# CHANGE THE FOLLOWING TO YOUR DATA, MEGATRON, and CHECKPOINT DIR
+if [[ -z ${USER_FSW} ]]; then
+    printf "${MLM_ERROR} Variable USER_FSW (read/write scratch space) must be set!\n"
+    exit 1
+fi
+
+if [ -z ${SANDBOX_DIR} ]; then
+    SANDBOX_DIR="$(pwd)"
+    printf "${MLM_WARNING} Variable SANDBOX_DIR not set! (default: ${SANDBOX_DIR})\n"
+fi
+
+if [ -z ${SANDBOX_ENV_SETUP} ]; then
+    SANDBOX_ENV_SETUP=./env_setup_template.sh
+    printf "${MLM_WARNING} Variable SANDBOX_ENV_SETUP not set! (default: ${SANDBOX_ENV_SETUP})\n"
+fi
+
+if [ -z ${CONTAINER_IMAGE} ]; then
+    CONTAINER_IMAGE="nvidia-modelopt-megatron:latest"
+    printf "${MLM_WARNING} Variable CONTAINER_IMAGE not set! (default: ${CONTAINER_IMAGE})\n"
+fi
+
+if [ -z ${LAUNCH_SCRIPT} ]; then
+    LAUNCH_SCRIPT="python"
+    printf "${MLM_WARNING} Variable LAUNCH_SCRIPT not set! (default: ${LAUNCH_SCRIPT})\n"
+fi
+
+# DO NOT MODIFY THE VALUES BELOW UNLESS YOU KNOW WHAT YOU ARE DOING!!!
+DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
+
+CONTAINER_MOUNT="${SANDBOX_DIR}:/workspace/nmm-sandbox,${USER_FSW}:/workspace/scratch"
+
+srun -l \
+    --mpi=pmix \
+    --output=%x_%j_$DATETIME.log \
+    --container-image ${CONTAINER_IMAGE} \
+    --container-workdir "/workspace/nmm-sandbox" \
+    --container-mounts ${CONTAINER_MOUNT} \
+    --export "HF_MODEL_CKPT=${HF_MODEL_CKPT},SANDBOX_ENV_SETUP=${SANDBOX_ENV_SETUP},LAUNCH_SCRIPT=${LAUNCH_SCRIPT}" \
+    bash ${1}
+
+set +x
+
-Original file line number
+Diff line change
@@ @@ -0,0 +1,7 @@ @@
 +#!/bin/bash
++
 +HF_MODEL_CKPT=/workspace/scratch/moonshotai/Kimi-K2-Instruct
 +TP=8
 +ETP=1
 +EP=64
++