Skip to content

Commit a389f33

Browse files
yaox12HaochenYuan
authored andcommitted
GPG sign off
1 parent effebd8 commit a389f33

30 files changed

+1111
-227
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,8 @@ pip install -e .[mlm,dev]
6161

6262
## Performance & Benchmarking
6363

64-
🚧 **Coming Soon** - We will update this section with performance benchmarks of experimental features as they become available.
64+
- 🚀 [2025/11] [Optimizing DeepSeek-V3 Training Performance on NVIDIA GB200 NVL72](docs/discussions/deepseek-v3-gb200-optimization/deepseek-v3-gb200-optimization.md).
65+
-[2025/11] [A Guide to Reproduce DeepSeek-V3 Pre-training Performance on GB200](docs/discussions/deepseek-v3-gb200-optimization/deepseek-v3-gb200-reproduce-guide.md).
6566

6667
## Community & Support
6768

docs/discussions/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,10 @@ This directory contains in-depth guides, tutorials, and discussions about optimi
1010

1111
A comprehensive guide on optimizing DeepSeek-V3 model training on NVIDIA GB200 NVL72 systems, covering profiling techniques, performance bottlenecks, and optimization strategies.
1212

13+
- **[A Guide to Reproduce DeepSeek-V3 Pre-training Performance on GB200](deepseek-v3-gb200-optimization/deepseek-v3-gb200-reproduce-guide.md)**
14+
15+
A detailed guide on how to reproduce the DeepSeek-V3 pre-training performance on GB200, incluing the dockerfile, package requirements and training scripts.
16+
1317
## Contributing
1418

1519
If you'd like to contribute a guide or tutorial, please follow this structure:

docs/discussions/deepseek-v3-gb200-optimization/deepseek-v3-gb200-optimization.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
---
66

7-
This guide describes how we used Megatron Core (MCore) and Transformer Engine (TE) to pre-train the DeepSeek-V3 model with MXFP8 precision on 256 GB200 GPUs. We will detail the step-by-step process of optimizing performance to **970 TFLOPS/GPU**, which is a **2.55x** speedup compared to the estimated 380 TFLOPS on H100/H800 (refer to the estimation in this article \[[1](https://zhuanlan.zhihu.com/p/16480858047)\] in Chinese). The related features have been or will be open-sourced to the [Megatron Core](https://github.com/NVIDIA/Megatron-LM) and [Transformer Engine](https://github.com/NVIDIA/TransformerEngine) repositories.
7+
This guide describes how we used Megatron Core (MCore) and Transformer Engine (TE) to pre-train the DeepSeek-V3 model with MXFP8 precision on 256 GB200 GPUs. We will detail the step-by-step process of optimizing performance to **970 TFLOPS/GPU**, which is a **2.55x** speedup compared to the estimated 380 TFLOPS on H100/H800 (refer to the estimation in this article \[[1](https://zhuanlan.zhihu.com/p/16480858047)\] in Chinese). The related features have been or will be open-sourced to the [Megatron Core](https://github.com/NVIDIA/Megatron-LM) and [Transformer Engine](https://github.com/NVIDIA/TransformerEngine) repositories. Refer to [the guide](./deepseek-v3-gb200-reproduce-guide.md) to reproduce the performance.
88

99
## **0. Methodology**
1010

@@ -20,7 +20,7 @@ DeepSeek-V3 innovatively uses FP8 mixed precision for pre-training, which saves
2020

2121
On the Blackwell platform, thanks to the native support of the fifth-generation Tensor Core for the MXFP8 format, we adopted the MXFP8 recipe, a more fine-grained quantization scheme for training. Both activations and weights are quantized at a 1x32 granularity, and E8M0 is used as the format for the scaling factor.
2222

23-
Here, we will briefly introduce the difference in implementation between MXFP8 GEMM on the Blackwell platform and Blockwise FP8 GEMM on the Hopper platform. On the Hopper platform, since the Tensor Core itself does not support multiplication with a scale, after the matrix multiplication of each tile, it is necessary to multiply by the scale and accumulate the result with the CUDA Core. This also determines that on the Hopper platform, 1x128 is almost the finest quantization granularity available. If a finer granularity was used for quantization, the GEMM performance would suffer a great loss. On the other hand, since the Blackwell platform natively supports MXFP8, the dequantization process in GEMM (i.e., multiplying by the scale) is completed inside the Tensor Core, so the CUDA Core is not involved throughout the process, which can achieve better performance and support finer-grained quantization (1x32).
23+
Here, we will briefly introduce the difference in implementation between MXFP8 GEMM on the Blackwell platform and Blockwise FP8 GEMM on the Hopper platform. On the Hopper platform, since the Tensor Core does not support multiplication with vectors of scales, the quantization granularity must be greater or equal to the GEMM tiles. This also determines that on the Hopper platform, 1x128 is almost the finest quantization granularity available. If a finer granularity was used for quantization, the GEMM performance would suffer a great loss due to small GEMM tiles. On the other hand, since the Blackwell platform natively supports MXFP8, the dequantization process in GEMM (i.e., multiplying by the scale) is completed inside the Tensor Core, so the CUDA Core is not involved throughout the process, which can achieve better performance and support finer-grained quantization (1x32).
2424

2525
When we started optimizing DeepSeek-V3 on the GB200 NVL72 platform with MCore, our baseline already included the following features:
2626

@@ -242,7 +242,8 @@ We started from a baseline of 494 TFLOPS, and through multiple rounds of perform
242242

243243
**Complete Training Examples**
244244

245-
* [DeepSeek-V3 Training Scripts](https://github.com/yanring/Megatron-MoE-ModelZoo) \- End-to-end training configurations and launch scripts
245+
* [Reproduce Guide](./deepseek-v3-gb200-reproduce-guide.md), including the Dockerfile, dependencies, cluster configuration and launch scripts.
246+
* [Megatron-MoE-ModelZoo](https://github.com/yanring/Megatron-MoE-ModelZoo) \- End-to-end training configurations and launch scripts for popular MoE models, including Deepseek-V3, Qwen3, etc.
246247

247248
**Papers and Technical Reports**
248249

0 commit comments

Comments
 (0)