You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -61,7 +61,8 @@ pip install -e .[mlm,dev]
61
61
62
62
## Performance & Benchmarking
63
63
64
-
🚧 **Coming Soon** - We will update this section with performance benchmarks of experimental features as they become available.
64
+
- 🚀 [2025/11][Optimizing DeepSeek-V3 Training Performance on NVIDIA GB200 NVL72](docs/discussions/deepseek-v3-gb200-optimization/deepseek-v3-gb200-optimization.md).
65
+
- ⚡ [2025/11][A Guide to Reproduce DeepSeek-V3 Pre-training Performance on GB200](docs/discussions/deepseek-v3-gb200-optimization/deepseek-v3-gb200-reproduce-guide.md).
Copy file name to clipboardExpand all lines: docs/discussions/README.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,6 +10,10 @@ This directory contains in-depth guides, tutorials, and discussions about optimi
10
10
11
11
A comprehensive guide on optimizing DeepSeek-V3 model training on NVIDIA GB200 NVL72 systems, covering profiling techniques, performance bottlenecks, and optimization strategies.
12
12
13
+
-**[A Guide to Reproduce DeepSeek-V3 Pre-training Performance on GB200](deepseek-v3-gb200-optimization/deepseek-v3-gb200-reproduce-guide.md)**
14
+
15
+
A detailed guide on how to reproduce the DeepSeek-V3 pre-training performance on GB200, incluing the dockerfile, package requirements and training scripts.
16
+
13
17
## Contributing
14
18
15
19
If you'd like to contribute a guide or tutorial, please follow this structure:
Copy file name to clipboardExpand all lines: docs/discussions/deepseek-v3-gb200-optimization/deepseek-v3-gb200-optimization.md
+4-3Lines changed: 4 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
5
5
---
6
6
7
-
This guide describes how we used Megatron Core (MCore) and Transformer Engine (TE) to pre-train the DeepSeek-V3 model with MXFP8 precision on 256 GB200 GPUs. We will detail the step-by-step process of optimizing performance to **970 TFLOPS/GPU**, which is a **2.55x** speedup compared to the estimated 380 TFLOPS on H100/H800 (refer to the estimation in this article \[[1](https://zhuanlan.zhihu.com/p/16480858047)\] in Chinese). The related features have been or will be open-sourced to the [Megatron Core](https://github.com/NVIDIA/Megatron-LM) and [Transformer Engine](https://github.com/NVIDIA/TransformerEngine) repositories.
7
+
This guide describes how we used Megatron Core (MCore) and Transformer Engine (TE) to pre-train the DeepSeek-V3 model with MXFP8 precision on 256 GB200 GPUs. We will detail the step-by-step process of optimizing performance to **970 TFLOPS/GPU**, which is a **2.55x** speedup compared to the estimated 380 TFLOPS on H100/H800 (refer to the estimation in this article \[[1](https://zhuanlan.zhihu.com/p/16480858047)\] in Chinese). The related features have been or will be open-sourced to the [Megatron Core](https://github.com/NVIDIA/Megatron-LM) and [Transformer Engine](https://github.com/NVIDIA/TransformerEngine) repositories. Refer to [the guide](./deepseek-v3-gb200-reproduce-guide.md) to reproduce the performance.
8
8
9
9
## **0. Methodology**
10
10
@@ -20,7 +20,7 @@ DeepSeek-V3 innovatively uses FP8 mixed precision for pre-training, which saves
20
20
21
21
On the Blackwell platform, thanks to the native support of the fifth-generation Tensor Core for the MXFP8 format, we adopted the MXFP8 recipe, a more fine-grained quantization scheme for training. Both activations and weights are quantized at a 1x32 granularity, and E8M0 is used as the format for the scaling factor.
22
22
23
-
Here, we will briefly introduce the difference in implementation between MXFP8 GEMM on the Blackwell platform and Blockwise FP8 GEMM on the Hopper platform. On the Hopper platform, since the Tensor Core itself does not support multiplication with a scale, after the matrix multiplication of each tile, it is necessary to multiply by the scale and accumulate the result with the CUDA Core. This also determines that on the Hopper platform, 1x128 is almost the finest quantization granularity available. If a finer granularity was used for quantization, the GEMM performance would suffer a great loss. On the other hand, since the Blackwell platform natively supports MXFP8, the dequantization process in GEMM (i.e., multiplying by the scale) is completed inside the Tensor Core, so the CUDA Core is not involved throughout the process, which can achieve better performance and support finer-grained quantization (1x32).
23
+
Here, we will briefly introduce the difference in implementation between MXFP8 GEMM on the Blackwell platform and Blockwise FP8 GEMM on the Hopper platform. On the Hopper platform, since the Tensor Core does not support multiplication with vectors of scales, the quantization granularity must be greater or equal to the GEMM tiles. This also determines that on the Hopper platform, 1x128 is almost the finest quantization granularity available. If a finer granularity was used for quantization, the GEMM performance would suffer a great loss due to small GEMM tiles. On the other hand, since the Blackwell platform natively supports MXFP8, the dequantization process in GEMM (i.e., multiplying by the scale) is completed inside the Tensor Core, so the CUDA Core is not involved throughout the process, which can achieve better performance and support finer-grained quantization (1x32).
24
24
25
25
When we started optimizing DeepSeek-V3 on the GB200 NVL72 platform with MCore, our baseline already included the following features:
26
26
@@ -242,7 +242,8 @@ We started from a baseline of 494 TFLOPS, and through multiple rounds of perform
242
242
243
243
**Complete Training Examples**
244
244
245
-
*[DeepSeek-V3 Training Scripts](https://github.com/yanring/Megatron-MoE-ModelZoo)\- End-to-end training configurations and launch scripts
245
+
*[Reproduce Guide](./deepseek-v3-gb200-reproduce-guide.md), including the Dockerfile, dependencies, cluster configuration and launch scripts.
246
+
*[Megatron-MoE-ModelZoo](https://github.com/yanring/Megatron-MoE-ModelZoo)\- End-to-end training configurations and launch scripts for popular MoE models, including Deepseek-V3, Qwen3, etc.
0 commit comments