modelscope · Jintao-Huang · Dec 16, 2025 · Dec 9, 2025 · Dec 9, 2025 · Dec 9, 2025
diff --git a/docs/source/BestPractices/Qwen3-Best-Practice.md b/docs/source/BestPractices/Qwen3-Best-Practice.md
@@ -330,7 +330,7 @@ Qwen3-235B-A22B-Instruct-250718 单机8卡H20 LoRA训练的最佳实践参考：
 
 ms-swift 引入了 Megatron 并行技术以加速大模型的CPT/SFT/DPO/GRPO。支持的模型可以在[支持的模型文档](../Instruction/Supported-models-and-datasets.md)中找到。
 
-关于环境准备以及 HF 和 MCore 模型权重的转换，可以参考[Megatron-SWIFT训练文档](../Megatron-SWIFT/Quick-start.md)。
+关于环境准备，可以参考[Megatron-SWIFT训练文档](../Megatron-SWIFT/Quick-start.md)。
 
 我们将使用阿里云 DLC 启动训练。训练环境由2台配备8卡 80GiB A800 GPU 组成。关于多节点启动方法的更多信息，请参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node)。
 
@@ -340,7 +340,9 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NNODES=$WORLD_SIZE \
 NODE_RANK=$RANK \
 megatron sft \
-    --load Qwen3-30B-A3B-Base-mcore \
+    --model Qwen/Qwen3-30B-A3B-Base \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \

diff --git a/docs/source/Instruction/Supported-models-and-datasets.md b/docs/source/Instruction/Supported-models-and-datasets.md
@@ -7,6 +7,7 @@
 - Model Type: 模型类型
 - Default Template: 默认对话模板
 - Requires: 使用该模型的额外依赖
+- Support Megatron: 是否支持Megatron-SWIFT训练
 - Tags: 模型的tags
 
 

diff --git a/docs/source/Megatron-SWIFT/Ascend.md b/docs/source/Megatron-SWIFT/Ascend.md
@@ -1,5 +1,7 @@
 # Ascend NPU
 
+关于Megatron-SWIFT在Ascend NPU上的环境准备，请参考[NPU最佳实践](../BestPractices/NPU-support.md)。
+
 ## NPU 性能数据采集
 
 NPU性能采集通过`torch_npu.profiler.profile`接口进行采集，创建torch_npu.profiler.profile对象，通过start和stop接口控制采集性能数据，采集过程需要修改依赖的megatron源码，修改Megatron-LM/megatron/training/training.py文件中的train函数，采集示例如下：

diff --git a/docs/source/Megatron-SWIFT/Command-line-parameters.md b/docs/source/Megatron-SWIFT/Command-line-parameters.md
@@ -85,9 +85,10 @@
   - 提示：你可以设置为一个很大的值来只保存最后一个检查点。
 - 🔥no_save_optim: 不保存optimizer，默认为False。在全参数训练时，可以显著降低存储时间。
 - 🔥no_save_rng: 不保存rng，默认为False。
-- 🔥load: 加载的checkpoint目录，默认None。
+- 🔥load: 加载的checkpoint目录，默认None。对于断点续训的介绍，请查看`--finetune`参数的介绍。
   - 注意：若未使用ms-swift提供的`swift export`进行权重转换，你需要额外设置`--model <hf-repo>`用于加载`config.json`配置文件。
-  - 对于断点续训的介绍，请查看`--finetune`参数的介绍。
+  - 注意：在"ms-swift>3.10"，支持直接加载和存储safetensors权重，参考[mcore-bridge文档](./Mcore-Bridge.md)。
+  - `--model`与`--load`的区别：`--model/--adapters/--ref_model/--ref_adapters`后加safetensors权重目录，`--load/--adapter_load/--ref_load/--ref_adapter_load`后加mcore权重目录。`--model/--adapters`不支持加载断点续训状态，因此在"ms-swift>=3.12"，若设置`--no_save_optim false`，将额外存储mcore权重格式用于断点续训，你需要使用`--load/--adapter_load`来加载断点续训的状态。
 - 🔥no_load_optim: 不载入optimizer，默认为False。
   - 注意：断点续训时，设置`--no_load_optim false`读取优化器状态通常比`--no_load_optim true`不读取优化器状态消耗更大的显存资源。
 - 🔥no_load_rng: 不载入rng，默认为False。
@@ -268,8 +269,9 @@ lora训练：
 - use_rslora: 默认为`False`，是否使用`RS-LoRA`。
 
 **Mcore-Bridge参数**
-- 🔥load_safetensors: 默认为False，是否直接从safetensors加载权重。
-- 🔥save_safetensors: 默认为False，是否直接保存成safetensors权重。注意，若该参数设置为True，则不会存储优化器权重、随机数状态等断点续训内容。
+- 🔥load_safetensors: 该参数在"ms-swift>=3.12"将失效（之前版本默认为False），将根据优先级加载权重：若`--load`不存在，则加载safetensors权重`--model`；`--adapters`和`--adapter_load`等同理。
+  - 注意：在"ms-swift>=3.12"，为保持shell脚本兼容性，该参数被保留，但不再发挥任何作用。
+- 🔥save_safetensors: 默认为True，是否直接保存成safetensors权重。该参数在"ms-swift>=3.12"支持了对优化器权重、随机数状态等断点续训内容进行保存（额外存储mcore格式权重），使用`--no_save_optim`和`--no_save_rng`控制。断点续训时使用`--load/--adapter_load`参数加载mcore格式权重。
 - model: safetensors权重的model_id或者model_path。默认为None。
 - model_type: 模型类型。介绍参考[ms-swift命令行参数文档](../Instruction/Command-line-parameters.md)。
 - adapters: safetensors格式的LoRA增量权重的adapter_id或者adapter_path。默认为`[]`。

diff --git a/docs/source/Megatron-SWIFT/LoRA-Training.md b/docs/source/Megatron-SWIFT/LoRA-Training.md
@@ -4,20 +4,35 @@ Qwen3-235B-A22B-Instruct-250718 单机8卡H20 LoRA训练的最佳实践参考：
 
 环境准备请参考Megatron-SWIFT的[快速开始文档](./Quick-start.md)。
 
-## HF转换Mcore
+## 传统方式
+
+### HF转换Mcore
+
+以下，我们分别介绍使用`swift export`和`megatron export`命令进行权重转换。相比于`swift export`，`megatron export`支持多机和LoRA增量权重转换，但也更加复杂，需要在导出时额外指定并行参数，例如`--tensor_model_parallel_size`, `--export_model_parallel_size`，具体参考[Mcore-Bridge文档](./Mcore-Bridge.md)。若要使用`swift export`命令，参考[快速开始文档](./Quick-start.md)。
 
-转换方式与全参数训练一致，脚本如下：
 ```shell
-CUDA_VISIBLE_DEVICES=0 \
-swift export \
+# megatron export
+NPROC_PER_NODE=2 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron export \
     --model Qwen/Qwen2.5-7B-Instruct \
+    --tensor_model_parallel_size 2 \
     --to_mcore true \
     --torch_dtype bfloat16 \
-    --output_dir Qwen2.5-7B-Instruct-mcore \
+    --save Qwen2.5-7B-Instruct-mcore \
     --test_convert_precision true
+
+# swift export
+# CUDA_VISIBLE_DEVICES=0 \
+# swift export \
+#     --model Qwen/Qwen2.5-7B-Instruct \
+#     --to_mcore true \
+#     --torch_dtype bfloat16 \
+#     --output_dir Qwen2.5-7B-Instruct-mcore \
+#     --test_convert_precision true
 ```
 
-## LoRA训练
+### LoRA训练
 
 训练脚本：
 ```bash
@@ -28,6 +43,7 @@ NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
     --load Qwen2.5-7B-Instruct-mcore \
+    --save_safetensors false \
     --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
               'AI-ModelScope/alpaca-gpt4-data-en#500' \
               'swift/self-cognition#500' \
@@ -61,28 +77,120 @@ megatron sft \
 ```
 - MoE模型的LoRA训练脚本参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/lora)。
 
-## MCore转换HF
+### MCore转换HF
 
 ```bash
-CUDA_VISIBLE_DEVICES=0 \
-swift export \
-    --mcore_adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx \
+# megatron export
+NPROC_PER_NODE=2 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron export \
+    --adapter_load megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
     --to_hf true \
+    --tensor_model_parallel_size 2 \
+    --merge_lora false \
     --torch_dtype bfloat16 \
-    --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
+    --save megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
     --test_convert_precision true
+
+# swift export
+# CUDA_VISIBLE_DEVICES=0 \
+# swift export \
+#     --mcore_adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
+#     --to_hf true \
+#     --torch_dtype bfloat16 \
+#     --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
+#     --test_convert_precision true
+```
+- 注意：`--adapter_load/--mcore_adapters`文件夹中包含`args.json`文件，转换过程会读取文件中`--model/--mcore_model`以及LoRA相关的参数信息。`swift export`暂不支持LoRA增量权重的转换。`megatron export`你可以使用`--merge_lora`参数控制是否进行权重合并。
+
+### 推理
+```shell
+# 如果是全量权重，请将`--adapters`替换为`--model
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
+    --stream true
 ```
-- 注意：`mcore_adapters`文件夹中包含`args.json`文件，转换过程中会读取文件中`mcore_model`和LoRA相关的参数信息，并将`mcore_model`和`mcore_adapters`merge-lora成完整权重，最终转换成HF格式权重。（暂不支持LoRA增量权重的转换）
 
-## Merge-LoRA
+### Merge-LoRA
 
 如果只想merge-lora，而不希望转成HF格式权重，用于后续DPO训练，可以使用以下脚本：
 ```shell
-CUDA_VISIBLE_DEVICES=0 \
-swift export \
-    --mcore_adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx \
+# megatron export
+NPROC_PER_NODE=2 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron export \
+    --adapter_load megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
+    --tensor_model_parallel_size 2 \
     --to_mcore true \
+    --merge_lora true \
     --torch_dtype bfloat16 \
-    --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx-mcore \
+    --save megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-mcore \
     --test_convert_precision true
+
+# swift export
+# CUDA_VISIBLE_DEVICES=0 \
+# swift export \
+#     --mcore_adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
+#     --to_mcore true \
+#     --torch_dtype bfloat16 \
+#     --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-mcore \
+#     --test_convert_precision true
+```
+
+## Mcore-Bridge【推荐】
+
+### 训练
+
+```shell
+# full: 2 * 70GiB 0.61s/it
+# lora: 2 * 14GiB 0.45s/it
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+NPROC_PER_NODE=2 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
+    --merge_lora false \
+    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
+              'AI-ModelScope/alpaca-gpt4-data-en#500' \
+              'swift/self-cognition#500' \
+    --train_type lora \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --tensor_model_parallel_size 2 \
+    --sequence_parallel true \
+    --micro_batch_size 16 \
+    --global_batch_size 16 \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-4 \
+    --lr_warmup_fraction 0.05 \
+    --min_lr 1e-5 \
+    --max_epochs 1 \
+    --save megatron_output/Qwen2.5-7B-Instruct \
+    --save_interval 100 \
+    --max_length 2048 \
+    --system 'You are a helpful assistant.' \
+    --num_workers 4 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --dataset_num_proc 4 \
+    --model_author swift \
+    --model_name swift-robot
+```
+
+### 推理
+
+```shell
+# 如果是全量权重，请将`--adapters`替换为`--model
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
+    --stream true
 ```
diff --git a/docs/source/Megatron-SWIFT/Mcore-Bridge.md b/docs/source/Megatron-SWIFT/Mcore-Bridge.md
@@ -192,7 +192,7 @@ swift infer \
 
 提示：如果在vLLM权重更新期间遇到 GPU OOM 问题，您可以设置 `--offload_bridge true` 将张量卸载到 CPU 并减少 GPU 内存使用量。
 
-## 导出与转换精度测试
+## `megatron export` 与 转换精度测试
 
 Mcore-Bridge除了支持在训练中进行safetensors的转换和保存，也支持了`megatron export`命令用于单独的权重导出。`megatron export`支持在权重转换时，对转换精度进行测试，这在接入新模型时验证接入准确性很有帮助。通常，Megatron-SWIFT已经接入的模型不会出现精度不对齐的情况，你可以放心设置`--test_convert_precision false`。
 - 提示：多模态模型请关注`mean_diff (with loss)`字段，`mean_diff`因包含图像tokens且该部分不计算损失，有较大的diff。
@@ -235,8 +235,8 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NPROC_PER_NODE=4 \
 megatron export \
     --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
-    --adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx \
-    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-lora \
+    --adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx \
+    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-lora \
     --merge_lora false \
     --to_hf true \
     --tensor_model_parallel_size 2 \
@@ -251,8 +251,8 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NPROC_PER_NODE=4 \
 megatron export \
     --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
-    --adapters megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-lora \
-    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-mcore \
+    --adapters megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-lora \
+    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-mcore \
     --merge_lora false \
     --to_mcore true \
     --tensor_model_parallel_size 2 \
@@ -268,8 +268,8 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NPROC_PER_NODE=4 \
 megatron export \
     --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
-    --adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx \
-    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-merged \
+    --adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx \
+    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-merged \
     --merge_lora true \
     --to_mcore true \
     --tensor_model_parallel_size 2 \