diff --git a/docs/source/BestPractices/Qwen3-Best-Practice.md b/docs/source/BestPractices/Qwen3-Best-Practice.md
index 586212ff86..784842413d 100644
--- a/docs/source/BestPractices/Qwen3-Best-Practice.md
+++ b/docs/source/BestPractices/Qwen3-Best-Practice.md
@@ -330,7 +330,7 @@ Qwen3-235B-A22B-Instruct-250718 单机8卡H20 LoRA训练的最佳实践参考：
 
 ms-swift 引入了 Megatron 并行技术以加速大模型的CPT/SFT/DPO/GRPO。支持的模型可以在[支持的模型文档](../Instruction/Supported-models-and-datasets.md)中找到。
 
-关于环境准备以及 HF 和 MCore 模型权重的转换，可以参考[Megatron-SWIFT训练文档](../Megatron-SWIFT/Quick-start.md)。
+关于环境准备，可以参考[Megatron-SWIFT训练文档](../Megatron-SWIFT/Quick-start.md)。
 
 我们将使用阿里云 DLC 启动训练。训练环境由2台配备8卡 80GiB A800 GPU 组成。关于多节点启动方法的更多信息，请参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node)。
 
@@ -340,7 +340,9 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NNODES=$WORLD_SIZE \
 NODE_RANK=$RANK \
 megatron sft \
-    --load Qwen3-30B-A3B-Base-mcore \
+    --model Qwen/Qwen3-30B-A3B-Base \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \
diff --git a/docs/source/Instruction/Supported-models-and-datasets.md b/docs/source/Instruction/Supported-models-and-datasets.md
index 6dd4adb6d0..d410106e86 100644
--- a/docs/source/Instruction/Supported-models-and-datasets.md
+++ b/docs/source/Instruction/Supported-models-and-datasets.md
@@ -7,6 +7,7 @@
 - Model Type: 模型类型
 - Default Template: 默认对话模板
 - Requires: 使用该模型的额外依赖
+- Support Megatron: 是否支持Megatron-SWIFT训练
 - Tags: 模型的tags
 
 
diff --git a/docs/source/Megatron-SWIFT/Ascend.md b/docs/source/Megatron-SWIFT/Ascend.md
index 1d822685c1..3e12438092 100644
--- a/docs/source/Megatron-SWIFT/Ascend.md
+++ b/docs/source/Megatron-SWIFT/Ascend.md
@@ -1,5 +1,7 @@
 # Ascend NPU
 
+关于Megatron-SWIFT在Ascend NPU上的环境准备，请参考[NPU最佳实践](../BestPractices/NPU-support.md)。
+
 ## NPU 性能数据采集
 
 NPU性能采集通过`torch_npu.profiler.profile`接口进行采集，创建torch_npu.profiler.profile对象，通过start和stop接口控制采集性能数据，采集过程需要修改依赖的megatron源码，修改Megatron-LM/megatron/training/training.py文件中的train函数，采集示例如下：
diff --git a/docs/source/Megatron-SWIFT/Command-line-parameters.md b/docs/source/Megatron-SWIFT/Command-line-parameters.md
index f4d891f631..2ba792a8af 100644
--- a/docs/source/Megatron-SWIFT/Command-line-parameters.md
+++ b/docs/source/Megatron-SWIFT/Command-line-parameters.md
@@ -85,9 +85,10 @@
   - 提示：你可以设置为一个很大的值来只保存最后一个检查点。
 - 🔥no_save_optim: 不保存optimizer，默认为False。在全参数训练时，可以显著降低存储时间。
 - 🔥no_save_rng: 不保存rng，默认为False。
-- 🔥load: 加载的checkpoint目录，默认None。
+- 🔥load: 加载的checkpoint目录，默认None。对于断点续训的介绍，请查看`--finetune`参数的介绍。
   - 注意：若未使用ms-swift提供的`swift export`进行权重转换，你需要额外设置`--model <hf-repo>`用于加载`config.json`配置文件。
-  - 对于断点续训的介绍，请查看`--finetune`参数的介绍。
+  - 注意：在"ms-swift>3.10"，支持直接加载和存储safetensors权重，参考[mcore-bridge文档](./Mcore-Bridge.md)。
+  - `--model`与`--load`的区别：`--model/--adapters/--ref_model/--ref_adapters`后加safetensors权重目录，`--load/--adapter_load/--ref_load/--ref_adapter_load`后加mcore权重目录。`--model/--adapters`不支持加载断点续训状态，因此在"ms-swift>=3.12"，若设置`--no_save_optim false`，将额外存储mcore权重格式用于断点续训，你需要使用`--load/--adapter_load`来加载断点续训的状态。
 - 🔥no_load_optim: 不载入optimizer，默认为False。
   - 注意：断点续训时，设置`--no_load_optim false`读取优化器状态通常比`--no_load_optim true`不读取优化器状态消耗更大的显存资源。
 - 🔥no_load_rng: 不载入rng，默认为False。
@@ -268,8 +269,9 @@ lora训练：
 - use_rslora: 默认为`False`，是否使用`RS-LoRA`。
 
 **Mcore-Bridge参数**
-- 🔥load_safetensors: 默认为False，是否直接从safetensors加载权重。
-- 🔥save_safetensors: 默认为False，是否直接保存成safetensors权重。注意，若该参数设置为True，则不会存储优化器权重、随机数状态等断点续训内容。
+- 🔥load_safetensors: 该参数在"ms-swift>=3.12"将失效（之前版本默认为False），将根据优先级加载权重：若`--load`不存在，则加载safetensors权重`--model`；`--adapters`和`--adapter_load`等同理。
+  - 注意：在"ms-swift>=3.12"，为保持shell脚本兼容性，该参数被保留，但不再发挥任何作用。
+- 🔥save_safetensors: 默认为True，是否直接保存成safetensors权重。该参数在"ms-swift>=3.12"支持了对优化器权重、随机数状态等断点续训内容进行保存（额外存储mcore格式权重），使用`--no_save_optim`和`--no_save_rng`控制。断点续训时使用`--load/--adapter_load`参数加载mcore格式权重。
 - model: safetensors权重的model_id或者model_path。默认为None。
 - model_type: 模型类型。介绍参考[ms-swift命令行参数文档](../Instruction/Command-line-parameters.md)。
 - adapters: safetensors格式的LoRA增量权重的adapter_id或者adapter_path。默认为`[]`。
diff --git a/docs/source/Megatron-SWIFT/LoRA-Training.md b/docs/source/Megatron-SWIFT/LoRA-Training.md
index 436f9e09a8..16a8a62edd 100644
--- a/docs/source/Megatron-SWIFT/LoRA-Training.md
+++ b/docs/source/Megatron-SWIFT/LoRA-Training.md
@@ -4,20 +4,35 @@ Qwen3-235B-A22B-Instruct-250718 单机8卡H20 LoRA训练的最佳实践参考：
 
 环境准备请参考Megatron-SWIFT的[快速开始文档](./Quick-start.md)。
 
-## HF转换Mcore
+## 传统方式
+
+### HF转换Mcore
+
+以下，我们分别介绍使用`swift export`和`megatron export`命令进行权重转换。相比于`swift export`，`megatron export`支持多机和LoRA增量权重转换，但也更加复杂，需要在导出时额外指定并行参数，例如`--tensor_model_parallel_size`, `--export_model_parallel_size`，具体参考[Mcore-Bridge文档](./Mcore-Bridge.md)。若要使用`swift export`命令，参考[快速开始文档](./Quick-start.md)。
 
-转换方式与全参数训练一致，脚本如下：
 ```shell
-CUDA_VISIBLE_DEVICES=0 \
-swift export \
+# megatron export
+NPROC_PER_NODE=2 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron export \
     --model Qwen/Qwen2.5-7B-Instruct \
+    --tensor_model_parallel_size 2 \
     --to_mcore true \
     --torch_dtype bfloat16 \
-    --output_dir Qwen2.5-7B-Instruct-mcore \
+    --save Qwen2.5-7B-Instruct-mcore \
     --test_convert_precision true
+
+# swift export
+# CUDA_VISIBLE_DEVICES=0 \
+# swift export \
+#     --model Qwen/Qwen2.5-7B-Instruct \
+#     --to_mcore true \
+#     --torch_dtype bfloat16 \
+#     --output_dir Qwen2.5-7B-Instruct-mcore \
+#     --test_convert_precision true
 ```
 
-## LoRA训练
+### LoRA训练
 
 训练脚本：
 ```bash
@@ -28,6 +43,7 @@ NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
     --load Qwen2.5-7B-Instruct-mcore \
+    --save_safetensors false \
     --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
               'AI-ModelScope/alpaca-gpt4-data-en#500' \
               'swift/self-cognition#500' \
@@ -61,28 +77,120 @@ megatron sft \
 ```
 - MoE模型的LoRA训练脚本参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/lora)。
 
-## MCore转换HF
+### MCore转换HF
 
 ```bash
-CUDA_VISIBLE_DEVICES=0 \
-swift export \
-    --mcore_adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx \
+# megatron export
+NPROC_PER_NODE=2 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron export \
+    --adapter_load megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
     --to_hf true \
+    --tensor_model_parallel_size 2 \
+    --merge_lora false \
     --torch_dtype bfloat16 \
-    --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
+    --save megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
     --test_convert_precision true
+
+# swift export
+# CUDA_VISIBLE_DEVICES=0 \
+# swift export \
+#     --mcore_adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
+#     --to_hf true \
+#     --torch_dtype bfloat16 \
+#     --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
+#     --test_convert_precision true
+```
+- 注意：`--adapter_load/--mcore_adapters`文件夹中包含`args.json`文件，转换过程会读取文件中`--model/--mcore_model`以及LoRA相关的参数信息。`swift export`暂不支持LoRA增量权重的转换。`megatron export`你可以使用`--merge_lora`参数控制是否进行权重合并。
+
+### 推理
+```shell
+# 如果是全量权重，请将`--adapters`替换为`--model
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
+    --stream true
 ```
-- 注意：`mcore_adapters`文件夹中包含`args.json`文件，转换过程中会读取文件中`mcore_model`和LoRA相关的参数信息，并将`mcore_model`和`mcore_adapters`merge-lora成完整权重，最终转换成HF格式权重。（暂不支持LoRA增量权重的转换）
 
-## Merge-LoRA
+### Merge-LoRA
 
 如果只想merge-lora，而不希望转成HF格式权重，用于后续DPO训练，可以使用以下脚本：
 ```shell
-CUDA_VISIBLE_DEVICES=0 \
-swift export \
-    --mcore_adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx \
+# megatron export
+NPROC_PER_NODE=2 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron export \
+    --adapter_load megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
+    --tensor_model_parallel_size 2 \
     --to_mcore true \
+    --merge_lora true \
     --torch_dtype bfloat16 \
-    --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx-mcore \
+    --save megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-mcore \
     --test_convert_precision true
+
+# swift export
+# CUDA_VISIBLE_DEVICES=0 \
+# swift export \
+#     --mcore_adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
+#     --to_mcore true \
+#     --torch_dtype bfloat16 \
+#     --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-mcore \
+#     --test_convert_precision true
+```
+
+## Mcore-Bridge【推荐】
+
+### 训练
+
+```shell
+# full: 2 * 70GiB 0.61s/it
+# lora: 2 * 14GiB 0.45s/it
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+NPROC_PER_NODE=2 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
+    --merge_lora false \
+    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
+              'AI-ModelScope/alpaca-gpt4-data-en#500' \
+              'swift/self-cognition#500' \
+    --train_type lora \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --tensor_model_parallel_size 2 \
+    --sequence_parallel true \
+    --micro_batch_size 16 \
+    --global_batch_size 16 \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-4 \
+    --lr_warmup_fraction 0.05 \
+    --min_lr 1e-5 \
+    --max_epochs 1 \
+    --save megatron_output/Qwen2.5-7B-Instruct \
+    --save_interval 100 \
+    --max_length 2048 \
+    --system 'You are a helpful assistant.' \
+    --num_workers 4 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --dataset_num_proc 4 \
+    --model_author swift \
+    --model_name swift-robot
+```
+
+### 推理
+
+```shell
+# 如果是全量权重，请将`--adapters`替换为`--model
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
+    --stream true
 ```
diff --git a/docs/source/Megatron-SWIFT/Mcore-Bridge.md b/docs/source/Megatron-SWIFT/Mcore-Bridge.md
index 65be015d77..c7c8978cf6 100644
--- a/docs/source/Megatron-SWIFT/Mcore-Bridge.md
+++ b/docs/source/Megatron-SWIFT/Mcore-Bridge.md
@@ -192,7 +192,7 @@ swift infer \
 
 提示：如果在vLLM权重更新期间遇到 GPU OOM 问题，您可以设置 `--offload_bridge true` 将张量卸载到 CPU 并减少 GPU 内存使用量。
 
-## 导出与转换精度测试
+## `megatron export` 与 转换精度测试
 
 Mcore-Bridge除了支持在训练中进行safetensors的转换和保存，也支持了`megatron export`命令用于单独的权重导出。`megatron export`支持在权重转换时，对转换精度进行测试，这在接入新模型时验证接入准确性很有帮助。通常，Megatron-SWIFT已经接入的模型不会出现精度不对齐的情况，你可以放心设置`--test_convert_precision false`。
 - 提示：多模态模型请关注`mean_diff (with loss)`字段，`mean_diff`因包含图像tokens且该部分不计算损失，有较大的diff。
@@ -235,8 +235,8 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NPROC_PER_NODE=4 \
 megatron export \
     --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
-    --adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx \
-    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-lora \
+    --adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx \
+    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-lora \
     --merge_lora false \
     --to_hf true \
     --tensor_model_parallel_size 2 \
@@ -251,8 +251,8 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NPROC_PER_NODE=4 \
 megatron export \
     --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
-    --adapters megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-lora \
-    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-mcore \
+    --adapters megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-lora \
+    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-mcore \
     --merge_lora false \
     --to_mcore true \
     --tensor_model_parallel_size 2 \
@@ -268,8 +268,8 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NPROC_PER_NODE=4 \
 megatron export \
     --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
-    --adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx \
-    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-merged \
+    --adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx \
+    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-merged \
     --merge_lora true \
     --to_mcore true \
     --tensor_model_parallel_size 2 \
diff --git a/docs/source/Megatron-SWIFT/Multimodal-Model.md b/docs/source/Megatron-SWIFT/Multimodal-Model.md
index 8f51213211..7b269e6741 100644
--- a/docs/source/Megatron-SWIFT/Multimodal-Model.md
+++ b/docs/source/Megatron-SWIFT/Multimodal-Model.md
@@ -8,17 +8,6 @@ ms-swift引入了Megatron的并行技术来加速多模态大模型的训练。
 
 这里介绍使用2卡80GiB A100对Qwen2.5-VL-7B-Instruct模型进行Latex-OCR的微调，分别使用全参数和LoRA的方式，以下最佳实践可以在10分钟内完成。
 
-首先，我们需要将HF格式的权重转为Megatron格式：
-```shell
-CUDA_VISIBLE_DEVICES=0 \
-swift export \
-    --model Qwen/Qwen2.5-VL-7B-Instruct \
-    --to_mcore true \
-    --torch_dtype bfloat16 \
-    --output_dir Qwen2.5-VL-7B-Instruct-mcore \
-    --test_convert_precision true
-```
-
 ### Full
 
 全参数训练脚本如下：
@@ -29,7 +18,9 @@ NPROC_PER_NODE=2 \
 MAX_PIXELS=1003520 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
-    --load Qwen2.5-VL-7B-Instruct-mcore \
+    --model Qwen/Qwen2.5-VL-7B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
     --load_from_cache_file true \
     --tensor_model_parallel_size 2 \
@@ -60,17 +51,6 @@ megatron sft \
     --dataset_num_proc 8
 ```
 
-将全参数保存的Megatron格式权重转为HF格式：
-- 注意：`--mcore_model`请指向`iter_xxx`的上级目录。默认会使用`latest_checkpointed_iteration.txt`中对应的checkpoint。
-```shell
-CUDA_VISIBLE_DEVICES=0 \
-swift export \
-    --mcore_model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx \
-    --to_hf true \
-    --torch_dtype bfloat16 \
-    --output_dir megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \
-    --test_convert_precision true
-```
 
 ### LoRA
 
@@ -82,7 +62,10 @@ NPROC_PER_NODE=2 \
 MAX_PIXELS=1003520 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
-    --load Qwen2.5-VL-7B-Instruct-mcore \
+    --model Qwen/Qwen2.5-VL-7B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
+    --merge_lora false \
     --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
     --load_from_cache_file true \
     --train_type lora \
@@ -117,24 +100,13 @@ megatron sft \
     --dataset_num_proc 8
 ```
 
-将LoRA保存的增量权重进行Merge-LoRA并转为HF格式：
-```shell
-CUDA_VISIBLE_DEVICES=0 \
-swift export \
-    --mcore_adapters megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx \
-    --to_hf true \
-    --torch_dtype bfloat16 \
-    --output_dir megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \
-    --test_convert_precision true
-```
-
 
 最后，我们使用生成的HF格式权重对验证集进行推理：
 ```shell
 MAX_PIXELS=1003520 \
 CUDA_VISIBLE_DEVICES=0 \
 swift infer \
-    --model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \
+    --adapters megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx/checkpoint-xxx \
     --attn_impl flash_attn \
     --stream true \
     --load_data_args true \
@@ -160,16 +132,18 @@ swift infer \
 
 ## Moe模型
 
-Moe模型的模型转换步骤和Dense模型一致（请参考Dense进行修改），这里介绍 OpenGVLab/InternVL3_5-30B-A3B-mcore 模型LoRA微调的训练脚本。
-- 在MoE模型的转换时，`--test_convert_precision true`转换精度测试所需时间较长，可酌情去除。
 
+训练脚本：
 ```bash
 # 2 * 43GiB, 8s/it
 PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
-    --load InternVL3_5-30B-A3B-mcore \
+    --model OpenGVLab/InternVL3_5-30B-A3B \
+    --load_safetensors true \
+    --save_safetensors true \
+    --merge_lora false \
     --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
     --load_from_cache_file true \
     --train_type lora \
@@ -214,7 +188,7 @@ megatron sft \
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 swift infer \
-    --model megatron_output/InternVL3_5-30B-A3B/vx-xxx-hf \
+    --adapters megatron_output/InternVL3_5-30B-A3B/vx-xxx/checkpoint-xxx \
     --attn_impl flash_attn \
     --stream true \
     --load_data_args true \
diff --git a/docs/source/Megatron-SWIFT/Quick-start.md b/docs/source/Megatron-SWIFT/Quick-start.md
index 4b17252669..3cd943b29f 100644
--- a/docs/source/Megatron-SWIFT/Quick-start.md
+++ b/docs/source/Megatron-SWIFT/Quick-start.md
@@ -77,10 +77,12 @@ modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu2
 
 这里介绍使用2卡80GiB A100对Qwen2.5-7B-Instruct模型进行自我认知微调的快速入门案例，以下最佳实践可以在10分钟内完成。
 
+### 传统方式
+
 首先，我们需要将HF格式的权重转为Megatron格式：
 - 多卡权重转换：将`CUDA_VISIBLE_DEVICES=0`删除即可使用多卡权重转换。
 - 转换精度测试：`--test_convert_precision true`将测试转换精度。在MoE大型模型的转换时，该参数所需时间较长，且需要更多的内存消耗，可酌情去除。
-- ms-swift支持了Mcore-Bridge来避免权重转换的额外耗时，请参考[Mcore-Bridge文档](./Mcore-Bridge.md)。
+
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 swift export \
@@ -99,6 +101,7 @@ NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
     --load Qwen2.5-7B-Instruct-mcore \
+    --save_safetensors false \
     --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
               'AI-ModelScope/alpaca-gpt4-data-en#500' \
               'swift/self-cognition#500' \
@@ -133,10 +136,10 @@ megatron sft \
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 swift export \
-    --mcore_model megatron_output/Qwen2.5-7B-Instruct/vx-xxx \
+    --mcore_model megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
     --to_hf true \
     --torch_dtype bfloat16 \
-    --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
+    --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
     --test_convert_precision true
 ```
 
@@ -144,7 +147,7 @@ swift export \
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 swift infer \
-    --model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
+    --model megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
     --stream true \
     --temperature 0 \
     --max_new_tokens 2048
@@ -156,6 +159,58 @@ swift infer \
 I am a language model developed by swift, you can call me swift-robot. How can I assist you?
 ```
 
+
+### Mcore-Bridge【推荐】
+
+在"ms-swift>=3.10"，支持了Mcore-Bridge，去除模型转换的繁琐过程。具体参考[Mcore-Bridge文档](./Mcore-Bridge.md)。
+
+训练脚本：
+```bash
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+NPROC_PER_NODE=2 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
+    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
+              'AI-ModelScope/alpaca-gpt4-data-en#500' \
+              'swift/self-cognition#500' \
+    --tensor_model_parallel_size 2 \
+    --sequence_parallel true \
+    --micro_batch_size 16 \
+    --global_batch_size 16 \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-5 \
+    --lr_warmup_fraction 0.05 \
+    --min_lr 1e-6 \
+    --max_epochs 1 \
+    --save megatron_output/Qwen2.5-7B-Instruct \
+    --save_interval 100 \
+    --max_length 2048 \
+    --system 'You are a helpful assistant.' \
+    --num_workers 4 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --dataset_num_proc 4 \
+    --model_author swift \
+    --model_name swift-robot
+```
+
+我们对生成的safetensors格式权重进行推理：
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --model megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
+    --stream true \
+    --temperature 0 \
+    --max_new_tokens 2048
+```
+
 - 若要进行预训练，你可以使用`megatron pt`替代`megatron sft`，这将会使用生成式的template进行训练。
 - Megatron-SWIFT使用与ms-swift相同的dataset和template处理模块，因此同样支持packing、loss_scale、agent训练等技术。自定义数据集格式参考[自定义数据集文档](../Customization/Custom-dataset.md)。
 - **更多案例**：包括packing、多机、32K上下文、DPO、MoE模型、预训练，可以查看[这里](https://github.com/modelscope/ms-swift/tree/main/examples/megatron)。
diff --git a/docs/source/index.rst b/docs/source/index.rst
index f70a8a05c9..fea4d1e325 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -43,6 +43,7 @@ Swift DOCUMENTATION
    Megatron-SWIFT/Multimodal-Model.md
    Megatron-SWIFT/Mcore-Bridge.md
    Megatron-SWIFT/GRPO.md
+   Megatron-SWIFT/Ascend.md
 
 
 .. toctree::
diff --git a/docs/source_en/BestPractices/Qwen3-Best-Practice.md b/docs/source_en/BestPractices/Qwen3-Best-Practice.md
index de406298d3..8ce104b275 100644
--- a/docs/source_en/BestPractices/Qwen3-Best-Practice.md
+++ b/docs/source_en/BestPractices/Qwen3-Best-Practice.md
@@ -334,7 +334,7 @@ Best practice reference for single-node 8xH20 LoRA training with Qwen3-235B-A22B
 
 ms-swift introduces Megatron parallelism techniques to accelerate CPT/SFT/DPO/GRPO for large models. Supported models can be found in the [Supported Models and Datasets Document](../Instruction/Supported-models-and-datasets.md).
 
-For environment setup and conversion between HF and MCore model weights, refer to the [Megatron-SWIFT Training Documentation](../Megatron-SWIFT/Quick-start.md).
+For environment setup, refer to the [Megatron-SWIFT Training Documentation](../Megatron-SWIFT/Quick-start.md).
 
 We will use Alibaba Cloud DLC to launch training. The training environment consists of two nodes equipped with 8x 80GiB A800 GPUs each. For more information on multi-node launching, see [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node).
 
@@ -344,7 +344,9 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NNODES=$WORLD_SIZE \
 NODE_RANK=$RANK \
 megatron sft \
-    --load Qwen3-30B-A3B-Base-mcore \
+    --model Qwen/Qwen3-30B-A3B-Base \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \
diff --git a/docs/source_en/Instruction/Supported-models-and-datasets.md b/docs/source_en/Instruction/Supported-models-and-datasets.md
index 7e57444296..aa932cadcc 100644
--- a/docs/source_en/Instruction/Supported-models-and-datasets.md
+++ b/docs/source_en/Instruction/Supported-models-and-datasets.md
@@ -7,6 +7,7 @@ The table below introduces the models integrated with ms-swift:
 - Model Type: Type of the model
 - Default Template: Default chat template
 - Requires: Additional dependencies required to use the model
+- Support Megatron: Whether Megatron-SWIFT training is supported
 - Tags: Tags associated with the model
 
 
diff --git a/docs/source_en/Megatron-SWIFT/Ascend.md b/docs/source_en/Megatron-SWIFT/Ascend.md
index ee27bd057b..e77740afef 100644
--- a/docs/source_en/Megatron-SWIFT/Ascend.md
+++ b/docs/source_en/Megatron-SWIFT/Ascend.md
@@ -1,4 +1,7 @@
 # Ascend NPU
+
+For environment preparation of Megatron-SWIFT on Ascend NPU, please refer to [NPU Best Practices](../BestPractices/NPU-support.md).
+
 ## NPU Performance Data Collection
 
 NPU performance collection is conducted through the `torch_npu.profiler.profile` interface. To begin, create an instance of `torch_npu.profiler.profile`, then use the `start` and `stop` methods to control the performance data collection process. During this process, modifications to the dependent Megatron source code are required, specifically altering the `train` function in the `Megatron-LM/megatron/training/training.py` file. Below is an example of the collection process:
diff --git a/docs/source_en/Megatron-SWIFT/Command-line-parameters.md b/docs/source_en/Megatron-SWIFT/Command-line-parameters.md
index 27a8749fb9..8a33c068c8 100644
--- a/docs/source_en/Megatron-SWIFT/Command-line-parameters.md
+++ b/docs/source_en/Megatron-SWIFT/Command-line-parameters.md
@@ -89,9 +89,10 @@
   - Tip: You can set it to a very large value to only save the final checkpoint.
 - 🔥no_save_optim: Do not save optimizer, default is False. When performing full-parameter training, this can significantly reduce storage time.
 - 🔥no_save_rng: Do not save RNG, default is False.
-- 🔥load: Directory of the checkpoint to load, default is None.
+- 🔥load: The directory of the checkpoint to load. Default is None. For details on resuming training from a checkpoint, please refer to the description of the `--finetune` argument.
   - Note: If you did not convert the weights with ms-swift’s `swift export`, you must also specify `--model <hf-repo>` so that the `config.json` configuration file can be loaded.
-  - For details on resuming training from a checkpoint, please refer to the description of the `--finetune` argument.
+  - Note: In "ms-swift>3.10", direct loading and saving of safetensors weights is supported, refer to [mcore-bridge documentation](./Mcore-Bridge.md).
+  - The difference between `--model` and `--load`: `--model/--adapters/--ref_model/--ref_adapters` are followed by safetensors weight directories, while `--load/--adapter_load/--ref_load/--ref_adapter_load` are followed by mcore weight directories. `--model/--adapters` do not support loading checkpoint resume states, so in "ms-swift>=3.12", if you set `--no_save_optim false`, mcore weight format will be additionally saved for checkpoint resumption, and you need to use `--load/--adapter_load` to load the checkpoint resume state.
 - 🔥no_load_optim: Do not load optimizer, default is False.
   - Note: When resuming training from a checkpoint, setting `--no_load_optim false` (i.e., loading the optimizer state) typically consumes significantly more GPU memory than setting `--no_load_optim true` (i.e., skipping the optimizer state).
 - 🔥no_load_rng: Do not load RNG, default is False.
@@ -286,9 +287,10 @@ LoRA Training:
 
 **Mcore-Bridge Parameters**
 
-- 🔥load_safetensors: Defaults to False. Whether to load weights directly from safetensors.
-- 🔥save_safetensors: Defaults to False. Whether to save directly as safetensors weights. Note: if this parameter is set to True, optimizer weights, random number states, and other checkpoint resumption contents will not be stored.
-- model: The model_id or model_path of safetensors weights. Defaults to None.
+- 🔥load_safetensors: This parameter will become ineffective in "ms-swift>=3.12" (defaults to False in previous versions). Weights will be loaded based on priority: if `--load` does not exist, safetensors weights `--model` will be loaded; the same applies to `--adapters` and `--adapter_load`, etc.
+  - Note: In "ms-swift>=3.12", this parameter is retained for shell script compatibility but no longer has any effect.
+- 🔥save_safetensors: Defaults to True, whether to directly save as safetensors weights. This parameter in "ms-swift>=3.12" supports saving checkpoint resume content such as optimizer weights and random number states (additionally storing mcore format weights), controlled by `--no_save_optim` and `--no_save_rng`. When resuming from checkpoint, use the `--load/--adapter_load` parameter to load mcore format weights.
+- model: The model_id or model_path of safetensors weights. Default is None. Supports resume training from checkpoint using `--no_load_optim false --no_load_rng false`.
 - model_type: Model type. For details, refer to [ms-swift command-line parameters documentation](../Instruction/Command-line-parameters.md).
 - adapters: adapter_id or adapter_path of LoRA incremental weights in safetensors format. Default is `[]`.
 - ref_model: model_id or model_path of ref_model safetensors weights. Required when using DPO/GRPO/KTO algorithms with full-parameter training. Default is None, set to `--model`.
diff --git a/docs/source_en/Megatron-SWIFT/LoRA-Training.md b/docs/source_en/Megatron-SWIFT/LoRA-Training.md
index 4850678d89..d158eb9ae9 100644
--- a/docs/source_en/Megatron-SWIFT/LoRA-Training.md
+++ b/docs/source_en/Megatron-SWIFT/LoRA-Training.md
@@ -4,21 +4,37 @@ Best practice reference for single-node 8xH20 LoRA training with Qwen3-235B-A22B
 
 For environment setup, please refer to the [Quick Start Guide](./Quick-start.md) of Megatron-SWIFT.
 
-## Converting HF to Mcore
 
-The conversion process is the same as for full-parameter training. Use the following script:
+## Traditional Method
+
+### Converting HF to Mcore
+
+Below, we introduce weight conversion using the `swift export` and `megatron export` commands respectively. Compared to `swift export`, `megatron export` supports multi-node and LoRA incremental weight conversion, but is also more complex, requiring additional specification of parallelism parameters during export, such as `--tensor_model_parallel_size` and `--export_model_parallel_size`. For details, refer to the [Mcore-Bridge Documentation](./Mcore-Bridge.md). To use the `swift export` command, refer to the [Quick Start Documentation](./Quick-start.md).
+
 
 ```shell
-CUDA_VISIBLE_DEVICES=0 \
-swift export \
+# megatron export
+NPROC_PER_NODE=2 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron export \
     --model Qwen/Qwen2.5-7B-Instruct \
+    --tensor_model_parallel_size 2 \
     --to_mcore true \
     --torch_dtype bfloat16 \
-    --output_dir Qwen2.5-7B-Instruct-mcore \
+    --save Qwen2.5-7B-Instruct-mcore \
     --test_convert_precision true
+
+# swift export
+# CUDA_VISIBLE_DEVICES=0 \
+# swift export \
+#     --model Qwen/Qwen2.5-7B-Instruct \
+#     --to_mcore true \
+#     --torch_dtype bfloat16 \
+#     --output_dir Qwen2.5-7B-Instruct-mcore \
+#     --test_convert_precision true
 ```
 
-## LoRA Training
+### LoRA Training
 
 Training Script:
 
@@ -30,6 +46,7 @@ NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
     --load Qwen2.5-7B-Instruct-mcore \
+    --save_safetensors false \
     --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
               'AI-ModelScope/alpaca-gpt4-data-en#500' \
               'swift/self-cognition#500' \
@@ -63,30 +80,127 @@ megatron sft \
 ```
 - For LoRA training scripts of MoE models, please refer to [here](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/lora).
 
-## Converting MCore to HF
+### Converting MCore to HF
 
 ```bash
-CUDA_VISIBLE_DEVICES=0 \
-swift export \
-    --mcore_adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx \
+# megatron export
+NPROC_PER_NODE=2 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron export \
+    --adapter_load megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
     --to_hf true \
+    --tensor_model_parallel_size 2 \
+    --merge_lora false \
     --torch_dtype bfloat16 \
-    --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
+    --save megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
     --test_convert_precision true
+
+# swift export
+# CUDA_VISIBLE_DEVICES=0 \
+# swift export \
+#     --mcore_adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
+#     --to_hf true \
+#     --torch_dtype bfloat16 \
+#     --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
+#     --test_convert_precision true
 ```
 
-- Note: The `mcore_adapters` folder contains an `args.json` file. During the conversion process, parameters related to `mcore_model` and LoRA will be loaded from this file. The system will then perform a merge-lora operation between the `mcore_model` and `mcore_adapters` to obtain the complete model weights, and finally convert them into HuggingFace (HF) format. (Conversion of LoRA incremental weights is not supported for now)
+- Note: The `--adapter_load/--mcore_adapters` folder contains an `args.json` file. The conversion process will read the `--model/--mcore_model` and LoRA-related parameter information from this file. `swift export` does not currently support conversion of LoRA incremental weights. With `megatron export`, you can use the `--merge_lora` parameter to control whether to merge weights.
 
-## Merge-LoRA
 
-If you only want to merge the LoRA weights without converting them to Hugging Face format, for subsequent DPO training, you can use the following script:
+### Inference
 
 ```shell
+# If using full weights, replace `--adapters` with `--model`
 CUDA_VISIBLE_DEVICES=0 \
-swift export \
-    --mcore_adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx \
+swift infer \
+    --adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
+    --stream true
+```
+
+
+### Merge-LoRA
+
+If you only want to merge the LoRA weights without converting them to Hugging Face format, for subsequent DPO training, you can use the following script:
+
+```shell
+# megatron export
+NPROC_PER_NODE=2 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron export \
+    --adapter_load megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
+    --tensor_model_parallel_size 2 \
     --to_mcore true \
+    --merge_lora true \
     --torch_dtype bfloat16 \
-    --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx-mcore \
+    --save megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-mcore \
     --test_convert_precision true
+
+# swift export
+# CUDA_VISIBLE_DEVICES=0 \
+# swift export \
+#     --mcore_adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
+#     --to_mcore true \
+#     --torch_dtype bfloat16 \
+#     --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-mcore \
+#     --test_convert_precision true
+```
+
+
+## Mcore-Bridge [Recommended]
+
+### Training
+
+```shell
+# full: 2 * 70GiB 0.61s/it
+# lora: 2 * 14GiB 0.45s/it
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+NPROC_PER_NODE=2 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
+    --merge_lora false \
+    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
+              'AI-ModelScope/alpaca-gpt4-data-en#500' \
+              'swift/self-cognition#500' \
+    --train_type lora \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --tensor_model_parallel_size 2 \
+    --sequence_parallel true \
+    --micro_batch_size 16 \
+    --global_batch_size 16 \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-4 \
+    --lr_warmup_fraction 0.05 \
+    --min_lr 1e-5 \
+    --max_epochs 1 \
+    --save megatron_output/Qwen2.5-7B-Instruct \
+    --save_interval 100 \
+    --max_length 2048 \
+    --system 'You are a helpful assistant.' \
+    --num_workers 4 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --dataset_num_proc 4 \
+    --model_author swift \
+    --model_name swift-robot
+```
+
+
+### Inference
+
+```shell
+# If using full weights, replace `--adapters` with `--model`
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
+    --stream true
 ```
diff --git a/docs/source_en/Megatron-SWIFT/Mcore-Bridge.md b/docs/source_en/Megatron-SWIFT/Mcore-Bridge.md
index e16885893b..a65a3a8d40 100644
--- a/docs/source_en/Megatron-SWIFT/Mcore-Bridge.md
+++ b/docs/source_en/Megatron-SWIFT/Mcore-Bridge.md
@@ -202,7 +202,7 @@ swift infer \
 
 Tip: If you encounter GPU OOM issues during weight synchronization with vLLM, you can set `--offload_bridge true` to offload intermediate tensors to the CPU and reduce GPU memory usage.
 
-## Export and Conversion Precision Testing
+## `megatron export` and Conversion Accuracy Testing
 
 In addition to supporting safetensors conversion and saving during training, Mcore-Bridge also supports the `megatron export` command for standalone weight export. `megatron export` supports conversion precision testing during weight conversion, which is very helpful for verifying accuracy when integrating new models. Typically, models already integrated into Megatron-SWIFT will not have precision misalignment issues, so you can confidently set `--test_convert_precision false`.
 - Note: For multimodal models, please focus on the `mean_diff (with loss)` field. The `mean_diff` may show a large difference because it includes image tokens, and loss is not calculated for that portion.
@@ -247,8 +247,8 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NPROC_PER_NODE=4 \
 megatron export \
     --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
-    --adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx \
-    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-lora \
+    --adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx \
+    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-lora \
     --merge_lora false \
     --to_hf true \
     --tensor_model_parallel_size 2 \
@@ -263,8 +263,8 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NPROC_PER_NODE=4 \
 megatron export \
     --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
-    --adapters megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-lora \
-    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-mcore \
+    --adapters megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-lora \
+    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-mcore \
     --merge_lora false \
     --to_mcore true \
     --tensor_model_parallel_size 2 \
@@ -280,8 +280,8 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NPROC_PER_NODE=4 \
 megatron export \
     --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
-    --adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx \
-    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-merged \
+    --adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx \
+    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-merged \
     --merge_lora true \
     --to_mcore true \
     --tensor_model_parallel_size 2 \
diff --git a/docs/source_en/Megatron-SWIFT/Multimodal-Model.md b/docs/source_en/Megatron-SWIFT/Multimodal-Model.md
index d3d96dde1f..778cc4b9a0 100644
--- a/docs/source_en/Megatron-SWIFT/Multimodal-Model.md
+++ b/docs/source_en/Megatron-SWIFT/Multimodal-Model.md
@@ -8,17 +8,6 @@ For environment setup, please refer to the Megatron-SWIFT [Quick Start guide](./
 
 This section demonstrates fine-tuning the Qwen2.5-VL-7B-Instruct model on the LaTeX-OCR task using two 80GiB A100 GPUs, with both full-parameter fine-tuning and LoRA. The best practices described below can be completed within 10 minutes.
 
-First, we need to convert the model weights from Hugging Face format to Megatron format:
-```shell
-CUDA_VISIBLE_DEVICES=0 \
-swift export \
-    --model Qwen/Qwen2.5-VL-7B-Instruct \
-    --to_mcore true \
-    --torch_dtype bfloat16 \
-    --output_dir Qwen2.5-VL-7B-Instruct-mcore \
-    --test_convert_precision true
-```
-
 ### Full
 
 The full-parameter training script is as follows:
@@ -29,7 +18,9 @@ NPROC_PER_NODE=2 \
 MAX_PIXELS=1003520 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
-    --load Qwen2.5-VL-7B-Instruct-mcore \
+    --model Qwen/Qwen2.5-VL-7B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
     --load_from_cache_file true \
     --tensor_model_parallel_size 2 \
@@ -60,19 +51,6 @@ megatron sft \
     --dataset_num_proc 8
 ```
 
-Convert Megatron-format weights saved with full parameters to Hugging Face format:
-
-- Note: `--mcore_model` should point to the parent directory of `iter_xxx`. By default, the checkpoint specified in `latest_checkpointed_iteration.txt` will be used.
-
-```shell
-CUDA_VISIBLE_DEVICES=0 \
-swift export \
-    --mcore_model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx \
-    --to_hf true \
-    --torch_dtype bfloat16 \
-    --output_dir megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \
-    --test_convert_precision true
-```
 
 ### LoRA
 
@@ -84,7 +62,10 @@ NPROC_PER_NODE=2 \
 MAX_PIXELS=1003520 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
-    --load Qwen2.5-VL-7B-Instruct-mcore \
+    --model Qwen/Qwen2.5-VL-7B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
+    --merge_lora false \
     --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
     --load_from_cache_file true \
     --train_type lora \
@@ -119,24 +100,12 @@ megatron sft \
     --dataset_num_proc 8
 ```
 
-Merge the LoRA-saved incremental weights and convert them to Hugging Face format:
-```shell
-CUDA_VISIBLE_DEVICES=0 \
-swift export \
-    --mcore_adapters megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx \
-    --to_hf true \
-    --torch_dtype bfloat16 \
-    --output_dir megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \
-    --test_convert_precision true
-```
-
-
 Finally, we use the generated Hugging Face format weights to perform inference on the validation set:
 ```shell
 MAX_PIXELS=1003520 \
 CUDA_VISIBLE_DEVICES=0 \
 swift infer \
-    --model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \
+    --adapters megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx/checkpoint-xxx \
     --attn_impl flash_attn \
     --stream true \
     --load_data_args true \
@@ -161,17 +130,17 @@ The inference results are as follows:
 
 ## MoE Model
 
-The model conversion steps for MoE models are the same as those for Dense models (please refer to the Dense model section for modifications). Below is the training script for LoRA fine-tuning of the OpenGVLab/InternVL3_5-30B-A3B-mcore model.
-
-- During MoE model conversion, the precision test via `--test_convert_precision true` takes a long time; consider removing it as appropriate.
-
+Training script:
 ```bash
 # 2 * 43GiB, 8s/it
 PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
-    --load InternVL3_5-30B-A3B-mcore \
+    --model OpenGVLab/InternVL3_5-30B-A3B \
+    --load_safetensors true \
+    --save_safetensors true \
+    --merge_lora false \
     --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
     --load_from_cache_file true \
     --train_type lora \
@@ -216,7 +185,7 @@ After training is completed, we use the generated Hugging Face format weights to
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 swift infer \
-    --model megatron_output/InternVL3_5-30B-A3B/vx-xxx-hf \
+    --adapters megatron_output/InternVL3_5-30B-A3B/vx-xxx/checkpoint-xxx \
     --attn_impl flash_attn \
     --stream true \
     --load_data_args true \
diff --git a/docs/source_en/Megatron-SWIFT/Quick-start.md b/docs/source_en/Megatron-SWIFT/Quick-start.md
index e35b4ff48f..6256a47062 100644
--- a/docs/source_en/Megatron-SWIFT/Quick-start.md
+++ b/docs/source_en/Megatron-SWIFT/Quick-start.md
@@ -77,10 +77,12 @@ Recommended Operating Environment:
 
 This section introduces a quick start example for fine-tuning the self-awareness of the Qwen2.5-7B-Instruct model using two 80GiB A100 GPUs. The following best practices can be completed within 10 minutes.
 
+### Traditional Method
+
 First, we need to convert the weights from HF (Hugging Face) format to Megatron format:
 - Multi-GPU weight conversion: Remove `CUDA_VISIBLE_DEVICES=0` to enable multi-GPU weight conversion.
 - Conversion precision test: `--test_convert_precision true` will test the conversion precision. For large MoE model conversions, this option takes longer and consumes more memory, so you may omit it as needed.
-- ms-swift supports Mcore-Bridge to avoid the extra time cost of weight conversion. Please refer to the [Mcore-Bridge documentation](./Mcore-Bridge.md).
+
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 swift export \
@@ -99,6 +101,7 @@ NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
     --load Qwen2.5-7B-Instruct-mcore \
+    --save_safetensors false \
     --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
               'AI-ModelScope/alpaca-gpt4-data-en#500' \
               'swift/self-cognition#500' \
@@ -134,10 +137,10 @@ Finally, convert the Megatron format weights back to HF format:
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 swift export \
-    --mcore_model megatron_output/Qwen2.5-7B-Instruct/vx-xxx \
+    --mcore_model megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
     --to_hf true \
     --torch_dtype bfloat16 \
-    --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
+    --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
     --test_convert_precision true
 ```
 
@@ -146,7 +149,7 @@ We then perform inference on the generated HF format weights:
 ```shell
 CUDA_VISIBLE_DEVICES=0 \
 swift infer \
-    --model megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
+    --model megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
     --stream true \
     --temperature 0 \
     --max_new_tokens 2048
@@ -159,6 +162,59 @@ The inference results are as follows:
 I am a language model developed by swift, you can call me swift-robot. How can I assist you?
 ```
 
+### Mcore-Bridge [Recommended]
+
+In "ms-swift>=3.10", Mcore-Bridge is supported, eliminating the cumbersome process of model conversion. For details, refer to [Mcore-Bridge Documentation](./Mcore-Bridge.md).
+
+Training script:
+
+```bash
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+NPROC_PER_NODE=2 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
+    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
+              'AI-ModelScope/alpaca-gpt4-data-en#500' \
+              'swift/self-cognition#500' \
+    --tensor_model_parallel_size 2 \
+    --sequence_parallel true \
+    --micro_batch_size 16 \
+    --global_batch_size 16 \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-5 \
+    --lr_warmup_fraction 0.05 \
+    --min_lr 1e-6 \
+    --max_epochs 1 \
+    --save megatron_output/Qwen2.5-7B-Instruct \
+    --save_interval 100 \
+    --max_length 2048 \
+    --system 'You are a helpful assistant.' \
+    --num_workers 4 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --dataset_num_proc 4 \
+    --model_author swift \
+    --model_name swift-robot
+```
+
+We perform inference on the generated safetensors format weights:
+
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --model megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
+    --stream true \
+    --temperature 0 \
+    --max_new_tokens 2048
+```
+
 - For pretraining, you can use `megatron pt` instead of `megatron sft`, which will use a generative template for training.
 - Megatron-SWIFT uses the same dataset and template processing modules as ms-swift, thus supporting techniques such as packing, loss scale, and agent training. For custom dataset formats, please refer to the [Custom Dataset Documentation](../Customization/Custom-dataset.md).
 - **More Examples**: Including packing, multi-node training, 32K context length, DPO, MoE models, and pre-training, can be found [here](https://github.com/modelscope/ms-swift/tree/main/examples/megatron).
diff --git a/docs/source_en/index.rst b/docs/source_en/index.rst
index f70a8a05c9..fea4d1e325 100644
--- a/docs/source_en/index.rst
+++ b/docs/source_en/index.rst
@@ -43,6 +43,7 @@ Swift DOCUMENTATION
    Megatron-SWIFT/Multimodal-Model.md
    Megatron-SWIFT/Mcore-Bridge.md
    Megatron-SWIFT/GRPO.md
+   Megatron-SWIFT/Ascend.md
 
 
 .. toctree::
diff --git a/examples/megatron/base_to_chat.sh b/examples/megatron/base_to_chat.sh
index 2397feea12..d2a354ffeb 100644
--- a/examples/megatron/base_to_chat.sh
+++ b/examples/megatron/base_to_chat.sh
@@ -3,7 +3,9 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=8 \
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 megatron sft \
-    --load Qwen2.5-14B-mcore \
+    --model Qwen/Qwen2.5-14B \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \
diff --git a/examples/megatron/dense/72b_offload.sh b/examples/megatron/dense/72b_offload.sh
index be6cbc5992..d520c3a127 100644
--- a/examples/megatron/dense/72b_offload.sh
+++ b/examples/megatron/dense/72b_offload.sh
@@ -3,7 +3,9 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=8 \
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 megatron sft \
-    --load Qwen2.5-72B-Instruct-mcore \
+    --model Qwen/Qwen2.5-72B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \
diff --git a/examples/megatron/dense/qwen3_32b.sh b/examples/megatron/dense/qwen3_32b.sh
index 4613e52bdf..d85af6fc91 100644
--- a/examples/megatron/dense/qwen3_32b.sh
+++ b/examples/megatron/dense/qwen3_32b.sh
@@ -3,7 +3,9 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=8 \
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 megatron sft \
-    --load Qwen3-32B-mcore \
+    --model Qwen/Qwen3-32B \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \
diff --git a/examples/megatron/export/lora.sh b/examples/megatron/export/lora.sh
index cf73420e7e..a3e69be3f4 100644
--- a/examples/megatron/export/lora.sh
+++ b/examples/megatron/export/lora.sh
@@ -8,8 +8,8 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NPROC_PER_NODE=4 \
 megatron export \
     --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
-    --adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx \
-    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-lora \
+    --adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx \
+    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-lora \
     --merge_lora false \
     --to_hf true \
     --tensor_model_parallel_size 2 \
@@ -22,8 +22,8 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NPROC_PER_NODE=4 \
 megatron export \
     --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
-    --adapters megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-lora \
-    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-mcore \
+    --adapters megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-lora \
+    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-mcore \
     --merge_lora false \
     --to_mcore true \
     --tensor_model_parallel_size 2 \
@@ -37,8 +37,8 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NPROC_PER_NODE=4 \
 megatron export \
     --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
-    --adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx \
-    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-merged \
+    --adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx \
+    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-merged \
     --merge_lora true \
     --to_mcore true \
     --tensor_model_parallel_size 2 \
diff --git a/examples/megatron/grpo/moe_colocate_lora.sh b/examples/megatron/grpo/moe_colocate_lora.sh
index c501a90a10..98dce9a373 100644
--- a/examples/megatron/grpo/moe_colocate_lora.sh
+++ b/examples/megatron/grpo/moe_colocate_lora.sh
@@ -6,6 +6,7 @@ megatron rlhf \
     --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
     --load_safetensors true \
     --save_safetensors true \
+    --merge_lora false \
     --context_parallel_size 2 \
     --tensor_model_parallel_size 2 \
     --expert_model_parallel_size 4 \
diff --git a/examples/megatron/long_text.sh b/examples/megatron/long_text.sh
index 8d9a90d26f..7a5def7916 100644
--- a/examples/megatron/long_text.sh
+++ b/examples/megatron/long_text.sh
@@ -5,7 +5,9 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=4 \
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 megatron sft \
-    --load Qwen2.5-7B-mcore \
+    --model Qwen/Qwen2.5-7B \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'ZhipuAI/LongWriter-6k' \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \
diff --git a/examples/megatron/lora/dense.sh b/examples/megatron/lora/dense.sh
index db6636fe8a..0c64b72b3f 100644
--- a/examples/megatron/lora/dense.sh
+++ b/examples/megatron/lora/dense.sh
@@ -4,7 +4,10 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
-    --load Qwen2.5-7B-Instruct-mcore \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --save_safetensors true \
+    --load_safetensors true \
+    --merge_lora false \
     --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
               'AI-ModelScope/alpaca-gpt4-data-en#500' \
               'swift/self-cognition#500' \
diff --git a/examples/megatron/lora/dpo.sh b/examples/megatron/lora/dpo.sh
index 4cf4af3c29..fb1861f030 100644
--- a/examples/megatron/lora/dpo.sh
+++ b/examples/megatron/lora/dpo.sh
@@ -4,7 +4,10 @@ NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron rlhf \
     --rlhf_type dpo \
-    --load Qwen3-30B-A3B-Instruct-2507-mcore \
+    --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+    --load_safetensors true \
+    --save_safetensors true \
+    --merge_lora false \
     --dataset AI-ModelScope/orpo-dpo-mix-40k \
     --load_from_cache_file true \
     --train_type lora \
diff --git a/examples/megatron/lora/loss_scale.sh b/examples/megatron/lora/loss_scale.sh
index 0a66820852..63b639c250 100644
--- a/examples/megatron/lora/loss_scale.sh
+++ b/examples/megatron/lora/loss_scale.sh
@@ -3,7 +3,10 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
-    --load Qwen3-30B-A3B-Base-mcore \
+    --model Qwen/Qwen3-30B-A3B-Base \
+    --load_safetensors true \
+    --save_safetensors true \
+    --merge_lora false \
     --train_type lora \
     --dataset AI-ModelScope/function-calling-chatml#10000 \
     --load_from_cache_file true \
diff --git a/examples/megatron/lora/moe.sh b/examples/megatron/lora/moe.sh
index 3446413b16..9dfa66a23e 100644
--- a/examples/megatron/lora/moe.sh
+++ b/examples/megatron/lora/moe.sh
@@ -3,7 +3,10 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
-    --load Qwen3-30B-A3B-mcore \
+    --model Qwen/Qwen3-30B-A3B \
+    --load_safetensors true \
+    --save_safetensors true \
+    --merge_lora false \
     --dataset 'swift/Qwen3-SFT-Mixin#2000' \
               'swift/self-cognition:empty_think#600' \
     --loss_scale ignore_empty_think \
diff --git a/examples/megatron/lora/mtp.sh b/examples/megatron/lora/mtp.sh
index 798670556a..9ad8a9ad8d 100644
--- a/examples/megatron/lora/mtp.sh
+++ b/examples/megatron/lora/mtp.sh
@@ -7,6 +7,7 @@ megatron sft \
     --model ZhipuAI/GLM-4.5-Air \
     --load_safetensors true \
     --save_safetensors true \
+    --merge_lora true \
     --mtp_num_layers 1 \
     --dataset 'swift/Chinese-Qwen3-235B-2507-Distill-data-110k-SFT' \
     --load_from_cache_file true \
diff --git a/examples/megatron/lora/new_special_tokens.sh b/examples/megatron/lora/new_special_tokens.sh
index db64f8b0dc..a5ae09cf42 100644
--- a/examples/megatron/lora/new_special_tokens.sh
+++ b/examples/megatron/lora/new_special_tokens.sh
@@ -6,7 +6,10 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
-    --load Qwen3-30B-A3B-mcore \
+    --model Qwen/Qwen3-30B-A3B \
+    --load_safetensors true \
+    --save_safetensors true \
+    --merge_lora false \
     --dataset 'swift/new_special_tokens' \
     --new_special_tokens 'examples/train/new_special_tokens/tokens.txt' \
     --train_type lora \
diff --git a/examples/megatron/mcore_bridge/full/dense.sh b/examples/megatron/mcore_bridge/full/dense.sh
index efa17589c1..e6bc48496f 100644
--- a/examples/megatron/mcore_bridge/full/dense.sh
+++ b/examples/megatron/mcore_bridge/full/dense.sh
@@ -44,6 +44,6 @@ megatron sft \
 # FPS_MAX_FRAMES=16 \
 # CUDA_VISIBLE_DEVICES=0 \
 # swift infer \
-#     --model megatron_output/Qwen3-VL-8B-Instruct/vx-xxx \
+#     --model megatron_output/Qwen3-VL-8B-Instruct/vx-xxx/checkpoint-xxx \
 #     --load_data_args true \
 #     --stream true
diff --git a/examples/megatron/mcore_bridge/lora/seq_cls.sh b/examples/megatron/mcore_bridge/lora/seq_cls.sh
index 36d5071238..631fecba37 100644
--- a/examples/megatron/mcore_bridge/lora/seq_cls.sh
+++ b/examples/megatron/mcore_bridge/lora/seq_cls.sh
@@ -51,6 +51,6 @@ megatron sft \
 # FPS_MAX_FRAMES=16 \
 # CUDA_VISIBLE_DEVICES=0 \
 # swift infer \
-#     --adapters megatron_output/Qwen3-VL-8B-Instruct/vx-xxx \
+#     --adapters megatron_output/Qwen3-VL-8B-Instruct/vx-xxx/checkpoint-xxx \
 #     --load_data_args true \
 #     --stream true
diff --git a/examples/megatron/moe/deepseek_v3.sh b/examples/megatron/moe/deepseek_v3.sh
index 89305ea24b..6222a0bdc9 100644
--- a/examples/megatron/moe/deepseek_v3.sh
+++ b/examples/megatron/moe/deepseek_v3.sh
@@ -5,7 +5,9 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=8 \
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 megatron sft \
-    --load Moonlight-16B-A3B-Instruct-mcore \
+    --model Qwen/Moonlight-16B-A3B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \
diff --git a/examples/megatron/moe/moe.sh b/examples/megatron/moe/moe.sh
index c344b188e1..1d2028d73f 100644
--- a/examples/megatron/moe/moe.sh
+++ b/examples/megatron/moe/moe.sh
@@ -3,7 +3,9 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=8 \
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 megatron sft \
-    --load Qwen1.5-MoE-A2.7B-mcore \
+    --model Qwen/Qwen1.5-MoE-A2.7B \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \
diff --git a/examples/megatron/moe/qwen3_moe.sh b/examples/megatron/moe/qwen3_moe.sh
index f017146bc2..97931b64d0 100644
--- a/examples/megatron/moe/qwen3_moe.sh
+++ b/examples/megatron/moe/qwen3_moe.sh
@@ -7,7 +7,9 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NNODES=$WORLD_SIZE \
 NODE_RANK=$RANK \
 megatron sft \
-    --load Qwen3-30B-A3B-Base-mcore \
+    --model Qwen/Qwen3-30B-A3B-Base \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \
diff --git a/examples/megatron/multi-node/node1.sh b/examples/megatron/multi-node/node1.sh
index fe8899d67b..da6cdbcb6a 100644
--- a/examples/megatron/multi-node/node1.sh
+++ b/examples/megatron/multi-node/node1.sh
@@ -9,7 +9,9 @@ MASTER_ADDR=127.0.0.1 \
 MASTER_PORT=29500 \
 NPROC_PER_NODE=4 \
 megatron sft \
-    --load Qwen2.5-14B-mcore \
+    --model Qwen/Qwen2.5-14B \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \
diff --git a/examples/megatron/multi-node/node2.sh b/examples/megatron/multi-node/node2.sh
index 975bf10fab..ce32ccdd61 100644
--- a/examples/megatron/multi-node/node2.sh
+++ b/examples/megatron/multi-node/node2.sh
@@ -6,7 +6,9 @@ MASTER_ADDR=xxx.xxx.xxx.xxx \
 MASTER_PORT=29500 \
 NPROC_PER_NODE=4 \
 megatron sft \
-    --load Qwen2.5-14B-mcore \
+    --model Qwen/Qwen2.5-14B \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \
diff --git a/examples/megatron/multimodal/dense/dpo.sh b/examples/megatron/multimodal/dense/dpo.sh
index 6a6fb06142..851d42ea09 100644
--- a/examples/megatron/multimodal/dense/dpo.sh
+++ b/examples/megatron/multimodal/dense/dpo.sh
@@ -5,7 +5,9 @@ MAX_PIXELS=1003520 \
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 megatron rlhf \
     --rlhf_type dpo \
-    --load Qwen2.5-VL-7B-Instruct-mcore \
+    --model Qwen/Qwen2.5-VL-7B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'swift/RLAIF-V-Dataset#20000' \
     --load_from_cache_file true \
     --train_type full \
diff --git a/examples/megatron/multimodal/dense/full.sh b/examples/megatron/multimodal/dense/full.sh
index 1e37e7d4b9..8834283f14 100644
--- a/examples/megatron/multimodal/dense/full.sh
+++ b/examples/megatron/multimodal/dense/full.sh
@@ -4,7 +4,9 @@ NPROC_PER_NODE=2 \
 MAX_PIXELS=1003520 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
-    --load Qwen2.5-VL-7B-Instruct-mcore \
+    --model Qwen/Qwen2.5-VL-7B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
     --load_from_cache_file true \
     --tensor_model_parallel_size 2 \
diff --git a/examples/megatron/multimodal/dense/lora.sh b/examples/megatron/multimodal/dense/lora.sh
index b19854bf0d..92555218b5 100644
--- a/examples/megatron/multimodal/dense/lora.sh
+++ b/examples/megatron/multimodal/dense/lora.sh
@@ -4,7 +4,10 @@ NPROC_PER_NODE=2 \
 MAX_PIXELS=1003520 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
-    --load Qwen2.5-VL-7B-Instruct-mcore \
+    --model Qwen/Qwen2.5-VL-7B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
+    --merge_lora false \
     --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
     --load_from_cache_file true \
     --train_type lora \
diff --git a/examples/megatron/multimodal/moe/full_dpo_offload.sh b/examples/megatron/multimodal/moe/full_dpo_offload.sh
index f26adbd534..ff7efd0ce8 100644
--- a/examples/megatron/multimodal/moe/full_dpo_offload.sh
+++ b/examples/megatron/multimodal/moe/full_dpo_offload.sh
@@ -4,7 +4,9 @@ NPROC_PER_NODE=8 \
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 megatron rlhf \
     --rlhf_type dpo \
-    --load InternVL3_5-30B-A3B-mcore \
+    --model OpenGVLab/InternVL3_5-30B-A3B \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'swift/RLAIF-V-Dataset#20000' \
     --load_from_cache_file true \
     --train_type full \
diff --git a/examples/megatron/multimodal/moe/lora.sh b/examples/megatron/multimodal/moe/lora.sh
index a79df366e4..bd78dd2ed6 100644
--- a/examples/megatron/multimodal/moe/lora.sh
+++ b/examples/megatron/multimodal/moe/lora.sh
@@ -3,7 +3,10 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
-    --load InternVL3_5-30B-A3B-mcore \
+    --model OpenGVLab/InternVL3_5-30B-A3B \
+    --load_safetensors true \
+    --save_safetensors true \
+    --merge_lora false \
     --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
     --load_from_cache_file true \
     --train_type lora \
diff --git a/examples/megatron/pretrain.sh b/examples/megatron/pretrain.sh
index bd0a668ab5..ff0d897db9 100644
--- a/examples/megatron/pretrain.sh
+++ b/examples/megatron/pretrain.sh
@@ -3,7 +3,9 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=4 \
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 megatron pt \
-    --load Qwen2.5-7B-mcore \
+    --model Qwen/Qwen2.5-7B \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset swift/chinese-c4 \
     --streaming true \
     --packing true \
diff --git a/examples/megatron/rlhf/dpo/dense.sh b/examples/megatron/rlhf/dpo/dense.sh
index c0db8a94cc..3c264bcb75 100644
--- a/examples/megatron/rlhf/dpo/dense.sh
+++ b/examples/megatron/rlhf/dpo/dense.sh
@@ -4,7 +4,9 @@ NPROC_PER_NODE=4 \
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 megatron rlhf \
     --rlhf_type dpo \
-    --load Qwen2.5-7B-Instruct-mcore \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset hjh0119/shareAI-Llama3-DPO-zh-en-emoji \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \
diff --git a/examples/megatron/rlhf/dpo/moe.sh b/examples/megatron/rlhf/dpo/moe.sh
index d82575da05..173d6a2de5 100644
--- a/examples/megatron/rlhf/dpo/moe.sh
+++ b/examples/megatron/rlhf/dpo/moe.sh
@@ -4,7 +4,9 @@ NPROC_PER_NODE=8 \
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 megatron rlhf \
     --rlhf_type dpo \
-    --load Qwen3-30B-A3B-Instruct-2507-mcore \
+    --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset AI-ModelScope/orpo-dpo-mix-40k \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \
diff --git a/examples/megatron/rlhf/dpo/packing.sh b/examples/megatron/rlhf/dpo/packing.sh
index 159dd4f97c..0bc02de37b 100644
--- a/examples/megatron/rlhf/dpo/packing.sh
+++ b/examples/megatron/rlhf/dpo/packing.sh
@@ -4,7 +4,9 @@ NPROC_PER_NODE=4 \
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 megatron rlhf \
     --rlhf_type dpo \
-    --load Qwen3-4B-Instruct-2507-mcore \
+    --model Qwen/Qwen3-4B-Instruct-2507 \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'AI-ModelScope/orpo-dpo-mix-40k' \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \
diff --git a/examples/megatron/rlhf/kto/dense.sh b/examples/megatron/rlhf/kto/dense.sh
index cbcb1c63c4..3f0a016293 100644
--- a/examples/megatron/rlhf/kto/dense.sh
+++ b/examples/megatron/rlhf/kto/dense.sh
@@ -4,7 +4,9 @@ NPROC_PER_NODE=4 \
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 megatron rlhf \
     --rlhf_type kto \
-    --load Qwen2.5-7B-Instruct-mcore \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'AI-ModelScope/ultrafeedback-binarized-preferences-cleaned-kto#20000' \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \
diff --git a/examples/megatron/rlhf/kto/moe.sh b/examples/megatron/rlhf/kto/moe.sh
index c44936ab40..ee63cc8bb7 100644
--- a/examples/megatron/rlhf/kto/moe.sh
+++ b/examples/megatron/rlhf/kto/moe.sh
@@ -4,7 +4,10 @@ NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron rlhf \
     --rlhf_type kto \
-    --load Qwen3-30B-A3B-Instruct-2507-mcore \
+    --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+    --load_safetensors true \
+    --save_safetensors true \
+    --merge_lora false \
     --dataset 'AI-ModelScope/ultrafeedback-binarized-preferences-cleaned-kto#20000' \
     --load_from_cache_file true \
     --packing true \
diff --git a/examples/megatron/rlhf/rm/dense.sh b/examples/megatron/rlhf/rm/dense.sh
index 85f6c6abd7..9db136b911 100644
--- a/examples/megatron/rlhf/rm/dense.sh
+++ b/examples/megatron/rlhf/rm/dense.sh
@@ -5,7 +5,9 @@ MAX_PIXELS=1003520 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron rlhf \
     --rlhf_type rm \
-    --load Qwen2.5-VL-7B-Instruct-mcore \
+    --model Qwen/Qwen2.5-VL-7B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'swift/RLAIF-V-Dataset#20000' \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \
diff --git a/examples/megatron/rlhf/rm/moe.sh b/examples/megatron/rlhf/rm/moe.sh
index 25e8dcf40f..c6e970af38 100644
--- a/examples/megatron/rlhf/rm/moe.sh
+++ b/examples/megatron/rlhf/rm/moe.sh
@@ -4,7 +4,10 @@ NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron rlhf \
     --rlhf_type rm \
-    --load Qwen3-30B-A3B-Instruct-2507-mcore \
+    --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+    --load_safetensors true \
+    --save_safetensors true \
+    --merge_lora false \
     --dataset 'AI-ModelScope/orpo-dpo-mix-40k' \
     --load_from_cache_file true \
     --packing true \
diff --git a/examples/megatron/seq_cls/full.sh b/examples/megatron/seq_cls/full.sh
index d7b7a413df..7576282ad9 100644
--- a/examples/megatron/seq_cls/full.sh
+++ b/examples/megatron/seq_cls/full.sh
@@ -3,7 +3,9 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=4 \
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 megatron sft \
-    --load Qwen2.5-VL-7B-Instruct-mcore \
+    --model Qwen/Qwen2.5-VL-7B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'tany0699/garbage265#20000' \
     --load_from_cache_file true \
     --tensor_model_parallel_size 2 \
diff --git a/examples/megatron/seq_cls/lora/infer.sh b/examples/megatron/seq_cls/lora/infer.sh
index 6a97cece9a..9591c245b2 100644
--- a/examples/megatron/seq_cls/lora/infer.sh
+++ b/examples/megatron/seq_cls/lora/infer.sh
@@ -2,7 +2,7 @@
 # 60GiB
 CUDA_VISIBLE_DEVICES=0 \
 swift infer \
-  --model megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-hf \
+  --model megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx \
   --load_data_args true \
   --max_batch_size 16 \
   --attn_impl flash_attn \
diff --git a/examples/megatron/seq_cls/lora/mcore2hf.sh b/examples/megatron/seq_cls/lora/mcore2hf.sh
deleted file mode 100644
index 4a9ed86d1a..0000000000
--- a/examples/megatron/seq_cls/lora/mcore2hf.sh
+++ /dev/null
@@ -1,7 +0,0 @@
-CUDA_VISIBLE_DEVICES=0,1 \
-swift export \
-    --mcore_adapters megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx \
-    --to_hf true \
-    --torch_dtype bfloat16 \
-    --output_dir megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-hf \
-    --test_convert_precision true
diff --git a/examples/megatron/seq_cls/lora/train.sh b/examples/megatron/seq_cls/lora/train.sh
index fb343b4eab..4076158bb8 100644
--- a/examples/megatron/seq_cls/lora/train.sh
+++ b/examples/megatron/seq_cls/lora/train.sh
@@ -4,7 +4,10 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
-    --load Qwen3-30B-A3B-Instruct-2507-mcore \
+    --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+    --load_safetensors true \
+    --save_safetensors true \
+    --merge_lora true \
     --dataset 'DAMO_NLP/jd:cls' \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \
diff --git a/examples/megatron/sft.sh b/examples/megatron/sft.sh
index 5174f398b6..e5047b8949 100644
--- a/examples/megatron/sft.sh
+++ b/examples/megatron/sft.sh
@@ -3,7 +3,9 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 megatron sft \
-    --load Qwen2.5-7B-Instruct-mcore \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
               'AI-ModelScope/alpaca-gpt4-data-en#500' \
               'swift/self-cognition#500' \
diff --git a/examples/models/qwen3_vl/mcore.sh b/examples/models/qwen3_vl/mcore.sh
index 8a16d0d5bf..5363c5e2af 100644
--- a/examples/models/qwen3_vl/mcore.sh
+++ b/examples/models/qwen3_vl/mcore.sh
@@ -6,7 +6,9 @@ NPROC_PER_NODE=8 \
 IMAGE_MAX_TOKEN_NUM=1024 \
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 megatron sft \
-    --load Qwen3-VL-235B-A22B-Instruct-mcore \
+    --model Qwen/Qwen3-VL-235B-A22B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#20000' \
     --load_from_cache_file true \
     --split_dataset_ratio 0.01 \
diff --git a/examples/models/qwen3_vl/mcore_full.sh b/examples/models/qwen3_vl/mcore_full.sh
index f1877be83a..7130d0f349 100644
--- a/examples/models/qwen3_vl/mcore_full.sh
+++ b/examples/models/qwen3_vl/mcore_full.sh
@@ -7,7 +7,9 @@ IMAGE_MAX_TOKEN_NUM=1024 \
 VIDEO_MAX_TOKEN_NUM=128 \
 FPS_MAX_FRAMES=16 \
 megatron sft \
-    --load Qwen3-VL-30B-A3B-Instruct-mcore \
+    --model Qwen/Qwen3-VL-30B-A3B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
     --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#10000' \
               'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
               'swift/VideoChatGPT:Generic#2000' \
diff --git a/examples/train/flash_attention_3/mcore.sh b/examples/train/flash_attention_3/mcore.sh
index 5b3948d52c..96f4601409 100644
--- a/examples/train/flash_attention_3/mcore.sh
+++ b/examples/train/flash_attention_3/mcore.sh
@@ -8,7 +8,10 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=4 \
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 megatron sft \
-    --load Qwen3-30B-A3B-Base-mcore \
+    --model Qwen/Qwen3-30B-A3B-Base \
+    --load_safetensors true \
+    --save_safetensors true \
+    --merge_lora false \
     --dataset 'swift/Chinese-Qwen3-235B-2507-Distill-data-110k-SFT' \
     --load_from_cache_file true \
     --train_type lora \
diff --git a/swift/cli/main.py b/swift/cli/main.py
index c5bee1dfcb..83423bcd5e 100644
--- a/swift/cli/main.py
+++ b/swift/cli/main.py
@@ -100,11 +100,9 @@ def cli_main(route_mapping: Optional[Dict[str, str]] = None, is_megatron: bool =
     torchrun_args = get_torchrun_args()
     prepare_config_args(argv)
     python_cmd = sys.executable
-    if not is_megatron and (torchrun_args is None or method_name not in {'pt', 'sft', 'rlhf', 'infer'}):
+    if torchrun_args is None or (not is_megatron and method_name not in {'pt', 'sft', 'rlhf', 'infer'}):
         args = [python_cmd, file_path, *argv]
     else:
-        if torchrun_args is None:
-            raise ValueError('Please set torchrun args: NPROC_PER_NODE, ...')
         args = [python_cmd, '-m', 'torch.distributed.run', *torchrun_args, file_path, *argv]
     print(f"run sh: `{' '.join(args)}`", flush=True)
     result = subprocess.run(args)
diff --git a/swift/llm/template/base.py b/swift/llm/template/base.py
index bb6136b9b4..1cbc66f8c8 100644
--- a/swift/llm/template/base.py
+++ b/swift/llm/template/base.py
@@ -402,7 +402,7 @@ def _embedding_encode(self, inputs: TemplateInputs) -> Dict[str, Any]:
 
             _all_negative_keys = set()
             for idx, negative in enumerate(inputs.negative):
-                _tmp_negative_keys = set()
+                _tmp_negative_keys = set()  # used to fill in missing keys
                 negative_encoded = self._encode_truncated(negative)
                 for key in negative_encoded:
                     negative_key = f'negative_{key}'
@@ -1562,7 +1562,7 @@ def _embedding_data_collator(self,
                 new_batch += [b]
             else:
                 keys = [key for key in b.keys() if 'negative' in key]
-                max_neg = None
+                max_neg = None  # number of negative samples
                 for key in keys:
                     value_list = b[key]
                     suffix = key[len('negative_'):]
diff --git a/swift/megatron/argument/megatron_args.py b/swift/megatron/argument/megatron_args.py
index eb17f37d9f..e4062c35c2 100644
--- a/swift/megatron/argument/megatron_args.py
+++ b/swift/megatron/argument/megatron_args.py
@@ -313,8 +313,8 @@ class ExtraMegatronArguments(RLHFMegatronArgumentsMixin, MegatronTunerMixin):
     # mcore-bridge
     model: Optional[str] = None
     model_type: Optional[str] = None
-    load_safetensors: bool = False
-    save_safetensors: bool = False
+    load_safetensors: Optional[bool] = None
+    save_safetensors: bool = True
     adapters: List[str] = field(default_factory=list)
     ref_model: Optional[str] = None
     ref_adapters: List[str] = field(default_factory=list)
@@ -720,6 +720,8 @@ def __post_init__(self):
         if self.save_strategy == 'epoch':
             self.save_interval = 1
             self.eval_interval = 1
+        if isinstance(self.ref_adapters, str):
+            self.ref_adapters = [self.ref_adapters]
         if self.eval_interval is None:
             self.eval_interval = self.save_interval
         if self.seq_length is None:
diff --git a/swift/megatron/argument/train_args.py b/swift/megatron/argument/train_args.py
index fd72353f33..636b265b6d 100644
--- a/swift/megatron/argument/train_args.py
+++ b/swift/megatron/argument/train_args.py
@@ -47,7 +47,6 @@ def __post_init__(self):
         if self.tensorboard_dir is None and self.save is not None:
             self.tensorboard_dir = f'{self.save}/runs'
         self.tensorboard_dir = to_abspath(self.tensorboard_dir)
-        if self.load is None and self.no_initialization and not self.load_safetensors:
-            raise ValueError('You did not pass `--load` or `--load_safetensors true` to read directly '
-                             'from safetensors weights, so you need to set `--no_initialization false` '
-                             'to allow the model to initialize weights properly.')
+        if self.load is None and self.model is None and self.no_initialization:
+            raise ValueError('You did not pass `--load/--model` to read weights, so you need to set '
+                             '`--no_initialization false` to allow the model to initialize weights properly.')
diff --git a/swift/megatron/train/sft.py b/swift/megatron/train/sft.py
index ded3807938..3aab447c26 100644
--- a/swift/megatron/train/sft.py
+++ b/swift/megatron/train/sft.py
@@ -40,7 +40,7 @@ def __init__(self, args: Optional[Union[List[str], MegatronTrainArguments]] = No
             if args.attention_backend != 'local':
                 # MindSpeed requires passing `use_flash_attn` to Megatron
                 # to enable flash attention on Ascend NPU.
-                self.args.use_flash_attn = True
+                args.use_flash_attn = True
             megatron_args = asdict(self.args)
             repatch(megatron_args)
         template_cls = TEMPLATE_MAPPING[args.template].template_cls
@@ -49,7 +49,7 @@ def __init__(self, args: Optional[Union[List[str], MegatronTrainArguments]] = No
         else:
             kwargs = {'load_model': False}
         with torch.device('meta'):
-            self.model, self.processor = args.get_model_processor(**kwargs, download_model=args.load_safetensors)
+            self.model, self.processor = args.get_model_processor(**kwargs, download_model=args.load is None)
         self._prepare_template()
         args.init_model_args(self.tokenizer, self.processor.model_info.config)
         args.save_args(args.save)
diff --git a/swift/megatron/trainers/base.py b/swift/megatron/trainers/base.py
index bbadcb07b1..c3c08a1ed8 100644
--- a/swift/megatron/trainers/base.py
+++ b/swift/megatron/trainers/base.py
@@ -16,7 +16,7 @@
 from megatron.core import mpu
 from megatron.core.datasets.utils import Split
 from megatron.core.enums import ModelType
-from megatron.core.num_microbatches_calculator import get_num_microbatches
+from megatron.core.num_microbatches_calculator import get_num_microbatches, update_num_microbatches
 from megatron.core.optimizer import _update_min_and_max_lr_in_param_groups
 from megatron.core.pipeline_parallel import get_forward_backward_func
 from megatron.core.rerun_state_machine import RerunMode, get_rerun_state_machine
@@ -27,7 +27,7 @@
 from megatron.training import (checkpointing, ft_integration, get_args, get_model, get_tensorboard_writer, get_timers,
                                get_wandb_writer, initialize, is_last_rank, one_logger_utils, pretrain, print_rank_0,
                                print_rank_last, training)
-from megatron.training.checkpointing import load_checkpoint
+from megatron.training.checkpointing import check_checkpoint_args, load_checkpoint, set_checkpoint_version
 from megatron.training.dist_signal_handler import DistributedSignalHandler
 from megatron.training.theoretical_memory_usage import report_theoretical_memory
 from megatron.training.training import num_floating_point_operations
@@ -434,25 +434,67 @@ def _patch_get_param_groups(self):
         finally:
             optimizer._get_param_groups = _get_param_groups
 
+    def _load_iteration(self):
+        args = self.args
+        ckpt_dir = None
+        if args.train_type == 'full':
+            ckpt_dir = args.model
+        elif args.train_type == 'lora' and args.adapters:
+            ckpt_dir = args.adapters[0]
+        if ckpt_dir is None:
+            return 0, 0
+        logger.info(f'checkpoint_dir: {ckpt_dir}')
+        iteration_path = os.path.join(ckpt_dir, 'latest_checkpointed_iteration.txt')
+        if not os.path.exists(iteration_path):
+            return 0, 0
+        with open(iteration_path, 'r') as f:
+            iteration = int(f.read())
+
+        common_path = os.path.join(ckpt_dir, f'iter_{iteration:07d}', 'common.pt')
+        if not os.path.exists(common_path):
+            return iteration, 0
+
+        state_dict = torch.load(common_path)
+        set_checkpoint_version(state_dict.get('checkpoint_version', 0))
+        num_floating_point_operations_so_far = state_dict.get('num_floating_point_operations_so_far', 0)
+        if 'args' in state_dict and not args.finetune:
+            checkpoint_args = state_dict['args']
+            check_checkpoint_args(checkpoint_args)
+            args.consumed_train_samples = getattr(checkpoint_args, 'consumed_train_samples', 0)
+            args.skipped_train_samples = getattr(checkpoint_args, 'skipped_train_samples', 0)
+            update_num_microbatches(consumed_samples=args.consumed_train_samples, verbose=True)
+            args.consumed_valid_samples = getattr(checkpoint_args, 'consumed_valid_samples', 0)
+        else:
+            print_rank_0('could not find arguments in the checkpoint ...')
+
+        return iteration, num_floating_point_operations_so_far
+
     def setup_model_and_optimizer(self, model_provider_func, model_type, *_args, **kwargs):
 
         args = get_args()
 
         def new_model_provider_func(*_args, **kwargs):
             model = model_provider_func(*_args, **kwargs)
-            if args.load_safetensors:
+            if args.load is None:
                 self.bridge.load_weights(model, args.model_dir)
             self.unwrapped_models.append(model)
             peft_model = prepare_mcore_model(model)
-            if args.load_safetensors and args.train_type == 'lora':
-                for adapters, name in [(args.adapters, 'default'), (args.ref_adapters, 'ref_adapter')]:
-                    if adapters:
-                        assert len(adapters) == 1, 'Currently only support one adapter.'
-                        self.bridge.load_weights(model, adapters[0], is_peft_format=True, adapter_name=name)
+            if args.train_type == 'lora':
+                if args.adapters and args.adapter_load is None:
+                    assert len(args.adapters) == 1, 'Currently only support one adapter.'
+                    self.bridge.load_weights(model, args.adapters[0], is_peft_format=True, adapter_name='default')
+                if args.ref_adapters and args.ref_adapter_load is None:
+                    assert len(args.ref_adapters) == 1, 'Currently only support one adapter.'
+                    self.bridge.load_weights(
+                        model, args.ref_adapters[0], is_peft_format=True, adapter_name='ref_adapter')
+
             self.peft_models.append(peft_model)
             return model
 
         self._init_multimodal_full()
+        # read iteration
+        if not args.finetune:
+            args.iteration, args.num_floating_point_operations_so_far = self._load_iteration()
         with self._patch_load_state_dict(self._load_base_checkpoint), self._patch_get_param_groups():
             model, optimizer, opt_param_scheduler = self._origin_setup_model_and_optimizer(
                 new_model_provider_func, model_type, *_args, **kwargs)
@@ -465,8 +507,7 @@ def new_model_provider_func(*_args, **kwargs):
                 copy_original_module_weight(m)
         if args.ref_adapter_load is not None:
             with self._patch_load_state_dict(self._load_adapter_base_checkpoint):
-                args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
-                    model, optimizer, opt_param_scheduler, load_arg='ref_adapter_load', strict=False)
+                load_checkpoint(model, optimizer, opt_param_scheduler, load_arg='ref_adapter_load', strict=False)
         if args.adapter_load is not None:
             with adapter_state_dict_context():
                 args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
@@ -957,23 +998,37 @@ def unmerge_lora_adapters(self):
                         # Unmerge to restore separate LoRA weights for training
                         module.unmerge()
 
-    def save_checkpoint(self, iteration, *_args, **kwargs):
+    @staticmethod
+    def _copy_args(output_dir):
+        if is_last_rank():
+            args_path = os.path.join(os.path.dirname(output_dir), 'args.json')
+            if os.path.exists(args_path):
+                shutil.copy(args_path, os.path.join(output_dir, 'args.json'))
+
+    def save_checkpoint(self, iteration, model, *_args, **kwargs):
         args = get_args()
-        if args.train_type == 'lora' and args.merge_lora:
-            self.merge_lora_adapters()
+        output_dir = os.path.join(args.save, f'checkpoint-{iteration}')
+        os.makedirs(output_dir, exist_ok=True)
+        origin_save = args.save
+        args.save = output_dir
+        self._copy_args(output_dir)
         save_peft_format = args.train_type == 'lora' and not args.merge_lora
+        if args.save_safetensors and args.no_save_optim:
+            model = []
+        with adapter_state_dict_context(is_peft_format=args.train_type == 'lora'):
+            self._origin_save_checkpoint(iteration, model, *_args, **kwargs)
+        args.save = origin_save
+        # safetensors
         if args.save_safetensors:
-            output_dir = os.path.join(args.save, f'checkpoint-{iteration}')
+            # merge-lora does not store lora, lora saving may report an error (Qwen3-VL-Moe)
+            if args.train_type == 'lora' and args.merge_lora:
+                self.merge_lora_adapters()
+                output_dir = f'{output_dir}-merged'
+                os.makedirs(output_dir, exist_ok=True)
+                self._copy_args(output_dir)
             self.bridge.save_weights(self.unwrapped_models, output_dir, is_peft_format=save_peft_format)
-            if is_last_rank():
-                args_path = os.path.join(os.path.dirname(output_dir), 'args.json')
-                if os.path.exists(args_path):
-                    shutil.copy(args_path, os.path.join(output_dir, 'args.json'))
-        else:
-            with adapter_state_dict_context(is_peft_format=save_peft_format):
-                self._origin_save_checkpoint(iteration, *_args, **kwargs)
-        if args.train_type == 'lora' and args.merge_lora:
-            self.unmerge_lora_adapters()
+            if args.train_type == 'lora' and args.merge_lora:
+                self.unmerge_lora_adapters()
 
     def _patch_megatron(self):
         # support max_epochs
diff --git a/swift/megatron/trainers/rlhf_mixin.py b/swift/megatron/trainers/rlhf_mixin.py
index f8393e6c8e..560194f69b 100644
--- a/swift/megatron/trainers/rlhf_mixin.py
+++ b/swift/megatron/trainers/rlhf_mixin.py
@@ -23,16 +23,15 @@ def setup_model_and_optimizer(self, model_provider_func, model_type, *_args, **k
         if args.train_type == 'full' and args.rlhf_type != 'rm':
             ref_models = get_model(model_provider_func, model_type, wrap_with_ddp=False)
             args.ref_model = args.ref_model or args.model
+            if args.ref_load is None:
+                args.ref_load = args.load
             for m in ref_models:
                 m = unwrap_model(m)
-                if args.load_safetensors:
+                if args.ref_load is None:
                     self.bridge.load_weights(m, args.ref_model)
                 m.requires_grad_(False).eval()
-            if args.ref_load is None:
-                args.ref_load = args.load
             if args.ref_load:
-                args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
-                    ref_models, None, None, load_arg='ref_load')
+                load_checkpoint(ref_models, None, None, load_arg='ref_load')
             self.ref_models = ref_models
         return super().setup_model_and_optimizer(model_provider_func, model_type, *_args, **kwargs)
 
diff --git a/swift/megatron/utils/utils.py b/swift/megatron/utils/utils.py
index 4221e70662..afd122415a 100644
--- a/swift/megatron/utils/utils.py
+++ b/swift/megatron/utils/utils.py
@@ -208,6 +208,8 @@ def adapter_state_dict_context(is_peft_format: bool = True):
 
     def generate_state_dict(args, model, *_args, **kwargs):
         state_dict = _origin_generate_state_dict(args, model, *_args, **kwargs)
+        if 'model' not in state_dict:
+            return state_dict
         new_state_dict = {}
         state_dict_model = state_dict['model']
         for n, p in model[0].named_parameters():
diff --git a/tests/megatron/export/test_export.py b/tests/megatron/export/test_export.py
index 0071ff4c77..d1452db5bd 100644
--- a/tests/megatron/export/test_export.py
+++ b/tests/megatron/export/test_export.py
@@ -28,7 +28,7 @@ def test_peft_to_mcore():
     megatron_export_main(
         MegatronExportArguments(
             model='Qwen/Qwen3-30B-A3B',
-            adapters=['megatron_output/Qwen3-30B-A3B/vx-xxx-hf'],
+            adapters=['megatron_output/Qwen3-30B-A3B/vx-xxx/checkpoint-xxx-hf'],
             merge_lora=False,
             to_mcore=True,
             exist_ok=True,
@@ -41,7 +41,7 @@ def test_peft_to_hf():
     megatron_export_main(
         MegatronExportArguments(
             load='Qwen3-30B-A3B-mcore',
-            adapter_load='megatron_output/Qwen3-30B-A3B/vx-xxx',
+            adapter_load='megatron_output/Qwen3-30B-A3B/vx-xxx/checkpoint-xxx',
             merge_lora=False,
             to_hf=True,
             exist_ok=True,
diff --git a/tests/megatron/test_lora.py b/tests/megatron/test_lora.py
index 49214d22ab..c3ca33c523 100644
--- a/tests/megatron/test_lora.py
+++ b/tests/megatron/test_lora.py
@@ -57,11 +57,12 @@ def test_moe():
 
 def test_convert():
     from swift.llm import export_main, ExportArguments
-    export_main(ExportArguments(
-        mcore_adapters=['megatron_output/vx-xxx'],
-        to_hf=True,
-        test_convert_precision=True,
-    ))
+    export_main(
+        ExportArguments(
+            mcore_adapters=['megatron_output/vx-xxx/checkpoint-xxx'],
+            to_hf=True,
+            test_convert_precision=True,
+        ))
 
 
 def test_embedding():