Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
495c427
update
Jintao-Huang Dec 9, 2025
615b518
Merge branch 'main' into update_megatron_shells
Jintao-Huang Dec 9, 2025
f4145ca
update
Jintao-Huang Dec 9, 2025
74c0466
update
Jintao-Huang Dec 10, 2025
3783cb6
update
Jintao-Huang Dec 10, 2025
9d986ca
Merge branch 'main' into update_megatron_shells
Jintao-Huang Dec 10, 2025
981394c
Merge branch 'main' into update_megatron_shells
Jintao-Huang Dec 11, 2025
eb93cb6
update
Jintao-Huang Dec 11, 2025
9ec0a3a
update
Jintao-Huang Dec 12, 2025
331b911
update
Jintao-Huang Dec 12, 2025
6cda03c
Merge branch 'main' into update_megatron_shells
Jintao-Huang Dec 15, 2025
108c37f
fix
Jintao-Huang Dec 15, 2025
a0aec59
update
Jintao-Huang Dec 15, 2025
418d6a1
fix
Jintao-Huang Dec 15, 2025
25ffe2a
update
Jintao-Huang Dec 15, 2025
f52cb46
fix null_ref_context
Jintao-Huang Dec 15, 2025
128e209
fix
Jintao-Huang Dec 15, 2025
af4dac2
Merge remote-tracking branch 'origin/fix_null_ref_context' into updat…
Jintao-Huang Dec 15, 2025
504e9d7
fix
Jintao-Huang Dec 15, 2025
80ab8cb
Merge branch 'main' into update_megatron_shells
Jintao-Huang Dec 15, 2025
826e563
Merge branch 'main' into update_megatron_shells
Jintao-Huang Dec 15, 2025
b79c051
Merge branch 'main' into update_megatron_shells
Jintao-Huang Dec 15, 2025
3e901d6
update
Jintao-Huang Dec 15, 2025
0ac3fb0
fix
Jintao-Huang Dec 15, 2025
d4c4bc4
fix
Jintao-Huang Dec 15, 2025
0ac3c1b
fix
Jintao-Huang Dec 15, 2025
8bf48dc
update
Jintao-Huang Dec 15, 2025
409e743
Merge branch 'main' into update_megatron_shells
Jintao-Huang Dec 15, 2025
cae166c
update
Jintao-Huang Dec 15, 2025
7505657
update
Jintao-Huang Dec 15, 2025
94c8756
fix
Jintao-Huang Dec 15, 2025
9066c72
fix megatron seq_cls bridge
Jintao-Huang Dec 15, 2025
ef9fa52
fix
Jintao-Huang Dec 16, 2025
d24bc27
fix
Jintao-Huang Dec 16, 2025
0b781bb
fix
Jintao-Huang Dec 16, 2025
8f9793f
update
Jintao-Huang Dec 16, 2025
319ba73
Merge branch 'fix_megatron_seq_cls_bridge' into update_megatron_shells
Jintao-Huang Dec 16, 2025
a925f63
update
Jintao-Huang Dec 16, 2025
d852c37
update
Jintao-Huang Dec 16, 2025
58f3af0
fix
Jintao-Huang Dec 16, 2025
1ad352b
Merge branch 'main' into update_megatron_shells
Jintao-Huang Dec 16, 2025
4f57472
update
Jintao-Huang Dec 16, 2025
337d1cd
fix
Jintao-Huang Dec 16, 2025
64f1c85
update
Jintao-Huang Dec 16, 2025
25a1484
update
Jintao-Huang Dec 16, 2025
2fc6653
fix
Jintao-Huang Dec 16, 2025
5b98c54
fix
Jintao-Huang Dec 16, 2025
6a06731
fix swift main
Jintao-Huang Dec 16, 2025
7e23b50
fix
Jintao-Huang Dec 16, 2025
f1f51e4
update
Jintao-Huang Dec 16, 2025
f9c1804
Merge remote-tracking branch 'origin/fix_swift_main' into update_mega…
Jintao-Huang Dec 16, 2025
210b554
fix
Jintao-Huang Dec 16, 2025
9562a58
fix
Jintao-Huang Dec 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions docs/source/BestPractices/Qwen3-Best-Practice.md
Original file line number Diff line number Diff line change
Expand Up @@ -330,7 +330,7 @@ Qwen3-235B-A22B-Instruct-250718 单机8卡H20 LoRA训练的最佳实践参考:

ms-swift 引入了 Megatron 并行技术以加速大模型的CPT/SFT/DPO/GRPO。支持的模型可以在[支持的模型文档](../Instruction/Supported-models-and-datasets.md)中找到。

关于环境准备以及 HF 和 MCore 模型权重的转换,可以参考[Megatron-SWIFT训练文档](../Megatron-SWIFT/Quick-start.md)。
关于环境准备,可以参考[Megatron-SWIFT训练文档](../Megatron-SWIFT/Quick-start.md)。

我们将使用阿里云 DLC 启动训练。训练环境由2台配备8卡 80GiB A800 GPU 组成。关于多节点启动方法的更多信息,请参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node)。

Expand All @@ -340,7 +340,9 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
megatron sft \
--load Qwen3-30B-A3B-Base-mcore \
--model Qwen/Qwen3-30B-A3B-Base \
--load_safetensors true \
--save_safetensors true \
--dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
--load_from_cache_file true \
--split_dataset_ratio 0.01 \
Expand Down
1 change: 1 addition & 0 deletions docs/source/Instruction/Supported-models-and-datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
- Model Type: 模型类型
- Default Template: 默认对话模板
- Requires: 使用该模型的额外依赖
- Support Megatron: 是否支持Megatron-SWIFT训练
- Tags: 模型的tags


Expand Down
2 changes: 2 additions & 0 deletions docs/source/Megatron-SWIFT/Ascend.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Ascend NPU

关于Megatron-SWIFT在Ascend NPU上的环境准备,请参考[NPU最佳实践](../BestPractices/NPU-support.md)

## NPU 性能数据采集

NPU性能采集通过`torch_npu.profiler.profile`接口进行采集,创建torch_npu.profiler.profile对象,通过start和stop接口控制采集性能数据,采集过程需要修改依赖的megatron源码,修改Megatron-LM/megatron/training/training.py文件中的train函数,采集示例如下:
Expand Down
10 changes: 6 additions & 4 deletions docs/source/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,10 @@
- 提示:你可以设置为一个很大的值来只保存最后一个检查点。
- 🔥no_save_optim: 不保存optimizer,默认为False。在全参数训练时,可以显著降低存储时间。
- 🔥no_save_rng: 不保存rng,默认为False。
- 🔥load: 加载的checkpoint目录,默认None。
- 🔥load: 加载的checkpoint目录,默认None。对于断点续训的介绍,请查看`--finetune`参数的介绍。
- 注意:若未使用ms-swift提供的`swift export`进行权重转换,你需要额外设置`--model <hf-repo>`用于加载`config.json`配置文件。
- 对于断点续训的介绍,请查看`--finetune`参数的介绍。
- 注意:在"ms-swift>3.10",支持直接加载和存储safetensors权重,参考[mcore-bridge文档](./Mcore-Bridge.md)。
- `--model`与`--load`的区别:`--model/--adapters/--ref_model/--ref_adapters`后加safetensors权重目录,`--load/--adapter_load/--ref_load/--ref_adapter_load`后加mcore权重目录。`--model/--adapters`不支持加载断点续训状态,因此在"ms-swift>=3.12",若设置`--no_save_optim false`,将额外存储mcore权重格式用于断点续训,你需要使用`--load/--adapter_load`来加载断点续训的状态。
- 🔥no_load_optim: 不载入optimizer,默认为False。
- 注意:断点续训时,设置`--no_load_optim false`读取优化器状态通常比`--no_load_optim true`不读取优化器状态消耗更大的显存资源。
- 🔥no_load_rng: 不载入rng,默认为False。
Expand Down Expand Up @@ -268,8 +269,9 @@ lora训练:
- use_rslora: 默认为`False`,是否使用`RS-LoRA`。

**Mcore-Bridge参数**
- 🔥load_safetensors: 默认为False,是否直接从safetensors加载权重。
- 🔥save_safetensors: 默认为False,是否直接保存成safetensors权重。注意,若该参数设置为True,则不会存储优化器权重、随机数状态等断点续训内容。
- 🔥load_safetensors: 该参数在"ms-swift>=3.12"将失效(之前版本默认为False),将根据优先级加载权重:若`--load`不存在,则加载safetensors权重`--model`;`--adapters`和`--adapter_load`等同理。
- 注意:在"ms-swift>=3.12",为保持shell脚本兼容性,该参数被保留,但不再发挥任何作用。
- 🔥save_safetensors: 默认为True,是否直接保存成safetensors权重。该参数在"ms-swift>=3.12"支持了对优化器权重、随机数状态等断点续训内容进行保存(额外存储mcore格式权重),使用`--no_save_optim`和`--no_save_rng`控制。断点续训时使用`--load/--adapter_load`参数加载mcore格式权重。
- model: safetensors权重的model_id或者model_path。默认为None。
- model_type: 模型类型。介绍参考[ms-swift命令行参数文档](../Instruction/Command-line-parameters.md)。
- adapters: safetensors格式的LoRA增量权重的adapter_id或者adapter_path。默认为`[]`。
Expand Down
142 changes: 125 additions & 17 deletions docs/source/Megatron-SWIFT/LoRA-Training.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,35 @@ Qwen3-235B-A22B-Instruct-250718 单机8卡H20 LoRA训练的最佳实践参考:

环境准备请参考Megatron-SWIFT的[快速开始文档](./Quick-start.md)。

## HF转换Mcore
## 传统方式

### HF转换Mcore

以下,我们分别介绍使用`swift export`和`megatron export`命令进行权重转换。相比于`swift export`,`megatron export`支持多机和LoRA增量权重转换,但也更加复杂,需要在导出时额外指定并行参数,例如`--tensor_model_parallel_size`, `--export_model_parallel_size`,具体参考[Mcore-Bridge文档](./Mcore-Bridge.md)。若要使用`swift export`命令,参考[快速开始文档](./Quick-start.md)。

转换方式与全参数训练一致,脚本如下:
```shell
CUDA_VISIBLE_DEVICES=0 \
swift export \
# megatron export
NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
megatron export \
--model Qwen/Qwen2.5-7B-Instruct \
--tensor_model_parallel_size 2 \
--to_mcore true \
--torch_dtype bfloat16 \
--output_dir Qwen2.5-7B-Instruct-mcore \
--save Qwen2.5-7B-Instruct-mcore \
--test_convert_precision true

# swift export
# CUDA_VISIBLE_DEVICES=0 \
# swift export \
# --model Qwen/Qwen2.5-7B-Instruct \
# --to_mcore true \
# --torch_dtype bfloat16 \
# --output_dir Qwen2.5-7B-Instruct-mcore \
# --test_convert_precision true
```

## LoRA训练
### LoRA训练

训练脚本:
```bash
Expand All @@ -28,6 +43,7 @@ NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
megatron sft \
--load Qwen2.5-7B-Instruct-mcore \
--save_safetensors false \
--dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
'AI-ModelScope/alpaca-gpt4-data-en#500' \
'swift/self-cognition#500' \
Expand Down Expand Up @@ -61,28 +77,120 @@ megatron sft \
```
- MoE模型的LoRA训练脚本参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/lora)。

## MCore转换HF
### MCore转换HF

```bash
CUDA_VISIBLE_DEVICES=0 \
swift export \
--mcore_adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx \
# megatron export
NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
megatron export \
--adapter_load megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
--to_hf true \
--tensor_model_parallel_size 2 \
--merge_lora false \
--torch_dtype bfloat16 \
--output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx-hf \
--save megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
--test_convert_precision true

# swift export
# CUDA_VISIBLE_DEVICES=0 \
# swift export \
# --mcore_adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
# --to_hf true \
# --torch_dtype bfloat16 \
# --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
# --test_convert_precision true
```
- 注意:`--adapter_load/--mcore_adapters`文件夹中包含`args.json`文件,转换过程会读取文件中`--model/--mcore_model`以及LoRA相关的参数信息。`swift export`暂不支持LoRA增量权重的转换。`megatron export`你可以使用`--merge_lora`参数控制是否进行权重合并。

### 推理
```shell
# 如果是全量权重,请将`--adapters`替换为`--model
CUDA_VISIBLE_DEVICES=0 \
swift infer \
--adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
--stream true
```
- 注意:`mcore_adapters`文件夹中包含`args.json`文件,转换过程中会读取文件中`mcore_model`和LoRA相关的参数信息,并将`mcore_model`和`mcore_adapters`merge-lora成完整权重,最终转换成HF格式权重。(暂不支持LoRA增量权重的转换)

## Merge-LoRA
### Merge-LoRA

如果只想merge-lora,而不希望转成HF格式权重,用于后续DPO训练,可以使用以下脚本:
```shell
CUDA_VISIBLE_DEVICES=0 \
swift export \
--mcore_adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx \
# megatron export
NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
megatron export \
--adapter_load megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
--tensor_model_parallel_size 2 \
--to_mcore true \
--merge_lora true \
--torch_dtype bfloat16 \
--output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx-mcore \
--save megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-mcore \
--test_convert_precision true

# swift export
# CUDA_VISIBLE_DEVICES=0 \
# swift export \
# --mcore_adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
# --to_mcore true \
# --torch_dtype bfloat16 \
# --output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-mcore \
# --test_convert_precision true
```

## Mcore-Bridge【推荐】

### 训练

```shell
# full: 2 * 70GiB 0.61s/it
# lora: 2 * 14GiB 0.45s/it
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
megatron sft \
--model Qwen/Qwen2.5-7B-Instruct \
--load_safetensors true \
--save_safetensors true \
--merge_lora false \
--dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
'AI-ModelScope/alpaca-gpt4-data-en#500' \
'swift/self-cognition#500' \
--train_type lora \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--tensor_model_parallel_size 2 \
--sequence_parallel true \
--micro_batch_size 16 \
--global_batch_size 16 \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--finetune true \
--cross_entropy_loss_fusion true \
--lr 1e-4 \
--lr_warmup_fraction 0.05 \
--min_lr 1e-5 \
--max_epochs 1 \
--save megatron_output/Qwen2.5-7B-Instruct \
--save_interval 100 \
--max_length 2048 \
--system 'You are a helpful assistant.' \
--num_workers 4 \
--no_save_optim true \
--no_save_rng true \
--dataset_num_proc 4 \
--model_author swift \
--model_name swift-robot
```

### 推理

```shell
# 如果是全量权重,请将`--adapters`替换为`--model
CUDA_VISIBLE_DEVICES=0 \
swift infer \
--adapters megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
--stream true
```
14 changes: 7 additions & 7 deletions docs/source/Megatron-SWIFT/Mcore-Bridge.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,7 +192,7 @@ swift infer \

提示:如果在vLLM权重更新期间遇到 GPU OOM 问题,您可以设置 `--offload_bridge true` 将张量卸载到 CPU 并减少 GPU 内存使用量。

## 导出与转换精度测试
## `megatron export` 与 转换精度测试

Mcore-Bridge除了支持在训练中进行safetensors的转换和保存,也支持了`megatron export`命令用于单独的权重导出。`megatron export`支持在权重转换时,对转换精度进行测试,这在接入新模型时验证接入准确性很有帮助。通常,Megatron-SWIFT已经接入的模型不会出现精度不对齐的情况,你可以放心设置`--test_convert_precision false`。
- 提示:多模态模型请关注`mean_diff (with loss)`字段,`mean_diff`因包含图像tokens且该部分不计算损失,有较大的diff。
Expand Down Expand Up @@ -235,8 +235,8 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
NPROC_PER_NODE=4 \
megatron export \
--model Qwen/Qwen3-30B-A3B-Instruct-2507 \
--adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx \
--save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-lora \
--adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx \
--save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-lora \
--merge_lora false \
--to_hf true \
--tensor_model_parallel_size 2 \
Expand All @@ -251,8 +251,8 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
NPROC_PER_NODE=4 \
megatron export \
--model Qwen/Qwen3-30B-A3B-Instruct-2507 \
--adapters megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-lora \
--save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-mcore \
--adapters megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-lora \
--save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-mcore \
--merge_lora false \
--to_mcore true \
--tensor_model_parallel_size 2 \
Expand All @@ -268,8 +268,8 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
NPROC_PER_NODE=4 \
megatron export \
--model Qwen/Qwen3-30B-A3B-Instruct-2507 \
--adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx \
--save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-merged \
--adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx \
--save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx-merged \
--merge_lora true \
--to_mcore true \
--tensor_model_parallel_size 2 \
Expand Down
Loading
Loading