求助：Qwen3-Next使用Megatron-swift训练时遇到的问题（OOM/训练慢/shape不匹配）

基于Qwen3-Next-80B-A3B-Instruct在32卡A100上，max_length=4096，PT训练时，遇到OOM（调整多种并行策略组合无法结局）、shape不匹配（开启context_parallel>1）等问题。不知道是不是哪里设置的有问题？之前在Qwen2.5-72B dense上进行megatron-swift训练同样配置是没啥问题的，按理说80B也不应该出现OOM问题？
1. 训练脚本：
```bash
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NCCL_DEBUG=WARN \
MODELSCOPE_CACHE="/mnt/.cache/datasets/train" \
MEGATRON_LM_PATH="/mnt/3rd_parties/megatron-packs/Megatron-LM" \
NPROC_PER_NODE=8 \
NNODES=$MLP_WORKER_NUM \
NODE_RANK=$MLP_ROLE_INDEX \
MASTER_ADDR=$MLP_WORKER_0_HOST \
MASTER_PORT=$MLP_WORKER_0_PORT \
megatron pt \
    --load ${model} \
    --dataset ${dataset} \
    --split_dataset_ratio 0 \
    --custom_dataset_info "./dataset_info_swift3.json" \
    --tensor_model_parallel_size 2 \
    --expert_model_parallel_size 16 \
    --pipeline_model_parallel_size 1 \
    --sequence_parallel true \
    --context_parallel_size 1 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 0.001 \
    --micro_batch_size 1 \
    --global_batch_size 1024 \
    --packing true \
    --recompute_granularity selective \
    --recompute_modules core_attn moe moe_act mlp \
    --max_epochs 2 \
    --train_iters 200 \
    --eval_iters 50 \
    --eval_interval 100 \
    --save_interval 100 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --seed 42 \
    --lr 5e-5 \
    --lr_warmup_iters 0 \
    --min_lr 5e-6 \
    --save ${save} \
    --max_length 4096 \
    --dataset_num_proc 32 \
    --no_save_optim false \
    --no_save_rng false \
    --attention_backend flash \
    --attn_impl flash_attn
```
- version: ms-swift=3.8.1, Megatron-LM@core_r0.13.0
2. 调整recompute_granularity为full,  method=uniform, num_layers=1可以训练，但是速度奇慢，1个step需要约半小时；
3. 开启context_parallel > 1，遇到shape不匹配的报错：
```bash
[2025-09-15 14:07:10] worker-0 >> [rank0]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[2025-09-15 14:07:10] worker-0 >> [rank0]:     return forward_call(*args, **kwargs)
[2025-09-15 14:07:10] worker-0 >> [rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-09-15 14:07:10] worker-0 >> [rank0]:   File "/mnt/3rd_parties/ms-swift-packs/ms-swift/swift/megatron/model/gpt/qwen3_next.py", line 398, in forward
[2025-09-15 14:07:10] worker-0 >> [rank0]:     new_hidden_states[i, :end - start] = hidden_states[start:end, 0]
[2025-09-15 14:07:10] worker-0 >> [rank0]:     ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[2025-09-15 14:07:10] worker-0 >> [rank0]: RuntimeError: The expanded size of the tensor (436) must match the existing size (284) at non-singleton dimension 0.  Target sizes: [436, 2048].  Tensor sizes: [284, 2048]
```
求助帮忙看下～非常感谢！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

求助：Qwen3-Next使用Megatron-swift训练时遇到的问题（OOM/训练慢/shape不匹配） #5851

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

求助：Qwen3-Next使用Megatron-swift训练时遇到的问题（OOM/训练慢/shape不匹配） #5851

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions