Skip to content

求助:Qwen3-Next使用Megatron-swift训练时遇到的问题(OOM/训练慢/shape不匹配) #5851

@LuoYuanzhen

Description

@LuoYuanzhen

基于Qwen3-Next-80B-A3B-Instruct在32卡A100上,max_length=4096,PT训练时,遇到OOM(调整多种并行策略组合无法结局)、shape不匹配(开启context_parallel>1)等问题。不知道是不是哪里设置的有问题?之前在Qwen2.5-72B dense上进行megatron-swift训练同样配置是没啥问题的,按理说80B也不应该出现OOM问题?

  1. 训练脚本:
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NCCL_DEBUG=WARN \
MODELSCOPE_CACHE="/mnt/.cache/datasets/train" \
MEGATRON_LM_PATH="/mnt/3rd_parties/megatron-packs/Megatron-LM" \
NPROC_PER_NODE=8 \
NNODES=$MLP_WORKER_NUM \
NODE_RANK=$MLP_ROLE_INDEX \
MASTER_ADDR=$MLP_WORKER_0_HOST \
MASTER_PORT=$MLP_WORKER_0_PORT \
megatron pt \
    --load ${model} \
    --dataset ${dataset} \
    --split_dataset_ratio 0 \
    --custom_dataset_info "./dataset_info_swift3.json" \
    --tensor_model_parallel_size 2 \
    --expert_model_parallel_size 16 \
    --pipeline_model_parallel_size 1 \
    --sequence_parallel true \
    --context_parallel_size 1 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 0.001 \
    --micro_batch_size 1 \
    --global_batch_size 1024 \
    --packing true \
    --recompute_granularity selective \
    --recompute_modules core_attn moe moe_act mlp \
    --max_epochs 2 \
    --train_iters 200 \
    --eval_iters 50 \
    --eval_interval 100 \
    --save_interval 100 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --seed 42 \
    --lr 5e-5 \
    --lr_warmup_iters 0 \
    --min_lr 5e-6 \
    --save ${save} \
    --max_length 4096 \
    --dataset_num_proc 32 \
    --no_save_optim false \
    --no_save_rng false \
    --attention_backend flash \
    --attn_impl flash_attn
  • version: ms-swift=3.8.1, Megatron-LM@core_r0.13.0
  1. 调整recompute_granularity为full, method=uniform, num_layers=1可以训练,但是速度奇慢,1个step需要约半小时;
  2. 开启context_parallel > 1,遇到shape不匹配的报错:
[2025-09-15 14:07:10] worker-0 >> [rank0]:   File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[2025-09-15 14:07:10] worker-0 >> [rank0]:     return forward_call(*args, **kwargs)
[2025-09-15 14:07:10] worker-0 >> [rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-09-15 14:07:10] worker-0 >> [rank0]:   File "/mnt/3rd_parties/ms-swift-packs/ms-swift/swift/megatron/model/gpt/qwen3_next.py", line 398, in forward
[2025-09-15 14:07:10] worker-0 >> [rank0]:     new_hidden_states[i, :end - start] = hidden_states[start:end, 0]
[2025-09-15 14:07:10] worker-0 >> [rank0]:     ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[2025-09-15 14:07:10] worker-0 >> [rank0]: RuntimeError: The expanded size of the tensor (436) must match the existing size (284) at non-singleton dimension 0.  Target sizes: [436, 2048].  Tensor sizes: [284, 2048]

求助帮忙看下~非常感谢!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions