-
Notifications
You must be signed in to change notification settings - Fork 876
Open
Description
基于Qwen3-Next-80B-A3B-Instruct在32卡A100上,max_length=4096,PT训练时,遇到OOM(调整多种并行策略组合无法结局)、shape不匹配(开启context_parallel>1)等问题。不知道是不是哪里设置的有问题?之前在Qwen2.5-72B dense上进行megatron-swift训练同样配置是没啥问题的,按理说80B也不应该出现OOM问题?
- 训练脚本:
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NCCL_DEBUG=WARN \
MODELSCOPE_CACHE="/mnt/.cache/datasets/train" \
MEGATRON_LM_PATH="/mnt/3rd_parties/megatron-packs/Megatron-LM" \
NPROC_PER_NODE=8 \
NNODES=$MLP_WORKER_NUM \
NODE_RANK=$MLP_ROLE_INDEX \
MASTER_ADDR=$MLP_WORKER_0_HOST \
MASTER_PORT=$MLP_WORKER_0_PORT \
megatron pt \
--load ${model} \
--dataset ${dataset} \
--split_dataset_ratio 0 \
--custom_dataset_info "./dataset_info_swift3.json" \
--tensor_model_parallel_size 2 \
--expert_model_parallel_size 16 \
--pipeline_model_parallel_size 1 \
--sequence_parallel true \
--context_parallel_size 1 \
--moe_grouped_gemm true \
--moe_shared_expert_overlap true \
--moe_aux_loss_coeff 0.001 \
--micro_batch_size 1 \
--global_batch_size 1024 \
--packing true \
--recompute_granularity selective \
--recompute_modules core_attn moe moe_act mlp \
--max_epochs 2 \
--train_iters 200 \
--eval_iters 50 \
--eval_interval 100 \
--save_interval 100 \
--finetune true \
--cross_entropy_loss_fusion true \
--seed 42 \
--lr 5e-5 \
--lr_warmup_iters 0 \
--min_lr 5e-6 \
--save ${save} \
--max_length 4096 \
--dataset_num_proc 32 \
--no_save_optim false \
--no_save_rng false \
--attention_backend flash \
--attn_impl flash_attn
- version: ms-swift=3.8.1, Megatron-LM@core_r0.13.0
- 调整recompute_granularity为full, method=uniform, num_layers=1可以训练,但是速度奇慢,1个step需要约半小时;
- 开启context_parallel > 1,遇到shape不匹配的报错:
[2025-09-15 14:07:10] worker-0 >> [rank0]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[2025-09-15 14:07:10] worker-0 >> [rank0]: return forward_call(*args, **kwargs)
[2025-09-15 14:07:10] worker-0 >> [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-09-15 14:07:10] worker-0 >> [rank0]: File "/mnt/3rd_parties/ms-swift-packs/ms-swift/swift/megatron/model/gpt/qwen3_next.py", line 398, in forward
[2025-09-15 14:07:10] worker-0 >> [rank0]: new_hidden_states[i, :end - start] = hidden_states[start:end, 0]
[2025-09-15 14:07:10] worker-0 >> [rank0]: ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[2025-09-15 14:07:10] worker-0 >> [rank0]: RuntimeError: The expanded size of the tensor (436) must match the existing size (284) at non-singleton dimension 0. Target sizes: [436, 2048]. Tensor sizes: [284, 2048]
求助帮忙看下~非常感谢!
Metadata
Metadata
Assignees
Labels
No labels