Skip to content

unify_gpt_model#589

Merged
xuxinyi389 merged 2 commits intoPaddlePaddle:developfrom
xuxinyi389:unify_gpt_model
Mar 5, 2026
Merged

unify_gpt_model#589
xuxinyi389 merged 2 commits intoPaddlePaddle:developfrom
xuxinyi389:unify_gpt_model

Conversation

@xuxinyi389
Copy link
Contributor

@xuxinyi389 xuxinyi389 commented Mar 3, 2026

在gpt_model中统一 mtp 组网模块实现

python3 -c "from paddle.distributed import fleet; s = fleet.DistributedStrategy(); print('默认值:', s.pipeline_configs.get('enable_partial_send_recv')); s.pipeline_configs = {'accumulate_steps': 12, 'micro_batch_size': 1}; print('设置后:', s.pipeline_configs.get('enable_partial_send_recv'))"
默认值: True
设置后: True

enable_partial_send_recv=True + sequence_parallel=True 的组合本身就有问题——每一次 PP stage 间的 P2P 通信都在损坏数据。

每一个 PP 边界上:
发送端: TP0 ≠ TP1 (SP分布,正确)
接收端: TP0 = TP1 (被 partial+allgather 混合了,错误)

#########################################################
develop 和 unify 两个分支的数值错误差异。核心数据流如下:

LMHead backward 输出 (两个分支完全相同):
TP0: [10, 20, 30, 40, | 50, 60, 70, 80]
TP1: [90,100,110,120, |130,140,150,160]
^-- front half -^ ^- back half -^

develop (NormPipe bw: grad*2): unify (EmptyLayer: 不变):
TP0: [20,40,60,80 |100,120,140,160] TP0: [10,20,30,40 | 50, 60, 70, 80]
TP1: [180,200,220,240|260,280,300,320] TP1: [90,100,110,120|130,140,150,160]

partial send 只取: partial send 只取:
TP0(rank_id=0) 前半: [20,40,60,80] TP0(rank_id=0) 前半: [10,20,30,40]
TP1(rank_id=1) 后半: [260,280,300,320] TP1(rank_id=1) 后半: [130,140,150,160]

allgather 后接收端: allgather 后接收端:
[260,280,300,320, 20,40,60,80] [130,140,150,160, 10,20,30,40]
丢失: [100,120,140,160] 丢失: [50,60,70,80]
[180,200,220,240] [90,100,110,120]

正确值应该是: 正确值应该是:
TP0: [180,200,220,240,260,280,300,320] TP0: [90,100,110,120,130,140,150,160]
TP1: [20,40,60,80,100,120,140,160] TP1: [10,20,30,40,50,60,70,80]
两个分支都错了,但混合出不同的错误值 — 这就是为什么 loss 完全一致,但所有 23 个参数梯度 MD5 全不相同。

@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@cf02ab5). Learn more about missing BASE report.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             develop      #589   +/-   ##
===========================================
  Coverage           ?   100.00%           
===========================================
  Files              ?         2           
  Lines              ?         2           
  Branches           ?         0           
===========================================
  Hits               ?         2           
  Misses             ?         0           
  Partials           ?         0           
Flag Coverage Δ
coverage_combine 100.00% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/paddlefleet/models/gpt/gpt_model.py 100.00% <100.00%> (ø)
src/paddlefleet/transformer/paddle_norm.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@xuxinyi389 xuxinyi389 merged commit a68c0c6 into PaddlePaddle:develop Mar 5, 2026
35 of 39 checks passed
ShigureNyako pushed a commit to ShigureNyako/PaddleFleet that referenced this pull request Mar 5, 2026
@ShigureNyako
Copy link
Contributor

✅ Cherry-pick successful! Created PR: #594

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants