unify_gpt_model by xuxinyi389 · Pull Request #589 · PaddlePaddle/PaddleFleet

xuxinyi389 · 2026-03-03T07:03:48Z

在gpt_model中统一 mtp 组网模块实现

python3 -c "from paddle.distributed import fleet; s = fleet.DistributedStrategy(); print('默认值:', s.pipeline_configs.get('enable_partial_send_recv')); s.pipeline_configs = {'accumulate_steps': 12, 'micro_batch_size': 1}; print('设置后:', s.pipeline_configs.get('enable_partial_send_recv'))"
默认值: True
设置后: True

enable_partial_send_recv=True + sequence_parallel=True 的组合本身就有问题——每一次 PP stage 间的 P2P 通信都在损坏数据。

每一个 PP 边界上:
发送端: TP0 ≠ TP1 (SP分布，正确)
接收端: TP0 = TP1 (被 partial+allgather 混合了，错误)

#########################################################
develop 和 unify 两个分支的数值错误差异。核心数据流如下：

LMHead backward 输出 (两个分支完全相同):
TP0: [10, 20, 30, 40, | 50, 60, 70, 80]
TP1: [90,100,110,120, |130,140,150,160]
^-- front half -^ ^- back half -^

develop (NormPipe bw: grad*2): unify (EmptyLayer: 不变):
TP0: [20,40,60,80 |100,120,140,160] TP0: [10,20,30,40 | 50, 60, 70, 80]
TP1: [180,200,220,240|260,280,300,320] TP1: [90,100,110,120|130,140,150,160]

partial send 只取: partial send 只取:
TP0(rank_id=0) 前半: [20,40,60,80] TP0(rank_id=0) 前半: [10,20,30,40]
TP1(rank_id=1) 后半: [260,280,300,320] TP1(rank_id=1) 后半: [130,140,150,160]

allgather 后接收端: allgather 后接收端:
[260,280,300,320, 20,40,60,80] [130,140,150,160, 10,20,30,40]
丢失: [100,120,140,160] 丢失: [50,60,70,80]
[180,200,220,240] [90,100,110,120]

正确值应该是: 正确值应该是:
TP0: [180,200,220,240,260,280,300,320] TP0: [90,100,110,120,130,140,150,160]
TP1: [20,40,60,80,100,120,140,160] TP1: [10,20,30,40,50,60,70,80]
两个分支都错了，但混合出不同的错误值 — 这就是为什么 loss 完全一致，但所有 23 个参数梯度 MD5 全不相同。

codecov-commenter · 2026-03-04T07:24:56Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@cf02ab5). Learn more about missing BASE report.

Additional details and impacted files

@@             Coverage Diff             @@
##             develop      #589   +/-   ##
===========================================
  Coverage           ?   100.00%           
===========================================
  Files              ?         2           
  Lines              ?         2           
  Branches           ?         0           
===========================================
  Hits               ?         2           
  Misses             ?         0           
  Partials           ?         0

Flag	Coverage Δ
coverage_combine	`100.00% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/paddlefleet/models/gpt/gpt_model.py	`100.00% <100.00%> (ø)`
src/paddlefleet/transformer/paddle_norm.py	`100.00% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ShigureNyako · 2026-03-05T07:13:51Z

✅ Cherry-pick successful! Created PR: #594

xuxinyi389 added 2 commits March 3, 2026 23:35

unify_gpt_model

ab6de0e

fix_test

850f00e

xuxinyi389 force-pushed the unify_gpt_model branch from 0b5cdba to 850f00e Compare March 3, 2026 18:24

risemeup1 approved these changes Mar 5, 2026

View reviewed changes

xuxinyi389 merged commit a68c0c6 into PaddlePaddle:develop Mar 5, 2026
35 of 39 checks passed

xuxinyi389 added the cherry-pick: release/0.2 label Mar 5, 2026

ShigureNyako pushed a commit to ShigureNyako/PaddleFleet that referenced this pull request Mar 5, 2026

unify_gpt_model (PaddlePaddle#589)

c120122

ShigureNyako mentioned this pull request Mar 5, 2026

[release/0.2] unify_gpt_model #594

Merged

github-actions bot removed the cherry-pick: release/0.2 label Mar 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unify_gpt_model#589

unify_gpt_model#589
xuxinyi389 merged 2 commits intoPaddlePaddle:developfrom
xuxinyi389:unify_gpt_model

xuxinyi389 commented Mar 3, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Mar 4, 2026

Uh oh!

Uh oh!

ShigureNyako commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xuxinyi389 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Mar 4, 2026

Codecov Report

Uh oh!

Uh oh!

ShigureNyako commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xuxinyi389 commented Mar 3, 2026 •

edited

Loading