[Cherry-Pick][Optimization] merge_allreduce (#7039) by Copilot · Pull Request #7475 · PaddlePaddle/FastDeploy

Copilot · 2026-04-17T15:49:08Z

Motivation

Cherry-pick of #7039 to release/2.5 branch.

Modifications

将普通专家和共享专家在计算ffn后各自的allreduce合并为一个allreduce。

具体变更：

在 glm4_moe.py 中，新增 merge_ffn_tp 标志判断是否为纯TP模式（tp>1, ep=1）
在纯TP模式下，将 FusedMoE 和 shared_experts 的 reduce_results 设为 False，延迟 allreduce
在 forward 方法中，合并两个分支的部分和后执行一次统一的 tensor_model_parallel_all_reduce，减少一次集合通信

Conflict Resolution

tests/e2e/utils/rollout_routing_replay_test_utils.py 的 baseline 路径冲突已解决，保留了 release/2.5 分支已有的 R3_BaseLine_25_uint8_0403 路径。

Checklist

Add at least a tag in the PR title.
Format your code, run pre-commit before commit.
Add unit tests. This is a cherry-pick of an already merged PR, no new tests needed.
Provide accuracy results. Same as original PR [Optimization] merge_allreduce #7039.
The current PR is submitting to the release branch, and the PR has been submitted to the develop branch first ([Optimization] merge_allreduce #7039).

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

paddle-bot · 2026-04-17T15:49:17Z

Thanks for your contribution!

PaddlePaddle-bot

🤖 AI Code Review | 2026-04-17 23:56 CST

📋 Review 摘要

PR 概述：Cherry-pick #7039，将 GLM4-MoE 模型中普通专家和共享专家各自的 allreduce 合并为一次，减少集合通信开销。
变更范围：model_executor/models/glm4_moe.py
影响面 Tag：Models Optimization

问题

未发现阻塞性问题。

总体评价

变更逻辑清晰正确，具体验证点如下：

merge_ffn_tp 条件判断正确：self.use_tp and not self.use_ep 精确识别纯 TP 模式（tp>1, ep=1），仅在该模式下延迟 allreduce。在 EP 模式或无并行模式下，各分支仍保持原有的内部 reduce 行为（reduce_results=True）。
reduce_results 参数传递正确：
- FusedMoE 在 reduce_results=False 时跳过内部的 tensor_model_parallel_all_reduce（已确认 moe.py 第 760 行的条件逻辑）。
- Glm4MoeMLP 的 reduce_results 参数仅影响 RowParallelLinear（非 use_sequence_parallel_moe 分支），use_sequence_parallel_moe 分支使用 ReplicatedLinear 不涉及 TP reduce，不受影响。
forward 方法正确：先将两个分支的部分和相加（out + self.shared_experts(x)），再执行一次统一的 tensor_model_parallel_all_reduce。当 n_shared_experts == 0 时，仍会正确执行 allreduce（因为 FusedMoE 输出仍是部分和）。
函数签名匹配：tensor_model_parallel_all_reduce(out, self.tp_group) 与 communication.py 中定义的 (input_, group_) 签名一致。

这是一个有效的通信优化，在纯 TP 模式下将两次 allreduce 合并为一次，减少约 50% 的集合通信次数，对大规模推理场景有明显收益。

[Optimization] merge_allreduce (#7039)

b8d3634

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

Copilot AI assigned Copilot and Jiang-Jia-Jun Apr 17, 2026

Copilot AI requested review from Copilot and removed request for Copilot April 17, 2026 15:49

paddle-bot bot added the contributor External developers label Apr 17, 2026

Copilot created this pull request from a session on behalf of Jiang-Jia-Jun April 17, 2026 15:50 View session

Copilot AI requested a review from Jiang-Jia-Jun April 17, 2026 15:50

Copilot finished work on behalf of Jiang-Jia-Jun April 17, 2026 15:50

PaddlePaddle-bot reviewed Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick][Optimization] merge_allreduce (#7039)#7475

[Cherry-Pick][Optimization] merge_allreduce (#7039)#7475
Copilot wants to merge 1 commit intorelease/2.5from
copilot/release-2-5-backport-pr-7039

Copilot AI commented Apr 17, 2026

Uh oh!

paddle-bot bot commented Apr 17, 2026

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Copilot AI commented Apr 17, 2026

Motivation

Modifications

Conflict Resolution

Checklist

Uh oh!

paddle-bot bot commented Apr 17, 2026

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

总体评价

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants