[XPU] get_infer_param use inplace copy, remove block_tables abundant d2h copy#7431
[XPU] get_infer_param use inplace copy, remove block_tables abundant d2h copy#7431RuohengMa wants to merge 22 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7431 +/- ##
==========================================
Coverage ? 73.84%
==========================================
Files ? 398
Lines ? 55000
Branches ? 8613
==========================================
Hits ? 40614
Misses ? 11671
Partials ? 2715
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
8c8f7ee to
7ff2c11
Compare
6d21f41 to
1dceb07
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-17 17:45 CST
📋 Review 摘要
PR 概述:将 get_infer_param 算子改为 inplace copy 模式,消除 block_tables 冗余的 D2H 拷贝;同时修复 block_attn_spliced 在 rope_3d 场景下的分支问题。
变更范围:custom_ops/xpu_ops/(C++ 算子实现 + 测试)、fastdeploy/model_executor/(Python 调用侧 + ForwardMeta)
影响面 Tag:XPU OP Optimization
📝 PR 规范检查
PR 标题已包含 [XPU] tag,符合规范。但 Modifications 部分与 Motivation 完全相同,建议补充具体的修改点说明。此外 Checklist 项目均未勾选。
描述建议(可直接复制):
Modifications
get_infer_param算子改为 inplace 模式:将 22 个输出 tensor 改为输入+inplace 输出,避免每次调用重新分配内存lod_to_slot_mapping中去除block_tables的 D2H 拷贝,改用index_select在 XPU 上直接完成 slot mapping 计算block_attn_spliced在rope_3d=true时跳过 splice 路径,避免触发PD_THROW- Python 侧
xpu_pre_and_post_process支持 cudagraph 模式下的 inplace copy
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | block_attn_spliced.cc:244 |
外层加了 !rope_3d 后,内层 if (rope_3d) 成为死代码 |
| ❓ 疑问 | get_infer_param.cc:221 |
循环上界从 bsz 改为 bsz + 1,需确认变更意图 |
总体评价
核心优化思路清晰:通过 inplace copy 避免重复分配 tensor,通过 index_select 替代 D2H+CPU 计算消除 block_tables 的冗余拷贝,对性能有明显帮助。代码逻辑正确,测试文件均已同步更新。两个小问题建议关注:block_attn_spliced.cc 中的死代码清理,以及 decoder_seq_lod_vec 循环上界变更的意图说明。
| if (FLAGS_encoder_splice) { | ||
| // TODO: spliced split kvcache should support rope3d | ||
| if (FLAGS_encoder_splice && !rope_3d) { | ||
| if (rope_3d) { |
There was a problem hiding this comment.
🟡 建议 外层条件已加上 !rope_3d 后,内层 if (rope_3d) 永远不会为 true,形成死代码。encoder(此处)和 decoder(line 553)两处均存在相同问题。
建议清理内层的 rope_3d 检查和 PD_THROW,或在 TODO 注释中说明保留原因:
// TODO: spliced split kvcache should support rope3d
if (FLAGS_encoder_splice && !rope_3d) {
// 以下 if (rope_3d) 已不可达,建议移除| @@ -157,7 +221,7 @@ std::vector<paddle::Tensor> GetInferParam( | |||
| } | |||
There was a problem hiding this comment.
❓ 疑问 循环上界从 bsz 改为 bsz + 1。decoder_seq_lod_vec 大小为 bsz + 1,访问不会越界,逻辑上也需要处理最后一个 lod 值。
想确认:这是修复了之前遗漏最后一个元素导致 monotonic 填充不完整的 bug,还是 cudagraph 模式下有新的需求?建议在注释中补充变更原因,方便后续维护。
Motivation
get_infer_param use inplace copy, remove block_tables abundant d2h copy
Modifications
get_infer_param use inplace copy, remove block_tables abundant d2h copy
Usage or Command
None
Accuracy Tests
None
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.