Skip to content

[XPU] get_infer_param use inplace copy, remove block_tables abundant d2h copy#7431

Open
RuohengMa wants to merge 22 commits intoPaddlePaddle:developfrom
RuohengMa:new_decouple_inplace_copy_block_tables
Open

[XPU] get_infer_param use inplace copy, remove block_tables abundant d2h copy#7431
RuohengMa wants to merge 22 commits intoPaddlePaddle:developfrom
RuohengMa:new_decouple_inplace_copy_block_tables

Conversation

@RuohengMa
Copy link
Copy Markdown
Contributor

Motivation

get_infer_param use inplace copy, remove block_tables abundant d2h copy

Modifications

get_infer_param use inplace copy, remove block_tables abundant d2h copy

Usage or Command

None

Accuracy Tests

None

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 16, 2026

Thanks for your contribution!

@paddle-bot paddle-bot bot added the XPU label Apr 16, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 16, 2026

Codecov Report

❌ Patch coverage is 4.34783% with 22 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@91b8bf2). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/model_executor/forward_meta.py 4.34% 22 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7431   +/-   ##
==========================================
  Coverage           ?   73.84%           
==========================================
  Files              ?      398           
  Lines              ?    55000           
  Branches           ?     8613           
==========================================
  Hits               ?    40614           
  Misses             ?    11671           
  Partials           ?     2715           
Flag Coverage Δ
GPU 73.84% <4.34%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@RuohengMa RuohengMa force-pushed the new_decouple_inplace_copy_block_tables branch from 6d21f41 to 1dceb07 Compare April 17, 2026 09:33
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-17 17:45 CST

📋 Review 摘要

PR 概述:将 get_infer_param 算子改为 inplace copy 模式,消除 block_tables 冗余的 D2H 拷贝;同时修复 block_attn_splicedrope_3d 场景下的分支问题。
变更范围custom_ops/xpu_ops/(C++ 算子实现 + 测试)、fastdeploy/model_executor/(Python 调用侧 + ForwardMeta)
影响面 TagXPU OP Optimization

📝 PR 规范检查

PR 标题已包含 [XPU] tag,符合规范。但 Modifications 部分与 Motivation 完全相同,建议补充具体的修改点说明。此外 Checklist 项目均未勾选。

描述建议(可直接复制):

Modifications

  1. get_infer_param 算子改为 inplace 模式:将 22 个输出 tensor 改为输入+inplace 输出,避免每次调用重新分配内存
  2. lod_to_slot_mapping 中去除 block_tables 的 D2H 拷贝,改用 index_select 在 XPU 上直接完成 slot mapping 计算
  3. block_attn_splicedrope_3d=true 时跳过 splice 路径,避免触发 PD_THROW
  4. Python 侧 xpu_pre_and_post_process 支持 cudagraph 模式下的 inplace copy

问题

级别 文件 概述
🟡 建议 block_attn_spliced.cc:244 外层加了 !rope_3d 后,内层 if (rope_3d) 成为死代码
❓ 疑问 get_infer_param.cc:221 循环上界从 bsz 改为 bsz + 1,需确认变更意图

总体评价

核心优化思路清晰:通过 inplace copy 避免重复分配 tensor,通过 index_select 替代 D2H+CPU 计算消除 block_tables 的冗余拷贝,对性能有明显帮助。代码逻辑正确,测试文件均已同步更新。两个小问题建议关注:block_attn_spliced.cc 中的死代码清理,以及 decoder_seq_lod_vec 循环上界变更的意图说明。

if (FLAGS_encoder_splice) {
// TODO: spliced split kvcache should support rope3d
if (FLAGS_encoder_splice && !rope_3d) {
if (rope_3d) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 外层条件已加上 !rope_3d 后,内层 if (rope_3d) 永远不会为 true,形成死代码。encoder(此处)和 decoder(line 553)两处均存在相同问题。

建议清理内层的 rope_3d 检查和 PD_THROW,或在 TODO 注释中说明保留原因:

// TODO: spliced split kvcache should support rope3d
if (FLAGS_encoder_splice && !rope_3d) {
    // 以下 if (rope_3d) 已不可达,建议移除

@@ -157,7 +221,7 @@ std::vector<paddle::Tensor> GetInferParam(
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 循环上界从 bsz 改为 bsz + 1decoder_seq_lod_vec 大小为 bsz + 1,访问不会越界,逻辑上也需要处理最后一个 lod 值。

想确认:这是修复了之前遗漏最后一个元素导致 monotonic 填充不完整的 bug,还是 cudagraph 模式下有新的需求?建议在注释中补充变更原因,方便后续维护。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants