[XPU] get_infer_param use inplace copy, remove block_tables abundant d2h copy by RuohengMa · Pull Request #7431 · PaddlePaddle/FastDeploy

RuohengMa · 2026-04-16T07:06:16Z

Motivation

get_infer_param use inplace copy, remove block_tables abundant d2h copy

Modifications

get_infer_param use inplace copy, remove block_tables abundant d2h copy

Usage or Command

None

Accuracy Tests

None

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-16T07:06:29Z

Thanks for your contribution!

codecov-commenter · 2026-04-16T08:33:39Z

Codecov Report

❌ Patch coverage is 4.34783% with 22 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@91b8bf2). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/model_executor/forward_meta.py	4.34%	22 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7431   +/-   ##
==========================================
  Coverage           ?   73.84%           
==========================================
  Files              ?      398           
  Lines              ?    55000           
  Branches           ?     8613           
==========================================
  Hits               ?    40614           
  Misses             ?    11671           
  Partials           ?     2715

Flag	Coverage Δ
GPU	`73.84% <4.34%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

🤖 AI Code Review | 2026-04-17 17:45 CST

📋 Review 摘要

PR 概述：将 get_infer_param 算子改为 inplace copy 模式，消除 block_tables 冗余的 D2H 拷贝；同时修复 block_attn_spliced 在 rope_3d 场景下的分支问题。
变更范围：custom_ops/xpu_ops/（C++ 算子实现 + 测试）、fastdeploy/model_executor/（Python 调用侧 + ForwardMeta）
影响面 Tag：XPU OP Optimization

📝 PR 规范检查

PR 标题已包含 [XPU] tag，符合规范。但 Modifications 部分与 Motivation 完全相同，建议补充具体的修改点说明。此外 Checklist 项目均未勾选。

描述建议（可直接复制）：

Modifications

get_infer_param 算子改为 inplace 模式：将 22 个输出 tensor 改为输入+inplace 输出，避免每次调用重新分配内存

lod_to_slot_mapping 中去除 block_tables 的 D2H 拷贝，改用 index_select 在 XPU 上直接完成 slot mapping 计算

block_attn_spliced 在 rope_3d=true 时跳过 splice 路径，避免触发 PD_THROW

Python 侧 xpu_pre_and_post_process 支持 cudagraph 模式下的 inplace copy

问题

级别	文件	概述
🟡 建议	`block_attn_spliced.cc:244`	外层加了 `!rope_3d` 后，内层 `if (rope_3d)` 成为死代码
❓ 疑问	`get_infer_param.cc:221`	循环上界从 `bsz` 改为 `bsz + 1`，需确认变更意图

总体评价

核心优化思路清晰：通过 inplace copy 避免重复分配 tensor，通过 index_select 替代 D2H+CPU 计算消除 block_tables 的冗余拷贝，对性能有明显帮助。代码逻辑正确，测试文件均已同步更新。两个小问题建议关注：block_attn_spliced.cc 中的死代码清理，以及 decoder_seq_lod_vec 循环上界变更的意图说明。

PaddlePaddle-bot · 2026-04-17T09:45:50Z

-  if (FLAGS_encoder_splice) {
+  // TODO: spliced split kvcache should support rope3d
+  if (FLAGS_encoder_splice && !rope_3d) {
    if (rope_3d) {


🟡 建议 外层条件已加上 !rope_3d 后，内层 if (rope_3d) 永远不会为 true，形成死代码。encoder（此处）和 decoder（line 553）两处均存在相同问题。

建议清理内层的 rope_3d 检查和 PD_THROW，或在 TODO 注释中说明保留原因：

// TODO: spliced split kvcache should support rope3d if (FLAGS_encoder_splice && !rope_3d) { // 以下 if (rope_3d) 已不可达，建议移除

PaddlePaddle-bot · 2026-04-17T09:45:50Z

@@ -157,7 +221,7 @@ std::vector<paddle::Tensor> GetInferParam(
  }


❓ 疑问 循环上界从 bsz 改为 bsz + 1。decoder_seq_lod_vec 大小为 bsz + 1，访问不会越界，逻辑上也需要处理最后一个 lod 值。

想确认：这是修复了之前遗漏最后一个元素导致 monotonic 填充不完整的 bug，还是 cudagraph 模式下有新的需求？建议在注释中补充变更原因，方便后续维护。

RuohengMa temporarily deployed to Metax_ci April 16, 2026 07:06 — with GitHub Actions Inactive

paddle-bot bot added the XPU label Apr 16, 2026