Skip to content

S3000 上 graph_def.pb 推理优化#191

Open
Aloyshaaaa wants to merge 2 commits intoMooreThreads:mainfrom
Aloyshaaaa:test-340fdb-archfix
Open

S3000 上 graph_def.pb 推理优化#191
Aloyshaaaa wants to merge 2 commits intoMooreThreads:mainfrom
Aloyshaaaa:test-340fdb-archfix

Conversation

@Aloyshaaaa
Copy link
Copy Markdown
Contributor

改动

  • S3000 offload target: CMakeLists.txt 中为 MUSA kernel 编译添加显式 --offload-arch 标志,解决 mcc 默认架构不匹配 S3000 导致的 invaliddevicefunction (err=98) 启动失败问题
  • MatMul+BiasAdd 融合: FusedMatMul 和 LinearRelu 使用 RunWithBiasAdd,将 MatMul+BiasAdd 融合为单次 kernel launch,减少推理延迟

性能验证

在 MTT S3000 + TF 2.6.1 + MUSA Plugin 上对 graph_def.pb 进行 batch sweep 测试(每批次5次取最优):

BatchSize Best Throughput (samples/s) P50 (s) P90 (s)
1 97 0.0103 0.0106
2 180 0.0111 0.0114
4 321 0.0124 0.0128
5 382 0.0131 0.0134
16 1,097 0.0146 0.0149
32 1,937 0.0165 0.0168
64 3,619 0.0177 0.0180
100 5,293 0.0188 0.0192
128 6,532 0.0196 0.0199
256 10,449 0.0245 0.0248
512 10,705 0.0478 0.0485
1024 12,986 0.0788 0.0797
2048 12,616 0.1623 0.1639
4096 12,553 0.3264 0.3300
峰值吞吐量 12,986 samples/s @ batch_size=1024

Admin and others added 2 commits April 24, 2026 17:01
The default mcc target selection was producing kernels that could launch on some stacks but hit invalid device function on the S3000 path. This change makes the offload arch explicit and defaults it to mp_21 while keeping an override for other cards.

Constraint: Must keep existing build.sh entrypoint and avoid new dependencies

Rejected: Rely on mcc default arch detection | produced mismatched kernels on this machine

Confidence: high

Scope-risk: narrow

Reversibility: clean

Directive: If you benchmark on a different MUSA card, set MUSA_TARGET_ARCH explicitly before changing this default

Tested: CMake configure emits explicit offload arch flag on S3000 branch

Not-tested: Cross-card behavior on non-S3000 devices
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant