Skip to content

Add fix for fastmath corner case with failing tests via sbgemm_neon_kernel#28394

Open
JonathanC-ARM wants to merge 3 commits into
microsoft:mainfrom
JonathanC-ARM:jonclo01/sbgemm_neon_gating
Open

Add fix for fastmath corner case with failing tests via sbgemm_neon_kernel#28394
JonathanC-ARM wants to merge 3 commits into
microsoft:mainfrom
JonathanC-ARM:jonclo01/sbgemm_neon_gating

Conversation

@JonathanC-ARM
Copy link
Copy Markdown
Contributor

Description

This change fixes sporadic onnxruntime_test_all failures observed when building with KleidiAI enabled and --no_sve.

In this configuration, some MatMul cases use the ARM64 BF16 fastmath path backed by the NEON SBGemm kernel (non kleidiai).
That kernel consumes A in groups of 4 floats. For K-tail cases such as K=13, the final block contains only one valid A value, but the kernel can still read up to three additional floats beyond the logical A row.
B is already packed/padded for this path, but A is not. If the overread A values contain NaN or otherwise invalid data, the result can diverge because NaN * 0 still propagates NaN.

This change avoids the unsafe NEON SBGemm fastmath path when K is not a multiple of 4. Those K-tail cases fall back to the existing SGEMM path, while aligned-K cases continue to use BF16 fastmath.
Repro

Build with KleidiAI enabled and --no_sve, then run:

./onnxruntime_test_all \
  --gtest_random_seed=2345 \
  --gtest_brief=1 \
  --gtest_filter="QDQTransformerTests*" 

Example failure:

[  FAILED  ] QDQTransformerTests.DQMatMulPerTensorWithBlockSizeOption
expected -146.433, got -146.436, diff: 0.00294495, tol=0.00147433

Validation

Validated with:

./onnxruntime_test_all \
  --gtest_random_seed=2345 \
  --gtest_filter="*QDQ*" \
  --gtest_repeat=10 \
  --gtest_break_on_failure \
  --gtest_brief=1

Also validated the previously failing QDQ MatMul tests over repeated runs.

Added a regression test to onnxruntime_provider_test which guards against this particular matmul corner case

./onnxruntime_provider_test --gtest_filter="MathOpTest.MatMulFloatTypeFastMathKTailFallsBackToSgemm" 
[ RUN      ] MathOpTest.MatMulFloatTypeFastMathKTailFallsBackToSgemm
[symbolize_elf.inc : 378] RAW: Unable to get high fd: rc=0, limit=1024
/home/jonclo01/kfi-devenv/repos/onnxruntime/onnxruntime/test/providers/cpu/math/matmul_fastmath_test.cc:230: Failure
Value of: std::isfinite(output_data[i])
  Actual: false
Expected: true
Output 0 should not include padded A tail values.
Stack trace:
  0xaaaaab2f8428: onnxruntime::test::MathOpTest_MatMulFloatTypeFastMathKTailFallsBackToSgemm_Test::TestBody()
  0xaaaaac946c30: testing::internal::HandleExceptionsInMethodIfSupported<>()
  0xaaaaac947048: testing::Test::Run()
  0xaaaaac9474e0: testing::TestInfo::Run()
... Google Test internal frames ...

[  FAILED  ] MathOpTest.MatMulFloatTypeFastMathKTailFallsBackToSgemm (13 ms)
[----------] 1 test from MathOpTest (13 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (14 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] MathOpTest.MatMulFloatTypeFastMathKTailFallsBackToSgemm

…ernel

Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
…correctly

Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses sporadic ARM64 Linux test failures when KleidiAI is enabled without SVE by preventing the BF16 fastmath (NEON SBGemm) MatMul path from running when the shared dimension K is not a multiple of 4, avoiding unsafe tail overreads and NaN propagation.

Changes:

  • Gate the ARM64 BF16 fastmath path (SBGemm) on K % 4 == 0 in both pre-packing and compute.
  • Add an ARM64/Linux regression test that pads the A-buffer tail with NaNs to ensure K-tail cases don’t leak invalid values into outputs.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
onnxruntime/test/providers/cpu/math/matmul_fastmath_test.cc Adds a regression test that detects NaN propagation from K-tail overreads when fastmath is enabled.
onnxruntime/core/providers/cpu/math/matmul.h Introduces a K-alignment constant (4) for the ARM64 fastmath SBGemm gating logic.
onnxruntime/core/providers/cpu/math/matmul.cc Applies K % 4 == 0 checks to avoid using SBGemm (and BF16 B prepacking) for misaligned K-tail cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants