softmax: adjust vectorization length according to shape #1781

weishi-deng · 2025-06-25T09:12:27Z

To optimize the performance of Softmax with smaller shapes, this PR aims to adjust the vectorization size adaptively according to the input shape.

Performance data (platform BMG )

dtype	shape(dim=0)	main	PR	ratio %
float	[64, 64]	5.989	4.365	137%
float	[8192, 8192]	2.833	2.818	101%
float	[64, 8192]	21.838	21.802	100%
float	[8192, 64]	197.458	121.979	162%
float	[1024, 1024]	33.041	21.896	151%
float	[16384, 16384]	13.342	13.445	99%
half	[64, 64]	8.463	4.547	186%
half	[8192, 8192]	1.346	1.35	100%
half	[64, 8192]	26.718	26.776	100%
half	[8192, 64]	312.901	131.197	238%
half	[1024, 1024]	57.859	23.713	244%
half	[16384, 16384]	5.982	5.974	100%

pytorchxpubot · 2025-06-27T08:04:50Z

@sys_pytorchxpubot triage result for run 15872303046

Triage bot UT analaysis result for reference only, please note unique error message only report once:

third_party.torch-xpu-ops.test.xpu.quantization.core.test_workflow_ops_xpu.TestFakeQuantizeOpsXPU test_fake_quantize_per_tensor_affine_inf_xpu got failed with error message

 AssertionError: Tensor-likes are not close!

Triage bot response:

{
  "similar_issue_id": 1214,
  "similar_issue_state": "open",
  "issue_owner": "daisyden",
  "issue_description": "Torch-xpu-ops pull request 1781 CI unit test third_party.torch-xpu-ops.test.xpu.quantization.core.test_workflow_ops_xpu.TestFakeQuantizeOpsXPU/test_fake_quantize_per_tensor_affine_inf_xpu failed with error message: AssertionError: Tensor-likes are not close!",
  "root_causes": [
    "Discrepancies in tensor computations between CPU and XPU during quantization operations.",
    "Potential precision issues in quantization functions on XPU.",
    "Implementation differences in quantization logic affecting tensor comparisons."
  ],
  "suggested_solutions": [
    "Investigate the quantization logic to ensure consistency across CPU and XPU.",
    "Compare computations between CPU and XPU for quantization operations to identify discrepancies.",
    "Adjust tolerance levels if necessary, but only after thorough investigation."
  ]
}

third_party.torch-xpu-ops.test.xpu.test_linalg_xpu.TestLinalgXPU test_det_xpu_complex128 got failed with error message

 AssertionError: Scalars are not close!

Triage bot response:

{
  "similar_issue_id": 1214,
  "similar_issue_state": "open",
  "issue_owner": "daisyden",
  "issue_description": "In the test case `test_det_xpu_complex128`, an `AssertionError: Scalars are not close!` occurred, indicating a failure in scalar comparison during determinant computation on XPU with complex128 dtype.",
  "root_causes": [
    "Potential precision issues in tensor computations on XPU.",
    "Incorrect handling of tensor comparisons leading to assertion failures."
  ],
  "suggested_solutions": [
    "Investigate numerical precision in determinant computation on XPU.",
    "Review tensor comparison logic to ensure accurate scalar comparisons."
  ]
}

third_party.torch-xpu-ops.test.xpu.test_ops_xpu.TestCommonXPU test_numpy_ref_linalg_tensorsolve_xpu_complex128 got failed with error message

 AssertionError: Tensor-likes are not close! ; Exception: Caused by reference input at index 0: SampleInput(input=Tensor[size=(2, 3, 6), device="xpu:0", dtype=torch.complex128], args=TensorList[Tensor[size=(2, 3), device="xpu:0", dtype=torch.complex128]], kwargs={'dims': 'None'}, broadcasts_input=False, name='')

Triage bot response:

{
  "similar_issue_id": 1214,
  "similar_issue_state": "open",
  "issue_owner": "daisyden",
  "issue_description": "In preci test, there are random cases on log or exp related ops will fail with 'AssertionError: Tensor-likes are not close!', need root cause. New random failure with release/2.7 RC2 pre release wheel test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_div_fastpath_outplace_xpu_complex128",
  "root_causes": [
    "Discrepancies in handling of complex128 tensors during log or exp operations on XPU.",
    "Potential precision issues or kernel behavior differences between XPU and CPU implementations."
  ],
  "suggested_solutions": [
    "Review and align XPU kernel implementations for linalg operations to match CPU behavior, especially for complex128 tensors.",
    "Enhance testing to include more thorough checks for consistency between XPU and CPU outputs in complex number operations.",
    "Investigate and address any precision-related issues in the XPU implementation of linalg.solve."
  ]
}

EikanWang · 2025-06-30T06:18:16Z

@weishi-deng , pls. add platform information to the PR description.

EikanWang · 2025-06-30T06:14:28Z

src/ATen/native/xpu/sycl/SoftMaxKernels.cpp

+  if (is_same_dtype) {
+    if (dim_size <= 2048 && dim_size * scalar_size <= 8192)
+      return false;
+  } else {
+    if (dim_size <= 1024 && dim_size * scalar_size <= 4096)
+      return false;
+  }


@weishi-deng , could you help elaborate on these magic numbers, including 2048, 1024, 4096, and 8192? Why are these numbers critical for the performance of SoftMax? Meanwhile, pls. consider whether the optimization applies to other platforms.

The heuristics and magic numbers are from the PyTorch implementation: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/SoftMax.cu#L1096 and https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/SoftMax.cu#L1187. As discussed last week with Liangang and Yutao, we prefer to add adaptive choices for our vectorization in SoftMax, and we prefer to align the implementation first. Furthermore, to decide which magic number we should set for all platforms, we need to collect more perf results on different shapes and platforms as they did in this issue. pytorch/pytorch#144645

@weishi-deng, according to the description at pytorch/pytorch#144645, the optimization aimed to reduce memory access by leveraging registers. Do you observe the same behavior on XPU?

If you believe that the optimization of pytorch/pytorch#144645 is applicable for XPU as well, pls. collect data to prove it. We need to obtain insights regarding the optimization on XPU rather than the performance result.

weishi-deng mentioned this pull request Jun 25, 2025

softmax: add device check for xpu with half_to_float pytorch/pytorch#150278

Open

softmax: adjust vectorization length according to shape

ddb9267

EikanWang reviewed Jun 30, 2025

View reviewed changes

update heristic for spatial softmax

0d4a4f9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

softmax: adjust vectorization length according to shape #1781

softmax: adjust vectorization length according to shape #1781

Uh oh!

weishi-deng commented Jun 25, 2025 •

edited

Loading

Uh oh!

pytorchxpubot commented Jun 27, 2025

Uh oh!

EikanWang commented Jun 30, 2025

Uh oh!

EikanWang Jun 30, 2025

Uh oh!

weishi-deng Jun 30, 2025

Uh oh!

EikanWang Jun 30, 2025

Uh oh!

EikanWang Jun 30, 2025

Uh oh!

Uh oh!

softmax: adjust vectorization length according to shape #1781

Are you sure you want to change the base?

softmax: adjust vectorization length according to shape #1781

Uh oh!

Conversation

weishi-deng commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance data (platform BMG )

Uh oh!

pytorchxpubot commented Jun 27, 2025

Uh oh!

EikanWang commented Jun 30, 2025

Uh oh!

EikanWang Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

weishi-deng Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

EikanWang Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

EikanWang Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

weishi-deng commented Jun 25, 2025 •

edited

Loading