Skip to content

softmax: adjust vectorization length according to shape #1781

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

weishi-deng
Copy link
Contributor

@weishi-deng weishi-deng commented Jun 25, 2025

To optimize the performance of Softmax with smaller shapes, this PR aims to adjust the vectorization size adaptively according to the input shape.

Performance data (platform BMG )

<style> </style>
 dtype shape(dim=0) main PR ratio %
float [64, 64] 5.989 4.365 137%
float [8192, 8192] 2.833 2.818 101%
float [64, 8192] 21.838 21.802 100%
float [8192, 64] 197.458 121.979 162%
float [1024, 1024] 33.041 21.896 151%
float [16384, 16384] 13.342 13.445 99%
half [64, 64] 8.463 4.547 186%
half [8192, 8192] 1.346 1.35 100%
half [64, 8192] 26.718 26.776 100%
half [8192, 64] 312.901 131.197 238%
half [1024, 1024] 57.859 23.713 244%
half [16384, 16384] 5.982 5.974 100%

@pytorchxpubot
Copy link

@sys_pytorchxpubot triage result for run 15872303046Triage bot UT analaysis result for reference only, please note unique error message only report once:
  1. third_party.torch-xpu-ops.test.xpu.quantization.core.test_workflow_ops_xpu.TestFakeQuantizeOpsXPU test_fake_quantize_per_tensor_affine_inf_xpu got failed with error message
 AssertionError: Tensor-likes are not close! 

Triage bot response:

{
  "similar_issue_id": 1214,
  "similar_issue_state": "open",
  "issue_owner": "daisyden",
  "issue_description": "Torch-xpu-ops pull request 1781 CI unit test third_party.torch-xpu-ops.test.xpu.quantization.core.test_workflow_ops_xpu.TestFakeQuantizeOpsXPU/test_fake_quantize_per_tensor_affine_inf_xpu failed with error message: AssertionError: Tensor-likes are not close!",
  "root_causes": [
    "Discrepancies in tensor computations between CPU and XPU during quantization operations.",
    "Potential precision issues in quantization functions on XPU.",
    "Implementation differences in quantization logic affecting tensor comparisons."
  ],
  "suggested_solutions": [
    "Investigate the quantization logic to ensure consistency across CPU and XPU.",
    "Compare computations between CPU and XPU for quantization operations to identify discrepancies.",
    "Adjust tolerance levels if necessary, but only after thorough investigation."
  ]
}
  1. third_party.torch-xpu-ops.test.xpu.test_linalg_xpu.TestLinalgXPU test_det_xpu_complex128 got failed with error message
 AssertionError: Scalars are not close! 

Triage bot response:

{
  "similar_issue_id": 1214,
  "similar_issue_state": "open",
  "issue_owner": "daisyden",
  "issue_description": "In the test case `test_det_xpu_complex128`, an `AssertionError: Scalars are not close!` occurred, indicating a failure in scalar comparison during determinant computation on XPU with complex128 dtype.",
  "root_causes": [
    "Potential precision issues in tensor computations on XPU.",
    "Incorrect handling of tensor comparisons leading to assertion failures."
  ],
  "suggested_solutions": [
    "Investigate numerical precision in determinant computation on XPU.",
    "Review tensor comparison logic to ensure accurate scalar comparisons."
  ]
}
  1. third_party.torch-xpu-ops.test.xpu.test_ops_xpu.TestCommonXPU test_numpy_ref_linalg_tensorsolve_xpu_complex128 got failed with error message
 AssertionError: Tensor-likes are not close! ; Exception: Caused by reference input at index 0: SampleInput(input=Tensor[size=(2, 3, 6), device="xpu:0", dtype=torch.complex128], args=TensorList[Tensor[size=(2, 3), device="xpu:0", dtype=torch.complex128]], kwargs={'dims': 'None'}, broadcasts_input=False, name='') 

Triage bot response:

{
  "similar_issue_id": 1214,
  "similar_issue_state": "open",
  "issue_owner": "daisyden",
  "issue_description": "In preci test, there are random cases on log or exp related ops will fail with 'AssertionError: Tensor-likes are not close!', need root cause. New random failure with release/2.7 RC2 pre release wheel test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_div_fastpath_outplace_xpu_complex128",
  "root_causes": [
    "Discrepancies in handling of complex128 tensors during log or exp operations on XPU.",
    "Potential precision issues or kernel behavior differences between XPU and CPU implementations."
  ],
  "suggested_solutions": [
    "Review and align XPU kernel implementations for linalg operations to match CPU behavior, especially for complex128 tensors.",
    "Enhance testing to include more thorough checks for consistency between XPU and CPU outputs in complex number operations.",
    "Investigate and address any precision-related issues in the XPU implementation of linalg.solve."
  ]
}

@EikanWang
Copy link
Contributor

@weishi-deng , pls. add platform information to the PR description.

Comment on lines 209 to 215
if (is_same_dtype) {
if (dim_size <= 2048 && dim_size * scalar_size <= 8192)
return false;
} else {
if (dim_size <= 1024 && dim_size * scalar_size <= 4096)
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weishi-deng , could you help elaborate on these magic numbers, including 2048, 1024, 4096, and 8192? Why are these numbers critical for the performance of SoftMax? Meanwhile, pls. consider whether the optimization applies to other platforms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The heuristics and magic numbers are from the PyTorch implementation: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/SoftMax.cu#L1096 and https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/SoftMax.cu#L1187. As discussed last week with Liangang and Yutao, we prefer to add adaptive choices for our vectorization in SoftMax, and we prefer to align the implementation first. Furthermore, to decide which magic number we should set for all platforms, we need to collect more perf results on different shapes and platforms as they did in this issue. pytorch/pytorch#144645

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weishi-deng, according to the description at pytorch/pytorch#144645, the optimization aimed to reduce memory access by leveraging registers. Do you observe the same behavior on XPU?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you believe that the optimization of pytorch/pytorch#144645 is applicable for XPU as well, pls. collect data to prove it. We need to obtain insights regarding the optimization on XPU rather than the performance result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants