-
Notifications
You must be signed in to change notification settings - Fork 46
softmax: adjust vectorization length according to shape #1781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@sys_pytorchxpubot triage result for run 15872303046Triage bot UT analaysis result for reference only, please note unique error message only report once:
Triage bot response: {
"similar_issue_id": 1214,
"similar_issue_state": "open",
"issue_owner": "daisyden",
"issue_description": "Torch-xpu-ops pull request 1781 CI unit test third_party.torch-xpu-ops.test.xpu.quantization.core.test_workflow_ops_xpu.TestFakeQuantizeOpsXPU/test_fake_quantize_per_tensor_affine_inf_xpu failed with error message: AssertionError: Tensor-likes are not close!",
"root_causes": [
"Discrepancies in tensor computations between CPU and XPU during quantization operations.",
"Potential precision issues in quantization functions on XPU.",
"Implementation differences in quantization logic affecting tensor comparisons."
],
"suggested_solutions": [
"Investigate the quantization logic to ensure consistency across CPU and XPU.",
"Compare computations between CPU and XPU for quantization operations to identify discrepancies.",
"Adjust tolerance levels if necessary, but only after thorough investigation."
]
}
Triage bot response: {
"similar_issue_id": 1214,
"similar_issue_state": "open",
"issue_owner": "daisyden",
"issue_description": "In the test case `test_det_xpu_complex128`, an `AssertionError: Scalars are not close!` occurred, indicating a failure in scalar comparison during determinant computation on XPU with complex128 dtype.",
"root_causes": [
"Potential precision issues in tensor computations on XPU.",
"Incorrect handling of tensor comparisons leading to assertion failures."
],
"suggested_solutions": [
"Investigate numerical precision in determinant computation on XPU.",
"Review tensor comparison logic to ensure accurate scalar comparisons."
]
}
Triage bot response: {
"similar_issue_id": 1214,
"similar_issue_state": "open",
"issue_owner": "daisyden",
"issue_description": "In preci test, there are random cases on log or exp related ops will fail with 'AssertionError: Tensor-likes are not close!', need root cause. New random failure with release/2.7 RC2 pre release wheel test_foreach_xpu.py::TestForeachXPU::test_parity__foreach_div_fastpath_outplace_xpu_complex128",
"root_causes": [
"Discrepancies in handling of complex128 tensors during log or exp operations on XPU.",
"Potential precision issues or kernel behavior differences between XPU and CPU implementations."
],
"suggested_solutions": [
"Review and align XPU kernel implementations for linalg operations to match CPU behavior, especially for complex128 tensors.",
"Enhance testing to include more thorough checks for consistency between XPU and CPU outputs in complex number operations.",
"Investigate and address any precision-related issues in the XPU implementation of linalg.solve."
]
} |
@weishi-deng , pls. add platform information to the PR description. |
if (is_same_dtype) { | ||
if (dim_size <= 2048 && dim_size * scalar_size <= 8192) | ||
return false; | ||
} else { | ||
if (dim_size <= 1024 && dim_size * scalar_size <= 4096) | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@weishi-deng , could you help elaborate on these magic numbers, including 2048, 1024, 4096, and 8192? Why are these numbers critical for the performance of SoftMax
? Meanwhile, pls. consider whether the optimization applies to other platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The heuristics and magic numbers are from the PyTorch implementation: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/SoftMax.cu#L1096 and https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/SoftMax.cu#L1187. As discussed last week with Liangang and Yutao, we prefer to add adaptive choices for our vectorization in SoftMax, and we prefer to align the implementation first. Furthermore, to decide which magic number we should set for all platforms, we need to collect more perf results on different shapes and platforms as they did in this issue. pytorch/pytorch#144645
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@weishi-deng, according to the description at pytorch/pytorch#144645, the optimization aimed to reduce memory access by leveraging registers. Do you observe the same behavior on XPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you believe that the optimization of pytorch/pytorch#144645 is applicable for XPU as well, pls. collect data to prove it. We need to obtain insights regarding the optimization on XPU rather than the performance result.
To optimize the performance of Softmax with smaller shapes, this PR aims to adjust the vectorization size adaptively according to the input shape.
Performance data (platform BMG )
<style> </style>