-
Notifications
You must be signed in to change notification settings - Fork 44
Add an option for NAN check for xccl #1756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds optional NaN checks to XCCL collective and point-to-point operations on XPU, driven by a new TORCH_XCCL_NAN_CHECK
CVar.
Key changes:
- Introduce
nanCheck
flag in collective/P2P APIs andenableNanCheck_
member - Implement XPU-side NaN detection kernel (
checkForNan
) - Update build (CMake) to compile the new SYCL-based checker separately
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
src/xccl/ProcessGroupXCCL.hpp | Added CVar vector, nanCheck parameters on collective APIs, and enableNanCheck_ with setter |
src/xccl/ProcessGroupXCCL.cpp | Initialized enableNanCheck_ , passed nanCheck through calls, and inserted pre-communication NaN checks |
src/xccl/NanCheck_XPU.hpp | Declared checkForNan interface for XPU streams |
src/xccl/NanCheck_XPU.cpp | Implemented a SYCL kernel to scan tensors for NaNs on XPU |
src/xccl/CMakeLists.txt | Updated source lists to compile NanCheck_XPU.cpp under SYCL target |
Comments suppressed due to low confidence (1)
src/xccl/NanCheck_XPU.cpp:177
- [nitpick] Function name
checkfornan_impl_xpu
is inconsistent with the CamelCase style elsewhere; consider renaming tocheckForNanImplXPU
for clarity.
void checkfornan_impl_xpu(
@@ -27,6 +27,9 @@ static std::vector<std::string> TORCH_XCCL_BLOCKING_WAIT = { | |||
"XCCL_BLOCKING_WAIT"}; | |||
|
|||
using xcclComm_t = ccl::communicator; | |||
|
|||
static std::vector<std::string> TORCH_XCCL_NAN_CHECK = {"TORCH_XCCL_NAN_CHECK"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using a std::vector<std::string>
for a single CVar name adds runtime allocation; consider using a static const char*[]
or std::array<std::string_view, 1>
to avoid overhead.
static std::vector<std::string> TORCH_XCCL_NAN_CHECK = {"TORCH_XCCL_NAN_CHECK"}; | |
static constexpr std::array<std::string_view, 1> TORCH_XCCL_NAN_CHECK = {"TORCH_XCCL_NAN_CHECK"}; |
Copilot uses AI. Check for mistakes.
@@ -620,6 +629,12 @@ c10::intrusive_ptr<Work> ProcessGroupXCCL::collective( | |||
|
|||
c10::OptionalDeviceGuard gpuGuard(device); | |||
|
|||
if (nanCheck) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NaN check currently only scans inputs
before communication. To catch NaNs introduced by the collective or point-to-point operations, consider adding a post-operation loop over outputs
.
Copilot uses AI. Check for mistakes.
From test log, nan will raise |
@sys_pytorchxpubot triage result for run 15868437424Triage bot UT analaysis result for reference only, please note unique error message only report once:
Triage bot response: {
"similar_issue_id": 1214,
"similar_issue_state": "open",
"issue_owner": "daisyden",
"issue_description": "In the test case `test_det_xpu_complex128`, an `AssertionError: Scalars are not close!` occurred, indicating a failure in scalar comparison during determinant computation on XPU with complex128 dtype.",
"root_causes": [
"Potential numerical precision issues in tensor computations on XPU.",
"Incorrect tensor comparison logic specific to XPU operations."
],
"suggested_solutions": [
"Investigate and adjust scalar comparison tolerance for XPU computations.",
"Review and correct tensor comparison methods to ensure accuracy on XPU."
]
}
Triage bot response: {
"similar_issue_id": 1214,
"similar_issue_state": "open",
"issue_owner": "daisyden",
"issue_description": "In the test test_tensorsolve_xpu_complex128, an AssertionError occurred with the message 'AssertionError: Tensor-likes are not close!'. This indicates a discrepancy in tensor values between CPU and XPU computations, possibly due to precision or kernel behavior differences.",
"root_causes": [
"Discrepancies in tensor computations between CPU and XPU, potentially due to precision differences or kernel implementation variations.",
"Possible synchronization issues or data transfer problems between devices."
],
"suggested_solutions": [
"Investigate the specific operations in test_tensorsolve_xpu_complex128 to identify where the computation diverges between CPU and XPU.",
"Review the kernel implementations for solve operations on XPU to ensure they match CPU behavior for complex128 tensors.",
"Adjust tolerance levels if the discrepancy is within an acceptable range but not zero."
]
}
Triage bot response: {
"similar_issue_id": 233,
"similar_issue_state": "closed",
"issue_owner": "guangyey",
"issue_description": "Failures in test_ops::TestCompositeCompliance due to unsupported operations and tensor management issues, particularly with Copy-on-Write (COW) and missing operator variants.",
"root_causes": [
"Lack of implementation for the backward pass of `linalg.eigvals` on XPU for float64.",
"Missing operator variants leading to unsupported operations during gradient computation."
],
"suggested_solutions": [
"Implement the backward pass for `linalg.eigvals` on XPU for float64 to support gradient computation.",
"Ensure that all necessary operator variants are available to prevent unsupported operation errors."
]
}
Triage bot response: {
"similar_issue_id": 1401,
"similar_issue_state": "open",
"issue_owner": "daisyden",
"issue_description": "The test test_numpy_ref_linalg_tensorsolve_xpu_complex128 failed with an AssertionError: Tensor-likes are not close! The error indicates that the tensors from CPU and XPU are not matching within the allowed tolerance. The failure occurs during the backward pass where gradients are compared. The test involves initializing tensors, performing a linalg tensorsolve operation, and comparing the results across CPU and XPU. The error suggests potential discrepancies in computation between CPU and XPU implementations, possibly due to differences in precision, kernel behavior, or synchronization issues.",
"root_causes": [
"Discrepancies in computation between CPU and XPU implementations, possibly due to differences in precision or kernel behavior.",
"Potential issues in the handling of complex128 dtype operations across different devices."
],
"suggested_solutions": [
"Review and align the XPU implementation of linalg tensorsolve with the CPU implementation to ensure consistent behavior.",
"Investigate and adjust the precision handling in the XPU kernels to match CPU behavior for complex128 dtype operations.",
"Add additional test cases to verify the consistency of linalg tensorsolve operations across different devices and data types."
]
} |
Refer from pytorch/pytorch#125726, pytorch/pytorch#135414.
Add nan check for xccl.
why we need to stop communication from spreading NaNs?
"technically if we can be sure which rank (or, even which host) detected the first nan, then its OK to let the nan spread to some other hosts. but in practice i dont know if we have good enough way to align our logs on different hosts, so if we let the nan spread to a few other hosts we may lose track of which one was first”