Skip to content

Add an option for NAN check for xccl #1756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Add an option for NAN check for xccl #1756

wants to merge 10 commits into from

Conversation

Chao1Han
Copy link
Contributor

@Chao1Han Chao1Han commented Jun 19, 2025

Refer from pytorch/pytorch#125726, pytorch/pytorch#135414.
Add nan check for xccl.
why we need to stop communication from spreading NaNs?
"technically if we can be sure which rank (or, even which host) detected the first nan, then its OK to let the nan spread to some other hosts. but in practice i dont know if we have good enough way to align our logs on different hosts, so if we let the nan spread to a few other hosts we may lose track of which one was first”

@Copilot Copilot AI review requested due to automatic review settings June 19, 2025 07:24
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds optional NaN checks to XCCL collective and point-to-point operations on XPU, driven by a new TORCH_XCCL_NAN_CHECK CVar.
Key changes:

  • Introduce nanCheck flag in collective/P2P APIs and enableNanCheck_ member
  • Implement XPU-side NaN detection kernel (checkForNan)
  • Update build (CMake) to compile the new SYCL-based checker separately

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/xccl/ProcessGroupXCCL.hpp Added CVar vector, nanCheck parameters on collective APIs, and enableNanCheck_ with setter
src/xccl/ProcessGroupXCCL.cpp Initialized enableNanCheck_, passed nanCheck through calls, and inserted pre-communication NaN checks
src/xccl/NanCheck_XPU.hpp Declared checkForNan interface for XPU streams
src/xccl/NanCheck_XPU.cpp Implemented a SYCL kernel to scan tensors for NaNs on XPU
src/xccl/CMakeLists.txt Updated source lists to compile NanCheck_XPU.cpp under SYCL target
Comments suppressed due to low confidence (1)

src/xccl/NanCheck_XPU.cpp:177

  • [nitpick] Function name checkfornan_impl_xpu is inconsistent with the CamelCase style elsewhere; consider renaming to checkForNanImplXPU for clarity.
void checkfornan_impl_xpu(

@@ -27,6 +27,9 @@ static std::vector<std::string> TORCH_XCCL_BLOCKING_WAIT = {
"XCCL_BLOCKING_WAIT"};

using xcclComm_t = ccl::communicator;

static std::vector<std::string> TORCH_XCCL_NAN_CHECK = {"TORCH_XCCL_NAN_CHECK"};
Copy link
Preview

Copilot AI Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using a std::vector<std::string> for a single CVar name adds runtime allocation; consider using a static const char*[] or std::array<std::string_view, 1> to avoid overhead.

Suggested change
static std::vector<std::string> TORCH_XCCL_NAN_CHECK = {"TORCH_XCCL_NAN_CHECK"};
static constexpr std::array<std::string_view, 1> TORCH_XCCL_NAN_CHECK = {"TORCH_XCCL_NAN_CHECK"};

Copilot uses AI. Check for mistakes.

@@ -620,6 +629,12 @@ c10::intrusive_ptr<Work> ProcessGroupXCCL::collective(

c10::OptionalDeviceGuard gpuGuard(device);

if (nanCheck) {
Copy link
Preview

Copilot AI Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NaN check currently only scans inputs before communication. To catch NaNs introduced by the collective or point-to-point operations, consider adding a post-operation loop over outputs.

Copilot uses AI. Check for mistakes.

@Chao1Han Chao1Han changed the title [wip] Xccl/nan Add an option for NAN check for xccl Jun 23, 2025
@Chao1Han
Copy link
Contributor Author

From test log, nan will raise
/home/sdp/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/xccl/NanCheck_XPU.cpp:162: void c10d::checkForNaN<c10::BFloat16>::operator()(sycl::nd_item<1>) const [T = c10::BFloat16]: global id: [0,0,0], local id: [0,0,0] Assertion `0` failed.

@pytorchxpubot
Copy link

@sys_pytorchxpubot triage result for run 15868437424Triage bot UT analaysis result for reference only, please note unique error message only report once:
  1. third_party.torch-xpu-ops.test.xpu.test_linalg_xpu.TestLinalgXPU test_det_xpu_complex128 got failed with error message
 AssertionError: Scalars are not close! 

Triage bot response:

{
  "similar_issue_id": 1214,
  "similar_issue_state": "open",
  "issue_owner": "daisyden",
  "issue_description": "In the test case `test_det_xpu_complex128`, an `AssertionError: Scalars are not close!` occurred, indicating a failure in scalar comparison during determinant computation on XPU with complex128 dtype.",
  "root_causes": [
    "Potential numerical precision issues in tensor computations on XPU.",
    "Incorrect tensor comparison logic specific to XPU operations."
  ],
  "suggested_solutions": [
    "Investigate and adjust scalar comparison tolerance for XPU computations.",
    "Review and correct tensor comparison methods to ensure accuracy on XPU."
  ]
}
  1. third_party.torch-xpu-ops.test.xpu.test_linalg_xpu.TestLinalgXPU test_tensorsolve_xpu_complex128 got failed with error message
 AssertionError: Tensor-likes are not close! 

Triage bot response:

{
  "similar_issue_id": 1214,
  "similar_issue_state": "open",
  "issue_owner": "daisyden",
  "issue_description": "In the test test_tensorsolve_xpu_complex128, an AssertionError occurred with the message 'AssertionError: Tensor-likes are not close!'. This indicates a discrepancy in tensor values between CPU and XPU computations, possibly due to precision or kernel behavior differences.",
  "root_causes": [
    "Discrepancies in tensor computations between CPU and XPU, potentially due to precision differences or kernel implementation variations.",
    "Possible synchronization issues or data transfer problems between devices."
  ],
  "suggested_solutions": [
    "Investigate the specific operations in test_tensorsolve_xpu_complex128 to identify where the computation diverges between CPU and XPU.",
    "Review the kernel implementations for solve operations on XPU to ensure they match CPU behavior for complex128 tensors.",
    "Adjust tolerance levels if the discrepancy is within an acceptable range but not zero."
  ]
}
  1. third_party.torch-xpu-ops.test.xpu.test_ops_gradients_xpu.TestBwdGradientsXPU test_fn_grad_linalg_eigvals_xpu_float64 got failed with error message
 Exception: Caused by sample input at index 0: SampleInput(input=Tensor[size=(5, 5), device="xpu:0", dtype=torch.float64], args=(), kwargs={}, broadcasts_input=False, name='') 

Triage bot response:

{
  "similar_issue_id": 233,
  "similar_issue_state": "closed",
  "issue_owner": "guangyey",
  "issue_description": "Failures in test_ops::TestCompositeCompliance due to unsupported operations and tensor management issues, particularly with Copy-on-Write (COW) and missing operator variants.",
  "root_causes": [
    "Lack of implementation for the backward pass of `linalg.eigvals` on XPU for float64.",
    "Missing operator variants leading to unsupported operations during gradient computation."
  ],
  "suggested_solutions": [
    "Implement the backward pass for `linalg.eigvals` on XPU for float64 to support gradient computation.",
    "Ensure that all necessary operator variants are available to prevent unsupported operation errors."
  ]
}
  1. third_party.torch-xpu-ops.test.xpu.test_ops_xpu.TestCommonXPU test_numpy_ref_linalg_tensorsolve_xpu_complex128 got failed with error message
 AssertionError: Tensor-likes are not close! ; Exception: Caused by reference input at index 0: SampleInput(input=Tensor[size=(2, 3, 6), device="xpu:0", dtype=torch.complex128], args=TensorList[Tensor[size=(2, 3), device="xpu:0", dtype=torch.complex128]], kwargs={'dims': 'None'}, broadcasts_input=False, name='') 

Triage bot response:

{
  "similar_issue_id": 1401,
  "similar_issue_state": "open",
  "issue_owner": "daisyden",
  "issue_description": "The test test_numpy_ref_linalg_tensorsolve_xpu_complex128 failed with an AssertionError: Tensor-likes are not close! The error indicates that the tensors from CPU and XPU are not matching within the allowed tolerance. The failure occurs during the backward pass where gradients are compared. The test involves initializing tensors, performing a linalg tensorsolve operation, and comparing the results across CPU and XPU. The error suggests potential discrepancies in computation between CPU and XPU implementations, possibly due to differences in precision, kernel behavior, or synchronization issues.",
  "root_causes": [
    "Discrepancies in computation between CPU and XPU implementations, possibly due to differences in precision or kernel behavior.",
    "Potential issues in the handling of complex128 dtype operations across different devices."
  ],
  "suggested_solutions": [
    "Review and align the XPU implementation of linalg tensorsolve with the CPU implementation to ensure consistent behavior.",
    "Investigate and adjust the precision handling in the XPU kernels to match CPU behavior for complex128 dtype operations.",
    "Add additional test cases to verify the consistency of linalg tensorsolve operations across different devices and data types."
  ]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants