Add an option for NAN check for xccl #1756

Chao1Han · 2025-06-19T07:24:00Z

Refer from pytorch/pytorch#125726, pytorch/pytorch#135414.
Add nan check for xccl.
why we need to stop communication from spreading NaNs?
"technically if we can be sure which rank (or, even which host) detected the first nan, then its OK to let the nan spread to some other hosts. but in practice i dont know if we have good enough way to align our logs on different hosts, so if we let the nan spread to a few other hosts we may lose track of which one was first”

Copilot

Pull Request Overview

This PR adds optional NaN checks to XCCL collective and point-to-point operations on XPU, driven by a new TORCH_XCCL_NAN_CHECK CVar.
Key changes:

Introduce nanCheck flag in collective/P2P APIs and enableNanCheck_ member
Implement XPU-side NaN detection kernel (checkForNan)
Update build (CMake) to compile the new SYCL-based checker separately

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/xccl/ProcessGroupXCCL.hpp	Added CVar vector, `nanCheck` parameters on collective APIs, and `enableNanCheck_` with setter
src/xccl/ProcessGroupXCCL.cpp	Initialized `enableNanCheck_`, passed `nanCheck` through calls, and inserted pre-communication NaN checks
src/xccl/NanCheck_XPU.hpp	Declared `checkForNan` interface for XPU streams
src/xccl/NanCheck_XPU.cpp	Implemented a SYCL kernel to scan tensors for NaNs on XPU
src/xccl/CMakeLists.txt	Updated source lists to compile `NanCheck_XPU.cpp` under SYCL target

Comments suppressed due to low confidence (1)

src/xccl/NanCheck_XPU.cpp:177

[nitpick] Function name checkfornan_impl_xpu is inconsistent with the CamelCase style elsewhere; consider renaming to checkForNanImplXPU for clarity.

void checkfornan_impl_xpu(

src/xccl/NanCheck_XPU.cpp

Copilot · 2025-06-19T07:26:32Z

src/xccl/ProcessGroupXCCL.hpp

@@ -27,6 +27,9 @@ static std::vector<std::string> TORCH_XCCL_BLOCKING_WAIT = {
    "XCCL_BLOCKING_WAIT"};

 using xcclComm_t = ccl::communicator;
+
+static std::vector<std::string> TORCH_XCCL_NAN_CHECK = {"TORCH_XCCL_NAN_CHECK"};


[nitpick] Using a std::vector<std::string> for a single CVar name adds runtime allocation; consider using a static const char*[] or std::array<std::string_view, 1> to avoid overhead.

Suggested change

static std::vector<std::string> TORCH_XCCL_NAN_CHECK = {"TORCH_XCCL_NAN_CHECK"};

static constexpr std::array<std::string_view, 1> TORCH_XCCL_NAN_CHECK = {"TORCH_XCCL_NAN_CHECK"};

Copilot · 2025-06-19T07:26:33Z

src/xccl/ProcessGroupXCCL.cpp

@@ -620,6 +629,12 @@ c10::intrusive_ptr<Work> ProcessGroupXCCL::collective(

  c10::OptionalDeviceGuard gpuGuard(device);

+  if (nanCheck) {


The NaN check currently only scans inputs before communication. To catch NaNs introduced by the collective or point-to-point operations, consider adding a post-operation loop over outputs.

Chao1Han · 2025-06-24T01:11:36Z

From test log, nan will raise
/home/sdp/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/xccl/NanCheck_XPU.cpp:162: void c10d::checkForNaN<c10::BFloat16>::operator()(sycl::nd_item<1>) const [T = c10::BFloat16]: global id: [0,0,0], local id: [0,0,0] Assertion `0` failed.

pytorchxpubot · 2025-06-27T09:04:36Z

@sys_pytorchxpubot triage result for run 15868437424

Triage bot UT analaysis result for reference only, please note unique error message only report once:

third_party.torch-xpu-ops.test.xpu.test_linalg_xpu.TestLinalgXPU test_det_xpu_complex128 got failed with error message

 AssertionError: Scalars are not close!

Triage bot response:

{
  "similar_issue_id": 1214,
  "similar_issue_state": "open",
  "issue_owner": "daisyden",
  "issue_description": "In the test case `test_det_xpu_complex128`, an `AssertionError: Scalars are not close!` occurred, indicating a failure in scalar comparison during determinant computation on XPU with complex128 dtype.",
  "root_causes": [
    "Potential numerical precision issues in tensor computations on XPU.",
    "Incorrect tensor comparison logic specific to XPU operations."
  ],
  "suggested_solutions": [
    "Investigate and adjust scalar comparison tolerance for XPU computations.",
    "Review and correct tensor comparison methods to ensure accuracy on XPU."
  ]
}

third_party.torch-xpu-ops.test.xpu.test_linalg_xpu.TestLinalgXPU test_tensorsolve_xpu_complex128 got failed with error message

 AssertionError: Tensor-likes are not close!

Triage bot response:

{
  "similar_issue_id": 1214,
  "similar_issue_state": "open",
  "issue_owner": "daisyden",
  "issue_description": "In the test test_tensorsolve_xpu_complex128, an AssertionError occurred with the message 'AssertionError: Tensor-likes are not close!'. This indicates a discrepancy in tensor values between CPU and XPU computations, possibly due to precision or kernel behavior differences.",
  "root_causes": [
    "Discrepancies in tensor computations between CPU and XPU, potentially due to precision differences or kernel implementation variations.",
    "Possible synchronization issues or data transfer problems between devices."
  ],
  "suggested_solutions": [
    "Investigate the specific operations in test_tensorsolve_xpu_complex128 to identify where the computation diverges between CPU and XPU.",
    "Review the kernel implementations for solve operations on XPU to ensure they match CPU behavior for complex128 tensors.",
    "Adjust tolerance levels if the discrepancy is within an acceptable range but not zero."
  ]
}

third_party.torch-xpu-ops.test.xpu.test_ops_gradients_xpu.TestBwdGradientsXPU test_fn_grad_linalg_eigvals_xpu_float64 got failed with error message

 Exception: Caused by sample input at index 0: SampleInput(input=Tensor[size=(5, 5), device="xpu:0", dtype=torch.float64], args=(), kwargs={}, broadcasts_input=False, name='')

Triage bot response:

{
  "similar_issue_id": 233,
  "similar_issue_state": "closed",
  "issue_owner": "guangyey",
  "issue_description": "Failures in test_ops::TestCompositeCompliance due to unsupported operations and tensor management issues, particularly with Copy-on-Write (COW) and missing operator variants.",
  "root_causes": [
    "Lack of implementation for the backward pass of `linalg.eigvals` on XPU for float64.",
    "Missing operator variants leading to unsupported operations during gradient computation."
  ],
  "suggested_solutions": [
    "Implement the backward pass for `linalg.eigvals` on XPU for float64 to support gradient computation.",
    "Ensure that all necessary operator variants are available to prevent unsupported operation errors."
  ]
}

third_party.torch-xpu-ops.test.xpu.test_ops_xpu.TestCommonXPU test_numpy_ref_linalg_tensorsolve_xpu_complex128 got failed with error message

 AssertionError: Tensor-likes are not close! ; Exception: Caused by reference input at index 0: SampleInput(input=Tensor[size=(2, 3, 6), device="xpu:0", dtype=torch.complex128], args=TensorList[Tensor[size=(2, 3), device="xpu:0", dtype=torch.complex128]], kwargs={'dims': 'None'}, broadcasts_input=False, name='')

Triage bot response:

{
  "similar_issue_id": 1401,
  "similar_issue_state": "open",
  "issue_owner": "daisyden",
  "issue_description": "The test test_numpy_ref_linalg_tensorsolve_xpu_complex128 failed with an AssertionError: Tensor-likes are not close! The error indicates that the tensors from CPU and XPU are not matching within the allowed tolerance. The failure occurs during the backward pass where gradients are compared. The test involves initializing tensors, performing a linalg tensorsolve operation, and comparing the results across CPU and XPU. The error suggests potential discrepancies in computation between CPU and XPU implementations, possibly due to differences in precision, kernel behavior, or synchronization issues.",
  "root_causes": [
    "Discrepancies in computation between CPU and XPU implementations, possibly due to differences in precision or kernel behavior.",
    "Potential issues in the handling of complex128 dtype operations across different devices."
  ],
  "suggested_solutions": [
    "Review and align the XPU implementation of linalg tensorsolve with the CPU implementation to ensure consistent behavior.",
    "Investigate and adjust the precision handling in the XPU kernels to match CPU behavior for complex128 dtype operations.",
    "Add additional test cases to verify the consistency of linalg tensorsolve operations across different devices and data types."
  ]
}

Chao1Han added 2 commits June 18, 2025 21:54

add nan check for xccl

eda6607

cmake and format

595a1dc

Copilot AI review requested due to automatic review settings June 19, 2025 07:24

Copilot AI reviewed Jun 19, 2025

View reviewed changes

Chao1Han and others added 3 commits June 19, 2025 18:28

add nan check

aef4be3

Merge branch 'main' into xccl/nan

e2bb25d

Merge branch 'main' into xccl/nan

b3a2e94

Chao1Han force-pushed the xccl/nan branch from 8a486f1 to 38c1c10 Compare June 20, 2025 08:47

update

38c1c10

Chao1Han changed the title ~~[wip] Xccl/nan~~ Add an option for NAN check for xccl Jun 23, 2025

add test case

1a7febb

Chao1Han force-pushed the xccl/nan branch from 8e92148 to 1a7febb Compare June 23, 2025 08:31

Chao1Han requested review from zhangxiaoli73 and fengyuan14 June 24, 2025 01:09

mengfei25 and others added 2 commits June 25, 2025 08:54

Merge branch 'main' into xccl/nan

65d85ca

clean code

1b5d5be

update

62263aa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add an option for NAN check for xccl #1756

Add an option for NAN check for xccl #1756

Uh oh!

Chao1Han commented Jun 19, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jun 19, 2025

Uh oh!

Copilot AI Jun 19, 2025

Uh oh!

Chao1Han commented Jun 24, 2025

Uh oh!

pytorchxpubot commented Jun 27, 2025

Uh oh!

Uh oh!

	static std::vector<std::string> TORCH_XCCL_NAN_CHECK = {"TORCH_XCCL_NAN_CHECK"};
	static constexpr std::array<std::string_view, 1> TORCH_XCCL_NAN_CHECK = {"TORCH_XCCL_NAN_CHECK"};

		@@ -620,6 +629,12 @@ c10::intrusive_ptr<Work> ProcessGroupXCCL::collective(

		c10::OptionalDeviceGuard gpuGuard(device);

		if (nanCheck) {

Add an option for NAN check for xccl #1756

Are you sure you want to change the base?

Add an option for NAN check for xccl #1756

Uh oh!

Conversation

Chao1Han commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Chao1Han commented Jun 24, 2025

Uh oh!

pytorchxpubot commented Jun 27, 2025

Uh oh!

Uh oh!

Chao1Han commented Jun 19, 2025 •

edited

Loading