Skip to content

NCCL-2.28.3 build locally is unusable on H100 #164402

@malfet

Description

@malfet

🐛 Describe the bug

If one to build PyTorch locally with NCCL-2.28.3 following simple program will fail with cudaErrorNoKernelImageForDevice while running following simple script

import os
import torch

os.environ["MASTER_PORT"]="12345"
os.environ["MASTER_ADDR"]="localhost"
os.environ["RANK"]="0"
print(torch.__version__, torch.cuda.nccl.version())
torch.distributed.init_process_group(backend='nccl', world_size=1)
torch.distributed.barrier() 
model = torch.nn.Linear(128, 128).cuda()
torch.cuda.synchronize()
x = torch.randn((32, 128), device="cuda")

And errors look as follows

2.10.0a0+gitb103378 (2, 28, 3)
/home/dev/git/pytorch/pytorch/torch/distributed/distributed_c10d.py:4879: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
  warnings.warn(  # warn only once
[rank0]:[W1001 20:22:34.218303536 ProcessGroupNCCL.cpp:5092] Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group()
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/dev/foo.py", line 13, in <module>
[rank0]:     x = torch.randn((32, 128), device="cuda")
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
[rank0]: Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Versions

nightly

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @seemethere @pytorch/pytorch-dev-infra

Metadata

Metadata

Assignees

Labels

actionablehigh prioritymodule: buildBuild system issuesmodule: ciRelated to continuous integrationmodule: ncclProblems related to nccl supportmodule: regressionIt used to work, and now it doesn'tmodule: third_partytriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

Status

Prioritized

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions