NCCL-2.28.3 build locally is unusable on H100

### 🐛 Describe the bug

If one to build PyTorch locally with NCCL-2.28.3 following simple program will fail with `cudaErrorNoKernelImageForDevice` while running following simple script
```python
import os
import torch

os.environ["MASTER_PORT"]="12345"
os.environ["MASTER_ADDR"]="localhost"
os.environ["RANK"]="0"
print(torch.__version__, torch.cuda.nccl.version())
torch.distributed.init_process_group(backend='nccl', world_size=1)
torch.distributed.barrier() 
model = torch.nn.Linear(128, 128).cuda()
torch.cuda.synchronize()
x = torch.randn((32, 128), device="cuda")
```

And errors look as follows
```
2.10.0a0+gitb103378 (2, 28, 3)
/home/dev/git/pytorch/pytorch/torch/distributed/distributed_c10d.py:4879: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
  warnings.warn(  # warn only once
[rank0]:[W1001 20:22:34.218303536 ProcessGroupNCCL.cpp:5092] Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group()
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/dev/foo.py", line 13, in <module>
[rank0]:     x = torch.randn((32, 128), device="cuda")
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
[rank0]: Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

### Versions

nightly

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @seemethere @pytorch/pytorch-dev-infra

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NCCL-2.28.3 build locally is unusable on H100 #164402

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NCCL-2.28.3 build locally is unusable on H100 #164402

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions