Skip to content

Conversation

@johnnynunez
Copy link
Contributor

@johnnynunez johnnynunez commented Oct 9, 2025

What does this PR do?

Fixes #1320 #1308 #1323 #1335 and includes fixes for flash-attention >= CUDA 12.9 and adds cutlass v4.2.1 that fixes some kernels for Blackwell.
Also add support for Spark and Thor.
Added Blackwell family support. https://developer.nvidia.com/blog/nvidia-blackwell-and-nvidia-cuda-12-9-introduce-family-specific-architecture-features/

Thanks to: #1285 #1262 that are included here.

Fixes in flash-attention to support CUDA 13:

  1. CUTLASS v4.2.1 Upgrade to cutlass v4.2.1 Dao-AILab/flash-attention#1905
  2. C++11 fix warnings C++11 fix warnings Dao-AILab/flash-attention#1904
  3. Blackwell family specific [NVIDIA] Enable Blackwell Family Specific Dao-AILab/flash-attention#1882
  4. [BUILD] SBSA wheels + CUDA 13 Support [BUILD] SBSA wheels + CUDA 13 Support Dao-AILab/flash-attention#1865
  5. [BUG] CUDA 13: make FA3 compatible with CUDA 13 Builds [BUG] CUDA 13: make FA3 compatible with CUDA 13 Builds Dao-AILab/flash-attention#1860
  6. cutlass v4.3.0

Pytorch 2.9.0 https://dev-discuss.pytorch.org/t/pytorch-2-9-rc1-produced-for-pytorch-audio-vision/3234

cc @sgrigory

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 9, 2025
@johnnynunez
Copy link
Contributor Author

johnnynunez commented Oct 15, 2025

@jiawenliu64 @bottler @sgrigory could you run and merge?

@snakeeater4526
Copy link

just a little message to tell you that i believe that PR is needed hard for people on cuda 13, basically pytorch 2.9 is now the stable one, but the latest available xformers doesn't support cuda 13.

so some software ( like comfyui ) using tensor related stuff can't work properly.

ps: i'm not a dev at all, but i just tried for an entire day to use comfyui with TensorRT acceleration, and it's basically impossible with Cuda 13 drivers... ( did not managed to compile successfully this PR sadly )

@johnnynunez
Copy link
Contributor Author

johnnynunez commented Oct 21, 2025

just a little message to tell you that i believe that PR is needed hard for people on cuda 13, basically pytorch 2.9 is now the stable one, but the latest available xformers doesn't support cuda 13.

so some software ( like comfyui ) using tensor related stuff can't work properly.

ps: i'm not a dev at all, but i just tried for an entire day to use comfyui with TensorRT acceleration, and it's basically impossible with Cuda 13 drivers... ( did not managed to compile successfully this PR sadly )

You have to export cccl to new ones. For me it is working.

@johnnynunez
Copy link
Contributor Author

just a little message to tell you that i believe that PR is needed hard for people on cuda 13, basically pytorch 2.9 is now the stable one, but the latest available xformers doesn't support cuda 13.

so some software ( like comfyui ) using tensor related stuff can't work properly.

ps: i'm not a dev at all, but i just tried for an entire day to use comfyui with TensorRT acceleration, and it's basically impossible with Cuda 13 drivers... ( did not managed to compile successfully this PR sadly )

could you try again? it should be fixed

@johnnynunez
Copy link
Contributor Author

ping you again
cc @jiawenliu64 @bottler @sgrigory

@riomus
Copy link

riomus commented Oct 28, 2025

Note: When this branch is build inside of nvcr.io/nvidia/pytorch:25.09-py3 on DGX Spark - it is not working as TORCH_CUDA_ARCH_LIST in it has value "8.0 8.6 9.0 10.0 11.0 12.0+PTX" - +PTX probably breaks recognition of compute capabilities and build is executed for sm_80 and sm_90

to install it on DGX Spark from sources inside of recommended image (Nvidia recommends it instead of installing pytorch manually) unset TORCH_CUDA_ARCH_LIST or export TORCH_CUDA_ARCH_LIST=12.0 is needed

@danthe3rd
Copy link
Contributor

Thanks for your PR! Let's check if the wheels build before merging - tests are running now :)

@danthe3rd
Copy link
Contributor

Also probably we should update python to 3.10 in the CI so that the linter is able to run:

python-version: '3.9'

@johnnynunez
Copy link
Contributor Author

Also probably we should update python to 3.10 in the CI so that the linter is able to run:

python-version: '3.9'

true, i miss it

@johnnynunez
Copy link
Contributor Author

@danthe3rd i upgrade to minimum version 3.10 that is the minimum right now with torch 2.9.0

@danthe3rd
Copy link
Contributor

Thanks! Can you also look at the windows build?

@johnnynunez
Copy link
Contributor Author

Thanks! Can you also look at the windows build?

seems a bug in jimver action Jimver/cuda-toolkit#395

@danthe3rd
Copy link
Contributor

danthe3rd commented Oct 29, 2025

Thanks! Looks like we're getting these errors in the CI now (related to PyTorch's CUDAExtension?):

ValueError: Unknown CUDA arch (10.0f) or GPU not supported

The only supported archs in PyTorch are the following at the moment?

https://github.com/pytorch/pytorch/blob/8b188647cfdc1355070ccd5aaa18a8060d4f67bf/torch/utils/cpp_extension.py#L2435-L2438

@johnnynunez
Copy link
Contributor Author

Thanks! Looks like we're getting these errors in the CI now (related to PyTorch's CUDAExtension?):

ValueError: Unknown CUDA arch (10.0f) or GPU not supported

The only supported archs in PyTorch are the following at the moment?

https://github.com/pytorch/pytorch/blob/8b188647cfdc1355070ccd5aaa18a8060d4f67bf/torch/utils/cpp_extension.py#L2435-L2438

Yes, i’ve seen that Blackwell Family, it is not compatible yet there. I’m going to change when i arrive at home(im come back from pytorch conference)
https://developer.nvidia.com/blog/nvidia-blackwell-and-nvidia-cuda-12-9-introduce-family-specific-architecture-features/

@danthe3rd
Copy link
Contributor

Thanks! Let's see how it goes. I'm a bit worried we might hit a timeout on the CI with that many architectures (especially for the build of FA3)

@johnnynunez
Copy link
Contributor Author

Thanks! Let's see how it goes. I'm a bit worried we might hit a timeout on the CI with that many architectures (especially for the build of FA3)

maybe we can filter for FA3? FA3 is only compatible with 80 and 90, fa4 only still with 100/103

@danthe3rd
Copy link
Contributor

Hum the FAv3 windows build for cuda 13 seems to be broken. Maybe we could disable FAv3 on windows/cuda13 for now?
I see this sort of errors:

2025-10-29T16:50:00.8970137Z C:\Users\runneradmin\AppData\Local\Temp\tmpxft_00002264_00000000-7_flash_bwd_hdim128_bf16_sm90.compute_90a.cudafe1.stub.c(417): error C2719: 'unnamed-parameter': formal parameter with requested alignment of 128 won't be aligned

https://github.com/facebookresearch/xformers/actions/runs/18913502491/job/53996546199?pr=1344

@johnnynunez
Copy link
Contributor Author

Hum the FAv3 windows build for cuda 13 seems to be broken. Maybe we could disable FAv3 on windows/cuda13 for now? I see this sort of errors:

2025-10-29T16:50:00.8970137Z C:\Users\runneradmin\AppData\Local\Temp\tmpxft_00002264_00000000-7_flash_bwd_hdim128_bf16_sm90.compute_90a.cudafe1.stub.c(417): error C2719: 'unnamed-parameter': formal parameter with requested alignment of 128 won't be aligned

https://github.com/facebookresearch/xformers/actions/runs/18913502491/job/53996546199?pr=1344

Yes, maybe we can disable it. I don’t have windows right now to test it

@danthe3rd
Copy link
Contributor

I would say we just need to set XFORMERS_DISABLE_FLASH_ATTN for now in the windows wheel build CI, or add a condition there to skip build on windows+cu13:

xformers/setup.py

Lines 282 to 283 in 51aa071

if cuda_version < 1203:
return []

@johnnynunez
Copy link
Contributor Author

I would say we just need to set XFORMERS_DISABLE_FLASH_ATTN for now in the windows wheel build CI, or add a condition there to skip build on windows+cu13:

xformers/setup.py

Lines 282 to 283 in 51aa071

if cuda_version < 1203:
return []

i tried to fix it, run CI if it works, if not, we can skip it

@johnnynunez
Copy link
Contributor Author

@danthe3rd seems that fails again. I reported it internally. We can merge avoiding cu130 at this point

@danthe3rd
Copy link
Contributor

danthe3rd commented Oct 30, 2025

Sure - let me merge if the CI is green :)
(will probably take ~6 hours tho)

@johnnynunez
Copy link
Contributor Author

You can cancel the previous one. Now should works @danthe3rd

Co-authored-by: dan_the_3rd <[email protected]>
@danthe3rd
Copy link
Contributor

Everything seems alright, we can fix the linters later on our side.
Thanks a lot for your contribution!

@danthe3rd danthe3rd merged commit a64b139 into facebookresearch:main Oct 30, 2025
17 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cu129 ERROR

4 participants