Releases: facebookresearch/xformers
Releases · facebookresearch/xformers
[v0.0.29.post1] Fix Flash2 on windows
This fixes the issue reported in #1163 (comment)
Enabling FAv3 by default, removed deprecated components
Pre-built binary wheels require PyTorch 2.5.1
Improved:
- [fMHA] Creating a
LowerTriangularMaskno longer creates a CUDA tensor - [fMHA] Updated Flash-Attention to
v2.7.2.post1 - [fMHA] Flash-Attention v3 will now be used by
memory_efficient_attentionby default when available, unless the operator is enforced with theopkeyword-argument. Switching from Flash2 to Flash3 can make transformer trainings ~10% faster end-to-end on H100s - [fMHA] Fixed a performance regression with the
cutlassbackend for the backward pass (#1176) - mostly used on older GPUs (eg V100) - Fixed swiglu operator compatibility with torch-compile with PyTorch 2.6
- Fix activation checkpointing of SwiGLU when AMP is enabled (#1152)
Removed:
- Following PyTorch, xFormers no longer builds binaries for conda. Pip is now the only recommended way to get xFormers
- Removed unmaintained/deprecated components in
xformers.components.*(see #848)
`v0.0.28.post3` - build for PyTorch 2.5.1
[0.0.28.post3] - 2024-10-30
Pre-built binary wheels require PyTorch 2.5.1
`v0.0.28.post2` - build for PyTorch 2.5.0
[0.0.28.post2] - 2024-10-18
Pre-built binary wheels require PyTorch 2.5.0
`0.0.28.post1` - fixing upload for cuda 12.4 wheels
[0.0.28.post1] - 2024-09-13
Properly upload wheels for cuda 12.4
FAv3, profiler update & AMD
Pre-built binary wheels require PyTorch 2.4.1
Added
- Added wheels for cuda 12.4
- Added conda builds for python 3.11
- Added wheels for rocm 6.1
Improved
- Profiler: Fix computation of FLOPS for the attention when using xFormers
- Profiler: Fix MFU/HFU calculation when multiple dtypes are used
- Profiler: Trace analysis to compute MFU & HFU is now much faster
- fMHA/splitK: Fixed
nanin the output when using atorch.Tensorbias where a lot of consecutive keys are masked with-inf - Update Flash-Attention version to
v2.6.3when building from scratch - When using the most recent version of Flash-Attention, it is no longer possible to mix it with the cutlass backend. In other words, it is no longer possible to use the cutlass Fw with the flash Bw.
Removed
- fMHA: Removed
decoderandsmall_kbackends - profiler: Removed
DetectSlowOpsProfilerprofiler - Removed compatibility with PyTorch < 2.4
- Removed conda builds for python 3.9
- Removed windows pip wheels for cuda 12.1 and 11.8
torch.compile support, bug fixes & more
Pre-built binary wheels require PyTorch 2.4.0
Added
- fMHA: PagedBlockDiagonalGappyKeysMask
- fMHA: heterogeneous queries in triton_splitk
- fMHA: support for paged attention in flash
- fMHA: Added backwards pass for merge_attentions
- fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
- fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
- fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
- fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
- 2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values
Improved
- fMHA: Fixed out-of-bounds reading for Split-K triton implementation
- Profiler: fix bug with modules that take a single tuple as argument
- Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory
Removed
- Removed support for PyTorch version older than 2.2.0
torch.compile support, bug fixes & more
Pre-built binary wheels require PyTorch 2.4.0
Added
- fMHA: PagedBlockDiagonalGappyKeysMask
- fMHA: heterogeneous queries in triton_splitk
- fMHA: support for paged attention in flash
- fMHA: Added backwards pass for merge_attentions
- fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
- fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
- fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
- fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
- 2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values
Improved
- fMHA: Fixed out-of-bounds reading for Split-K triton implementation
- Profiler: fix bug with modules that take a single tuple as argument
- Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory
Removed
- Removed support for PyTorch version older than 2.2.0
[v0.0.27] torch.compile support, bug fixes & more
Added
- fMHA:
PagedBlockDiagonalGappyKeysMask - fMHA: heterogeneous queries in
triton_splitk - fMHA: support for paged attention in flash
- fMHA: Added backwards pass for
merge_attentions - fMHA: Added
torch.compilesupport for 3 biases (LowerTriangularMask,LowerTriangularMaskWithTensorBiasandBlockDiagonalMask) - some might require PyTorch 2.4 - fMHA: Added
torch.compilesupport inmemory_efficient_attentionwhen passing the flash operator explicitely (egmemory_efficient_attention(..., op=(flash.FwOp, flash.BwOp))) - fMHA:
memory_efficient_attentionnow expects itsattn_biasargument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device. - fMHA:
AttentionBiassubclasses are now constructed by default on thecudadevice if available - they used to be created on the CPU device - 2:4 sparsity: Added
xformers.ops.sp24.sparsify24_stefor Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values
Improved
- fMHA: Fixed out-of-bounds reading for Split-K triton implementation
- Profiler: fix bug with modules that take a single tuple as argument
- Profiler: Added manual trigger for a profiling step, by creating a
triggerfile in the profiling directory
Removed
- Removed support for PyTorch version older than 2.2.0
2:4 sparsity & `torch.compile`-ing memory_efficient_attention
Pre-built binary wheels require PyTorch 2.3.0
Added
- [2:4 sparsity] Added support for Straight-Through Estimator for
sparsify24gradient (GRADIENT_STE) - [2:4 sparsity]
sparsify24_likenow supports the cuSparseLt backend, and the STE gradient - Basic support for
torch.compilefor thememory_efficient_attentionoperator. Currently only supports Flash-Attention, and without any bias provided. We want to expand this coverage progressively.
Improved
- merge_attentions no longer needs inputs to be stacked.
- fMHA: triton_splitk now supports additive bias
- fMHA: benchmark cleanup