Skip to content

Conversation

@drisspg
Copy link
Contributor

@drisspg drisspg commented Nov 5, 2025

Summary

  • Implement block-sparse attention in flash_fwd_sm100.py
  • Update interface.py to handle SM100 block size calculations
    (2x multiplier for m_block_size since 1 CTA handles 2*tile_m rows)
  • Add mask_mod parameter support in mask.py for block-sparse masking
  • Add SM100 test fixtures and tile size handling in test_mask_mod.py

Fast follow

Do the aux_tensor fastdivmod wrapping to avoid OOB reads

Also we should land:
#1984
Before and rebase so its easier to review

Perf

Alot of perf wins (not universal for document mask ) but the delta from sol is much higher than what was found on hopper impl

Not autotuning the flex blocksparse impl gives this:
combined_comparison

And autotuning the triton impl:
combined_comparison_autotune

Possible problems

Looking at the Pm samples we can see a long tail:

For causal_mask with the default StaticPersistentSchedule. (We need to build a generic version of this) but we already have a better schedule for causal. If hard code the LPT schedule ![Uploading Screenshot 2025-11-04 at 5.32.58 PM.png…]()

we go from :
Screenshot 2025-11-04 at 5 39 33 PM

to:

Screenshot 2025-11-04 at 5 39 14 PM

Tests

Screenshot 2025-11-10 at 8 26 43 PM Screenshot 2025-11-10 at 8 37 51 PM Screenshot 2025-11-10 at 8 43 15 PM

@drisspg drisspg changed the title [Cute] Extract block-sparse utilities from SM80/90 [Cute] Block sparse support Sm100 Nov 5, 2025
@drisspg drisspg force-pushed the sm100-block-sparsity branch 6 times, most recently from 06588af to 547cf51 Compare November 11, 2025 04:44
@drisspg drisspg force-pushed the sm100-block-sparsity branch from 547cf51 to 66cbbdb Compare November 11, 2025 05:29


@cute.jit
def handle_block_sparse_empty_tile_correction_sm100(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had to dupe alot here but still think its better, having very large IF Else indents makes it harder to rebase / iterate on all the constexpr tree

@drisspg drisspg force-pushed the sm100-block-sparsity branch from d1ece5e to 1db7911 Compare November 13, 2025 19:28
- Implement block-sparse attention in flash_fwd_sm100.py
- Update interface.py to handle SM100 block size calculations
  (2x multiplier for m_block_size since 1 CTA handles 2*tile_m rows)
- Add mask_mod parameter support in mask.py for block-sparse masking
- Add SM100 test fixtures and tile size handling in test_mask_mod.py

This enables block-sparsity on SM 10.0 architecture, including
mask_mod support and proper block size accounting.
@drisspg drisspg force-pushed the sm100-block-sparsity branch from 1db7911 to b01ba0c Compare November 14, 2025 02:02
@drisspg
Copy link
Contributor Author

drisspg commented Nov 14, 2025

@tridao Okay, finally rebased, perf looks good
and tests are green
Screenshot 2025-11-13 at 7 26 18 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants