Skip to content

Releases: flashinfer-ai/flashinfer

v0.3.1

05 Sep 06:24
3c1e8d7
Compare
Choose a tag to compare

What's Changed

  • hotfix: change MAX_JOBS in aot ci by @yzh119 in #1621
  • fix: export MAX_JOBS for AOT build by @yongwww in #1626
  • perf: Fix the tactic sorting in TrtllmGenBatchedGemmRunner::getValidConfigIndices by @jinyangyuan-nvidia in #1615
  • Fix cute dsl gemm API wrong arg name and silent error when passing wrong kwargs by @fzyzcjy in #1619
  • bugfix: fix merge_attention_state in BatchAttention w/ gqa-group-size in Qwen family by @happierpig in #1614
  • bugfix: fix multi-gpu/node unit-test: skip when there aren't enough GPUs in test_trtllm_mnnvl_allreduce by @bkryu in #1627
  • ci: add cuda-13 unittests to CI by @yzh119 in #1603
  • Revert "hotfix: change MAX_JOBS in aot ci (#1621)" by @yzh119 in #1629
  • patch mm segfault & patch cubin avail. by @aleozlx in #1628
  • bugfix: fix flashinfer_benchmark.py IMA when running a test list by @bkryu in #1625
  • bugfix: trtllm-gen fmha sm101 and sm100 compatibility by @cyx-6 in #1631
  • bugfix: collect all modules to aot by @yzh119 in #1622
  • fix: pass workspace for trtllm-gen attention by @yyihuang in #1635
  • test: pytest.mark.xfail on deepgemm by @yongwww in #1636
  • release: bump version v0.3.1 by @yongwww in #1637

Full Changelog: v0.3.0...v0.3.1

v0.3.0

01 Sep 06:21
f131f3d
Compare
Choose a tag to compare

What's Changed

  • Backend: downgrade trtllm-gen kernel to cuda-12 by @cyx-6 in #1567
  • feat: Add fp8-qkv, fp16/bf16 output MHA by @weireweire in #1540
  • bump cutlass submodule to v4.2 by @ttyio in #1572
  • typo: fix typo in variable names of fp4 masked gemm by @fzyzcjy in #1570
  • benchmark: Add autotunner to moe benchmark by @nv-yunzheq in #1536
  • bugfix: fix cuda version guard macros by @nvjullin in #1571
  • misc: remove some unused files by @yzh119 in #1574
  • bugfix: update trtllm-gen gemm kernel names by @cyx-6 in #1577
  • feat: Support for inferring out_dtype from out.dtype for TRTLLM attention kernel by @elvischenv in #1578
  • fix: semaphoress must be at the fixed range in workspace buffer on trtllm_gen attention by @yyihuang in #1584
  • bugfix: Fix arg passing to TORCH_CHECK and TORCH_WARN macros by @amitz-nv in #1582
  • refactor: Expose calculate_tile_tokens_dim function by @amitz-nv in #1581
  • fix unignorable narrowing conversion issue by @luccafong in #1586
  • bugfix: Fix test_fp4_quantize test bug by @sricketts in #1585
  • update trtllm-gen fp4 autotuner and routing by @IwakuraRein in #1573
  • fix: limit the number of nvcc threads for each kernel by @yzh119 in #1589
  • fix: Improve TRTLLM attention kernel out_dtype unit test by @elvischenv in #1590
  • refactor: use allocator class for workspace buffer allocation by @yyihuang in #1588
  • misc: Fix footnote and typo in CONTRIBUTING.md by @sricketts in #1583
  • Mnnvl memory with custom communicator by @wenscarl in #1245
  • Add mnnvl_moe_alltoallv_prepare_without_allgather by @trevor-m in #1550
  • bugfix: Adding version checks to tests/test_hopper*.py files by @bkryu in #1594
  • Remove cuda-python from dependency and check at runtime by @VALLIS-NERIA in #1534
  • bugfix: fix fused-temperature softmax IMA issue by @yzh119 in #1596
  • bugfix: Fix RuntimeError("FlashInfer requires sm75+") by @hijkzzz in #1598
  • bugfix: fix the register overflow issue for topk renorm kernels on blackwell by @yzh119 in #1597
  • bugfix: fix unittest test_fp8_quantize by @yzh119 in #1599
  • bugfix: fix multi-gpu/node unit-test: skip when there aren't enough GPUs instead of failing by @bkryu in #1600
  • feat: Enable MnnvlMemory (for alltoallv) on B200 by @trevor-m in #1601
  • ci: add ci container of cuda 13 and add cute-dsl as dependency. by @yzh119 in #1595
  • ci: Fix unittests of logits processor by @yzh119 in #1602
  • feat: integrate xqa attention backend by @qsang-nv in #1503
  • [cute dsl] optimize cute dsl make_ptr perf by @limin2021 in #1607
  • bugfix: fix fp4 quantization with 8x4 scale factor layout by @cyx-6 in #1611
  • feat: enable trtllm-gen attn speculative decoding verify by decode by @yyihuang in #1453
  • ci: limit aot parallel build jobs based on available memory by @yongwww in #1612
  • releas: bump version v0.3.0 by @yzh119 in #1617

New Contributors

Full Changelog: v0.2.14.post1...v0.3.0

v0.2.14.post1

25 Aug 03:15
0380322
Compare
Choose a tag to compare

What's Changed

  • bugfix: Fix Persistent kernel precision for masked output by @Edenzzzz in #1533
  • ci: create docker image for cu126/cu128/cu129 by @yzh119 in #1558
  • Bugfix: some typos in Persistent kernel by @Edenzzzz in #1562
  • fix: separate out fp4 lib into sm90 and sm100 versions, add oob checking in fused moe by @djmmoss in #1565
  • bugfix: fix persistent attention kernel correctness on blackwell by @yzh119 in #1559
  • ci: add unittest for different cuda version by @yzh119 in #1560
  • release: bump version to v0.2.14.post1 by @yzh119 in #1568

Full Changelog: v0.2.14...v0.2.14.post1

v0.2.14

23 Aug 09:26
674f699
Compare
Choose a tag to compare

What's Changed

  • flashinfer_benchmark QoL Improvements and Attention FP8 Support by @bkryu in #1512
  • add cuda version check for jit by @cyx-6 in #1526
  • bugfix: Fix compile error for undefined swizzle enum. by @weireweire in #1530
  • refactor: Sink attention AoT by @nandor in #1427
  • test: Enable all modules in AOT build test by @yongwww in #1528
  • Add GeGLU support to trtllm-gen NVFP4 Fused MoE Kernel by @stslxg-nv in #1525
  • Add sm check for sm100 only cutlass/trtllm kernel by @ttyio in #1535
  • bugfix: fix autotuner failure with low precision data types by @ttyio in #1539
  • misc: Setting logging level from env var by @cyx-6 in #1538
  • backend: Refactor trtllm-gen fmha metainfo loading by @cyx-6 in #1518
  • feat: pass sm_count as param for fp4_masked_gemm by @yyihuang in #1529
  • Revert "backend: Refactor trtllm-gen fmha metainfo loading (#1518)" by @yzh119 in #1543
  • Fix typo in sampling.cuh: Remove duplicate parameter by @Appenhaimer in #1546
  • perf: replace cudaGetDeviceProperties with cudaDeviceGetAttribute by @yongwww in #1547
  • fix trtllm_allreduce_fusion twoshot register problem. by @strgrb in #1545
  • feat: Integrate TRTLLM varlen kernel for deepseek R1 prefill by @elfiegg in #1537
  • Add CONTRIBUTING.md by @sricketts in #1553
  • release: bump version to v0.2.14 by @yongwww in #1554
  • ci: add timeout for SPOT instance allocation by @yongwww in #1555
  • fix: add packaging dependency to resolve pypi workflow by @yongwww in #1557

New Contributors

Full Changelog: v0.2.13...v0.2.14

v0.2.13

20 Aug 23:23
0973054
Compare
Choose a tag to compare

What's Changed

  • test: add top_k_sampling_with_variable_k test by @JasonJ2021 in #1505
  • benchmark: add moe to benchmark by @nv-yunzheq in #1497
  • update allreduce to match trtllm by @nvjullin in #1507
  • Support cuda<12.8 built for trtllm_allreduce_fusion. by @strgrb in #1508
  • gpt-oss: Add MXFP8 x MXFP4 CUTLASS MOE for SM100 and BF16 x MXFP4 CUTLASS for SM90 + SwigluBias Activation by @djmmoss in #1396
  • tuner: Trtllm-gen Fp4 MoE Autotunner by @IwakuraRein in #1475
  • refactor fp4 masked gemm cute-dsl implementation and add manual cache by @yzh119 in #1521
  • fix: add missing 'requests' when building the package with AOT by @EmilienM in #1517
  • Fix cuda-python v13.0 import compatibility by @yongwww in #1455
  • misc: add license of spdlog for packaging by @yzh119 in #1522
  • Fix linking errors with CUDA 13 by @yongwww in #1523
  • release: bump version to v0.2.13 by @yongwww in #1524

New Contributors

Full Changelog: v0.2.12...v0.2.13

v0.2.12

18 Aug 16:46
ae1480c
Compare
Choose a tag to compare

What's Changed

  • Fix TRTLLM NVFP4-out attention kernel scale factor dim issue by @elvischenv in #1460
  • perf: add fast path to TopPRenormProbKernel for top_p >= 1.0, significantly boosting SGLang workloads by @TianyuZhang1214 in #1483
  • fix: update cutedsl masked moe gemm by @yyihuang in #1488
  • feat: Support fp8 qkv, fp16/bf16 out MHA for trtllm-gen. by @weireweire in #1490
  • Add errors when dtype is anything other than int32 for ptr metatdata by @pavanimajety in #1492
  • refactor: unify autotuner for bmm_fp8 by @ttyio in #1479
  • fix: update masked moe gemm fp4 tensor reshape by @yyihuang in #1495
  • Revert "feat: Support fp8 qkv, fp16/bf16 out MHA for trtllm-gen. (#1490) by @yzh119 in #1496
  • fix(aot): unused compute in has_sm by @fecet in #1501
  • fix: Replace cub Max/Min with cuda::maximum/minimum for cuda 13 compatibility by @yongwww in #1500
  • doc: Update the masked grouped gemm doc by @kaixih in #1499
  • Perf: support scale_a/scale_b instead of combined scale in cutlass bmm_fp8 by @ttyio in #1491
  • feat: scaling at fp4 gemm epilogue by @yyihuang in #1498
  • Add benchmark for cutedsl gemm by @fzyzcjy in #1502
  • Do not import NVSHMEM in the AoT script unless explicitly requested by @nandor in #1506
  • bugfix: Fix stream handling in cutedsl gemm by @fzyzcjy in #1509
  • bump version to v0.2.12 by @yongwww in #1510

New Contributors

Full Changelog: v0.2.11.post3...v0.2.12

Release v0.2.11.post3

14 Aug 06:44
7146ebc
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.2.11.post2...v0.2.11.post3

v0.2.11.post2

13 Aug 19:17
66144d2
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.2.11.post1...v0.2.11.post2

v0.2.11.post1

11 Aug 15:51
df306f6
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.2.11...v0.2.11.post1

v0.2.11

09 Aug 04:51
7c45412
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.2.10...v0.2.11