Releases: flashinfer-ai/flashinfer
Releases · flashinfer-ai/flashinfer
v0.3.1
What's Changed
- hotfix: change MAX_JOBS in aot ci by @yzh119 in #1621
- fix: export MAX_JOBS for AOT build by @yongwww in #1626
- perf: Fix the tactic sorting in TrtllmGenBatchedGemmRunner::getValidConfigIndices by @jinyangyuan-nvidia in #1615
- Fix cute dsl gemm API wrong arg name and silent error when passing wrong kwargs by @fzyzcjy in #1619
- bugfix: fix merge_attention_state in BatchAttention w/ gqa-group-size in Qwen family by @happierpig in #1614
- bugfix: fix multi-gpu/node unit-test: skip when there aren't enough GPUs in test_trtllm_mnnvl_allreduce by @bkryu in #1627
- ci: add cuda-13 unittests to CI by @yzh119 in #1603
- Revert "hotfix: change MAX_JOBS in aot ci (#1621)" by @yzh119 in #1629
- patch mm segfault & patch cubin avail. by @aleozlx in #1628
- bugfix: fix flashinfer_benchmark.py IMA when running a test list by @bkryu in #1625
- bugfix: trtllm-gen fmha sm101 and sm100 compatibility by @cyx-6 in #1631
- bugfix: collect all modules to aot by @yzh119 in #1622
- fix: pass workspace for trtllm-gen attention by @yyihuang in #1635
- test: pytest.mark.xfail on deepgemm by @yongwww in #1636
- release: bump version v0.3.1 by @yongwww in #1637
Full Changelog: v0.3.0...v0.3.1
v0.3.0
What's Changed
- Backend: downgrade trtllm-gen kernel to cuda-12 by @cyx-6 in #1567
- feat: Add fp8-qkv, fp16/bf16 output MHA by @weireweire in #1540
- bump cutlass submodule to v4.2 by @ttyio in #1572
- typo: fix typo in variable names of fp4 masked gemm by @fzyzcjy in #1570
- benchmark: Add autotunner to moe benchmark by @nv-yunzheq in #1536
- bugfix: fix cuda version guard macros by @nvjullin in #1571
- misc: remove some unused files by @yzh119 in #1574
- bugfix: update trtllm-gen gemm kernel names by @cyx-6 in #1577
- feat: Support for inferring out_dtype from out.dtype for TRTLLM attention kernel by @elvischenv in #1578
- fix: semaphoress must be at the fixed range in workspace buffer on trtllm_gen attention by @yyihuang in #1584
- bugfix: Fix arg passing to TORCH_CHECK and TORCH_WARN macros by @amitz-nv in #1582
- refactor: Expose calculate_tile_tokens_dim function by @amitz-nv in #1581
- fix unignorable narrowing conversion issue by @luccafong in #1586
- bugfix: Fix test_fp4_quantize test bug by @sricketts in #1585
- update trtllm-gen fp4 autotuner and routing by @IwakuraRein in #1573
- fix: limit the number of nvcc threads for each kernel by @yzh119 in #1589
- fix: Improve TRTLLM attention kernel out_dtype unit test by @elvischenv in #1590
- refactor: use allocator class for workspace buffer allocation by @yyihuang in #1588
- misc: Fix footnote and typo in CONTRIBUTING.md by @sricketts in #1583
- Mnnvl memory with custom communicator by @wenscarl in #1245
- Add mnnvl_moe_alltoallv_prepare_without_allgather by @trevor-m in #1550
- bugfix: Adding version checks to tests/test_hopper*.py files by @bkryu in #1594
- Remove cuda-python from dependency and check at runtime by @VALLIS-NERIA in #1534
- bugfix: fix fused-temperature softmax IMA issue by @yzh119 in #1596
- bugfix: Fix RuntimeError("FlashInfer requires sm75+") by @hijkzzz in #1598
- bugfix: fix the register overflow issue for topk renorm kernels on blackwell by @yzh119 in #1597
- bugfix: fix unittest test_fp8_quantize by @yzh119 in #1599
- bugfix: fix multi-gpu/node unit-test: skip when there aren't enough GPUs instead of failing by @bkryu in #1600
- feat: Enable MnnvlMemory (for alltoallv) on B200 by @trevor-m in #1601
- ci: add ci container of cuda 13 and add cute-dsl as dependency. by @yzh119 in #1595
- ci: Fix unittests of logits processor by @yzh119 in #1602
- feat: integrate xqa attention backend by @qsang-nv in #1503
- [cute dsl] optimize cute dsl make_ptr perf by @limin2021 in #1607
- bugfix: fix fp4 quantization with 8x4 scale factor layout by @cyx-6 in #1611
- feat: enable trtllm-gen attn speculative decoding verify by decode by @yyihuang in #1453
- ci: limit aot parallel build jobs based on available memory by @yongwww in #1612
- releas: bump version v0.3.0 by @yzh119 in #1617
New Contributors
- @amitz-nv made their first contribution in #1582
- @luccafong made their first contribution in #1586
- @trevor-m made their first contribution in #1550
- @VALLIS-NERIA made their first contribution in #1534
- @hijkzzz made their first contribution in #1598
- @qsang-nv made their first contribution in #1503
- @limin2021 made their first contribution in #1607
Full Changelog: v0.2.14.post1...v0.3.0
v0.2.14.post1
What's Changed
- bugfix: Fix Persistent kernel precision for masked output by @Edenzzzz in #1533
- ci: create docker image for cu126/cu128/cu129 by @yzh119 in #1558
- Bugfix: some typos in Persistent kernel by @Edenzzzz in #1562
- fix: separate out fp4 lib into sm90 and sm100 versions, add oob checking in fused moe by @djmmoss in #1565
- bugfix: fix persistent attention kernel correctness on blackwell by @yzh119 in #1559
- ci: add unittest for different cuda version by @yzh119 in #1560
- release: bump version to v0.2.14.post1 by @yzh119 in #1568
Full Changelog: v0.2.14...v0.2.14.post1
v0.2.14
What's Changed
- flashinfer_benchmark QoL Improvements and Attention FP8 Support by @bkryu in #1512
- add cuda version check for jit by @cyx-6 in #1526
- bugfix: Fix compile error for undefined swizzle enum. by @weireweire in #1530
- refactor: Sink attention AoT by @nandor in #1427
- test: Enable all modules in AOT build test by @yongwww in #1528
- Add GeGLU support to trtllm-gen NVFP4 Fused MoE Kernel by @stslxg-nv in #1525
- Add sm check for sm100 only cutlass/trtllm kernel by @ttyio in #1535
- bugfix: fix autotuner failure with low precision data types by @ttyio in #1539
- misc: Setting logging level from env var by @cyx-6 in #1538
- backend: Refactor trtllm-gen fmha metainfo loading by @cyx-6 in #1518
- feat: pass sm_count as param for fp4_masked_gemm by @yyihuang in #1529
- Revert "backend: Refactor trtllm-gen fmha metainfo loading (#1518)" by @yzh119 in #1543
- Fix typo in sampling.cuh: Remove duplicate parameter by @Appenhaimer in #1546
- perf: replace cudaGetDeviceProperties with cudaDeviceGetAttribute by @yongwww in #1547
- fix trtllm_allreduce_fusion twoshot register problem. by @strgrb in #1545
- feat: Integrate TRTLLM varlen kernel for deepseek R1 prefill by @elfiegg in #1537
- Add CONTRIBUTING.md by @sricketts in #1553
- release: bump version to v0.2.14 by @yongwww in #1554
- ci: add timeout for SPOT instance allocation by @yongwww in #1555
- fix: add packaging dependency to resolve pypi workflow by @yongwww in #1557
New Contributors
- @stslxg-nv made their first contribution in #1525
- @Appenhaimer made their first contribution in #1546
Full Changelog: v0.2.13...v0.2.14
v0.2.13
What's Changed
- test: add top_k_sampling_with_variable_k test by @JasonJ2021 in #1505
- benchmark: add moe to benchmark by @nv-yunzheq in #1497
- update allreduce to match trtllm by @nvjullin in #1507
- Support cuda<12.8 built for trtllm_allreduce_fusion. by @strgrb in #1508
- gpt-oss: Add MXFP8 x MXFP4 CUTLASS MOE for SM100 and BF16 x MXFP4 CUTLASS for SM90 + SwigluBias Activation by @djmmoss in #1396
- tuner: Trtllm-gen Fp4 MoE Autotunner by @IwakuraRein in #1475
- refactor fp4 masked gemm cute-dsl implementation and add manual cache by @yzh119 in #1521
- fix: add missing 'requests' when building the package with AOT by @EmilienM in #1517
- Fix cuda-python v13.0 import compatibility by @yongwww in #1455
- misc: add license of spdlog for packaging by @yzh119 in #1522
- Fix linking errors with CUDA 13 by @yongwww in #1523
- release: bump version to v0.2.13 by @yongwww in #1524
New Contributors
- @JasonJ2021 made their first contribution in #1505
- @nv-yunzheq made their first contribution in #1497
- @nvjullin made their first contribution in #1507
- @strgrb made their first contribution in #1508
- @djmmoss made their first contribution in #1396
Full Changelog: v0.2.12...v0.2.13
v0.2.12
What's Changed
- Fix TRTLLM NVFP4-out attention kernel scale factor dim issue by @elvischenv in #1460
- perf: add fast path to TopPRenormProbKernel for top_p >= 1.0, significantly boosting SGLang workloads by @TianyuZhang1214 in #1483
- fix: update cutedsl masked moe gemm by @yyihuang in #1488
- feat: Support fp8 qkv, fp16/bf16 out MHA for trtllm-gen. by @weireweire in #1490
- Add errors when dtype is anything other than int32 for ptr metatdata by @pavanimajety in #1492
- refactor: unify autotuner for bmm_fp8 by @ttyio in #1479
- fix: update masked moe gemm fp4 tensor reshape by @yyihuang in #1495
- Revert "feat: Support fp8 qkv, fp16/bf16 out MHA for trtllm-gen. (#1490) by @yzh119 in #1496
- fix(aot): unused compute in has_sm by @fecet in #1501
- fix: Replace cub Max/Min with cuda::maximum/minimum for cuda 13 compatibility by @yongwww in #1500
- doc: Update the masked grouped gemm doc by @kaixih in #1499
- Perf: support scale_a/scale_b instead of combined scale in cutlass bmm_fp8 by @ttyio in #1491
- feat: scaling at fp4 gemm epilogue by @yyihuang in #1498
- Add benchmark for cutedsl gemm by @fzyzcjy in #1502
- Do not import NVSHMEM in the AoT script unless explicitly requested by @nandor in #1506
- bugfix: Fix stream handling in cutedsl gemm by @fzyzcjy in #1509
- bump version to v0.2.12 by @yongwww in #1510
New Contributors
- @elvischenv made their first contribution in #1460
- @TianyuZhang1214 made their first contribution in #1483
- @pavanimajety made their first contribution in #1492
- @fecet made their first contribution in #1501
Full Changelog: v0.2.11.post3...v0.2.12
Release v0.2.11.post3
What's Changed
- Remove outdated formatting scripts by @cyx-6 in #1482
- feat: add pdl for trtllm-gen attn by @yyihuang in #1484
- fix missing enable_pdl argument in trtllm-gen fp4 moe by @IwakuraRein in #1480
- Add python API for masked grouped gemm by @kaixih in #1481
- release: bump version to v0.2.11.post3 by @yyihuang in #1486
Full Changelog: v0.2.11.post2...v0.2.11.post3
v0.2.11.post2
What's Changed
- [doc]: Update installation doc and readme by @yongwww in #1465
- Allow BatchPrefillPagedWrapper to call cudnn API by @Anerudhan in #1384
- [RFC] log filename and lineno in flashinfer jit logger by @842974287 in #1461
- Add Mxfp4 trtllm-gen moe unit tests by @IwakuraRein in #1399
- bugfix: Verify num_experts greater or equal to local_experts + offset by @amirkl94 in #1469
- [RFC] add an env to allow specify cubins directory by @842974287 in #1462
- Fix "more than one operator "/" matches these operands" by @842974287 in #1471
- Fix race condition when JitSpec loads the library by @nvpohanh in #1467
- perf: add 1x4x1 cluster shape for fp8 bmm M<16 cases by @ttyio in #1473
- feat: Enable multiple fused-moe backends by @amirkl94 in #1472
- Remove restrict extension to fix compilation error on GB200 by @842974287 in #1470
- feat: masked layout fp4 gemm using cute-dsl by @yzh119 in #1331
- fix: minor fix after #1384 by @yyihuang in #1476
- fix: remove redundant zero_init reverted by #1459 by @yyihuang in #1463
- Remove getEnvEnablePDL in favor of enable_pdl parameter by @yongwww in #1446
- Unify and modularize decode and prefill test. by @weireweire in #1375
- refactor: Improved metainfo for trtllm-gen kernels by @cyx-6 in #1328
- Tone down the amount of logging when downloading cubins by @joker-eph in #1477
- release: bump version to v0.2.11.post2 by @yyihuang in #1478
Full Changelog: v0.2.11.post1...v0.2.11.post2
v0.2.11.post1
v0.2.11
What's Changed
- Fix flag order by @nandor in #1392
- Add flags to trim down AoT builds by @nandor in #1393
- Force upgrade cuDNN to latest by @paul841029 in #1401
- Adding FP8 benchmark on attention and matmul testing by @bkryu in #1390
- feature: enable cublas for fp4 gemm when cudnn == 9.11.1 or >= 9.13 by @ttyio in #1405
- Relax the clear_cuda_cache by @yongwww in #1406
- Update autotune results for the nvfp4 cutlass moe backends for v0.2.9 by @kaixih in #1361
- fix shared memory alignment conflict in sampling.cuh by @842974287 in #1402
- Fix trtllm moe launcher local_num_experts by @wenscarl in #1398
- [bugfix] Fix compilation failure when compiling csrc/trtllm_moe_allreduce_fusion.cu by @nvpohanh in #1410
- install: remove nvidia-cudnn-12 from package dependency by @yzh119 in #1409
- Add mypy to pre-commit by @cyx-6 in #1179
- feat(aot): add nvshmem module for aot compilation by @EmilienM in #1261
- Add ruff to pre-commit by @cyx-6 in #1201
- install: remove nvidia-nvshmem-cu12 from package dependency by @EmilienM in #1426
- Fix redundant kernels in moe by @fzyzcjy in #1428
- ci: add arm64 to release-ci-docker.yml by @yzh119 in #1429
- Fix crash when pos_encoding_mode is passed as int by @kaixih in #1413
- Fix trtllm_ar failure by @nvpohanh in #1423
- Use self hosted runner for arm image build by @yongwww in #1433
- Remote const qualifier to avoid compilation error by @842974287 in #1421
- Add multi-arch Docker image for x86-64 and arm64 by @yongwww in #1431
- Add NOTICE with copyrights by @sricketts in #1432
- Fix FusedMoeRunner does not exist error by @nvpohanh in #1424
- Putting back cudnn_batch_prefill_with_kv_cache that was deleted by ruff by @bkryu in #1438
- Decouple cutlass config version from flashinfer version by @kaixih in #1441
- feat: Fused rope fp8 quantize kernel for MLA by @yzh119 in #1339
- Add disk cleanup for Docker builds by @yongwww in #1442
- ci: Add ARM AOT test by @yongwww in #1418
- bugfix: fix perf issue by using fp8 graph that can use cublaslt by @ttyio in #1435
- Faster weight processing (moe nvfp4) by @aleozlx in #1412
- Add alignment in MxFP8Quantization by @Qiaolin-Yu in #1445
- misc: remove unused dependency by @yzh119 in #1443
- fix: remote redundant zero_init from trtllm-gen attn by @yyihuang in #1444
- benchmark: trtllm-gen mha with sink, add benchmark args by @yyihuang in #1415
- Fixes for Blackwell Tests by @paul841029 in #1434
- Fix missing v_scale for prefill wrapper. by @weireweire in #1416
- ci: add github actions to upload sdist to pypi by @yzh119 in #1270
- 3rparty: upgrade cutlass dependency to v4.1.0 by @yzh119 in #1299
- feature: add cutlass as bmm_fp8 backend. by @ttyio in #1397
- release: bump version to v0.2.11 by @yongwww in #1447
- ci: bugfix on sdist pypi workflow by @yzh119 in #1449
New Contributors
- @paul841029 made their first contribution in #1401
- @842974287 made their first contribution in #1402
- @fzyzcjy made their first contribution in #1428
- @sricketts made their first contribution in #1432
- @Qiaolin-Yu made their first contribution in #1445
Full Changelog: v0.2.10...v0.2.11