Releases · flashinfer-ai/flashinfer

05 Sep 06:24

yongwww

v0.3.1

3c1e8d7

v0.3.1 Latest

Latest

What's Changed

hotfix: change MAX_JOBS in aot ci by @yzh119 in #1621
fix: export MAX_JOBS for AOT build by @yongwww in #1626
perf: Fix the tactic sorting in TrtllmGenBatchedGemmRunner::getValidConfigIndices by @jinyangyuan-nvidia in #1615
Fix cute dsl gemm API wrong arg name and silent error when passing wrong kwargs by @fzyzcjy in #1619
bugfix: fix merge_attention_state in BatchAttention w/ gqa-group-size in Qwen family by @happierpig in #1614
bugfix: fix multi-gpu/node unit-test: skip when there aren't enough GPUs in test_trtllm_mnnvl_allreduce by @bkryu in #1627
ci: add cuda-13 unittests to CI by @yzh119 in #1603
Revert "hotfix: change MAX_JOBS in aot ci (#1621)" by @yzh119 in #1629
patch mm segfault & patch cubin avail. by @aleozlx in #1628
bugfix: fix flashinfer_benchmark.py IMA when running a test list by @bkryu in #1625
bugfix: trtllm-gen fmha sm101 and sm100 compatibility by @cyx-6 in #1631
bugfix: collect all modules to aot by @yzh119 in #1622
fix: pass workspace for trtllm-gen attention by @yyihuang in #1635
test: pytest.mark.xfail on deepgemm by @yongwww in #1636
release: bump version v0.3.1 by @yongwww in #1637

Full Changelog: v0.3.0...v0.3.1

Contributors

aleozlx, fzyzcjy, and 7 other contributors

Assets 2

01 Sep 06:21

yzh119

v0.3.0

f131f3d

v0.3.0

What's Changed

Backend: downgrade trtllm-gen kernel to cuda-12 by @cyx-6 in #1567
feat: Add fp8-qkv, fp16/bf16 output MHA by @weireweire in #1540
bump cutlass submodule to v4.2 by @ttyio in #1572
typo: fix typo in variable names of fp4 masked gemm by @fzyzcjy in #1570
benchmark: Add autotunner to moe benchmark by @nv-yunzheq in #1536
bugfix: fix cuda version guard macros by @nvjullin in #1571
misc: remove some unused files by @yzh119 in #1574
bugfix: update trtllm-gen gemm kernel names by @cyx-6 in #1577
feat: Support for inferring out_dtype from out.dtype for TRTLLM attention kernel by @elvischenv in #1578
fix: semaphoress must be at the fixed range in workspace buffer on trtllm_gen attention by @yyihuang in #1584
bugfix: Fix arg passing to TORCH_CHECK and TORCH_WARN macros by @amitz-nv in #1582
refactor: Expose calculate_tile_tokens_dim function by @amitz-nv in #1581
fix unignorable narrowing conversion issue by @luccafong in #1586
bugfix: Fix test_fp4_quantize test bug by @sricketts in #1585
update trtllm-gen fp4 autotuner and routing by @IwakuraRein in #1573
fix: limit the number of nvcc threads for each kernel by @yzh119 in #1589
fix: Improve TRTLLM attention kernel out_dtype unit test by @elvischenv in #1590
refactor: use allocator class for workspace buffer allocation by @yyihuang in #1588
misc: Fix footnote and typo in CONTRIBUTING.md by @sricketts in #1583
Mnnvl memory with custom communicator by @wenscarl in #1245
Add mnnvl_moe_alltoallv_prepare_without_allgather by @trevor-m in #1550
bugfix: Adding version checks to tests/test_hopper*.py files by @bkryu in #1594
Remove cuda-python from dependency and check at runtime by @VALLIS-NERIA in #1534
bugfix: fix fused-temperature softmax IMA issue by @yzh119 in #1596
bugfix: Fix RuntimeError("FlashInfer requires sm75+") by @hijkzzz in #1598
bugfix: fix the register overflow issue for topk renorm kernels on blackwell by @yzh119 in #1597
bugfix: fix unittest test_fp8_quantize by @yzh119 in #1599
bugfix: fix multi-gpu/node unit-test: skip when there aren't enough GPUs instead of failing by @bkryu in #1600
feat: Enable MnnvlMemory (for alltoallv) on B200 by @trevor-m in #1601
ci: add ci container of cuda 13 and add cute-dsl as dependency. by @yzh119 in #1595
ci: Fix unittests of logits processor by @yzh119 in #1602
feat: integrate xqa attention backend by @qsang-nv in #1503
[cute dsl] optimize cute dsl make_ptr perf by @limin2021 in #1607
bugfix: fix fp4 quantization with 8x4 scale factor layout by @cyx-6 in #1611
feat: enable trtllm-gen attn speculative decoding verify by decode by @yyihuang in #1453
ci: limit aot parallel build jobs based on available memory by @yongwww in #1612
releas: bump version v0.3.0 by @yzh119 in #1617

New Contributors

@amitz-nv made their first contribution in #1582
@luccafong made their first contribution in #1586
@trevor-m made their first contribution in #1550
@VALLIS-NERIA made their first contribution in #1534
@hijkzzz made their first contribution in #1598
@qsang-nv made their first contribution in #1503
@limin2021 made their first contribution in #1607

Full Changelog: v0.2.14.post1...v0.3.0

Contributors

ttyio, sricketts, and 19 other contributors

Assets 2

25 Aug 03:15

yzh119

v0.2.14.post1

0380322

v0.2.14.post1

What's Changed

bugfix: Fix Persistent kernel precision for masked output by @Edenzzzz in #1533
ci: create docker image for cu126/cu128/cu129 by @yzh119 in #1558
Bugfix: some typos in Persistent kernel by @Edenzzzz in #1562
fix: separate out fp4 lib into sm90 and sm100 versions, add oob checking in fused moe by @djmmoss in #1565
bugfix: fix persistent attention kernel correctness on blackwell by @yzh119 in #1559
ci: add unittest for different cuda version by @yzh119 in #1560
release: bump version to v0.2.14.post1 by @yzh119 in #1568

Full Changelog: v0.2.14...v0.2.14.post1

Contributors

djmmoss, yzh119, and Edenzzzz

Assets 2

23 Aug 09:26

yongwww

v0.2.14

674f699

v0.2.14

What's Changed

flashinfer_benchmark QoL Improvements and Attention FP8 Support by @bkryu in #1512
add cuda version check for jit by @cyx-6 in #1526
bugfix: Fix compile error for undefined swizzle enum. by @weireweire in #1530
refactor: Sink attention AoT by @nandor in #1427
test: Enable all modules in AOT build test by @yongwww in #1528
Add GeGLU support to trtllm-gen NVFP4 Fused MoE Kernel by @stslxg-nv in #1525
Add sm check for sm100 only cutlass/trtllm kernel by @ttyio in #1535
bugfix: fix autotuner failure with low precision data types by @ttyio in #1539
misc: Setting logging level from env var by @cyx-6 in #1538
backend: Refactor trtllm-gen fmha metainfo loading by @cyx-6 in #1518
feat: pass sm_count as param for fp4_masked_gemm by @yyihuang in #1529
Revert "backend: Refactor trtllm-gen fmha metainfo loading (#1518)" by @yzh119 in #1543
Fix typo in sampling.cuh: Remove duplicate parameter by @Appenhaimer in #1546
perf: replace cudaGetDeviceProperties with cudaDeviceGetAttribute by @yongwww in #1547
fix trtllm_allreduce_fusion twoshot register problem. by @strgrb in #1545
feat: Integrate TRTLLM varlen kernel for deepseek R1 prefill by @elfiegg in #1537
Add CONTRIBUTING.md by @sricketts in #1553
release: bump version to v0.2.14 by @yongwww in #1554
ci: add timeout for SPOT instance allocation by @yongwww in #1555
fix: add packaging dependency to resolve pypi workflow by @yongwww in #1557

New Contributors

@stslxg-nv made their first contribution in #1525
@Appenhaimer made their first contribution in #1546

Full Changelog: v0.2.13...v0.2.14

Contributors

ttyio, nandor, and 11 other contributors

Assets 2

20 Aug 23:23

yongwww

v0.2.13

0973054

v0.2.13

What's Changed

test: add top_k_sampling_with_variable_k test by @JasonJ2021 in #1505
benchmark: add moe to benchmark by @nv-yunzheq in #1497
update allreduce to match trtllm by @nvjullin in #1507
Support cuda<12.8 built for trtllm_allreduce_fusion. by @strgrb in #1508
gpt-oss: Add MXFP8 x MXFP4 CUTLASS MOE for SM100 and BF16 x MXFP4 CUTLASS for SM90 + SwigluBias Activation by @djmmoss in #1396
tuner: Trtllm-gen Fp4 MoE Autotunner by @IwakuraRein in #1475
refactor fp4 masked gemm cute-dsl implementation and add manual cache by @yzh119 in #1521
fix: add missing 'requests' when building the package with AOT by @EmilienM in #1517
Fix cuda-python v13.0 import compatibility by @yongwww in #1455
misc: add license of spdlog for packaging by @yzh119 in #1522
Fix linking errors with CUDA 13 by @yongwww in #1523
release: bump version to v0.2.13 by @yongwww in #1524

New Contributors

@JasonJ2021 made their first contribution in #1505
@nv-yunzheq made their first contribution in #1497
@nvjullin made their first contribution in #1507
@strgrb made their first contribution in #1508
@djmmoss made their first contribution in #1396

Full Changelog: v0.2.12...v0.2.13

Contributors

EmilienM, djmmoss, and 7 other contributors

Assets 2

18 Aug 16:46

yongwww

v0.2.12

ae1480c

v0.2.12

What's Changed

Fix TRTLLM NVFP4-out attention kernel scale factor dim issue by @elvischenv in #1460
perf: add fast path to TopPRenormProbKernel for top_p >= 1.0, significantly boosting SGLang workloads by @TianyuZhang1214 in #1483
fix: update cutedsl masked moe gemm by @yyihuang in #1488
feat: Support fp8 qkv, fp16/bf16 out MHA for trtllm-gen. by @weireweire in #1490
Add errors when dtype is anything other than int32 for ptr metatdata by @pavanimajety in #1492
refactor: unify autotuner for bmm_fp8 by @ttyio in #1479
fix: update masked moe gemm fp4 tensor reshape by @yyihuang in #1495
Revert "feat: Support fp8 qkv, fp16/bf16 out MHA for trtllm-gen. (#1490) by @yzh119 in #1496
fix(aot): unused compute in has_sm by @fecet in #1501
fix: Replace cub Max/Min with cuda::maximum/minimum for cuda 13 compatibility by @yongwww in #1500
doc: Update the masked grouped gemm doc by @kaixih in #1499
Perf: support scale_a/scale_b instead of combined scale in cutlass bmm_fp8 by @ttyio in #1491
feat: scaling at fp4 gemm epilogue by @yyihuang in #1498
Add benchmark for cutedsl gemm by @fzyzcjy in #1502
Do not import NVSHMEM in the AoT script unless explicitly requested by @nandor in #1506
bugfix: Fix stream handling in cutedsl gemm by @fzyzcjy in #1509
bump version to v0.2.12 by @yongwww in #1510

New Contributors

@elvischenv made their first contribution in #1460
@TianyuZhang1214 made their first contribution in #1483
@pavanimajety made their first contribution in #1492
@fecet made their first contribution in #1501

Full Changelog: v0.2.11.post3...v0.2.12

Contributors

ttyio, nandor, and 10 other contributors

Assets 2

14 Aug 06:44

zhyncs

v0.2.11.post3

7146ebc

Release v0.2.11.post3

What's Changed

Remove outdated formatting scripts by @cyx-6 in #1482
feat: add pdl for trtllm-gen attn by @yyihuang in #1484
fix missing enable_pdl argument in trtllm-gen fp4 moe by @IwakuraRein in #1480
Add python API for masked grouped gemm by @kaixih in #1481
release: bump version to v0.2.11.post3 by @yyihuang in #1486

Full Changelog: v0.2.11.post2...v0.2.11.post3

Contributors

kaixih, cyx-6, and 2 other contributors

Assets 2

13 Aug 19:17

yyihuang

v0.2.11.post2

66144d2

v0.2.11.post2

What's Changed

[doc]: Update installation doc and readme by @yongwww in #1465
Allow BatchPrefillPagedWrapper to call cudnn API by @Anerudhan in #1384
[RFC] log filename and lineno in flashinfer jit logger by @842974287 in #1461
Add Mxfp4 trtllm-gen moe unit tests by @IwakuraRein in #1399
bugfix: Verify num_experts greater or equal to local_experts + offset by @amirkl94 in #1469
[RFC] add an env to allow specify cubins directory by @842974287 in #1462
Fix "more than one operator "/" matches these operands" by @842974287 in #1471
Fix race condition when JitSpec loads the library by @nvpohanh in #1467
perf: add 1x4x1 cluster shape for fp8 bmm M<16 cases by @ttyio in #1473
feat: Enable multiple fused-moe backends by @amirkl94 in #1472
Remove restrict extension to fix compilation error on GB200 by @842974287 in #1470
feat: masked layout fp4 gemm using cute-dsl by @yzh119 in #1331
fix: minor fix after #1384 by @yyihuang in #1476
fix: remove redundant zero_init reverted by #1459 by @yyihuang in #1463
Remove getEnvEnablePDL in favor of enable_pdl parameter by @yongwww in #1446
Unify and modularize decode and prefill test. by @weireweire in #1375
refactor: Improved metainfo for trtllm-gen kernels by @cyx-6 in #1328
Tone down the amount of logging when downloading cubins by @joker-eph in #1477
release: bump version to v0.2.11.post2 by @yyihuang in #1478

Full Changelog: v0.2.11.post1...v0.2.11.post2

Contributors

ttyio, Anerudhan, and 10 other contributors

Assets 2

11 Aug 15:51

yongwww

v0.2.11.post1

df306f6

v0.2.11.post1

What's Changed

perf: cache get_compute_capability by @cyx-6 in #1456
minor: disable warning on auto-deduce by @yyihuang in #1458
Revert "fix: remote redundant zero_init from trtllm-gen attn (#1444)" by @zhyncs in #1459

Full Changelog: v0.2.11...v0.2.11.post1

Contributors

cyx-6, zhyncs, and yyihuang

Assets 2

09 Aug 04:51

cyx-6

v0.2.11

7c45412

v0.2.11

What's Changed

Fix flag order by @nandor in #1392
Add flags to trim down AoT builds by @nandor in #1393
Force upgrade cuDNN to latest by @paul841029 in #1401
Adding FP8 benchmark on attention and matmul testing by @bkryu in #1390
feature: enable cublas for fp4 gemm when cudnn == 9.11.1 or >= 9.13 by @ttyio in #1405
Relax the clear_cuda_cache by @yongwww in #1406
Update autotune results for the nvfp4 cutlass moe backends for v0.2.9 by @kaixih in #1361
fix shared memory alignment conflict in sampling.cuh by @842974287 in #1402
Fix trtllm moe launcher local_num_experts by @wenscarl in #1398
[bugfix] Fix compilation failure when compiling csrc/trtllm_moe_allreduce_fusion.cu by @nvpohanh in #1410
install: remove nvidia-cudnn-12 from package dependency by @yzh119 in #1409
Add mypy to pre-commit by @cyx-6 in #1179
feat(aot): add nvshmem module for aot compilation by @EmilienM in #1261
Add ruff to pre-commit by @cyx-6 in #1201
install: remove nvidia-nvshmem-cu12 from package dependency by @EmilienM in #1426
Fix redundant kernels in moe by @fzyzcjy in #1428
ci: add arm64 to release-ci-docker.yml by @yzh119 in #1429
Fix crash when pos_encoding_mode is passed as int by @kaixih in #1413
Fix trtllm_ar failure by @nvpohanh in #1423
Use self hosted runner for arm image build by @yongwww in #1433
Remote const qualifier to avoid compilation error by @842974287 in #1421
Add multi-arch Docker image for x86-64 and arm64 by @yongwww in #1431
Add NOTICE with copyrights by @sricketts in #1432
Fix FusedMoeRunner does not exist error by @nvpohanh in #1424
Putting back cudnn_batch_prefill_with_kv_cache that was deleted by ruff by @bkryu in #1438
Decouple cutlass config version from flashinfer version by @kaixih in #1441
feat: Fused rope fp8 quantize kernel for MLA by @yzh119 in #1339
Add disk cleanup for Docker builds by @yongwww in #1442
ci: Add ARM AOT test by @yongwww in #1418
bugfix: fix perf issue by using fp8 graph that can use cublaslt by @ttyio in #1435
Faster weight processing (moe nvfp4) by @aleozlx in #1412
Add alignment in MxFP8Quantization by @Qiaolin-Yu in #1445
misc: remove unused dependency by @yzh119 in #1443
fix: remote redundant zero_init from trtllm-gen attn by @yyihuang in #1444
benchmark: trtllm-gen mha with sink, add benchmark args by @yyihuang in #1415
Fixes for Blackwell Tests by @paul841029 in #1434
Fix missing v_scale for prefill wrapper. by @weireweire in #1416
ci: add github actions to upload sdist to pypi by @yzh119 in #1270
3rparty: upgrade cutlass dependency to v4.1.0 by @yzh119 in #1299
feature: add cutlass as bmm_fp8 backend. by @ttyio in #1397
release: bump version to v0.2.11 by @yongwww in #1447
ci: bugfix on sdist pypi workflow by @yzh119 in #1449

New Contributors

@paul841029 made their first contribution in #1401
@842974287 made their first contribution in #1402
@fzyzcjy made their first contribution in #1428
@sricketts made their first contribution in #1432
@Qiaolin-Yu made their first contribution in #1445

Full Changelog: v0.2.10...v0.2.11

Contributors

ttyio, nandor, and 16 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

Releases: flashinfer-ai/flashinfer

v0.3.1

What's Changed

Contributors

Uh oh!

v0.3.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.14.post1

What's Changed

Contributors

Uh oh!

v0.2.14

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.13

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.12

What's Changed

New Contributors

Contributors

Uh oh!

Release v0.2.11.post3

What's Changed

Contributors

Uh oh!

v0.2.11.post2

What's Changed

Contributors

Uh oh!

v0.2.11.post1

What's Changed

Contributors

Uh oh!

v0.2.11

What's Changed

New Contributors

Contributors

Uh oh!