v0.8.0
What's Changed
🚀 Features
- Torch dp support by @grimoire in #3207
- Add deep gemm with tma pre allocated by @AllentDan in #3287
- Add mixed DP + TP by @lzhangzz in #3229
- Add Qwen3 and Qwen3MoE by @lzhangzz in #3305
- [ascend] support multi nodes on ascend device by @tangzhiyi11 in #3260
- [Feature] support qwen3 and qwen3-moe for pytorch engine by @CUHKSZzxy in #3315
- [ascend]support deepseekv2 by @yao-fengchen in #3206
- add deepep by @zhaochaoxing in #3313
- support ascend w8a8 graph_mode by @yao-fengchen in #3267
- support all2all ep by @zhaochaoxing in #3370
- optimize ep in decoding stage by @zhaochaoxing in #3383
- Warmup deepgemm by @grimoire in #3387
- support Llama4 by @grimoire in #3408
- add twomicrobatch support by @SHshenhao in #3381
- Support phi4 mini by @RunningLeon in #3467
- [Dlinfer][Ascend] support 310P by @JackWeiw in #3484
- support qwen3 fp8 by @CUHKSZzxy in #3505
💥 Improvements
- Add spaces_between_special_tokens to /v1/interactive and make compatible with empty text by @AllentDan in #3283
- add env var to control timeout by @CUHKSZzxy in #3291
- refactor attn param by @irexyc in #3164
- Verbose log by @grimoire in #3329
- optimize mla, remove load
vby @grimoire in #3334 - support dp decoding with cudagraph by @grimoire in #3311
- optimize quant-fp8 kernel by @grimoire in #3345
- refactor dlinfer rope by @yao-fengchen in #3326
- enable qwenvl2.5 graph mode on ascend by @jinminxi104 in #3367
- Add AIOHTTP_TIMEOUT env var for proxy server by @AllentDan in #3355
- disable sync batch on dp eager mode by @grimoire in #3382
- fix for deepgemm update by @grimoire in #3380
- Add string before hash tokens in blocktrie by @RunningLeon in #3386
- optimize moe get sorted idx by @grimoire in #3356
- use half/bf16 lm_head output by @irexyc in #3213
- remove ep eager check by @grimoire in #3392
- Optimize ascend moe by @yao-fengchen in #3364
- optimize fp8 moe kernel by @grimoire in #3419
- ray async forward execute by @grimoire in #3443
- map internvl3 chat template to builtin chat template internvl2_5 by @lvhan028 in #3450
- Refactor turbomind (low-level abstractions) by @lzhangzz in #3423
- remove barely used code to improve maintenance by @lvhan028 in #3462
- optimize sm80 long context by @grimoire in #3465
- move partial_json_parser from ’serve.txt‘ to ‘runtime.txt‘ by @lvhan028 in #3493
- support qwen3-dense models awq quantization by @lvhan028 in #3503
- Optimize MoE gate for Qwen3 by @lzhangzz in #3500
- Pass num_tokens_per_iter and max_prefill_iters params through in
lmdeploy serve api_serverby @josephrocca in #3504 - [Dlinfer][Ascend] Optimize performance of 310P device by @JackWeiw in #3486
- optimize longcontext decoding by @grimoire in #3510
- Support min_p in openai completions_v1 by @josephrocca in #3506
🐞 Bug fixes
- fix activation grid oversize by @grimoire in #3282
- Set ensure_ascii=False for tool calling by @AllentDan in #3295
- fix sliding window multi chat by @grimoire in #3302
- add
vcheck by @grimoire in #3307 - Fix Qwen3MoE config parsing by @lzhangzz in #3336
- Fix finish reasons by @AllentDan in #3338
- remove think_end_token_id in streaming content by @AllentDan in #3327
- Fix the finish_reason by @AllentDan in #3350
- set cmake policy minimum version as 3.5 by @lvhan028 in #3376
- fix dp cudagraph by @grimoire in #3372
- fix flashmla eagermode by @grimoire in #3375
- close engine after each benchmark-generation iter by @grimoire in #3269
- [Fix] fix
image_token_iderror of qwen2-vl and deepseek by @ao-zz in #3358 - fix stopping criteria by @grimoire in #3384
- support List[dict] prompt input without do_preprocess by @irexyc in #3385
- add rayexecutor release timeout by @grimoire in #3403
- fix tensor dispatch in dynamo by @wanfengcxz in #3417
- fix linting error by upgrade to ubuntu-latest by @lvhan028 in #3442
- fix awq tp for pytorch engine by @RunningLeon in #3435
- fix mllm testcase fail by @caikun-pjlab in #3458
- remove paged attention autotune by @grimoire in #3452
- Remove empty prompts in benchmark scripts by @lvhan028 in #3460
- failed to end session properly by @lvhan028 in #3471
- fix qwen2.5-vl chat template by @CUHKSZzxy in #3475
- Align forward arguments of deepgemm blockedf8 by @RunningLeon in #3474
- fix turbomind lib missing to link nccl by exporting nccl path by @lvhan028 in #3479
- fix dsvl2 no attr config error by @CUHKSZzxy in #3477
- fix flash attention crash on triton3.1.0 by @grimoire in #3478
- Fix disorder of ray execution by @RunningLeon in #3481
- update dockerfile by @CUHKSZzxy in #3482
- fix output logprobs by @irexyc in #3488
- Fix Qwen2MoE shared expert gate by @lzhangzz in #3491
- fix replicate kv for qwen3-moe by @grimoire in #3499
- fix sampling if data overflow after temperature penalty by @irexyc in #3508
📚 Documentations
- update qwen2.5-vl-32b docs by @CUHKSZzxy in #3446
🌐 Other
- bump version to v0.7.2.post1 by @lvhan028 in #3298
- [ci] add think function testcase by @zhulinJulia24 in #3299
- merge dev into main by @lvhan028 in #3348
- [ci] add vl models into pipeline interface testcase by @zhulinJulia24 in #3374
- merge dev to main branch by @lvhan028 in #3378
- opt experts memory and permute by @zhaochaoxing in #3390
- Revert "opt experts memory and permute" by @zhaochaoxing in #3406
- merge dev to main by @lvhan028 in #3400
- add Hopper GPU dockerfile by @CUHKSZzxy in #3415
- optimize internvit by @caikun-pjlab in #3433
- fix stop/bad words by @irexyc in #3492
- [ci] testcase bugfix and add more models into testcase by @zhulinJulia24 in #3463
- bump version to v0.8.0 by @lvhan028 in #3432
New Contributors
- @zhaochaoxing made their first contribution in #3313
- @ao-zz made their first contribution in #3358
- @wanfengcxz made their first contribution in #3417
- @SHshenhao made their first contribution in #3381
- @josephrocca made their first contribution in #3504
Full Changelog: v0.7.2...v0.8.0