Release v0.8.0 · InternLM/lmdeploy

What's Changed

🚀 Features

Torch dp support by @grimoire in #3207
Add deep gemm with tma pre allocated by @AllentDan in #3287
Add mixed DP + TP by @lzhangzz in #3229
Add Qwen3 and Qwen3MoE by @lzhangzz in #3305
[ascend] support multi nodes on ascend device by @tangzhiyi11 in #3260
[Feature] support qwen3 and qwen3-moe for pytorch engine by @CUHKSZzxy in #3315
[ascend]support deepseekv2 by @yao-fengchen in #3206
add deepep by @zhaochaoxing in #3313
support ascend w8a8 graph_mode by @yao-fengchen in #3267
support all2all ep by @zhaochaoxing in #3370
optimize ep in decoding stage by @zhaochaoxing in #3383
Warmup deepgemm by @grimoire in #3387
support Llama4 by @grimoire in #3408
add twomicrobatch support by @SHshenhao in #3381
Support phi4 mini by @RunningLeon in #3467
[Dlinfer][Ascend] support 310P by @JackWeiw in #3484
support qwen3 fp8 by @CUHKSZzxy in #3505

💥 Improvements

Add spaces_between_special_tokens to /v1/interactive and make compatible with empty text by @AllentDan in #3283
add env var to control timeout by @CUHKSZzxy in #3291
refactor attn param by @irexyc in #3164
Verbose log by @grimoire in #3329
optimize mla, remove load v by @grimoire in #3334
support dp decoding with cudagraph by @grimoire in #3311
optimize quant-fp8 kernel by @grimoire in #3345
refactor dlinfer rope by @yao-fengchen in #3326
enable qwenvl2.5 graph mode on ascend by @jinminxi104 in #3367
Add AIOHTTP_TIMEOUT env var for proxy server by @AllentDan in #3355
disable sync batch on dp eager mode by @grimoire in #3382
fix for deepgemm update by @grimoire in #3380
Add string before hash tokens in blocktrie by @RunningLeon in #3386
optimize moe get sorted idx by @grimoire in #3356
use half/bf16 lm_head output by @irexyc in #3213
remove ep eager check by @grimoire in #3392
Optimize ascend moe by @yao-fengchen in #3364
optimize fp8 moe kernel by @grimoire in #3419
ray async forward execute by @grimoire in #3443
map internvl3 chat template to builtin chat template internvl2_5 by @lvhan028 in #3450
Refactor turbomind (low-level abstractions) by @lzhangzz in #3423
remove barely used code to improve maintenance by @lvhan028 in #3462
optimize sm80 long context by @grimoire in #3465
move partial_json_parser from ’serve.txt‘ to ‘runtime.txt‘ by @lvhan028 in #3493
support qwen3-dense models awq quantization by @lvhan028 in #3503
Optimize MoE gate for Qwen3 by @lzhangzz in #3500
Pass num_tokens_per_iter and max_prefill_iters params through in lmdeploy serve api_server by @josephrocca in #3504
[Dlinfer][Ascend] Optimize performance of 310P device by @JackWeiw in #3486
optimize longcontext decoding by @grimoire in #3510
Support min_p in openai completions_v1 by @josephrocca in #3506

🐞 Bug fixes

fix activation grid oversize by @grimoire in #3282
Set ensure_ascii=False for tool calling by @AllentDan in #3295
fix sliding window multi chat by @grimoire in #3302
add v check by @grimoire in #3307
Fix Qwen3MoE config parsing by @lzhangzz in #3336
Fix finish reasons by @AllentDan in #3338
remove think_end_token_id in streaming content by @AllentDan in #3327
Fix the finish_reason by @AllentDan in #3350
set cmake policy minimum version as 3.5 by @lvhan028 in #3376
fix dp cudagraph by @grimoire in #3372
fix flashmla eagermode by @grimoire in #3375
close engine after each benchmark-generation iter by @grimoire in #3269
[Fix] fix image_token_id error of qwen2-vl and deepseek by @ao-zz in #3358
fix stopping criteria by @grimoire in #3384
support List[dict] prompt input without do_preprocess by @irexyc in #3385
add rayexecutor release timeout by @grimoire in #3403
fix tensor dispatch in dynamo by @wanfengcxz in #3417
fix linting error by upgrade to ubuntu-latest by @lvhan028 in #3442
fix awq tp for pytorch engine by @RunningLeon in #3435
fix mllm testcase fail by @caikun-pjlab in #3458
remove paged attention autotune by @grimoire in #3452
Remove empty prompts in benchmark scripts by @lvhan028 in #3460
failed to end session properly by @lvhan028 in #3471
fix qwen2.5-vl chat template by @CUHKSZzxy in #3475
Align forward arguments of deepgemm blockedf8 by @RunningLeon in #3474
fix turbomind lib missing to link nccl by exporting nccl path by @lvhan028 in #3479
fix dsvl2 no attr config error by @CUHKSZzxy in #3477
fix flash attention crash on triton3.1.0 by @grimoire in #3478
Fix disorder of ray execution by @RunningLeon in #3481
update dockerfile by @CUHKSZzxy in #3482
fix output logprobs by @irexyc in #3488
Fix Qwen2MoE shared expert gate by @lzhangzz in #3491
fix replicate kv for qwen3-moe by @grimoire in #3499
fix sampling if data overflow after temperature penalty by @irexyc in #3508

📚 Documentations

update qwen2.5-vl-32b docs by @CUHKSZzxy in #3446

🌐 Other

bump version to v0.7.2.post1 by @lvhan028 in #3298
[ci] add think function testcase by @zhulinJulia24 in #3299
merge dev into main by @lvhan028 in #3348
[ci] add vl models into pipeline interface testcase by @zhulinJulia24 in #3374
merge dev to main branch by @lvhan028 in #3378
opt experts memory and permute by @zhaochaoxing in #3390
Revert "opt experts memory and permute" by @zhaochaoxing in #3406
merge dev to main by @lvhan028 in #3400
add Hopper GPU dockerfile by @CUHKSZzxy in #3415
optimize internvit by @caikun-pjlab in #3433
fix stop/bad words by @irexyc in #3492
[ci] testcase bugfix and add more models into testcase by @zhulinJulia24 in #3463
bump version to v0.8.0 by @lvhan028 in #3432

New Contributors

@zhaochaoxing made their first contribution in #3313
@ao-zz made their first contribution in #3358
@wanfengcxz made their first contribution in #3417
@SHshenhao made their first contribution in #3381
@josephrocca made their first contribution in #3504

Full Changelog: v0.7.2...v0.8.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.8.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors

Uh oh!