v0.9.1
What's Changed
🚀 Features
- feature: enable tool_call and reasoning_content parsing for qwen3 by @ywx217 in #3615
- Support Mooncake migration backend for PD disaggregation by @Risc-lt in #3620
- Support load fused moe weights by @RunningLeon in #3672
- Seperate api_server and pytorch engine into different processors by @grimoire in #3627
- add reward model api by @CUHKSZzxy in #3665
💥 Improvements
- [ascend]import patch at initiazing time by @JackWeiw in #3662
- [ascend]use custon transdata in python kernel by @JackWeiw in #3671
- move import transformers in patch by @grimoire in #3660
- set ray envs by @grimoire in #3643
- raise ImportError when enable ep and not install dlblas by @zhaochaoxing in #3636
- Reduce sampling memory usage by @lzhangzz in #3666
🐞 Bug fixes
- fix dockerfile by @lvhan028 in #3657
- Fix top-p only sampling with padded vocab size by @lzhangzz in #3661
- fix pt engine stop & cancel by @irexyc in #3681
- Fix convert bf16 to numpy by @RunningLeon in #3686
- disable torch.compile in cuda graph runner by @grimoire in #3691
- fix reward model api by @CUHKSZzxy in #3703
📚 Documentations
- add reward model documents by @CUHKSZzxy in #3706
🌐 Other
- upgrade torch and triton by @grimoire in #3677
- support do_preprocess=False for chat.completions by @irexyc in #3645
- [ci] change flash atten installation in pr test by @zhulinJulia24 in #3688
- fix profile_throughput.py by @irexyc in #3692
- fix profile_generation.py by @irexyc in #3707
- update dlblas version in dockerfile by @CUHKSZzxy in #3711
- bump version to v0.9.1 by @lvhan028 in #3685
New Contributors
Full Changelog: v0.9.0...v0.9.1