v0.7.0

JimHsiung released this 20 Nov 13:03

· 57 commits to main since this release

53b6e6f

Highlights

Model Support

Support GLM-4.5.
Support Qwen3-Embedding.
Support Qwen3-VL.
Support FluxFill.

Feature

Support MLU backend, currently supports Qwen3 series models.
Support dynamic disaggregated PD, with dynamic switching between P and D phases based on strategy.
Support multi-stream parallel overlap optimization.
Support beam-search capability in generative models.
Support virtual memory continuous kv-cache capability.
Support ACL graph executor.
Support unified online-offline co-location scheduling in disaggregated PD scenarios.
Support PrefillOnly Scheduler.
Support v1/rerank model service interface.
Support communication between devices via shared memory instead of RPC on a single machine.
Support function call.
Support reasoning output in chat interface.
Support top-k+add fusion in the router component of MoE models.
Support offline inference for LLM, VLM, and Embedding models.
Optimized certain runtime performance.

Bugfix

Skip cancelled requests when processing stream output.
Resolve segmentation fault during qwen3 quantized inference.
Fix the alignment of monitoring metrics format for Prometheus.
Clear outdated tensors to save memory when loading model weights.
Fix attention mask to support long sequence requests.
Fix bugs caused by enabling scheduler overlap.

Assets 2