Releases
v0.7.0
Compare
Sorry, something went wrong.
No results found
Highlights
Model Support
Support GLM-4.5.
Support Qwen3-Embedding.
Support Qwen3-VL.
Support FluxFill.
Feature
Support MLU backend, currently supports Qwen3 series models.
Support dynamic disaggregated PD, with dynamic switching between P and D phases based on strategy.
Support multi-stream parallel overlap optimization.
Support beam-search capability in generative models.
Support virtual memory continuous kv-cache capability.
Support ACL graph executor.
Support unified online-offline co-location scheduling in disaggregated PD scenarios.
Support PrefillOnly Scheduler.
Support v1/rerank model service interface.
Support communication between devices via shared memory instead of RPC on a single machine.
Support function call.
Support reasoning output in chat interface.
Support top-k+add fusion in the router component of MoE models.
Support offline inference for LLM, VLM, and Embedding models.
Optimized certain runtime performance.
Bugfix
Skip cancelled requests when processing stream output.
Resolve segmentation fault during qwen3 quantized inference.
Fix the alignment of monitoring metrics format for Prometheus.
Clear outdated tensors to save memory when loading model weights.
Fix attention mask to support long sequence requests.
Fix bugs caused by enabling scheduler overlap.
You can’t perform that action at this time.