Skip to content

v0.7.0

Choose a tag to compare

@JimHsiung JimHsiung released this 20 Nov 13:03
· 57 commits to main since this release

Highlights

Model Support

  • Support GLM-4.5.
  • Support Qwen3-Embedding.
  • Support Qwen3-VL.
  • Support FluxFill.

Feature

  • Support MLU backend, currently supports Qwen3 series models.
  • Support dynamic disaggregated PD, with dynamic switching between P and D phases based on strategy.
  • Support multi-stream parallel overlap optimization.
  • Support beam-search capability in generative models.
  • Support virtual memory continuous kv-cache capability.
  • Support ACL graph executor.
  • Support unified online-offline co-location scheduling in disaggregated PD scenarios.
  • Support PrefillOnly Scheduler.
  • Support v1/rerank model service interface.
  • Support communication between devices via shared memory instead of RPC on a single machine.
  • Support function call.
  • Support reasoning output in chat interface.
  • Support top-k+add fusion in the router component of MoE models.
  • Support offline inference for LLM, VLM, and Embedding models.
  • Optimized certain runtime performance.

Bugfix

  • Skip cancelled requests when processing stream output.
  • Resolve segmentation fault during qwen3 quantized inference.
  • Fix the alignment of monitoring metrics format for Prometheus.
  • Clear outdated tensors to save memory when loading model weights.
  • Fix attention mask to support long sequence requests.
  • Fix bugs caused by enabling scheduler overlap.