Skip to content

Releases: jd-opensource/xllm

v0.7.1

20 Nov 14:01

Choose a tag to compare

Highlights

Model Support

  • Support GLM-4.5-Air.
  • Support Qwen3-VL-Moe.

Feature

  • Support scheduler overlap when enable chunked prefill and MTP.
  • Enable multi-process mode when running VLM model.
  • Support AclGraph for GLM-4.5.

Bugfix

  • Reslove core dump of qwen embedding 0.6B.
  • Resolve duplicate content in multi-turn tool call conversations.
  • Support sampler parameters for MTP.
  • Enable MTP and schedule overlap to work simultaneously.
  • Resolve google.protobuf.Struct parsing failures which broke tool_call and think toggle functionality.
  • Fix the precision issue in the Qwen2 model caused by model_type is not be assigned.
  • Fix core dump of GLM 4.5 when enable MTP.
  • Temporarily use heap allocation for VLM backend.
  • Reslove core dump of stream chat completion request for VLM.

v0.7.0

20 Nov 13:03

Choose a tag to compare

Highlights

Model Support

  • Support GLM-4.5.
  • Support Qwen3-Embedding.
  • Support Qwen3-VL.
  • Support FluxFill.

Feature

  • Support MLU backend, currently supports Qwen3 series models.
  • Support dynamic disaggregated PD, with dynamic switching between P and D phases based on strategy.
  • Support multi-stream parallel overlap optimization.
  • Support beam-search capability in generative models.
  • Support virtual memory continuous kv-cache capability.
  • Support ACL graph executor.
  • Support unified online-offline co-location scheduling in disaggregated PD scenarios.
  • Support PrefillOnly Scheduler.
  • Support v1/rerank model service interface.
  • Support communication between devices via shared memory instead of RPC on a single machine.
  • Support function call.
  • Support reasoning output in chat interface.
  • Support top-k+add fusion in the router component of MoE models.
  • Support offline inference for LLM, VLM, and Embedding models.
  • Optimized certain runtime performance.

Bugfix

  • Skip cancelled requests when processing stream output.
  • Resolve segmentation fault during qwen3 quantized inference.
  • Fix the alignment of monitoring metrics format for Prometheus.
  • Clear outdated tensors to save memory when loading model weights.
  • Fix attention mask to support long sequence requests.
  • Fix bugs caused by enabling scheduler overlap.

v0.6.1

31 Oct 02:41
a0ca5b4

Choose a tag to compare

Highlights

Bugfix

  • Skip cancelled requests when processing stream output.
  • Resolve segmentation fault during qwen3 quantized inference.
  • Fix the alignment of monitoring metrics format for Prometheus.
  • Clear outdated tensors to save memory when loading model weights.

Release Images

x86 image

quay.io/jd_xllm/xllm-ai:xllm-0.6.1-release-hb-rc2-x86

ARM a2 device image

quay.io/jd_xllm/xllm-ai:xllm-0.6.1-release-hb-rc2-arm

ARM a3 device image

quay.io/jd_xllm/xllm-ai:xllm-0.6.1-release-hc-rc2-arm

v0.6.0

15 Sep 14:31

Choose a tag to compare

Highlights

Model Support

  • Support DeepSeek-V3/R1.
  • Support DeepSeek-R1-Distill-Qwen.
  • Support Kimi-k2.
  • Support Llama2/3.
  • Support Qwen2/2.5/QwQ.
  • Support Qwen3/Qwen3-MoE.
  • Support MiniCPM-V.
  • Support MiMo-VL.
  • Support Qwen2.5-VL .

Feature

  • Support KV cache store.
  • Support Expert Parallelism Load Balance.
  • Support multi-priority on/offline scheduler.
  • Support latency-aware scheduler.
  • Support serving early stop.
  • Optimize ppmatmul kernel.
  • Support image url input for VLM.
  • Support disaggregated prefill and decoding.
  • Support large-scale EP parallelism.
  • Support Hash-based PrefixCache matching.
  • Support Multi-Token Prediction for DeepSeek.
  • Support asynchronous scheduling, allowing the scheduling and computational pipeline to execute in parallel.
  • Support EP, DP, TP model parallel.
  • Support multiple process and multiple nodes.

Docs

  • Add getting started docs.
  • Add features docs.

Release Images

x86 image

quay.io/jd_xllm/xllm-ai:xllm-0.6.0-release-hb-rc2-py3.11-oe24.03-lts-x86

ARM a2 device image

quay.io/jd_xllm/xllm-ai:xllm-0.6.0-release-hb-rc2-py3.11-oe24.03-lts-arm

ARM a3 device image

quay.io/jd_xllm/xllm-ai:xllm-0.6.0-release-hc-rc2-py3.11-oe24.03-lts-arm