Skip to content

llm-scaler-vllm PV release 1.0

Latest

Choose a tag to compare

@liu-shaojun liu-shaojun released this 09 Aug 03:14
84f3771

Highlights

Resources

What’s new

  • vLLM:

    • Performance optimization of TPOP for long input length (>4K): up to 1.8x perf for 40K seq length on 32B KPI model, and 4.2x perf for 40K seq length on 70B KPI model.
    • Performance optimizations with ~10% output throughput improvement for 8B-32B KPI models compared to last drop.
    • New feature: By-layer online quantization to reduce the required GPU memory
    • New feature: PP (pipeline parallelism) support in vLLM (experimental)
    • New feature: torch.compile (experimental)
    • New feature: speculative decoding (experimental)
    • Support for embedding, rerank model
    • Enhanced multi-modal model support
    • Performance improvements
    • Maximum length auto-detecting
    • Data parallelism support
    • Bug fixes
  • OneCCL:

    • OneCCL benchmark tool enablement
  • XPU Manager:

    • GPU Power
    • GPU Firmware update
    • GPU Diagnostic
    • GPU Memory Bandwidth
  • BKC:

    • Implemented an offline installer to ensure a consistent environment and eliminate slow download speeds from the global Ubuntu PPA repository