llm-scaler-vllm PV release 1.0

Latest

Latest

liu-shaojun released this 09 Aug 03:14

84f3771

Highlights

Resources

Docker Image: intel/llm-scaler-vllm:1.0
Docker Image: intel/llm-scaler-platform:1.0

What’s new

vLLM:
- Performance optimization of TPOP for long input length (>4K): up to 1.8x perf for 40K seq length on 32B KPI model, and 4.2x perf for 40K seq length on 70B KPI model.
- Performance optimizations with ~10% output throughput improvement for 8B-32B KPI models compared to last drop.
- New feature: By-layer online quantization to reduce the required GPU memory
- New feature: PP (pipeline parallelism) support in vLLM (experimental)
- New feature: torch.compile (experimental)
- New feature: speculative decoding (experimental)
- Support for embedding, rerank model
- Enhanced multi-modal model support
- Performance improvements
- Maximum length auto-detecting
- Data parallelism support
- Bug fixes
OneCCL:
- OneCCL benchmark tool enablement
XPU Manager:
- GPU Power
- GPU Firmware update
- GPU Diagnostic
- GPU Memory Bandwidth
BKC:
- Implemented an offline installer to ensure a consistent environment and eliminate slow download speeds from the global Ubuntu PPA repository

Assets 2