Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:1.0
- Docker Image: intel/llm-scaler-platform:1.0
What’s new
-
vLLM:
- Performance optimization of TPOP for long input length (>4K): up to 1.8x perf for 40K seq length on 32B KPI model, and 4.2x perf for 40K seq length on 70B KPI model.
- Performance optimizations with ~10% output throughput improvement for 8B-32B KPI models compared to last drop.
- New feature: By-layer online quantization to reduce the required GPU memory
- New feature: PP (pipeline parallelism) support in vLLM (experimental)
- New feature: torch.compile (experimental)
- New feature: speculative decoding (experimental)
- Support for embedding, rerank model
- Enhanced multi-modal model support
- Performance improvements
- Maximum length auto-detecting
- Data parallelism support
- Bug fixes
-
OneCCL:
- OneCCL benchmark tool enablement
-
XPU Manager:
- GPU Power
- GPU Firmware update
- GPU Diagnostic
- GPU Memory Bandwidth
-
BKC:
- Implemented an offline installer to ensure a consistent environment and eliminate slow download speeds from the global Ubuntu PPA repository