llm-scaler-vllm pre-production release 0.2.0

Pre-release

Pre-release

glorysdj released this 04 Jul 02:25

· 112 commits to main since this release

vllm-0.2.0-pre-release

2f62bc2

Highlights

Resources

Docker Image: intel/llm-scaler-vllm:0.2.0-pre-release

What’s new

oneCCL reduces the buffer size and published official release in github.
GQA kernel brings up-to 30% improvement for models.
Bugfix for OOM issues exposed by stress test (more tests are ongoing).
Support 70B FP8 TP4 in offline mode.
DeepSeek-v2-lite accuracy fix.
Other bugfixes.

Verified Features

Refresh the KPI functionality and performance on 4x and 8x BMG e211 system. All KPI models now meet the goal. Add FP8 performance of DS-Distilled-LLaMA 70B model measured on 4xBMG w/ TP4 under offline mode.
FP8 functionality test for 32K-8K(ISL/OSL) on DS-Distilled-Qwen32B model on 4xBMG w/ TP4.
Verified model list for FP8 functionality.

Assets 2