llm-scaler-vllm pre-production release 0.2.0
Pre-release
Pre-release
·
112 commits
to main
since this release
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:0.2.0-pre-release
What’s new
- oneCCL reduces the buffer size and published official release in github.
- GQA kernel brings up-to 30% improvement for models.
- Bugfix for OOM issues exposed by stress test (more tests are ongoing).
- Support 70B FP8 TP4 in offline mode.
- DeepSeek-v2-lite accuracy fix.
- Other bugfixes.
Verified Features
- Refresh the KPI functionality and performance on 4x and 8x BMG e211 system. All KPI models now meet the goal. Add FP8 performance of DS-Distilled-LLaMA 70B model measured on 4xBMG w/ TP4 under offline mode.
- FP8 functionality test for 32K-8K(ISL/OSL) on DS-Distilled-Qwen32B model on 4xBMG w/ TP4.
- Verified model list for FP8 functionality.