-
Notifications
You must be signed in to change notification settings - Fork 206
Fix hf_quant_config with kv cache type [OMNIML-2918] #557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #557 +/- ##
=======================================
Coverage 74.64% 74.64%
=======================================
Files 183 183
Lines 18547 18547
=======================================
Hits 13844 13844
Misses 4703 4703 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| kv_cache_quantization = None | ||
| if get_kv_cache_dtype(self.model) == KV_CACHE_FP8: | ||
| # Only FP8 KV Cache is supported in VLLM for now | ||
| kv_cache_quantization = "FP8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add FP4 KV support? TRT-LLM actually supports FP4 kv now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just added
478abad to
c2b254c
Compare
18852fe to
3ecca22
Compare
Signed-off-by: Jennifer Chen <[email protected]>
3ecca22 to
3cdb810
Compare
|
/ok to test 3cdb810 |
Update hf_quant_config with correct kv cache type for FP8 and NVFP4 --------- Signed-off-by: jenchen13 <[email protected]> Signed-off-by: Jennifer Chen <[email protected]>
What does this PR do?
Type of change: ? Bug fix
Fix hf_quant_config with correct kv cache type for FP8/NVFP4
Overview: ?
Usage
# Add a code snippet demonstrating how to use thisTesting
will test export with KV cache fp8 enabled
Before your PR is "Ready for review"
Additional Information