KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

# Feature Description
with KV cache quantized in 2bits. This brings 2.6× less peak memory on the Llama/Mistral/Falcon models we evaluated while enabling 4x larger batch size, resulting in 2.35× - 3.47× throughput improvement.

# Motivation
Reduce memory use by Kv cache during long context batch inference 
https://arxiv.org/abs/2402.02750
https://github.com/jy-yuan/KIVI

it was publish at reddit
https://www.reddit.com/r/LocalLLaMA/comments/1ap3bkt/kv_cache_is_huge_and_bottlenecks_llm_inference_we/


# Possible Implementation

https://github.com/jy-yuan/KIVI




I find it quite interesting, it might improve a lot for VRAM poor users even without large batch or long context.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache #5492

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache #5492

Description

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions