- RetroInfer is a system that rethinks the KV cache as vector storage within a GPU–CPU co-execution setup to accelerate long-context Large Language Model (LLM) inference. It exploits the inherent sparsity of the attention mechanism and introduces an Attention-aWare VEctor index (wave index) that enables efficient and accurate retrieval of critical tokens from the KV cache. Complementing this is the wave buffer, which coordinates KV cache placement and overlaps computation and data transfer across GPU and CPU to sustain high throughput. It achieves 4.5x–10.5x improvements of the decoding throughput over FlashAttention——without accuracy loss.
- RetroInfer can effectively improve the decoding throughput of generating LLMs in long-context scenarios, with minimal impact on model accuracy.
- RetroInfer is intended for LLM deployers and users who need to manage long-context scenarios efficiently.
- We evaluated RetroInfer using state-of-the-art long-context benchmarks, including RULER, LongBench, and Needle in a Haystack, and their respective evaluation metrics.
- Extensive testing was conducted across various scenarios, such as multi-needle, multi-hop tracing, multi-document QA, single-document QA, code completion, few-shot learning, etc. The results showed almost no change in accuracy.
What are the limitations of RetroInfer? How can users minimize the impact of RetroInfer’s limitations when using the system?
- Potentially harmful, false, or biased responses generated by LLMs are likely unchanged with RetroInfer. As a result, using RetroInfer does not inherently mitigate or exacerbate these responsible AI concerns.
- RetroInfer was developed for research and experimental purposes. Further testing and validation are needed before considering its application in real-world scenarios.
- Users can adjust parameters such as retrieval budget ratio and attention estimate ratio when using RetroInfer. Once configured, RetroInfer can effectively enhance LLM response generation in long-context scenarios.