Smart cache (RAM context cache) #1851

Pento95 · 2025-11-16T23:28:56Z

The Problem

As described in issue #1827, frequent context switching (e.g., in multi-user scenarios like AI Horde) causes significant latency. This occurs because the KV cache in VRAM must be discarded and re-calculated from scratch for each new, unrelated prompt, wasting processing time.

The Solution

A Multi-Slot RAM KV Cache system using system RAM to save and restore
KV cache snapshots, inspired by llama.cpp's server_prompt_cache:

When receiving a new prompt, calculate similarity to current VRAM context
If similarity ≥ threshold (default 0.8), use ContextFastForward (cache hit)
If similarity < threshold (cache miss):
- Save current VRAM KV cache to a free RAM slot
- Search RAM slots for best similarity match
- Restore best match to VRAM if found (RAM hit)
- Otherwise, cold prefill and save to new slot

This approach drastically reduces latency during context switches, improving efficiency and response speed in multi-user scenarios.

Architecture: Two-Level Cache System

Level 1 (VRAM): Active KV cache (1 slot, fast GPU memory)
Level 2 (RAM): Saved KV cache snapshots (N slots, slower but larger capacity)
LRU Eviction: Oldest slots evicted when RAM limit reached

Key Features

GUI Options: Adds a checkbox to enable the "Smart Cache" and a slider to configure the number of memory slots, including a warning for the estimated RAM consumption.
CLI Flags: Introduces the --smartcache, --smartcacherammaxsize, and --smartcachethreshold flags for command-line configuration.
API Endpoint: Creates a new /api/extra/stats/smartcache endpoint to monitor cache performance and statistics (hit rate, misses, etc.).
C++ Integration: Adds C++ helper functions to optimize similarity calculations and cache management.

Hot to use

Smart Cache Two-Level System Commands:
--smartcache Enable smart cache two-level system for intelligent context switching (default: disabled).
--smartcacherammaxsize [GB]
Maximum RAM size in GB for smart cache slots (default: 10GB). Smart cache will create unlimited slots until this RAM limit is reached. Cannot exceed 90% of total system RAM.
--smartcachethreshold [threshold]
Similarity threshold (0.0-1.0) for cache reuse. Values >= threshold use ContextFastForward, values < threshold trigger context switch with RAM search. (default: 0.8)

Pento95 · 2025-11-17T00:06:10Z

"Hi @LostRuins , I've opened this draft PR as a functional proof-of-concept for the Smart Cache feature. As we discussed, I'd really appreciate your feedback and any help you can offer to refine it before I mark it as ready for a formal review. Thank you!"

koboldcpp.py

…maxsize (max RAM size) - not ready yet

Pento95 · 2025-11-18T23:18:22Z

@LostRuins this PR is ready to be reviewed
Multi chat, image generation and AIHorder have been tested on my Linux + CUDA hardware only

While implementing the Smart Cache feature, I noticed the gpttype_adapter.cpp header
includes these design considerations:

//No dynamic memory allocation! Setup structs with FIXED (known) shapes and sizes
//Python will ALWAYS provide the memory, we just write to it.

My implementation has a few deviations from these strict rules:

Current Choices:

savestates management: Using new std::vector<savestate_data>(limit) for
dynamic slot allocation (3-256 configurable)
- Alternative: Static array → wastes ~100MB if using only 3 slots
smart_cache_lcs_precomputed: Uses std::vector with capacity pre-allocation
- Alternative: Static buffer → wastes 131KB permanently
smart_cache_compute_purge_diff(): Creates temporary vectors for LCS computation
- Alternative: Python passes work buffers → more complex API
get_current_context_tokens(): Returns pointer to C++ memory (zero-copy)
- Alternative: Python allocates + memcpy → slower, requires 2 calls

These choices favor:

Performance (zero-copy, minimal allocations after warm-up)
Flexibility (configurable 3-256 slots vs fixed 256)
Memory efficiency (only allocate what's needed)
RAII safety (automatic cleanup, no leaks)

vs strict compliance which would require:

~230MB wasted static allocations
5-10% performance loss (extra copies)
More complex Python API

Question: Are these pragmatic deviations acceptable, or would you prefer
strict compliance with the original design constraints?

…-allocate buffers, cache context params

expose.cpp

gpttype_adapter.cpp

… Remove Redundant max_savestate_slots

Pento95 · 2025-11-19T11:40:29Z

Thanks @wbruna! All three points addressed:

Declarations moved to model_adapter.h - type safety now enforced by compiler
vector* → vector - simpler, RAII-safe, removed ensure_savestates_initialized()
Removed max_savestate_slots - using savestates.size() directly (single source of truth)

Changes pushed. Ready for re-review!

- Add slot pooling to reuse C++ buffers (prevents memory leak) - Skip RAM search for prompts <2048 tokens - Remove misleading is_active flag from stats API - Invalidate slots on eviction to enable pool reuse

HumbleDeer · 2025-11-20T17:15:02Z

If in the future this ends up being a half useful feature rather than fully useful due to the potential of needing a LOT more sysRAM, I suppose one option is to specify a maximum context size the user may send over before it gets either truncated before generating the Ctx K-V database, or entirely skipped with the option of reporting this back to the user?

My phrasing isn't exactly the most eloquent, but this is what I can muster for linguistics at this time. The assertion of half/full usefulness isn't a dig at you, but rather about what the user would find as a barrier to entry/usage.

In any case, with all the newer storage methods and RAM speeds and all that are around lately, the need to actually copy it back to VRAM might be almost redundant depending on the latency introduced by the process of loading it into vram. System ram is blazing fast, and I can attest I have in many situations preferred fully manually offloading KV cache to sysRAM to load all the layers on the GPU because the non-framentation is much more effective. In my case, it's a bit slow albeit still faster than framentation across a GTX1070 & I7 7700K. But that just stands to prove that there is most definitely room for "play" and nonstandard ways of holding onto that KV cache.

As to how the faster storage I mentioned comes to play in this: Some NVMe storage devices come quite scary close to the total roundtrip time you'd find for RAM access given the possibly easier ability to access NVMe directly over the bus. That's... in essence no different than the concept of directly attaching your storage to your PCIe bus for that gaming use case. It makes use of the exact justifications and reasons I'm bringing up now.

That said, I'm not knowledgeable enough to actually have a firm grasp on what the sort of latency figures related to this all turn out to. I'm merely keeping in mind that there is those extra layers to traverse, in the end. That's especially true for Python, even if cPython is really impressive these days.

I'm following this, I'm curious what we end up with for christmas this year!

Pento95 · 2025-11-20T22:46:56Z

Consumer gaming PCs have 32 GB ram, sometimes 64 GB (like i do), if you don't use mmap to load the model, most of it is free, unless you use MoE experts on CPU or stuff like that. it's with huge context that this feature truly shines

I'm testing this feature using 48GB smart cache, with both AIHorde worker (around 9% cache hit rate, 36h), waidrin (https://github.com/p-e-w/waidrin) and SillyTavern. I can really notice the difference.

About the "speed" of moving data RAM <-> VRAM, well.. even with a DDR3 RAM you would still get better speeds then having to preprocess thousands of tokens in case of cache hit.

About Storage hierarchy and NVMe speeds, You're absolutely right—modern NVMe (especially Gen4/5) has dramatically arrowed the gap to DRAM latency. The challenge with KV cache specifically is the frequency of access during generation (every token decode touches it), so even 50-100µs NVMe latency vs ~100ns RAM adds up fast. That said, for hibernated slots that aren't actively generating, NVMe could be brilliant—kind of a tiered cache (VRAM → RAM → NVMe).

The current implementation keeps everything in RAM because we're using llama.cpp's save_state_kv(), which serializes to a memory buffer. Extending this to NVMe would need a custom serialization path that bypasses the buffer, but it's technically doable.

About Direct VRAM offload vs fragmentation, Your GTX 1070 experience is a perfect example—sometimes predictable RAM latency beats fragmented VRAM/split execution. The smart cache sits in a sweet spot for that: it keeps the working set in VRAM while letting you maintain a much larger "recently used" pool in RAM without OOM-ing the GPU.

smart cache proof of concept

9c50400

Pento95 marked this pull request as draft November 16, 2025 23:33

Pento95 force-pushed the smart-cache branch from da6945a to 9c50400 Compare November 16, 2025 23:36

Pento95 changed the base branch from concedo to concedo_experimental November 16, 2025 23:41

Pento95 commented Nov 17, 2025

View reviewed changes

koboldcpp.py Show resolved Hide resolved

Pento added 6 commits November 17, 2025 21:52

fix streaming and corrected LCS -> prefix match

6ee2fc3

migration from --smartcacheslots (slot based size) to --smartcacheram…

e68122c

…maxsize (max RAM size) - not ready yet

fix gui and missing implementation

ef46136

refactor: remove dead code and optimize smart cache

35fb82b

fix overflow when VRAM's KV cache size > smartcacherammaxsize

6c67d62

removed prints and logs, ready for testing

4ea16c1

Pento95 marked this pull request as ready for review November 18, 2025 13:36

Pento95 changed the title ~~[Draft] Smart cache (RAM context cache)~~ Smart cache (RAM context cache) Nov 18, 2025

Merge branch 'concedo_experimental' into smart-cache

1ccaff0

Better naming, extract magic numbers, eliminate code duplication, pre…

e11385b

…-allocate buffers, cache context params

wbruna reviewed Nov 19, 2025

View reviewed changes

expose.cpp Outdated Show resolved Hide resolved

wbruna reviewed Nov 19, 2025

View reviewed changes

gpttype_adapter.cpp Outdated Show resolved Hide resolved

wbruna reviewed Nov 19, 2025

View reviewed changes

gpttype_adapter.cpp Outdated Show resolved Hide resolved

Declarations in model_adapter.h, Direct Vector Instead of Pointer and…

c1b923f

… Remove Redundant max_savestate_slots

- Create slots only during hibernate (not on VRAM read)

518525d

- Add slot pooling to reuse C++ buffers (prevents memory leak) - Skip RAM search for prompts <2048 tokens - Remove misleading is_active flag from stats API - Invalidate slots on eviction to enable pool reuse

Fix smart cache memory leak and add dynamic threshold

1346ca6

Pento95 requested a review from wbruna November 21, 2025 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Smart cache (RAM context cache) #1851

Smart cache (RAM context cache) #1851

Pento95 commented Nov 16, 2025 •

edited

Loading

Uh oh!

Pento95 commented Nov 17, 2025

Uh oh!

Uh oh!

Pento95 commented Nov 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pento95 commented Nov 19, 2025

Uh oh!

HumbleDeer commented Nov 20, 2025

Uh oh!

Pento95 commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Smart cache (RAM context cache) #1851

Are you sure you want to change the base?

Smart cache (RAM context cache) #1851

Conversation

Pento95 commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Problem

The Solution

The Solution

Architecture: Two-Level Cache System

Key Features

Hot to use

Uh oh!

Pento95 commented Nov 17, 2025

Uh oh!

Uh oh!

Pento95 commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pento95 commented Nov 19, 2025

Uh oh!

HumbleDeer commented Nov 20, 2025

Uh oh!

Pento95 commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pento95 commented Nov 16, 2025 •

edited

Loading

Pento95 commented Nov 18, 2025 •

edited

Loading