You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
***Description**: Called when a new request arrives. It checks the external cache to see if any part of the prompt has already been computed and cached.
36
+
***Description**: Called when a new request arrives. It checks to see if any KV cache can be loaded from an external KV store.
37
37
***Returns**: A tuple `(num_tokens, is_async)`. `num_tokens` is the number of tokens found in the external cache. `is_async` indicates if the loading will happen asynchronously (background) or requires blocking.
***Description**: Called when a request completes generation.
41
-
***Returns**: A boolean indicating if an asynchronous save operation has started. If `True`, the system waits for the operation to complete before deallocating the blocks.
41
+
***Returns**: A boolean indicating if an asynchronous save operation is underway. If `True`, the system waits for the operation to complete before releasing the KV cache blocks.
***Description**: A synchronization point. Ensures that the KV cache for a specific layer is fully loaded before the model attempts to perform the forward pass on that layer.
@@ -93,9 +93,8 @@ This example implements a file-system based KV cache.
93
93
This example illustrates the API mechanics but has several limitations that make it unsuitable for high-performance production use without modification:
94
94
95
95
1.**Blocking I/O**: The example uses `torch.load` and `torch.save` synchronously. In a real implementation, these should be offloaded to a background thread or asynchronous I/O handler to avoid stalling the GPU.
96
-
2.**Simplified Block Matching**: The `get_num_new_matched_tokens` implementation in the example only matches full blocks aligned to the start of the sequence. It doesn't handle complex prefix matching or partial blocks efficiently.
97
-
3.**No Chunked Prefill**: The example notes it "does not work with chunked prefill," meaning it assumes the entire prefill happens in one go.
98
-
4.**FileSystem Latency**: Storing one file per block can create high filesystem overhead. Production systems typically aggregate blocks or use a key-value store database (like Redis) or object store.
96
+
2.**Simplified Block Matching**: The `get_num_new_matched_tokens` implementation in the example only matches full blocks. It does not handle partial cache hits.
97
+
3.**FileSystem Latency**: Storing one file per block can create high filesystem overhead.
0 commit comments