Skip to content

Commit cd9cb7a

Browse files
committed
little fixes
Signed-off-by: jthomson04 <[email protected]>
1 parent ade1a9e commit cd9cb7a

File tree

1 file changed

+6
-7
lines changed

1 file changed

+6
-7
lines changed

docs/source/examples/kv_cache_connector.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -33,15 +33,15 @@ These methods run on the leader process and drive the connector's behavior.
3333
* **Returns**: An arbitrary metadata object (picklable) that describes the tasks for the workers. This object is broadcasted to all workers.
3434

3535
* **`get_num_new_matched_tokens(self, request: LlmRequest, num_computed_tokens: int) -> tuple[int, bool]`**
36-
* **Description**: Called when a new request arrives. It checks the external cache to see if any part of the prompt has already been computed and cached.
36+
* **Description**: Called when a new request arrives. It checks to see if any KV cache can be loaded from an external KV store.
3737
* **Returns**: A tuple `(num_tokens, is_async)`. `num_tokens` is the number of tokens found in the external cache. `is_async` indicates if the loading will happen asynchronously (background) or requires blocking.
3838

3939
* **`request_finished(self, request: LlmRequest, cache_block_ids: list[int]) -> bool`**
4040
* **Description**: Called when a request completes generation.
41-
* **Returns**: A boolean indicating if an asynchronous save operation has started. If `True`, the system waits for the operation to complete before deallocating the blocks.
41+
* **Returns**: A boolean indicating if an asynchronous save operation is underway. If `True`, the system waits for the operation to complete before releasing the KV cache blocks.
4242

4343
* **`update_state_after_alloc(self, request: LlmRequest, block_ids: list[int])`**
44-
* **Description**: a callback to update internal state after new blocks have been allocated for a request.
44+
* **Description**: a callback to update internal state after KV cache blocks have been allocated for the prefill.
4545

4646
#### 2. Worker Interface (`KvCacheConnectorWorker`)
4747

@@ -53,7 +53,7 @@ These methods run on all workers (GPU processes) and interact with the actual GP
5353

5454
* **`start_load_kv(self, stream: torch.cuda.Stream)`**
5555
* **Description**: Initiates the loading of KV blocks from the external source into the GPU memory.
56-
* **Arguments**: `stream` is the CUDA stream where copy operations should be enqueued.
56+
* **Arguments**: `stream` is the CUDA stream where the forward pass is executed in.
5757

5858
* **`wait_for_layer_load(self, layer_idx: int, stream: torch.cuda.Stream)`**
5959
* **Description**: A synchronization point. Ensures that the KV cache for a specific layer is fully loaded before the model attempts to perform the forward pass on that layer.
@@ -93,9 +93,8 @@ This example implements a file-system based KV cache.
9393
This example illustrates the API mechanics but has several limitations that make it unsuitable for high-performance production use without modification:
9494

9595
1. **Blocking I/O**: The example uses `torch.load` and `torch.save` synchronously. In a real implementation, these should be offloaded to a background thread or asynchronous I/O handler to avoid stalling the GPU.
96-
2. **Simplified Block Matching**: The `get_num_new_matched_tokens` implementation in the example only matches full blocks aligned to the start of the sequence. It doesn't handle complex prefix matching or partial blocks efficiently.
97-
3. **No Chunked Prefill**: The example notes it "does not work with chunked prefill," meaning it assumes the entire prefill happens in one go.
98-
4. **FileSystem Latency**: Storing one file per block can create high filesystem overhead. Production systems typically aggregate blocks or use a key-value store database (like Redis) or object store.
96+
2. **Simplified Block Matching**: The `get_num_new_matched_tokens` implementation in the example only matches full blocks. It does not handle partial cache hits.
97+
3. **FileSystem Latency**: Storing one file per block can create high filesystem overhead.
9998

10099
### Usage
101100

0 commit comments

Comments
 (0)