Disk cache and cpu Cache feature #997

hiworldwzj · 2025-08-04T09:22:35Z

No description provided.

gemini-code-assist

Code Review

This pull request introduces a significant new feature for CPU and disk-based KV caching. The implementation is extensive, touching many parts of the system from low-level Triton kernels to high-level server logic. While the overall direction is good, there are several critical issues that need to be addressed, including incorrect kernel calls, logical errors in loops, and improper shared memory allocation. There are also numerous typos (e.g., 'chuncked' instead of 'chunked') that should be corrected for code clarity and consistency.

gemini-code-assist · 2025-08-04T09:25:31Z

lightllm/utils/kv_cache_utils.py

+
+    # 创建共享内存
+    key = args.cpu_kv_cache_shm_id  # 共享内存的键
+    size = 1024  # 共享内存大小


The shared memory size is hardcoded to 1024 bytes, which is far too small for the KV cache. The actual required size should be calculated using calcu_cpu_cache_meta().calcu_size() and used here.

Suggested change

size = 1024 # 共享内存大小

size = calcu_cpu_cache_meta().calcu_size() # 共享内存大小

gemini-code-assist · 2025-08-04T09:25:32Z

lightllm/server/router/model_infer/mode_backend/base_backend.py

+                            self.multi_level_cache_manager.cpu_cache_handle_queue.append(trans_task)
+                else:
+                    true_finished_reqs.append(req)
+                return true_finished_reqs


The return true_finished_reqs statement is inside the for loop, which will cause the function to exit after processing only the first finished request. This is incorrect and will lead to other finished requests not being processed for CPU cache offloading. The return statement should be moved outside the loop.

Suggested change

return true_finished_reqs

return true_finished_reqs

gemini-code-assist · 2025-08-04T09:25:32Z

lightllm/server/multi_level_kv_cache/shm_objs.py

+        while cur_link_item is not None:
+            if cur_link_item.key == key:
+                # remove item
+                pre_item = cur_link_item.get_pre_item()
+                pre_item.next_index = cur_link_item.next_index
+                if cur_link_item.next_index != -1:
+                    next_item = cur_link_item.get_next_item()
+                    next_item.pre_index = pre_item.self_index
+
+                self.link_items.add_item_to_tail(index=cur_link_item.self_index)
+            else:
+                cur_link_item = cur_link_item.get_next_item()


The remove method in ShmDict has a bug. After finding and removing an item, the while loop continues without updating cur_link_item, leading to an infinite loop. Assuming keys are unique, the loop should terminate after removal.

Suggested change

while cur_link_item is not None:

if cur_link_item.key == key:

# remove item

pre_item = cur_link_item.get_pre_item()

pre_item.next_index = cur_link_item.next_index

if cur_link_item.next_index != -1:

next_item = cur_link_item.get_next_item()

next_item.pre_index = pre_item.self_index

self.link_items.add_item_to_tail(index=cur_link_item.self_index)

else:

cur_link_item = cur_link_item.get_next_item()

while cur_link_item is not None:

if cur_link_item.key == key:

# remove item

pre_item = cur_link_item.get_pre_item()

pre_item.next_index = cur_link_item.next_index

if cur_link_item.next_index != -1:

next_item = cur_link_item.get_next_item()

next_item.pre_index = pre_item.self_index

self.link_items.add_item_to_tail(index=cur_link_item.self_index)

return

else:

cur_link_item = cur_link_item.get_next_item()

gemini-code-assist · 2025-08-04T09:25:32Z

lightllm/common/basemodel/triton_kernel/kv_cache_offload.py

+    _offload_gpu_kv_to_cpu[grid](
+        token_indexes_ptr=mem_indexes,
+        gpu_kv_cache_ptr=gpu_kv_cache,
+        gpu_stride0=gpu_kv_cache.stride(0),
+        gpu_stride1=gpu_kv_cache.stride(1),
+        gpu_stride2=gpu_kv_cache.stride(2),
+        gpu_stride3=gpu_kv_cache.stride(3),
+        cpu_kv_cache_ptr=cpu_kv_cache,
+        cpu_stride0=cpu_kv_cache.stride(0),
+        cpu_stride1=cpu_kv_cache.stride(1),
+        cpu_stride2=cpu_kv_cache.stride(2),
+        cpu_stride3=cpu_kv_cache.stride(3),
+        cpu_stride4=cpu_kv_cache.stride(4),
+        page_indexes_ptr=page_indexes,
+        layer_num=gpu_kv_cache.shape[0],
+        head_all_dim=gpu_kv_cache.shape[-1] * gpu_kv_cache.shape[-2],
+        all_move_token_num=len(mem_indexes),
+        BLOCK_HEAD_ALL_DIM=BLOCK_HEAD_ALL_DIM,
+        TOKEN_BLOCK=token_block_size,
+        num_warps=num_warps,
+        num_stages=1,
+    )


This function incorrectly calls the _offload_gpu_kv_to_cpu kernel instead of _load_cpu_cache_to_gpu. This is a critical error that will cause data to be moved in the wrong direction. The arguments passed are also incorrect for the intended operation, and the chuncked_size parameter is missing.

_load_cpu_cache_to_gpu[grid]( token_indexes_ptr=mem_indexes, gpu_kv_cache_ptr=gpu_kv_cache, gpu_stride0=gpu_kv_cache.stride(0), gpu_stride1=gpu_kv_cache.stride(1), gpu_stride2=gpu_kv_cache.stride(2), gpu_stride3=gpu_kv_cache.stride(3), cpu_kv_cache_ptr=cpu_kv_cache, cpu_stride0=cpu_kv_cache.stride(0), cpu_stride1=cpu_kv_cache.stride(1), cpu_stride2=cpu_kv_cache.stride(2), cpu_stride3=cpu_kv_cache.stride(3), cpu_stride4=cpu_kv_cache.stride(4), page_indexes_ptr=page_indexes, layer_num=gpu_kv_cache.shape[0], head_all_dim=gpu_kv_cache.shape[-1] * gpu_kv_cache.shape[-2], all_move_token_num=len(mem_indexes), chuncked_size=token_num, BLOCK_HEAD_ALL_DIM=BLOCK_HEAD_ALL_DIM, TOKEN_BLOCK=token_block_size, num_warps=num_warps, num_stages=1, )

gemini-code-assist · 2025-08-04T09:25:32Z

lightllm/server/multi_level_kv_cache/manager.py

+            req: Req = req
+            finded_page_indexes = []
+            for token_chuncked_hash_value in req.token_hash_list.get_all():
+                page_index = self.cpu_cache_client.query_one_page(token_chuncked_hash_value)


The method query_one_page returns a tuple (page_index, is_ready), but the result is assigned to a single variable page_index. This will cause incorrect behavior as page_index will be a tuple. The tuple should be unpacked. Also, the variable token_chuncked_hash_value has a typo.

Suggested change

page_index = self.cpu_cache_client.query_one_page(token_chuncked_hash_value)

page_index, _ = self.cpu_cache_client.query_one_page(token_chunked_hash_value)

gemini-code-assist · 2025-08-04T09:25:35Z

lightllm/server/router/model_infer/mode_backend/multi_level_cache_manager.py

+    ) -> Optional["TransTask"]:
+        with torch.cuda.stream(cpu_kv_cache_stream):
+            all_token_hash_list = req.shm_req.token_hash_list.get_all()
+            block_size = req.cur_kv_len // self.args.cpu_cache_token_chuncked_size


There is a typo in an attribute name. cpu_cache_token_chuncked_size should be cpu_cache_token_chunked_size.

Suggested change

block_size = req.cur_kv_len // self.args.cpu_cache_token_chuncked_size

block_size = req.cur_kv_len // self.args.cpu_cache_token_chunked_size

gemini-code-assist · 2025-08-04T09:25:35Z

lightllm/server/router/model_infer/mode_backend/multi_level_cache_manager.py

+
+    def fill_cpu_cache_to_reqs(self, reqs: List[InferReq]):
+        idle_token_num = g_infer_context.get_can_alloc_token_num()
+        token_chuncked_size = self.args.cpu_cache_token_chuncked_size


There is a typo in a variable name. token_chuncked_size should be token_chunked_size.

Suggested change

token_chuncked_size = self.args.cpu_cache_token_chuncked_size

token_chunked_size = self.args.cpu_cache_token_chunked_size

gemini-code-assist · 2025-08-04T09:25:35Z

lightllm/server/router/model_infer/mode_backend/multi_level_cache_manager.py

+        for req in reqs:
+            if req.shm_req.group_req_id == req.shm_req.request_id:
+                page_list = req.shm_req.cpu_cache_match_page_indexes.get_all()
+                match_tokens = len(page_list) * token_chuncked_size


There is a typo in a variable name. token_chuncked_size should be token_chunked_size.

Suggested change

match_tokens = len(page_list) * token_chuncked_size

match_tokens = len(page_list) * token_chunked_size

gemini-code-assist · 2025-08-04T09:25:35Z

lightllm/common/basemodel/triton_kernel/kv_cache_offload.py

+    if cpu_page_index == -1:
+        return
+
+    first_block_start_index = chuncked_size * tl.num_programs(0) - all_move_token_num


There is a typo in the variable name chuncked_size. It should be chunked_size.

Suggested change

first_block_start_index = chuncked_size * tl.num_programs(0) - all_move_token_num

first_block_start_index = chunked_size * tl.num_programs(0) - all_move_token_num

gemini-code-assist · 2025-08-04T09:25:36Z

lightllm/common/basemodel/triton_kernel/kv_cache_offload.py

+    layer_num,
+    head_all_dim,
+    all_move_token_num,
+    chuncked_size,


There is a typo in the parameter name chuncked_size. It should be chunked_size for consistency and correctness.

Suggested change

chuncked_size,

chunked_size,

liujiacheng and others added 30 commits July 29, 2025 02:17

add multi_level_kv_cache

2b5c260

fix

8672506

add shmdict

d2a4c53

add shm dict

bb230a6

add shm dict

c4caedb

fix cpu cache client

1851371

fix

506167d

fix

44bec0c

disk and cpu cache enable mix

557c536

add hash utils

fa5ec1f

fix

414645a

add start args for cpu cache and disk cache

6752565

add CpuCacheMatch List

738cfea

add multi_level_cache manager

a3dffb5

fix

b617a83

fix

257f57c

add calcu_cpu_cache_page_num

d6d3851

improve radix cache

906996e

improve pd p impl

70d762a

add multi_level_cache_manager.py

a896b3a

fix

565916c

add kv cache offload kernel

7033c0b

add to do

11e67e9

add kv_cache_utils.py

acb6ade

add register_shm_ptr_to_pin

3830612

fix

7d852a7

fix

1cf9d74

fix

287ce25

fix

de03e94

fix

516e9bc

gemini-code-assist bot reviewed Aug 4, 2025

View reviewed changes

hiworldwzj and others added 29 commits August 4, 2025 21:55

fix

fc9cbff

fix

50b2e9c

fix

7218378

fix

542b4c2

fix

728d447

fix

989ca56

fix radix cache insert

7683430

fix

accd36c

fix

4a18b3d

fix

5437ee7

fix

4cc6d08

add multi_level_kv_cache start

b011825

add multi_level_kv_cache start

088345b

rename

c3d5e61

rename multi level kv cache

151333f

fix

2711e8b

fix

c9a1838

fix

937af93

add draft test.py

ff7fa9c

fix

df1d500

fix first version

6ef4dcd

add cpu_prompt_cache_len

31d57af

add cpu_prompt_cache_len

9d56ce9

fix

cae4432

fix

5cbec55

fix

f4cdbed

Fix

053c922

fix

b5ca416

fix

2d1cc47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disk cache and cpu Cache feature #997

Disk cache and cpu Cache feature #997

Uh oh!

hiworldwzj commented Aug 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

Uh oh!

	size = 1024 # 共享内存大小
	size = calcu_cpu_cache_meta().calcu_size() # 共享内存大小

	page_index = self.cpu_cache_client.query_one_page(token_chuncked_hash_value)
	page_index, _ = self.cpu_cache_client.query_one_page(token_chunked_hash_value)

	block_size = req.cur_kv_len // self.args.cpu_cache_token_chuncked_size
	block_size = req.cur_kv_len // self.args.cpu_cache_token_chunked_size

	token_chuncked_size = self.args.cpu_cache_token_chuncked_size
	token_chunked_size = self.args.cpu_cache_token_chunked_size

	match_tokens = len(page_list) * token_chuncked_size
	match_tokens = len(page_list) * token_chunked_size

	first_block_start_index = chuncked_size * tl.num_programs(0) - all_move_token_num
	first_block_start_index = chunked_size * tl.num_programs(0) - all_move_token_num

Disk cache and cpu Cache feature #997

Are you sure you want to change the base?

Disk cache and cpu Cache feature #997

Uh oh!

Conversation

hiworldwzj commented Aug 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!