Skip to content

Commit c37924f

Browse files
authored
[None][fix] Clear indexer k cache reference before release cuda memory (#9110)
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
1 parent cde18c1 commit c37924f

File tree

1 file changed

+5
-0
lines changed
  • tensorrt_llm/_torch/attention_backend/sparse

1 file changed

+5
-0
lines changed

tensorrt_llm/_torch/attention_backend/sparse/dsa.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1528,6 +1528,11 @@ def get_indexer_k_cache_buffers(self, layer_idx: int):
15281528
return self.indexer_k_cache_pool_per_layer[layer_offset].view(
15291529
self.num_blocks, block_size, 1, per_token_size)
15301530

1531+
def shutdown(self):
1532+
# Clear Python references BEFORE C++ frees the underlying CUDA buffers
1533+
self.indexer_k_cache_pool_per_layer = []
1534+
super().shutdown()
1535+
15311536
@staticmethod
15321537
def get_cache_size_per_token(model_config: ModelConfig, mapping: Mapping,
15331538
**kwargs):

0 commit comments

Comments
 (0)