KV Cache Memory Leakage in Prefill Node During Context Transfer Failures

**Environment**:
TensorRT-LLM version:  [v1.0.0rc1](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v1.0.0rc1)
Hardware: H20-96G and IB network

**Description:**
I conducted experiments with PD deployment using DeepSeek-R1, and observed that the prefill node occasionally hangs during stress testing.

After preliminary investigation, I found that:
- The prefill node fails to release KV cache space allocated for failed requests
- This leads to request accumulation in the prefill node
- Both failed requests and normal in-transit requests are tracked in `PyExecutor.ctx_in_transmission_requests`
- The hanging occurs **probabilistically** during synchronous status checks of these requests via `self.kv_cache_transceiver.check_context_transfer_status(1)`

**Expected Behavior:**
The system should properly release KV cache resources for failed requests and maintain stable operation during context transfer status checks.

**Reproduction Steps:**

- Configure PD deployment
- Stress test the service with high traffic (unconstrained concurrency) until:
    - The system exceeds its maximum processing capacity.
    - Requests start queuing due to overload.
- Abruptly terminate the stress test script (simulate multiple request failures to rapidly exhaust the KV cache in the prefill node). 
- Observe behavior:
    - The prefill node continues processing some remaining requests.
    - Eventually, it hangs and stops accepting new requests.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KV Cache Memory Leakage in Prefill Node During Context Transfer Failures #5820

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KV Cache Memory Leakage in Prefill Node During Context Transfer Failures #5820

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions