Skip to content

KV Cache Memory Leakage in Prefill Node During Context Transfer Failures #5820

@Nekofish-L

Description

@Nekofish-L

Environment:
TensorRT-LLM version: v1.0.0rc1
Hardware: H20-96G and IB network

Description:
I conducted experiments with PD deployment using DeepSeek-R1, and observed that the prefill node occasionally hangs during stress testing.

After preliminary investigation, I found that:

  • The prefill node fails to release KV cache space allocated for failed requests
  • This leads to request accumulation in the prefill node
  • Both failed requests and normal in-transit requests are tracked in PyExecutor.ctx_in_transmission_requests
  • The hanging occurs probabilistically during synchronous status checks of these requests via self.kv_cache_transceiver.check_context_transfer_status(1)

Expected Behavior:
The system should properly release KV cache resources for failed requests and maintain stable operation during context transfer status checks.

Reproduction Steps:

  • Configure PD deployment
  • Stress test the service with high traffic (unconstrained concurrency) until:
    • The system exceeds its maximum processing capacity.
    • Requests start queuing due to overload.
  • Abruptly terminate the stress test script (simulate multiple request failures to rapidly exhaust the KV cache in the prefill node).
  • Observe behavior:
    • The prefill node continues processing some remaining requests.
    • Eventually, it hangs and stops accepting new requests.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions