-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Description
Environment:
TensorRT-LLM version: v1.0.0rc1
Hardware: H20-96G and IB network
Description:
I conducted experiments with PD deployment using DeepSeek-R1, and observed that the prefill node occasionally hangs during stress testing.
After preliminary investigation, I found that:
- The prefill node fails to release KV cache space allocated for failed requests
- This leads to request accumulation in the prefill node
- Both failed requests and normal in-transit requests are tracked in
PyExecutor.ctx_in_transmission_requests - The hanging occurs probabilistically during synchronous status checks of these requests via
self.kv_cache_transceiver.check_context_transfer_status(1)
Expected Behavior:
The system should properly release KV cache resources for failed requests and maintain stable operation during context transfer status checks.
Reproduction Steps:
- Configure PD deployment
- Stress test the service with high traffic (unconstrained concurrency) until:
- The system exceeds its maximum processing capacity.
- Requests start queuing due to overload.
- Abruptly terminate the stress test script (simulate multiple request failures to rapidly exhaust the KV cache in the prefill node).
- Observe behavior:
- The prefill node continues processing some remaining requests.
- Eventually, it hangs and stops accepting new requests.
Metadata
Metadata
Assignees
Labels
No labels