-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Labels
MemoryMemory utilization in TRTLLM: leak/OOM handling, footprint optimization, memory profiling.Memory utilization in TRTLLM: leak/OOM handling, footprint optimization, memory profiling.PerformanceTRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.TRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.Speculative Decoding<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafter<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafter
Description
Proposal to improve performance
Currently, in DraftTarget speculative decoding, we pass along the KVCacheConfig that the target model is configured with to a separate draft model KV cache. This could lead to excessive memory being reserved for draft model KV cache, when the KV cache for the draft model can be made much smaller than for the target model (based on a ratio of number of attention layers between the two models, and number of draft tokens generated).
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
System Information:
- OS:
- Python version:
- CUDA version:
- GPU model(s):
- Driver version:
- TensorRT version:
- PyTorch version:
- TensorRT-LLM version:
Detailed output:
Paste the output of the above commands here
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Metadata
Metadata
Assignees
Labels
MemoryMemory utilization in TRTLLM: leak/OOM handling, footprint optimization, memory profiling.Memory utilization in TRTLLM: leak/OOM handling, footprint optimization, memory profiling.PerformanceTRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.TRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.Speculative Decoding<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafter<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafter
Type
Projects
Status
Backlog