Skip to content

[AutoDeploy] Tune KVCacheConfig of drafter for memory optimization #9279

@govind-ramnarayan

Description

@govind-ramnarayan

Proposal to improve performance

Currently, in DraftTarget speculative decoding, we pass along the KVCacheConfig that the target model is configured with to a separate draft model KV cache. This could lead to excessive memory being reserved for draft model KV cache, when the KV cache for the draft model can be made much smaller than for the target model (based on a ratio of number of attention layers between the two models, and number of draft tokens generated).

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

System Information:

  • OS:
  • Python version:
  • CUDA version:
  • GPU model(s):
  • Driver version:
  • TensorRT version:
  • PyTorch version:
  • TensorRT-LLM version:

Detailed output:

Paste the output of the above commands here

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Labels

MemoryMemory utilization in TRTLLM: leak/OOM handling, footprint optimization, memory profiling.PerformanceTRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.Speculative Decoding<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafter

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions