-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Labels
General perf<NV>Broad performance issues not specific to a particular component<NV>Broad performance issues not specific to a particular componentInference runtime<NV>General operational aspects of TRTLLM execution not in other categories.<NV>General operational aspects of TRTLLM execution not in other categories.PerformanceTRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.TRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.bugSomething isn't workingSomething isn't working
Description
System Info
GPU: NVIDIA A100, NVIDIA H100
TensorRT-LLM version: 1.0.0rc5
TensorRT-LLM commit: b3ca159
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Build: following gemma3 guide
Serve: trtllm-serve with max_attention_window = [512, 512, 512, 512, 512, 3100]
Expected behavior
When the sequence length is shorter than the minimum attention window,
concurrent batch size remains the same compared to when max_attention_window is not used.
When it is longer, the batch size increases.
It works well in vLLM.
actual behavior
However the batch size decreases in both cases, resulting in a significant drop in throughput.
additional notes
This behavior has continued since the referenced commit.
Metadata
Metadata
Assignees
Labels
General perf<NV>Broad performance issues not specific to a particular component<NV>Broad performance issues not specific to a particular componentInference runtime<NV>General operational aspects of TRTLLM execution not in other categories.<NV>General operational aspects of TRTLLM execution not in other categories.PerformanceTRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.TRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.bugSomething isn't workingSomething isn't working