[Performance]: [AutoDeploy] Benchmark and analyze AD-vLLM perf gap for Nemotron MoE FP8 tp=1

### Proposal to improve performance

Nemotron MoE FP8 tp=1. Compare to vLLM perf on H100/B200. 
Sweep over max concurrency and prepare output tok/s vs tok/user/s pareto curves. 
Dump traces for both vLLM and AD.
Analyze traces and identify possible performance optimizations for AD.



### Report of performance regression

_No response_

### Misc discussion on performance

_No response_

### Your current environment (if you think it is necessary)

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance]: [AutoDeploy] Benchmark and analyze AD-vLLM perf gap for Nemotron MoE FP8 tp=1 #9268

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance]: [AutoDeploy] Benchmark and analyze AD-vLLM perf gap for Nemotron MoE FP8 tp=1 #9268

Description

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions