Replies: 2 comments
-
|
@yechank-nvidia Hi Yechan, can you help follow this ask from the community? Thanks |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
@juney-nvidia @yechank-nvidia Thanks |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm currently working on building a TensorRT-LLM engine from an LLM that was quantized using
ModelOptwith SmoothQuant.However, I've run into some difficulties because the SmoothQuant implementation in
ModelOptandTensorRT-LLMappears to differ slightly - especially regarding output scaling (scale_y).While
ModelOptapplies SmoothQuant without output scaling during inference,TensorRT-LLMexpects it to be used as the input scale forSmoothQuantGemmplugin.Because of this difference, the model quantized with
ModelOptcannot be directly used to build a TensorRT-LLM engine without additional modifications, leading to compatibility issues.I would like to understand:
Any technical explanation or best practice for addressing this difference would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions