-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Labels
Low PrecisionLower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).Lower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).questionFurther information is requestedFurther information is requestedtriagedIssue has been triaged by maintainersIssue has been triaged by maintainerswaiting for feedback
Description
I have encountered a problem that whe benchmark the
TensorRT-LLM/cpp/tensorrt_llm/kernels/cutlass_kernels/int8_gemm/int8_gemm_template.h
Lines 62 to 63 in a65dba7
| void genericInt8GemmKernelLauncher(int8_t const* A, int8_t const* B, tk::QuantMode quantOption, float const* alphaCol, | |
| float const* alphaRow, T* C, int m, int n, int k, tkc::CutlassGemmConfig gemmConfig, char* workspace, |
But when it comes to real model, the int8 gemm kernel's perf degraded a lot. (I did use gemmprofilerplugin).
Two pictures from the nsight:
Int8 Gemm (above: seperate kernels, below: in real models)
int8 in benchmark:
int8 in models:
Device: A100 SXM-80GB
For 16, 6144, 4096; 14us -> 24us, almost doubbled. The config is exactly the same as the nsight shows.
Metadata
Metadata
Assignees
Labels
Low PrecisionLower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).Lower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).questionFurther information is requestedFurther information is requestedtriagedIssue has been triaged by maintainersIssue has been triaged by maintainerswaiting for feedback

