Skip to content

Conversation

@Egor-Krivov
Copy link
Contributor

@Egor-Krivov Egor-Krivov commented Nov 17, 2025

User provides hardware capability with a call like this:
python benchmarks/triton_kernels_benchmark/gemm_tensor_desc_benchmark.py --hw_gbps 1229 --hw_tflops 356 --brief and we can print software efficiency and save it to the report as well.

If this functionality is popular and required, we could save hardware capability to the file and read it automatically, maybe with a call to scripts/capture-hw-details.sh before the script or during the benchmark. Then user will not have to provide device properties.

We potentially could also just print one efficiency (max between compute and memory).

Example output:

python benchmarks/triton_kernels_benchmark/gemm_tensor_desc_benchmark.py --hw_gbps 1229 --hw_tflops 356 --brief
matmul-tensor-desc-performance:
         B        M         N        K  Triton-GB/s  OneDNN-GB/s  CUTLASS-GB/s  Triton-TFlops  OneDNN-TFlops  CUTLASS-TFlops Triton-GB/s-eff OneDNN-GB/s-eff CUTLASS-GB/s-eff Triton-TFlops-eff OneDNN-TFlops-eff CUTLASS-TFlops-eff
0      1.0      1.0    1024.0   4096.0   175.750946   425.577337    115.970394       0.175494       0.424955        0.115801           14.3%           34.6%             9.4%              0.0%              0.1%               0.0%
1      1.0      1.0    4096.0   4096.0   524.016979   521.413149    443.931886       0.523633       0.521032        0.443607           42.6%           42.4%            36.1%              0.1%              0.1%               0.1%
2      1.0      1.0    4096.0  14336.0   389.696066   469.416483    455.829798       0.389547       0.469236        0.455655           31.7%           38.2%            37.1%              0.1%              0.1%               0.1%
3      1.0      1.0    6144.0   4096.0   486.236889   489.735672    546.370370       0.485921       0.489417        0.546015           39.6%           39.8%            44.5%              0.1%              0.1%               0.2%
4      1.0      1.0   13824.0   5120.0   441.854767   458.802913    397.907679       0.441650       0.458591        0.397724           36.0%           37.3%            32.4%              0.1%              0.1%               0.1%
5      1.0      1.0   14336.0   4096.0   445.977144   453.376201    408.517751       0.445728       0.453123        0.408290           36.3%           36.9%            33.2%              0.1%              0.1%               0.1%
6      1.0      1.0   28672.0   4096.0   498.016222   529.741423    475.100868       0.497756       0.529464        0.474852           40.5%           43.1%            38.7%              0.1%              0.1%               0.1%
7      1.0      1.0  128256.0   4096.0   578.545692   640.628428    577.928586       0.578259       0.640311        0.577642           47.1%           52.1%            47.0%              0.2%              0.2%               0.2%
8      1.0      4.0   12288.0   4096.0   458.061690   466.060043    403.248084       1.828081       1.860002        1.609325           37.3%           37.9%            32.8%              0.5%              0.5%               0.5%
9      1.0      8.0    1024.0   4096.0   178.596631   420.144175     91.909371       1.412224       3.322221        0.726758           14.5%           34.2%             7.5%              0.4%              0.9%               0.2%
10     1.0      8.0    4096.0   4096.0   527.689799   513.870853    363.931855       4.196927       4.087019        2.894495           42.9%           41.8%            29.6%              1.2%              1.1%               0.8%
11     1.0      8.0    4096.0  14336.0   390.121087   472.716502    387.426715       3.111419       3.770161        3.089930           31.7%           38.5%            31.5%              0.9%              1.1%               0.9%
12     1.0      8.0    6144.0   4096.0   479.017137   495.920279    526.579822       3.812281       3.946806        4.190811           39.0%           40.4%            42.8%              1.1%              1.1%               1.2%
13     1.0      8.0   14336.0   4096.0   445.520030   453.850414    427.656637       3.548320       3.614666        3.406048           36.3%           36.9%            34.8%              1.0%              1.0%               1.0%
14     1.0      8.0   28672.0   4096.0   487.241924   529.412984    494.972012       3.881689       4.217652        3.943272           39.6%           43.1%            40.3%              1.1%              1.2%               1.1%
15     1.0      8.0  128256.0   4096.0   560.061839   638.153974    583.876454       4.462784       5.085051        4.652547           45.6%           51.9%            47.5%              1.3%              1.4%               1.3%
16     1.0    512.0    8192.0   8192.0   296.682089   377.936908    278.408932     127.916825     162.950481      120.038209           24.1%           30.8%            22.7%             35.9%             45.8%              33.7%
17     1.0    512.0    8192.0  32768.0   297.996733   414.006919    296.896851     139.496528     193.802553      138.981657           24.2%           33.7%            24.2%             39.2%             54.4%              39.0%
18     1.0    512.0   32768.0   8192.0   393.451786   399.217928    385.462203     176.611344     179.199631      173.025006           32.0%           32.5%            31.4%             49.6%             50.3%              48.6%
19     1.0   1024.0    1024.0   1024.0   344.359945   312.308600    231.729491      88.156146      79.951002       59.322750           28.0%           25.4%            18.9%             24.8%             22.5%              16.7%
20     1.0   1024.0    8192.0  16384.0   207.744225   252.478797    208.903088     170.184069     206.830630      171.133410           16.9%           20.5%            17.0%             47.8%             58.1%              48.1%
21     1.0   1024.0    8192.0  28672.0   191.088374   227.313312    198.575703     163.548831     194.553053      169.957091           15.5%           18.5%            16.2%             45.9%             54.6%              47.7%
22     1.0   2048.0    2048.0   2048.0   259.508372   341.138993    222.421005     132.868286     174.663164      113.879554           21.1%           27.8%            18.1%             37.3%             49.1%              32.0%
23     1.0   3072.0    3072.0   4096.0   190.148059   219.432945    187.540746     166.895668     192.599430      164.607192           15.5%           17.9%            15.3%             46.9%             54.1%              46.2%
24     1.0   4096.0    4096.0   4096.0   190.780275   222.339944    175.173214     195.359002     227.676103      179.377371           15.5%           18.1%            14.3%             54.9%             64.0%              50.4%
25     1.0   4096.0    8192.0  16384.0    91.940661    94.919261     83.223153     188.294473     194.394647      170.441018            7.5%            7.7%             6.8%             52.9%             54.6%              47.9%
26     1.0   8192.0    1024.0  16384.0   190.708685   186.268641    186.365886     156.228555     152.591271      152.670934           15.5%           15.2%            15.2%             43.9%             42.9%              42.9%
27     1.0   8192.0    4096.0   4096.0   190.876395   212.877970    175.153624     223.379919     249.128047      204.979784           15.5%           17.3%            14.3%             62.7%             70.0%              57.6%
28     1.0   8192.0    4096.0  16384.0    89.539968    86.694727     89.235761     183.377854     177.550800      182.754838            7.3%            7.1%             7.3%             51.5%             49.9%              51.3%
29     1.0   8192.0    8192.0   8192.0   103.311161   111.342884     70.806559     211.581257     228.030226      145.011832            8.4%            9.1%             5.8%             59.4%             64.1%              40.7%
30     1.0  16384.0    1024.0   8192.0   246.024283   275.092183    239.593021     191.945803     214.624301      186.928193           20.0%           22.4%            19.5%             53.9%             60.3%              52.5%
31     1.0  16384.0    4096.0   8192.0   116.857383   124.702125    113.550945     212.732374     227.013291      206.713186            9.5%           10.1%             9.2%             59.8%             63.8%              58.1%
32     1.0  16384.0    8192.0   1024.0   420.457559   506.532221    267.520700     196.822190     237.114969      125.230261           34.2%           41.2%            21.8%             55.3%             66.6%              35.2%
33     1.0  16384.0    8192.0   4096.0   138.795960   152.690504    131.224377     206.730274     227.425565      195.452745           11.3%           12.4%            10.7%             58.1%             63.9%              54.9%
34     4.0  32768.0     128.0   4096.0   557.546790   485.472400    378.755178      66.921953      58.270914       45.461720           45.4%           39.5%            30.8%             18.8%             16.4%              12.8%
35     4.0  32768.0    4096.0    128.0   814.060842  1513.905977    407.025866      51.199896      95.216259       25.599661           66.2%          123.2%            33.1%             14.4%             26.7%               7.2%
36    32.0   4096.0     128.0   4096.0   551.669541   538.095889    521.920722      64.561098      62.972593       61.079636           44.9%           43.8%            42.5%             18.1%             17.7%              17.2%
37  4096.0      8.0     128.0  16384.0   583.029627   561.039587    501.603018       4.385839       4.220419        3.773308           47.4%           45.7%            40.8%              1.2%              1.2%               1.1%
38  4096.0      8.0   16384.0    128.0   636.337183   669.579984    462.410021       4.523101       4.759392        3.286822           51.8%           54.5%            37.6%              1.3%              1.3%               0.9%

@Egor-Krivov
Copy link
Contributor Author

Egor-Krivov commented Nov 17, 2025

@etiotto @whitneywhtsang What do you think about this function? Would you use it?

@whitneywhtsang This is unrelated to grafana HW efficiency calculations, as we can probably just do it on the fly, don't need to change it in the repo. This PR is for local dev runs.

@whitneywhtsang
Copy link
Contributor

@etiotto @whitneywhtsang What do you think about this function? Would you use it?

For me, I would only use it if hardware capability is provided automatically, and have NV SW efficiency as reference.

Copy link
Contributor

@etiotto etiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea but, as you mentioned in the description, my preference would be to encode the HW capabilities into the file, and detect the HW the benchmark is running on automatically.

@Egor-Krivov
Copy link
Contributor Author

Current output:

(triton) (312) jovyan@jupyter-ekrivov:~/triton/intel-xpu-backend-for-triton$ python benchmarks/third_party/sglang/scaled_mm_benchmark.py -b -e
scaled_mm_benchmark:
         M       N        K  triton-GB/s  triton-td-GB/s  pytorch-deqmm-GB/s  triton-TFlops  triton-td-TFlops  pytorch-deqmm-TFlops triton-eff triton-td-eff pytorch-deqmm-eff
0      1.0  1024.0   4096.0     3.579846       57.398848           49.803748       0.007149          0.114630              0.099462      0.29%         4.67%             4.05%
1      1.0  4096.0   4096.0    10.566348      172.235373           69.189417       0.021117          0.344219              0.138278      0.86%        14.02%             5.63%
2      1.0  4096.0  16384.0    11.042084      231.032576           53.081664       0.022076          0.461896              0.106124      0.90%        18.80%             4.32%
3      8.0  1024.0   4096.0     3.298091       56.882791           43.080772       0.052158          0.899583              0.681308      0.27%         4.63%             3.51%
4      8.0  4096.0   4096.0    11.002999      169.127279           64.243644       0.175022          2.690273              1.021911      0.90%        13.76%             5.23%
5      8.0  4096.0  16384.0    11.529292      212.695850           54.543409       0.183930          3.393193              0.870145      0.94%        17.31%             4.44%
6    128.0  1024.0   4096.0     2.777600       22.843223           34.874219       0.598792          4.924518              7.518147      0.23%         1.86%             2.84%
7    128.0  4096.0   4096.0     9.070279       75.757902           42.796027       2.122964         17.731678             10.016716      0.74%         6.17%             3.48%
8    128.0  4096.0  16384.0     8.981069       75.767429           44.472249       2.196206         18.527964             10.875124      0.73%         6.17%             3.62%
9   1024.0  1024.0   4096.0     2.389590        5.414800           19.732331       1.957552          4.435804             16.164726      0.55%         1.25%             4.55%
10  1024.0  4096.0   4096.0     2.477038        9.669833           15.724485       2.898842         11.316467             18.402140      0.82%         3.18%             5.18%
11  1024.0  4096.0  16384.0     1.988680        7.884973           12.801350       2.962049         11.744309             19.067028      0.83%         3.30%             5.36%
12  4096.0  1024.0   4096.0     2.478878        9.714627           15.701443       2.900995         11.368889             18.375174      0.82%         3.20%             5.17%
13  4096.0  4096.0   4096.0     2.150473        8.668761            9.731620       4.404169         17.753622             19.930359      1.24%         4.99%             5.61%
14  4096.0  4096.0  16384.0     1.404768        5.278151            6.305655       4.603143         17.295446             20.662369      1.29%         4.86%             5.81%
(triton) (312) jovyan@jupyter-ekrivov:~/triton/intel-xpu-backend-for-triton$ python benchmarks/triton_kernels_benchmark/fused_softmax.py -b -e
softmax-performance:
         N  Triton-GB/s  XeTLA-GB/s  oneDNN-GB/s  Triton-TFlops  XeTLA-TFlops  oneDNN-TFlops Triton-eff XeTLA-eff oneDNN-eff
0    256.0   469.161517  563.750514   243.289109       0.469162      0.563751       0.243289     38.18%    45.88%     19.80%
1   1024.0   674.867894  540.503129   469.950017       0.674868      0.540503       0.469950     54.92%    43.99%     38.24%
2   2048.0   663.655679  768.891613   516.858158       0.663656      0.768892       0.516858     54.01%    62.57%     42.06%
3   4096.0   607.320034  484.330674   512.986306       0.607320      0.484331       0.512986     49.42%    39.41%     41.75%
4   8192.0   667.882818  631.672264   456.119567       0.667883      0.631672       0.456120     54.35%    51.41%     37.12%
5  16384.0   702.563450  684.295574    87.017625       0.702563      0.684296       0.087018     57.17%    55.69%      7.08%
6  32768.0   736.145476  740.348213    82.313131       0.736145      0.740348       0.082313     59.91%    60.25%      6.70%

@Egor-Krivov
Copy link
Contributor Author

@etiotto @whitneywhtsang I added automatic knowledge about hardware capability, so now you can just call python benchmark.py -be to get efficiency printed.

@Egor-Krivov
Copy link
Contributor Author

@Egor-Krivov
Copy link
Contributor Author

Closes #5514

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants