Skip to content

Conversation

@MollySophia
Copy link
Contributor

@MollySophia MollySophia commented Aug 31, 2025

Progress:

  • packB 32 & 32x8 microkernel & test pass
  • 32x32 / 32x64 / 64x64 microkernel && packing and unpacking

Unfortunately, I've been too busy with my internship works to fully finish this optimization (only implemented 32x8 microkernels with packB 32). The performance gain could be much higher if fully implemented.

Benchmarking

test_gemm.param.zip
benchncnn.cpp:

        benchmark("test_gemm1024", ncnn::Mat(1024, 1024, 3), opt);

        benchmark("test_gemm2048", ncnn::Mat(2048, 2048, 3), opt);

        benchmark("test_gemm4096", ncnn::Mat(4096, 4096, 3), opt);

        benchmark("test_gemm8192", ncnn::Mat(8192, 8192, 3), opt);

32 layers of [dim, dim] @ [dim, dim] gemms on Apple M4:

# molly @ mollydeMac-mini in ~/ncnn/benchmark on git:apple-amx-remastered x [10:40:06] 
$ ../build/benchmark/benchncnn.app/Contents/MacOS/benchncnn 100 4 2 -1 1      
loop_count = 100
num_threads = 4
powersave = 2
gpu_device = -1
cooling_down = 1
       test_gemm1024  min =    2.92  max =    5.42  avg =    3.04
       test_gemm2048  min =   10.93  max =   20.71  avg =   11.49
       test_gemm4096  min =   42.90  max =   73.63  avg =   44.38
       test_gemm8192  min =  172.38  max = 1099.07  avg =  211.27

# molly @ mollydeMac-mini in ~/ncnn/benchmark on git:apple-amx-remastered x [10:53:14] 
$ ../build-noamx/benchmark/benchncnn.app/Contents/MacOS/benchncnn 100 4 2 -1 1 
loop_count = 100
num_threads = 4
powersave = 2
gpu_device = -1
cooling_down = 1
       test_gemm1024  min =    5.35  max =    7.99  avg =    6.43
       test_gemm2048  min =   21.85  max =   29.20  avg =   24.55
       test_gemm4096  min =   89.58  max =   97.65  avg =   93.08
       test_gemm8192  min =  511.64  max = 2068.80  avg = 1229.32

Testing

# molly @ mollydeMac-mini in ~/ncnn/build on git:apple-amx-remastered x [10:03:21] 
$ ctest --output-on-failure -j10         
Test project /Users/molly/ncnn/build
        Start  17: test_binaryop_3
        Start 123: test_slice
        Start  24: test_convolution
        Start  26: test_convolution_2
        Start  25: test_convolution_1
        Start 125: test_softmax
        Start  71: test_gemm_3
        Start  38: test_crop_1
        Start  43: test_deconvolution
        Start  16: test_binaryop_2
  1/135 Test  #43: test_deconvolution ...............   Passed    0.67 sec
        Start  69: test_gemm
  2/135 Test  #71: test_gemm_3 ......................   Passed    1.22 sec
        Start  53: test_deformableconv2d_2
  3/135 Test  #38: test_crop_1 ......................   Passed    1.25 sec
        Start  52: test_deformableconv2d_1
  4/135 Test #125: test_softmax .....................   Passed    1.41 sec
        Start  31: test_convolutiondepthwise
  5/135 Test  #16: test_binaryop_2 ..................   Passed    1.48 sec
        Start  54: test_deformableconv2d_3
  6/135 Test  #26: test_convolution_2 ...............   Passed    1.64 sec
        Start  15: test_binaryop_1
  7/135 Test  #69: test_gemm ........................   Passed    1.01 sec
        Start  72: test_gemm_4
  8/135 Test  #24: test_convolution .................   Passed    1.82 sec
        Start  14: test_binaryop
  9/135 Test #123: test_slice .......................   Passed    1.98 sec
        Start  51: test_deformableconv2d
 10/135 Test  #25: test_convolution_1 ...............   Passed    2.11 sec
        Start  42: test_cumulativesum
 11/135 Test  #52: test_deformableconv2d_1 ..........   Passed    0.95 sec
        Start  37: test_crop
 12/135 Test  #17: test_binaryop_3 ..................   Passed    2.24 sec
        Start  30: test_convolution3d
 13/135 Test  #54: test_deformableconv2d_3 ..........   Passed    0.85 sec
        Start  70: test_gemm_1
 14/135 Test  #53: test_deformableconv2d_2 ..........   Passed    1.13 sec
        Start  47: test_deconvolutiondepthwise_1
 15/135 Test  #72: test_gemm_4 ......................   Passed    0.82 sec
        Start  27: test_convolution_3
 16/135 Test  #31: test_convolutiondepthwise ........   Passed    1.10 sec
        Start  46: test_deconvolutiondepthwise
 17/135 Test  #15: test_binaryop_1 ..................   Passed    1.00 sec
        Start 100: test_pooling3d
 18/135 Test  #14: test_binaryop ....................   Passed    0.93 sec
        Start  45: test_deconvolution3d
 19/135 Test  #51: test_deformableconv2d ............   Passed    0.79 sec
        Start  49: test_deconvolutiondepthwise3d
 20/135 Test  #42: test_cumulativesum ...............   Passed    0.74 sec
        Start  36: test_copyto_1
 21/135 Test  #47: test_deconvolutiondepthwise_1 ....   Passed    0.58 sec
        Start 135: test_yolov3detectionoutput
 22/135 Test  #37: test_crop ........................   Passed    0.84 sec
        Start  35: test_copyto
 23/135 Test  #70: test_gemm_1 ......................   Passed    0.77 sec
        Start  89: test_multiheadattention
 24/135 Test  #30: test_convolution3d ...............   Passed    0.89 sec
        Start 124: test_slice_oom
 25/135 Test  #46: test_deconvolutiondepthwise ......   Passed    0.67 sec
        Start  95: test_padding
 26/135 Test #100: test_pooling3d ...................   Passed    0.59 sec
        Start  29: test_convolution1d
 27/135 Test  #45: test_deconvolution3d .............   Passed    0.58 sec
        Start 112: test_reshape_1
 28/135 Test  #49: test_deconvolutiondepthwise3d ....   Passed    0.57 sec
        Start  90: test_multiheadattention_1
 29/135 Test #135: test_yolov3detectionoutput .......   Passed    0.49 sec
        Start  75: test_gru
 30/135 Test  #27: test_convolution_3 ...............   Passed    0.94 sec
        Start  73: test_gridsample
 31/135 Test  #36: test_copyto_1 ....................   Passed    0.60 sec
        Start 108: test_reorg
 32/135 Test  #89: test_multiheadattention ..........   Passed    0.44 sec
        Start  44: test_deconvolution1d
 33/135 Test #124: test_slice_oom ...................   Passed    0.48 sec
        Start  80: test_interp
 34/135 Test  #95: test_padding .....................   Passed    0.51 sec
        Start  98: test_pooling
 35/135 Test #112: test_reshape_1 ...................   Passed    0.39 sec
        Start  96: test_permute
 36/135 Test #108: test_reorg .......................   Passed    0.32 sec
        Start 114: test_rmsnorm
 37/135 Test  #75: test_gru .........................   Passed    0.38 sec
        Start  22: test_concat
 38/135 Test  #35: test_copyto ......................   Passed    0.76 sec
        Start  74: test_groupnorm
 39/135 Test  #90: test_multiheadattention_1 ........   Passed    0.49 sec
        Start 102: test_prelu
 40/135 Test  #29: test_convolution1d ...............   Passed    0.61 sec
        Start 109: test_requantize
 41/135 Test  #73: test_gridsample ..................   Passed    0.46 sec
        Start 126: test_softmax_oom
 42/135 Test  #44: test_deconvolution1d .............   Passed    0.43 sec
        Start 132: test_tile
 43/135 Test  #96: test_permute .....................   Passed    0.33 sec
        Start  85: test_lstm
 44/135 Test  #98: test_pooling .....................   Passed    0.40 sec
        Start  48: test_deconvolutiondepthwise1d
 45/135 Test #114: test_rmsnorm .....................   Passed    0.32 sec
        Start  92: test_noop
 46/135 Test  #74: test_groupnorm ...................   Passed    0.43 sec
        Start 127: test_softplus
 47/135 Test #102: test_prelu .......................   Passed    0.42 sec
        Start 106: test_reduction
 48/135 Test  #22: test_concat ......................   Passed    0.50 sec
        Start  59: test_einsum
 49/135 Test #109: test_requantize ..................   Passed    0.46 sec
        Start  91: test_multiheadattention_oom
 50/135 Test #126: test_softmax_oom .................   Passed    0.41 sec
        Start  86: test_matmul
 51/135 Test #132: test_tile ........................   Passed    0.38 sec
        Start  55: test_deformableconv2d_4
 52/135 Test  #48: test_deconvolutiondepthwise1d ....   Passed    0.42 sec
        Start 118: test_scale
 53/135 Test  #92: test_noop ........................   Passed    0.42 sec
        Start  99: test_pooling1d
 54/135 Test  #85: test_lstm ........................   Passed    0.46 sec
        Start  60: test_eltwise
 55/135 Test #127: test_softplus ....................   Passed    0.34 sec
        Start 111: test_reshape
 56/135 Test #106: test_reduction ...................   Passed    0.40 sec
        Start  50: test_deepcopy
 57/135 Test  #91: test_multiheadattention_oom ......   Passed    0.49 sec
        Start  67: test_gelu
 58/135 Test  #80: test_interp ......................   Passed    1.19 sec
        Start  94: test_packing
 59/135 Test  #86: test_matmul ......................   Passed    0.48 sec
        Start 128: test_spectrogram
 60/135 Test  #59: test_einsum ......................   Passed    0.53 sec
        Start 107: test_relu
 61/135 Test  #55: test_deformableconv2d_4 ..........   Passed    0.50 sec
        Start 119: test_selu
 62/135 Test #118: test_scale .......................   Passed    0.35 sec
        Start  84: test_lrn
 63/135 Test  #50: test_deepcopy ....................   Passed    0.36 sec
        Start  64: test_expanddims
 64/135 Test  #99: test_pooling1d ...................   Passed    0.50 sec
        Start  66: test_fold
 65/135 Test  #60: test_eltwise .....................   Passed    0.51 sec
        Start  61: test_elu
 66/135 Test #111: test_reshape .....................   Passed    0.51 sec
        Start  88: test_mish
 67/135 Test  #67: test_gelu ........................   Passed    0.31 sec
        Start 115: test_rnn
 68/135 Test  #94: test_packing .....................   Passed    0.31 sec
        Start  97: test_pixelshuffle
 69/135 Test #119: test_selu ........................   Passed    0.44 sec
        Start  58: test_dropout
 70/135 Test #128: test_spectrogram .................   Passed    0.47 sec
 71/135 Test  #84: test_lrn .........................   Passed    0.44 sec
        Start  56: test_dequantize
        Start  78: test_innerproduct
 72/135 Test #107: test_relu ........................   Passed    0.47 sec
        Start  65: test_flatten
 73/135 Test  #64: test_expanddims ..................   Passed    0.34 sec
        Start  87: test_memorydata
 74/135 Test  #66: test_fold ........................   Passed    0.34 sec
        Start 120: test_shrink
 75/135 Test  #61: test_elu .........................   Passed    0.51 sec
        Start 116: test_roipooling
 76/135 Test  #97: test_pixelshuffle ................   Passed    0.43 sec
        Start 129: test_squeeze
 77/135 Test  #88: test_mish ........................   Passed    0.46 sec
        Start  57: test_diag
 78/135 Test #115: test_rnn .........................   Passed    0.48 sec
        Start 101: test_power
 79/135 Test  #58: test_dropout .....................   Passed    0.30 sec
        Start 131: test_tanh
 80/135 Test  #78: test_innerproduct ................   Passed    0.31 sec
        Start  93: test_normalize
 81/135 Test  #87: test_memorydata ..................   Passed    0.47 sec
 82/135 Test #120: test_shrink ......................   Passed    0.46 sec
        Start 113: test_reshape_oom
        Start 103: test_priorbox
 83/135 Test  #65: test_flatten .....................   Passed    0.53 sec
        Start 117: test_roialign
 84/135 Test  #56: test_dequantize ..................   Passed    0.54 sec
        Start 110: test_requantize_oom
 85/135 Test #129: test_squeeze .....................   Passed    0.34 sec
        Start 104: test_quantize
 86/135 Test #116: test_roipooling ..................   Passed    0.35 sec
        Start  76: test_hardsigmoid
 87/135 Test  #57: test_diag ........................   Passed    0.51 sec
        Start  82: test_inversespectrogram
 88/135 Test #131: test_tanh ........................   Passed    0.46 sec
        Start  83: test_layernorm
 89/135 Test #101: test_power .......................   Passed    0.48 sec
        Start 105: test_quantize_oom
 90/135 Test #117: test_roialign ....................   Passed    0.32 sec
        Start 133: test_unaryop
 91/135 Test #113: test_reshape_oom .................   Passed    0.32 sec
        Start  79: test_instancenorm
 92/135 Test #110: test_requantize_oom ..............   Passed    0.51 sec
        Start  40: test_crop_3
 93/135 Test #104: test_quantize ....................   Passed    0.46 sec
        Start 122: test_sigmoid
 94/135 Test #103: test_priorbox ....................   Passed    0.53 sec
        Start 130: test_swish
 95/135 Test  #76: test_hardsigmoid .................   Passed    0.49 sec
        Start  77: test_hardswish
 96/135 Test  #82: test_inversespectrogram ..........   Passed    0.36 sec
        Start  63: test_erf
 97/135 Test  #93: test_normalize ...................   Passed    0.85 sec
        Start  68: test_glu
 98/135 Test  #79: test_instancenorm ................   Passed    0.44 sec
        Start  32: test_convolutiondepthwise_1
 99/135 Test #105: test_quantize_oom ................   Passed    0.52 sec
        Start  81: test_interp_1
100/135 Test  #83: test_layernorm ...................   Passed    0.53 sec
        Start 121: test_shufflechannel
101/135 Test #133: test_unaryop .....................   Passed    0.50 sec
        Start  62: test_embed
102/135 Test #122: test_sigmoid .....................   Passed    0.32 sec
        Start 134: test_unfold
103/135 Test  #40: test_crop_3 ......................   Passed    0.37 sec
        Start  41: test_crop_oom
104/135 Test  #63: test_erf .........................   Passed    0.45 sec
        Start  39: test_crop_2
105/135 Test  #77: test_hardswish ...................   Passed    0.48 sec
        Start   6: test_squeezenet
106/135 Test #130: test_swish .......................   Passed    0.51 sec
        Start  34: test_convolutiondepthwise3d
107/135 Test  #68: test_glu .........................   Passed    0.41 sec
        Start  33: test_convolutiondepthwise1d
108/135 Test  #32: test_convolutiondepthwise_1 ......   Passed    0.46 sec
        Start   5: test_mat_pixel
109/135 Test  #81: test_interp_1 ....................   Passed    0.49 sec
        Start   4: test_mat_pixel_resize
110/135 Test #121: test_shufflechannel ..............   Passed    0.52 sec
        Start  28: test_convolution_oom
111/135 Test #134: test_unfold ......................   Passed    0.44 sec
        Start  20: test_celu
112/135 Test  #62: test_embed .......................   Passed    0.48 sec
        Start  13: test_bias
113/135 Test  #41: test_crop_oom ....................   Passed    0.40 sec
        Start   9: test_expression
114/135 Test  #39: test_crop_2 ......................   Passed    0.32 sec
        Start  21: test_clip
115/135 Test   #4: test_mat_pixel_resize ............   Passed    0.32 sec
        Start   2: test_mat_pixel_drawing
116/135 Test   #5: test_mat_pixel ...................   Passed    0.35 sec
        Start  19: test_cast
117/135 Test  #28: test_convolution_oom .............   Passed    0.31 sec
        Start  18: test_bnll
118/135 Test  #33: test_convolutiondepthwise1d ......   Passed    0.57 sec
        Start  12: test_batchnorm
119/135 Test  #34: test_convolutiondepthwise3d ......   Passed    0.63 sec
        Start   3: test_mat_pixel_rotate
120/135 Test   #6: test_squeezenet ..................   Passed    0.67 sec
        Start   7: test_c_api
121/135 Test   #9: test_expression ..................   Passed    0.52 sec
        Start  10: test_paramdict
122/135 Test  #20: test_celu ........................   Passed    0.53 sec
        Start  11: test_absval
123/135 Test  #13: test_bias ........................   Passed    0.52 sec
        Start   1: test_mat_pixel_affine
124/135 Test  #21: test_clip ........................   Passed    0.45 sec
        Start  23: test_concat_oom
125/135 Test   #2: test_mat_pixel_drawing ...........   Passed    0.32 sec
        Start   8: test_cpu
126/135 Test  #19: test_cast ........................   Passed    0.53 sec
127/135 Test   #3: test_mat_pixel_rotate ............   Passed    0.42 sec
128/135 Test  #18: test_bnll ........................   Passed    0.51 sec
129/135 Test   #7: test_c_api .......................   Passed    0.39 sec
130/135 Test  #12: test_batchnorm ...................   Passed    0.49 sec
131/135 Test  #10: test_paramdict ...................   Passed    0.32 sec
132/135 Test   #8: test_cpu .........................   Passed    0.46 sec
133/135 Test  #11: test_absval ......................   Passed    0.54 sec
134/135 Test  #23: test_concat_oom ..................   Passed    0.54 sec
135/135 Test   #1: test_mat_pixel_affine ............   Passed    0.55 sec

100% tests passed, 0 tests failed out of 135

Total Test time (real) =   8.19 sec

@tencent-adm
Copy link
Member

tencent-adm commented Aug 31, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link

github-actions bot commented Aug 31, 2025

The binary size change of libncnn.so (bytes)

architecture base size pr size difference
x86_64 15124728 15124784 +56 ⚠️
armhf 6155744 6155824 +80 ⚠️
aarch64 9453192 9452928 -264 😘

@github-actions
Copy link

Please enable github action in YOUR FORKED REPO to make code-format workflow work

@codecov-commenter
Copy link

codecov-commenter commented Aug 31, 2025

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.59%. Comparing base (a514cf5) to head (da499ad).
⚠️ Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
src/cpu.cpp 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6293      +/-   ##
==========================================
- Coverage   95.89%   95.59%   -0.30%     
==========================================
  Files         837      837              
  Lines      264994   264997       +3     
==========================================
- Hits       254105   253327     -778     
- Misses      10889    11670     +781     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@MollySophia MollySophia changed the title WIP: Apple AMX GEMM optimization Apple AMX GEMM optimization Sep 10, 2025
Copy link
Member

@nihui nihui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of code is duplicated in gemm_arm_asimdhp.cpp and should be extracted into gemm_fp16sa.h to unify the implementation into a single file to reduce duplication.

Apple AMX requires additional macro definitions, such as __ARM_FEATURE_APPLE_AMX or __ARM_FEATURE_APPLE_AMX2

Comment on lines +2664 to +2676
{
try_initialize_global_cpu_info();
#if __aarch64__ && __APPLE__
return g_hw_cpufamily == CPUFAMILY_ARM_FIRESTORM_ICESTORM // M1
|| g_hw_cpufamily == CPUFAMILY_ARM_AVALANCHE_BLIZZARD // M2
|| g_hw_cpufamily == CPUFAMILY_ARM_IBIZA // M3
|| g_hw_cpufamily == CPUFAMILY_ARM_LOBOS // M3 Pro
|| g_hw_cpufamily == CPUFAMILY_ARM_PALMA // M3 Max
|| g_hw_cpufamily == CPUFAMILY_ARM_DONAN // M4
|| g_hw_cpufamily == CPUFAMILY_ARM_BRAVA; // M4 Pro / M4

#else
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discover cpu isa info in initialize_global_cpu_info()

hw.optional.amx_version == 2

Signed-off-by: Molly Sophia <[email protected]>
Signed-off-by: Molly Sophia <[email protected]>
Signed-off-by: Molly Sophia <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants