Apple AMX GEMM optimization #6293

MollySophia · 2025-08-31T13:43:40Z

Progress:

packB 32 & 32x8 microkernel & test pass
32x32 / 32x64 / 64x64 microkernel && packing and unpacking

Unfortunately, I've been too busy with my internship works to fully finish this optimization (only implemented 32x8 microkernels with packB 32). The performance gain could be much higher if fully implemented.

Benchmarking

test_gemm.param.zip
benchncnn.cpp:

        benchmark("test_gemm1024", ncnn::Mat(1024, 1024, 3), opt);

        benchmark("test_gemm2048", ncnn::Mat(2048, 2048, 3), opt);

        benchmark("test_gemm4096", ncnn::Mat(4096, 4096, 3), opt);

        benchmark("test_gemm8192", ncnn::Mat(8192, 8192, 3), opt);

32 layers of [dim, dim] @ [dim, dim] gemms on Apple M4:

# molly @ mollydeMac-mini in ~/ncnn/benchmark on git:apple-amx-remastered x [10:40:06] 
$ ../build/benchmark/benchncnn.app/Contents/MacOS/benchncnn 100 4 2 -1 1      
loop_count = 100
num_threads = 4
powersave = 2
gpu_device = -1
cooling_down = 1
       test_gemm1024  min =    2.92  max =    5.42  avg =    3.04
       test_gemm2048  min =   10.93  max =   20.71  avg =   11.49
       test_gemm4096  min =   42.90  max =   73.63  avg =   44.38
       test_gemm8192  min =  172.38  max = 1099.07  avg =  211.27

# molly @ mollydeMac-mini in ~/ncnn/benchmark on git:apple-amx-remastered x [10:53:14] 
$ ../build-noamx/benchmark/benchncnn.app/Contents/MacOS/benchncnn 100 4 2 -1 1 
loop_count = 100
num_threads = 4
powersave = 2
gpu_device = -1
cooling_down = 1
       test_gemm1024  min =    5.35  max =    7.99  avg =    6.43
       test_gemm2048  min =   21.85  max =   29.20  avg =   24.55
       test_gemm4096  min =   89.58  max =   97.65  avg =   93.08
       test_gemm8192  min =  511.64  max = 2068.80  avg = 1229.32

Testing

# molly @ mollydeMac-mini in ~/ncnn/build on git:apple-amx-remastered x [10:03:21] 
$ ctest --output-on-failure -j10         
Test project /Users/molly/ncnn/build
        Start  17: test_binaryop_3
        Start 123: test_slice
        Start  24: test_convolution
        Start  26: test_convolution_2
        Start  25: test_convolution_1
        Start 125: test_softmax
        Start  71: test_gemm_3
        Start  38: test_crop_1
        Start  43: test_deconvolution
        Start  16: test_binaryop_2
  1/135 Test  #43: test_deconvolution ...............   Passed    0.67 sec
        Start  69: test_gemm
  2/135 Test  #71: test_gemm_3 ......................   Passed    1.22 sec
        Start  53: test_deformableconv2d_2
  3/135 Test  #38: test_crop_1 ......................   Passed    1.25 sec
        Start  52: test_deformableconv2d_1
  4/135 Test #125: test_softmax .....................   Passed    1.41 sec
        Start  31: test_convolutiondepthwise
  5/135 Test  #16: test_binaryop_2 ..................   Passed    1.48 sec
        Start  54: test_deformableconv2d_3
  6/135 Test  #26: test_convolution_2 ...............   Passed    1.64 sec
        Start  15: test_binaryop_1
  7/135 Test  #69: test_gemm ........................   Passed    1.01 sec
        Start  72: test_gemm_4
  8/135 Test  #24: test_convolution .................   Passed    1.82 sec
        Start  14: test_binaryop
  9/135 Test #123: test_slice .......................   Passed    1.98 sec
        Start  51: test_deformableconv2d
 10/135 Test  #25: test_convolution_1 ...............   Passed    2.11 sec
        Start  42: test_cumulativesum
 11/135 Test  #52: test_deformableconv2d_1 ..........   Passed    0.95 sec
        Start  37: test_crop
 12/135 Test  #17: test_binaryop_3 ..................   Passed    2.24 sec
        Start  30: test_convolution3d
 13/135 Test  #54: test_deformableconv2d_3 ..........   Passed    0.85 sec
        Start  70: test_gemm_1
 14/135 Test  #53: test_deformableconv2d_2 ..........   Passed    1.13 sec
        Start  47: test_deconvolutiondepthwise_1
 15/135 Test  #72: test_gemm_4 ......................   Passed    0.82 sec
        Start  27: test_convolution_3
 16/135 Test  #31: test_convolutiondepthwise ........   Passed    1.10 sec
        Start  46: test_deconvolutiondepthwise
 17/135 Test  #15: test_binaryop_1 ..................   Passed    1.00 sec
        Start 100: test_pooling3d
 18/135 Test  #14: test_binaryop ....................   Passed    0.93 sec
        Start  45: test_deconvolution3d
 19/135 Test  #51: test_deformableconv2d ............   Passed    0.79 sec
        Start  49: test_deconvolutiondepthwise3d
 20/135 Test  #42: test_cumulativesum ...............   Passed    0.74 sec
        Start  36: test_copyto_1
 21/135 Test  #47: test_deconvolutiondepthwise_1 ....   Passed    0.58 sec
        Start 135: test_yolov3detectionoutput
 22/135 Test  #37: test_crop ........................   Passed    0.84 sec
        Start  35: test_copyto
 23/135 Test  #70: test_gemm_1 ......................   Passed    0.77 sec
        Start  89: test_multiheadattention
 24/135 Test  #30: test_convolution3d ...............   Passed    0.89 sec
        Start 124: test_slice_oom
 25/135 Test  #46: test_deconvolutiondepthwise ......   Passed    0.67 sec
        Start  95: test_padding
 26/135 Test #100: test_pooling3d ...................   Passed    0.59 sec
        Start  29: test_convolution1d
 27/135 Test  #45: test_deconvolution3d .............   Passed    0.58 sec
        Start 112: test_reshape_1
 28/135 Test  #49: test_deconvolutiondepthwise3d ....   Passed    0.57 sec
        Start  90: test_multiheadattention_1
 29/135 Test #135: test_yolov3detectionoutput .......   Passed    0.49 sec
        Start  75: test_gru
 30/135 Test  #27: test_convolution_3 ...............   Passed    0.94 sec
        Start  73: test_gridsample
 31/135 Test  #36: test_copyto_1 ....................   Passed    0.60 sec
        Start 108: test_reorg
 32/135 Test  #89: test_multiheadattention ..........   Passed    0.44 sec
        Start  44: test_deconvolution1d
 33/135 Test #124: test_slice_oom ...................   Passed    0.48 sec
        Start  80: test_interp
 34/135 Test  #95: test_padding .....................   Passed    0.51 sec
        Start  98: test_pooling
 35/135 Test #112: test_reshape_1 ...................   Passed    0.39 sec
        Start  96: test_permute
 36/135 Test #108: test_reorg .......................   Passed    0.32 sec
        Start 114: test_rmsnorm
 37/135 Test  #75: test_gru .........................   Passed    0.38 sec
        Start  22: test_concat
 38/135 Test  #35: test_copyto ......................   Passed    0.76 sec
        Start  74: test_groupnorm
 39/135 Test  #90: test_multiheadattention_1 ........   Passed    0.49 sec
        Start 102: test_prelu
 40/135 Test  #29: test_convolution1d ...............   Passed    0.61 sec
        Start 109: test_requantize
 41/135 Test  #73: test_gridsample ..................   Passed    0.46 sec
        Start 126: test_softmax_oom
 42/135 Test  #44: test_deconvolution1d .............   Passed    0.43 sec
        Start 132: test_tile
 43/135 Test  #96: test_permute .....................   Passed    0.33 sec
        Start  85: test_lstm
 44/135 Test  #98: test_pooling .....................   Passed    0.40 sec
        Start  48: test_deconvolutiondepthwise1d
 45/135 Test #114: test_rmsnorm .....................   Passed    0.32 sec
        Start  92: test_noop
 46/135 Test  #74: test_groupnorm ...................   Passed    0.43 sec
        Start 127: test_softplus
 47/135 Test #102: test_prelu .......................   Passed    0.42 sec
        Start 106: test_reduction
 48/135 Test  #22: test_concat ......................   Passed    0.50 sec
        Start  59: test_einsum
 49/135 Test #109: test_requantize ..................   Passed    0.46 sec
        Start  91: test_multiheadattention_oom
 50/135 Test #126: test_softmax_oom .................   Passed    0.41 sec
        Start  86: test_matmul
 51/135 Test #132: test_tile ........................   Passed    0.38 sec
        Start  55: test_deformableconv2d_4
 52/135 Test  #48: test_deconvolutiondepthwise1d ....   Passed    0.42 sec
        Start 118: test_scale
 53/135 Test  #92: test_noop ........................   Passed    0.42 sec
        Start  99: test_pooling1d
 54/135 Test  #85: test_lstm ........................   Passed    0.46 sec
        Start  60: test_eltwise
 55/135 Test #127: test_softplus ....................   Passed    0.34 sec
        Start 111: test_reshape
 56/135 Test #106: test_reduction ...................   Passed    0.40 sec
        Start  50: test_deepcopy
 57/135 Test  #91: test_multiheadattention_oom ......   Passed    0.49 sec
        Start  67: test_gelu
 58/135 Test  #80: test_interp ......................   Passed    1.19 sec
        Start  94: test_packing
 59/135 Test  #86: test_matmul ......................   Passed    0.48 sec
        Start 128: test_spectrogram
 60/135 Test  #59: test_einsum ......................   Passed    0.53 sec
        Start 107: test_relu
 61/135 Test  #55: test_deformableconv2d_4 ..........   Passed    0.50 sec
        Start 119: test_selu
 62/135 Test #118: test_scale .......................   Passed    0.35 sec
        Start  84: test_lrn
 63/135 Test  #50: test_deepcopy ....................   Passed    0.36 sec
        Start  64: test_expanddims
 64/135 Test  #99: test_pooling1d ...................   Passed    0.50 sec
        Start  66: test_fold
 65/135 Test  #60: test_eltwise .....................   Passed    0.51 sec
        Start  61: test_elu
 66/135 Test #111: test_reshape .....................   Passed    0.51 sec
        Start  88: test_mish
 67/135 Test  #67: test_gelu ........................   Passed    0.31 sec
        Start 115: test_rnn
 68/135 Test  #94: test_packing .....................   Passed    0.31 sec
        Start  97: test_pixelshuffle
 69/135 Test #119: test_selu ........................   Passed    0.44 sec
        Start  58: test_dropout
 70/135 Test #128: test_spectrogram .................   Passed    0.47 sec
 71/135 Test  #84: test_lrn .........................   Passed    0.44 sec
        Start  56: test_dequantize
        Start  78: test_innerproduct
 72/135 Test #107: test_relu ........................   Passed    0.47 sec
        Start  65: test_flatten
 73/135 Test  #64: test_expanddims ..................   Passed    0.34 sec
        Start  87: test_memorydata
 74/135 Test  #66: test_fold ........................   Passed    0.34 sec
        Start 120: test_shrink
 75/135 Test  #61: test_elu .........................   Passed    0.51 sec
        Start 116: test_roipooling
 76/135 Test  #97: test_pixelshuffle ................   Passed    0.43 sec
        Start 129: test_squeeze
 77/135 Test  #88: test_mish ........................   Passed    0.46 sec
        Start  57: test_diag
 78/135 Test #115: test_rnn .........................   Passed    0.48 sec
        Start 101: test_power
 79/135 Test  #58: test_dropout .....................   Passed    0.30 sec
        Start 131: test_tanh
 80/135 Test  #78: test_innerproduct ................   Passed    0.31 sec
        Start  93: test_normalize
 81/135 Test  #87: test_memorydata ..................   Passed    0.47 sec
 82/135 Test #120: test_shrink ......................   Passed    0.46 sec
        Start 113: test_reshape_oom
        Start 103: test_priorbox
 83/135 Test  #65: test_flatten .....................   Passed    0.53 sec
        Start 117: test_roialign
 84/135 Test  #56: test_dequantize ..................   Passed    0.54 sec
        Start 110: test_requantize_oom
 85/135 Test #129: test_squeeze .....................   Passed    0.34 sec
        Start 104: test_quantize
 86/135 Test #116: test_roipooling ..................   Passed    0.35 sec
        Start  76: test_hardsigmoid
 87/135 Test  #57: test_diag ........................   Passed    0.51 sec
        Start  82: test_inversespectrogram
 88/135 Test #131: test_tanh ........................   Passed    0.46 sec
        Start  83: test_layernorm
 89/135 Test #101: test_power .......................   Passed    0.48 sec
        Start 105: test_quantize_oom
 90/135 Test #117: test_roialign ....................   Passed    0.32 sec
        Start 133: test_unaryop
 91/135 Test #113: test_reshape_oom .................   Passed    0.32 sec
        Start  79: test_instancenorm
 92/135 Test #110: test_requantize_oom ..............   Passed    0.51 sec
        Start  40: test_crop_3
 93/135 Test #104: test_quantize ....................   Passed    0.46 sec
        Start 122: test_sigmoid
 94/135 Test #103: test_priorbox ....................   Passed    0.53 sec
        Start 130: test_swish
 95/135 Test  #76: test_hardsigmoid .................   Passed    0.49 sec
        Start  77: test_hardswish
 96/135 Test  #82: test_inversespectrogram ..........   Passed    0.36 sec
        Start  63: test_erf
 97/135 Test  #93: test_normalize ...................   Passed    0.85 sec
        Start  68: test_glu
 98/135 Test  #79: test_instancenorm ................   Passed    0.44 sec
        Start  32: test_convolutiondepthwise_1
 99/135 Test #105: test_quantize_oom ................   Passed    0.52 sec
        Start  81: test_interp_1
100/135 Test  #83: test_layernorm ...................   Passed    0.53 sec
        Start 121: test_shufflechannel
101/135 Test #133: test_unaryop .....................   Passed    0.50 sec
        Start  62: test_embed
102/135 Test #122: test_sigmoid .....................   Passed    0.32 sec
        Start 134: test_unfold
103/135 Test  #40: test_crop_3 ......................   Passed    0.37 sec
        Start  41: test_crop_oom
104/135 Test  #63: test_erf .........................   Passed    0.45 sec
        Start  39: test_crop_2
105/135 Test  #77: test_hardswish ...................   Passed    0.48 sec
        Start   6: test_squeezenet
106/135 Test #130: test_swish .......................   Passed    0.51 sec
        Start  34: test_convolutiondepthwise3d
107/135 Test  #68: test_glu .........................   Passed    0.41 sec
        Start  33: test_convolutiondepthwise1d
108/135 Test  #32: test_convolutiondepthwise_1 ......   Passed    0.46 sec
        Start   5: test_mat_pixel
109/135 Test  #81: test_interp_1 ....................   Passed    0.49 sec
        Start   4: test_mat_pixel_resize
110/135 Test #121: test_shufflechannel ..............   Passed    0.52 sec
        Start  28: test_convolution_oom
111/135 Test #134: test_unfold ......................   Passed    0.44 sec
        Start  20: test_celu
112/135 Test  #62: test_embed .......................   Passed    0.48 sec
        Start  13: test_bias
113/135 Test  #41: test_crop_oom ....................   Passed    0.40 sec
        Start   9: test_expression
114/135 Test  #39: test_crop_2 ......................   Passed    0.32 sec
        Start  21: test_clip
115/135 Test   #4: test_mat_pixel_resize ............   Passed    0.32 sec
        Start   2: test_mat_pixel_drawing
116/135 Test   #5: test_mat_pixel ...................   Passed    0.35 sec
        Start  19: test_cast
117/135 Test  #28: test_convolution_oom .............   Passed    0.31 sec
        Start  18: test_bnll
118/135 Test  #33: test_convolutiondepthwise1d ......   Passed    0.57 sec
        Start  12: test_batchnorm
119/135 Test  #34: test_convolutiondepthwise3d ......   Passed    0.63 sec
        Start   3: test_mat_pixel_rotate
120/135 Test   #6: test_squeezenet ..................   Passed    0.67 sec
        Start   7: test_c_api
121/135 Test   #9: test_expression ..................   Passed    0.52 sec
        Start  10: test_paramdict
122/135 Test  #20: test_celu ........................   Passed    0.53 sec
        Start  11: test_absval
123/135 Test  #13: test_bias ........................   Passed    0.52 sec
        Start   1: test_mat_pixel_affine
124/135 Test  #21: test_clip ........................   Passed    0.45 sec
        Start  23: test_concat_oom
125/135 Test   #2: test_mat_pixel_drawing ...........   Passed    0.32 sec
        Start   8: test_cpu
126/135 Test  #19: test_cast ........................   Passed    0.53 sec
127/135 Test   #3: test_mat_pixel_rotate ............   Passed    0.42 sec
128/135 Test  #18: test_bnll ........................   Passed    0.51 sec
129/135 Test   #7: test_c_api .......................   Passed    0.39 sec
130/135 Test  #12: test_batchnorm ...................   Passed    0.49 sec
131/135 Test  #10: test_paramdict ...................   Passed    0.32 sec
132/135 Test   #8: test_cpu .........................   Passed    0.46 sec
133/135 Test  #11: test_absval ......................   Passed    0.54 sec
134/135 Test  #23: test_concat_oom ..................   Passed    0.54 sec
135/135 Test   #1: test_mat_pixel_affine ............   Passed    0.55 sec

100% tests passed, 0 tests failed out of 135

Total Test time (real) =   8.19 sec

tencent-adm · 2025-08-31T13:43:59Z

All committers have signed the CLA.

github-actions · 2025-08-31T14:04:34Z

The binary size change of libncnn.so (bytes)

architecture	base size	pr size	difference
x86_64	15124728	15124784	+56 ⚠️
armhf	6155744	6155824	+80 ⚠️
aarch64	9453192	9452928	-264 😘

github-actions · 2025-08-31T14:25:30Z

Please enable github action in YOUR FORKED REPO to make code-format workflow work

codecov-commenter · 2025-08-31T14:27:03Z

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.59%. Comparing base (a514cf5) to head (da499ad).
⚠️ Report is 9 commits behind head on master.

Files with missing lines	Patch %	Lines
src/cpu.cpp	0.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6293      +/-   ##
==========================================
- Coverage   95.89%   95.59%   -0.30%     
==========================================
  Files         837      837              
  Lines      264994   264997       +3     
==========================================
- Hits       254105   253327     -778     
- Misses      10889    11670     +781

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nihui

A lot of code is duplicated in gemm_arm_asimdhp.cpp and should be extracted into gemm_fp16sa.h to unify the implementation into a single file to reduce duplication.

Apple AMX requires additional macro definitions, such as __ARM_FEATURE_APPLE_AMX or __ARM_FEATURE_APPLE_AMX2

src/layer/arm/amx_usability.h

nihui · 2025-09-10T06:08:52Z

src/cpu.cpp

+{
+    try_initialize_global_cpu_info();
+#if __aarch64__ && __APPLE__
+    return g_hw_cpufamily == CPUFAMILY_ARM_FIRESTORM_ICESTORM // M1
+           || g_hw_cpufamily == CPUFAMILY_ARM_AVALANCHE_BLIZZARD // M2
+           || g_hw_cpufamily == CPUFAMILY_ARM_IBIZA // M3
+           || g_hw_cpufamily == CPUFAMILY_ARM_LOBOS // M3 Pro
+           || g_hw_cpufamily == CPUFAMILY_ARM_PALMA // M3 Max
+           || g_hw_cpufamily == CPUFAMILY_ARM_DONAN // M4
+           || g_hw_cpufamily == CPUFAMILY_ARM_BRAVA; // M4 Pro / M4
+
+#else
+    return 0;


discover cpu isa info in initialize_global_cpu_info()

hw.optional.amx_version == 2

Signed-off-by: Molly Sophia <[email protected]>

github-actions bot added core arm cmake labels Aug 31, 2025

MollySophia changed the title ~~WIP: Apple AMX GEMM optimization~~ Apple AMX GEMM optimization Sep 10, 2025

nihui requested changes Sep 10, 2025

View reviewed changes

MollySophia added 3 commits September 10, 2025 22:45

Apple AMX gemm optimizations

49dc103

Signed-off-by: Molly Sophia <[email protected]>

packB optimize & WIP packA

4316578

Signed-off-by: Molly Sophia <[email protected]>

remove debug logging

27ce7b1

Signed-off-by: Molly Sophia <[email protected]>

MollySophia force-pushed the apple-amx-remastered branch from 854164c to 27ce7b1 Compare September 10, 2025 14:45

SPDX stype license header for amx_usability.h

da499ad

Signed-off-by: Molly Sophia <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apple AMX GEMM optimization #6293

Apple AMX GEMM optimization #6293

MollySophia commented Aug 31, 2025 •

edited

Loading

Uh oh!

tencent-adm commented Aug 31, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 31, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 31, 2025

Uh oh!

codecov-commenter commented Aug 31, 2025 •

edited

Loading

Uh oh!

nihui left a comment

Uh oh!

Uh oh!

nihui Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Apple AMX GEMM optimization #6293

Are you sure you want to change the base?

Apple AMX GEMM optimization #6293

Conversation

MollySophia commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress:

Benchmarking

Testing

Uh oh!

tencent-adm commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 31, 2025

Uh oh!

codecov-commenter commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nihui left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nihui Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MollySophia commented Aug 31, 2025 •

edited

Loading

tencent-adm commented Aug 31, 2025 •

edited

Loading

github-actions bot commented Aug 31, 2025 •

edited

Loading

codecov-commenter commented Aug 31, 2025 •

edited

Loading