-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Apple AMX GEMM optimization #6293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
The binary size change of libncnn.so (bytes)
|
|
Please enable github action in YOUR FORKED REPO to make code-format workflow work |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6293 +/- ##
==========================================
- Coverage 95.89% 95.59% -0.30%
==========================================
Files 837 837
Lines 264994 264997 +3
==========================================
- Hits 254105 253327 -778
- Misses 10889 11670 +781 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
nihui
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of code is duplicated in gemm_arm_asimdhp.cpp and should be extracted into gemm_fp16sa.h to unify the implementation into a single file to reduce duplication.
Apple AMX requires additional macro definitions, such as __ARM_FEATURE_APPLE_AMX or __ARM_FEATURE_APPLE_AMX2
| { | ||
| try_initialize_global_cpu_info(); | ||
| #if __aarch64__ && __APPLE__ | ||
| return g_hw_cpufamily == CPUFAMILY_ARM_FIRESTORM_ICESTORM // M1 | ||
| || g_hw_cpufamily == CPUFAMILY_ARM_AVALANCHE_BLIZZARD // M2 | ||
| || g_hw_cpufamily == CPUFAMILY_ARM_IBIZA // M3 | ||
| || g_hw_cpufamily == CPUFAMILY_ARM_LOBOS // M3 Pro | ||
| || g_hw_cpufamily == CPUFAMILY_ARM_PALMA // M3 Max | ||
| || g_hw_cpufamily == CPUFAMILY_ARM_DONAN // M4 | ||
| || g_hw_cpufamily == CPUFAMILY_ARM_BRAVA; // M4 Pro / M4 | ||
|
|
||
| #else | ||
| return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discover cpu isa info in initialize_global_cpu_info()
hw.optional.amx_version == 2
Signed-off-by: Molly Sophia <[email protected]>
Signed-off-by: Molly Sophia <[email protected]>
Signed-off-by: Molly Sophia <[email protected]>
854164c to
27ce7b1
Compare
Signed-off-by: Molly Sophia <[email protected]>
Progress:
Unfortunately, I've been too busy with my internship works to fully finish this optimization (only implemented 32x8 microkernels with packB 32). The performance gain could be much higher if fully implemented.
Benchmarking
test_gemm.param.zip
benchncnn.cpp:
32 layers of [dim, dim] @ [dim, dim] gemms on Apple M4:
Testing