-
Notifications
You must be signed in to change notification settings - Fork 12.4k
OpenCL: add tiled mul_mat_f16_f32 #14535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenCL: add tiled mul_mat_f16_f32 #14535
Conversation
@rmatif thank you for the PR. I will play with it and the direct convolution PR in the next few days. For matmul, using image1d_buffer is probably the easiest way to utilize L1 cache - it wraps around a normal cl buffer and uses read_image for access, so the index should stay the same as cl buffer. The Q4_0 matmul is already doing this. It is also possible to use normal cl buffer for one matrix input and image_1d_buffer to use both load paths. |
@lhez You're right, using
@zhouwg Please reach out to me via email, and I'll send you the build scripts and discuss further, as this seems off-topic here. |
@rmatif, Thanks so much for your help. I'm so exciting that it's my first time to running the ggml-opencl backend on my Snapdragon 8Elite based phone. llama-bench with qwen1_5-1_8b-chat-q4_0.gguf on master:
llama-cli with qwen1_5-1_8b-chat-q4_0.gguf on master:
llama-bench with Llama-3.2-1B-Instruct-f16.gguf on this PR:
llama-bench with Llama-3.2-1B-Instruct-f16.gguf on master:
BTW, I provide a simple build/shell script to build ggml-opencl backend on Linux for simplify workflow: https://github.com/zhouwg/ggml-hexagon/blob/self-build/scripts/build-run-ggmlopencl-android.sh Can I add this script to this excellent PR or submit a standalone PR so other developers can help to verify ggml-opencl related PR or learning something about OpenCL programming on Android phone accordingly? I think such this script is easy/no technical difficulty but might-be very useful/helpful for other developers. |
#define OPTN 8 | ||
|
||
#define WG_M (OPWM / OPTM) | ||
#define WG_N (OPWN / OPTN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WG_M
and WG_N
seem to be workgroup size - can they be replaced with get_local_size()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested replacing the macros with get_local_size(), but it resulted in a significant performance regression (~17%). Using compile-time constants is critical here, as it allows the compiler to fully unroll the inner loops and pre-calculate memory address offsets, an optimization that is lost when WG_M becomes a runtime value
const int OPWM = 64; | ||
const int OPWN = 64; | ||
const int TPWM = 16; | ||
const int TPWN = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think OPWM
, OPWN
, TPWM
, TPWN
, OPTM
, OPTN
are related, e.g., TPWM
can be calculated from OPWM
and OPTM
. I wonder if it is possible to do calculation. Or maybe add a comment about how they are related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are all mathematically related. keeping them as explicit compile-time constants is generally better because it allows the compiler to perform aggressive, hardware-specific optimizations. This enables full loop unrolling to eliminate expensive branching and allows accumulator arrays to be allocated directly into the fastest registers, optimizations that are impossible if these values are calculated as runtime variables. I will add comments about the relationship
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would constexpr
(as opposed to const
) work for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constexpr
is a C++ feature and wouldn't apply here, as the OpenCL kernel is compiled separately at runtime using the C-like OpenCL language
@lhez I'm almost done with a new kernel that uses |
Cool, thank you. Merging this now. |
This PR introduces a new
mul_mat_f16_f32
kernel that leverages tiling and vectorization. I believe this will serve as a strong baseline for future improvements.In a future PR, I may explore using
image2d_t
to utilize the L1 cache formul_mat
andconv2d
operations. This is a bit tricky as it requires some data preprocessing on the host sideResults on Adreno 830:
Master:
This PR:
@lhez @max-krasnyansky