-
Notifications
You must be signed in to change notification settings - Fork 12.3k
remove templates from soft_max_f32_submitter to allow SYCL graph updates #13724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@lslusarczyk Thank you! |
We will run some benchmarks for this change, thanks. I suspect the templates arguments were introduced to optimize the kernel in some cases. It would probably be safer to use the "runtime version" of softmax (i.e. with the templates Also have you been able to identify how come a different soft_max can be used in each iteration? I can see this depends on the first dimension of |
I can confirm this has too much impact on performance when SYCL-Graph is not used, see the results of
with the PR:
I think this is enough evidence to keep the template arguments when SYCL-Graph is not used. I suggest storing in the backend context whether SYCL-Graph ends up being used so that it can be checked when submitting softmax kernels. This could be a new field in llama.cpp/ggml/src/ggml-sycl/common.hpp Line 318 in 6f180b9
I'm also attaching some full model benchmark results below. Most cases seem slightly worse. The improvement in the TG of phi3 3B Q4_K - Medium is odd. I haven't looked closer at this model. master:
PR:
|
SYCL graph feature has nothing with hardware platform. |
@NeoZhangJianyu SYCL-Graph already has a global variable to try and enable SYCL-Graph but there are also more checks as SYCL-Graph doesn't support all graphs not all devices, see llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp Line 3897 in 4f81b33
Given this check, |
@qnixsynapse , @NeoZhangJianyu , thank you for reviewing my code and for your comments. @Rbiessy , thank you very much for checking the performance. Before deciding to have two code paths, which will make code uglier (instead of simplier as I tried in this PR), I'd like to try to understand why these templates were causing better performance. By code analysis I expected nearly no impact by removing template parameters that I changed to be arguments. Expect my updates here in a few days. |
Thanks @lslusarczyk, that would be useful yes.
Just to clarify in my latest comment I suggest to tweak the condition in |
I implemented it the way you suggested. Although now we use same kernels and graphs do not need to be recreated but benchmarks showed me terrible performance drop. This must be because of "else" kernel being significantly slower than non-"else" ones. |
What are you comparing to measure the performance drop? Do you mean the performance of using SYCL-Graph with your patch to only use the fallback softmax is worse than using SYCL-Graph with current master version? I would expect allowing to re-use the same graph would be a good improvement. I think @sgeor255 should be able to help with the investigation as we discussed. |
When
soft_max_f32_sycl
is templated, then update of SYCL graph will fail because of different node type. Having just kernel parameters allows to update just parameters.