-
Notifications
You must be signed in to change notification settings - Fork 12.2k
CUDA: add softmax broadcast #14475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gg/ggml-batch-soft-max-ops
Are you sure you want to change the base?
CUDA: add softmax broadcast #14475
Conversation
// FIXME: this limit could be raised by ~2-4x on Ampere or newer | ||
if (nbytes_shared < ggml_cuda_info().devices[ggml_cuda_get_device()].smpb) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you were asking me for things to do, consider also tackling this. It's not an immediate problem but if we ever want to do sampling on the GPU it will be. See
llama.cpp/ggml/src/ggml-cuda/mmq.cuh
Lines 3019 to 3026 in 343b6e9
#if !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__)) && !defined(GGML_USE_MUSA) | |
static bool shared_memory_limit_raised[GGML_CUDA_MAX_DEVICES] = {false}; | |
if (!shared_memory_limit_raised[id]) { | |
CUDA_CHECK(cudaFuncSetAttribute(mul_mat_q<type, mmq_x, MMQ_NWARPS, false>, cudaFuncAttributeMaxDynamicSharedMemorySize, nbytes_shared)); | |
CUDA_CHECK(cudaFuncSetAttribute(mul_mat_q<type, mmq_x, MMQ_NWARPS, true>, cudaFuncAttributeMaxDynamicSharedMemorySize, nbytes_shared)); | |
shared_memory_limit_raised[id] = true; | |
} | |
#endif // !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__)) && !defined(GGML_USE_MUSA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I checked with NVIDIA engineers: it is not possible to raise the shared memory limit universally, you have to do it manually for each function. Yes, it's stupid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if I understand this correctly - prior to calling any which uses shared mem, we should somehow call cudaFuncSetAttribute
for that kernel. If so, can we create macro for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only matters for like 3 kernels that sometimes want to use more than 48 kB of shared memory. The desired behavior is to raise the shared memory limit once per function and device. Raising the limit multiple times does not result in an error but it makes the code slower (thus the static variable).
A macro could conceivably be used to solve this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look at this
Target #14435, note that is just the softmax. I also refactored the code a little bit to make it cleaner.