CUDA: add softmax broadcast #14475

am17an · 2025-07-01T08:40:24Z

Target #14435, note that is just the softmax. I also refactored the code a little bit to make it cleaner.

ggml/src/ggml-cuda/softmax.cu

JohannesGaessler · 2025-07-01T11:04:48Z

ggml/src/ggml-cuda/softmax.cu

    // FIXME: this limit could be raised by ~2-4x on Ampere or newer
    if (nbytes_shared < ggml_cuda_info().devices[ggml_cuda_get_device()].smpb) {


Since you were asking me for things to do, consider also tackling this. It's not an immediate problem but if we ever want to do sampling on the GPU it will be. See

llama.cpp/ggml/src/ggml-cuda/mmq.cuh

Lines 3019 to 3026 in 343b6e9

#if !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__)) && !defined(GGML_USE_MUSA)

static bool shared_memory_limit_raised[GGML_CUDA_MAX_DEVICES] = {false};

if (!shared_memory_limit_raised[id]) {

CUDA_CHECK(cudaFuncSetAttribute(mul_mat_q<type, mmq_x, MMQ_NWARPS, false>, cudaFuncAttributeMaxDynamicSharedMemorySize, nbytes_shared));

CUDA_CHECK(cudaFuncSetAttribute(mul_mat_q<type, mmq_x, MMQ_NWARPS, true>, cudaFuncAttributeMaxDynamicSharedMemorySize, nbytes_shared));

shared_memory_limit_raised[id] = true;

}

#endif // !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__)) && !defined(GGML_USE_MUSA)

. (This can be a separate PR.)

Also, I checked with NVIDIA engineers: it is not possible to raise the shared memory limit universally, you have to do it manually for each function. Yes, it's stupid.

Let me know if I understand this correctly - prior to calling any which uses shared mem, we should somehow call cudaFuncSetAttribute for that kernel. If so, can we create macro for this?

This only matters for like 3 kernels that sometimes want to use more than 48 kB of shared memory. The desired behavior is to raise the shared memory limit once per function and device. Raising the limit multiple times does not result in an error but it makes the code slower (thus the static variable).

A macro could conceivably be used to solve this issue.

I'll take a look at this

ggml/src/ggml-cuda/softmax.cu

CUDA: add softmax broadcast

dd9344d

am17an requested a review from JohannesGaessler July 1, 2025 08:40

ggerganov reviewed Jul 1, 2025

View reviewed changes

ggml/src/ggml-cuda/softmax.cu Outdated Show resolved Hide resolved

Pass by const ref

250233c

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 1, 2025

slaren reviewed Jul 1, 2025

View reviewed changes

ggml/src/ggml-cuda/softmax.cu Outdated Show resolved Hide resolved

JohannesGaessler reviewed Jul 1, 2025

View reviewed changes

Review: Use blockDims for indexing, remove designated initializers

146bf44

JohannesGaessler approved these changes Jul 1, 2025

View reviewed changes

ggml/src/ggml-cuda/softmax.cu Outdated Show resolved Hide resolved

Add TODO for noncontigous input/output

884c1d1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: add softmax broadcast #14475

CUDA: add softmax broadcast #14475

am17an commented Jul 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler Jul 1, 2025

Uh oh!

JohannesGaessler Jul 1, 2025

Uh oh!

am17an Jul 1, 2025

Uh oh!

JohannesGaessler Jul 1, 2025

Uh oh!

am17an Jul 1, 2025

Uh oh!

Uh oh!

Uh oh!

		// FIXME: this limit could be raised by ~2-4x on Ampere or newer
		if (nbytes_shared < ggml_cuda_info().devices[ggml_cuda_get_device()].smpb) {

	#if !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__)) && !defined(GGML_USE_MUSA)
	static bool shared_memory_limit_raised[GGML_CUDA_MAX_DEVICES] = {false};
	if (!shared_memory_limit_raised[id]) {
	CUDA_CHECK(cudaFuncSetAttribute(mul_mat_q<type, mmq_x, MMQ_NWARPS, false>, cudaFuncAttributeMaxDynamicSharedMemorySize, nbytes_shared));
	CUDA_CHECK(cudaFuncSetAttribute(mul_mat_q<type, mmq_x, MMQ_NWARPS, true>, cudaFuncAttributeMaxDynamicSharedMemorySize, nbytes_shared));
	shared_memory_limit_raised[id] = true;
	}
	#endif // !(defined(GGML_USE_HIP) && defined(__HIP_PLATFORM_AMD__)) && !defined(GGML_USE_MUSA)

CUDA: add softmax broadcast #14475

Are you sure you want to change the base?

CUDA: add softmax broadcast #14475

Conversation

am17an commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

am17an Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

am17an Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

am17an commented Jul 1, 2025 •

edited

Loading