ggml : support broadcast for ggml_soft_max_ext and ggml_flash_attn_ext #14435

ggerganov · 2025-06-28T14:28:36Z

Extract broadcast changes from #14363 for ggml_soft_max_ext() and ggml_flash_attn_ext():

Lines 1435 to 1451 in 236682a

    
           // a    [ne0, ne01, ne02, ne03] 
        
           // mask [ne0, ne11, ne12, ne13] | ne11 >= ne01, F16 or F32, optional 
        
           // 
        
           // broadcast: 
        
           //   ne02 % ne12 == 0 
        
           //   ne03 % ne13 == 0 
        
           // 
        
           // fused soft_max(a*scale + mask*(ALiBi slope)) 
        
           // max_bias = 0.0f for no ALiBi 
        
           GGML_API struct ggml_tensor * ggml_soft_max_ext( 
        
                   struct ggml_context * ctx, 
        
                   struct ggml_tensor  * a, 
        
                   struct ggml_tensor  * mask, 
        
                   float                 scale, 
        
                   float                 max_bias);

llama.cpp/ggml/include/ggml.h

Lines 1876 to 1896 in 236682a

    
           // q:    [n_embd_k, n_batch,     n_head,    ne3] 
        
           // k:    [n_embd_k, n_kv,        n_head_kv, ne3] 
        
           // v:    [n_embd_v, n_kv,        n_head_kv, ne3] !! not transposed !! 
        
           // mask: [n_kv,     n_batch_pad, ne32,      1] !! n_batch_pad = GGML_PAD(n_batch, GGML_KQ_MASK_PAD) !! 
        
           // res:  [n_embd_v, n_head,      n_batch,   ne3] !! permuted !! 
        
           // 
        
           // broadcast: 
        
           //   n_head % n_head_kv == 0 
        
           //   ne3    % ne32      == 0 
        
           // 
        
           GGML_API struct ggml_tensor * ggml_flash_attn_ext( 
        
                   struct ggml_context * ctx, 
        
                   struct ggml_tensor  * q, 
        
                   struct ggml_tensor  * k, 
        
                   struct ggml_tensor  * v, 
        
                   struct ggml_tensor  * mask, 
        
                   float                 scale, 
        
                   float                 max_bias, 
        
                   float                 logit_softcap);

Both changes should be quite simple. On master we have the assumption that the mask is a 2D matrix and we always broadcast it across the dim 2 (i.e. the heads) and dim 3. With this change, we allow to have separate masks - i.e. generalized broadcast.

Currently, I've added tests and implemented the CPU and Metal to support this. The rest of the backends will fallback to CPU, until this gets implemented:

Fallback is okay for now since these extensions are not used at the moment by llama.cpp. This support will be needed later for the #14363 PR, although it's better to support this either way.

jeffbolznv · 2025-06-29T15:14:22Z

I've started working on the Vulkan backend support for this.

jeffbolznv · 2025-06-29T18:35:47Z

Vulkan support is in #14449, targeted to this branch.

ggml-ci

* CUDA: add softmax broadcast * Pass by const ref * Review: Use blockDims for indexing, remove designated initializers * Add TODO for noncontigous input/output

ggml-ci

* origin/master: llama : initial Mamba-2 support (ggml-org#9126) sync : ggml ggml : add version function to get lib version (ggml/1286) Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dynamic libraries search for dependencies in their origin directory. (ggml-org#14309) CUDA: add softmax broadcast (ggml-org#14475) CUDA: broadcasting for FlashAttention mask (ggml-org#14500) vulkan: support softmax/FA batch and broadcast (ggml-org#14449) ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (ggml-org#14435) opencl : fix possible buffer overflow in dump_tensor (ggml-org#14490) simple-chat : fix context-exceeded condition (ggml-org#14494) opencl : skip empty nodes on cgraph compute (ggml-org#14491) opencl : update upscale to support align corners (ggml-org#14488) ci : add OpenCL to labeler workflow (ggml-org#14496) github : add OpenCL backend to issue templates (ggml-org#14492) ggml : Callback before abort (ggml-org#14481) ci : disable fast-math for Metal GHA CI (ggml-org#14478)

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jun 28, 2025

ggerganov force-pushed the gg/ggml-batch-soft-max-ops branch from e6faa45 to 236682a Compare June 28, 2025 14:53

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend Ascend NPU issues specific to Ascend NPUs labels Jun 28, 2025

ggerganov force-pushed the gg/ggml-batch-soft-max-ops branch 3 times, most recently from 852529e to bdfd7b7 Compare June 28, 2025 15:39

ggerganov marked this pull request as ready for review June 28, 2025 15:39

github-actions bot added the SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language label Jun 28, 2025

ggerganov force-pushed the gg/ggml-batch-soft-max-ops branch from bdfd7b7 to 461cb2f Compare June 29, 2025 06:48

jeffbolznv mentioned this pull request Jun 29, 2025

vulkan: support softmax/FA batch and broadcast #14449

Merged

am17an mentioned this pull request Jul 1, 2025

CUDA: add softmax broadcast #14475

Merged

JohannesGaessler self-requested a review as a code owner July 2, 2025 11:42

ggerganov and others added 4 commits July 2, 2025 15:44

ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435)

3236670

ggml-ci

vulkan: support softmax/FA batch and broadcast (#14449)

b726564

CUDA: broadcasting for FlashAttention mask (#14500)

3045a1e

CUDA: add softmax broadcast (#14475)

3b38afd

* CUDA: add softmax broadcast * Pass by const ref * Review: Use blockDims for indexing, remove designated initializers * Add TODO for noncontigous input/output

ggerganov force-pushed the gg/ggml-batch-soft-max-ops branch from be8d470 to 3b38afd Compare July 2, 2025 12:45

ggerganov merged commit 55a1c5a into master Jul 2, 2025
47 of 49 checks passed

ggerganov added a commit that referenced this pull request Jul 2, 2025

ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435)

ec68e84

ggml-ci

ggerganov deleted the gg/ggml-batch-soft-max-ops branch July 2, 2025 12:48

This was referenced Jul 3, 2025

ggml : fix FA mask dim 2 and 3 #14505

Merged

llama : add high-throughput mode #14363

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml : support broadcast for ggml_soft_max_ext and ggml_flash_attn_ext #14435

ggml : support broadcast for ggml_soft_max_ext and ggml_flash_attn_ext #14435

ggerganov commented Jun 28, 2025 •

edited by am17an

Loading

Uh oh!

jeffbolznv commented Jun 29, 2025

Uh oh!

jeffbolznv commented Jun 29, 2025

Uh oh!

Uh oh!

Uh oh!


	// a [ne0, ne01, ne02, ne03]
	// mask [ne0, ne11, ne12, ne13] \| ne11 >= ne01, F16 or F32, optional
	//
	// broadcast:
	// ne02 % ne12 == 0
	// ne03 % ne13 == 0
	//
	// fused soft_max(ascale + mask(ALiBi slope))
	// max_bias = 0.0f for no ALiBi
	GGML_API struct ggml_tensor * ggml_soft_max_ext(
	struct ggml_context * ctx,
	struct ggml_tensor * a,
	struct ggml_tensor * mask,
	float scale,
	float max_bias);


	// q: [n_embd_k, n_batch, n_head, ne3]
	// k: [n_embd_k, n_kv, n_head_kv, ne3]
	// v: [n_embd_v, n_kv, n_head_kv, ne3] !! not transposed !!
	// mask: [n_kv, n_batch_pad, ne32, 1] !! n_batch_pad = GGML_PAD(n_batch, GGML_KQ_MASK_PAD) !!
	// res: [n_embd_v, n_head, n_batch, ne3] !! permuted !!
	//
	// broadcast:
	// n_head % n_head_kv == 0
	// ne3 % ne32 == 0
	//
	GGML_API struct ggml_tensor * ggml_flash_attn_ext(
	struct ggml_context * ctx,
	struct ggml_tensor * q,
	struct ggml_tensor * k,
	struct ggml_tensor * v,
	struct ggml_tensor * mask,
	float scale,
	float max_bias,
	float logit_softcap);

ggml : support broadcast for ggml_soft_max_ext and ggml_flash_attn_ext #14435

ggml : support broadcast for ggml_soft_max_ext and ggml_flash_attn_ext #14435

Conversation

ggerganov commented Jun 28, 2025 • edited by am17an Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Jun 29, 2025

Uh oh!

jeffbolznv commented Jun 29, 2025

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Jun 28, 2025 •

edited by am17an

Loading