-
Notifications
You must be signed in to change notification settings - Fork 12.3k
ggml : support broadcast for ggml_soft_max_ext and ggml_flash_attn_ext #14435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+442
−237
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
e6faa45
to
236682a
Compare
852529e
to
bdfd7b7
Compare
bdfd7b7
to
461cb2f
Compare
I've started working on the Vulkan backend support for this. |
Vulkan support is in #14449, targeted to this branch. |
* CUDA: add softmax broadcast * Pass by const ref * Review: Use blockDims for indexing, remove designated initializers * Add TODO for noncontigous input/output
be8d470
to
3b38afd
Compare
ggerganov
added a commit
that referenced
this pull request
Jul 2, 2025
gabe-l-hart
added a commit
to gabe-l-hart/llama.cpp
that referenced
this pull request
Jul 2, 2025
* origin/master: llama : initial Mamba-2 support (ggml-org#9126) sync : ggml ggml : add version function to get lib version (ggml/1286) Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dynamic libraries search for dependencies in their origin directory. (ggml-org#14309) CUDA: add softmax broadcast (ggml-org#14475) CUDA: broadcasting for FlashAttention mask (ggml-org#14500) vulkan: support softmax/FA batch and broadcast (ggml-org#14449) ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (ggml-org#14435) opencl : fix possible buffer overflow in dump_tensor (ggml-org#14490) simple-chat : fix context-exceeded condition (ggml-org#14494) opencl : skip empty nodes on cgraph compute (ggml-org#14491) opencl : update upscale to support align corners (ggml-org#14488) ci : add OpenCL to labeler workflow (ggml-org#14496) github : add OpenCL backend to issue templates (ggml-org#14492) ggml : Callback before abort (ggml-org#14481) ci : disable fast-math for Metal GHA CI (ggml-org#14478)
This was referenced Jul 3, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Apple Metal
https://en.wikipedia.org/wiki/Metal_(API)
Ascend NPU
issues specific to Ascend NPUs
ggml
changes relating to the ggml tensor library for machine learning
Nvidia GPU
Issues specific to Nvidia GPUs
SYCL
https://en.wikipedia.org/wiki/SYCL - GPU programming language
testing
Everything test related
Vulkan
Issues specific to the Vulkan backend
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Extract broadcast changes from #14363 for
ggml_soft_max_ext()
andggml_flash_attn_ext()
:llama.cpp/ggml/include/ggml.h
Lines 1435 to 1451 in 236682a
llama.cpp/ggml/include/ggml.h
Lines 1876 to 1896 in 236682a
Both changes should be quite simple. On
master
we have the assumption that the mask is a 2D matrix and we always broadcast it across the dim 2 (i.e. the heads) and dim 3. With this change, we allow to have separate masks - i.e. generalized broadcast.Currently, I've added tests and implemented the CPU and Metal to support this. The rest of the backends will fallback to CPU, until this gets implemented:
Fallback is okay for now since these extensions are not used at the moment by
llama.cpp
. This support will be needed later for the #14363 PR, although it's better to support this either way.