OpenCL: add tiled mul_mat_f16_f32 #14535

rmatif · 2025-07-04T17:53:10Z

This PR introduces a new mul_mat_f16_f32 kernel that leverages tiling and vectorization. I believe this will serve as a strong baseline for future improvements.
In a future PR, I may explore using image2d_t to utilize the L1 cache for mul_mat and conv2d operations. This is a bit tricky as it requires some data preprocessing on the host side

Results on Adreno 830:

Master:

model	size	params	backend	ngl	test	t/s
llama 1B F16	2.30 GiB	1.24 B	OpenCL	99	pp512	19.24 ± 0.88
llama 1B F16	2.30 GiB	1.24 B	OpenCL	99	tg128	18.87 ± 4.37

This PR:

model	size	params	backend	ngl	test	t/s
llama 1B F16	2.30 GiB	1.24 B	OpenCL	99	pp512	168.17 ± 0.41
llama 1B F16	2.30 GiB	1.24 B	OpenCL	99	tg128	22.61 ± 0.02

@lhez @max-krasnyansky

lhez · 2025-07-04T20:06:01Z

@rmatif thank you for the PR. I will play with it and the direct convolution PR in the next few days.

For matmul, using image1d_buffer is probably the easiest way to utilize L1 cache - it wraps around a normal cl buffer and uses read_image for access, so the index should stay the same as cl buffer. The Q4_0 matmul is already doing this. It is also possible to use normal cl buffer for one matrix input and image_1d_buffer to use both load paths.

rmatif · 2025-07-05T09:36:54Z

@rmatif thank you for the PR. I will play with it and the direct convolution PR in the next few days.

For matmul, using image1d_buffer is probably the easiest way to utilize L1 cache - it wraps around a normal cl buffer and uses read_image for access, so the index should stay the same as cl buffer. The Q4_0 matmul is already doing this. It is also possible to use normal cl buffer for one matrix input and image_1d_buffer to use both load paths.

@lhez You're right, using image1d_buffer is indeed a much simpler way to leverage the L1 cache. It avoids the need to manually handle row_pitch and the complexity of converting data into a 2D-tiled memory format, as it essentially acts as a "view" of an existing cl_buffer. I may begin by looking into that first as an incremental step.
However, I believe image2d_t is ultimately the best path forward, especially on Adreno, because its L1 cache is highly optimized for 2D spatial locality. MNN uses this technique extensively for its matmul op

What do you think of this plan? Looking forward to your reply/advice and thanks.

@zhouwg Please reach out to me via email, and I'll send you the build scripts and discuss further, as this seems off-topic here.
In short, my current take is that our time and effort would be better spent optimizing OpenCL, there’s still significant room for improvement. To me, it's not clear that we can achieve good enough performance on Hexagon for the moment

jeffzhou2000 · 2025-07-06T02:43:58Z

@rmatif, Thanks so much for your help. I'm so exciting that it's my first time to running the ggml-opencl backend on my Snapdragon 8Elite based phone.

llama-bench with qwen1_5-1_8b-chat-q4_0.gguf on master:


zhouwg:$ ./scripts/build-run-ggmlopencl-android.sh run_llamabench
current working path:/home/zhouwg/kantvai/llama.cpp

/usr/bin/wget

/usr/bin/git

/usr/bin/ninja

/bin/ls
Android NDK already exist:   /home/zhouwg/kantvai/llama.cpp/prebuilts/android-ndk-r28 

OpenCL SDK already exist:    /home/zhouwg/kantvai/llama.cpp/prebuilts/OpenCL_SDK 

/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
./out/ggmlopencl-android/bin/libggml-base.so: 1 file pushed. 17.6 MB/s (6962256 bytes in 0.378s)
./out/ggmlopencl-android/bin/libggml-cpu.so: 1 file pushed. 17.6 MB/s (3407440 bytes in 0.185s)
./out/ggmlopencl-android/bin/libggml-opencl.so: 1 file pushed. 17.4 MB/s (1476880 bytes in 0.081s)
./out/ggmlopencl-android/bin/libggml.so: 1 file pushed. 17.0 MB/s (1764248 bytes in 0.099s)
./out/ggmlopencl-android/bin/libllama.so: 1 file pushed. 18.5 MB/s (24163448 bytes in 1.245s)
./out/ggmlopencl-android/bin/libmtmd.so: 1 file pushed. 17.7 MB/s (4863024 bytes in 0.261s)
6 files pushed. 18.1 MB/s (42637296 bytes in 2.252s)
./out/ggmlopencl-android/bin/llama-bench: 1 file pushed. 17.7 MB/s (4770920 bytes in 0.258s)
-rwxrwxrwx 1 shell shell 6962256 2025-07-06 10:04 /data/local/tmp/libggml-base.so
-rwxrwxrwx 1 shell shell 3407440 2025-07-06 10:04 /data/local/tmp/libggml-cpu.so
-rwxrwxrwx 1 shell shell 5849280 2025-07-05 08:04 /data/local/tmp/libggml-hexagon.so
-rwxrwxrwx 1 shell shell 1476880 2025-07-06 10:04 /data/local/tmp/libggml-opencl.so
adb shell "cd /data/local/tmp                && export LD_LIBRARY_PATH=/data/local/tmp                && /data/local/tmp/llama-bench  -ngl 99 -t 4 -n 256 --no-warmup  -m /sdcard/qwen1_5-1_8b-chat-q4_0.gguf"
/data/local/tmp/llama-bench  -ngl 99 -t 4 -n 256 --no-warmup  -m /sdcard/qwen1_5-1_8b-chat-q4_0.gguf
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.47.18.23
ggml_opencl: vector subgroup broadcast support: true
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels....................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| qwen2 1B Q4_0                  |   1.04 GiB |     1.84 B | OpenCL     |  99 |       4 |           pp512 |        329.52 ± 0.47 |
| qwen2 1B Q4_0                  |   1.04 GiB |     1.84 B | OpenCL     |  99 |       4 |           tg256 |         29.77 ± 0.06 |

build: a4701c4be (6025)
running time:2025-07-06,10:21:10

llama-cli with qwen1_5-1_8b-chat-q4_0.gguf on master:

zhouwg:$ ./scripts/build-run-ggmlopencl-android.sh run_llamacli
current working path:/home/zhouwg/kantvai/llama.cpp

/usr/bin/wget

/usr/bin/git

/usr/bin/ninja

/bin/ls
Android NDK already exist:   /home/zhouwg/kantvai/llama.cpp/prebuilts/android-ndk-r28 

OpenCL SDK already exist:    /home/zhouwg/kantvai/llama.cpp/prebuilts/OpenCL_SDK 

/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
./out/ggmlopencl-android/bin/libggml-base.so: 1 file pushed. 17.6 MB/s (6962256 bytes in 0.377s)
./out/ggmlopencl-android/bin/libggml-cpu.so: 1 file pushed. 17.6 MB/s (3407440 bytes in 0.184s)
./out/ggmlopencl-android/bin/libggml-opencl.so: 1 file pushed. 16.5 MB/s (1476880 bytes in 0.086s)
./out/ggmlopencl-android/bin/libggml.so: 1 file pushed. 16.6 MB/s (1764248 bytes in 0.101s)
./out/ggmlopencl-android/bin/libllama.so: 1 file pushed. 19.9 MB/s (24163448 bytes in 1.158s)
./out/ggmlopencl-android/bin/libmtmd.so: 1 file pushed. 18.6 MB/s (4863024 bytes in 0.249s)
6 files pushed. 18.8 MB/s (42637296 bytes in 2.159s)
./out/ggmlopencl-android/bin/llama-cli: 1 file pushed. 18.6 MB/s (27712544 bytes in 1.425s)
-rwxrwxrwx 1 shell shell  6962256 2025-07-07 22:19 /data/local/tmp/libggml-base.so
-rwxrwxrwx 1 shell shell  3407440 2025-07-07 22:19 /data/local/tmp/libggml-cpu.so
-rwxrwxrwx 1 shell shell  5848720 2025-07-07 22:02 /data/local/tmp/libggml-hexagon.so
-rwxrwxrwx 1 shell shell  1476880 2025-07-07 22:19 /data/local/tmp/libggml-opencl.so
-rwxrwxrwx 1 shell shell 25159984 2025-07-07 22:27 /data/local/tmp/libggml-vulkan.so
/data/local/tmp/llama-cli  -ngl 99 -t 4 -n 256 --no-warmup  -no-cnv -m /sdcard/qwen1_5-1_8b-chat-q4_0.gguf -p "introduce the movie Once Upon a Time in America briefly.\n"
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.47.18.23
ggml_opencl: vector subgroup broadcast support: true
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels....................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
build: 6038 (12ac0f350) with Android (12896553, +pgo, +bolt, +lto, +mlgo, based on r530567c) clang version 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device GPUOpenCL (QUALCOMM Adreno(TM) 830) - 0 MiB free
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /sdcard/qwen1_5-1_8b-chat-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = Qwen1.5-1.8B-Chat-AWQ-fp16
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 2048
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 5504
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 16
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  10:                qwen2.use_parallel_residual bool             = true
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  16:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  18:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - kv  20:                          general.file_type u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_0:  169 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 1.04 GiB (4.85 BPW) 
load: missing pre-tokenizer type, using: 'default'
load:                                             
load: ************************************        
load: GENERATION QUALITY WILL BE DEGRADED!        
load: CONSIDER REGENERATING THE MODEL             
load: ************************************        
load:                                             
load: special tokens cache size = 293
load: token to piece cache size = 0.9338 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 2048
print_info: n_layer          = 24
print_info: n_head           = 16
print_info: n_head_kv        = 16
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 2048
print_info: n_embd_v_gqa     = 2048
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 5504
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: model type       = 1B
print_info: model params     = 1.84 B
print_info: general.name     = Qwen1.5-1.8B-Chat-AWQ-fp16
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   166.92 MiB
load_tensors:       OpenCL model buffer size =   895.75 MiB
................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache_unified:     OpenCL KV buffer size =   768.00 MiB
llama_kv_cache_unified: size =  768.00 MiB (  4096 cells,  24 layers,  1 seqs), K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context:     OpenCL compute buffer size =   300.75 MiB
llama_context:        CPU compute buffer size =    12.01 MiB
llama_context: graph nodes  = 942
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
main: llama threadpool init, n_threads = 4

system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | REPACK = 1 | 

introduce the movie Once Upon a Time in America briefly.
sampler seed: 710723739
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 256, n_keep = 0

Once Upon a Time in America is a 1988 American crime film directed by Francis Ford Coppola. The film is a crime drama that tells the story of two New York City detectives, Sam Marlow (played by Robert De Niro) and David Frost (played by Al Pacino), who are assigned to investigate the murder of a wealthy businessman named John Reed (played by Al Pacino).
The film takes place in New York City in the late 1950s and early 1960s, during the height of the Cold War. The film follows Marlow and Frost as they delve into the complex web of secrets and corruption that runs through the city's elite society, including the high society, the underworld, and the political establishment.
Marlow and Frost are initially assigned to work on the murder of Reed, but they soon realize that he is not the only target of their investigation. Reed's business partner, a rival businessman named Arthur Goldstein (played by Robert De Niro), has also been murdered and his body is found in a nearby park.
As Marlow and Frost uncover the truth about Reed's death, they discover a complex web of relationships, hidden agendas, and power struggles that threatens to destroy the city's social fabric

llama_perf_sampler_print:    sampling time =     131.08 ms /   269 runs   (    0.49 ms per token,  2052.13 tokens per second)
llama_perf_context_print:        load time =    2395.93 ms
llama_perf_context_print: prompt eval time =     104.11 ms /    13 tokens (    8.01 ms per token,   124.87 tokens per second)
llama_perf_context_print:        eval time =    8510.46 ms /   255 runs   (   33.37 ms per token,    29.96 tokens per second)
llama_perf_context_print:       total time =   11124.95 ms /   268 tokens

llama-bench with Llama-3.2-1B-Instruct-f16.gguf on this PR:

zhouwg:$ ./scripts/build-run-ggmlopencl-android.sh run_llamabench
current working path:/home/zhouwg/kantvai/llama.cpp

/usr/bin/wget

/usr/bin/git

/usr/bin/ninja

/bin/ls
Android NDK already exist:   /home/zhouwg/kantvai/llama.cpp/prebuilts/android-ndk-r28 

OpenCL SDK already exist:    /home/zhouwg/kantvai/llama.cpp/prebuilts/OpenCL_SDK 

/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
./out/ggmlopencl-android/bin/libggml-base.so: 1 file pushed. 29.0 MB/s (6962256 bytes in 0.229s)
./out/ggmlopencl-android/bin/libggml-cpu.so: 1 file pushed. 29.6 MB/s (3407440 bytes in 0.110s)
./out/ggmlopencl-android/bin/libggml-opencl.so: 1 file pushed. 27.0 MB/s (1487528 bytes in 0.053s)
./out/ggmlopencl-android/bin/libggml.so: 1 file pushed. 27.7 MB/s (1764248 bytes in 0.061s)
./out/ggmlopencl-android/bin/libllama.so: 1 file pushed. 28.9 MB/s (24163448 bytes in 0.798s)
./out/ggmlopencl-android/bin/libmtmd.so: 1 file pushed. 28.7 MB/s (4863024 bytes in 0.161s)
6 files pushed. 28.7 MB/s (42647944 bytes in 1.415s)
./out/ggmlopencl-android/bin/llama-bench: 1 file pushed. 29.0 MB/s (4770920 bytes in 0.157s)
-rwxrwxrwx 1 shell shell 6962256 2025-07-06 12:36 /data/local/tmp/libggml-base.so
-rwxrwxrwx 1 shell shell 3407440 2025-07-06 12:36 /data/local/tmp/libggml-cpu.so
-rwxrwxrwx 1 shell shell 5848736 2025-07-06 10:56 /data/local/tmp/libggml-hexagon.so
-rwxrwxrwx 1 shell shell 1487528 2025-07-06 12:36 /data/local/tmp/libggml-opencl.so
adb shell "cd /data/local/tmp                && export LD_LIBRARY_PATH=/data/local/tmp                && /data/local/tmp/llama-bench  -ngl 99 -t 4 -n 256 --no-warmup  -m /sdcard/Llama-3.2-1B-Instruct-f16.gguf"
/data/local/tmp/llama-bench  -ngl 99 -t 4 -n 256 --no-warmup  -m /sdcard/Llama-3.2-1B-Instruct-f16.gguf
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.47.18.23
ggml_opencl: vector subgroup broadcast support: true
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels.....................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| llama 1B F16                   |   2.30 GiB |     1.24 B | OpenCL     |  99 |       4 |           pp512 |        155.24 ± 0.36 |
| llama 1B F16                   |   2.30 GiB |     1.24 B | OpenCL     |  99 |       4 |           tg256 |         20.11 ± 0.04 |

build: 0de83f97c (6027)
running time:2025-07-06,12:38:05

llama-bench with Llama-3.2-1B-Instruct-f16.gguf on master:

zhouwg:$ ./scripts/build-run-ggmlopencl-android.sh run_llamabench
current working path:/home/zhouwg/kantvai/llama.cpp

/usr/bin/wget

/usr/bin/git

/usr/bin/ninja

/bin/ls
Android NDK already exist:   /home/zhouwg/kantvai/llama.cpp/prebuilts/android-ndk-r28 

OpenCL SDK already exist:    /home/zhouwg/kantvai/llama.cpp/prebuilts/OpenCL_SDK 

/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
./out/ggmlopencl-android/bin/libggml-base.so: 1 file pushed. 17.5 MB/s (6962256 bytes in 0.379s)
./out/ggmlopencl-android/bin/libggml-cpu.so: 1 file pushed. 17.3 MB/s (3407440 bytes in 0.188s)
./out/ggmlopencl-android/bin/libggml-opencl.so: 1 file pushed. 16.7 MB/s (1476880 bytes in 0.084s)
./out/ggmlopencl-android/bin/libggml.so: 1 file pushed. 17.2 MB/s (1764248 bytes in 0.098s)
./out/ggmlopencl-android/bin/libllama.so: 1 file pushed. 18.9 MB/s (24163448 bytes in 1.219s)
./out/ggmlopencl-android/bin/libmtmd.so: 1 file pushed. 17.2 MB/s (4863024 bytes in 0.269s)
6 files pushed. 18.2 MB/s (42637296 bytes in 2.239s)
./out/ggmlopencl-android/bin/llama-bench: 1 file pushed. 17.8 MB/s (4770920 bytes in 0.256s)
-rwxrwxrwx 1 shell shell 6962256 2025-07-06 12:42 /data/local/tmp/libggml-base.so
-rwxrwxrwx 1 shell shell 3407440 2025-07-06 12:42 /data/local/tmp/libggml-cpu.so
-rwxrwxrwx 1 shell shell 5848736 2025-07-06 10:56 /data/local/tmp/libggml-hexagon.so
-rwxrwxrwx 1 shell shell 1476880 2025-07-06 12:42 /data/local/tmp/libggml-opencl.so
adb shell "cd /data/local/tmp                && export LD_LIBRARY_PATH=/data/local/tmp                && /data/local/tmp/llama-bench  -ngl 99 -t 4 -n 256 --no-warmup  -m /sdcard/Llama-3.2-1B-Instruct-f16.gguf"
/data/local/tmp/llama-bench  -ngl 99 -t 4 -n 256 --no-warmup  -m /sdcard/Llama-3.2-1B-Instruct-f16.gguf
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.47.18.23
ggml_opencl: vector subgroup broadcast support: true
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels....................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| llama 1B F16                   |   2.30 GiB |     1.24 B | OpenCL     |  99 |       4 |           pp512 |         15.54 ± 1.89 |
| llama 1B F16                   |   2.30 GiB |     1.24 B | OpenCL     |  99 |       4 |           tg256 |         16.02 ± 0.03 |

build: 0de83f97c (6027)
running time:2025-07-06,12:50:12

BTW, I provide a simple build/shell script to build ggml-opencl backend on Linux for simplify workflow: https://github.com/zhouwg/ggml-hexagon/blob/self-build/scripts/build-run-ggmlopencl-android.sh

Can I add this script to this excellent PR or submit a standalone PR so other developers can help to verify ggml-opencl related PR or learning something about OpenCL programming on Android phone accordingly? I think such this script is easy/no technical difficulty but might-be very useful/helpful for other developers.

lhez · 2025-07-09T06:03:01Z

ggml/src/ggml-opencl/kernels/mul_mat_f16_f32.cl

+#define OPTN 8
+
+#define WG_M (OPWM / OPTM)
+#define WG_N (OPWN / OPTN)


WG_M and WG_N seem to be workgroup size - can they be replaced with get_local_size()?

I tested replacing the macros with get_local_size(), but it resulted in a significant performance regression (~17%). Using compile-time constants is critical here, as it allows the compiler to fully unroll the inner loops and pre-calculate memory address offsets, an optimization that is lost when WG_M becomes a runtime value

ggml/src/ggml-opencl/ggml-opencl.cpp

lhez · 2025-07-09T06:12:49Z

ggml/src/ggml-opencl/ggml-opencl.cpp

+    const int OPWM = 64;
+    const int OPWN = 64;
+    const int TPWM = 16;
+    const int TPWN = 8;


I think OPWM, OPWN, TPWM, TPWN, OPTM, OPTN are related, e.g., TPWM can be calculated from OPWM and OPTM. I wonder if it is possible to do calculation. Or maybe add a comment about how they are related.

They are all mathematically related. keeping them as explicit compile-time constants is generally better because it allows the compiler to perform aggressive, hardware-specific optimizations. This enables full loop unrolling to eliminate expensive branching and allows accumulator arrays to be allocated directly into the fastest registers, optimizations that are impossible if these values are calculated as runtime variables. I will add comments about the relationship

Would constexpr (as opposed to const) work for this?

constexpr is a C++ feature and wouldn't apply here, as the OpenCL kernel is compiled separately at runtime using the C-like OpenCL language

rmatif · 2025-07-10T12:08:12Z

@lhez I'm almost done with a new kernel that uses image2d_t to take advantage of the TP. Just a few tweaks and some testing left, and it'll be ready. Let me know if you'd prefer to merge this first

lhez · 2025-07-10T21:57:47Z

@lhez I'm almost done with a new kernel that uses image2d_t to take advantage of the TP. Just a few tweaks and some testing left, and it'll be ready. Let me know if you'd prefer to merge this first

Cool, thank you. Merging this now.

add tiled mul_mat_f16_f32

c24b666

github-actions bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Jul 4, 2025

fix trailing whitespace

834acb1

ggerganov requested a review from max-krasnyansky July 4, 2025 17:57

lhez reviewed Jul 9, 2025

View reviewed changes

add insightful comments

8933b05

lhez approved these changes Jul 10, 2025

View reviewed changes

lhez merged commit 6bdda13 into ggml-org:master Jul 10, 2025
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OpenCL: add tiled mul_mat_f16_f32 #14535

OpenCL: add tiled mul_mat_f16_f32 #14535

Uh oh!

rmatif commented Jul 4, 2025

Uh oh!

lhez commented Jul 4, 2025

Uh oh!

rmatif commented Jul 5, 2025

Uh oh!

jeffzhou2000 commented Jul 6, 2025 •

edited

Loading

Uh oh!

lhez Jul 9, 2025

Uh oh!

rmatif Jul 9, 2025

Uh oh!

Uh oh!

lhez Jul 9, 2025

Uh oh!

rmatif Jul 9, 2025

Uh oh!

ehoogeveen-medweb Jul 9, 2025

Uh oh!

rmatif Jul 10, 2025

Uh oh!

rmatif commented Jul 10, 2025

Uh oh!

lhez commented Jul 10, 2025

Uh oh!

Uh oh!

Uh oh!

OpenCL: add tiled mul_mat_f16_f32 #14535

OpenCL: add tiled mul_mat_f16_f32 #14535

Uh oh!

Conversation

rmatif commented Jul 4, 2025

Uh oh!

lhez commented Jul 4, 2025

Uh oh!

rmatif commented Jul 5, 2025

Uh oh!

jeffzhou2000 commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhez Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

rmatif Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lhez Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

rmatif Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

ehoogeveen-medweb Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

rmatif Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

rmatif commented Jul 10, 2025

Uh oh!

lhez commented Jul 10, 2025

Uh oh!

Uh oh!

Uh oh!

jeffzhou2000 commented Jul 6, 2025 •

edited

Loading