Skip to content

ggml : implement GLU for split up/gate #14181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: cisc/unary-reglu-geglu-swiglu
Choose a base branch
from

Conversation

CISC
Copy link
Collaborator

@CISC CISC commented Jun 14, 2025

Implement GLU for split up/gate.

Builds upon #14158

@0cc4m @ggerganov PTAL for adding support to Metal/Vulkan.

@CISC CISC requested a review from ggerganov June 14, 2025 11:17
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 14, 2025
@CISC CISC requested a review from JohannesGaessler June 14, 2025 11:17
Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CUDA code looks good to me.

@CISC
Copy link
Collaborator Author

CISC commented Jun 14, 2025

The CUDA code looks good to me.

Yay, mind generating a plot again for some largeish model?

@github-actions github-actions bot added the testing Everything test related label Jun 14, 2025
@JohannesGaessler
Copy link
Collaborator

plot

The "?b" LLaMA model is Mistral Small. For differences like ~1% I think differences are difficult to see in plots.

Table with more GPUs
GPU Model Microbatch size Test t/s master t/s cisc/split-reglu-geglu-swiglu Speedup
RTX 4090 chatglm 9B Q4_0 1 pp512 157.07 160.12 1.02
RTX 4090 chatglm 9B Q4_0 2 pp512 266.96 275.78 1.03
RTX 4090 chatglm 9B Q4_0 4 pp512 514.87 533.91 1.04
RTX 4090 chatglm 9B Q4_0 8 pp512 817.45 850.34 1.04
RTX 4090 chatglm 9B Q4_0 16 pp512 1398.24 1445.84 1.03
RTX 4090 chatglm 9B Q4_0 32 pp512 2531.06 2651.77 1.05
RTX 4090 chatglm 9B Q4_0 64 pp512 4382.53 4683.11 1.07
RTX 4090 chatglm 9B Q4_0 128 pp512 6421.39 6996.64 1.09
RTX 4090 chatglm 9B Q4_0 256 pp512 8600.24 9394.75 1.09
RTX 4090 chatglm 9B Q4_0 512 pp512 9816.14 10792.81 1.10
RTX 4090 llama 1B Q4_0 1 pp512 871.45 876.93 1.01
RTX 4090 llama 1B Q4_0 2 pp512 1242.58 1249.89 1.01
RTX 4090 llama 1B Q4_0 4 pp512 2463.93 2481.47 1.01
RTX 4090 llama 1B Q4_0 8 pp512 3856.49 3886.71 1.01
RTX 4090 llama 1B Q4_0 16 pp512 6012.96 6047.42 1.01
RTX 4090 llama 1B Q4_0 32 pp512 10825.56 10856.35 1.00
RTX 4090 llama 1B Q4_0 64 pp512 17850.89 17920.20 1.00
RTX 4090 llama 1B Q4_0 128 pp512 28027.86 28218.11 1.01
RTX 4090 llama 1B Q4_0 256 pp512 44615.22 44907.08 1.01
RTX 4090 llama 1B Q4_0 512 pp512 49748.66 50023.06 1.01
RTX 4090 llama 8B Q4_0 1 pp512 191.14 192.25 1.01
RTX 4090 llama 8B Q4_0 2 pp512 335.82 338.53 1.01
RTX 4090 llama 8B Q4_0 4 pp512 656.19 659.70 1.01
RTX 4090 llama 8B Q4_0 8 pp512 1053.99 1060.15 1.01
RTX 4090 llama 8B Q4_0 16 pp512 1830.09 1841.61 1.01
RTX 4090 llama 8B Q4_0 32 pp512 3329.07 3343.82 1.00
RTX 4090 llama 8B Q4_0 64 pp512 5802.84 5825.10 1.00
RTX 4090 llama 8B Q4_0 128 pp512 8633.97 8701.43 1.01
RTX 4090 llama 8B Q4_0 256 pp512 11672.39 11664.27 1.00
RTX 4090 llama 8B Q4_0 512 pp512 12990.89 13093.10 1.01
RTX 4090 llama ?B Q4_0 1 pp512 67.76 67.91 1.00
RTX 4090 llama ?B Q4_0 2 pp512 124.58 125.21 1.01
RTX 4090 llama ?B Q4_0 4 pp512 246.56 247.38 1.00
RTX 4090 llama ?B Q4_0 8 pp512 399.60 401.85 1.01
RTX 4090 llama ?B Q4_0 16 pp512 720.04 723.70 1.01
RTX 4090 llama ?B Q4_0 32 pp512 1310.61 1316.27 1.00
RTX 4090 llama ?B Q4_0 64 pp512 2398.35 2401.42 1.00
RTX 4090 llama ?B Q4_0 128 pp512 3534.90 3560.28 1.01
RTX 4090 llama ?B Q4_0 256 pp512 4190.87 4263.77 1.02
RTX 4090 llama ?B Q4_0 512 pp512 4467.73 4532.54 1.01
RX 6800 chatglm 9B Q4_0 1 pp512 45.74 46.71 1.02
RX 6800 chatglm 9B Q4_0 2 pp512 80.82 83.56 1.03
RX 6800 chatglm 9B Q4_0 4 pp512 113.74 117.01 1.03
RX 6800 chatglm 9B Q4_0 8 pp512 134.81 138.13 1.02
RX 6800 chatglm 9B Q4_0 16 pp512 188.49 194.67 1.03
RX 6800 chatglm 9B Q4_0 32 pp512 257.49 266.78 1.04
RX 6800 chatglm 9B Q4_0 64 pp512 317.98 330.96 1.04
RX 6800 chatglm 9B Q4_0 128 pp512 384.87 402.91 1.05
RX 6800 chatglm 9B Q4_0 256 pp512 432.58 454.80 1.05
RX 6800 chatglm 9B Q4_0 512 pp512 419.65 439.94 1.05
RX 6800 llama 1B Q4_0 1 pp512 225.18 228.39 1.01
RX 6800 llama 1B Q4_0 2 pp512 384.81 388.87 1.01
RX 6800 llama 1B Q4_0 4 pp512 574.05 580.28 1.01
RX 6800 llama 1B Q4_0 8 pp512 635.15 639.92 1.01
RX 6800 llama 1B Q4_0 16 pp512 805.52 810.66 1.01
RX 6800 llama 1B Q4_0 32 pp512 1105.49 1111.71 1.01
RX 6800 llama 1B Q4_0 64 pp512 1399.62 1409.38 1.01
RX 6800 llama 1B Q4_0 128 pp512 1699.55 1709.73 1.01
RX 6800 llama 1B Q4_0 256 pp512 1930.04 1945.97 1.01
RX 6800 llama 1B Q4_0 512 pp512 1706.13 1712.44 1.00
RX 6800 llama 8B Q4_0 1 pp512 56.71 57.13 1.01
RX 6800 llama 8B Q4_0 2 pp512 102.18 102.98 1.01
RX 6800 llama 8B Q4_0 4 pp512 141.32 142.12 1.01
RX 6800 llama 8B Q4_0 8 pp512 161.61 162.21 1.00
RX 6800 llama 8B Q4_0 16 pp512 231.94 232.84 1.00
RX 6800 llama 8B Q4_0 32 pp512 326.40 328.09 1.01
RX 6800 llama 8B Q4_0 64 pp512 404.16 405.76 1.00
RX 6800 llama 8B Q4_0 128 pp512 487.69 490.30 1.01
RX 6800 llama 8B Q4_0 256 pp512 547.51 550.89 1.01
RX 6800 llama 8B Q4_0 512 pp512 529.86 532.55 1.01
P40 chatglm 9B Q4_0 1 pp512 45.36 45.80 1.01
P40 chatglm 9B Q4_0 2 pp512 89.69 92.63 1.03
P40 chatglm 9B Q4_0 4 pp512 127.91 131.13 1.03
P40 chatglm 9B Q4_0 8 pp512 163.27 167.49 1.03
P40 chatglm 9B Q4_0 16 pp512 368.87 387.68 1.05
P40 chatglm 9B Q4_0 32 pp512 528.58 565.57 1.07
P40 chatglm 9B Q4_0 64 pp512 593.71 643.45 1.08
P40 chatglm 9B Q4_0 128 pp512 699.75 765.57 1.09
P40 chatglm 9B Q4_0 256 pp512 758.72 836.24 1.10
P40 chatglm 9B Q4_0 512 pp512 786.29 869.28 1.11
P40 llama 1B Q4_0 1 pp512 261.26 263.14 1.01
P40 llama 1B Q4_0 2 pp512 536.94 543.57 1.01
P40 llama 1B Q4_0 4 pp512 750.15 755.12 1.01
P40 llama 1B Q4_0 8 pp512 1030.15 1034.74 1.00
P40 llama 1B Q4_0 16 pp512 2045.97 2061.16 1.01
P40 llama 1B Q4_0 32 pp512 3095.97 3113.62 1.01
P40 llama 1B Q4_0 64 pp512 4006.29 4036.23 1.01
P40 llama 1B Q4_0 128 pp512 4854.02 4930.81 1.02
P40 llama 1B Q4_0 256 pp512 5559.00 5672.41 1.02
P40 llama 1B Q4_0 512 pp512 5742.70 5846.29 1.02
P40 llama 8B Q4_0 1 pp512 54.30 54.51 1.00
P40 llama 8B Q4_0 2 pp512 109.34 110.08 1.01
P40 llama 8B Q4_0 4 pp512 156.99 157.78 1.01
P40 llama 8B Q4_0 8 pp512 198.59 199.35 1.00
P40 llama 8B Q4_0 16 pp512 464.71 467.04 1.01
P40 llama 8B Q4_0 32 pp512 661.47 664.78 1.00
P40 llama 8B Q4_0 64 pp512 778.06 783.18 1.01
P40 llama 8B Q4_0 128 pp512 895.29 906.05 1.01
P40 llama 8B Q4_0 256 pp512 975.08 988.13 1.01
P40 llama 8B Q4_0 512 pp512 1016.08 1029.92 1.01

@CISC
Copy link
Collaborator Author

CISC commented Jun 14, 2025

The "?b" LLaMA model is Mistral Small. For differences like ~1% I think differences are difficult to see in plots.

Ok, so as expected a lot less beneficial for split tensors, will probably gain another percent or so for MoEs?

@github-actions github-actions bot added the Vulkan Issues specific to the Vulkan backend label Jun 15, 2025
@0cc4m
Copy link
Collaborator

0cc4m commented Jun 15, 2025

I added Vulkan support for split GLU.

@CISC
Copy link
Collaborator Author

CISC commented Jun 15, 2025

@0cc4m BTW, I see you check for ggml_is_contiguous rather than ggml_is_contiguous_1, is this correct?

@0cc4m
Copy link
Collaborator

0cc4m commented Jun 15, 2025

@0cc4m BTW, I see you check for ggml_is_contiguous rather than ggml_is_contiguous_1, is this correct?

Yes, the Vulkan support is only for contiguous. GLSL (the shader language we use) has no support for pointers, so incontiguous support isn't as simple to implement as it is for CPU, CUDA and Metal. That's why I did only contiguous for now. If necessary, we can add incontigous at a later point.

@qnixsynapse qnixsynapse force-pushed the cisc/split-reglu-geglu-swiglu branch from 917b5b5 to 42c2870 Compare June 15, 2025 16:02
@qnixsynapse
Copy link
Collaborator

qnixsynapse commented Jun 15, 2025

Had to refactor and deduplicate SYCL code before adding support for split up+gate. Let me know if there are any issues.

Edit: Adding SYCL test results(on A750):
Master(6adc3c3):

Model Test t/s master t/s cisc/split-reglu-geglu-swiglu Speedup
chatglm 9B IQ4_XS - 4.25 bpw pp512 967.86 1015.64 1.05
chatglm 9B IQ4_XS - 4.25 bpw tg128 15.37 16.21 1.05
gemma3 4B Q4_0 pp512 2428.35 2447.54 1.01
gemma3 4B Q4_0 tg128 22.54 22.88 1.02

Edit2: Added chatglm and with tanh approximation GEGLU

@github-actions github-actions bot added the SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language label Jun 15, 2025
@CISC
Copy link
Collaborator Author

CISC commented Jun 16, 2025

Edit: Adding SYCL test results(on A750)

Nice, can you run a test with chatglm too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants