ggml : implement GLU for split up/gate #14181

CISC · 2025-06-14T11:17:06Z

Implement GLU for split up/gate.

Builds upon #14158

@0cc4m @ggerganov PTAL for adding support to Metal/Vulkan.

JohannesGaessler

The CUDA code looks good to me.

CISC · 2025-06-14T12:20:16Z

The CUDA code looks good to me.

Yay, mind generating a plot again for some largeish model?

JohannesGaessler · 2025-06-14T15:19:22Z

The "?b" LLaMA model is Mistral Small. For differences like ~1% I think differences are difficult to see in plots.

Table with more GPUs

GPU	Model	Microbatch size	Test	t/s master	t/s cisc/split-reglu-geglu-swiglu	Speedup
RTX 4090	chatglm 9B Q4_0	1	pp512	157.07	160.12	1.02
RTX 4090	chatglm 9B Q4_0	2	pp512	266.96	275.78	1.03
RTX 4090	chatglm 9B Q4_0	4	pp512	514.87	533.91	1.04
RTX 4090	chatglm 9B Q4_0	8	pp512	817.45	850.34	1.04
RTX 4090	chatglm 9B Q4_0	16	pp512	1398.24	1445.84	1.03
RTX 4090	chatglm 9B Q4_0	32	pp512	2531.06	2651.77	1.05
RTX 4090	chatglm 9B Q4_0	64	pp512	4382.53	4683.11	1.07
RTX 4090	chatglm 9B Q4_0	128	pp512	6421.39	6996.64	1.09
RTX 4090	chatglm 9B Q4_0	256	pp512	8600.24	9394.75	1.09
RTX 4090	chatglm 9B Q4_0	512	pp512	9816.14	10792.81	1.10
RTX 4090	llama 1B Q4_0	1	pp512	871.45	876.93	1.01
RTX 4090	llama 1B Q4_0	2	pp512	1242.58	1249.89	1.01
RTX 4090	llama 1B Q4_0	4	pp512	2463.93	2481.47	1.01
RTX 4090	llama 1B Q4_0	8	pp512	3856.49	3886.71	1.01
RTX 4090	llama 1B Q4_0	16	pp512	6012.96	6047.42	1.01
RTX 4090	llama 1B Q4_0	32	pp512	10825.56	10856.35	1.00
RTX 4090	llama 1B Q4_0	64	pp512	17850.89	17920.20	1.00
RTX 4090	llama 1B Q4_0	128	pp512	28027.86	28218.11	1.01
RTX 4090	llama 1B Q4_0	256	pp512	44615.22	44907.08	1.01
RTX 4090	llama 1B Q4_0	512	pp512	49748.66	50023.06	1.01
RTX 4090	llama 8B Q4_0	1	pp512	191.14	192.25	1.01
RTX 4090	llama 8B Q4_0	2	pp512	335.82	338.53	1.01
RTX 4090	llama 8B Q4_0	4	pp512	656.19	659.70	1.01
RTX 4090	llama 8B Q4_0	8	pp512	1053.99	1060.15	1.01
RTX 4090	llama 8B Q4_0	16	pp512	1830.09	1841.61	1.01
RTX 4090	llama 8B Q4_0	32	pp512	3329.07	3343.82	1.00
RTX 4090	llama 8B Q4_0	64	pp512	5802.84	5825.10	1.00
RTX 4090	llama 8B Q4_0	128	pp512	8633.97	8701.43	1.01
RTX 4090	llama 8B Q4_0	256	pp512	11672.39	11664.27	1.00
RTX 4090	llama 8B Q4_0	512	pp512	12990.89	13093.10	1.01
RTX 4090	llama ?B Q4_0	1	pp512	67.76	67.91	1.00
RTX 4090	llama ?B Q4_0	2	pp512	124.58	125.21	1.01
RTX 4090	llama ?B Q4_0	4	pp512	246.56	247.38	1.00
RTX 4090	llama ?B Q4_0	8	pp512	399.60	401.85	1.01
RTX 4090	llama ?B Q4_0	16	pp512	720.04	723.70	1.01
RTX 4090	llama ?B Q4_0	32	pp512	1310.61	1316.27	1.00
RTX 4090	llama ?B Q4_0	64	pp512	2398.35	2401.42	1.00
RTX 4090	llama ?B Q4_0	128	pp512	3534.90	3560.28	1.01
RTX 4090	llama ?B Q4_0	256	pp512	4190.87	4263.77	1.02
RTX 4090	llama ?B Q4_0	512	pp512	4467.73	4532.54	1.01
RX 6800	chatglm 9B Q4_0	1	pp512	45.74	46.71	1.02
RX 6800	chatglm 9B Q4_0	2	pp512	80.82	83.56	1.03
RX 6800	chatglm 9B Q4_0	4	pp512	113.74	117.01	1.03
RX 6800	chatglm 9B Q4_0	8	pp512	134.81	138.13	1.02
RX 6800	chatglm 9B Q4_0	16	pp512	188.49	194.67	1.03
RX 6800	chatglm 9B Q4_0	32	pp512	257.49	266.78	1.04
RX 6800	chatglm 9B Q4_0	64	pp512	317.98	330.96	1.04
RX 6800	chatglm 9B Q4_0	128	pp512	384.87	402.91	1.05
RX 6800	chatglm 9B Q4_0	256	pp512	432.58	454.80	1.05
RX 6800	chatglm 9B Q4_0	512	pp512	419.65	439.94	1.05
RX 6800	llama 1B Q4_0	1	pp512	225.18	228.39	1.01
RX 6800	llama 1B Q4_0	2	pp512	384.81	388.87	1.01
RX 6800	llama 1B Q4_0	4	pp512	574.05	580.28	1.01
RX 6800	llama 1B Q4_0	8	pp512	635.15	639.92	1.01
RX 6800	llama 1B Q4_0	16	pp512	805.52	810.66	1.01
RX 6800	llama 1B Q4_0	32	pp512	1105.49	1111.71	1.01
RX 6800	llama 1B Q4_0	64	pp512	1399.62	1409.38	1.01
RX 6800	llama 1B Q4_0	128	pp512	1699.55	1709.73	1.01
RX 6800	llama 1B Q4_0	256	pp512	1930.04	1945.97	1.01
RX 6800	llama 1B Q4_0	512	pp512	1706.13	1712.44	1.00
RX 6800	llama 8B Q4_0	1	pp512	56.71	57.13	1.01
RX 6800	llama 8B Q4_0	2	pp512	102.18	102.98	1.01
RX 6800	llama 8B Q4_0	4	pp512	141.32	142.12	1.01
RX 6800	llama 8B Q4_0	8	pp512	161.61	162.21	1.00
RX 6800	llama 8B Q4_0	16	pp512	231.94	232.84	1.00
RX 6800	llama 8B Q4_0	32	pp512	326.40	328.09	1.01
RX 6800	llama 8B Q4_0	64	pp512	404.16	405.76	1.00
RX 6800	llama 8B Q4_0	128	pp512	487.69	490.30	1.01
RX 6800	llama 8B Q4_0	256	pp512	547.51	550.89	1.01
RX 6800	llama 8B Q4_0	512	pp512	529.86	532.55	1.01
P40	chatglm 9B Q4_0	1	pp512	45.36	45.80	1.01
P40	chatglm 9B Q4_0	2	pp512	89.69	92.63	1.03
P40	chatglm 9B Q4_0	4	pp512	127.91	131.13	1.03
P40	chatglm 9B Q4_0	8	pp512	163.27	167.49	1.03
P40	chatglm 9B Q4_0	16	pp512	368.87	387.68	1.05
P40	chatglm 9B Q4_0	32	pp512	528.58	565.57	1.07
P40	chatglm 9B Q4_0	64	pp512	593.71	643.45	1.08
P40	chatglm 9B Q4_0	128	pp512	699.75	765.57	1.09
P40	chatglm 9B Q4_0	256	pp512	758.72	836.24	1.10
P40	chatglm 9B Q4_0	512	pp512	786.29	869.28	1.11
P40	llama 1B Q4_0	1	pp512	261.26	263.14	1.01
P40	llama 1B Q4_0	2	pp512	536.94	543.57	1.01
P40	llama 1B Q4_0	4	pp512	750.15	755.12	1.01
P40	llama 1B Q4_0	8	pp512	1030.15	1034.74	1.00
P40	llama 1B Q4_0	16	pp512	2045.97	2061.16	1.01
P40	llama 1B Q4_0	32	pp512	3095.97	3113.62	1.01
P40	llama 1B Q4_0	64	pp512	4006.29	4036.23	1.01
P40	llama 1B Q4_0	128	pp512	4854.02	4930.81	1.02
P40	llama 1B Q4_0	256	pp512	5559.00	5672.41	1.02
P40	llama 1B Q4_0	512	pp512	5742.70	5846.29	1.02
P40	llama 8B Q4_0	1	pp512	54.30	54.51	1.00
P40	llama 8B Q4_0	2	pp512	109.34	110.08	1.01
P40	llama 8B Q4_0	4	pp512	156.99	157.78	1.01
P40	llama 8B Q4_0	8	pp512	198.59	199.35	1.00
P40	llama 8B Q4_0	16	pp512	464.71	467.04	1.01
P40	llama 8B Q4_0	32	pp512	661.47	664.78	1.00
P40	llama 8B Q4_0	64	pp512	778.06	783.18	1.01
P40	llama 8B Q4_0	128	pp512	895.29	906.05	1.01
P40	llama 8B Q4_0	256	pp512	975.08	988.13	1.01
P40	llama 8B Q4_0	512	pp512	1016.08	1029.92	1.01

CISC · 2025-06-14T15:24:27Z

The "?b" LLaMA model is Mistral Small. For differences like ~1% I think differences are difficult to see in plots.

Ok, so as expected a lot less beneficial for split tensors, will probably gain another percent or so for MoEs?

tests/test-backend-ops.cpp

0cc4m · 2025-06-15T06:13:24Z

I added Vulkan support for split GLU.

CISC · 2025-06-15T06:29:06Z

@0cc4m BTW, I see you check for ggml_is_contiguous rather than ggml_is_contiguous_1, is this correct?

0cc4m · 2025-06-15T06:32:10Z

@0cc4m BTW, I see you check for ggml_is_contiguous rather than ggml_is_contiguous_1, is this correct?

Yes, the Vulkan support is only for contiguous. GLSL (the shader language we use) has no support for pointers, so incontiguous support isn't as simple to implement as it is for CPU, CUDA and Metal. That's why I did only contiguous for now. If necessary, we can add incontigous at a later point.

…gated kernels

qnixsynapse · 2025-06-15T16:06:57Z

Had to refactor and deduplicate SYCL code before adding support for split up+gate. Let me know if there are any issues.

Edit: Adding SYCL test results(on A750):
Master(6adc3c3):

Model	Test	t/s master	t/s cisc/split-reglu-geglu-swiglu	Speedup
chatglm 9B IQ4_XS - 4.25 bpw	pp512	967.86	1015.64	1.05
chatglm 9B IQ4_XS - 4.25 bpw	tg128	15.37	16.21	1.05
gemma3 4B Q4_0	pp512	2428.35	2447.54	1.01
gemma3 4B Q4_0	tg128	22.54	22.88	1.02

Edit2: Added chatglm and with tanh approximation GEGLU

CISC · 2025-06-16T17:59:54Z

Edit: Adding SYCL test results(on A750)

Nice, can you run a test with chatglm too?

CISC requested a review from ggerganov June 14, 2025 11:17

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 14, 2025

CISC requested a review from JohannesGaessler June 14, 2025 11:17

JohannesGaessler reviewed Jun 14, 2025

View reviewed changes

qnixsynapse mentioned this pull request Jun 14, 2025

ggml : implement REGLU/GEGLU/SWIGLU ops #14158

Open

github-actions bot added the testing Everything test related label Jun 14, 2025

0cc4m reviewed Jun 15, 2025

View reviewed changes

tests/test-backend-ops.cpp Outdated Show resolved Hide resolved

github-actions bot added the Vulkan Issues specific to the Vulkan backend label Jun 15, 2025

CISC and others added 5 commits June 15, 2025 21:20

implement GLU for split up/gate

94361fd

add tests for ggml_glu_split

ac3194d

Vulkan: Implement glu_split logic and shader support

be3f78c

add split to logging [no ci]

06362b0

SYCL: refactor element_size ops and add split up and gate support to …

42c2870

…gated kernels

qnixsynapse force-pushed the cisc/split-reglu-geglu-swiglu branch from 917b5b5 to 42c2870 Compare June 15, 2025 16:02

qnixsynapse requested review from Rbiessy, Alcpz and NeoZhangJianyu June 15, 2025 16:03

github-actions bot added the SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language label Jun 15, 2025

SYCL: switch GEGLU to use tanh approximation

832efa9

Rbiessy mentioned this pull request Jun 17, 2025

sycl: add usage of enqueue_functions extension #14244

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml : implement GLU for split up/gate #14181

ggml : implement GLU for split up/gate #14181

CISC commented Jun 14, 2025

Uh oh!

JohannesGaessler left a comment

Uh oh!

CISC commented Jun 14, 2025

Uh oh!

JohannesGaessler commented Jun 14, 2025

Uh oh!

CISC commented Jun 14, 2025

Uh oh!

Uh oh!

0cc4m commented Jun 15, 2025

Uh oh!

CISC commented Jun 15, 2025

Uh oh!

0cc4m commented Jun 15, 2025

Uh oh!

qnixsynapse commented Jun 15, 2025 •

edited

Loading

Uh oh!

CISC commented Jun 16, 2025

Uh oh!

Uh oh!

ggml : implement GLU for split up/gate #14181

Are you sure you want to change the base?

ggml : implement GLU for split up/gate #14181

Conversation

CISC commented Jun 14, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

CISC commented Jun 14, 2025

Uh oh!

JohannesGaessler commented Jun 14, 2025

Uh oh!

CISC commented Jun 14, 2025

Uh oh!

Uh oh!

0cc4m commented Jun 15, 2025

Uh oh!

CISC commented Jun 15, 2025

Uh oh!

0cc4m commented Jun 15, 2025

Uh oh!

qnixsynapse commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Jun 16, 2025

Uh oh!

Uh oh!

qnixsynapse commented Jun 15, 2025 •

edited

Loading