Description
In ggml_cuda_op() I have spikes of up to 30ms, easily reproduceable when using a very low -ngl count like 1,2 or 3 on a large model like 40B, q6_k
This causes a quite significant slowdown of the calculations, it's 2 orders of magnitude higher than what the operation usually takes.
The CPU operations are significantly faster than the GPU operations in those cases.
The device the tensor is on is a 4090, a second 3090 is installed
I used -ngl 1 to reproduce it with almost every token.
I tried -ts 1,0 without any change (all tensors are on device 0)
When all works fine the sync on result_wo takes 0.144 ms
I debugged it down to the call of cudaDeviceSynchronize() at the end of the function.
Will continue debugging this one tomorrow
Maybe @JohannesGaessler already has an idea what is going on ?
Also anyone to confirm this would be helpful.
Just run a model like 40b q6_k (or similar) with **-ngl 1** and **--debug-timings 3**
In my case it shows some mat_mul spikes of 7-30ms in almost every token generation.
-ts 1,0 had no influence (note, the tensor split is currently not working because it stops at device #1 memory_free (was just fixing that)