Skip to content

Random spikes of up to 30ms in ggml_cuda_op device synchronization when using a low -ngl count with dual GPU #19

Closed
@cmp-nct

Description

@cmp-nct

In ggml_cuda_op() I have spikes of up to 30ms, easily reproduceable when using a very low -ngl count like 1,2 or 3 on a large model like 40B, q6_k
This causes a quite significant slowdown of the calculations, it's 2 orders of magnitude higher than what the operation usually takes.
The CPU operations are significantly faster than the GPU operations in those cases.

The device the tensor is on is a 4090, a second 3090 is installed
I used -ngl 1 to reproduce it with almost every token.
I tried -ts 1,0 without any change (all tensors are on device 0)

When all works fine the sync on result_wo takes 0.144 ms

I debugged it down to the call of cudaDeviceSynchronize() at the end of the function.
Will continue debugging this one tomorrow

Maybe @JohannesGaessler already has an idea what is going on ?
Also anyone to confirm this would be helpful.

Just run a model like 40b q6_k (or similar) with **-ngl 1** and **--debug-timings 3**
In my case it shows some mat_mul spikes of 7-30ms in almost every token generation.
-ts 1,0 had no influence (note, the tensor split is currently not working because it stops at device #1 memory_free (was just fixing that)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions