Skip to content

Commit 68405ed

Browse files
committed
docs: add more details on GPU sampling implementation
1 parent 4696766 commit 68405ed

File tree

1 file changed

+99
-109
lines changed

1 file changed

+99
-109
lines changed

notes/llama.cpp/gpu-sampling.md

Lines changed: 99 additions & 109 deletions
Original file line numberDiff line numberDiff line change
@@ -194,134 +194,124 @@ being on the GPU, then filter down the logits/probabilities before copying them
194194
from device to system memory for the remaining CPU samplers to process?
195195
So perhaps we could start by implementing:
196196
- Temparature
197-
- Top-k (see section below about an issue I ran into trying to implement this)
198197
- Top-p
199198
- Min-p
200199
- additional?
201200
202201
### Top-k GPU sampling
203-
I ran into an issue when trying to implement the top-k sampling on the GPU.
204-
```c++
205-
static void llama_sampler_gpu_top_k_apply_ggml(
206-
struct llama_sampler * smpl,
207-
struct ggml_context * ctx,
208-
struct ggml_cgraph * gf,
209-
struct llama_sampler_ggml_data * ggml_data) {
210-
211-
auto * ctx_data = (llama_sampler_gpu_top_k_ctx *) smpl->ctx;
212-
printf("gpu top-k: Building top-k sampler with k=%d\n", ctx_data->k);
202+
So currently we have a sampler chain per sequence, and each sampler is provided
203+
with the logits for its sequence.
213204
214-
struct ggml_tensor * top_k = ggml_top_k(ctx, ggml_data->logits, ctx_data->k);
215-
ggml_set_name(top_k, "top_k");
216-
217-
ggml_data->logits = ggml_get_rows(ctx, ggml_data->logits, top_k);
218-
ggml_build_forward_expand(gf, ggml_data->logits);
219-
ggml_data->size = ctx_data->k;
220-
}
205+
So in our sampler we have:
206+
```console
207+
(gdb) p ggml_data->logits->ne
208+
$9 = {32000, 1, 1, 1}
221209
```
222-
If we look at ggml_top_k we find that it is implemented like this:
223-
```c++
224-
// ggml_top_k
225-
226-
struct ggml_tensor * ggml_top_k(
227-
struct ggml_context * ctx,
228-
struct ggml_tensor * a,
229-
int k) {
230-
GGML_ASSERT(a->ne[0] >= k);
231-
232-
struct ggml_tensor * result = ggml_argsort(ctx, a, GGML_SORT_ORDER_DESC);
210+
These are the logits for the last token in the sequence, so we have 32000 tokens.
233211

234-
result = ggml_view_4d(ctx, result,
235-
k, result->ne[1], result->ne[2], result->ne[3],
236-
result->nb[1], result->nb[2], result->nb[3],
237-
0);
212+
So going through this. top_k will look like this:
213+
```console
214+
(gdb) p top_k->ne
215+
$3 = {8, 1, 1, 1}
216+
```
217+
This is just one row with the indices of the top 8 logits.
238218

239-
return result;
240-
}
219+
Then we reshape the logits to become 32000 rows each with one element.
220+
And we use ggml_get_rows to select those values using the indices which produces
221+
```console
222+
(gdb) p top_k_rows->ne
223+
$6 = {1, 8, 1, 1}
241224
```
242-
We can see that this is implemented using argsort:
225+
And each or these rows contains a token, and the first row is the top selection,
226+
followed by the second etc.
227+
228+
_wip_
229+
230+
231+
### Dist GPU sampling
232+
To implement dist sampling on the GPU we need to be able to generate random
233+
and uniform numbers on the GPU. I don't think that GGML currently has support
234+
for generating random numbers nor that GPU backend have such an operation.
235+
But instead what we could do is that we enable the GPU sampler's _apply_ggml
236+
function to create a tensor in the samplers context. And we then add a new
237+
function to the sampler interface named set_input_ggml. This function will be
238+
called after the graph has been built and scheduled but before it has been
239+
executed. This way samplers like this one can generate the random numbers on
240+
the CPU and then upload them to the GPU before the graph is executed. This
241+
involved some data transfer but only of a relatively small tensor of random
242+
numbers.
243+
244+
I naively just created the tensor in the apply_ggml function like this:
243245
```c++
244-
struct ggml_tensor * ggml_argsort(
245-
struct ggml_context * ctx,
246-
struct ggml_tensor * a,
247-
enum ggml_sort_order order) {
248-
GGML_ASSERT(a->ne[0] <= INT32_MAX);
249-
struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_I32, GGML_MAX_DIMS, a->ne);
250-
251-
ggml_set_op_params_i32(result, 0, (int32_t) order);
252-
253-
result->op = GGML_OP_ARGSORT;
254-
result->src[0] = a;
255-
256-
return result;
257-
}
246+
// Create the uniform random scalar input tensor. This will be set by
247+
// llama_sampler_gpu_dist_set_input_ggml after this graph is built, but
248+
// before it is executed.
249+
struct ggml_tensor * uniform = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
250+
sctx->uniform = uniform;
251+
ggml_set_name(uniform, "uniform");
252+
ggml_set_input(uniform);
253+
ggml_set_output(uniform);
258254
```
259-
For the CUDA backend this is implemented using:
255+
Now, if we think about how samplers are used, then are added to then end of the
256+
models graphs, after the ouput tensors. So it makes sense to create the samplers
257+
on the same backend as that tensor so that there is not copying between the
258+
backends.
259+
260+
So we need to do something like the following when creaing tensors in GPU
261+
samplers:
260262
```c++
261-
case GGML_OP_ARGSORT:
262-
ggml_cuda_op_argsort(ctx, dst);
263-
break;
263+
ggml_backend_sched_set_tensor_backend(sched, uniform, target_backend);
264+
```
265+
Notice
266+
```console
267+
(gdb) p *model->pimpl->dev_output.dev
264268
```
265269
```c++
266-
void ggml_cuda_op_argsort(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
267-
const ggml_tensor * src0 = dst->src[0];
268-
const float * src0_d = (const float *)src0->data;
269-
float * dst_d = (float *)dst->data;
270-
cudaStream_t stream = ctx.stream();
270+
GGML_API void ggml_backend_sched_set_tensor_backend(ggml_backend_sched_t sched,
271+
struct ggml_tensor * node, ggml_backend_t backend);
272+
GGML_API ggml_backend_t ggml_backend_sched_get_tensor_backend(ggml_backend_sched_t sched,
273+
struct ggml_tensor * node);
274+
```
271275
272-
GGML_ASSERT(src0->type == GGML_TYPE_F32);
273-
GGML_ASSERT( dst->type == GGML_TYPE_I32);
274-
GGML_ASSERT(ggml_is_contiguous(src0));
276+
So perhaps we can add a function the samplers interface that sets this information,
277+
the scheduler and the backend tensor to use. I'll try this out and see if it
278+
works and how it "feels". The samplers that need to maintain states or create
279+
tenorsr would need to implement this function and also add members for the
280+
scheduler and the target backend.
275281
276-
const int64_t ncols = src0->ne[0];
277-
const int64_t nrows = ggml_nrows(src0);
278282
279-
enum ggml_sort_order order = (enum ggml_sort_order) dst->op_params[0];
283+
### llama-server
284+
Now GPU sampling for llama-cli was pretty straightforward as there is bacially
285+
just one GPU sampler needed to be configured. And recall that the samplers need
286+
to be configured before the context is created as the GPU samplers add to the
287+
models computation graph and are not something that is processed after the model
288+
graph like CPU samplers.
289+
290+
It is possible to have a sampler per sequence in llama.cpp, and llama-server
291+
support multiple slots/sequences. And it is possible to requests to specify a
292+
specific slot to be used. So we could provide a configuration option for
293+
llama-server to have different gpu sampler chains per slot.
280294
281-
argsort_f32_i32_cuda(src0_d, (int *)dst_d, ncols, nrows, order, stream);
282-
}
283-
```
284-
```c++
285-
static void argsort_f32_i32_cuda(const float * x, int * dst, const int ncols, const int nrows, ggml_sort_order order, cudaStream_t stream) {
286-
// bitonic sort requires ncols to be power of 2
287-
const int ncols_pad = next_power_of_2(ncols);
288-
289-
const dim3 block_dims(ncols_pad, 1, 1);
290-
const dim3 block_nums(1, nrows, 1);
291-
const size_t shared_mem = ncols_pad * sizeof(int);
292-
293-
// FIXME: this limit could be raised by ~2-4x on Ampere or newer
294-
GGML_ASSERT(shared_mem <= ggml_cuda_info().devices[ggml_cuda_get_device()].smpb);
295-
296-
if (order == GGML_SORT_ORDER_ASC) {
297-
k_argsort_f32_i32<GGML_SORT_ORDER_ASC><<<block_nums, block_dims, shared_mem, stream>>>(x, dst, ncols, ncols_pad);
298-
} else if (order == GGML_SORT_ORDER_DESC) {
299-
k_argsort_f32_i32<GGML_SORT_ORDER_DESC><<<block_nums, block_dims, shared_mem, stream>>>(x, dst, ncols, ncols_pad);
300-
} else {
301-
GGML_ABORT("fatal error");
302-
}
303-
}
304-
```
305-
In the case that I'm testing the models vocabulary is 32000 tokens:
306295
```console
307-
(gdb) p *src0
308-
$2 = {type = GGML_TYPE_F32, buffer = 0x555556989ce0, ne = {32000, 1, 1, 1}, nb = {4, 128000, 128000, 128000}, op = GGML_OP_NONE,
309-
op_params = {0 <repeats 16 times>}, flags = 0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x0,
310-
view_offs = 0, data = 0x7fff9ea20400, name = "leaf_0", '\000' <repeats 57 times>, extra = 0x0,
311-
padding = "\000\000\000\000\000\000\000"}
312-
```
313-
But my devices shared memory per block is only 49152 bytes:
314-
``` console
315-
(gdb) p ggml_cuda_info().devices[ggml_cuda_get_device()].smpb
316-
$9 = 49152
296+
-gpu-sampling enable GPU sampling (default: disabled)
297+
--gpu-top-k N GPU top-k sampling (default: 40, <= 0 = disabled)
298+
--gpu-top-p-approx-k N GPU top-p approximation using top-k (default: 0, 0 = disabled)
299+
--gpu-temp N GPU temperature (default: 0.80, 0.0 = disabled, greedy sampling)
300+
--gpu-softmax add GPU softmax to sampling chain (default: disabled)
301+
--gpu-dist add GPU dist (final sampling) to sampling chain (default: disabled)
302+
--gpu-slot SLOT_ID:CONFIG configure GPU sampling for a specific slot (server only)
303+
format: SLOT_ID:top_k=N,temp=F,dist=BOOL
304+
example: --gpu-slot 0:top_k=20,temp=0.8,dist=true --gpu-slot
305+
1:top_k=40,temp=0.5
317306
```
318-
So perhaps for top-k sampling we might need a different algoritm that argsort
319-
to avoid this shared memory limitation.
320-
321-
A similar limit can be found in the metal backend as well, in ggml-metal-device.m
322-
there is the following check:
323-
```c++
324-
case GGML_OP_ARGSORT:
325-
// TODO: Support arbitrary column width
326-
return op->src[0]->ne[0] <= 1024;
307+
And this could then be use as follows:
308+
```console
309+
./build-gpu-sampling/bin/llama-server \
310+
-m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \
311+
--gpu-sampling \
312+
--gpu-top-k 20 \
313+
--gpu-temp 0.8 \
314+
--gpu-dist \
315+
-ngl 99 \
316+
-v
327317
```

0 commit comments

Comments
 (0)