@@ -194,134 +194,124 @@ being on the GPU, then filter down the logits/probabilities before copying them
194194from device to system memory for the remaining CPU samplers to process?
195195So perhaps we could start by implementing:
196196- Temparature
197- - Top-k (see section below about an issue I ran into trying to implement this)
198197- Top-p
199198- Min-p
200199- additional?
201200
202201### Top-k GPU sampling
203- I ran into an issue when trying to implement the top-k sampling on the GPU.
204- ```c++
205- static void llama_sampler_gpu_top_k_apply_ggml(
206- struct llama_sampler * smpl,
207- struct ggml_context * ctx,
208- struct ggml_cgraph * gf,
209- struct llama_sampler_ggml_data * ggml_data) {
210-
211- auto * ctx_data = (llama_sampler_gpu_top_k_ctx *) smpl->ctx;
212- printf("gpu top-k: Building top-k sampler with k=%d\n", ctx_data->k);
202+ So currently we have a sampler chain per sequence, and each sampler is provided
203+ with the logits for its sequence.
213204
214- struct ggml_tensor * top_k = ggml_top_k(ctx, ggml_data->logits, ctx_data->k);
215- ggml_set_name(top_k, "top_k");
216-
217- ggml_data->logits = ggml_get_rows(ctx, ggml_data->logits, top_k);
218- ggml_build_forward_expand(gf, ggml_data->logits);
219- ggml_data->size = ctx_data->k;
220- }
205+ So in our sampler we have:
206+ ```console
207+ (gdb) p ggml_data->logits->ne
208+ $9 = {32000, 1, 1, 1}
221209```
222- If we look at ggml_top_k we find that it is implemented like this:
223- ``` c++
224- // ggml_top_k
225-
226- struct ggml_tensor * ggml_top_k(
227- struct ggml_context * ctx,
228- struct ggml_tensor * a,
229- int k) {
230- GGML_ASSERT(a->ne[ 0] >= k);
231-
232- struct ggml_tensor * result = ggml_argsort(ctx, a, GGML_SORT_ORDER_DESC);
210+ These are the logits for the last token in the sequence, so we have 32000 tokens.
233211
234- result = ggml_view_4d(ctx, result,
235- k, result->ne[ 1] , result->ne[ 2] , result->ne[ 3] ,
236- result->nb[ 1] , result->nb[ 2] , result->nb[ 3] ,
237- 0);
212+ So going through this. top_k will look like this:
213+ ``` console
214+ (gdb) p top_k->ne
215+ $3 = {8, 1, 1, 1}
216+ ```
217+ This is just one row with the indices of the top 8 logits.
238218
239- return result;
240- }
219+ Then we reshape the logits to become 32000 rows each with one element.
220+ And we use ggml_get_rows to select those values using the indices which produces
221+ ``` console
222+ (gdb) p top_k_rows->ne
223+ $6 = {1, 8, 1, 1}
241224```
242- We can see that this is implemented using argsort:
225+ And each or these rows contains a token, and the first row is the top selection,
226+ followed by the second etc.
227+
228+ _ wip_
229+
230+
231+ ### Dist GPU sampling
232+ To implement dist sampling on the GPU we need to be able to generate random
233+ and uniform numbers on the GPU. I don't think that GGML currently has support
234+ for generating random numbers nor that GPU backend have such an operation.
235+ But instead what we could do is that we enable the GPU sampler's _ apply_ggml
236+ function to create a tensor in the samplers context. And we then add a new
237+ function to the sampler interface named set_input_ggml. This function will be
238+ called after the graph has been built and scheduled but before it has been
239+ executed. This way samplers like this one can generate the random numbers on
240+ the CPU and then upload them to the GPU before the graph is executed. This
241+ involved some data transfer but only of a relatively small tensor of random
242+ numbers.
243+
244+ I naively just created the tensor in the apply_ggml function like this:
243245``` c++
244- struct ggml_tensor * ggml_argsort(
245- struct ggml_context * ctx,
246- struct ggml_tensor * a,
247- enum ggml_sort_order order) {
248- GGML_ASSERT(a->ne[0] <= INT32_MAX);
249- struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_I32, GGML_MAX_DIMS, a->ne);
250-
251- ggml_set_op_params_i32(result, 0, (int32_t) order);
252-
253- result->op = GGML_OP_ARGSORT;
254- result->src[0] = a;
255-
256- return result;
257- }
246+ // Create the uniform random scalar input tensor. This will be set by
247+ // llama_sampler_gpu_dist_set_input_ggml after this graph is built, but
248+ // before it is executed.
249+ struct ggml_tensor * uniform = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
250+ sctx->uniform = uniform;
251+ ggml_set_name(uniform, "uniform");
252+ ggml_set_input(uniform);
253+ ggml_set_output(uniform);
258254```
259- For the CUDA backend this is implemented using:
255+ Now, if we think about how samplers are used, then are added to then end of the
256+ models graphs, after the ouput tensors. So it makes sense to create the samplers
257+ on the same backend as that tensor so that there is not copying between the
258+ backends.
259+
260+ So we need to do something like the following when creaing tensors in GPU
261+ samplers:
260262```c++
261- case GGML_OP_ARGSORT:
262- ggml_cuda_op_argsort (ctx, dst);
263- break;
263+ ggml_backend_sched_set_tensor_backend(sched, uniform, target_backend);
264+ ```
265+ Notice
266+ ``` console
267+ (gdb) p *model->pimpl->dev_output.dev
264268```
265269``` c++
266- void ggml_cuda_op_argsort(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
267- const ggml_tensor * src0 = dst->src[0] ;
268- const float * src0_d = (const float *)src0->data;
269- float * dst_d = (float *)dst->data ;
270- cudaStream_t stream = ctx.stream();
270+ GGML_API void ggml_backend_sched_set_tensor_backend (ggml_backend_sched_t sched,
271+ struct ggml_tensor * node, ggml_backend_t backend) ;
272+ GGML_API ggml_backend_t ggml_backend_sched_get_tensor_backend(ggml_backend_sched_t sched,
273+ struct ggml_tensor * node) ;
274+ ```
271275
272- GGML_ASSERT(src0->type == GGML_TYPE_F32);
273- GGML_ASSERT( dst->type == GGML_TYPE_I32);
274- GGML_ASSERT(ggml_is_contiguous(src0));
276+ So perhaps we can add a function the samplers interface that sets this information,
277+ the scheduler and the backend tensor to use. I'll try this out and see if it
278+ works and how it "feels". The samplers that need to maintain states or create
279+ tenorsr would need to implement this function and also add members for the
280+ scheduler and the target backend.
275281
276- const int64_t ncols = src0->ne[0];
277- const int64_t nrows = ggml_nrows(src0);
278282
279- enum ggml_sort_order order = (enum ggml_sort_order) dst->op_params[0];
283+ ### llama-server
284+ Now GPU sampling for llama-cli was pretty straightforward as there is bacially
285+ just one GPU sampler needed to be configured. And recall that the samplers need
286+ to be configured before the context is created as the GPU samplers add to the
287+ models computation graph and are not something that is processed after the model
288+ graph like CPU samplers.
289+
290+ It is possible to have a sampler per sequence in llama.cpp, and llama-server
291+ support multiple slots/sequences. And it is possible to requests to specify a
292+ specific slot to be used. So we could provide a configuration option for
293+ llama-server to have different gpu sampler chains per slot.
280294
281- argsort_f32_i32_cuda(src0_d, (int *)dst_d, ncols, nrows, order, stream);
282- }
283- ```
284- ``` c++
285- static void argsort_f32_i32_cuda (const float * x, int * dst, const int ncols, const int nrows, ggml_sort_order order, cudaStream_t stream) {
286- // bitonic sort requires ncols to be power of 2
287- const int ncols_pad = next_power_of_2(ncols);
288-
289- const dim3 block_dims(ncols_pad, 1, 1);
290- const dim3 block_nums(1, nrows, 1);
291- const size_t shared_mem = ncols_pad * sizeof(int);
292-
293- // FIXME: this limit could be raised by ~2-4x on Ampere or newer
294- GGML_ASSERT(shared_mem <= ggml_cuda_info().devices[ggml_cuda_get_device()].smpb);
295-
296- if (order == GGML_SORT_ORDER_ASC) {
297- k_argsort_f32_i32<GGML_SORT_ORDER_ASC><<<block_nums, block_dims, shared_mem, stream>>>(x, dst, ncols, ncols_pad);
298- } else if (order == GGML_SORT_ORDER_DESC) {
299- k_argsort_f32_i32<GGML_SORT_ORDER_DESC><<<block_nums, block_dims, shared_mem, stream>>>(x, dst, ncols, ncols_pad);
300- } else {
301- GGML_ABORT("fatal error");
302- }
303- }
304- ```
305- In the case that I'm testing the models vocabulary is 32000 tokens:
306295```console
307- (gdb) p *src0
308- $2 = {type = GGML_TYPE_F32, buffer = 0x555556989ce0, ne = {32000, 1, 1, 1}, nb = {4, 128000, 128000, 128000}, op = GGML_OP_NONE,
309- op_params = {0 <repeats 16 times>}, flags = 0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x0,
310- view_offs = 0, data = 0x7fff9ea20400, name = "leaf_0", '\000' <repeats 57 times>, extra = 0x0,
311- padding = "\000\000\000\000\000\000\000"}
312- ```
313- But my devices shared memory per block is only 49152 bytes:
314- ``` console
315- (gdb) p ggml_cuda_info().devices[ggml_cuda_get_device()].smpb
316- $9 = 49152
296+ -gpu-sampling enable GPU sampling (default: disabled)
297+ --gpu-top-k N GPU top-k sampling (default: 40, <= 0 = disabled)
298+ --gpu-top-p-approx-k N GPU top-p approximation using top-k (default: 0, 0 = disabled)
299+ --gpu-temp N GPU temperature (default: 0.80, 0.0 = disabled, greedy sampling)
300+ --gpu-softmax add GPU softmax to sampling chain (default: disabled)
301+ --gpu-dist add GPU dist (final sampling) to sampling chain (default: disabled)
302+ --gpu-slot SLOT_ID:CONFIG configure GPU sampling for a specific slot (server only)
303+ format: SLOT_ID:top_k=N,temp=F,dist=BOOL
304+ example: --gpu-slot 0:top_k=20,temp=0.8,dist=true --gpu-slot
305+ 1:top_k=40,temp=0.5
317306```
318- So perhaps for top-k sampling we might need a different algoritm that argsort
319- to avoid this shared memory limitation.
320-
321- A similar limit can be found in the metal backend as well, in ggml-metal-device.m
322- there is the following check:
323- ``` c++
324- case GGML_OP_ARGSORT:
325- // TODO: Support arbitrary column width
326- return op->src[0 ]->ne[0 ] <= 1024 ;
307+ And this could then be use as follows:
308+ ``` console
309+ ./build-gpu-sampling/bin/llama-server \
310+ -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \
311+ --gpu-sampling \
312+ --gpu-top-k 20 \
313+ --gpu-temp 0.8 \
314+ --gpu-dist \
315+ -ngl 99 \
316+ -v
327317```
0 commit comments