docs: update gpu sampling notes

danbev · danbev · commit d45ee138c7b6 · 2025-10-16T12:59:58.000+02:00
diff --git a/notes/llama.cpp/gpu-sampling.md b/notes/llama.cpp/gpu-sampling.md
@@ -61,180 +61,86 @@ any way (like also if the size of the array is modified).
 To be processed by a GGML graph the data, the information in `cur_p` needs to be
 in the form of a `ggml_tensor` (or multiple).
 
+### Suggested approach
 One way could be to store the tensors in a struct like this:
 ```c++
     struct llama_sampler_ggml_data {
-        struct ggml_tensor * ids;     // [n_vocab] - GGML_TYPE_I32
-        struct ggml_tensor * logits;  // [n_vocab] - GGML_TYPE_F32
-        struct ggml_tensor * probs;   // [n_vocab] - GGML_TYPE_F32
-        int64_t size;                 // number of valid tokens in the arrays (<= n_vocab)
-        int64_t selected;             // index in the array (-1 if not yet selected)
-        bool sorted;                  // whether data is sorted by logits/probs
+        struct ggml_tensor * ids;      // [n_vocab] - GGML_TYPE_I32
+        struct ggml_tensor * logits;   // [n_vocab] - GGML_TYPE_F32
+        struct ggml_tensor * probs;    // [n_vocab] - GGML_TYPE_F32
+        struct ggml_tensor * selected; // [1]       - GGML_TYPE_I32
+        struct ggml_tensor * size;     // [1]       - GGML_TYPE_I32  number of valid tokens in the arrays (<= n_vocab)
+        struct ggml_tensor * sorted;   // [1]       - GGML_TYPE_I32  whether data is sorted by logits/probs
     };
 };
 ```
-The token ids can be stored as `I32` type tensors, and the logits and probabilities as `F32`.
-Having separate tensors instead of perhaps packing data into a single tensor makes enables
-easier operations on all of the logits or probabilities. Also if the types are different
-this might also make sense to have them separate and not packed as well.
+The token ids can be stored as `I32` type tensors, and the logits and
+probabilities as `F32`.
 
-This would allow a function declaration to look something like this:
+This would allow a function declarations to look something like this:
 ```c++
         void                   (*apply_ggml)(  struct llama_sampler * smpl,
                                                        ggml_context * ctx,
                                                         ggml_cgraph * gf,
                                             llama_sampler_ggml_data * ggml_data);
+
+        void                   (*accept_ggml)( struct llama_sampler * smpl,
+                                                       ggml_context * ctx,
+                                                        ggml_cgraph * gf,
+                                               struct ggml_tensor * selected_token);
 ```
 This way multiple GPU samplers can be chained together and they can all update
 the graph with the operations they need to perform.
 
-And `llama_sampler_chain` would then apply all the samplers calling `apply_ggml` for
-each sampler in the chain, passing in a `ggml_cgraph` that is built up with all the
-operations from each sampler.
-
-While we want to avoid intermixing CPU and GPU samplers in the samplers chain, as this
-would require converting and copying data between system memory to device memory, we should
-support having GPU samplers at the start of the sampling chain. This way we can take
-advantage of the logits already being on the GPU and perform some of the sampling
-operations on the GPU before copying the data back to the CPU for any CPU samplers
-to process later in the chain.
-
+To be able to perform the GPU sampling operations on the GPU this will be done
+in a similar manner to how pooling is currently applied. To enable this a
+llama_sampler has been added to llama_context_params:
 ```c++
-llama_token llama_sampler_sample(struct llama_sampler * smpl, struct llama_context * ctx, int32_t idx) {
-    const auto * logits = llama_get_logits_ith(ctx, idx);
-
-    const llama_model * model = llama_get_model(ctx);
-    const llama_vocab * vocab = llama_model_get_vocab(model);
-
-    const int n_vocab = llama_vocab_n_tokens(vocab);
-    ...
- 
-    // check the smpl chain and store the GPU samplers in a vector
-    // Process the GPU samplers and afterwards create a llama_token_data_array
-    // which can then be passed to the remaining CPU samplers in the chain.
-}
-```
-
-### GPU Sampler parameters and state
-Some samplers need to be able to accept parameters and also maintain state.
-The tensors for the parameters and state need to be pre allocated made accessible
-to the samplers.
-
-If we take a top-k sampler as an example. This sampler needs to be initialized with
-the 'k' value. This is possible in much the same way as the CPU implementation:
 ```c++
-static struct llama_sampler * llama_sampler_gpu_init_top_k(int32_t k) {
-    static const llama_sampler_i iface = {
-        /*.name        =*/ llama_sampler_gpu_top_k_name,
-        /*.accept      =*/ nullptr,
-        /*.apply       =*/ nullptr,
-        /*.reset       =*/ nullptr,
-        /*.clone       =*/ nullptr,
-        /*.free        =*/ llama_sampler_gpu_top_k_free,
-        /*.apply_ggml  =*/ llama_sampler_gpu_top_k_apply_ggml
-    };
-
-    auto * ctx_data = new llama_sampler_gpu_top_k_ctx {
-        /*.k =*/ k,
-    };
+        struct llama_sampler_chain_params gpu_sampler_params = llama_sampler_chain_default_params();
+        struct llama_sampler * gpu_sampler_chain = llama_sampler_chain_init(gpu_sampler_params);
+        llama_sampler_chain_add(gpu_sampler_chain, llama_sampler_gpu_init_greedy());
 
-    auto * sampler = new llama_sampler {
-        /*.iface =*/ &iface,
-        /*.ctx   =*/ ctx_data,
-    };
+        llama_context_params cparams = llama_context_default_params();
+        cparams.sampler = sampler;
 
-    return sampler;
-}
+        auto ctx = llama_init_from_model(model, cparams);
 ```
-Currently, the GPU sampler are initialized before calling `llama_sampler_sample` so the
-above works fine. In llama_sampler_sample is where we currently have the GPU sampler
-processsing:
+When the models graph is built we will then also build the sampling graph:
 ```c++
-llama_token llama_sampler_sample(struct llama_sampler * smpl, struct llama_context * ctx, int32_t idx) {
+ggml_cgraph * llama_model::build_graph(const llm_graph_params & params) const {
+    std::unique_ptr<llm_graph_context> llm;
     ...
-        struct ggml_init_params params = {
-            // TODO: need to take into account any tensors that GPU sampler may need.
-            /*.mem_size   =*/ (ggml_tensor_overhead() * 5) + GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead(),
-            /*.mem_buffer =*/ nullptr,
-            /*.no_alloc   =*/ true,
-        };
-        struct ggml_context * ctx_sample = ggml_init(params);
-
-        struct ggml_tensor * logits_t = ggml_new_tensor_1d(ctx_sample, GGML_TYPE_F32, n_vocab);
-        struct ggml_tensor * ids      = ggml_new_tensor_1d(ctx_sample, GGML_TYPE_I32, n_vocab);
-        struct ggml_tensor * probs    = ggml_new_tensor_1d(ctx_sample, GGML_TYPE_F32, n_vocab);
-        struct ggml_tensor * selected = ggml_new_tensor_1d(ctx_sample, GGML_TYPE_I32, 1);
-
-        // Select a GPU backend.
-        // TODO: perhaps this should be configurable as to which GPU to use
-        ggml_backend_t backend = nullptr;
-        ggml_backend_buffer_type_t buft = nullptr;
-        for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
-            auto * dev = ggml_backend_dev_get(i);
-            if (ggml_backend_dev_type(dev) == GGML_BACKEND_DEVICE_TYPE_GPU) {
-                backend = ggml_backend_dev_init(dev, nullptr);
-                buft = ggml_backend_dev_buffer_type(dev);
-                printf("Using GPU device '%s' for sampling\n", ggml_backend_dev_name(dev));
-                break;
-            }
-        }
-        ...
-
-        struct ggml_cgraph * gf = ggml_new_graph(ctx_sample);
-
-        struct llama_sampler_ggml_data ggml_data = {
-            /*.ids      =*/ ids,
-            /*.logits   =*/ logits_t,
-            /*.probs    =*/ probs,
-            /*.selected =*/ selected,
-            /*.size     =*/ n_vocab,
-            /*.sorted   =*/ false,
-        };
-
-        // Apply GPU samplers (add sampling operations to the graph)
-        for (auto & smpl : gpu_samplers) {
-            smpl.iface->apply_ggml(&smpl, ctx_sample, gf, &ggml_data);
-        }
 
-        ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx_sample, buft);
-        ...
+    // add on pooling layer
+    llm->build_pooling(cls, cls_b, cls_out, cls_out_b);
+
+    // add GPU sampling layers (if any)
+    llm->build_sampling(*this);
+    ...
 }
 ```
-A GPU sampler can create the tensors it needs in its apply_ggml function. But notice that the
-graph is passed to this function, and we have to specify the memory size for the ggml_context
-before this call happens.
-So how do we know how much memory to allocate for the samplers tensors?  
-Perhaps adding a callback for the gpu samplers be accaptable where the size of memory needed
-by a sampler is returned? Something like:
+All the sampler will be applied that have been configured. Depending on that
+what sampler actually does, like it may just modify the logits like a temperature
+sampler, or it may filter the logits like a top-k sampler, or calculate the
+probabilities like a dist sampler. The sampler could select a token directly
+like a greedy sampler so it would be possible to skip and CPU sampler and just
+run GPU samplers.
+
+To get the probabilites generated:
 ```c++
-        size_t                  (*size_ggml)(const  struct llama_sampler * smpl);
+    float * probs = llama_get_sampled_probs(test_ctx.ctx);
 ```
-This could then be called when we gather the GPU samplers:
+To get the selected token:
 ```c++
-    std::vector<llama_sampler> gpu_samplers;
-    size_t gpu_samplers_ggml_size = 0;
-    if (smpl->iface->name && strcmp(smpl->iface->name(smpl), "chain") == 0) {
-        for (int i = 0; i < llama_sampler_chain_n(smpl); i++) {
-            auto * s = llama_sampler_chain_get(smpl, i);
-            if (s->iface->apply_ggml) {
-                gpu_samplers.push_back(*s);
-                gpu_samplers_ggml_size += s->iface->size_ggml(s);
-            }
-        }
-```
-We can then use this later when creating the context parameters:
-```c++
-        size_t total_ggml_size = gpu_samplers_ggml_size + (ggml_tensor_overhead() * 5) + GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead();
-        printf("Total ggml size for GPU samplers: %zu bytes\n", total_ggml_size);
-        struct ggml_init_params params = {
-            // TODO: need to take into account any tensors that GPU sampler may need.
-            /*.mem_size   =*/ total_ggml_size,
-            /*.mem_buffer =*/ nullptr,
-            /*.no_alloc   =*/ true,
-        };
-        struct ggml_context * ctx_sample = ggml_init(params);
+    llama_token id = llama_get_sampled_token(test_ctx.ctx);
 ```
 
+But it is also possible to have CPU samplers after the normal llama_decode call
+which will be able to operate on the logits just like before but this opens up
+the possiblility to mix GPU samplers and CPU samplers, for example running
+temperature scaling and top-k on the GPU and then dist and greedy on the CPU.
+
 ### GPU Sampler state
 This was brought up in the feedback and something that we need to consider. The
 parameters to the GPU samplers can work as the currently do for CPU samplers and
@@ -276,14 +182,6 @@ This way the tensors will have a fixed tensor size.
 * The conversion from llama_token_data_array to llama_sampler_ggml_data should not
   be performed when we are using only GPU samplers. Otherwise it would incur
   significant data transfer to negate the benefits of GPU sampling.
-I've updated the suggestion above and we can check the samplers in the chain
-and if they are all GPU samplers then we can avoid creating the
-llama_token_data_array altogether.
-
-* Originally I have specified that either GPU samplers or CPU samplers would be
-  used in a sampling chain. But there was a suggestion to allow mixing them in the
-  sense that the GPU samplers would be at the start of the chain and CPU samplers
-  after them.
 
 * It was also brought up that CPU samplers may require addition tensors for storing
   parameters and state. These tensors need to be preallocated and made availabe to the