@@ -61,6 +61,118 @@ any way (like also if the size of the array is modified).
6161To be processed by a GGML graph the data, the information in ` cur_p ` needs to be
6262in the form of a ` ggml_tensor ` (or multiple).
6363
64+
65+ ### Current implementation
66+ In contrast to CPU sampling where the sampling operations are performed after
67+ the models graph has been executed, GPU sampling is part of the same execution
68+ graph. All of the sampling can be done on the GPU or parts of it can be done on
69+ the GPU and the rest on the CPU.
70+
71+ #### Configuration of GPU samplers
72+ GPU samplers are configured before the context is created and a GPU sampler
73+ can be configured per sequence:
74+ ``` console
75+ struct llama_sampler_chain_params params = llama_sampler_chain_default_params();
76+ struct llama_sampler * samplers = llama_sampler_chain_init(params);
77+
78+ llama_sampler_chain_add(samplers, llama_sampler_gpu_init_greedy());
79+
80+ std::vector<llama_sampler_seq_config> gpu_sampler_configs = {
81+ { 0, gpu_sampler_chain }
82+ };
83+ ```
84+ The above is only showing one sampler but multiple samplers can be added to the
85+ gpu_samplers_config vector.
86+
87+ These samplers are then passed into the context parameters when creating the
88+ context:
89+ ``` c++
90+ llama_context_params cparams = llama_context_default_params();
91+ cparams.samplers = sampler_configs.data();
92+ cparams.n_samplers = sampler_configs.size();
93+
94+ ctx = llama_init_from_model(model, cparams);
95+ ```
96+
97+ When the model graph is built the GPU samplers will called to enable them to
98+ add their operations to the graph:
99+ ``` c++
100+ ggml_cgraph * llama_model::build_graph (const llm_graph_params & params) const {
101+ std::unique_ptr<llm_graph_context> llm;
102+ ...
103+
104+ // add GPU sampling layers (if any)
105+ llm->build_sampling(*this, params);
106+ ```
107+ The llama_sampler_i interface as been extended with 4 new methods in the API,
108+ and they are currently all named with a `_ggml` suffix to indicate that they
109+ are for GPU sampling:
110+ ```c++
111+ void (*apply_ggml)( struct llama_sampler * smpl,
112+ ggml_context * ctx,
113+ ggml_cgraph * gf,
114+ llama_sampler_ggml_data * ggml_data);
115+
116+ void (*accept_ggml)( struct llama_sampler * smpl,
117+ ggml_context * ctx,
118+ ggml_cgraph * gf,
119+ struct ggml_tensor * selected_token);
120+
121+ void (*set_input_ggml)( struct llama_sampler * smpl,
122+ ggml_context * ctx,
123+ ggml_cgraph * gf);
124+
125+ void (*set_backend_context)( struct llama_sampler * smpl,
126+ ggml_backend_sched_t sched,
127+ ggml_backend_t backend);
128+ ```
129+ set_backenck_context function is use to enable the GPU sampler to know which
130+ backend the tensors that it creates/uses should be created on. This is important
131+ so that we avoid splits in the computation graph that would require data transfer
132+ between different backends.
133+
134+ apply_ggml is where the GPU sampler adds its operations to the graphs. For
135+ example the greedy sampler will select the token with the highest probability:
136+ ``` c++
137+ static void llama_sampler_gpu_greedy_apply_ggml (
138+ struct llama_sampler * smpl,
139+ struct ggml_context * ctx,
140+ struct ggml_cgraph * gf,
141+ struct llama_sampler_ggml_data * ggml_data) {
142+ (void) gf;
143+ auto * sctx = (llama_sampler_gpu_greedy_ctx *) smpl->ctx;
144+
145+ struct ggml_tensor * argmax_result = ggml_argmax(ctx, ggml_data->logits);
146+ ggml_set_name(argmax_result, "argmax_result");
147+ ggml_backend_sched_set_tensor_backend(sctx->sched, argmax_result, sctx->backend);
148+ ggml_data->sampled_token = argmax_result;
149+ }
150+ ```
151+ And here we also see the usage of the scheduler and backend to ensure that the
152+ tensor is created on the correct backend.
153+
154+ accept_ggml is called after the GPU graph has been executed to allow the GPU
155+ sampler to accept the selected token and update its state. Note that currently
156+ no GPU samplers maintain any state in this way and is something that needs more
157+ work.
158+
159+ set_input_ggml is called after the computation graph has been schduled but before
160+ it is computed. This allows the GPU sampler to set any input. This is currently
161+ used by the temp sampler to set a random number tensor that is used for sampling.
162+
163+ Support has been added to llama-cli and llama-server to enable testing of the GPU
164+ sampling features. Even though the implementation might still change and perhaps
165+ significantly it was valuable to implement that support to see how this would work
166+ and it uncovered some isseus that the tests missed.
167+
168+ The pull request can be found here:
169+ https://github.com/ggml-org/llama.cpp/pull/17004
170+
171+ ----
172+
173+ The sections below contains some notes taken during the initial design and
174+ exporation of GPU sampling llama.cpp and are not really relavant anymore.
175+
64176### Suggested approach
65177One way could be to store the tensors in a struct like this:
66178```c++
0 commit comments