@@ -61,180 +61,86 @@ any way (like also if the size of the array is modified).
6161To be processed by a GGML graph the data, the information in ` cur_p ` needs to be
6262in the form of a ` ggml_tensor ` (or multiple).
6363
64+ ### Suggested approach
6465One way could be to store the tensors in a struct like this:
6566``` c++
6667 struct llama_sampler_ggml_data {
67- struct ggml_tensor * ids; // [ n_vocab] - GGML_TYPE_I32
68- struct ggml_tensor * logits; // [ n_vocab] - GGML_TYPE_F32
69- struct ggml_tensor * probs; // [ n_vocab] - GGML_TYPE_F32
70- int64_t size; // number of valid tokens in the arrays (<= n_vocab)
71- int64_t selected ; // index in the array (-1 if not yet selected )
72- bool sorted; // whether data is sorted by logits/probs
68+ struct ggml_tensor * ids; // [ n_vocab] - GGML_TYPE_I32
69+ struct ggml_tensor * logits; // [ n_vocab] - GGML_TYPE_F32
70+ struct ggml_tensor * probs; // [ n_vocab] - GGML_TYPE_F32
71+ struct ggml_tensor * selected; // [ 1 ] - GGML_TYPE_I32
72+ struct ggml_tensor * size ; // [ 1 ] - GGML_TYPE_I32 number of valid tokens in the arrays (<= n_vocab )
73+ struct ggml_tensor * sorted; // [ 1 ] - GGML_TYPE_I32 whether data is sorted by logits/probs
7374 };
7475};
7576```
76- The token ids can be stored as `I32` type tensors, and the logits and probabilities as `F32`.
77- Having separate tensors instead of perhaps packing data into a single tensor makes enables
78- easier operations on all of the logits or probabilities. Also if the types are different
79- this might also make sense to have them separate and not packed as well.
77+ The token ids can be stored as `I32` type tensors, and the logits and
78+ probabilities as `F32`.
8079
81- This would allow a function declaration to look something like this:
80+ This would allow a function declarations to look something like this:
8281```c++
8382 void (*apply_ggml)( struct llama_sampler * smpl,
8483 ggml_context * ctx,
8584 ggml_cgraph * gf,
8685 llama_sampler_ggml_data * ggml_data);
86+
87+ void (*accept_ggml)( struct llama_sampler * smpl,
88+ ggml_context * ctx,
89+ ggml_cgraph * gf,
90+ struct ggml_tensor * selected_token);
8791```
8892This way multiple GPU samplers can be chained together and they can all update
8993the graph with the operations they need to perform.
9094
91- And ` llama_sampler_chain ` would then apply all the samplers calling ` apply_ggml ` for
92- each sampler in the chain, passing in a ` ggml_cgraph ` that is built up with all the
93- operations from each sampler.
94-
95- While we want to avoid intermixing CPU and GPU samplers in the samplers chain, as this
96- would require converting and copying data between system memory to device memory, we should
97- support having GPU samplers at the start of the sampling chain. This way we can take
98- advantage of the logits already being on the GPU and perform some of the sampling
99- operations on the GPU before copying the data back to the CPU for any CPU samplers
100- to process later in the chain.
101-
95+ To be able to perform the GPU sampling operations on the GPU this will be done
96+ in a similar manner to how pooling is currently applied. To enable this a
97+ llama_sampler has been added to llama_context_params:
10298``` c++
103- llama_token llama_sampler_sample (struct llama_sampler * smpl, struct llama_context * ctx, int32_t idx) {
104- const auto * logits = llama_get_logits_ith(ctx, idx);
105-
106- const llama_model * model = llama_get_model(ctx);
107- const llama_vocab * vocab = llama_model_get_vocab(model);
108-
109- const int n_vocab = llama_vocab_n_tokens(vocab);
110- ...
111-
112- // check the smpl chain and store the GPU samplers in a vector
113- // Process the GPU samplers and afterwards create a llama_token_data_array
114- // which can then be passed to the remaining CPU samplers in the chain.
115- }
116- ```
117-
118- ### GPU Sampler parameters and state
119- Some samplers need to be able to accept parameters and also maintain state.
120- The tensors for the parameters and state need to be pre allocated made accessible
121- to the samplers.
122-
123- If we take a top-k sampler as an example. This sampler needs to be initialized with
124- the 'k' value. This is possible in much the same way as the CPU implementation:
12599```c++
126- static struct llama_sampler * llama_sampler_gpu_init_top_k(int32_t k) {
127- static const llama_sampler_i iface = {
128- /*.name =*/ llama_sampler_gpu_top_k_name,
129- /*.accept =*/ nullptr,
130- /*.apply =*/ nullptr,
131- /*.reset =*/ nullptr,
132- /*.clone =*/ nullptr,
133- /*.free =*/ llama_sampler_gpu_top_k_free,
134- /*.apply_ggml =*/ llama_sampler_gpu_top_k_apply_ggml
135- };
136-
137- auto * ctx_data = new llama_sampler_gpu_top_k_ctx {
138- /*.k =*/ k,
139- };
100+ struct llama_sampler_chain_params gpu_sampler_params = llama_sampler_chain_default_params();
101+ struct llama_sampler * gpu_sampler_chain = llama_sampler_chain_init(gpu_sampler_params);
102+ llama_sampler_chain_add(gpu_sampler_chain, llama_sampler_gpu_init_greedy());
140103
141- auto * sampler = new llama_sampler {
142- /*.iface =*/ &iface,
143- /*.ctx =*/ ctx_data,
144- };
104+ llama_context_params cparams = llama_context_default_params();
105+ cparams.sampler = sampler;
145106
146- return sampler;
147- }
107+ auto ctx = llama_init_from_model(model, cparams);
148108```
149- Currently, the GPU sampler are initialized before calling ` llama_sampler_sample ` so the
150- above works fine. In llama_sampler_sample is where we currently have the GPU sampler
151- processsing:
109+ When the models graph is built we will then also build the sampling graph:
152110```c++
153- llama_token llama_sampler_sample (struct llama_sampler * smpl, struct llama_context * ctx, int32_t idx) {
111+ ggml_cgraph * llama_model::build_graph(const llm_graph_params & params) const {
112+ std::unique_ptr<llm_graph_context> llm;
154113 ...
155- struct ggml_init_params params = {
156- // TODO: need to take into account any tensors that GPU sampler may need.
157- /* .mem_size =* / (ggml_tensor_overhead() * 5) + GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead(),
158- /* .mem_buffer =* / nullptr,
159- /* .no_alloc =* / true,
160- };
161- struct ggml_context * ctx_sample = ggml_init(params);
162-
163- struct ggml_tensor * logits_t = ggml_new_tensor_1d(ctx_sample, GGML_TYPE_F32, n_vocab);
164- struct ggml_tensor * ids = ggml_new_tensor_1d(ctx_sample, GGML_TYPE_I32, n_vocab);
165- struct ggml_tensor * probs = ggml_new_tensor_1d(ctx_sample, GGML_TYPE_F32, n_vocab);
166- struct ggml_tensor * selected = ggml_new_tensor_1d(ctx_sample, GGML_TYPE_I32, 1);
167-
168- // Select a GPU backend.
169- // TODO: perhaps this should be configurable as to which GPU to use
170- ggml_backend_t backend = nullptr;
171- ggml_backend_buffer_type_t buft = nullptr;
172- for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
173- auto * dev = ggml_backend_dev_get(i);
174- if (ggml_backend_dev_type(dev) == GGML_BACKEND_DEVICE_TYPE_GPU) {
175- backend = ggml_backend_dev_init(dev, nullptr);
176- buft = ggml_backend_dev_buffer_type(dev);
177- printf("Using GPU device '%s' for sampling\n", ggml_backend_dev_name(dev));
178- break;
179- }
180- }
181- ...
182-
183- struct ggml_cgraph * gf = ggml_new_graph(ctx_sample);
184-
185- struct llama_sampler_ggml_data ggml_data = {
186- /*.ids =*/ ids,
187- /*.logits =*/ logits_t,
188- /*.probs =*/ probs,
189- /*.selected =*/ selected,
190- /*.size =*/ n_vocab,
191- /*.sorted =*/ false,
192- };
193-
194- // Apply GPU samplers (add sampling operations to the graph)
195- for (auto & smpl : gpu_samplers) {
196- smpl.iface->apply_ggml(&smpl, ctx_sample, gf, &ggml_data);
197- }
198114
199- ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx_sample, buft);
200- ...
115+ // add on pooling layer
116+ llm->build_pooling(cls, cls_b, cls_out, cls_out_b);
117+
118+ // add GPU sampling layers (if any)
119+ llm->build_sampling(*this);
120+ ...
201121}
202122```
203- A GPU sampler can create the tensors it needs in its apply_ggml function. But notice that the
204- graph is passed to this function, and we have to specify the memory size for the ggml_context
205- before this call happens.
206- So how do we know how much memory to allocate for the samplers tensors?
207- Perhaps adding a callback for the gpu samplers be accaptable where the size of memory needed
208- by a sampler is returned? Something like:
123+ All the sampler will be applied that have been configured. Depending on that
124+ what sampler actually does, like it may just modify the logits like a temperature
125+ sampler, or it may filter the logits like a top-k sampler, or calculate the
126+ probabilities like a dist sampler. The sampler could select a token directly
127+ like a greedy sampler so it would be possible to skip and CPU sampler and just
128+ run GPU samplers.
129+
130+ To get the probabilites generated:
209131``` c++
210- size_t (*size_ggml)(const struct llama_sampler * smpl );
132+ float * probs = llama_get_sampled_probs(test_ctx.ctx );
211133```
212- This could then be called when we gather the GPU samplers :
134+ To get the selected token :
213135``` c++
214- std::vector<llama_sampler> gpu_samplers;
215- size_t gpu_samplers_ggml_size = 0 ;
216- if (smpl->iface->name && strcmp(smpl->iface->name (smpl), "chain") == 0) {
217- for (int i = 0; i < llama_sampler_chain_n(smpl); i++) {
218- auto * s = llama_sampler_chain_get(smpl, i);
219- if (s->iface->apply_ggml) {
220- gpu_samplers.push_back(* s);
221- gpu_samplers_ggml_size += s->iface->size_ggml(s);
222- }
223- }
224- ```
225- We can then use this later when creating the context parameters:
226- ```c++
227- size_t total_ggml_size = gpu_samplers_ggml_size + (ggml_tensor_overhead() * 5) + GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead();
228- printf("Total ggml size for GPU samplers: %zu bytes\n", total_ggml_size);
229- struct ggml_init_params params = {
230- // TODO: need to take into account any tensors that GPU sampler may need.
231- /*.mem_size =*/ total_ggml_size,
232- /*.mem_buffer =*/ nullptr,
233- /*.no_alloc =*/ true,
234- };
235- struct ggml_context * ctx_sample = ggml_init(params);
136+ llama_token id = llama_get_sampled_token(test_ctx.ctx);
236137```
237138
139+ But it is also possible to have CPU samplers after the normal llama_decode call
140+ which will be able to operate on the logits just like before but this opens up
141+ the possiblility to mix GPU samplers and CPU samplers, for example running
142+ temperature scaling and top-k on the GPU and then dist and greedy on the CPU.
143+
238144### GPU Sampler state
239145This was brought up in the feedback and something that we need to consider. The
240146parameters to the GPU samplers can work as the currently do for CPU samplers and
@@ -276,14 +182,6 @@ This way the tensors will have a fixed tensor size.
276182* The conversion from llama_token_data_array to llama_sampler_ggml_data should not
277183 be performed when we are using only GPU samplers. Otherwise it would incur
278184 significant data transfer to negate the benefits of GPU sampling.
279- I've updated the suggestion above and we can check the samplers in the chain
280- and if they are all GPU samplers then we can avoid creating the
281- llama_token_data_array altogether.
282-
283- * Originally I have specified that either GPU samplers or CPU samplers would be
284- used in a sampling chain. But there was a suggestion to allow mixing them in the
285- sense that the GPU samplers would be at the start of the chain and CPU samplers
286- after them.
287185
288186* It was also brought up that CPU samplers may require addition tensors for storing
289187 parameters and state. These tensors need to be preallocated and made availabe to the
0 commit comments