Skip to content

Commit 54f381d

Browse files
committed
docs: add some short notes about current GPU sampling impl
1 parent 08fb027 commit 54f381d

File tree

1 file changed

+112
-0
lines changed

1 file changed

+112
-0
lines changed

notes/llama.cpp/gpu-sampling.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,118 @@ any way (like also if the size of the array is modified).
6161
To be processed by a GGML graph the data, the information in `cur_p` needs to be
6262
in the form of a `ggml_tensor` (or multiple).
6363

64+
65+
### Current implementation
66+
In contrast to CPU sampling where the sampling operations are performed after
67+
the models graph has been executed, GPU sampling is part of the same execution
68+
graph. All of the sampling can be done on the GPU or parts of it can be done on
69+
the GPU and the rest on the CPU.
70+
71+
#### Configuration of GPU samplers
72+
GPU samplers are configured before the context is created and a GPU sampler
73+
can be configured per sequence:
74+
```console
75+
struct llama_sampler_chain_params params = llama_sampler_chain_default_params();
76+
struct llama_sampler * samplers = llama_sampler_chain_init(params);
77+
78+
llama_sampler_chain_add(samplers, llama_sampler_gpu_init_greedy());
79+
80+
std::vector<llama_sampler_seq_config> gpu_sampler_configs = {
81+
{ 0, gpu_sampler_chain }
82+
};
83+
```
84+
The above is only showing one sampler but multiple samplers can be added to the
85+
gpu_samplers_config vector.
86+
87+
These samplers are then passed into the context parameters when creating the
88+
context:
89+
```c++
90+
llama_context_params cparams = llama_context_default_params();
91+
cparams.samplers = sampler_configs.data();
92+
cparams.n_samplers = sampler_configs.size();
93+
94+
ctx = llama_init_from_model(model, cparams);
95+
```
96+
97+
When the model graph is built the GPU samplers will called to enable them to
98+
add their operations to the graph:
99+
```c++
100+
ggml_cgraph * llama_model::build_graph(const llm_graph_params & params) const {
101+
std::unique_ptr<llm_graph_context> llm;
102+
...
103+
104+
// add GPU sampling layers (if any)
105+
llm->build_sampling(*this, params);
106+
```
107+
The llama_sampler_i interface as been extended with 4 new methods in the API,
108+
and they are currently all named with a `_ggml` suffix to indicate that they
109+
are for GPU sampling:
110+
```c++
111+
void (*apply_ggml)( struct llama_sampler * smpl,
112+
ggml_context * ctx,
113+
ggml_cgraph * gf,
114+
llama_sampler_ggml_data * ggml_data);
115+
116+
void (*accept_ggml)( struct llama_sampler * smpl,
117+
ggml_context * ctx,
118+
ggml_cgraph * gf,
119+
struct ggml_tensor * selected_token);
120+
121+
void (*set_input_ggml)( struct llama_sampler * smpl,
122+
ggml_context * ctx,
123+
ggml_cgraph * gf);
124+
125+
void (*set_backend_context)( struct llama_sampler * smpl,
126+
ggml_backend_sched_t sched,
127+
ggml_backend_t backend);
128+
```
129+
set_backenck_context function is use to enable the GPU sampler to know which
130+
backend the tensors that it creates/uses should be created on. This is important
131+
so that we avoid splits in the computation graph that would require data transfer
132+
between different backends.
133+
134+
apply_ggml is where the GPU sampler adds its operations to the graphs. For
135+
example the greedy sampler will select the token with the highest probability:
136+
```c++
137+
static void llama_sampler_gpu_greedy_apply_ggml(
138+
struct llama_sampler * smpl,
139+
struct ggml_context * ctx,
140+
struct ggml_cgraph * gf,
141+
struct llama_sampler_ggml_data * ggml_data) {
142+
(void) gf;
143+
auto * sctx = (llama_sampler_gpu_greedy_ctx *) smpl->ctx;
144+
145+
struct ggml_tensor * argmax_result = ggml_argmax(ctx, ggml_data->logits);
146+
ggml_set_name(argmax_result, "argmax_result");
147+
ggml_backend_sched_set_tensor_backend(sctx->sched, argmax_result, sctx->backend);
148+
ggml_data->sampled_token = argmax_result;
149+
}
150+
```
151+
And here we also see the usage of the scheduler and backend to ensure that the
152+
tensor is created on the correct backend.
153+
154+
accept_ggml is called after the GPU graph has been executed to allow the GPU
155+
sampler to accept the selected token and update its state. Note that currently
156+
no GPU samplers maintain any state in this way and is something that needs more
157+
work.
158+
159+
set_input_ggml is called after the computation graph has been schduled but before
160+
it is computed. This allows the GPU sampler to set any input. This is currently
161+
used by the temp sampler to set a random number tensor that is used for sampling.
162+
163+
Support has been added to llama-cli and llama-server to enable testing of the GPU
164+
sampling features. Even though the implementation might still change and perhaps
165+
significantly it was valuable to implement that support to see how this would work
166+
and it uncovered some isseus that the tests missed.
167+
168+
The pull request can be found here:
169+
https://github.com/ggml-org/llama.cpp/pull/17004
170+
171+
----
172+
173+
The sections below contains some notes taken during the initial design and
174+
exporation of GPU sampling llama.cpp and are not really relavant anymore.
175+
64176
### Suggested approach
65177
One way could be to store the tensors in a struct like this:
66178
```c++

0 commit comments

Comments
 (0)