Skip to content

Commit 08fb027

Browse files
committed
docs: update llama-server static file generation details
1 parent 68405ed commit 08fb027

File tree

1 file changed

+100
-24
lines changed

1 file changed

+100
-24
lines changed

notes/llama.cpp/llama-server.md

Lines changed: 100 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -685,8 +685,8 @@ In server.cpp we have the following code:
685685
});
686686
```
687687

688-
Now, `index_html_gz` gzipped file in `examples/server/public` which is built
689-
by `examples/server/webui/package.json`:
688+
Now, `index_html_gz` is a gzipped file in `tools/server/public` which is built
689+
by `tools/server/webui/package.json`:
690690
```console
691691
"scripts": {
692692
"dev": "vite",
@@ -698,25 +698,48 @@ by `examples/server/webui/package.json`:
698698
We can inspect the vite configuration which is in `vite.config.js`:
699699
```js
700700
...
701-
writeBundle() {
702-
const outputIndexHtml = path.join(config.build.outDir, 'index.html');
703-
const content = GUIDE_FOR_FRONTEND + '\n' + fs.readFileSync(outputIndexHtml, 'utf-8');
704-
const compressed = zlib.gzipSync(Buffer.from(content, 'utf-8'), { level: 9 });
705-
706-
// because gzip header contains machine-specific info, we must remove these data from the header
707-
// timestamp
708-
compressed[0x4] = 0;
709-
compressed[0x5] = 0;
710-
compressed[0x6] = 0;
711-
compressed[0x7] = 0;
712-
// OS
713-
compressed[0x9] = 0;
714-
```
715-
This is reading the `webui/index.html` file and prepending `GUIDE_FOR_FRONTEND`
716-
warning to it. This is then gzipped and the timestamp and OS fields are zeroed
717-
out.
701+
llamaCppBuildPlugin() {
702+
...
703+
try {
704+
const indexPath = resolve('../public/index.html');
705+
const gzipPath = resolve('../public/index.html.gz');
706+
707+
if (!existsSync(indexPath)) {
708+
return;
709+
}
710+
711+
let content = readFileSync(indexPath, 'utf-8');
712+
713+
const faviconPath = resolve('static/favicon.svg');
714+
if (existsSync(faviconPath)) {
715+
const faviconContent = readFileSync(faviconPath, 'utf-8');
716+
const faviconBase64 = Buffer.from(faviconContent).toString('base64');
717+
const faviconDataUrl = `data:image/svg+xml;base64,${faviconBase64}`;
718+
719+
content = content.replace(/href="[^"]*favicon\.svg"/g, `href="${faviconDataUrl}"`);
720+
721+
console.log('✓ Inlined favicon.svg as base64 data URL');
722+
}
723+
724+
content = content.replace(/\r/g, '');
725+
content = GUIDE_FOR_FRONTEND + '\n' + content;
726+
727+
const compressed = fflate.gzipSync(Buffer.from(content, 'utf-8'), { level: 9 });
728+
729+
// because gzip header contains machine-specific info, we must remove these data from the header
730+
// timestamp
731+
compressed[0x4] = 0;
732+
compressed[0x5] = 0;
733+
compressed[0x6] = 0;
734+
compressed[0x7] = 0;
735+
compressed[0x9] = 0;
736+
```
737+
This is reading the `public/index.html` file which is then gzipped and the
738+
timestamp and OS fields are zeroed out.
739+
718740
So when we run `npm run build` in the `webui` directory, the `index.html` file
719-
is built and gzipped and the resulting `index.html.gz` file is copied to the
741+
is built and gzipped and the resulting `index.html.gz` file is placed in the
742+
public directory.
720743
721744
And then when we build `llama-server` using cmake we can see the following
722745
in `examples/server/CMakeLists.txt`:
@@ -738,12 +761,12 @@ foreach(asset ${PUBLIC_ASSETS})
738761
set_source_files_properties(${output} PROPERTIES GENERATED TRUE)
739762
endforeach()
740763
```
741-
Notice that this is actually generateing a `.hpp` file from the `.gz` file:
764+
Notice that this is actually generating a `.hpp` file from the `.gz` file:
742765
```console
743766
/home/danbev/work/ai/llama.cpp-debug/build/examples/server/index.html.gz.hpp
744767
```
745768
746-
Now, this is passed to the script `xxd.cmake`:
769+
This is passed to the script `xxd.cmake`:
747770
```
748771
# CMake equivalent of `xxd -i ${INPUT} ${OUTPUT}`
749772
```
@@ -755,6 +778,7 @@ If we look in includes in server.cpp we find:
755778
#include "index.html.gz.hpp"
756779
```
757780
781+
And in build/tools/server/index.html.gz.hpp we find:
758782
```cpp
759783
unsigned char index_html_gz[] = {0x1f,0x8b,...
760784

@@ -765,5 +789,57 @@ And this is how the `index.html.gz` file is included in the server:
765789
res.set_content(reinterpret_cast<const char*>(index_html_gz), index_html_gz_len, "text/html; charset=utf-8");
766790
```
767791
768-
### Slots
769-
This section aims to explain what slots are in the context of llama-server.
792+
### GPU Sampling with llama-server
793+
794+
Currently the GPU sampling works in a similar manner to how pooling works, it
795+
is an option function that is called in build_graph:
796+
```c++
797+
// add GPU sampling layers (if any)
798+
llm->build_sampling(*this, params);
799+
```
800+
GPU samplers can be configured by creating sampler chains, where each sampler
801+
chain is associated with a specific sequence id:
802+
```c++
803+
struct llama_sampler_chain_params params = llama_sampler_chain_default_params();
804+
struct llama_sampler * chain = llama_sampler_chain_init(params);
805+
llama_sampler_chain_add(chain, llama_sampler_gpu_init_greedy());
806+
std::vector<llama_sampler_seq_config> sampler_configs = {
807+
{ 0, gpu_sampler_chain }
808+
};
809+
```
810+
The struct is defined as:
811+
```c++
812+
struct llama_sampler_seq_config {
813+
llama_seq_id seq_id;
814+
struct llama_sampler * sampler;
815+
};
816+
```
817+
And these sampler configs are then passed into as context params:
818+
```c++
819+
llama_context_params cparams = llama_context_default_params();
820+
cparams.samplers = sampler_configs.data();
821+
cparams.n_samplers = sampler_configs.size();
822+
```
823+
When the graph is built then the configured samplers will be added the
824+
computation graph and be part of the computed graph. This is done in the
825+
samplers _apply function which allows it to add operations/nodes to the computation
826+
graph.
827+
828+
This enables the sampling to happen fully, or partially on the GPU. The samplers
829+
could sample a single token in which case that is what will be transferred from
830+
the device memory to host memory after llama_decode has been called.
831+
The sampled token can then be retrieved using:
832+
```c++
833+
llama_token id = llama_get_sampled_token_ith(test_ctx.ctx, index);
834+
```
835+
836+
Is it also possible to run a GPU sampler that only filters the logits and then
837+
only the filtered logits are transferred back to the host and the sampling can
838+
proceed on the CPU with the normal(CPU) sampler chain. In this case one configures
839+
the CPU samplers as usual but they will now operate on already filtered logits.
840+
841+
Similar to the above with logits, it is possible for a GPU sampler to compute
842+
the full probability distribution and transfer that to the host. And similar
843+
to the logits filtering, the CPU samplers can then operate on the full
844+
probability.
845+

0 commit comments

Comments
 (0)