@@ -685,8 +685,8 @@ In server.cpp we have the following code:
685685 });
686686```
687687
688- Now, ` index_html_gz ` gzipped file in ` examples /server/public` which is built
689- by ` examples /server/webui/package.json` :
688+ Now, ` index_html_gz ` is a gzipped file in ` tools /server/public` which is built
689+ by ` tools /server/webui/package.json` :
690690``` console
691691 "scripts": {
692692 "dev": "vite",
@@ -698,25 +698,48 @@ by `examples/server/webui/package.json`:
698698We can inspect the vite configuration which is in ` vite.config.js ` :
699699``` js
700700...
701- writeBundle () {
702- const outputIndexHtml = path .join (config .build .outDir , ' index.html' );
703- const content = GUIDE_FOR_FRONTEND + ' \n ' + fs .readFileSync (outputIndexHtml, ' utf-8' );
704- const compressed = zlib .gzipSync (Buffer .from (content, ' utf-8' ), { level: 9 });
705-
706- // because gzip header contains machine-specific info, we must remove these data from the header
707- // timestamp
708- compressed[0x4 ] = 0 ;
709- compressed[0x5 ] = 0 ;
710- compressed[0x6 ] = 0 ;
711- compressed[0x7 ] = 0 ;
712- // OS
713- compressed[0x9 ] = 0 ;
714- ` ` `
715- This is reading the ` webui/ index .html ` file and prepending ` GUIDE_FOR_FRONTEND `
716- warning to it. This is then gzipped and the timestamp and OS fields are zeroed
717- out.
701+ llamaCppBuildPlugin () {
702+ ...
703+ try {
704+ const indexPath = resolve (' ../public/index.html' );
705+ const gzipPath = resolve (' ../public/index.html.gz' );
706+
707+ if (! existsSync (indexPath)) {
708+ return ;
709+ }
710+
711+ let content = readFileSync (indexPath, ' utf-8' );
712+
713+ const faviconPath = resolve (' static/favicon.svg' );
714+ if (existsSync (faviconPath)) {
715+ const faviconContent = readFileSync (faviconPath, ' utf-8' );
716+ const faviconBase64 = Buffer .from (faviconContent).toString (' base64' );
717+ const faviconDataUrl = ` data:image/svg+xml;base64,${ faviconBase64} ` ;
718+
719+ content = content .replace (/ href="[^ "] * favicon\. svg"/ g , ` href="${ faviconDataUrl} "` );
720+
721+ console .log (' ✓ Inlined favicon.svg as base64 data URL' );
722+ }
723+
724+ content = content .replace (/ \r / g , ' ' );
725+ content = GUIDE_FOR_FRONTEND + ' \n ' + content;
726+
727+ const compressed = fflate .gzipSync (Buffer .from (content, ' utf-8' ), { level: 9 });
728+
729+ // because gzip header contains machine-specific info, we must remove these data from the header
730+ // timestamp
731+ compressed[0x4 ] = 0 ;
732+ compressed[0x5 ] = 0 ;
733+ compressed[0x6 ] = 0 ;
734+ compressed[0x7 ] = 0 ;
735+ compressed[0x9 ] = 0 ;
736+ ` ` `
737+ This is reading the ` public / index .html ` file which is then gzipped and the
738+ timestamp and OS fields are zeroed out.
739+
718740So when we run ` npm run build` in the ` webui` directory, the ` index .html ` file
719- is built and gzipped and the resulting ` index .html .gz ` file is copied to the
741+ is built and gzipped and the resulting ` index .html .gz ` file is placed in the
742+ public directory.
720743
721744And then when we build ` llama- server` using cmake we can see the following
722745in ` examples/ server/ CMakeLists .txt ` :
@@ -738,12 +761,12 @@ foreach(asset ${PUBLIC_ASSETS})
738761 set_source_files_properties (${output} PROPERTIES GENERATED TRUE )
739762endforeach ()
740763` ` `
741- Notice that this is actually generateing a ` .hpp ` file from the ` .gz ` file:
764+ Notice that this is actually generating a ` .hpp ` file from the ` .gz ` file:
742765` ` ` console
743766/ home/ danbev/ work/ ai/ llama .cpp - debug/ build/ examples/ server/ index .html .gz .hpp
744767` ` `
745768
746- Now, this is passed to the script ` xxd .cmake ` :
769+ This is passed to the script ` xxd .cmake ` :
747770` ` `
748771# CMake equivalent of ` xxd -i ${ INPUT } ${ OUTPUT } `
749772` ` `
@@ -755,6 +778,7 @@ If we look in includes in server.cpp we find:
755778#include " index.html.gz.hpp"
756779` ` `
757780
781+ And in build/tools/server/index.html.gz.hpp we find:
758782` ` ` cpp
759783unsigned char index_html_gz[] = {0x1f ,0x8b ,...
760784
@@ -765,5 +789,57 @@ And this is how the `index.html.gz` file is included in the server:
765789 res .set_content (reinterpret_cast< const char *>(index_html_gz ), index_html_gz_len , "text /html ; charset= utf- 8 " );
766790```
767791
768- ### Slots
769- This section aims to explain what slots are in the context of llama-server.
792+ ### GPU Sampling with llama-server
793+
794+ Currently the GPU sampling works in a similar manner to how pooling works, it
795+ is an option function that is called in build_graph:
796+ ```c++
797+ // add GPU sampling layers (if any)
798+ llm->build_sampling(*this, params);
799+ ```
800+ GPU samplers can be configured by creating sampler chains, where each sampler
801+ chain is associated with a specific sequence id:
802+ ```c++
803+ struct llama_sampler_chain_params params = llama_sampler_chain_default_params();
804+ struct llama_sampler * chain = llama_sampler_chain_init(params);
805+ llama_sampler_chain_add(chain, llama_sampler_gpu_init_greedy());
806+ std::vector<llama_sampler_seq_config> sampler_configs = {
807+ { 0, gpu_sampler_chain }
808+ };
809+ ```
810+ The struct is defined as:
811+ ```c++
812+ struct llama_sampler_seq_config {
813+ llama_seq_id seq_id;
814+ struct llama_sampler * sampler;
815+ };
816+ ```
817+ And these sampler configs are then passed into as context params:
818+ ```c++
819+ llama_context_params cparams = llama_context_default_params();
820+ cparams.samplers = sampler_configs.data();
821+ cparams.n_samplers = sampler_configs.size();
822+ ```
823+ When the graph is built then the configured samplers will be added the
824+ computation graph and be part of the computed graph. This is done in the
825+ samplers _apply function which allows it to add operations/nodes to the computation
826+ graph.
827+
828+ This enables the sampling to happen fully, or partially on the GPU. The samplers
829+ could sample a single token in which case that is what will be transferred from
830+ the device memory to host memory after llama_decode has been called.
831+ The sampled token can then be retrieved using:
832+ ```c++
833+ llama_token id = llama_get_sampled_token_ith(test_ctx.ctx, index);
834+ ```
835+
836+ Is it also possible to run a GPU sampler that only filters the logits and then
837+ only the filtered logits are transferred back to the host and the sampling can
838+ proceed on the CPU with the normal(CPU) sampler chain. In this case one configures
839+ the CPU samplers as usual but they will now operate on already filtered logits.
840+
841+ Similar to the above with logits, it is possible for a GPU sampler to compute
842+ the full probability distribution and transfer that to the host. And similar
843+ to the logits filtering, the CPU samplers can then operate on the full
844+ probability.
845+
0 commit comments