Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Main Changes
llama-cli,llama-server, andllama-bench.The limitation is that NPU must run with
-ub 1for all utilities, which makes the prompt processing time proportional to the input length.Preliminary Test
For CPU and GPU:
llama-simple,llama-cli, andllama-serverwork with default command-line arguments.llama-benchneeds to be run with the flag-fa 1.For NPU:
llama-cliandllama-serverwork with-ub 1. For better performance, a smaller context size is recommended (e.g.,-c 512).llama-simpledoes not work as it does not support setting-ub.llama-benchneeds to be run with-fa 1 -ub 1. It’s also recommended to use a shorter prompt (e.g.,-p 32 -n 32) for faster results.Running
llama-clion LNL-32GB-Linux: