NPU Unify PD #14

wine99 · 2025-11-03T05:17:05Z

Main Changes

Use a stateless graph to fix llama-cli, llama-server, and llama-bench.
Use a single static graph for NPU for both prompt processing and decoding.
The limitation is that NPU must run with -ub 1 for all utilities, which makes the prompt processing time proportional to the input length.

Preliminary Test

For CPU and GPU:

llama-simple, llama-cli, and llama-server work with default command-line arguments.
llama-bench needs to be run with the flag -fa 1.

For NPU:

llama-cli and llama-server work with -ub 1. For better performance, a smaller context size is recommended (e.g., -c 512).
llama-simple does not work as it does not support setting -ub.
llama-bench needs to be run with -fa 1 -ub 1. It’s also recommended to use a shorter prompt (e.g., -p 32 -n 32) for faster results.

Running llama-cli on LNL-32GB-Linux:

GGML_OPENVINO_DEVICE=CPU ./llama-cli -m Llama-3.2-1B-Instruct.q4_0.gguf -c 512

> Hi
Hello! How can I help you today?

> EOF by user

llama_perf_sampler_print:    sampling time =       0.58 ms /    20 runs   (    0.03 ms per token, 34482.76 tokens per second)
llama_perf_context_print:        load time =    4290.94 ms
llama_perf_context_print: prompt eval time =      83.88 ms /    11 tokens (    7.63 ms per token,   131.14 tokens per second)
llama_perf_context_print:        eval time =     175.56 ms /     9 runs   (   19.51 ms per token,    51.27 tokens per second)
llama_perf_context_print:       total time =    2283.60 ms /    20 tokens

GGML_OPENVINO_DEVICE=GPU ./llama-cli -m Llama-3.2-1B-Instruct.q4_0.gguf -c 512

> Hi
Hello! How can I help or chat with you today?

> EOF by user

llama_perf_sampler_print:    sampling time =       0.85 ms /    23 runs   (    0.04 ms per token, 27186.76 tokens per second)
llama_perf_context_print:        load time =    5169.48 ms
llama_perf_context_print: prompt eval time =     101.62 ms /    11 tokens (    9.24 ms per token,   108.25 tokens per second)
llama_perf_context_print:        eval time =     294.33 ms /    12 runs   (   24.53 ms per token,    40.77 tokens per second)
llama_perf_context_print:       total time =    2225.93 ms /    23 tokens

GGML_OPENVINO_DEVICE=NPU ./llama-cli -m Llama-3.2-1B-Instruct.q4_0.gguf -c 512 -ub 1

> Hi
Hello! How can I assist you today?

> EOF by user


llama_perf_sampler_print:    sampling time =       2.02 ms /    20 runs   (    0.10 ms per token,  9920.63 tokens per second)
llama_perf_context_print:        load time =    9096.21 ms
llama_perf_context_print: prompt eval time =     603.17 ms /    11 tokens (   54.83 ms per token,    18.24 tokens per second)
llama_perf_context_print:        eval time =     403.52 ms /     9 runs   (   44.84 ms per token,    22.30 tokens per second)
llama_perf_context_print:       total time =   11608.82 ms /    20 tokens

…`-fa 1`

* Stateless. Fix llama-cli llama-server * Simplify broadcast op in attention * Replace get_output_tensor+memcpy with set_output_tensor * NPU unify PD. Unify dynamic and static dims

wine99 added 8 commits October 21, 2025 14:57

Fix llama-cli; WIP llama-server

ac57c17

Fix: llama-server works with --cache-ram 0; llama-bench works with …

affee8d

…`-fa 1`

Minor udpates

1133677

Fix llama-server

e6387bb

Style: clang-format

6cc824d

stateless

240097d

Simplify broadcast op in attention

0cfe69c

Replace get_output_tensor+memcpy with set_output_tensor

e9abf1c

github-actions bot added the ggml label Nov 3, 2025

NPU unify PD. Unify dynamic and static dims

b806886

wine99 force-pushed the npu_unify_pd branch from 2221995 to b806886 Compare November 4, 2025 06:29

wine99 merged commit 2a51cde into dev_backend_openvino Nov 4, 2025
1 check passed

wine99 added a commit that referenced this pull request Nov 4, 2025

NPU Unify PD (#14)

0f97715

* Stateless. Fix llama-cli llama-server * Simplify broadcast op in attention * Replace get_output_tensor+memcpy with set_output_tensor * NPU unify PD. Unify dynamic and static dims

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NPU Unify PD #14

NPU Unify PD #14

wine99 commented Nov 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NPU Unify PD #14

NPU Unify PD #14

Conversation

wine99 commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main Changes

Preliminary Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wine99 commented Nov 3, 2025 •

edited

Loading