Skip to content

Commit 80cfee9

Browse files
committed
docs: add more details on OpenVINO backend and NPU issue
1 parent 79f42d2 commit 80cfee9

File tree

1 file changed

+30
-8
lines changed

1 file changed

+30
-8
lines changed

notes/ggml/openvino-backend.md

Lines changed: 30 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -64,12 +64,36 @@ llm_graph_result * llama_context::process_ubatch(const llama_ubatch & ubatch, ll
6464
```
6565
And this will build the computation graph for the model in question, and this
6666
includes specific sizes for the inputs, like the sequence length, batch size
67-
etc. In the case of OpenVINO which translate the GGML graph to an OpenVINO
68-
model, they have to perform the above steps of compilation, firmware generation
69-
and loading to the NPU. This is a slow process and it is not feasible to do
70-
for every inference call. Currently that do some form of caching which I'm not
71-
exactly sure how this works and I think this only applied of the NPU case which
72-
I can't really test.
67+
etc.
68+
69+
In the case of OpenVINO which translate the GGML graph to an OpenVINO model, they
70+
have to perform the above steps of compilation, firmware generation and loading
71+
to the NPU. This is a slow process and it is not feasible to do for every
72+
inference call. Currently they do some form of caching which I'm not exactly sure
73+
how this works and I think this only applied of the NPU case which I can't
74+
really test it as I don't have the hardware.
75+
76+
It sounds like this is not something unique to Intel:
77+
* Apple Neural Engine: Similar black box through Core ML
78+
79+
### NPU issue
80+
To understand the NPU issue better we can think of the NPU as my IoT device. If
81+
I need to update something, like change a configuration in the program, say the
82+
size of a variable, then I actually need to compile and flash the device to see
83+
that change. This is alright if it seldom changes but if it is frequenent it is
84+
time consuming.
85+
The same thing happens with the NPU where if you have operation that contain the
86+
same sizes that are called many times it will bascially be compiled once and
87+
called multiple times. But if the case is that tensor sizes change then it needs
88+
to "flash" the device for the new program. And this is what is happening with
89+
llama.cpp and OpenVINO backend where the sequence length and batch size can
90+
change for every inference call. And this is also the reason why the NPU has
91+
this issue and not the CPU or GPU as they are more flexible and can handle it.
92+
There are solutions like having a fixed sequence lenght and padding the input
93+
to fit but this also wastes resources.
94+
95+
So could the CPU handle the prefill prompt and then the NPU handle the decoding
96+
of tokens with a fixed sequence length of 1?
7397

7498

7599
### Debugging session notes
@@ -125,8 +149,6 @@ ggml_backend_openvino_graph_compute(ggml_backend_t backend, struct ggml_cgraph *
125149
```
126150
And openvino_frontend_compute is defined in ggml/src/ggml-openvino/utils.h:
127151
```c++
128-
This will land in ggml/src/ggml-openvino/utils.cpp:
129-
```c++
130152
enum ggml_status openvino_frontend_compute(ggml_backend_t backend, struct ggml_cgraph* cgraph) {
131153
static ov::Core core;
132154

0 commit comments

Comments
 (0)