@@ -64,12 +64,36 @@ llm_graph_result * llama_context::process_ubatch(const llama_ubatch & ubatch, ll
6464```
6565And this will build the computation graph for the model in question, and this
6666includes specific sizes for the inputs, like the sequence length, batch size
67- etc. In the case of OpenVINO which translate the GGML graph to an OpenVINO
68- model, they have to perform the above steps of compilation, firmware generation
69- and loading to the NPU. This is a slow process and it is not feasible to do
70- for every inference call. Currently that do some form of caching which I'm not
71- exactly sure how this works and I think this only applied of the NPU case which
72- I can't really test.
67+ etc.
68+
69+ In the case of OpenVINO which translate the GGML graph to an OpenVINO model, they
70+ have to perform the above steps of compilation, firmware generation and loading
71+ to the NPU. This is a slow process and it is not feasible to do for every
72+ inference call. Currently they do some form of caching which I'm not exactly sure
73+ how this works and I think this only applied of the NPU case which I can't
74+ really test it as I don't have the hardware.
75+
76+ It sounds like this is not something unique to Intel:
77+ * Apple Neural Engine: Similar black box through Core ML
78+
79+ ### NPU issue
80+ To understand the NPU issue better we can think of the NPU as my IoT device. If
81+ I need to update something, like change a configuration in the program, say the
82+ size of a variable, then I actually need to compile and flash the device to see
83+ that change. This is alright if it seldom changes but if it is frequenent it is
84+ time consuming.
85+ The same thing happens with the NPU where if you have operation that contain the
86+ same sizes that are called many times it will bascially be compiled once and
87+ called multiple times. But if the case is that tensor sizes change then it needs
88+ to "flash" the device for the new program. And this is what is happening with
89+ llama.cpp and OpenVINO backend where the sequence length and batch size can
90+ change for every inference call. And this is also the reason why the NPU has
91+ this issue and not the CPU or GPU as they are more flexible and can handle it.
92+ There are solutions like having a fixed sequence lenght and padding the input
93+ to fit but this also wastes resources.
94+
95+ So could the CPU handle the prefill prompt and then the NPU handle the decoding
96+ of tokens with a fixed sequence length of 1?
7397
7498
7599### Debugging session notes
@@ -125,8 +149,6 @@ ggml_backend_openvino_graph_compute(ggml_backend_t backend, struct ggml_cgraph *
125149```
126150And openvino_frontend_compute is defined in ggml/src/ggml-openvino/utils.h:
127151```c++
128- This will land in ggml/src/ggml-openvino/utils.cpp:
129- ```c++
130152enum ggml_status openvino_frontend_compute(ggml_backend_t backend, struct ggml_cgraph* cgraph) {
131153 static ov::Core core;
132154
0 commit comments