Fix llama-server and llama-bench #11

wine99 · 2025-10-20T05:29:54Z

llama-server runs with default params (does not support -np >1)
llama-bench runs with -fa 1

…e model * Add OpenVINO ADD operator to Llama.cpp. The output is somewhat abnormal and needs further debugging.

… operator

…ontend-utils, GraphIterator, Decoder

…on openvino device

…view op.

…end of llama.cpp

…ckend

…ize, iSWA model not working

…`-fa 1`

cavusmustafa

I wonder if there is another way to identify different sequences rather than reading kv_cache state. If there is a way to identify, how about creating a unique inference request for each sequence? So, instead of caching inference requests with cgraph pointer, we can create an id out of cgraph+seq_id ? If this is possible, we don't need to read the states as every inference request will manage its own variables anyways.

cavusmustafa · 2025-10-27T17:21:20Z

ggml/src/ggml-openvino/utils.cpp

-                break;
-            }
-        }
+    static std::string device = getenv("GGML_OPENVINO_DEVICE") ? getenv("GGML_OPENVINO_DEVICE") : "CPU";


How about doing something like this to avoid multiple getenv calls?

const std::string& getDevice() { static const std::string device_str= [] { const char* device_env = std::getenv("GGML_OPENVINO_DEVICE"); return device_env ? std::string(device_env) : "CPU"; }(); return device_str; } . . . static std::string device = getDevice();

cavusmustafa · 2025-10-27T20:43:05Z

ggml/src/ggml-openvino/utils.cpp

    ov::AnyMap config;
+    if (device == "GPU") {
+        auto * disable_sdpa_optimization = getenv("GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION");
+        if (disable_sdpa_optimization && std::string(disable_sdpa_optimization) != "0") {


Why do we need to check this for every iteration instead of checking once before compile_model?

cavusmustafa · 2025-10-27T20:56:38Z

ggml/src/ggml-openvino/utils.cpp

+
+        // outdated if:
+        // 1. kv_len != kv_len_in_state
+        // 2. last row has different values


In this case are we deleting the previous kv_cache completely?

YangleiZouIntel and others added 30 commits October 15, 2025 13:03

Add ggml-openvino base files

675085b

add openvino as optional backend for Llama.cpp ggml

5b28b4f

* Configure the device(default CPU) that uses OpenVINO to compile th…

8a54dfd

…e model * Add OpenVINO ADD operator to Llama.cpp. The output is somewhat abnormal and needs further debugging.

Solve the issue of abnormal model output caused by using OpenVINO ADD…

0fc7124

… operator

Add OpenVINO MUL operator to GGML of Llama.cpp.

15b6a4f

Add compile options

f246d24

add OpenVINO frontend convert process steps

f7189b5

add get openvino available ops function

d2a306f

Add PoC of integration of openvino frontend. Main changes: ggml-ov-fr…

746bd53

…ontend-utils, GraphIterator, Decoder

Implement GgmlOvDecoder. Add dump functions.

047a771

Convert subgraph with add, sub, mul, div op to ov model and do infer …

21aaa5f

…on openvino device

Add GGML_OV_FRONTEND option. Add readme.

d3f5b62

Change output for infer request to set output tensor. Support scale, …

eedde64

…view op.

add GET_ROWS operator of OpenVINO to GGML of llama.cpp

6ee5ee4

Update build.md and add operation mapping(GGML to OpenVINO)

0515ab8

add the rms_norm operator implemented using OpenVINO to the GGML back…

63327ea

…end of llama.cpp

Fix issue for output memory copy of infer request

8037eb3

Change to implementation following pytorch frontend

b111e6a

Add support for UNARY SILU op . Fix pytorch impl bugs.

4b91d1f

Support Softmax op

923166f

Support Softmax op

f783c2a

Support ROPE op.

aedb88a

Add support for RMS_NORM OP

67d51bd

Add MUL_MAT,CPY,CONT as operators implemented in OpenVINO for GGML ba…

33cf85f

…ckend

Move CPY from GGML OV Backend to OV Frontend

cba097b

add implementation of MUL_MAT, CPY, CONT of GGML ops using OV ops

9b74f0f

add implementation of CPY when the output tensor is non-contiguous

68d53bb

add tmp source code files

1dc4ec6

Execute singel CONT operator is OK

d941a56

Execute CONT & VIEW operators in OV Frontend is OK

f670c64

wine99 and others added 16 commits October 15, 2025 16:06

Add Q5_K to support phi-3-q4_k_m

96254f9

Requantize Q6_K (gs16) to gs32 on GPU

f785c3d

Fix after rebasing

e6dca1b

Always apply Eliminate_ZP to fix GPU compile issue on some platforms

2413170

kvcachefusion support

e444b88

env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added

c40b9fe

Fix for Phi3

c21d664

Fix llama-cli (need to run with --no-warmup)

3370915

Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_s…

47843b6

…ize, iSWA model not working

fix after rebasing

63f6bba

Fix llama-3-8b and phi3-mini q4_0 NPU

b41f494

Update to OV-2025.3 and CMakeLists.txt

d4eb8ec

Add OV CI cache

f89292d

Apply CISC review and update CI to OV2025.3

89b8212

Update CI to run OV dep install before build

09ead55

Update OV dockerfile to use OV2025.3 and update build docs

7d8ea73

github-actions bot added the ggml label Oct 20, 2025

wine99 force-pushed the reset_variable_state branch from abf2454 to 94934fa Compare October 21, 2025 03:21

Style: use switch in supports_ops

6cac650

wine99 force-pushed the reset_variable_state branch from 94934fa to 8f93fe6 Compare October 21, 2025 05:32

wine99 added 5 commits October 21, 2025 14:45

Style: middle ptr and ref align, omit optional struct keyword

9e02b3f

Fix llama-cli; WIP llama-server

ac57c17

Fix: llama-server works with --cache-ram 0; llama-bench works with …

affee8d

…`-fa 1`

Minor udpates

886d418

Fix llama-server

5f25e52

wine99 force-pushed the reset_variable_state branch from 8f93fe6 to 5f25e52 Compare October 21, 2025 06:59

Style: clang-format

1b55304

cavusmustafa reviewed Oct 27, 2025

View reviewed changes

wine99 force-pushed the dev_backend_openvino branch from 956dbf7 to d5038aa Compare November 4, 2025 08:51

wine99 closed this Nov 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix llama-server and llama-bench #11

Fix llama-server and llama-bench #11

Uh oh!

wine99 commented Oct 20, 2025 •

edited

Loading

Uh oh!

cavusmustafa left a comment

Uh oh!

cavusmustafa Oct 27, 2025

Uh oh!

cavusmustafa Oct 27, 2025

Uh oh!

cavusmustafa Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Fix llama-server and llama-bench #11

Fix llama-server and llama-bench #11

Uh oh!

Conversation

wine99 commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cavusmustafa left a comment

Choose a reason for hiding this comment

Uh oh!

cavusmustafa Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

cavusmustafa Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

cavusmustafa Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

wine99 commented Oct 20, 2025 •

edited

Loading