-
Notifications
You must be signed in to change notification settings - Fork 1
Fix llama-server and llama-bench #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e model * Add OpenVINO ADD operator to Llama.cpp. The output is somewhat abnormal and needs further debugging.
…ontend-utils, GraphIterator, Decoder
…on openvino device
…ize, iSWA model not working
abf2454 to
94934fa
Compare
94934fa to
8f93fe6
Compare
8f93fe6 to
5f25e52
Compare
cavusmustafa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is another way to identify different sequences rather than reading kv_cache state. If there is a way to identify, how about creating a unique inference request for each sequence? So, instead of caching inference requests with cgraph pointer, we can create an id out of cgraph+seq_id ? If this is possible, we don't need to read the states as every inference request will manage its own variables anyways.
| break; | ||
| } | ||
| } | ||
| static std::string device = getenv("GGML_OPENVINO_DEVICE") ? getenv("GGML_OPENVINO_DEVICE") : "CPU"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about doing something like this to avoid multiple getenv calls?
const std::string& getDevice() {
static const std::string device_str= [] {
const char* device_env = std::getenv("GGML_OPENVINO_DEVICE");
return device_env ? std::string(device_env) : "CPU";
}();
return device_str;
}
.
.
.
static std::string device = getDevice();
| ov::AnyMap config; | ||
| if (device == "GPU") { | ||
| auto * disable_sdpa_optimization = getenv("GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION"); | ||
| if (disable_sdpa_optimization && std::string(disable_sdpa_optimization) != "0") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to check this for every iteration instead of checking once before compile_model?
|
|
||
| // outdated if: | ||
| // 1. kv_len != kv_len_in_state | ||
| // 2. last row has different values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case are we deleting the previous kv_cache completely?
956dbf7 to
d5038aa
Compare
llama-server runs with default params (does not support
-np >1)llama-bench runs with
-fa 1