-
-
Notifications
You must be signed in to change notification settings - Fork 60
Feature/expose metal flashattn kv #97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -63,13 +63,24 @@ public final class LlamaClient: LLMClient { | |
| context.clear() | ||
| try context.decode(text: text) | ||
| case .chatTemplate(let messages): | ||
| // Match the `.plain` path: reset the prefill batch + KV cache | ||
| // before each new generation. Without this, a previous | ||
| // generation that was cut short by an external stop condition | ||
| // (consumer-level stop sequence or maxTokens break) leaves | ||
| // stale `batch.n_tokens > 0` and the next prefill walks past | ||
| // the end of `seq_id`, crashing on the force-unwrap at | ||
| // Batch.swift:20. The asymmetry between `.plain` (cleared) | ||
| // and `.chat`/`.chatTemplate` (not cleared) was the root | ||
| // cause of the residual crash in d71786a. | ||
| context.clear() | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Calling |
||
| try messageProcessor.process( | ||
| templateMessages: messages, | ||
| context: context, | ||
| multimodal: multimodal, | ||
| tools: tools | ||
| ) | ||
| case .chat(let messages): | ||
| context.clear() | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| try messageProcessor.process( | ||
| messages: messages, | ||
| context: context, | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -11,10 +11,23 @@ final class Model { | |||||
| llama_model_get_vocab(model) | ||||||
| } | ||||||
|
|
||||||
| init(url: URL) throws(LLMError) { | ||||||
| init(url: URL, parameter: LlamaClient.Parameter = .default) throws(LLMError) { | ||||||
| var model_params = llama_model_default_params() | ||||||
|
|
||||||
| // GPU layer offload. On Apple Silicon (real device + Mac) the GPU has | ||||||
| // unified memory access so offloading "all" layers is the desired | ||||||
| // setting. On the iOS Simulator there is no Metal device available | ||||||
| // for llama.cpp, so we force CPU-only regardless of the requested | ||||||
| // value to avoid runtime failures. | ||||||
| // | ||||||
| // We use 999 as the "all layers" sentinel (the same value used | ||||||
| // throughout the llama.cpp examples). `Int32.max` was tried first | ||||||
| // but appears to trigger internal arithmetic edge cases in | ||||||
| // `llama_batch` allocation paths on b8851; 999 sidesteps that. | ||||||
| #if targetEnvironment(simulator) | ||||||
| model_params.n_gpu_layers = 0 | ||||||
| #else | ||||||
| model_params.n_gpu_layers = parameter.nGpuLayers == -1 ? 999 : Int32(parameter.nGpuLayers) | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using
Suggested change
|
||||||
| #endif | ||||||
| model_params.use_mmap = true | ||||||
|
|
||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
clear()method resets the underlying KV cache usingllama_memory_clear, but it does not clear thepromptCachesarray. This will lead to a state mismatch where subsequent calls totextStreammight skip processing prompt chunks that are no longer in the KV cache, resulting in incorrect model output. You should clear the Swift-side cache tracking whenever the KV cache is wiped.