-
Notifications
You must be signed in to change notification settings - Fork 12.1k
llama: automatically set runtime parameters such as --n-gpu-layers to fit VRAM #14067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Looking forward to this, I've been setting this as 999 when I wanted GPU acceleration and users have complained it's too high. |
Since VRAM required to run the model is majorly dependant on 1) Model size and 2) Allocated Context, please consider adding the following flags:
Maybe even the following flags can be useful. Not sure.
In the past, I've been using https://github.com/3Simplex/Llama.Cpp-Toolbox, which features automatic determination of context and layers, with a Nvidia Geforce 1060 3GB, but the solution there is imperfect, so I've had my fair share of pondering how to make this more right. At time of writing, I am not proficient at coding in c++, so please excuse me, all these problems can be solved in a better way. |
My intent is to make the targeted VRAM margin and the min. context size configurable (and to only adjust runtime parameters not explicitly set by the user). That should cover most use cases. It is my opinion that the logic for optimizing runtime parameters should be kept simple since it's not feasible to cover all possible use cases and hardware setups anyways. If someone wants to squeeze out the last few % of performance for their setup they should determine the optimal parameters manually and save them somewhere. |
Will this logic be backend agnostic? Is it possible that different backends would require different amounts of VRAM (eg Vulkan vs cuda) even with the exact same layers and generation params? |
It will work for all GPU backends, the required VRAM margin will not be the same. One problem is that right now for example the CUDA backend allocates temporary buffers for data conversion that are not part of the compute graph and are therefore not being considered for the expected VRAM use. Long-term I think we should move these conversions to the compute graph anyways since that would also have the benefit of lower memory use and reusability when the compute graph splits. |
(I have not tried or checked the code) |
For multiple GPUs and |
A somewhat related issue is being explored here: sometimes users set -ngl 999 to enable acceleration and -ngl 0 to disable it, very boolean. Kinda a big hammer approach. It is also being claimed that ggml-cpu (no vulkan built in) performs significantly faster that ggml-vulkan with -ngl 0, is this correct? (I think I assumed in the past they should be the same without looking into the details) I also wonder should we by default auto-select -ngl 0 when CPU is detected in vulkan:
|
I was worried about this but it's only supposed to affect things that aren't explicitly set, and using -ngl 99 -ot exps=CPU is creating an expected desirable effect so if this broke it somehow it would be a bug to fix. |
Would --auto-max-context default to the model config's full context size? |
My idea was letting the user set a value there. Even if the model supports 32768 context, if the user sets it to 8192, then allocated VRAM should be maxed out to accomodate as many layers of the model first and then max out to 8192 context and then stop there. If there is not enough space in VRAM left to fill up to 8192, then go as high as possible and max out VRAM, but not context. Since users don't know how many layers of the model will fit into VRAM for a given context, allocating the layers automatically is nice. Number and thickness of layers differs between models. So does the VRAM required for the context (e.g. flash attention does require less VRAM than the default llama.cpp setting), so IMHO, both context and layers are important variables to account for and whatever is maxed out first of the two will determine the values for the other variable in case of limited VRAM. The order is important here. I am also operating under the assumption that automatically set runtime parameters are imperfect and max out VRAM in cases, when it is not desired to max out VRAM and argue for a tiny "empty" margin to prevent fully maxing out VRAM, hence the |
You have a cool idea but you're trying to juggle too many variables when designing the user experience and it's turning into an inconsistent mess. Set sane defaults assuming the user doesn't even know the switches exist, then let people work do find tune it from there. (Though at some point what's the effort difference between what you're making and just manually setting context and testing vram use? Now we gotta set like two or three different context numbers to 'shoot the gap' and have the dry runner figure out the difference) Also shouldn't there just be a way to estimate cache use mathematically? I know it'll vary if you use flash attention or swa or such, but, theoretically there should be no reason to need to estimate to that degree. In exllamav3 for instance ref, you can get in the ballpark with |
Just so there is no misunderstanding: @ThiloteE is not writing any of the code in this PR, he merely made suggestions regarding the interface. I am the one working on this feature and my response was:
|
See #13860 .
This PR aims to add code for setting runtime parameters such as the number of GPU layers automatically given the available memory. As of right now only the number of GPU layers is being adjusted and this is done unconditionally. Implementation details:
llama_expected_memory_use
to retrieve the expected memory use per device without doing any actual allocations. A functioncommon_fit_to_free_memory
incommon.cpp
then repeatedly tries configurations of runtime parameters until the allocation would fit in free memory (plus some margin). I think the code for determining the optimal parameters should work in such a way that only parameters that the user does not explicitly set are modified. To me the most natural way to do this would be incommon.cpp
though it could also be done inllama.cpp
.llama_model
andllama_context
are extended with a flagdry_run
which, when set, prevents the allocation of memory during initialization.dry_run
cannot be set by user code.llama_model
has been extended with a methodtotal_size
that returns the size of the weights.llama_context
has been extended with a methodtotal_size
that internally calls the same method on the memory, thus returning the size of the KV cache.ggml_backend_alloc_ctx_tensors_from_buft_size
which returns the amount of memory that would be needed for a call toggml_backend_alloc_ctx_tensors_from_buft
. Both functions internally use the same code but a new flagdry_run
controls whether the memory is actually being allocated.dry_run
flag inllama_model
andllama_context
results in the creation of dummy backend buffers with size 0,ggml_backend_buffer_get_size
cannot be used to retrieve the expected memory use intotal_size
. Insteadggml_backend_alloc_ctx_tensors_from_buft_size
is used. This makes the corresponding methods for the memory awkward: right now they retrieve the expected memory use of the KV cache even if actual, physical buffers have been allocated. I'm not sure what the best course of action here is; maybe use the expected size withdry_run
and the actually allocated size withoutdry_run
and assert consistency in debug mode?llama_context
has a new vectorbackends_exp_max_size
to store the expected max. memory use given the worst-case compute graphs that are already being pre-allocated on master. Ifdry_run
is set a new functionggml_backend_sched_reserve_size
is used to retrieve the expected allocation size of the scheduler instead of an actual call toggml_backend_sched_reserve
. Independently of the automatic determination of runtime parameters, I think it would be useful to track the max. expected memory use of the scheduler and to assert that it was not exceeded in the destructor ofllama_context
.ggml_backend_sched_reserve_size
callsggml_gallocr_reserve_n
with a new flagdry_run
and afterwards calls a new functionggml_gallocr_get_max_size
to retrieve the max. sizes of the internally storedggml_dyn_tallocr
s.ggml_backend_dev_t
to filter the memory use by. I'm not sure whether I should be filtering byggml_backend_buffer_type_t
instead.llama_context::backends
vs.llama_context::backend_ptrs
.vocab_only
which does very similar things todry_run
. Maybe we should unify the logic using an enum?For the finished PR I envision the following behavior: