Description
Background Description
llama_tensor_get_type()
in src/llama.cpp is nearly 300 lines of conditions and has a bit of inconsistent logic (and formatting). Some parts are dense (what use_more_bits()
's purpose exactly?) and takes a while to truly understand. The chances of unintended side effects when changing the code is high.
There's no way to get the effective bpw for variable quantization forms Qn_K, so it is hardcoded in llama_model_ftype_name()
in src/llama.cpp and examples/quantize/quantize.cpp instead of being calculated.
While for Qn_0 and Qn_1 flavors it's not a problem, it is a problem with Qn_K flavors where there's a judgement call on how deep each tensor is quantized. Furthermore, the Qn_K flavors are underspecified and the function is the only source of truth of what the end result (e.g. Q5_K_M vs Q5_K_L) should look like.
As quantization techniques continue to evolve, this discrepancy will become more difficult to manage; it's essentially an entropy accumulator. An example of desire for flexibility is #6844.
Questions:
- Would it be valuable to be able to calculate the bpw instead of hardcoding?
- Is the entropy accumulation and risk of unintended consequence in
llama_model_ftype_name()
considered a problem? - Do we want to provide flexibility for each tensor quantization level without having to recompile? e.g. Custom quantization schemes #6844
If all are "no" for maintainers, then we can close this request.
Possible Refactor Approaches
I don't have a strong personal opinion on what the end state of llama_tensor_get_type()
should look like. Here are possible options (some far fetched) and maybe a combinations of these would be a path forward:
- Extract of the code in a standalone (its own .cpp file) reusable function so it can be built as a stand alone executable to be called by tools.
- Create a table driven unit test for all existing known combinations to ensure new modifications do not have unintended side effects.
- Automate the generation of the table driven unit test for ease of development. (I like this one)
- Convert the function into a formal fine-state machine with explicit internal states.
- Convert the function into a table driven machine, i.e. the source of truth becomes pure data, where conditions are encoded as data. (I like this one) This is a fully generalized form instead of a one-off as done in Custom quantization schemes #6844.
- Generate the function via another language. Python or Starlark would be potential choices.
I do not believe this is a performance critical function? It only has to be negligible run time, which could color the option(s) chosen.