Refactor: decide the future of llama_tensor_get_type()

### Background Description

`llama_tensor_get_type()` in [src/llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/src/llama.cpp) is nearly 300 lines of conditions and has a bit of inconsistent logic (and formatting). Some parts are dense (what `use_more_bits()`'s purpose exactly?) and takes a while to truly understand. The chances of unintended side effects when changing the code is high.

There's no way to get the effective bpw for variable quantization forms Qn_K, so it is hardcoded in `llama_model_ftype_name()` in[ src/llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/src/llama.cpp) *and* [examples/quantize/quantize.cpp](https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp) instead of being calculated.

While for Qn_0 and Qn_1 flavors it's not a problem, it is a problem with Qn_K flavors where there's a judgement call on how deep each tensor is quantized. Furthermore, the Qn_K flavors are underspecified and the function is the only source of truth of what the end result (e.g. Q5_K_M vs Q5_K_L) should look like.

As quantization techniques continue to evolve, this discrepancy will become more difficult to manage; it's essentially an entropy accumulator. An example of desire for flexibility is https://github.com/ggerganov/llama.cpp/pull/6844.

Questions:
- Would it be valuable to be able to calculate the bpw instead of hardcoding?
- Is the entropy accumulation and risk of unintended consequence in `llama_model_ftype_name()` considered a problem?
- Do we want to provide flexibility for each tensor quantization level without having to recompile? e.g. https://github.com/ggerganov/llama.cpp/pull/6844

If all are "no" for maintainers, then we can close this request.

### Possible Refactor Approaches

I don't have a strong personal opinion on what the end state of `llama_tensor_get_type()`  should look like. Here are possible options (some far fetched) and maybe a combinations of these would be a path forward:

- Extract of the code in a standalone (its own .cpp file) reusable function so it can be built as a stand alone executable to be called by tools.
- Create a table driven unit test for all existing known combinations to ensure new modifications do not have unintended side effects.
    - Automate the generation of the table driven unit test for ease of development. (I like this one)
- Convert the function into a formal fine-state machine with explicit internal states.
- Convert the function into a table driven machine, i.e. the source of truth becomes pure data, where conditions are encoded as data. (I like this one) This is a fully generalized form instead of a one-off as done in https://github.com/ggerganov/llama.cpp/pull/6844.
- Generate the function via another language. Python or Starlark would be potential choices.

I do not believe this is a performance critical function? It only has to be negligible run time, which could color the option(s) chosen.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor: decide the future of llama_tensor_get_type() #8736

Background Description

Possible Refactor Approaches

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refactor: decide the future of llama_tensor_get_type() #8736

Description

Background Description

Possible Refactor Approaches

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions