Skip to content

Refactor: decide the future of llama_tensor_get_type() #8736

Closed
@maruel

Description

@maruel

Background Description

llama_tensor_get_type() in src/llama.cpp is nearly 300 lines of conditions and has a bit of inconsistent logic (and formatting). Some parts are dense (what use_more_bits()'s purpose exactly?) and takes a while to truly understand. The chances of unintended side effects when changing the code is high.

There's no way to get the effective bpw for variable quantization forms Qn_K, so it is hardcoded in llama_model_ftype_name() in src/llama.cpp and examples/quantize/quantize.cpp instead of being calculated.

While for Qn_0 and Qn_1 flavors it's not a problem, it is a problem with Qn_K flavors where there's a judgement call on how deep each tensor is quantized. Furthermore, the Qn_K flavors are underspecified and the function is the only source of truth of what the end result (e.g. Q5_K_M vs Q5_K_L) should look like.

As quantization techniques continue to evolve, this discrepancy will become more difficult to manage; it's essentially an entropy accumulator. An example of desire for flexibility is #6844.

Questions:

  • Would it be valuable to be able to calculate the bpw instead of hardcoding?
  • Is the entropy accumulation and risk of unintended consequence in llama_model_ftype_name() considered a problem?
  • Do we want to provide flexibility for each tensor quantization level without having to recompile? e.g. Custom quantization schemes #6844

If all are "no" for maintainers, then we can close this request.

Possible Refactor Approaches

I don't have a strong personal opinion on what the end state of llama_tensor_get_type() should look like. Here are possible options (some far fetched) and maybe a combinations of these would be a path forward:

  • Extract of the code in a standalone (its own .cpp file) reusable function so it can be built as a stand alone executable to be called by tools.
  • Create a table driven unit test for all existing known combinations to ensure new modifications do not have unintended side effects.
    • Automate the generation of the table driven unit test for ease of development. (I like this one)
  • Convert the function into a formal fine-state machine with explicit internal states.
  • Convert the function into a table driven machine, i.e. the source of truth becomes pure data, where conditions are encoded as data. (I like this one) This is a fully generalized form instead of a one-off as done in Custom quantization schemes #6844.
  • Generate the function via another language. Python or Starlark would be potential choices.

I do not believe this is a performance critical function? It only has to be negligible run time, which could color the option(s) chosen.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions