Custom quantization schemes #6844

jubruckne · 2024-04-23T12:53:33Z

This is not ready to merge but I wanted to get your opinion if it’s something you’d be interested in including. If so, I can clean it up and improve it a little.

The idea is to allow creating a custom quantization mix by reading the per-layer quant type from a config file, by specifying CUSTOM as the type, like so:

./quantize --allow-requantize ../models/Meta-Llama-3-8B-Instruct.Q8_0.gguf ./llama3-q.gguf CUSTOM

The config file is currently hardcoded to read quant.cfg from the current directory (sample cfg is included). In the config file I allow specifying a default type for tensors that are not explicitly overridden, and the tensor name / type pairs with the requested type.

Possible improvements would be:

specifying types as strings instead of enum values („Q3_K“ instead of 11)
wildcards or regex to specify tensor names (like „blk\d{2}.ffn_down.weight“)
allow variable type (like -1, +2): if the default is Q8_K, and quant.cfg says -1 for the tensor you get Q6_K)
make filename for quant.cfg configurable via command line switch.

askmyteapot · 2024-04-23T13:02:40Z

This would be handy, as i like to experiment with different custom quants, and its a little clunky having to modify and rebuild llama.cpp every time i want to change something.
For example, i found that with Mixtral, having token_embed, attn_v/k/q/output as Q6_K with iq4_xs weights typically scores better than the standard iq4_xs. Weirdly, it even slightly outperforms the above at Q8_0 with iq4_xs weights.

ggerganov · 2024-04-23T14:36:54Z

Yes, this functionality is welcome

Nexesenex · 2024-04-23T17:17:22Z

Excellent idea, I wanted to see such a feature but am unable to do it myself.
I will use it.. a lot!

All the possibles improvements you mention are pertinent.

Also, this tool should ideally feature variable quantization.
For example, it can be useful to be able to quantize a fraction of a given weight in a quant, and the other half in another.
Example : the ffn.down.weight is usually the "lead" of the 3 ffn weights in terms of influence over perplexity. Simply quantizing half of the ffn.down.weight in the immediate superior quant gives a very good perplexity shrink on most models, to not speak about other benches like ARC.

Moreover, and that's a bit more complex, the ideal combination might be to be able to use a customizable form "more_bits feature" (query it in the llama.cpp file) to make that partial quant, and to select a layer range of a given weight to quantize with a higher quant.
Example : Take a 70b model, with 80 layers, with a LLAMA_FTYPE IQ2_S
I'd like to quantize the ffn.down.weight as such without recompiling LlamaCPP 👍

10 (or any number) first layers in IQ3_XXS.
One every x layer into IQ3_XXS between layer 11 and 70 (for example).
10 (or any number) last layers in IQ3XXS.
The rest in IQ2_S.
Of course, these numbers are arbitrary, and I'd be curious to know which layers are actually the most influential over a model, and thus, would deserve the higher bitrate of a variable quant.

I'm currently toying with the code in Llama.cpp file, and that's quite indigest and not practical, especially because 2 approaches were used to define the quant strategies 👍

The IQ1 and IQ2 quant strategies are a tree, the weight being branchess.
The other quants (IQ and Q) are branches in per-weight trees.
That coexistence of 2 approaches is confusing to me, and should ideally be harmonized into either one (by weight) or another (by Quant strategy).

jubruckne · 2024-04-24T09:27:52Z

I'd like to quantize the ffn.down.weight as such without recompiling LlamaCPP

Yeah, that's the idea. I actually explained my intentions slightly incorrectly in the first post above. It's actually about allowing individual quantisation for each tensor (not layer). So you can have a config file like this:

# use default quantisation of Q8_0
ftype=7

# override tensors matching a pattern with a specific quant:
blk.10.ffn_up.weight=7
blk.1?.ffn_up.weight=10
blk.2?.ffn_up.weight=10
blk.1?.attn*=23
blk.2?.attn*=23
*down*=14
*gate*=12

jubruckne · 2024-04-24T10:08:31Z

Moreover, and that's a bit more complex, the ideal combination might be to be able to use a customizable form "more_bits feature" (query it in the llama.cpp file) to make that partial quant, and to select a layer range of a given weight to quantize with a higher quant.

Exactly my plan :) The idea here would be that instead of setting a specific quant type, increments of +1, +2, -1, ... relative to the default could be used. For example:

# use default quantisation of Q4_K_S
ftype=14

# override tensors matching a pattern with a specific quant:
*ffn_up.weight=+1
*ffn_down.weight=-1

The challenge lies in defining the what the sequences of quant types should be. One possibility is to establish a sequence such that it's transitioning between similar quant types of different "bit" rates, such as from x_K to x+1_K or from IQx_S to IQx-1_S. For example:

IQ1_S, IQ1_M
IQ2_XXS, IQ2_XS, IQ2_S, Q2_K
IQ3_XXS, IQ3_S, Q3_K
Q4_0, Q4_1, Q4_K, IQ4_XS, IQ4_NL
Q5_0, Q5_1, Q5_K
Q6_K
Q8_K

Using this sequence, a default of Q4_K would transition to Q5_K with a +1 adjustment and to Q3_K with a -1.

A more detailed sequence might look like this:

IQ1_S
IQ1_M
IQ2_XXS
IQ2_XS
IQ2_S
Q2_K
IQ3_XXS
IQ3_S
Q3_K
Q4_0, Q4_1, Q4_K
IQ4_XS, IQ4_NL
Q5_0, Q5_1, Q5_K
Q6_K
Q8_K

However now that I've tried to write down sensible sequences, I realize that defining one that is universally applicable is challenging due to the varying nature of quant types, and probably doesn't make sense in most cases.

Any thoughts?

Nexesenex · 2024-04-24T18:32:21Z

Well, ideally the whole pattern would be definable so the system can be universally applied, there's no premade recipe which makes consensus nor should it be, because we are still empirically discovering the effects of particular quantization strategies as we try them.

Here's a reformulation of my idea compatible with your plans :

Define optionally each tensor to be offseted, in relative (+1, -1) or absolute GGML_TYPE.
Define optionally within a tensor one or several ranges of layer (relative or absolute) to be quantized either on the baseline quant, either with a relative offset to the baseline quant, either with a GGML_TYPE, either with a mix of 2 quants on a given layer interval.

Example in profane writing, for each tensor chosen for a customized quantization away from a base quantization strategy, presently a base Q4_K defined on a 70b L2 model with 80 layers on which we want to customize the ffn.down without even using Q4_K for the sake of the example :
ffn.down -> Layer 1:15 or first 20% : Q5_K (or +1) ; Layer 16:65 : Q5K (or +1) every x layers, rest Q3_K (or -1) ; Layer 66-80 or last 20% : Q5K (or +1)
The "x layers" pattern being of course appliable to the first or last range of layers, and not intermediary one.

That might require a slight overhaul of the quant strategy part of Llama.CPP, and potentially an harmonization of its hierarchical tree in respect for the IQ1 and IQ2 groups, but if possible, that'd offer the widest range of possibilities.

I'm sorry for my lack of code proficiency, I have no background into coding beyond mimicking what I see, understanding and adapting a few formatting tricks, and changing values.

jubruckne · 2024-04-25T09:50:23Z

I think this should be ready. I added parsing of enum values (so that friendly names like Q8_0 can be used instead of their numeric values), wildcards for tensor names, and possibility to specify the cfg file to use.

To use, specify the new CUSTOM type on ./quantize like so:
./quantize ../models/Phi-3-mini-4k-instruct-fp16.gguf ./phi3-q.gguf CUSTOM:quant.cfg

The quant.cfg should be pretty self-explanatory:

# Defines the default ftype (the quantization mix code, 
# that you pass to quantize if you're not using custom mix).
# tensors that are not overriden below will be quantized 
# according to this mix.
#
# Must be one of
#    Q4_0, Q4_1, Q5_0, Q5_1, IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M, 
#    IQ1_S, IQ1_M, Q2_K, Q2_K_S, IQ3_XXS, IQ3_S, IQ3_M, Q3_K,
#    IQ3_XS, Q3_K_S, Q3_K_M, Q3_K_L, IQ4_NL, IQ4_XS, Q4_K, 
#    Q4_K_S, Q4_K_M, Q5_K, Q5_K_S, Q5_K_M, Q6_K, Q8_0, F16

ftype=Q6_K

# Defines overrides for tensors with names matching a given 
# string. Filters are processed in order given, the first 
# matching will be used. 
#
# Wildcards are allowed:
#     ? single character
#     * multiple characters
#
# Type must be one of 
#     F16, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1, Q2_K, Q3_K, 
#     Q4_K, Q5_K, Q6_K, Q8_K, IQ2_XXS, IQ2_XS, IQ3_XXS, 
#     IQ1_S, IQ4_NL, IQ3_S, IQ2_S, IQ4_XS, IQ1_M

blk.10.ffn_up.weight=Q5_K
blk.1?.ffn_up.weight=Q4_K
blk.23.*=Q2_K
blk.24.*=Q2_K
blk.25.*=Q2_K
blk.2?.ffn_up.weight=Q4_K
*_gate*=Q4_K
*.attn*=IQ4_XS
*_down*=IQ3_S
output.weight=Q5_K

github-actions · 2024-04-25T10:24:52Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 556 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8437.61ms p(95)=19797.62ms fails=, finish reason: stop=488 truncated=68
Prompt processing (pp): avg=93.93tk/s p(95)=352.4tk/s
Token generation (tg): avg=33.41tk/s p(95)=49.02tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=20b22433f0cf941c1b43e27c086e2ef71798fd57

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715266997 --> 1715267625
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 461.65, 461.65, 461.65, 461.65, 461.65, 708.57, 708.57, 708.57, 708.57, 708.57, 583.45, 583.45, 583.45, 583.45, 583.45, 610.8, 610.8, 610.8, 610.8, 610.8, 681.27, 681.27, 681.27, 681.27, 681.27, 694.95, 694.95, 694.95, 694.95, 694.95, 698.32, 698.32, 698.32, 698.32, 698.32, 738.85, 738.85, 738.85, 738.85, 738.85, 740.37, 740.37, 740.37, 740.37, 740.37, 754.14, 754.14, 754.14, 754.14, 754.14, 758.94, 758.94, 758.94, 758.94, 758.94, 783.8, 783.8, 783.8, 783.8, 783.8, 831.77, 831.77, 831.77, 831.77, 831.77, 853.53, 853.53, 853.53, 853.53, 853.53, 842.07, 842.07, 842.07, 842.07, 842.07, 844.5, 844.5, 844.5, 844.5, 844.5, 844.77, 844.77, 844.77, 844.77, 844.77, 867.89, 867.89, 867.89, 867.89, 867.89, 863.96, 863.96, 863.96, 863.96, 863.96, 864.51, 864.51, 864.51, 864.51, 864.51, 870.65, 870.65, 870.65, 870.65, 870.65, 873.37, 873.37, 873.37, 873.37, 873.37, 870.04, 870.04, 870.04, 870.04, 870.04, 867.42, 867.42, 867.42, 867.42, 867.42, 862.4, 862.4, 862.4, 862.4, 862.4, 861.24, 861.24, 861.24, 861.24, 861.24, 849.5, 849.5, 849.5, 849.5, 849.5, 848.25, 848.25, 848.25, 848.25, 848.25, 845.86, 845.86, 845.86, 845.86, 845.86, 847.08, 847.08, 847.08, 847.08, 847.08, 851.68, 851.68, 851.68, 851.68, 851.68, 850.73, 850.73, 850.73, 850.73, 850.73, 851.39, 851.39, 851.39, 851.39, 851.39, 857.7, 857.7, 857.7, 857.7, 857.7, 860.93, 860.93, 860.93, 860.93, 860.93, 861.72, 861.72, 861.72, 861.72, 861.72, 846.45, 846.45, 846.45, 846.45, 846.45, 842.42, 842.42, 842.42, 842.42, 842.42, 841.97, 841.97, 841.97, 841.97, 841.97, 844.18, 844.18, 844.18, 844.18, 844.18, 846.31, 846.31, 846.31, 846.31, 846.31, 838.22, 838.22, 838.22, 838.22, 838.22, 821.9, 821.9, 821.9, 821.9, 821.9, 791.6, 791.6, 791.6, 791.6, 791.6, 790.51, 790.51, 790.51, 790.51, 790.51, 791.04, 791.04, 791.04, 791.04, 791.04, 797.59, 797.59, 797.59, 797.59, 797.59, 797.61, 797.61, 797.61, 797.61, 797.61, 801.3, 801.3, 801.3, 801.3, 801.3, 803.42, 803.42, 803.42, 803.42, 803.42, 809.92, 809.92, 809.92, 809.92, 809.92, 811.74, 811.74, 811.74, 811.74, 811.74, 810.68, 810.68, 810.68, 810.68, 810.68, 808.5, 808.5, 808.5, 808.5, 808.5, 809.7, 809.7, 809.7, 809.7, 809.7, 810.26, 810.26, 810.26, 810.26, 810.26, 811.95, 811.95, 811.95, 811.95, 811.95, 813.78, 813.78, 813.78, 813.78, 813.78, 814.71, 814.71, 814.71, 814.71, 814.71, 817.21, 817.21, 817.21, 817.21, 817.21, 817.28, 817.28, 817.28, 817.28, 817.28]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715266997 --> 1715267625
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 40.46, 40.46, 40.46, 40.46, 40.46, 43.64, 43.64, 43.64, 43.64, 43.64, 32.67, 32.67, 32.67, 32.67, 32.67, 32.48, 32.48, 32.48, 32.48, 32.48, 34.09, 34.09, 34.09, 34.09, 34.09, 33.8, 33.8, 33.8, 33.8, 33.8, 34.34, 34.34, 34.34, 34.34, 34.34, 35.33, 35.33, 35.33, 35.33, 35.33, 35.53, 35.53, 35.53, 35.53, 35.53, 35.04, 35.04, 35.04, 35.04, 35.04, 34.52, 34.52, 34.52, 34.52, 34.52, 34.41, 34.41, 34.41, 34.41, 34.41, 34.2, 34.2, 34.2, 34.2, 34.2, 33.03, 33.03, 33.03, 33.03, 33.03, 32.96, 32.96, 32.96, 32.96, 32.96, 32.36, 32.36, 32.36, 32.36, 32.36, 32.67, 32.67, 32.67, 32.67, 32.67, 32.79, 32.79, 32.79, 32.79, 32.79, 31.99, 31.99, 31.99, 31.99, 31.99, 31.7, 31.7, 31.7, 31.7, 31.7, 31.69, 31.69, 31.69, 31.69, 31.69, 31.78, 31.78, 31.78, 31.78, 31.78, 32.06, 32.06, 32.06, 32.06, 32.06, 31.9, 31.9, 31.9, 31.9, 31.9, 31.99, 31.99, 31.99, 31.99, 31.99, 32.15, 32.15, 32.15, 32.15, 32.15, 32.07, 32.07, 32.07, 32.07, 32.07, 31.98, 31.98, 31.98, 31.98, 31.98, 31.52, 31.52, 31.52, 31.52, 31.52, 31.62, 31.62, 31.62, 31.62, 31.62, 31.84, 31.84, 31.84, 31.84, 31.84, 31.94, 31.94, 31.94, 31.94, 31.94, 32.07, 32.07, 32.07, 32.07, 32.07, 32.17, 32.17, 32.17, 32.17, 32.17, 32.14, 32.14, 32.14, 32.14, 32.14, 32.02, 32.02, 32.02, 32.02, 32.02, 31.68, 31.68, 31.68, 31.68, 31.68, 31.7, 31.7, 31.7, 31.7, 31.7, 31.79, 31.79, 31.79, 31.79, 31.79, 31.9, 31.9, 31.9, 31.9, 31.9, 32.01, 32.01, 32.01, 32.01, 32.01, 32.18, 32.18, 32.18, 32.18, 32.18, 31.96, 31.96, 31.96, 31.96, 31.96, 31.5, 31.5, 31.5, 31.5, 31.5, 31.49, 31.49, 31.49, 31.49, 31.49, 30.83, 30.83, 30.83, 30.83, 30.83, 30.75, 30.75, 30.75, 30.75, 30.75, 30.75, 30.75, 30.75, 30.75, 30.75, 30.79, 30.79, 30.79, 30.79, 30.79, 30.9, 30.9, 30.9, 30.9, 30.9, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.79, 30.79, 30.79, 30.79, 30.79, 30.71, 30.71, 30.71, 30.71, 30.71, 30.8, 30.8, 30.8, 30.8, 30.8, 31.0, 31.0, 31.0, 31.0, 31.0, 31.14, 31.14, 31.14, 31.14, 31.14, 31.22, 31.22, 31.22, 31.22, 31.22, 31.25, 31.25, 31.25, 31.25, 31.25, 31.28, 31.28, 31.28, 31.28, 31.28, 31.27, 31.27, 31.27, 31.27, 31.27]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715266997 --> 1715267625
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.38, 0.38, 0.38, 0.38, 0.38, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.29, 0.29, 0.29, 0.29, 0.29, 0.26, 0.26, 0.26, 0.26, 0.26, 0.25, 0.25, 0.25, 0.25, 0.25, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.28, 0.28, 0.28, 0.28, 0.28, 0.13, 0.13, 0.13, 0.13, 0.13, 0.22, 0.22, 0.22, 0.22, 0.22, 0.15, 0.15, 0.15, 0.15, 0.15, 0.33, 0.33, 0.33, 0.33, 0.33, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.33, 0.33, 0.33, 0.33, 0.33, 0.36, 0.36, 0.36, 0.36, 0.36, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.32, 0.32, 0.32, 0.32, 0.32, 0.56, 0.56, 0.56, 0.56, 0.56, 0.51, 0.51, 0.51, 0.51, 0.51, 0.4, 0.4, 0.4, 0.4, 0.4, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.23, 0.23, 0.23, 0.23, 0.23, 0.28, 0.28, 0.28, 0.28, 0.28, 0.17, 0.17, 0.17, 0.17, 0.17, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715266997 --> 1715267625
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0]

cebtenzzre · 2024-04-25T21:42:09Z

examples/quantize/quantize.cpp

+        if (pos != std::string::npos) {
+            std::string tensor_name = line.substr(0, pos);
+            std::string type_name = line.substr(pos + 1);
+            ggml_type type = parse_ggml_type(type_name.c_str());


Shouldn't the configuration describe tensors by data type (from enum ggml_type), not file type (from enum llama_ftype)? E.g. Q3_K_S, Q3_K_M, and Q3_K_L are all file types, whereas Q3_K is a data type.

The idea was that you set the ftype (llama_ftype) and that will give you as a base the built-in mixing logic that llama_tensor_get_type() determines. I figured that's a good default since it's also doing some architecture specific optimizations. Only then, on top of it, you override specific tensors with s different ggml_type.

An alternative would be to completely get rid of the built-in llama_ftype logic, and specify the quant mixes completely from cfg files. Then we could deliver pre-built cfg files for each of the mixes currently supported by quantize. This would be nice on the one hand, as it would move all the special casing out from the code base, but on the other hand it's going to be a bigger endeavour as we'd need to figure out how to handle the different architectures, models of varying layer counts, and other logics that are not so easy to specify in a declarative configuration.

BarfingLemurs · 2024-04-26T04:21:38Z

Hi, I'm noticing we don't have good perplexity/quality scores for llama3 8b at Q4_0:

F16: Final estimate: PPL = 6.7647
Q8_0: Final estimate: PPL = 6.7646
Q4_0: Final estimate: PPL = 7.2904
Q5_1: Final estimate: PPL = 6.8849

This is a 7.7% difference, but these numbers are even worse earlier on in evaluation.

Mistral PPL:
F16: Final estimate: PPL = 5.6925
Q8_0: Final estimate: PPL = 5.6918
Q4_0: Final estimate: PPL = 5.8192

Only 2.2% difference for mistral.

Different UI's using the Q4_0 series would be getting a higher quality degradation for llama3 than llama2 or mistral.

This isn't a llama.cpp issue, most gpu quantizations will get similar results. Is there a pre-existing quantization sweet spot suitable as the de-facto for llama3?

ggerganov

Would be nice to get some usage feedback from people before merging

ggerganov · 2024-05-09T12:02:34Z

llama.h

+        LLAMA_FTYPE_CUSTOM               = 32, // except 1d tensors

        LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file


Suggested change

LLAMA_FTYPE_CUSTOM = 32, // except 1d tensors

LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file

LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file

LLAMA_FTYPE_CUSTOM = 1025,

ggerganov · 2024-05-09T12:03:05Z

llama.h

+        enum llama_ftype default_ftype; // default type if not overriden
+        uint32_t count;                 // number of overrides
+        const char ** names;            // tensor names
+        enum ggml_type * types;   // tensor type override


Suggested change

enum llama_ftype default_ftype; // default type if not overriden

uint32_t count; // number of overrides

const char ** names; // tensor names

enum ggml_type * types; // tensor type override

enum llama_ftype default_ftype; // default type if not overriden

uint32_t count; // number of overrides

const char ** names; // tensor names

enum ggml_type * types; // tensor type override

ggerganov · 2024-05-09T12:03:16Z

llama.cpp

@@ -14886,7 +14925,8 @@ struct llama_model_quantize_params llama_model_quantize_default_params() {
        /*.only_copy                   =*/ false,
        /*.pure                        =*/ false,
        /*.imatrix                     =*/ nullptr,
-        /*.kv_overrides                =*/ nullptr,
+        /*.kv_overrides                =*/ nullptr, 
+        /*.override_ftype              =*/ nullptr


Suggested change

/*.override_ftype =*/ nullptr

/*.override_ftype =*/ nullptr,

ggerganov · 2024-05-09T12:03:30Z

llama.cpp

@@ -14417,6 +14444,18 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
                new_type = params->output_tensor_type;
            }

+            // look up tensor name in type override map, if not found use default
+            // type as determined by the ftype.
+            if(params->override_ftype) {


Suggested change

if(params->override_ftype) {

if (params->override_ftype) {

ggerganov · 2024-05-09T12:05:11Z

llama.cpp

@@ -14279,7 +14306,7 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
    // copy the KV pairs from the input file
    gguf_set_kv     (ctx_out, ml.meta);
    gguf_set_val_u32(ctx_out, "general.quantization_version", GGML_QNT_VERSION);
-    gguf_set_val_u32(ctx_out, "general.file_type", ftype);
+    gguf_set_val_u32(ctx_out, "general.file_type", params->ftype);


Shouldn't we keep ftype here instead of params->ftype?

ggerganov · 2024-05-09T12:05:32Z

examples/quantize/quantize.cpp

+            std::string tensor_name = line.substr(0, pos);
+            std::string type_name = line.substr(pos + 1);
+            ggml_type type = parse_ggml_type(type_name.c_str());
+            if(type < 0 || type >= GGML_TYPE_COUNT) {


Suggested change

if(type < 0 || type >= GGML_TYPE_COUNT) {

if (type < 0 || type >= GGML_TYPE_COUNT) {

ggerganov · 2024-05-09T12:05:44Z

examples/quantize/quantize.cpp

+            std::string ftype_name;
+            std::string custom_quant_config_filename;
+            llama_ftype ftype;
+            if(!try_parse_ftype(ftype_str, ftype, ftype_name, custom_quant_config_filename)) {


Suggested change

if(!try_parse_ftype(ftype_str, ftype, ftype_name, custom_quant_config_filename)) {

if (!try_parse_ftype(ftype_str, ftype, ftype_name, custom_quant_config_filename)) {

Green-Sky · 2024-05-09T12:56:53Z

Hi, I'm noticing we don't have good perplexity/quality scores for llama3 8b at Q4_0:

F16: Final estimate: PPL = 6.7647 Q8_0: Final estimate: PPL = 6.7646 Q4_0: Final estimate: PPL = 7.2904 Q5_1: Final estimate: PPL = 6.8849

This is a 7.7% difference, but these numbers are even worse earlier on in evaluation.

Mistral PPL: F16: Final estimate: PPL = 5.6925 Q8_0: Final estimate: PPL = 5.6918 Q4_0: Final estimate: PPL = 5.8192

Only 2.2% difference for mistral.

llama3 reacts more strongly to quantization, probably because it makes more use of the bits/precision it was trained on.

someone should use MAP to find the frontier of best ppl-size (or any other 2 dimensional metric)

slaren · 2024-05-09T21:47:10Z

examples/quantize/quantize.cpp

@@ -224,13 +246,119 @@ static ggml_type parse_ggml_type(const char * arg) {
    for (int j = 0; j < GGML_TYPE_COUNT; ++j) {
        auto type = ggml_type(j);
        const auto * name = ggml_type_name(type);
-        if (name && strcmp(arg, name) == 0) {
+        if (name && strcasecmp(arg, name) == 0) {


strcasecmp is not available on every platform.

slaren · 2024-05-09T21:49:55Z

llama.cpp

@@ -14570,11 +14573,35 @@ static size_t llama_tensor_quantize_internal(enum ggml_type new_type, const floa
    return new_size;
 }

+static bool match_string(const std::string& str, const std::string& pattern, uint32_t string_index = 0, uint32_t pattern_index = 0) {


Suggested change

static bool match_string(const std::string& str, const std::string& pattern, uint32_t string_index = 0, uint32_t pattern_index = 0) {

static bool match_string(const std::string & str, const std::string & pattern, uint32_t string_index = 0, uint32_t pattern_index = 0) {

slaren · 2024-05-09T21:52:11Z

examples/quantize/quantize.cpp

            result = type; break;
        }
    }
    return result;
 }

+static bool parse_kv_override(const char * data, std::vector<llama_model_kv_override> & overrides) {


This function was moved to common.cpp, this definition should be removed from here.

jukofyork · 2024-06-27T00:25:26Z

Is this PR still working? I'd be interested to try it on the new deepseek-v2 models to see if using lower quants for the later layers is feasible.

jukofyork · 2024-06-27T10:01:33Z

This PR seems dead, but I have found where you can hack this in: llama.cpp::llama_tensor_get_type().

Interestingly there looks to be a lot of hard-coded tests of n_expert == 8 that might be hurting the quantization of some of the newer MoE models that use more experts like dbrx, deepseek-v2, Qwen-MoE, etc:

The new "shared experts" might need thinking about now too:

  "n_routed_experts": 160,
  "n_shared_experts": 2,
  "num_experts_per_tok": 6

as they will be disproportionality effected by quantization (ie: used 100% of the time vs 4/160 = 2.5% of the time, for the example deepseek-v2 config above).

jukofyork · 2024-06-27T10:44:31Z

static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {
    const std::string name = ggml_get_name(tensor);

    // TODO: avoid hardcoded tensor names - use the TN_* constants
    const llm_arch arch = qs.model.arch;
    const auto       tn = LLM_TN(arch);

    auto use_more_bits = [](int i_layer, int num_layers) -> bool {
        return i_layer < num_layers/8 || i_layer >= 7*num_layers/8 || (i_layer - num_layers/8)%3 == 2;
    };
    const int n_expert = std::max(1, (int)qs.model.hparams.n_expert);
    auto layer_info = [n_expert] (int i_layer, int n_layer, const char * name) {
        if (n_expert > 1) {
            // Believe it or not, "experts" in the FFN of Mixtral-8x7B are not consecutive, but iccasionally randomly
            // sprinkled in the model. Hence, simply dividing i_ffn_down by n_expert does not work
            // for getting the current layer as I initially thought, and we need to resort to parsing the
            // tensor name.
            if (sscanf(name, "blk.%d.", &i_layer) != 1) {
                throw std::runtime_error(format("Failed to determine layer for tensor %s", name));
            }
            if (i_layer < 0 || i_layer >= n_layer) {
                throw std::runtime_error(format("Bad layer %d for tensor %s. Must be in [0, %d)", i_layer, name, n_layer));
            }
        }
        return std::make_pair(i_layer, n_layer);
    };

    // for arches that share the same tensor between the token embeddings and the output, we quantize the token embeddings
    // with the quantization of the output tensor
    if (name == tn(LLM_TENSOR_OUTPUT, "weight") || (!qs.has_output && name == tn(LLM_TENSOR_TOKEN_EMBD, "weight"))) {
        if (qs.params->output_tensor_type < GGML_TYPE_COUNT) {
            new_type = qs.params->output_tensor_type;
        } else {
            int nx = tensor->ne[0];
            if (arch == LLM_ARCH_FALCON || nx % QK_K != 0) {
                new_type = GGML_TYPE_Q8_0;
            }
            else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS ||
                     ftype == LLAMA_FTYPE_MOSTLY_IQ1_S   || ftype == LLAMA_FTYPE_MOSTLY_IQ2_S  || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M   ||
                     ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
                new_type = GGML_TYPE_Q5_K;
            }
            else if (new_type != GGML_TYPE_Q8_0) {
                new_type = GGML_TYPE_Q6_K;
            }
        }
    } else if (name == "token_embd.weight") {
        if (qs.params->token_embedding_type < GGML_TYPE_COUNT) {
            new_type = qs.params->token_embedding_type;
        } else {
            if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS ||
                ftype == LLAMA_FTYPE_MOSTLY_IQ1_S   || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
                new_type = GGML_TYPE_Q2_K;
            }
            else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M) {
                new_type = GGML_TYPE_IQ3_S;
            }
            else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) {
                new_type = GGML_TYPE_IQ3_S;
            }
        }
    } else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ1_S ||
               ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M    || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
        if (name.find("attn_v.weight") != std::string::npos) {
            if (qs.model.hparams.n_gqa() >= 4 || qs.model.hparams.n_expert >= 4) new_type = GGML_TYPE_Q4_K;
            else new_type = ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ? GGML_TYPE_IQ3_S : GGML_TYPE_Q2_K;
            ++qs.i_attention_wv;
        }
        else if (qs.model.hparams.n_expert == 8 && name.find("attn_k.weight") != std::string::npos) {
            new_type = GGML_TYPE_Q4_K;
        }
        else if (name.find("ffn_down") != std::string::npos) {
            if (qs.i_ffn_down < qs.n_ffn_down/8) {
                new_type = ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ? GGML_TYPE_IQ3_S : GGML_TYPE_Q2_K;
            }
            ++qs.i_ffn_down;
        }
        else if (name.find("attn_output.weight") != std::string::npos) {
            if (qs.model.hparams.n_expert == 8) {
                new_type = GGML_TYPE_Q5_K;
            } else {
                if (ftype == LLAMA_FTYPE_MOSTLY_IQ1_S || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) new_type = GGML_TYPE_IQ2_XXS;
                else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M) new_type = GGML_TYPE_IQ3_S;
            }
        }
    } else if (name.find("attn_v.weight") != std::string::npos) {
        if      (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) {
            new_type = qs.model.hparams.n_gqa() >= 4 ? GGML_TYPE_Q4_K : GGML_TYPE_Q3_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S && qs.model.hparams.n_gqa() >= 4) {
            new_type = GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) {
            new_type = qs.model.hparams.n_gqa() >= 4 ? GGML_TYPE_Q4_K : !qs.has_imatrix ? GGML_TYPE_IQ3_S : GGML_TYPE_IQ3_XXS;
        }
        else if ((ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_S) && qs.model.hparams.n_gqa() >= 4) {
            new_type = GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_M) {
            new_type = GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M) {
            new_type = qs.i_attention_wv < 2 ? GGML_TYPE_Q5_K : GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) new_type = GGML_TYPE_Q5_K;
        else if ((ftype == LLAMA_FTYPE_MOSTLY_IQ4_NL || ftype == LLAMA_FTYPE_MOSTLY_IQ4_XS) && qs.model.hparams.n_gqa() >= 4) {
            new_type = GGML_TYPE_Q5_K;
        }
        else if ((ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) &&
                use_more_bits(qs.i_attention_wv, qs.n_attention_wv)) new_type = GGML_TYPE_Q6_K;
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S && qs.i_attention_wv < 4) new_type = GGML_TYPE_Q5_K;
        if (qs.model.type == MODEL_70B) {
            // In the 70B model we have 8 heads sharing the same attn_v weights. As a result, the attn_v.weight tensor is
            // 8x smaller compared to attn_q.weight. Hence, we can get a nice boost in quantization accuracy with
            // nearly negligible increase in model size by quantizing this tensor with more bits:
            if (new_type == GGML_TYPE_Q3_K || new_type == GGML_TYPE_Q4_K) new_type = GGML_TYPE_Q5_K;
        }
        if (qs.model.hparams.n_expert == 8) {
            // for the 8-expert model, bumping this to Q8_0 trades just ~128MB
            // TODO: explore better strategies
            new_type = GGML_TYPE_Q8_0;
        }
        ++qs.i_attention_wv;
    } else if (name.find("attn_k.weight") != std::string::npos) {
        if (qs.model.hparams.n_expert == 8) {
            // for the 8-expert model, bumping this to Q8_0 trades just ~128MB
            // TODO: explore better strategies
            new_type = GGML_TYPE_Q8_0;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS) {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) {
            new_type = GGML_TYPE_IQ2_S;
        }
    } else if (name.find("attn_q.weight") != std::string::npos) {
        if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS) {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) {
            new_type = GGML_TYPE_IQ2_S;
        }
    } else if (name.find("ffn_down") != std::string::npos) {
        auto info = layer_info(qs.i_ffn_down, qs.n_ffn_down, name.c_str());
        int i_layer = info.first, n_layer = info.second;
        if      (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) new_type = GGML_TYPE_Q3_K;
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S) {
            if (i_layer < n_layer/8) new_type = GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS && !qs.has_imatrix) {
            new_type = i_layer < n_layer/8 ? GGML_TYPE_Q4_K : GGML_TYPE_Q3_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M) {
            new_type = i_layer < n_layer/16 ? GGML_TYPE_Q5_K
                     : arch != LLM_ARCH_FALCON || use_more_bits(i_layer, n_layer) ? GGML_TYPE_Q4_K
                     : GGML_TYPE_Q3_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_M && (i_layer < n_layer/8 ||
                    (qs.model.hparams.n_expert == 8 && use_more_bits(i_layer, n_layer)))) {
            new_type = GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) {
            new_type = arch == LLM_ARCH_FALCON ? GGML_TYPE_Q4_K : GGML_TYPE_Q5_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
            if (arch == LLM_ARCH_FALCON) {
                new_type = i_layer < n_layer/16 ? GGML_TYPE_Q6_K :
                           use_more_bits(i_layer, n_layer) ? GGML_TYPE_Q5_K : GGML_TYPE_Q4_K;
            } else {
                if (use_more_bits(i_layer, n_layer)) new_type = GGML_TYPE_Q6_K;
            }
        }
        else if (i_layer < n_layer/8 && (ftype == LLAMA_FTYPE_MOSTLY_IQ4_NL || ftype == LLAMA_FTYPE_MOSTLY_IQ4_XS) && !qs.has_imatrix) {
            new_type = GGML_TYPE_Q5_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M && use_more_bits(i_layer, n_layer)) new_type = GGML_TYPE_Q6_K;
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S && arch != LLM_ARCH_FALCON && i_layer < n_layer/8) {
            new_type = GGML_TYPE_Q5_K;
        }
        else if ((ftype == LLAMA_FTYPE_MOSTLY_Q4_0 || ftype == LLAMA_FTYPE_MOSTLY_Q5_0)
                && qs.has_imatrix && i_layer < n_layer/8) {
            // Guard against craziness in the first few ffn_down layers that can happen even with imatrix for Q4_0/Q5_0.
            // We only do it when an imatrix is provided because a) we want to make sure that one can always get the
            // same quantization as before imatrix stuff, and b) Q4_1/Q5_1 do go crazy on ffn_down without an imatrix.
            new_type = ftype == LLAMA_FTYPE_MOSTLY_Q4_0 ? GGML_TYPE_Q4_1 : GGML_TYPE_Q5_1;
        }
        ++qs.i_ffn_down;
    } else if (name.find("attn_output.weight") != std::string::npos) {
        if (arch != LLM_ARCH_FALCON) {
            if (qs.model.hparams.n_expert == 8) {
                if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K   || ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS ||
                    ftype == LLAMA_FTYPE_MOSTLY_Q3_K_S || ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M  || ftype == LLAMA_FTYPE_MOSTLY_IQ4_NL  ||
                    ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S || ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M  || ftype == LLAMA_FTYPE_MOSTLY_IQ3_S  ||
                    ftype == LLAMA_FTYPE_MOSTLY_IQ3_M  || ftype == LLAMA_FTYPE_MOSTLY_IQ4_XS) {
                    new_type = GGML_TYPE_Q5_K;
                }
            } else {
                if      (ftype == LLAMA_FTYPE_MOSTLY_Q2_K   ) new_type = GGML_TYPE_Q3_K;
                else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) new_type = GGML_TYPE_IQ3_S;
                else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M ) new_type = GGML_TYPE_Q4_K;
                else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L ) new_type = GGML_TYPE_Q5_K;
                else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_M  ) new_type = GGML_TYPE_Q4_K;
            }
        } else {
            if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) new_type = GGML_TYPE_Q4_K;
        }
    }
    else if (name.find("attn_qkv.weight") != std::string::npos) {
        if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L || ftype == LLAMA_FTYPE_MOSTLY_IQ3_M) {
            new_type = GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) new_type = GGML_TYPE_Q5_K;
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) new_type = GGML_TYPE_Q6_K;
    }
    else if (name.find("ffn_gate") != std::string::npos) {
        auto info = layer_info(qs.i_ffn_gate, qs.n_ffn_gate, name.c_str());
        int i_layer = info.first, n_layer = info.second;
        if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS && (i_layer >= n_layer/8 && i_layer < 7*n_layer/8)) {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        ++qs.i_ffn_gate;
    }
    else if (name.find("ffn_up") != std::string::npos) {
        auto info = layer_info(qs.i_ffn_up, qs.n_ffn_up, name.c_str());
        int i_layer = info.first, n_layer = info.second;
        if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS && (i_layer >= n_layer/8 && i_layer < 7*n_layer/8)) {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        ++qs.i_ffn_up;
    }

    //    if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) new_type = GGML_TYPE_Q3_K;
    //}
    // IK: let's remove this, else Q2_K is almost the same as Q3_K_S
    //else if (name.find("ffn_gate") != std::string::npos || name.find("ffn_up") != std::string::npos) {
    //    if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) new_type = GGML_TYPE_Q3_K;
    //}
    // This can be used to reduce the size of the Q5_K_S model.
    // The associated PPL increase is fully in line with the size reduction
    //else {
    //    if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_S) new_type = GGML_TYPE_Q4_K;
    //}
    bool convert_incompatible_tensor = false;
    if (new_type == GGML_TYPE_Q2_K || new_type == GGML_TYPE_Q3_K || new_type == GGML_TYPE_Q4_K ||
        new_type == GGML_TYPE_Q5_K || new_type == GGML_TYPE_Q6_K || new_type == GGML_TYPE_IQ4_XS ||
        new_type == GGML_TYPE_IQ2_XS || new_type == GGML_TYPE_IQ2_XXS || new_type == GGML_TYPE_IQ2_S ||
        new_type == GGML_TYPE_IQ3_XXS || new_type == GGML_TYPE_IQ1_S || new_type == GGML_TYPE_IQ3_S ||
        new_type == GGML_TYPE_IQ1_M) {
        int nx = tensor->ne[0];
        int ny = tensor->ne[1];
        if (nx % QK_K != 0) {
            LLAMA_LOG_WARN("\n\n%s : tensor cols %d x %d are not divisible by %d, required for %s", __func__, nx, ny, QK_K, ggml_type_name(new_type));
            convert_incompatible_tensor = true;
        } else {
            ++qs.n_k_quantized;
        }
    }
    if (convert_incompatible_tensor) {
        switch (new_type) {
            case GGML_TYPE_IQ2_XXS:
            case GGML_TYPE_IQ2_XS:
            case GGML_TYPE_IQ2_S:
            case GGML_TYPE_IQ3_XXS:
            case GGML_TYPE_IQ3_S:
            case GGML_TYPE_IQ1_S:
            case GGML_TYPE_IQ1_M:
            case GGML_TYPE_Q2_K:
            case GGML_TYPE_Q3_K:
            case GGML_TYPE_IQ4_XS: new_type = GGML_TYPE_IQ4_NL; break;
            case GGML_TYPE_Q4_K:   new_type = GGML_TYPE_Q5_0;   break;
            case GGML_TYPE_Q5_K:   new_type = GGML_TYPE_Q5_1;   break;
            case GGML_TYPE_Q6_K:   new_type = GGML_TYPE_Q8_0;   break;
            default: throw std::runtime_error("\nUnsupported tensor size encountered\n");
        }
        LLAMA_LOG_WARN(" - using fallback quantization %s\n", ggml_type_name(new_type));
        ++qs.n_fallback;
    }

    return new_type;
}

What criterion was used to find these combinations originally?

If we can test each configuration in a reasonable amount of time then it would be quite feasible to optimize this automatically using the Cross-Entropy Method (there is another version [for optimization] not shown on the Wikipedia page that optimizes discrete Bernoulli and/or categorical / "multinoulli" distributions [see Chapter 5 of Rubinstein's book]).

The dimensions are likely to be almost independent and it might even be nearly as easy to optimize "layer-index specific" quant schemes.

From previous experience using CEM on highly independent set of variables like this, you would need to be able to perform a minimum of 10-20 evaluations per variable to be optimized (you need much, much more though if you need to assume a non-diagonal covariance matrix [or conditional dependence for the discrete case] - which I don't think this would need and CMA-ES would be more suitable in that case anyway...).

It's very robust to noise so a noisy/quick evaluation criterion like perplexity will be preferable to a slow/precise criterion like KL-divergence.

One potential problem is if the optimization boundaries are hard to set due to say perplexity returning nan causing lots of samples to be discarded.

slaren · 2024-06-27T13:12:43Z

What criterion was used to find these combinations originally?

The original combinations were created by @ikawrakow from his own tests on llama and llama2 models, as far as I know. I think it would be very good to be able to automate this process, I would expect that different models will benefit from different quantization schemes.

One potential problem is if the optimization boundaries are hard to set due to say perplexity returning nan causing lots of samples to be discarded.

This should never happen unless there is a bug in llama.cpp, in which case it needs to be fixed rather than ignored.

jukofyork · 2024-06-27T14:30:36Z

What criterion was used to find these combinations originally?

The original combinations were created by @ikawrakow from his own tests on llama and llama2 models, as far as I know. I think it would be very good to be able to automate this process, I would expect that different models will benefit from different quantization schemes.

One potential problem is if the optimization boundaries are hard to set due to say perplexity returning nan causing lots of samples to be discarded.

This should never happen unless there is a bug in llama.cpp, in which case it needs to be fixed rather than ignored.

I can't promise how soon I can look at it, but it is definitely possible even without understanding any of the original logic:

new_type = GGML_TYPE_XXX;

It would just need a modified version of the function that can select this from a categorical distribution (what people in the ML community have started calling "multinoulli").

The name "Cross Entropy Method" might sound intimidating, but it is actually super-simple:

Randomly initialize the distribution for each variable (intelligently if possible).
Take N samples (usually 100) from the distribution and evaluate each sample.
Rank the samples and choose the top 0.1 * N of the samples.
Calculate the new Maximum Likelihood distribution to use from these samples.
Go to step 2.

For (1) you could set the initial categorical distribution to be weighted heavily towards @ikawrakow's choices and possibly also set hard boundaries on what you think are sensible for the memory size budget you are looking at.
For (2a), since we are assuming independence of the variables it will just be a simple "weighted roulette wheel" selection process.
For (2b), since we have a memory size budget this will have to be incorporated as a constraint into the evaluation via a penalty (a soft penalty preferably so as not to discard too many samples... You can progressively "harden" the penalty during the run to enforce the constraint though).
For (4), this just comes down to the empirical fraction of counts in each bin for the discrete case. You have to be slightly careful that none of the bins get set to zero (this is easily solved via Additive smoothing and IIRC explained in Rubinstein's book).

Just eyeballing the function there looks to be maybe 5-10 choices for a given model, so using a population of 100 and assuming 10-20 evaluations per variable: 5*100*10 = 5000 .. 10*100*20 = 20000 evaluations per model to be optimized (minimum), but it is likely a lot could be learnt from small models and used to constrain the search for larger models.

A week has 7*24*60 = 10080 minutes, so it would need to take no longer than 2-5 minutes per evaluation to be feasible IMO. It is very easy to parallelize using MPI though so could be run on a cluster of machines if needed.

slaren · 2024-06-27T14:37:54Z

That sounds very interesting. I am not sure what parameter you would use to optimize a quantization scheme. I guess that what we really would want to find is the pareto front of all the quantization schemes that are not worse than any other scheme in both file size and perplexity at the same time. Not sure if that affects your estimation of the number of samples.

jukofyork · 2024-06-27T14:58:25Z

That sounds very interesting. I am not sure what parameter you would use to optimize a quantization scheme. I guess that what we really would want to find is the pareto front of all the quantization schemes that are not worse than any other scheme in both file size and perplexity at the same time. Not sure if that affects your estimation of the number of samples.

Generally the more you constrain it and the more the independence assumption is broken the more samples you need (ie: it will waste samples trying to pass over the constraints and have trouble navigating non-orthogonal "valleys" otherwise).

If the independence assumption is very wrong then it's almost certainly better to use CMA-ES instead (CEM does have a version using a non-diagonal covariance matrix, but it requires a Cholesky Factorization to sample from and suffers from needing many more samples to reliably estimate the covariance matrix compared to CMA-ES's incremental method).

There are likely other things like using a clipped-Gaussian instead of a categorical distribution (as the choices are ordered) that can be tried to reduce the number of samples needed.

It works really well in practice and often can find solutions a human could not due to the human getting stuck in a local optima where they can't escape by tuning a single variable alone.

If the optimization landscape is very smooth and "nice" there are other methods that can use way fewer samples. Somebody with an OR background would likely be able to suggest even better ways of tackling this - I've just had success in the past using CEM for problems almost exactly like this (and SPSA for problems with homogenous variables and low-noise evaluations available).

jukofyork · 2024-06-27T15:05:00Z

Sorry I missed this part of your question:

I guess that what we really would want to find is the pareto front of all the quantization schemes that are not worse than any other scheme in both file size and perplexity at the same time.

This can likely be found in a single run by starting with the maximum allowable memory budget, converging to a (fairly) stable solution, and then reducing the budget constraint/penalty downwards (or vice versa).

If you search for "L1 regularization path" you'll see plots like this found all in 1 run:

Which are basically doing the same thing by reducing (or increasing) the penalty during a single run of the optimization algorithm.

jukofyork · 2024-06-27T18:19:19Z

The new "shared experts" might need thinking about now too:
  "n_routed_experts": 160,
  "n_shared_experts": 2,
  "num_experts_per_tok": 6
as they will be disproportionality effected by quantization (ie: used 100% of the time vs 4/160 = 2.5% of the time, for the example deepseek-v2 config above).

Out of luck trying to do anything with the "shared_experts":

[  28/ 959]           blk.1.ffn_down_exps.weight - [ 1536,  5120,   160,     1], type =    f16, converting to iq3_xxs .. size =  2400.00 MiB ->   459.38 MiB
[  29/ 959]           blk.1.ffn_gate_exps.weight - [ 5120,  1536,   160,     1], type =    f16, converting to iq3_xxs .. size =  2400.00 MiB ->   459.38 MiB
[  30/ 959]             blk.1.ffn_up_exps.weight - [ 5120,  1536,   160,     1], type =    f16, converting to iq3_xxs .. size =  2400.00 MiB ->   459.38 MiB

jukofyork · 2024-06-27T18:25:59Z

If I can get this PR working or figure out how to hack the llama.cpp::llama_tensor_get_type() code I'm going to try using bigger quants for the early layer's expert tensors and smaller for the later.

I can't find it now but read a paper that hypothesised the later layers don't do all that much and mostly just do averaging (link me if you know this paper please!). This paper (which came later IIRC) also shows this:

https://arxiv.org/pdf/2403.17887

Starting around 60th percentile layer in.

"Across models, the deeper layers tend to be very similar, though the deepest blocks that include the final layer (squares along the outer diagonal) are (near-)maximally dissimilar."

Charles Goddard (the Mergekit creator) tried the above method here:

https://huggingface.co/chargoddard/llama3-42b-v0

but I think it's got a much better chance keeping the layers; just having them more quantized... Deepseek-v2 looks the perfect model to try this on as it's 90% MLP.

jukofyork · 2024-06-28T13:45:46Z

The new "shared experts" might need thinking about now too:
  "n_routed_experts": 160,
  "n_shared_experts": 2,
  "num_experts_per_tok": 6
as they will be disproportionality effected by quantization (ie: used 100% of the time vs 4/160 = 2.5% of the time, for the example deepseek-v2 config above).
Out of luck trying to do anything with the "shared_experts":
[  28/ 959]           blk.1.ffn_down_exps.weight - [ 1536,  5120,   160,     1], type =    f16, converting to iq3_xxs .. size =  2400.00 MiB ->   459.38 MiB
[  29/ 959]           blk.1.ffn_gate_exps.weight - [ 5120,  1536,   160,     1], type =    f16, converting to iq3_xxs .. size =  2400.00 MiB ->   459.38 MiB
[  30/ 959]             blk.1.ffn_up_exps.weight - [ 5120,  1536,   160,     1], type =    f16, converting to iq3_xxs .. size =  2400.00 MiB ->   459.38 MiB

Actually I've found they are in separate tensors and are named differently: ffn_up_shexp.weight, ffn_gate_shexp.weight, and ffn_down_shexp.weight.

I've also found the low rank attn_q_a.weight, attn_q_b.weight, attn_kv_a_mqa.weight and attn_kv_b.weight tensors were falling through and getting quantized using the lowest default... This is very bad as these are actually tiny compared to the rest of the giant MLP tensors and the W.W^T products that this creates will likely have O(((w-q)^2)^2) rate-distortion (ie: 4-th power quantization error!).

So I've tried to look through the function to distil what @ikawrakow obviously must have spent hours figuring out, and have come up with this:

    // ### JUK'S DEEPSEEK V2 CUSTOM CONFIG (Use: 'llama-quantize --imatrix ... ... ... Q5_K_M') ###
    if (name == tn(LLM_TENSOR_OUTPUT, "weight")) {
         new_type = GGML_TYPE_Q6_K;
    } else if (name == "token_embd.weight") {
         new_type = GGML_TYPE_Q5_K;
    } else if (name.find("attn_q_a.weight") != std::string::npos || name.find("attn_q_b.weight") != std::string::npos) {
        new_type = GGML_TYPE_Q8_0;
    } else if (name.find("attn_kv_a_mqa.weight") != std::string::npos || name.find("attn_kv_b.weight") != std::string::npos) {
        new_type = GGML_TYPE_Q8_0;
        // ++qs.i_attention_wv; @@@ Looks to be used for 'use_more_bits' tests and not outside this function... @@@
    } else if (name.find("attn_output.weight") != std::string::npos) {
        new_type = GGML_TYPE_Q5_K;
    } else if (name.find("shexp.weight") != std::string::npos) {
        new_type = GGML_TYPE_Q8_0;
    } else if (name.find("ffn_down_exps.weight") != std::string::npos) {
        auto info = layer_info(qs.i_ffn_down, qs.n_ffn_down, name.c_str());
        int i_layer = info.first, n_layer = info.second;
        if (i_layer < n_layer/8 || i_layer >= 7*n_layer/8) {
            new_type = GGML_TYPE_IQ4_XS;
        }
        else {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        ++qs.i_ffn_down;
    } else if (name.find("ffn_gate_exps.weight") != std::string::npos) {
        auto info = layer_info(qs.i_ffn_gate, qs.n_ffn_gate, name.c_str());
        int i_layer = info.first, n_layer = info.second;
        if (i_layer < n_layer/8 || i_layer >= 7*n_layer/8) {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        else {
            new_type = GGML_TYPE_IQ2_S;
        }
        ++qs.i_ffn_gate;
    } else if (name.find("ffn_up_exps.weight") != std::string::npos) {
        auto info = layer_info(qs.i_ffn_up, qs.n_ffn_up, name.c_str());
        int i_layer = info.first, n_layer = info.second;
        if (i_layer < n_layer/8 || i_layer >= 7*n_layer/8) {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        else {
            new_type = GGML_TYPE_IQ2_S;
        }
        ++qs.i_ffn_up;
    } else
    // ### JUK ###

It needs to be copied right before this line:

if (name == tn(LLM_TENSOR_OUTPUT, "weight") || (!qs.has_output && name == tn(LLM_TENSOR_TOKEN_EMBD, "weight")))

The mix of the GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS and GGML_TYPE_IQ2_S is just my attempt at getting this to fit in 96GB of VRAM...

Hopefully this helps as the IQ3_XXS version I made using the stock settings and with the problems outlined above (that let me get a whopping 1K of context in 96GB VRAM!), was as dumb as a post... 😦

I will also try just leaving all these as f16 later as they are tiny in comparison to everything else and the ffn_gate_inp.weight routing tensors are already left as f32 for this reason:

[  16/ 959]          blk.1.ffn_down_shexp.weight - [ 3072,  5120,     1,     1], type =    f16, converting to q8_0 .. size =    30.00 MiB ->    15.94 MiB
[  17/ 959]          blk.1.ffn_gate_shexp.weight - [ 5120,  3072,     1,     1], type =    f16, converting to q8_0 .. size =    30.00 MiB ->    15.94 MiB
[  18/ 959]            blk.1.ffn_up_shexp.weight - [ 5120,  3072,     1,     1], type =    f16, converting to q8_0 .. size =    30.00 MiB ->    15.94 MiB
[  20/ 959]           blk.1.attn_kv_a_mqa.weight - [ 5120,   576,     1,     1], type =    f16, converting to q8_0 .. size =     5.62 MiB ->     2.99 MiB
[  21/ 959]               blk.1.attn_kv_b.weight - [  512, 32768,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  24/ 959]                blk.1.attn_q_a.weight - [ 5120,  1536,     1,     1], type =    f16, converting to q8_0 .. size =    15.00 MiB ->     7.97 MiB
[  25/ 959]                blk.1.attn_q_b.weight - [ 1536, 24576,     1,     1], type =    f16, converting to q8_0 .. size =    72.00 MiB ->    38.25 MiB

jukofyork · 2024-06-28T15:25:50Z

Yeah, I think quantizing the low-rank attention weights was absolutely killing the model... I've put in a PR to fix this: #8194.

HaroldBenoit · 2024-09-11T12:35:04Z

Giving some usage feedback so that this gets merged.

This PR works pretty much almost out-of-the-box (just need to define quant types in lowercase instead of uppercase in the config, and a few tweaks to adapt to the latest llama.cpp state).

It introduces very useful functionality. Props to @jubruckne !

ThomasBaruzier · 2024-10-08T10:22:59Z

Hello,

I'd like to share a few findings and explain why this PR could be very beneficial for quants.
I did quantization and perplexity tests for around 20 models and many different architectures.
Furthermore, I noticed that the damage caused by quantization changes a lot from architecture to architecture.

For example, the best candidate is the Gemma 2 series, especially at 9B (imat):

Quant	Size (MB)	Perplexity (PPL)	Size (%)	Accuracy (%)	PPL Error rate
IQ1_S	2269	16.0064	12.87	54.91	0.12077
IQ1_M	2429	13.7255	13.77	64.03	0.10272
IQ2_XXS	2695	11.269	15.28	77.99	0.08345
IQ2_XS	2926	10.5628	16.59	83.2	0.07809
IQ2_S	3063	10.3671	17.37	84.77	0.07772
IQ2_M	3276	9.7973	18.58	89.7	0.07298
Q2_K_S	3388	9.9206	19.21	88.59	0.07247
IQ3_XXS	3621	9.3955	20.53	93.54	0.06962
Q2_K	3630	9.421	20.58	93.29	0.0683
IQ3_XS	3953	9.2545	22.42	94.96	0.06868
IQ3_S	4137	9.2127	23.46	95.39	0.06866
Q3_K_S	4137	9.083	23.46	96.76	0.06618
IQ3_M	4287	8.9791	24.31	97.88	0.06614
Q3_K_M	4542	9.0172	25.76	97.46	0.06684
Q3_K_L	4895	8.9965	27.76	97.69	0.06675
IQ4_XS	4943	8.8286	28.03	99.54	0.06504
IQ4_NL	5191	8.8235	29.44	99.6	0.06496
Q4_0	5207	8.834	29.53	99.48	0.0648
Q4_K_S	5226	8.829	29.63	99.54	0.06513
Q4_K_M	5495	8.8069	31.16	99.79	0.06493
Q4_1	5688	8.8395	32.25	99.42	0.06526
Q5_K_S	6184	8.8011	35.07	99.86	0.06504
Q5_0	6199	8.7668	35.15	100.25	0.06455
Q5_K_M	6340	8.7993	35.95	99.88	0.06506
Q5_1	6680	8.7888	37.88	100	0.06493
Q6_K	7238	8.7863	41.04	100.02	0.06497
Q8_0	9372	8.7858	53.14	100.03	0.06497
F16	17635	8.7884	100	100	0.06501

Note that IQ1_S is 12.87% of the size of F16, while having 16.0064 PPL (54.91% of F16)

On the other end, here is the same table for Llama 3.1 8B (imat):

Quant	Size (MB)	Perplexity (PPL)	Size (%)	Accuracy (%)	PPL Error rate
IQ1_S	1927	63.4502	12.57	11.54	0.42632
IQ1_M	2062	26.8862	13.46	27.24	0.18531
IQ2_XXS	2289	15.1538	14.94	48.32	0.10214
IQ2_XS	2486	11.771	16.22	62.21	0.07778
IQ2_S	2631	10.5231	17.17	69.59	0.06974
IQ2_M	2812	9.5838	18.35	76.41	0.0623
Q2_K_S	2851	10.6111	18.6	69.01	0.07128
Q2_K	3032	9.857	19.78	74.29	0.0636
IQ3_XXS	3124	8.4925	20.38	86.22	0.05406
IQ3_XS	3356	8.1907	21.9	89.4	0.05164
Q3_K_S	3495	8.9094	22.81	82.19	0.05599
IQ3_S	3512	7.973	22.92	91.84	0.05015
IQ3_M	3610	7.9137	23.56	92.53	0.04941
Q3_K_M	3833	7.8494	25.01	93.29	0.05019
Q3_K_L	4122	7.7578	26.9	94.39	0.0497
IQ4_XS	4242	7.5211	27.68	97.36	0.04819
Q4_0	4460	7.6204	29.1	96.09	0.04851
IQ4_NL	4462	7.5144	29.12	97.45	0.04819
Q4_K_S	4476	7.5198	29.21	97.38	0.048
Q4_K_M	4693	7.4975	30.62	97.67	0.04794
Q4_1	4893	7.5353	31.93	97.18	0.04808
Q5_K_S	5340	7.3903	34.85	99.08	0.04728
Q5_0	5354	7.3982	34.94	98.98	0.04728
Q5_K_M	5468	7.3962	35.68	99	0.04739
Q5_1	5788	7.3843	37.77	99.16	0.04721
Q6_K	6291	7.3538	41.05	99.58	0.04696
Q8_0	8146	7.3279	53.15	99.93	0.04677
F16	15325	7.3226	100	100	0.04674

Note that IQ1_S is 12.57% of the size of F16, while having 63.4502 PPL (11.54% of F16)

It's arguable that smaller models are prone to more damage caused by quantization. But even so, Qwen 2.5 14B (imat), while having 5B more parameters, suffers more than Gemma 2 9B:

Quant	Size (MB)	PPL	Size (%)	Accuracy (%)	PPL error rate
IQ1_S	3441	22.0082	12.21	27.14	0.16818
IQ1_M	3693	15.079	13.11	39.62	0.1106
IQ2_XXS	4114	9.6047	14.6	62.2	0.06625
IQ2_XS	4487	8.3649	15.92	71.41	0.05574
IQ2_S	4772	8.1942	16.93	72.9	0.0548
IQ2_M	5109	7.7261	18.13	77.32	0.05177
Q2_K_S	5148	8.0641	18.27	74.08	0.0549
Q2_K	5504	7.6005	19.53	78.6	0.05146
IQ3_XXS	5672	6.9285	20.13	86.22	0.04547
IQ3_XS	6088	6.721	21.6	88.88	0.04329
Q3_K_S	6352	6.8697	22.54	86.96	0.04576
IQ3_S	6383	6.6246	22.65	90.17	0.04285
IQ3_M	6597	6.6359	23.41	90.02	0.04256
Q3_K_M	7000	6.5281	24.84	91.51	0.043
Q3_K_L	7558	6.4323	26.82	92.87	0.04211
IQ4_XS	7744	6.2005	27.48	96.34	0.04022
Q4_0	8149	6.2928	28.92	94.93	0.04095
IQ4_NL	8154	6.208	28.94	96.23	0.04032
Q4_K_S	8177	6.163	29.02	96.93	0.03976
Q4_K_M	8572	6.1311	30.42	97.43	0.03957
Q4_1	8958	6.1674	31.79	96.86	0.03981
Q5_K_S	9791	6.0411	34.75	98.88	0.03886
Q5_0	9817	6.0504	34.84	98.73	0.03895
Q5_K_M	10023	6.0389	35.57	98.92	0.03888
Q5_1	10625	6.0366	37.71	98.96	0.03885
Q6_K	11564	6.0004	41.04	99.56	0.0386
Q8_0	14975	5.9821	53.14	99.86	0.03842
F16	28179	5.9737	100	100	0.03835

Note that IQ1_S is 12.21% of the size of F16, while having 22.0082 PPL (27.14% of F16)

Conclusion

IQ1_S might be a poor example regarding its use cases, however, it helps us show how quants can be affected differently depending on the architecture of the model, and show how much custom quants could be beneficial for their quality.

I'd like to try this PR and "bruteforce" my way up to the lowest perplexity achievable on models suffering the most from quantization. I really hope this gets merged soon.

Djip007 · 2024-10-18T23:23:42Z

I like this idea to have configurable/ custom quantisation. But isn't it simple to use std c++ regex and not code a match_string.

And I think there is use json lib in llama.cpp so may be we can use it for the "quant.cfg" file?

ddh0 · 2024-10-18T23:31:31Z

I like this idea to have configurable/ custom quantisation. But isn't it simple to use std c++ regex and not code a match_string.

And I think there is use json lib in llama.cpp so may be we can use it for the "quant.cfg" file?

I agree, JSON would be more accessible for myself and probably many others

ddh0 · 2024-10-30T01:33:56Z

I've just tried to build jubruckne/llama.cpp as of commit 20b2243 and it's failing with this error:

examples/quantize/quantize.cpp:256:13: error: static declaration of 'parse_kv_override' follows non-static declaration
  256 | static bool parse_kv_override(const char * data, std::vector<llama_model_kv_override> & overrides) {
      |             ^
common/common.h:180:6: note: previous declaration is here
  180 | bool parse_kv_override(const char * data, std::vector<llama_model_kv_override> & overrides);
      |      ^
examples/quantize/quantize.cpp:269:13: error: no member named 'int_value' in 'llama_model_kv_override'
  269 |         kvo.int_value = std::atol(sep);
      |         ~~~ ^
examples/quantize/quantize.cpp:273:13: error: no member named 'float_value' in 'llama_model_kv_override'
  273 |         kvo.float_value = std::atof(sep);
      |         ~~~ ^
examples/quantize/quantize.cpp:278:17: error: no member named 'bool_value' in 'llama_model_kv_override'
  278 |             kvo.bool_value = true;
      |             ~~~ ^
examples/quantize/quantize.cpp:280:17: error: no member named 'bool_value' in 'llama_model_kv_override'
  280 |             kvo.bool_value = false;
      |             ~~~ ^
examples/quantize/quantize.cpp:390:39: error: call to 'parse_kv_override' is ambiguous
  390 |             if (arg_idx == argc-1 || !parse_kv_override(argv[++arg_idx], kv_overrides)) {
      |                                       ^~~~~~~~~~~~~~~~~
common/common.h:180:6: note: candidate function
  180 | bool parse_kv_override(const char * data, std::vector<llama_model_kv_override> & overrides);
      |      ^
examples/quantize/quantize.cpp:256:13: note: candidate function
  256 | static bool parse_kv_override(const char * data, std::vector<llama_model_kv_override> & overrides) {
      |             ^
6 errors generated.
make: *** [quantize] Error 1
make: *** Waiting for unfinished jobs....

I'd really love to see this functionality make its way into master, @jubruckne do you still have plans to work on this?

To @slaren and @cebtenzzre : other than the above error, what else needs to be done before this is ready for review?

Thank you everyone.

slaren · 2024-10-31T23:55:50Z

I think it would be important that the implementation of custom quantization schemes can be used to replace the current logic. That is to say, it should be possible to remove the code of the current quantization schemes and replace them as inputs of the custom quantization schemes. Otherwise, we would be just adding more complexity on top of already too complex code. I don't know what's the state of this PR.

jubruckne added 3 commits April 23, 2024 13:33

custom quantization schemas

31e2f56

custom quantization schemas

dbe6483

fix spaces

054e73e

allow wildcards for tensor names

6e09a26

parse gmml_type and llama_ftype, allow specifiying cfg file

238551e

jubruckne marked this pull request as ready for review April 25, 2024 09:50

cebtenzzre reviewed Apr 25, 2024

View reviewed changes

ggerganov approved these changes May 9, 2024

View reviewed changes

ggerganov added the need feedback Testing and feedback with results are needed label May 9, 2024

mofosyne added the enhancement New feature or request label May 9, 2024

mofosyne added 2 commits May 9, 2024 23:21

Merge branch 'master' into master

8172129

Update llama.h LLAMA_FTYPE_CUSTOM=33

20b2243

slaren reviewed May 9, 2024

View reviewed changes

maruel mentioned this pull request Jul 28, 2024

Refactor: decide the future of llama_tensor_get_type() #8736

Closed

CISC mentioned this pull request Aug 7, 2024

Quantize: specify each major tensor quant in CLI for common LLMs #8917

Draft

4 tasks

Djip007 mentioned this pull request Oct 30, 2024

add FP8 support to gguf/llama: #10055

Draft

3 tasks

		LLAMA_FTYPE_CUSTOM = 32, // except 1d tensors

		LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file

	if(type < 0 \|\| type >= GGML_TYPE_COUNT) {
	if (type < 0 \|\| type >= GGML_TYPE_COUNT) {

	if(!try_parse_ftype(ftype_str, ftype, ftype_name, custom_quant_config_filename)) {
	if (!try_parse_ftype(ftype_str, ftype, ftype_name, custom_quant_config_filename)) {

	static bool match_string(const std::string& str, const std::string& pattern, uint32_t string_index = 0, uint32_t pattern_index = 0) {
	static bool match_string(const std::string & str, const std::string & pattern, uint32_t string_index = 0, uint32_t pattern_index = 0) {

Custom quantization schemes #6844

Are you sure you want to change the base?

Custom quantization schemes #6844

Conversation

jubruckne commented Apr 23, 2024

Uh oh!

askmyteapot commented Apr 23, 2024

Uh oh!

ggerganov commented Apr 23, 2024

Uh oh!

Nexesenex commented Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jubruckne commented Apr 24, 2024

Uh oh!

jubruckne commented Apr 24, 2024

Uh oh!

Nexesenex commented Apr 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jubruckne commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BarfingLemurs commented Apr 26, 2024

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Green-Sky commented May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jukofyork commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Jun 27, 2024

Uh oh!

jukofyork commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Jun 27, 2024

Uh oh!

jukofyork commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nexesenex commented Apr 23, 2024 •

edited

Loading

Nexesenex commented Apr 24, 2024 •

edited

Loading

jubruckne commented Apr 25, 2024 •

edited

Loading

github-actions bot commented Apr 25, 2024 •

edited

Loading

Green-Sky commented May 9, 2024 •

edited

Loading

jukofyork commented Jun 27, 2024 •

edited

Loading

jukofyork commented Jun 27, 2024 •

edited

Loading

jukofyork commented Jun 27, 2024 •

edited

Loading

jukofyork commented Jun 27, 2024 •

edited

Loading

jukofyork commented Jun 27, 2024 •

edited

Loading

jukofyork commented Jun 27, 2024 •

edited

Loading

jukofyork commented Jun 27, 2024 •

edited

Loading

jukofyork commented Jun 28, 2024 •

edited

Loading