Skip to content

Custom quantization schemes #6844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Conversation

jubruckne
Copy link

This is not ready to merge but I wanted to get your opinion if it’s something you’d be interested in including. If so, I can clean it up and improve it a little.

The idea is to allow creating a custom quantization mix by reading the per-layer quant type from a config file, by specifying CUSTOM as the type, like so:

./quantize --allow-requantize ../models/Meta-Llama-3-8B-Instruct.Q8_0.gguf ./llama3-q.gguf CUSTOM

The config file is currently hardcoded to read quant.cfg from the current directory (sample cfg is included). In the config file I allow specifying a default type for tensors that are not explicitly overridden, and the tensor name / type pairs with the requested type.

Possible improvements would be:

  • specifying types as strings instead of enum values („Q3_K“ instead of 11)
  • wildcards or regex to specify tensor names (like „blk\d{2}.ffn_down.weight“)
  • allow variable type (like -1, +2): if the default is Q8_K, and quant.cfg says -1 for the tensor you get Q6_K)
  • make filename for quant.cfg configurable via command line switch.

@askmyteapot
Copy link

This would be handy, as i like to experiment with different custom quants, and its a little clunky having to modify and rebuild llama.cpp every time i want to change something.
For example, i found that with Mixtral, having token_embed, attn_v/k/q/output as Q6_K with iq4_xs weights typically scores better than the standard iq4_xs. Weirdly, it even slightly outperforms the above at Q8_0 with iq4_xs weights.

@ggerganov
Copy link
Member

Yes, this functionality is welcome

@Nexesenex
Copy link
Contributor

Nexesenex commented Apr 23, 2024

Excellent idea, I wanted to see such a feature but am unable to do it myself.
I will use it.. a lot!

All the possibles improvements you mention are pertinent.

Also, this tool should ideally feature variable quantization.
For example, it can be useful to be able to quantize a fraction of a given weight in a quant, and the other half in another.
Example : the ffn.down.weight is usually the "lead" of the 3 ffn weights in terms of influence over perplexity. Simply quantizing half of the ffn.down.weight in the immediate superior quant gives a very good perplexity shrink on most models, to not speak about other benches like ARC.

Moreover, and that's a bit more complex, the ideal combination might be to be able to use a customizable form "more_bits feature" (query it in the llama.cpp file) to make that partial quant, and to select a layer range of a given weight to quantize with a higher quant.
Example : Take a 70b model, with 80 layers, with a LLAMA_FTYPE IQ2_S
I'd like to quantize the ffn.down.weight as such without recompiling LlamaCPP 👍

  • 10 (or any number) first layers in IQ3_XXS.
  • One every x layer into IQ3_XXS between layer 11 and 70 (for example).
  • 10 (or any number) last layers in IQ3XXS.
  • The rest in IQ2_S.
    Of course, these numbers are arbitrary, and I'd be curious to know which layers are actually the most influential over a model, and thus, would deserve the higher bitrate of a variable quant.

I'm currently toying with the code in Llama.cpp file, and that's quite indigest and not practical, especially because 2 approaches were used to define the quant strategies 👍

  • The IQ1 and IQ2 quant strategies are a tree, the weight being branchess.
  • The other quants (IQ and Q) are branches in per-weight trees.
    That coexistence of 2 approaches is confusing to me, and should ideally be harmonized into either one (by weight) or another (by Quant strategy).

@jubruckne
Copy link
Author

I'd like to quantize the ffn.down.weight as such without recompiling LlamaCPP

Yeah, that's the idea. I actually explained my intentions slightly incorrectly in the first post above. It's actually about allowing individual quantisation for each tensor (not layer). So you can have a config file like this:

# use default quantisation of Q8_0
ftype=7

# override tensors matching a pattern with a specific quant:
blk.10.ffn_up.weight=7
blk.1?.ffn_up.weight=10
blk.2?.ffn_up.weight=10
blk.1?.attn*=23
blk.2?.attn*=23
*down*=14
*gate*=12

@jubruckne
Copy link
Author

Moreover, and that's a bit more complex, the ideal combination might be to be able to use a customizable form "more_bits feature" (query it in the llama.cpp file) to make that partial quant, and to select a layer range of a given weight to quantize with a higher quant.

Exactly my plan :) The idea here would be that instead of setting a specific quant type, increments of +1, +2, -1, ... relative to the default could be used. For example:

# use default quantisation of Q4_K_S
ftype=14

# override tensors matching a pattern with a specific quant:
*ffn_up.weight=+1
*ffn_down.weight=-1

The challenge lies in defining the what the sequences of quant types should be. One possibility is to establish a sequence such that it's transitioning between similar quant types of different "bit" rates, such as from x_K to x+1_K or from IQx_S to IQx-1_S. For example:

  1. IQ1_S, IQ1_M
  2. IQ2_XXS, IQ2_XS, IQ2_S, Q2_K
  3. IQ3_XXS, IQ3_S, Q3_K
  4. Q4_0, Q4_1, Q4_K, IQ4_XS, IQ4_NL
  5. Q5_0, Q5_1, Q5_K
  6. Q6_K
  7. Q8_K

Using this sequence, a default of Q4_K would transition to Q5_K with a +1 adjustment and to Q3_K with a -1.

A more detailed sequence might look like this:

  1. IQ1_S
  2. IQ1_M
  3. IQ2_XXS
  4. IQ2_XS
  5. IQ2_S
  6. Q2_K
  7. IQ3_XXS
  8. IQ3_S
  9. Q3_K
  10. Q4_0, Q4_1, Q4_K
  11. IQ4_XS, IQ4_NL
  12. Q5_0, Q5_1, Q5_K
  13. Q6_K
  14. Q8_K

However now that I've tried to write down sensible sequences, I realize that defining one that is universally applicable is challenging due to the varying nature of quant types, and probably doesn't make sense in most cases.

Any thoughts?

@Nexesenex
Copy link
Contributor

Nexesenex commented Apr 24, 2024

Well, ideally the whole pattern would be definable so the system can be universally applied, there's no premade recipe which makes consensus nor should it be, because we are still empirically discovering the effects of particular quantization strategies as we try them.

Here's a reformulation of my idea compatible with your plans :

  • Define optionally each tensor to be offseted, in relative (+1, -1) or absolute GGML_TYPE.
  • Define optionally within a tensor one or several ranges of layer (relative or absolute) to be quantized either on the baseline quant, either with a relative offset to the baseline quant, either with a GGML_TYPE, either with a mix of 2 quants on a given layer interval.

Example in profane writing, for each tensor chosen for a customized quantization away from a base quantization strategy, presently a base Q4_K defined on a 70b L2 model with 80 layers on which we want to customize the ffn.down without even using Q4_K for the sake of the example :
ffn.down -> Layer 1:15 or first 20% : Q5_K (or +1) ; Layer 16:65 : Q5K (or +1) every x layers, rest Q3_K (or -1) ; Layer 66-80 or last 20% : Q5K (or +1)
The "x layers" pattern being of course appliable to the first or last range of layers, and not intermediary one.

That might require a slight overhaul of the quant strategy part of Llama.CPP, and potentially an harmonization of its hierarchical tree in respect for the IQ1 and IQ2 groups, but if possible, that'd offer the widest range of possibilities.

I'm sorry for my lack of code proficiency, I have no background into coding beyond mimicking what I see, understanding and adapting a few formatting tricks, and changing values.

@jubruckne
Copy link
Author

jubruckne commented Apr 25, 2024

I think this should be ready. I added parsing of enum values (so that friendly names like Q8_0 can be used instead of their numeric values), wildcards for tensor names, and possibility to specify the cfg file to use.

To use, specify the new CUSTOM type on ./quantize like so:
./quantize ../models/Phi-3-mini-4k-instruct-fp16.gguf ./phi3-q.gguf CUSTOM:quant.cfg

The quant.cfg should be pretty self-explanatory:

# Defines the default ftype (the quantization mix code, 
# that you pass to quantize if you're not using custom mix).
# tensors that are not overriden below will be quantized 
# according to this mix.
#
# Must be one of
#    Q4_0, Q4_1, Q5_0, Q5_1, IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M, 
#    IQ1_S, IQ1_M, Q2_K, Q2_K_S, IQ3_XXS, IQ3_S, IQ3_M, Q3_K,
#    IQ3_XS, Q3_K_S, Q3_K_M, Q3_K_L, IQ4_NL, IQ4_XS, Q4_K, 
#    Q4_K_S, Q4_K_M, Q5_K, Q5_K_S, Q5_K_M, Q6_K, Q8_0, F16

ftype=Q6_K

# Defines overrides for tensors with names matching a given 
# string. Filters are processed in order given, the first 
# matching will be used. 
#
# Wildcards are allowed:
#     ? single character
#     * multiple characters
#
# Type must be one of 
#     F16, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1, Q2_K, Q3_K, 
#     Q4_K, Q5_K, Q6_K, Q8_K, IQ2_XXS, IQ2_XS, IQ3_XXS, 
#     IQ1_S, IQ4_NL, IQ3_S, IQ2_S, IQ4_XS, IQ1_M

blk.10.ffn_up.weight=Q5_K
blk.1?.ffn_up.weight=Q4_K
blk.23.*=Q2_K
blk.24.*=Q2_K
blk.25.*=Q2_K
blk.2?.ffn_up.weight=Q4_K
*_gate*=Q4_K
*.attn*=IQ4_XS
*_down*=IQ3_S
output.weight=Q5_K

@jubruckne jubruckne marked this pull request as ready for review April 25, 2024 09:50
Copy link
Contributor

github-actions bot commented Apr 25, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 556 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8437.61ms p(95)=19797.62ms fails=, finish reason: stop=488 truncated=68
  • Prompt processing (pp): avg=93.93tk/s p(95)=352.4tk/s
  • Token generation (tg): avg=33.41tk/s p(95)=49.02tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=20b22433f0cf941c1b43e27c086e2ef71798fd57

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715266997 --> 1715267625
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 461.65, 461.65, 461.65, 461.65, 461.65, 708.57, 708.57, 708.57, 708.57, 708.57, 583.45, 583.45, 583.45, 583.45, 583.45, 610.8, 610.8, 610.8, 610.8, 610.8, 681.27, 681.27, 681.27, 681.27, 681.27, 694.95, 694.95, 694.95, 694.95, 694.95, 698.32, 698.32, 698.32, 698.32, 698.32, 738.85, 738.85, 738.85, 738.85, 738.85, 740.37, 740.37, 740.37, 740.37, 740.37, 754.14, 754.14, 754.14, 754.14, 754.14, 758.94, 758.94, 758.94, 758.94, 758.94, 783.8, 783.8, 783.8, 783.8, 783.8, 831.77, 831.77, 831.77, 831.77, 831.77, 853.53, 853.53, 853.53, 853.53, 853.53, 842.07, 842.07, 842.07, 842.07, 842.07, 844.5, 844.5, 844.5, 844.5, 844.5, 844.77, 844.77, 844.77, 844.77, 844.77, 867.89, 867.89, 867.89, 867.89, 867.89, 863.96, 863.96, 863.96, 863.96, 863.96, 864.51, 864.51, 864.51, 864.51, 864.51, 870.65, 870.65, 870.65, 870.65, 870.65, 873.37, 873.37, 873.37, 873.37, 873.37, 870.04, 870.04, 870.04, 870.04, 870.04, 867.42, 867.42, 867.42, 867.42, 867.42, 862.4, 862.4, 862.4, 862.4, 862.4, 861.24, 861.24, 861.24, 861.24, 861.24, 849.5, 849.5, 849.5, 849.5, 849.5, 848.25, 848.25, 848.25, 848.25, 848.25, 845.86, 845.86, 845.86, 845.86, 845.86, 847.08, 847.08, 847.08, 847.08, 847.08, 851.68, 851.68, 851.68, 851.68, 851.68, 850.73, 850.73, 850.73, 850.73, 850.73, 851.39, 851.39, 851.39, 851.39, 851.39, 857.7, 857.7, 857.7, 857.7, 857.7, 860.93, 860.93, 860.93, 860.93, 860.93, 861.72, 861.72, 861.72, 861.72, 861.72, 846.45, 846.45, 846.45, 846.45, 846.45, 842.42, 842.42, 842.42, 842.42, 842.42, 841.97, 841.97, 841.97, 841.97, 841.97, 844.18, 844.18, 844.18, 844.18, 844.18, 846.31, 846.31, 846.31, 846.31, 846.31, 838.22, 838.22, 838.22, 838.22, 838.22, 821.9, 821.9, 821.9, 821.9, 821.9, 791.6, 791.6, 791.6, 791.6, 791.6, 790.51, 790.51, 790.51, 790.51, 790.51, 791.04, 791.04, 791.04, 791.04, 791.04, 797.59, 797.59, 797.59, 797.59, 797.59, 797.61, 797.61, 797.61, 797.61, 797.61, 801.3, 801.3, 801.3, 801.3, 801.3, 803.42, 803.42, 803.42, 803.42, 803.42, 809.92, 809.92, 809.92, 809.92, 809.92, 811.74, 811.74, 811.74, 811.74, 811.74, 810.68, 810.68, 810.68, 810.68, 810.68, 808.5, 808.5, 808.5, 808.5, 808.5, 809.7, 809.7, 809.7, 809.7, 809.7, 810.26, 810.26, 810.26, 810.26, 810.26, 811.95, 811.95, 811.95, 811.95, 811.95, 813.78, 813.78, 813.78, 813.78, 813.78, 814.71, 814.71, 814.71, 814.71, 814.71, 817.21, 817.21, 817.21, 817.21, 817.21, 817.28, 817.28, 817.28, 817.28, 817.28]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715266997 --> 1715267625
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 40.46, 40.46, 40.46, 40.46, 40.46, 43.64, 43.64, 43.64, 43.64, 43.64, 32.67, 32.67, 32.67, 32.67, 32.67, 32.48, 32.48, 32.48, 32.48, 32.48, 34.09, 34.09, 34.09, 34.09, 34.09, 33.8, 33.8, 33.8, 33.8, 33.8, 34.34, 34.34, 34.34, 34.34, 34.34, 35.33, 35.33, 35.33, 35.33, 35.33, 35.53, 35.53, 35.53, 35.53, 35.53, 35.04, 35.04, 35.04, 35.04, 35.04, 34.52, 34.52, 34.52, 34.52, 34.52, 34.41, 34.41, 34.41, 34.41, 34.41, 34.2, 34.2, 34.2, 34.2, 34.2, 33.03, 33.03, 33.03, 33.03, 33.03, 32.96, 32.96, 32.96, 32.96, 32.96, 32.36, 32.36, 32.36, 32.36, 32.36, 32.67, 32.67, 32.67, 32.67, 32.67, 32.79, 32.79, 32.79, 32.79, 32.79, 31.99, 31.99, 31.99, 31.99, 31.99, 31.7, 31.7, 31.7, 31.7, 31.7, 31.69, 31.69, 31.69, 31.69, 31.69, 31.78, 31.78, 31.78, 31.78, 31.78, 32.06, 32.06, 32.06, 32.06, 32.06, 31.9, 31.9, 31.9, 31.9, 31.9, 31.99, 31.99, 31.99, 31.99, 31.99, 32.15, 32.15, 32.15, 32.15, 32.15, 32.07, 32.07, 32.07, 32.07, 32.07, 31.98, 31.98, 31.98, 31.98, 31.98, 31.52, 31.52, 31.52, 31.52, 31.52, 31.62, 31.62, 31.62, 31.62, 31.62, 31.84, 31.84, 31.84, 31.84, 31.84, 31.94, 31.94, 31.94, 31.94, 31.94, 32.07, 32.07, 32.07, 32.07, 32.07, 32.17, 32.17, 32.17, 32.17, 32.17, 32.14, 32.14, 32.14, 32.14, 32.14, 32.02, 32.02, 32.02, 32.02, 32.02, 31.68, 31.68, 31.68, 31.68, 31.68, 31.7, 31.7, 31.7, 31.7, 31.7, 31.79, 31.79, 31.79, 31.79, 31.79, 31.9, 31.9, 31.9, 31.9, 31.9, 32.01, 32.01, 32.01, 32.01, 32.01, 32.18, 32.18, 32.18, 32.18, 32.18, 31.96, 31.96, 31.96, 31.96, 31.96, 31.5, 31.5, 31.5, 31.5, 31.5, 31.49, 31.49, 31.49, 31.49, 31.49, 30.83, 30.83, 30.83, 30.83, 30.83, 30.75, 30.75, 30.75, 30.75, 30.75, 30.75, 30.75, 30.75, 30.75, 30.75, 30.79, 30.79, 30.79, 30.79, 30.79, 30.9, 30.9, 30.9, 30.9, 30.9, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.79, 30.79, 30.79, 30.79, 30.79, 30.71, 30.71, 30.71, 30.71, 30.71, 30.8, 30.8, 30.8, 30.8, 30.8, 31.0, 31.0, 31.0, 31.0, 31.0, 31.14, 31.14, 31.14, 31.14, 31.14, 31.22, 31.22, 31.22, 31.22, 31.22, 31.25, 31.25, 31.25, 31.25, 31.25, 31.28, 31.28, 31.28, 31.28, 31.28, 31.27, 31.27, 31.27, 31.27, 31.27]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715266997 --> 1715267625
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.38, 0.38, 0.38, 0.38, 0.38, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.29, 0.29, 0.29, 0.29, 0.29, 0.26, 0.26, 0.26, 0.26, 0.26, 0.25, 0.25, 0.25, 0.25, 0.25, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.28, 0.28, 0.28, 0.28, 0.28, 0.13, 0.13, 0.13, 0.13, 0.13, 0.22, 0.22, 0.22, 0.22, 0.22, 0.15, 0.15, 0.15, 0.15, 0.15, 0.33, 0.33, 0.33, 0.33, 0.33, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.33, 0.33, 0.33, 0.33, 0.33, 0.36, 0.36, 0.36, 0.36, 0.36, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.32, 0.32, 0.32, 0.32, 0.32, 0.56, 0.56, 0.56, 0.56, 0.56, 0.51, 0.51, 0.51, 0.51, 0.51, 0.4, 0.4, 0.4, 0.4, 0.4, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.23, 0.23, 0.23, 0.23, 0.23, 0.28, 0.28, 0.28, 0.28, 0.28, 0.17, 0.17, 0.17, 0.17, 0.17, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715266997 --> 1715267625
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0]
                    
Loading

if (pos != std::string::npos) {
std::string tensor_name = line.substr(0, pos);
std::string type_name = line.substr(pos + 1);
ggml_type type = parse_ggml_type(type_name.c_str());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the configuration describe tensors by data type (from enum ggml_type), not file type (from enum llama_ftype)? E.g. Q3_K_S, Q3_K_M, and Q3_K_L are all file types, whereas Q3_K is a data type.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was that you set the ftype (llama_ftype) and that will give you as a base the built-in mixing logic that llama_tensor_get_type() determines. I figured that's a good default since it's also doing some architecture specific optimizations. Only then, on top of it, you override specific tensors with s different ggml_type.

An alternative would be to completely get rid of the built-in llama_ftype logic, and specify the quant mixes completely from cfg files. Then we could deliver pre-built cfg files for each of the mixes currently supported by quantize. This would be nice on the one hand, as it would move all the special casing out from the code base, but on the other hand it's going to be a bigger endeavour as we'd need to figure out how to handle the different architectures, models of varying layer counts, and other logics that are not so easy to specify in a declarative configuration.

@BarfingLemurs
Copy link
Contributor

Hi, I'm noticing we don't have good perplexity/quality scores for llama3 8b at Q4_0:

F16: Final estimate: PPL = 6.7647
Q8_0: Final estimate: PPL = 6.7646
Q4_0: Final estimate: PPL = 7.2904
Q5_1: Final estimate: PPL = 6.8849

This is a 7.7% difference, but these numbers are even worse earlier on in evaluation.

Mistral PPL:
F16: Final estimate: PPL = 5.6925
Q8_0: Final estimate: PPL = 5.6918
Q4_0: Final estimate: PPL = 5.8192

Only 2.2% difference for mistral.

Different UI's using the Q4_0 series would be getting a higher quality degradation for llama3 than llama2 or mistral.

This isn't a llama.cpp issue, most gpu quantizations will get similar results. Is there a pre-existing quantization sweet spot suitable as the de-facto for llama3?

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to get some usage feedback from people before merging

llama.h Outdated
Comment on lines 125 to 127
LLAMA_FTYPE_CUSTOM = 32, // except 1d tensors

LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LLAMA_FTYPE_CUSTOM = 32, // except 1d tensors
LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file
LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file
LLAMA_FTYPE_CUSTOM = 1025,

Comment on lines +283 to +286
enum llama_ftype default_ftype; // default type if not overriden
uint32_t count; // number of overrides
const char ** names; // tensor names
enum ggml_type * types; // tensor type override
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
enum llama_ftype default_ftype; // default type if not overriden
uint32_t count; // number of overrides
const char ** names; // tensor names
enum ggml_type * types; // tensor type override
enum llama_ftype default_ftype; // default type if not overriden
uint32_t count; // number of overrides
const char ** names; // tensor names
enum ggml_type * types; // tensor type override

@@ -14886,7 +14925,8 @@ struct llama_model_quantize_params llama_model_quantize_default_params() {
/*.only_copy =*/ false,
/*.pure =*/ false,
/*.imatrix =*/ nullptr,
/*.kv_overrides =*/ nullptr,
/*.kv_overrides =*/ nullptr,
/*.override_ftype =*/ nullptr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/*.override_ftype =*/ nullptr
/*.override_ftype =*/ nullptr,

@@ -14417,6 +14444,18 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
new_type = params->output_tensor_type;
}

// look up tensor name in type override map, if not found use default
// type as determined by the ftype.
if(params->override_ftype) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if(params->override_ftype) {
if (params->override_ftype) {

@@ -14279,7 +14306,7 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
// copy the KV pairs from the input file
gguf_set_kv (ctx_out, ml.meta);
gguf_set_val_u32(ctx_out, "general.quantization_version", GGML_QNT_VERSION);
gguf_set_val_u32(ctx_out, "general.file_type", ftype);
gguf_set_val_u32(ctx_out, "general.file_type", params->ftype);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we keep ftype here instead of params->ftype?

std::string tensor_name = line.substr(0, pos);
std::string type_name = line.substr(pos + 1);
ggml_type type = parse_ggml_type(type_name.c_str());
if(type < 0 || type >= GGML_TYPE_COUNT) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if(type < 0 || type >= GGML_TYPE_COUNT) {
if (type < 0 || type >= GGML_TYPE_COUNT) {

std::string ftype_name;
std::string custom_quant_config_filename;
llama_ftype ftype;
if(!try_parse_ftype(ftype_str, ftype, ftype_name, custom_quant_config_filename)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if(!try_parse_ftype(ftype_str, ftype, ftype_name, custom_quant_config_filename)) {
if (!try_parse_ftype(ftype_str, ftype, ftype_name, custom_quant_config_filename)) {

@ggerganov ggerganov added the need feedback Testing and feedback with results are needed label May 9, 2024
@mofosyne mofosyne added the enhancement New feature or request label May 9, 2024
@Green-Sky
Copy link
Collaborator

Green-Sky commented May 9, 2024

Hi, I'm noticing we don't have good perplexity/quality scores for llama3 8b at Q4_0:

F16: Final estimate: PPL = 6.7647 Q8_0: Final estimate: PPL = 6.7646 Q4_0: Final estimate: PPL = 7.2904 Q5_1: Final estimate: PPL = 6.8849

This is a 7.7% difference, but these numbers are even worse earlier on in evaluation.

Mistral PPL: F16: Final estimate: PPL = 5.6925 Q8_0: Final estimate: PPL = 5.6918 Q4_0: Final estimate: PPL = 5.8192

Only 2.2% difference for mistral.

llama3 reacts more strongly to quantization, probably because it makes more use of the bits/precision it was trained on.

someone should use MAP to find the frontier of best ppl-size (or any other 2 dimensional metric)

@@ -224,13 +246,119 @@ static ggml_type parse_ggml_type(const char * arg) {
for (int j = 0; j < GGML_TYPE_COUNT; ++j) {
auto type = ggml_type(j);
const auto * name = ggml_type_name(type);
if (name && strcmp(arg, name) == 0) {
if (name && strcasecmp(arg, name) == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strcasecmp is not available on every platform.

@@ -14570,11 +14573,35 @@ static size_t llama_tensor_quantize_internal(enum ggml_type new_type, const floa
return new_size;
}

static bool match_string(const std::string& str, const std::string& pattern, uint32_t string_index = 0, uint32_t pattern_index = 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static bool match_string(const std::string& str, const std::string& pattern, uint32_t string_index = 0, uint32_t pattern_index = 0) {
static bool match_string(const std::string & str, const std::string & pattern, uint32_t string_index = 0, uint32_t pattern_index = 0) {

result = type; break;
}
}
return result;
}

static bool parse_kv_override(const char * data, std::vector<llama_model_kv_override> & overrides) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was moved to common.cpp, this definition should be removed from here.

@jukofyork
Copy link
Collaborator

jukofyork commented Jun 27, 2024

Is this PR still working? I'd be interested to try it on the new deepseek-v2 models to see if using lower quants for the later layers is feasible.

@jukofyork
Copy link
Collaborator

jukofyork commented Jun 27, 2024

This PR seems dead, but I have found where you can hack this in: llama.cpp::llama_tensor_get_type().

Interestingly there looks to be a lot of hard-coded tests of n_expert == 8 that might be hurting the quantization of some of the newer MoE models that use more experts like dbrx, deepseek-v2, Qwen-MoE, etc:

The new "shared experts" might need thinking about now too:

  "n_routed_experts": 160,
  "n_shared_experts": 2,
  "num_experts_per_tok": 6

as they will be disproportionality effected by quantization (ie: used 100% of the time vs 4/160 = 2.5% of the time, for the example deepseek-v2 config above).

@jukofyork
Copy link
Collaborator

jukofyork commented Jun 27, 2024

static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {
    const std::string name = ggml_get_name(tensor);

    // TODO: avoid hardcoded tensor names - use the TN_* constants
    const llm_arch arch = qs.model.arch;
    const auto       tn = LLM_TN(arch);

    auto use_more_bits = [](int i_layer, int num_layers) -> bool {
        return i_layer < num_layers/8 || i_layer >= 7*num_layers/8 || (i_layer - num_layers/8)%3 == 2;
    };
    const int n_expert = std::max(1, (int)qs.model.hparams.n_expert);
    auto layer_info = [n_expert] (int i_layer, int n_layer, const char * name) {
        if (n_expert > 1) {
            // Believe it or not, "experts" in the FFN of Mixtral-8x7B are not consecutive, but iccasionally randomly
            // sprinkled in the model. Hence, simply dividing i_ffn_down by n_expert does not work
            // for getting the current layer as I initially thought, and we need to resort to parsing the
            // tensor name.
            if (sscanf(name, "blk.%d.", &i_layer) != 1) {
                throw std::runtime_error(format("Failed to determine layer for tensor %s", name));
            }
            if (i_layer < 0 || i_layer >= n_layer) {
                throw std::runtime_error(format("Bad layer %d for tensor %s. Must be in [0, %d)", i_layer, name, n_layer));
            }
        }
        return std::make_pair(i_layer, n_layer);
    };

    // for arches that share the same tensor between the token embeddings and the output, we quantize the token embeddings
    // with the quantization of the output tensor
    if (name == tn(LLM_TENSOR_OUTPUT, "weight") || (!qs.has_output && name == tn(LLM_TENSOR_TOKEN_EMBD, "weight"))) {
        if (qs.params->output_tensor_type < GGML_TYPE_COUNT) {
            new_type = qs.params->output_tensor_type;
        } else {
            int nx = tensor->ne[0];
            if (arch == LLM_ARCH_FALCON || nx % QK_K != 0) {
                new_type = GGML_TYPE_Q8_0;
            }
            else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS ||
                     ftype == LLAMA_FTYPE_MOSTLY_IQ1_S   || ftype == LLAMA_FTYPE_MOSTLY_IQ2_S  || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M   ||
                     ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
                new_type = GGML_TYPE_Q5_K;
            }
            else if (new_type != GGML_TYPE_Q8_0) {
                new_type = GGML_TYPE_Q6_K;
            }
        }
    } else if (name == "token_embd.weight") {
        if (qs.params->token_embedding_type < GGML_TYPE_COUNT) {
            new_type = qs.params->token_embedding_type;
        } else {
            if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS ||
                ftype == LLAMA_FTYPE_MOSTLY_IQ1_S   || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
                new_type = GGML_TYPE_Q2_K;
            }
            else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M) {
                new_type = GGML_TYPE_IQ3_S;
            }
            else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) {
                new_type = GGML_TYPE_IQ3_S;
            }
        }
    } else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ1_S ||
               ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M    || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
        if (name.find("attn_v.weight") != std::string::npos) {
            if (qs.model.hparams.n_gqa() >= 4 || qs.model.hparams.n_expert >= 4) new_type = GGML_TYPE_Q4_K;
            else new_type = ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ? GGML_TYPE_IQ3_S : GGML_TYPE_Q2_K;
            ++qs.i_attention_wv;
        }
        else if (qs.model.hparams.n_expert == 8 && name.find("attn_k.weight") != std::string::npos) {
            new_type = GGML_TYPE_Q4_K;
        }
        else if (name.find("ffn_down") != std::string::npos) {
            if (qs.i_ffn_down < qs.n_ffn_down/8) {
                new_type = ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ? GGML_TYPE_IQ3_S : GGML_TYPE_Q2_K;
            }
            ++qs.i_ffn_down;
        }
        else if (name.find("attn_output.weight") != std::string::npos) {
            if (qs.model.hparams.n_expert == 8) {
                new_type = GGML_TYPE_Q5_K;
            } else {
                if (ftype == LLAMA_FTYPE_MOSTLY_IQ1_S || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) new_type = GGML_TYPE_IQ2_XXS;
                else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M) new_type = GGML_TYPE_IQ3_S;
            }
        }
    } else if (name.find("attn_v.weight") != std::string::npos) {
        if      (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) {
            new_type = qs.model.hparams.n_gqa() >= 4 ? GGML_TYPE_Q4_K : GGML_TYPE_Q3_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S && qs.model.hparams.n_gqa() >= 4) {
            new_type = GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) {
            new_type = qs.model.hparams.n_gqa() >= 4 ? GGML_TYPE_Q4_K : !qs.has_imatrix ? GGML_TYPE_IQ3_S : GGML_TYPE_IQ3_XXS;
        }
        else if ((ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_S) && qs.model.hparams.n_gqa() >= 4) {
            new_type = GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_M) {
            new_type = GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M) {
            new_type = qs.i_attention_wv < 2 ? GGML_TYPE_Q5_K : GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) new_type = GGML_TYPE_Q5_K;
        else if ((ftype == LLAMA_FTYPE_MOSTLY_IQ4_NL || ftype == LLAMA_FTYPE_MOSTLY_IQ4_XS) && qs.model.hparams.n_gqa() >= 4) {
            new_type = GGML_TYPE_Q5_K;
        }
        else if ((ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) &&
                use_more_bits(qs.i_attention_wv, qs.n_attention_wv)) new_type = GGML_TYPE_Q6_K;
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S && qs.i_attention_wv < 4) new_type = GGML_TYPE_Q5_K;
        if (qs.model.type == MODEL_70B) {
            // In the 70B model we have 8 heads sharing the same attn_v weights. As a result, the attn_v.weight tensor is
            // 8x smaller compared to attn_q.weight. Hence, we can get a nice boost in quantization accuracy with
            // nearly negligible increase in model size by quantizing this tensor with more bits:
            if (new_type == GGML_TYPE_Q3_K || new_type == GGML_TYPE_Q4_K) new_type = GGML_TYPE_Q5_K;
        }
        if (qs.model.hparams.n_expert == 8) {
            // for the 8-expert model, bumping this to Q8_0 trades just ~128MB
            // TODO: explore better strategies
            new_type = GGML_TYPE_Q8_0;
        }
        ++qs.i_attention_wv;
    } else if (name.find("attn_k.weight") != std::string::npos) {
        if (qs.model.hparams.n_expert == 8) {
            // for the 8-expert model, bumping this to Q8_0 trades just ~128MB
            // TODO: explore better strategies
            new_type = GGML_TYPE_Q8_0;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS) {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) {
            new_type = GGML_TYPE_IQ2_S;
        }
    } else if (name.find("attn_q.weight") != std::string::npos) {
        if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS) {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) {
            new_type = GGML_TYPE_IQ2_S;
        }
    } else if (name.find("ffn_down") != std::string::npos) {
        auto info = layer_info(qs.i_ffn_down, qs.n_ffn_down, name.c_str());
        int i_layer = info.first, n_layer = info.second;
        if      (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) new_type = GGML_TYPE_Q3_K;
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S) {
            if (i_layer < n_layer/8) new_type = GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS && !qs.has_imatrix) {
            new_type = i_layer < n_layer/8 ? GGML_TYPE_Q4_K : GGML_TYPE_Q3_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M) {
            new_type = i_layer < n_layer/16 ? GGML_TYPE_Q5_K
                     : arch != LLM_ARCH_FALCON || use_more_bits(i_layer, n_layer) ? GGML_TYPE_Q4_K
                     : GGML_TYPE_Q3_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_M && (i_layer < n_layer/8 ||
                    (qs.model.hparams.n_expert == 8 && use_more_bits(i_layer, n_layer)))) {
            new_type = GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) {
            new_type = arch == LLM_ARCH_FALCON ? GGML_TYPE_Q4_K : GGML_TYPE_Q5_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
            if (arch == LLM_ARCH_FALCON) {
                new_type = i_layer < n_layer/16 ? GGML_TYPE_Q6_K :
                           use_more_bits(i_layer, n_layer) ? GGML_TYPE_Q5_K : GGML_TYPE_Q4_K;
            } else {
                if (use_more_bits(i_layer, n_layer)) new_type = GGML_TYPE_Q6_K;
            }
        }
        else if (i_layer < n_layer/8 && (ftype == LLAMA_FTYPE_MOSTLY_IQ4_NL || ftype == LLAMA_FTYPE_MOSTLY_IQ4_XS) && !qs.has_imatrix) {
            new_type = GGML_TYPE_Q5_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M && use_more_bits(i_layer, n_layer)) new_type = GGML_TYPE_Q6_K;
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S && arch != LLM_ARCH_FALCON && i_layer < n_layer/8) {
            new_type = GGML_TYPE_Q5_K;
        }
        else if ((ftype == LLAMA_FTYPE_MOSTLY_Q4_0 || ftype == LLAMA_FTYPE_MOSTLY_Q5_0)
                && qs.has_imatrix && i_layer < n_layer/8) {
            // Guard against craziness in the first few ffn_down layers that can happen even with imatrix for Q4_0/Q5_0.
            // We only do it when an imatrix is provided because a) we want to make sure that one can always get the
            // same quantization as before imatrix stuff, and b) Q4_1/Q5_1 do go crazy on ffn_down without an imatrix.
            new_type = ftype == LLAMA_FTYPE_MOSTLY_Q4_0 ? GGML_TYPE_Q4_1 : GGML_TYPE_Q5_1;
        }
        ++qs.i_ffn_down;
    } else if (name.find("attn_output.weight") != std::string::npos) {
        if (arch != LLM_ARCH_FALCON) {
            if (qs.model.hparams.n_expert == 8) {
                if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K   || ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS ||
                    ftype == LLAMA_FTYPE_MOSTLY_Q3_K_S || ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M  || ftype == LLAMA_FTYPE_MOSTLY_IQ4_NL  ||
                    ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S || ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M  || ftype == LLAMA_FTYPE_MOSTLY_IQ3_S  ||
                    ftype == LLAMA_FTYPE_MOSTLY_IQ3_M  || ftype == LLAMA_FTYPE_MOSTLY_IQ4_XS) {
                    new_type = GGML_TYPE_Q5_K;
                }
            } else {
                if      (ftype == LLAMA_FTYPE_MOSTLY_Q2_K   ) new_type = GGML_TYPE_Q3_K;
                else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) new_type = GGML_TYPE_IQ3_S;
                else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M ) new_type = GGML_TYPE_Q4_K;
                else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L ) new_type = GGML_TYPE_Q5_K;
                else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_M  ) new_type = GGML_TYPE_Q4_K;
            }
        } else {
            if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) new_type = GGML_TYPE_Q4_K;
        }
    }
    else if (name.find("attn_qkv.weight") != std::string::npos) {
        if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L || ftype == LLAMA_FTYPE_MOSTLY_IQ3_M) {
            new_type = GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) new_type = GGML_TYPE_Q5_K;
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) new_type = GGML_TYPE_Q6_K;
    }
    else if (name.find("ffn_gate") != std::string::npos) {
        auto info = layer_info(qs.i_ffn_gate, qs.n_ffn_gate, name.c_str());
        int i_layer = info.first, n_layer = info.second;
        if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS && (i_layer >= n_layer/8 && i_layer < 7*n_layer/8)) {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        ++qs.i_ffn_gate;
    }
    else if (name.find("ffn_up") != std::string::npos) {
        auto info = layer_info(qs.i_ffn_up, qs.n_ffn_up, name.c_str());
        int i_layer = info.first, n_layer = info.second;
        if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS && (i_layer >= n_layer/8 && i_layer < 7*n_layer/8)) {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        ++qs.i_ffn_up;
    }

    //    if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) new_type = GGML_TYPE_Q3_K;
    //}
    // IK: let's remove this, else Q2_K is almost the same as Q3_K_S
    //else if (name.find("ffn_gate") != std::string::npos || name.find("ffn_up") != std::string::npos) {
    //    if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) new_type = GGML_TYPE_Q3_K;
    //}
    // This can be used to reduce the size of the Q5_K_S model.
    // The associated PPL increase is fully in line with the size reduction
    //else {
    //    if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_S) new_type = GGML_TYPE_Q4_K;
    //}
    bool convert_incompatible_tensor = false;
    if (new_type == GGML_TYPE_Q2_K || new_type == GGML_TYPE_Q3_K || new_type == GGML_TYPE_Q4_K ||
        new_type == GGML_TYPE_Q5_K || new_type == GGML_TYPE_Q6_K || new_type == GGML_TYPE_IQ4_XS ||
        new_type == GGML_TYPE_IQ2_XS || new_type == GGML_TYPE_IQ2_XXS || new_type == GGML_TYPE_IQ2_S ||
        new_type == GGML_TYPE_IQ3_XXS || new_type == GGML_TYPE_IQ1_S || new_type == GGML_TYPE_IQ3_S ||
        new_type == GGML_TYPE_IQ1_M) {
        int nx = tensor->ne[0];
        int ny = tensor->ne[1];
        if (nx % QK_K != 0) {
            LLAMA_LOG_WARN("\n\n%s : tensor cols %d x %d are not divisible by %d, required for %s", __func__, nx, ny, QK_K, ggml_type_name(new_type));
            convert_incompatible_tensor = true;
        } else {
            ++qs.n_k_quantized;
        }
    }
    if (convert_incompatible_tensor) {
        switch (new_type) {
            case GGML_TYPE_IQ2_XXS:
            case GGML_TYPE_IQ2_XS:
            case GGML_TYPE_IQ2_S:
            case GGML_TYPE_IQ3_XXS:
            case GGML_TYPE_IQ3_S:
            case GGML_TYPE_IQ1_S:
            case GGML_TYPE_IQ1_M:
            case GGML_TYPE_Q2_K:
            case GGML_TYPE_Q3_K:
            case GGML_TYPE_IQ4_XS: new_type = GGML_TYPE_IQ4_NL; break;
            case GGML_TYPE_Q4_K:   new_type = GGML_TYPE_Q5_0;   break;
            case GGML_TYPE_Q5_K:   new_type = GGML_TYPE_Q5_1;   break;
            case GGML_TYPE_Q6_K:   new_type = GGML_TYPE_Q8_0;   break;
            default: throw std::runtime_error("\nUnsupported tensor size encountered\n");
        }
        LLAMA_LOG_WARN(" - using fallback quantization %s\n", ggml_type_name(new_type));
        ++qs.n_fallback;
    }

    return new_type;
}

What criterion was used to find these combinations originally?

If we can test each configuration in a reasonable amount of time then it would be quite feasible to optimize this automatically using the Cross-Entropy Method (there is another version [for optimization] not shown on the Wikipedia page that optimizes discrete Bernoulli and/or categorical / "multinoulli" distributions [see Chapter 5 of Rubinstein's book]).

The dimensions are likely to be almost independent and it might even be nearly as easy to optimize "layer-index specific" quant schemes.

From previous experience using CEM on highly independent set of variables like this, you would need to be able to perform a minimum of 10-20 evaluations per variable to be optimized (you need much, much more though if you need to assume a non-diagonal covariance matrix [or conditional dependence for the discrete case] - which I don't think this would need and CMA-ES would be more suitable in that case anyway...).

It's very robust to noise so a noisy/quick evaluation criterion like perplexity will be preferable to a slow/precise criterion like KL-divergence.

One potential problem is if the optimization boundaries are hard to set due to say perplexity returning nan causing lots of samples to be discarded.

@slaren
Copy link
Member

slaren commented Jun 27, 2024

What criterion was used to find these combinations originally?

The original combinations were created by @ikawrakow from his own tests on llama and llama2 models, as far as I know. I think it would be very good to be able to automate this process, I would expect that different models will benefit from different quantization schemes.

One potential problem is if the optimization boundaries are hard to set due to say perplexity returning nan causing lots of samples to be discarded.

This should never happen unless there is a bug in llama.cpp, in which case it needs to be fixed rather than ignored.

@jukofyork
Copy link
Collaborator

jukofyork commented Jun 27, 2024

What criterion was used to find these combinations originally?

The original combinations were created by @ikawrakow from his own tests on llama and llama2 models, as far as I know. I think it would be very good to be able to automate this process, I would expect that different models will benefit from different quantization schemes.

One potential problem is if the optimization boundaries are hard to set due to say perplexity returning nan causing lots of samples to be discarded.

This should never happen unless there is a bug in llama.cpp, in which case it needs to be fixed rather than ignored.

I can't promise how soon I can look at it, but it is definitely possible even without understanding any of the original logic:

new_type = GGML_TYPE_XXX;

It would just need a modified version of the function that can select this from a categorical distribution (what people in the ML community have started calling "multinoulli").

The name "Cross Entropy Method" might sound intimidating, but it is actually super-simple:

  1. Randomly initialize the distribution for each variable (intelligently if possible).
  2. Take N samples (usually 100) from the distribution and evaluate each sample.
  3. Rank the samples and choose the top 0.1 * N of the samples.
  4. Calculate the new Maximum Likelihood distribution to use from these samples.
  5. Go to step 2.
  • For (1) you could set the initial categorical distribution to be weighted heavily towards @ikawrakow's choices and possibly also set hard boundaries on what you think are sensible for the memory size budget you are looking at.
  • For (2a), since we are assuming independence of the variables it will just be a simple "weighted roulette wheel" selection process.
  • For (2b), since we have a memory size budget this will have to be incorporated as a constraint into the evaluation via a penalty (a soft penalty preferably so as not to discard too many samples... You can progressively "harden" the penalty during the run to enforce the constraint though).
  • For (4), this just comes down to the empirical fraction of counts in each bin for the discrete case. You have to be slightly careful that none of the bins get set to zero (this is easily solved via Additive smoothing and IIRC explained in Rubinstein's book).

Just eyeballing the function there looks to be maybe 5-10 choices for a given model, so using a population of 100 and assuming 10-20 evaluations per variable: 5*100*10 = 5000 .. 10*100*20 = 20000 evaluations per model to be optimized (minimum), but it is likely a lot could be learnt from small models and used to constrain the search for larger models.

A week has 7*24*60 = 10080 minutes, so it would need to take no longer than 2-5 minutes per evaluation to be feasible IMO. It is very easy to parallelize using MPI though so could be run on a cluster of machines if needed.

@slaren
Copy link
Member

slaren commented Jun 27, 2024

That sounds very interesting. I am not sure what parameter you would use to optimize a quantization scheme. I guess that what we really would want to find is the pareto front of all the quantization schemes that are not worse than any other scheme in both file size and perplexity at the same time. Not sure if that affects your estimation of the number of samples.

@jukofyork
Copy link
Collaborator

jukofyork commented Jun 27, 2024

That sounds very interesting. I am not sure what parameter you would use to optimize a quantization scheme. I guess that what we really would want to find is the pareto front of all the quantization schemes that are not worse than any other scheme in both file size and perplexity at the same time. Not sure if that affects your estimation of the number of samples.

Generally the more you constrain it and the more the independence assumption is broken the more samples you need (ie: it will waste samples trying to pass over the constraints and have trouble navigating non-orthogonal "valleys" otherwise).

If the independence assumption is very wrong then it's almost certainly better to use CMA-ES instead (CEM does have a version using a non-diagonal covariance matrix, but it requires a Cholesky Factorization to sample from and suffers from needing many more samples to reliably estimate the covariance matrix compared to CMA-ES's incremental method).

There are likely other things like using a clipped-Gaussian instead of a categorical distribution (as the choices are ordered) that can be tried to reduce the number of samples needed.

It works really well in practice and often can find solutions a human could not due to the human getting stuck in a local optima where they can't escape by tuning a single variable alone.


If the optimization landscape is very smooth and "nice" there are other methods that can use way fewer samples. Somebody with an OR background would likely be able to suggest even better ways of tackling this - I've just had success in the past using CEM for problems almost exactly like this (and SPSA for problems with homogenous variables and low-noise evaluations available).

@jukofyork
Copy link
Collaborator

jukofyork commented Jun 27, 2024

Sorry I missed this part of your question:

I guess that what we really would want to find is the pareto front of all the quantization schemes that are not worse than any other scheme in both file size and perplexity at the same time.

This can likely be found in a single run by starting with the maximum allowable memory budget, converging to a (fairly) stable solution, and then reducing the budget constraint/penalty downwards (or vice versa).

If you search for "L1 regularization path" you'll see plots like this found all in 1 run:

image

Which are basically doing the same thing by reducing (or increasing) the penalty during a single run of the optimization algorithm.

@jukofyork
Copy link
Collaborator

The new "shared experts" might need thinking about now too:

  "n_routed_experts": 160,
  "n_shared_experts": 2,
  "num_experts_per_tok": 6

as they will be disproportionality effected by quantization (ie: used 100% of the time vs 4/160 = 2.5% of the time, for the example deepseek-v2 config above).

Out of luck trying to do anything with the "shared_experts":

[  28/ 959]           blk.1.ffn_down_exps.weight - [ 1536,  5120,   160,     1], type =    f16, converting to iq3_xxs .. size =  2400.00 MiB ->   459.38 MiB
[  29/ 959]           blk.1.ffn_gate_exps.weight - [ 5120,  1536,   160,     1], type =    f16, converting to iq3_xxs .. size =  2400.00 MiB ->   459.38 MiB
[  30/ 959]             blk.1.ffn_up_exps.weight - [ 5120,  1536,   160,     1], type =    f16, converting to iq3_xxs .. size =  2400.00 MiB ->   459.38 MiB

@jukofyork
Copy link
Collaborator

jukofyork commented Jun 27, 2024

If I can get this PR working or figure out how to hack the llama.cpp::llama_tensor_get_type() code I'm going to try using bigger quants for the early layer's expert tensors and smaller for the later.

I can't find it now but read a paper that hypothesised the later layers don't do all that much and mostly just do averaging (link me if you know this paper please!). This paper (which came later IIRC) also shows this:

https://arxiv.org/pdf/2403.17887

image

Starting around 60th percentile layer in.

"Across models, the deeper layers tend to be very similar, though the deepest blocks that include the final layer (squares along the outer diagonal) are (near-)maximally dissimilar."


Charles Goddard (the Mergekit creator) tried the above method here:

https://huggingface.co/chargoddard/llama3-42b-v0

but I think it's got a much better chance keeping the layers; just having them more quantized... Deepseek-v2 looks the perfect model to try this on as it's 90% MLP.

@jukofyork
Copy link
Collaborator

jukofyork commented Jun 28, 2024

The new "shared experts" might need thinking about now too:

  "n_routed_experts": 160,
  "n_shared_experts": 2,
  "num_experts_per_tok": 6

as they will be disproportionality effected by quantization (ie: used 100% of the time vs 4/160 = 2.5% of the time, for the example deepseek-v2 config above).

Out of luck trying to do anything with the "shared_experts":

[  28/ 959]           blk.1.ffn_down_exps.weight - [ 1536,  5120,   160,     1], type =    f16, converting to iq3_xxs .. size =  2400.00 MiB ->   459.38 MiB
[  29/ 959]           blk.1.ffn_gate_exps.weight - [ 5120,  1536,   160,     1], type =    f16, converting to iq3_xxs .. size =  2400.00 MiB ->   459.38 MiB
[  30/ 959]             blk.1.ffn_up_exps.weight - [ 5120,  1536,   160,     1], type =    f16, converting to iq3_xxs .. size =  2400.00 MiB ->   459.38 MiB

Actually I've found they are in separate tensors and are named differently: ffn_up_shexp.weight, ffn_gate_shexp.weight, and ffn_down_shexp.weight.

I've also found the low rank attn_q_a.weight, attn_q_b.weight, attn_kv_a_mqa.weight and attn_kv_b.weight tensors were falling through and getting quantized using the lowest default... This is very bad as these are actually tiny compared to the rest of the giant MLP tensors and the W.W^T products that this creates will likely have O(((w-q)^2)^2) rate-distortion (ie: 4-th power quantization error!).

So I've tried to look through the function to distil what @ikawrakow obviously must have spent hours figuring out, and have come up with this:

    // ### JUK'S DEEPSEEK V2 CUSTOM CONFIG (Use: 'llama-quantize --imatrix ... ... ... Q5_K_M') ###
    if (name == tn(LLM_TENSOR_OUTPUT, "weight")) {
         new_type = GGML_TYPE_Q6_K;
    } else if (name == "token_embd.weight") {
         new_type = GGML_TYPE_Q5_K;
    } else if (name.find("attn_q_a.weight") != std::string::npos || name.find("attn_q_b.weight") != std::string::npos) {
        new_type = GGML_TYPE_Q8_0;
    } else if (name.find("attn_kv_a_mqa.weight") != std::string::npos || name.find("attn_kv_b.weight") != std::string::npos) {
        new_type = GGML_TYPE_Q8_0;
        // ++qs.i_attention_wv; @@@ Looks to be used for 'use_more_bits' tests and not outside this function... @@@
    } else if (name.find("attn_output.weight") != std::string::npos) {
        new_type = GGML_TYPE_Q5_K;
    } else if (name.find("shexp.weight") != std::string::npos) {
        new_type = GGML_TYPE_Q8_0;
    } else if (name.find("ffn_down_exps.weight") != std::string::npos) {
        auto info = layer_info(qs.i_ffn_down, qs.n_ffn_down, name.c_str());
        int i_layer = info.first, n_layer = info.second;
        if (i_layer < n_layer/8 || i_layer >= 7*n_layer/8) {
            new_type = GGML_TYPE_IQ4_XS;
        }
        else {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        ++qs.i_ffn_down;
    } else if (name.find("ffn_gate_exps.weight") != std::string::npos) {
        auto info = layer_info(qs.i_ffn_gate, qs.n_ffn_gate, name.c_str());
        int i_layer = info.first, n_layer = info.second;
        if (i_layer < n_layer/8 || i_layer >= 7*n_layer/8) {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        else {
            new_type = GGML_TYPE_IQ2_S;
        }
        ++qs.i_ffn_gate;
    } else if (name.find("ffn_up_exps.weight") != std::string::npos) {
        auto info = layer_info(qs.i_ffn_up, qs.n_ffn_up, name.c_str());
        int i_layer = info.first, n_layer = info.second;
        if (i_layer < n_layer/8 || i_layer >= 7*n_layer/8) {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        else {
            new_type = GGML_TYPE_IQ2_S;
        }
        ++qs.i_ffn_up;
    } else
    // ### JUK ###

It needs to be copied right before this line:

if (name == tn(LLM_TENSOR_OUTPUT, "weight") || (!qs.has_output && name == tn(LLM_TENSOR_TOKEN_EMBD, "weight")))

The mix of the GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS and GGML_TYPE_IQ2_S is just my attempt at getting this to fit in 96GB of VRAM...

Hopefully this helps as the IQ3_XXS version I made using the stock settings and with the problems outlined above (that let me get a whopping 1K of context in 96GB VRAM!), was as dumb as a post... 😦


I will also try just leaving all these as f16 later as they are tiny in comparison to everything else and the ffn_gate_inp.weight routing tensors are already left as f32 for this reason:

[  16/ 959]          blk.1.ffn_down_shexp.weight - [ 3072,  5120,     1,     1], type =    f16, converting to q8_0 .. size =    30.00 MiB ->    15.94 MiB
[  17/ 959]          blk.1.ffn_gate_shexp.weight - [ 5120,  3072,     1,     1], type =    f16, converting to q8_0 .. size =    30.00 MiB ->    15.94 MiB
[  18/ 959]            blk.1.ffn_up_shexp.weight - [ 5120,  3072,     1,     1], type =    f16, converting to q8_0 .. size =    30.00 MiB ->    15.94 MiB
[  20/ 959]           blk.1.attn_kv_a_mqa.weight - [ 5120,   576,     1,     1], type =    f16, converting to q8_0 .. size =     5.62 MiB ->     2.99 MiB
[  21/ 959]               blk.1.attn_kv_b.weight - [  512, 32768,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  24/ 959]                blk.1.attn_q_a.weight - [ 5120,  1536,     1,     1], type =    f16, converting to q8_0 .. size =    15.00 MiB ->     7.97 MiB
[  25/ 959]                blk.1.attn_q_b.weight - [ 1536, 24576,     1,     1], type =    f16, converting to q8_0 .. size =    72.00 MiB ->    38.25 MiB

@jukofyork
Copy link
Collaborator

Yeah, I think quantizing the low-rank attention weights was absolutely killing the model... I've put in a PR to fix this: #8194.

@HaroldBenoit
Copy link

Giving some usage feedback so that this gets merged.

This PR works pretty much almost out-of-the-box (just need to define quant types in lowercase instead of uppercase in the config, and a few tweaks to adapt to the latest llama.cpp state).

It introduces very useful functionality. Props to @jubruckne !

@ThomasBaruzier
Copy link

Hello,

I'd like to share a few findings and explain why this PR could be very beneficial for quants.
I did quantization and perplexity tests for around 20 models and many different architectures.
Furthermore, I noticed that the damage caused by quantization changes a lot from architecture to architecture.

For example, the best candidate is the Gemma 2 series, especially at 9B (imat):

Quant Size (MB) Perplexity (PPL) Size (%) Accuracy (%) PPL Error rate
IQ1_S 2269 16.0064 12.87 54.91 0.12077
IQ1_M 2429 13.7255 13.77 64.03 0.10272
IQ2_XXS 2695 11.269 15.28 77.99 0.08345
IQ2_XS 2926 10.5628 16.59 83.2 0.07809
IQ2_S 3063 10.3671 17.37 84.77 0.07772
IQ2_M 3276 9.7973 18.58 89.7 0.07298
Q2_K_S 3388 9.9206 19.21 88.59 0.07247
IQ3_XXS 3621 9.3955 20.53 93.54 0.06962
Q2_K 3630 9.421 20.58 93.29 0.0683
IQ3_XS 3953 9.2545 22.42 94.96 0.06868
IQ3_S 4137 9.2127 23.46 95.39 0.06866
Q3_K_S 4137 9.083 23.46 96.76 0.06618
IQ3_M 4287 8.9791 24.31 97.88 0.06614
Q3_K_M 4542 9.0172 25.76 97.46 0.06684
Q3_K_L 4895 8.9965 27.76 97.69 0.06675
IQ4_XS 4943 8.8286 28.03 99.54 0.06504
IQ4_NL 5191 8.8235 29.44 99.6 0.06496
Q4_0 5207 8.834 29.53 99.48 0.0648
Q4_K_S 5226 8.829 29.63 99.54 0.06513
Q4_K_M 5495 8.8069 31.16 99.79 0.06493
Q4_1 5688 8.8395 32.25 99.42 0.06526
Q5_K_S 6184 8.8011 35.07 99.86 0.06504
Q5_0 6199 8.7668 35.15 100.25 0.06455
Q5_K_M 6340 8.7993 35.95 99.88 0.06506
Q5_1 6680 8.7888 37.88 100 0.06493
Q6_K 7238 8.7863 41.04 100.02 0.06497
Q8_0 9372 8.7858 53.14 100.03 0.06497
F16 17635 8.7884 100 100 0.06501

Note that IQ1_S is 12.87% of the size of F16, while having 16.0064 PPL (54.91% of F16)

On the other end, here is the same table for Llama 3.1 8B (imat):

Quant Size (MB) Perplexity (PPL) Size (%) Accuracy (%) PPL Error rate
IQ1_S 1927 63.4502 12.57 11.54 0.42632
IQ1_M 2062 26.8862 13.46 27.24 0.18531
IQ2_XXS 2289 15.1538 14.94 48.32 0.10214
IQ2_XS 2486 11.771 16.22 62.21 0.07778
IQ2_S 2631 10.5231 17.17 69.59 0.06974
IQ2_M 2812 9.5838 18.35 76.41 0.0623
Q2_K_S 2851 10.6111 18.6 69.01 0.07128
Q2_K 3032 9.857 19.78 74.29 0.0636
IQ3_XXS 3124 8.4925 20.38 86.22 0.05406
IQ3_XS 3356 8.1907 21.9 89.4 0.05164
Q3_K_S 3495 8.9094 22.81 82.19 0.05599
IQ3_S 3512 7.973 22.92 91.84 0.05015
IQ3_M 3610 7.9137 23.56 92.53 0.04941
Q3_K_M 3833 7.8494 25.01 93.29 0.05019
Q3_K_L 4122 7.7578 26.9 94.39 0.0497
IQ4_XS 4242 7.5211 27.68 97.36 0.04819
Q4_0 4460 7.6204 29.1 96.09 0.04851
IQ4_NL 4462 7.5144 29.12 97.45 0.04819
Q4_K_S 4476 7.5198 29.21 97.38 0.048
Q4_K_M 4693 7.4975 30.62 97.67 0.04794
Q4_1 4893 7.5353 31.93 97.18 0.04808
Q5_K_S 5340 7.3903 34.85 99.08 0.04728
Q5_0 5354 7.3982 34.94 98.98 0.04728
Q5_K_M 5468 7.3962 35.68 99 0.04739
Q5_1 5788 7.3843 37.77 99.16 0.04721
Q6_K 6291 7.3538 41.05 99.58 0.04696
Q8_0 8146 7.3279 53.15 99.93 0.04677
F16 15325 7.3226 100 100 0.04674

Note that IQ1_S is 12.57% of the size of F16, while having 63.4502 PPL (11.54% of F16)

It's arguable that smaller models are prone to more damage caused by quantization. But even so, Qwen 2.5 14B (imat), while having 5B more parameters, suffers more than Gemma 2 9B:

Quant Size (MB) PPL Size (%) Accuracy (%) PPL error rate
IQ1_S 3441 22.0082 12.21 27.14 0.16818
IQ1_M 3693 15.079 13.11 39.62 0.1106
IQ2_XXS 4114 9.6047 14.6 62.2 0.06625
IQ2_XS 4487 8.3649 15.92 71.41 0.05574
IQ2_S 4772 8.1942 16.93 72.9 0.0548
IQ2_M 5109 7.7261 18.13 77.32 0.05177
Q2_K_S 5148 8.0641 18.27 74.08 0.0549
Q2_K 5504 7.6005 19.53 78.6 0.05146
IQ3_XXS 5672 6.9285 20.13 86.22 0.04547
IQ3_XS 6088 6.721 21.6 88.88 0.04329
Q3_K_S 6352 6.8697 22.54 86.96 0.04576
IQ3_S 6383 6.6246 22.65 90.17 0.04285
IQ3_M 6597 6.6359 23.41 90.02 0.04256
Q3_K_M 7000 6.5281 24.84 91.51 0.043
Q3_K_L 7558 6.4323 26.82 92.87 0.04211
IQ4_XS 7744 6.2005 27.48 96.34 0.04022
Q4_0 8149 6.2928 28.92 94.93 0.04095
IQ4_NL 8154 6.208 28.94 96.23 0.04032
Q4_K_S 8177 6.163 29.02 96.93 0.03976
Q4_K_M 8572 6.1311 30.42 97.43 0.03957
Q4_1 8958 6.1674 31.79 96.86 0.03981
Q5_K_S 9791 6.0411 34.75 98.88 0.03886
Q5_0 9817 6.0504 34.84 98.73 0.03895
Q5_K_M 10023 6.0389 35.57 98.92 0.03888
Q5_1 10625 6.0366 37.71 98.96 0.03885
Q6_K 11564 6.0004 41.04 99.56 0.0386
Q8_0 14975 5.9821 53.14 99.86 0.03842
F16 28179 5.9737 100 100 0.03835

Note that IQ1_S is 12.21% of the size of F16, while having 22.0082 PPL (27.14% of F16)

Conclusion

IQ1_S might be a poor example regarding its use cases, however, it helps us show how quants can be affected differently depending on the architecture of the model, and show how much custom quants could be beneficial for their quality.

I'd like to try this PR and "bruteforce" my way up to the lowest perplexity achievable on models suffering the most from quantization. I really hope this gets merged soon.

@Djip007
Copy link
Contributor

Djip007 commented Oct 18, 2024

I like this idea to have configurable/ custom quantisation. But isn't it simple to use std c++ regex and not code a match_string.

And I think there is use json lib in llama.cpp so may be we can use it for the "quant.cfg" file?

@ddh0
Copy link
Contributor

ddh0 commented Oct 18, 2024

I like this idea to have configurable/ custom quantisation. But isn't it simple to use std c++ regex and not code a match_string.

And I think there is use json lib in llama.cpp so may be we can use it for the "quant.cfg" file?

I agree, JSON would be more accessible for myself and probably many others

@ddh0
Copy link
Contributor

ddh0 commented Oct 30, 2024

I've just tried to build jubruckne/llama.cpp as of commit 20b2243 and it's failing with this error:

examples/quantize/quantize.cpp:256:13: error: static declaration of 'parse_kv_override' follows non-static declaration
  256 | static bool parse_kv_override(const char * data, std::vector<llama_model_kv_override> & overrides) {
      |             ^
common/common.h:180:6: note: previous declaration is here
  180 | bool parse_kv_override(const char * data, std::vector<llama_model_kv_override> & overrides);
      |      ^
examples/quantize/quantize.cpp:269:13: error: no member named 'int_value' in 'llama_model_kv_override'
  269 |         kvo.int_value = std::atol(sep);
      |         ~~~ ^
examples/quantize/quantize.cpp:273:13: error: no member named 'float_value' in 'llama_model_kv_override'
  273 |         kvo.float_value = std::atof(sep);
      |         ~~~ ^
examples/quantize/quantize.cpp:278:17: error: no member named 'bool_value' in 'llama_model_kv_override'
  278 |             kvo.bool_value = true;
      |             ~~~ ^
examples/quantize/quantize.cpp:280:17: error: no member named 'bool_value' in 'llama_model_kv_override'
  280 |             kvo.bool_value = false;
      |             ~~~ ^
examples/quantize/quantize.cpp:390:39: error: call to 'parse_kv_override' is ambiguous
  390 |             if (arg_idx == argc-1 || !parse_kv_override(argv[++arg_idx], kv_overrides)) {
      |                                       ^~~~~~~~~~~~~~~~~
common/common.h:180:6: note: candidate function
  180 | bool parse_kv_override(const char * data, std::vector<llama_model_kv_override> & overrides);
      |      ^
examples/quantize/quantize.cpp:256:13: note: candidate function
  256 | static bool parse_kv_override(const char * data, std::vector<llama_model_kv_override> & overrides) {
      |             ^
6 errors generated.
make: *** [quantize] Error 1
make: *** Waiting for unfinished jobs....

I'd really love to see this functionality make its way into master, @jubruckne do you still have plans to work on this?

To @slaren and @cebtenzzre : other than the above error, what else needs to be done before this is ready for review?

Thank you everyone.

@Djip007 Djip007 mentioned this pull request Oct 30, 2024
3 tasks
@slaren
Copy link
Member

slaren commented Oct 31, 2024

I think it would be important that the implementation of custom quantization schemes can be used to replace the current logic. That is to say, it should be possible to remove the code of the current quantization schemes and replace them as inputs of the custom quantization schemes. Otherwise, we would be just adding more complexity on top of already too complex code. I don't know what's the state of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.