-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Custom quantization schemes #6844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This would be handy, as i like to experiment with different custom quants, and its a little clunky having to modify and rebuild llama.cpp every time i want to change something. |
Yes, this functionality is welcome |
Excellent idea, I wanted to see such a feature but am unable to do it myself. All the possibles improvements you mention are pertinent. Also, this tool should ideally feature variable quantization. Moreover, and that's a bit more complex, the ideal combination might be to be able to use a customizable form "more_bits feature" (query it in the llama.cpp file) to make that partial quant, and to select a layer range of a given weight to quantize with a higher quant.
I'm currently toying with the code in Llama.cpp file, and that's quite indigest and not practical, especially because 2 approaches were used to define the quant strategies 👍
|
Yeah, that's the idea. I actually explained my intentions slightly incorrectly in the first post above. It's actually about allowing individual quantisation for each tensor (not layer). So you can have a config file like this:
|
Exactly my plan :) The idea here would be that instead of setting a specific quant type, increments of +1, +2, -1, ... relative to the default could be used. For example:
The challenge lies in defining the what the sequences of quant types should be. One possibility is to establish a sequence such that it's transitioning between similar quant types of different "bit" rates, such as from x_K to x+1_K or from IQx_S to IQx-1_S. For example:
Using this sequence, a default of Q4_K would transition to Q5_K with a +1 adjustment and to Q3_K with a -1. A more detailed sequence might look like this:
However now that I've tried to write down sensible sequences, I realize that defining one that is universally applicable is challenging due to the varying nature of quant types, and probably doesn't make sense in most cases. Any thoughts? |
Well, ideally the whole pattern would be definable so the system can be universally applied, there's no premade recipe which makes consensus nor should it be, because we are still empirically discovering the effects of particular quantization strategies as we try them. Here's a reformulation of my idea compatible with your plans :
Example in profane writing, for each tensor chosen for a customized quantization away from a base quantization strategy, presently a base Q4_K defined on a 70b L2 model with 80 layers on which we want to customize the ffn.down without even using Q4_K for the sake of the example : That might require a slight overhaul of the quant strategy part of Llama.CPP, and potentially an harmonization of its hierarchical tree in respect for the IQ1 and IQ2 groups, but if possible, that'd offer the widest range of possibilities. I'm sorry for my lack of code proficiency, I have no background into coding beyond mimicking what I see, understanding and adapting a few formatting tricks, and changing values. |
I think this should be ready. I added parsing of enum values (so that friendly names like Q8_0 can be used instead of their numeric values), wildcards for tensor names, and possibility to specify the cfg file to use. To use, specify the new CUSTOM type on ./quantize like so: The quant.cfg should be pretty self-explanatory: # Defines the default ftype (the quantization mix code,
# that you pass to quantize if you're not using custom mix).
# tensors that are not overriden below will be quantized
# according to this mix.
#
# Must be one of
# Q4_0, Q4_1, Q5_0, Q5_1, IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M,
# IQ1_S, IQ1_M, Q2_K, Q2_K_S, IQ3_XXS, IQ3_S, IQ3_M, Q3_K,
# IQ3_XS, Q3_K_S, Q3_K_M, Q3_K_L, IQ4_NL, IQ4_XS, Q4_K,
# Q4_K_S, Q4_K_M, Q5_K, Q5_K_S, Q5_K_M, Q6_K, Q8_0, F16
ftype=Q6_K
# Defines overrides for tensors with names matching a given
# string. Filters are processed in order given, the first
# matching will be used.
#
# Wildcards are allowed:
# ? single character
# * multiple characters
#
# Type must be one of
# F16, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1, Q2_K, Q3_K,
# Q4_K, Q5_K, Q6_K, Q8_K, IQ2_XXS, IQ2_XS, IQ3_XXS,
# IQ1_S, IQ4_NL, IQ3_S, IQ2_S, IQ4_XS, IQ1_M
blk.10.ffn_up.weight=Q5_K
blk.1?.ffn_up.weight=Q4_K
blk.23.*=Q2_K
blk.24.*=Q2_K
blk.25.*=Q2_K
blk.2?.ffn_up.weight=Q4_K
*_gate*=Q4_K
*.attn*=IQ4_XS
*_down*=IQ3_S
output.weight=Q5_K
|
if (pos != std::string::npos) { | ||
std::string tensor_name = line.substr(0, pos); | ||
std::string type_name = line.substr(pos + 1); | ||
ggml_type type = parse_ggml_type(type_name.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the configuration describe tensors by data type (from enum ggml_type
), not file type (from enum llama_ftype
)? E.g. Q3_K_S, Q3_K_M, and Q3_K_L are all file types, whereas Q3_K is a data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was that you set the ftype (llama_ftype) and that will give you as a base the built-in mixing logic that llama_tensor_get_type() determines. I figured that's a good default since it's also doing some architecture specific optimizations. Only then, on top of it, you override specific tensors with s different ggml_type.
An alternative would be to completely get rid of the built-in llama_ftype logic, and specify the quant mixes completely from cfg files. Then we could deliver pre-built cfg files for each of the mixes currently supported by quantize. This would be nice on the one hand, as it would move all the special casing out from the code base, but on the other hand it's going to be a bigger endeavour as we'd need to figure out how to handle the different architectures, models of varying layer counts, and other logics that are not so easy to specify in a declarative configuration.
Hi, I'm noticing we don't have good perplexity/quality scores for llama3 8b at Q4_0: F16: Final estimate: PPL = 6.7647 This is a 7.7% difference, but these numbers are even worse earlier on in evaluation. Mistral PPL: Only 2.2% difference for mistral. Different UI's using the Q4_0 series would be getting a higher quality degradation for llama3 than llama2 or mistral. This isn't a llama.cpp issue, most gpu quantizations will get similar results. Is there a pre-existing quantization sweet spot suitable as the de-facto for llama3? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to get some usage feedback from people before merging
llama.h
Outdated
LLAMA_FTYPE_CUSTOM = 32, // except 1d tensors | ||
|
||
LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLAMA_FTYPE_CUSTOM = 32, // except 1d tensors | |
LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file | |
LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file | |
LLAMA_FTYPE_CUSTOM = 1025, |
enum llama_ftype default_ftype; // default type if not overriden | ||
uint32_t count; // number of overrides | ||
const char ** names; // tensor names | ||
enum ggml_type * types; // tensor type override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enum llama_ftype default_ftype; // default type if not overriden | |
uint32_t count; // number of overrides | |
const char ** names; // tensor names | |
enum ggml_type * types; // tensor type override | |
enum llama_ftype default_ftype; // default type if not overriden | |
uint32_t count; // number of overrides | |
const char ** names; // tensor names | |
enum ggml_type * types; // tensor type override |
@@ -14886,7 +14925,8 @@ struct llama_model_quantize_params llama_model_quantize_default_params() { | |||
/*.only_copy =*/ false, | |||
/*.pure =*/ false, | |||
/*.imatrix =*/ nullptr, | |||
/*.kv_overrides =*/ nullptr, | |||
/*.kv_overrides =*/ nullptr, | |||
/*.override_ftype =*/ nullptr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/*.override_ftype =*/ nullptr | |
/*.override_ftype =*/ nullptr, |
@@ -14417,6 +14444,18 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s | |||
new_type = params->output_tensor_type; | |||
} | |||
|
|||
// look up tensor name in type override map, if not found use default | |||
// type as determined by the ftype. | |||
if(params->override_ftype) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if(params->override_ftype) { | |
if (params->override_ftype) { |
@@ -14279,7 +14306,7 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s | |||
// copy the KV pairs from the input file | |||
gguf_set_kv (ctx_out, ml.meta); | |||
gguf_set_val_u32(ctx_out, "general.quantization_version", GGML_QNT_VERSION); | |||
gguf_set_val_u32(ctx_out, "general.file_type", ftype); | |||
gguf_set_val_u32(ctx_out, "general.file_type", params->ftype); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we keep ftype
here instead of params->ftype
?
std::string tensor_name = line.substr(0, pos); | ||
std::string type_name = line.substr(pos + 1); | ||
ggml_type type = parse_ggml_type(type_name.c_str()); | ||
if(type < 0 || type >= GGML_TYPE_COUNT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if(type < 0 || type >= GGML_TYPE_COUNT) { | |
if (type < 0 || type >= GGML_TYPE_COUNT) { |
std::string ftype_name; | ||
std::string custom_quant_config_filename; | ||
llama_ftype ftype; | ||
if(!try_parse_ftype(ftype_str, ftype, ftype_name, custom_quant_config_filename)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if(!try_parse_ftype(ftype_str, ftype, ftype_name, custom_quant_config_filename)) { | |
if (!try_parse_ftype(ftype_str, ftype, ftype_name, custom_quant_config_filename)) { |
llama3 reacts more strongly to quantization, probably because it makes more use of the bits/precision it was trained on. someone should use MAP to find the frontier of best ppl-size (or any other 2 dimensional metric) |
@@ -224,13 +246,119 @@ static ggml_type parse_ggml_type(const char * arg) { | |||
for (int j = 0; j < GGML_TYPE_COUNT; ++j) { | |||
auto type = ggml_type(j); | |||
const auto * name = ggml_type_name(type); | |||
if (name && strcmp(arg, name) == 0) { | |||
if (name && strcasecmp(arg, name) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strcasecmp
is not available on every platform.
@@ -14570,11 +14573,35 @@ static size_t llama_tensor_quantize_internal(enum ggml_type new_type, const floa | |||
return new_size; | |||
} | |||
|
|||
static bool match_string(const std::string& str, const std::string& pattern, uint32_t string_index = 0, uint32_t pattern_index = 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static bool match_string(const std::string& str, const std::string& pattern, uint32_t string_index = 0, uint32_t pattern_index = 0) { | |
static bool match_string(const std::string & str, const std::string & pattern, uint32_t string_index = 0, uint32_t pattern_index = 0) { |
result = type; break; | ||
} | ||
} | ||
return result; | ||
} | ||
|
||
static bool parse_kv_override(const char * data, std::vector<llama_model_kv_override> & overrides) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function was moved to common.cpp
, this definition should be removed from here.
Is this PR still working? I'd be interested to try it on the new |
This PR seems dead, but I have found where you can hack this in: Interestingly there looks to be a lot of hard-coded tests of The new "shared experts" might need thinking about now too: "n_routed_experts": 160,
"n_shared_experts": 2,
"num_experts_per_tok": 6 as they will be disproportionality effected by quantization (ie: used 100% of the time vs 4/160 = 2.5% of the time, for the example |
static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {
const std::string name = ggml_get_name(tensor);
// TODO: avoid hardcoded tensor names - use the TN_* constants
const llm_arch arch = qs.model.arch;
const auto tn = LLM_TN(arch);
auto use_more_bits = [](int i_layer, int num_layers) -> bool {
return i_layer < num_layers/8 || i_layer >= 7*num_layers/8 || (i_layer - num_layers/8)%3 == 2;
};
const int n_expert = std::max(1, (int)qs.model.hparams.n_expert);
auto layer_info = [n_expert] (int i_layer, int n_layer, const char * name) {
if (n_expert > 1) {
// Believe it or not, "experts" in the FFN of Mixtral-8x7B are not consecutive, but iccasionally randomly
// sprinkled in the model. Hence, simply dividing i_ffn_down by n_expert does not work
// for getting the current layer as I initially thought, and we need to resort to parsing the
// tensor name.
if (sscanf(name, "blk.%d.", &i_layer) != 1) {
throw std::runtime_error(format("Failed to determine layer for tensor %s", name));
}
if (i_layer < 0 || i_layer >= n_layer) {
throw std::runtime_error(format("Bad layer %d for tensor %s. Must be in [0, %d)", i_layer, name, n_layer));
}
}
return std::make_pair(i_layer, n_layer);
};
// for arches that share the same tensor between the token embeddings and the output, we quantize the token embeddings
// with the quantization of the output tensor
if (name == tn(LLM_TENSOR_OUTPUT, "weight") || (!qs.has_output && name == tn(LLM_TENSOR_TOKEN_EMBD, "weight"))) {
if (qs.params->output_tensor_type < GGML_TYPE_COUNT) {
new_type = qs.params->output_tensor_type;
} else {
int nx = tensor->ne[0];
if (arch == LLM_ARCH_FALCON || nx % QK_K != 0) {
new_type = GGML_TYPE_Q8_0;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS ||
ftype == LLAMA_FTYPE_MOSTLY_IQ1_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ||
ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
new_type = GGML_TYPE_Q5_K;
}
else if (new_type != GGML_TYPE_Q8_0) {
new_type = GGML_TYPE_Q6_K;
}
}
} else if (name == "token_embd.weight") {
if (qs.params->token_embedding_type < GGML_TYPE_COUNT) {
new_type = qs.params->token_embedding_type;
} else {
if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS ||
ftype == LLAMA_FTYPE_MOSTLY_IQ1_S || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
new_type = GGML_TYPE_Q2_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M) {
new_type = GGML_TYPE_IQ3_S;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) {
new_type = GGML_TYPE_IQ3_S;
}
}
} else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ1_S ||
ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
if (name.find("attn_v.weight") != std::string::npos) {
if (qs.model.hparams.n_gqa() >= 4 || qs.model.hparams.n_expert >= 4) new_type = GGML_TYPE_Q4_K;
else new_type = ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ? GGML_TYPE_IQ3_S : GGML_TYPE_Q2_K;
++qs.i_attention_wv;
}
else if (qs.model.hparams.n_expert == 8 && name.find("attn_k.weight") != std::string::npos) {
new_type = GGML_TYPE_Q4_K;
}
else if (name.find("ffn_down") != std::string::npos) {
if (qs.i_ffn_down < qs.n_ffn_down/8) {
new_type = ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ? GGML_TYPE_IQ3_S : GGML_TYPE_Q2_K;
}
++qs.i_ffn_down;
}
else if (name.find("attn_output.weight") != std::string::npos) {
if (qs.model.hparams.n_expert == 8) {
new_type = GGML_TYPE_Q5_K;
} else {
if (ftype == LLAMA_FTYPE_MOSTLY_IQ1_S || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) new_type = GGML_TYPE_IQ2_XXS;
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M) new_type = GGML_TYPE_IQ3_S;
}
}
} else if (name.find("attn_v.weight") != std::string::npos) {
if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) {
new_type = qs.model.hparams.n_gqa() >= 4 ? GGML_TYPE_Q4_K : GGML_TYPE_Q3_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S && qs.model.hparams.n_gqa() >= 4) {
new_type = GGML_TYPE_Q4_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) {
new_type = qs.model.hparams.n_gqa() >= 4 ? GGML_TYPE_Q4_K : !qs.has_imatrix ? GGML_TYPE_IQ3_S : GGML_TYPE_IQ3_XXS;
}
else if ((ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_S) && qs.model.hparams.n_gqa() >= 4) {
new_type = GGML_TYPE_Q4_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_M) {
new_type = GGML_TYPE_Q4_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M) {
new_type = qs.i_attention_wv < 2 ? GGML_TYPE_Q5_K : GGML_TYPE_Q4_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) new_type = GGML_TYPE_Q5_K;
else if ((ftype == LLAMA_FTYPE_MOSTLY_IQ4_NL || ftype == LLAMA_FTYPE_MOSTLY_IQ4_XS) && qs.model.hparams.n_gqa() >= 4) {
new_type = GGML_TYPE_Q5_K;
}
else if ((ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) &&
use_more_bits(qs.i_attention_wv, qs.n_attention_wv)) new_type = GGML_TYPE_Q6_K;
else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S && qs.i_attention_wv < 4) new_type = GGML_TYPE_Q5_K;
if (qs.model.type == MODEL_70B) {
// In the 70B model we have 8 heads sharing the same attn_v weights. As a result, the attn_v.weight tensor is
// 8x smaller compared to attn_q.weight. Hence, we can get a nice boost in quantization accuracy with
// nearly negligible increase in model size by quantizing this tensor with more bits:
if (new_type == GGML_TYPE_Q3_K || new_type == GGML_TYPE_Q4_K) new_type = GGML_TYPE_Q5_K;
}
if (qs.model.hparams.n_expert == 8) {
// for the 8-expert model, bumping this to Q8_0 trades just ~128MB
// TODO: explore better strategies
new_type = GGML_TYPE_Q8_0;
}
++qs.i_attention_wv;
} else if (name.find("attn_k.weight") != std::string::npos) {
if (qs.model.hparams.n_expert == 8) {
// for the 8-expert model, bumping this to Q8_0 trades just ~128MB
// TODO: explore better strategies
new_type = GGML_TYPE_Q8_0;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS) {
new_type = GGML_TYPE_IQ3_XXS;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) {
new_type = GGML_TYPE_IQ2_S;
}
} else if (name.find("attn_q.weight") != std::string::npos) {
if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS) {
new_type = GGML_TYPE_IQ3_XXS;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) {
new_type = GGML_TYPE_IQ2_S;
}
} else if (name.find("ffn_down") != std::string::npos) {
auto info = layer_info(qs.i_ffn_down, qs.n_ffn_down, name.c_str());
int i_layer = info.first, n_layer = info.second;
if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) new_type = GGML_TYPE_Q3_K;
else if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S) {
if (i_layer < n_layer/8) new_type = GGML_TYPE_Q4_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS && !qs.has_imatrix) {
new_type = i_layer < n_layer/8 ? GGML_TYPE_Q4_K : GGML_TYPE_Q3_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M) {
new_type = i_layer < n_layer/16 ? GGML_TYPE_Q5_K
: arch != LLM_ARCH_FALCON || use_more_bits(i_layer, n_layer) ? GGML_TYPE_Q4_K
: GGML_TYPE_Q3_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_M && (i_layer < n_layer/8 ||
(qs.model.hparams.n_expert == 8 && use_more_bits(i_layer, n_layer)))) {
new_type = GGML_TYPE_Q4_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) {
new_type = arch == LLM_ARCH_FALCON ? GGML_TYPE_Q4_K : GGML_TYPE_Q5_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
if (arch == LLM_ARCH_FALCON) {
new_type = i_layer < n_layer/16 ? GGML_TYPE_Q6_K :
use_more_bits(i_layer, n_layer) ? GGML_TYPE_Q5_K : GGML_TYPE_Q4_K;
} else {
if (use_more_bits(i_layer, n_layer)) new_type = GGML_TYPE_Q6_K;
}
}
else if (i_layer < n_layer/8 && (ftype == LLAMA_FTYPE_MOSTLY_IQ4_NL || ftype == LLAMA_FTYPE_MOSTLY_IQ4_XS) && !qs.has_imatrix) {
new_type = GGML_TYPE_Q5_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M && use_more_bits(i_layer, n_layer)) new_type = GGML_TYPE_Q6_K;
else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S && arch != LLM_ARCH_FALCON && i_layer < n_layer/8) {
new_type = GGML_TYPE_Q5_K;
}
else if ((ftype == LLAMA_FTYPE_MOSTLY_Q4_0 || ftype == LLAMA_FTYPE_MOSTLY_Q5_0)
&& qs.has_imatrix && i_layer < n_layer/8) {
// Guard against craziness in the first few ffn_down layers that can happen even with imatrix for Q4_0/Q5_0.
// We only do it when an imatrix is provided because a) we want to make sure that one can always get the
// same quantization as before imatrix stuff, and b) Q4_1/Q5_1 do go crazy on ffn_down without an imatrix.
new_type = ftype == LLAMA_FTYPE_MOSTLY_Q4_0 ? GGML_TYPE_Q4_1 : GGML_TYPE_Q5_1;
}
++qs.i_ffn_down;
} else if (name.find("attn_output.weight") != std::string::npos) {
if (arch != LLM_ARCH_FALCON) {
if (qs.model.hparams.n_expert == 8) {
if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K || ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS ||
ftype == LLAMA_FTYPE_MOSTLY_Q3_K_S || ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M || ftype == LLAMA_FTYPE_MOSTLY_IQ4_NL ||
ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S || ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M || ftype == LLAMA_FTYPE_MOSTLY_IQ3_S ||
ftype == LLAMA_FTYPE_MOSTLY_IQ3_M || ftype == LLAMA_FTYPE_MOSTLY_IQ4_XS) {
new_type = GGML_TYPE_Q5_K;
}
} else {
if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K ) new_type = GGML_TYPE_Q3_K;
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) new_type = GGML_TYPE_IQ3_S;
else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M ) new_type = GGML_TYPE_Q4_K;
else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L ) new_type = GGML_TYPE_Q5_K;
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_M ) new_type = GGML_TYPE_Q4_K;
}
} else {
if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) new_type = GGML_TYPE_Q4_K;
}
}
else if (name.find("attn_qkv.weight") != std::string::npos) {
if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L || ftype == LLAMA_FTYPE_MOSTLY_IQ3_M) {
new_type = GGML_TYPE_Q4_K;
}
else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) new_type = GGML_TYPE_Q5_K;
else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) new_type = GGML_TYPE_Q6_K;
}
else if (name.find("ffn_gate") != std::string::npos) {
auto info = layer_info(qs.i_ffn_gate, qs.n_ffn_gate, name.c_str());
int i_layer = info.first, n_layer = info.second;
if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS && (i_layer >= n_layer/8 && i_layer < 7*n_layer/8)) {
new_type = GGML_TYPE_IQ3_XXS;
}
++qs.i_ffn_gate;
}
else if (name.find("ffn_up") != std::string::npos) {
auto info = layer_info(qs.i_ffn_up, qs.n_ffn_up, name.c_str());
int i_layer = info.first, n_layer = info.second;
if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS && (i_layer >= n_layer/8 && i_layer < 7*n_layer/8)) {
new_type = GGML_TYPE_IQ3_XXS;
}
++qs.i_ffn_up;
}
// if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) new_type = GGML_TYPE_Q3_K;
//}
// IK: let's remove this, else Q2_K is almost the same as Q3_K_S
//else if (name.find("ffn_gate") != std::string::npos || name.find("ffn_up") != std::string::npos) {
// if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) new_type = GGML_TYPE_Q3_K;
//}
// This can be used to reduce the size of the Q5_K_S model.
// The associated PPL increase is fully in line with the size reduction
//else {
// if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_S) new_type = GGML_TYPE_Q4_K;
//}
bool convert_incompatible_tensor = false;
if (new_type == GGML_TYPE_Q2_K || new_type == GGML_TYPE_Q3_K || new_type == GGML_TYPE_Q4_K ||
new_type == GGML_TYPE_Q5_K || new_type == GGML_TYPE_Q6_K || new_type == GGML_TYPE_IQ4_XS ||
new_type == GGML_TYPE_IQ2_XS || new_type == GGML_TYPE_IQ2_XXS || new_type == GGML_TYPE_IQ2_S ||
new_type == GGML_TYPE_IQ3_XXS || new_type == GGML_TYPE_IQ1_S || new_type == GGML_TYPE_IQ3_S ||
new_type == GGML_TYPE_IQ1_M) {
int nx = tensor->ne[0];
int ny = tensor->ne[1];
if (nx % QK_K != 0) {
LLAMA_LOG_WARN("\n\n%s : tensor cols %d x %d are not divisible by %d, required for %s", __func__, nx, ny, QK_K, ggml_type_name(new_type));
convert_incompatible_tensor = true;
} else {
++qs.n_k_quantized;
}
}
if (convert_incompatible_tensor) {
switch (new_type) {
case GGML_TYPE_IQ2_XXS:
case GGML_TYPE_IQ2_XS:
case GGML_TYPE_IQ2_S:
case GGML_TYPE_IQ3_XXS:
case GGML_TYPE_IQ3_S:
case GGML_TYPE_IQ1_S:
case GGML_TYPE_IQ1_M:
case GGML_TYPE_Q2_K:
case GGML_TYPE_Q3_K:
case GGML_TYPE_IQ4_XS: new_type = GGML_TYPE_IQ4_NL; break;
case GGML_TYPE_Q4_K: new_type = GGML_TYPE_Q5_0; break;
case GGML_TYPE_Q5_K: new_type = GGML_TYPE_Q5_1; break;
case GGML_TYPE_Q6_K: new_type = GGML_TYPE_Q8_0; break;
default: throw std::runtime_error("\nUnsupported tensor size encountered\n");
}
LLAMA_LOG_WARN(" - using fallback quantization %s\n", ggml_type_name(new_type));
++qs.n_fallback;
}
return new_type;
} What criterion was used to find these combinations originally? If we can test each configuration in a reasonable amount of time then it would be quite feasible to optimize this automatically using the Cross-Entropy Method (there is another version [for optimization] not shown on the Wikipedia page that optimizes discrete Bernoulli and/or categorical / "multinoulli" distributions [see Chapter 5 of Rubinstein's book]). The dimensions are likely to be almost independent and it might even be nearly as easy to optimize "layer-index specific" quant schemes. From previous experience using CEM on highly independent set of variables like this, you would need to be able to perform a minimum of 10-20 evaluations per variable to be optimized (you need much, much more though if you need to assume a non-diagonal covariance matrix [or conditional dependence for the discrete case] - which I don't think this would need and CMA-ES would be more suitable in that case anyway...). It's very robust to noise so a noisy/quick evaluation criterion like perplexity will be preferable to a slow/precise criterion like KL-divergence. One potential problem is if the optimization boundaries are hard to set due to say perplexity returning |
The original combinations were created by @ikawrakow from his own tests on llama and llama2 models, as far as I know. I think it would be very good to be able to automate this process, I would expect that different models will benefit from different quantization schemes.
This should never happen unless there is a bug in llama.cpp, in which case it needs to be fixed rather than ignored. |
I can't promise how soon I can look at it, but it is definitely possible even without understanding any of the original logic: new_type = GGML_TYPE_XXX; It would just need a modified version of the function that can select this from a categorical distribution (what people in the ML community have started calling "multinoulli"). The name "Cross Entropy Method" might sound intimidating, but it is actually super-simple:
Just eyeballing the function there looks to be maybe 5-10 choices for a given model, so using a population of 100 and assuming 10-20 evaluations per variable: A week has |
That sounds very interesting. I am not sure what parameter you would use to optimize a quantization scheme. I guess that what we really would want to find is the pareto front of all the quantization schemes that are not worse than any other scheme in both file size and perplexity at the same time. Not sure if that affects your estimation of the number of samples. |
Generally the more you constrain it and the more the independence assumption is broken the more samples you need (ie: it will waste samples trying to pass over the constraints and have trouble navigating non-orthogonal "valleys" otherwise). If the independence assumption is very wrong then it's almost certainly better to use CMA-ES instead (CEM does have a version using a non-diagonal covariance matrix, but it requires a Cholesky Factorization to sample from and suffers from needing many more samples to reliably estimate the covariance matrix compared to CMA-ES's incremental method). There are likely other things like using a clipped-Gaussian instead of a categorical distribution (as the choices are ordered) that can be tried to reduce the number of samples needed. It works really well in practice and often can find solutions a human could not due to the human getting stuck in a local optima where they can't escape by tuning a single variable alone. If the optimization landscape is very smooth and "nice" there are other methods that can use way fewer samples. Somebody with an OR background would likely be able to suggest even better ways of tackling this - I've just had success in the past using CEM for problems almost exactly like this (and SPSA for problems with homogenous variables and low-noise evaluations available). |
Sorry I missed this part of your question:
This can likely be found in a single run by starting with the maximum allowable memory budget, converging to a (fairly) stable solution, and then reducing the budget constraint/penalty downwards (or vice versa). If you search for "L1 regularization path" you'll see plots like this found all in 1 run: Which are basically doing the same thing by reducing (or increasing) the penalty during a single run of the optimization algorithm. |
Out of luck trying to do anything with the "shared_experts":
|
If I can get this PR working or figure out how to hack the I can't find it now but read a paper that hypothesised the later layers don't do all that much and mostly just do averaging (link me if you know this paper please!). This paper (which came later IIRC) also shows this: https://arxiv.org/pdf/2403.17887 Starting around 60th percentile layer in.
Charles Goddard (the Mergekit creator) tried the above method here: https://huggingface.co/chargoddard/llama3-42b-v0 but I think it's got a much better chance keeping the layers; just having them more quantized... |
Actually I've found they are in separate tensors and are named differently: I've also found the low rank So I've tried to look through the function to distil what @ikawrakow obviously must have spent hours figuring out, and have come up with this: // ### JUK'S DEEPSEEK V2 CUSTOM CONFIG (Use: 'llama-quantize --imatrix ... ... ... Q5_K_M') ###
if (name == tn(LLM_TENSOR_OUTPUT, "weight")) {
new_type = GGML_TYPE_Q6_K;
} else if (name == "token_embd.weight") {
new_type = GGML_TYPE_Q5_K;
} else if (name.find("attn_q_a.weight") != std::string::npos || name.find("attn_q_b.weight") != std::string::npos) {
new_type = GGML_TYPE_Q8_0;
} else if (name.find("attn_kv_a_mqa.weight") != std::string::npos || name.find("attn_kv_b.weight") != std::string::npos) {
new_type = GGML_TYPE_Q8_0;
// ++qs.i_attention_wv; @@@ Looks to be used for 'use_more_bits' tests and not outside this function... @@@
} else if (name.find("attn_output.weight") != std::string::npos) {
new_type = GGML_TYPE_Q5_K;
} else if (name.find("shexp.weight") != std::string::npos) {
new_type = GGML_TYPE_Q8_0;
} else if (name.find("ffn_down_exps.weight") != std::string::npos) {
auto info = layer_info(qs.i_ffn_down, qs.n_ffn_down, name.c_str());
int i_layer = info.first, n_layer = info.second;
if (i_layer < n_layer/8 || i_layer >= 7*n_layer/8) {
new_type = GGML_TYPE_IQ4_XS;
}
else {
new_type = GGML_TYPE_IQ3_XXS;
}
++qs.i_ffn_down;
} else if (name.find("ffn_gate_exps.weight") != std::string::npos) {
auto info = layer_info(qs.i_ffn_gate, qs.n_ffn_gate, name.c_str());
int i_layer = info.first, n_layer = info.second;
if (i_layer < n_layer/8 || i_layer >= 7*n_layer/8) {
new_type = GGML_TYPE_IQ3_XXS;
}
else {
new_type = GGML_TYPE_IQ2_S;
}
++qs.i_ffn_gate;
} else if (name.find("ffn_up_exps.weight") != std::string::npos) {
auto info = layer_info(qs.i_ffn_up, qs.n_ffn_up, name.c_str());
int i_layer = info.first, n_layer = info.second;
if (i_layer < n_layer/8 || i_layer >= 7*n_layer/8) {
new_type = GGML_TYPE_IQ3_XXS;
}
else {
new_type = GGML_TYPE_IQ2_S;
}
++qs.i_ffn_up;
} else
// ### JUK ### It needs to be copied right before this line: if (name == tn(LLM_TENSOR_OUTPUT, "weight") || (!qs.has_output && name == tn(LLM_TENSOR_TOKEN_EMBD, "weight"))) The mix of the Hopefully this helps as the I will also try just leaving all these as
|
Yeah, I think quantizing the low-rank attention weights was absolutely killing the model... I've put in a PR to fix this: #8194. |
Giving some usage feedback so that this gets merged. This PR works pretty much almost out-of-the-box (just need to define quant types in lowercase instead of uppercase in the config, and a few tweaks to adapt to the latest llama.cpp state). It introduces very useful functionality. Props to @jubruckne ! |
Hello, I'd like to share a few findings and explain why this PR could be very beneficial for quants. For example, the best candidate is the Gemma 2 series, especially at 9B (imat):
Note that IQ1_S is 12.87% of the size of F16, while having 16.0064 PPL (54.91% of F16) On the other end, here is the same table for Llama 3.1 8B (imat):
Note that IQ1_S is 12.57% of the size of F16, while having 63.4502 PPL (11.54% of F16) It's arguable that smaller models are prone to more damage caused by quantization. But even so, Qwen 2.5 14B (imat), while having 5B more parameters, suffers more than Gemma 2 9B:
Note that IQ1_S is 12.21% of the size of F16, while having 22.0082 PPL (27.14% of F16) ConclusionIQ1_S might be a poor example regarding its use cases, however, it helps us show how quants can be affected differently depending on the architecture of the model, and show how much custom quants could be beneficial for their quality. I'd like to try this PR and "bruteforce" my way up to the lowest perplexity achievable on models suffering the most from quantization. I really hope this gets merged soon. |
I like this idea to have configurable/ custom quantisation. But isn't it simple to use std c++ regex and not code a match_string. And I think there is use json lib in llama.cpp so may be we can use it for the "quant.cfg" file? |
I agree, JSON would be more accessible for myself and probably many others |
I've just tried to build jubruckne/llama.cpp as of commit 20b2243 and it's failing with this error:
I'd really love to see this functionality make its way into master, @jubruckne do you still have plans to work on this? To @slaren and @cebtenzzre : other than the above error, what else needs to be done before this is ready for review? Thank you everyone. |
I think it would be important that the implementation of custom quantization schemes can be used to replace the current logic. That is to say, it should be possible to remove the code of the current quantization schemes and replace them as inputs of the custom quantization schemes. Otherwise, we would be just adding more complexity on top of already too complex code. I don't know what's the state of this PR. |
This is not ready to merge but I wanted to get your opinion if it’s something you’d be interested in including. If so, I can clean it up and improve it a little.
The idea is to allow creating a custom quantization mix by reading the per-layer quant type from a config file, by specifying CUSTOM as the type, like so:
./quantize --allow-requantize ../models/Meta-Llama-3-8B-Instruct.Q8_0.gguf ./llama3-q.gguf CUSTOM
The config file is currently hardcoded to read quant.cfg from the current directory (sample cfg is included). In the config file I allow specifying a default type for tensors that are not explicitly overridden, and the tensor name / type pairs with the requested type.
Possible improvements would be: