-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Fully deprecate AutoGPTQ and AutoAWQ for GPT-QModel #41567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: ZX-ModelCloud <[email protected]>
|
cc @MekkCyber for quantization |
5492303 to
d839d2b
Compare
|
We have begun AutoAWQ deprecation as well.
|
|
Hi @Qubitium ! Thanks a lot for working on this! Quick question, what do you mean by AutoAWQ being part of GPT-QModel now? Did you integrate the entire library (including the transformers dependency, like AutoAWQ does), or did you just port over the linear layers, kernels, and related components? |
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
|
@SunMarc @MekkCyber PR is now synced to Peft/Optimum pending Prs. Ready for code review for this portion. All tests passing with pending gpt-qmodel 5.4.4 release (later today). Notable changes:
|
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, left some minor comments !
| # if self.quantization_config.version == AWQLinearVersion.IPEX: | ||
| # from ..integrations import post_init_awq_ipex_modules | ||
| # | ||
| # model = post_init_awq_ipex_modules(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be removed ?
| do_fuse (`bool`, *optional*, defaults to `False`): | ||
| Whether to fuse attention and mlp layers together for faster inference | ||
| Deprecated, Whether to fuse attention and mlp layers together for faster inference | ||
| fuse_max_seq_len (`int`, *optional*): | ||
| The Maximum sequence length to generate when using fusing. | ||
| Deprecated, The Maximum sequence length to generate when using fusing. | ||
| modules_to_fuse (`dict`, *optional*, default to `None`): | ||
| Overwrite the natively supported fusing scheme with the one specified by the users. | ||
| Deprecated, Overwrite the natively supported fusing scheme with the one specified by the users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove it directly since those are not used
| The version of the quantization algorithm to use. GEMM is better for big batch_size (e.g. >= 8) otherwise, | ||
| GEMV is better (e.g. < 8 ). GEMM models are compatible with Exllama kernels. | ||
| backend (`AwqBackendPackingMethod`, *optional*, defaults to `AwqBackendPackingMethod.AUTOAWQ`): | ||
| backend (`AwqBackendPackingMethod`, *optional*, defaults to `AwqBackendPackingMethod.GPTQMODEL`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update docstring
| Whether to use zero point quantization. | ||
| version (`AWQLinearVersion`, *optional*, defaults to `AWQLinearVersion.GEMM`): | ||
| The version of the quantization algorithm to use. GEMM is better for big batch_size (e.g. >= 8) otherwise, | ||
| GEMV is better (e.g. < 8 ). GEMM models are compatible with Exllama kernels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not used anymore so put in below in legacy param section for example.
| # def test_wrong_backend(self): | ||
| # """ | ||
| # Simple test that checks if a user passes a wrong backend an error is raised | ||
| # """ | ||
| # # This should work fine | ||
| # _ = AwqConfig(bits=4) | ||
| # | ||
| # with self.assertRaises(ValueError): | ||
| # AwqConfig(bits=4, backend="") | ||
| # | ||
| # # These should work fine | ||
| # _ = AwqConfig(bits=4, version="GEMM") | ||
| # _ = AwqConfig(bits=4, version="gemm") | ||
| # | ||
| # with self.assertRaises(ValueError): | ||
| # AwqConfig(bits=4, backend="unexisting-backend") | ||
| # | ||
| # # Only cuda and xpu devices can run this function | ||
| # support_llm_awq = False | ||
| # device_type, major, _ = get_device_properties() | ||
| # if device_type == "cuda" and major >= 8: | ||
| # support_llm_awq = True | ||
| # elif device_type == "xpu": | ||
| # support_llm_awq = True | ||
| # | ||
| # if support_llm_awq: | ||
| # # LLMAWQ should work on an A100 | ||
| # AwqConfig(bits=4, backend="llm-awq") | ||
| # else: | ||
| # # LLMAWQ does not work on a T4 | ||
| # with self.assertRaises(ValueError): | ||
| # AwqConfig(bits=4, backend="llm-awq") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update
| # @slow | ||
| # @require_gptqmodel | ||
| # @require_accelerate | ||
| # class AwqIPEXTest(unittest.TestCase): | ||
| # def test_quantized_model_ipex(self): | ||
| # """ | ||
| # Simple test that checks if the quantized model is working properly with ipex backend | ||
| # """ | ||
| # quantization_config = AwqConfig(version="ipex") | ||
| # | ||
| # model = AutoModelForCausalLM.from_pretrained( | ||
| # "TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ", | ||
| # quantization_config=quantization_config, | ||
| # device_map="cpu", | ||
| # ) | ||
| # tokenizer = AutoTokenizer.from_pretrained("TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ") | ||
| # input_ids = tokenizer.encode("How to make a cake", return_tensors="pt") | ||
| # pad_token_id = tokenizer.eos_token_id | ||
| # output = model.generate(input_ids, do_sample=False, max_length=20, pad_token_id=pad_token_id) | ||
| # print(tokenizer.decode(output[0], skip_special_tokens=True)) | ||
| # | ||
| # expected_output = ( | ||
| # "How to make a cake with a round tin?\nHow to make a cake with a round tin?\n1. Preheat the oven to 180°" | ||
| # ) | ||
| # self.assertIn(tokenizer.decode(output[0], skip_special_tokens=True), expected_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this doesn't work yet, we can just skip for now
| if not is_gptqmodel_available(): | ||
| self.skipTest("gptqmodel not available") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not skip test here, add a decorator instead
| tmpdirname, device_map=self.device_map | ||
| ) | ||
| quant_type = "exllamav2" | ||
| # if self.quantized_model.config["quantization_config"]["format"] == "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove ?
| post_init_awq_exllama_modules, | ||
| post_init_awq_ipex_modules, | ||
| replace_quantization_scales, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove imports if we deleted them
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
|
@bot /style |
|
Style bot fixed some files and pushed the changes. |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: autoawq, gptq |
Remove autogptq clutter and autogptq related configs that are not worth adding backward compat.
GPTQModel has a slight project name change (pypi package and import name stays the same) to GPT-QModel with
-as we now have addedawq/AutoAWQ into our repo and will be making pr soon to address awq loading using GPT-QModel.GPTQConfighas the most important changes in this PR:The 3 removed properties are all related
kernelselection. These 3 are a hot potatoe mess and legacy from autogptq. GPT-QModel uses unifiedbackend(existing) property to select kernels. There were compat codes written toconvertthese 3 properties tobackendbehind the scenes in 2024 but no longer relevant for 2025.Note:
kernel.QUANT_TYPE(str). GPTQ-QModel will return best performing kernel for the relevant module and it may be different per module due to in/out features and other gptq/module properties in relation to device type + dtype + many factors.kernel.QUANT_TYPEif the test specifies a specific kernel viabackendselection.