Fully deprecate AutoGPTQ and AutoAWQ for GPT-QModel #41567

Qubitium · 2025-10-14T07:46:11Z

Remove autogptq clutter and autogptq related configs that are not worth adding backward compat.

GPTQModel has a slight project name change (pypi package and import name stays the same) to GPT-QModel with - as we now have added awq/AutoAWQ into our repo and will be making pr soon to address awq loading using GPT-QModel.

GPTQConfig has the most important changes in this PR:

# New GPTQConfig Property. Applicable for sister Peft/Optimum PRs
act_group_aware (`bool`, *optional*, defaults to `True`):
    Use GAR (group aware activation order) during quantization. Has measurable positive impact on quantization
    quality. Only applicable when `desc_act = False`. Will forced to be `False` when `desc_act = True`.
    
    
# Removed GPTQConfig Properties:
use_cuda_fp16
use_exllama
exllama_config

The 3 removed properties are all related kernel selection. These 3 are a hot potatoe mess and legacy from autogptq. GPT-QModel uses unified backend (existing) property to select kernels. There were compat codes written to convert these 3 properties to backend behind the scenes in 2024 but no longer relevant for 2025.

Note:

Transformers/Optimum/Peft CI tests should never check for kernel.QUANT_TYPE (str). GPTQ-QModel will return best performing kernel for the relevant module and it may be different per module due to in/out features and other gptq/module properties in relation to device type + dtype + many factors.
CI tests should only assert check for kernel.QUANT_TYPE if the test specifies a specific kernel via backend selection.

…odel

Signed-off-by: ZX-ModelCloud <[email protected]>

Rocketknight1 · 2025-10-15T12:18:22Z

cc @MekkCyber for quantization

Qubitium · 2025-11-20T03:26:49Z

We have begun AutoAWQ deprecation as well.

Fused module codes have all been removed. AutoAWQ used to do quant linear level fusing but I do not believe that this is maintainable or good since if SGLang/vLLM adopts Transformers v5 for model loading, they will do their own auto fusing and the quant module should not interfere with this.
IPEX is deprecated by Intel and we have a new AwqTorchFused kernel (based on same Intel TorchFused kernel for GPTQ) so any code/unit tests for IPEX now will point to AwqTorchFused kernel.

MekkCyber · 2025-11-20T09:02:29Z

Hi @Qubitium ! Thanks a lot for working on this! Quick question, what do you mean by AutoAWQ being part of GPT-QModel now? Did you integrate the entire library (including the transformers dependency, like AutoAWQ does), or did you just port over the linear layers, kernels, and related components?

Signed-off-by: ZX-ModelCloud <[email protected]>

Qubitium · 2025-12-02T09:19:57Z

@SunMarc @MekkCyber PR is now synced to Peft/Optimum pending Prs. Ready for code review for this portion. All tests passing with pending gpt-qmodel 5.4.4 release (later today).

Notable changes:

hf_select_quant_linear_v2 will now auto select kernel for both gptq/autoawq. No more kernel selection crud in transformers and gptq/awq kernel selection merged into single api strictly used for hf for future api stability. Let gpt-qmodel decide as it has the best view to return the best/latest kernel.
AutoAWQ fusing codes have been removed. This code is not maintainable (static map based, model arch specific) and is not relevant for vllm/sglang as they do their own fusing. Tranformer v5 I believe is also introducing more generic fusing so any manual, per model arch, fusing done by previous autoawq code should be eliminated.
AwqConfig now inherits from GPTQConfig due to shared properties. For GPTQ, legacy checkpoint_format is remapped to format internally but for backward compat, until future deprecation, we also write to checkpoint_format on save via to_dict. For AWQ, version is now mapped to format internally, and likewise for compat, we write to version using format value in to_dict. This is consistent with what gpt-qmodel does for code clarity while maintaining backward compat.

Signed-off-by: ZX-ModelCloud <[email protected]>

SunMarc

Thanks, left some minor comments !

SunMarc · 2025-12-02T15:05:25Z

src/transformers/quantizers/quantizer_awq.py

+        # if self.quantization_config.version == AWQLinearVersion.IPEX:
+        #     from ..integrations import post_init_awq_ipex_modules
+        #
+        #     model = post_init_awq_ipex_modules(model)


should this be removed ?

SunMarc · 2025-12-02T15:08:52Z

src/transformers/utils/quantization_config.py

        do_fuse (`bool`, *optional*, defaults to `False`):
-            Whether to fuse attention and mlp layers together for faster inference
+            Deprecated, Whether to fuse attention and mlp layers together for faster inference
        fuse_max_seq_len (`int`, *optional*):
-            The Maximum sequence length to generate when using fusing.
+            Deprecated, The Maximum sequence length to generate when using fusing.
        modules_to_fuse (`dict`, *optional*, default to `None`):
-            Overwrite the natively supported fusing scheme with the one specified by the users.
+            Deprecated, Overwrite the natively supported fusing scheme with the one specified by the users.


remove it directly since those are not used

SunMarc · 2025-12-02T15:11:03Z

src/transformers/utils/quantization_config.py

            The version of the quantization algorithm to use. GEMM is better for big batch_size (e.g. >= 8) otherwise,
            GEMV is better (e.g. < 8 ). GEMM models are compatible with Exllama kernels.
-        backend (`AwqBackendPackingMethod`, *optional*, defaults to `AwqBackendPackingMethod.AUTOAWQ`):
+        backend (`AwqBackendPackingMethod`, *optional*, defaults to `AwqBackendPackingMethod.GPTQMODEL`):


update docstring

SunMarc · 2025-12-02T15:12:37Z

src/transformers/utils/quantization_config.py

            Whether to use zero point quantization.
        version (`AWQLinearVersion`, *optional*, defaults to `AWQLinearVersion.GEMM`):
            The version of the quantization algorithm to use. GEMM is better for big batch_size (e.g. >= 8) otherwise,
            GEMV is better (e.g. < 8 ). GEMM models are compatible with Exllama kernels.


this is not used anymore so put in below in legacy param section for example.

SunMarc · 2025-12-02T15:13:39Z

tests/quantization/autoawq/test_awq.py

+    # def test_wrong_backend(self):
+    #     """
+    #     Simple test that checks if a user passes a wrong backend an error is raised
+    #     """
+    #     # This should work fine
+    #     _ = AwqConfig(bits=4)
+    #
+    #     with self.assertRaises(ValueError):
+    #         AwqConfig(bits=4, backend="")
+    #
+    #     # These should work fine
+    #     _ = AwqConfig(bits=4, version="GEMM")
+    #     _ = AwqConfig(bits=4, version="gemm")
+    #
+    #     with self.assertRaises(ValueError):
+    #         AwqConfig(bits=4, backend="unexisting-backend")
+    #
+    #     # Only cuda and xpu devices can run this function
+    #     support_llm_awq = False
+    #     device_type, major, _ = get_device_properties()
+    #     if device_type == "cuda" and major >= 8:
+    #         support_llm_awq = True
+    #     elif device_type == "xpu":
+    #         support_llm_awq = True
+    #
+    #     if support_llm_awq:
+    #         # LLMAWQ should work on an A100
+    #         AwqConfig(bits=4, backend="llm-awq")
+    #     else:
+    #         # LLMAWQ does not work on a T4
+    #         with self.assertRaises(ValueError):
+    #             AwqConfig(bits=4, backend="llm-awq")


SunMarc · 2025-12-02T15:14:13Z

tests/quantization/autoawq/test_awq.py

+# @slow
+# @require_gptqmodel
+# @require_accelerate
+# class AwqIPEXTest(unittest.TestCase):
+#     def test_quantized_model_ipex(self):
+#         """
+#         Simple test that checks if the quantized model is working properly with ipex backend
+#         """
+#         quantization_config = AwqConfig(version="ipex")
+#
+#         model = AutoModelForCausalLM.from_pretrained(
+#             "TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ",
+#             quantization_config=quantization_config,
+#             device_map="cpu",
+#         )
+#         tokenizer = AutoTokenizer.from_pretrained("TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ")
+#         input_ids = tokenizer.encode("How to make a cake", return_tensors="pt")
+#         pad_token_id = tokenizer.eos_token_id
+#         output = model.generate(input_ids, do_sample=False, max_length=20, pad_token_id=pad_token_id)
+#         print(tokenizer.decode(output[0], skip_special_tokens=True))
+#
+#         expected_output = (
+#             "How to make a cake with a round tin?\nHow to make a cake with a round tin?\n1. Preheat the oven to 180°"
+#         )
+#         self.assertIn(tokenizer.decode(output[0], skip_special_tokens=True), expected_output)


if this doesn't work yet, we can just skip for now

SunMarc · 2025-12-02T15:15:05Z

tests/quantization/gptq/test_gptq.py

+        if not is_gptqmodel_available():
+            self.skipTest("gptqmodel not available")


do not skip test here, add a decorator instead

SunMarc · 2025-12-02T15:15:17Z

tests/quantization/gptq/test_gptq.py

-                    tmpdirname, device_map=self.device_map
-                )
+                quant_type = "exllamav2"
+            # if self.quantized_model.config["quantization_config"]["format"] == ""


SunMarc · 2025-12-02T15:19:52Z

src/transformers/integrations/__init__.py

        post_init_awq_exllama_modules,
        post_init_awq_ipex_modules,
        replace_quantization_scales,


remove imports if we deleted them

Signed-off-by: ZX-ModelCloud <[email protected]>

SunMarc · 2025-12-03T13:00:23Z

@bot /style

github-actions · 2025-12-03T13:01:06Z

Style bot fixed some files and pushed the changes.

github-actions · 2025-12-03T13:03:14Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: autoawq, gptq

Qubitium and others added 7 commits October 14, 2025 07:33

fully deprecate autogptq

79013a4

remove use_cuda and use_exllama toggles are fully deprecated in gptqm…

0400ee5

…odel

format

cada621

add act_group_aware property

b82e291

Merge branch 'main' into gptqmodel

6058d40

fix QUANT_TYPE assert

c1d907f

Signed-off-by: ZX-ModelCloud <[email protected]>

Merge remote-tracking branch 'origin/gptqmodel' into gptqmodel

0326488

Qubitium added 5 commits October 15, 2025 20:29

Merge branch 'main' into gptqmodel

eda2f44

Merge branch 'main' into gptqmodel

821fd5b

format

8a7da2a

Merge branch 'main' into gptqmodel

7f55adc

Merge branch 'main' into gptqmodel

5938f75

Qubitium mentioned this pull request Nov 20, 2025

CI: Add gptqmodel to the CI huggingface/peft#2342

Open

Merge branch 'main' into gptqmodel

1fdd855

Qubitium changed the title ~~[WIP] Fully deprecate AutoGPTQ for GPT-QModel~~ [WIP] Fully deprecate AutoGPTQ and AutoAWQ for GPT-QModel Nov 20, 2025

LRL2-ModelCloud added 5 commits November 20, 2025 10:49

mod awq import

18a6d80

remove autoawq fuse support

fece25c

remove remove autoawq.config fuse

000e223

cleanup

c9f9c02

remove awq fuse test

d839d2b

LRL2-ModelCloud force-pushed the gptqmodel branch from 5492303 to d839d2b Compare November 20, 2025 03:26

LRL2-ModelCloud added 5 commits November 20, 2025 13:54

fix import

32dd6ac

use gptqmodel

ed0c0a3

cleanup

0cb315d

remove get_modules_to_fuse

a930c47

mod require_auto_awq -> require_gptqmodel

13191b9

convert vertion to checkpoint_format

e91e272

Remove the "version" field from AwqConfig

5e567ec

Signed-off-by: ZX-ModelCloud <[email protected]>

ZX-ModelCloud mentioned this pull request Nov 28, 2025

Fully deprecate AutoGPTQ for GPT-QModel huggingface/optimum#2385

Open

3 tasks

ZX-ModelCloud added 4 commits November 28, 2025 17:50

Add torch_fused inferencefix test_gptq test

1fca1f0

Signed-off-by: ZX-ModelCloud <[email protected]>

fix test_awq

edcab15

Signed-off-by: ZX-ModelCloud <[email protected]>

fix test_awq

b31ac1b

Signed-off-by: ZX-ModelCloud <[email protected]>

fix AwqConfig

23f34a2

Signed-off-by: ZX-ModelCloud <[email protected]>

ZX-ModelCloud mentioned this pull request Dec 1, 2025

Fully deprecate AutoGPTQ and AutoAWQ for GPT-QModel huggingface/peft#2932

Open

Qubitium and others added 5 commits December 1, 2025 17:29

Merge branch 'main' into gptqmodel

c040689

call hf_select_quant_linear_v2()

4c2198d

Signed-off-by: ZX-ModelCloud <[email protected]>

Merge remote-tracking branch 'origin/gptqmodel' into gptqmodel

bc469d3

remove auto_awq

4b2f348

Signed-off-by: ZX-ModelCloud <[email protected]>

fix typo

c45ebe3

Signed-off-by: ZX-ModelCloud <[email protected]>

Qubitium changed the title ~~[WIP] Fully deprecate AutoGPTQ and AutoAWQ for GPT-QModel~~ Fully deprecate AutoGPTQ and AutoAWQ for GPT-QModel Dec 2, 2025

Merge branch 'main' into gptqmodel

54b4f1c

Qubitium marked this pull request as ready for review December 2, 2025 09:11

ZX-ModelCloud added 3 commits December 2, 2025 09:53

Compatible with legacy field: checkpoint_format

16334be

Signed-off-by: ZX-ModelCloud <[email protected]>

Compatible with legacy field: checkpoint_format

3ebc618

Signed-off-by: ZX-ModelCloud <[email protected]>

format

93a345e

Signed-off-by: ZX-ModelCloud <[email protected]>

SunMarc reviewed Dec 2, 2025

View reviewed changes

Qubitium and others added 5 commits December 3, 2025 10:14

Merge branch 'main' into gptqmodel

42f4301

CLEANUP

2ba5204

Signed-off-by: ZX-ModelCloud <[email protected]>

update test_awq

af04b86

Signed-off-by: ZX-ModelCloud <[email protected]>

fix get_modules_to_not_convert()

c4eed48

Signed-off-by: ZX-ModelCloud <[email protected]>

fix test_awq.py::AwqTest::test_quantized_model_exllama

65a8e89

Signed-off-by: ZX-ModelCloud <[email protected]>

Apply style fixes

b99e743

		if not is_gptqmodel_available():
		self.skipTest("gptqmodel not available")

Fully deprecate AutoGPTQ and AutoAWQ for GPT-QModel #41567

Are you sure you want to change the base?

Fully deprecate AutoGPTQ and AutoAWQ for GPT-QModel #41567

Conversation

Qubitium commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rocketknight1 commented Oct 15, 2025

Uh oh!

Qubitium commented Nov 20, 2025

Uh oh!

MekkCyber commented Nov 20, 2025

Uh oh!

Qubitium commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SunMarc commented Dec 3, 2025

Uh oh!

github-actions bot commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Qubitium commented Oct 14, 2025 •

edited

Loading

Qubitium commented Dec 2, 2025 •

edited

Loading

github-actions bot commented Dec 3, 2025 •

edited

Loading