[WIP] Fully deprecate AutoGPTQ and AutoAWQ for GPT-QModel #41567

Qubitium · 2025-10-14T07:46:11Z

Remove autogptq clutter and autogptq related configs that are not worth adding backward compat.

GPTQModel has a slight project name change (pypi package and import name stays the same) to GPT-QModel with - as we now have added awq/AutoAWQ into our repo and will be making pr soon to address awq loading using GPT-QModel.

GPTQConfig has the most important changes in this PR:

# New GPTQConfig Property. Applicable for sister Peft/Optimum PRs
act_group_aware (`bool`, *optional*, defaults to `True`):
    Use GAR (group aware activation order) during quantization. Has measurable positive impact on quantization
    quality. Only applicable when `desc_act = False`. Will forced to be `False` when `desc_act = True`.
    
    
# Removed GPTQConfig Properties:
use_cuda_fp16
use_exllama
exllama_config

The 3 removed properties are all related kernel selection. These 3 are a hot potatoe mess and legacy from autogptq. GPT-QModel uses unified backend (existing) property to select kernels. There were compat codes written to convert these 3 properties to backend behind the scenes in 2024 but no longer relevant for 2025.

Note:

Transformers/Optimum/Peft CI tests should never check for kernel.QUANT_TYPE (str). GPTQ-QModel will return best performing kernel for the relevant module and it may be different per module due to in/out features and other gptq/module properties in relation to device type + dtype + many factors.
CI tests should only assert check for kernel.QUANT_TYPE if the test specifies a specific kernel via backend selection.

…odel

Signed-off-by: ZX-ModelCloud <[email protected]>

Rocketknight1 · 2025-10-15T12:18:22Z

cc @MekkCyber for quantization

Qubitium · 2025-11-20T03:26:49Z

We have begun AutoAWQ deprecation as well.

Fused module codes have all been removed. AutoAWQ used to do quant linear level fusing but I do not believe that this is maintainable or good since if SGLang/vLLM adopts Transformers v5 for model loading, they will do their own auto fusing and the quant module should not interfere with this.
IPEX is deprecated by Intel and we have a new AwqTorchFused kernel (based on same Intel TorchFused kernel for GPTQ) so any code/unit tests for IPEX now will point to AwqTorchFused kernel.

MekkCyber · 2025-11-20T09:02:29Z

Hi @Qubitium ! Thanks a lot for working on this! Quick question, what do you mean by AutoAWQ being part of GPT-QModel now? Did you integrate the entire library (including the transformers dependency, like AutoAWQ does), or did you just port over the linear layers, kernels, and related components?

Qubitium · 2025-11-20T09:27:00Z

Hi @Qubitium ! Thanks a lot for working on this! Quick question, what do you mean by AutoAWQ being part of GPT-QModel now? Did you integrate the entire library (including the transformers dependency, like AutoAWQ does), or did you just port over the linear layers, kernels, and related components?

Long story short. We folded AutoAWQ into GPT-QModel in multiple stage over the past few weeks. Stage 1. Simple/Directly port/copy the AutoAWQ code over. Stage 2. Refractor. Stage 3. Fixed bugs, add new kernels. Major refractor to align with new internal life cycle in GPT-QModel v5.0+. So we are current post Stage 3 where GPT-QModel base retains minimal original AutoAWQ code. Most AutoAWQ code have been refractored away.

Major Changes vs AutoAWQ:

New kernels. We have 2 new kernels for AWQ (AwqTorch pure torch based, and AwqTorchFused which is cpu optimized based on work by Intel @jiqing-feng.
Plan to add 3rd new AWQ kernel based on Bitblas as most gptq kernels are compatbile with AWQ with some small changes. Marlin kernel also sycned with gptq Marlin kernel for updated Marlin fixes/otpimizations via vllm port.
QuantLinear code have been rewritten/refractored.
Quant logic is new due to GPT-QModel 5.0+ life cycle which is not compatible with AutoAWQ.

HF eco system compat:

Work on Peft integration is happening in a parallel pr by @LRL2-ModelCloud huggingface/peft#2917 in coordination from @BenjaminBossan huggingface/peft#2342 (comment)

The Peft pr will need to co-exist concurrently with this PR due to interdependency.

We will hold off Optimum change last if we can help it, or may have to parallel a 3rd Pr to Optimum as well if inter-dependency causes trouble there as well.

Final goal of the 2 Prs is to remove dead AutoGPTQ code (no one uses it or should use it frankly) and almost dead AutoAWQ (repo in readonly and no longer accepting bug fixes or new model support). Compat of model loading of old models that use these two packages will be maintained.

SunMarc · 2025-11-20T16:40:15Z

Thanks for working on this @Qubitium . We are still debating if this is something that should offload to GPTQ-Model or we should start upstreaming some of the inference code directly into transformers + kernels. Here is a proposal from a contributor #42256.
The goal would be to only upstream the GEMM path but we can potentially leave the other kernels to GPTQ-Model WDYT ?

About GPTQ-Model, will it be possible to create awq quants for newer models that are compatible with other frameworks (e.g vllm) just like autoawq did ?

Qubitium · 2025-11-20T20:01:12Z

WDYT

I think badly of this proposal.

Thanks for working on this @Qubitium . We are still debating if this is something that should offload to GPTQ-Model or we should start upstreaming some of the inference code directly into transformers + kernels. Here is a proposal from a contributor #42256. The goal would be to only upstream the GEMM path but we can potentially leave the other kernels to GPTQ-Model WDYT ?

About GPTQ-Model, will it be possible to create awq quants for newer models that are compatible with other frameworks (e.g vllm) just like autoawq did ?

I just checked the PR which has no code. I am not going to waste time arguing over vaporware vs what I have done with AWQ in GPT-QModel over the past 2 months that is awq inference and quantization full stack complete with new kernels, new model support with full ci kernel and modeling validation. GPT-Qmodel can be viewed as not an AutoAWQ port and a full point release upgrade in every regard.

Edit: I have outlined in prior post on why Fusing is a bad idea. It is not AWQ's job to fuse in 2025. Leave it to model makers and higher level engines such as SGLang, vlLLM which HF v5.0 is targeting from my understading.

Qubitium · 2025-11-20T21:47:00Z

About GPTQ-Model, will it be possible to create awq quants for newer models that are compatible with other frameworks (e.g vllm) just like autoawq did ?

Our quantized models are more compatible with vllm/SGLang than ones quantized with Optimum or AutoAWQ

SGLang/vLLM compat is a number 1 target/design from day one so 100% yes.

jiqing-feng · 2025-11-21T02:33:07Z

It's a good chance to deprecate autoawq as it's archived. I suppose the best way is to go upstream to the transformer's main codes, just like we did in the Autogptq replacement. For example, the IPEX linear in AutoAWQ is out-of-date, we need a new implementation for it. The new linear implementation is TorchFusedLinear in gptqmodel.

Qubitium · 2025-11-21T03:13:07Z

It's a good chance to deprecate autoawq as it's archived. I suppose the best way is to go upstream to the transformer's main codes, just like we did in the Autogptq replacement. For example, the IPEX linear in AutoAWQ is out-of-date, we need a new implementation for it. The new linear implementation is TorchFusedLinear in gptqmodel.

@jiqing-feng The AWQ version of GPTQ TorchFusedKernel has been added to gpt-qmodel as AwqTorchFusedKernel. Same underlying code but memory layout tweaks to get it to work. AWQ Kernel output tests passing.

@SunMarc For the most part the kernels for AWQ and GPTQ are shared. For example, we do not compile an extra awq only Marlin kernel for AWQ. The previous gptq only Marlin kernel is synced from vLLM to run AWQ weights as well.

This reverts commit 90019c6.

Qubitium · 2025-11-21T12:15:01Z

CI Passing status using GPT-QModel main branch:

transformers/tests/quantization/autoawq/test_awq.py:
test_awq.py::AwqTest::test_quantized_model PASSED
test_awq.py::AwqTest::test_quantized_model_bf16 PASSED
test_awq.py::AwqTest::test_quantized_model_conversion PASSED
test_awq.py::AwqTest::test_quantized_model_exllama FAILED <-- Needs fixing. 
test_awq.py::AwqTest::test_quantized_model_multi_accelerator SKIPPED
test_awq.py::AwqTest::test_quantized_model_no_device_map PASSED
test_awq.py::AwqTest::test_save_pretrained PASSED
test_awq.py::AwqTest::test_raise_if_non_quantized PASSED
test_awq.py::AwqTest::test_quantized_model_no_k_proj_quantized PASSED
test_awq.py::AwqScaleTest::test_load_quantized_model PASSED
test_awq.py::AwqIPEXTest::test_quantized_model_ipex PASSED <-- test needs to be renamed to AwqTorchFused (ipex removed)
 
peft/tests/test_gpu_examples.py:
PeftAwqGPUTests PASSED
PeftGPTQGPUTests PASSED

@SunMarc PR is in working state and ready for prelim review. Look at the code diffs, we are eliminating 5x more crud for every line of code we add for the new awq integration.

SunMarc

Thanks, left a couple of comments. As I said, I'm happy to see that you are willing to fill the hole left by AutoAWQ and eager to see this PR merged. However, note that maybe in the future, we will add a default working path for GEMM if gptq-model is not installed. As those libraries depends on kernels that requires to deal with building + distribution for each new version of torch, we never know when this will suddenly stop.
Left a couple of comments. Also maybe it will be better to split this PR into 2: one for gptq and one for awq ?

SunMarc · 2025-11-21T16:52:09Z

src/transformers/quantizers/quantizer_awq.py

-        if self.quantization_config.do_fuse:
-            from ..integrations import fuse_awq_modules
-
-            model = fuse_awq_modules(model, self.quantization_config)
-            model._awq_is_fused = True  # TODO: consider storing this flag in model.config instead
-


Fused module codes have all been removed. AutoAWQ used to do quant linear level fusing but I do not believe that this is maintainable or good since if SGLang/vLLM adopts Transformers v5 for model loading, they will do their own auto fusing and the quant module should not interfere with this.

Fusing only happens when the user specify do_fuse in the config when loading the awq model using from_pretained, so it shouldn't impact at all sglang or vllm at all. Also you can't serialize the model if we fuse the modules. So I think we should still try to maintain that path if possible

Fusing has no place in AWQ. QKV, MLP fusing should be done at an upper level just like what vLLM and SGLang is doing. This is wrong code/logic for 2025. AutoAWQ did it to squeeze some inference perpformance but if you look at the code, it is a hot mess of static model class mapping and unmtainable. The number of new models coming out and emergence of SGLang, vLLM make this obsolete. It is my understanding that v5 HF is going to be used for SGLang/vLLM loading model foudnation so that makes it even more so.

For those users that depend on AutoAWQ fusing, they need to choose. We are not gonna spend vaulable energy supporting dead code.

SunMarc · 2025-11-21T16:53:55Z

src/transformers/integrations/awq.py

+    if not is_gptqmodel_available():
        raise ValueError(
-            "AWQ (either `autoawq` or `llmawq`) is not available. Please install it with `pip install autoawq` or check out the installation guide in https://github.com/mit-han-lab/llm-awq"
+            "AWQ (either `llmawq`) is not available. Please install it with `pip install gptqmodel` or check out the installation guide in https://github.com/mit-han-lab/llm-awq"


Just a note but if this doesn't make into v5, we will have to slowly deprecate autoawq

I will try to make this as clean as possible to reach that v5 goal. We are removig 75% of the code, not adding. Other than fusing, the features are not deprecated and only improved with zero compt issues. New kernels, bettter hw compat, faster kernels, and even bug fixes. Current awq kernels are fp16 only and failed all our bf16 kernel output quality tests. We will make sure users do not execute in bf16 or at least warn when this happens (loading model and executing in bf16 when awq kernels are validated for fp16 only).

SunMarc · 2025-11-21T16:55:27Z

src/transformers/quantizers/quantizer_gptq.py

    from ..modeling_utils import PreTrainedModel

-from ..utils import is_auto_gptq_available, is_gptqmodel_available, is_optimum_available, is_torch_available, logging
+from ..utils import is_gptqmodel_available, is_optimum_available, is_torch_available, logging


can do two seperate pr for gptq and awq ? For gptq one, I will be able to quickly merge it

huggingface/peft#2917 (comment)

GPTQ changes are minimal and mostly cosmetic. But this PR is required for huggingface/peft#2917 (comment) due to interdependency.

SunMarc · 2025-11-21T16:58:07Z

src/transformers/integrations/awq.py

+            from gptqmodel.nn_modules.qlinear.awq_gemm import AwqGEMMQuantLinear

-            target_cls = WQLinear_GEMM
+            target_cls = AwqGEMMQuantLinear


As I said, we might have replace this path by one handled by kernels at some point

That's fine. At that point in the future, HF staff can override the auto kernel selection code from gpt-qmodel and return specific AwqGEMM from kernels. It will be a clean override and requires no changes from gpt-qmodel. My point is that it is unreasonable to burden our task further by imposing a future optional requirement that that does not resolve the issue now.

SunMarc · 2025-11-21T17:11:40Z

src/transformers/integrations/awq.py

+            from gptqmodel.nn_modules.qlinear.awq_gemv import AwqGEMVQuantLinear

-            target_cls = WQLinear_GEMV
+            target_cls = AwqGEMVQuantLinear
        elif quantization_config.version == AWQLinearVersion.EXLLAMA:
            if quantization_config.exllama_config["version"] == ExllamaVersion.ONE:
-                from awq.modules.linear.exllama import WQLinear_Exllama
+                from gptqmodel.nn_modules.qlinear.awq_exllama import AwqExllamaQuantLinear

-                target_cls = WQLinear_Exllama
+                target_cls = AwqExllamaQuantLinear
            elif quantization_config.exllama_config["version"] == ExllamaVersion.TWO:


unlike gptq, the version selection is not automatic, is that right ?

Just double checked this code and it is incomplete. Need to changed to do auto-kernel selection just like gptq kernel selection. The reason is same format (version) has multiple compatible kernels (GEMM can be mapped to [AwqTorch, AwqTorchFued, AwqGEMM, AwqMarlin]) and in the same reason unreasonable to expect users to manually pass backend to select kernels.

After udpate, this entire block of manual kernel seection will replaced by one line of hf_select_awq_kernel or something similar.

Note that we will be removing old awq flawed terminalogy of version (actually format), and backend (no need for this unique for llm-awq as we will auto-compat during config loading for llm-awq where there is no quant_method and only version attribute). Backward compat will be maintained via config load/save mapping.

Qubitium · 2025-11-22T08:21:52Z

@SunMarc @MekkCyber Hold off on review. I will ping once ready. I need to remove more code related to fuse and kernel selection.

github-actions · 2025-11-22T08:21:57Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: autoawq, gptq

Qubitium · 2025-11-24T02:50:03Z

@SunMarc @MekkCyber Update. This PR will updated once we finish small refractor and add sync auto kernel selection just like what we did with gptq in ModelCloud/GPTQModel#2214. Both gptq and awq kernel selection will be folded into single hf_select_quant_linear_v2 interface for stability and single entry point.

in addition, the original AwqGEMM kernel will be split into effectively 3 distinct kernels, TorchGEMM, CudaGEMM, TritonGEMM. The autoawq gemm kernel was actually 3 kernels in one monolithic one. Sound nice but terrible for ci/kernel output regression/comparison tests with zero performance benefit. GPT-QModel will auto select the kernels based on system env, device_map, kernel qualifications (method, format, etc). This will knock off another layer of complexity in existing HF code.

# public/stable api exposed to transformer/optimum
def hf_select_quant_linear_v2(
        bits: int,
        group_size: int,
        desc_act: bool,
        sym: bool,
        format: Union[str, FORMAT], # awq `version` should be pre-mapped to format
        quant_method: Union[str, METHOD], # awq llm-awq `version` should be pre-mapped to method
        zero_point: Optional[bool] = True, # awq only
        dtype: Optional[Union[str, torch.dtype]] = None,
        meta: Optional[Dict[str, any]] = None,
        pack: Optional[bool] = True,
        device_map: Optional[Union[str, dict]] = None,
        backend: Optional[Union[str, BACKEND]] = None,
) -> Type[BaseQuantLinear]:

jiqing-feng · 2025-11-24T05:49:39Z

Thanks, left a couple of comments. As I said, I'm happy to see that you are willing to fill the hole left by AutoAWQ and eager to see this PR merged. However, note that maybe in the future, we will add a default working path for GEMM if gptq-model is not installed. As those libraries depends on kernels that requires to deal with building + distribution for each new version of torch, we never know when this will suddenly stop. Left a couple of comments. Also maybe it will be better to split this PR into 2: one for gptq and one for awq ?

Does it mean we can upstream some specific ops for awq or gptq in the kernel-community? In that case, gptqmodel can pull kernels from the community at runtime?

Qubitium and others added 7 commits October 14, 2025 07:33

fully deprecate autogptq

79013a4

remove use_cuda and use_exllama toggles are fully deprecated in gptqm…

0400ee5

…odel

format

cada621

add act_group_aware property

b82e291

Merge branch 'main' into gptqmodel

6058d40

fix QUANT_TYPE assert

c1d907f

Signed-off-by: ZX-ModelCloud <[email protected]>

Merge remote-tracking branch 'origin/gptqmodel' into gptqmodel

0326488

Qubitium added 5 commits October 15, 2025 20:29

Merge branch 'main' into gptqmodel

eda2f44

Merge branch 'main' into gptqmodel

821fd5b

format

8a7da2a

Merge branch 'main' into gptqmodel

7f55adc

Merge branch 'main' into gptqmodel

5938f75

Qubitium mentioned this pull request Nov 20, 2025

CI: Add gptqmodel to the CI huggingface/peft#2342

Open

Merge branch 'main' into gptqmodel

1fdd855

Qubitium changed the title ~~[WIP] Fully deprecate AutoGPTQ for GPT-QModel~~ [WIP] Fully deprecate AutoGPTQ and AutoAWQ for GPT-QModel Nov 20, 2025

LRL2-ModelCloud added 5 commits November 20, 2025 10:49

mod awq import

18a6d80

remove autoawq fuse support

fece25c

remove remove autoawq.config fuse

000e223

cleanup

c9f9c02

remove awq fuse test

d839d2b

LRL2-ModelCloud force-pushed the gptqmodel branch from 5492303 to d839d2b Compare November 20, 2025 03:26

LRL2-ModelCloud added 5 commits November 20, 2025 13:54

fix import

32dd6ac

use gptqmodel

ed0c0a3

cleanup

0cb315d

remove get_modules_to_fuse

a930c47

mod require_auto_awq -> require_gptqmodel

13191b9

convert vertion to checkpoint_format

e91e272

LRL2-ModelCloud added 2 commits November 20, 2025 17:24

check is_gptqmodel_available

dd30373

revert modules_to_not_convert

f768820

pass bits, sym, desc_act

94f9134

LRL2-ModelCloud added 2 commits November 21, 2025 10:09

fix awqconfig init

c14413a

fix wrong args

27ec7b4

LRL2-ModelCloud added 3 commits November 21, 2025 10:44

fix ipex

820c694

mod ipex version check

f80ed50

cleanup

d404005

LRL2-ModelCloud added 6 commits November 21, 2025 14:38

fix awq_linear

c86ac34

remove self.exllama_config = exllama_config

8bae986

cleanuo

90019c6

Revert "cleanuo"

6a4865c

This reverts commit 90019c6.

update is_trainable

1238c3b

cleanup

26d1f0f

Qubitium mentioned this pull request Nov 21, 2025

[WIP]: Clean auto_gptq and auto_awq huggingface/peft#2917

Draft

SunMarc reviewed Nov 21, 2025

View reviewed changes

SunMarc requested a review from MekkCyber November 21, 2025 17:12

merge

a5fed53

remove fused

b2ae0d5

[WIP] Fully deprecate AutoGPTQ and AutoAWQ for GPT-QModel #41567

Are you sure you want to change the base?

[WIP] Fully deprecate AutoGPTQ and AutoAWQ for GPT-QModel #41567

Conversation

Qubitium commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rocketknight1 commented Oct 15, 2025

Uh oh!

Qubitium commented Nov 20, 2025

Uh oh!

MekkCyber commented Nov 20, 2025

Uh oh!

Qubitium commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SunMarc commented Nov 20, 2025

Uh oh!

Qubitium commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiqing-feng commented Nov 21, 2025

Uh oh!

Qubitium commented Nov 21, 2025

Uh oh!

Qubitium commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Qubitium Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Qubitium Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Qubitium commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 22, 2025

Uh oh!

Qubitium commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiqing-feng commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Qubitium commented Oct 14, 2025 •

edited

Loading

Qubitium commented Nov 20, 2025 •

edited

Loading

Qubitium commented Nov 20, 2025 •

edited

Loading

Qubitium commented Nov 20, 2025 •

edited

Loading

Qubitium commented Nov 21, 2025 •

edited

Loading

Qubitium Nov 22, 2025 •

edited

Loading

Qubitium Nov 22, 2025 •

edited

Loading

Qubitium commented Nov 22, 2025 •

edited

Loading

Qubitium commented Nov 24, 2025 •

edited

Loading