Skip to content

Conversation

@Qubitium
Copy link
Contributor

@Qubitium Qubitium commented Oct 14, 2025

Remove autogptq clutter and autogptq related configs that are not worth adding backward compat.

GPTQModel has a slight project name change (pypi package and import name stays the same) to GPT-QModel with - as we now have added awq/AutoAWQ into our repo and will be making pr soon to address awq loading using GPT-QModel.

GPTQConfig has the most important changes in this PR:

# New GPTQConfig Property. Applicable for sister Peft/Optimum PRs
act_group_aware (`bool`, *optional*, defaults to `True`):
    Use GAR (group aware activation order) during quantization. Has measurable positive impact on quantization
    quality. Only applicable when `desc_act = False`. Will forced to be `False` when `desc_act = True`.
    
    
# Removed GPTQConfig Properties:
use_cuda_fp16
use_exllama
exllama_config

The 3 removed properties are all related kernel selection. These 3 are a hot potatoe mess and legacy from autogptq. GPT-QModel uses unified backend (existing) property to select kernels. There were compat codes written to convert these 3 properties to backend behind the scenes in 2024 but no longer relevant for 2025.

Note:

  • Transformers/Optimum/Peft CI tests should never check for kernel.QUANT_TYPE (str). GPTQ-QModel will return best performing kernel for the relevant module and it may be different per module due to in/out features and other gptq/module properties in relation to device type + dtype + many factors.
  • CI tests should only assert check for kernel.QUANT_TYPE if the test specifies a specific kernel via backend selection.

@Rocketknight1
Copy link
Member

cc @MekkCyber for quantization

@Qubitium Qubitium changed the title [WIP] Fully deprecate AutoGPTQ for GPT-QModel [WIP] Fully deprecate AutoGPTQ and AutoAWQ for GPT-QModel Nov 20, 2025
@Qubitium
Copy link
Contributor Author

We have begun AutoAWQ deprecation as well.

  • Fused module codes have all been removed. AutoAWQ used to do quant linear level fusing but I do not believe that this is maintainable or good since if SGLang/vLLM adopts Transformers v5 for model loading, they will do their own auto fusing and the quant module should not interfere with this.

  • IPEX is deprecated by Intel and we have a new AwqTorchFused kernel (based on same Intel TorchFused kernel for GPTQ) so any code/unit tests for IPEX now will point to AwqTorchFused kernel.

@MekkCyber
Copy link
Contributor

Hi @Qubitium ! Thanks a lot for working on this! Quick question, what do you mean by AutoAWQ being part of GPT-QModel now? Did you integrate the entire library (including the transformers dependency, like AutoAWQ does), or did you just port over the linear layers, kernels, and related components?

Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
@Qubitium Qubitium changed the title [WIP] Fully deprecate AutoGPTQ and AutoAWQ for GPT-QModel Fully deprecate AutoGPTQ and AutoAWQ for GPT-QModel Dec 2, 2025
@Qubitium Qubitium marked this pull request as ready for review December 2, 2025 09:11
@Qubitium
Copy link
Contributor Author

Qubitium commented Dec 2, 2025

@SunMarc @MekkCyber PR is now synced to Peft/Optimum pending Prs. Ready for code review for this portion. All tests passing with pending gpt-qmodel 5.4.4 release (later today).

Notable changes:

  1. hf_select_quant_linear_v2 will now auto select kernel for both gptq/autoawq. No more kernel selection crud in transformers and gptq/awq kernel selection merged into single api strictly used for hf for future api stability. Let gpt-qmodel decide as it has the best view to return the best/latest kernel.

  2. AutoAWQ fusing codes have been removed. This code is not maintainable (static map based, model arch specific) and is not relevant for vllm/sglang as they do their own fusing. Tranformer v5 I believe is also introducing more generic fusing so any manual, per model arch, fusing done by previous autoawq code should be eliminated.

  3. AwqConfig now inherits from GPTQConfig due to shared properties. For GPTQ, legacy checkpoint_format is remapped to format internally but for backward compat, until future deprecation, we also write to checkpoint_format on save via to_dict. For AWQ, version is now mapped to format internally, and likewise for compat, we write to version using format value in to_dict. This is consistent with what gpt-qmodel does for code clarity while maintaining backward compat.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, left some minor comments !

Comment on lines 95 to 98
# if self.quantization_config.version == AWQLinearVersion.IPEX:
# from ..integrations import post_init_awq_ipex_modules
#
# model = post_init_awq_ipex_modules(model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be removed ?

Comment on lines 813 to 818
do_fuse (`bool`, *optional*, defaults to `False`):
Whether to fuse attention and mlp layers together for faster inference
Deprecated, Whether to fuse attention and mlp layers together for faster inference
fuse_max_seq_len (`int`, *optional*):
The Maximum sequence length to generate when using fusing.
Deprecated, The Maximum sequence length to generate when using fusing.
modules_to_fuse (`dict`, *optional*, default to `None`):
Overwrite the natively supported fusing scheme with the one specified by the users.
Deprecated, Overwrite the natively supported fusing scheme with the one specified by the users.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove it directly since those are not used

The version of the quantization algorithm to use. GEMM is better for big batch_size (e.g. >= 8) otherwise,
GEMV is better (e.g. < 8 ). GEMM models are compatible with Exllama kernels.
backend (`AwqBackendPackingMethod`, *optional*, defaults to `AwqBackendPackingMethod.AUTOAWQ`):
backend (`AwqBackendPackingMethod`, *optional*, defaults to `AwqBackendPackingMethod.GPTQMODEL`):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update docstring

Whether to use zero point quantization.
version (`AWQLinearVersion`, *optional*, defaults to `AWQLinearVersion.GEMM`):
The version of the quantization algorithm to use. GEMM is better for big batch_size (e.g. >= 8) otherwise,
GEMV is better (e.g. < 8 ). GEMM models are compatible with Exllama kernels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not used anymore so put in below in legacy param section for example.

Comment on lines 44 to 75
# def test_wrong_backend(self):
# """
# Simple test that checks if a user passes a wrong backend an error is raised
# """
# # This should work fine
# _ = AwqConfig(bits=4)
#
# with self.assertRaises(ValueError):
# AwqConfig(bits=4, backend="")
#
# # These should work fine
# _ = AwqConfig(bits=4, version="GEMM")
# _ = AwqConfig(bits=4, version="gemm")
#
# with self.assertRaises(ValueError):
# AwqConfig(bits=4, backend="unexisting-backend")
#
# # Only cuda and xpu devices can run this function
# support_llm_awq = False
# device_type, major, _ = get_device_properties()
# if device_type == "cuda" and major >= 8:
# support_llm_awq = True
# elif device_type == "xpu":
# support_llm_awq = True
#
# if support_llm_awq:
# # LLMAWQ should work on an A100
# AwqConfig(bits=4, backend="llm-awq")
# else:
# # LLMAWQ does not work on a T4
# with self.assertRaises(ValueError):
# AwqConfig(bits=4, backend="llm-awq")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update

Comment on lines 308 to 332
# @slow
# @require_gptqmodel
# @require_accelerate
# class AwqIPEXTest(unittest.TestCase):
# def test_quantized_model_ipex(self):
# """
# Simple test that checks if the quantized model is working properly with ipex backend
# """
# quantization_config = AwqConfig(version="ipex")
#
# model = AutoModelForCausalLM.from_pretrained(
# "TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ",
# quantization_config=quantization_config,
# device_map="cpu",
# )
# tokenizer = AutoTokenizer.from_pretrained("TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ")
# input_ids = tokenizer.encode("How to make a cake", return_tensors="pt")
# pad_token_id = tokenizer.eos_token_id
# output = model.generate(input_ids, do_sample=False, max_length=20, pad_token_id=pad_token_id)
# print(tokenizer.decode(output[0], skip_special_tokens=True))
#
# expected_output = (
# "How to make a cake with a round tin?\nHow to make a cake with a round tin?\n1. Preheat the oven to 180°"
# )
# self.assertIn(tokenizer.decode(output[0], skip_special_tokens=True), expected_output)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this doesn't work yet, we can just skip for now

Comment on lines +179 to +180
if not is_gptqmodel_available():
self.skipTest("gptqmodel not available")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not skip test here, add a decorator instead

tmpdirname, device_map=self.device_map
)
quant_type = "exllamav2"
# if self.quantized_model.config["quantization_config"]["format"] == ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove ?

Comment on lines 167 to 169
post_init_awq_exllama_modules,
post_init_awq_ipex_modules,
replace_quantization_scales,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove imports if we deleted them

@SunMarc
Copy link
Member

SunMarc commented Dec 3, 2025

@bot /style

@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2025

Style bot fixed some files and pushed the changes.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: autoawq, gptq

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants