diff --git a/docs/guides/evaluating-models.md b/docs/guides/evaluating-models.md index e278aef6d..7a95850ee 100644 --- a/docs/guides/evaluating-models.md +++ b/docs/guides/evaluating-models.md @@ -119,6 +119,17 @@ same mechanism `forgather train` uses). Pass an explicit path with `--checkpoint PATH` to pin a specific one, or `--no-checkpoint` to load via `AutoModelForCausalLM.from_pretrained` on the model directory. +**Quantized models** (artifacts produced by `forgather finalize --quantize`) +are autodetected: if `config.json` has a `quantization_config` block, eval +forces the `from_pretrained()` path and ignores both `--checkpoint` and +`--no-checkpoint`. An explicit `--checkpoint PATH` logs a warning before +being overridden. The checkpoint-resume path uses `from_config()` + +`load_state_dict()` which has no quantizer hook and fails on quantized +tensor subclasses; `from_pretrained()` runs HF's `TorchAoHfQuantizer` +pre-process so the right linear modules are in place before the weights +load. See [QAT Training § Evaluating Quantized +Models](../trainers/qat-training.md#evaluating-quantized-models). + The tokenizer is always loaded directly from `--model` via `AutoTokenizer.from_pretrained`. diff --git a/docs/guides/finalize-model.md b/docs/guides/finalize-model.md index 395955efb..47dd73d46 100644 --- a/docs/guides/finalize-model.md +++ b/docs/guides/finalize-model.md @@ -122,6 +122,14 @@ quantized tensor subclasses don't expose a single `.storage().data_ptr()`, which the safetensors writer requires. If `--safetensors` is passed alongside `--quantize`, it is silently disabled with a warning. +Finalize also writes a `quantization_config` block into `config.json` +with the recipe. This makes HF `AutoModelForCausalLM.from_pretrained()` +auto-detect the quantization on reload and run the +`TorchAoHfQuantizer` pre-process path — so `forgather eval` and the +inference server load the artifact correctly without any caller-side +flag. See [Evaluating Quantized +Models](../trainers/qat-training.md#evaluating-quantized-models). + ### Misc | Option | Description | diff --git a/docs/trainers/qat-training.md b/docs/trainers/qat-training.md index 59a9b7217..61befc1f3 100644 --- a/docs/trainers/qat-training.md +++ b/docs/trainers/qat-training.md @@ -117,7 +117,7 @@ Whether that's worth it depends heavily on the recipe: | Recipe | QAT vs PTQ delta (expected) | Recommendation | |--------|---------------------------|----------------| -| `int8-dynamic-act-int4-weight` | Largest QAT benefit — int4 weights are aggressive enough that plain PTQ can drift noticeably. | QAT if you care about the last point of eval loss; PTQ fine for prototyping. | +| `int8-dynamic-act-int4-weight` | Largest QAT benefit — int4 weights are aggressive enough that plain PTQ can drift noticeably. Measured Δ at 4.43M params is small (0.0013, see [Three-Way Comparison](#three-way-comparison-bf16--ptq--qat)); the gap is expected to widen at larger scale. | QAT if you care about the last point of eval loss; PTQ fine for prototyping. | | `int4-weight-only` | Moderate QAT benefit. Per-group int4 + bf16 matmul is already quite robust. | PTQ first; reach for QAT only if eval drops more than you can absorb. | | `float8-dynamic-act-float8-weight` | Minimal — fp8 is already near-lossless. | PTQ. QAT is rarely justified for fp8. | @@ -261,10 +261,57 @@ forgather finalize output_models/my_bf16_run out/my_bf16_run_int8_int4 \ This is the path used for the AMP-baseline / PTQ / QAT three-way comparison: train the same model once in plain bf16, then run finalize twice (one source bf16, one source QAT) with the same `--quantize` -recipe and compare the eval results. PTQ on a model that was not -QAT-trained typically pays a larger accuracy penalty than the -`~+0.017` eval-loss premium QAT does (see [Loss Trajectory](#loss-trajectory-1-chinchilla-tiny-llama) -above for the QAT numbers); how much depends on the model and recipe. +recipe and compare the eval results. See [Three-Way +Comparison](#three-way-comparison-bf16--ptq--qat) below for measured +numbers on Tiny Llama. + +## Evaluating Quantized Models + +`forgather eval` loads `--quantize`-finalized models with no extra flag: + +```bash +# Same invocation as for a bf16 model — eval autodetects the quantized +# artifact and routes through HF `from_pretrained()` so the +# TorchAoHfQuantizer pre-process path installs the right linear modules. +forgather -p examples/tutorials/tiny_llama eval test tinystories \ + -M /path/to/quantized_model +``` + +How it works: at finalize time, `--quantize` writes a +`quantization_config` block into `config.json` with the recipe. At eval +time, `scripts/eval_script.py:resolve_checkpoint()` reads that field +and forces the `from_pretrained()` load path, **ignoring** any +`--checkpoint` / `--no-checkpoint` flag (an explicit `--checkpoint +PATH` logs a warning before being overridden). The normal +checkpoint-resume path uses `from_config()` + `load_state_dict()`, +which does not run any quantizer hook and fails when the saved tensors +are torchao quantized subclasses. Same mechanism applies to the +inference server — no caller-side changes needed. + +The check is purely additive: bf16 models keep using the +checkpoint-resume path. Pass `--no-checkpoint` to opt into +`from_pretrained()` manually for non-quantized models. + +### Three-Way Comparison: bf16 / PTQ / QAT + +Full eval against the Tiny Stories test split for the 4.43M Tiny Llama +trained at 1-Chinchilla on a single RTX 3090 (matching the [Loss +Trajectory](#loss-trajectory-1-chinchilla-tiny-llama) setup above), all +three using `int8-dynamic-act-int4-weight`: + +| Model | eval_loss | perplexity | Δ vs bf16 | +|-------|-----------|------------|-----------| +| bf16 baseline | 1.3656 | 3.918 | — | +| PTQ on bf16 baseline | 1.3917 | 4.022 | +0.0262 | +| QAT-trained + converted | 1.3905 | 4.017 | +0.0249 | + +At this scale (4.43M params, recipe = int4 weights / int8 dynamic +activations), QAT shaves about **0.0013 eval-loss** off PTQ — a real +but small gain. Whether QAT's ~1.67× training-time overhead pays for +that depends on your tolerance for the +0.025 eval-loss premium that +quantization itself imposes (PTQ buys you almost all of the win at zero +training cost). QAT is expected to scale better at larger models and +more aggressive recipes — measure your own setup before committing. ## Programmatic Usage diff --git a/scripts/eval_script.py b/scripts/eval_script.py index 6edfd0fad..61d8ff0b1 100644 --- a/scripts/eval_script.py +++ b/scripts/eval_script.py @@ -129,6 +129,23 @@ def init_model(): return init_model +def _model_is_quantized(model_path: str) -> bool: + """True iff ``/config.json`` carries a ``quantization_config`` block. + + Written by ``forgather finalize --quantize``; HF's + ``from_pretrained()`` consumes it to install the TorchAoHfQuantizer + pre-process path. Unreadable or malformed config.json -> False. + """ + cfg_path = os.path.join(model_path, "config.json") + if not os.path.isfile(cfg_path): + return False + try: + with open(cfg_path) as f: + return "quantization_config" in json.load(f) + except (OSError, ValueError): + return False + + def resolve_checkpoint(args): """Return (checkpoint_arg, use_checkpoint) for the trainer. @@ -136,12 +153,42 @@ def resolve_checkpoint(args): - True: auto-find latest - str path: explicit - False: do not resume + + Quantized models force ``(False, False)`` regardless of flags: only HF + `from_pretrained()` runs the TorchAoHfQuantizer pre-process that + installs the right quantized linear modules before weight load. The + checkpoint-resume path uses `from_config()` + `load_state_dict()`, + which has no quantizer hook and crashes with `'Parameter' object has + no attribute 'tensor_data_names'` on torchao tensor subclasses. + + Cache the result on ``args`` so repeated calls don't re-log. """ - if args.no_checkpoint: - return False, False - if args.checkpoint: - return args.checkpoint, True - return True, True + cached = getattr(args, "_resolved_checkpoint", None) + if cached is not None: + return cached + + if _model_is_quantized(args.model): + if args.checkpoint: + logger.warning( + "Detected quantization_config in model config; ignoring " + "--checkpoint %r and loading via from_pretrained().", + args.checkpoint, + ) + else: + logger.info( + "Detected quantization_config in model config; " + "loading via from_pretrained() (overrides default checkpoint resume)." + ) + result = (False, False) + elif args.no_checkpoint: + result = (False, False) + elif args.checkpoint: + result = (args.checkpoint, True) + else: + result = (True, True) + + args._resolved_checkpoint = result + return result def build_trainer(args, model_init, eval_dataset, data_collator, tokenizer, device): diff --git a/tools/finalize_model/finalize_model.py b/tools/finalize_model/finalize_model.py index 79817573b..e889885ce 100644 --- a/tools/finalize_model/finalize_model.py +++ b/tools/finalize_model/finalize_model.py @@ -432,11 +432,26 @@ def main(argv=None): logger.info( f"Would run quantize step with recipe '{args.quantize_recipe}'" ) + logger.info( + "Would write 'quantization_config' block to config.json" + ) return 0 # ---- 6. Quantize (optional) ---------------------------------------- if args.quantize_recipe: _apply_quantize(model, args.quantize_recipe) + # Record the recipe on the config so HF `from_pretrained()` runs + # the TorchAoHfQuantizer pre-process path on reload — it installs + # the right quantized linear modules before `load_state_dict`, so + # the quantized tensor subclasses land in slots that know how to + # hold them. Without this block, reload via `from_pretrained()` + # fails with `'Parameter' object has no attribute 'tensor_data_names'`. + from transformers import TorchAoConfig + from forgather.ml.qat_recipes import recipe_to_base_config + + config.quantization_config = TorchAoConfig( + quant_type=recipe_to_base_config(args.quantize_recipe) + ) if args.safetensors: # torchao's quantized tensor subclasses wrap multiple inner # tensors and do not expose a single .storage().data_ptr(),