Skip to content

Commit 35c4675

Browse files
committed
Add pt2e tutorials to torchao doc page
Summary: att, after we migrate pt2e quant code from pytorch to torchao, now we also want to migrate the docs as well Test Plan: check generated docs Reviewers: Subscribers: Tasks: Tags:
1 parent 5239ce7 commit 35c4675

File tree

9 files changed

+2362
-3
lines changed

9 files changed

+2362
-3
lines changed

docs/source/index.rst

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ for an overall introduction to the library and recent highlight and updates.
2121
quantization
2222
sparsity
2323
contributor_guide
24+
pt2e_quant
2425

2526
.. toctree::
2627
:glob:
@@ -35,11 +36,23 @@ for an overall introduction to the library and recent highlight and updates.
3536
.. toctree::
3637
:glob:
3738
:maxdepth: 1
38-
:caption: Tutorials
39+
:caption: Eager Quantization Tutorials
3940

4041
serialization
4142
subclass_basic
4243
subclass_advanced
4344
static_quantization
4445
pretraining
4546
torchao_vllm_integration
47+
48+
.. toctree::
49+
:glob:
50+
:maxdepth: 1
51+
:caption: PT2E Quantization Tutorials
52+
53+
tutorials_source/pt2e_quant_ptq
54+
tutorials_source/pt2e_quant_qat
55+
tutorials_source/pt2e_quant_x86_inductor
56+
tutorials_source/pt2e_quant_xpu_inductor
57+
tutorials_source/pt2e_quantizer
58+
tutorials_source/openvino_quantizer

docs/source/pt2e_quant.rst

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
PyTorch 2 Export Quantization
2+
=========================================
3+
4+
PyTorch 2 Export Quantization is a full graph quantization workflow mostly for static quantization. It targets hardwares that requires both input and output activation and weight to be quantized and relies of recognizing an operator pattern to make quantization decisions (such as linear - relu). PT2E quantization produces a pattern with quantize and dequantize ops inserted around the operators and during lowering quantized operator patterns will be fused into real quantized ops. Overall there are two lowering paths, 1. torch.compile through inductor lowering 2. ExecuTorch through delegation
5+
6+
Here we show an example with X86InductorQuantizer
7+
8+
API Example::
9+
10+
import torch
11+
from torchao.quantization.pt2e.quantize_pt2e import prepare_pt2e
12+
from torch.export import export
13+
from torchao.quantization.pt2e.quantizer.x86_inductor_quantizer import (
14+
X86InductorQuantizer,
15+
get_default_x86_inductor_quantization_config,
16+
)
17+
18+
class M(torch.nn.Module):
19+
def __init__(self):
20+
super().__init__()
21+
self.linear = torch.nn.Linear(5, 10)
22+
23+
def forward(self, x):
24+
return self.linear(x)
25+
26+
# initialize a floating point model
27+
float_model = M().eval()
28+
29+
# define calibration function
30+
def calibrate(model, data_loader):
31+
model.eval()
32+
with torch.no_grad():
33+
for image, target in data_loader:
34+
model(image)
35+
36+
# Step 1. program capture
37+
m = export(m, *example_inputs).module()
38+
# we get a model with aten ops
39+
40+
# Step 2. quantization
41+
# backend developer will write their own Quantizer and expose methods to allow
42+
# users to express how they
43+
# want the model to be quantized
44+
quantizer = X86InductorQuantizer()
45+
quantizer.set_global(xiq.get_default_x86_inductor_quantization_config())
46+
47+
# or prepare_qat_pt2e for Quantization Aware Training
48+
m = prepare_pt2e(m, quantizer)
49+
50+
# run calibration
51+
# calibrate(m, sample_inference_data)
52+
m = convert_pt2e(m)
53+
54+
# Step 3. lowering
55+
# lower to target backend
56+
57+
# Optional: using the C++ wrapper instead of default Python wrapper
58+
import torch._inductor.config as config
59+
config.cpp_wrapper = True
60+
61+
with torch.no_grad():
62+
optimized_model = torch.compile(converted_model)
63+
64+
# Running some benchmark
65+
optimized_model(*example_inputs)
66+
67+
68+
Please follow these tutorials to get started on PyTorch 2 Export Quantization:
69+
70+
Modeling Users:
71+
72+
- `PyTorch 2 Export Post Training Quantization <https://docs.pytorch.org/ao/stable/tutorial_source/pt2e_quant_ptq.html>`_
73+
- `PyTorch 2 Export Quantization Aware Training <ttps://docs.pytorch.org/ao/stable/tutorial_source/pt2e_quant_qat.html>`_
74+
- `PyTorch 2 Export Post Training Quantization with X86 Backend through Inductor <https://docs.pytorch.org/ao/stable/tutorial_source/pt2e_quant_x86_inductor.html>`_
75+
- `PyTorch 2 Export Post Training Quantization with XPU Backend through Inductor <https://docs.pytorch.org/ao/stable/tutorial_source/pt2e_quant_xpu_inductor.html>`_
76+
- `PyTorch 2 Export Quantization for OpenVINO torch.compile Backend <https://docs.pytorch.org/ao/stable/tutorial_source/pt2e_quant_openvino.html>`_
77+
78+
79+
Backend Developers (please check out all Modeling Users docs as well):
80+
81+
- `How to Write a Quantizer for PyTorch 2 Export Quantization <https://docs.pytorch.org/ao/stable/tutorial_source/pt2e_quantizer.html>`_
Lines changed: 251 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,251 @@
1+
PyTorch 2 Export Quantization for OpenVINO torch.compile Backend
2+
===========================================================================
3+
4+
**Authors**: `Daniil Lyakhov <https://github.com/daniil-lyakhov>`_, `Aamir Nazir <https://github.com/anzr299>`_, `Alexander Suslov <https://github.com/alexsu52>`_, `Yamini Nimmagadda <https://github.com/ynimmaga>`_, `Alexander Kozlov <https://github.com/AlexKoff88>`_
5+
6+
Prerequisites
7+
--------------
8+
9+
- `PyTorch 2 Export Post Training Quantization <https://docs.pytorch.org/ao/stable/tutorial_source/pt2e_quant_ptq.html>`_
10+
- `How to Write a Quantizer for PyTorch 2 Export Quantization <https://docs.pytorch.org/ao/stable/tutorial_source/pt2e_quantizer.html>`_
11+
12+
Introduction
13+
--------------
14+
15+
.. note::
16+
17+
This is an experimental feature, the quantization API is subject to change.
18+
19+
This tutorial demonstrates how to use ``OpenVINOQuantizer`` from `Neural Network Compression Framework (NNCF) <https://github.com/openvinotoolkit/nncf/tree/develop>`_ in PyTorch 2 Export Quantization flow to generate a quantized model customized for the `OpenVINO torch.compile backend <https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html>`_ and explains how to lower the quantized model into the `OpenVINO <https://docs.openvino.ai/2024/index.html>`_ representation.
20+
``OpenVINOQuantizer`` unlocks the full potential of low-precision OpenVINO kernels due to the placement of quantizers designed specifically for the OpenVINO.
21+
22+
The PyTorch 2 export quantization flow uses ``torch.export`` to capture the model into a graph and performs quantization transformations on top of the ATen graph.
23+
This approach is expected to have significantly higher model coverage, improved flexibility, and a simplified UX.
24+
OpenVINO backend compiles the FX Graph generated by TorchDynamo into an optimized OpenVINO model.
25+
26+
The quantization flow mainly includes four steps:
27+
28+
- Step 1: Capture the FX Graph from the eager Model based on the `torch export mechanism <https://pytorch.org/docs/main/export.html>`_.
29+
- Step 2: Apply the PyTorch 2 Export Quantization flow with OpenVINOQuantizer based on the captured FX Graph.
30+
- Step 3: Lower the quantized model into OpenVINO representation with the `torch.compile <https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html>`_ API.
31+
- Optional step 4: : Improve quantized model metrics via `quantize_pt2e <https://openvinotoolkit.github.io/nncf/autoapi/nncf/experimental/torch/fx/index.html#nncf.experimental.torch.fx.quantize_pt2e>`_ method.
32+
33+
The high-level architecture of this flow could look like this:
34+
35+
::
36+
37+
float_model(Python) Example Input
38+
\ /
39+
\ /
40+
—--------------------------------------------------------
41+
| export |
42+
—--------------------------------------------------------
43+
|
44+
FX Graph in ATen
45+
|
46+
| OpenVINOQuantizer
47+
| /
48+
—--------------------------------------------------------
49+
| prepare_pt2e |
50+
| | |
51+
| Calibrate
52+
| | |
53+
| convert_pt2e |
54+
—--------------------------------------------------------
55+
|
56+
Quantized Model
57+
|
58+
—--------------------------------------------------------
59+
| Lower into Inductor |
60+
—--------------------------------------------------------
61+
|
62+
OpenVINO model
63+
64+
Post Training Quantization
65+
----------------------------
66+
67+
Now, we will walk you through a step-by-step tutorial for how to use it with `torchvision resnet18 model <https://download.pytorch.org/models/resnet18-f37072fd.pth>`_
68+
for post training quantization.
69+
70+
Prerequisite: OpenVINO and NNCF installation
71+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
72+
OpenVINO and NNCF could be easily installed via `pip distribution <https://docs.openvino.ai/2024/get-started/install-openvino.html>`_:
73+
74+
.. code-block:: bash
75+
76+
pip install -U pip
77+
pip install openvino, nncf
78+
79+
80+
1. Capture FX Graph
81+
^^^^^^^^^^^^^^^^^^^^^
82+
83+
We will start by performing the necessary imports, capturing the FX Graph from the eager module.
84+
85+
.. code-block:: python
86+
87+
import copy
88+
import openvino.torch
89+
import torch
90+
import torchvision.models as models
91+
from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e
92+
from torchao.quantization.pt2e.quantize_pt2e import prepare_pt2e
93+
94+
import nncf.torch
95+
96+
# Create the Eager Model
97+
model_name = "resnet18"
98+
model = models.__dict__[model_name](pretrained=True)
99+
100+
# Set the model to eval mode
101+
model = model.eval()
102+
103+
# Create the data, using the dummy data here as an example
104+
traced_bs = 50
105+
x = torch.randn(traced_bs, 3, 224, 224)
106+
example_inputs = (x,)
107+
108+
# Capture the FX Graph to be quantized
109+
with torch.no_grad(), nncf.torch.disable_patching():
110+
exported_model = torch.export.export(model, example_inputs).module()
111+
112+
113+
114+
2. Apply Quantization
115+
^^^^^^^^^^^^^^^^^^^^^^^
116+
117+
After we capture the FX Module to be quantized, we will import the OpenVINOQuantizer.
118+
119+
120+
.. code-block:: python
121+
122+
from nncf.experimental.torch.fx import OpenVINOQuantizer
123+
124+
quantizer = OpenVINOQuantizer()
125+
126+
``OpenVINOQuantizer`` has several optional parameters that allow tuning the quantization process to get a more accurate model.
127+
Below is the list of essential parameters and their description:
128+
129+
130+
* ``preset`` - defines quantization scheme for the model. Two types of presets are available:
131+
132+
* ``PERFORMANCE`` (default) - defines symmetric quantization of weights and activations
133+
134+
* ``MIXED`` - weights are quantized with symmetric quantization and the activations are quantized with asymmetric quantization. This preset is recommended for models with non-ReLU and asymmetric activation functions, e.g. ELU, PReLU, GELU, etc.
135+
136+
.. code-block:: python
137+
138+
OpenVINOQuantizer(preset=nncf.QuantizationPreset.MIXED)
139+
140+
* ``model_type`` - used to specify quantization scheme required for specific type of the model. Transformer is the only supported special quantization scheme to preserve accuracy after quantization of Transformer models (BERT, Llama, etc.). None is default, i.e. no specific scheme is defined.
141+
142+
.. code-block:: python
143+
144+
OpenVINOQuantizer(model_type=nncf.ModelType.Transformer)
145+
146+
* ``ignored_scope`` - this parameter can be used to exclude some layers from the quantization process to preserve the model accuracy. For example, when you want to exclude the last layer of the model from quantization. Below are some examples of how to use this parameter:
147+
148+
.. code-block:: python
149+
150+
#Exclude by layer name:
151+
names = ['layer_1', 'layer_2', 'layer_3']
152+
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(names=names))
153+
154+
#Exclude by layer type:
155+
types = ['Conv2d', 'Linear']
156+
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(types=types))
157+
158+
#Exclude by regular expression:
159+
regex = '.*layer_.*'
160+
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(patterns=regex))
161+
162+
#Exclude by subgraphs:
163+
# In this case, all nodes along all simple paths in the graph
164+
# from input to output nodes will be excluded from the quantization process.
165+
subgraph = nncf.Subgraph(inputs=['layer_1', 'layer_2'], outputs=['layer_3'])
166+
OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(subgraphs=[subgraph]))
167+
168+
169+
* ``target_device`` - defines the target device, the specificity of which will be taken into account during optimization. The following values are supported: ``ANY`` (default), ``CPU``, ``CPU_SPR``, ``GPU``, and ``NPU``.
170+
171+
.. code-block:: python
172+
173+
OpenVINOQuantizer(target_device=nncf.TargetDevice.CPU)
174+
175+
For further details on `OpenVINOQuantizer` please see the `documentation <https://openvinotoolkit.github.io/nncf/autoapi/nncf/experimental/torch/fx/index.html#nncf.experimental.torch.fx.OpenVINOQuantizer>`_.
176+
177+
After we import the backend-specific Quantizer, we will prepare the model for post-training quantization.
178+
``prepare_pt2e`` folds BatchNorm operators into preceding Conv2d operators, and inserts observers in appropriate places in the model.
179+
180+
.. code-block:: python
181+
182+
prepared_model = prepare_pt2e(exported_model, quantizer)
183+
184+
Now, we will calibrate the ``prepared_model`` after the observers are inserted in the model.
185+
186+
.. code-block:: python
187+
188+
# We use the dummy data as an example here
189+
prepared_model(*example_inputs)
190+
191+
Finally, we will convert the calibrated Model to a quantized Model. ``convert_pt2e`` takes a calibrated model and produces a quantized model.
192+
193+
.. code-block:: python
194+
195+
quantized_model = convert_pt2e(prepared_model, fold_quantize=False)
196+
197+
After these steps, we finished running the quantization flow, and we will get the quantized model.
198+
199+
200+
3. Lower into OpenVINO representation
201+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
202+
203+
After that the FX Graph can utilize OpenVINO optimizations using `torch.compile(…, backend=”openvino”) <https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html>`_ functionality.
204+
205+
.. code-block:: python
206+
207+
with torch.no_grad(), nncf.torch.disable_patching():
208+
optimized_model = torch.compile(quantized_model, backend="openvino")
209+
210+
# Running some benchmark
211+
optimized_model(*example_inputs)
212+
213+
214+
215+
The optimized model is using low-level kernels designed specifically for Intel CPU.
216+
This should significantly speed up inference time in comparison with the eager model.
217+
218+
4. Optional: Improve quantized model metrics
219+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
220+
221+
NNCF implements advanced quantization algorithms like `SmoothQuant <https://arxiv.org/abs/2211.10438>`_ and `BiasCorrection <https://arxiv.org/abs/1906.04721>`_, which help
222+
to improve the quantized model metrics while minimizing the output discrepancies between the original and compressed models.
223+
These advanced NNCF algorithms can be accessed via the NNCF `quantize_pt2e` API:
224+
225+
.. code-block:: python
226+
227+
from nncf.experimental.torch.fx import quantize_pt2e
228+
229+
calibration_loader = torch.utils.data.DataLoader(...)
230+
231+
232+
def transform_fn(data_item):
233+
images, _ = data_item
234+
return images
235+
236+
237+
calibration_dataset = nncf.Dataset(calibration_loader, transform_fn)
238+
quantized_model = quantize_pt2e(
239+
exported_model, quantizer, calibration_dataset, smooth_quant=True, fast_bias_correction=False
240+
)
241+
242+
243+
For further details, please see the `documentation <https://openvinotoolkit.github.io/nncf/autoapi/nncf/experimental/torch/fx/index.html#nncf.experimental.torch.fx.quantize_pt2e>`_
244+
and a complete `example on Resnet18 quantization <https://github.com/openvinotoolkit/nncf/blob/develop/examples/post_training_quantization/torch_fx/resnet18/README.md>`_.
245+
246+
Conclusion
247+
------------
248+
249+
This tutorial introduces how to use torch.compile with the OpenVINO backend and the OpenVINO quantizer.
250+
For more details on NNCF and the NNCF Quantization Flow for PyTorch models, refer to the `NNCF Quantization Guide <https://docs.openvino.ai/2025/openvino-workflow/model-optimization-guide/quantizing-models-post-training/basic-quantization-flow.html.>`_.
251+
For additional information, check out the `OpenVINO Deployment via torch.compile Documentation <https://docs.openvino.ai/2024/openvino-workflow/torch-compile.html>`_.

0 commit comments

Comments
 (0)