Skip to content

v2.18.0

Latest

Choose a tag to compare

@AlexanderDokuchaev AlexanderDokuchaev released this 04 Sep 14:37
· 2736 commits to develop since this release

Post-training Quantization:

  • Features:
    • (OpenVINO) Introduced new compression data types CB4_F8E4M3 and CODEBOOK. CB4_F8E4M3 is a fixed codebook with 16 fp8 values based on NF4 data type values. CODEBOOK is an arbitrary user-selectable codebook that can be used to experiment with different data types. Both data types are used for weight compression. The AWQ and scale estimation algorithms are supported for these data types.
    • (OpenVINO) Added support for compressing FP8 (f8e4m3 and f8e5m2) weights to 4-bit data types, which is particularly beneficial for models like DeepSeek-R1.
    • Added group_size_fallback_mode parameter for advanced weight compression. It controls how nodes that do not support the default group size are handled. By default (IGNORE), such nodes are skipped. With ERROR, an exception is raised if the channel size is not divisible by the group size, while ADJUST attempts to modify the group size so it becomes valid.
    • (TorchFX) Added support for external quantizers in the quantize_pt2e API, including XNNPACKQuantizer and CoreMLQuantizer. Users now can quantize their models in ExecuTorch for the XNNPACK and CoreML backends via the nncf quantize_pt2e employing smooth quant, bias correction algorithms and a wide range of statistic collectors.
    • (ONNX) Added support for data-aware weight compression in the ONNX backend, including the AWQ and Scale Estimation algorithms. Provided an example demonstrating the data-aware weight compression pipeline using the TinyLlama/TinyLlama-1.1B-Chat-v1.0 model in ONNX format.
  • Improvements:
    • Support of weight compression for models with the Rotary Positional Embedding block.
    • Support of weight compression for models with stateful self-attention blocks.
  • Tutorials:

Compression-aware training:

  • Features:
    • (PyTorch) Enhanced initialization for "QAT with absorbable LoRA" using advanced compression methods (AWQ + Scale Estimation). This improvement replaces the previous basic data-free compression approach, enabling QAT to start with a more accurate model baseline and achieve superior final accuracy.
  • Improvements:
    • (PyTorch) Streamlined "QAT with absorbable LoRA" by removing checkpoint selection based on validation set. This change significantly reduces overall tuning time and maximum allocated memory. While the results on Wikitext are slightly worse, it provides a more efficient and faster tuning pipeline (e.g. reduced from 32 minutes to 25 minutes for SmoLM-1.7B).
  • Tutorials:
    • (TorchFX) Added example for compression of TinnyLama-1.1B.
    • Updated example to meet NPU implementation.
    • Implemented fast evaluation and improved output in example.

Deprecations/Removals:

  • Removed examples that used create_compressed_model API.

Requirements:

  • Updated PyTorch (2.8.0) and Torchvision (0.23.0) versions.
  • Set require setuptools>=77 to build package.

Acknowledgements

Thanks for contributions from the OpenVINO developer community:
@bopeng1234 @jpablomch