Skip to content

Commit f604c67

Browse files
authored
[backend] Add ONNX & OpenVINO support for Cross Encoder (reranker) models (#3319)
* Add ONNX & OpenVINO support for Cross Encoder (reranker) models * Remove accidental leftover breakpoint * Improve typing for tokenizer * Improve docs for save_pretrained * Apply minor improvements/fixes to efficiency docs
1 parent 27c14b9 commit f604c67

File tree

8 files changed

+997
-65
lines changed

8 files changed

+997
-65
lines changed

docs/cross_encoder/usage/efficiency.rst

Lines changed: 602 additions & 0 deletions
Large diffs are not rendered by default.

docs/cross_encoder/usage/usage.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,4 +73,5 @@ Once you have `installed <../../installation.html>`_ Sentence Transformers, you
7373
:caption: Tasks
7474

7575
Cross-Encoder vs Bi-Encoder <../../../examples/cross_encoder/applications/README>
76-
../../../examples/sentence_transformer/applications/retrieve_rerank/README
76+
../../../examples/sentence_transformer/applications/retrieve_rerank/README
77+
efficiency
56.7 KB
Loading
54 KB
Loading

docs/sentence_transformer/usage/efficiency.rst

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -132,9 +132,9 @@ Optimizing ONNX Models
132132

133133
.. include:: backend_export_sidebar.rst
134134

135-
ONNX models can be optimized using Optimum, allowing for speedups on CPUs and GPUs alike. To do this, you can use the :func:`~sentence_transformers.backend.export_optimized_onnx_model` function, which saves the optimized in a directory or model repository that you specify. It expects:
135+
ONNX models can be optimized using `Optimum <https://huggingface.co/docs/optimum/index>`_, allowing for speedups on CPUs and GPUs alike. To do this, you can use the :func:`~sentence_transformers.backend.export_optimized_onnx_model` function, which saves the optimized in a directory or model repository that you specify. It expects:
136136

137-
- ``model``: a Sentence Transformer model loaded with the ONNX backend.
137+
- ``model``: a Sentence Transformer or Cross Encoder model loaded with the ONNX backend.
138138
- ``optimization_config``: ``"O1"``, ``"O2"``, ``"O3"``, or ``"O4"`` representing optimization levels from :class:`~optimum.onnxruntime.AutoOptimizationConfig`, or an :class:`~optimum.onnxruntime.OptimizationConfig` instance.
139139
- ``model_name_or_path``: a path to save the optimized model file, or the repository name if you want to push it to the Hugging Face Hub.
140140
- ``push_to_hub``: (Optional) a boolean to push the optimized model to the Hugging Face Hub.
@@ -204,9 +204,9 @@ Quantizing ONNX Models
204204

205205
.. include:: backend_export_sidebar.rst
206206

207-
ONNX models can be quantized to int8 precision using Optimum, allowing for faster inference on CPUs. To do this, you can use the :func:`~sentence_transformers.backend.export_dynamic_quantized_onnx_model` function, which saves the quantized in a directory or model repository that you specify. Dynamic quantization, unlike static quantization, does not require a calibration dataset. It expects:
207+
ONNX models can be quantized to int8 precision using `Optimum <https://huggingface.co/docs/optimum/index>`_, allowing for faster inference on CPUs. To do this, you can use the :func:`~sentence_transformers.backend.export_dynamic_quantized_onnx_model` function, which saves the quantized in a directory or model repository that you specify. Dynamic quantization, unlike static quantization, does not require a calibration dataset. It expects:
208208

209-
- ``model``: a Sentence Transformer model loaded with the ONNX backend.
209+
- ``model``: a Sentence Transformer or Cross Encoder model loaded with the ONNX backend.
210210
- ``quantization_config``: ``"arm64"``, ``"avx2"``, ``"avx512"``, or ``"avx512_vnni"`` representing quantization configurations from :class:`~optimum.onnxruntime.AutoQuantizationConfig`, or an :class:`~optimum.onnxruntime.QuantizationConfig` instance.
211211
- ``model_name_or_path``: a path to save the quantized model file, or the repository name if you want to push it to the Hugging Face Hub.
212212
- ``push_to_hub``: (Optional) a boolean to push the quantized model to the Hugging Face Hub.
@@ -329,15 +329,15 @@ Quantizing OpenVINO Models
329329

330330
.. include:: backend_export_sidebar.rst
331331

332-
OpenVINO models can be quantized to int8 precision using Optimum Intel to speed up inference.
332+
OpenVINO models can be quantized to int8 precision using `Optimum Intel <https://huggingface.co/docs/optimum/main/en/intel/index>`_ to speed up inference.
333333
To do this, you can use the :func:`~sentence_transformers.backend.export_static_quantized_openvino_model` function,
334334
which saves the quantized model in a directory or model repository that you specify.
335335
Post-Training Static Quantization expects:
336336

337-
- ``model``: a Sentence Transformer model loaded with the OpenVINO backend.
337+
- ``model``: a Sentence Transformer or Cross Encoder model loaded with the OpenVINO backend.
338338
- ``quantization_config``: (Optional) The quantization configuration. This parameter accepts either:
339-
``None`` for the default 8-bit quantization, a dictionary representing quantization configurations, or
340-
an :class:`~optimum.intel.OVQuantizationConfig` instance.
339+
``None`` for the default 8-bit quantization, a dictionary representing quantization configurations, or
340+
an :class:`~optimum.intel.OVQuantizationConfig` instance.
341341
- ``model_name_or_path``: a path to save the quantized model file, or the repository name if you want to push it to the Hugging Face Hub.
342342
- ``dataset_name``: (Optional) The name of the dataset to load for calibration. If not specified, defaults to ``sst2`` subset from the ``glue`` dataset.
343343
- ``dataset_config_name``: (Optional) The specific configuration of the dataset to load.
@@ -541,8 +541,8 @@ Based on the benchmarks, this flowchart should help you decide which backend to
541541
}
542542
}}%%
543543
graph TD
544-
A(What is your hardware?) -->|GPU| B(Is your text usually smaller than 500 characters?)
545-
A -->|CPU| C(Is a 0.4% accuracy loss acceptable?)
544+
A(What is your hardware?) -->|GPU| B(Is your text usually smaller<br>than 500 characters?)
545+
A -->|CPU| C(Is a 0.4% accuracy loss<br>acceptable?)
546546
B -->|yes| D[onnx-O4]
547547
B -->|no| F[float16]
548548
C -->|yes| G[openvino-qint8]

docs/sentence_transformer/usage/usage.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,6 @@ Once you have `installed <../../installation.html>`_ Sentence Transformers, you
5656
../../../examples/sentence_transformer/applications/parallel-sentence-mining/README
5757
../../../examples/sentence_transformer/applications/image-search/README
5858
../../../examples/sentence_transformer/applications/embedding-quantization/README
59-
efficiency
6059
custom_models
60+
efficiency
6161

0 commit comments

Comments
 (0)