You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/inference.rst
+6-2Lines changed: 6 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -103,7 +103,7 @@ Torchao's sparsity support can be combined with quantization for additional perf
103
103
print(response)
104
104
105
105
.. note::
106
-
For more information on supported quantization and sparsity configurations, see `HF-Torchao Docs <https://huggingface.co/docs/transformers/main/en/quantization/torchao>`_.
106
+
For more information on supported quantization and sparsity configurations, see `HF-Torchao Docs <https://huggingface.co/docs/transformers/main/en/quantization/torchao>`_.
107
107
108
108
Inference with Transformers
109
109
---------------------------
@@ -356,6 +356,10 @@ When using vLLM with torchao:
356
356
- **Sparsity Support**: Semi-structured (2:4) sparsity for faster inference (see `Accelerating Neural Network Training with Semi-Structured (2:4) Sparsity <https://pytorch.org/blog/accelerating-neural-network-training/>`_ blog post)
357
357
- **KV Cache Quantization**: Enables long context inference with lower memory (see `KV Cache Quantization <https://github.com/pytorch/ao/blob/main/torchao/_models/llama/README.md>`_)
358
358
359
+
.. note::
360
+
For more information on vLLM Integration, please refer to the detailed guide :ref:`torchao_vllm_integration`.
361
+
362
+
359
363
Mobile Deployment with ExecuTorch
360
364
---------------------------------
361
365
@@ -518,7 +522,7 @@ The torchao-optimized 8da4w model provides:
518
522
- **Accuracy**: Maintained within 5-10% of original model on most benchmarks
519
523
520
524
.. note::
521
-
For detailed instructions on testing the executorch model and reproducing benchmarks please refer to the `HF Phi-4-mini-instruct-8da4w model <https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w>`_.
525
+
For detailed instructions on testing the executorch model and reproducing benchmarks please refer to the `HF Phi-4-mini-instruct-8da4w model <https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w>`_.
0 commit comments