Inference tutorial - Part 3 of e2e series [WIP] #2343

jainapurva · 2025-06-09T23:18:31Z

No description provided.

pytorch-bot · 2025-06-09T23:18:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2343

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6a96697 with merge base 5239ce7 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

docs/source/inference.rst

jerryzh168 · 2025-06-17T17:26:11Z

docs/source/inference.rst

+----------------------------+----------------+------------------------------+
+
+
+Sparsity Integration


this should not be a separate section I think, it can be merged into Float8 Dynamic Quantization section, and just mention for more quantization/sparsity, please see https://huggingface.co/docs/transformers/main/en/quantization/torchao

docs/source/inference.rst

jainapurva · 2025-06-17T20:41:57Z

docs/source/inference.rst

+    print("Response:", output_text[0][len(prompt):])
+
+
+[Optional] Float8 Dynamic Quantization + Semi-structured (2:4) sparsity


@jerryzh168 @jcaip Does this look good? Should I keep sparsity as a optional section or just mention it in note

can we just add to huggingface torchao page?

jerryzh168 · 2025-06-17T20:43:51Z

docs/source/inference.rst

+Memory Benchmarking
+--------------------
+
+**Memory Usage Comparison**:


nit: remove

jerryzh168 · 2025-06-17T20:44:43Z

docs/source/inference.rst

+
+    vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
+
+Inference with vLLM


should we move this after Inference with Transformers

jerryzh168 · 2025-06-17T20:45:36Z

docs/source/inference.rst

+
+vLLM automatically leverages torchao's optimized kernels when serving quantized models, providing significant throughput improvements.
+
+Setting up vLLM with Quantized Models


nit: this doesn't have to be a new section I think

jerryzh168 · 2025-06-17T20:51:11Z

docs/source/inference.rst

+Performance Breakdown
+=====================
+
+When using vLLM with torchao:


this is not a comprehensive list, probably just remove, do we have a exhaustive list of all the techniques that we support?

andrewor14 · 2025-06-17T21:48:34Z

Hi @jainapurva, by the way I'm adding a serving.rst here: #2394. It uses the same template as parts 1 and 2. After that's landed, do you mind updating your PR to use that file instead? Right now it's a blank page with the template:

Preliminary structure for tutorial

c0584b4

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2025

jainapurva added the topic: documentation Use this tag if this PR adds or improves documentation label Jun 10, 2025

jainapurva and others added 8 commits June 16, 2025 09:59

Updates

f4e8f2d

Update

7c2332e

Update

942a02b

Update

888fd4c

Update

c200cd2

Merge remote-tracking branch 'origin/main' into inference_tutorial

4f76b23

Update

c52e6f8

Update

de160b1

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

docs/source/inference.rst Show resolved Hide resolved

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jainapurva added 2 commits June 17, 2025 12:11

Update

e8f5e53

Update

bbd567d

jainapurva commented Jun 17, 2025

View reviewed changes

jainapurva requested review from jerryzh168, andrewor14, drisspg and jcaip June 17, 2025 20:42

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

docs/source/inference.rst

Memory Benchmarking

--------------------

**Memory Usage Comparison**:

Copy link

Contributor

jerryzh168 Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

Update notes

6a96697

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inference tutorial - Part 3 of e2e series [WIP] #2343

Inference tutorial - Part 3 of e2e series [WIP] #2343

jainapurva commented Jun 9, 2025

Uh oh!

pytorch-bot bot commented Jun 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

jerryzh168 Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jainapurva Jun 17, 2025

Uh oh!

jerryzh168 Jun 17, 2025

Uh oh!

jerryzh168 Jun 17, 2025

Uh oh!

jerryzh168 Jun 17, 2025

Uh oh!

jerryzh168 Jun 17, 2025

Uh oh!

jerryzh168 Jun 17, 2025

Uh oh!

andrewor14 commented Jun 17, 2025

Uh oh!

Uh oh!

		+----------------------------+----------------+------------------------------+


		Sparsity Integration

		print("Response:", output_text[0][len(prompt):])


		[Optional] Float8 Dynamic Quantization + Semi-structured (2:4) sparsity


		vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3

		Inference with vLLM


		vLLM automatically leverages torchao's optimized kernels when serving quantized models, providing significant throughput improvements.

		Setting up vLLM with Quantized Models

Inference tutorial - Part 3 of e2e series [WIP] #2343

Are you sure you want to change the base?

Inference tutorial - Part 3 of e2e series [WIP] #2343

Conversation

jainapurva commented Jun 9, 2025

Uh oh!

pytorch-bot bot commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2343

✅ No Failures

Uh oh!

Uh oh!

jerryzh168 Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jainapurva Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Jun 17, 2025

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 9, 2025 •

edited

Loading