From c0584b46cf6f874f589a88c7658778fab54857ed Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Mon, 9 Jun 2025 16:17:38 -0700
Subject: [PATCH 01/13] Preliminary structure for tutorial

---
 docs/source/inference.rst | 51 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)
 create mode 100644 docs/source/inference.rst

diff --git a/docs/source/inference.rst b/docs/source/inference.rst
new file mode 100644
index 0000000000..43d86ead98
--- /dev/null
+++ b/docs/source/inference.rst
@@ -0,0 +1,51 @@
+Inference
+---------
+In continuation to the previous tutorials about pretraining and finetuning, in this tutorial we'll show recipes for post-training quantization and serving the quantized model
+
+The tutorial focuses on 3 receipes for post-training quantization and serving the quantized model:
+1. :ref:`Post-training Quantization and Serving a model on HuggingFace`
+
+
+Post-training Quantization and Serving
+######################################
+
+Part 3 (inference): Move/duplicate Jerry’s Phi-4 model card instructions to doc page
+Part 3: Move code snippets from HF transformers torchao guide to this tutorial
+
+Post-training Quantization using HuggingFace
+------------------------------------------------
+
+
+Evaluating the model
+--------------------
+
+Serving it on vLLM
+--------------------
+
+Sparsify using HuggingFace
+##########################
+
+Part 3: Add sparsity torchao huggingface integration
+
+
+Lower to Executorch
+###################
+
+From the executorch root directory run the following command to lower the model to executorch format:
+
+.. code:: console
+    python -m examples.models.llama.export_llama --checkpoint "${LLAMA_QUANTIZED_CHECKPOINT:?}" -p "${LLAMA_PARAMS:?}" -kv --use_sdpa_with_kv_cache -qmode 8da4w --group_size 256 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_id":128001}' --embedding-quantize 4,32 --output_name="llama3_8da4w.pte"
+
+This will generate a file called ``llama3_8da4w.pte`` in the current directory. This file is the quantized and lowered model that can be used for inference.
+
+# Evaluate model
+# python -m examples.models.llama.eval_llama \
+# 	-c "${LLAMA_QUANTIZED_CHECKPOINT:?}" \
+# 	-p "${LLAMA_PARAMS:?}" \
+# 	-t "${LLAMA_TOKENIZER:?}" \
+# 	-kv \
+# 	-d fp32 \
+# 	--tasks mmlu \
+# 	--num_fewshot 5 \
+# 	--max_seq_len 8192 \
+# 	--max_context_len 8192

From f4e8f2d3b168783fbab491bda8ef93b2009a3e84 Mon Sep 17 00:00:00 2001
From: Apurva Jain <appy@meta.com>
Date: Mon, 16 Jun 2025 09:59:24 -0700
Subject: [PATCH 02/13] Updates

---
 docs/source/inference.rst | 564 +++++++++++++++++++++++++++++++++++---
 1 file changed, 530 insertions(+), 34 deletions(-)

diff --git a/docs/source/inference.rst b/docs/source/inference.rst
index 43d86ead98..7daaee8e72 100644
--- a/docs/source/inference.rst
+++ b/docs/source/inference.rst
@@ -1,51 +1,547 @@
-Inference
----------
-In continuation to the previous tutorials about pretraining and finetuning, in this tutorial we'll show recipes for post-training quantization and serving the quantized model
+Inference Tutorial: From Quantization to Deployment
+===================================================
 
-The tutorial focuses on 3 receipes for post-training quantization and serving the quantized model:
-1. :ref:`Post-training Quantization and Serving a model on HuggingFace`
+This tutorial demonstrates how to perform post-training quantization and deploy models for inference using torchao's integration with popular frameworks. All quantization techniques shown here use torchao as the underlying optimization engine, seamlessly integrated through HuggingFace Transformers, vLLM, and ExecuTorch.
 
+.. contents::
+   :local:
+   :depth: 2
 
-Post-training Quantization and Serving
-######################################
+Overview
+--------
 
-Part 3 (inference): Move/duplicate Jerry’s Phi-4 model card instructions to doc page
-Part 3: Move code snippets from HF transformers torchao guide to this tutorial
+This tutorial covers the complete inference pipeline:
 
-Post-training Quantization using HuggingFace
-------------------------------------------------
+1. **Post-training Quantization**: Using int4/int8 quantization with HuggingFace integration
+2. **Sparsity**: Combining sparsity with quantization for additional speedups
+3. **High-throughput Serving**: Deploying quantized models with vLLM
+4. **Mobile Deployment**: Lowering to ExecuTorch for on-device inference
 
+All these workflows leverage torchao's optimized kernels and quantization algorithms under the hood.
 
-Evaluating the model
---------------------
+Post-training Quantization with HuggingFace
+############################################
+
+HuggingFace Transformers provides seamless integration with torchao quantization. The ``TorchAoConfig`` automatically applies torchao's optimized quantization algorithms during model loading.
+
+Int4 Weight-Only Quantization
+------------------------------
+
+Int4 weight-only quantization reduces model size by 4x with minimal accuracy loss:
+
+.. code-block:: python
+
+    import torch
+    from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
+    from torchao.quantization import Int4WeightOnlyConfig
+
+    model_id = "meta-llama/Llama-3.1-8B-Instruct"
+
+    # Configure int4 weight-only quantization (torchao under the hood)
+    quant_config = Int4WeightOnlyConfig()
+    quantization_config = TorchAoConfig(quant_type=quant_config)
+
+    # Load and quantize model - torchao handles the optimization
+    model = AutoModelForCausalLM.from_pretrained(
+        model_id,
+        torch_dtype="auto",
+        device_map="auto",
+        quantization_config=quantization_config
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+    # Test inference
+    messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]
+    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
+
+    with torch.no_grad():
+        outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)
+
+    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
+    print(response)
+
+Int8 Dynamic Quantization
+--------------------------
+
+Int8 dynamic quantization provides a balance between compression and accuracy:
+
+.. code-block:: python
+
+    from torchao.quantization import Int8DynamicActivationIntxWeightConfig
+
+    # Configure int8 dynamic quantization with int4 weights
+    quant_config = Int8DynamicActivationIntxWeightConfig(
+        weight_dtype=torch.int4,
+        weight_granularity=torchao.quantization.granularity.PerGroup(32)
+    )
+    quantization_config = TorchAoConfig(quant_type=quant_config)
+
+    model = AutoModelForCausalLM.from_pretrained(
+        "microsoft/Phi-4-mini-instruct",
+        quantization_config=quantization_config,
+        torch_dtype=torch.bfloat16,
+        device_map="auto"
+    )
+
+Advanced: Per-Layer Quantization Control
+----------------------------------------
+
+For models where you need different quantization strategies for different layers:
+
+.. code-block:: python
+
+    from torchao.quantization import (
+        IntxWeightOnlyConfig,
+        Int8DynamicActivationIntxWeightConfig,
+        ModuleFqnToConfig
+    )
+    from torchao.quantization.granularity import PerAxis, PerGroup
+
+    # Different configs for different layer types
+    embedding_config = IntxWeightOnlyConfig(
+        weight_dtype=torch.int8,
+        granularity=PerAxis(0)
+    )
+
+    linear_config = Int8DynamicActivationIntxWeightConfig(
+        weight_dtype=torch.int4,
+        weight_granularity=PerGroup(32),
+        weight_scale_dtype=torch.bfloat16
+    )
+
+    # Map specific layers to configs - torchao applies optimizations per layer
+    quant_config = ModuleFqnToConfig({
+        "_default": linear_config,
+        "model.embed_tokens": embedding_config,
+        "lm_head": embedding_config
+    })
+
+    quantization_config = TorchAoConfig(
+        quant_type=quant_config,
+        include_embedding=True
+    )
+
+    model = AutoModelForCausalLM.from_pretrained(
+        model_id,
+        quantization_config=quantization_config,
+        torch_dtype=torch.float32,
+        device_map="auto"
+    )
+
+Sparsity Integration
+####################
+
+Torchao's sparsity support can be combined with quantization for additional performance gains. The Marlin sparse layout provides optimized kernels for 2:4 structured sparsity.
+
+Sparse + Quantized Models
+-------------------------
+
+.. code-block:: python
+
+    from torchao.quantization import Int4WeightOnlyConfig
+    from torchao.dtypes import MarlinSparseLayout
+
+    # Combine sparsity with int4 quantization - both optimized by torchao
+    quant_config = Int4WeightOnlyConfig(layout=MarlinSparseLayout())
+    quantization_config = TorchAoConfig(quant_type=quant_config)
+
+    # Load a pre-sparsified checkpoint
+    model = AutoModelForCausalLM.from_pretrained(
+        "nm-testing/Meta-Llama-3.1-8B-Instruct-W4A16-G128-2of4",  # 2:4 sparse model
+        torch_dtype=torch.float16,
+        device_map="cuda",
+        quantization_config=quantization_config
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
+
+    # Use static KV cache for best performance with torchao optimizations
+    messages = [{"role": "user", "content": "What are the benefits of sparse neural networks?"}]
+    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
+
+    outputs = model.generate(
+        inputs,
+        max_new_tokens=150,
+        cache_implementation="static",  # Optimized for torchao
+        do_sample=False
+    )
+
+    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
+    print(response)
+
+High-throughput Serving with vLLM
+##################################
+
+vLLM automatically leverages torchao's optimized kernels when serving quantized models, providing significant throughput improvements.
+
+Setting up vLLM with Quantized Models
+--------------------------------------
+
+First, install vLLM with torchao support:
+
+.. code-block:: bash
+
+    pip install vllm
+    pip install torchao
+
+Serving Int4 Quantized Models
+-----------------------------
+
+.. code-block:: python
+
+    from vllm import LLM, SamplingParams
+
+    # vLLM automatically uses torchao's optimized int4 kernels
+    llm = LLM(
+        model="nm-testing/Meta-Llama-3.1-8B-Instruct-W4A16-G128",
+        quantization="int4_weight_only",  # Uses torchao int4 implementation
+        max_model_len=4096,
+        gpu_memory_utilization=0.8
+    )
+
+    sampling_params = SamplingParams(
+        temperature=0.7,
+        top_p=0.9,
+        max_tokens=200
+    )
+
+    prompts = [
+        "Explain the concept of machine learning to a 10-year-old.",
+        "What are the main differences between supervised and unsupervised learning?",
+        "How does a neural network learn from data?"
+    ]
+
+    # Generate responses - torchao kernels handle the optimized inference
+    outputs = llm.generate(prompts, sampling_params)
+
+    for output in outputs:
+        print(f"Prompt: {output.prompt}")
+        print(f"Generated text: {output.outputs[0].text}")
+        print("-" * 50)
+
+Serving with OpenAI-Compatible API
+----------------------------------
+
+Launch a server that uses torchao optimizations:
+
+.. code-block:: bash
+
+    # Start vLLM server with torchao-optimized quantization
+    python -m vllm.entrypoints.openai.api_server \
+        --model nm-testing/Meta-Llama-3.1-8B-Instruct-W4A16-G128 \
+        --quantization int4_weight_only \
+        --max-model-len 4096 \
+        --host 0.0.0.0 \
+        --port 8000
+
+Client usage:
+
+.. code-block:: python
+
+    import openai
+
+    client = openai.OpenAI(
+        base_url="http://localhost:8000/v1",
+        api_key="token-abc123"  # Dummy key for local server
+    )
+
+    completion = client.chat.completions.create(
+        model="nm-testing/Meta-Llama-3.1-8B-Instruct-W4A16-G128",
+        messages=[
+            {"role": "user", "content": "Write a Python function to calculate Fibonacci numbers."}
+        ],
+        max_tokens=300,
+        temperature=0.7
+    )
+
+    print(completion.choices[0].message.content)
+
+Performance Optimization Notes
+------------------------------
+
+When using vLLM with torchao:
+
+- **Int4 quantization**: Provides 3-4x memory reduction with torchao's optimized kernels
+- **Sparse models**: Additional 1.5-2x speedup when combined with quantization
+- **Static KV cache**: Use ``--kv-cache-dtype fp8`` for additional memory savings
+- **Compile optimizations**: Set ``VLLM_DISABLE_COMPILE_CACHE=1`` if encountering issues
 
-Serving it on vLLM
+Mobile Deployment with ExecuTorch
+##################################
+
+ExecuTorch enables on-device inference using torchao's mobile-optimized quantization schemes. The 8da4w (8-bit dynamic activation, 4-bit weight) configuration is specifically designed for mobile deployment.
+
+Preparing Models for Mobile
+----------------------------
+
+**Step 1: Create Mobile-Optimized Quantization**
+
+.. code-block:: python
+
+    import torch
+    from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
+    from torchao.quantization import (
+        IntxWeightOnlyConfig,
+        Int8DynamicActivationIntxWeightConfig,
+        ModuleFqnToConfig
+    )
+    from torchao.quantization.granularity import PerAxis, PerGroup
+
+    model_id = "microsoft/Phi-4-mini-instruct"
+
+    # Mobile-optimized quantization scheme using torchao
+    embedding_config = IntxWeightOnlyConfig(
+        weight_dtype=torch.int8,
+        granularity=PerAxis(0)
+    )
+
+    linear_config = Int8DynamicActivationIntxWeightConfig(
+        weight_dtype=torch.int4,
+        weight_granularity=PerGroup(32),
+        weight_scale_dtype=torch.bfloat16
+    )
+
+    # 8da4w configuration optimized by torchao for mobile
+    quant_config = ModuleFqnToConfig({
+        "_default": linear_config,
+        "model.embed_tokens": embedding_config
+    })
+
+    quantization_config = TorchAoConfig(
+        quant_type=quant_config,
+        include_embedding=True,
+        untie_embedding_weights=True
+    )
+
+    # Load with mobile-optimized settings
+    model = AutoModelForCausalLM.from_pretrained(
+        model_id,
+        torch_dtype=torch.float32,  # Required for mobile export
+        quantization_config=quantization_config,
+        device_map="cpu"  # Export from CPU
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+    # Save quantized model
+    model.save_pretrained("./phi4-mini-8da4w-mobile")
+    tokenizer.save_pretrained("./phi4-mini-8da4w-mobile")
+
+**Step 2: Export to ExecuTorch**
+
+.. code-block:: bash
+
+    # Install ExecuTorch
+    git clone https://github.com/pytorch/executorch.git
+    cd executorch
+    ./install_requirements.sh
+
+    # Convert checkpoint format for ExecuTorch
+    python -m executorch.examples.models.phi_4_mini.convert_weights \
+        ./phi4-mini-8da4w-mobile/pytorch_model.bin \
+        ./phi4-mini-8da4w-mobile/pytorch_model_converted.bin
+
+    # Export to PTE format with torchao optimizations preserved
+    python -m executorch.examples.models.llama.export_llama \
+        --model "phi_4_mini" \
+        --checkpoint "./phi4-mini-8da4w-mobile/pytorch_model_converted.bin" \
+        --params "./phi4-mini-8da4w-mobile/config.json" \
+        -kv \
+        --use_sdpa_with_kv_cache \
+        -X \
+        --metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}' \
+        --max_seq_length 512 \
+        --max_context_length 512 \
+        --output_name="phi4-mini-8da4w-mobile.pte"
+
+Mobile Performance Characteristics
+----------------------------------
+
+The torchao-optimized 8da4w model provides:
+
+- **Memory**: ~3.2GB on iPhone 15 Pro (vs ~12GB unquantized)
+- **Speed**: ~17 tokens/sec on iPhone 15 Pro
+- **Accuracy**: Maintained within 5-10% of original model on most benchmarks
+
+**iOS Integration Example**:
+
+.. code-block:: objective-c
+
+    // Load the torchao-optimized PTE file
+    NSString *modelPath = [[NSBundle mainBundle] pathForResource:@"phi4-mini-8da4w-mobile" ofType:@"pte"];
+
+    // ExecuTorch runtime automatically uses torchao's optimized kernels
+    torch::executor::Result<torch::executor::Module> module_result =
+        torch::executor::Module::load(modelPath.UTF8String);
+
+Android integration follows similar patterns using the ExecuTorch Android API.
+
+Evaluation and Benchmarking
+############################
+
+Model Quality Assessment
+------------------------
+
+Evaluate quantized models using lm-evaluation-harness:
+
+.. code-block:: bash
+
+    # Install evaluation framework
+    pip install lm-eval[all]
+
+    # Evaluate baseline model
+    lm_eval --model hf \
+            --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
+            --tasks mmlu,arc_challenge,hellaswag,winogrande \
+            --batch_size 8
+
+    # Evaluate torchao-quantized model
+    lm_eval --model hf \
+            --model_args pretrained=nm-testing/Meta-Llama-3.1-8B-Instruct-W4A16-G128 \
+            --tasks mmlu,arc_challenge,hellaswag,winogrande \
+            --batch_size 8
+
+Performance Benchmarking
+------------------------
+
+**Memory Usage Comparison**:
+
+.. code-block:: python
+
+    import torch
+    from transformers import AutoModelForCausalLM
+    import psutil
+    import os
+
+    def measure_memory_usage(model_id, quantization_config=None):
+        process = psutil.Process(os.getpid())
+        mem_before = process.memory_info().rss / 1024 / 1024 / 1024  # GB
+
+        model = AutoModelForCausalLM.from_pretrained(
+            model_id,
+            quantization_config=quantization_config,
+            torch_dtype=torch.bfloat16,
+            device_map="auto"
+        )
+
+        mem_after = process.memory_info().rss / 1024 / 1024 / 1024  # GB
+        model_memory = mem_after - mem_before
+
+        return model_memory
+
+    # Compare memory usage
+    baseline_memory = measure_memory_usage("meta-llama/Llama-3.1-8B-Instruct")
+
+    from transformers import TorchAoConfig
+    from torchao.quantization import Int4WeightOnlyConfig
+    quant_config = TorchAoConfig(quant_type=Int4WeightOnlyConfig())
+    quantized_memory = measure_memory_usage("meta-llama/Llama-3.1-8B-Instruct", quant_config)
+
+    print(f"Baseline model: {baseline_memory:.2f} GB")
+    print(f"Int4 quantized: {quantized_memory:.2f} GB")
+    print(f"Memory reduction: {(1 - quantized_memory/baseline_memory)*100:.1f}%")
+
+**Latency Benchmarking**:
+
+.. code-block:: python
+
+    import time
+    import torch
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+
+    def benchmark_latency(model, tokenizer, prompt, num_runs=10):
+        messages = [{"role": "user", "content": prompt}]
+        inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
+
+        # Warmup
+        for _ in range(3):
+            with torch.no_grad():
+                _ = model.generate(inputs, max_new_tokens=100, do_sample=False)
+
+        # Benchmark
+        torch.cuda.synchronize()
+        start_time = time.time()
+
+        for _ in range(num_runs):
+            with torch.no_grad():
+                outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)
+
+        torch.cuda.synchronize()
+        end_time = time.time()
+
+        avg_latency = (end_time - start_time) / num_runs
+        tokens_generated = outputs.shape[1] - inputs.shape[1]
+        throughput = tokens_generated / avg_latency
+
+        return avg_latency, throughput
+
+    # Benchmark both models
+    prompt = "Explain the theory of relativity in simple terms."
+
+    baseline_latency, baseline_throughput = benchmark_latency(baseline_model, tokenizer, prompt)
+    quantized_latency, quantized_throughput = benchmark_latency(quantized_model, tokenizer, prompt)
+
+    print(f"Baseline: {baseline_latency:.3f}s ({baseline_throughput:.1f} tok/s)")
+    print(f"Quantized: {quantized_latency:.3f}s ({quantized_throughput:.1f} tok/s)")
+    print(f"Speedup: {baseline_latency/quantized_latency:.2f}x")
+
+Best Practices and Tips
+#######################
+
+Choosing Quantization Strategies
+---------------------------------
+
+**For Server Deployment**:
+- Use Int4 weight-only for maximum throughput with vLLM
+- Consider sparse models for additional speedup if available
+- Int8 dynamic activation provides better accuracy if needed
+
+**For Mobile Deployment**:
+- Use 8da4w (8-bit dynamic activation, 4-bit weights) configuration
+- Ensure proper weight untying for models with tied embeddings
+- Test on target hardware early in the process
+
+**For Edge Devices**:
+- ExecuTorch with XNNPACK delegate provides best performance
+- Consider using smaller base models (7B → 3B → 1B) if accuracy allows
+- Profile memory usage on target device constraints
+
+Common Optimizations
 --------------------
 
-Sparsify using HuggingFace
-##########################
+1. **Static KV Cache**: Use ``cache_implementation="static"`` for consistent performance
+2. **Compilation**: Enable ``torch.compile`` for additional speedups (disable cache if issues arise)
+3. **Mixed Precision**: Use bfloat16 when possible for better performance
+4. **Batch Processing**: Group inference requests when serving multiple users
+
+Troubleshooting
+---------------
+
+**Memory Issues**:
+- Reduce ``max_model_len`` in vLLM
+- Use ``device_map="auto"`` for automatic GPU/CPU offloading
+- Consider gradient checkpointing for training scenarios
 
-Part 3: Add sparsity torchao huggingface integration
+**Performance Issues**:
+- Verify torchao kernels are being used (check for CUDA kernel launches)
+- Ensure proper tensor shapes for optimal kernel dispatch
+- Profile with ``torch.profiler`` to identify bottlenecks
 
+**Accuracy Issues**:
+- Compare against baseline model on representative evaluation sets
+- Consider higher precision for sensitive layers (embeddings, final layer)
+- Use calibration datasets for better quantization if available
 
-Lower to Executorch
-###################
+Conclusion
+##########
 
-From the executorch root directory run the following command to lower the model to executorch format:
+This tutorial demonstrated how torchao's quantization and sparsity techniques integrate seamlessly across the entire ML deployment stack:
 
-.. code:: console
-    python -m examples.models.llama.export_llama --checkpoint "${LLAMA_QUANTIZED_CHECKPOINT:?}" -p "${LLAMA_PARAMS:?}" -kv --use_sdpa_with_kv_cache -qmode 8da4w --group_size 256 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_id":128001}' --embedding-quantize 4,32 --output_name="llama3_8da4w.pte"
+- **HuggingFace Transformers** provides easy model loading with torchao quantization
+- **vLLM** leverages torchao's optimized kernels for high-throughput serving
+- **ExecuTorch** enables mobile deployment with torchao's mobile-optimized schemes
 
-This will generate a file called ``llama3_8da4w.pte`` in the current directory. This file is the quantized and lowered model that can be used for inference.
+All these frameworks use torchao as the underlying optimization engine, ensuring consistent performance gains and ease of integration. The quantization techniques shown provide significant memory reduction (3-4x) and performance improvements (1.5-2x) while maintaining model quality within acceptable bounds for most applications.
 
-# Evaluate model
-# python -m examples.models.llama.eval_llama \
-# 	-c "${LLAMA_QUANTIZED_CHECKPOINT:?}" \
-# 	-p "${LLAMA_PARAMS:?}" \
-# 	-t "${LLAMA_TOKENIZER:?}" \
-# 	-kv \
-# 	-d fp32 \
-# 	--tasks mmlu \
-# 	--num_fewshot 5 \
-# 	--max_seq_len 8192 \
-# 	--max_context_len 8192
+For production deployments, always benchmark on your specific use case and hardware to validate the performance and accuracy trade-offs.

From 7c2332e7c4d8386491ea85a216d956a2aa09bb2c Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Mon, 16 Jun 2025 10:37:34 -0700
Subject: [PATCH 03/13] Update

---
 docs/source/inference.rst | 223 +++++++++++---------------------------
 1 file changed, 66 insertions(+), 157 deletions(-)

diff --git a/docs/source/inference.rst b/docs/source/inference.rst
index 7daaee8e72..a9010f5e82 100644
--- a/docs/source/inference.rst
+++ b/docs/source/inference.rst
@@ -12,7 +12,7 @@ Overview
 
 This tutorial covers the complete inference pipeline:
 
-1. **Post-training Quantization**: Using int4/int8 quantization with HuggingFace integration
+1. **Post-training Quantization**: Using float8 dynamic quantization with HuggingFace integration
 2. **Sparsity**: Combining sparsity with quantization for additional speedups
 3. **High-throughput Serving**: Deploying quantized models with vLLM
 4. **Mobile Deployment**: Lowering to ExecuTorch for on-device inference
@@ -24,65 +24,51 @@ Post-training Quantization with HuggingFace
 
 HuggingFace Transformers provides seamless integration with torchao quantization. The ``TorchAoConfig`` automatically applies torchao's optimized quantization algorithms during model loading.
 
-Int4 Weight-Only Quantization
+Float8 Dynamic Quantization
 ------------------------------
 
-Int4 weight-only quantization reduces model size by 4x with minimal accuracy loss:
+Float8 dynamic quantization shows 36% reduction in model size with minimal accuracy loss:
 
 .. code-block:: python
 
     import torch
-    from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
-    from torchao.quantization import Int4WeightOnlyConfig
+    from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
 
-    model_id = "meta-llama/Llama-3.1-8B-Instruct"
+    torch.random.manual_seed(0)
 
-    # Configure int4 weight-only quantization (torchao under the hood)
-    quant_config = Int4WeightOnlyConfig()
-    quantization_config = TorchAoConfig(quant_type=quant_config)
+    model_path = "pytorch/Phi-4-mini-instruct-float8dq"
 
-    # Load and quantize model - torchao handles the optimization
     model = AutoModelForCausalLM.from_pretrained(
-        model_id,
-        torch_dtype="auto",
+        model_path,
         device_map="auto",
-        quantization_config=quantization_config
+        torch_dtype="auto",
+        trust_remote_code=True,
     )
+    tokenizer = AutoTokenizer.from_pretrained(model_path)
 
-    tokenizer = AutoTokenizer.from_pretrained(model_id)
-
-    # Test inference
-    messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]
-    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
-
-    with torch.no_grad():
-        outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)
-
-    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
-    print(response)
-
-Int8 Dynamic Quantization
---------------------------
+    messages = [
+        {"role": "system", "content": "You are a helpful AI assistant."},
+        {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
+        {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
+        {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
+    ]
 
-Int8 dynamic quantization provides a balance between compression and accuracy:
+    pipe = pipeline(
+        "text-generation",
+        model=model,
+        tokenizer=tokenizer,
+    )
 
-.. code-block:: python
+    generation_args = {
+        "max_new_tokens": 500,
+        "return_full_text": False,
+        "temperature": 0.0,
+        "do_sample": False,
+    }
 
-    from torchao.quantization import Int8DynamicActivationIntxWeightConfig
+    output = pipe(messages, **generation_args)
+    print(output[0]['generated_text'])
 
-    # Configure int8 dynamic quantization with int4 weights
-    quant_config = Int8DynamicActivationIntxWeightConfig(
-        weight_dtype=torch.int4,
-        weight_granularity=torchao.quantization.granularity.PerGroup(32)
-    )
-    quantization_config = TorchAoConfig(quant_type=quant_config)
-
-    model = AutoModelForCausalLM.from_pretrained(
-        "microsoft/Phi-4-mini-instruct",
-        quantization_config=quantization_config,
-        torch_dtype=torch.bfloat16,
-        device_map="auto"
-    )
 
 Advanced: Per-Layer Quantization Control
 ----------------------------------------
@@ -182,90 +168,61 @@ First, install vLLM with torchao support:
 
 .. code-block:: bash
 
-    pip install vllm
+    pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
     pip install torchao
 
-Serving Int4 Quantized Models
------------------------------
+Inference with vLLM
+-------------------
 
 .. code-block:: python
 
     from vllm import LLM, SamplingParams
 
-    # vLLM automatically uses torchao's optimized int4 kernels
-    llm = LLM(
-        model="nm-testing/Meta-Llama-3.1-8B-Instruct-W4A16-G128",
-        quantization="int4_weight_only",  # Uses torchao int4 implementation
-        max_model_len=4096,
-        gpu_memory_utilization=0.8
-    )
-
-    sampling_params = SamplingParams(
-        temperature=0.7,
-        top_p=0.9,
-        max_tokens=200
-    )
-
+    # Sample prompts.
     prompts = [
-        "Explain the concept of machine learning to a 10-year-old.",
-        "What are the main differences between supervised and unsupervised learning?",
-        "How does a neural network learn from data?"
+        "Hello, my name is",
+        "The president of the United States is",
+        "The capital of France is",
+        "The future of AI is",
     ]
-
-    # Generate responses - torchao kernels handle the optimized inference
-    outputs = llm.generate(prompts, sampling_params)
-
-    for output in outputs:
-        print(f"Prompt: {output.prompt}")
-        print(f"Generated text: {output.outputs[0].text}")
-        print("-" * 50)
-
-Serving with OpenAI-Compatible API
-----------------------------------
-
-Launch a server that uses torchao optimizations:
+    # Create a sampling params object.
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+
+    if __name__ == '__main__':
+        # Create an LLM.
+        llm = LLM(model="pytorch/Phi-4-mini-instruct-float8dq")
+        # Generate texts from the prompts.
+        # The output is a list of RequestOutput objects
+        # that contain the prompt, generated text, and other information.
+        outputs = llm.generate(prompts, sampling_params)
+        # Print the outputs.
+        print("\nGenerated Outputs:\n" + "-" * 60)
+        for output in outputs:
+            prompt = output.prompt
+            generated_text = output.outputs[0].text
+            print(f"Prompt:    {prompt!r}")
+            print(f"Output:    {generated_text!r}")
+            print("-" * 60)
+
+
+Serving Quantized Models
+-----------------------------
 
 .. code-block:: bash
 
-    # Start vLLM server with torchao-optimized quantization
-    python -m vllm.entrypoints.openai.api_server \
-        --model nm-testing/Meta-Llama-3.1-8B-Instruct-W4A16-G128 \
-        --quantization int4_weight_only \
-        --max-model-len 4096 \
-        --host 0.0.0.0 \
-        --port 8000
+    vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
 
-Client usage:
-
-.. code-block:: python
-
-    import openai
-
-    client = openai.OpenAI(
-        base_url="http://localhost:8000/v1",
-        api_key="token-abc123"  # Dummy key for local server
-    )
-
-    completion = client.chat.completions.create(
-        model="nm-testing/Meta-Llama-3.1-8B-Instruct-W4A16-G128",
-        messages=[
-            {"role": "user", "content": "Write a Python function to calculate Fibonacci numbers."}
-        ],
-        max_tokens=300,
-        temperature=0.7
-    )
-
-    print(completion.choices[0].message.content)
 
 Performance Optimization Notes
 ------------------------------
 
 When using vLLM with torchao:
 
-- **Int4 quantization**: Provides 3-4x memory reduction with torchao's optimized kernels
-- **Sparse models**: Additional 1.5-2x speedup when combined with quantization
-- **Static KV cache**: Use ``--kv-cache-dtype fp8`` for additional memory savings
-- **Compile optimizations**: Set ``VLLM_DISABLE_COMPILE_CACHE=1`` if encountering issues
+- **Float8 dynamic quantization**: Provides 36% memory reduction with torchao's optimized kernels
+- **Sparse models**: Additional ---- speedup speedup when combined with quantization
+- **KV cache**:
+- **Compile optimizations**:
 
 Mobile Deployment with ExecuTorch
 ##################################
@@ -338,9 +295,7 @@ Preparing Models for Mobile
     ./install_requirements.sh
 
     # Convert checkpoint format for ExecuTorch
-    python -m executorch.examples.models.phi_4_mini.convert_weights \
-        ./phi4-mini-8da4w-mobile/pytorch_model.bin \
-        ./phi4-mini-8da4w-mobile/pytorch_model_converted.bin
+    .. Add code here..
 
     # Export to PTE format with torchao optimizations preserved
     python -m executorch.examples.models.llama.export_llama \
@@ -486,52 +441,6 @@ Performance Benchmarking
     print(f"Quantized: {quantized_latency:.3f}s ({quantized_throughput:.1f} tok/s)")
     print(f"Speedup: {baseline_latency/quantized_latency:.2f}x")
 
-Best Practices and Tips
-#######################
-
-Choosing Quantization Strategies
----------------------------------
-
-**For Server Deployment**:
-- Use Int4 weight-only for maximum throughput with vLLM
-- Consider sparse models for additional speedup if available
-- Int8 dynamic activation provides better accuracy if needed
-
-**For Mobile Deployment**:
-- Use 8da4w (8-bit dynamic activation, 4-bit weights) configuration
-- Ensure proper weight untying for models with tied embeddings
-- Test on target hardware early in the process
-
-**For Edge Devices**:
-- ExecuTorch with XNNPACK delegate provides best performance
-- Consider using smaller base models (7B → 3B → 1B) if accuracy allows
-- Profile memory usage on target device constraints
-
-Common Optimizations
---------------------
-
-1. **Static KV Cache**: Use ``cache_implementation="static"`` for consistent performance
-2. **Compilation**: Enable ``torch.compile`` for additional speedups (disable cache if issues arise)
-3. **Mixed Precision**: Use bfloat16 when possible for better performance
-4. **Batch Processing**: Group inference requests when serving multiple users
-
-Troubleshooting
----------------
-
-**Memory Issues**:
-- Reduce ``max_model_len`` in vLLM
-- Use ``device_map="auto"`` for automatic GPU/CPU offloading
-- Consider gradient checkpointing for training scenarios
-
-**Performance Issues**:
-- Verify torchao kernels are being used (check for CUDA kernel launches)
-- Ensure proper tensor shapes for optimal kernel dispatch
-- Profile with ``torch.profiler`` to identify bottlenecks
-
-**Accuracy Issues**:
-- Compare against baseline model on representative evaluation sets
-- Consider higher precision for sensitive layers (embeddings, final layer)
-- Use calibration datasets for better quantization if available
 
 Conclusion
 ##########

From 942a02b487b03bab95f7bce0cec0471ea59a0d69 Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Mon, 16 Jun 2025 14:00:00 -0700
Subject: [PATCH 04/13] Update

---
 docs/source/inference.rst | 317 ++++++++++++++++++++------------------
 1 file changed, 170 insertions(+), 147 deletions(-)

diff --git a/docs/source/inference.rst b/docs/source/inference.rst
index a9010f5e82..2b7c7959ed 100644
--- a/docs/source/inference.rst
+++ b/docs/source/inference.rst
@@ -69,52 +69,6 @@ Float8 dynamic quantization shows 36% reduction in model size with minimal accur
     output = pipe(messages, **generation_args)
     print(output[0]['generated_text'])
 
-
-Advanced: Per-Layer Quantization Control
-----------------------------------------
-
-For models where you need different quantization strategies for different layers:
-
-.. code-block:: python
-
-    from torchao.quantization import (
-        IntxWeightOnlyConfig,
-        Int8DynamicActivationIntxWeightConfig,
-        ModuleFqnToConfig
-    )
-    from torchao.quantization.granularity import PerAxis, PerGroup
-
-    # Different configs for different layer types
-    embedding_config = IntxWeightOnlyConfig(
-        weight_dtype=torch.int8,
-        granularity=PerAxis(0)
-    )
-
-    linear_config = Int8DynamicActivationIntxWeightConfig(
-        weight_dtype=torch.int4,
-        weight_granularity=PerGroup(32),
-        weight_scale_dtype=torch.bfloat16
-    )
-
-    # Map specific layers to configs - torchao applies optimizations per layer
-    quant_config = ModuleFqnToConfig({
-        "_default": linear_config,
-        "model.embed_tokens": embedding_config,
-        "lm_head": embedding_config
-    })
-
-    quantization_config = TorchAoConfig(
-        quant_type=quant_config,
-        include_embedding=True
-    )
-
-    model = AutoModelForCausalLM.from_pretrained(
-        model_id,
-        quantization_config=quantization_config,
-        torch_dtype=torch.float32,
-        device_map="auto"
-    )
-
 Sparsity Integration
 ####################
 
@@ -232,60 +186,122 @@ ExecuTorch enables on-device inference using torchao's mobile-optimized quantiza
 Preparing Models for Mobile
 ----------------------------
 
-**Step 1: Create Mobile-Optimized Quantization**
+**Step 1: Untie Embedding Weights**
 
+We want to quantize the embedding and lm_head differently. Since those layers are tied, we first need to untie the model:
 .. code-block:: python
 
+    from transformers import (
+    AutoModelForCausalLM,
+    AutoProcessor,
+    AutoTokenizer,
+    )
     import torch
-    from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
-    from torchao.quantization import (
+
+    model_id = "microsoft/Phi-4-mini-instruct"
+    untied_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+    print(untied_model)
+    from transformers.modeling_utils import find_tied_parameters
+    print("tied weights:", find_tied_parameters(untied_model))
+    if getattr(untied_model.config.get_text_config(decoder=True), "tie_word_embeddings"):
+        setattr(untied_model.config.get_text_config(decoder=True), "tie_word_embeddings", False)
+
+    untied_model._tied_weights_keys = []
+    untied_model.lm_head.weight = torch.nn.Parameter(untied_model.lm_head.weight.clone())
+
+    print("tied weights:", find_tied_parameters(untied_model))
+
+    USER_ID = "YOUR_USER_ID"
+    MODEL_NAME = model_id.split("/")[-1]
+    save_to = f"{USER_ID}/{MODEL_NAME}-untied-weights"
+
+    untied_model.push_to_hub(save_to)
+    tokenizer.push_to_hub(save_to)
+
+    # or save locally
+    save_to_local_path = f"{MODEL_NAME}-untied-weights"
+    untied_model.save_pretrained(save_to_local_path)
+    tokenizer.save_pretrained(save_to)
+
+**Step 2: Create Mobile-Optimized Quantization**
+
+Quantizing the model for mobile deployment using torchao's Int8DynamicActivationIntxWeightConfig configuration:
+.. code-block:: python
+
+    from transformers import (
+    AutoModelForCausalLM,
+    AutoProcessor,
+    AutoTokenizer,
+    TorchAoConfig,
+    )
+    from torchao.quantization.quant_api import (
         IntxWeightOnlyConfig,
         Int8DynamicActivationIntxWeightConfig,
-        ModuleFqnToConfig
+        ModuleFqnToConfig,
+        quantize_,
     )
-    from torchao.quantization.granularity import PerAxis, PerGroup
+    from torchao.quantization.granularity import PerGroup, PerAxis
+    import torch
 
+    # we start from the model with untied weights
     model_id = "microsoft/Phi-4-mini-instruct"
+    USER_ID = "YOUR_USER_ID"
+    MODEL_NAME = model_id.split("/")[-1]
+    untied_model_id = f"{USER_ID}/{MODEL_NAME}-untied-weights"
+    untied_model_local_path = f"{MODEL_NAME}-untied-weights"
 
-    # Mobile-optimized quantization scheme using torchao
     embedding_config = IntxWeightOnlyConfig(
         weight_dtype=torch.int8,
-        granularity=PerAxis(0)
+        granularity=PerAxis(0),
     )
-
     linear_config = Int8DynamicActivationIntxWeightConfig(
         weight_dtype=torch.int4,
         weight_granularity=PerGroup(32),
-        weight_scale_dtype=torch.bfloat16
+        weight_scale_dtype=torch.bfloat16,
     )
+    quant_config = ModuleFqnToConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
+    quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[])
 
-    # 8da4w configuration optimized by torchao for mobile
-    quant_config = ModuleFqnToConfig({
-        "_default": linear_config,
-        "model.embed_tokens": embedding_config
-    })
+    # either use `untied_model_id` or `untied_model_local_path`
+    quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
 
-    quantization_config = TorchAoConfig(
-        quant_type=quant_config,
-        include_embedding=True,
-        untie_embedding_weights=True
-    )
+    # Push to hub
+    MODEL_NAME = model_id.split("/")[-1]
+    save_to = f"{USER_ID}/{MODEL_NAME}-8da4w"
+    quantized_model.push_to_hub(save_to, safe_serialization=False)
+    tokenizer.push_to_hub(save_to)
 
-    # Load with mobile-optimized settings
-    model = AutoModelForCausalLM.from_pretrained(
-        model_id,
-        torch_dtype=torch.float32,  # Required for mobile export
-        quantization_config=quantization_config,
-        device_map="cpu"  # Export from CPU
+    # Manual testing
+    prompt = "Hey, are you conscious? Can you talk to me?"
+    messages = [
+        {
+            "role": "system",
+            "content": "",
+        },
+        {"role": "user", "content": prompt},
+    ]
+    templated_prompt = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True,
     )
+    print("Prompt:", prompt)
+    print("Templated prompt:", templated_prompt)
+    inputs = tokenizer(
+        templated_prompt,
+        return_tensors="pt",
+    ).to("cuda")
+    generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
+    output_text = tokenizer.batch_decode(
+        generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
+    )
+    print("Response:", output_text[0][len(prompt):])
 
-    tokenizer = AutoTokenizer.from_pretrained(model_id)
-
-    # Save quantized model
-    model.save_pretrained("./phi4-mini-8da4w-mobile")
-    tokenizer.save_pretrained("./phi4-mini-8da4w-mobile")
 
-**Step 2: Export to ExecuTorch**
+**Step 3: Export to ExecuTorch**
 
 .. code-block:: bash
 
@@ -295,20 +311,22 @@ Preparing Models for Mobile
     ./install_requirements.sh
 
     # Convert checkpoint format for ExecuTorch
-    .. Add code here..
+    python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin pytorch_model_converted.bin
 
     # Export to PTE format with torchao optimizations preserved
+    PARAMS="executorch/examples/models/phi_4_mini/config.json"
     python -m executorch.examples.models.llama.export_llama \
         --model "phi_4_mini" \
-        --checkpoint "./phi4-mini-8da4w-mobile/pytorch_model_converted.bin" \
-        --params "./phi4-mini-8da4w-mobile/config.json" \
+        --checkpoint "pytorch_model_converted.bin" \
+        --params "$PARAMS" \
         -kv \
         --use_sdpa_with_kv_cache \
         -X \
         --metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}' \
-        --max_seq_length 512 \
-        --max_context_length 512 \
-        --output_name="phi4-mini-8da4w-mobile.pte"
+        --max_seq_length 128 \
+        --max_context_length 128 \
+        --output_name="phi4-mini-8da4w.pte"
+
 
 Mobile Performance Characteristics
 ----------------------------------
@@ -343,19 +361,14 @@ Evaluate quantized models using lm-evaluation-harness:
 .. code-block:: bash
 
     # Install evaluation framework
-    pip install lm-eval[all]
+    # Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install
 
     # Evaluate baseline model
-    lm_eval --model hf \
-            --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
-            --tasks mmlu,arc_challenge,hellaswag,winogrande \
-            --batch_size 8
+    lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
+
+    # Evaluate torchao-quantized model (float8dq)
+    lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
 
-    # Evaluate torchao-quantized model
-    lm_eval --model hf \
-            --model_args pretrained=nm-testing/Meta-Llama-3.1-8B-Instruct-W4A16-G128 \
-            --tasks mmlu,arc_challenge,hellaswag,winogrande \
-            --batch_size 8
 
 Performance Benchmarking
 ------------------------
@@ -365,81 +378,91 @@ Performance Benchmarking
 .. code-block:: python
 
     import torch
-    from transformers import AutoModelForCausalLM
-    import psutil
-    import os
+    from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
 
-    def measure_memory_usage(model_id, quantization_config=None):
-        process = psutil.Process(os.getpid())
-        mem_before = process.memory_info().rss / 1024 / 1024 / 1024  # GB
+    # use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-float8dq"
+    model_id = "pytorch/Phi-4-mini-instruct-float8dq"
+    quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
 
-        model = AutoModelForCausalLM.from_pretrained(
-            model_id,
-            quantization_config=quantization_config,
-            torch_dtype=torch.bfloat16,
-            device_map="auto"
-        )
+    torch.cuda.reset_peak_memory_stats()
 
-        mem_after = process.memory_info().rss / 1024 / 1024 / 1024  # GB
-        model_memory = mem_after - mem_before
+    prompt = "Hey, are you conscious? Can you talk to me?"
+    messages = [
+        {
+            "role": "system",
+            "content": "",
+        },
+        {"role": "user", "content": prompt},
+    ]
+    templated_prompt = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True,
+    )
+    print("Prompt:", prompt)
+    print("Templated prompt:", templated_prompt)
+    inputs = tokenizer(
+        templated_prompt,
+        return_tensors="pt",
+    ).to("cuda")
+    generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
+    output_text = tokenizer.batch_decode(
+        generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
+    )
+    print("Response:", output_text[0][len(prompt):])
 
-        return model_memory
+    mem = torch.cuda.max_memory_reserved() / 1e9
+    print(f"Peak Memory Usage: {mem:.02f} GB")
 
-    # Compare memory usage
-    baseline_memory = measure_memory_usage("meta-llama/Llama-3.1-8B-Instruct")
 
-    from transformers import TorchAoConfig
-    from torchao.quantization import Int4WeightOnlyConfig
-    quant_config = TorchAoConfig(quant_type=Int4WeightOnlyConfig())
-    quantized_memory = measure_memory_usage("meta-llama/Llama-3.1-8B-Instruct", quant_config)
+| Benchmark | Phi-4 mini-Ins | Phi-4-mini-instruct-float8dq |
+|-----------|----------------|------------------------------|
+| Peak Memory (GB) | 8.91 |	5.70 (36% reduction) |
 
-    print(f"Baseline model: {baseline_memory:.2f} GB")
-    print(f"Int4 quantized: {quantized_memory:.2f} GB")
-    print(f"Memory reduction: {(1 - quantized_memory/baseline_memory)*100:.1f}%")
 
 **Latency Benchmarking**:
 
-.. code-block:: python
+.. code-block:: bash
 
-    import time
-    import torch
-    from transformers import AutoModelForCausalLM, AutoTokenizer
+    # baseline
+    python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
 
-    def benchmark_latency(model, tokenizer, prompt, num_runs=10):
-        messages = [{"role": "user", "content": prompt}]
-        inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
+    # float8dq
+    VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
 
-        # Warmup
-        for _ in range(3):
-            with torch.no_grad():
-                _ = model.generate(inputs, max_new_tokens=100, do_sample=False)
+**Serving Benchmarking**:
 
-        # Benchmark
-        torch.cuda.synchronize()
-        start_time = time.time()
+We benchmarked the throughput in a serving environment.
 
-        for _ in range(num_runs):
-            with torch.no_grad():
-                outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)
+.. code-block:: bash
+    # Download sharegpt dataset:
+    wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
 
-        torch.cuda.synchronize()
-        end_time = time.time()
+    # Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
+    # Note: you can change the number of prompts to be benchmarked with --num-prompts argument for benchmark_serving script.
 
-        avg_latency = (end_time - start_time) / num_runs
-        tokens_generated = outputs.shape[1] - inputs.shape[1]
-        throughput = tokens_generated / avg_latency
+    # For baseline
+    # Server:
+    vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
+    # Client:
+    python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
 
-        return avg_latency, throughput
+    # For float8dq
+    # Server:
+    VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
+    # Client:
+    python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
 
-    # Benchmark both models
-    prompt = "Explain the theory of relativity in simple terms."
 
-    baseline_latency, baseline_throughput = benchmark_latency(baseline_model, tokenizer, prompt)
-    quantized_latency, quantized_throughput = benchmark_latency(quantized_model, tokenizer, prompt)
+**Results (H100 machine)**
 
-    print(f"Baseline: {baseline_latency:.3f}s ({baseline_throughput:.1f} tok/s)")
-    print(f"Quantized: {quantized_latency:.3f}s ({quantized_throughput:.1f} tok/s)")
-    print(f"Speedup: {baseline_latency/quantized_latency:.2f}x")
+| Benchmark | Phi-4 mini-Ins |	Phi-4-mini-instruct-float8dq |
+|-----------|----------------|------------------------------|
+| latency (batch_size=1) |	1.64s	| 1.41s (1.16x speedup) |
+| latency (batch_size=128) |	3.1s	| 2.72s (1.14x speedup) |
+| serving (num_prompts=1) |	1.35 req/s |	1.57 req/s (1.16x speedup) |
+| serving (num_prompts=1000) |	66.68 req/s |	80.53 req/s (1.21x speedup) |
 
 
 Conclusion

From 888fd4c7e08d9312960822f670c3a0d5f5790ecc Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Mon, 16 Jun 2025 14:09:47 -0700
Subject: [PATCH 05/13] Update

---
 docs/source/inference.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/inference.rst b/docs/source/inference.rst
index 2b7c7959ed..5e26da799f 100644
--- a/docs/source/inference.rst
+++ b/docs/source/inference.rst
@@ -1,4 +1,4 @@
-Inference Tutorial: From Quantization to Deployment
+Inference Tutorial: From Quantization to Serving
 ===================================================
 
 This tutorial demonstrates how to perform post-training quantization and deploy models for inference using torchao's integration with popular frameworks. All quantization techniques shown here use torchao as the underlying optimization engine, seamlessly integrated through HuggingFace Transformers, vLLM, and ExecuTorch.

From c200cd25fa9e755c03d53cff78dfd20f534ccf0c Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Mon, 16 Jun 2025 14:11:12 -0700
Subject: [PATCH 06/13] Update

---
 docs/source/index.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/source/index.rst b/docs/source/index.rst
index 9df40131cf..70a265da2b 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -41,3 +41,4 @@ for an overall introduction to the library and recent highlight and updates.
    subclass_basic
    subclass_advanced
    pretraining
+   inference

From c52e6f8f22d2632ef53125a55d04ed177ddb859f Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Mon, 16 Jun 2025 18:25:44 -0700
Subject: [PATCH 07/13] Update

---
 docs/source/inference.rst | 47 ++++++++++++++++++---------------------
 1 file changed, 22 insertions(+), 25 deletions(-)

diff --git a/docs/source/inference.rst b/docs/source/inference.rst
index 5e26da799f..0c90e52ff7 100644
--- a/docs/source/inference.rst
+++ b/docs/source/inference.rst
@@ -174,9 +174,9 @@ Performance Optimization Notes
 When using vLLM with torchao:
 
 - **Float8 dynamic quantization**: Provides 36% memory reduction with torchao's optimized kernels
-- **Sparse models**: Additional ---- speedup speedup when combined with quantization
-- **KV cache**:
-- **Compile optimizations**:
+- **Sparse models**: Additional [x%] speedup when combined with quantization
+- **KV cache**: Add text here
+- **Compile optimizations**: Add text here
 
 Mobile Deployment with ExecuTorch
 ##################################
@@ -189,6 +189,7 @@ Preparing Models for Mobile
 **Step 1: Untie Embedding Weights**
 
 We want to quantize the embedding and lm_head differently. Since those layers are tied, we first need to untie the model:
+
 .. code-block:: python
 
     from transformers import (
@@ -228,6 +229,7 @@ We want to quantize the embedding and lm_head differently. Since those layers ar
 **Step 2: Create Mobile-Optimized Quantization**
 
 Quantizing the model for mobile deployment using torchao's Int8DynamicActivationIntxWeightConfig configuration:
+
 .. code-block:: python
 
     from transformers import (
@@ -337,19 +339,6 @@ The torchao-optimized 8da4w model provides:
 - **Speed**: ~17 tokens/sec on iPhone 15 Pro
 - **Accuracy**: Maintained within 5-10% of original model on most benchmarks
 
-**iOS Integration Example**:
-
-.. code-block:: objective-c
-
-    // Load the torchao-optimized PTE file
-    NSString *modelPath = [[NSBundle mainBundle] pathForResource:@"phi4-mini-8da4w-mobile" ofType:@"pte"];
-
-    // ExecuTorch runtime automatically uses torchao's optimized kernels
-    torch::executor::Result<torch::executor::Module> module_result =
-        torch::executor::Module::load(modelPath.UTF8String);
-
-Android integration follows similar patterns using the ExecuTorch Android API.
-
 Evaluation and Benchmarking
 ############################
 
@@ -416,9 +405,11 @@ Performance Benchmarking
     print(f"Peak Memory Usage: {mem:.02f} GB")
 
 
-| Benchmark | Phi-4 mini-Ins | Phi-4-mini-instruct-float8dq |
-|-----------|----------------|------------------------------|
-| Peak Memory (GB) | 8.91 |	5.70 (36% reduction) |
++-------------------+----------------+------------------------------+
+| Benchmark         | Phi-4 mini-Ins | Phi-4-mini-instruct-float8dq |
++===================+================+==============================+
+| Peak Memory (GB)  | 8.91           | 5.70 (36% reduction)         |
++-------------------+----------------+------------------------------+
 
 
 **Latency Benchmarking**:
@@ -436,6 +427,7 @@ Performance Benchmarking
 We benchmarked the throughput in a serving environment.
 
 .. code-block:: bash
+
     # Download sharegpt dataset:
     wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
 
@@ -457,12 +449,17 @@ We benchmarked the throughput in a serving environment.
 
 **Results (H100 machine)**
 
-| Benchmark | Phi-4 mini-Ins |	Phi-4-mini-instruct-float8dq |
-|-----------|----------------|------------------------------|
-| latency (batch_size=1) |	1.64s	| 1.41s (1.16x speedup) |
-| latency (batch_size=128) |	3.1s	| 2.72s (1.14x speedup) |
-| serving (num_prompts=1) |	1.35 req/s |	1.57 req/s (1.16x speedup) |
-| serving (num_prompts=1000) |	66.68 req/s |	80.53 req/s (1.21x speedup) |
++----------------------------+----------------+------------------------------+
+| Benchmark                  | Phi-4 mini-Ins | Phi-4-mini-instruct-float8dq |
++============================+================+==============================+
+| latency (batch_size=1)     | 1.64s          | 1.41s (1.16x speedup)        |
++----------------------------+----------------+------------------------------+
+| latency (batch_size=128)   | 3.1s           | 2.72s (1.14x speedup)        |
++----------------------------+----------------+------------------------------+
+| serving (num_prompts=1)    | 1.35 req/s     | 1.57 req/s (1.16x speedup)   |
++----------------------------+----------------+------------------------------+
+| serving (num_prompts=1000) | 66.68 req/s    | 80.53 req/s (1.21x speedup)  |
++----------------------------+----------------+------------------------------+
 
 
 Conclusion

From de160b1274c55ccbaaf50474ea3c997bf011539b Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Tue, 17 Jun 2025 10:20:35 -0700
Subject: [PATCH 08/13] Update

---
 docs/source/inference.rst | 269 ++++++++++++++++++--------------------
 1 file changed, 128 insertions(+), 141 deletions(-)

diff --git a/docs/source/inference.rst b/docs/source/inference.rst
index 0c90e52ff7..2e80f505c8 100644
--- a/docs/source/inference.rst
+++ b/docs/source/inference.rst
@@ -1,23 +1,15 @@
+##################################################
 Inference Tutorial: From Quantization to Serving
-===================================================
+##################################################
 
-This tutorial demonstrates how to perform post-training quantization and deploy models for inference using torchao's integration with popular frameworks. All quantization techniques shown here use torchao as the underlying optimization engine, seamlessly integrated through HuggingFace Transformers, vLLM, and ExecuTorch.
+This tutorial demonstrates how to perform post-training quantization and deploy models for inference:
 
-.. contents::
-   :local:
-   :depth: 2
+1. :ref:`Post-training Quantization with HuggingFace`: Using float8 dynamic quantization with HuggingFace integration
+2. :ref:`Sparsity Integration`: Combining sparsity with quantization for additional speedups
+3. :ref:`High-throughput Serving with vLLM`: Deploying quantized models with vLLM
+4. :ref:`Mobile Deployment with Executorch`: Lowering to ExecuTorch for on-device inference
 
-Overview
---------
-
-This tutorial covers the complete inference pipeline:
-
-1. **Post-training Quantization**: Using float8 dynamic quantization with HuggingFace integration
-2. **Sparsity**: Combining sparsity with quantization for additional speedups
-3. **High-throughput Serving**: Deploying quantized models with vLLM
-4. **Mobile Deployment**: Lowering to ExecuTorch for on-device inference
-
-All these workflows leverage torchao's optimized kernels and quantization algorithms under the hood.
+All techniques shown here use torchao as the underlying optimization engine, seamlessly integrated through HuggingFace Transformers, vLLM, and ExecuTorch.
 
 Post-training Quantization with HuggingFace
 ############################################
@@ -69,6 +61,126 @@ Float8 dynamic quantization shows 36% reduction in model size with minimal accur
     output = pipe(messages, **generation_args)
     print(output[0]['generated_text'])
 
+Model Quality Assessment
+------------------------
+
+Evaluate quantized models using lm-evaluation-harness:
+
+.. code-block:: bash
+
+    # Install evaluation framework
+    # Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install
+
+    # Evaluate baseline model
+    lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
+
+    # Evaluate torchao-quantized model (float8dq)
+    lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
+
+
+Performance Benchmarking
+------------------------
+
+**Memory Usage Comparison**:
+
+.. code-block:: python
+
+    import torch
+    from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
+
+    # use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-float8dq"
+    model_id = "pytorch/Phi-4-mini-instruct-float8dq"
+    quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+    torch.cuda.reset_peak_memory_stats()
+
+    prompt = "Hey, are you conscious? Can you talk to me?"
+    messages = [
+        {
+            "role": "system",
+            "content": "",
+        },
+        {"role": "user", "content": prompt},
+    ]
+    templated_prompt = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True,
+    )
+    print("Prompt:", prompt)
+    print("Templated prompt:", templated_prompt)
+    inputs = tokenizer(
+        templated_prompt,
+        return_tensors="pt",
+    ).to("cuda")
+    generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
+    output_text = tokenizer.batch_decode(
+        generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
+    )
+    print("Response:", output_text[0][len(prompt):])
+
+    mem = torch.cuda.max_memory_reserved() / 1e9
+    print(f"Peak Memory Usage: {mem:.02f} GB")
+
+
++-------------------+----------------+------------------------------+
+| Benchmark         | Phi-4 mini-Ins | Phi-4-mini-instruct-float8dq |
++===================+================+==============================+
+| Peak Memory (GB)  | 8.91           | 5.70 (36% reduction)         |
++-------------------+----------------+------------------------------+
+
+
+**Latency Benchmarking**:
+
+.. code-block:: bash
+
+    # baseline
+    python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
+
+    # float8dq
+    VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
+
+**Serving Benchmarking**:
+
+We benchmarked the throughput in a serving environment.
+
+.. code-block:: bash
+
+    # Download sharegpt dataset:
+    wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+
+    # Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
+    # Note: you can change the number of prompts to be benchmarked with --num-prompts argument for benchmark_serving script.
+
+    # For baseline
+    # Server:
+    vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
+    # Client:
+    python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
+
+    # For float8dq
+    # Server:
+    VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
+    # Client:
+    python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
+
+
+**Results (H100 machine)**
+
++----------------------------+----------------+------------------------------+
+| Benchmark                  | Phi-4 mini-Ins | Phi-4-mini-instruct-float8dq |
++============================+================+==============================+
+| latency (batch_size=1)     | 1.64s          | 1.41s (1.16x speedup)        |
++----------------------------+----------------+------------------------------+
+| latency (batch_size=128)   | 3.1s           | 2.72s (1.14x speedup)        |
++----------------------------+----------------+------------------------------+
+| serving (num_prompts=1)    | 1.35 req/s     | 1.57 req/s (1.16x speedup)   |
++----------------------------+----------------+------------------------------+
+| serving (num_prompts=1000) | 66.68 req/s    | 80.53 req/s (1.21x speedup)  |
++----------------------------+----------------+------------------------------+
+
+
 Sparsity Integration
 ####################
 
@@ -183,9 +295,6 @@ Mobile Deployment with ExecuTorch
 
 ExecuTorch enables on-device inference using torchao's mobile-optimized quantization schemes. The 8da4w (8-bit dynamic activation, 4-bit weight) configuration is specifically designed for mobile deployment.
 
-Preparing Models for Mobile
-----------------------------
-
 **Step 1: Untie Embedding Weights**
 
 We want to quantize the embedding and lm_head differently. Since those layers are tied, we first need to untie the model:
@@ -339,128 +448,6 @@ The torchao-optimized 8da4w model provides:
 - **Speed**: ~17 tokens/sec on iPhone 15 Pro
 - **Accuracy**: Maintained within 5-10% of original model on most benchmarks
 
-Evaluation and Benchmarking
-############################
-
-Model Quality Assessment
-------------------------
-
-Evaluate quantized models using lm-evaluation-harness:
-
-.. code-block:: bash
-
-    # Install evaluation framework
-    # Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install
-
-    # Evaluate baseline model
-    lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
-
-    # Evaluate torchao-quantized model (float8dq)
-    lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
-
-
-Performance Benchmarking
-------------------------
-
-**Memory Usage Comparison**:
-
-.. code-block:: python
-
-    import torch
-    from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
-
-    # use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-float8dq"
-    model_id = "pytorch/Phi-4-mini-instruct-float8dq"
-    quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
-    tokenizer = AutoTokenizer.from_pretrained(model_id)
-
-    torch.cuda.reset_peak_memory_stats()
-
-    prompt = "Hey, are you conscious? Can you talk to me?"
-    messages = [
-        {
-            "role": "system",
-            "content": "",
-        },
-        {"role": "user", "content": prompt},
-    ]
-    templated_prompt = tokenizer.apply_chat_template(
-        messages,
-        tokenize=False,
-        add_generation_prompt=True,
-    )
-    print("Prompt:", prompt)
-    print("Templated prompt:", templated_prompt)
-    inputs = tokenizer(
-        templated_prompt,
-        return_tensors="pt",
-    ).to("cuda")
-    generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
-    output_text = tokenizer.batch_decode(
-        generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
-    )
-    print("Response:", output_text[0][len(prompt):])
-
-    mem = torch.cuda.max_memory_reserved() / 1e9
-    print(f"Peak Memory Usage: {mem:.02f} GB")
-
-
-+-------------------+----------------+------------------------------+
-| Benchmark         | Phi-4 mini-Ins | Phi-4-mini-instruct-float8dq |
-+===================+================+==============================+
-| Peak Memory (GB)  | 8.91           | 5.70 (36% reduction)         |
-+-------------------+----------------+------------------------------+
-
-
-**Latency Benchmarking**:
-
-.. code-block:: bash
-
-    # baseline
-    python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
-
-    # float8dq
-    VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
-
-**Serving Benchmarking**:
-
-We benchmarked the throughput in a serving environment.
-
-.. code-block:: bash
-
-    # Download sharegpt dataset:
-    wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
-
-    # Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
-    # Note: you can change the number of prompts to be benchmarked with --num-prompts argument for benchmark_serving script.
-
-    # For baseline
-    # Server:
-    vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
-    # Client:
-    python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
-
-    # For float8dq
-    # Server:
-    VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
-    # Client:
-    python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
-
-
-**Results (H100 machine)**
-
-+----------------------------+----------------+------------------------------+
-| Benchmark                  | Phi-4 mini-Ins | Phi-4-mini-instruct-float8dq |
-+============================+================+==============================+
-| latency (batch_size=1)     | 1.64s          | 1.41s (1.16x speedup)        |
-+----------------------------+----------------+------------------------------+
-| latency (batch_size=128)   | 3.1s           | 2.72s (1.14x speedup)        |
-+----------------------------+----------------+------------------------------+
-| serving (num_prompts=1)    | 1.35 req/s     | 1.57 req/s (1.16x speedup)   |
-+----------------------------+----------------+------------------------------+
-| serving (num_prompts=1000) | 66.68 req/s    | 80.53 req/s (1.21x speedup)  |
-+----------------------------+----------------+------------------------------+
-
 
 Conclusion
 ##########

From e8f5e533b57e018868b9db01df83ba6c6c201f2d Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Tue, 17 Jun 2025 12:11:06 -0700
Subject: [PATCH 09/13] Update

---
 docs/source/inference.rst | 253 ++++++++++++++++++++++++--------------
 1 file changed, 162 insertions(+), 91 deletions(-)

diff --git a/docs/source/inference.rst b/docs/source/inference.rst
index 2e80f505c8..7e8c97363a 100644
--- a/docs/source/inference.rst
+++ b/docs/source/inference.rst
@@ -5,9 +5,8 @@ Inference Tutorial: From Quantization to Serving
 This tutorial demonstrates how to perform post-training quantization and deploy models for inference:
 
 1. :ref:`Post-training Quantization with HuggingFace`: Using float8 dynamic quantization with HuggingFace integration
-2. :ref:`Sparsity Integration`: Combining sparsity with quantization for additional speedups
-3. :ref:`High-throughput Serving with vLLM`: Deploying quantized models with vLLM
-4. :ref:`Mobile Deployment with Executorch`: Lowering to ExecuTorch for on-device inference
+2. :ref:`High-throughput Serving with vLLM`: Deploying quantized models with vLLM
+3. :ref:`Mobile Deployment with Executorch`: Lowering to ExecuTorch for on-device inference
 
 All techniques shown here use torchao as the underlying optimization engine, seamlessly integrated through HuggingFace Transformers, vLLM, and ExecuTorch.
 
@@ -21,6 +20,102 @@ Float8 Dynamic Quantization
 
 Float8 dynamic quantization shows 36% reduction in model size with minimal accuracy loss:
 
+.. code-block:: python
+
+    import torch
+    from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
+
+    model_id = "microsoft/Phi-4-mini-instruct"
+
+    from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
+    quant_config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())
+    quantization_config = TorchAoConfig(quant_type=quant_config)
+    quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+    # Push to hub
+    USER_ID = "YOUR_USER_ID"
+    MODEL_NAME = model_id.split("/")[-1]
+    save_to = f"{USER_ID}/{MODEL_NAME}-float8dq"
+    quantized_model.push_to_hub(save_to, safe_serialization=False)
+    tokenizer.push_to_hub(save_to)
+
+    # Manual Testing
+    prompt = "Hey, are you conscious? Can you talk to me?"
+    messages = [
+        {
+            "role": "system",
+            "content": "",
+        },
+        {"role": "user", "content": prompt},
+    ]
+    templated_prompt = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True,
+    )
+    print("Prompt:", prompt)
+    print("Templated prompt:", templated_prompt)
+    inputs = tokenizer(
+        templated_prompt,
+        return_tensors="pt",
+    ).to("cuda")
+    generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
+    output_text = tokenizer.batch_decode(
+        generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
+    )
+    print("Response:", output_text[0][len(prompt):])
+
+
+[Optional] Float8 Dynamic Quantization + Semi-structured (2:4) sparsity
+--------------------------------------------------------------------------
+
+Torchao's sparsity support can be combined with quantization for additional performance gains. The Marlin sparse layout provides optimized kernels for 2:4 structured sparsity.
+
+.. code-block:: python
+
+    from torchao.quantization import Float8DynamicActivationFloat8SemiSparseWeightConfig
+
+    # Combine sparsity with int4 quantization - both optimized by torchao
+    quant_config = Float8DynamicActivationFloat8SemiSparseWeightConfig()
+    quantization_config = TorchAoConfig(quant_type=quant_config)
+
+    # Load a pre-sparsified checkpoint
+    model = AutoModelForCausalLM.from_pretrained(
+        "nm-testing/Meta-Llama-3.1-8B-Instruct-W4A16-G128-2of4",  # 2:4 sparse model
+        torch_dtype=torch.float16,
+        device_map="cuda",
+        quantization_config=quantization_config
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
+
+    # Use static KV cache for best performance with torchao optimizations
+    messages = [{"role": "user", "content": "What are the benefits of sparse neural networks?"}]
+    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
+
+    outputs = model.generate(
+        inputs,
+        max_new_tokens=150,
+        cache_implementation="static",  # Optimized for torchao
+        do_sample=False
+    )
+
+    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
+    print(response)
+
+For more information on supported quantization and sparsity configurations, see https://huggingface.co/docs/transformers/main/en/quantization/torchao.
+
+Inference with Transformers
+---------------------------
+
+Install the required packages:
+.. code-block:: bash
+    pip install git+https://github.com/huggingface/transformers@main
+    pip install torchao
+    pip install torch
+    pip install accelerate
+
 .. code-block:: python
 
     import torch
@@ -79,7 +174,31 @@ Evaluate quantized models using lm-evaluation-harness:
 
 
 Performance Benchmarking
-------------------------
+------------------------------
+
+**Latency Benchmarking**:
+
+.. code-block:: bash
+
+    # baseline
+    python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
+
+    # float8dq
+    VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
+
+**Results (H100 machine)**
+
++----------------------------+----------------+------------------------------+
+| Benchmark                  | Phi-4 mini-Ins | Phi-4-mini-instruct-float8dq |
++============================+================+==============================+
+| latency (batch_size=1)     | 1.64s          | 1.41s (1.16x speedup)        |
++----------------------------+----------------+------------------------------+
+| latency (batch_size=128)   | 3.1s           | 2.72s (1.14x speedup)        |
++----------------------------+----------------+------------------------------+
+
+
+Memory Benchmarking
+--------------------
 
 **Memory Usage Comparison**:
 
@@ -130,16 +249,39 @@ Performance Benchmarking
 | Peak Memory (GB)  | 8.91           | 5.70 (36% reduction)         |
 +-------------------+----------------+------------------------------+
 
+Performance Breakdown
+------------------------------
+When using vLLM with torchao:
 
-**Latency Benchmarking**:
+- **Float8 dynamic quantization**: Provides 36% VRAM reduction, 1.15x-1.2x speedup and little to no accuracy impact on H100
+- **Sparsity Support**: Semi-structured (2:4) sparsity for faster inference (see [Accelerating Neural Network Training with Semi-Structured (2:4) Sparsity](https://pytorch.org/blog/accelerating-neural-network-training/) blog post)
+- **KV Cache Quantization**: Enables long context inference with lower memory (see [KV Cache Quantization](https://github.com/pytorch/ao/blob/main/torchao/_models/llama/README.md))
+
+High-throughput Serving with vLLM
+##################################
+
+vLLM automatically leverages torchao's optimized kernels when serving quantized models, providing significant throughput improvements.
+
+Setting up vLLM with Quantized Models
+--------------------------------------
+
+First, install vLLM with torchao support:
 
 .. code-block:: bash
 
-    # baseline
-    python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
+    pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
+    pip install torchao
 
-    # float8dq
-    VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
+Serving Quantized Models
+-----------------------------
+
+.. code-block:: bash
+
+    vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
+
+
+Serving Performance Benchmarking
+--------------------------------
 
 **Serving Benchmarking**:
 
@@ -147,6 +289,14 @@ We benchmarked the throughput in a serving environment.
 
 .. code-block:: bash
 
+    # Setup: Get vllm source code
+    git clone git@github.com:vllm-project/vllm.git
+
+    # Install vllm
+    VLLM_USE_PRECOMPILED=1 pip install --editable .
+
+    # Run the benchmarks under vllm root folder:
+
     # Download sharegpt dataset:
     wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
 
@@ -165,78 +315,16 @@ We benchmarked the throughput in a serving environment.
     # Client:
     python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
 
-
 **Results (H100 machine)**
 
 +----------------------------+----------------+------------------------------+
 | Benchmark                  | Phi-4 mini-Ins | Phi-4-mini-instruct-float8dq |
 +============================+================+==============================+
-| latency (batch_size=1)     | 1.64s          | 1.41s (1.16x speedup)        |
-+----------------------------+----------------+------------------------------+
-| latency (batch_size=128)   | 3.1s           | 2.72s (1.14x speedup)        |
-+----------------------------+----------------+------------------------------+
 | serving (num_prompts=1)    | 1.35 req/s     | 1.57 req/s (1.16x speedup)   |
 +----------------------------+----------------+------------------------------+
 | serving (num_prompts=1000) | 66.68 req/s    | 80.53 req/s (1.21x speedup)  |
 +----------------------------+----------------+------------------------------+
 
-
-Sparsity Integration
-####################
-
-Torchao's sparsity support can be combined with quantization for additional performance gains. The Marlin sparse layout provides optimized kernels for 2:4 structured sparsity.
-
-Sparse + Quantized Models
--------------------------
-
-.. code-block:: python
-
-    from torchao.quantization import Int4WeightOnlyConfig
-    from torchao.dtypes import MarlinSparseLayout
-
-    # Combine sparsity with int4 quantization - both optimized by torchao
-    quant_config = Int4WeightOnlyConfig(layout=MarlinSparseLayout())
-    quantization_config = TorchAoConfig(quant_type=quant_config)
-
-    # Load a pre-sparsified checkpoint
-    model = AutoModelForCausalLM.from_pretrained(
-        "nm-testing/Meta-Llama-3.1-8B-Instruct-W4A16-G128-2of4",  # 2:4 sparse model
-        torch_dtype=torch.float16,
-        device_map="cuda",
-        quantization_config=quantization_config
-    )
-
-    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
-
-    # Use static KV cache for best performance with torchao optimizations
-    messages = [{"role": "user", "content": "What are the benefits of sparse neural networks?"}]
-    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
-
-    outputs = model.generate(
-        inputs,
-        max_new_tokens=150,
-        cache_implementation="static",  # Optimized for torchao
-        do_sample=False
-    )
-
-    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
-    print(response)
-
-High-throughput Serving with vLLM
-##################################
-
-vLLM automatically leverages torchao's optimized kernels when serving quantized models, providing significant throughput improvements.
-
-Setting up vLLM with Quantized Models
---------------------------------------
-
-First, install vLLM with torchao support:
-
-.. code-block:: bash
-
-    pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
-    pip install torchao
-
 Inference with vLLM
 -------------------
 
@@ -271,25 +359,6 @@ Inference with vLLM
             print(f"Output:    {generated_text!r}")
             print("-" * 60)
 
-
-Serving Quantized Models
------------------------------
-
-.. code-block:: bash
-
-    vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
-
-
-Performance Optimization Notes
-------------------------------
-
-When using vLLM with torchao:
-
-- **Float8 dynamic quantization**: Provides 36% memory reduction with torchao's optimized kernels
-- **Sparse models**: Additional [x%] speedup when combined with quantization
-- **KV cache**: Add text here
-- **Compile optimizations**: Add text here
-
 Mobile Deployment with ExecuTorch
 ##################################
 
@@ -444,10 +513,11 @@ Mobile Performance Characteristics
 
 The torchao-optimized 8da4w model provides:
 
-- **Memory**: ~3.2GB on iPhone 15 Pro (vs ~12GB unquantized)
+- **Memory**: ~3.2GB on iPhone 15 Pro
 - **Speed**: ~17 tokens/sec on iPhone 15 Pro
 - **Accuracy**: Maintained within 5-10% of original model on most benchmarks
 
+For detailed instructions on testing the executorch model and reproducing benchmarks please refer to the [HF Phi-4-mini-instruct-8da4w model](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w).
 
 Conclusion
 ##########
@@ -457,6 +527,7 @@ This tutorial demonstrated how torchao's quantization and sparsity techniques in
 - **HuggingFace Transformers** provides easy model loading with torchao quantization
 - **vLLM** leverages torchao's optimized kernels for high-throughput serving
 - **ExecuTorch** enables mobile deployment with torchao's mobile-optimized schemes
+- **lm-evaluation-harness** provides model quality assessment
 
 All these frameworks use torchao as the underlying optimization engine, ensuring consistent performance gains and ease of integration. The quantization techniques shown provide significant memory reduction (3-4x) and performance improvements (1.5-2x) while maintaining model quality within acceptable bounds for most applications.
 

From bbd567d246f71ccf60595c18b34a7f31070b996b Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Tue, 17 Jun 2025 13:37:25 -0700
Subject: [PATCH 10/13] Update

---
 docs/source/inference.rst | 171 +++++++++++++++++++-------------------
 1 file changed, 86 insertions(+), 85 deletions(-)

diff --git a/docs/source/inference.rst b/docs/source/inference.rst
index 7e8c97363a..e6bc4a19e1 100644
--- a/docs/source/inference.rst
+++ b/docs/source/inference.rst
@@ -2,13 +2,11 @@
 Inference Tutorial: From Quantization to Serving
 ##################################################
 
-This tutorial demonstrates how to perform post-training quantization and deploy models for inference:
+This tutorial demonstrates how to perform post-training quantization and deploy models for inference using torchao as the underlying optimization engine, seamlessly integrated through HuggingFace Transformers, vLLM, and ExecuTorch.
 
-1. :ref:`Post-training Quantization with HuggingFace`: Using float8 dynamic quantization with HuggingFace integration
-2. :ref:`High-throughput Serving with vLLM`: Deploying quantized models with vLLM
-3. :ref:`Mobile Deployment with Executorch`: Lowering to ExecuTorch for on-device inference
-
-All techniques shown here use torchao as the underlying optimization engine, seamlessly integrated through HuggingFace Transformers, vLLM, and ExecuTorch.
+.. contents::
+   :local:
+   :depth: 2
 
 Post-training Quantization with HuggingFace
 ############################################
@@ -70,7 +68,7 @@ Float8 dynamic quantization shows 36% reduction in model size with minimal accur
 [Optional] Float8 Dynamic Quantization + Semi-structured (2:4) sparsity
 --------------------------------------------------------------------------
 
-Torchao's sparsity support can be combined with quantization for additional performance gains. The Marlin sparse layout provides optimized kernels for 2:4 structured sparsity.
+Torchao's sparsity support can be combined with quantization for additional performance gains, using optimized kernels for 2:4 structured sparsity.
 
 .. code-block:: python
 
@@ -104,13 +102,16 @@ Torchao's sparsity support can be combined with quantization for additional perf
     response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
     print(response)
 
-For more information on supported quantization and sparsity configurations, see https://huggingface.co/docs/transformers/main/en/quantization/torchao.
+.. note::
+For more information on supported quantization and sparsity configurations, see `HF-Torchao Docs <https://huggingface.co/docs/transformers/main/en/quantization/torchao>`_.
 
 Inference with Transformers
 ---------------------------
 
 Install the required packages:
+
 .. code-block:: bash
+
     pip install git+https://github.com/huggingface/transformers@main
     pip install torchao
     pip install torch
@@ -156,6 +157,9 @@ Install the required packages:
     output = pipe(messages, **generation_args)
     print(output[0]['generated_text'])
 
+Evaluation
+###########
+
 Model Quality Assessment
 ------------------------
 
@@ -172,31 +176,6 @@ Evaluate quantized models using lm-evaluation-harness:
     # Evaluate torchao-quantized model (float8dq)
     lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
 
-
-Performance Benchmarking
-------------------------------
-
-**Latency Benchmarking**:
-
-.. code-block:: bash
-
-    # baseline
-    python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
-
-    # float8dq
-    VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
-
-**Results (H100 machine)**
-
-+----------------------------+----------------+------------------------------+
-| Benchmark                  | Phi-4 mini-Ins | Phi-4-mini-instruct-float8dq |
-+============================+================+==============================+
-| latency (batch_size=1)     | 1.64s          | 1.41s (1.16x speedup)        |
-+----------------------------+----------------+------------------------------+
-| latency (batch_size=128)   | 3.1s           | 2.72s (1.14x speedup)        |
-+----------------------------+----------------+------------------------------+
-
-
 Memory Benchmarking
 --------------------
 
@@ -242,48 +221,28 @@ Memory Benchmarking
     mem = torch.cuda.max_memory_reserved() / 1e9
     print(f"Peak Memory Usage: {mem:.02f} GB")
 
++-------------------+---------------------+------------------------------+
+| Benchmark         | Phi-4 mini-instruct | Phi-4-mini-instruct-float8dq |
++===================+=====================+==============================+
+| Peak Memory (GB)  | 8.91                | 5.70 (36% reduction)         |
++-------------------+---------------------+------------------------------+
 
-+-------------------+----------------+------------------------------+
-| Benchmark         | Phi-4 mini-Ins | Phi-4-mini-instruct-float8dq |
-+===================+================+==============================+
-| Peak Memory (GB)  | 8.91           | 5.70 (36% reduction)         |
-+-------------------+----------------+------------------------------+
-
-Performance Breakdown
+Performance Benchmarking
 ------------------------------
-When using vLLM with torchao:
-
-- **Float8 dynamic quantization**: Provides 36% VRAM reduction, 1.15x-1.2x speedup and little to no accuracy impact on H100
-- **Sparsity Support**: Semi-structured (2:4) sparsity for faster inference (see [Accelerating Neural Network Training with Semi-Structured (2:4) Sparsity](https://pytorch.org/blog/accelerating-neural-network-training/) blog post)
-- **KV Cache Quantization**: Enables long context inference with lower memory (see [KV Cache Quantization](https://github.com/pytorch/ao/blob/main/torchao/_models/llama/README.md))
-
-High-throughput Serving with vLLM
-##################################
 
-vLLM automatically leverages torchao's optimized kernels when serving quantized models, providing significant throughput improvements.
-
-Setting up vLLM with Quantized Models
---------------------------------------
-
-First, install vLLM with torchao support:
-
-.. code-block:: bash
-
-    pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
-    pip install torchao
-
-Serving Quantized Models
------------------------------
+**Latency Benchmarking**:
+=========================
 
 .. code-block:: bash
 
-    vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
-
+    # baseline
+    python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
 
-Serving Performance Benchmarking
---------------------------------
+    # float8dq
+    VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
 
 **Serving Benchmarking**:
+=========================
 
 We benchmarked the throughput in a serving environment.
 
@@ -315,18 +274,48 @@ We benchmarked the throughput in a serving environment.
     # Client:
     python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
 
-**Results (H100 machine)**
+**Results (H100 machine)**:
+============================
 
-+----------------------------+----------------+------------------------------+
-| Benchmark                  | Phi-4 mini-Ins | Phi-4-mini-instruct-float8dq |
-+============================+================+==============================+
-| serving (num_prompts=1)    | 1.35 req/s     | 1.57 req/s (1.16x speedup)   |
-+----------------------------+----------------+------------------------------+
-| serving (num_prompts=1000) | 66.68 req/s    | 80.53 req/s (1.21x speedup)  |
-+----------------------------+----------------+------------------------------+
++----------------------------+---------------------+------------------------------+
+| Benchmark                  | Phi-4-mini-instruct | Phi-4-mini-instruct-float8dq |
++============================+=====================+==============================+
+| latency (batch_size=1)     | 1.64s               | 1.41s (1.16x speedup)        |
++----------------------------+---------------------+------------------------------+
+| latency (batch_size=128)   | 3.1s                | 2.72s (1.14x speedup)        |
++----------------------------+---------------------+------------------------------+
+| serving (num_prompts=1)    | 1.35 req/s          | 1.57 req/s (1.16x speedup)   |
++----------------------------+---------------------+------------------------------+
+| serving (num_prompts=1000) | 66.68 req/s         | 80.53 req/s (1.21x speedup)  |
++----------------------------+---------------------+------------------------------+
+
+Serving
+#######
+
+High-throughput Serving with vLLM
+---------------------------------
+
+vLLM automatically leverages torchao's optimized kernels when serving quantized models, providing significant throughput improvements.
+
+Setting up vLLM with Quantized Models
+=====================================
+
+First, install vLLM with torchao support:
+
+.. code-block:: bash
+
+    pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
+    pip install torchao
+
+Serving Quantized Models
+========================
+
+.. code-block:: bash
+
+    vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
 
 Inference with vLLM
--------------------
+===================
 
 .. code-block:: python
 
@@ -342,7 +331,6 @@ Inference with vLLM
     # Create a sampling params object.
     sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
 
-
     if __name__ == '__main__':
         # Create an LLM.
         llm = LLM(model="pytorch/Phi-4-mini-instruct-float8dq")
@@ -359,12 +347,22 @@ Inference with vLLM
             print(f"Output:    {generated_text!r}")
             print("-" * 60)
 
+Performance Breakdown
+=====================
+
+When using vLLM with torchao:
+
+- **Float8 dynamic quantization**: Provides 36% VRAM reduction, 1.15x-1.2x speedup and little to no accuracy impact on H100
+- **Sparsity Support**: Semi-structured (2:4) sparsity for faster inference (see `Accelerating Neural Network Training with Semi-Structured (2:4) Sparsity <https://pytorch.org/blog/accelerating-neural-network-training/>`_ blog post)
+- **KV Cache Quantization**: Enables long context inference with lower memory (see `KV Cache Quantization <https://github.com/pytorch/ao/blob/main/torchao/_models/llama/README.md>`_)
+
 Mobile Deployment with ExecuTorch
-##################################
+---------------------------------
 
 ExecuTorch enables on-device inference using torchao's mobile-optimized quantization schemes. The 8da4w (8-bit dynamic activation, 4-bit weight) configuration is specifically designed for mobile deployment.
 
-**Step 1: Untie Embedding Weights**
+Step 1: Untie Embedding Weights
+===============================
 
 We want to quantize the embedding and lm_head differently. Since those layers are tied, we first need to untie the model:
 
@@ -404,9 +402,10 @@ We want to quantize the embedding and lm_head differently. Since those layers ar
     untied_model.save_pretrained(save_to_local_path)
     tokenizer.save_pretrained(save_to)
 
-**Step 2: Create Mobile-Optimized Quantization**
+Step 2: Create Mobile-Optimized Quantization
+============================================
 
-Quantizing the model for mobile deployment using torchao's Int8DynamicActivationIntxWeightConfig configuration:
+Quantizing the model for mobile deployment using TorchAO's **Int8DynamicActivationIntxWeightConfig** configuration:
 
 .. code-block:: python
 
@@ -481,7 +480,8 @@ Quantizing the model for mobile deployment using torchao's Int8DynamicActivation
     print("Response:", output_text[0][len(prompt):])
 
 
-**Step 3: Export to ExecuTorch**
+Step 3: Export to ExecuTorch
+============================
 
 .. code-block:: bash
 
@@ -509,7 +509,7 @@ Quantizing the model for mobile deployment using torchao's Int8DynamicActivation
 
 
 Mobile Performance Characteristics
-----------------------------------
+====================================
 
 The torchao-optimized 8da4w model provides:
 
@@ -517,10 +517,11 @@ The torchao-optimized 8da4w model provides:
 - **Speed**: ~17 tokens/sec on iPhone 15 Pro
 - **Accuracy**: Maintained within 5-10% of original model on most benchmarks
 
-For detailed instructions on testing the executorch model and reproducing benchmarks please refer to the [HF Phi-4-mini-instruct-8da4w model](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w).
+.. note::
+For detailed instructions on testing the executorch model and reproducing benchmarks please refer to the `HF Phi-4-mini-instruct-8da4w model <https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w>`_.
 
-Conclusion
-##########
+**Conclusion**
+==============
 
 This tutorial demonstrated how torchao's quantization and sparsity techniques integrate seamlessly across the entire ML deployment stack:
 

From 6a966971b36a10756ac7cfbb62c9a6bbbb1eb9b3 Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Tue, 17 Jun 2025 13:46:31 -0700
Subject: [PATCH 11/13] Update notes

---
 docs/source/inference.rst | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/docs/source/inference.rst b/docs/source/inference.rst
index e6bc4a19e1..08324d573b 100644
--- a/docs/source/inference.rst
+++ b/docs/source/inference.rst
@@ -103,7 +103,7 @@ Torchao's sparsity support can be combined with quantization for additional perf
     print(response)
 
 .. note::
-For more information on supported quantization and sparsity configurations, see `HF-Torchao Docs <https://huggingface.co/docs/transformers/main/en/quantization/torchao>`_.
+    For more information on supported quantization and sparsity configurations, see `HF-Torchao Docs <https://huggingface.co/docs/transformers/main/en/quantization/torchao>`_.
 
 Inference with Transformers
 ---------------------------
@@ -356,6 +356,10 @@ When using vLLM with torchao:
 - **Sparsity Support**: Semi-structured (2:4) sparsity for faster inference (see `Accelerating Neural Network Training with Semi-Structured (2:4) Sparsity <https://pytorch.org/blog/accelerating-neural-network-training/>`_ blog post)
 - **KV Cache Quantization**: Enables long context inference with lower memory (see `KV Cache Quantization <https://github.com/pytorch/ao/blob/main/torchao/_models/llama/README.md>`_)
 
+.. note::
+    For more information on vLLM Integration, please refer to the detailed guide :ref:`torchao_vllm_integration`.
+
+
 Mobile Deployment with ExecuTorch
 ---------------------------------
 
@@ -518,7 +522,7 @@ The torchao-optimized 8da4w model provides:
 - **Accuracy**: Maintained within 5-10% of original model on most benchmarks
 
 .. note::
-For detailed instructions on testing the executorch model and reproducing benchmarks please refer to the `HF Phi-4-mini-instruct-8da4w model <https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w>`_.
+    For detailed instructions on testing the executorch model and reproducing benchmarks please refer to the `HF Phi-4-mini-instruct-8da4w model <https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w>`_.
 
 **Conclusion**
 ==============

From 06612d3e002c7a3f61bd68fdfaa418d55f7f785d Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Wed, 18 Jun 2025 12:11:08 -0700
Subject: [PATCH 12/13] Updates

---
 docs/source/inference.rst | 80 ++++++++++++++++++---------------------
 1 file changed, 36 insertions(+), 44 deletions(-)

diff --git a/docs/source/inference.rst b/docs/source/inference.rst
index 08324d573b..b2ee0741b6 100644
--- a/docs/source/inference.rst
+++ b/docs/source/inference.rst
@@ -80,7 +80,7 @@ Torchao's sparsity support can be combined with quantization for additional perf
 
     # Load a pre-sparsified checkpoint
     model = AutoModelForCausalLM.from_pretrained(
-        "nm-testing/Meta-Llama-3.1-8B-Instruct-W4A16-G128-2of4",  # 2:4 sparse model
+        "RedHatAI/Sparse-Llama-3.1-8B-2of4",  # 2:4 sparse model
         torch_dtype=torch.float16,
         device_map="cuda",
         quantization_config=quantization_config
@@ -105,8 +105,41 @@ Torchao's sparsity support can be combined with quantization for additional perf
 .. note::
     For more information on supported quantization and sparsity configurations, see `HF-Torchao Docs <https://huggingface.co/docs/transformers/main/en/quantization/torchao>`_.
 
-Inference with Transformers
----------------------------
+Inference with vLLM
+-------------------
+
+.. code-block:: python
+
+    from vllm import LLM, SamplingParams
+
+    # Sample prompts.
+    prompts = [
+        "Hello, my name is",
+        "The president of the United States is",
+        "The capital of France is",
+        "The future of AI is",
+    ]
+    # Create a sampling params object.
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+    if __name__ == '__main__':
+        # Create an LLM.
+        llm = LLM(model="pytorch/Phi-4-mini-instruct-float8dq")
+        # Generate texts from the prompts.
+        # The output is a list of RequestOutput objects
+        # that contain the prompt, generated text, and other information.
+        outputs = llm.generate(prompts, sampling_params)
+        # Print the outputs.
+        print("\nGenerated Outputs:\n" + "-" * 60)
+        for output in outputs:
+            prompt = output.prompt
+            generated_text = output.outputs[0].text
+            print(f"Prompt:    {prompt!r}")
+            print(f"Output:    {generated_text!r}")
+            print("-" * 60)
+
+[Optional] Inference with Transformers
+--------------------------------------
 
 Install the required packages:
 
@@ -179,8 +212,6 @@ Evaluate quantized models using lm-evaluation-harness:
 Memory Benchmarking
 --------------------
 
-**Memory Usage Comparison**:
-
 .. code-block:: python
 
     import torch
@@ -297,9 +328,6 @@ High-throughput Serving with vLLM
 
 vLLM automatically leverages torchao's optimized kernels when serving quantized models, providing significant throughput improvements.
 
-Setting up vLLM with Quantized Models
-=====================================
-
 First, install vLLM with torchao support:
 
 .. code-block:: bash
@@ -307,46 +335,10 @@ First, install vLLM with torchao support:
     pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
     pip install torchao
 
-Serving Quantized Models
-========================
-
 .. code-block:: bash
 
     vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
 
-Inference with vLLM
-===================
-
-.. code-block:: python
-
-    from vllm import LLM, SamplingParams
-
-    # Sample prompts.
-    prompts = [
-        "Hello, my name is",
-        "The president of the United States is",
-        "The capital of France is",
-        "The future of AI is",
-    ]
-    # Create a sampling params object.
-    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-
-    if __name__ == '__main__':
-        # Create an LLM.
-        llm = LLM(model="pytorch/Phi-4-mini-instruct-float8dq")
-        # Generate texts from the prompts.
-        # The output is a list of RequestOutput objects
-        # that contain the prompt, generated text, and other information.
-        outputs = llm.generate(prompts, sampling_params)
-        # Print the outputs.
-        print("\nGenerated Outputs:\n" + "-" * 60)
-        for output in outputs:
-            prompt = output.prompt
-            generated_text = output.outputs[0].text
-            print(f"Prompt:    {prompt!r}")
-            print(f"Output:    {generated_text!r}")
-            print("-" * 60)
-
 Performance Breakdown
 =====================
 

From ce675b8b042d0155a1cc02489d1dc8794bdb8243 Mon Sep 17 00:00:00 2001
From: jainapurva <apurvajain.kota@gmail.com>
Date: Wed, 18 Jun 2025 14:05:25 -0700
Subject: [PATCH 13/13] Updates

---
 docs/source/serving.rst | 528 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 527 insertions(+), 1 deletion(-)

diff --git a/docs/source/serving.rst b/docs/source/serving.rst
index cb61b159c4..ec5ceb4761 100644
--- a/docs/source/serving.rst
+++ b/docs/source/serving.rst
@@ -9,4 +9,530 @@ serving step.
 
 .. image:: ../static/e2e_flow_part3.png
 
-(Coming soon!)
+This tutorial demonstrates how to perform post-training quantization and deploy models for inference using torchao as the underlying optimization engine, seamlessly integrated through HuggingFace Transformers, vLLM, and ExecuTorch.
+
+.. contents::
+   :local:
+   :depth: 2
+
+Post-training Quantization with HuggingFace
+############################################
+
+HuggingFace Transformers provides seamless integration with torchao quantization. The ``TorchAoConfig`` automatically applies torchao's optimized quantization algorithms during model loading.
+
+Float8 Dynamic Quantization
+------------------------------
+
+Float8 dynamic quantization shows 36% reduction in model size with minimal accuracy loss:
+
+.. code-block:: python
+
+    import torch
+    from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
+
+    model_id = "microsoft/Phi-4-mini-instruct"
+
+    from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
+    quant_config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())
+    quantization_config = TorchAoConfig(quant_type=quant_config)
+    quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+    # Push to hub
+    USER_ID = "YOUR_USER_ID"
+    MODEL_NAME = model_id.split("/")[-1]
+    save_to = f"{USER_ID}/{MODEL_NAME}-float8dq"
+    quantized_model.push_to_hub(save_to, safe_serialization=False)
+    tokenizer.push_to_hub(save_to)
+
+    # Manual Testing
+    prompt = "Hey, are you conscious? Can you talk to me?"
+    messages = [
+        {
+            "role": "system",
+            "content": "",
+        },
+        {"role": "user", "content": prompt},
+    ]
+    templated_prompt = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True,
+    )
+    print("Prompt:", prompt)
+    print("Templated prompt:", templated_prompt)
+    inputs = tokenizer(
+        templated_prompt,
+        return_tensors="pt",
+    ).to("cuda")
+    generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
+    output_text = tokenizer.batch_decode(
+        generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
+    )
+    print("Response:", output_text[0][len(prompt):])
+
+
+[Optional] Float8 Dynamic Quantization + Semi-structured (2:4) sparsity
+--------------------------------------------------------------------------
+
+Torchao's sparsity support can be combined with quantization for additional performance gains, using optimized kernels for 2:4 structured sparsity.
+
+.. code-block:: python
+
+    from torchao.quantization import Float8DynamicActivationFloat8SemiSparseWeightConfig
+
+    # Combine sparsity with int4 quantization - both optimized by torchao
+    quant_config = Float8DynamicActivationFloat8SemiSparseWeightConfig()
+    quantization_config = TorchAoConfig(quant_type=quant_config)
+
+    # Load a pre-sparsified checkpoint
+    model = AutoModelForCausalLM.from_pretrained(
+        "RedHatAI/Sparse-Llama-3.1-8B-2of4",  # 2:4 sparse model
+        torch_dtype=torch.float16,
+        device_map="cuda",
+        quantization_config=quantization_config
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
+
+    # Use static KV cache for best performance with torchao optimizations
+    messages = [{"role": "user", "content": "What are the benefits of sparse neural networks?"}]
+    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
+
+    outputs = model.generate(
+        inputs,
+        max_new_tokens=150,
+        cache_implementation="static",  # Optimized for torchao
+        do_sample=False
+    )
+
+    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
+    print(response)
+
+.. note::
+    For more information on supported quantization and sparsity configurations, see `HF-Torchao Docs <https://huggingface.co/docs/transformers/main/en/quantization/torchao>`_.
+
+Inference with vLLM
+-------------------
+
+.. code-block:: python
+
+    from vllm import LLM, SamplingParams
+
+    # Sample prompts.
+    prompts = [
+        "Hello, my name is",
+        "The president of the United States is",
+        "The capital of France is",
+        "The future of AI is",
+    ]
+    # Create a sampling params object.
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+    if __name__ == '__main__':
+        # Create an LLM.
+        llm = LLM(model="pytorch/Phi-4-mini-instruct-float8dq")
+        # Generate texts from the prompts.
+        # The output is a list of RequestOutput objects
+        # that contain the prompt, generated text, and other information.
+        outputs = llm.generate(prompts, sampling_params)
+        # Print the outputs.
+        print("\nGenerated Outputs:\n" + "-" * 60)
+        for output in outputs:
+            prompt = output.prompt
+            generated_text = output.outputs[0].text
+            print(f"Prompt:    {prompt!r}")
+            print(f"Output:    {generated_text!r}")
+            print("-" * 60)
+
+[Optional] Inference with Transformers
+--------------------------------------
+
+Install the required packages:
+
+.. code-block:: bash
+
+    pip install git+https://github.com/huggingface/transformers@main
+    pip install torchao
+    pip install torch
+    pip install accelerate
+
+.. code-block:: python
+
+    import torch
+    from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
+
+    torch.random.manual_seed(0)
+
+    model_path = "pytorch/Phi-4-mini-instruct-float8dq"
+
+    model = AutoModelForCausalLM.from_pretrained(
+        model_path,
+        device_map="auto",
+        torch_dtype="auto",
+        trust_remote_code=True,
+    )
+    tokenizer = AutoTokenizer.from_pretrained(model_path)
+
+    messages = [
+        {"role": "system", "content": "You are a helpful AI assistant."},
+        {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
+        {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
+        {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
+    ]
+
+    pipe = pipeline(
+        "text-generation",
+        model=model,
+        tokenizer=tokenizer,
+    )
+
+    generation_args = {
+        "max_new_tokens": 500,
+        "return_full_text": False,
+        "temperature": 0.0,
+        "do_sample": False,
+    }
+
+    output = pipe(messages, **generation_args)
+    print(output[0]['generated_text'])
+
+Evaluation
+###########
+
+Model Quality Assessment
+------------------------
+
+Evaluate quantized models using lm-evaluation-harness:
+
+.. code-block:: bash
+
+    # Install evaluation framework
+    # Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install
+
+    # Evaluate baseline model
+    lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
+
+    # Evaluate torchao-quantized model (float8dq)
+    lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
+
+Memory Benchmarking
+--------------------
+
+.. code-block:: python
+
+    import torch
+    from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
+
+    # use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-float8dq"
+    model_id = "pytorch/Phi-4-mini-instruct-float8dq"
+    quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+    torch.cuda.reset_peak_memory_stats()
+
+    prompt = "Hey, are you conscious? Can you talk to me?"
+    messages = [
+        {
+            "role": "system",
+            "content": "",
+        },
+        {"role": "user", "content": prompt},
+    ]
+    templated_prompt = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True,
+    )
+    print("Prompt:", prompt)
+    print("Templated prompt:", templated_prompt)
+    inputs = tokenizer(
+        templated_prompt,
+        return_tensors="pt",
+    ).to("cuda")
+    generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
+    output_text = tokenizer.batch_decode(
+        generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
+    )
+    print("Response:", output_text[0][len(prompt):])
+
+    mem = torch.cuda.max_memory_reserved() / 1e9
+    print(f"Peak Memory Usage: {mem:.02f} GB")
+
++-------------------+---------------------+------------------------------+
+| Benchmark         | Phi-4 mini-instruct | Phi-4-mini-instruct-float8dq |
++===================+=====================+==============================+
+| Peak Memory (GB)  | 8.91                | 5.70 (36% reduction)         |
++-------------------+---------------------+------------------------------+
+
+Performance Benchmarking
+------------------------------
+
+**Latency Benchmarking**:
+=========================
+
+.. code-block:: bash
+
+    # baseline
+    python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
+
+    # float8dq
+    VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
+
+**Serving Benchmarking**:
+=========================
+
+We benchmarked the throughput in a serving environment.
+
+.. code-block:: bash
+
+    # Setup: Get vllm source code
+    git clone git@github.com:vllm-project/vllm.git
+
+    # Install vllm
+    VLLM_USE_PRECOMPILED=1 pip install --editable .
+
+    # Run the benchmarks under vllm root folder:
+
+    # Download sharegpt dataset:
+    wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+
+    # Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
+    # Note: you can change the number of prompts to be benchmarked with --num-prompts argument for benchmark_serving script.
+
+    # For baseline
+    # Server:
+    vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
+    # Client:
+    python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
+
+    # For float8dq
+    # Server:
+    VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
+    # Client:
+    python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
+
+**Results (H100 machine)**:
+============================
+
++----------------------------+---------------------+------------------------------+
+| Benchmark                  | Phi-4-mini-instruct | Phi-4-mini-instruct-float8dq |
++============================+=====================+==============================+
+| latency (batch_size=1)     | 1.64s               | 1.41s (1.16x speedup)        |
++----------------------------+---------------------+------------------------------+
+| latency (batch_size=128)   | 3.1s                | 2.72s (1.14x speedup)        |
++----------------------------+---------------------+------------------------------+
+| serving (num_prompts=1)    | 1.35 req/s          | 1.57 req/s (1.16x speedup)   |
++----------------------------+---------------------+------------------------------+
+| serving (num_prompts=1000) | 66.68 req/s         | 80.53 req/s (1.21x speedup)  |
++----------------------------+---------------------+------------------------------+
+
+Serving
+#######
+
+High-throughput Serving with vLLM
+---------------------------------
+
+vLLM automatically leverages torchao's optimized kernels when serving quantized models, providing significant throughput improvements.
+
+First, install vLLM with torchao support:
+
+.. code-block:: bash
+
+    pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
+    pip install torchao
+
+.. code-block:: bash
+
+    vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
+
+Performance Breakdown
+=====================
+
+When using vLLM with torchao:
+
+- **Float8 dynamic quantization**: Provides 36% VRAM reduction, 1.15x-1.2x speedup and little to no accuracy impact on H100
+- **Sparsity Support**: Semi-structured (2:4) sparsity for faster inference (see `Accelerating Neural Network Training with Semi-Structured (2:4) Sparsity <https://pytorch.org/blog/accelerating-neural-network-training/>`_ blog post)
+- **KV Cache Quantization**: Enables long context inference with lower memory (see `KV Cache Quantization <https://github.com/pytorch/ao/blob/main/torchao/_models/llama/README.md>`_)
+
+.. note::
+    For more information on vLLM Integration, please refer to the detailed guide :ref:`torchao_vllm_integration`.
+
+
+Mobile Deployment with ExecuTorch
+---------------------------------
+
+ExecuTorch enables on-device inference using torchao's mobile-optimized quantization schemes. The 8da4w (8-bit dynamic activation, 4-bit weight) configuration is specifically designed for mobile deployment.
+
+Step 1: Untie Embedding Weights
+===============================
+
+We want to quantize the embedding and lm_head differently. Since those layers are tied, we first need to untie the model:
+
+.. code-block:: python
+
+    from transformers import (
+    AutoModelForCausalLM,
+    AutoProcessor,
+    AutoTokenizer,
+    )
+    import torch
+
+    model_id = "microsoft/Phi-4-mini-instruct"
+    untied_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+    print(untied_model)
+    from transformers.modeling_utils import find_tied_parameters
+    print("tied weights:", find_tied_parameters(untied_model))
+    if getattr(untied_model.config.get_text_config(decoder=True), "tie_word_embeddings"):
+        setattr(untied_model.config.get_text_config(decoder=True), "tie_word_embeddings", False)
+
+    untied_model._tied_weights_keys = []
+    untied_model.lm_head.weight = torch.nn.Parameter(untied_model.lm_head.weight.clone())
+
+    print("tied weights:", find_tied_parameters(untied_model))
+
+    USER_ID = "YOUR_USER_ID"
+    MODEL_NAME = model_id.split("/")[-1]
+    save_to = f"{USER_ID}/{MODEL_NAME}-untied-weights"
+
+    untied_model.push_to_hub(save_to)
+    tokenizer.push_to_hub(save_to)
+
+    # or save locally
+    save_to_local_path = f"{MODEL_NAME}-untied-weights"
+    untied_model.save_pretrained(save_to_local_path)
+    tokenizer.save_pretrained(save_to)
+
+Step 2: Create Mobile-Optimized Quantization
+============================================
+
+Quantizing the model for mobile deployment using TorchAO's **Int8DynamicActivationIntxWeightConfig** configuration:
+
+.. code-block:: python
+
+    from transformers import (
+    AutoModelForCausalLM,
+    AutoProcessor,
+    AutoTokenizer,
+    TorchAoConfig,
+    )
+    from torchao.quantization.quant_api import (
+        IntxWeightOnlyConfig,
+        Int8DynamicActivationIntxWeightConfig,
+        ModuleFqnToConfig,
+        quantize_,
+    )
+    from torchao.quantization.granularity import PerGroup, PerAxis
+    import torch
+
+    # we start from the model with untied weights
+    model_id = "microsoft/Phi-4-mini-instruct"
+    USER_ID = "YOUR_USER_ID"
+    MODEL_NAME = model_id.split("/")[-1]
+    untied_model_id = f"{USER_ID}/{MODEL_NAME}-untied-weights"
+    untied_model_local_path = f"{MODEL_NAME}-untied-weights"
+
+    embedding_config = IntxWeightOnlyConfig(
+        weight_dtype=torch.int8,
+        granularity=PerAxis(0),
+    )
+    linear_config = Int8DynamicActivationIntxWeightConfig(
+        weight_dtype=torch.int4,
+        weight_granularity=PerGroup(32),
+        weight_scale_dtype=torch.bfloat16,
+    )
+    quant_config = ModuleFqnToConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
+    quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[])
+
+    # either use `untied_model_id` or `untied_model_local_path`
+    quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+    # Push to hub
+    MODEL_NAME = model_id.split("/")[-1]
+    save_to = f"{USER_ID}/{MODEL_NAME}-8da4w"
+    quantized_model.push_to_hub(save_to, safe_serialization=False)
+    tokenizer.push_to_hub(save_to)
+
+    # Manual testing
+    prompt = "Hey, are you conscious? Can you talk to me?"
+    messages = [
+        {
+            "role": "system",
+            "content": "",
+        },
+        {"role": "user", "content": prompt},
+    ]
+    templated_prompt = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True,
+    )
+    print("Prompt:", prompt)
+    print("Templated prompt:", templated_prompt)
+    inputs = tokenizer(
+        templated_prompt,
+        return_tensors="pt",
+    ).to("cuda")
+    generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
+    output_text = tokenizer.batch_decode(
+        generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
+    )
+    print("Response:", output_text[0][len(prompt):])
+
+
+Step 3: Export to ExecuTorch
+============================
+
+.. code-block:: bash
+
+    # Install ExecuTorch
+    git clone https://github.com/pytorch/executorch.git
+    cd executorch
+    ./install_requirements.sh
+
+    # Convert checkpoint format for ExecuTorch
+    python -m executorch.examples.models.phi_4_mini.convert_weights pytorch_model.bin pytorch_model_converted.bin
+
+    # Export to PTE format with torchao optimizations preserved
+    PARAMS="executorch/examples/models/phi_4_mini/config.json"
+    python -m executorch.examples.models.llama.export_llama \
+        --model "phi_4_mini" \
+        --checkpoint "pytorch_model_converted.bin" \
+        --params "$PARAMS" \
+        -kv \
+        --use_sdpa_with_kv_cache \
+        -X \
+        --metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}' \
+        --max_seq_length 128 \
+        --max_context_length 128 \
+        --output_name="phi4-mini-8da4w.pte"
+
+
+Mobile Performance Characteristics
+====================================
+
+The torchao-optimized 8da4w model provides:
+
+- **Memory**: ~3.2GB on iPhone 15 Pro
+- **Speed**: ~17 tokens/sec on iPhone 15 Pro
+- **Accuracy**: Maintained within 5-10% of original model on most benchmarks
+
+.. note::
+    For detailed instructions on testing the executorch model and reproducing benchmarks please refer to the `HF Phi-4-mini-instruct-8da4w model <https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w>`_.
+
+**Conclusion**
+==============
+
+This tutorial demonstrated how torchao's quantization and sparsity techniques integrate seamlessly across the entire ML deployment stack:
+
+- **HuggingFace Transformers** provides easy model loading with torchao quantization
+- **vLLM** leverages torchao's optimized kernels for high-throughput serving
+- **ExecuTorch** enables mobile deployment with torchao's mobile-optimized schemes
+- **lm-evaluation-harness** provides model quality assessment
+
+All these frameworks use torchao as the underlying optimization engine, ensuring consistent performance gains and ease of integration. The quantization techniques shown provide significant memory reduction (3-4x) and performance improvements (1.5-2x) while maintaining model quality within acceptable bounds for most applications.
+
+For production deployments, always benchmark on your specific use case and hardware to validate the performance and accuracy trade-offs.