aws-neuron
diff --git a/‎general/models/inference-inf2-trn1-samples.rst‎
Lines changed: 1 addition & 0 deletions b/‎general/models/inference-inf2-trn1-samples.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎libraries/nxd-inference/developer_guides/dev-guide.txt‎
Lines changed: 2 additions & 0 deletions b/‎libraries/nxd-inference/developer_guides/dev-guide.txt‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎libraries/nxd-inference/developer_guides/feature-guide.rst‎
Lines changed: 7 additions & 7 deletions b/‎libraries/nxd-inference/developer_guides/feature-guide.rst‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎libraries/nxd-inference/developer_guides/index.rst‎
Lines changed: 2 additions & 0 deletions b/‎libraries/nxd-inference/developer_guides/index.rst‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎libraries/nxd-inference/developer_guides/llm-inference-benchmarking-guide.rst‎
Lines changed: 77 additions & 0 deletions b/‎libraries/nxd-inference/developer_guides/llm-inference-benchmarking-guide.rst‎
Lines changed: 77 additions & 0 deletions
diff --git a/‎libraries/nxd-inference/misc.rst‎
Lines changed: 4 additions & 3 deletions b/‎libraries/nxd-inference/misc.rst‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎libraries/nxd-inference/tutorials/index.rst‎
Lines changed: 3 additions & 0 deletions b/‎libraries/nxd-inference/tutorials/index.rst‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎libraries/nxd-inference/tutorials/llama405b_perf_comparison.csv‎
Lines changed: 3 additions & 0 deletions b/‎libraries/nxd-inference/tutorials/llama405b_perf_comparison.csv‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎libraries/nxd-inference/tutorials/llama70b_perf_comparison.csv‎
Lines changed: 2 additions & 2 deletions b/‎libraries/nxd-inference/tutorials/llama70b_perf_comparison.csv‎
Lines changed: 2 additions & 2 deletions
@@ -103,6 +103,7 @@ Decoders
    * - meta-llama/Llama-3.1-405b
      - neuronx-distributed-inference
      - * :ref:`Tutorial for deploying Llama-3.1-405B on Trn2 <nxdi-trn2-llama3.1-405b-tutorial>`
+       * :ref:`nxdi-trn2-llama3.1-405b-speculative-tutorial`
 
    * - meta-llama/Llama-3.1-405b
      - transformers-neuronx
 
@@ -4,3 +4,5 @@
 * :ref:`nxdi-vllm-user-guide`
 * :ref:`nxd-examples-migration-guide`
 * :ref:`nxdi_migrate_from_tnx`
+* :ref:`llm-inference-benchmarking`
+
@@ -466,7 +466,7 @@ smaller *draft* LLM model predicts the next tokens, and the larger *target*
 LLM model verifies those predictions. NxD Inference supports
 the following speculative decoding implementations:
 
-1. :ref:`Vanilla speculative decoding<nxd-vanilla-speculative-decoding>`,
+1. :ref:`Speculative decoding with a draft model <nxd-vanilla-speculative-decoding>`,
    where a separate draft model predicts the next *n* tokens for the target
    model. Each model is compiled independently.
 2. :ref:`Medusa speculative decoding<nxd-medusa-speculative-decoding>`,
@@ -479,17 +479,17 @@ the following speculative decoding implementations:
 
 .. _nxd-vanilla-speculative-decoding:
 
-Vanilla Speculative Decoding
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Speculative Decoding with a Draft model
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-To use vanilla speculative decoding, you configure, compile, and load a
-draft model in addition to the main target model. To enable vanilla
-speculative decoding, set ``speculation_length`` and
+To use speculative decoding with a draft model, you configure, compile, and load a
+draft model in addition to the main target model. To enable 
+speculative decoding with a draft model, set ``speculation_length`` and
 ``trace_tokengen_model=False`` in the target model's NeuronConfig. The
 draft model's NeuronConfig should use the same configuration but with
 these additional attributes reset to their defaults.
 
-Vanilla speculative decoding currently supports only batch sizes of 1.
+ Speculative decoding with a draft model currently supports only batch sizes of 1.
 
 .. _example-2:
 
 
@@ -14,6 +14,8 @@ Developer Guides
     /libraries/nxd-inference/developer_guides/vllm-user-guide
     /libraries/nxd-inference/developer_guides/nxd-examples-migration-guide
     /libraries/nxd-inference/developer_guides/migrate-from-tnx-to-nxdi
+    /libraries/nxd-inference/developer_guides/llm-inference-benchmarking-guide
+
 
 
 Use the NxD Inference (``neuronx-distributed-inference``) Developer Guides to learn how to use NxD Inference.
 
@@ -0,0 +1,77 @@
+.. _llm-inference-benchmarking:
+
+LLM Inference Benchmarking guide
+================================
+
+This guide gives an overview of the metrics that are tracked for LLM Inference and guidelines in using LLMPerf library
+to benchmark for LLM Inference.
+
+.. contents:: Table of contents
+   :local:
+   :depth: 2
+
+
+.. _llm_inference_metrics:
+
+LLM Inference metrics
+---------------------
+Following are the essential metrics for monitoring LLM inference server performance.
+
+.. list-table::
+   :widths: 20 70 
+   :header-rows: 1
+   :align: left
+   :class: table-smaller-font-size
+
+   * - Metric
+     - Description
+
+   * - Time To First Token (TTFT) 
+     - Average time taken for the LLM to process the prompt and output the first output token to the user. This is typically measured in milli seconds.
+  
+   * - Time per Output Token (TPOT) 
+     - Average time taken for LLM to generate an output token for an inference request. This is typically measured in milli seconds. This metric is also referred as Inter Token Latency (ITL) or Per Token Latency(PTL)
+  
+   * - End-to-End Response Latency
+     - Time taken for the LLM to generate the entire response, including all output tokens. This metric is computed as  
+       end-to-end latency = (TTFT) + (TPOT) * (Number of output tokens).
+ 
+   * - Output Token Throughput
+     - Number of output tokens generated per second by the inference server across all concurrent users and requests.
+
+
+.. _llm_perf_patch_changes:
+
+Using LLMPerf to benchmark LLM Inference performance
+----------------------------------------------------
+
+`LLMPerf <https://github.com/ray-project/llmperf>`_ is an open source library to benchmark LLM Inference performance. However, there are few changes that need to be applied to LLMPerf
+to accurately benchmark and reproduce the metrics that are published by Neuron.
+
+
+All the changes outlined below are provided as a patch file that you can easily download and apply.
+We will work in upstreaming these changes to public LLMPerf in the future. 
+
+Using the relevant HF tokenizer
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+In public LLMPerf, ``hf-internal-testing`` tokenizer is used by default for all the models that can impact accuracy of performance.
+Instead, there is a change to pass the tokenizer config of the model from Hugging Face which is being benchmarked for Neuron.
+
+Excluding TTFT in TPOT calculation
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+LLMPerf includes TTFT in Time per Output Token(or Inter Token Latency) calculation. As TPOT and TTFT are two different metrics, a change is done to LLMPerf
+to exclude TTFT from TPOT calculation to keep it consistent with how other industry standard performance benchmarks are done.
+
+
+Following are the instructions to apply the patch to the LLMPerf library.
+
+
+* Step 1: Get the Neuron git patch file
+
+  Download the ``neuron_perf.patch`` :download:`file </src/benchmark/helper_scripts/neuron_perf.patch>` into the ``llmperf`` directory. 
+
+* Step 2: Apply the git patch
+
+  Run ``git apply neuron_perf.patch``. Confirm changes with ``git diff``.
+
+
@@ -1,11 +1,12 @@
 .. _nxdi-misc-index:
 
 NxD Inference Misc
-==========
+===================
 
 .. toctree::
     :maxdepth: 1
 
-    /release-notes/neuronx-distributed-inference/neuronx-distributed-inference-rn
+    /release-notes/neuronx-distributed-inference/neuronx-distributed-inference
 
-* :ref:`neuronx-distributed-inference-rn`
+* :ref:`neuronx-distributed-inference-rn`
+  
@@ -13,6 +13,9 @@ Tutorials
     /libraries/nxd-inference/tutorials/trn2-llama3.1-405b-tutorial
     /libraries/nxd-inference/tutorials/llama3.2-multimodal-tutorial
     /libraries/nxd-inference/tutorials/trn2-llama3.3-70b-tutorial
+    /libraries/nxd-inference/tutorials/trn2-llama3.1-405b-speculative-tutorial.rst
+    /libraries/nxd-inference/tutorials/run_llmperf.rst
+
 
 
 
 
@@ -0,0 +1,3 @@
+Scenario (all using BF16),TTFT (P50 in ms),TPOT (P50 in ms),Output token Throughput (per second)
+No speculative decoding,2442,37.9,25.46
+Fused speculative decoding + rescaled weights (Llama 3.2 1B Draft),2255,8.27,102.41
@@ -1,3 +1,3 @@
 Scenario (all using BF16),TTFT (P50 in ms),TPOT (P50 in ms),Output token Throughput (per second)
-No speculative decoding,575.6,19.8,47
-Fused speculative decoding (Llama 3.2 1B Draft),612.4,5.6,143
+No speculative decoding,814.2,19.6,36
+Fused speculative decoding (Llama 3.2 1B Draft),870.1,5.3,144
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+Scenario (all using BF16),TTFT (P50 in ms),TPOT (P50 in ms),Output token Throughput (per second)`
	`2`	`+No speculative decoding,2442,37.9,25.46`
	`3`	`+Fused speculative decoding + rescaled weights (Llama 3.2 1B Draft),2255,8.27,102.41`