Skip to content

Commit 82231d6

Browse files
authored
Merge pull request #1106 from aws-neuron/minor-post221-cps
Misc documentation updates/fixes
2 parents c797fce + ba1c08b commit 82231d6

17 files changed

+878
-120
lines changed

general/models/inference-inf2-trn1-samples.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,7 @@ Decoders
103103
* - meta-llama/Llama-3.1-405b
104104
- neuronx-distributed-inference
105105
- * :ref:`Tutorial for deploying Llama-3.1-405B on Trn2 <nxdi-trn2-llama3.1-405b-tutorial>`
106+
* :ref:`nxdi-trn2-llama3.1-405b-speculative-tutorial`
106107

107108
* - meta-llama/Llama-3.1-405b
108109
- transformers-neuronx

libraries/nxd-inference/developer_guides/dev-guide.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@
44
* :ref:`nxdi-vllm-user-guide`
55
* :ref:`nxd-examples-migration-guide`
66
* :ref:`nxdi_migrate_from_tnx`
7+
* :ref:`llm-inference-benchmarking`
8+

libraries/nxd-inference/developer_guides/feature-guide.rst

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -466,7 +466,7 @@ smaller *draft* LLM model predicts the next tokens, and the larger *target*
466466
LLM model verifies those predictions. NxD Inference supports
467467
the following speculative decoding implementations:
468468

469-
1. :ref:`Vanilla speculative decoding<nxd-vanilla-speculative-decoding>`,
469+
1. :ref:`Speculative decoding with a draft model <nxd-vanilla-speculative-decoding>`,
470470
where a separate draft model predicts the next *n* tokens for the target
471471
model. Each model is compiled independently.
472472
2. :ref:`Medusa speculative decoding<nxd-medusa-speculative-decoding>`,
@@ -479,17 +479,17 @@ the following speculative decoding implementations:
479479

480480
.. _nxd-vanilla-speculative-decoding:
481481

482-
Vanilla Speculative Decoding
483-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
482+
Speculative Decoding with a Draft model
483+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
484484

485-
To use vanilla speculative decoding, you configure, compile, and load a
486-
draft model in addition to the main target model. To enable vanilla
487-
speculative decoding, set ``speculation_length`` and
485+
To use speculative decoding with a draft model, you configure, compile, and load a
486+
draft model in addition to the main target model. To enable
487+
speculative decoding with a draft model, set ``speculation_length`` and
488488
``trace_tokengen_model=False`` in the target model's NeuronConfig. The
489489
draft model's NeuronConfig should use the same configuration but with
490490
these additional attributes reset to their defaults.
491491

492-
Vanilla speculative decoding currently supports only batch sizes of 1.
492+
Speculative decoding with a draft model currently supports only batch sizes of 1.
493493

494494
.. _example-2:
495495

libraries/nxd-inference/developer_guides/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ Developer Guides
1414
/libraries/nxd-inference/developer_guides/vllm-user-guide
1515
/libraries/nxd-inference/developer_guides/nxd-examples-migration-guide
1616
/libraries/nxd-inference/developer_guides/migrate-from-tnx-to-nxdi
17+
/libraries/nxd-inference/developer_guides/llm-inference-benchmarking-guide
18+
1719

1820

1921
Use the NxD Inference (``neuronx-distributed-inference``) Developer Guides to learn how to use NxD Inference.
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
.. _llm-inference-benchmarking:
2+
3+
LLM Inference Benchmarking guide
4+
================================
5+
6+
This guide gives an overview of the metrics that are tracked for LLM Inference and guidelines in using LLMPerf library
7+
to benchmark for LLM Inference.
8+
9+
.. contents:: Table of contents
10+
:local:
11+
:depth: 2
12+
13+
14+
.. _llm_inference_metrics:
15+
16+
LLM Inference metrics
17+
---------------------
18+
Following are the essential metrics for monitoring LLM inference server performance.
19+
20+
.. list-table::
21+
:widths: 20 70
22+
:header-rows: 1
23+
:align: left
24+
:class: table-smaller-font-size
25+
26+
* - Metric
27+
- Description
28+
29+
* - Time To First Token (TTFT)
30+
- Average time taken for the LLM to process the prompt and output the first output token to the user. This is typically measured in milli seconds.
31+
32+
* - Time per Output Token (TPOT)
33+
- Average time taken for LLM to generate an output token for an inference request. This is typically measured in milli seconds. This metric is also referred as Inter Token Latency (ITL) or Per Token Latency(PTL)
34+
35+
* - End-to-End Response Latency
36+
- Time taken for the LLM to generate the entire response, including all output tokens. This metric is computed as
37+
end-to-end latency = (TTFT) + (TPOT) * (Number of output tokens).
38+
39+
* - Output Token Throughput
40+
- Number of output tokens generated per second by the inference server across all concurrent users and requests.
41+
42+
43+
.. _llm_perf_patch_changes:
44+
45+
Using LLMPerf to benchmark LLM Inference performance
46+
----------------------------------------------------
47+
48+
`LLMPerf <https://github.com/ray-project/llmperf>`_ is an open source library to benchmark LLM Inference performance. However, there are few changes that need to be applied to LLMPerf
49+
to accurately benchmark and reproduce the metrics that are published by Neuron.
50+
51+
52+
All the changes outlined below are provided as a patch file that you can easily download and apply.
53+
We will work in upstreaming these changes to public LLMPerf in the future.
54+
55+
Using the relevant HF tokenizer
56+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
57+
In public LLMPerf, ``hf-internal-testing`` tokenizer is used by default for all the models that can impact accuracy of performance.
58+
Instead, there is a change to pass the tokenizer config of the model from Hugging Face which is being benchmarked for Neuron.
59+
60+
Excluding TTFT in TPOT calculation
61+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
62+
LLMPerf includes TTFT in Time per Output Token(or Inter Token Latency) calculation. As TPOT and TTFT are two different metrics, a change is done to LLMPerf
63+
to exclude TTFT from TPOT calculation to keep it consistent with how other industry standard performance benchmarks are done.
64+
65+
66+
Following are the instructions to apply the patch to the LLMPerf library.
67+
68+
69+
* Step 1: Get the Neuron git patch file
70+
71+
Download the ``neuron_perf.patch`` :download:`file </src/benchmark/helper_scripts/neuron_perf.patch>` into the ``llmperf`` directory.
72+
73+
* Step 2: Apply the git patch
74+
75+
Run ``git apply neuron_perf.patch``. Confirm changes with ``git diff``.
76+
77+

libraries/nxd-inference/misc.rst

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
11
.. _nxdi-misc-index:
22

33
NxD Inference Misc
4-
==========
4+
===================
55

66
.. toctree::
77
:maxdepth: 1
88

9-
/release-notes/neuronx-distributed-inference/neuronx-distributed-inference-rn
9+
/release-notes/neuronx-distributed-inference/neuronx-distributed-inference
1010

11-
* :ref:`neuronx-distributed-inference-rn`
11+
* :ref:`neuronx-distributed-inference-rn`
12+

libraries/nxd-inference/tutorials/index.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,9 @@ Tutorials
1313
/libraries/nxd-inference/tutorials/trn2-llama3.1-405b-tutorial
1414
/libraries/nxd-inference/tutorials/llama3.2-multimodal-tutorial
1515
/libraries/nxd-inference/tutorials/trn2-llama3.3-70b-tutorial
16+
/libraries/nxd-inference/tutorials/trn2-llama3.1-405b-speculative-tutorial.rst
17+
/libraries/nxd-inference/tutorials/run_llmperf.rst
18+
1619

1720

1821

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
Scenario (all using BF16),TTFT (P50 in ms),TPOT (P50 in ms),Output token Throughput (per second)
2+
No speculative decoding,2442,37.9,25.46
3+
Fused speculative decoding + rescaled weights (Llama 3.2 1B Draft),2255,8.27,102.41
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
Scenario (all using BF16),TTFT (P50 in ms),TPOT (P50 in ms),Output token Throughput (per second)
2-
No speculative decoding,575.6,19.8,47
3-
Fused speculative decoding (Llama 3.2 1B Draft),612.4,5.6,143
2+
No speculative decoding,814.2,19.6,36
3+
Fused speculative decoding (Llama 3.2 1B Draft),870.1,5.3,144

0 commit comments

Comments
 (0)