Skip to content

Commit 215b421

Browse files
authored
Neuron SDK Release 2.19.0 (#919)
Neuron SDK Release 2.19.0 - Release Notes
1 parent 78169c6 commit 215b421

File tree

103 files changed

+2682
-596
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

103 files changed

+2682
-596
lines changed

CODEOWNERS

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -8,18 +8,18 @@
88
# review when someone opens a pull request.
99
# * @global-owner1 @global-owner2
1010

11-
* @aws-maens @aws-mesharma @rgrandhiamzn
11+
* @aws-maens @micwade-aws @musunita @aws-sadaf @natemail-aws @rgrandhiamzn @eshalakhotia @jluntamazon @jeffhataws @aws-rhsoln @hannanjgaws @aws-trsharma @PrashantSaraf @shadis @aws-donkrets @aws-singhada @gsnaws @awsjoshir @sidjoshiaws @pinak-p @vikas-paliwal-aws
1212

13-
src/examples/mxnet/ @aws-rhsoln @aws-sadaf @aws-maens
14-
neuron-guide/neuron-frameworks/mxnet-neuron/ @aws-rhsoln @aws-maens
15-
neuron-guide/neuron-frameworks/mxnet-neuron/tutorials/ @kct22aws @musunita @aws-rhsoln @aws-maens
13+
src/examples/mxnet/ @aws-rhsoln @aws-sadaf @aws-maens @vikas-paliwal-aws @rgrandhiamzn @eshalakhotia
14+
neuron-guide/neuron-frameworks/mxnet-neuron/ @aws-rhsoln @aws-maens @vikas-paliwal-aws @rgrandhiamzn @eshalakhotia
15+
neuron-guide/neuron-frameworks/mxnet-neuron/tutorials/ @musunita @aws-rhsoln @aws-maens @vikas-paliwal-aws @rgrandhiamzn @eshalakhotia
1616

17-
src/examples/tensorflow/ @awshaichen @aws-sadaf @aws-maens
18-
neuron-guide/neuron-frameworks/tensorflow-neuron/ @awshaichen @aws-maens
19-
neuron-guide/neuron-frameworks/tensorflow-neuron/tutorials/ @kct22aws @musunita @awshaichen @aws-maens
17+
src/examples/tensorflow/ @awshaichen @aws-sadaf @aws-maens @vikas-paliwal-aws @rgrandhiamzn @eshalakhotia
18+
neuron-guide/neuron-frameworks/tensorflow-neuron/ @awshaichen @aws-maens @vikas-paliwal-aws @rgrandhiamzn @eshalakhotia
19+
neuron-guide/neuron-frameworks/tensorflow-neuron/tutorials/ @musunita @awshaichen @aws-maens @vikas-paliwal-aws @rgrandhiamzn @eshalakhotia
2020

2121

22-
src/examples/pytorch/ @jluntamazon @aws-sadaf @aws-maens
23-
neuron-guide/neuron-frameworks/pytorch-neuron/ @jluntamazon @aws-maens
24-
neuron-guide/neuron-frameworks/pytorch-neuron/tutorials/ @kct22aws @musunita @jluntamazon @aws-maens
22+
src/examples/pytorch/ @jluntamazon @aws-sadaf @aws-maens @vikas-paliwal-aws @rgrandhiamzn @eshalakhotia
23+
neuron-guide/neuron-frameworks/pytorch-neuron/ @jluntamazon @aws-maens @vikas-paliwal-aws @rgrandhiamzn @eshalakhotia
24+
neuron-guide/neuron-frameworks/pytorch-neuron/tutorials/ @musunita @jluntamazon @aws-maens @vikas-paliwal-aws @rgrandhiamzn @eshalakhotia
2525

conf.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,7 @@
8181
'sphinx.ext.autodoc',
8282
'local_documenter',
8383
'archive',
84+
"sphinx_copybutton",
8485
]
8586

8687

@@ -97,6 +98,10 @@
9798
exclude_patterns = ['_build','**.ipynb_checkpoints','.venv']
9899
html_extra_path = ['static']
99100

101+
# remove bash/python/ipython/jupyter prompts and continuations
102+
copybutton_prompt_text = r">>> |\.\.\. |\$ |In \[\d*\]: | {2,5}\.\.\.: | {5,8}: "
103+
copybutton_prompt_is_regexp = True
104+
100105
# nbsphinx_allow_errors = True
101106
nbsphinx_execute = 'never'
102107

@@ -141,9 +146,7 @@
141146

142147
#top_banner_message="<span>&#9888;</span><a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/setup-troubleshooting.html#gpg-key-update'> Neuron repository GPG key for Ubuntu installation has expired, see instructions how to update! </a>"
143148

144-
145-
top_banner_message="Neuron 2.18.2 is released! check <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#latest-neuron-release'> What's New </a> and <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/announcements/index.html'> Announcements </a>"
146-
149+
top_banner_message="Neuron 2.19.0 is released! check <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#latest-neuron-release'> What's New </a> and <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/announcements/index.html'> Announcements </a>"
147150

148151
html_theme = "sphinx_book_theme"
149152
html_theme_options = {
@@ -234,6 +237,6 @@
234237
,r'https://github.com/awslabs/multi-model-server/blob/master/docs/management_api.md',r'https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/dp_bert_hf_pretrain/run_dp_bert_large_hf_pretrain_bf16_s128.sh',r' https://github.com/pytorch/xla/blob/master/test/test_train_mp_mnist.py',r'https://github.com/pytorch/xla/blob/v1.10.0/TROUBLESHOOTING.md'
235238
,r'https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/saved_model.md',r'https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/g3doc/index.md',r'https://github.com/pytorch/xla/blob/master/test/test_train_mp_mnist.py',r'https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb'
236239
,r'https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/examples/pytorch/torch-neuronx/t5-inference-tutorial.ipynb',r'https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md',r'https://github.com/pytorch/PiPPy/blob/main/pippy/IR.py#L697', r'https://github.com/pytorch/pytorch/blob/main/torch/fx/_symbolic_trace.py#L241', r'https://github.com/pytorch/xla/blob/master/torch_xla/utils/checkpoint.py#L129', r'https://github.com/aws-neuron/neuronx-distributed/blob/main/src/neuronx_distributed/parallel_layers/layer_norm.py#L32', r'https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain.py#L273C1-L289C55'
237-
,r'https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install.html#pytorch-neuronx-install',r'https://github.com/google-research/bert#user-content-pre-trained-models',r'https://github.com/google-research/bert#user-content-sentence-and-sentence-pair-classification-tasks', r'https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-retirement.html', r'https://repost.aws/knowledge-center/eventbridge-notification-scheduled-events', r'https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/modeling_gpt_neox_nxd.py',r'https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain.py']
240+
,r'https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-install.html#pytorch-neuronx-install',r'https://github.com/google-research/bert#user-content-pre-trained-models',r'https://github.com/google-research/bert#user-content-sentence-and-sentence-pair-classification-tasks', r'https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-retirement.html', r'https://repost.aws/knowledge-center/eventbridge-notification-scheduled-events', r'https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/modeling_gpt_neox_nxd.py',r'https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain.py',r'https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/llama-3-8b-32k-sampling.ipynb']
238241
linkcheck_exclude_documents = [r'src/examples/.*', 'general/announcements/neuron1.x/announcements', r'release-notes/.*',r'containers/.*',r'general/.*']
239242
nitpicky = True

containers/getting-started.txt

Lines changed: 41 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@
6060
sudo yum install -y docker.io
6161
sudo usermod -aG docker $USER
6262

63-
Logout and log back in to refresh membership.
63+
Logout and log back in to refresh membership.
6464

6565
.. dropdown:: Verify Docker
6666
:class-title: sphinx-design-class-title-small
@@ -97,32 +97,32 @@
9797
https://docs.docker.com/get-started/
9898

9999
.. dropdown:: Verify Neuron Component
100-
:class-title: sphinx-design-class-title-small
101-
:class-body: sphinx-design-class-body-small
102-
:animate: fade-in
100+
:class-title: sphinx-design-class-title-small
101+
:class-body: sphinx-design-class-body-small
102+
:animate: fade-in
103103

104-
Once the environment is setup, a container can be started with
105-
--device=/dev/neuron# to specify desired set of Inferentia/Trainium devices to be
106-
exposed to the container. To find out the available neuron devices on
107-
your instance, use the command ``ls /dev/neuron*``.
104+
Once the environment is setup, a container can be started with
105+
--device=/dev/neuron# to specify desired set of Inferentia/Trainium devices to be
106+
exposed to the container. To find out the available neuron devices on
107+
your instance, use the command ``ls /dev/neuron*``.
108108

109-
When running neuron-ls inside a container, you will only see the set of
110-
exposed Trainiums. For example:
109+
When running neuron-ls inside a container, you will only see the set of
110+
exposed Trainiums. For example:
111111

112-
.. code:: bash
112+
.. code:: bash
113113

114-
docker run --device=/dev/neuron0 neuron-test neuron-ls
114+
docker run --device=/dev/neuron0 neuron-test neuron-ls
115115

116-
Would produce the following output in trn1.32xlarge:
116+
Would produce the following output in trn1.32xlarge:
117117

118-
::
118+
::
119119

120-
+--------+--------+--------+---------+
121-
| NEURON | NEURON | NEURON | PCI |
122-
| DEVICE | CORES | MEMORY | BDF |
123-
+--------+--------+--------+---------+
124-
| 0 | 2 | 32 GB | 10:1c.0 |
125-
+--------+--------+--------+---------+
120+
+--------+--------+--------+---------+
121+
| NEURON | NEURON | NEURON | PCI |
122+
| DEVICE | CORES | MEMORY | BDF |
123+
+--------+--------+--------+---------+
124+
| 0 | 2 | 32 GB | 10:1c.0 |
125+
+--------+--------+--------+---------+
126126

127127
.. dropdown:: Build and Run Docker Image
128128
:class-title: sphinx-design-class-title-small
@@ -146,8 +146,7 @@
146146
:class-title: sphinx-design-class-title-small
147147
:class-body: sphinx-design-class-body-small
148148
:animate: fade-in
149-
150-
.. include:: /general/setup/install-templates/launch-inf1.txt
149+
.. include:: /general/setup/install-templates/launch-inf1.txt
151150

152151
.. dropdown:: Install Drivers
153152
:class-title: sphinx-design-class-title-small
@@ -195,7 +194,7 @@
195194
sudo yum install -y docker.io
196195
sudo usermod -aG docker $USER
197196

198-
Logout and log back in to refresh membership.
197+
Logout and log back in to refresh membership.
199198

200199
.. dropdown:: Verify Docker
201200
:class-title: sphinx-design-class-title-small
@@ -233,32 +232,32 @@
233232

234233

235234
.. dropdown:: Verify Neuron Component
236-
:class-title: sphinx-design-class-title-small
237-
:class-body: sphinx-design-class-body-small
238-
:animate: fade-in
235+
:class-title: sphinx-design-class-title-small
236+
:class-body: sphinx-design-class-body-small
237+
:animate: fade-in
239238

240-
Once the environment is setup, a container can be started with
241-
--device=/dev/neuron# to specify desired set of Inferentia/Trainium devices to be
242-
exposed to the container. To find out the available neuron devices on
243-
your instance, use the command ``ls /dev/neuron*``.
239+
Once the environment is setup, a container can be started with
240+
--device=/dev/neuron# to specify desired set of Inferentia/Trainium devices to be
241+
exposed to the container. To find out the available neuron devices on
242+
your instance, use the command ``ls /dev/neuron*``.
244243

245-
When running neuron-ls inside a container, you will only see the set of
246-
exposed Inferentias. For example:
244+
When running neuron-ls inside a container, you will only see the set of
245+
exposed Inferentias. For example:
247246

248-
.. code:: bash
247+
.. code:: bash
249248

250-
docker run --device=/dev/neuron0 neuron-test neuron-ls
249+
docker run --device=/dev/neuron0 neuron-test neuron-ls
251250

252-
Would produce the following output in inf1.xlarge:
251+
Would produce the following output in inf1.xlarge:
253252

254-
::
253+
::
255254

256-
+--------------+---------+--------+-----------+-----------+------+------+
257-
| PCI BDF | LOGICAL | NEURON | MEMORY | MEMORY | EAST | WEST |
258-
| | ID | CORES | CHANNEL 0 | CHANNEL 1 | | |
259-
+--------------+---------+--------+-----------+-----------+------+------+
260-
| 0000:00:1f.0 | 0 | 4 | 4096 MB | 4096 MB | 0 | 0 |
261-
+--------------+---------+--------+-----------+-----------+------+------+
255+
+--------------+---------+--------+-----------+-----------+------+------+
256+
| PCI BDF | LOGICAL | NEURON | MEMORY | MEMORY | EAST | WEST |
257+
| | ID | CORES | CHANNEL 0 | CHANNEL 1 | | |
258+
+--------------+---------+--------+-----------+-----------+------+------+
259+
| 0000:00:1f.0 | 0 | 4 | 4096 MB | 4096 MB | 0 | 0 |
260+
+--------------+---------+--------+-----------+-----------+------+------+
262261

263262
.. dropdown:: Run Tutorial
264263
:class-title: sphinx-design-class-title-small
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,10 @@
11
Containers - Kubernetes - Getting Started
22
=========================================
33

4+
The Neuron device plugin is a DaemonSet run on all Inferentia and Trainium nodes that enables the containers in your Kubernetes cluster to request and use Neuron cores or devices.
5+
The Neuron scheduler extension is required for containers in your Kubernetes cluster that request multiple Neuron resources.
6+
It helps find optimal sets of Neuron resources to minimize inter-resource communication costs.
7+
Below are directions for installing and using the Neuron device plugin and scheduler extension.
8+
49

510
.. include:: /containers/kubernetes-getting-started.txt

containers/kubernetes-getting-started.txt

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,13 @@
55

66
.. include:: /containers/tutorials/k8s-prerequisite.rst
77

8+
.. dropdown:: Prerequisite for Neuron Problem Detector Plugin
9+
:class-title: sphinx-design-class-title-small
10+
:class-body: sphinx-design-class-body-small
11+
:animate: fade-in
12+
13+
.. include:: /containers/tutorials/k8s-neuron-problem-detector-and-recovery-irsa.rst
14+
815
.. dropdown:: Deploy Neuron Device Plugin
916
:class-title: sphinx-design-class-title-small
1017
:class-body: sphinx-design-class-body-small
@@ -17,4 +24,18 @@
1724
:class-body: sphinx-design-class-body-small
1825
:animate: fade-in
1926

20-
.. include:: /containers/tutorials/k8s-neuron-scheduler.rst
27+
.. include:: /containers/tutorials/k8s-neuron-scheduler.rst
28+
29+
.. dropdown:: Deploy Neuron Problem Detector And Recovery
30+
:class-title: sphinx-design-class-title-small
31+
:class-body: sphinx-design-class-body-small
32+
:animate: fade-in
33+
34+
.. include:: /containers/tutorials/k8s-neuron-problem-detector-and-recovery.rst
35+
36+
.. dropdown:: Deploy Neuron Monitor Daemonset
37+
:class-title: sphinx-design-class-title-small
38+
:class-body: sphinx-design-class-body-small
39+
:animate: fade-in
40+
41+
.. include:: /containers/tutorials/k8s-neuron-monitor.rst

containers/tutorials/inference/tutorial-infer.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@ Setup Environment
2323
-----------------
2424

2525
1. Launch an Inf1 Instance
26-
.. include:: /general/setup/install-templates/launch-inf1.txt
2726

2827
2. Set up docker environment according to :ref:`tutorial-docker-env-setup`
2928

containers/tutorials/k8s-neuron-device-plugin.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
.. _k8s-neuron-device-plugin:
22

3-
Neuron device plugin exposes Neuron cores & devices to kubernetes as a resource. aws.amazon.com/neuroncore, aws.amazon.com/neurondevice, aws.amazon.com/neuron are the resources that the neuron device plugin registers with the kubernetes. aws.amazon.com/neuroncore is used for allocating neuron cores to the container. aws.amazon.com/neurondevice is used for allocating neuron devices to the container. When neurondevice is used all the cores belonging to the device will be allocated to container. aws.amazon.com/neuron also allocates neurondevices and this exists just to be backward compatible with already existing installations. aws.amazon.com/neurondevice is the recommended resource for allocating devices to the container.
3+
Neuron device plugin exposes Neuron cores & devices to kubernetes as a resource. aws.amazon.com/neuroncore, aws.amazon.com/neurondevice, aws.amazon.com/neuron are the resources that the neuron device plugin registers with the kubernetes. aws.amazon.com/neuroncore is used for allocating neuron cores to the container. aws.amazon.com/neurondevice is used for allocating neuron devices to the container. When neurondevice is used all the cores belonging to the device will be allocated to container. aws.amazon.com/neuron also allocates neurondevices. Resource name 'neuron' is recommended for allocating devices to the container. Neuron will be ending support of resource name 'neurondevice' in a future release. Please check announcements for updates.
44

55
* Make sure :ref:`prequisite<k8s-prerequisite>` are satisified
66
* Download the neuron device plugin yaml file. :download:`k8s-neuron-device-plugin.yml </src/k8/k8s-neuron-device-plugin.yml>`
@@ -49,4 +49,4 @@ Neuron device plugin exposes Neuron cores & devices to kubernetes as a resource.
4949
5050
NAME NeuronDevice
5151
ip-192-168-65-41.us-west-2.compute.internal 16
52-
ip-192-168-87-81.us-west-2.compute.internal 16
52+
ip-192-168-87-81.us-west-2.compute.internal 16
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
.. _k8s-neuron-monitor:
2+
3+
Neuron monitor Container
4+
========================
5+
6+
Neuron monitor is primary observability tool for neuron devices. For details of neuron monitor, please refer to the `neuron monitor guide <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html>`_. This tutorial describes deploying neuron monitor as a daemonset on the kubernetes cluster.
7+
8+
9+
* Download the neuron monitor yaml file. :download:`k8s-neuron-monitor-daemonset.yml </src/k8/k8s-neuron-monitor-daemonset.yml>`
10+
* Apply the Neuron monitor yaml to create a daemonset on the cluster with the following command
11+
12+
.. code:: bash
13+
14+
kubectl apply -f k8s-neuron-monitor.yml
15+
16+
* Verify that neuron monitor daemonset is running
17+
18+
.. code:: bash
19+
20+
kubectl get ds neuron-monitor --namespace neuron-monitor
21+
22+
Expected result (with 2 nodes in cluster):
23+
24+
.. code:: bash
25+
26+
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
27+
neuron-monitor 2 2 2 2 2 <none> 27h
28+
29+
30+
* Get the neuron-monitor pod names
31+
.. code:: bash
32+
33+
kubectl get pods
34+
35+
Expected result
36+
37+
.. code:: bash
38+
39+
NAME READY STATUS RESTARTS AGE
40+
neuron-monitor-slsxf 1/1 Running 0 17m
41+
neuron-monitor-wc4f5 1/1 Running 0 17m
42+
43+
44+
* Verify the prometheus endpoint is available
45+
.. code:: bash
46+
47+
kubectl exec neuron-monitor-wc4f5 -- wget -q --output-document - http://127.0.0.1:8000
48+
49+
Expected result
50+
51+
.. code:: bash
52+
53+
# HELP python_gc_objects_collected_total Objects collected during gc
54+
# TYPE python_gc_objects_collected_total counter
55+
python_gc_objects_collected_total{generation="0"} 362.0
56+
python_gc_objects_collected_total{generation="1"} 0.0
57+
python_gc_objects_collected_total{generation="2"} 0.0
58+
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
59+
# TYPE python_gc_objects_uncollectable_total counter

0 commit comments

Comments
 (0)