Skip to content

Commit 3ab9d96

Browse files
ivashkstgeetasgaws-eddyjeffhatawsrgrandhiamzn
committed
Neuron SDK Release 2.20.1
--------- Co-authored-by: Geeta Gharpure <[email protected]> Co-authored-by: Eddy Varela <[email protected]> Co-authored-by: Jeffrey Huynh <[email protected]> Co-authored-by: roopgran <[email protected]> Co-authored-by: Esha Lakhotia <[email protected]> Co-authored-by: Roopnath <[email protected]> Co-authored-by: musunita <[email protected]> Co-authored-by: mounchin <[email protected]>
1 parent 9c301c9 commit 3ab9d96

File tree

17 files changed

+179
-99
lines changed

17 files changed

+179
-99
lines changed

conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,7 @@
157157

158158
#top_banner_message="<span>&#9888;</span><a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/setup-troubleshooting.html#gpg-key-update'> Neuron repository GPG key for Ubuntu installation has expired, see instructions how to update! </a>"
159159

160-
top_banner_message="Neuron 2.20.0 is released! check <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#latest-neuron-release'> What's New </a> and <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/announcements/index.html'> Announcements </a>"
160+
top_banner_message="Neuron 2.20.1 is released! check <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#latest-neuron-release'> What's New </a> and <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/announcements/index.html'> Announcements </a>"
161161

162162
html_theme = "sphinx_book_theme"
163163
html_theme_options = {

containers/tutorials/k8s-default-scheduler.rst

Lines changed: 9 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,32 +2,28 @@
22
.. _k8s-default-scheduler:
33

44
* Make sure :ref:`Neuron device plugin<k8s-neuron-device-plugin>` is running
5-
* Download the scheduler config map :download:`k8s-neuron-scheduler-configmap.yml </src/k8/k8s-neuron-scheduler-configmap.yml>`
6-
* Download the scheduler extension :download:`k8s-neuron-scheduler.yml </src/k8/k8s-neuron-scheduler.yml>`
75
* Enable the kube-scheduler with option to use configMap for scheduler policy. In your cluster.yml Please update the spec section with the following
86

9-
::
7+
.. code:: bash
108
119
spec:
1210
kubeScheduler:
1311
usePolicyConfigMap: true
1412
1513
* Launch the cluster
1614

17-
::
15+
.. code:: bash
1816
1917
kops create -f cluster.yml
2018
kops create secret --name neuron-test-1.k8s.local sshpublickey admin -i ~/.ssh/id_rsa.pub
2119
kops update cluster --name neuron-test-1.k8s.local --yes
2220
23-
* Apply the k8s-neuron-scheduler-configmap.yml [Registers neuron-scheduler-extension with kube-scheduler]
21+
* Install the neuron-scheduler-extension [Registers neuron-scheduler-extension with kube-scheduler]
2422

25-
::
23+
.. code:: bash
2624
27-
kubectl apply -f k8s-neuron-scheduler-configmap.yml
28-
29-
* Launch the neuron-scheduler-extension
30-
31-
::
32-
33-
kubectl apply -f k8s-neuron-scheduler.yml
25+
helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
26+
--set "scheduler.enabled=true" \
27+
--set "scheduler.customScheduler.enabled=false" \
28+
--set "scheduler.defaultScheduler.enabled=true" \
29+
--set "npd.enabled=false"

containers/tutorials/k8s-multiple-scheduler.rst

Lines changed: 11 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -5,35 +5,28 @@ In cluster environments where there is no access to default scheduler, the neuro
55
use this new scheduler. Neuron scheduler extension is added to this new scheduler. EKS natively does not yet support the neuron scheduler extension and so in the EKS environment this is the only way to add the neuron scheduler extension.
66

77
* Make sure :ref:`Neuron device plugin<k8s-neuron-device-plugin>` is running
8-
* Download the my scheduler :download:`my-scheduler.yml </src/k8/my-scheduler.yml>`
9-
* Download the scheduler extension :download:`k8s-neuron-scheduler-eks.yml </src/k8/k8s-neuron-scheduler-eks.yml>`
10-
* Apply the neuron-scheduler-extension
8+
* Install the neuron-scheduler-extension
119

12-
::
10+
.. code:: bash
1311
14-
kubectl apply -f k8s-neuron-scheduler-eks.yml
15-
16-
17-
* Apply the my-scheduler.yml
18-
19-
::
20-
21-
kubectl apply -f my-scheduler.yml
12+
helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
13+
--set "scheduler.enabled=true" \
14+
--set "npd.enabled=false"
2215
2316
* Check there are no errors in the my-scheduler pod logs and the k8s-neuron-scheduler pod is bound to a node
2417

25-
::
18+
.. code:: bash
2619
2720
kubectl logs -n kube-system my-scheduler-79bd4cb788-hq2sq
2821
29-
::
22+
.. code:: bash
3023
3124
I1012 15:30:21.629611 1 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/k8s-neuron-scheduler-5d9d9d7988-xcpqm" node="ip-192-168-2-25.ec2.internal" evaluatedNodes=1 feasibleNodes=1
3225
3326
3427
* When running new pod's that need to use the neuron scheduler extension, make sure it uses the my-scheduler as the scheduler. Sample pod spec is below
3528

36-
::
29+
.. code:: bash
3730
3831
apiVersion: v1
3932
kind: Pod
@@ -57,20 +50,19 @@ use this new scheduler. Neuron scheduler extension is added to this new schedule
5750
5851
* Once the neuron workload pod is run, make sure logs in the k8s neuron scheduler has successfull filter/bind request
5952

60-
61-
::
53+
.. code:: bash
6254
6355
kubectl logs -n kube-system k8s-neuron-scheduler-5d9d9d7988-xcpqm
6456
6557
66-
::
58+
.. code:: bash
6759
6860
2022/10/12 15:41:16 POD nrt-test-5038 fits in Node:ip-192-168-2-25.ec2.internal
6961
2022/10/12 15:41:16 Filtered nodes: [ip-192-168-2-25.ec2.internal]
7062
2022/10/12 15:41:16 Failed nodes: map[]
7163
2022/10/12 15:41:16 Finished Processing Filter Request...
7264
73-
::
65+
.. code:: bash
7466
7567
2022/10/12 15:41:16 Executing Bind Request!
7668
2022/10/12 15:41:16 Determine if the pod %v is NeuronDevice podnrt-test-5038
@@ -96,6 +88,3 @@ use this new scheduler. Neuron scheduler extension is added to this new schedule
9688
2022/10/12 15:41:16 Return aws.amazon.com/neuroncore
9789
2022/10/12 15:41:16 Succesfully updated the DevUsageMap [true true true true true true true true true false false false false false false false] and otherDevUsageMap [true true true false] after alloc for node ip-192-168-2-25.ec2.internal
9890
2022/10/12 15:41:16 Finished executing Bind Request...
99-
100-
101-

containers/tutorials/k8s-neuron-device-plugin.rst

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,27 +6,25 @@ Deploy Neuron Device Plugin
66
~~~~~~~~~~~~~~~~~~~~~~~~~~~
77

88
* Make sure :ref:`prequisite<k8s-prerequisite>` are satisified
9-
* Download the neuron device plugin yaml file. :download:`k8s-neuron-device-plugin.yml </src/k8/k8s-neuron-device-plugin.yml>`
10-
* Download the neuron device plugin rbac yaml file. This enables permissions for device plugin to update the node and Pod annotations. :download:`k8s-neuron-device-plugin-rbac.yml </src/k8/k8s-neuron-device-plugin-rbac.yml>`
119
* Apply the Neuron device plugin as a daemonset on the cluster with the following command
1210

1311
.. code:: bash
1412
15-
kubectl apply -f k8s-neuron-device-plugin-rbac.yml
16-
kubectl apply -f k8s-neuron-device-plugin.yml
13+
helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
14+
--set "npd.enabled=false"
1715
1816
* Verify that neuron device plugin is running
1917

2018
.. code:: bash
2119
22-
kubectl get ds neuron-device-plugin-daemonset --namespace kube-system
20+
kubectl get ds neuron-device-plugin -n kube-system
2321
2422
Expected result (with 2 nodes in cluster):
2523

2624
.. code:: bash
2725
28-
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
29-
neuron-device-plugin-daemonset 2 2 2 2 2 <none> 27h
26+
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
27+
neuron-device-plugin 2 2 2 2 2 <none> 18h
3028
3129
* Verify that the node has allocatable neuron cores and devices with the following command
3230

containers/tutorials/k8s-neuron-problem-detector-and-recovery.rst

Lines changed: 8 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -4,30 +4,24 @@ Neuron node problem detector and recovery artifact checks the health of Neuron d
44

55
* The Neuron node problem detector and recovery requires Neuron driver 2.15+, and it requires the runtime to be at SDK 2.18 or later.
66
* Make sure prerequisites are satisfied. This includes prerequisites for getting started with Kubernetes containers and prerequisites for the Neuron node problem detector and recovery.
7-
* Download the Neuron node problem detector and recovery YAML file: :download:`k8s-neuron-problem-detector-and-recovery.yml </src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery.yml>`.
7+
* Install the Neuron node problem detector and recovery as a DaemonSet on the cluster with the following command:
88

99
.. note::
1010

11-
This YAML pulls the container image from the upstream repository for node problem detector registry.k8s.io/node-problem-detector.
12-
13-
* Download the Neuron node problem detector and recovery configuration file: :download:`k8s-neuron-problem-detector-and-recovery-config.yml </src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery-config.yml>`.
14-
* Download the Neuron node problem detector and recovery RBAC YAML file. This enables permissions for the Neuron node problem detector and recovery to update the node condition: :download:`k8s-neuron-problem-detector-and-recovery-rbac.yml </src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery-rbac.yml>`.
15-
* By default, the Neuron node problem detector and recovery has monitor only mode enabled. To enable the recovery functionality, update the environment variable in the YAML file:
11+
The installation pulls the container image from the upstream repository for node problem detector registry.k8s.io/node-problem-detector.
1612

1713
.. code:: bash
1814
19-
- name: ENABLE_RECOVERY
20-
value: "true"
15+
helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart
2116
22-
Apply the Neuron node problem detector and recovery as a DaemonSet on the cluster with the following command:
17+
* By default, the Neuron node problem detector and recovery has monitor only mode enabled. To enable the recovery functionality:
2318

2419
.. code:: bash
2520
26-
kubectl apply -f k8s-neuron-problem-detector-and-recovery-rbac.yml
27-
kubectl apply -f k8s-neuron-problem-detector-and-recovery-config.yml
28-
kubectl apply -f k8s-neuron-problem-detector-and-recovery.yml
21+
helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
22+
--set "npd.nodeRecovery.enabled=true"
2923
30-
Verify that the Neuron device plugin is running:
24+
* Verify that the Neuron device plugin is running:
3125

3226
.. code:: bash
3327
@@ -44,4 +38,4 @@ Verify that the Neuron device plugin is running:
4438
node-problem-detector-vpjtk 1/1 Running 0 59s
4539
4640
47-
When any unrecoverable error occurs, Neuron node problem detector and recovery publishes a metric under the CloudWatch namespace NeuronHealthCheck. It also reflects in NodeCondition and can be seen with kubectl describe node.
41+
* When any unrecoverable error occurs, Neuron node problem detector and recovery publishes a metric under the CloudWatch namespace NeuronHealthCheck. It also reflects in NodeCondition and can be seen with kubectl describe node.

containers/tutorials/k8s-neuron-scheduler.rst

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,13 +25,16 @@ could be assigned to a container given a request for 2 devices.
2525

2626
Devices on Trn1.32xlarge and Trn1n.32xlarge nodes are connected via a 2D torus topology. On Trn1 nodes
2727
containers can request 1, 4, 8, or all 16 devices. In the case you request an invalid number of devices, such as 7,
28-
your pod will not be scheduled and you will receive a warning
29-
``Instance type trn1.32xlarge does not support requests for device: 7. Please request a different number of devices.```.
28+
your pod will not be scheduled and you will receive a warning:
29+
30+
``Instance type trn1.32xlarge does not support requests for device: 7. Please request a different number of devices.``
3031

3132
When requesting 4 devices, your container will be allocated one of the following sets of devices if they are available.
33+
3234
|eks-trn1-device-set4|
3335

3436
When requesting 8 devices, your container will be allocated one of the following sets of devices if they are available.
37+
3538
|eks-trn1-device-set8|
3639

3740
For all instance types, requesting one or all Neuron cores or devices is valid.

dlami/index.rst

Lines changed: 6 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,9 @@ Multi Framework DLAMIs supported
3737
* - Ubuntu 22.04
3838
- Inf1, Inf2, Trn1, Trn1n
3939
- Deep Learning AMI Neuron (Ubuntu 22.04)
40+
* - Amazon Linux 2023
41+
- Inf1, Inf2, Trn1, Trn1n
42+
- Deep Learning AMI Neuron (Amazon Linux 2023)
4043

4144

4245

@@ -154,23 +157,10 @@ Virtual Environments pre-installed
154157
- torch-neuron
155158
- /opt/aws_neuron_venv_pytorch_inf1
156159

157-
* - Deep Learning AMI Neuron PyTorch 1.13 (Amazon Linux 2)
158-
- torch-neuronx, neuronx-distributed
159-
- /opt/aws_neuron_venv_pytorch
160-
161-
* - Deep Learning AMI Neuron PyTorch 1.13 (Amazon Linux 2)
162-
- torch-neuron
163-
- /opt/aws_neuron_venv_pytorch_inf1
164-
165160
* - Deep Learning AMI Neuron TensorFlow 2.10 (Ubuntu 20.04)
166161
- tensorflow-neuronx
167162
- /opt/aws_neuron_venv_tensorflow
168163

169-
* - Deep Learning AMI Neuron TensorFlow 2.10 (Amazon Linux 2)
170-
- tensorflow-neuronx
171-
- /opt/aws_neuron_venv_tensorflow
172-
173-
174164
You can easily get started with the single framework DLAMI through AWS console by following one of the corresponding setup guides . If you are looking to
175165
use the Neuron DLAMI in your cloud automation flows , Neuron also supports :ref:`SSM parameters <ssm-parameter-neuron-dlami>` to easily retrieve the latest DLAMI id.
176166

@@ -203,11 +193,6 @@ Base DLAMIs supported
203193
- Inf1, Inf2, Trn1, Trn1n
204194
- Deep Learning Base Neuron AMI (Ubuntu 20.04)
205195

206-
* - Amazon Linux 2
207-
- Inf1, Inf2, Trn1, Trn1n
208-
- Deep Learning Base Neuron AMI (Amazon Linux 2)
209-
210-
211196

212197
.. _ssm-parameter-neuron-dlami:
213198

@@ -251,6 +236,9 @@ SSM Parameter Prefix
251236

252237
* - Deep Learning AMI Neuron (Ubuntu 22.04)
253238
- /aws/service/neuron/dlami/multi-framework/ubuntu-22.04
239+
240+
* - Deep Learning AMI Neuron (Amazon Linux 2023)
241+
- /aws/service/neuron/dlami/multi-framework/amazon-linux-2023
254242

255243
* - Deep Learning AMI Neuron PyTorch 2.1 (Ubuntu 22.04)
256244
- /aws/service/neuron/dlami/pytorch-2.1/ubuntu-22.04
@@ -261,18 +249,9 @@ SSM Parameter Prefix
261249
* - Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04)
262250
- /aws/service/neuron/dlami/pytorch-1.13/ubuntu-20.04
263251

264-
* - Deep Learning AMI Neuron PyTorch 1.13 (Amazon Linux 2)
265-
- /aws/service/neuron/dlami/pytorch-1.13/amazon-linux-2
266-
267252
* - Deep Learning AMI Neuron TensorFlow 2.10 (Ubuntu 20.04)
268253
- /aws/service/neuron/dlami/tensorflow-2.10/ubuntu-20.04
269254

270-
* - Deep Learning AMI Neuron TensorFlow 2.10 (Amazon Linux 2)
271-
- /aws/service/neuron/dlami/tensorflow-2.10/amazon-linux-2
272-
273-
* - Deep Learning Base Neuron AMI (Amazon Linux 2)
274-
- /aws/service/neuron/dlami/base/amazon-linux-2
275-
276255
* - Deep Learning Base Neuron AMI (Ubuntu 22.04)
277256
- /aws/service/neuron/dlami/base/ubuntu-22.04
278257

general/devflows/plugins/npd-ecs-flows.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ Follow these steps to create a task definition for NPD and recovery:
7777
},
7878
{
7979
"name": "recovery",
80-
"image": "public.ecr.aws/neuron/neuron-node-recovery:1.2.0",
80+
"image": "public.ecr.aws/neuron/neuron-node-recovery:1.3.0",
8181
"cpu": 0,
8282
"portMappings": [],
8383
"essential": true,
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
.. _neuron-dlami-release-notes:
2+
3+
Neuron DLAMI Release Notes
4+
===============================
5+
6+
.. contents:: Table of contents
7+
:local:
8+
:depth: 1
9+
10+
11+
Neuron 2.20.1
12+
-------------
13+
14+
Date: 10/25/2024
15+
16+
- Added support for Amazon Linux 2023 to Neuron Multi Framework DLAMI. Customers will have two operating system options when using the multi framework DLAMI. See :ref:`neuron-dlami-overview`.

release-notes/containers/neuron-dlc.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,13 @@ Neuron DLC Release Notes
88
:depth: 1
99

1010

11+
Neuron 2.20.1
12+
-------------
13+
14+
Date: 10/25/2024
15+
- Neuron 2.20.1 DLC includes prerequisites for `Neuronx Distributed Training framework <https://github.com/aws-neuron/neuronx-distributed-training/blob/main/docs/general/installation_guide.rst#building-apex>`. Customers can expect to use NxDT out of the box.
16+
17+
1118
Neuron 2.20.0
1219
-------------
1320

0 commit comments

Comments
 (0)