Skip to content

Commit 3e264bd

Browse files
aws-qieqingyaws-mesharma
authored andcommitted
Miscellaneous Updates -
* Fix Incorrect SD2 Inpainting Artifact & Broken Link to SDXL benchmark link * Fix incorrect SD inpainting artifact * Refactoring Performance Data * Improve neuron-profile documentation * Adding additional dependency for AL2023 * Update n2-helper.py * Adding changes for Llama2 Training using aws batch * Addressed PR comments * Update training-trn1-samples.rst * Formating fixes * Adding Notes for Broken Tutorials
1 parent 1351ee1 commit 3e264bd

29 files changed

+87
-212
lines changed

frameworks/torch/torch-neuronx/programming-guide/training/index.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@ Developer Guide (``torch-neuronx``)
88
/frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide
99
/frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-debug
1010
/frameworks/torch/torch-neuronx/programming-guide/torch-neuronx-profiling-dev-guide
11-
/frameworks/torch/torch-neuronx/programming-guide/training/neuron-distributed-programming-guide
1211

1312

1413
.. dropdown:: Developer Guide
Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
11
* :ref:`pytorch-neuronx-programming-guide`
22
* :ref:`pytorch-neuronx-debug`
33
* :ref:`torch-neuronx-dev-guide`
4-
* :ref:`neuronx-distributed`

frameworks/torch/torch-neuronx/training-troubleshooting.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -61,17 +61,17 @@ On Ubuntu, if Apport is not running, core dump file name is by default "core" in
6161
6262
echo '/tmp/core.%e.%p' | sudo tee /proc/sys/kernel/core_pattern
6363
64-
For containers, install appropriate dependencies during docker build ("apt-get update && apt-get -y install build-essential gdb") and start the container with "--ulimit core=-1" to enable core dump and "-v /tmp/:/tmp/" to ensure core dumps to /tmp are preserved when container is stopped or deleted. Dependencies can also be installed after container is started.
64+
For containers, install appropriate dependencies during docker build ("apt-get update && apt-get -y install build-essential gdb") and start the container with ``--ulimit core=-1`` to enable core dump and ``-v /tmp/:/tmp/`` to ensure core dumps to ``/tmp`` are preserved when container is stopped or deleted. Dependencies can also be installed after container is started.
6565

66-
On Ubuntu, core dumps can also handled by Apport which is disabled by default. To enable Apport, run ``sudo service apport start``. The ``/proc/sys/kernel/core_pattern`` is updated by Apport service. After a crash, look in /var/log/apport.log for the core dump file name, which should be in located in /var/lib/apport/coredump/.
66+
On Ubuntu, core dumps can also handled by Apport which is disabled by default. To enable Apport, run ``sudo service apport start``. The ``/proc/sys/kernel/core_pattern`` is updated by Apport service. After a crash, look in ``/var/log/apport.log`` for the core dump file name, which should be in located in ``/var/lib/apport/coredump/``.
6767

6868
Once you have the core dump, you can use gdb to debug further (for Python applications, <executable> is ``python`` or ``python3``):
6969

7070
.. code:: bash
7171
7272
gdb <executable> <core file>
7373
74-
If some process (i.e. XRT server) is killed due to out-of-memory on host (i.e. you see "Out of memory: Killed process <PID>" in syslog or dmesg), there won't be any core dump generated. However, you can change to it to kernel panic mode to trigger core dump by setting ``/proc/sys/vm/panic_on_oom`` to value of 1 on the host or from inside container.
74+
If some process (i.e. XRT server) is killed due to out-of-memory on host (i.e. you see ``Out of memory: Killed process <PID>`` in ``/var/log/syslog`` or output of ``dmesg``), there won't be any core dump generated. However, you can change to it to kernel panic mode to trigger core dump by setting ``/proc/sys/vm/panic_on_oom`` to value of 1 on the host or from inside container.
7575

7676
On the host where you need ``sudo`` (this change will be reflected inside the container also):
7777

frameworks/torch/torch-neuronx/tutorials/training/finetune_t5.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,9 @@
33
Fine-tune T5 model on Trn1
44
================================
55

6+
.. note::
7+
Update 01/03/24: This tutorial is currently broken and the AWS Neuron team is working on the fix.
8+
69

710
In this tutorial, we show how to fine-tune a Hugging Face (HF) T5 model
811
using HF trainer API. This example fine-tunes a `T5 model for

frameworks/torch/torch-neuronx/tutorials/training/zero1_gpt2.rst

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,9 @@
33
ZeRO-1 Tutorial
44
===============
55

6+
.. note::
7+
Update 01/03/24: This tutorial is currently broken and the AWS Neuron team is working on the fix.
8+
69
What is ZeRO-1?
710
---------------
811

@@ -89,15 +92,13 @@ these arguments to the wrapper constructor:
8992

9093
GPT2-XL Pretraining Tutorial
9194
----------------------------
92-
Update 10/02:This tutorial is currently broken and the AWS Neuron team is working on the fix.
9395

94-
Table of Contents:
95-
///
96-
- Setup
97-
- Dataset
98-
- Training
96+
.. note::
97+
Update 01/03/24: This tutorial is currently broken and the AWS Neuron team is working on the fix.
9998

100-
--------------
99+
.. contents:: Table of contents
100+
:local:
101+
:depth: 2
101102

102103
Setup
103104
~~~~~

general/benchmarks/inf1/neuronperf_nlp_latency_optimized.csv

Lines changed: 0 additions & 6 deletions
This file was deleted.

general/benchmarks/inf1/neuronperf_nlp_throughput_optimized.csv

Lines changed: 0 additions & 6 deletions
This file was deleted.

general/benchmarks/inf2/latency_data_LLM.csv

Lines changed: 0 additions & 6 deletions
This file was deleted.

general/benchmarks/inf2/latency_data_language.csv

Lines changed: 0 additions & 18 deletions
This file was deleted.

general/benchmarks/inf2/latency_data_vision.csv

Lines changed: 0 additions & 22 deletions
This file was deleted.

0 commit comments

Comments
 (0)