Skip to content

Commit 9c301c9

Browse files
authored
Merge pull request #917 from jimburtoft/patch-1
spelling nit
2 parents fb2dad2 + 953e044 commit 9c301c9

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

general/faq/training/neuron-training.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ For simplicity, you should consider each NeuronCore within your instances as an
3434
What are the time to train advantages of Trn1?
3535
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3636

37-
While the answer is largely model defendant, training performance on Trn1 is fast due thanks for multiple system wide optimizations working in concert. Dependent on the data type, you should expect between 1.4-5X higher throughput on Trn1 as compared to the latest GPUs instances (P4d). For distributed workloads, 800Gbps EFA gives customers lower latency, and 2x the throughput as compared to P4d. (a Trn1n 1.6Tb option is coming soon). Each Trainium also has a dedicated collective compute (CC) engine, which enables running the CC ops in parallel to the NeuronCores compute. This enables another 10-15% acceleration of the overall workload. Finally, stochastic rounding enables running at half precision speeds (BF16) while maintaining accuracy at near full precision, this is not only simplifying model development (no need for mixed precision) it also helps the loss function converge faster and reduce memory footprint.
37+
While the answer is largely model dependent, training performance on Trn1 is fast due thanks for multiple system wide optimizations working in concert. Dependent on the data type, you should expect between 1.4-5X higher throughput on Trn1 as compared to the latest GPUs instances (P4d). For distributed workloads, 800Gbps EFA gives customers lower latency, and 2x the throughput as compared to P4d. (a Trn1n 1.6Tb option is coming soon). Each Trainium also has a dedicated collective compute (CC) engine, which enables running the CC ops in parallel to the NeuronCores compute. This enables another 10-15% acceleration of the overall workload. Finally, stochastic rounding enables running at half precision speeds (BF16) while maintaining accuracy at near full precision, this is not only simplifying model development (no need for mixed precision) it also helps the loss function converge faster and reduce memory footprint.
3838

3939
What are some of the training performance results for Trn1?
4040
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

0 commit comments

Comments
 (0)