Skip to content
This repository was archived by the owner on Jan 3, 2023. It is now read-only.

Commit db61000

Browse files
indiediyessi
authored andcommitted
Fix broken link and update details about mxnet (#2684)
* Update MXNet bridge info page * Fix warning from docbuild due to heading * Update page on distributed training
1 parent 199ec73 commit db61000

File tree

6 files changed

+61
-89
lines changed

6 files changed

+61
-89
lines changed

doc/sphinx/source/distr/index.rst

Lines changed: 30 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -1,102 +1,55 @@
1-
.. distr/index.rst:
1+
2+
.. distr/index.rst:
23
34
################################
45
Distributed training with nGraph
56
################################
67

78

8-
.. important:: Distributed training is not officially supported in version
9-
|version|; however, some configuration options have worked for nGraph devices
10-
with mixed or limited success in testing environments.
11-
12-
13-
Why distributed training?
14-
=========================
9+
.. important:: Distributed training is not officially supported as of version
10+
|version|; however, some configuration options have worked for nGraph
11+
devices in testing environments.
1512

16-
A tremendous amount of data is required to train DNNs in diverse areas -- from
17-
computer vision to natural language processing. Meanwhile, computation used in
18-
AI training has been increasing exponentially. And even though significant
19-
improvements have been made in algorithms and hardware, using one machine to
20-
train a very large :term:`NN` is usually not optimal. The use of multiple nodes,
21-
then, becomes important for making deep learning training feasible with large
22-
datasets.
23-
24-
Data parallelism is the most popular parallel architecture to accelerate deep
25-
learning with large datasets. The first algorithm we support is `based on the
26-
synchronous`_ :term:`SGD` method, and partitions the dataset among workers
27-
where each worker executes the same neural network model. For every iteration,
28-
nGraph backend computes the gradients in back-propagation, aggregates the gradients
29-
across all workers, and then update the weights.
3013

3114
How? (Generic frameworks)
3215
=========================
3316

3417
* :doc:`../core/constructing-graphs/distribute-train`
3518

36-
To synchronize gradients across all workers, the essential operation for data
37-
parallel training, due to its simplicity and scalability over parameter servers,
38-
is ``allreduce``. The AllReduce op is one of the nGraph Library’s core ops. To
39-
enable gradient synchronization for a network, we simply inject the AllReduce op
40-
into the computation graph, connecting the graph for the autodiff computation
41-
and optimizer update (which then becomes part of the nGraph graph). The
42-
nGraph Backend will handle the rest.
43-
44-
Data scientists with locally-scalable rack or cloud-based resources will likely
45-
find it worthwhile to experiment with different modes or variations of
46-
distributed training. Deployments using nGraph Library with supported backends
47-
can be configured to train with data parallelism and will soon work with model
48-
parallelism. Distributing workloads is increasingly important, as more data and
49-
bigger models mean the ability to :doc:`../core/constructing-graphs/distribute-train`
50-
work with larger and larger datasets, or to work with models having many layers
51-
that aren't designed to fit to a single device.
52-
53-
Distributed training with data parallelism splits the data and each worker
54-
node has the same model; during each iteration, the gradients are aggregated
55-
across all workers with an op that performs "allreduce", and applied to update
19+
To synchronize gradients across all workers, the essential operation for data
20+
parallel training, due to its simplicity and scalability over parameter servers,
21+
is ``allreduce``. The AllReduce op is one of the nGraph Library’s core ops. To
22+
enable gradient synchronization for a network, we simply inject the AllReduce op
23+
into the computation graph, connecting the graph for the autodiff computation
24+
and optimizer update (which then becomes part of the nGraph graph). The
25+
nGraph Backend will handle the rest.
26+
27+
Data scientists with locally-scalable rack or cloud-based resources will likely
28+
find it worthwhile to experiment with different modes or variations of
29+
distributed training. Deployments using nGraph Library with supported backends
30+
can be configured to train with data parallelism and will soon work with model
31+
parallelism. Distributing workloads is increasingly important, as more data and
32+
bigger models mean the ability to :doc:`../core/constructing-graphs/distribute-train`
33+
work with larger and larger datasets, or to work with models having many layers
34+
that aren't designed to fit to a single device.
35+
36+
Distributed training with data parallelism splits the data and each worker
37+
node has the same model; during each iteration, the gradients are aggregated
38+
across all workers with an op that performs "allreduce", and applied to update
5639
the weights.
5740

5841
Using multiple machines helps to scale and speed up deep learning. With large
59-
mini-batch training, one could train ResNet-50 with Imagenet-1k data to the
60-
*Top 5* classifier in minutes using thousands of CPU nodes. See
61-
`arxiv.org/abs/1709.05011`_.
62-
63-
64-
MXNet
65-
=====
66-
67-
We implemented a KVStore in MXNet\* (KVStore is unique to MXNet) to modify
68-
the SGD update op so the nGraph graph will contain the allreduce op and generate
69-
corresponding collective communication kernels for different backends. We are
70-
using `Intel MLSL`_ for CPU backends.
71-
72-
The figure below shows a bar chart with preliminary results from a Resnet-50
73-
I1K training in MXNet 1, 2, 4, (and 8 if available) nodes, x-axis is the number
74-
of nodes while y-axis is the throughput (images/sec).
75-
76-
77-
.. TODO add figure graphics/distributed-training-ngraph-backends.png
78-
79-
80-
TensorFlow
81-
==========
82-
83-
We plan to support the same in nGraph-TensorFlow. It is still work in progress.
84-
Meanwhile, users could still use Horovod and the current nGraph TensorFlow,
85-
where allreduce op is placed on CPU instead of on nGraph device.
86-
Figure: a bar chart shows preliminary results Resnet-50 I1K training in TF 1,
87-
2, 4, (and 8 if available) nodes, x-axis is the number of nodes while y-axis
88-
is the throughput (images/sec).
42+
mini-batch training, one could train ResNet-50 with Imagenet-1k data to the
43+
*Top 5* classifier in minutes using thousands of CPU nodes. See
44+
`arxiv.org/abs/1709.05011`_.
8945

9046

9147
Future work
9248
===========
9349

94-
Model parallelism with more communication ops support is in the works. For
95-
more general parallelism, such as model parallel, we plan to add more
96-
communication collective ops such as allgather, scatter, gather, etc. in
97-
the future.
50+
More communication ops support is in the works. See also:
51+
:doc:`../../core/passes/list-of-passes`.
9852

9953

10054
.. _arxiv.org/abs/1709.05011: https://arxiv.org/format/1709.05011
10155
.. _based on the synchronous: https://arxiv.org/format/1602.06709
102-
.. _Intel MLSL: https://github.com/intel/MLSL/releases
Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
1-
.. mxnet_integ.rst:
1+
.. frameworks/mxnet_integ.rst:
22
33
MXNet\* bridge
44
===============
55

6-
* See the `README`_ on nGraph-MXNet repo.
6+
* See the nGraph-MXNet `Integration Guide`_ on the nGraph-MXNet repo.
77

88
* **Testing inference latency**: See the :doc:`validated/testing-latency`
99
doc for a fully-documented example how to compile and test latency with an
10-
MXNet-supported model.
10+
MXNet-supported model.
1111

12-
* **Training**: For experimental or alternative approaches to distributed
13-
training methodologies, including data parallel training, see the
14-
MXNet-relevant sections of the docs on :doc:`../distr/index` and
15-
:doc:`How to <../core/constructing-graphs/index>` topics like :doc:`../core/constructing-graphs/distribute-train`.
12+
.. note:: The nGraph-MXNet bridge is designed to be used with trained models
13+
only; it does not support distributed training.
1614

15+
1716

18-
.. _README: https://github.com/NervanaSystems/ngraph-mxnet/blob/master/README.md
17+
18+
.. _Integration Guide: https://github.com/NervanaSystems/ngraph-mxnet/blob/master/NGRAPH_README.md

doc/sphinx/source/ops/broadcast_distributed.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ Inputs
2828

2929

3030
Outputs (in place)
31-
-------
31+
------------------
3232

3333
+-----------------+-------------------------+--------------------------------+
3434
| Name | Element Type | Shape |

doc/sphinx/source/ops/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ Not currently a comprehensive list.
2626
* :doc:`batch_norm_training`
2727
* :doc:`batch_norm_training_backprop`
2828
* :doc:`broadcast`
29-
* :doc:`broadcastdistributed`
29+
* :doc:`broadcast_distributed`
3030
* :doc:`ceiling`
3131
* :doc:`concat`
3232
* :doc:`constant`

doc/sphinx/source/python_api/_autosummary/ngraph.exceptions.rst

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,19 @@ ngraph.exceptions
33

44
.. automodule:: ngraph.exceptions
55

6+
7+
8+
9+
10+
11+
12+
13+
14+
15+
16+
17+
18+
619
.. rubric:: Exceptions
720

821
.. autosummary::

doc/sphinx/source/python_api/_autosummary/ngraph.ops.rst

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,31 +14,37 @@ ngraph.ops
1414
absolute
1515
acos
1616
add
17+
argmax
18+
argmin
1719
asin
1820
atan
1921
avg_pool
2022
batch_norm
2123
broadcast
24+
broadcast_to
2225
ceiling
2326
concat
2427
constant
2528
convert
2629
convolution
30+
convolution_backprop_data
2731
cos
2832
cosh
2933
divide
3034
dot
3135
equal
3236
exp
3337
floor
34-
function_call
3538
get_output_element
3639
greater
3740
greater_eq
3841
less
3942
less_eq
4043
log
44+
logical_and
4145
logical_not
46+
logical_or
47+
lrn
4248
max
4349
max_pool
4450
maximum
@@ -52,7 +58,6 @@ ngraph.ops
5258
parameter
5359
power
5460
prod
55-
reduce
5661
relu
5762
replace_slice
5863
reshape
@@ -68,6 +73,7 @@ ngraph.ops
6873
sum
6974
tan
7075
tanh
76+
topk
7177

7278

7379

0 commit comments

Comments
 (0)