|
1 | | -.. distr/index.rst: |
| 1 | + |
| 2 | +.. distr/index.rst: |
2 | 3 |
|
3 | 4 | ################################ |
4 | 5 | Distributed training with nGraph |
5 | 6 | ################################ |
6 | 7 |
|
7 | 8 |
|
8 | | -.. important:: Distributed training is not officially supported in version |
9 | | - |version|; however, some configuration options have worked for nGraph devices |
10 | | - with mixed or limited success in testing environments. |
11 | | - |
12 | | - |
13 | | -Why distributed training? |
14 | | -========================= |
| 9 | +.. important:: Distributed training is not officially supported as of version |
| 10 | + |version|; however, some configuration options have worked for nGraph |
| 11 | + devices in testing environments. |
15 | 12 |
|
16 | | -A tremendous amount of data is required to train DNNs in diverse areas -- from |
17 | | -computer vision to natural language processing. Meanwhile, computation used in |
18 | | -AI training has been increasing exponentially. And even though significant |
19 | | -improvements have been made in algorithms and hardware, using one machine to |
20 | | -train a very large :term:`NN` is usually not optimal. The use of multiple nodes, |
21 | | -then, becomes important for making deep learning training feasible with large |
22 | | -datasets. |
23 | | - |
24 | | -Data parallelism is the most popular parallel architecture to accelerate deep |
25 | | -learning with large datasets. The first algorithm we support is `based on the |
26 | | -synchronous`_ :term:`SGD` method, and partitions the dataset among workers |
27 | | -where each worker executes the same neural network model. For every iteration, |
28 | | -nGraph backend computes the gradients in back-propagation, aggregates the gradients |
29 | | -across all workers, and then update the weights. |
30 | 13 |
|
31 | 14 | How? (Generic frameworks) |
32 | 15 | ========================= |
33 | 16 |
|
34 | 17 | * :doc:`../core/constructing-graphs/distribute-train` |
35 | 18 |
|
36 | | -To synchronize gradients across all workers, the essential operation for data |
37 | | -parallel training, due to its simplicity and scalability over parameter servers, |
38 | | -is ``allreduce``. The AllReduce op is one of the nGraph Library’s core ops. To |
39 | | -enable gradient synchronization for a network, we simply inject the AllReduce op |
40 | | -into the computation graph, connecting the graph for the autodiff computation |
41 | | -and optimizer update (which then becomes part of the nGraph graph). The |
42 | | -nGraph Backend will handle the rest. |
43 | | - |
44 | | -Data scientists with locally-scalable rack or cloud-based resources will likely |
45 | | -find it worthwhile to experiment with different modes or variations of |
46 | | -distributed training. Deployments using nGraph Library with supported backends |
47 | | -can be configured to train with data parallelism and will soon work with model |
48 | | -parallelism. Distributing workloads is increasingly important, as more data and |
49 | | -bigger models mean the ability to :doc:`../core/constructing-graphs/distribute-train` |
50 | | -work with larger and larger datasets, or to work with models having many layers |
51 | | -that aren't designed to fit to a single device. |
52 | | - |
53 | | -Distributed training with data parallelism splits the data and each worker |
54 | | -node has the same model; during each iteration, the gradients are aggregated |
55 | | -across all workers with an op that performs "allreduce", and applied to update |
| 19 | +To synchronize gradients across all workers, the essential operation for data |
| 20 | +parallel training, due to its simplicity and scalability over parameter servers, |
| 21 | +is ``allreduce``. The AllReduce op is one of the nGraph Library’s core ops. To |
| 22 | +enable gradient synchronization for a network, we simply inject the AllReduce op |
| 23 | +into the computation graph, connecting the graph for the autodiff computation |
| 24 | +and optimizer update (which then becomes part of the nGraph graph). The |
| 25 | +nGraph Backend will handle the rest. |
| 26 | + |
| 27 | +Data scientists with locally-scalable rack or cloud-based resources will likely |
| 28 | +find it worthwhile to experiment with different modes or variations of |
| 29 | +distributed training. Deployments using nGraph Library with supported backends |
| 30 | +can be configured to train with data parallelism and will soon work with model |
| 31 | +parallelism. Distributing workloads is increasingly important, as more data and |
| 32 | +bigger models mean the ability to :doc:`../core/constructing-graphs/distribute-train` |
| 33 | +work with larger and larger datasets, or to work with models having many layers |
| 34 | +that aren't designed to fit to a single device. |
| 35 | + |
| 36 | +Distributed training with data parallelism splits the data and each worker |
| 37 | +node has the same model; during each iteration, the gradients are aggregated |
| 38 | +across all workers with an op that performs "allreduce", and applied to update |
56 | 39 | the weights. |
57 | 40 |
|
58 | 41 | Using multiple machines helps to scale and speed up deep learning. With large |
59 | | -mini-batch training, one could train ResNet-50 with Imagenet-1k data to the |
60 | | -*Top 5* classifier in minutes using thousands of CPU nodes. See |
61 | | -`arxiv.org/abs/1709.05011`_. |
62 | | - |
63 | | - |
64 | | -MXNet |
65 | | -===== |
66 | | - |
67 | | -We implemented a KVStore in MXNet\* (KVStore is unique to MXNet) to modify |
68 | | -the SGD update op so the nGraph graph will contain the allreduce op and generate |
69 | | -corresponding collective communication kernels for different backends. We are |
70 | | -using `Intel MLSL`_ for CPU backends. |
71 | | - |
72 | | -The figure below shows a bar chart with preliminary results from a Resnet-50 |
73 | | -I1K training in MXNet 1, 2, 4, (and 8 if available) nodes, x-axis is the number |
74 | | -of nodes while y-axis is the throughput (images/sec). |
75 | | - |
76 | | - |
77 | | -.. TODO add figure graphics/distributed-training-ngraph-backends.png |
78 | | - |
79 | | -
|
80 | | -TensorFlow |
81 | | -========== |
82 | | - |
83 | | -We plan to support the same in nGraph-TensorFlow. It is still work in progress. |
84 | | -Meanwhile, users could still use Horovod and the current nGraph TensorFlow, |
85 | | -where allreduce op is placed on CPU instead of on nGraph device. |
86 | | -Figure: a bar chart shows preliminary results Resnet-50 I1K training in TF 1, |
87 | | -2, 4, (and 8 if available) nodes, x-axis is the number of nodes while y-axis |
88 | | -is the throughput (images/sec). |
| 42 | +mini-batch training, one could train ResNet-50 with Imagenet-1k data to the |
| 43 | +*Top 5* classifier in minutes using thousands of CPU nodes. See |
| 44 | +`arxiv.org/abs/1709.05011`_. |
89 | 45 |
|
90 | 46 |
|
91 | 47 | Future work |
92 | 48 | =========== |
93 | 49 |
|
94 | | -Model parallelism with more communication ops support is in the works. For |
95 | | -more general parallelism, such as model parallel, we plan to add more |
96 | | -communication collective ops such as allgather, scatter, gather, etc. in |
97 | | -the future. |
| 50 | +More communication ops support is in the works. See also: |
| 51 | +:doc:`../../core/passes/list-of-passes`. |
98 | 52 |
|
99 | 53 |
|
100 | 54 | .. _arxiv.org/abs/1709.05011: https://arxiv.org/format/1709.05011 |
101 | 55 | .. _based on the synchronous: https://arxiv.org/format/1602.06709 |
102 | | -.. _Intel MLSL: https://github.com/intel/MLSL/releases |
0 commit comments