Fix broken link and update details about mxnet (#2684)

indie · diyessi · commit db6100061500 · 2019-03-29T13:38:46.000-07:00
* Update MXNet bridge info page

* Fix warning from docbuild due to heading

* Update page on distributed training
diff --git a/doc/sphinx/source/distr/index.rst b/doc/sphinx/source/distr/index.rst
@@ -1,102 +1,55 @@
-.. distr/index.rst: 
+
+.. distr/index.rst:
 
 ################################
 Distributed training with nGraph
 ################################
 
 
-.. important:: Distributed training is not officially supported in version 
-   |version|; however, some configuration options have worked for nGraph devices 
-   with mixed or limited success in testing environments.
-
-
-Why distributed training?
-=========================
+.. important:: Distributed training is not officially supported as of version
+   |version|; however, some configuration options have worked for nGraph 
+   devices in testing environments.
 
-A tremendous amount of data is required to train DNNs in diverse areas -- from 
-computer vision to natural language processing. Meanwhile, computation used in 
-AI training has been increasing exponentially. And even though significant 
-improvements have been made in algorithms and hardware, using one machine to 
-train a very large :term:`NN` is usually not optimal. The use of multiple nodes, 
-then, becomes important for making deep learning training feasible with large 
-datasets.   
-
-Data parallelism is the most popular parallel architecture to accelerate deep 
-learning with large datasets. The first algorithm we support is `based on the 
-synchronous`_ :term:`SGD` method, and partitions the dataset among workers 
-where each worker executes the same neural network model. For every iteration, 
-nGraph backend computes the gradients in back-propagation, aggregates the gradients 
-across all workers, and then update the weights. 
 
 How? (Generic frameworks)
 =========================
 
 * :doc:`../core/constructing-graphs/distribute-train`
 
-To synchronize gradients across all workers, the essential operation for data 
-parallel training, due to its simplicity and scalability over parameter servers, 
-is ``allreduce``. The AllReduce op is one of the nGraph Library’s core ops. To 
-enable gradient synchronization for a network, we simply inject the AllReduce op 
-into the computation graph, connecting the graph for the autodiff computation 
-and optimizer update (which then becomes part of the nGraph graph). The 
-nGraph Backend will handle the rest. 
-
-Data scientists with locally-scalable rack or cloud-based resources will likely 
-find it worthwhile to experiment with different modes or variations of  
-distributed training. Deployments using nGraph Library with supported backends 
-can be configured to train with data parallelism and will soon work with model 
-parallelism. Distributing workloads is increasingly important, as more data and 
-bigger models mean the ability to :doc:`../core/constructing-graphs/distribute-train` 
-work with larger and larger datasets, or to work with models having many layers 
-that aren't designed to fit to a single device.  
-
-Distributed training with data parallelism splits the data and each worker 
-node has the same model; during each iteration, the gradients are aggregated 
-across all workers with an op that performs "allreduce", and applied to update 
+To synchronize gradients across all workers, the essential operation for data
+parallel training, due to its simplicity and scalability over parameter servers,
+is ``allreduce``. The AllReduce op is one of the nGraph Library’s core ops. To
+enable gradient synchronization for a network, we simply inject the AllReduce op
+into the computation graph, connecting the graph for the autodiff computation
+and optimizer update (which then becomes part of the nGraph graph). The
+nGraph Backend will handle the rest.
+
+Data scientists with locally-scalable rack or cloud-based resources will likely
+find it worthwhile to experiment with different modes or variations of
+distributed training. Deployments using nGraph Library with supported backends
+can be configured to train with data parallelism and will soon work with model
+parallelism. Distributing workloads is increasingly important, as more data and
+bigger models mean the ability to :doc:`../core/constructing-graphs/distribute-train`
+work with larger and larger datasets, or to work with models having many layers
+that aren't designed to fit to a single device.
+
+Distributed training with data parallelism splits the data and each worker
+node has the same model; during each iteration, the gradients are aggregated
+across all workers with an op that performs "allreduce", and applied to update
 the weights.
 
 Using multiple machines helps to scale and speed up deep learning. With large 
-mini-batch training, one could train ResNet-50 with Imagenet-1k data to the 
-*Top 5* classifier in minutes using thousands of CPU nodes. See 
-`arxiv.org/abs/1709.05011`_. 
-
-
-MXNet
-=====
-
-We implemented a KVStore in MXNet\* (KVStore is unique to MXNet) to modify 
-the SGD update op so the nGraph graph will contain the allreduce op and generate
-corresponding collective communication kernels for different backends. We are 
-using `Intel MLSL`_ for CPU backends.
-
-The figure below shows a bar chart with preliminary results from a Resnet-50 
-I1K training in MXNet 1, 2, 4, (and 8 if available) nodes, x-axis is the number 
-of nodes while y-axis is the throughput (images/sec).
-
-
-.. TODO add figure graphics/distributed-training-ngraph-backends.png
-   
-
-TensorFlow
-==========
-
-We plan to support the same in nGraph-TensorFlow. It is still work in progress.
-Meanwhile, users could still use Horovod and the current nGraph TensorFlow, 
-where allreduce op is placed on CPU instead of on nGraph device.
-Figure: a bar chart shows preliminary results Resnet-50 I1K training in TF 1, 
-2, 4, (and 8 if available) nodes, x-axis is the number of nodes while y-axis 
-is the throughput (images/sec).
+mini-batch training, one could train ResNet-50 with Imagenet-1k data to the
+*Top 5* classifier in minutes using thousands of CPU nodes. See
+`arxiv.org/abs/1709.05011`_.
 
 
 Future work
 ===========
 
-Model parallelism with more communication ops support is in the works. For 
-more general parallelism, such as model parallel, we plan to add more 
-communication collective ops such as allgather, scatter, gather, etc. in 
-the future. 
+More communication ops support is in the works. See also:  
+:doc:`../../core/passes/list-of-passes`. 
 
 
 .. _arxiv.org/abs/1709.05011: https://arxiv.org/format/1709.05011
 .. _based on the synchronous: https://arxiv.org/format/1602.06709 
-.. _Intel MLSL: https://github.com/intel/MLSL/releases
diff --git a/doc/sphinx/source/frameworks/mxnet_integ.rst b/doc/sphinx/source/frameworks/mxnet_integ.rst
@@ -1,18 +1,18 @@
-.. mxnet_integ.rst:
+.. frameworks/mxnet_integ.rst:
 
 MXNet\* bridge
 ===============
 
-* See the `README`_ on nGraph-MXNet repo.
+* See the nGraph-MXNet `Integration Guide`_ on the nGraph-MXNet repo.
 
 * **Testing inference latency**:  See the :doc:`validated/testing-latency` 
   doc for a fully-documented example how to compile and test latency with an 
-  MXNet-supported model.     
+  MXNet-supported model.  
 
-* **Training**: For experimental or alternative approaches to distributed 
-  training methodologies, including data parallel training, see the 
-  MXNet-relevant sections of the docs on :doc:`../distr/index` and 
-  :doc:`How to <../core/constructing-graphs/index>` topics like :doc:`../core/constructing-graphs/distribute-train`. 
+.. note:: The nGraph-MXNet bridge is designed to be used with trained models 
+   only; it does not support distributed training.  
 
+ 
 
-.. _README: https://github.com/NervanaSystems/ngraph-mxnet/blob/master/README.md
+
+.. _Integration Guide: https://github.com/NervanaSystems/ngraph-mxnet/blob/master/NGRAPH_README.md
diff --git a/doc/sphinx/source/ops/broadcast_distributed.rst b/doc/sphinx/source/ops/broadcast_distributed.rst
@@ -28,7 +28,7 @@ Inputs
 
 
 Outputs (in place)
--------
+------------------
 
 +-----------------+-------------------------+--------------------------------+
 | Name            | Element Type            | Shape                          |
diff --git a/doc/sphinx/source/ops/index.rst b/doc/sphinx/source/ops/index.rst
@@ -26,7 +26,7 @@ Not currently a comprehensive list.
    * :doc:`batch_norm_training`
    * :doc:`batch_norm_training_backprop`
    * :doc:`broadcast`
-   * :doc:`broadcastdistributed`
+   * :doc:`broadcast_distributed`
    * :doc:`ceiling`
    * :doc:`concat`
    * :doc:`constant`
diff --git a/doc/sphinx/source/python_api/_autosummary/ngraph.exceptions.rst b/doc/sphinx/source/python_api/_autosummary/ngraph.exceptions.rst
@@ -3,6 +3,19 @@ ngraph.exceptions
 
 .. automodule:: ngraph.exceptions
 
+   
+   
+   
+
+
+   
+   
+   
+
+
+   
+   
+
    .. rubric:: Exceptions
 
    .. autosummary::
diff --git a/doc/sphinx/source/python_api/_autosummary/ngraph.ops.rst b/doc/sphinx/source/python_api/_autosummary/ngraph.ops.rst
@@ -14,31 +14,37 @@ ngraph.ops
       absolute
       acos
       add
+      argmax
+      argmin
       asin
       atan
       avg_pool
       batch_norm
       broadcast
+      broadcast_to
       ceiling
       concat
       constant
       convert
       convolution
+      convolution_backprop_data
       cos
       cosh
       divide
       dot
       equal
       exp
       floor
-      function_call
       get_output_element
       greater
       greater_eq
       less
       less_eq
       log
+      logical_and
       logical_not
+      logical_or
+      lrn
       max
       max_pool
       maximum
@@ -52,7 +58,6 @@ ngraph.ops
       parameter
       power
       prod
-      reduce
       relu
       replace_slice
       reshape
@@ -68,6 +73,7 @@ ngraph.ops
       sum
       tan
       tanh
+      topk