[Bug?] Gradient Synchronization for DDP

According to no_sync function description in  https://github.com/pytorch/pytorch/blob/main/torch/nn/parallel/distributed.py#L1424 
```
.. warning::
    The forward pass should be included inside the context manager, or
    else gradients will still be synchronized.
```
The current code does separate forward and backward pass in no_sync, therefore will still trigger gradient synchronization