Functorch added to grad sample README (#497)

Igor Shilov · facebook-github-bot · commit d49e9f07be1d · 2022-09-09T08:35:15.000-07:00
Summary: As we've landed functorch-backed GradSampleModule, we also want to update the README which helps people navigate different grad samplers For the content of this readme I've also run benchmarks for all the options - some results are surprising and hard to interpret, but we have mostly consistent picture ## tl;dr * There's no difference on CPU * functorch performance depends on the exact GPU setup: same benchmarks show could be up to 4x slower or 2x faster than the baseline depending on the GPU * EW are consistently 25-30% faster for linear, but not conv ## benchmarks | device | benchmark | hooks | functorch | ExpandedWeights |:-------:|:-------:|:-------:|:-------:|:-------:| | cpu | nn.Conv2d | 1x | 0.9x | 1x | | cpu | nn.Linear | 1x | 1x | 0.9x | | cpu | full epoch on CIFAR10 example | 1x | 1.5x | 1x | | Tesla T4 (Google Colab) | nn.Conv2d | 1x | 4x | 0.9x | | Tesla T4 (Google Colab) | nn.Linear | 1x | 1.25x | 0.75x | | A100 (AWS) | nn.Conv2d | 1x | 0.5x | 1x | | A100 (AWS) | nn.Linear | 1x | 1.5x | 0.75x | | A100 (AWS) | full epoch on CIFAR10 example | 1x | 1.1x | 0.75x | FYI samdow Pull Request resolved: #497 Reviewed By: karthikprasad Differential Revision: D39352067 Pulled By: ffuuugor fbshipit-source-id: 19b4fff80fe3c1963fab24e1292ae625200bc749
diff --git a/examples/char-lstm_README.md b/examples/char-lstm_README.md
@@ -9,14 +9,14 @@ Download the training zip from https://download.pytorch.org/tutorial/data.zip an
 Run with dp:
 
 ```
-python char-lstm-classification.py --epochs=50 --learning-rate=2.0 --hidden-size=128 --delta=8e-5 --sample-rate=0.05 --n-lstm-layers=1 --sigma=1.0 --max-per-sample-grad-norm=1.5 --device=cuda:0 --data-root="/my/folder/data/names/" --test-every 5
+python char-lstm-classification.py --epochs=50 --learning-rate=2.0 --hidden-size=128 --delta=8e-5 --batch-size 64 --n-layers=1 --sigma=1.0 --max-per-sample-grad-norm=1.5 --device=cuda:0 --data-root="/my/folder/data/names/" --test-every 5
 ```
 
 You should get something like this: Test Accuracy: 0.739542 (ε = 11.83, δ = 8e-05) for α = 2.7
 
 Run without dp:
 
 ```
-python char-lstm-classification.py --epochs=50 --learning-rate=0.5 --hidden-size=128 --sample-rate=0.05 --n-lstm-layers=1 --disable-dp --device=cuda:1 --data-root="/my/folder/data/names/" --test-every 5
+python char-lstm-classification.py --epochs=50 --learning-rate=0.5 --hidden-size=128 --batch-size 64 --n-layers=1 --disable-dp --device=cuda:1 --data-root="/my/folder/data/names/" --test-every 5
 ```
 You should get something like this: Test Accuracy: 0.760716
diff --git a/opacus/grad_sample/README.md b/opacus/grad_sample/README.md
@@ -14,8 +14,11 @@ which one to use.
 improves upon ``GradSampleModule`` on performance and functionality.
 
 **TL;DR:** If you want stable implementation, use ``GradSampleModule`` (`grad_sample_mode="hooks"`).
-If you want to experiment with the new functionality - try ``GradSampleModuleExpandedWeights``(`grad_sample_mode="ew"`)
-and switch back to ``GradSampleModule`` if you encounter strange errors or unexpexted behaviour.
+If you want to experiment with the new functionality, you have two options. Try 
+``GradSampleModuleExpandedWeights``(`grad_sample_mode="ew"`) for better performance and `grad_sample_mode=functorch` 
+if your model is not supported by ``GradSampleModule``. 
+
+Please switch back to ``GradSampleModule``(`grad_sample_mode="hooks"`) if you encounter strange errors or unexpexted behaviour.
 We'd also appreciate it if you report these to us
 
 ## Hooks-based approach
@@ -26,6 +29,23 @@ Computes per-sample gradients for a model using backward hooks. It requires cust
 trainable layer in the model. We provide such methods for most popular PyTorch layers. Additionally, client can
 provide their own grad sampler for any new unsupported layer (see [tutorial](https://github.com/pytorch/opacus/blob/main/tutorials/guide_to_grad_sampler.ipynb))
 
+## Functorch approach
+- Model wrapping class: ``opacus.grad_sample.grad_sample_module.GradSampleModule (force_functorch=True)``
+- Keyword argument for ``PrivacyEngine.make_private()``: `grad_sample_mode="functorch"`
+
+[functorch](https://pytorch.org/functorch/stable/) is JAX-like composable function transforms for PyTorch.
+With functorch we can compute per-sample-gradients efficiently by using function transforms. With the efficient
+parallelization provided by `vmap`, we can obtain per-sample gradients for any function function (i.e. any model) by 
+doing essentially `vmap(grad(f(x)))`. 
+
+Our experiments show, that `vmap` computations in most cases are as fast as manually written grad samplers used in 
+hooks-based approach.
+
+With the current implementation `GradSampleModule` will use manual grad samplers for known modules (i.e. maintain the
+old behaviour for all previously supported models) and will only use functorch for unknown modules.
+
+With `force_functorch=True` passed to the constructor `GradSampleModule` will rely exclusively on functorch. 
+
 ## ExpandedWeigths approach
 - Model wrapping class: ``opacus.grad_sample.gsm_exp_weights.GradSampleModuleExpandedWeights``
 - Keyword argument for ``PrivacyEngine.make_private()``: `grad_sample_mode="ew"`
@@ -42,14 +62,23 @@ is roughly the same.
 Please note that these are known limitations and we plan to improve Expanded Weights and bridge the gap in feature completeness
 
 
-| xxx | Hooks | Expanded Weights |
-|:-----:|:-------:|:------------------:|
-| Required PyTorch version | 1.8+ | 1.13+ |
-| Development status | Underlying mechanism deprecated | Beta |
-| Performance | - | ✅ Likely up to 2.5x faster |
-| torchscript models | Not supported | ✅ Supported |
-| Client-provided grad sampler | ✅ Supported | Not supported |
-| `batch_first=False` | ✅ Supported | Not supported |
-| Most popular nn.* layers | ✅ Supported | ✅ Supported |
-| Recurrent networks | ✅ Supported | Not supported |
-| Padding `same` in Conv | ✅ Supported | Not supported |
+| xxx                          | Hooks                           | Expanded Weights | Functorch    |
+|:----------------------------:|:-------------------------------:|:----------------:|:------------:| 
+| Required PyTorch version     | 1.8+                            | 1.13+            | 1.12 (to be updated) |
+| Development status           | Underlying mechanism deprecated | Beta             | Beta         | 
+| Runtime Performance†          | baseline                       | ✅ ~25% faster  | 🟨 0-50% slower |
+| Any DP-allowed†† layers       | Not supported                   | Not supported   | ✅ Supported |
+| Most popular nn.* layers     | ✅ Supported                    | ✅ Supported    | ✅ Supported  | 
+| torchscripted models         | Not supported                   | ✅ Supported    | Not supported |
+| Client-provided grad sampler | ✅ Supported                    | Not supported   | ✅ Not needed |
+| `batch_first=False`          | ✅ Supported                    | Not supported   | ✅ Supported  |
+| Recurrent networks           | ✅ Supported                    | Not supported   | ✅ Supported  |
+| Padding `same` in Conv       | ✅ Supported                    | Not supported   | ✅ Supported  |
+
+† Note, that performance differences are unstable and can vary a lot depending on the exact model and batch size. 
+Numbers above are averaged over benchmarks with small models consisting of convolutional and linear layers. 
+Note, that performance differences are only observed on GPU training, CPU performance seem to be almost identical 
+for all approaches.
+
+†† Layers that produce joint computations on batch samples (e.g. BatchNorm) are not allowed under any approach    
+