Open
Description
Currently, if BatchNorm is being performed on GPU we assert that the parameters must be trainable and statistics must be tracked. Which is a reasonable assumption given CUDNN requires an explicit mean and variance during inference.
However, there are quite a few cases where we might want to disable these (we typically don't set track_stats=true
when inside a Deep Equilibrium Model). Considering this I feel, if any of these parameters are disabled, we should fall back to the CPU implementation which relies on broadcasting and simple Linear Algebra operations. (We use those for GroupNorm and LayerNorm so we might as well use it for batchnorm)