Collecting PyTorch -> Flux migration notes

I'm in the process of moving a model from PyTorch to Flux, and I'm going to catalog the challenges I've found in migrating, perhaps for a doc page or to improve docs generally. If others wish to add anything, please do!

### Weights initialization:
In PyTorch, the default weight init method is `kaiming_uniform` aka He initialization.
In Flux, the default weight init method is `glorot_uniform` aka Xavier initialization.

PyTorch chooses a `gain` for the init function based on the type of nonlinearity specified, which defaults to `leaky_relu`, and uses this:
```
a = √5 # argument passed into PyTorch's `kaiming_uniform_` function, the "negative slope" of the rectifier
gain = √(2 / (1 + a ^ 2))
```
Flux defaults the `kaiming_uniform` gain to `√2`, which is what PyTorch would use if `relu` was specified rather than `leaky_relu`.

To replicate the PyTorch default, `kaiming_uniform(gain = √(2 / (1 + a ^ 2)))` can be provided for the `init` keyword argument of the layer constructor.

### Bias initialization:
PyTorch initializes bias parameters with uniformly random values between +/- `1 / √(fan_in)`, where `fan_in` in Flux is `first(nfan(filter..., cin÷groups, cout))` for Conv layers. For Dense layers, `last(nfan(out, in))` instead. Flux initializes them all to zero.

### Layers:
In PyTorch, there are separate objects for different dimensionality (e.g. `conv1d, conv2d`, `conv3d`). In Flux, the dimensionality is specified by the tuple provided for the kernel of `Conv`.

In PyTorch, activation functions are inserted as separate steps in the chain, as equals to layers. In Flux, they are provided as an argument to the layer constructor.

`Sequential` => `Chain`

`Linear` => `Dense`

`Upsample` in Flux (via NNlib) is equivalent to `align_corners=True` with PyTorch's `Upsample`, but the default there is `False`. Note that this makes the gradients depend on image size.

When building a custom layer, the `(::MyLayer)(input) = ` method is the equivalent of `def forward(self, input):`

Often if PyTorch has a method, Flux (many via MLUtils.jl) has the same method. e.g. `unsqueeze` to insert a 1-length dimension, or `erf` to compute the error function. Note that the inverse of `unsqueeze` is actually `Base.dropdims`.

### Training:
A single step in Flux is simply `gradient` followed by `update!`. In PyTorch there are more steps: the optimizer's gradients must be zeroed with `.zero_grad()`, then the loss is calculated, then the tensor returned from the loss function is backward-propagated with `.backward()` to compute the gradients, and finally the optimizer is stepped forward and the model parameters are updated with `.step()`. In Flux, both parts can be combined with the `train!` function, and can also be used to iterate over a set of paired training inputs and outputs.

In Flux, an optimizer state object is first obtained by `setup`, and this state is passed to the training loop. In PyTorch, the optimizer object itself is manipulated in the training loop.

### Added 10/13/24:
`Upsample` in PyTorch is actually deprecated in favor of `nn.functional.interpolate`, but the former just relies on the latter anyway.

`clip_grad_norm_` in PyTorch (side note: they've adopted a trailing underscore to indicate a modifying function) can be accomplished by creating an `Optimisers.OptimiserChain(ClipNorm(___), optimizer)`.

### Added 10/25/24:
`torch.where`, which produces an array with elements of each type depending on a mask, can be accomplished with a broadcasted `ifelse`

`AdamW` optimizer is implemented differently. In PyTorch, the weight decay is moderated by the learning rate. In Flux, it is not. See https://github.com/FluxML/Optimisers.jl/issues/182 for a workaround until Flux makes something built-in available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Collecting PyTorch -> Flux migration notes #2410

Weights initialization:

Bias initialization:

Layers:

Training:

Added 10/13/24:

Added 10/25/24:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Collecting PyTorch -> Flux migration notes #2410

Description

Weights initialization:

Bias initialization:

Layers:

Training:

Added 10/13/24:

Added 10/25/24:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions