Wav2Vec2 pipeline feature extractor normalizes input over batch dimension, is it a feature or bug in design?

I'm tryining to undestand the intuition of the input normalization using layer norm like this:

`waveforms = nn.functional.layer_norm(waveforms, waveforms.shape)` [link](https://github.com/pytorch/audio/blob/d60ce09e2c532d5bf2e05619e700ab520543465e/src/torchaudio/pipelines/_wav2vec2/utils.py#L53)

If the input is [B, L], this code will normalize it accross batch elements. I.e. to compute the mean, it will sum up all values regardless of the batch element they belong to. The same for variance. Is this really the intended behaviour that one batch element can inluence another one?

The original [paper](https://arxiv.org/pdf/2006.11477) states: The raw waveform input to the encoder is normalized to zero mean and unit 
 variance. There is nothing about the normalization accross the batch.

I think, the right way is to normalize each batch element independently, and the code should be changed to: `waveforms = nn.functional.layer_norm(waveforms, waveforms.shape[1:])`




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wav2Vec2 pipeline feature extractor normalizes input over batch dimension, is it a feature or bug in design? #5609

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Wav2Vec2 pipeline feature extractor normalizes input over batch dimension, is it a feature or bug in design? #5609

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions