Whisper normalization for evals

The `transformers` version of the Whisper tokenizer has an `EnglishTextNormalizer` (https://github.com/huggingface/transformers/blob/d9deddb4c18410a14952537a91099319ecedb869/src/transformers/models/whisper/tokenization_whisper.py#L529) that is initialized with the contents of [this file](https://github.com/huggingface/transformers/blob/d9deddb4c18410a14952537a91099319ecedb869/src/transformers/models/whisper/tokenization_whisper.py#L42). There's also a `BasicTextNormalizer` and some additional stuff.

These normalizers are not applied during regular use of the tokenizer. They can be enabled by [passing custom flags to `decode`](https://github.com/huggingface/transformers/blob/d9deddb4c18410a14952537a91099319ecedb869/src/transformers/models/whisper/tokenization_whisper.py#L662). This usually happens during quality evaluation, as [explained in this PR](https://github.com/huggingface/transformers/pull/28136), or as seen in the [Open ASR leaderboard](https://github.com/huggingface/open_asr_leaderboard/blob/main/normalizer/normalizer.py), which contains a [hardcoded version of the English normalization file](https://github.com/huggingface/open_asr_leaderboard/blob/main/normalizer/english_abbreviations.py).

It'd be interesting to add these features as opt-in capabilities, but they are really not required until we want to run evaluations in Swift. Opening this issue for future reference.

h/t @ZachNagengast for his help diving into this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Whisper normalization for evals #47

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Whisper normalization for evals #47

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions