Skip to content

Whisper normalization for evals #47

Open
@pcuenca

Description

@pcuenca

The transformers version of the Whisper tokenizer has an EnglishTextNormalizer (https://github.com/huggingface/transformers/blob/d9deddb4c18410a14952537a91099319ecedb869/src/transformers/models/whisper/tokenization_whisper.py#L529) that is initialized with the contents of this file. There's also a BasicTextNormalizer and some additional stuff.

These normalizers are not applied during regular use of the tokenizer. They can be enabled by passing custom flags to decode. This usually happens during quality evaluation, as explained in this PR, or as seen in the Open ASR leaderboard, which contains a hardcoded version of the English normalization file.

It'd be interesting to add these features as opt-in capabilities, but they are really not required until we want to run evaluations in Swift. Opening this issue for future reference.

h/t @ZachNagengast for his help diving into this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions