Description
The transformers
version of the Whisper tokenizer has an EnglishTextNormalizer
(https://github.com/huggingface/transformers/blob/d9deddb4c18410a14952537a91099319ecedb869/src/transformers/models/whisper/tokenization_whisper.py#L529) that is initialized with the contents of this file. There's also a BasicTextNormalizer
and some additional stuff.
These normalizers are not applied during regular use of the tokenizer. They can be enabled by passing custom flags to decode
. This usually happens during quality evaluation, as explained in this PR, or as seen in the Open ASR leaderboard, which contains a hardcoded version of the English normalization file.
It'd be interesting to add these features as opt-in capabilities, but they are really not required until we want to run evaluations in Swift. Opening this issue for future reference.
h/t @ZachNagengast for his help diving into this.