Guidance on Adding Support for BPE Tokenizer in F5-TTS #748

Alykasym · 2025-01-25T15:27:33Z

Alykasym
Jan 25, 2025

Hi,

I’ve been working with the F5-TTS model, and I’m interested in integrating a Byte Pair Encoding (BPE) tokenizer instead of the character-level tokenizer currently used. I’m familiar with how BPE tokenization works and the general steps for implementing it, but I’m unsure of the specific files and methods in the F5-TTS codebase that need to be modified to support this change.

Could you kindly guide me on:

Which files should be modified to integrate a BPE tokenizer into the pipeline?
Which methods/functions in the code deal with text tokenization, and how can I adapt them to work with BPE?
Any particular parts of the training and inference code that need to be adjusted to accommodate the new tokenizer? (for example, text preprocessing, data loading, model inputs, etc.)
If there’s a preferred way of adding custom tokenizers or if there are any specific considerations I should be aware of when implementing this change?
I would greatly appreciate any tips, documentation, or pointers on where to start.

Thanks in advance!

Answered by SWivid

Jan 27, 2025

core codes with model here https://github.com/SWivid/F5-TTS/tree/main/src/f5_tts/model

basically change the text format is fine

F5-TTS/src/f5_tts/model/cfm.py

Line 214 in c2cf31e

text: int["b nt"] | list[str], # noqa: F722

F5-TTS/src/f5_tts/model/cfm.py

Lines 227 to 233 in c2cf31e

     # handle text as string  
   if isinstance(text, list):  
   if exists(self.vocab_char_map):  
   text = list_str_to_idx(text, self.vocab_char_map).to(device)  
   else:  
   text = list_str_to_tensor(text).to(device)  
   assert text.shape[0] == batch  

 

text is currently input like: [['h', 'o', 'w', ' ', 'a', 'r', 'e', ' ', 'y', 'o', 'u', '?',], ['i', ''', 'm', ' ', 'f', 'i', '…

View full answer

SWivid · 2025-01-27T12:39:07Z

SWivid
Jan 27, 2025
Maintainer

core codes with model here https://github.com/SWivid/F5-TTS/tree/main/src/f5_tts/model

basically change the text format is fine

F5-TTS/src/f5_tts/model/cfm.py

Line 214 in c2cf31e

text: int["b nt"] | list[str], # noqa: F722

F5-TTS/src/f5_tts/model/cfm.py

Lines 227 to 233 in c2cf31e

    
           # handle text as string 
        
           if isinstance(text, list): 
        
               if exists(self.vocab_char_map): 
        
                   text = list_str_to_idx(text, self.vocab_char_map).to(device) 
        
               else: 
        
                   text = list_str_to_tensor(text).to(device) 
        
               assert text.shape[0] == batch

text is currently input like: [['h', 'o', 'w', ' ', 'a', 'r', 'e', ' ', 'y', 'o', 'u', '?',], ['i', ''', 'm', ' ', 'f', 'i', 'n', 'e', '.',]]
you could just change to bpe.

inference process is the same, you could make a func e.g. char_to_bpe(), and pass the result in as text value

1 reply

Alykasym Jan 27, 2025
Author

Thank you so much. I will post updates once I have results

abhisirka2001 · 2025-06-17T10:04:42Z

abhisirka2001
Jun 17, 2025

What was the reason you shifted to bpe tokenizer? just curios and it might help me because i am also facing issues in vocab learning .
@Alykasym

1 reply

Alykasym Jun 17, 2025
Author

Main reason was that I had a small dataset and I was trying to teach a new language with it.

When I finetuned using char tokenizer, it did produce pretty good results, but sometimes it pronounced some vowels longer or shorter than it should be.

Basically, char tokenizer needs more data for the model to learn the different scenario pronounciations.

So I thought maybe BPE tokenizer will help, as it tokenizes chunks of letters in pairs. So it will hold more context of scenarios for the model to learn.

BPE did indeed reduce the frequency of wrong pronouncations but they were still there. Basically, I still need larger dataset.

abhisirka2001 · 2025-06-17T10:44:23Z

abhisirka2001
Jun 17, 2025

Ohh gotcha
I am also facing issues related to pronunciation but yes i have increased the data volume so it might get resolved. Lets see.. thanks for help

0 replies

SachinTelecmi · 2025-06-17T11:21:01Z

SachinTelecmi
Jun 17, 2025

@Alykasym can you share your settings.json file?? i alos tried to train f5-tts with open source datya espcially for hindi but my results are not that great as its should be espcially the pronunciation, whats your data size and how many epochs you trained!! i was training with 4090 i tried to train it for a week still the results were not that upto mark !!

11 replies

Alykasym Jun 24, 2025
Author

I tried different LR, but for my dataset size 1e-05 works well. It depends on your dataset size and whether you want to train longer. If you have bigger dataset and you want to train for few hundred epochs, then you can lower the LR further.
Yes, you can use the finetuned model to finetune further with new data. One of the example scenarios would be:
Let's say you have 200 hours of non verified (not clean/not high quality) dataset. You finetune the model with that dataset, and after that you get like 50 hours of super high quality accurate dataset and continue the finetuning with it. Basically, you will be doing Base model unsupervised training and supervised training afterwards to get high quality dataset while maintaining the knowledge of the language.

But when you do the second finetuning, don't run it long. 100,000-150,000 steps should be enough. Otherwise it will forget the details of your first dataset and just remember the second dataset.

dpp-user Jun 24, 2025

thank you, but how do i use the trained model as a base for the next? somehow when i enter it as path to pretrained model it not using it. cause when i generate at the beginning of the training it still generate gibrish or japanese and it suppose to already know the new language just fine tune on a specific voice

Alykasym Jun 27, 2025
Author

Not sure, it worked for me. You can open an issue about it with the details like Source code version, and steps to reproduce. Maybe if other people encountered it, they might be able to help.

campar Jul 13, 2025

@Alykasym I've noticed that my dataset contains duplicate entries, the same text with two different audio files (not identical, but very similar). Could this affect the quality of training?

Alykasym Jul 31, 2025
Author

@Alykasym I've noticed that my dataset contains duplicate entries, the same text with two different audio files (not identical, but very similar). Could this affect the quality of training?

It shouldn't affect as long as it is different person's voice. But if it is same person's voice and same text transcription, then it might affect badly.

	# handle text as string
	if isinstance(text, list):
	if exists(self.vocab_char_map):
	text = list_str_to_idx(text, self.vocab_char_map).to(device)
	else:
	text = list_str_to_tensor(text).to(device)
	assert text.shape[0] == batch

Guidance on Adding Support for BPE Tokenizer in F5-TTS #748

Uh oh!

Alykasym Jan 25, 2025

Replies: 4 comments · 13 replies

Uh oh!

Uh oh!

SWivid Jan 27, 2025 Maintainer

Uh oh!

Alykasym Jan 27, 2025 Author

Uh oh!

abhisirka2001 Jun 17, 2025

Uh oh!

Alykasym Jun 17, 2025 Author

Uh oh!

abhisirka2001 Jun 17, 2025

Uh oh!

SachinTelecmi Jun 17, 2025

Uh oh!

Alykasym Jun 24, 2025 Author

Uh oh!

dpp-user Jun 24, 2025

Uh oh!

Alykasym Jun 27, 2025 Author

Uh oh!

campar Jul 13, 2025

Uh oh!

Uh oh!

Alykasym Jul 31, 2025 Author

Alykasym
Jan 25, 2025

Replies: 4 comments 13 replies

SWivid
Jan 27, 2025
Maintainer

Alykasym Jan 27, 2025
Author

abhisirka2001
Jun 17, 2025

Alykasym Jun 17, 2025
Author

abhisirka2001
Jun 17, 2025

SachinTelecmi
Jun 17, 2025

Alykasym Jun 24, 2025
Author

Alykasym Jun 27, 2025
Author

Alykasym Jul 31, 2025
Author