Fine-tuning the F5 TTS model for Polish language #1168

cyberbol · 2025-08-25T14:13:29Z

cyberbol
Aug 25, 2025

I’m preparing for fine-tuning the F5 TTS model for Polish, since there isn’t one available yet. I did find one person who created a Polish model using about 90 hours of recordings and trained it on an A100 80GB for around 24 hours. Unfortunately, he didn’t share that model.
https://www.youtube.com/watch?v=K6vY9Je4ufQ

That’s why I decided to give it a try myself. There isn’t much information online about TTS training configurations, unlike with photo or video models. Based on what I managed to gather so far:

My dataset contains 142 hours of correct Polish speech. The dataset has been split into smaller files with transcripts (the transcription process is still ongoing).

As for the configuration, I’m not entirely sure if it’s correct, but I plan to start training with the following settings:

polish_config.yaml

I don’t know if it will work, and I also don’t know how long it will take on an RTX 4090. Possibly a few days! XD

So, if anyone here has done a similar training and could help me out with tips or suggestions, I’d really appreciate it.

Yesterday, I ran a very short test training with just a 2-hour dataset. Unfortunately, the process crashed during the night, but it managed to reach 2500 steps. I saved sample outputs every 500 steps, so I have five of them.
I must say, at 500 steps the difference between the reference wav and the generated file was huge – as a native Polish speaker, I couldn’t understand a single word from the generated one. But at 2500 steps, it was already intelligible. Lots of mistakes, but at least I could understand the speech.

I could share the 2500-step sample here, but since it’s in Polish, I’m not sure if any of you would understand it.

update_2500_generated.wav
update_2500_reference.wav

Anyway, if someone can help, I’d be very grateful for any advice.

cyberbol · 2025-08-25T14:55:18Z

cyberbol
Aug 25, 2025
Author

Any one can help?

0 replies

healthyfat · 2025-08-26T20:00:21Z

healthyfat
Aug 26, 2025

Guide to Fine-Tuning F5-TTS for Polish Language

Fine-tuning F5-TTS for Polish requires a substantial dataset, careful configuration adjustments, and attention to language-specific challenges like character handling and alignment. Based on community experiences, successful trainings have used 90-142 hours of Polish audio, achieving intelligible results after 2,500-350,000 steps, depending on whether you're fine-tuning or training from scratch. Here's a detailed step-by-step guide focused on key training aspects, drawing from GitHub discussions, YouTube tutorials, and Hugging Face resources—skipping basic setup.

Dataset Optimization for Polish:

Gather 90+ hours of high-quality Polish speech audio, such as from Common Voice Polish or custom recordings, ensuring mono WAV format at 44kHz for BigVGAN (preferred for fidelity) or 24kHz for Vocos. Split into 3-15 second segments to handle varied speech patterns and improve model generalization. Create a metadata.csv with columns for audio_name and text, where text includes accurate Polish transcriptions—pay special attention to diacritics (ą, ć, ę, ł, ń, ó, ś, ź, ż) to avoid garbled output. Community tip: Use datasets with diverse speakers and accents for robustness; one user trained from scratch with 90 hours on an A100, converging at 350,000 steps with clear Polish speech. Validate 10% of data separately to monitor overfitting.

Tokenizer and Vocabulary Customization:

The base F5-TTS tokenizer is optimized for English and Chinese, so extend vocab.txt to fully support Polish phonetics. Extract and modify the vocabulary from existing multilingual models like Gregniuki F5-tts_English_German_Polish, which already includes Polish characters. Add entries for unique symbols and test tokenization on sample texts to prevent "unknown token" errors during training. For better prosody, incorporate Polish-specific punctuation and emphasis markers in transcripts.

The Gregniuki F5-tts_English_German_Polish model: https://huggingface.co/Gregniuki/F5-tts_English_German_Polish

Model Selection and Fine-Tuning Strategy:

Start by fine-tuning the pre-trained F5TTS_v1_Base or the Polish-adapted Gregniuki model rather than training from scratch for efficiency, fine-tuning needs fewer steps (100,000-150,000) and data. Use Hugging Face Accelerate for distributed training or the Gradio finetune app for visual monitoring. Set the model to 22 layers, 16 attention heads, embedding dimension 1024, and FFN dimension 2048. Apply 0.1 dropout to attention and feed-forward layers to reduce overfitting on Polish data.

Gregniuki model demo: https://huggingface.co/spaces/Gregniuki/f5-tts_Polish_English_German

Training Hyperparameters Tailored for Polish:

Configure batch size to 4,000 with 4-10 gradient accumulation steps (effective ~16,000-40,000) to fit RTX 4090's 24GB VRAM without crashes. Use AdamW optimizer with a peak learning rate of 7.5e-5, linear warm-up over 20,000 steps, then decay. Enable data augmentation like 70% Mel spectrogram masking for better fill-in-the-blank learning. For the vocoder, choose BigVGAN at 44kHz with hop length 512—this improves alignment for Polish's complex consonant clusters and longer words, as noted in community tests where 256 hop length caused mismatches in sequences over 6 seconds. Train for 100,000-150,000 steps (2-5 days on RTX 4090), evaluating every 5,000 steps by generating samples and comparing Mel spectrograms to ground truth for clarity.

Monitoring, Troubleshooting, and Polish-Specific Challenges:

Track validation loss and audio intelligibility—early steps (e.g., 2,500) may sound garbled, but by 10,000-20,000, Polish words become understandable, per user reports. Address muffled output by ensuring clean audio input and increasing steps; compare generated vs. reference spectrograms if fidelity drops. For alignment issues in longer Polish sentences, include varied utterance lengths in your dataset and adjust max_length to 15 seconds. If crashes occur (e.g., from VRAM overflow), lower batch size or accumulation. Community insights from YouTube tutorials (e.g., https://www.youtube.com/watch?v=UO4usaOojys for new language training) emphasize testing with short Polish phrases early to tweak configs.

For more tips, check GitHub discussions:

#961.
#143

15 replies

przemakk Sep 26, 2025

Great. Why do you use so much data set? I have read people train this model with 100 hours only. 4000 sounds bit stretch in compare :) Could you present some samples? Wonder how well it can be trained compared to other models. Do you plan to release this model or it just for personal use?

Training on small datasets is OK, but fails with cloning unknown speakers and it sometimes produces strange results. I use it for correction when editing audiobooks, so it has to be unrecognizable edit - and it is ;)
About release - I will, when it will be ready. Now its part of my work at university.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fine-tuning the F5 TTS model for Polish language #1168

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 15 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Fine-tuning the F5 TTS model for Polish language #1168

Uh oh!

cyberbol Aug 25, 2025

Replies: 2 comments · 15 replies

Uh oh!

cyberbol Aug 25, 2025 Author

Uh oh!

Uh oh!

healthyfat Aug 26, 2025

Uh oh!

przemakk Sep 26, 2025

Uh oh!

healthyfat Sep 27, 2025

Uh oh!

Uh oh!

przemakk Sep 27, 2025

Uh oh!

healthyfat Sep 28, 2025

Uh oh!

przemakk Sep 28, 2025

cyberbol
Aug 25, 2025

Replies: 2 comments 15 replies

cyberbol
Aug 25, 2025
Author

healthyfat
Aug 26, 2025