Replies: 2 comments 15 replies
-
|
Any one can help? |
Beta Was this translation helpful? Give feedback.
-
|
Guide to Fine-Tuning F5-TTS for Polish Language Fine-tuning F5-TTS for Polish requires a substantial dataset, careful configuration adjustments, and attention to language-specific challenges like character handling and alignment. Based on community experiences, successful trainings have used 90-142 hours of Polish audio, achieving intelligible results after 2,500-350,000 steps, depending on whether you're fine-tuning or training from scratch. Here's a detailed step-by-step guide focused on key training aspects, drawing from GitHub discussions, YouTube tutorials, and Hugging Face resources—skipping basic setup. Dataset Optimization for Polish: Gather 90+ hours of high-quality Polish speech audio, such as from Common Voice Polish or custom recordings, ensuring mono WAV format at 44kHz for BigVGAN (preferred for fidelity) or 24kHz for Vocos. Split into 3-15 second segments to handle varied speech patterns and improve model generalization. Create a metadata.csv with columns for audio_name and text, where text includes accurate Polish transcriptions—pay special attention to diacritics (ą, ć, ę, ł, ń, ó, ś, ź, ż) to avoid garbled output. Community tip: Use datasets with diverse speakers and accents for robustness; one user trained from scratch with 90 hours on an A100, converging at 350,000 steps with clear Polish speech. Validate 10% of data separately to monitor overfitting. Tokenizer and Vocabulary Customization: The base F5-TTS tokenizer is optimized for English and Chinese, so extend vocab.txt to fully support Polish phonetics. Extract and modify the vocabulary from existing multilingual models like Gregniuki F5-tts_English_German_Polish, which already includes Polish characters. Add entries for unique symbols and test tokenization on sample texts to prevent "unknown token" errors during training. For better prosody, incorporate Polish-specific punctuation and emphasis markers in transcripts. The Gregniuki F5-tts_English_German_Polish model: https://huggingface.co/Gregniuki/F5-tts_English_German_Polish Model Selection and Fine-Tuning Strategy: Start by fine-tuning the pre-trained F5TTS_v1_Base or the Polish-adapted Gregniuki model rather than training from scratch for efficiency, fine-tuning needs fewer steps (100,000-150,000) and data. Use Hugging Face Accelerate for distributed training or the Gradio finetune app for visual monitoring. Set the model to 22 layers, 16 attention heads, embedding dimension 1024, and FFN dimension 2048. Apply 0.1 dropout to attention and feed-forward layers to reduce overfitting on Polish data. Gregniuki model demo: https://huggingface.co/spaces/Gregniuki/f5-tts_Polish_English_German Training Hyperparameters Tailored for Polish: Configure batch size to 4,000 with 4-10 gradient accumulation steps (effective ~16,000-40,000) to fit RTX 4090's 24GB VRAM without crashes. Use AdamW optimizer with a peak learning rate of 7.5e-5, linear warm-up over 20,000 steps, then decay. Enable data augmentation like 70% Mel spectrogram masking for better fill-in-the-blank learning. For the vocoder, choose BigVGAN at 44kHz with hop length 512—this improves alignment for Polish's complex consonant clusters and longer words, as noted in community tests where 256 hop length caused mismatches in sequences over 6 seconds. Train for 100,000-150,000 steps (2-5 days on RTX 4090), evaluating every 5,000 steps by generating samples and comparing Mel spectrograms to ground truth for clarity. Monitoring, Troubleshooting, and Polish-Specific Challenges: Track validation loss and audio intelligibility—early steps (e.g., 2,500) may sound garbled, but by 10,000-20,000, Polish words become understandable, per user reports. Address muffled output by ensuring clean audio input and increasing steps; compare generated vs. reference spectrograms if fidelity drops. For alignment issues in longer Polish sentences, include varied utterance lengths in your dataset and adjust max_length to 15 seconds. If crashes occur (e.g., from VRAM overflow), lower batch size or accumulation. Community insights from YouTube tutorials (e.g., https://www.youtube.com/watch?v=UO4usaOojys for new language training) emphasize testing with short Polish phrases early to tweak configs. For more tips, check GitHub discussions: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I’m preparing for fine-tuning the F5 TTS model for Polish, since there isn’t one available yet. I did find one person who created a Polish model using about 90 hours of recordings and trained it on an A100 80GB for around 24 hours. Unfortunately, he didn’t share that model.
https://www.youtube.com/watch?v=K6vY9Je4ufQ
That’s why I decided to give it a try myself. There isn’t much information online about TTS training configurations, unlike with photo or video models. Based on what I managed to gather so far:
My dataset contains 142 hours of correct Polish speech. The dataset has been split into smaller files with transcripts (the transcription process is still ongoing).
As for the configuration, I’m not entirely sure if it’s correct, but I plan to start training with the following settings:
polish_config.yaml
I don’t know if it will work, and I also don’t know how long it will take on an RTX 4090. Possibly a few days! XD
So, if anyone here has done a similar training and could help me out with tips or suggestions, I’d really appreciate it.
Yesterday, I ran a very short test training with just a 2-hour dataset. Unfortunately, the process crashed during the night, but it managed to reach 2500 steps. I saved sample outputs every 500 steps, so I have five of them.
I must say, at 500 steps the difference between the reference wav and the generated file was huge – as a native Polish speaker, I couldn’t understand a single word from the generated one. But at 2500 steps, it was already intelligible. Lots of mistakes, but at least I could understand the speech.
I could share the 2500-step sample here, but since it’s in Polish, I’m not sure if any of you would understand it.
update_2500_generated.wav
update_2500_reference.wav
Anyway, if someone can help, I’d be very grateful for any advice.
Beta Was this translation helpful? Give feedback.
All reactions