Seeking Guidance for Finetuning F5-TTS. Small datasets (10-60 minutes) on 12GB VRAM #769

S-T-K · 2025-02-06T16:06:53Z

S-T-K
Feb 6, 2025

I've been getting great results with F5-TTS, but my first attempt at fine-tuning trained from scratch instead of using the pretrained model—the output started as noise and is only slowly becoming speech.

How do I correctly fine-tune instead of starting from scratch?

Do I need to set "Tokenizer File" and "Path to Pretrained Checkpoint" manually? If so, what should I put?
Does "Download corresponding dataset first, and fill in the path in scripts" (repo link) refer to this?

Project Details:
I'm working on generating voices for characters from an old game. I have 10 to 60 minutes of clean audio samples per character. Language: English.

Hardware:
GPU: Nvidia 4080 Laptop (12GB VRAM)
I'm looking for advice on the best values to set for finetuning, given my hardware. Here’s what I’ve gathered so far, but I’d love some expert input:

Parameter Questions
Batch Size per GPU: I assume 6400 should work with 12GB VRAM, but would appreciate confirmation.
Max Samples: Not sure, but I read that 2 might be fine (reference).
Gradient Accumulation Steps & Max Gradient Norm: No idea—should I just leave them at 1?
Epochs: How many would be reasonable for my dataset size?
Warmup Updates: Not sure what value is appropriate.
Save per Updates: I assume setting this high is better, as frequent saving would slow down training?
Last per Updates: Not sure what value to use here either.
Other Options:
Use 8-bit Adam optimizer – Should I enable this?
Mixed Precision – Any recommendations based on my GPU?
Logger – Not sure what’s best here.

Finetuning Duration
How long should I expect finetuning to take per character? Just so I can compare against my actual training times and check if my machine is underperforming due to driver or config issues.

Any guidance would be highly appreciated! 🚀

Thanks in advance!

S-T-K · 2025-02-07T19:37:42Z

S-T-K
Feb 7, 2025
Author

I was able to answer some of the questions myself:
For finetuning, just check the checkbox of the same name. No need to specify anything else.

On my machine any value for max samples higher than 8 leads to horribly slow training speed. So I guess going higher than that is not advisable.

Trained for 320 epochs but unfortunately the resulting model doesn't perform any better than the base model in my opinion. No discernable difference in the generated audio files.

8 replies

S-T-K Feb 8, 2025
Author

I see, that makes sense. Do you know of a metric or score I could calculate for the generated files to compare their similarity to the reference? For example, I could generate 100 files using the base model and the fine-tuned model and compare them statistically.

SWivid Feb 8, 2025
Maintainer

https://github.com/SWivid/F5-TTS/tree/main/src/f5_tts/eval

S-T-K Feb 8, 2025
Author

Oh perfect, didn't know it's part of the repo. Thank you!

SWivid Feb 8, 2025
Maintainer

Since you wanna evaluate for specific voice performance rather than zero-shot ability, you could write some code and do eval for the certain speaker voices, with

F5-TTS/src/f5_tts/eval/utils_eval.py

Lines 379 to 413 in 261b277

    
           def run_sim(args): 
        
               rank, test_set, ckpt_dir = args 
        
               device = f"cuda:{rank}" 
        
               model = ECAPA_TDNN_SMALL(feat_dim=1024, feat_type="wavlm_large", config_path=None) 
        
               state_dict = torch.load(ckpt_dir, weights_only=True, map_location=lambda storage, loc: storage) 
        
               model.load_state_dict(state_dict["model"], strict=False) 
        
               use_gpu = True if torch.cuda.is_available() else False 
        
               if use_gpu: 
        
                   model = model.cuda(device) 
        
               model.eval() 
        
               sims = [] 
        
               for wav1, wav2, truth in tqdm(test_set): 
        
                   wav1, sr1 = torchaudio.load(wav1) 
        
                   wav2, sr2 = torchaudio.load(wav2) 
        
                   resample1 = torchaudio.transforms.Resample(orig_freq=sr1, new_freq=16000) 
        
                   resample2 = torchaudio.transforms.Resample(orig_freq=sr2, new_freq=16000) 
        
                   wav1 = resample1(wav1) 
        
                   wav2 = resample2(wav2) 
        
                   if use_gpu: 
        
                       wav1 = wav1.cuda(device) 
        
                       wav2 = wav2.cuda(device) 
        
                   with torch.no_grad(): 
        
                       emb1 = model(wav1) 
        
                       emb2 = model(wav2) 
        
                   sim = F.cosine_similarity(emb1, emb2)[0].item() 
        
                   # print(f"VSim score between two audios: {sim:.4f} (-1.0, 1.0).") 
        
                   sims.append(sim) 
        
               return sims

and maybe also utmos

S-T-K Feb 11, 2025
Author

Update: Fine-tuning did improve the output quality significantly!
Initially, I thought it hadn't, but that was due to an error I missed in the console. While comparing the original pretrained model and the fine-tuned model in the finetune gradio interface, I encountered an issue. When switching back to the original model, an error (not visible in the gradio interface) caused the fine-tuned model to remain loaded, even though I believed I had switched back. As a result, I was unknowingly comparing the fine-tuned model to itself, which is why I didn't notice much difference and mistakenly thought the original model also sounded good.

So in summary: fine-tuning is worth it and f5-tts is amazing. Thank you so much @SWivid!

Balauruu · 2025-07-15T13:47:42Z

Balauruu
Jul 15, 2025

Hello, could you share the parameters used and what other insights you had with training such as the 8-bit Adam optimizer (if you found out how it affects your training). I am currently doing a very similar training with a similar setup and I'm getting some weird artefacts (a wet sound) and I'm not sure if the model is undertrained or my reference audio files are bad and it's picking too much background noise. Also on my machine (4070 laptop) it takes like 1 day for around 20k-25k steps depending on parameters and I can't figure out if it's performing normally or slow. (I see ppl here training for >100k steps). I would appreciate very much some insights and results from your training.

My Parameters:
-learning_rate: 0.00001,
-batch_size_per_gpu: 2200 ( I found gives me the best training time per epoch)
-max_samples: 64,
-grad_accumulation_steps: 1,
-max_grad_norm: 1,
-epochs: 3425 ( I left it this high, because I thought it doesn't affect training. Just the slope in results when using the tensorboard to analyze?
-num_warmup_updates: 100
-save_per_updates: 2000,
-finetune: true,
-mixed_precision: "bf16",
"logger": "tensorboard",

8 bit optimizer: false (not enabled) (from what I researched online it should be a straight speed upgrade, but I saw people in other discussion boards are not using it when training)

I actually trained my model for only 150 epochs, 30k steps.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Seeking Guidance for Finetuning F5-TTS. Small datasets (10-60 minutes) on 12GB VRAM #769

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 8 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Seeking Guidance for Finetuning F5-TTS. Small datasets (10-60 minutes) on 12GB VRAM #769

Uh oh!

Uh oh!

S-T-K Feb 6, 2025

Replies: 2 comments · 8 replies

Uh oh!

S-T-K Feb 7, 2025 Author

Uh oh!

S-T-K Feb 8, 2025 Author

Uh oh!

SWivid Feb 8, 2025 Maintainer

Uh oh!

S-T-K Feb 8, 2025 Author

Uh oh!

Uh oh!

SWivid Feb 8, 2025 Maintainer

Uh oh!

S-T-K Feb 11, 2025 Author

Uh oh!

Balauruu Jul 15, 2025

S-T-K
Feb 6, 2025

Replies: 2 comments 8 replies

S-T-K
Feb 7, 2025
Author

S-T-K Feb 8, 2025
Author

SWivid Feb 8, 2025
Maintainer

S-T-K Feb 8, 2025
Author

SWivid Feb 8, 2025
Maintainer

S-T-K Feb 11, 2025
Author

Balauruu
Jul 15, 2025