Finetune practice #57

SWivid · 2024-10-14T01:49:59Z

SWivid
Oct 14, 2024
Maintainer

Full finetune is currently supported, lora or adapter not yet.

Set checkpoint_path to pretrained model dir in test_train.py, model/trainer.py will load from there to resume. Reuse the vocab.txt under data /Emilia_ZH_EN_pinyin (Emilia_ZH_EN_pinyin <- tokenizer = "pinyin"; dataset_name = "Emilia_ZH_EN" in test_train.py setting)
For preparing finetune data, see model/dataset.py. Just need e.g. the audio path, text (tokenized, leverage convert_char_to_pinyin func in model/utils.py see script/prepare_xxxx.py), duration of audio in seconds.

def __getitem__(self, index):
    row = self.data[index]
    audio_path = row["audio_path"]
    text = row["text"]
    duration = row["duration"]

Set a smaller batchsize according to your GPU mem. The grad_accumulation_steps could be used to simulate a large batchsize. Also other settings, e.g. few warmup steps, 1e-4 lr, etc.

We didn't specifically experiment with finetuning, so if you get positive results, welcome to share :)

Some helpful issues, #16 #27

Welcome to share your successful results with finetuning, maybe also start a new tutorial doc helping others to get start with it.
Many Thanks !

acul3 · 2024-10-14T08:03:05Z

acul3
Oct 14, 2024

hello @SWivid do you think is possible fine tuning pretrained model on new language

planing to add another language italian + english (to avoid catastrophic forgetting)

52 replies

yiwei0730 Dec 4, 2024

def is_japanese(c):
        return (
            "\u3040" <= c <= "\u309f"  # Hiragana
            or "\u30a0" <= c <= "\u30ff"  # Katakana
            or "\uff66" <= c <= "\uff9f"  # Half-width Katakana
        )
  else:  # if mixed chinese characters, alphabets and symbols
                for c in seg:
                    if ord(c) < 256:
                        char_list.extend(c)
                    elif is_japanese(c):
                        char_list.append(c)
                    else:
                        if c not in "。，、；：？！《》【】—…":
                            if not char_list or not is_japanese(char_list[-1]):
                                char_list.append(" ")
                            char_list.extend(lazy_pinyin(c, style=Style.TONE3, tone_sandhi=True))
                        else:  # if is zh punc
                            char_list.append(c)

Thanks for your repo, i see..., first i need to use pykakasi to preprocessed my dataset，and add the convert code into, then i can use the original vocab to finetuning, is it corrected? if i missing something, tell me thanks!

yiwei0730 Dec 4, 2024

Yeah, the model should only see hiragana, whether that be at inference or training just because you'd probably need a LOT of data to generalize all those tokens in kanji (other readings etc, also make it suboptimal IMO).

You might find this useful if you format data in the same way the GUI does to convert train text files

https://github.com/JarodMica/tortoise_dataset_tools/blob/master/japanese_tools/hiragana_train_file.py

Thanks, this help me a lot!

JarodMica Dec 4, 2024
Collaborator

About right, however, a small clarification is that you do not need that conversion code for training, you only need it for inference.

The steps would be:

Convert kanji to all hiragana with pykakaski in your dataset
Train/fine-tune on the original code.

Then after training, you can implement the is_japanese function when you use convert to pinyin in inference

yiwei0730 Dec 4, 2024

Yeah! got it
just preprocessed the japanese dataset by yourself and running the finetune, at last use the is_japanese function to run infer.

Demonmasterlqx Apr 17, 2025

Yeah, the model should only see hiragana, whether that be at inference or training just because you'd probably need a LOT of data to generalize all those tokens in kanji (other readings etc, also make it suboptimal IMO).是的，模型应该只看到平假名，无论是在推理还是训练中，仅仅因为你可能需要大量数据来以汉字概括所有这些标记（其他阅读等，也使其次优 IMO）。

You might find this useful if you format data in the same way the GUI does to convert train text files如果您以与 GUI 相同的方式格式化数据以转换训练文本文件，您可能会发现这很有用

https://github.com/JarodMica/tortoise_dataset_tools/blob/master/japanese_tools/hiragana_train_file.py

Hi @JarodMica,
First off, thank you for your great work on the f5-tts Japanese model! I'm really impressed by it and very interested in what you've achieved.
I'm currently trying to fine-tune my own Japanese model based on f5-tts, but I've run into a few challenging issues. I was hoping you, or anyone else in the community with experience, might be able to offer some guidance.
Here are my specific questions:

Dataset Cleaning and Preparation: When preparing the dataset for fine-tuning, does the vocabulary list need to be identical to the one used in the base model? Also, are there any publicly available or recommended preprocessing scripts you could point me towards?
Reference Audio for Inference: To get good results during inference with the Japanese model, what kind of preprocessing or specific qualities should the reference audio ideally have? (For context, I downloaded your published model and tried running inference using the official code, but the generated audio quality was not ideal and sounded a bit garbled/chaotic).
Fine-tuning Data Volume: Approximately how much data (in terms of total duration, perhaps?) is generally required to achieve a reasonably good result when fine-tuning the model?

I'd be incredibly grateful for any insights or advice from @JarodMica or anyone else in the community who might have experience with these aspects of f5-tts for Japanese.
Thank you for your time and consideration, and thanks in advance for any help!

kunibald413 · 2024-10-16T11:52:34Z

kunibald413
Oct 16, 2024

create a dataset:

audio files of maybe 3 - 12s duration, i'm not sure what's good, and their transcripts

/your_dataset
|-- metadata.csv
|-- wavs/
|   |-- audio_0001.wav
|   |-- audio_0002.wav
|   `-- ...

metadata.csv contents:

<relative_path_to_wav>|<transcript>

audio_file|text
wavs/audio_0001.wav|Yo! Hello? Hello?
wavs/audio_0002.wav|Hi, how are you doing today? I want to go shopping and buy me some lemons.

call script to prepare dataset
it doesn't handle other tokenizers, always assumes english dataset and pinyin, can adjust to your liking

python scripts/prepare_csv_wavs.py <path_to_your_dataset> <F5-TTS_repo_data_path>/<dataset_name>_pinyin

example:

python scripts/prepare_csv_wavs.py /my_pc/your_dataset /my_pc/F5-TTS/data/your_dataset_pinyin

adjust hyperparams in train.py

set dataset name to name of your dataset in f5-tts data folder

dataset_name = "your_dataset"

play around with these parameters and see what give the best results:

set max samples to 2, or whatever you seem fit

max_samples = 2

also play around with learning rate, don't know which one is best

learning_rate = 5e-06

change epochs and warmup to whatver you seem fit for your dataset
maybe for 100 audio files 10 epochs and 20 warum steps is fine, i have no clue

epochs = 10  # use linear decay, thus epochs control the slope
num_warmup_updates = 20  # warmup steps

adjust this to your dataset size, eg for 100 audio files and 2 max samples, maybe 500
or add code to trainer to save final checkpoint after training is done

last_per_steps = 500  # save last checkpoint per steps

python train.py

hopefully we find good hyperparams for good finetuning results

could put prepare_csv_wavs.py into scripts folder @SWivid

it doesn't handle other tokenizers, always assumes english dataset and pinyin, can adjust to your liking

import sys, os
sys.path.append(os.getcwd())

from pathlib import Path
import json
import shutil
import argparse

from tqdm import tqdm
from datasets.arrow_writer import ArrowWriter

from model.utils import (
    convert_char_to_pinyin,
)

PRETRAINED_VOCAB_PATH = Path(__file__).parent.parent / "data/Emilia_ZH_EN_pinyin/vocab.txt"

def is_csv_wavs_format(input_dataset_dir):
    fpath = Path(input_dataset_dir)
    metadata = fpath / "metadata.csv"
    wavs = fpath / 'wavs'
    return metadata.exists() and metadata.is_file() and wavs.exists() and wavs.is_dir()


def prepare_csv_wavs_dir(input_dir):
    assert is_csv_wavs_format(input_dir), f"not csv_wavs format: {input_dir}"
    input_dir = Path(input_dir)
    metadata_path = input_dir / "metadata.csv"
    audio_path_text_pairs = read_audio_text_pairs(metadata_path.as_posix())

    sub_result, durations = [], []
    vocab_set = set()
    polyphone = True
    for audio_path, text in audio_path_text_pairs:
        if not Path(audio_path).exists():
            print(f"audio {audio_path} not found, skipping")
            continue
        audio_duration = get_audio_duration(audio_path)
        # assume tokenizer = "pinyin"  ("pinyin" | "char")
        text = convert_char_to_pinyin([text], polyphone=polyphone)[0]
        sub_result.append({"audio_path": audio_path, "text": text, "duration": audio_duration})
        durations.append(audio_duration)
        vocab_set.update(list(text))

    return sub_result, durations, vocab_set

def get_audio_duration(audio_path):
    import torchaudio
    audio, sample_rate = torchaudio.load(audio_path)
    num_channels = audio.shape[0]
    return audio.shape[1] / (sample_rate * num_channels)

def read_audio_text_pairs(csv_file_path):
    import csv
    audio_text_pairs = []

    parent = Path(csv_file_path).parent
    with open(csv_file_path, mode='r', newline='', encoding='utf-8') as csvfile:
        reader = csv.reader(csvfile, delimiter='|')
        next(reader)  # Skip the header row
        for row in reader:
            if len(row) >= 2:
                audio_file = row[0].strip()  # First column: audio file path
                text = row[1].strip()          # Second column: text
                audio_file_path = parent / audio_file
                audio_text_pairs.append((audio_file_path.as_posix(), text))

    return audio_text_pairs


def save_prepped_dataset(out_dir, result, duration_list, text_vocab_set, is_finetune):
    out_dir = Path(out_dir)
    # save preprocessed dataset to disk
    out_dir.mkdir(exist_ok=True, parents=True)
    print(f"\nSaving to {out_dir} ...")

    # dataset = Dataset.from_dict({"audio_path": audio_path_list, "text": text_list, "duration": duration_list})  # oom
    # dataset.save_to_disk(f"data/{dataset_name}/raw", max_shard_size="2GB")
    raw_arrow_path = out_dir / "raw.arrow"
    with ArrowWriter(path=raw_arrow_path.as_posix(), writer_batch_size=1) as writer:
        for line in tqdm(result, desc=f"Writing to raw.arrow ..."):
            writer.write(line)

    # dup a json separately saving duration in case for DynamicBatchSampler ease
    dur_json_path = out_dir / "duration.json"
    with open(dur_json_path.as_posix(), 'w', encoding='utf-8') as f:
        json.dump({"duration": duration_list}, f, ensure_ascii=False)

    # vocab map, i.e. tokenizer
    # add alphabets and symbols (optional, if plan to ft on de/fr etc.)
    # if tokenizer == "pinyin":
    #     text_vocab_set.update([chr(i) for i in range(32, 127)] + [chr(i) for i in range(192, 256)])
    voca_out_path = out_dir / "vocab.txt"
    with open(voca_out_path.as_posix(), "w") as f:
        for vocab in sorted(text_vocab_set):
            f.write(vocab + "\n")

    if is_finetune:
        file_vocab_finetune = PRETRAINED_VOCAB_PATH.as_posix()
        shutil.copy2(file_vocab_finetune, voca_out_path)
    else:
        with open(voca_out_path, "w") as f:
            for vocab in sorted(text_vocab_set):
                f.write(vocab + "\n")

    dataset_name = out_dir.stem
    print(f"\nFor {dataset_name}, sample count: {len(result)}")
    print(f"For {dataset_name}, vocab size is: {len(text_vocab_set)}")
    print(f"For {dataset_name}, total {sum(duration_list)/3600:.2f} hours")


def prepare_and_save_set(inp_dir, out_dir, is_finetune: bool = True):
    if is_finetune:
        assert PRETRAINED_VOCAB_PATH.exists(), f"pretrained vocab.txt not found: {PRETRAINED_VOCAB_PATH}"
    sub_result, durations, vocab_set = prepare_csv_wavs_dir(inp_dir)
    save_prepped_dataset(out_dir, sub_result, durations, vocab_set, is_finetune)


def cli():
    # finetune: python script.py /path/to/input_dir /path/to/output_dir
    # pretrain: python script.py /path/to/input_dir /path/to/output_dir --pretrain
    parser = argparse.ArgumentParser(description="Prepare and save dataset.")
    parser.add_argument('inp_dir', type=str, help="Input directory containing the data.")
    parser.add_argument('out_dir', type=str, help="Output directory to save the prepared data.")
    parser.add_argument('--pretrain', action='store_true', help="Enable for new pretrain, otherwise is a fine-tune")

    args = parser.parse_args()

    prepare_and_save_set(args.inp_dir, args.out_dir, is_finetune=not args.pretrain)

if __name__ == "__main__":
    cli()

9 replies

kostum123 Nov 5, 2024

prepare_csv_wavs.py

While using the prepare_csv_wavs.py file, if our language characters are contained inside PRETRAINED_VOCAB_PATH = files("f5_tts").joinpath("../../data/Emilia_ZH_EN_pinyin/vocab.txt"), do we need to edit the prepare_csv_wavs code to make it work with languages other than English and Latin-based languages? Should we change the tokenizer to a character-based one, or is it acceptable to keep it as is and run python src/f5_tts/train/datasets/prepare_csv_wavs.py?
@SWivid @kunibald413

kunibald413 Nov 5, 2024

the csv_wavs script is mostly copy paste from prepare_emilia.py

this is how they deal with different tokenizers:

F5-TTS/src/f5_tts/train/datasets/prepare_emilia.py

Line 140 in 4a69e6b

if tokenizer == "pinyin":

i assume at least this line would need to change, it always assumes pinyin tokenizer:

F5-TTS/src/f5_tts/train/datasets/prepare_csv_wavs.py

Line 47 in 4a69e6b

text = convert_char_to_pinyin([text], polyphone=polyphone)[0]

# assume tokenizer = "pinyin"  ("pinyin" | "char")
text = convert_char_to_pinyin([text], polyphone=polyphone)[0]

i'm not sure about the implications and maybe there's more that needs to change.
currently I don't have much time looking into it, preferably SWivid fixes it up if he has time.

if you know what's going on there feel free to adjust accordingly and make a merge reuquest @kostum123

TEJASAMA-007 Nov 8, 2024

Hi @kunibald413, I'm currently finetuning F5-TTS with gradio interface, it's really amazing. But while I'm trying to finetune it from the previous checkpoint, I'm facing the below issue:

finetune_cli.py: error: unrecognized arguments: --file_checkpoint_train /home/teja/voice-cloning/F5-TTS/ckpts/test_en/model_last.pt

Can you please help me with this. thank you.

HuuHuy227 Nov 8, 2024

Hi @kunibald413, I'm currently finetuning F5-TTS with gradio interface, it's really amazing. But while I'm trying to finetune it from the previous checkpoint, I'm facing the below issue:

finetune_cli.py: error: unrecognized arguments: --file_checkpoint_train /home/teja/voice-cloning/F5-TTS/ckpts/test_en/model_last.pt

Can you please help me with this. thank you.

Try changing --file_checkpoint_train into --pretrain in finetune_cli.py (Line 454). There is confusion about the argument between finetune_cli.py and finetune_gradio.py

Seigfried-PYM Mar 31, 2025

Hi @kunibald413, I'm currently finetuning F5-TTS with gradio interface, it's really amazing. But while I'm trying to finetune it from the previous checkpoint, I'm facing the below issue:

finetune_cli.py: error: unrecognized arguments: --file_checkpoint_train /home/teja/voice-cloning/F5-TTS/ckpts/test_en/model_last.pt

Can you please help me with this. thank you.

I have the same error.
Have you solved this problem? Please tell me how. Thank you very much.

lpscr · 2024-10-16T18:56:17Z

lpscr
Oct 16, 2024

@kunibald413 Thank you for the script. I've already created something similar here. #62 (comment)
2 days ago ...

can you update this part i think be nice to have like this

def format_seconds_to_hms(seconds):
    hours = int(seconds / 3600)
    minutes = int((seconds % 3600) / 60)
    seconds = seconds % 60
    return "{:02d}:{:02d}:{:02d}".format(hours, minutes, int(seconds))

    print(f"\nFor {dataset_name}, sample count: {len(result)}")
    print(f"For {dataset_name}, vocab size is: {len(text_vocab_set)}")
    print(f"For {dataset_name}, total {format_seconds_to_hms(sum(duration_list))}")
    print(f"For {dataset_name}, min {min(duration_list)} sec")
    print(f"For {dataset_name}, max {max(duration_list)} sec")

before

For , sample count: 242
For , vocab size is: 53
For , total 0.20 hours

after

For , sample count: 242
For , vocab size is: 53
For , total 00:12:17 
For , min 1.519 sec
For , max 8.294 sec

1 reply

kunibald413 Oct 16, 2024

it's merged into main repo, feel free to adjust to your liking

mhenrichsen · 2024-10-16T20:52:16Z

mhenrichsen
Oct 16, 2024

The code @kunibald413 has provided works.

However, when training it seems to initialize from a model with random weights. Can we initialize from the trained model weights instead?

1 reply

thunn Oct 16, 2024

By default, the training script will look for an existing model in ckpts/<exp_name>/model_last.pt (exp_name set in the script)

If you place a model at that path, it will be loaded in as the base model

jpgallegoar · 2024-10-16T22:44:44Z

jpgallegoar
Oct 16, 2024
Collaborator

Just started my spanish finetune from the facebook libraspeech dataset. Single 4090 so it will take a while.

5 replies

MithrilMan Oct 17, 2024

can you report back the time it takes?
I'm interested in an italian version and I've a 3090ti

jpgallegoar Oct 17, 2024
Collaborator

Unfortunately I did it wrong and have to start over. But I ran it over 1 Epoch overnight, which took
6h 50m 53s for 4,000 batch size and 93000 batches, reaching around 0.75 loss rate. I will start over and report back when I have something to show for it.

jpgallegoar Oct 18, 2024
Collaborator

Small update on the spanish finetune.

original.wav (in training data):
https://vocaroo.com/19oXO8sJm0WH

finetuned.wav (same input voice):
https://voca.ro/1aKKzX7pBhf3

Still much to go

anarucu Nov 19, 2024

Hello @jpgallegoar , thank you for the model in Spanish. I tried cloning a voice with a Cuban accent, and it’s not bad at all, even though it’s an accent you didn’t use during fine-tuning. I wonder if it’s possible to do fine-tuning starting from the model you trained... I only have 12 hours of Spanish with a Cuban accent.

lapc506 Feb 19, 2025

I would like to help you adding Costa Rican accent. How can I provide training data?

bensonbs · 2024-10-17T03:38:55Z

bensonbs
Oct 17, 2024

I am using a Chinese dataset (about 33hr) to fine-tune my model. The loss is continuously decreasing, and the generated voice tone is getting closer to the target. However, as the training steps increase, the pronunciation of words is becoming increasingly unclear.
It's like the following audio file:

model_12620.pt
https://voca.ro/11Ny6egSZ7zf
model_126200.pt
https://voca.ro/1cXdiNNM0zRt

parm

exp_name = "F5TTS_Base"  
learning_rate = 7.5e-5
batch_size_per_gpu = 38400/8
batch_size_type = "frame
max_samples = 64
grad_accumulation_steps = 1 
max_grad_norm = 1.

15 replies

bensonbs Oct 21, 2024

@jpgallegoar There are different accents in Chinese, and fine-tuning can also make the model's voice closer to the dataset.

bensonbs Oct 21, 2024

@charleypeng

I'm sorry, but my training dataset is private and cannot be provided.

jpgallegoar Oct 21, 2024
Collaborator

@jpgallegoar There are different accents in Chinese, and fine-tuning can also make the model's voice closer to the dataset.

Ah yes, thank you for the explanation

yc930401 Dec 2, 2024

I am using a Chinese dataset (about 33hr) to fine-tune my model. The loss is continuously decreasing, and the generated voice tone is getting closer to the target. However, as the training steps increase, the pronunciation of words is becoming increasingly unclear. It's like the following audio file:

model_12620.pt
https://voca.ro/11Ny6egSZ7zf

model_126200.pt
https://voca.ro/1cXdiNNM0zRt

parm
exp_name = "F5TTS_Base"  
learning_rate = 7.5e-5
batch_size_per_gpu = 38400/8
batch_size_type = "frame
max_samples = 64
grad_accumulation_steps = 1 
max_grad_norm = 1.

Hello friend, I also want to finetune a chinese model, because I think the voice generated by the current model is not very similar to the target voice sometimes. I want to train a generic model that with 5-10s reference voice by any person, the model can mimic that person's voice. May I ask if your finetuned model get better result in mimicing the reference voice? And how many person's voices did you use and how long for each person? Thanks a lot !

tujie-jiangye May 26, 2025

I am using a Chinese dataset (about 33hr) to fine-tune my model. The loss is continuously decreasing, and the generated voice tone is getting closer to the target. However, as the training steps increase, the pronunciation of words is becoming increasingly unclear. It's like the following audio file:

model_12620.pt
https://voca.ro/11Ny6egSZ7zf

model_126200.pt
https://voca.ro/1cXdiNNM0zRt

parm
exp_name = "F5TTS_Base"  
learning_rate = 7.5e-5
batch_size_per_gpu = 38400/8
batch_size_type = "frame
max_samples = 64
grad_accumulation_steps = 1 
max_grad_norm = 1.

Hello, may I inquire whether you have identified a solution to the issue? I have also encountered the same challenge and would greatly appreciate

lpscr · 2024-10-17T12:09:02Z

lpscr
Oct 17, 2024

hi i just create gradio interface for easy user-friendly and accessible for beginners you can see here

#143

Features

Transcription Tab: Easily transcribe audio files to create a dataset.
Dataset Preparation Tab: Prepare your dataset for training.
Training Tab:
    Select fine-tuning options.
    Automatically calculate settings, with the option to manually adjust them.
Reduction Tab: Convert your model from 5GB to 1.3GB.
Check Vocab: Check if it is possible to fine-tune in another language

0 replies

acul3 · 2024-10-17T18:02:11Z

acul3
Oct 17, 2024

can confirm also training work
almost 3 days finally got it work

i training 3 language indonesia-italian-english

eng: https://vocaroo.com/1mGEFlRNgouY
(you are likely overfitting. Impossible to know without evaluating. Bigger dataset of same quality is always better.)

italian: https://voca.ro/1l6SYplhnSxz

(Quattro imperdibili appuntamenti con l’Orchestra da Camera di Caserta e solisti internazionali.)

indonesia:
https://vocaroo.com/11e5OQQucQDY
(Joan Laporta mengumumkan Barcelona kini akhirnya meraih laba positif, sebesar dua belas juta euro.)

it even can do code switching (eng-indonesia):
https://vocaroo.com/1iZkXBo6vII5
(Sebenarnya sih gak juga, There's always something there which is a little bit different.)

using same config as train

19 replies

paulovasconcellos-hotmart Oct 28, 2024

Just to confirm: you have used a dataset composed of english, Italian and Indonesian audios do fine-tune this model, right? Can you share how many hours of each you used for each language, and how long are the clips in seconds?

luterz Nov 4, 2024

Very good , when was integrated italian language in main f5-tts?

MithrilMan Nov 4, 2024

@leoiania finetuning after 3 days,

@leoiania i cannot release the weight because i use company data and hardware, but now i training from scratch using data and rent some gpu

will share the weight once complete

@acul3
did you have any news on the training?

sw-els Nov 11, 2024

@acul3 can you tell me how many epochs you used, especially for indo voice?

SyamsQ Dec 15, 2024

Boleh minta model fine-tunenya gan? @acul3

lpscr · 2024-10-17T23:35:52Z

lpscr
Oct 17, 2024

Hi, I was just wondering why you dont try to train on small data first instead of starting with a large dataset. For me, I trained for only 40 hours greek and with 20 hours (LibriTTS-R) focused on English, and it’s working fine speak very well. in half a day about with the 4090, and after about 100k to 150k steps, the model can speak greek and english in same time, very well and have great zero shot ,

try see if ths working for you i hope this help

3 replies

jpgallegoar Oct 17, 2024
Collaborator

Can you please elaborate? Why did you train in English again?

lpscr Oct 17, 2024

because i want speak both english and greek for example i give english and speak greek , and like i see this working gine with this method , if i train only greek not speak well english

ppc2017 Oct 19, 2024

Can you share the greek model?

lpscr · 2024-10-17T23:55:29Z

lpscr
Oct 17, 2024

here the setting i use

learning_rate = 1e-5

batch_size_per_gpu = 1618# 8 GPUs, 8 * 38400 = 307200
batch_size_type = "frame"  # "frame" or "sample"
max_samples = 64  # max sequences per batch if use frame-wise batch_size. we set 32 for small models, 64 for base models
grad_accumulation_steps = 1  # note: updates = steps / grad_accumulation_steps
max_grad_norm = 1.

epochs = 11  # use linear decay, thus epochs control the slope
num_warmup_updates = 500 # warmup steps
save_per_updates = 10000 # save checkpoint per steps
last_per_steps = 20000  # save last checkpoint per steps

3 replies

lpscr Oct 18, 2024

here sample greek and english
say first in
greek : Καλώς ήρθατε όλοι στην μεγαλύτερη πόλη ψυχαγωγίας του κόσμου!
then say the same in english : welcome one and all to the world's greatest entertainment city!
thank you @SWivid for suport greek symbols !

https://voca.ro/1kn7SqhCQjis

you see my method working great ;)

BTW: i working in finetune and plan to suport easy finetune for other language i have some ideas to add and tips , you can see here #143 already merge in main repo

justinjohn0306 Oct 18, 2024

@lpscr how did you load the model for the inference since you have changed the language from chinese to greek the vocab.txt must have chanced and so with the default config it would error out, right?

lpscr Oct 18, 2024

i have create my api script class to work easy load model and more stuff

but if you want now something simple just change in interface-clu.py

def load_model(repo_name, exp_name, model_cls, model_cfg, ckpt_step):
    ckpt_path = f"ckpts/{exp_name}/model_{ckpt_step}.pt" # .pt | .safetensors

ckpt_path with the file .pt your model you have finetune other take preetrain !

lpscr · 2024-10-18T08:27:46Z

lpscr
Oct 18, 2024

Hi all, this is very important and might be confusing for some. You need to copy the original model
F5TTS_Base/model_1200000.pt into the folder where you are training for fine-tuning.

If you start training without copying this model, it will train from scratch!

I’ve created a script called finetune-cli.py that can automate this process. However, before running the script, you need to update all the settings accordingly.

Please make sure to do this before you start.

or you can run simple

https://github.com/SWivid/F5-TTS/blob/182b0f08e4cde7280996c4018575b4a80425754b/finetune-cli.py#L41C1-L59C86

run simple change only the dataname my_speak in 3090 with about 60-80 hours dataset working well
accelerate launch finetune-cli.py --exp_name F5TTS_Base --learning_rate 0.00001 --batch_size_per_gpu 1618 --batch_size_type frame --max_samples 64 --grad_accumulation_steps 1 --max_grad_norm 1 --epochs 10 --num_warmup_updates 500 --save_per_updates 10000 --last_per_steps 20000 --dataset_name my_speak --finetune True

for 4090 like say @JarodMica working very well and also with very big dataset
batch_size_per_gpu = 4000
grad_accumulation_steps = 78

about the vocab i dont replace anything because suport all symbols in language i train

make sure if suport all symbols in your language you want to train if miss symbols not working correct
i think you can replace with the miss symbols with unsued this correct @SWivid ? and what symbols not use ? to can replace safe ?

or another idea it's in case miss symbols , you can simple covert all symbols in english language ,

here how check the vocab in finetune_gradio.py

make sure in data/project_name/ you have inside metadata.csv for all text
you need also write in project name in gradio same name

thats why i make gradio_finetune.py to dont confuse for begin users
also like i say i plane to make all this automatic soon

i hope this help

23 replies

JarodMica Oct 18, 2024
Collaborator

@jpgallegoar Ah! My bad, yes, I'm pretty much topped out on RAM if you had 32GB, but chrome takes up 2 gb 💀 and vscode another 2 gb in this screenshot

However, I just stopped training just to check and I'm idling at 20 GB used... Idk what's taking it up exactly but possible some data didn't cleared.

Anywho, you can try to reduce the num_workers that are being used, the default it 16.

If I looked at Ipscr's code, it's not setting it here so you can pass it,

trainer.train(train_dataset,
                  resumable_with_seed=666,  # seed for shuffling dataset
                  num_workers=1
                  )

This should help to lower RAM usage I think as it should reduce how much data is being prepared before hand.

One more note, you don't want too little workers or else you're GPU won't saturate and it'll train a little slower, so it's a balancing game

HuuHuy227 Oct 21, 2024

In my case, I encountered a problem with missing symbols. As you suggested, maybe changing out unused symbols could help, so does this mean I should add these missing symbols to the unused symbols (and what are they in the vocab.txt file)? I have my own tokenizer for my language, and if I use it, does that mean I need to train from scratch rather than fine-tune?

ApexArtist Oct 23, 2024

Hi all, this is very important and might be confusing for some. You need to copy the original model F5TTS_Base/model_1200000.pt into the folder where you are training for fine-tuning.

If you start training without copying this model, it will train from scratch!

I’ve created a script called finetune-cli.py that can automate this process. However, before running the script, you need to update all the settings accordingly.

Please make sure to do this before you start.

or you can run simple

https://github.com/SWivid/F5-TTS/blob/182b0f08e4cde7280996c4018575b4a80425754b/finetune-cli.py#L41C1-L59C86

run simple change only the dataname my_speak in 3090 with about 60-80 hours dataset working well accelerate launch finetune-cli.py --exp_name F5TTS_Base --learning_rate 0.00001 --batch_size_per_gpu 1618 --batch_size_type frame --max_samples 64 --grad_accumulation_steps 1 --max_grad_norm 1 --epochs 10 --num_warmup_updates 500 --save_per_updates 10000 --last_per_steps 20000 --dataset_name my_speak --finetune True

for 4090 like say @JarodMica working very well and also with very big dataset batch_size_per_gpu = 4000 grad_accumulation_steps = 78

about the vocab i dont replace anything because suport all symbols in language i train

make sure if suport all symbols in your language you want to train if miss symbols not working correct i think you can replace with the miss symbols with unsued this correct @SWivid ? and what symbols not use ? to can replace safe ?

or another idea it's in case miss symbols , you can simple covert all symbols in english language ,

here how check the vocab in finetune_gradio.py

make sure in data/project_name/ you have inside metadata.csv for all text you need also write in project name in gradio same name

thats why i make gradio_finetune.py to dont confuse for begin users also like i say i plane to make all this automatic soon

i hope this help

youtube tutorial

lpscr Oct 27, 2024

@jonnytracker check the gradio finetune i have also video tutorial #143

atlonxp Nov 12, 2024

@JarodMica I have the same issue of out of RAM memory. I have found the cause? and any solutions for fixing it?

My case is on supercomputer which has 500GB ram; it still run out of memory every time.

lpscr · 2024-10-18T14:13:53Z

lpscr
Oct 18, 2024

@jpgallegoar I’m trying to train in Spanish as an experiment , let see this take some hours. I just hope the dataset I’m using is okay since I don’t speak Spanish. I’ll let you know soon.

16 replies

lpscr Oct 18, 2024

i just leave to 60k i dont want burn my gpu more... i need make other test also , i try train spanish only for test stuff , like i see you say now working also for you that's great ,

jpgallegoar Oct 18, 2024
Collaborator

Can you send me your model? Perhaps I can keep training it

lpscr Oct 18, 2024

lol you have 1k sound and i have only 20 hours . thre is not point to compare or train ...

jpgallegoar Oct 18, 2024
Collaborator

yeah but end result is what matters no?

cristianosoy Oct 28, 2024

Hello, I am a native Spanish speaker, I would be honored to help you if you need it.

henriklied · 2024-10-18T18:57:15Z

henriklied
Oct 18, 2024

Given a large dataset, how important is it that the transcription is 1-1 with the source audio? The reason I ask, most of my datasets are built using a Whisper model, and they often do some text compression and correct misspoken words or stutter. Is this TTS-architecture forgiving for those kinds of variations or inconsistencies in transcription, or should I consider using a more verbose Whisper model for creating this dataset?

13 replies

jpgallegoar Oct 18, 2024
Collaborator

83C 388W

lpscr Oct 18, 2024

for me same 83C here @jpgallegoar you have 4090 ?
i just worder if this safe to have like this to train days ... for example , like i see @JarodMica in post image ger 78C this the max you get in 4090 ?

BTW:
there is app call after burn by msi you can control the temp but make train make slower because you change the power of your gpu but like this you get more health to have temp 75 around , i dont test yet anyone test this ?

jpgallegoar Oct 18, 2024
Collaborator

I read up to 90 is safe, so 83 should be fine for a few days. I will not do this forever but keep in mind some people mine crypto 24/7 for YEARS and it doesn't break. Same thing

lpscr Oct 18, 2024

maybe i put in gradio like this you can see your memory and temp in your gpu right now gpu sleep thats why you see 46C

jpgallegoar Oct 18, 2024
Collaborator

I read up to 90 is safe, so 83 should be fine for a few days. I will not do this forever but keep in mind some people mine crypto 24/7 for YEARS and it doesn't break. Same thing

oh yes I didnt answer I have 4090

@lpscr That looks cool, PR!

justinjohn0306 · 2024-10-18T22:11:19Z

justinjohn0306
Oct 18, 2024

Has anyone here tried finetuning the base model on a single speaker dataset? I tried finetuning with a 6 hr English dataset, but I don’t hear any difference after the training.

12 replies

jpgallegoar Oct 18, 2024
Collaborator

Change the line I told you to, you're not using the correct model yet

Oh, I thought you meant 6 hours of different speakers. If you train on 6 hours of a single person, that's only useful for generating audio of that specific person, and make it sound closer to them.

Yes but the issue is, it doesn't make any difference after the training which I find really odd.

justinjohn0306 Oct 18, 2024

Change the line I told you to, you're not using the correct model yet

Oh, I thought you meant 6 hours of different speakers. If you train on 6 hours of a single person, that's only useful for generating audio of that specific person, and make it sound closer to them.

Yes but the issue is, it doesn't make any difference after the training which I find really odd.

Actually it did use my finetuned model and yeah...I don't hear much difference:

Here's the audio generated using the base F5-TTS model: https://voca.ro/1hfgaoFTKcIi

Here's the audio generated using my finetuned model: https://voca.ro/1gT2MaNzZXW0

The reference audio: https://voca.ro/15snOJ8WmHF5

jpgallegoar Oct 18, 2024
Collaborator

The audio should be closer to 15 seconds, at least 10-12. The first mhm part does not get transcribed so for the model, it's a long silence. You're giving a 6 second audio which is 20% silence that's why it's so unnatural. You should use longer audio and splice together the sentences so there's not much silences in between.

Either way, the finetuned audio does sound close to the original voice to me, even if the silences didn't dissapear (that can be fixed with a better input audio)

GUUser91 Oct 19, 2024

@jpgallegoar
Thanks for the tip. I've been trying to finetune a cartoon character with a scottish accent.

The prompt is:
Baby coughing on a bus right as a needed tae cough so a nearly exploded hawdin it in cos a didny wanty look like the guy who copies babies.

Output files are from the finetune model

Old reference input audio(6 Seconds)
https://vocaroo.com/1iEAymADCOma
Output:
https://vocaroo.com/1jz9TcKrXHUH

Multiple Reference Input Audio Files Merged (13 Seconds)
https://vocaroo.com/16wuhY0yGHxg
Output
https://vocaroo.com/19yKnWO2byNu

S-T-K Feb 7, 2025

@justinjohn0306

I'm facing the same issue at the moment. Finetuning makes no discernable difference. Did you eventually find a way to make it work? I mean, the quality is fine as is, but I was hoping finetuning would make it even better.

jpgallegoar · 2024-10-18T22:24:49Z

jpgallegoar
Oct 18, 2024
Collaborator

After much testing, I'm gonna have to give up on the spanish finetune for now.
280k samples in and, although I have gotten a decent result which says 85% of the words correctly, it's unusable, since you need much more for an acceptable result. I'm attributing the failure in part to the model's poor transcription quality (I expect more from facebook) and in part to my own lack of skill in this regard. I am eager to give it another try if a better method is found, after careful revision of the dataset. Another mistake is that now the model has lost the capability to speak English (and I assume Chinese too).

Anyway, if anyone wants it, here is the model: Link

4 replies

zephirusgit Oct 30, 2024

gracias por compartir, ahi vere como funciona, justo miraba si seria posible hacer un finetuning, en español. estaba pensando si serviria para ello, utilizar audios generados por bark por ejemplo. que uno puede "Crearlos", de cualquier manera es como que no entiendo aun el proceso de finetuning, estaria dando palos de ciego y nose si mi rtx2060 de 12gb sirve para ese proceso. si logras hacer un avance nuevo , estaria feliz de probarlo, y si hay algo que se pueda hacer para colaborar para hacer uno, tambien. Saludos.

zephirusgit Oct 30, 2024

jpgallegoar, pregunta desde la absoluta ignorancia, el archivo model_last.pt como lo utilizas? porque veo que en pinokio los modelos los envia al cache del hub y como nombre ilegible,

jpgallegoar Oct 30, 2024
Collaborator

jpgallegoar, pregunta desde la absoluta ignorancia, el archivo model_last.pt como lo utilizas? porque veo que en pinokio los modelos los envia al cache del hub y como nombre ilegible,

hay que hardcodearlo directamente en load_model() de utils_infer.py por ahora. Si en el futuro tenemos varios finetunes decentes, podemos integrarlos en la aplicacion.

jpgallegoar Oct 30, 2024
Collaborator

gracias por compartir, ahi vere como funciona, justo miraba si seria posible hacer un finetuning, en español. estaba pensando si serviria para ello, utilizar audios generados por bark por ejemplo. que uno puede "Crearlos", de cualquier manera es como que no entiendo aun el proceso de finetuning, estaria dando palos de ciego y nose si mi rtx2060 de 12gb sirve para ese proceso. si logras hacer un avance nuevo , estaria feliz de probarlo, y si hay algo que se pueda hacer para colaborar para hacer uno, tambien. Saludos.

Por ahora creo que el mayor problema son los datos. Si sería posible entrenar el modelo con audios generados por Bark, pero ten en cuenta que la variabilidad de esos datos no sería muy alta, por lo que el finetuning no sería flexible.

Se necesita más VRAM para hacer un finetuning, pero puedes contribuir recopilando datasets de alta calidad, por ejemplo.

jpgallegoar · 2025-01-16T10:25:19Z

jpgallegoar
Jan 16, 2025
Collaborator

Has anyone tested training on fp32 vs fp16 vs bf16? Is there a noticeable quality dropoff? Which is the best?

8 replies

jpgallegoar Jan 16, 2025
Collaborator

I have only trained with fp32 with 100h dataset.

I'm planning to increase my dataset and train on fp32. May I know how was the quality? For example:

Does it clone reference audio well?

Is the pronunciation accurate?

Does it skip words?

The truth is 2. and 3. depends only on the quality of your transcriptions. If your transcriptions are 100% and your reference audio, text and gen text is in the domain of your dataset (not training on normal voice and then using cartoon voice with fast speech or things like that), the generated audio will be perfect. On 100h, 1.2 milliok steps, it's really really good.

And number 1 depends on the variety of speech patterns of your dataset. If you have only 5 speakers, it won't be able to clone the voices very well, but if you have 100+ speakers with different voices or something like that, it will learn to generalize and clone any voice. (Again, if you train only on normal voices, don't expect the model to clone a very high pitched cartoon voice, for example)

Alykasym Jan 22, 2025

Hi! Just wanted to share some findings from my recent experiments.

I tested the impact of different batch sizes on the quality of speech synthesis. The dataset I used is a 35-hour, high-quality, multi-speaker dataset with accurate annotations. I fine-tuned two models with identical configurations, only changing the batch size: one with a batch size of 3000 and the other with 500. Both were trained for 50 epochs using a learning rate of 1e-5.

Surprisingly, the model trained with a batch size of 500 produced more accurate speech compared to the one with a batch size of 3000. The 500 batch size run left more than half of my VRAM unused during fine-tuning and took slightly longer, but the improved results made it worth the trade-off.

I’m not entirely sure why this happened, as everything else about the models was identical. I plan to dig into the code and experiment further when I have more time, but for now, I thought I’d share these results in case anyone finds them useful.

jpgallegoar Jan 22, 2025
Collaborator

I just think the higher batch size one needs longer to learn. I am absolutely certain the ceiling of higher batch size is higher than lower batch size (spent hundreds renting H100 / H200 and locally 4090). If you give it enough time, it will improve. For reference. my 50h dataset was trained on 1300 epochs on 12000 batch size. I'm pretty sure it's 99-100% accurate on a new language.

hcsolakoglu Jan 30, 2025

@Alykasym smaller batch size provides more optimization opportunities because the number of steps increases. This is partly why you may see better results compared to a larger batch size.

goranskular Feb 5, 2025

Surprisingly, the model trained with a batch size of 500 produced more accurate speech compared to the one with a batch size of 3000.... Both were trained for 50 epochs using a learning rate of 1e-5....

When increasing batch size, you should typically increase the learning rate. A common rule is linear scaling: multiply the learning rate by the ratio. it's compensation for reduced gradient noise in larger batches. If you don't adjust it, training might slow down or become unstable. Can be that's why...

holycowdude · 2025-01-17T12:18:50Z

holycowdude
Jan 17, 2025

I'm finetuning models with F5-TTS via Pinokio but i'm struggling to identify how to use the models i've trained
Please can someone help?

Would a kind person possibly update the Gradio UI for Pinokio and add an ability to automatically find and be able to select any of the finetuned models that have been trained / created to make it easy please? 😊

1 reply

sarpba Jan 22, 2025

@holycowdude The easyest way use ComfyUI with this cudtom node https://github.com/niknah/ComfyUI-F5-TTS
or my bach script from here https://github.com/sarpba/F5-TTS_scripts

firstpixel · 2025-01-19T15:08:00Z

firstpixel
Jan 19, 2025

I'm training 200hrs for pt-br reaching 1M steps, using google colab, half with A100 and half with T4, but it still not perfect, it is actually doing a little inference, but have some misspellings, and for numbers, just does not work.
it also seems to be worst if the sample is bigger than 6s.. if the sample is bigger than 10s, it becomes a mess.
the numbers issue is easy, I can just use a python to convert numbers to words, will work, but the misspells, I think it should need a finetune.

Is it possible to finetune it with a new dataset with only numbers and misspells? will it destroy the previous trainings?
Have anyone tried finetune to fix issues?
or should I keep training it for more time on same dataset, just adding more samples to the corner cases?

12 replies

lumpidu Jan 19, 2025

No need to regenerate the data. You could just concatenate your audios if spoken by the same speaker. It doesn't matter, if the audio contains 3 spoken sentences or 1 spoken long sentence. Just make the silence padding consistent between sentences when concatenating. You could also aim for a normal distribution of samples between 3-30 secs.

jpgallegoar Jan 19, 2025
Collaborator

Yes but if he has the long audios, and the transcriptions were for the long audios, it's better to resplit them to avoid unnatural timings and artifacts when rejoining them.

lumpidu Jan 20, 2025

The single splits should be "logical units", i.e. indepent utterances that make sense standalone. But it's okay to have 3 such utterances concatenated together, like:

"The weather seemed fine"
"Current stocks went down by 15%".
"We just put the blame on that region where bad stuff happens"

firstpixel Jan 30, 2025

Another question, if I want to make it pt-br + en, can I use the same dataset, with both languages? will it be able to speak both? or for multi language is different? pt-br I used the same vocab.txt from original, I want to add english to it as many words in portuguese are english words, specially on tech industry.

lumpidu Feb 7, 2025

Yes, add good amount, though (e.g. LJSpeech). It will have an portugese accent, though.

sarpba · 2025-01-21T19:41:07Z

sarpba
Jan 21, 2025

Hello, I would like to re-finetune the hungarian model again from the original. I collect about 2600 hours of ultra clear audio from about 50 speakers. Unfortunately, the average audio length is around 5 seconds for me as well. Is there an ideal curve for the distribution of sounds? The amount of data is abundant, I can select a data set corresponding to an ideal distribution curve.

edit: I think i need to reroll my dataset, it's worse than I thought.

20 replies

jpgallegoar Jan 22, 2025
Collaborator

The question is whether the higher bach size achieved with gradient accumulation is equivalent to the large bach_size achieved in 1 step with higher vram.

Unfortunately I don't think anyone has made that test yet.

sarpba Jan 22, 2025

@jpgallegoar @sch0ngut here is the scripts https://github.com/sarpba/ADCS I no have more time now. It's working, but not too nice (gpt translated) and the last scripts is missing. (trash_dropout, numbers_drouout, csv_maker) I'll continue friday night. under windows need to use WSL.

I ran a quick test. I processed 20 hours of audio in 35 minutes. it turned out to be about 15 hour usable data. (2x3090)

sch0ngut Jan 23, 2025

Awesome, thank you!

campar Feb 8, 2025

@sarpba What software/library did you use to create that kind of graph for distribution of audio durations?

sarpba Feb 8, 2025

@campar It's a python script from my Audio Database Creator Scripts https://github.com/sarpba/ADCS/blob/main/statistics_scripts/statistics_histogram.py

kdcyberdude · 2025-01-24T19:32:03Z

kdcyberdude
Jan 24, 2025

Has anyone able to train this on Multiple-4090 GPU's setup (2 or more)??

I am getting this - #728 (comment)

1 reply

jpgallegoar Feb 3, 2025
Collaborator

I was able to do it, but it was from Replicate

jpgallegoar · 2025-02-03T09:41:15Z

jpgallegoar
Feb 3, 2025
Collaborator

Has anyone tried parallelized training with multi GPUs? I mean getting parallel performance, not only more VRAM. Is it even possible?

11 replies

hcsolakoglu Feb 4, 2025

Since NVIDIA's consumer GPUs, such as the 4090 and 3090, do not support P2P communication, they may not provide significant speedup in multi-GPU training. Instead, you could try training with NVIDIA's data center GPUs. Rather than using 4x 4090, you might consider renting a single H100 or A100. @jpgallegoar

jpgallegoar Feb 4, 2025
Collaborator

@hcsolakoglu Thank you for your answer, that is in fact what I ended up doing. I even rented H200 because I calculated it was the most efficient.

I want to spend the same amount of money (or a little bit more) and train in less time via parallelization.

Do you think it would work with these server GPUs?

sarpba Feb 4, 2025

@hcsolakoglu @jpgallegoar
I don't understand why are you waiting for speedup.

theoretically:
1xGPU - 3200 batch size -> 5 update/s
4xGPU 3200 batch size / GPU owerall batc size 12800 -> 5 update/s but with x4 batch If you want speedup, than use
4xGPU 800 batch size / GPU owerall batc size 3200 -> around 20 update/s

So there is an increase in speed, just in a different way.
but I noticed that at low batch sizes it throws away most of the training data, leaving barely anything.

My train with 3200 batch size / 2x3090 GPU (NVlik connected) 6400 owerall batch size -> I have 7-8 update/s (underpowered to 280watt)

jpgallegoar Feb 4, 2025
Collaborator

Thanks for the answer, the speedup would be nice for commercial purposes. I did 9600 batch size and the results were very good indeed. Perhaps I am confused and it was speeding up, I will have to test again.

mame82 Feb 19, 2025

@hcsolakoglu @jpgallegoar I don't understand why are you waiting for speedup.

theoretically: 1xGPU - 3200 batch size -> 5 update/s 4xGPU 3200 batch size / GPU owerall batc size 12800 -> 5 update/s but with x4 batch If you want speedup, than use 4xGPU 800 batch size / GPU owerall batc size 3200 -> around 20 update/s

So there is an increase in speed, just in a different way. but I noticed that at low batch sizes it throws away most of the training data, leaving barely anything.

My train with 3200 batch size / 2x3090 GPU (NVlik connected) 6400 owerall batch size -> I have 7-8 update/s (underpowered to 280watt)

Makes sense, because (if you use frames for batch limit):

batch length in seconds = num batches per frame * mel_hop_length / sample_rate

For batch size 800 your maximum sample length per batch is 800 * 256 / 24000 = 8.53 seconds

Every sample not fitting in a batch (>8.5 s) would be dropped. Also two samples of 5s duration would need two single batches, as they don't fit into one.

In other words: to account for training samples up to 30s minimum batch size would be 2814. On low VRAM hardware, this could still be achieved with AdamW8 (BNB). I assume that's also the reason, why many folks complain that finetunes struggle with longer audio output ... Training batch size to low, relevant training data is disregarded.

For large VRAM setups with small training sets (~1k hours) I tried to keep batch size and updates per epoch balanced (with AdamW8 batch sizes >15k are possible with 24GB VRAM, but there will be way less updates per epoch with small sample sets ... Sweet spot should be to keep updates per epoch and batch size balanced)

mechasoul · 2025-02-08T15:15:55Z

mechasoul
Feb 8, 2025

quick note that might be relevant for anyone trying to finetune with low VRAM: if you're using frame batching, then batch size acts as a maximum frame length for any samples in your dataset. that is, any audio samples with frame length greater than batch size are dropped altogether and are not used during the finetuning.

this doesn't matter with like batch size >= 3200 since training samples shouldn't be over 30s anyway, but if you're running, say, 1600 or 2000 batch size (hence probably most relevant to people with low VRAM), you're probably dropping some stuff from your dataset.

i suppose this should be obvious in retrospect, but as someone who isn't knowledgeable about what some of these parameters actually mean, i think it's an easy oversight to make (i didn't even notice until i was experimenting with batch size <1000 and suddenly got a division by zero exception somewhere as a consequence of my entire dataset being greater than the batch size...)

2 replies

sarpba Feb 8, 2025

@mechasoul the question it's vocos frames or?? just because 256 the default vocos samplerate = 3840 frames / 15 sec

mame82 Feb 19, 2025

Explained it there: #57 (reply in thread)

miria00 · 2025-03-03T03:13:48Z

miria00
Mar 3, 2025

Is it possible to finetune this model from the F5TTS_Base checkpoint with around 166 samples of training data? Each clip around 10sec long, normalized, re-sampled to 24000, noise-reduced. At update 1100 the gen still sounds like white noise, and loss is not decreasing. Samples uploaded. Yaml file looks like:

hydra:
run:
dir: blahblah

datasets:
name: outputPF5 # Custom
path: dataset path
batch_size_per_gpu: 4
batch_size_type: sample
max_samples: 4
num_workers: 4

optim:
epochs: 20000
learning_rate: 5e-9
num_warmup_updates: 10
grad_accumulation_steps: 4
max_grad_norm: 0.5
bnb_optimizer: False

model:
name: F5TTS_Base
tokenizer: custom
tokenizer_path: path to tokenizer ---- vocab.txt
arch:
dim: 1024
depth: 22
heads: 16
ff_mult: 2
text_dim: 512
conv_layers: 4
checkpoint_activations: False
mel_spec:
target_sample_rate: 24000
n_mel_channels: 100
hop_length: 256
win_length: 1024
n_fft: 1024
mel_spec_type: vocos # 'vocos' or 'bigvgan'
vocoder:
is_local: False
local_path: None

ckpts:
logger: wandb #
save_per_updates: 50
keep_last_n_checkpoints: 8
last_per_updates: 5000
save_dir: ckpts/${model.name}${model.mel_spec.mel_spec_type}${model.tokenizer}_${datasets.name}

custom training samples and metacsv file generated with preapare_csv_wavs.py. Are there any improvements I can make in the YAML file so that this finetuned model can sound as good as the zero-shot examples? Thanks a lot!

0 replies

ohsisi · 2025-03-26T20:59:55Z

ohsisi
Mar 26, 2025

Hi @jpgallegoar I tried downloading your .safetensors and vocab spanish version and overwrite them in the 1.08 release. (I changed the file name from model_1200000 to model_1250000 ) It loaded and ran an spanish test but I only got gibberish as audio result. Do you know what could be the reason? Thanks in advance.

2 replies

SWivid Mar 27, 2025
Maintainer Author

check inference readme, especially for custom config setting

ohsisi Mar 27, 2025

I found the config. It's working now. Thank you very much @SWivid .

Amirmohammadpiran · 2025-05-10T11:51:34Z

Amirmohammadpiran
May 10, 2025

I tried finetuning the model for Persian data. I changed vocab.txt to include persian characters and used a 'custom' tokenizer with this new vocab.txt. My results weren't promising and it could be because of the lack of data (only 105 hours) or insufficient training (1.2 million steps).
The parameters that I used:

batch_size = 4200
grad_accum=1,
epochs=150
learning_rate = 7.5e-5

Am I missing something else?

2 replies

eingrid May 13, 2025

What is the length distribution of your dataset? I believe with 105 hours you should get somewhat acceptable results .

I am currently training on +- 140 hrs of data, but my data is skewed to lower durations and on lower duration it generates okay (up to 8 seconds no hallucinations, 10-12 sometimes hallucinates), but on longer it struggles.

Here is my distribution :

🎵 Total audio files: 60242
🕒 Total duration: 137:40:51

📊 Duration statistics:

15-20 sec       → 5548 files, 26:26:27
10–15 sec      → 16504 files, 54:37:17
6–10 sec       → 16520 files, 34:46:15
< 6 sec        → 21670 files, 21:50:51

Currently I think about stitching multiple smaller audios by the same speaker into single one and using this in dataset to get more samples with longer durations. However I am very limited with VRAM so my batch_size is maximum 1600, so samples longer than ~17 seconds aren't really used, but that is not issue for you.

Amirmohammadpiran May 14, 2025

I analyzed my dataset and realized my dataset is also skewed on lower durations. Considering 151254 samples in general, I have around 80000 samples with 1-2 sec length, around 56000 samples with 2-4 sec length and the rest of the samples are longer and are 6 sec long at most. My GPU is 4090 RTX with 22GB VRAM which allows for larger batch sizes as you mentioned. Either way, Should I also stitch some samples together or is it okay the way it is?

powerumble · 2025-05-14T08:18:41Z

powerumble
May 14, 2025

How do you structure the dataset folder or metadata.csv when you have multiple speakers and want to train a new language? There is no documentation regarding that. Should I just put all the training data into the wavs folder with all the speakers and proceed with the training? Like mentioned in #57 (comment)

There are a lot of training data for many different languages on huggingface, they all show a speaker_ID in their database (sql browser), but when it comes to training with those data, there is no information about how to structure the folder or the metadata.csv

1 reply

eingrid May 14, 2025

-Dataset/
-- wavs/
-- metadata.csv

metadata.csv must have path to audio and text, speaker_id is not mandatory and as I understand is not used even if provided.

isolveit-aps · 2025-05-16T07:26:43Z

isolveit-aps
May 16, 2025

I think this model trains pretty well on multi-speaker datasets - even with only 100 hours of data.
I have trained a model that performs quite well, but there's not too much data (around 110 hours), and some of that training data is actually rubbish (pauses, wrong transcripts or repeated audio that is not in the transcript, etc.).

I have recorded 10s of hours of my own speech, quality microphone, no errors in transcripts/audio, and lengths are from 5-20 seconds. I also have some other speakers that have a few hours of audio also, and then I have 400 speakers that have around 20-50 minutes each.

In theory, would training on my own speech skew the model away from performing well on the other speakers (my voice would then be maybe 30% of the overall audio)? Or would it just improve more if I add my 10s of hours to the mix? And does shuffling the dataset give any better results when training, when my own recordings are shuffled in with the other 500 speakers?

I don't care as much about the voice-cloning part of the tts - I just really want a clear pronunciation in my language, so I don't mind the model being skewed towards my voice, if only it has good pronunciation.

0 replies

rheadoshi · 2025-05-17T09:49:23Z

rheadoshi
May 17, 2025

So I am trying to finetune on my dataset which contains the Indian English accent. I have around 60 hours of audio data. Although the inference audio is clear, no noise, the audio that is generated does not resemble English at all. Any ideas so as to why this is happening?

I made minimal changes to the hyperparameters while finetuning. And followed the procedure exactly as provided in the documentation.

0 replies

shekharmeena2896 · 2025-05-27T07:38:51Z

shekharmeena2896
May 27, 2025

Hi I was trying to finetune the pretrained weights of hindi model, I changed the path in train.py (checkpoint_paths to the pretrained directory and my config file is

hydra:
run:
dir: ckpts/${model.name}${model.mel_spec.mel_spec_type}${model.tokenizer}_${datasets.name}/${now:%Y-%m-%d}/${now:%H-%M-%S}

datasets:
name: hindi_dataset_custom
batch_size_per_gpu: 3200
batch_size_type: frame
max_samples: 64
num_workers: 8 # Reduced from 16 to be safe

optim:
epochs: 500
learning_rate: 1e-5
num_warmup_updates: 500
grad_accumulation_steps: 1
max_grad_norm: 1.0
bnb_optimizer: False

model:
name: F5TTS_Hindi_Finetune
tokenizer: custom # Using custom for Hindi
tokenizer_path: /home/ubuntu/F5-TTS/data/hindi_dataset_custom/vocab.txt # Path to your Hindi vocab
backbone: DiT # This was missing in your config!
arch:
dim: 768
depth: 18
heads: 12
ff_mult: 2
text_dim: 512
text_mask_padding: False
conv_layers: 4
pe_attn_head: 1
checkpoint_activations: True
mel_spec:
target_sample_rate: 24000
n_mel_channels: 100
hop_length: 256
win_length: 1024
n_fft: 1024
mel_spec_type: vocos
vocoder:
is_local: False
local_path: null

ckpts:
logger: wandb # Set to null for simpler setup, or use wandb/tensorboard if desired
log_samples: True
save_per_updates: 5000 # Reduced from 50000 for more frequent saves
keep_last_n_checkpoints: 3 # Keep only the last 3 checkpoints to save space
last_per_updates: 500
save_dir: ckpts/my_hindi_finetune_run # Path to your pretrained model
(f5tts) root@t1-le-45-gra7:/home/ubuntu/F5-TTS/src/f5_tts/configs#

the training finish instantly I am not able to understand what mistake I am making this below is the output :

(f5tts) root@t1-le-45-gra7:/home/ubuntu/F5-TTS/src/f5_tts/configs# cd ../../..
(f5tts) root@t1-le-45-gra7:/home/ubuntu/F5-TTS# tail -f nohup.out
Loading dataset ...
Download Vocos from huggingface charactr/vocos-mel-24khz
Sorting with sampler... if slow, check whether dataset is provided with duration: 100%|██████████| 5900/5900 [00:00<00:00, 1259166.21it/s]
Creating dynamic batches with 3200 audio frames per gpu: 100%|██████████| 5900/5900 [00:00<00:00, 1693798.33it/s]
wandb:
wandb: 🚀 View run F5TTS_Hindi_Finetune_vocos_custom_hindi_dataset_custom at: https://wandb.ai/devteeoff-teeoff-technologies/CFM-TTS/runs/vtuhvwdi
wandb: ⭐️ View project at: https://wandb.ai/devteeoff-teeoff-technologies/CFM-TTS
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20250527_073442-vtuhvwdi/logs
Saved last checkpoint at update 2500000

1 reply

sagaradjay Sep 17, 2025

Your old checkpoint gets loaded automatically, remember to delete it before rerunning.

odunola499 · 2025-10-31T22:27:50Z

odunola499
Oct 31, 2025

Hello! Super awesome model and repo!
I was able to achieve good results for voice cloning and voice effects with my LoRA implementation for the model. This works for the english pretrained checkpoint and adds low-rank parameters to all linear layers in the model. Please check it out and let me know what you think!

f5lora github repo

1 reply

SWivid Nov 1, 2025
Maintainer Author

cc @ZhikangNiu @Jerrister @lzlyz

Finetune practice #57

Uh oh!

Uh oh!

SWivid Oct 14, 2024 Maintainer

Replies: 97 comments · 840 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JarodMica Dec 4, 2024 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jpgallegoar Oct 16, 2024 Collaborator

Uh oh!

Uh oh!

jpgallegoar Oct 17, 2024 Collaborator

Uh oh!

jpgallegoar Oct 18, 2024 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jpgallegoar Oct 21, 2024 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SWivid
Oct 14, 2024
Maintainer

Replies: 97 comments 840 replies

JarodMica Dec 4, 2024
Collaborator

jpgallegoar
Oct 16, 2024
Collaborator

jpgallegoar Oct 17, 2024
Collaborator

jpgallegoar Oct 18, 2024
Collaborator

jpgallegoar Oct 21, 2024
Collaborator