I am only getting babbling and stutterd voice after fine tuning process. #2298

hjkaddict · 2023-01-18T16:06:45Z

hjkaddict
Jan 18, 2023

Hello everyone,
even though Vall-E is about to dominate the world, I would not follow their trend, rather keep playing with coquiTTS.

I posted this one a few days ago but no one answered.. I think the question itself was a full of ignorance. In the meantime, I found tensorboard to monitor the progress. Now I would like to give you more information about my practice.

I set up the Coqui TTS training environment on Jetson Xavier AGX and my goal is to clone my voice at this point. Because of a lack of my voice recording data, I first wanted to do the fine-tuning to see the performance. So I started to finetune the ljspeech(english, female) glow-tts model with my total 70 minutes of recordings (english, korean accent, male), 1100 segments where each duration ranges between 2 to 8 seconds. I put the configuration at the bottom of this post.

However the result I got is quite depressive. One good news is that it kind of mimiking a timbre of my voice, but no one can recognize any single word, it is just me babbling sound.. I also want to share the tensorboard information below.

I can think of two reasons.

1. not enough amount of recordings in my voice dataset? or training epochs?
A total length of my dataset is around 1h 10m. Of course it is not enough but I saw that 1h-2h of recordings is sufficient for fine tuning. Another assumption is that if the training time is long enough. I did 1000 epochs and get the value of -0.03133 in "avg_loss" (well, this negative value means something bad?), which for me seems quite low. Do you think do I need to train more from there?

2. unmatched audio configuration between ljspeech dataset and my dataset
what i found out is that this pre-trained ljspeech model is trained with the recordings that has 22050 sampling rate. But the my voice dataset I fine tuned on are all 16000. Do you think that would be critical reason? I also wonder if I should match all the audio configuration parameters same as the one of pre-trained model.

It would be appreciated if you could just throw me any kinds of tips or advices.
Thanks for your help in advance!

{
"output_path": "/data/TTS",
"logger_uri": null,
"run_name": "ft-on-hjk-silence-reading",
"project_name": null,
"run_description": "\ud83d\udc38Coqui trainer run.",
"print_step": 25,
"plot_step": 100,
"model_param_stats": false,
"wandb_entity": null,
"dashboard_logger": "tensorboard",
"log_model_step": 10000,
"save_step": 10000,
"save_n_checkpoints": 5,
"save_checkpoints": true,
"save_all_best": false,
"save_best_after": 10000,
"target_loss": null,
"print_eval": false,
"test_delay_epochs": -1,
"run_eval": true,
"run_eval_steps": null,
"distributed_backend": "nccl",
"distributed_url": "tcp://localhost:54321",
"mixed_precision": false,
"epochs": 1000,
"batch_size": 32,
"eval_batch_size": 16,
"grad_clip": 5.0,
"scheduler_after_epoch": true,
"lr": 1e-05,
"optimizer": "RAdam",
"optimizer_params": {
"betas": [
0.9,
0.998
],
"weight_decay": 1e-06
},
"lr_scheduler": "NoamLR",
"lr_scheduler_params": {
"warmup_steps": 4000
},
"use_grad_scaler": false,
"cudnn_enable": true,
"cudnn_deterministic": false,
"cudnn_benchmark": false,
"training_seed": 54321,
"model": "glow_tts",
"num_loader_workers": 0,
"num_eval_loader_workers": 4,
"use_noise_augment": false,
"audio": {
"fft_size": 1024,
"win_length": 1024,
"hop_length": 256,
"frame_shift_ms": null,
"frame_length_ms": null,
"stft_pad_mode": "reflect",
"sample_rate": 16000,
"resample": false,
"preemphasis": 0.98,
"ref_level_db": 20,
"do_sound_norm": false,
"log_func": "np.log10",
"do_trim_silence": true,
"trim_db": 45,
"do_rms_norm": false,
"db_level": null,
"power": 1.5,
"griffin_lim_iters": 60,
"num_mels": 80,
"mel_fmin": 0.0,
"mel_fmax": 8000.0,
"spec_gain": 20,
"do_amp_to_db_linear": true,
"do_amp_to_db_mel": true,
"pitch_fmax": 640.0,
"pitch_fmin": 1.0,
"signal_norm": true,
"min_level_db": -100,
"symmetric_norm": true,
"max_norm": 4.0,
"clip_norm": true,
"stats_path": null
},
"use_phonemes": true,
"phonemizer": "gruut",
"phoneme_language": "en-us",
"compute_input_seq_cache": false,
"text_cleaner": "phoneme_cleaners",
"enable_eos_bos_chars": false,
"test_sentences_file": "",
"phoneme_cache_path": "/data/TTS/phoneme_cache",
"characters": {
"characters_class": "TTS.tts.utils.text.characters.IPAPhonemes",
"vocab_dict": null,
"pad": "",
"eos": "",
"bos": "",
"blank": "",
"characters": "iy\u0268\u0289\u026fu\u026a\u028f\u028ae\u00f8\u0258\u0259\u0275\u0264o\u025b\u0153\u025c\u025e\u028c\u0254\u00e6\u0250a\u0276\u0251\u0252\u1d7b\u0298\u0253\u01c0\u0257\u01c3\u0284\u01c2\u0260\u01c1\u029bpbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029f\u02c8\u02cc\u02d0\u02d1\u028dw\u0265\u029c\u02a2\u02a1\u0255\u0291\u027a\u0267\u02b2\u025a\u02de\u026b",
"punctuations": "!'(),-.:;? ",
"phonemes": null,
"is_unique": false,
"is_sorted": true
},
"add_blank": false,
"batch_group_size": 0,
"loss_masking": null,
"min_audio_len": 1,
"max_audio_len": Infinity,
"min_text_len": 1,
"max_text_len": Infinity,
"compute_f0": false,
"compute_linear_spec": false,
"precompute_num_workers": 0,
"start_by_longest": false,
"shuffle": false,
"drop_last": false,
"datasets": [
{
"formatter": "deepspeech",
"dataset_name": "",
"path": "/data/TTS/recipes/hjkvoice/recording_segs_tts_all/",
"meta_file_train": "metadata.csv",
"ignored_speakers": null,
"language": "",
"phonemizer": "",
"meta_file_val": "",
"meta_file_attn_mask": ""
}
],
"test_sentences": [
"It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
"Be a voice, not an echo.",
"I'm sorry Dave. I'm afraid I can't do that.",
"This cake is great. It's so delicious and moist.",
"Prior to November 22, 1963."
],
"eval_split_max_size": null,
"eval_split_size": 0.01,
"use_speaker_weighted_sampler": false,
"speaker_weighted_sampler_alpha": 1.0,
"use_language_weighted_sampler": false,
"language_weighted_sampler_alpha": 1.0,
"use_length_weighted_sampler": false,
"length_weighted_sampler_alpha": 1.0,
"num_chars": 131,
"encoder_type": "rel_pos_transformer",
"encoder_params": {
"kernel_size": 3,
"dropout_p": 0.1,
"num_layers": 6,
"num_heads": 2,
"hidden_channels_ffn": 768,
"input_length": null
},
"use_encoder_prenet": true,
"hidden_channels_enc": 192,
"hidden_channels_dec": 192,
"hidden_channels_dp": 256,
"dropout_p_dp": 0.1,
"dropout_p_dec": 0.05,
"mean_only": true,
"out_channels": 80,
"num_flow_blocks_dec": 12,
"inference_noise_scale": 0.0,
"kernel_size_dec": 5,
"dilation_rate": 1,
"num_block_layers": 4,
"num_speakers": 0,
"c_in_channels": 0,
"num_splits": 4,
"num_squeeze": 2,
"sigmoid_scale": false,
"d_vector_dim": 0,
"data_dep_init_steps": 10,
"style_wav_for_test": null,
"length_scale": 1.0,
"use_speaker_embedding": false,
"speakers_file": null,
"use_d_vector_file": false,
"d_vector_file": false,
"min_seq_len": 3,
"max_seq_len": 500,
"r": 1
}

nlpander · 2023-07-03T22:25:13Z

nlpander
Jul 3, 2023

Hey hjkaddict -

I've just experienced the same thing - trained on 587 segments - comprising around 30 mins of audio maybe too short - also sampled at 16Khz rather than 22.05 KHz.

Also I was using the VITS model - did you also try YourTTS ? I've seen some debate between these two.

What did you do in the end ? Was another model / training procedure / data prep superior ?

0 replies

erogol · 2023-07-05T09:56:21Z

erogol
Jul 5, 2023
Maintainer

I'd rather go with VITS. It is more robust and tends to give better results.

What is your sampling rate of your audio?

Are you sure the data is formatted and loaded correctly while training?

Too me it looks like a simple false value in the config that doesn't match your dataset but hard to tell exactly what.

3 replies

nlpander Jul 6, 2023

Hey

So the sampling rate of the audio is 22050 in wav format and it is very clean - one speaker talking.
The audio chunks range from 1 second to 20 seconds I have around 1380 samples (total audio which it was sampled from is 1.5 hours)

I've also tried disabling the phoneme cleaner but this was worse. Quite a low batch size 1 - could this be the issue ?

Here are my configs:

num_epochs = 30
batch_size = 1
eval_batch_size = 1
batch_group_size = 1

audio_config = VitsAudioConfig(
sample_rate=22050, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=None
)

config = VitsConfig(
audio=audio_config,
run_name="vits_test_longer_220500Hz_w_phonemes",
batch_size=batch_size,
eval_batch_size=eval_batch_size,
batch_group_size=batch_group_size,
num_loader_workers=2,
num_eval_loader_workers=2,
run_eval=True,
test_delay_epochs=-1,
epochs=num_epochs,
text_cleaner="phoneme_cleaners",
use_phonemes=True,
phoneme_language="en",
phoneme_cache_path=os.path.join(directory, "phoneme_cache"),
compute_input_seq_cache=True,
print_step=100,
print_eval=True,
mixed_precision=False,
output_path=directory,
datasets=[dataset_config],
cudnn_benchmark=True,
test_sentences=test_sentences,
)

nlpander Jul 6, 2023

Also used the ljspeech formatter

nlpander Jul 22, 2023

@erogol any guidance on this you could provide ? Could it be the batch size ? Should I denoise the audio stream with rnnoise as per #2507 ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

I am only getting babbling and stutterd voice after fine tuning process. #2298

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

I am only getting babbling and stutterd voice after fine tuning process. #2298

Uh oh!

Uh oh!

hjkaddict Jan 18, 2023

Replies: 2 comments · 3 replies

Uh oh!

nlpander Jul 3, 2023

Uh oh!

erogol Jul 5, 2023 Maintainer

Uh oh!

nlpander Jul 6, 2023

Uh oh!

nlpander Jul 6, 2023

Uh oh!

Uh oh!

nlpander Jul 22, 2023

hjkaddict
Jan 18, 2023

Replies: 2 comments 3 replies

nlpander
Jul 3, 2023

erogol
Jul 5, 2023
Maintainer