VITS sounds drunk on German #2834

cschaefer26 · 2023-08-02T09:33:35Z

cschaefer26
Aug 2, 2023

Hi, first of all thanks for your hard work to front the proprietary tts systems, love it. I am currently trying to train some German VITS models with coqui and I am finding that the prosody is really weird, here is a model output after about 200k steps:

Sentence:

Es ist schade, dass die EU als die humanste und moralischste aller Ländergruppierungen angesehen wird, aber sie wollen die Menschenrechte nicht aufrechterhalten und den Magnitsky Act nicht nutzen.

Phonemes:

ɛs ɪst ʃaːdə, das diː eːʔuː als diː humaːnstə ʊnt moʁaːlɪʃstə alɐ lɛndɐɡʁʊpiːʁʊŋən anɡəzeːən vɪʁt, aːbɐ ziː vɔlən diː mɛnʃn̩ʁɛçtə nɪçt aʊfʁɛçtʔɛɐhaltn̩ ʊnt deːn maɡnɪt͡ski ɛkt nɪçt nʊt͡sn̩.

noise=0.8

audio_noise_0.8.mp4

noise=0

audio_noise_0.mp4

For comparison here is our trained ForwardTacotron model (100k steps) + a modified MelGAN:

forward_taco_melgan.mp4

Any idea what could be the problem? I switched off phonemization and use IPAPhonemes as character set, the rest of the config is default. Any help would be appreciated :) - if you need I can of course post tensorboard graphs, configs etc.

thorstenMueller · 2023-08-20T16:19:07Z

thorstenMueller
Aug 20, 2023

Hi @cschaefer26 , even i have no idea what the problem might be and want to congratulate you for your great forward_taco_melgan model 👏. It's sounding really good.

2 replies

thorstenMueller Aug 20, 2023

I've played around with my TTS voice and your phrase. It doesn't pronounce "EU" and english "act" not as good as your last model. How did you make the english words pronounced english?
https://huggingface.co/spaces/Thorsten-Voice/demo

Es ist schade, dass die EU als die humanste und moralischste aller Ländergruppierungen angesehen wird, aber sie wollen die Menschenrechte nicht aufrechterhalten und den Magnitsky Act nicht nutzen.

cschaefer26 Aug 21, 2023
Author

Hey thorsten, thanks for your answer. Did you train your model on standard VITS parameters? I did and saw that loss_1 is going up fairly early, indicating overfitting...

As for the English: For our production models we use a custom phonemizer that performs much better than the standard espeak (or gruut) phonemizers. It is based on https://github.com/as-ideas/DeepPhonemizer. Unfortunately, I cannot share our German proprietary model but I can tell you that it is trained on quite a bit dataset of grapheme-phoneme pairs, including the most common English inclusions. Also, we have a NER tagger that identifies English inclusions in German text so that we can apply English conversion of those entities.

erogol · 2023-08-21T08:23:39Z

erogol
Aug 21, 2023
Maintainer

I am not sure but you can try disabling the blank_token in the config. Might make it more fluent. How large is your dataset?

3 replies

cschaefer26 Aug 21, 2023
Author

Yeah gonna try that. Dataset is about 20hrs (about 9k utterances) and pretty high quality. I uses my own phonemized text - but I also tested it with espeak enabled on the original characters with the same result. Here is the eval losses going up quite early:

erogol Aug 21, 2023
Maintainer

Do you use SDP or regular duration predictor?

cschaefer26 Aug 21, 2023
Author

Used SDP as the default - I briefly tried a training with the regular one, but didnt follow through on it. AFAIR it was more consistent but still a bit off.

cschaefer26 · 2023-08-21T08:36:20Z

cschaefer26
Aug 21, 2023
Author

Here is the train script if it helps:

import os

from trainer import Trainer, TrainerArgs

from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits, VitsAudioConfig, VitsArgs
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
from TTS.tts.utils.text.characters import IPAPhonemes

output_path = os.path.dirname(os.path.abspath(file))
dataset_config = BaseDatasetConfig(
formatter="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "/data/datasets/welt_resampled")
)
audio_config = VitsAudioConfig(
sample_rate=22050, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=None
)

model_args = VitsArgs()

config = VitsConfig(
audio=audio_config,
model_args = model_args,
run_name="vits_welt",
batch_size=32,
eval_batch_size=16,
batch_group_size=5,
num_loader_workers=8,
num_eval_loader_workers=4,
run_eval=True,
test_delay_epochs=-1,
epochs=1000,
text_cleaner=None,
use_phonemes=False,
phoneme_language="de",
phoneme_cache_path=os.path.join(output_path, "phoneme_cache_welt"),
compute_input_seq_cache=True,
print_step=25,
print_eval=True,
mixed_precision=True,
output_path=output_path,
datasets=[dataset_config],
cudnn_benchmark=False,
#use_sdp=False,
test_sentences=[['ɪn aɪnɐ vɛlt, ɪn deːɐ deːɐ kliːmavandl̩ ɪmɐ dʁɛŋəndɐ vɪʁt, hat diː ɔɪʁopɛːɪʃə unioːn aɪnən eːɐɡaɪt͡sɪɡn̩ plaːn foːɐɡəʃlaːɡn̩: bɪs t͡svaɪtaʊzn̩t-fʏnfʊntdʁaɪsɪç zɔlən kaɪnə aʊtos meːɐ aʊf deːn maʁkt kɔmən, diː t͡seː-ʔoː- t͡svaɪ oːdɐ andəʁə ʃaːtʃtɔfə aʊsʃtoːsn̩.']]

)

tokenizer, config = TTSTokenizer.init_from_config(config, IPAPhonemes())

ap = AudioProcessor.init_from_config(config)

train_samples, eval_samples = load_tts_samples(
dataset_config,
eval_split=True,
eval_split_max_size=config.eval_split_max_size,
eval_split_size=config.eval_split_size,
)

model = Vits(config, ap, tokenizer, speaker_manager=None)

trainer = Trainer(
TrainerArgs(continue_path='vits_welt-August-01-2023_12+09PM-dc04baa1'),
config,
output_path,
model=model,
train_samples=train_samples,
eval_samples=eval_samples,
)
trainer.fit()

0 replies

VITS sounds drunk on German #2834

Uh oh!

Uh oh!

cschaefer26 Aug 2, 2023

Replies: 3 comments · 5 replies

Uh oh!

thorstenMueller Aug 20, 2023

Uh oh!

thorstenMueller Aug 20, 2023

Uh oh!

Uh oh!

cschaefer26 Aug 21, 2023 Author

Uh oh!

erogol Aug 21, 2023 Maintainer

Uh oh!

cschaefer26 Aug 21, 2023 Author

Uh oh!

erogol Aug 21, 2023 Maintainer

Uh oh!

cschaefer26 Aug 21, 2023 Author

Uh oh!

Uh oh!

cschaefer26 Aug 21, 2023 Author

cschaefer26
Aug 2, 2023

Replies: 3 comments 5 replies

thorstenMueller
Aug 20, 2023

cschaefer26 Aug 21, 2023
Author

erogol
Aug 21, 2023
Maintainer

cschaefer26 Aug 21, 2023
Author

erogol Aug 21, 2023
Maintainer

cschaefer26 Aug 21, 2023
Author

cschaefer26
Aug 21, 2023
Author