Benchmarking LJSpeech English Pre-trained models #1698

iprovalo · 2022-06-28T05:18:26Z

iprovalo
Jun 28, 2022

I want to share some tests I ran Raspberry Pi - RPi (8g memory, only used single CPU in this test) as well as on a cloud instance. The set up is:

Raspberry Pi: "PyTorch_version": "1.11.0a0+gitbc2c6ed", "TTS": "0.7.1", "numpy": "1.21.6", "python": "3.9.2",
Cloud: "PyTorch_version": "1.11.0+cu115", "TTS": "0.7.0", "numpy": "1.21.6", "python": "3.7.13",

I was measuring the processing speed and quality on both. I also wanted to specifically compare the quality relative to the compatibility of the models to vocoders. I tested seven models and three vocoders. Quality had three gradations - Clear, Metallic, noisy, Unintelligible, all measured by myself. I used the git clone installation in both cases for Coqui-AI TTS and pre-trained models/vocoders:

git clone https://github.com/coqui-ai/TTS
cd TTS/
python3 -m pip install -r requirements.txt
python3 setup.py develop

I had to disable torchaudio in order to run on Raspberry Pi as mentioned here.

I used ljspeech English dataset based configurations - six models, and three vocoders, plus vits (combined model and vocoder).

Here are some observations:

Each model had one or two compatible (clear quality) vocoders, never all three
Each Vocoder had three out of six clear quality models
I found that the results of quality were consistent across the two environments (RPi vs the cloud instance)
The best model/vocoder combo for both quality and speed on both platforms is glow-tts+mb_melgan (Clear quality, RPi:1.73 sec).
Vits didn't work on RPi, I only modified the base_encoder.py to get it to work without the torchaudio. Vits relies on torchaudio to load the audio item (and resampling if required)
Single CPU on RPi performed consistently faster than 4-CPU tests. The cloud instance did the opposite in most of the cases, where the 4-CPU tests were faster. The three tests which didn't follow this pattern on the cloud instance were glow-tts+hifigan_v2, speedy-speech+hifigan_v2, speedy-speech+multiband-melgan

Full spreadsheet with results, including the commands I ran. The spreadsheet is open for the comments.

Output wav files.

I hope this is helpful.

iprovalo · 2022-06-30T22:44:07Z

iprovalo
Jun 30, 2022
Author

Here is a model to vocoder compatibility matrix based on my testing (English, ljspeech released models):

5 replies

iprovalo Jul 3, 2022
Author

@erogol what are some of the reasons for the model and vocoder compatibility issues?

It's a practical question when training for a new language - which model/vocoder pair to choose for training?

Thank you!

erogol Jul 5, 2022
Maintainer

because they are not meant to be together. They use different audio processing params

iprovalo Jul 5, 2022
Author

@erogol, are there any more details somewhere about the audio processing params? I am training a glow-tts and mb-melgan for the Russian language. What should I be using for the audio parameters?

Thank you

iprovalo Jul 6, 2022
Author

@erogol just to clarify, I am not sure I am using the right parameters for the training of the glow_tts and mb_melgan combo. I am capturing the results here for the training of the Russian model.

Basically, the training audio results of the model and a vocoder on the RUSLAN dataset are sounding great separately (tensorboard). Together, there is a distinct metallic sound.

Glow_TTS:

MB_Melgan:

iprovalo Jul 8, 2022
Author

@erogol an update, somewhere after 500K steps, the glow_tts+mb_melgan started sounding decent together. I posted the best model audio in the discussion.

king-dahmanus · 2022-06-30T23:05:43Z

king-dahmanus
Jun 30, 2022

any blind compatible summeries please? I'm one of the blind people who's interested in coqui tts, so hope the developers and contributors agree on a rule to summerize any graphical or image stuff, except the tts spectrograms. The summerized images include those benchmarks like this one, errors etc

1 reply

iprovalo Jul 3, 2022
Author

@king-dahmanus

The detail page in the spreadsheet is here.

Summary page is here.

king-dahmanus · 2022-07-03T16:01:50Z

king-dahmanus
Jul 3, 2022

I don't know how to work with spreadsheets on the web but that falls on me. Other blind people encounter spreadsheets and use them enough but I've never needed to so far, in any case thanks for the effort, who knows if someone else can bennifit from this

5 replies

iprovalo Jul 3, 2022
Author

@king-dahmanus how does this one work:

Detail Table:

Model-Short	Vocoder-Short	Quality Status - RPi	Quality Status - Cloud Instance	Time (sec), Single CPU, RPi	Time(sec), 4 CPU, RPi	Time(sec), Single CPU, Cloud	Time(sec), 4 CPU, Cloud
fast_pitch	hifigan_v2	Clear	Clear	2.40	3.52	1.06	0.53
fast_pitch	multiband-melgan	Metallic, noisy	Metallic, noisy	1.55	2.22	0.95	0.65
fast_pitch	univnet	Unintelligible	Unintelligible	3.40	4.02	1.25	0.68
glow-tts	hifigan_v2	Metallic, noisy	Metallic, noisy	2.72	4.09	0.62	0.68
glow-tts	multiband-melgan	Clear	Clear	1.73	2.35	0.63	0.52
glow-tts	univnet	Clear	Clear	3.65	4.23	0.99	0.60
speedy-speech	hifigan_v2	Clear	Clear	2.09	2.98	0.48	0.66
speedy-speech	multiband-melgan	Metallic, noisy	Metallic, noisy	1.21	1.61	0.39	0.46
speedy-speech	univnet	Unintelligible	Unintelligible	2.85	3.33	0.66	0.44
tacotron2-DCA	hifigan_v2	Metallic, noisy	Metallic, noisy	6.18	8.94	1.53	0.87
tacotron2-DCA	multiband-melgan	Clear	Clear	5.55	8.23	1.42	0.80
tacotron2-DCA	univnet	Clear	Clear	6.96	8.39	1.55	0.92
tacotron2-DDC	hifigan_v2	Clear	Clear	11.32	16.53	1.98	1.23
tacotron2-DDC	multiband-melgan	Metallic, noisy	Metallic, noisy	10.86	15.51	2.39	1.12
tacotron2-DDC	univnet	Unintelligible	Unintelligible	12.24	16.39	1.63	1.00
tacotron2-DDC_ph	hifigan_v2	Metallic, noisy	Metallic, noisy	6.78	9.94	1.36	0.74
tacotron2-DDC_ph	multiband-melgan	Clear	Clear	6.05	8.09	1.46	0.59
tacotron2-DDC_ph	univnet	Clear	Clear	7.62	10.30	2.25	1.12
vits	NA	Torch Audio Init Error	Clear	NA	NA	1.60	0.61

Summary Table:

Model-Short	hifigan_v2	multiband-melgan	univnet	Grand Total
fast_pitch	CLEAR	NOISY	NOISY	1
glow-tts	NOISY	CLEAR	CLEAR	2
speedy-speech	CLEAR	NOISY	NOISY	1
tacotron2-DCA	NOISY	CLEAR	CLEAR	2
tacotron2-DDC	CLEAR	NOISY	NOISY	1
tacotron2-DDC_ph	NOISY	CLEAR	CLEAR	2
Grand Total	3	3	3	9

erogol Jul 5, 2022
Maintainer

Every model comes with a default vocoder set for it. You don't need to try a different vocoder. Doesn't make sense.

koayst Aug 8, 2022

Every model comes with a default vocoder set for it. You don't need to try a different vocoder. Doesn't make sense.

Hi,

Trying to learn more about Coqui TTS. How to I find out what is the default vocoder for each model ? Is there some documentation I can refer to ?

iprovalo Aug 8, 2022
Author

I think this .models.json file is a good place to start.

koayst Aug 9, 2022

Thank you @iprovalo for the heads-up.

king-dahmanus · 2022-08-09T13:11:34Z

king-dahmanus
Aug 9, 2022

@iprovalo perfect. Thanks!

0 replies

Benchmarking LJSpeech English Pre-trained models #1698

Uh oh!

Uh oh!

iprovalo Jun 28, 2022

Replies: 4 comments · 11 replies

Uh oh!

iprovalo Jun 30, 2022 Author

Uh oh!

iprovalo Jul 3, 2022 Author

Uh oh!

erogol Jul 5, 2022 Maintainer

Uh oh!

iprovalo Jul 5, 2022 Author

Uh oh!

iprovalo Jul 6, 2022 Author

Uh oh!

iprovalo Jul 8, 2022 Author

Uh oh!

king-dahmanus Jun 30, 2022

Uh oh!

iprovalo Jul 3, 2022 Author

Uh oh!

king-dahmanus Jul 3, 2022

Uh oh!

Uh oh!

iprovalo Jul 3, 2022 Author

Uh oh!

erogol Jul 5, 2022 Maintainer

Uh oh!

Uh oh!

koayst Aug 8, 2022

Uh oh!

iprovalo Aug 8, 2022 Author

Uh oh!

koayst Aug 9, 2022

Uh oh!

king-dahmanus Aug 9, 2022

iprovalo
Jun 28, 2022

Replies: 4 comments 11 replies

iprovalo
Jun 30, 2022
Author

iprovalo Jul 3, 2022
Author

erogol Jul 5, 2022
Maintainer

iprovalo Jul 5, 2022
Author

iprovalo Jul 6, 2022
Author

iprovalo Jul 8, 2022
Author

king-dahmanus
Jun 30, 2022

iprovalo Jul 3, 2022
Author

king-dahmanus
Jul 3, 2022

iprovalo Jul 3, 2022
Author

erogol Jul 5, 2022
Maintainer

iprovalo Aug 8, 2022
Author

king-dahmanus
Aug 9, 2022