Writing short reviews for each 🐸TTS model [WIP] #875

erogol · 2021-10-11T21:35:00Z

erogol
Oct 11, 2021
Maintainer

The text below is a Work in Process but I thought it'd be good to collect feedback as I write it.

Text-to-Mel Models

Tacotron1-2

Tacotron is the one of the first successful DL based text-to-mel models and opened up the whole TTS field for more DL research.

Tacotron mainly is an encoder-decoder model with attention. Encoder takes input tokens (characters or phonemes) and decoder outputs mel-spectrogram* frames. Attention module in-between learns to align the input tokens with the output mel-spectrgorams.

Tacotron1 and 2 are both built on the same encoder-decoder architecture but they use different layers in each component. Additinonally, Tacotron1 uses Postnet module to convert mel-spectrograms to linear spectrograms to achieve higher resolution.

Vanilla Tacotron models are slow at inference due to the auto-regressive* structure that prevent model to process all the inputs in parallel. One trick is to use a higher “reduction rate*” that helps model to predict multiple frames at once. That is, reduction rate 2 reduces the number of decoder iterations by half.

Another notorious issue in Tacotron is the attention mis-alignment. Especially, in inference, for some input sequences the alignment can fail and cause the model to produce pure noise.

Tacotron also uses a Prenet module with Dropout that projects the model’s previous output before feeding it to the decoder again. The paper and the most of the implemenattions use the Dropout layer even in inference and they report the attention fails or the voice quality degrates othwerwise. But the issue with that, you get a slightly different output speech everytime you run the model.

Review

Although Tacotron is relatively old, yet it achieves the most natural flow of speech.
Alignment issues can be solved by using Double Decoder Consistency* (DDC) or Guided Attention Loss* with almost zero-errors.
DDC also lets you disable dropout in inference.
Reduction rate of 3 is the sweet spot for quality-speed trade off, but needs a good vocoder model. It helps you to achieve 1-1.3 RTF*.
Tacotron is the first model I train before experimenting any other models on a new dataset.

GlowTTS

GlowTTS uses normalizing flows [REF] to learn the distribution to project the input text sequence to output mel-spectrograms. It also learns the character durations from the data by “Monotonic Alignment Search” - greedy Viterbi search -. The model initially learn the distribution transformation that is necessary to project the mel-spectrograms to a isotropic Gaussian distribution. It also learns the projection of the input sequence to these Gaussian to find the affinity between each input character and each mel-spectrogram frame. These affinity values are used by MAS to compute durations which then learn by a duration predictor. Duration predictor network is used at inference and the MAS algorithm is skipped.

Glow TTS is built in a Transformer based encoder network and a non-causal WaveNet based decoder network. The duration predictor is just a stack of convolutional layers. Though in 🐸TTS we provide different preset encoder and decoder alternatives that you can try out.

Review

It is the model that converges the fastest.
It is robust against noisy datasets especially since it does not rely on a sensitive Attention mechanism.
GlowTTS samples an value from unit normal distribution to project it back to the output mel-spectrogram at inference. This sampling process causes different outputs for each inference run.
It is able to achieve faster inference than Tacotron as it is a parallel feed-forward model with no recurrence.

AlignTTS

It is very similar to the GlowTTS but without normalizing flows. It uses normal feed-forward learning with simple MSE on the predicted mel-spectrograms. It uses an additional Mixed Density Network to projects the encoder ouputs to Gaussian distribution parameters that are used to learn the affinity between each input character and the mel-spectrogram frame. Then it uses a Baum-Welch like algorithm [REF] to find the expected text-to-spec alignment path which is used for training a duration predictor network. Another important difference is that the AlignTTS uses a multi-phased training algorithm to trains different segments of the model at different stages of the training.

The AlignTTS uses Transformer layers for the encoder and decoder networks. The duration predictor network is a stack of Transformer layers too.

Review

Due to its multi-phased training, it is hard to tune this model.
If you can find the right training setting, it works quite well.
It is a memory-consuming model in training because of the Transformer layers.
It's also sensitive to noise in the dataset, so make sure you use it with a clean TTS-ready dataset.

FastSpeech

SpeedySpeech

FastPitch

Vocoder Models

WaveRNN

WaveGrad

ParallelWaveGAN

MelGAN

HifiGAN

UnivNet

End-to-End Models

VITS

CoquiTTS

Glossary

Auto-Regression:
Spectrogram:
Mel-Spectrogram:

mbarnig · 2021-10-15T07:15:57Z

mbarnig
Oct 15, 2021

I think the fifth line of Tacotron1-2

Additinonally, Tacotron1 uses Postnet module ..

should be

Additionally, Tacotron2 uses Postnet module ..

A short review of TTS-models has also been written by PaddlePaddle/Parakeet.

2 replies

erogol Oct 18, 2021
Maintainer Author

Yeah, the whole text is a mess right now. I wrote it initially on my phone. I'll recap once I really finish drafting. Until then bear with me all the mistakes :)

But thx for the flag.

erogol Oct 18, 2021
Maintainer Author

Also thanks for the link. It is very similar to what I was trying

Looks like they converted 🐸TTS to PaddlePaddle :)

VigneshBaskar · 2022-01-09T21:18:57Z

VigneshBaskar
Jan 9, 2022

@erogol Any updates on this? Thank you

1 reply

erogol Jan 10, 2022
Maintainer Author

I need some spare time, but I don't have it so far :(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Writing short reviews for each 🐸TTS model [WIP] #875

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Writing short reviews for each 🐸TTS model [WIP] #875

Uh oh!

erogol Oct 11, 2021 Maintainer

Text-to-Mel Models

Tacotron1-2

Review

GlowTTS

Review

AlignTTS

Review

FastSpeech

SpeedySpeech

FastPitch

Vocoder Models

WaveRNN

WaveGrad

ParallelWaveGAN

MelGAN

HifiGAN

UnivNet

End-to-End Models

VITS

CoquiTTS

Glossary

Replies: 2 comments · 3 replies

Uh oh!

mbarnig Oct 15, 2021

Uh oh!

erogol Oct 18, 2021 Maintainer Author

Uh oh!

Uh oh!

erogol Oct 18, 2021 Maintainer Author

Uh oh!

VigneshBaskar Jan 9, 2022

Uh oh!

erogol Jan 10, 2022 Maintainer Author

erogol
Oct 11, 2021
Maintainer

Replies: 2 comments 3 replies

mbarnig
Oct 15, 2021

erogol Oct 18, 2021
Maintainer Author

erogol Oct 18, 2021
Maintainer Author

VigneshBaskar
Jan 9, 2022

erogol Jan 10, 2022
Maintainer Author