Replies: 2 comments 3 replies
-
|
I think the fifth line of Tacotron1-2
should be
A short review of TTS-models has also been written by PaddlePaddle/Parakeet. |
Beta Was this translation helpful? Give feedback.
2 replies
-
|
@erogol Any updates on this? Thank you |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The text below is a Work in Process but I thought it'd be good to collect feedback as I write it.
Text-to-Mel Models
Tacotron1-2
Tacotron is the one of the first successful DL based text-to-mel models and opened up the whole TTS field for more DL research.
Tacotron mainly is an encoder-decoder model with attention. Encoder takes input tokens (characters or phonemes) and decoder outputs mel-spectrogram* frames. Attention module in-between learns to align the input tokens with the output mel-spectrgorams.
Tacotron1 and 2 are both built on the same encoder-decoder architecture but they use different layers in each component. Additinonally, Tacotron1 uses Postnet module to convert mel-spectrograms to linear spectrograms to achieve higher resolution.
Vanilla Tacotron models are slow at inference due to the auto-regressive* structure that prevent model to process all the inputs in parallel. One trick is to use a higher “reduction rate*” that helps model to predict multiple frames at once. That is, reduction rate 2 reduces the number of decoder iterations by half.
Another notorious issue in Tacotron is the attention mis-alignment. Especially, in inference, for some input sequences the alignment can fail and cause the model to produce pure noise.
Tacotron also uses a Prenet module with Dropout that projects the model’s previous output before feeding it to the decoder again. The paper and the most of the implemenattions use the Dropout layer even in inference and they report the attention fails or the voice quality degrates othwerwise. But the issue with that, you get a slightly different output speech everytime you run the model.
Review
GlowTTS
GlowTTS uses normalizing flows [REF] to learn the distribution to project the input text sequence to output mel-spectrograms. It also learns the character durations from the data by “Monotonic Alignment Search” - greedy Viterbi search -. The model initially learn the distribution transformation that is necessary to project the mel-spectrograms to a isotropic Gaussian distribution. It also learns the projection of the input sequence to these Gaussian to find the affinity between each input character and each mel-spectrogram frame. These affinity values are used by MAS to compute durations which then learn by a duration predictor. Duration predictor network is used at inference and the MAS algorithm is skipped.
Glow TTS is built in a Transformer based encoder network and a non-causal WaveNet based decoder network. The duration predictor is just a stack of convolutional layers. Though in 🐸TTS we provide different preset encoder and decoder alternatives that you can try out.
Review
AlignTTS
It is very similar to the GlowTTS but without normalizing flows. It uses normal feed-forward learning with simple MSE on the predicted mel-spectrograms. It uses an additional Mixed Density Network to projects the encoder ouputs to Gaussian distribution parameters that are used to learn the affinity between each input character and the mel-spectrogram frame. Then it uses a Baum-Welch like algorithm [REF] to find the expected text-to-spec alignment path which is used for training a duration predictor network. Another important difference is that the AlignTTS uses a multi-phased training algorithm to trains different segments of the model at different stages of the training.
The AlignTTS uses Transformer layers for the encoder and decoder networks. The duration predictor network is a stack of Transformer layers too.
Review
FastSpeech
SpeedySpeech
FastPitch
Vocoder Models
WaveRNN
WaveGrad
ParallelWaveGAN
MelGAN
HifiGAN
UnivNet
End-to-End Models
VITS
CoquiTTS
Glossary
Beta Was this translation helpful? Give feedback.
All reactions