You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I am reading VITS paper now which is based on glowTTS paper which I discussed earlier. #1494
I am just trying to understand the math correctly, so please guide me if I am wrong. Thanks.
TLDR:
Rather than generating predicted mel-spec from text and then converting into an audio wav using vocoder, in this paper, we are trying to generate a waveform and minimise the loss of waveform reconstruction indirectly by comparing the predicted waveform's mel-spec and ground truth waveform mel-spec. Since, getting mel-spec from waveform is deterministic step of just using STFT, why do we need adversarial training?
OBJECTIVE of DECODER:
Xlin is linear spectrogram of output audio. We know ctext that is input text of the corresponding Xlin.
Find z from Xlin. z <-> Xlin using variation inference.
There are two components to the eq1 (also know as ELBO) --- (i) reconstruction loss and (ii) KL divergence.
(i)Reconstruction loss:
From z instead of getting Xpredict directly, we upscale z to yhat which is audio wav form, convert yhat to spec Xhat and compare Xlin to Xhat. This is reconstruction loss of generating x given z.
Picture shows using Xmel, but the authors change it to Xlin later saying that linear is more information than melspec.
(ii)KL divergence:
We have two distributions, q(z|xlin) and p(z|c,A).
Lets say we pick q(z|xlin) to be N(z, muXlin , sigmaXlin) for the VAE to learn.
For p(z|ctext, A), we want to keep z a complex distribution (instead of traditionally simple ones, to increase expressiveness). So we use normalising flows and convert z into f(z) ~ N(f(z), muctext, A , sigmactext, A .
Notice how smartly f(z) is made to be dependent on the input text and alignment.
MAS and ENCODER:
In glowTTS, MAS was used to find alignment during training. Here our objective to maximise ELBO, we can see only denominator of KL divergence is dependent on A and ctext. So if we max the denominator, we max the ELBO.
For maxing denominator (N(f(z), muctext, A , sigmactext, A), we can train the encoder to generate statistics - muctext, A , sigmactext, A .
Also, similar to glowTTS, we need duration predictor for finding mel size corresponding to text size during inference. However, glowTTS was trained with deterministic predicotr (meaning same sentence, same duration always, unnatural), while here they train a stochastic VAE to predict more natural durations (same sentence, different durations, more natural).
Adversarial training:
Hoping that the encoder was trained well, we can feed some text, generate stats for f(z) and aligment from duration predictor.
Using inverse function in flows, we can find z.
This z can be used to find yhat(waveform directly)
To make things more interesting, train a discriminator D that distinguishes output of decoder G (which is yhat, the waveform) and ground truth waveform y. [Still reading more about the feature mapping and layers part here, will update asap].
INFERENCE:
Feed some text, generate stats for f(z) and aligment from duration predictor.
Using inverse function in flows, we can find z.
This z can be used to find yhat (predicted waveform) directly , hence end-to-end.
MINOR DOUBT:
Why need adversarial training when we already use reconstruction loss in VAE? Doesnt that take care of generating right waveform from z?
Thanks for reading and please help if you can. Sorry for long post.
EDIT: added TLDR
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, I am reading VITS paper now which is based on glowTTS paper which I discussed earlier. #1494
I am just trying to understand the math correctly, so please guide me if I am wrong. Thanks.
TLDR:
Rather than generating predicted mel-spec from text and then converting into an audio wav using vocoder, in this paper, we are trying to generate a waveform and minimise the loss of waveform reconstruction indirectly by comparing the predicted waveform's mel-spec and ground truth waveform mel-spec. Since, getting mel-spec from waveform is deterministic step of just using STFT, why do we need adversarial training?
OBJECTIVE of DECODER:
Xlin is linear spectrogram of output audio. We know ctext that is input text of the corresponding Xlin.
(i)Reconstruction loss:
(ii)KL divergence:
MAS and ENCODER:
Adversarial training:
INFERENCE:
MINOR DOUBT:
Thanks for reading and please help if you can. Sorry for long post.
EDIT: added TLDR
Beta Was this translation helpful? Give feedback.
All reactions