VITS paper discussion (math) #1501

p0p4k · 2022-04-15T03:17:44Z

p0p4k
Apr 15, 2022

Hello, I am reading VITS paper now which is based on glowTTS paper which I discussed earlier. #1494
I am just trying to understand the math correctly, so please guide me if I am wrong. Thanks.
TLDR:
Rather than generating predicted mel-spec from text and then converting into an audio wav using vocoder, in this paper, we are trying to generate a waveform and minimise the loss of waveform reconstruction indirectly by comparing the predicted waveform's mel-spec and ground truth waveform mel-spec. Since, getting mel-spec from waveform is deterministic step of just using STFT, why do we need adversarial training?

OBJECTIVE of DECODER:
Xlin is linear spectrogram of output audio. We know ctext that is input text of the corresponding Xlin.

Find z from Xlin. z <-> Xlin using variation inference.
There are two components to the eq1 (also know as ELBO) --- (i) reconstruction loss and (ii) KL divergence.

(i)Reconstruction loss:

From z instead of getting Xpredict directly, we upscale z to yhat which is audio wav form, convert yhat to spec Xhat and compare Xlin to Xhat. This is reconstruction loss of generating x given z.
Picture shows using Xmel, but the authors change it to Xlin later saying that linear is more information than melspec.

(ii)KL divergence:

We have two distributions, q(z|xlin) and p(z|c,A).
Lets say we pick q(z|xlin) to be N(z, mu_Xlin , sigma_Xlin) for the VAE to learn.
For p(z|ctext, A), we want to keep z a complex distribution (instead of traditionally simple ones, to increase expressiveness). So we use normalising flows and convert z into f(z) ~ N(f(z), mu_{ctext, A} , sigma_{ctext, A} .
Notice how smartly f(z) is made to be dependent on the input text and alignment.

MAS and ENCODER:

In glowTTS, MAS was used to find alignment during training. Here our objective to maximise ELBO, we can see only denominator of KL divergence is dependent on A and ctext. So if we max the denominator, we max the ELBO.
For maxing denominator (N(f(z), mu_{ctext, A} , sigma_{ctext, A}), we can train the encoder to generate statistics - mu_{ctext, A} , sigma_{ctext, A} .
Also, similar to glowTTS, we need duration predictor for finding mel size corresponding to text size during inference. However, glowTTS was trained with deterministic predicotr (meaning same sentence, same duration always, unnatural), while here they train a stochastic VAE to predict more natural durations (same sentence, different durations, more natural).

Adversarial training:

Hoping that the encoder was trained well, we can feed some text, generate stats for f(z) and aligment from duration predictor.
Using inverse function in flows, we can find z.
This z can be used to find yhat(waveform directly)
To make things more interesting, train a discriminator D that distinguishes output of decoder G (which is yhat, the waveform) and ground truth waveform y. [Still reading more about the feature mapping and layers part here, will update asap].

INFERENCE:

Feed some text, generate stats for f(z) and aligment from duration predictor.
Using inverse function in flows, we can find z.
This z can be used to find yhat (predicted waveform) directly , hence end-to-end.

MINOR DOUBT:

Why need adversarial training when we already use reconstruction loss in VAE? Doesnt that take care of generating right waveform from z?

Thanks for reading and please help if you can. Sorry for long post.
EDIT: added TLDR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

VITS paper discussion (math) #1501

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

VITS paper discussion (math) #1501

Uh oh!

Uh oh!

p0p4k Apr 15, 2022

Replies: 0 comments

p0p4k
Apr 15, 2022