Skip to content

Commit 9503595

Browse files
committed
add some more toucantts documentation
1 parent 0a250f0 commit 9503595

File tree

2 files changed

+16
-10
lines changed

2 files changed

+16
-10
lines changed

README.md

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,12 @@ names.
4747
reduced, in practice, the vocoder produces much fewer artifacts.
4848
- Lots of quality of life changes: Finetuning your own model from the provided pretrained checkpoints is easier than
4949
ever! Just follow the `finetuning_example.py` pipeline.
50+
- We now have the option of choosing between Avocodo and [BigVGAN](https://arxiv.org/abs/2206.04658), which has an improved
51+
generator over HiFiGAN. It is significantly slower on CPU, but the quality is the best I have heard at the time of writing
52+
this. The speed on GPU is fine, just for CPU inference you might want to stick with Avocodo.
53+
- We compile a bunch of quality enhancements from all our previous works so far into one very stable and nice sounding
54+
architecture, which we call **ToucanTTS**. We submitted a system based on this architecture to the Blizzard Challenge 2023,
55+
you can try out our system [speaking French here](https://huggingface.co/spaces/Flux9665/Blizzard2023IMS).
5056

5157
### 2022
5258

@@ -149,16 +155,16 @@ appropriate names.
149155

150156
## Creating a new Pipeline 🦆
151157

152-
### Build an Avocodo Pipeline
158+
### Build an Avocodo/BigVGAN Pipeline
153159

154160
This should not be necessary, because we provide a pretrained model and one of the key benefits of vocoders in general
155161
is how incredibly speaker independent they are. But in case you want to train your own anyway, here are the
156162
instructions: You will need a function to return the list of all the absolute paths to each of
157-
the audio files in your dataset as strings. If you already have a *path_to_transcript_dict* of your data for FastSpeech
158-
2 training, you can simply take the keys of the dict and transform them into a list.
163+
the audio files in your dataset as strings. If you already have a *path_to_transcript_dict* of your data for ToucanTTS training,
164+
you can simply take the keys of the dict and transform them into a list.
159165

160166
Then go to the directory
161-
*TrainingInterfaces/TrainingPipelines*. In there, make a copy of any existing pipeline that has Avocodo in its name. We
167+
*TrainingInterfaces/TrainingPipelines*. In there, make a copy of any existing pipeline that has Avocodo or BigVGAN in its name. We
162168
will use this as reference and only make the necessary changes to use the new dataset. Look out for a variable called
163169
*model_save_dir*. This is the default directory that checkpoints will be saved into, unless you specify another one when
164170
calling the training script. Change it to whatever you like. Then pass the list of paths to the instanciation of the
@@ -173,7 +179,8 @@ Now you need to add your newly created pipeline to the pipeline dictionary in th
173179

174180
What we call ToucanTTS is actually mostly FastSpeech 2, but with a couple of changes, such as the normalizing flow
175181
based PostNet that was introduced in PortaSpeech. We found the VAE used in PortaSpeech too unstable for low-resource
176-
cases, so we continue experimenting with those in experimental branches of the toolkit.
182+
cases, so we continue experimenting with those in experimental branches of the toolkit. There are a bunch of other
183+
changes that mostly relate to low-resource scenarios. For more info, have a look at the ToucanTTS docstring.
177184

178185
In the directory called
179186
*Utility* there is a file called

TrainingInterfaces/Text_to_Spectrogram/StochasticToucanTTS/StochasticToucanTTS.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717

1818
class StochasticToucanTTS(torch.nn.Module):
1919
"""
20-
ToucanTTS module, which is mostly just a FastSpeech 2 module,
20+
StochasticToucanTTS module, which is mostly just a FastSpeech 2 module,
2121
but with lots of designs from different architectures accumulated
2222
and some major components added to put a large focus on multilinguality.
2323
@@ -27,16 +27,15 @@ class StochasticToucanTTS(torch.nn.Module):
2727
- Speaker embedding conditioning is derived from GST and Adaspeech 4
2828
- Responsiveness of variance predictors to utterance embedding is increased through conditional layer norm
2929
- The final output receives a GAN discriminator feedback signal
30+
- Stochastic Duration Prediction through a normalizing flow
31+
- Stochastic Pitch Prediction through a normalizing flow
32+
- Stochastic Energy prediction through a normalizing flow
3033
3134
Contributions inspired from elsewhere:
3235
- The PostNet is also a normalizing flow, like in PortaSpeech
3336
- Pitch and energy values are averaged per-phone, as in FastPitch to enable great controllability
3437
- The encoder and decoder are Conformers
3538
36-
Things that were tried, but showed inferior performance:
37-
- Stochastic Duration Prediction
38-
- Stochastic Pitch Prediction
39-
- Stochastic Energy prediction
4039
"""
4140

4241
def __init__(self,

0 commit comments

Comments
 (0)