Finetuning single speaker model for custom voice #1269

Simopich · 2022-02-19T19:29:13Z

Simopich
Feb 19, 2022

Hi, I'm fairly new to the AI world in general and I have a pretty basic question.
Would it be possible to finetune the italian TTS model released by @nicolalandro here #1148 which is based on a single speaker dataset to transfer my own voice on it? And how could I achieve that?
Thanks in advance! :)

windowshopr · 2022-09-15T06:33:21Z

windowshopr
Sep 15, 2022

I would love to get a tutorial made on this very subject too! Transfer learning seems to be (on average) regarded as a faster, more accurate way of achieving great results, but it’s tough to find detailed instructions on how to do it!

In addition to the Italian TTS model mentioned above, another model to include in this potential tutorial would be one based on the LJSpeech dataset, where all we want to do is change the voice using our own .wav and transcription files.

How would this be done? Anyone tried it yet with success yet?

0 replies

king-dahmanus · 2022-09-15T07:11:27Z

king-dahmanus
Sep 15, 2022

you want to transcribe lj speech but with your own voice? That will take time even if you have the cach and use descript to fake your voice in order to easily export the rest of the data

0 replies

nicolalandro · 2022-09-15T07:26:59Z

nicolalandro
Sep 15, 2022

For voice style transfer (get 2 audio and get the same sentence of First audio with voice of second) i try that code https://github.com/nicolalandro/autovc , It works well on english but very bed on any other languages 😅 you can try to replicate that work with italian pretrained, otherways you must train your model from zero with your custom dataset.

0 replies

windowshopr · 2022-09-15T16:05:39Z

windowshopr
Sep 15, 2022

Thanks! I will definitely check out that repo and see how it does.

I was thinking that, conceptually, if the model was trained well with a big dataset and sounded natural, etc etc, that you could “freeze” the model except for the vocoder and en(de)coder to make use of the new voice only, even if the new voice’s dataset might not be as large as what was trained on. All of the inflections and other subtle spoken language nuances would have theoretically been learned during training, but we just want to change the vocoder to the new voice?

2 replies

nicolalandro Sep 15, 2022

If you remain into english you must Only have a small audio with the voice ti style transfer

nicolalandro Sep 15, 2022

If you want to train for another langue you Need a new encoder and a train also autovc... Training Only the autovc with another language do not get me any good result

windowshopr · 2022-09-15T17:09:13Z

windowshopr
Sep 15, 2022

This is all assuming the same language is used, yes. Sorry for the confusion.

So hypothetically let’s say we have trained a model on 100 hours of English dialogue, and we want to use our own voice, which we have say 3 hours of dialogue. I could use autovc to transfer learn the base model to my new voice? Is that the correct way to think of it?

2 replies

nicolalandro Sep 15, 2022

After the train you can use also few seconds of audio like you can see in the colab https://colab.research.google.com/github/nicolalandro/autovc/blob/master/AutoVCDemoColab.ipynb you Only Need two small audio One that say what you need and another with the voice that you want to imitate... You can just try the notebook with the audio you want in eng (It should be also zero shot so It should work also with unseen voices.)

windowshopr Sep 15, 2022

Awesome! Thank you! I’ll give it a shot this weekend!

Finetuning single speaker model for custom voice #1269

Uh oh!

Replies: 5 comments · 4 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 5 comments 4 replies