-
Notifications
You must be signed in to change notification settings - Fork 32
Description
Hello p0p4k,
I'm reaching out to you again with a question.
Thanks to your great help, I've successfully trained and inferred the Korean pflow model. During the inference process, I observed a few limitations, but I confirmed that satisfactory voices are synthesized for seen speakers.
I used data from about 3,000 male and female speakers, only utilizing voice files with durations longer than 4.1 seconds. I conducted distributed learning with a batch size of 64 on NVIDIA A100 40G (4 units), completing 160 epochs (500k steps).
However, when synthesizing voices using unseen speakers' voices as prompts, I found that while the voice content is well synthesized, the speakers' voices are not applied to the synthesized sound.
This phenomenon was observed for both male and female speakers, and the inference code was written referring to the synthesis.ipynb (almost identical).
I'm looking into why the speaker's voice characteristics are not applied in zero-shot inference.
If you have experienced the same issue or know anything about it, I would appreciate your help. If there's any additional information I should provide, please comment below.
Thank you.