about zero-shot inference

Hello p0p4k,

I'm reaching out to you again with a question.

Thanks to your great help, I've successfully trained and inferred the Korean pflow model. During the inference process, I observed a few limitations, but I confirmed that satisfactory voices are synthesized for seen speakers.

I used data from about 3,000 male and female speakers, only utilizing voice files with durations longer than 4.1 seconds. I conducted distributed learning with a batch size of 64 on NVIDIA A100 40G (4 units), completing 160 epochs (500k steps).

However, when synthesizing voices using unseen speakers' voices as prompts, I found that while the voice content is well synthesized, the speakers' voices are not applied to the synthesized sound.

This phenomenon was observed for both male and female speakers, and the inference code was written referring to the synthesis.ipynb (almost identical).

I'm looking into why the speaker's voice characteristics are not applied in zero-shot inference.

If you have experienced the same issue or know anything about it, I would appreciate your help. If there's any additional information I should provide, please comment below.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

about zero-shot inference #37

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

about zero-shot inference #37

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions