Skip to content

about zero-shot inference #37

@0913ktg

Description

@0913ktg

Hello p0p4k,

I'm reaching out to you again with a question.

Thanks to your great help, I've successfully trained and inferred the Korean pflow model. During the inference process, I observed a few limitations, but I confirmed that satisfactory voices are synthesized for seen speakers.

I used data from about 3,000 male and female speakers, only utilizing voice files with durations longer than 4.1 seconds. I conducted distributed learning with a batch size of 64 on NVIDIA A100 40G (4 units), completing 160 epochs (500k steps).

However, when synthesizing voices using unseen speakers' voices as prompts, I found that while the voice content is well synthesized, the speakers' voices are not applied to the synthesized sound.

This phenomenon was observed for both male and female speakers, and the inference code was written referring to the synthesis.ipynb (almost identical).

I'm looking into why the speaker's voice characteristics are not applied in zero-shot inference.

If you have experienced the same issue or know anything about it, I would appreciate your help. If there's any additional information I should provide, please comment below.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions