Skip to content

Text-to-token LLM from CosyVoice-300M-SFT #232

@Lalaramarya

Description

@Lalaramarya

How can I generate a target semantic token for my data using the CosyVoice CosyVoice-300M-SFT model? I am unable to locate any code in the CosyVoice model.

Have you generated the semantic token using the target text or target speech only?
Will there be any difference if we generate the Target discrete token using Hubert from the target speech? and fine-tune the model

Currently, I am getting the result like this

Generating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [00:22<00:00, 11.37it/s]
[2025-05-15 15:23:14][root][INFO] - Generated Text: hoteleot lot lot lot lot Lot lot lotlot lot lotLot lot lot lot lotlota lote le lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot litotal lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lotit lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot
[2025-05-15 15:23:14][root][WARNING] - Audio token is too long, skipping.
Generating: 0%| | 0/256 [00:00<?, ?it/s]torch.Size([3000, 128])
300
torch.Size([3, 83])
83
Generating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [00:23<00:00, 10.70it/s]
[2025-05-15 15:23:38][root][INFO] - Generated Text: hote hote hotte hot te tete te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te teet ete te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te Te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te
[2025-05-15 15:23:38][root][WARNING] - Audio token is too long, skipping.
torch.Size([3000, 128])

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions