-
Notifications
You must be signed in to change notification settings - Fork 100
Description
How can I generate a target semantic token for my data using the CosyVoice CosyVoice-300M-SFT model? I am unable to locate any code in the CosyVoice model.
Have you generated the semantic token using the target text or target speech only?
Will there be any difference if we generate the Target discrete token using Hubert from the target speech? and fine-tune the model
Currently, I am getting the result like this
Generating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [00:22<00:00, 11.37it/s]
[2025-05-15 15:23:14][root][INFO] - Generated Text: hoteleot lot lot lot lot Lot lot lotlot lot lotLot lot lot lot lotlota lote le lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot litotal lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lotit lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot lot
[2025-05-15 15:23:14][root][WARNING] - Audio token is too long, skipping.
Generating: 0%| | 0/256 [00:00<?, ?it/s]torch.Size([3000, 128])
300
torch.Size([3, 83])
83
Generating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [00:23<00:00, 10.70it/s]
[2025-05-15 15:23:38][root][INFO] - Generated Text: hote hote hotte hot te tete te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te teet ete te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te Te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te te
[2025-05-15 15:23:38][root][WARNING] - Audio token is too long, skipping.
torch.Size([3000, 128])