VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

TL;DR

VoiceCraft-X an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish.

Installation

conda create -n voicecraftx python=3.10
conda activate voicecraftx

# montreal-forced-aligner is for speech editing, if you find it difficult to install, you can ignore it.
conda install -c conda-forge montreal-forced-aligner==3.2.1 
pip install -r requirements.txt

Pretrained Models

We have uploaded the pretrained models to HuggingFace. You can download them from here.

cd VoiceCraft-X
git clone https://huggingface.co/zhisheng01/VoiceCraft-X pretrained_models

Note: The default multilingual checkpoint is voicecraftx.ckpt. To download the monolingual checkpoints, you can download from the following links: (with more languages coming soon)

Inference

Checkout speech_editing.ipynb and speech_synthesize.ipynb

TODO

Environment setup
Inference code for TTS and speech editing
Upload monolingual (Japanese, French, Spanish, Indian Languages, ...) checkpoints to HuggingFace
HuggingFace Spaces demo
Colab notebooks
Command line
Improve efficiency

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

For commercial use, please contact the authors.

Acknowledgement

We acknowledge the following open-source projects that made this work possible:

AudioCraft for open-sourcing encodec
Montreal Forced Alignment for speech alignment
CosyVoice for text-preprocessing and speaker embedding extraction
VoiceCraft for token ordering and sampling strategy

Citation

If you use this work in your research, please cite:

@inproceedings{zheng2025voicecraft,
  title={VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing},
  author={Zheng, Zhisheng and Peng, Puyuan and Diwan, Anuj and Huynh, Cong Phuoc and Sun, Xiaohang and Liu, Zhu and Bhat, Vimal and Harwath, David},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  pages={2737--2756},
  year={2025}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate or edit someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data/samples		data/samples
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

TL;DR

Installation

Pretrained Models

Inference

TODO

License

Acknowledgement

Citation

Disclaimer

About

Uh oh!

Releases

Packages

Languages

License

zszheng147/VoiceCraft-X

Folders and files

Latest commit

History

Repository files navigation

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

TL;DR

Installation

Pretrained Models

Inference

TODO

License

Acknowledgement

Citation

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages