VoiceCraft-X an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish.
conda create -n voicecraftx python=3.10
conda activate voicecraftx
# montreal-forced-aligner is for speech editing, if you find it difficult to install, you can ignore it.
conda install -c conda-forge montreal-forced-aligner==3.2.1
pip install -r requirements.txtWe have uploaded the pretrained models to HuggingFace. You can download them from here.
cd VoiceCraft-X
git clone https://huggingface.co/zhisheng01/VoiceCraft-X pretrained_modelsNote: The default multilingual checkpoint is voicecraftx.ckpt.
To download the monolingual checkpoints, you can download from the following links: (with more languages coming soon)
Checkout speech_editing.ipynb and speech_synthesize.ipynb
- Environment setup
- Inference code for TTS and speech editing
- Upload monolingual (Japanese, French, Spanish, Indian Languages, ...) checkpoints to HuggingFace
- HuggingFace Spaces demo
- Colab notebooks
- Command line
- Improve efficiency
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Copyright (c) 2025 Zhisheng Zheng/The University of Texas at Austin
For commercial use, please contact the authors.
We acknowledge the following open-source projects that made this work possible:
- AudioCraft for open-sourcing encodec
- Montreal Forced Alignment for speech alignment
- CosyVoice for text-preprocessing and speaker embedding extraction
- VoiceCraft for token ordering and sampling strategy
If you use this work in your research, please cite:
@inproceedings{zheng2025voicecraft,
title={VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing},
author={Zheng, Zhisheng and Peng, Puyuan and Diwan, Anuj and Huynh, Cong Phuoc and Sun, Xiaohang and Liu, Zhu and Bhat, Vimal and Harwath, David},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
pages={2737--2756},
year={2025}
}Any organization or individual is prohibited from using any technology mentioned in this paper to generate or edit someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.