Skip to content

zszheng147/VoiceCraft-X

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

python demo CC BY-NC 4.0

TL;DR

VoiceCraft-X an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish.

Installation

conda create -n voicecraftx python=3.10
conda activate voicecraftx

# montreal-forced-aligner is for speech editing, if you find it difficult to install, you can ignore it.
conda install -c conda-forge montreal-forced-aligner==3.2.1 
pip install -r requirements.txt

Pretrained Models

We have uploaded the pretrained models to HuggingFace. You can download them from here.

cd VoiceCraft-X
git clone https://huggingface.co/zhisheng01/VoiceCraft-X pretrained_models

Note: The default multilingual checkpoint is voicecraftx.ckpt. To download the monolingual checkpoints, you can download from the following links: (with more languages coming soon)

Inference

Checkout speech_editing.ipynb and speech_synthesize.ipynb

TODO

  • Environment setup
  • Inference code for TTS and speech editing
  • Upload monolingual (Japanese, French, Spanish, Indian Languages, ...) checkpoints to HuggingFace
  • HuggingFace Spaces demo
  • Colab notebooks
  • Command line
  • Improve efficiency

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Copyright (c) 2025 Zhisheng Zheng/The University of Texas at Austin

For commercial use, please contact the authors.

Acknowledgement

We acknowledge the following open-source projects that made this work possible:

Citation

If you use this work in your research, please cite:

@inproceedings{zheng2025voicecraft,
  title={VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing},
  author={Zheng, Zhisheng and Peng, Puyuan and Diwan, Anuj and Huynh, Cong Phuoc and Sun, Xiaohang and Liu, Zhu and Bhat, Vimal and Harwath, David},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  pages={2737--2756},
  year={2025}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate or edit someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published