|
| 1 | +# SLAM-Omni |
| 2 | +[](https://www.python.org/downloads/release/python-3100/) [](https://arxiv.org/abs/2412.15649) [](https://slam-omni.github.io/) [](https://opensource.org/licenses/MIT) |
| 3 | + |
| 4 | +(*Reproduction of the [paper](https://arxiv.org/abs/2412.15649) SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training.*) |
| 5 | + |
| 6 | +## Environment Setup |
| 7 | +Set up the environment using the following commands after preparing the SLAM-LLM environment: |
| 8 | +```bash |
| 9 | +pip install -r ./examples/s2s/requirements.txt |
| 10 | +``` |
| 11 | + |
| 12 | +Alternatively, you can use our provided Docker image: |
| 13 | +```bash |
| 14 | +docker pull worstchan/slam-omni:v0 |
| 15 | +docker run -it --gpus all --name slam-omni worstchan/slam-omni:v0 /bin/bash |
| 16 | +``` |
| 17 | + |
| 18 | +## Data Preparation |
| 19 | + |
| 20 | +Our project supports two data formats: **Parquet** and **JSONL**. The open-source datasets are available on the Hugging Face Hub in **Parquet** format. Examples usage is provided in [this notebook](./demo/demo_data/demo.ipynb). |
| 21 | + |
| 22 | +### Supported Datasets |
| 23 | +We provide three re-synthesized datasets for SLAM-Omni training: |
| 24 | +- [VoiceAssistant-400K](https://huggingface.co/datasets/worstchan/VoiceAssistant-400K-SLAM-Omni): Single-round English dialogue dataset. |
| 25 | +- [UltraChat-300K](https://huggingface.co/datasets/worstchan/UltraChat-300K-SLAM-Omni): Multi-round English dialogue dataset. |
| 26 | +- [Belle_1.4M](https://huggingface.co/datasets/worstchan/Belle_1.4M-SLAM-Omni): Multi-round Chinese dialogue dataset. |
| 27 | + |
| 28 | +#### Usage |
| 29 | +You can load any of these datasets using the following code: |
| 30 | +```python |
| 31 | +from datasets import load_dataset |
| 32 | + |
| 33 | +# Replace "DATASET_NAME" with one of the following: |
| 34 | +# - "worstchan/VoiceAssistant-400K-SLAM-Omni" |
| 35 | +# - "worstchan/UltraChat-300K-SLAM-Omni" |
| 36 | +# - "worstchan/Belle_1.4M-SLAM-Omni" |
| 37 | + |
| 38 | +ds = load_dataset("DATASET_NAME") |
| 39 | +``` |
| 40 | + |
| 41 | +### JSONL |
| 42 | +We also support JSONL format for its concise structure. Below is an example: |
| 43 | +```jsonl |
| 44 | +{"key": "1", "source_wav": "/xxx/1.wav", "source_text": "Can you recommend some Chinese food for me?", "target_wav": "/xxx/1.wav", "target_text": "Sure! I recommend trying dumplings, Peking duck, and mapo tofu for a mix of flavors and textures in Chinese cuisine. These dishes offer a good balance of savory, spicy, and crispy elements."} |
| 45 | +``` |
| 46 | + |
| 47 | +## Checkpoints |
| 48 | +We reproduced the single-stage fine-tuning results of SLAM-Omni with a group size of **3**. The following checkpoints are available for download: |
| 49 | +- [Single-Round Dialogue (English)](https://drive.google.com/drive/folders/1ZmM1h5ZTvS-piuN-msmctmZdi51GWLAu?usp=sharing): Trained on VoiceAssistant-400K. |
| 50 | +- [Multi-Round Dialogue (English)](https://drive.google.com/drive/folders/1xBNrqR2LWC0uEjezjx4aUgdsbstisboS?usp=sharing): Trained on VoiceAssistant-400K and UltraChat-300K. |
| 51 | +- [Multi-Round Dialogue (Chinese)](https://drive.google.com/drive/folders/1sExIp-UDdL37gb-mh9YlhuDIib0-wUVP?usp=sharing): Trained on Belle_1.4M. |
| 52 | + |
| 53 | + |
| 54 | +## Training |
| 55 | + |
| 56 | +You can pre-train the S2S model using TTS or ASR tasks with our provided scripts, though we recommend proceeding directly to fine-tuning. Alternatively, you may directly train a TTS or ASR model under the SLAM-Omni framework. For detailed instructions, refer to the [pre-training README](./scripts/pretrain/README.md). |
| 57 | + |
| 58 | +### Fine-tuning |
| 59 | +We provide two primary fine-tuning options for **SLAM-Omni** modeling: |
| 60 | +```bash |
| 61 | +# Fine-tune with grouping strategy (Recommended) |
| 62 | +bash ./examples/s2s/scripts/finetune/finetune_s2s_group.sh |
| 63 | + |
| 64 | +# Fine-tune without grouping |
| 65 | +bash ./examples/s2s/scripts/finetune/finetune_s2s.sh |
| 66 | +``` |
| 67 | + |
| 68 | +We also include scripts for reproducing [Mini-Omni](https://github.com/gpt-omni/mini-omni). Note that this requires the original [VoiceAssistant-400K](https://huggingface.co/datasets/gpt-omni/VoiceAssistant-400K) dataset for training: |
| 69 | +```bash |
| 70 | +bash ./examples/s2s/scripts/finetune/mini-omni/finetune_s2s.sh |
| 71 | +``` |
| 72 | + |
| 73 | +#### Note💫 |
| 74 | +Our framework theoretically supports **all codec-based spoken dialogue model training**. Simply re-synthesize the target tokens (e.g., CosyVoice2 tokens) during training for compatibility. |
| 75 | + |
| 76 | +## Inference |
| 77 | +We provide scripts for both **online** and **batch** inference. You can use the trained model or the provided checkpoints for inference. For detailed guidance, refer to [inference README](./scripts/inference/README.md). |
| 78 | + |
| 79 | + |
| 80 | + |
| 81 | +### Online Inference |
| 82 | +Run the following commands for real-time inference: |
| 83 | + |
| 84 | +```bash |
| 85 | +# Multi-turn (Recommended) |
| 86 | +bash ./examples/s2s/scripts/inference/inference_s2s_online_multi-round.sh |
| 87 | + |
| 88 | +# Single-turn |
| 89 | +bash ./examples/s2s/scripts/inference/inference_s2s_online.sh |
| 90 | +``` |
| 91 | + |
| 92 | +For Mini-Omni modeling, use the following commands: |
| 93 | +```bash |
| 94 | +# Single-turn non-streaming |
| 95 | +bash ./examples/s2s/scripts/inference/mini-omni/inference_s2s_online.sh |
| 96 | + |
| 97 | +# Single-turn streaming |
| 98 | +bash ./examples/s2s/scripts/inference/mini-omni/inference_s2s_online_stream.sh |
| 99 | +``` |
| 100 | + |
| 101 | + |
| 102 | +### Batch Inference |
| 103 | + |
| 104 | +For batch inference, ensure the data format matches the training format (**Parquet** or **JSONL**). Use the following commands: |
| 105 | + |
| 106 | +```bash |
| 107 | +# SLAM-Omni framework |
| 108 | +bash ./examples/s2s/scripts/inference/inference_s2s_batch.sh |
| 109 | + |
| 110 | +# Mini-Omni framework |
| 111 | +bash ./examples/s2s/scripts/inference/mini-omni/inference_s2s_batch.sh |
| 112 | +``` |
| 113 | + |
| 114 | +## TODO |
| 115 | +- [ ] Add evaluation scripts. |
| 116 | +- [ ] Add streaming inference scripts for SLAM-Omni. |
| 117 | + |
| 118 | + |
| 119 | +<!-- ## Gradio Demo --> |
| 120 | + |
| 121 | +## Citation |
| 122 | +SLAM-Omni: |
| 123 | +```bibtex |
| 124 | +@article{chen2024slam, |
| 125 | + title={SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training}, |
| 126 | + author={Chen, Wenxi and Ma, Ziyang and Yan, Ruiqi and Liang, Yuzhe and Li, Xiquan and Xu, Ruiyang and Niu, Zhikang and Zhu, Yanqiao and Yang, Yifan and Liu, Zhanxun and others}, |
| 127 | + journal={arXiv preprint arXiv:2412.15649}, |
| 128 | + year={2024} |
| 129 | +} |
| 130 | +``` |
| 131 | +Mini-Omni: |
| 132 | +```bibtex |
| 133 | +@article{xie2024mini, |
| 134 | + title={Mini-omni: Language models can hear, talk while thinking in streaming}, |
| 135 | + author={Xie, Zhifei and Wu, Changqiao}, |
| 136 | + journal={arXiv preprint arXiv:2408.16725}, |
| 137 | + year={2024} |
| 138 | +} |
| 139 | +``` |
| 140 | + |
| 141 | +## Acknowledgement |
| 142 | +- We borrow some code from [Mini-Omni](https://github.com/gpt-omni/mini-omni) for SNAC-based modeling. |
| 143 | +- We borrow some code from [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) for the vocoder. |
| 144 | + |
| 145 | + |
| 146 | +## License |
| 147 | +Our code is released under MIT License. The Chinese dialogue model is licensed under GPL-3.0 due to its use of Belle data and is intended for research purposes only. |
0 commit comments