Skip to content

Commit 43b5293

Browse files
authored
Merge pull request #190 from X-LANCE/dev-slam-omni
Add reproduction for SLAM-Omni
2 parents ad0be72 + c9b7d9e commit 43b5293

File tree

219 files changed

+81661
-24
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

219 files changed

+81661
-24
lines changed

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,20 @@ __pycache__
33
.ipynb_checkpoints
44
.vscode
55
debug.py
6+
debug.ipynb
7+
debug.sh
68
.idea/*
79
transformers
810
wandb/
911
log/
1012
*.log
1113
outputs/
1214
data/
15+
jobs/
16+
debug/
17+
audio/
1318

19+
examples/s2s/scripts/debug
1420
examples/vsr_LRS3/scripts/decode_avhubert_vo_vicuna_7b_noself.sh
1521
examples/asr_librispeech/scripts/decode_hubert_xtralarge_linear_vicuna_7b_copy.sh
1622
examples/vsr_LRS3/scripts/decode_avhubert_vo_vicuna_7b_copy.sh

README.md

Lines changed: 36 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,22 @@ developers to train custom multimodal large language model (MLLM), focusing on <
2828
6. [Citation](#citation)
2929

3030
# News
31+
- [Update Jan. 22, 2025] 🔥🔥🔥 Full reproduction (including all data preparation, model training, and inference) for [SLAM-Omni](examples/s2s/README.md) has been supported.
32+
![](docs/slam-omni-model.png)
33+
- SLAM-Omni is a **timbre-controllable** voice interaction system that requires only **single-stage training** and minimal resources to achieve high-quality, end-to-end speech dialogue, supporting multi-turn conversations in both Chinese and English. ([paper](https://arxiv.org/abs/2412.15649), [demo](https://slam-omni.github.io))
34+
- We have fully reproduced the **training and inference** processes of SLAM-Omni and open-sourced all related training datasets. The provided code framework theoretically supports all codec-based spoken dialogue models. Additionally, we offer the reproduction code for [Mini-Omni](https://github.com/gpt-omni/mini-omni).
35+
36+
<table class="center">
37+
<tr>
38+
<td width=50% style="border: none">
39+
<video controls autoplay loop src="https://github.com/user-attachments/assets/73597edb-0d66-453b-b10c-8cf8dd3cae18" muted="false"></video>
40+
</td>
41+
<td width=50% style="border: none">
42+
<video controls autoplay loop src="https://github.com/user-attachments/assets/7a797491-0509-4da8-8662-f2107bd8856a" muted="false"></video>
43+
</td>
44+
</tr>
45+
</table>
46+
3147
- [Update Nov. 17, 2024] Recipes for [LLM-Based Contextual ASR](examples/contextual_asr/README.md) have been supported.
3248
- [Update Nov. 5, 2024] Recipes for [speech emotion captioning (SEC)](examples/sec_emotioncaps/README.md) with [emotion2vec](https://github.com/ddlBoJack/emotion2vec) as the encoder has been supported.
3349
- [Update Oct. 12, 2024] Recipes for [SLAM-AAC](examples/slam_aac/README.md) with [EAT](https://github.com/cwx-worst-one/EAT) as the encoder have been supported.
@@ -94,13 +110,17 @@ We provide reference implementations of various LLM-based speech, audio, and mus
94110
- Text-to-Speech (TTS)
95111
- [VALL-E-X](examples/vallex/README.md)
96112
- [Speech Emotion Captioning (SEC)](examples/sec_emotioncaps/README.md)
113+
- Voice Interaction System
114+
- [SLAM-Omni](examples/s2s/README.md)
97115

98116
- **Audio Task**
99117
- [Automated Audio Captioning (AAC)](examples/aac_audiocaps/README.md)
100118
- [SLAM-AAC](examples/slam_aac/README.md)
101119
- [DRCap](examples/drcap_zeroshot_aac/README.md)
120+
102121
- Spatial Audio Understanding
103122
- [BAT](examples/seld_spatialsoundqa/README.md)
123+
104124
- **Music Task**
105125
- [Music Caption (MC)](examples/mc_musiccaps/README.md)
106126

@@ -163,24 +183,33 @@ CoT-ST:
163183
}
164184
```
165185

186+
SLAM-Omni:
187+
```
188+
@article{chen2024slam,
189+
title={SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training},
190+
author={Chen, Wenxi and Ma, Ziyang and Yan, Ruiqi and Liang, Yuzhe and Li, Xiquan and Xu, Ruiyang and Niu, Zhikang and Zhu, Yanqiao and Yang, Yifan and Liu, Zhanxun and others},
191+
journal={arXiv preprint arXiv:2412.15649},
192+
year={2024}
193+
}
194+
```
166195

167196
## Audio Task
168197
SLAM-AAC:
169198
```
170-
@article{chen2024slam,
199+
@article{chen2025slam,
171200
title={SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs},
172201
author={Chen, Wenxi and Ma, Ziyang and Li, Xiquan and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Yu, Kai and Chen, Xie},
173-
journal={arXiv preprint arXiv:2410.09503},
174-
year={2024}
202+
journal={Proc. ICASSP},
203+
year={2025}
175204
}
176205
```
177206
DRCap:
178207
```
179-
@article{li2024drcap,
208+
@article{li2025drcap,
180209
title={DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning},
181210
author={Li, Xiquan and Chen, Wenxi and Ma, Ziyang and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Kong, Qiuqiang and Chen, Xie},
182-
journal={arXiv preprint arXiv:2410.09472},
183-
year={2024}
211+
journal={Proc. ICASSP},
212+
year={2025}
184213
}
185214
```
186215
BAT:
@@ -191,4 +220,4 @@ BAT:
191220
journal={Proc. ICML},
192221
year={2024}
193222
}
194-
```
223+
```

docs/slam-omni-model.png

196 KB
Loading

examples/s2s/README.md

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# SLAM-Omni
2+
[![Python 3.10](https://img.shields.io/badge/Python-3.10-blue.svg)](https://www.python.org/downloads/release/python-3100/) [![arXiv](https://img.shields.io/badge/arXiv-2412.15649-B31B1B.svg)](https://arxiv.org/abs/2412.15649) [![GitHub Demo Page](https://img.shields.io/badge/Github-Demo%20Page-orange.svg)](https://slam-omni.github.io/) [![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
3+
4+
(*Reproduction of the [paper](https://arxiv.org/abs/2412.15649) SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training.*)
5+
6+
## Environment Setup
7+
Set up the environment using the following commands after preparing the SLAM-LLM environment:
8+
```bash
9+
pip install -r ./examples/s2s/requirements.txt
10+
```
11+
12+
Alternatively, you can use our provided Docker image:
13+
```bash
14+
docker pull worstchan/slam-omni:v0
15+
docker run -it --gpus all --name slam-omni worstchan/slam-omni:v0 /bin/bash
16+
```
17+
18+
## Data Preparation
19+
20+
Our project supports two data formats: **Parquet** and **JSONL**. The open-source datasets are available on the Hugging Face Hub in **Parquet** format. Examples usage is provided in [this notebook](./demo/demo_data/demo.ipynb).
21+
22+
### Supported Datasets
23+
We provide three re-synthesized datasets for SLAM-Omni training:
24+
- [VoiceAssistant-400K](https://huggingface.co/datasets/worstchan/VoiceAssistant-400K-SLAM-Omni): Single-round English dialogue dataset.
25+
- [UltraChat-300K](https://huggingface.co/datasets/worstchan/UltraChat-300K-SLAM-Omni): Multi-round English dialogue dataset.
26+
- [Belle_1.4M](https://huggingface.co/datasets/worstchan/Belle_1.4M-SLAM-Omni): Multi-round Chinese dialogue dataset.
27+
28+
#### Usage
29+
You can load any of these datasets using the following code:
30+
```python
31+
from datasets import load_dataset
32+
33+
# Replace "DATASET_NAME" with one of the following:
34+
# - "worstchan/VoiceAssistant-400K-SLAM-Omni"
35+
# - "worstchan/UltraChat-300K-SLAM-Omni"
36+
# - "worstchan/Belle_1.4M-SLAM-Omni"
37+
38+
ds = load_dataset("DATASET_NAME")
39+
```
40+
41+
### JSONL
42+
We also support JSONL format for its concise structure. Below is an example:
43+
```jsonl
44+
{"key": "1", "source_wav": "/xxx/1.wav", "source_text": "Can you recommend some Chinese food for me?", "target_wav": "/xxx/1.wav", "target_text": "Sure! I recommend trying dumplings, Peking duck, and mapo tofu for a mix of flavors and textures in Chinese cuisine. These dishes offer a good balance of savory, spicy, and crispy elements."}
45+
```
46+
47+
## Checkpoints
48+
We reproduced the single-stage fine-tuning results of SLAM-Omni with a group size of **3**. The following checkpoints are available for download:
49+
- [Single-Round Dialogue (English)](https://drive.google.com/drive/folders/1ZmM1h5ZTvS-piuN-msmctmZdi51GWLAu?usp=sharing): Trained on VoiceAssistant-400K.
50+
- [Multi-Round Dialogue (English)](https://drive.google.com/drive/folders/1xBNrqR2LWC0uEjezjx4aUgdsbstisboS?usp=sharing): Trained on VoiceAssistant-400K and UltraChat-300K.
51+
- [Multi-Round Dialogue (Chinese)](https://drive.google.com/drive/folders/1sExIp-UDdL37gb-mh9YlhuDIib0-wUVP?usp=sharing): Trained on Belle_1.4M.
52+
53+
54+
## Training
55+
56+
You can pre-train the S2S model using TTS or ASR tasks with our provided scripts, though we recommend proceeding directly to fine-tuning. Alternatively, you may directly train a TTS or ASR model under the SLAM-Omni framework. For detailed instructions, refer to the [pre-training README](./scripts/pretrain/README.md).
57+
58+
### Fine-tuning
59+
We provide two primary fine-tuning options for **SLAM-Omni** modeling:
60+
```bash
61+
# Fine-tune with grouping strategy (Recommended)
62+
bash ./examples/s2s/scripts/finetune/finetune_s2s_group.sh
63+
64+
# Fine-tune without grouping
65+
bash ./examples/s2s/scripts/finetune/finetune_s2s.sh
66+
```
67+
68+
We also include scripts for reproducing [Mini-Omni](https://github.com/gpt-omni/mini-omni). Note that this requires the original [VoiceAssistant-400K](https://huggingface.co/datasets/gpt-omni/VoiceAssistant-400K) dataset for training:
69+
```bash
70+
bash ./examples/s2s/scripts/finetune/mini-omni/finetune_s2s.sh
71+
```
72+
73+
#### Note💫
74+
Our framework theoretically supports **all codec-based spoken dialogue model training**. Simply re-synthesize the target tokens (e.g., CosyVoice2 tokens) during training for compatibility.
75+
76+
## Inference
77+
We provide scripts for both **online** and **batch** inference. You can use the trained model or the provided checkpoints for inference. For detailed guidance, refer to [inference README](./scripts/inference/README.md).
78+
79+
80+
81+
### Online Inference
82+
Run the following commands for real-time inference:
83+
84+
```bash
85+
# Multi-turn (Recommended)
86+
bash ./examples/s2s/scripts/inference/inference_s2s_online_multi-round.sh
87+
88+
# Single-turn
89+
bash ./examples/s2s/scripts/inference/inference_s2s_online.sh
90+
```
91+
92+
For Mini-Omni modeling, use the following commands:
93+
```bash
94+
# Single-turn non-streaming
95+
bash ./examples/s2s/scripts/inference/mini-omni/inference_s2s_online.sh
96+
97+
# Single-turn streaming
98+
bash ./examples/s2s/scripts/inference/mini-omni/inference_s2s_online_stream.sh
99+
```
100+
101+
102+
### Batch Inference
103+
104+
For batch inference, ensure the data format matches the training format (**Parquet** or **JSONL**). Use the following commands:
105+
106+
```bash
107+
# SLAM-Omni framework
108+
bash ./examples/s2s/scripts/inference/inference_s2s_batch.sh
109+
110+
# Mini-Omni framework
111+
bash ./examples/s2s/scripts/inference/mini-omni/inference_s2s_batch.sh
112+
```
113+
114+
## TODO
115+
- [ ] Add evaluation scripts.
116+
- [ ] Add streaming inference scripts for SLAM-Omni.
117+
118+
119+
<!-- ## Gradio Demo -->
120+
121+
## Citation
122+
SLAM-Omni:
123+
```bibtex
124+
@article{chen2024slam,
125+
title={SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training},
126+
author={Chen, Wenxi and Ma, Ziyang and Yan, Ruiqi and Liang, Yuzhe and Li, Xiquan and Xu, Ruiyang and Niu, Zhikang and Zhu, Yanqiao and Yang, Yifan and Liu, Zhanxun and others},
127+
journal={arXiv preprint arXiv:2412.15649},
128+
year={2024}
129+
}
130+
```
131+
Mini-Omni:
132+
```bibtex
133+
@article{xie2024mini,
134+
title={Mini-omni: Language models can hear, talk while thinking in streaming},
135+
author={Xie, Zhifei and Wu, Changqiao},
136+
journal={arXiv preprint arXiv:2408.16725},
137+
year={2024}
138+
}
139+
```
140+
141+
## Acknowledgement
142+
- We borrow some code from [Mini-Omni](https://github.com/gpt-omni/mini-omni) for SNAC-based modeling.
143+
- We borrow some code from [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) for the vocoder.
144+
145+
146+
## License
147+
Our code is released under MIT License. The Chinese dialogue model is licensed under GPL-3.0 due to its use of Belle data and is intended for research purposes only.
296 KB
Binary file not shown.
301 KB
Binary file not shown.
296 KB
Binary file not shown.
312 KB
Binary file not shown.
305 KB
Binary file not shown.
327 KB
Binary file not shown.

0 commit comments

Comments
 (0)