Skip to content

Commit 918668f

Browse files
authored
Merge pull request #198 from X-LANCE/dev-slam-omni
Fix: Update support for jsonl format
2 parents 43b5293 + 5cb725b commit 918668f

17 files changed

+57
-32
lines changed

examples/s2s/README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,9 +41,12 @@ ds = load_dataset("DATASET_NAME")
4141
### JSONL
4242
We also support JSONL format for its concise structure. Below is an example:
4343
```jsonl
44-
{"key": "1", "source_wav": "/xxx/1.wav", "source_text": "Can you recommend some Chinese food for me?", "target_wav": "/xxx/1.wav", "target_text": "Sure! I recommend trying dumplings, Peking duck, and mapo tofu for a mix of flavors and textures in Chinese cuisine. These dishes offer a good balance of savory, spicy, and crispy elements."}
44+
{"key": "1", "source_wav": "/xxx/1.wav", "source_text": "Can you recommend some Chinese food for me?", "target_token": [742, 383, 455, 619, 180], "target_text": "Sure! I recommend trying dumplings, Peking duck, and mapo tofu for a mix of flavors and textures in Chinese cuisine. These dishes offer a good balance of savory, spicy, and crispy elements."}
4545
```
4646

47+
🔔**Update**:
48+
We now use `target_token` to replace the `target_wav` field. When using your own data, you need to generate the corresponding audio response tokens yourself (e.g., using [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) tokens in SLAM-Omni).
49+
4750
## Checkpoints
4851
We reproduced the single-stage fine-tuning results of SLAM-Omni with a group size of **3**. The following checkpoints are available for download:
4952
- [Single-Round Dialogue (English)](https://drive.google.com/drive/folders/1ZmM1h5ZTvS-piuN-msmctmZdi51GWLAu?usp=sharing): Trained on VoiceAssistant-400K.
@@ -144,4 +147,4 @@ Mini-Omni:
144147

145148

146149
## License
147-
Our code is released under MIT License. The Chinese dialogue model is licensed under GPL-3.0 due to its use of Belle data and is intended for research purposes only.
150+
Our code is released under MIT License. The Chinese dialogue model is licensed under GPL-3.0 due to its use of Belle data and is intended for research purposes only.

examples/s2s/demo/demo_data/jsonl_demo-en.jsonl

Lines changed: 10 additions & 0 deletions
Large diffs are not rendered by default.

examples/s2s/demo/demo_data/jsonl_demo-zh.jsonl

Lines changed: 10 additions & 0 deletions
Large diffs are not rendered by default.

examples/s2s/demo/demo_data/jsonl_demo.jsonl

Lines changed: 0 additions & 6 deletions
This file was deleted.

examples/s2s/s2s_config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -189,7 +189,7 @@ class DataConfig:
189189
"help": "whether input is normalized, used for models such as wavlm"
190190
})
191191
seed: int = 42
192-
manifest_format: str = field(default="datasets", metadata={ "help": "alternative: jsonl" })
192+
manifest_format: str = field(default="parquet", metadata={ "help": "alternative: jsonl" })
193193
split_size: float = 0.1
194194

195195
vocab_config: VocabConfig = field(default_factory=VocabConfig)

examples/s2s/scripts/finetune/finetune_s2s.sh

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,9 +32,10 @@ num_latency_tokens=0 # number of delay tokens (in front of the ge
3232
do_layershift=false # if false, tokens in each layers use the same codebook, otherwise, use different codebooks
3333

3434
# dataset settings
35+
manifest_format=parquet # parquet or jsonl
3536
train_data_path=worstchan/VoiceAssistant-400K-SLAM-Omni
3637
val_data_path=worstchan/VoiceAssistant-400K-SLAM-Omni
37-
load_from_cache_file=true # set to true if you have already generated the cache file, otherwise set to false
38+
load_from_cache_file=true # set to true if you have already generated the cache file, otherwise set to false
3839

3940
# training settings
4041
batch_size_training=6
@@ -89,7 +90,7 @@ hydra.run.dir=$output_dir \
8990
++dataset_config.input_type=mel \
9091
++dataset_config.mel_size=$mel_size \
9192
++dataset_config.seed=42 \
92-
++dataset_config.manifest_format=datasets \
93+
++dataset_config.manifest_format=$manifest_format \
9394
++dataset_config.split_size=$split_size \
9495
++dataset_config.load_from_cache_file=$load_from_cache_file \
9596
++dataset_config.task_type=$task_type \

examples/s2s/scripts/finetune/finetune_s2s_group.sh

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,9 +32,10 @@ num_latency_tokens=0 # number of delay tokens (in front of the ge
3232
do_layershift=false # if false, tokens in each layers use the same codebook, otherwise, use different codebooks
3333

3434
# dataset settings
35+
manifest_format=parquet # parquet or jsonl
3536
train_data_path=worstchan/VoiceAssistant-400K-SLAM-Omni
3637
val_data_path=worstchan/VoiceAssistant-400K-SLAM-Omni
37-
load_from_cache_file=true # set to true if you have already generated the cache file, otherwise set to false
38+
load_from_cache_file=true # set to true if you have already generated the cache file, otherwise set to false
3839

3940
# training settings
4041
batch_size_training=6
@@ -96,7 +97,7 @@ hydra.run.dir=$output_dir \
9697
++dataset_config.input_type=mel \
9798
++dataset_config.mel_size=$mel_size \
9899
++dataset_config.seed=42 \
99-
++dataset_config.manifest_format=datasets \
100+
++dataset_config.manifest_format=$manifest_format \
100101
++dataset_config.split_size=$split_size \
101102
++dataset_config.load_from_cache_file=$load_from_cache_file \
102103
++dataset_config.task_type=$task_type \

examples/s2s/scripts/finetune/mini-omni/finetune_s2s.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ mel_size=80 # 80 128 ( only whisper-large-v3 supports 128 )
2020
llm_dim=896 # 896 1536 2048 3584 -> 0.5B 1.5B 3B 7B
2121

2222
# dataset settings
23+
manifest_format=parquet # parquet or jsonl
2324
train_data_path="/valleblob/v-wenxichen/data/s2s/VoiceAssistant-400K"
2425
val_data_path="/valleblob/v-wenxichen/data/s2s/VoiceAssistant-400K"
2526
load_from_cache_file=false # set to true if you have already generated the cache file, otherwise set to false
@@ -75,7 +76,7 @@ hydra.run.dir=$output_dir \
7576
++dataset_config.input_type=mel \
7677
++dataset_config.mel_size=$mel_size \
7778
++dataset_config.seed=42 \
78-
++dataset_config.manifest_format=datasets \
79+
++dataset_config.manifest_format=$manifest_format \
7980
++dataset_config.split_size=$split_size \
8081
++dataset_config.load_from_cache_file=$load_from_cache_file \
8182
++dataset_config.task_type=$task_type \

examples/s2s/scripts/inference/inference_s2s_batch.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ ckpt_path=/valleblob/v-wenxichen/exp/s2s/s2s_train_v3-gpu16-btz3-lr5e-4-fp16-epo
3939
# val_data_path=/home/v-wenxichen/SLAM-LLM/examples/s2s/demo/data/${split}.jsonl
4040

4141
# huggingface dataset
42-
manifest_format=datasets
42+
manifest_format=parquet
4343
val_data_path="/valleblob/v-wenxichen/data/s2s/VoiceAssistant-400K-v1/test"
4444
load_from_cache_file=false
4545
dataset_sample_seed=777

examples/s2s/scripts/inference/mini-omni/inference_s2s_batch.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ split=test
2929
# val_data_path=/home/v-wenxichen/SLAM-LLM/examples/s2s/demo/data/${split}.jsonl
3030

3131
# huggingface dataset
32-
manifest_format=datasets
32+
manifest_format=parquet
3333
val_data_path="gpt-omni/VoiceAssistant-400K"
3434
load_from_cache_file=true
3535
dataset_sample_seed=777

0 commit comments

Comments
 (0)