Skip to content

Commit 6c26585

Browse files
authored
Merge pull request #176 from X-LANCE/yxdu
Yxdu
2 parents f42716d + 33b84ed commit 6c26585

File tree

20 files changed

+125016
-109284
lines changed

20 files changed

+125016
-109284
lines changed

examples/st_covost2/README.md

100755100644
Lines changed: 50 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,49 @@
22

33

44
## Model Stracture
5-
<img src="image/framework.jpg" alt="示例图片" style="width:75%;">
5+
<img src="image/framework.jpg" alt="Photo" style="width:75%;">
66

77

88
## Multitask
9-
<img src="image/prompt.png" alt="示例图片" style="width:50%;">
9+
<img src="image/prompt.png" alt="Photo" style="width:50%;">
1010

1111

12+
## Installation
13+
```
14+
conda create -n cotst python=3.10
15+
conda activate cotst
16+
17+
git clone https://github.com/ddlBoJack/SLAM-LLM.git
18+
cd SLAM-LLM
19+
20+
pip install -e .
21+
sudo apt install ffmpeg
22+
pip install -U openai-whisper
23+
pip install wandb
24+
pip install soundfile
25+
pip install evaluate
26+
pip install transformers
27+
pip install datasets
28+
pip install sacrebleu
29+
pip install jiwer
30+
pip install librosa
31+
pip install torch==2.4.0
32+
pip install torchaudio==2.4.0
33+
pip install torchvision==0.19.0
34+
```
35+
36+
## Infer Demo
37+
It is recommended to run on a single GPU for the first execution. Later, remove CUDA_VISIBLE_DEVICES=0, and it will automatically utilize all GPUs.
38+
39+
This demo will automatically download the model and dataset from Hugging Face, totaling approximately 100GB. Each card requires 128GB of RAM and 24GB of GPU memory.
40+
41+
#supported translation languages are Chinese (zh), German (de), and Japanese (ja).
42+
43+
44+
```
45+
CUDA_VISIBLE_DEVICES=0 bash examples/st_covost2/scripts/infer_enzh.sh zh
46+
```
47+
1248

1349
## Download Model
1450
We only train the q-former projector in this recipe.
@@ -46,31 +82,25 @@ You can find the test jsonl in "test_st.jsonl"
4682
Here, we have designed a three-step training process, where each training session uses the checkpoint obtained from the previous training session.
4783
```
4884
#In this step, we perform ASR pretraining to acquire speech recognition capabilities.
49-
bash asr_pretrain.sh
85+
bash examples/st_covost2/scripts/asr_pretrain.sh
5086
51-
#In this phase, we conduct multimodal machine translation training to enhance the final performance.
52-
bash mmt.sh
5387
54-
#monolingual SRT training and multitask training.
55-
bash srt.sh
56-
bash zsrt.sh
88+
#monolingual MMT,SRT training and multitask training.
89+
#You can change the task type by modifying the value of **source** in the script.
90+
bash examples/st_covost2/scripts/all.sh
5791
```
5892

5993

60-
## Infer Stage
61-
You can try our pre-trained model.
62-
63-
```
64-
bash infer_enzh.sh
65-
```
66-
6794
## Citation
6895
You can refer to the paper for more results.
6996
```
70-
@article{du2024cot,
71-
title={CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought},
72-
author={Yexing Du, Ziyang Ma, Yifan Yang, Keqi Deng, Xie Chen, Bo Yang, Yang Xiang, Ming Liu, Bing Qin},
73-
journal={arXiv preprint arXiv:2409.19510},
74-
year={2024}
97+
@misc{du2024cotstenhancingllmbasedspeech,
98+
title={CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought},
99+
author={Yexing Du and Ziyang Ma and Yifan Yang and Keqi Deng and Xie Chen and Bo Yang and Yang Xiang and Ming Liu and Bing Qin},
100+
year={2024},
101+
eprint={2409.19510},
102+
archivePrefix={arXiv},
103+
primaryClass={cs.CL},
104+
url={https://arxiv.org/abs/2409.19510},
75105
}
76106
```

examples/st_covost2/asr_config.py

100755100644
Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,10 @@ class ModelConfig:
2424
encoder_type: str = field(default="finetune", metadata={
2525
"help": "whether model is only pretrained or finetuned, used for models such as hubert"
2626
})
27-
ckpt_path: Optional[str] = None
2827
query_len: Optional[str] = None
28+
qformer_layers: int = 8
29+
30+
2931

3032

3133
@dataclass
@@ -93,7 +95,7 @@ class DataConfig:
9395
train_data_path: Optional[str] = None
9496
val_data_path: Optional[str] = None
9597
train_split: str = "train"
96-
test_split:str = "validation"
98+
test_split:str = "test"
9799
prompt: Optional[str] = None
98100
data_path: Optional[str] = None
99101
max_words: Optional[int] = None
@@ -127,7 +129,7 @@ class FSDPConfig:
127129
class LogConfig:
128130
use_wandb: bool = False
129131
wandb_dir: str = "test_wandb"
130-
wandb_entity_name: str = "SLAM"
132+
wandb_entity_name: str = "sdinger"
131133
wandb_project_name: str = "project_name"
132134
wandb_exp_name: str = "exp_name"
133135
log_file: str = "./test.log"

examples/st_covost2/change_dir.py

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
import os
2+
import json
3+
4+
# 定义输入文件夹路径
5+
folder_path = ""
6+
7+
# 定义关键词替换规则
8+
old_keyword = "" # 需要替换的关键词
9+
new_keyword = "/code_dir" # 替换成的关键词
10+
11+
# 遍历文件夹及其子文件夹
12+
for root, _, files in os.walk(folder_path):
13+
for file_name in files:
14+
if file_name.endswith(".jsonl"):
15+
file_path = os.path.join(root, file_name)
16+
17+
# 读取和处理 JSONL 文件
18+
with open(file_path, "r", encoding="utf-8") as file:
19+
lines = file.readlines()
20+
21+
updated_lines = []
22+
for line in lines:
23+
data = json.loads(line)
24+
if "audio" in data and old_keyword in data["audio"]:
25+
data["audio"] = data["audio"].replace(old_keyword, new_keyword)
26+
updated_lines.append(json.dumps(data, ensure_ascii=False))
27+
28+
# 写入修改后的内容到原文件
29+
with open(file_path, "w", encoding="utf-8") as file:
30+
file.write("\n".join(updated_lines))
31+
32+
print(f"关键词替换完成,修改内容已写回文件: {file_path}")
33+
34+
print("所有文件处理完成。")

examples/st_covost2/conf/prompt.yaml

100755100644
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
dataset_config:
22
# we put prompt here, because the hydra override in shell script only support a small subset of chars
33
# prompt: "Transcribe speech to text. Output the transcription directly without redundant content. Ensure that the output is not duplicated. "
4-
prompt: "<en>"
4+
# prompt: "<en>"

0 commit comments

Comments
 (0)