|
2 | 2 |
|
3 | 3 |
|
4 | 4 | ## Model Stracture |
5 | | -<img src="image/framework.jpg" alt="示例图片" style="width:75%;"> |
| 5 | +<img src="image/framework.jpg" alt="Photo" style="width:75%;"> |
6 | 6 |
|
7 | 7 |
|
8 | 8 | ## Multitask |
9 | | -<img src="image/prompt.png" alt="示例图片" style="width:50%;"> |
| 9 | +<img src="image/prompt.png" alt="Photo" style="width:50%;"> |
10 | 10 |
|
11 | 11 |
|
| 12 | +## Installation |
| 13 | +``` |
| 14 | +conda create -n cotst python=3.10 |
| 15 | +conda activate cotst |
| 16 | +
|
| 17 | +git clone https://github.com/ddlBoJack/SLAM-LLM.git |
| 18 | +cd SLAM-LLM |
| 19 | +
|
| 20 | +pip install -e . |
| 21 | +sudo apt install ffmpeg |
| 22 | +pip install -U openai-whisper |
| 23 | +pip install wandb |
| 24 | +pip install soundfile |
| 25 | +pip install evaluate |
| 26 | +pip install transformers |
| 27 | +pip install datasets |
| 28 | +pip install sacrebleu |
| 29 | +pip install jiwer |
| 30 | +pip install librosa |
| 31 | +pip install torch==2.4.0 |
| 32 | +pip install torchaudio==2.4.0 |
| 33 | +pip install torchvision==0.19.0 |
| 34 | +``` |
| 35 | + |
| 36 | +## Infer Demo |
| 37 | +It is recommended to run on a single GPU for the first execution. Later, remove CUDA_VISIBLE_DEVICES=0, and it will automatically utilize all GPUs. |
| 38 | + |
| 39 | +This demo will automatically download the model and dataset from Hugging Face, totaling approximately 100GB. Each card requires 128GB of RAM and 24GB of GPU memory. |
| 40 | + |
| 41 | +#supported translation languages are Chinese (zh), German (de), and Japanese (ja). |
| 42 | + |
| 43 | + |
| 44 | +``` |
| 45 | +CUDA_VISIBLE_DEVICES=0 bash examples/st_covost2/scripts/infer_enzh.sh zh |
| 46 | +``` |
| 47 | + |
12 | 48 |
|
13 | 49 | ## Download Model |
14 | 50 | We only train the q-former projector in this recipe. |
@@ -46,31 +82,25 @@ You can find the test jsonl in "test_st.jsonl" |
46 | 82 | Here, we have designed a three-step training process, where each training session uses the checkpoint obtained from the previous training session. |
47 | 83 | ``` |
48 | 84 | #In this step, we perform ASR pretraining to acquire speech recognition capabilities. |
49 | | -bash asr_pretrain.sh |
| 85 | +bash examples/st_covost2/scripts/asr_pretrain.sh |
50 | 86 |
|
51 | | -#In this phase, we conduct multimodal machine translation training to enhance the final performance. |
52 | | -bash mmt.sh |
53 | 87 |
|
54 | | -#monolingual SRT training and multitask training. |
55 | | -bash srt.sh |
56 | | -bash zsrt.sh |
| 88 | +#monolingual MMT,SRT training and multitask training. |
| 89 | +#You can change the task type by modifying the value of **source** in the script. |
| 90 | +bash examples/st_covost2/scripts/all.sh |
57 | 91 | ``` |
58 | 92 |
|
59 | 93 |
|
60 | | -## Infer Stage |
61 | | -You can try our pre-trained model. |
62 | | - |
63 | | -``` |
64 | | -bash infer_enzh.sh |
65 | | -``` |
66 | | - |
67 | 94 | ## Citation |
68 | 95 | You can refer to the paper for more results. |
69 | 96 | ``` |
70 | | -@article{du2024cot, |
71 | | - title={CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought}, |
72 | | - author={Yexing Du, Ziyang Ma, Yifan Yang, Keqi Deng, Xie Chen, Bo Yang, Yang Xiang, Ming Liu, Bing Qin}, |
73 | | - journal={arXiv preprint arXiv:2409.19510}, |
74 | | - year={2024} |
| 97 | +@misc{du2024cotstenhancingllmbasedspeech, |
| 98 | + title={CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought}, |
| 99 | + author={Yexing Du and Ziyang Ma and Yifan Yang and Keqi Deng and Xie Chen and Bo Yang and Yang Xiang and Ming Liu and Bing Qin}, |
| 100 | + year={2024}, |
| 101 | + eprint={2409.19510}, |
| 102 | + archivePrefix={arXiv}, |
| 103 | + primaryClass={cs.CL}, |
| 104 | + url={https://arxiv.org/abs/2409.19510}, |
75 | 105 | } |
76 | 106 | ``` |
0 commit comments