Skip to content

Commit a000404

Browse files
Merge pull request #184 from X-LANCE/revert-183-main
Revert "merge latest main branch"
2 parents c1e26c1 + 42514c2 commit a000404

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

64 files changed

+109315
-4874
lines changed

README.md

Lines changed: 3 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -20,17 +20,15 @@ developers to train custom multimodal large language model (MLLM), focusing on <
2020
# Table of Contents
2121
1. [News](#news)
2222
2. [Installation](#installation)
23-
3. [Usage](#usage)
23+
3. [Uasge](#uasge)
2424
- [List of Recipes](#list-of-recipes)
2525
- [Configuration Priority](#configuration-priority)
2626
4. [Features](#features)
2727
5. [Acknowledge](#acknowledge)
2828
6. [Citation](#citation)
2929

3030
# News
31-
- [Update Nov. 17, 2024] Recipes for [LLM-Based Contextual ASR](examples/contextual_asr/README.md) have been supported.
32-
- [Update Nov. 5, 2024] Recipes for [speech emotion captioning (SEC)](examples/sec_emotioncaps/README.md) with [emotion2vec](https://github.com/ddlBoJack/emotion2vec) as the encoder has been supported.
33-
- [Update Oct. 12, 2024] Recipes for [SLAM-AAC](examples/slam_aac/README.md) with [EAT](https://github.com/cwx-worst-one/EAT) as the encoder have been supported.
31+
- [Update Oct. 12, 2024] Recipes for [SLAM-AAC](examples/slam_aac/README.md) have been supported.
3432
- [Update Sep. 28, 2024] Recipes for [CoT-ST](examples/st_covost2/README.md) have been supported.
3533
- [Update Sep. 25, 2024] Recipes for [DRCap](examples/drcap_zeroshot_aac/README.md) have been supported.
3634
- [Update Jun. 12, 2024] Recipes for [MaLa-ASR](examples/mala_asr_slidespeech/README.md) have been supported.
@@ -85,15 +83,13 @@ We provide reference implementations of various LLM-based speech, audio, and mus
8583

8684
- Contextual Automatic Speech Recognition (CASR)
8785
- [ Mala-ASR](examples/mala_asr_slidespeech/README.md)
88-
- [LLM-Based Contextual ASR](examples/contextual_asr/README.md)
8986

9087
- [Visual Speech Recognition (VSR)](examples/vsr_LRS3/README.md)
9188
- Speech-to-Text Translation (S2TT)
9289
- [CoT-ST](examples/st_covost2/README.md)
9390

9491
- Text-to-Speech (TTS)
9592
- [VALL-E-X](examples/vallex/README.md)
96-
- [Speech Emotion Captioning (SEC)](examples/sec_emotioncaps/README.md)
9793

9894
- **Audio Task**
9995
- [Automated Audio Captioning (AAC)](examples/aac_audiocaps/README.md)
@@ -122,10 +118,7 @@ command-line (shell file) > Hydra configuration (yaml file) > dataclass configur
122118
- We borrow code from [Fairseq](https://github.com/facebookresearch/fairseq) for deepspeed configuration.
123119
- We thank the contributors for providing diverse recipes.
124120

125-
# Citation
126-
127-
## Speech Task
128-
121+
## Citation
129122
SLAM-ASR:
130123
```
131124
@article{ma2024embarrassingly,
@@ -135,60 +128,4 @@ SLAM-ASR:
135128
year={2024}
136129
}
137130
```
138-
Mala-ASR:
139-
```
140-
@article{yang2024mala,
141-
title={MaLa-ASR: Multimedia-Assisted LLM-Based ASR},
142-
author={Yang, Guanrou and Ma, Ziyang and Yu, Fan and Gao, Zhifu and Zhang, Shiliang and Chen, Xie},
143-
journal={Proc. INTERSPEECH},
144-
year={2024}
145-
}
146-
```
147-
LLM-Based Contextual ASR:
148-
```
149-
@article{yang2024ctc,
150-
title={CTC-Assisted LLM-Based Contextual ASR},
151-
author={Yang, Guanrou and Ma, Ziyang and Gao, Zhifu and Zhang, Shiliang and Chen, Xie},
152-
journal={Proc. SLT},
153-
year={2024}
154-
}
155-
```
156-
CoT-ST:
157-
```
158-
@article{du2024cot,
159-
title={CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought},
160-
author={Du, Yexing and Ma, Ziyang and Yang, Yifan and Deng, Keqi and Chen, Xie and Yang, Bo and Xiang, Yang and Liu, Ming and Qin, Bing},
161-
journal={arXiv preprint arXiv:2409.19510},
162-
year={2024}
163-
}
164-
```
165-
166131

167-
## Audio Task
168-
SLAM-AAC:
169-
```
170-
@article{chen2024slam,
171-
title={SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs},
172-
author={Chen, Wenxi and Ma, Ziyang and Li, Xiquan and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Yu, Kai and Chen, Xie},
173-
journal={arXiv preprint arXiv:2410.09503},
174-
year={2024}
175-
}
176-
```
177-
DRCap:
178-
```
179-
@article{li2024drcap,
180-
title={DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning},
181-
author={Li, Xiquan and Chen, Wenxi and Ma, Ziyang and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Kong, Qiuqiang and Chen, Xie},
182-
journal={arXiv preprint arXiv:2410.09472},
183-
year={2024}
184-
}
185-
```
186-
BAT:
187-
```
188-
@article{zheng2024bat,
189-
title={BAT: Learning to Reason about Spatial Sounds with Large Language Models},
190-
author={Zheng, Zhisheng and Peng, Puyuan and Ma, Ziyang and Chen, Xie and Choi, Eunsol and Harwath, David},
191-
journal={Proc. ICML},
192-
year={2024}
193-
}
194-
```

examples/aac_audiocaps/aac_config.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,5 @@
11
from dataclasses import dataclass, field
22
from typing import Optional, List
3-
4-
from torch.distributed.fsdp import ShardingStrategy
5-
6-
73
@dataclass
84
class ModelConfig:
95
file: str = "examples/aac_audiocaps/model/slam_model_aac.py:model_factory"
@@ -118,7 +114,7 @@ class FSDPConfig:
118114
mixed_precision: bool = True
119115
use_fp16: bool = False
120116
# sharding_strategy = "FULL_SHARD" #ShardingStrategy = ShardingStrategy.FULL_SHARD
121-
sharding_strategy: ShardingStrategy = "NO_SHARD" #ShardingStrategy.NO_SHARD #MZY: set NO_SHARD when use DDP
117+
sharding_strategy: str = "NO_SHARD" #ShardingStrategy.NO_SHARD #MZY: set NO_SHARD when use DDP
122118
checkpoint_type: str = "SHARDED_STATE_DICT" # alternatively can use SHARDED_STATE_DICT save one file per rank, and can resize the world-size.
123119
fsdp_activation_checkpointing: bool = True
124120
fsdp_cpu_offload: bool = False

examples/asr_librispeech/asr_config.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,5 @@
11
from dataclasses import dataclass, field
22
from typing import Optional, List
3-
4-
from torch.distributed.fsdp import ShardingStrategy
5-
6-
73
@dataclass
84
class ModelConfig:
95
file: str = "examples/asr_librispeech/model/slam_model_asr.py:model_factory"
@@ -112,7 +108,7 @@ class FSDPConfig:
112108
mixed_precision: bool = True
113109
use_fp16: bool = False
114110
# sharding_strategy = "FULL_SHARD" #ShardingStrategy = ShardingStrategy.FULL_SHARD
115-
sharding_strategy: ShardingStrategy = "NO_SHARD" #ShardingStrategy.NO_SHARD #MZY: set NO_SHARD when use DDP
111+
sharding_strategy: str = "NO_SHARD" #ShardingStrategy.NO_SHARD #MZY: set NO_SHARD when use DDP
116112
checkpoint_type: str = "SHARDED_STATE_DICT" # alternatively can use SHARDED_STATE_DICT save one file per rank, and can resize the world-size.
117113
fsdp_activation_checkpointing: bool = True
118114
fsdp_cpu_offload: bool = False

examples/contextual_asr/README.md

Lines changed: 0 additions & 62 deletions
This file was deleted.

examples/contextual_asr/conf/ds_config.json

Lines changed: 0 additions & 19 deletions
This file was deleted.

examples/contextual_asr/conf/prompt.yaml

Lines changed: 0 additions & 4 deletions
This file was deleted.

examples/contextual_asr/contextual_asr_config.py

Lines changed: 0 additions & 135 deletions
This file was deleted.

0 commit comments

Comments
 (0)