Skip to content
Merged

test #152

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
e5dc39c
initial commit for drcap
xiquan-li Sep 21, 2024
80ea224
test
cwx-worst-one Sep 22, 2024
a430200
Update SLAM-AAC README.md
cwx-worst-one Sep 22, 2024
0e68d94
Update
cwx-worst-one Sep 22, 2024
2c8cddd
commit for exmaples/drcap
xiquan-li Sep 23, 2024
4897187
commit for exmaples/drcap
xiquan-li Sep 23, 2024
eedefb5
minor fix
xiquan-li Sep 23, 2024
402a9da
add recipes for DRCap
xiquan-li Sep 23, 2024
ffb640f
add recipes for DRCap
xiquan-li Sep 23, 2024
72e5680
add custom dataset for drcap, keep original audio_dataset unchanged
xiquan-li Sep 25, 2024
e7a03c3
Merge pull request #133 from Andreas-Xi/lxq-drcap
ddlBoJack Sep 26, 2024
4fb8d82
Merge pull request #134 from X-LANCE/main
cwx-worst-one Sep 26, 2024
9aa603e
st
yxduir Sep 27, 2024
cdac921
md
yxduir Sep 27, 2024
a4a3846
md
yxduir Sep 27, 2024
57795bc
md
yxduir Sep 27, 2024
0abf7c4
update slam-aac
cwx-worst-one Sep 27, 2024
12d694c
update SLAM-AAC
cwx-worst-one Sep 27, 2024
389234b
merge
yxduir Sep 27, 2024
a55cee9
md
yxduir Sep 27, 2024
de8d4d5
md
yxduir Sep 27, 2024
a48f11c
md
yxduir Sep 27, 2024
8a0a3dc
md
yxduir Sep 27, 2024
f5cd6f3
0928
yxduir Sep 28, 2024
3a7c195
Merge pull request #137 from X-LANCE/yxdu
ddlBoJack Sep 28, 2024
cd72be0
md
yxduir Sep 29, 2024
1a4f26d
Merge pull request #138 from X-LANCE/main
ddlBoJack Oct 1, 2024
b304f0a
update README
ddlBoJack Oct 1, 2024
b93bbf3
update README
ddlBoJack Oct 1, 2024
fbe3b65
Merge pull request #139 from X-LANCE/dev-mzy
ddlBoJack Oct 1, 2024
a21507e
Update README.md
yxduir Oct 2, 2024
c1706d6
md
HITCSzwx Oct 2, 2024
e1be609
md
HITCSzwx Oct 2, 2024
12b8772
md
HITCSzwx Oct 2, 2024
d599ce4
Merge pull request #140 from X-LANCE/yxdu
ddlBoJack Oct 2, 2024
752b96e
10.11
cwx-worst-one Oct 11, 2024
56fb822
10.11
cwx-worst-one Oct 11, 2024
b8dbc12
SLAM-AAC
cwx-worst-one Oct 11, 2024
2ab898e
Merge pull request #147 from X-LANCE/cwx_slam_aac
ddlBoJack Oct 12, 2024
db55bea
update README
ddlBoJack Oct 12, 2024
be67304
BAT update: fix type, upload checkpoint; finish inference code; add d…
zszheng147 Oct 12, 2024
8c05584
seld: add checkpoint link to readme
Oct 13, 2024
38d8c66
Merge pull request #150 from X-LANCE/seld
ddlBoJack Oct 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 28 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,15 +28,18 @@ developers to train custom multimodal large language model (MLLM), focusing on <
6. [Citation](#citation)

# News
- [Update Jun. 12, 2024] Recipes for [MaLa-ASR](examples/mala_asr_slidespeech/README.md) has been supported.
- [Update Oct. 12, 2024] Recipes for [SLAM-AAC](examples/slam_aac/README.md) have been supported.
- [Update Sep. 28, 2024] Recipes for [CoT-ST](examples/st_covost2/README.md) have been supported.
- [Update Sep. 25, 2024] Recipes for [DRCap](examples/drcap_zeroshot_aac/README.md) have been supported.
- [Update Jun. 12, 2024] Recipes for [MaLa-ASR](examples/mala_asr_slidespeech/README.md) have been supported.
- **[CALL FOR EXAMPLE]** We sincerely invite developers and researchers to develop new applications, conduct academic research based on SLAM-LLM, and pull request your examples! We also acknowledge engineering PR (such as improving and speeding up multi-node training).
- [Update May. 22, 2024] Please join [slack](https://join.slack.com/t/slam-llm/shared_invite/zt-2mc0pkhhs-5jjOi8Cwc8R1Xc8IQmykDA) or [WeChat group](./docs/Wechat.jpg). We will sync our updates and Q&A here.
- [Update May. 21, 2024] Recipes for [Spatial Audio Understanding](examples/seld_spatialsoundqa/README.md) has been supported.
- [Update May. 20, 2024] Recipes for [music caption (MC)](examples/mc_musiccaps/README.md) has been supported.
- [Update May. 8, 2024] Recipes for [visual speech recognition (VSR)](examples/vsr_LRS3/README.md) has been supported.
- [Update May. 4, 2024] Recipes for [zero-shot text-to-speech (TTS)](examples/vallex/README.md) has been supported.
- [Update Apr. 28, 2024] Recipes for [automated audio captioning (AAC)](examples/aac_audiocaps/README.md) has been supported.
- [Update Mar. 31, 2024] Recipes for [automatic speech recognition (ASR)](examples/asr_librispeech/README.md) has been supported.
- [Update May. 21, 2024] Recipes for [Spatial Audio Understanding](examples/seld_spatialsoundqa/README.md) have been supported.
- [Update May. 20, 2024] Recipes for [music caption (MC)](examples/mc_musiccaps/README.md) have been supported.
- [Update May. 8, 2024] Recipes for [visual speech recognition (VSR)](examples/vsr_LRS3/README.md) have been supported.
- [Update May. 4, 2024] Recipes for [zero-shot text-to-speech (TTS)](examples/vallex/README.md) have been supported.
- [Update Apr. 28, 2024] Recipes for [automated audio captioning (AAC)](examples/aac_audiocaps/README.md) have been supported.
- [Update Mar. 31, 2024] Recipes for [automatic speech recognition (ASR)](examples/asr_librispeech/README.md) have been supported.

# Installation
```bash
Expand Down Expand Up @@ -75,12 +78,25 @@ docker run -it --gpus all --name slam --shm-size=256g slam-llm:latest /bin/bash
## List of Recipes
We provide reference implementations of various LLM-based speech, audio, and music tasks:
- **Speech Task**
- [Automatic Speech Recognition (ASR)](examples/asr_librispeech/README.md)
- [Text-to-Speech (TTS)](examples/vallex/README.md)
- [Visual Speech Recognition (VSR)](examples/vsr_LRS3/README.md)
- Automatic Speech Recognition (ASR)
- [SLAM-ASR](examples/asr_librispeech/README.md)

- Contextual Automatic Speech Recognition (CASR)
- [ Mala-ASR](examples/mala_asr_slidespeech/README.md)

- [Visual Speech Recognition (VSR)](examples/vsr_LRS3/README.md)
- Speech-to-Text Translation (S2TT)
- [CoT-ST](examples/st_covost2/README.md)

- Text-to-Speech (TTS)
- [VALL-E-X](examples/vallex/README.md)

- **Audio Task**
- [Automated Audio Captioning (AAC)](examples/aac_audiocaps/README.md)
- [Spatial Audio Understanding](examples/seld_spatialsoundqa/README.md)
- [SLAM-AAC](examples/slam_aac/README.md)
- [DRCap](examples/drcap_zeroshot_aac/README.md)
- Spatial Audio Understanding
- [BAT](examples/seld_spatialsoundqa/README.md)
- **Music Task**
- [Music Caption (MC)](examples/mc_musiccaps/README.md)

Expand All @@ -103,7 +119,7 @@ command-line (shell file) > Hydra configuration (yaml file) > dataclass configur
- We thank the contributors for providing diverse recipes.

## Citation

SLAM-ASR:
```
@article{ma2024embarrassingly,
title={An Embarrassingly Simple Approach for LLM with Strong ASR Capacity},
Expand Down
45 changes: 45 additions & 0 deletions examples/drcap_zeroshot_aac/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# DRCap_Zeroshot_Audio_Captioning

## Introduction
DRCap is a data-efficient and flexible audio captioning system requiring text-only data for training and can quickly adapt to new domains without additional fine-tuning.

![](assets/model.png)

## Pretrained models
You could download our pretrained CLAP model and linear mapping network through google drive:
* [CLAP](https://drive.google.com/drive/folders/1d5RqM2OTxO8PD7qBUAyXXJHjS96XIauw?usp=sharing) pretrained on [SoundVECaps](https://yyua8222.github.io/Sound-VECaps-demo/) and [WavCaps](https://huggingface.co/datasets/cvssp/WavCaps) [~1.9M Audio-text pairs in total]

* [Linear mapping network](https://drive.google.com/drive/folders/1d5RqM2OTxO8PD7qBUAyXXJHjS96XIauw?usp=sharing) trained on AudioCaps and Clotho_v2 via clap latents decoding and text-to-text retrieval augmentation.

* LLM [vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5)

## Inference
You could modify the variables `run_dir`, `audio_encoder_dir`, `output_dir`, `llm_path` in `scripts/inference_drcap.sh` to match the paths where the downloaded checkpoints are located. Additionally, update the `source` in `data/audiocaps_test.jsonl` to ensure the audio paths point to your audio files, and then run:

```shell
bash scripts/inference_drcap.sh
```


## Data preparation
Prepare your `jsonl` data file in the following format:
```json
{"key": "Y7fmOlUlwoNg_1", "target": "Constant rattling noise and sharp vibrations", "text": "Constant rattling noise and sharp vibrations"}
{"key": "Y6BJ455B1aAs_1", "target": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle", "text": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle"}
```
Please note that only textual data is required for training. However, for zero-shot inference, audio files are also necessary. You could find an example of the jsonl file in `data/audiocaps_test.jsonl`

Run the following command to do the retrieval-augmentation and create the text embedding support for evaluation:
```shell
bash scripts/data_preprocess.sh
```

## Model Training
You could run the following command to train the model
```bash
bash scripts/finetune_drcap.sh
```
For training only the linear layer (without using LoRA or other PEFT methods), you can set the following parameters: `use_peft=false` and `freeze_llm=true`. To turn off the RAG, you could set `use_arg=false` and `rag_first=false`

## Acknowledgement
The code of training the CLAP model is based on the [WavCaps](https://github.com/XinhaoMei/WavCaps) repo, we thank the contributors for open-sourcing their work.
Binary file added examples/drcap_zeroshot_aac/assets/model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
68 changes: 68 additions & 0 deletions examples/drcap_zeroshot_aac/conf/clap_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
device: "cuda"
seed: 20
embed_size: 1024
temp: 0.07
queue_size: 5120
json_files: [
'../data/json_files/BBC_Sound_Effects/bbc_final.json',
'../data/json_files/FreeSound/fsd_final.json',
'../data/json_files/SoundBible/sb_final.json',
"../data/json_files/AudioSet_SL/as_final.json",
"data/AudioCaps/json_files/train.json",
"data/Clotho/json_files/train.json"
]

resume: false
blacklist: "../data/json_files/blacklist/" # path to blacklist file
embed_regularization: true

arch_version: 0


dist_args:
world_size: 1

audio_args:
sr: 32000
n_fft: 1024
hop_length: 320
f_min: 50
f_max: 14000
n_mels: 64
max_length: 30
mono: True
use_torchaudio: True


audio_encoder_args:
type: "transformer"
model: "Cnn14"
pretrained: False
freeze: False


data_args:
batch_size: 128
num_workers: 8


text_encoder_args:
type: 'roberta-base'
freeze: False


optim_args:
lr: !!float 5e-5
warmup_steps: 0
optimizer_name: "adam"
betas: [0.9, 0.999]
eps: !!float 1e-8
momentum: 0.9
warmup_epochs: 2


training:
spec_augmentation: True
epochs: 15
clip_grad: 2
dropout: 0.2
19 changes: 19 additions & 0 deletions examples/drcap_zeroshot_aac/conf/ds_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"train_micro_batch_size_per_gpu": 4,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 1e-4
}
},
"fp16": {
"enabled": true
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu"
}
}
}
3 changes: 3 additions & 0 deletions examples/drcap_zeroshot_aac/conf/prompt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
dataset_config:
# we put prompt here, because the hydra override in shell script only support a small subset of chars
prompt: "Describe the audio you hear."
Loading
Loading