Skip to content

Commit ad0be72

Browse files
authored
Merge pull request #186 from X-LANCE/slam-aac-dev0
docs: update README for clarity and consistency; add encoder fairseq …
2 parents 87e1449 + 2e4068b commit ad0be72

File tree

8 files changed

+32
-8
lines changed

8 files changed

+32
-8
lines changed

examples/slam_aac/README.md

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,18 @@
11
# SLAM-AAC
22

3-
SLAM-AAC is a LLM-based model for Automated Audio Captioning (AAC) task. Inspired by techniques in machine translation and ASR, the model enhances audio captioning by incorporating paraphrasing augmentation and a plug-and-play CLAP-Refine strategy. For more details, please refer to the [paper](https://arxiv.org/abs/2410.09503).
3+
SLAM-AAC is a LLM-based framework for Automated Audio Captioning (AAC) task. Inspired by techniques in machine translation and ASR, the model enhances audio captioning by incorporating **paraphrasing augmentation** and a plug-and-play **CLAP-Refine** strategy. For more details, please refer to the [paper](https://arxiv.org/abs/2410.09503).
44

55
## Model Architecture
6-
SLAM-AAC uses EAT as the audio encoder and Vicuna-7B as the LLM decoder. During training, only the Linear Projector and LoRA modules are trainable. For inference, multiple candidates are generated using different beam sizes, which are then refined using the CLAP-Refine strategy.
6+
SLAM-AAC uses **EAT** as the audio encoder and **Vicuna-7B** as the LLM decoder. During training, only the Linear Projector and LoRA modules are trainable. For inference, multiple candidates are generated using different beam sizes, which are then refined using the CLAP-Refine strategy.
77

88
![](./docs/model.png)
99

1010
## Performance and checkpoints
11-
We have released the pre-trained checkpoint of SLAM-AAC, as well as the fine-tuned checkpoints for the Clotho and AudioCaps datasets. The provided checkpoints include the model's Linear Projector and LoRA modules. Please note that when using each component, be sure to set up the corresponding environments according to the instructions provided in the respective repositories (e.g., for [EAT](https://github.com/cwx-worst-one/EAT)).
11+
Pre-trained and fine-tuned checkpoints for the **Clotho** and **AudioCaps** datasets are available. These checkpoints include the Linear Projector and LoRA modules. Ensure proper setup of the corresponding environments (e.g., [EAT](https://github.com/cwx-worst-one/EAT)) before use.
12+
1213

1314
### Pre-training
14-
SLAM-AAC was pre-trained on a combination of AudioCaps, Clotho, WavCaps, and MACS datasets. For more information on these datasets, you can refer to [this repository](https://github.com/Labbeti/aac-datasets). Additionally, the Clotho dataset was augmented using a back-translation-based paraphrasing technique.
15+
SLAM-AAC was pre-trained on AudioCaps, Clotho, WavCaps, and MACS datasets. For more information on these datasets, you can refer to [this repository](https://github.com/Labbeti/aac-datasets). Additionally, the Clotho dataset was augmented using a back-translation-based paraphrasing technique.
1516
Audio Encoder | LLM | Checkpoint | Pre-training Dataset|
1617
|:---:|:---:|:---:|:---:|
1718
[EAT-base (fine-tuned)](https://drive.google.com/file/d/1aCYiQmoZv_Gh1FxnR-CCWpNAp6DIJzn6/view?usp=sharing) |[vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5) | [link](https://drive.google.com/drive/folders/10kOjB112AeGYA_0mIUr8f1-i5rSg08_O?usp=sharing) | AudioCaps, Clotho, WavCaps, MACS |
@@ -25,7 +26,7 @@ Dataset | Audio Encoder | LLM | Checkpoint | METEOR | CIDEr | SPICE | SPIDEr | S
2526

2627

2728
## Data preparation
28-
Ensure your `jsonl` data follows the structure outlined below:
29+
Ensure your `jsonl` data follows this format:
2930
```json
3031
{"key": "Y7fmOlUlwoNg_1", "source": "/root/data/AudioCaps/waveforms/test/Y7fmOlUlwoNg.wav", "target": "Constant rattling noise and sharp vibrations"}
3132
{"key": "Y6BJ455B1aAs_1", "source": "/root/data/AudioCaps/waveforms/test/Y6BJ455B1aAs.wav", "target": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle"}
@@ -57,7 +58,7 @@ You can also fine-tune the model without loading any pre-trained weights, though
5758
- Due to differences in dependency versions, there may be slight variations in the performance of the SLAM-AAC model.
5859

5960
## Inference
60-
To perform inference with the trained models, you can use the following commands to decode using the common beam search method:
61+
To perform inference with the trained models with beam search:
6162
```bash
6263
# Inference on AudioCaps (Beam Search)
6364
bash scripts/inference_audiocaps_bs.sh
@@ -66,7 +67,9 @@ bash scripts/inference_audiocaps_bs.sh
6667
bash scripts/inference_clotho_bs.sh
6768
```
6869

69-
For improved inference results, you can use the CLAP-Refine strategy, which utilizes multiple beam search decoding. To use this method, you need to download and use our pre-trained [CLAP](https://drive.google.com/drive/folders/1X4NYE08N-kbOy6s_Itb0wBR_3X8oZF56?usp=sharing) model. Note that CLAP-Refine may take longer to run, but it can provide better quality outputs. You can execute the following commands:
70+
To generate better captions, use the CLAP-Refine strategy with multiple beam search decoding. This method leverages our pre-trained [CLAP](https://drive.google.com/drive/folders/1X4NYE08N-kbOy6s_Itb0wBR_3X8oZF56?usp=sharing) model. Though it takes more time, it ensures higher-quality results. Use the following commands to apply it:
71+
72+
7073
```bash
7174
# Inference on AudioCaps (CLAP-Refine)
7275
bash scripts/inference_audiocaps_CLAP_Refine.sh
@@ -81,7 +84,7 @@ bash scripts/clap_refine.sh
8184
```
8285

8386
## Citation
84-
You can refer to the paper for more results.
87+
If you find SLAM-AAC useful, please cite the following paper:
8588
```
8689
@article{chen2024slam,
8790
title={SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs},

examples/slam_aac/scripts/finetune_audiocaps.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ run_dir=/data/wenxi.chen/SLAM-LLM
99
cd $run_dir
1010
code_dir=examples/slam_aac
1111

12+
encoder_fairseq_dir=/fairseq/EAT # path to the fairseq directory of the encoder model
13+
1214
audio_encoder_path=/data/xiquan.li/models/EAT-base_epoch30_ft.pt
1315
llm_path=/data/xiquan.li/models/vicuna-7b-v1.5
1416

@@ -38,6 +40,7 @@ hydra.run.dir=$output_dir \
3840
++model_config.encoder_path=$audio_encoder_path \
3941
++model_config.encoder_dim=768 \
4042
++model_config.encoder_projector=linear \
43+
++model_config.encoder_fairseq_dir=$encoder_fairseq_dir \
4144
++dataset_config.encoder_projector_ds_rate=${encoder_projector_ds_rate} \
4245
++dataset_config.dataset=audio_dataset \
4346
++dataset_config.train_data_path=$train_jsonl_path \

examples/slam_aac/scripts/finetune_clotho.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ run_dir=/data/wenxi.chen/SLAM-LLM
99
cd $run_dir
1010
code_dir=examples/slam_aac
1111

12+
encoder_fairseq_dir=/fairseq/EAT # path to the fairseq directory of the encoder model
13+
1214
audio_encoder_path=/data/xiquan.li/models/EAT-base_epoch30_ft.pt
1315
llm_path=/data/xiquan.li/models/vicuna-7b-v1.5
1416

@@ -38,6 +40,7 @@ hydra.run.dir=$output_dir \
3840
++model_config.encoder_path=$audio_encoder_path \
3941
++model_config.encoder_dim=768 \
4042
++model_config.encoder_projector=linear \
43+
++model_config.encoder_fairseq_dir=$encoder_fairseq_dir \
4144
++dataset_config.encoder_projector_ds_rate=${encoder_projector_ds_rate} \
4245
++dataset_config.dataset=audio_dataset \
4346
++dataset_config.train_data_path=$train_jsonl_path \

examples/slam_aac/scripts/inference_audiocaps_CLAP_Refine.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ audio_encoder_path=/data/xiquan.li/models/EAT-base_epoch30_ft.pt
1010
llm_path=/data/xiquan.li/models/vicuna-7b-v1.5
1111
clap_dir=/data/xiquan.li/models/clap
1212

13+
encoder_fairseq_dir=/fairseq/EAT # path to the fairseq directory of the encoder model
14+
1315
encoder_projector_ds_rate=5
1416

1517
inference_data_path=/data/wenxi.chen/data/audiocaps/new_test.jsonl
@@ -41,6 +43,7 @@ for num_beams in "${beam_range[@]}"; do
4143
++model_config.encoder_projector=linear \
4244
++model_config.encoder_projector_ds_rate=$encoder_projector_ds_rate \
4345
++model_config.normalize=true \
46+
++model_config.encoder_fairseq_dir=$encoder_fairseq_dir \
4447
++dataset_config.encoder_projector_ds_rate=$encoder_projector_ds_rate \
4548
++dataset_config.dataset=audio_dataset \
4649
++dataset_config.val_data_path=$inference_data_path \

examples/slam_aac/scripts/inference_audiocaps_bs.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ run_dir=/data/wenxi.chen/SLAM-LLM
66
cd $run_dir
77
code_dir=examples/slam_aac
88

9+
encoder_fairseq_dir=/fairseq/EAT # path to the fairseq directory of the encoder model
10+
911
audio_encoder_path=/data/xiquan.li/models/EAT-base_epoch30_ft.pt
1012
llm_path=/data/xiquan.li/models/vicuna-7b-v1.5
1113

@@ -31,6 +33,7 @@ python $code_dir/inference_aac_batch.py \
3133
++model_config.encoder_projector=linear \
3234
++model_config.encoder_projector_ds_rate=$encoder_projector_ds_rate \
3335
++model_config.normalize=true \
36+
++model_config.encoder_fairseq_dir=$encoder_fairseq_dir \
3437
++dataset_config.encoder_projector_ds_rate=$encoder_projector_ds_rate \
3538
++dataset_config.dataset=audio_dataset \
3639
++dataset_config.val_data_path=$inference_data_path \

examples/slam_aac/scripts/inference_clotho_CLAP_Refine.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ run_dir=/data/wenxi.chen/SLAM-LLM
66
cd $run_dir
77
code_dir=examples/slam_aac
88

9+
encoder_fairseq_dir=/fairseq/EAT # path to the fairseq directory of the encoder model
10+
911
audio_encoder_path=/data/xiquan.li/models/EAT-base_epoch30_ft.pt
1012
llm_path=/data/xiquan.li/models/vicuna-7b-v1.5
1113
clap_dir=/data/xiquan.li/models/clap
@@ -41,6 +43,7 @@ for num_beams in "${beam_range[@]}"; do
4143
++model_config.encoder_projector=linear \
4244
++model_config.encoder_projector_ds_rate=$encoder_projector_ds_rate \
4345
++model_config.normalize=true \
46+
++model_config.encoder_fairseq_dir=$encoder_fairseq_dir \
4447
++dataset_config.encoder_projector_ds_rate=$encoder_projector_ds_rate \
4548
++dataset_config.dataset=audio_dataset \
4649
++dataset_config.val_data_path=$inference_data_path \

examples/slam_aac/scripts/inference_clotho_bs.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ run_dir=/data/wenxi.chen/SLAM-LLM
66
cd $run_dir
77
code_dir=examples/slam_aac
88

9+
encoder_fairseq_dir=/fairseq/EAT # path to the fairseq directory of the encoder model
10+
911
audio_encoder_path=/data/xiquan.li/models/EAT-base_epoch30_ft.pt
1012
llm_path=/data/xiquan.li/models/vicuna-7b-v1.5
1113

@@ -31,6 +33,7 @@ python $code_dir/inference_aac_batch.py \
3133
++model_config.encoder_projector=linear \
3234
++model_config.encoder_projector_ds_rate=$encoder_projector_ds_rate \
3335
++model_config.normalize=true \
36+
++model_config.encoder_fairseq_dir=$encoder_fairseq_dir \
3437
++dataset_config.encoder_projector_ds_rate=$encoder_projector_ds_rate \
3538
++dataset_config.dataset=audio_dataset \
3639
++dataset_config.val_data_path=$inference_data_path \

examples/slam_aac/scripts/pretrain.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ run_dir=/data/wenxi.chen/SLAM-LLM
99
cd $run_dir
1010
code_dir=examples/slam_aac
1111

12+
encoder_fairseq_dir=/fairseq/EAT # path to the fairseq directory of the encoder model
13+
1214
audio_encoder_path=/data/xiquan.li/models/EAT-base_epoch30_ft.pt
1315
llm_path=/data/xiquan.li/models/vicuna-7b-v1.5
1416

@@ -34,6 +36,7 @@ hydra.run.dir=$output_dir \
3436
++model_config.encoder_path=$audio_encoder_path \
3537
++model_config.encoder_dim=768 \
3638
++model_config.encoder_projector=linear \
39+
++model_config.encoder_fairseq_dir=$encoder_fairseq_dir \
3740
++dataset_config.encoder_projector_ds_rate=${encoder_projector_ds_rate} \
3841
++dataset_config.dataset=audio_dataset \
3942
++dataset_config.train_data_path=$train_jsonl_path \

0 commit comments

Comments
 (0)