Skip to content

Commit f32b8a2

Browse files
authored
Merge pull request #170 from Andreas-Xi/lxq-drcap
Lxq drcap
2 parents 68c5fb9 + 135f69b commit f32b8a2

File tree

5 files changed

+2023
-9
lines changed

5 files changed

+2023
-9
lines changed

examples/drcap_zeroshot_aac/README.md

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# DRCap_Zeroshot_Audio_Captioning
22

33
## Introduction
4-
DRCap is a data-efficient and flexible audio captioning system requiring text-only data for training and can quickly adapt to new domains without additional fine-tuning.
4+
[DRCap](https://www.arxiv.org/abs/2410.09472) is a data-efficient and flexible audio captioning system requiring text-only data for training and can quickly adapt to new domains without additional fine-tuning. It uses projection decoding and retrieval-augmented generation to perform zero-shot audio captioning.
55

66
![](assets/model.png)
77

@@ -14,7 +14,7 @@ You could download our pretrained CLAP model and linear mapping network through
1414
* LLM [vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5)
1515

1616
## Inference
17-
You could modify the variables `run_dir`, `audio_encoder_dir`, `output_dir`, `llm_path` in `scripts/inference_drcap.sh` to match the paths where the downloaded checkpoints are located. Additionally, update the `source` in `data/audiocaps_test.jsonl` to ensure the audio paths point to your audio files, and then run:
17+
You could modify the variables `run_dir`, `audio_encoder_dir`, `output_dir`, `llm_path` in `scripts/inference_drcap.sh` to match the paths where the downloaded checkpoints are located. Additionally, update the `source` in `data_examples/audiocaps_test.jsonl` to ensure the audio paths point to your audio files, and then run:
1818

1919
```shell
2020
bash scripts/inference_drcap.sh
@@ -24,10 +24,10 @@ bash scripts/inference_drcap.sh
2424
## Data preparation
2525
Prepare your `jsonl` data file in the following format:
2626
```json
27-
{"key": "Y7fmOlUlwoNg_1", "target": "Constant rattling noise and sharp vibrations", "text": "Constant rattling noise and sharp vibrations"}
28-
{"key": "Y6BJ455B1aAs_1", "target": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle", "text": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle"}
27+
{"key": "Y7fmOlUlwoNg_1", "target": "Constant rattling noise and sharp vibrations", "text": "Constant rattling noise and sharp vibrations", "similar_captions": ["The engine of a small machine pulling chains", "A market vendor is producing a rhythmic sound with metal forceps.", "A masonry machine is in operation at a fair."]}
28+
{"key": "Y6BJ455B1aAs_1", "target": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle", "text": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle", "similar_captions": ["An engine is revving, with fire and an explosion.", "An explosion is heard after an engine cuts out.", "A car speeding past with a large boom"]}
2929
```
30-
Please note that only textual data is required for training. However, for zero-shot inference, audio files are also necessary. You could find an example of the jsonl file in `data/audiocaps_test.jsonl`
30+
Please note that only textual data is required for training. However, for zero-shot inference, audio files are also necessary. You could find an example of the jsonl file in `data_examples/audiocaps_test.jsonl`
3131

3232
Run the following command to do the retrieval-augmentation and create the text embedding support for evaluation:
3333
```shell
@@ -42,4 +42,16 @@ bash scripts/finetune_drcap.sh
4242
For training only the linear layer (without using LoRA or other PEFT methods), you can set the following parameters: `use_peft=false` and `freeze_llm=true`. To turn off the RAG, you could set `use_arg=false` and `rag_first=false`
4343

4444
## Acknowledgement
45-
The code of training the CLAP model is based on the [WavCaps](https://github.com/XinhaoMei/WavCaps) repo, we thank the contributors for open-sourcing their work.
45+
The code of training the CLAP model is based on the [WavCaps](https://github.com/XinhaoMei/WavCaps) repo, we thank the contributors for open-sourcing their work.
46+
47+
48+
## Citation
49+
You can refer to our paper for more results
50+
```
51+
@article{li2024drcap,
52+
title={DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning},
53+
author={Li, Xiquan and Chen, Wenxi and Ma, Ziyang and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Kong, Qiuqiang and Chen, Xie},
54+
journal={arXiv preprint arXiv:2410.09472},
55+
year={2024}
56+
}
57+
```

0 commit comments

Comments
 (0)