X-LANCE
diff --git a/‎README.md‎
Lines changed: 45 additions & 4 deletions b/‎README.md‎
Lines changed: 45 additions & 4 deletions
diff --git a/‎examples/drcap_zeroshot_aac/README.md‎
Lines changed: 18 additions & 6 deletions b/‎examples/drcap_zeroshot_aac/README.md‎
Lines changed: 18 additions & 6 deletions
@@ -28,7 +28,8 @@ developers to train custom multimodal large language model (MLLM), focusing on <
 6. [Citation](#citation)
 
 # News
-- [Update Oct. 12, 2024] Recipes for [SLAM-AAC](examples/slam_aac/README.md) have been supported. 
+- [Update Nov. 5, 2024] Recipes for [speech emotion captioning (SEC)](examples/sec_emotioncaps/README.md) with [emotion2vec](https://github.com/ddlBoJack/emotion2vec) as the encoder has been supported.
+- [Update Oct. 12, 2024] Recipes for [SLAM-AAC](examples/slam_aac/README.md) with [EAT](https://github.com/cwx-worst-one/EAT) as the encoder have been supported. 
 - [Update Sep. 28, 2024] Recipes for [CoT-ST](examples/st_covost2/README.md) have been supported. 
 - [Update Sep. 25, 2024] Recipes for [DRCap](examples/drcap_zeroshot_aac/README.md) have been supported. 
 - [Update Jun. 12, 2024] Recipes for [MaLa-ASR](examples/mala_asr_slidespeech/README.md) have been supported. 
@@ -90,6 +91,7 @@ We provide reference implementations of various LLM-based speech, audio, and mus
 
     - Text-to-Speech (TTS)
         - [VALL-E-X](examples/vallex/README.md)
+    - [Speech Emotion Captioning (SEC)](examples/sec_emotioncaps/README.md)
 
 - **Audio Task**
     - [Automated Audio Captioning (AAC)](examples/aac_audiocaps/README.md)
@@ -118,7 +120,10 @@ command-line (shell file) > Hydra configuration (yaml file) > dataclass configur
 - We borrow code from [Fairseq](https://github.com/facebookresearch/fairseq) for deepspeed configuration. 
 - We thank the contributors for providing diverse recipes. 
 
-## Citation
+# Citation
+
+## Speech Task
+
 SLAM-ASR:
 ```
 @article{ma2024embarrassingly,
@@ -128,7 +133,27 @@ SLAM-ASR:
   year={2024}
 }
 ```
+Mala-ASR:
+```
+@article{yang2024mala,
+  title={MaLa-ASR: Multimedia-Assisted LLM-Based ASR},
+  author={Yang, Guanrou and Ma, Ziyang and Yu, Fan and Gao, Zhifu and Zhang, Shiliang and Chen, Xie},
+  journal={Proc. INTERSPEECH},
+  year={2024}
+}
+```
+CoT-ST:
+```
+@article{du2024cot,
+  title={CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought},
+  author={Du, Yexing and Ma, Ziyang and Yang, Yifan and Deng, Keqi and Chen, Xie and Yang, Bo and Xiang, Yang and Liu, Ming and Qin, Bing},
+  journal={arXiv preprint arXiv:2409.19510},
+  year={2024}
+}
+```
 
+
+## Audio Task
 SLAM-AAC:
 ```
 @article{chen2024slam,
@@ -138,5 +163,21 @@ SLAM-AAC:
   year={2024}
 }
 ```
-
-
+DRCap:
+```
+@article{li2024drcap,
+  title={DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning},
+  author={Li, Xiquan and Chen, Wenxi and Ma, Ziyang and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Kong, Qiuqiang and Chen, Xie},
+  journal={arXiv preprint arXiv:2410.09472},
+  year={2024}
+}
+```
+BAT:
+```
+@article{zheng2024bat,
+  title={BAT: Learning to Reason about Spatial Sounds with Large Language Models},
+  author={Zheng, Zhisheng and Peng, Puyuan and Ma, Ziyang and Chen, Xie and Choi, Eunsol and Harwath, David},
+  journal={Proc. ICML},
+  year={2024}
+}
+```
@@ -1,7 +1,7 @@
 # DRCap_Zeroshot_Audio_Captioning
 
 ## Introduction
-DRCap is a data-efficient and flexible audio captioning system requiring text-only data for training and can quickly adapt to new domains without additional fine-tuning. 
+[DRCap](https://www.arxiv.org/abs/2410.09472) is a data-efficient and flexible audio captioning system requiring text-only data for training and can quickly adapt to new domains without additional fine-tuning. It uses projection decoding and retrieval-augmented generation to perform zero-shot audio captioning. 
 
 ![](assets/model.png)
 
@@ -14,7 +14,7 @@ You could download our pretrained CLAP model and linear mapping network through
 * LLM [vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5)
 
 ## Inference
-You could modify the variables `run_dir`, `audio_encoder_dir`, `output_dir`, `llm_path` in `scripts/inference_drcap.sh` to match the paths where the downloaded checkpoints are located. Additionally, update the `source` in `data/audiocaps_test.jsonl` to ensure the audio paths point to your audio files, and then run:
+You could modify the variables `run_dir`, `audio_encoder_dir`, `output_dir`, `llm_path` in `scripts/inference_drcap.sh` to match the paths where the downloaded checkpoints are located. Additionally, update the `source` in `data_examples/audiocaps_test.jsonl` to ensure the audio paths point to your audio files, and then run:
 
 ```shell
 bash scripts/inference_drcap.sh
@@ -24,10 +24,10 @@ bash scripts/inference_drcap.sh
 ## Data preparation
 Prepare your `jsonl` data file in the following format:
 ```json
-{"key": "Y7fmOlUlwoNg_1", "target": "Constant rattling noise and sharp vibrations", "text": "Constant rattling noise and sharp vibrations"}
-{"key": "Y6BJ455B1aAs_1", "target": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle", "text": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle"}
+{"key": "Y7fmOlUlwoNg_1", "target": "Constant rattling noise and sharp vibrations", "text": "Constant rattling noise and sharp vibrations", "similar_captions": ["The engine of a small machine pulling chains", "A market vendor is producing a rhythmic sound with metal forceps.", "A masonry machine is in operation at a fair."]}
+{"key": "Y6BJ455B1aAs_1", "target": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle", "text": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle", "similar_captions": ["An engine is revving, with fire and an explosion.", "An explosion is heard after an engine cuts out.", "A car speeding past with a large boom"]}
 ```
-Please note that only textual data is required for training. However, for zero-shot inference, audio files are also necessary. You could find an example of the jsonl file in `data/audiocaps_test.jsonl`
+Please note that only textual data is required for training. However, for zero-shot inference, audio files are also necessary. You could find an example of the jsonl file in `data_examples/audiocaps_test.jsonl`
 
 Run the following command to do the retrieval-augmentation and create the text embedding support for evaluation: 
 ```shell
@@ -42,4 +42,16 @@ bash scripts/finetune_drcap.sh
 For training only the linear layer (without using LoRA or other PEFT methods), you can set the following parameters: `use_peft=false` and `freeze_llm=true`. To turn off the RAG, you could set `use_arg=false` and `rag_first=false`
 
 ## Acknowledgement
-The code of training the CLAP model is based on the [WavCaps](https://github.com/XinhaoMei/WavCaps) repo, we thank the contributors for open-sourcing their work.
+The code of training the CLAP model is based on the [WavCaps](https://github.com/XinhaoMei/WavCaps) repo, we thank the contributors for open-sourcing their work.
+
+
+## Citation
+You can refer to our paper for more results
+```
+@article{li2024drcap,
+  title={DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning},
+  author={Li, Xiquan and Chen, Wenxi and Ma, Ziyang and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Kong, Qiuqiang and Chen, Xie},
+  journal={arXiv preprint arXiv:2410.09472},
+  year={2024}
+}
+```