X-LANCE · cwx-worst-one · Oct 14, 2024 · Sep 21, 2024 · Sep 22, 2024 · Sep 22, 2024
diff --git a/README.md b/README.md
@@ -28,15 +28,18 @@ developers to train custom multimodal large language model (MLLM), focusing on <
 6. [Citation](#citation)
 
 # News
-- [Update Jun. 12, 2024] Recipes for [MaLa-ASR](examples/mala_asr_slidespeech/README.md) has been supported. 
+- [Update Oct. 12, 2024] Recipes for [SLAM-AAC](examples/slam_aac/README.md) have been supported. 
+- [Update Sep. 28, 2024] Recipes for [CoT-ST](examples/st_covost2/README.md) have been supported. 
+- [Update Sep. 25, 2024] Recipes for [DRCap](examples/drcap_zeroshot_aac/README.md) have been supported. 
+- [Update Jun. 12, 2024] Recipes for [MaLa-ASR](examples/mala_asr_slidespeech/README.md) have been supported. 
 - **[CALL FOR EXAMPLE]** We sincerely invite developers and researchers to develop new applications, conduct academic research based on SLAM-LLM, and pull request your examples! We also acknowledge engineering PR (such as improving and speeding up multi-node training). 
 - [Update May. 22, 2024] Please join [slack](https://join.slack.com/t/slam-llm/shared_invite/zt-2mc0pkhhs-5jjOi8Cwc8R1Xc8IQmykDA) or [WeChat group](./docs/Wechat.jpg). We will sync our updates and Q&A here. 
-- [Update May. 21, 2024] Recipes for [Spatial Audio Understanding](examples/seld_spatialsoundqa/README.md) has been supported. 
-- [Update May. 20, 2024] Recipes for [music caption (MC)](examples/mc_musiccaps/README.md) has been supported. 
-- [Update May. 8, 2024] Recipes for [visual speech recognition (VSR)](examples/vsr_LRS3/README.md) has been supported. 
-- [Update May. 4, 2024] Recipes for [zero-shot text-to-speech (TTS)](examples/vallex/README.md) has been supported. 
-- [Update Apr. 28, 2024] Recipes for [automated audio captioning (AAC)](examples/aac_audiocaps/README.md) has been supported. 
-- [Update Mar. 31, 2024] Recipes for [automatic speech recognition (ASR)](examples/asr_librispeech/README.md) has been supported. 
+- [Update May. 21, 2024] Recipes for [Spatial Audio Understanding](examples/seld_spatialsoundqa/README.md) have been supported. 
+- [Update May. 20, 2024] Recipes for [music caption (MC)](examples/mc_musiccaps/README.md) have been supported. 
+- [Update May. 8, 2024] Recipes for [visual speech recognition (VSR)](examples/vsr_LRS3/README.md) have been supported. 
+- [Update May. 4, 2024] Recipes for [zero-shot text-to-speech (TTS)](examples/vallex/README.md) have been supported. 
+- [Update Apr. 28, 2024] Recipes for [automated audio captioning (AAC)](examples/aac_audiocaps/README.md) have been supported. 
+- [Update Mar. 31, 2024] Recipes for [automatic speech recognition (ASR)](examples/asr_librispeech/README.md) have been supported. 
 
 # Installation
 ```bash
@@ -75,12 +78,25 @@ docker run -it --gpus all --name slam --shm-size=256g slam-llm:latest /bin/bash
 ## List of Recipes
 We provide reference implementations of various LLM-based speech, audio, and music tasks: 
 - **Speech Task**
-    - [Automatic Speech Recognition (ASR)](examples/asr_librispeech/README.md)
-    - [Text-to-Speech (TTS)](examples/vallex/README.md)
-    - [Visual Speech Recognition (VSR)](examples/vsr_LRS3/README.md)
+    - Automatic Speech Recognition (ASR)
+        - [SLAM-ASR](examples/asr_librispeech/README.md)
+
+    - Contextual Automatic Speech Recognition (CASR)
+        - [ Mala-ASR](examples/mala_asr_slidespeech/README.md)
+
+    - [Visual Speech Recognition (VSR)](examples/vsr_LRS3/README.md) 
+    - Speech-to-Text Translation (S2TT)
+        - [CoT-ST](examples/st_covost2/README.md)
+
+    - Text-to-Speech (TTS)
+        - [VALL-E-X](examples/vallex/README.md)
+
 - **Audio Task**
     - [Automated Audio Captioning (AAC)](examples/aac_audiocaps/README.md)
-    - [Spatial Audio Understanding](examples/seld_spatialsoundqa/README.md)
+      - [SLAM-AAC](examples/slam_aac/README.md)
+      - [DRCap](examples/drcap_zeroshot_aac/README.md)
+    - Spatial Audio Understanding
+      - [BAT](examples/seld_spatialsoundqa/README.md)
 - **Music Task**
     - [Music Caption (MC)](examples/mc_musiccaps/README.md)
 
@@ -103,7 +119,7 @@ command-line (shell file) > Hydra configuration (yaml file) > dataclass configur
 - We thank the contributors for providing diverse recipes. 
 
 ## Citation
-
+SLAM-ASR:
 ```
 @article{ma2024embarrassingly,
   title={An Embarrassingly Simple Approach for LLM with Strong ASR Capacity},

diff --git a/examples/drcap_zeroshot_aac/README.md b/examples/drcap_zeroshot_aac/README.md
@@ -0,0 +1,45 @@
+# DRCap_Zeroshot_Audio_Captioning
+
+## Introduction
+DRCap is a data-efficient and flexible audio captioning system requiring text-only data for training and can quickly adapt to new domains without additional fine-tuning. 
+
+![](assets/model.png)
+
+## Pretrained models 
+You could download our pretrained CLAP model and linear mapping network through google drive: 
+* [CLAP](https://drive.google.com/drive/folders/1d5RqM2OTxO8PD7qBUAyXXJHjS96XIauw?usp=sharing) pretrained on [SoundVECaps](https://yyua8222.github.io/Sound-VECaps-demo/) and [WavCaps](https://huggingface.co/datasets/cvssp/WavCaps) [~1.9M Audio-text pairs in total] 
+
+* [Linear mapping network](https://drive.google.com/drive/folders/1d5RqM2OTxO8PD7qBUAyXXJHjS96XIauw?usp=sharing) trained on AudioCaps and Clotho_v2 via clap latents decoding and text-to-text retrieval augmentation. 
+
+* LLM [vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5)
+
+## Inference
+You could modify the variables `run_dir`, `audio_encoder_dir`, `output_dir`, `llm_path` in `scripts/inference_drcap.sh` to match the paths where the downloaded checkpoints are located. Additionally, update the `source` in `data/audiocaps_test.jsonl` to ensure the audio paths point to your audio files, and then run:
+
+```shell
+bash scripts/inference_drcap.sh
+```
+
+
+## Data preparation
+Prepare your `jsonl` data file in the following format:
+```json
+{"key": "Y7fmOlUlwoNg_1", "target": "Constant rattling noise and sharp vibrations", "text": "Constant rattling noise and sharp vibrations"}
+{"key": "Y6BJ455B1aAs_1", "target": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle", "text": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle"}
+```
+Please note that only textual data is required for training. However, for zero-shot inference, audio files are also necessary. You could find an example of the jsonl file in `data/audiocaps_test.jsonl`
+
+Run the following command to do the retrieval-augmentation and create the text embedding support for evaluation: 
+```shell
+bash scripts/data_preprocess.sh
+``` 
+
+## Model Training
+You could run the following command to train the model
+```bash
+bash scripts/finetune_drcap.sh
+```
+For training only the linear layer (without using LoRA or other PEFT methods), you can set the following parameters: `use_peft=false` and `freeze_llm=true`. To turn off the RAG, you could set `use_arg=false` and `rag_first=false`
+
+## Acknowledgement
+The code of training the CLAP model is based on the [WavCaps](https://github.com/XinhaoMei/WavCaps) repo, we thank the contributors for open-sourcing their work.
diff --git a/examples/drcap_zeroshot_aac/assets/model.png b/examples/drcap_zeroshot_aac/assets/model.png
diff --git a/examples/drcap_zeroshot_aac/conf/clap_config.yaml b/examples/drcap_zeroshot_aac/conf/clap_config.yaml
@@ -0,0 +1,68 @@
+device: "cuda"
+seed: 20
+embed_size: 1024
+temp: 0.07
+queue_size: 5120
+json_files: [
+'../data/json_files/BBC_Sound_Effects/bbc_final.json',
+ '../data/json_files/FreeSound/fsd_final.json', 
+ '../data/json_files/SoundBible/sb_final.json',
+ "../data/json_files/AudioSet_SL/as_final.json", 
+ "data/AudioCaps/json_files/train.json", 
+ "data/Clotho/json_files/train.json"
+]
+
+resume: false
+blacklist: "../data/json_files/blacklist/" # path to blacklist file
+embed_regularization: true
+
+arch_version: 0
+
+
+dist_args:
+  world_size: 1
+
+audio_args:
+  sr: 32000
+  n_fft: 1024
+  hop_length: 320
+  f_min: 50
+  f_max: 14000
+  n_mels: 64
+  max_length: 30
+  mono: True
+  use_torchaudio: True
+
+
+audio_encoder_args:
+  type: "transformer"
+  model: "Cnn14"
+  pretrained: False
+  freeze: False
+
+
+data_args:
+  batch_size: 128
+  num_workers: 8
+
+
+text_encoder_args:
+  type: 'roberta-base'
+  freeze: False
+
+
+optim_args:
+  lr: !!float 5e-5
+  warmup_steps: 0
+  optimizer_name: "adam"
+  betas: [0.9, 0.999]
+  eps: !!float 1e-8
+  momentum: 0.9
+  warmup_epochs: 2
+
+
+training:
+  spec_augmentation: True
+  epochs: 15
+  clip_grad: 2
+  dropout: 0.2
diff --git a/examples/drcap_zeroshot_aac/conf/ds_config.json b/examples/drcap_zeroshot_aac/conf/ds_config.json
@@ -0,0 +1,19 @@
+{
+    "train_micro_batch_size_per_gpu": 4,
+    "gradient_accumulation_steps": 1,
+    "optimizer": {
+        "type": "Adam",
+        "params": {
+            "lr": 1e-4
+        }
+    },
+    "fp16": {
+        "enabled": true
+    },
+    "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {
+            "device": "cpu"
+        }
+    }
+}
diff --git a/examples/drcap_zeroshot_aac/conf/prompt.yaml b/examples/drcap_zeroshot_aac/conf/prompt.yaml
@@ -0,0 +1,3 @@
+dataset_config:
+    # we put prompt here, because the hydra override in shell script only support a small subset of chars
+    prompt: "Describe the audio you hear."