SLAM-AAC

cwx-worst-one · cwx-worst-one · commit b8dbc12f3fdf · 2024-10-11T15:54:44.000Z
diff --git a/examples/slam_aac/README.md b/examples/slam_aac/README.md
@@ -54,8 +54,8 @@ You can also fine-tune the model without loading any pre-trained weights, though
 
 
 ### Note
-In the current version of SLAM-LLM, the `peft_ckpt` parameter is no longer required. However, if you are using the checkpoint provided by us, which was trained with an earlier version, please keep the `peft_ckpt` parameter in your configuration to ensure compatibility.
-
+- In the current version of SLAM-LLM, the `peft_ckpt` parameter is no longer required. However, if you are using the checkpoint provided by us, which was trained with an earlier version, please keep the `peft_ckpt` parameter in your configuration to ensure compatibility.
+- Due to differences in dependency versions, there may be slight variations in the performance of the SLAM-AAC model.
 
 ## Inference
 To perform inference with the trained models, you can use the following commands to decode using the common beam search method:
@@ -67,7 +67,7 @@ bash scripts/inference_audiocaps_bs.sh
 bash scripts/inference_clotho_bs.sh
 ```
 
-For improved inference results, you can use the CLAP-Refine strategy, which utilizes multiple beam search decoding. Note that this method may take longer to run, but it can provide better quality outputs. You can execute the following commands:
+For improved inference results, you can use the CLAP-Refine strategy, which utilizes multiple beam search decoding. To use this method, you need to download and use our pre-trained [CLAP](https://drive.google.com/drive/folders/1X4NYE08N-kbOy6s_Itb0wBR_3X8oZF56?usp=sharing) model. Note that CLAP-Refine may take longer to run, but it can provide better quality outputs. You can execute the following commands:
 ```bash
 # Inference on AudioCaps (CLAP-Refine)
 bash scripts/inference_audiocaps_CLAP_Refine.sh
@@ -86,5 +86,3 @@ You can refer to the paper for more results.
 ```
 
 ``` -->
-
-<!-- [CLAP](https://drive.google.com/drive/folders/1X4NYE08N-kbOy6s_Itb0wBR_3X8oZF56?usp=sharing) model for post-processing (CLAP-refine) -->
diff --git a/examples/slam_aac/scripts/clap_refine.sh b/examples/slam_aac/scripts/clap_refine.sh
@@ -6,8 +6,8 @@ cd $run_dir
 code_dir=examples/slam_aac
 
 clap_dir=/data/xiquan.li/models/clap
-inference_data_path=/data/wenxi.chen/data/clotho/evaluation_single.jsonl
-output_dir=/data/wenxi.chen/cp/wavcaps_pt_v7_epoch4-clotho_ft-seed10086_btz4_lr8e-6-short_prompt_10w/aac_epoch_1_step_4500
+inference_data_path=/data/wenxi.chen/data/audiocaps/new_test.jsonl
+output_dir=/data/wenxi.chen/cp/aac_epoch_2_step_182_audiocaps_seed42
 
 echo "Running CLAP-Refine"
 
diff --git a/src/slam_llm/models/CLAP/feature_extractor.py b/src/slam_llm/models/CLAP/feature_extractor.py
@@ -27,10 +27,10 @@ def __init__(self, audio_config):
                                           fmin=audio_config["f_min"],
                                           fmax=audio_config["f_max"],
                                           ref=1.0, 
-                                          amin=1e-6,    
+                                          amin=audio_config.get("amin", 1e-6),    
                                           top_db=None,
                                           freeze_parameters=True)
-
+                                          
     def forward(self, input):
         # input: waveform [bs, wav_length]
         mel_feats = self.mel_trans(input)