Skip to content

Commit 6fb784b

Browse files
authored
Merge pull request #160 from X-LANCE/cwx_slam_aac
update SLAM-AAC citation and fix typo
2 parents 0045773 + 8373500 commit 6fb784b

File tree

3 files changed

+26
-7
lines changed

3 files changed

+26
-7
lines changed

README.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ developers to train custom multimodal large language model (MLLM), focusing on <
2020
# Table of Contents
2121
1. [News](#news)
2222
2. [Installation](#installation)
23-
3. [Uasge](#uasge)
23+
3. [Usage](#usage)
2424
- [List of Recipes](#list-of-recipes)
2525
- [Configuration Priority](#configuration-priority)
2626
4. [Features](#features)
@@ -129,3 +129,14 @@ SLAM-ASR:
129129
}
130130
```
131131

132+
SLAM-AAC:
133+
```
134+
@article{chen2024slam,
135+
title={SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs},
136+
author={Chen, Wenxi and Ma, Ziyang and Li, Xiquan and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Yu, Kai and Chen, Xie},
137+
journal={arXiv preprint arXiv:2410.09503},
138+
year={2024}
139+
}
140+
```
141+
142+

examples/slam_aac/README.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
# SLAM-AAC
22

3-
SLAM-AAC is a LLM-based model for Automated Audio Captioning (AAC) task. Inspired by techniques in machine translation and ASR, the model enhances audio captioning by incorporating paraphrasing augmentation and a plug-and-play CLAP-Refine strategy.
4-
<!-- For more details, please refer to the [paper](). -->
3+
SLAM-AAC is a LLM-based model for Automated Audio Captioning (AAC) task. Inspired by techniques in machine translation and ASR, the model enhances audio captioning by incorporating paraphrasing augmentation and a plug-and-play CLAP-Refine strategy. For more details, please refer to the [paper](https://arxiv.org/abs/2410.09503).
54

65
## Model Architecture
76
SLAM-AAC uses EAT as the audio encoder and Vicuna-7B as the LLM decoder. During training, only the Linear Projector and LoRA modules are trainable. For inference, multiple candidates are generated using different beam sizes, which are then refined using the CLAP-Refine strategy.
@@ -81,8 +80,13 @@ If you already have the generated candidates and want to directly refine them us
8180
bash scripts/clap_refine.sh
8281
```
8382

84-
<!-- ## Citation
83+
## Citation
8584
You can refer to the paper for more results.
8685
```
87-
88-
``` -->
86+
@article{chen2024slam,
87+
title={SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs},
88+
author={Chen, Wenxi and Ma, Ziyang and Li, Xiquan and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Yu, Kai and Chen, Xie},
89+
journal={arXiv preprint arXiv:2410.09503},
90+
year={2024}
91+
}
92+
```

examples/slam_aac/aac_config.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
from dataclasses import dataclass, field
22
from typing import Optional, List
3+
4+
from torch.distributed.fsdp import ShardingStrategy
5+
6+
37
@dataclass
48
class ModelConfig:
59
file: str = "examples/slam_aac/model/slam_model_aac.py:model_factory"
@@ -125,7 +129,7 @@ class FSDPConfig:
125129
mixed_precision: bool = True
126130
use_fp16: bool = False
127131
# sharding_strategy = "FULL_SHARD" #ShardingStrategy = ShardingStrategy.FULL_SHARD
128-
sharding_strategy: str = "NO_SHARD" #ShardingStrategy.NO_SHARD #MZY: set NO_SHARD when use DDP
132+
sharding_strategy: ShardingStrategy = "NO_SHARD" #ShardingStrategy.NO_SHARD #MZY: set NO_SHARD when use DDP
129133
checkpoint_type: str = "SHARDED_STATE_DICT" # alternatively can use SHARDED_STATE_DICT save one file per rank, and can resize the world-size.
130134
fsdp_activation_checkpointing: bool = True
131135
fsdp_cpu_offload: bool = False

0 commit comments

Comments
 (0)