|
1 | 1 | # <img src="assets/bat.png" alt="SELD_SpatialSoundQA" width="25" height="25"> SELD_SpatialSoundQA |
2 | 2 |
|
3 | | -This repo hosts the code and models of "[BAT: Learning to Reason about Spatial Sounds with Large Language Models](https://arxiv.org/abs/2402.01591)" [ICML 2024 [bib](https://github.com/zszheng147/Spatial-AST#citation)]. |
| 3 | +This repo hosts the code and models of "[BAT: Learning to Reason about Spatial Sounds with Large Language Models](https://arxiv.org/abs/2402.01591)" [ICML 2024 [bib](https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/seld_spatialsoundqa#citation)]. |
4 | 4 |
|
5 | 5 | Checkout our [demo page](https://zhishengzheng.com/BAT/) and enjoy a QA game with spatial audio. |
6 | 6 |
|
7 | | -## Performance and checkpoints |
8 | | -Encoder | Projector | PEFT | LLM |
9 | | -|---|---|---|---| |
10 | | -[Spatial-AST](https://huggingface.co/zhisheng01/Bat/blob/main/spatial-ast.pth) | Q-Former | adapter |[llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b) |
| 7 | +## Performance evaluation on **SpatialSoundQA** |
| 8 | +We use [Spatial-AST](https://huggingface.co/datasets/zhisheng01/SpatialAudio/blob/main/SpatialAST/finetuned.pth) as audio encoder, [llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b) as LLM backbone. We finetune the model by adding Q-Former and LoRA. To calculate MAP, you can refer to [calculate_map.py](https://github.com/X-LANCE/SLAM-LLM/blob/main/examples/seld_spatialsoundqa/scripts/calculate_map.py) |
| 9 | +<img src="assets/performance.png" alt="xxx"> |
| 10 | + |
| 11 | + |
| 12 | +## Checkpoints |
| 13 | +Encoder | Projector | LLM | |
| 14 | +|---|---|---| |
| 15 | +[Spatial-AST](https://huggingface.co/datasets/zhisheng01/SpatialAudio/blob/main/SpatialAST/finetuned.pth) | [Q-former](https://huggingface.co/datasets/zhisheng01/SpatialAudio/blob/main/BAT/model.pt)(~73.56M) | [llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b) | |
| 16 | + |
| 17 | +## Demo (Spatial Audio Inference) |
| 18 | +Try [`inference.ipynb`](https://github.com/X-LANCE/SLAM-LLM/blob/main/examples/seld_spatialsoundqa/inference.ipynb). |
| 19 | + |
11 | 20 |
|
12 | 21 | ## Data preparation |
13 | 22 | You need to prepare the data jsonl in this format. Below is an example. |
14 | | -You can download the SpatialSoundQA dataset from [huggingface](https://huggingface.co/datasets/zhisheng01/SpatialSoundQA). |
15 | | -``` |
16 | | -{"audio_id": "eval/audio/YI-HlrcP6Qg4", "reverb_id": "q9vSo1VnCiC/0.npy", "audio_id2": null, "reverb_id2": null, "question_id": 0, "question_type": "CLASSIFICATION", "question": "Enumerate the sound occurrences in the audio clip.", "answer": "accelerating, revving, vroom; car; vehicle"} |
| 23 | +You can download the SpatialSoundQA dataset from [SpatialAudio](https://huggingface.co/datasets/zhisheng01/SpatialAudio). |
| 24 | +```json |
| 25 | +{ |
| 26 | + "audio_id": "eval/audio/YI-HlrcP6Qg4", |
| 27 | + "reverb_id": "q9vSo1VnCiC/0.npy", |
| 28 | + "audio_id2": null, |
| 29 | + "reverb_id2": null, |
| 30 | + "question_id": 0, |
| 31 | + "question_type": "CLASSIFICATION", |
| 32 | + "question": "Enumerate the sound occurrences in the audio clip.", |
| 33 | + "answer": "accelerating, revving, vroom; car; vehicle" |
| 34 | +} |
| 35 | + |
17 | 36 | ... |
18 | | -{"audio_id": "eval/audio/YZX2fVPmUidA", "reverb_id": "q9vSo1VnCiC/32.npy", "audio_id2": "eval/audio/YjNjUU01quLs", "reverb_id2": "q9vSo1VnCiC/31.npy", "question_id": 58, "question_type": "MIXUP_NONBINARY_DISTANCE", "question": "How far away is the sound of the banjo from the sound of the whack, thwack?", "answer": "2m"} |
| 37 | + |
| 38 | +{ |
| 39 | + "audio_id": "eval/audio/YZX2fVPmUidA", |
| 40 | + "reverb_id": "q9vSo1VnCiC/32.npy", |
| 41 | + "audio_id2": "eval/audio/YjNjUU01quLs", |
| 42 | + "reverb_id2": "q9vSo1VnCiC/31.npy", |
| 43 | + "question_id": 58, |
| 44 | + "question_type": "MIXUP_NONBINARY_DISTANCE", |
| 45 | + "question": "How far away is the sound of the banjo from the sound of the whack, thwack?", |
| 46 | + "answer": "2m" |
| 47 | +} |
19 | 48 | ``` |
20 | 49 |
|
21 | 50 | ## Train a new model |
22 | 51 | ```bash |
23 | | -bash examples/seld_spatialsoundqa/scripts/finetune_spatial-ast_qformer_llama_2_7b.sh |
| 52 | +cd examples/seld_spatialsoundqa/ |
| 53 | +bash scripts/finetune_spatial-ast_qformer_llama_2_7b.sh |
24 | 54 | ``` |
25 | 55 |
|
26 | 56 | ## Decoding with checkpoints |
27 | 57 | ```bash |
28 | | -bash examples/seld_spatialsoundqa/scripts/decode_spatial-ast_qformer_llama_2_7b.sh |
| 58 | +cd examples/seld_spatialsoundqa/ |
| 59 | +bash scripts/decode_spatial-ast_qformer_llama_2_7b.sh |
29 | 60 | ``` |
30 | 61 |
|
31 | 62 |
|
32 | 63 | ## TODO |
33 | 64 | - [x] Decode with checkpoints |
34 | 65 | - [x] Upload SpatialSoundQA dataset |
35 | | -- [ ] Upload pretrained checkpoints |
36 | | -- [ ] Update model performance |
| 66 | +- [x] Upload pretrained checkpoints |
| 67 | +- [x] Update model performance |
37 | 68 |
|
38 | 69 | ## Citation |
39 | 70 | ``` |
|
0 commit comments