Skip to content

midas-research/Optimizing-Multimodal-LLMs-for-Scientific-VQA-using-Caption-Aware-SFT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Enhancing Scientific Visual Question Answering via Vision-Caption aware Supervised Fine-Tuning


πŸ“˜ Overview

This work focuses on improving Scientific Visual Question Answering (SciVQA) by incorporating Vision-Caption aware Supervised Fine-Tuning (VCASFT). This approach leverages both visual and caption information from scientific figures to enhance reasoning and answer generation.

🧩 Framework

The proposed VCASFT framework integrates caption along with visual features during fine-tuning, enriching the model’s understanding of scientific figures and textual annotations.

VCASFT Framework

πŸš€ Key Highlights

βœ… HiSciVQA Dataset - A high-quality Multimodal Hindi Physics Question Answering Dataset

βœ… VCASFT - A novel training paradigm that jointly leverages visual and caption information to improve reasoning in scientific VQA.

βœ… Performance Gains - VCASFT Demonstrates substantial gains on HiSciVQA benchmarks across reasoning and answer quality metrics.

πŸ“‚ HiSciVQA Dataset

Data Files

data/
β”œβ”€β”€ HiSciVQA_Train_Data.json
└── HiSciVQA_Test_Data.json

Images

images/
β”œβ”€β”€ train_images.zip
└── test_images.zip

πŸ“¬ Contact

For any questions or feedback, please feel free to reach out:

Janak Kapuriya [email protected]

πŸ“‘ Citation

If you use this dataset or build upon this work, please cite:

@inproceedings{kapuriya2025enhancing,
  title={Enhancing Scientific Visual Question Answering via Vision-Caption aware Supervised Fine-Tuning},
  author={Kapuriya, Janak and Shaikh, Anwar and Goel, Arnav and Hira, Medha and Singh, Apoorv and Saraf, Jay and Sanjana and Nauriyal, Vaibhav and Anand, Avinash and Wang, Zhengkui and others},
  booktitle={Proceedings of the 2nd International Workshop on Large Vision-Language Model Learning and Applications},
  pages={13--30},
  year={2025}
}

About

[LAVA Workshop @ ACM Multimedia 2025] Enhancing Scientific Visual Question Answering via Vision-Caption aware Supervised Fine-Tuning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •