Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, Zhijie Deng
- 🚀 [04/02/2025] We release VSI-100k on Huggingface.
- 🚀 [04/02/2025] We release our paper on arxiv.
🔔 We incorporate GRPO training for improved visual-spatial reasoning, using the carefully curated VSI-100k dataset.
🔔 With GRPO training, our vsGRPO-2B outperforms GPT-4o, and the vsGRPO-7B demonstrates performance comparable to the best open-source model, LLaVA-Video-Next-72B.
To combat the data scarity, we build VSI-100k. Specifically, with the ScanNet 3D annotation information, we construct approximately 100k question-answer pairs.
Our vsGRPO-2B outperforms GPT-4o, and the vsGRPO-7B demonstrates performance comparable to the best open-source model, LLaVA-Video-Next-72B.
If you find our work and the dataset useful, please cite:
@article{liao2025improved,
title={Improved visual-spatial reasoning via r1-zero-like training},
author={Liao, Zhenyi and Xie, Qingsong and Zhang, Yanhao and Kong, Zijian and Lu, Haonan and Yang, Zhenyu and Deng, Zhijie},
journal={arXiv preprint arXiv:2504.00883},
year={2025}
}
Usage and License Notices: The data and code are intended and licensed for research use only.
License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
We sincerely thank projects R1-V and ScanNet, based on which we build our project. We also thank trl, Qwen2-VL, vllm for their open-source techniques.