Dear Spatial-MLLM team,
Thank you for the great work.
I’m a bit confused about the connector configuration. Is the connector trained from scratch, or is it partially initialized from the pre-trained Qwen2.5-VL? If it is trained from scratch, given that VLMs typically require large amounts of data to align vision and language features through the connector layer, the dataset size you report seems smaller than what’s typical in mainstream MLLM settings.
Could you please provide more details about how the connector is trained (e.g., initialization strategy, data scale, objectives, and training procedure)?
Best regards
Dear Spatial-MLLM team,
Thank you for the great work.
I’m a bit confused about the connector configuration. Is the connector trained from scratch, or is it partially initialized from the pre-trained Qwen2.5-VL? If it is trained from scratch, given that VLMs typically require large amounts of data to align vision and language features through the connector layer, the dataset size you report seems smaller than what’s typical in mainstream MLLM settings.
Could you please provide more details about how the connector is trained (e.g., initialization strategy, data scale, objectives, and training procedure)?
Best regards