Code will be available soon.
Weakly supervised visual grounding aims to locate a region in an image based on an input query sentence, without access to the mapping between image regions and queries during training. Current methods treat spatial grounding as an object retrieval task, relying on cross-modal similarity scores for proposal selection. However, they fail to address model overfitting caused by unreliable cross-modal similarity scores. To overcome this, we first propose the Confidence-aware Pseudo-label Learning (CPL) framework. CPL first generates diverse pseudo queries for region proposals, and then establishes reliable associations for model training based on the uni-modal similarity score. Secondly, we propose a cross-modal verification module based on the pretrained vision-language model to verify associations. However, the verification module is isolated from the grounding model, so it can only assess associations in a static manner, but not correct the suspicious ones. Finally, we introduce CPL++ to make two-fold improvements. For one thing, we upgrade the verification process based on the model's grounding loss value to identify suspicious associations dynamically and selectively leverage them in the training. For another, we propose a self-supervised association correction module to rectify suspicious associations, thereby mitigating the risk of error propagation. Experimental results on five datasets demonstrate the superiority of our approach.
