Skip to content

Repository for the IJGIS paper "Zero-shot urban function inference with street view images through prompting a pre-trained vision-language model"

License

Notifications You must be signed in to change notification settings

RightBank/UrbanCLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Zero-shot urban function inference with street view images through prompting a pre-trained vision-language model

This project is linked to a paper accepted to International Journal of Geographical Information Science: Zero-shot urban function inference with street view images through prompting a pretrained vision-language model. In this study, we develop the prompting framework UrbanCLIP, which builts upon the vision-language pretrained model CLIP, to carry out zero-shot urban function inference using street view images (SVIs).

Quick start

With this repository, you can

  • reproduce the results in the paper
  • use the shared urban scene dataset for further studies
  • potentially carry out zero-shot urban function inference using your own SVIs

Structure

The structure of this repository is as follows:

  • Zeroshot_UrbanCLIP.py: the main portal for zero-shot urban function inference with SVIs
  • ./Data: the folder to store the data, including the annotated urban scenes (SVIs) in Shenzhen, Singapore, and London.
  • ./Emb: the folder to store the SVI embeddings.
  • ./Utils: the folder to store the urban taxonomy and urban fucntion prompts, as well as some utility functions.

Zeroshot_UrbanCLIP.py

Zeroshot_UrbanCLIP.py is used to carry out zero-shot urban function inference with SVIs. You could specify several arguments in zeroshot_inference.py:

  • --task: can be "primary", "multi", "transfer-singapore" or "transfer-london", to reproduce the results in the paper.
  • --taxomony: indicate if the developed urban taxomony is to be used, with the option "UrbanCLIP" indicting the use of the urban taxonomy, and "function_name" indicting otherwise.
  • --prompt_template: the prompt templates to use, which can be UrbanCLIP, Wu, Photo, CLIP80, no_template, UrbanCLIP_SC, Wu_without_SC, and please refer to the paper for more details.
  • --ensemble: indicate the prompting template ensembling method, which can be "mean" or "zpe".
  • --device: the device to use, which can be cpu or cuda or cuda:{}.

For example, you could run the command:

python Zeroshot_UrbanCLIP.py --task=primary --device=cuda:0

to reproduce the results in the paper on zero-shot primary function inference.

Citation

If you use the code in this project, please cite the paper in IJGIS.

@article{huang2024urbanclip,
  title={Zero-shot urban function inference with street view images through prompting a pretrained vision-language model},
  author={Huang, Weiming and Wang, Jing and Cong, Gao},
  journal={International Journal of Geographical Information Science},
  pages={1--29},
  year={2024},
  publisher={Taylor \& Francis}
}

contact

Weiming Huang

Email: [email protected]

About

Repository for the IJGIS paper "Zero-shot urban function inference with street view images through prompting a pre-trained vision-language model"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages