Skip to content

[2022 CVPR] LiT: Zero-Shot Transfer with Locked-image text Tuning #225

@Jasonlee1995

Description

@Jasonlee1995

web-sourced paired image-text data를 contrastive learning으로 학습한 CLIP, ALIGN은 좋은 zero-shot transfer 성능을 보임

CLIP, ALIGN은 large dataset을 scratch부터 학습하다보니 data & compute efficient하지 않음

그렇다면 pre-trained weight를 이용해서 학습하면 어떨까? 더 나아가서 pre-trained weight를 frozen하면 어떨까?

저자들은 다양한 setup으로 contrastive-tuning을 실험한 결과
pre-trained image encoder를 frozen하고 text encoder만 학습한 setup의 zero-shot image classification 성능이 제일 좋았음
(locked pre-trained image models with unlocked text models work best)

해당 방법을 Locked-image Tuning (LiT)라고 부르며, LiT는 data & compute efficient함

그렇다면 LiT가 왜 잘 작동하는가?
→ image descriptor를 학습하는 것과 vision-language alignment를 학습하는 것을 decouple했기에

contrastive learning의 목적은 vision-language alignment를 맞추는 것이지, image classification에 적절한 representation을 배우는 것은 아님

image classification task에 적합한 pre-trained image encoder를 frozen하고, text encoder만 vision-language alignment를 맞추도록 학습했기에 zero-shot image classification 성능이 좋은 것

물론 image encoder를 frozen했기에, cross-modal retrieval task 성능은 부족함

즉, 내가 잘했으면 하는 downstream task가 무엇이느냐?에 따라 LiT를 사용하는 것이 적합할수도, 아닐수도 있다는 의미
(아니면 내가 잘했으면 하는 downstream task에 맞는 좋은 pre-trained image encoder를 사용하면 괜찮을수도?)

중요하다고 생각되는 부분만 간단히 요약

1. Methods

image

모든 gpu device간 contrastive loss 계산하는 global contrastive loss 사용해서 모델 학습

L : pre-trained weight, freeze (locked)
U : pre-trained weight, train (unlocked)
u : random initialization, train (unlocked)

notation example
LU : locked pre-trained image model, unlocked pre-trained language model
Lu : locked pre-trained image model, unlocked random init language model
UL : unlocked pre-trained image model, locked pre-trained language model

2. Experiments

2.1. Image-text datasets

CC12M : Conceptual Captions dataset, 10 million image-text pairs
YFCC100m : Yahoo Flickr Creative Commons dataset, 99.2 million images with metadata
YFCC100m-CLIP : subset of YFCC100m, 15 million images, filtered for English text of high quality
Our dataset : 4 billion (image, alt-text) pairs, follow the same process as ALIGN

2.2. Comparison to previous state-of-the-art

pre-trained weight for image tower : ViT-g/14 pre-trained on JFT-3B

image

Table 1 - zero-shot classification results
contrastive tuning with our dataset
ImageNet : our model significantly outperforms the previous state-of-the-art methods
OOD dataset (v2, R, A, ReaL, ObjectNet) : our model consistently outperforms the previous models
7 VTAB-natural tasks : LiT models achieve promising zero-shot results

contrastive tuning with public data sources
we achieve state-of-the-art results that uses only public data sources
we also obtain strong results on a wide range of robustness datasets and the VTAB-natural tasks

image

Figure 1
LiT setup converges significantly faster than the standard from-scratch setups reported in the literature
LiT provides a way to reuse the already pre-trained models in the literature, amortizing the computational resources used to re-generate the image models

2.3. Evaluation of design choices image

Figure 3
locking the image tower almost always works best and using a pre-trained image tower significantly helps across the board
using a pre-trained text tower only marginally improves performance, and locking the text tower does not work well

cross-modal retrieval task에서는 Lu의 benefit이 없음
학습을 더 오래하면 Uu, UU가 더 좋은 성능을 낸다고 함

image

Table 2
initializing the image tower from a pre-trained model provides better performance
frozen setup Lu, achieves even better results

image

Figure 4
why is locked (L) better than unlocked (U)?
Lu의 training, validation loss는 다른 config보다 높음
(substantially worse contrastive loss)
그러나 image representation quality는 더 좋음

즉, pre-trained image model의 image representation은 충분히 잘 generalize함
contrastive fine-tuning을 하게 되면 visual representation의 generality가 떨어지게 됨
→ LiT leads to a text model that is well aligned to an already strong and general image representation, as opposed to an image-text model that is well aligned but specialized to the dataset used for alignment

2.4. LiT works better for more generally pre-trained models image

Table 3
models which are pre-trained in a generic way (e.g. on large amounts of data, or in an unsupervised way) and have similar representation quality, become similarly good image-text models after locked-image tuning (LiT)
narrowly pre-trained model (AugReg-IN and AugReg-Places) will perform misleadingly well on its narrow task (0-shot IN for AugReg-IN), but significantly fall behind on more general image-text tasks (MSCOCO captions)

2.5. Which text model to use? image

Table 4
small dataset (YFCC100m-CLIP)에서의 양상과 large dataset (Ours)에서의 양상이 다름
small dataset에서의 양상
pre-trained weights로 initialize하면 성능이 무조건 좋아짐
BERT 모델이 가장 성능이 좋음
why? → small differences in architecture (initialization, LayerNorm placement) 때문

large dataset에서의 양상
pre-trained weights로 initialize하지 않아도 성능에 영향 X
오히려 BERT 모델 성능이 안좋음
why? → BERT 모델이 학습할 때 less stable하기 때문
학습이 stable한 ViT가 더 좋음

2.6. Do duplicate examples matter for LiT?

downstream dataset에 있는 data가 upstream dataset에 존재한다면, 이는 어떠한 영향을 미치는가?
(role of duplicate examples between upstream datasets and downstream datasets)

image

Table 5
duplication of examples does not influence the results strongly
with a large upstream dataset, the model may not memorize those duplicate examples

2.7. Preliminary multi-lingual experiments image image

Figure 5 : ImageNet prompts를 online translation service를 이용해서 target language로 translate하고 zero-shot classification
Figure 6 : Wikipedia based Image Text (WIT) dataset에서 T → I retrieval

training on full dataset improves performance on non-English languages
using a multi-lingual tokenizer (mT5) significantly helps languages that do not use the Latin script
starting from a pre-trained multi-lingual text model can further help

3. Appendix

Is this specific to ViT image models? image

Table 6
LiT works for different model families
but ViT models do seem more amenable to learning image-text mappings than other architectures of similar size

Larger model capacity yields better results image

Figure 7
pre-trained model의 model capacity를 늘리면 성능이 좋아짐
image tower capacity를 늘리는 것이 text tower capacity를 늘리는 것보다 더 효과적임
(increasing the capacity of image tower improves more than increasing the capacity of the text tower)

Metadata

Metadata

Assignees

No one assigned

    Labels

    DiscriminativeDiscriminative ModelingEfficientMemory and Computation-Efficient LearningLanguageRelated with Natural Language Processing tasksVisionRelated with Computer Vision tasks

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions