-
Notifications
You must be signed in to change notification settings - Fork 0
Description
web-sourced paired image-text data를 contrastive learning으로 학습한 CLIP, ALIGN은 좋은 zero-shot transfer 성능을 보임
CLIP, ALIGN은 large dataset을 scratch부터 학습하다보니 data & compute efficient하지 않음
그렇다면 pre-trained weight를 이용해서 학습하면 어떨까? 더 나아가서 pre-trained weight를 frozen하면 어떨까?
저자들은 다양한 setup으로 contrastive-tuning을 실험한 결과
pre-trained image encoder를 frozen하고 text encoder만 학습한 setup의 zero-shot image classification 성능이 제일 좋았음
(locked pre-trained image models with unlocked text models work best)
해당 방법을 Locked-image Tuning (LiT)라고 부르며, LiT는 data & compute efficient함
그렇다면 LiT가 왜 잘 작동하는가?
→ image descriptor를 학습하는 것과 vision-language alignment를 학습하는 것을 decouple했기에
contrastive learning의 목적은 vision-language alignment를 맞추는 것이지, image classification에 적절한 representation을 배우는 것은 아님
image classification task에 적합한 pre-trained image encoder를 frozen하고, text encoder만 vision-language alignment를 맞추도록 학습했기에 zero-shot image classification 성능이 좋은 것
물론 image encoder를 frozen했기에, cross-modal retrieval task 성능은 부족함
즉, 내가 잘했으면 하는 downstream task가 무엇이느냐?에 따라 LiT를 사용하는 것이 적합할수도, 아닐수도 있다는 의미
(아니면 내가 잘했으면 하는 downstream task에 맞는 좋은 pre-trained image encoder를 사용하면 괜찮을수도?)
중요하다고 생각되는 부분만 간단히 요약
1. Methods
모든 gpu device간 contrastive loss 계산하는 global contrastive loss 사용해서 모델 학습
L: pre-trained weight, freeze (locked)
U: pre-trained weight, train (unlocked)
u: random initialization, train (unlocked)
notation example
LU: locked pre-trained image model, unlocked pre-trained language model
Lu: locked pre-trained image model, unlocked random init language model
UL: unlocked pre-trained image model, locked pre-trained language model
2. Experiments
2.1. Image-text datasets
CC12M : Conceptual Captions dataset, 10 million image-text pairs
YFCC100m : Yahoo Flickr Creative Commons dataset, 99.2 million images with metadata
YFCC100m-CLIP : subset of YFCC100m, 15 million images, filtered for English text of high quality
Our dataset : 4 billion (image, alt-text) pairs, follow the same process as ALIGN
2.2. Comparison to previous state-of-the-art
pre-trained weight for image tower : ViT-g/14 pre-trained on JFT-3B
Table 1- zero-shot classification results
contrastive tuning with our dataset
ImageNet : our model significantly outperforms the previous state-of-the-art methods
OOD dataset (v2, R, A, ReaL, ObjectNet) : our model consistently outperforms the previous models
7 VTAB-natural tasks : LiT models achieve promising zero-shot resultscontrastive tuning with public data sources
we achieve state-of-the-art results that uses only public data sources
we also obtain strong results on a wide range of robustness datasets and the VTAB-natural tasks
Figure 1
LiT setup converges significantly faster than the standard from-scratch setups reported in the literature
LiT provides a way to reuse the already pre-trained models in the literature, amortizing the computational resources used to re-generate the image models
2.3. Evaluation of design choices
Figure 3
locking the image tower almost always works best and using a pre-trained image tower significantly helps across the board
using a pre-trained text tower only marginally improves performance, and locking the text tower does not work wellcross-modal retrieval task에서는
Lu의 benefit이 없음
학습을 더 오래하면Uu,UU가 더 좋은 성능을 낸다고 함
Table 2
initializing the image tower from a pre-trained model provides better performance
frozen setupLu, achieves even better results
Figure 4
why is locked (L) better than unlocked (U)?
Lu의 training, validation loss는 다른 config보다 높음
(substantially worse contrastive loss)
그러나 image representation quality는 더 좋음즉, pre-trained image model의 image representation은 충분히 잘 generalize함
contrastive fine-tuning을 하게 되면 visual representation의 generality가 떨어지게 됨
→ LiT leads to a text model that is well aligned to an already strong and general image representation, as opposed to an image-text model that is well aligned but specialized to the dataset used for alignment
2.4. LiT works better for more generally pre-trained models
Table 3
models which are pre-trained in a generic way (e.g. on large amounts of data, or in an unsupervised way) and have similar representation quality, become similarly good image-text models after locked-image tuning (LiT)
narrowly pre-trained model (AugReg-IN and AugReg-Places) will perform misleadingly well on its narrow task (0-shot IN for AugReg-IN), but significantly fall behind on more general image-text tasks (MSCOCO captions)
2.5. Which text model to use?
Table 4
small dataset (YFCC100m-CLIP)에서의 양상과 large dataset (Ours)에서의 양상이 다름
small dataset에서의 양상
pre-trained weights로 initialize하면 성능이 무조건 좋아짐
BERT 모델이 가장 성능이 좋음
why? → small differences in architecture (initialization, LayerNorm placement) 때문large dataset에서의 양상
pre-trained weights로 initialize하지 않아도 성능에 영향 X
오히려 BERT 모델 성능이 안좋음
why? → BERT 모델이 학습할 때 less stable하기 때문
학습이 stable한 ViT가 더 좋음
2.6. Do duplicate examples matter for LiT?
downstream dataset에 있는 data가 upstream dataset에 존재한다면, 이는 어떠한 영향을 미치는가?
(role of duplicate examples between upstream datasets and downstream datasets)
Table 5
duplication of examples does not influence the results strongly
with a large upstream dataset, the model may not memorize those duplicate examples
2.7. Preliminary multi-lingual experiments
Figure 5: ImageNet prompts를 online translation service를 이용해서 target language로 translate하고 zero-shot classification
Figure 6: Wikipedia based Image Text (WIT) dataset에서 T → I retrieval
training on full dataset improves performance on non-English languages
using a multi-lingual tokenizer (mT5) significantly helps languages that do not use the Latin script
starting from a pre-trained multi-lingual text model can further help

