[2022 CVPR] LiT: Zero-Shot Transfer with Locked-image text Tuning

web-sourced paired image-text data를 contrastive learning으로 학습한 CLIP, ALIGN은 좋은 zero-shot transfer 성능을 보임

CLIP, ALIGN은 large dataset을 scratch부터 학습하다보니 data & compute efficient하지 않음

그렇다면 pre-trained weight를 이용해서 학습하면 어떨까? 더 나아가서 pre-trained weight를 frozen하면 어떨까?

저자들은 다양한 setup으로 contrastive-tuning을 실험한 결과
pre-trained image encoder를 frozen하고 text encoder만 학습한 setup의 zero-shot image classification 성능이 제일 좋았음
(locked pre-trained image models with unlocked text models work best)

해당 방법을 Locked-image Tuning (LiT)라고 부르며, LiT는 data & compute efficient함

그렇다면 LiT가 왜 잘 작동하는가?
→ image descriptor를 학습하는 것과 vision-language alignment를 학습하는 것을 decouple했기에

contrastive learning의 목적은 vision-language alignment를 맞추는 것이지, image classification에 적절한 representation을 배우는 것은 아님

image classification task에 적합한 pre-trained image encoder를 frozen하고, text encoder만 vision-language alignment를 맞추도록 학습했기에 zero-shot image classification 성능이 좋은 것

물론 image encoder를 frozen했기에, cross-modal retrieval task 성능은 부족함

즉, 내가 잘했으면 하는 downstream task가 무엇이느냐?에 따라 LiT를 사용하는 것이 적합할수도, 아닐수도 있다는 의미
(아니면 내가 잘했으면 하는 downstream task에 맞는 좋은 pre-trained image encoder를 사용하면 괜찮을수도?)

중요하다고 생각되는 부분만 간단히 요약


## 1. Methods
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/72a94bda-4e91-4376-892a-6ef5050d547d">

> 모든 gpu device간 contrastive loss 계산하는 global contrastive loss 사용해서 모델 학습

> `L` : pre-trained weight, freeze (locked)
> `U` : pre-trained weight, train (unlocked)
> `u` : random initialization, train (unlocked)

> notation example
> `LU` : locked pre-trained image model, unlocked pre-trained language model
> `Lu` : locked pre-trained image model, unlocked random init language model
> `UL` : unlocked pre-trained image model, locked pre-trained language model


## 2. Experiments

<details><summary>2.1. Image-text datasets</summary>

> CC12M : Conceptual Captions dataset, 10 million image-text pairs
> YFCC100m : Yahoo Flickr Creative Commons dataset, 99.2 million images with metadata
> YFCC100m-CLIP : subset of YFCC100m, 15 million images, filtered for English text of high quality
> Our dataset : 4 billion (image, alt-text) pairs, follow the same process as ALIGN

</details>


<details><summary>2.2. Comparison to previous state-of-the-art</summary>

> pre-trained weight for image tower : ViT-g/14 pre-trained on JFT-3B

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/aa3276da-60ef-40a8-badb-b9bf35deaa8b">

> `Table 1` - zero-shot classification results
> contrastive tuning with our dataset
> ImageNet : our model significantly outperforms the previous state-of-the-art methods
> OOD dataset (v2, R, A, ReaL, ObjectNet) : our model consistently outperforms the previous models
> 7 VTAB-natural tasks : LiT models achieve promising zero-shot results
> 
> contrastive tuning with public data sources
> we achieve state-of-the-art results that uses only public data sources
> we also obtain strong results on a wide range of robustness datasets and the VTAB-natural tasks

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/17b1bc63-bf25-4e40-b602-6be9147afd20">

> `Figure 1`
> LiT setup converges significantly faster than the standard from-scratch setups reported in the literature
> LiT provides a way to reuse the already pre-trained models in the literature, amortizing the computational resources used to re-generate the image models

</details>



<details><summary>2.3. Evaluation of design choices</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/e58df0d6-95a9-4c0b-8de6-bd17389c4020">

> `Figure 3`
> locking the image tower almost always works best and using a pre-trained image tower significantly helps across the board
> using a pre-trained text tower only marginally improves performance, and locking the text tower does not work well
> 
> cross-modal retrieval task에서는 `Lu`의 benefit이 없음
> 학습을 더 오래하면 `Uu`, `UU`가 더 좋은 성능을 낸다고 함

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/945002a7-44db-40cd-9690-c3bdb218e6e1">

> `Table 2`
> initializing the image tower from a pre-trained model provides better performance
> frozen setup `Lu`, achieves even better results


<img width="50%" alt="image" src="https://github.com/user-attachments/assets/3b265526-08a0-47b2-829c-ae7f7d60269f">

> `Figure 4`
> why is locked (`L`) better than unlocked (`U`)?
> `Lu`의 training, validation loss는 다른 config보다 높음
> (substantially worse contrastive loss)
> 그러나 image representation quality는 더 좋음
> 
> 즉, pre-trained image model의 image representation은 충분히 잘 generalize함
> contrastive fine-tuning을 하게 되면 visual representation의 generality가 떨어지게 됨
> → LiT leads to a text model that is well aligned to an already strong and general image representation, as opposed to an image-text model that is well aligned but specialized to the dataset used for alignment

</details>


<details><summary>2.4. LiT works better for more generally pre-trained models</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/c4994b28-b940-485a-9536-7d349830ddd5">

> `Table 3`
> models which are pre-trained in a generic way (e.g. on large amounts of data, or in an unsupervised way) and have similar representation quality, become similarly good image-text models after locked-image tuning (LiT)
> narrowly pre-trained model (AugReg-IN and AugReg-Places) will perform misleadingly well on its narrow task (0-shot IN for AugReg-IN), but significantly fall behind on more general image-text tasks (MSCOCO captions)

</details>


<details><summary>2.5. Which text model to use?</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/e5c16b4a-ed5b-4069-b42f-9250ea120506">

> `Table 4`
> small dataset (YFCC100m-CLIP)에서의 양상과 large dataset (Ours)에서의 양상이 다름
> small dataset에서의 양상
> pre-trained weights로 initialize하면 성능이 무조건 좋아짐
> BERT 모델이 가장 성능이 좋음
> why? → small differences in architecture (initialization, LayerNorm placement) 때문
> 
> large dataset에서의 양상
> pre-trained weights로 initialize하지 않아도 성능에 영향 X
> 오히려 BERT 모델 성능이 안좋음
> why? → BERT 모델이 학습할 때 less stable하기 때문
> 학습이 stable한 ViT가 더 좋음

</details>


<details><summary>2.6. Do duplicate examples matter for LiT?</summary>

> downstream dataset에 있는 data가 upstream dataset에 존재한다면, 이는 어떠한 영향을 미치는가?
> (role of duplicate examples between upstream datasets and downstream datasets)

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/625980de-1d99-4b84-b13e-73cec33f35bf">

> `Table 5`
> duplication of examples does not influence the results strongly
> with a large upstream dataset, the model may not memorize those duplicate examples

</details>


<details><summary>2.7. Preliminary multi-lingual experiments</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/218e0107-7a0d-4880-8ba1-b23fee768685">
<img width="49%" alt="image" src="https://github.com/user-attachments/assets/eb093ada-6d6b-416a-84c8-49ee5afa6057">

> `Figure 5` : ImageNet prompts를 online translation service를 이용해서 target language로 translate하고 zero-shot classification
> `Figure 6` : Wikipedia based Image Text (WIT) dataset에서 T → I retrieval

> training on full dataset improves performance on non-English languages
> using a multi-lingual tokenizer (mT5) significantly helps languages that do not use the Latin script
> starting from a pre-trained multi-lingual text model can further help

</details>


## 3. Appendix

<details><summary>Is this specific to ViT image models?</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/ca03c728-fbf9-4c13-97b0-6744f57812bc">

> `Table 6`
> LiT works for different model families
> but ViT models do seem more amenable to learning image-text mappings than other architectures of similar size

</details>


<details><summary>Larger model capacity yields better results</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/6c454dd1-589d-4257-9074-88318c62ec10">

> `Figure 7`
> pre-trained model의 model capacity를 늘리면 성능이 좋아짐
> image tower capacity를 늘리는 것이 text tower capacity를 늘리는 것보다 더 효과적임
> (increasing the capacity of image tower improves more than increasing the capacity of the text tower)

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[2022 CVPR] LiT: Zero-Shot Transfer with Locked-image text Tuning #225

1. Methods

2. Experiments

3. Appendix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[2022 CVPR] LiT: Zero-Shot Transfer with Locked-image text Tuning #225

Description

1. Methods

2. Experiments

3. Appendix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions