[2022 CVPR] Unified Contrastive Learning in Image-Text-Label Space

image로부터 visual recognition을 학습하기 위해, 기존 연구들은 2가지 방향으로 연구되었음

1. supervised learning on human-annotated image-label data
  장점 : 학습때 봤던 category에 대한 discriminative representation을 학습
  단점 : image-label data를 구축하는 비용이 비쌈 → scalability가 떨어짐 → 다양한 visual concepts를 커버하는 것이 어려움
2. language-image contrastive learning on webly-crawled image-text pairs
  장점 : noisy, free-form but diverse image-text pairs로 다양한 visual concepts를 커버할 수 있음
  단점 : transfer learning에서 필요로 하는 strong discriminative ability가 부족함을 발견했다고 함
  (we find in our experiments that they usually lack the strong discriminative ability required by transfer learning)

그렇다면 위 2가지 방법의 장점들을 모두 취할 수 있는 좋은 방법이 없는가?라는 생각이 자연스럽게 들게 됨
(can we have one model for both discriminative representations and broad visual concept coverage?)

image-label, image-text를 같이 학습하면 장점만 잘 취할 수 있다고 유추할 수 있는데, 이를 잘 해결한게 해당 논문의 전부임

논문에서 말하는 3가지 main contribution은 다음과 같음

1. new perspective : image-text-label space
  기존의 image-label, image-text data를 위 관점에서 unify할 수 있음
2. new learning paradigm : Unified Contrastive Learning (UniCL)
  image-text-label space에 있는 모든 데이터에 대해 학습할 수 있음
  즉, image-label, image-text data를 각각 학습하는데 사용할 수 있고, 동시에 학습할 수도 있음
3. UniCL can leverage both types of data effectively and achieve superior performance universally
  image-label, image-text data + UniCL로 학습한 모델은 discriminative, semantic-rich representation을 학습함
  (image-label → discriminative representation, image-text → semantic-rich representation)

중요하다고 생각되는 부분만 간단히 요약


## 1. Method

<img width="49%" alt="Figure 1" src="https://github.com/user-attachments/assets/64769db8-4916-4190-873b-c926d30a263f">
<img width="49%" alt="Figure 2" src="https://github.com/user-attachments/assets/698a386e-50d7-4d19-a520-3239495c8926">
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/dbcb123e-4dbb-431f-a2ff-5a1cbdb92aa4">

> template을 이용해 label을 text description으로 바꿔주고 contrastive로 모델 학습
> positive pairs를 고려해주는 것 말고는 CLIP과 동일

<details><summary>Unified Image-Text-Label Contrast</summary>
<img width="49%" alt="image" src="https://github.com/user-attachments/assets/aa9be002-41f4-454d-b24b-9dc33e2e2c28">
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/d9058efe-1350-4a21-b0ea-eca826a23799">

> v : visual feature vector
> u : text feature vector
> network로 u, v 구한 다음 normalize하여 similarity 계산
> 
> CLIP과 비교하면, positive를 고려한다는 것을 알 수 있음
> (안에 있는 sigma를 통해 positive log probability끼리 더해줌)

</details>


## 2. Experiments

<details><summary>Datasets</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/70fd1853-1d46-4b3f-8eb1-bb1029e17158">

> ratio of images/concepts clearly illustrates the different trade-off between image diversity and semantic-richness over different datasets

</details>


<details><summary>Training</summary>

> batch size 4096

> CLIP과 똑같은 tokenizer, prompt strategy 사용
> train → prompt template 중 하나를 random sample하여 사용
> validation → average all 80 templates

> image-label data와 image-text data 간의 data imbalance가 심함
> → balanced sampler 사용하여 학습
> Appendix에서 balanced sampling strategy를 사용하는 것이 성능에 매우 중요했다고 했다고 언급함

</details>


### 2.1. Results of UniCL on image classification

<details><summary>Table 2</summary>

<img width="100%" alt="image" src="https://github.com/user-attachments/assets/c19ce927-f8d7-476c-979e-0fe598acfb48">

> UniCL achieves comparable if not better performance across all datasets and model architectures

> UniCL의 image loss term만 보면 CE랑 큰 차이가 없음
> 그렇다면 text loss term은 어떤 역할을 하는가?
> overfitting이 일어날만한 case에서 CE (cross entropy)보다 성능이 좋음
> (CIFAR - ResNet, ImageNet - Swin)
> → bidirectional alignment between images and category names, which imposes an additional regularization term
> 즉, text loss term은 어느정도의 regularization term으로 작동한다...라는 것

</details>


<details><summary>Ablations</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/97438f6c-f2dc-474e-b0ff-639edc1120a2">

> `Table 3`
> ablation of language encoders
> Transformer가 simple linear embedding layer보다 성능이 좋음
> → we suspect this is due to its ability to capture the semantics behind the 1K category names
> 
> ablation of training objectives
> image loss term만 사용하면 성능이 떨어짐

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/1c8d825e-1799-43fb-918b-9a383b03ac4f">

> `Table 4` : effect of training batch size
> UniCL is robust to the variation of batch size, regardless of which language encoder is employed
> this is probably because...
> 1. one of the two views is the embeddings of category names in our UniCL, which are consistently used
with high overlap across different mini-batches, which make the learning less vulnerable to the batch size
> 2. the label information provides a consistent and strong guidance

</details>


### 2.2. Results on data unification of image-text-label
#### 2.2.1. Benefit of image-text to image-label

<details><summary>Table 5</summary>
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/d200bacc-8c63-48f9-981d-2465be6d0828">

> adding image-text pairs can generally improve the performance across all metrics

> ImageNet + GCC 15M > ImageNet + GCC 3M : concept richness is important
> ImageNet + GCC 15M > ImageNet + YFCC 14M : quality is important
> 즉, image-text dataset이 다양한 concept을 가질수록 + quality가 좋을수록 성능에 좋음

</details>


<details><summary>Figure 4</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/839a670d-f6b8-419f-82e2-805b9318d192">

> given a query concept from ImageNet-1K, search the closest target concept from the remained 21K concepts in ImageNet-22K in the feature space
> model trained on ImageNet-1K → hardly generalize to understand the concepts from the other 21K concepts
> model trained on ImageNet-1K + GCC-15M → significantly improve the its understanding ability, as the retrieved target become more semantically similar to the queries in ImageNet-1K

</details>


#### 2.2.2. Benefit of image-label to image-text

<details><summary>Table 6, Figure 5</summary>
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/f7532e11-1f74-4e2b-ba56-a24472e9530a">

> CLIP : image-text contrastive learning
> Multi-task : image-text contrastive learning + image-label supervised learning
> (image-label 학습을 위해 encoder 위해 linear layer 붙여서 학습)

> `Table 6`
> image-label data is arguably another good source of learning visual-semantic representations
> Multi-task isolates image-label and image-text pairs
> → cannot learn a discriminative and semantic-rich feature space as UniCL

<img width="100%" alt="image" src="https://github.com/user-attachments/assets/261e5bc8-dfc3-43be-886d-11cca1402b9d">

> `Figure 5`
> data unification boosts performance almost on all metrics

</details>


<details><summary>Figure 6</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/a8bee31d-e26d-4621-b201-d6bcb9a7b5dd">

> model trained on image-text → dogs with fine-grained breeds are heavily mixed together
> model trained on image-text + image-label → dogs with fine-grained breeds are clearly grouped, even though it contains none of those dog breed concepts

</details>


## 3. Appendix

<details><summary>Results with larger vision backbone</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/30bdfb99-d36e-48ac-9724-1e0543c6bb6c">

> combining two type of data can significantly improve the zero-shot recognition performance
> → our method is agnostic to different model sizes and thus a generic learning paradigm for visual-semantic representations

</details>


<details><summary>Transfer to object detection</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/a1067aa0-c393-478f-b7d8-138dfd7e779c">

> train Mask R-CNN with pre-trained vision backbones
> combining two data types with similar amount clearly improve the object detection performance
> → adding image-text pairs to image-label data and the other way around can universally help to learn a better visual representations compared with the individual counter-parts

> adding image-text pairs data : enrich and smoothen the semantic space
> adding image-label data : directly imposes the pressure to learn more discriminative representations

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[2022 CVPR] Unified Contrastive Learning in Image-Text-Label Space #219

1. Method

2. Experiments

2.1. Results of UniCL on image classification

2.2. Results on data unification of image-text-label

2.2.1. Benefit of image-text to image-label

2.2.2. Benefit of image-label to image-text

3. Appendix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[2022 CVPR] Unified Contrastive Learning in Image-Text-Label Space #219

Description

1. Method

2. Experiments

2.1. Results of UniCL on image classification

2.2. Results on data unification of image-text-label

2.2.1. Benefit of image-text to image-label

2.2.2. Benefit of image-label to image-text

3. Appendix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions