-
Notifications
You must be signed in to change notification settings - Fork 0
Description
image로부터 visual recognition을 학습하기 위해, 기존 연구들은 2가지 방향으로 연구되었음
- supervised learning on human-annotated image-label data
장점 : 학습때 봤던 category에 대한 discriminative representation을 학습
단점 : image-label data를 구축하는 비용이 비쌈 → scalability가 떨어짐 → 다양한 visual concepts를 커버하는 것이 어려움 - language-image contrastive learning on webly-crawled image-text pairs
장점 : noisy, free-form but diverse image-text pairs로 다양한 visual concepts를 커버할 수 있음
단점 : transfer learning에서 필요로 하는 strong discriminative ability가 부족함을 발견했다고 함
(we find in our experiments that they usually lack the strong discriminative ability required by transfer learning)
그렇다면 위 2가지 방법의 장점들을 모두 취할 수 있는 좋은 방법이 없는가?라는 생각이 자연스럽게 들게 됨
(can we have one model for both discriminative representations and broad visual concept coverage?)
image-label, image-text를 같이 학습하면 장점만 잘 취할 수 있다고 유추할 수 있는데, 이를 잘 해결한게 해당 논문의 전부임
논문에서 말하는 3가지 main contribution은 다음과 같음
- new perspective : image-text-label space
기존의 image-label, image-text data를 위 관점에서 unify할 수 있음 - new learning paradigm : Unified Contrastive Learning (UniCL)
image-text-label space에 있는 모든 데이터에 대해 학습할 수 있음
즉, image-label, image-text data를 각각 학습하는데 사용할 수 있고, 동시에 학습할 수도 있음 - UniCL can leverage both types of data effectively and achieve superior performance universally
image-label, image-text data + UniCL로 학습한 모델은 discriminative, semantic-rich representation을 학습함
(image-label → discriminative representation, image-text → semantic-rich representation)
중요하다고 생각되는 부분만 간단히 요약
1. Method
template을 이용해 label을 text description으로 바꿔주고 contrastive로 모델 학습
positive pairs를 고려해주는 것 말고는 CLIP과 동일
Unified Image-Text-Label Contrast
v : visual feature vector
u : text feature vector
network로 u, v 구한 다음 normalize하여 similarity 계산CLIP과 비교하면, positive를 고려한다는 것을 알 수 있음
(안에 있는 sigma를 통해 positive log probability끼리 더해줌)
2. Experiments
Datasets
ratio of images/concepts clearly illustrates the different trade-off between image diversity and semantic-richness over different datasets
Training
batch size 4096
CLIP과 똑같은 tokenizer, prompt strategy 사용
train → prompt template 중 하나를 random sample하여 사용
validation → average all 80 templates
image-label data와 image-text data 간의 data imbalance가 심함
→ balanced sampler 사용하여 학습
Appendix에서 balanced sampling strategy를 사용하는 것이 성능에 매우 중요했다고 했다고 언급함
2.1. Results of UniCL on image classification
Table 2
UniCL achieves comparable if not better performance across all datasets and model architectures
UniCL의 image loss term만 보면 CE랑 큰 차이가 없음
그렇다면 text loss term은 어떤 역할을 하는가?
overfitting이 일어날만한 case에서 CE (cross entropy)보다 성능이 좋음
(CIFAR - ResNet, ImageNet - Swin)
→ bidirectional alignment between images and category names, which imposes an additional regularization term
즉, text loss term은 어느정도의 regularization term으로 작동한다...라는 것
Ablations
Table 3
ablation of language encoders
Transformer가 simple linear embedding layer보다 성능이 좋음
→ we suspect this is due to its ability to capture the semantics behind the 1K category namesablation of training objectives
image loss term만 사용하면 성능이 떨어짐
Table 4: effect of training batch size
UniCL is robust to the variation of batch size, regardless of which language encoder is employed
this is probably because...
- one of the two views is the embeddings of category names in our UniCL, which are consistently used
with high overlap across different mini-batches, which make the learning less vulnerable to the batch size- the label information provides a consistent and strong guidance
2.2. Results on data unification of image-text-label
2.2.1. Benefit of image-text to image-label
Table 5
adding image-text pairs can generally improve the performance across all metrics
ImageNet + GCC 15M > ImageNet + GCC 3M : concept richness is important
ImageNet + GCC 15M > ImageNet + YFCC 14M : quality is important
즉, image-text dataset이 다양한 concept을 가질수록 + quality가 좋을수록 성능에 좋음
Figure 4
given a query concept from ImageNet-1K, search the closest target concept from the remained 21K concepts in ImageNet-22K in the feature space
model trained on ImageNet-1K → hardly generalize to understand the concepts from the other 21K concepts
model trained on ImageNet-1K + GCC-15M → significantly improve the its understanding ability, as the retrieved target become more semantically similar to the queries in ImageNet-1K
2.2.2. Benefit of image-label to image-text
Table 6, Figure 5
CLIP : image-text contrastive learning
Multi-task : image-text contrastive learning + image-label supervised learning
(image-label 학습을 위해 encoder 위해 linear layer 붙여서 학습)
Table 6
image-label data is arguably another good source of learning visual-semantic representations
Multi-task isolates image-label and image-text pairs
→ cannot learn a discriminative and semantic-rich feature space as UniCL
Figure 5
data unification boosts performance almost on all metrics
Figure 6
model trained on image-text → dogs with fine-grained breeds are heavily mixed together
model trained on image-text + image-label → dogs with fine-grained breeds are clearly grouped, even though it contains none of those dog breed concepts
3. Appendix
Results with larger vision backbone
combining two type of data can significantly improve the zero-shot recognition performance
→ our method is agnostic to different model sizes and thus a generic learning paradigm for visual-semantic representations
Transfer to object detection
train Mask R-CNN with pre-trained vision backbones
combining two data types with similar amount clearly improve the object detection performance
→ adding image-text pairs to image-label data and the other way around can universally help to learn a better visual representations compared with the individual counter-parts
adding image-text pairs data : enrich and smoothen the semantic space
adding image-label data : directly imposes the pressure to learn more discriminative representations