Skip to content

[2022 CVPR Oral] Grounded Language-Image Pre-training #224

@Jasonlee1995

Description

@Jasonlee1995

CLIP으로 zero-shot image classification이 가능해짐

하지만 CLIP은 object-level visual representations를 학습하지 않아서 image에 대한 fine-grained understanding이 부족함

저자들의 목표 : zero-shot object detection model

어떻게 하면 object-level, language-aware, semantic-rich visual representation을 학습할 수 있을까?

CLIP과 같이 대규모 image-text data를 학습해야 다양한 concept을 학습할 수 있을 것임
→ phrase grounding task로 pre-train하자

about phrase grounding task

phrase grounding task (= word-to-region matching task)
input : image, text
text 내의 phrase와 image 내의 object간의 fine-grained correspondence를 맞추는 task

object detection task와 매우 유사하나, 주어진 text 내에서의 phrase에 해당하는 object를 detect해야한다는 점에서 다름
(input이 image가 아니라 image+text인 점, fixed-class classification이 아닌 점)

해당 논문의 4가지 main contribution은 다음과 같음

  1. object detection task를 phrase grounding task로 reformulate
    모든 candidate category를 text prompt로 변경하여 phrase grounding task로 reformulate
  2. phrase grounding task로 pre-train하는 방법인 Grounded Language-Image Pre-training (GLIP)을 제안함
    CLIP과 다르게 deep cross-modality fusion 사용
  3. image-text data를 augment하여 GLIP pre-training data로 사용
    NLP parser로 noun phrase detect한 후, teacher grounding model을 이용하여 pseudo label 생성
  4. GLIP 모델은 다양한 object-level recognition에서 좋은 성능을 보임
    phrase grounding data로 학습 → 다양한 visual concepts 학습
    object detection data로 학습 → 더 많은 bounding box 학습

중요하다고 생각되는 부분만 간단히 요약

1. Grounding Language Image Pre-training

image

1.1. Unified Formulation

Background: object detection image

Equation 1 - object detector loss
localization loss : image에 대해 bounding box를 예측
classification loss : bounding box region의 object class를 예측

image

Equation 2 - object detector classification loss
image encoder로 N개의 object feature를 뽑고
linear classifier를 이용하여 N개의 object가 무엇인지 예측하고
target matching information T를 이용하여 loss를 계산

Object detection as phrase grounding

object detection task를 phrase grounding task로 reformulate하고 싶음
bounding box가 주어졌을 때 c classes 중 하나를 예측하는 것 → X
bounding box가 주어졌을 때 text prompt에서의 c phrases 중 하나를 예측하는 것 → O

detection task에서의 object classes를 쭉 나열해서 text prompt로 변환해줌
ex.
detection object classes : [person, bicycle, car, toothbrush]
text prompt : "person. bicycle. car. toothbrush"

모든 category name을 1개의 prompt로 넣기 어려울 수 있음
training : category random downsample해서 1개의 prompt로 학습
(단 positive class는 무조건 포함되도록)
inference : split the category names into multiple prompts

image

Equation 3
image encoder로 N개의 object feature를 뽑고
text encoder로 M개의 sub-word feature를 뽑고
object feature와 sub-word feature간의 matrix 연산을 이용하여 classification logit을 구함

classification target은 phrase이지 sub-word가 아님
그러면 어떻게 학습하는가?
phrase를 구성하는 sub-words를 모두 positive라고 생각하고 classification target 생성
ex.
phrase : "traffic light"
sub-words : "traffic", "light"
"traffic", "light"은 1, 나머지는 0이 되도록 정답지 생성
binary sigmoid loss로 모델 학습

그러면 inference는 어떻게 하는가?
phrase의 sub-words probability를 average하여 phrase probability로 사용
phrase probability를 이용하여 object가 무엇인지를 예측

1.2. Language-Aware Deep Fusion

deep fusion의 2가지 장점

  1. improves the phrase grounding performance
  2. makes the learned visual features language-aware
    → model's prediction is conditioned on the text prompt
deep fusion details

image encoder : DyHead
text encoder : BERT (base-uncased, max 256 input length)
deep fusion layer : DyHead module, BERT layer, X-MHA

image

Equation 4, 5, 6
cross-modality multi-head attention module (X-MHA)를 이용하여 cross-modality communication한 다음
single modality fusion에 update

image

cross-modality multi-head attention은 cross attention과 매우 유사하다고 생각하면 됨
image query, text query로 attention map 구하고
text value, image value에 각각 attention map을 계산하여 image feature, text feature를 구함

1.3. Pre-training with Scalable Semantic-Rich Data

웹에서 수집한 image-text data를 pseudo labeling하여 grounding data 수를 늘려 성능을 높임

self-training detail
  1. teacher GLIP을 gold (human-annotated) detection, grounding data로 pre-train
  2. teacher GLIP을 이용하여 web-collected image-text data를 pseudo labeling
    NLP parser를 이용하여 noun phrase 추출하고, 이에 대해 bounding box predict
  3. student GLIP을 gold data, generated pseudo grounding data로 학습

pseudo data로 pre-train할 때, positive caption에 few negative captions를 mix하는 augmentation 사용
mix in 19 negative captions with 0.3 probability

why self-training works well?

student model 성능이 teacher model 성능보다 높음
why?
teacher model이 학습때 못봐서 모르는 concept이 존재함
teacher model이 rich language context를 이용해서 "educated guess"를 할 수 있음
Figure 3를 예로 들면, small vail을 localize할 수 있기에 small vail of vaccine을 localize할 수 있음
student는 teacher의 "educated guess"를 학습해서 더 다양한 concept을 배우게 되어 성능이 좋아짐

2. Transfer to Established Benchmarks

image
2.1. Zero-Shot and Supervised Transfer on COCO image

GLIP models achieve strong zero-shot and supervised performance

GLIP의 zero-shot performance에 영향을 미치는 3가지 factor

  1. close domain overlap between Object365 and COCO
  2. deep fusion
  3. grounding data
2.2. Zero-Shot Transfer on LVIS image

GLIP exhibits strong zero-shot performance on all the categories

2.3. Phrase Grounding on Flickr30K Entities image

addition of detection data helps grounding
→ synergy between the two tasks and the effectiveness of our unified loss

2.4. Analysis image

adding grounding data brings consistent improvement with different detection data
→ grounding data are more semantic-rich and a promising alternative to scaling up detection data

3. Object Detection in the Wild

3.1. Data Efficiency

freeze the bottom 2 layers of the backbone and fine-tune

image

Figure 4
GLIP exhibits transformative data efficiency
unified grounding reformulation, deep fusion, grounding data, and model scale-up all contribute to the improved data efficiency

image

Figure 5
introduction of grounding data brings significant improvement on certain tasks that test novel concepts

3.2. One Model for All Tasks image

Figure 6 - manual prompt tuning
for any novel categories, the user can use expressive descriptions in the text prompt, adding attributes or language context, to inject domain knowledge and help GLIP transfer
simple prompt change improves performance

image

Figure 7 - prompt tuning
get prompt embeddings from the language backbone and only fine-tune prompt embeddings as the task-specific output
prompt tuning almost matches the full-tuning results, without changing any of the grounding model parameters
as the model and data size grow larger, the gap between full-model tuning and prompt tuning becomes smaller

Metadata

Metadata

Assignees

No one assigned

    Labels

    DataRelated with dataLanguageRelated with Natural Language Processing tasksVisionRelated with Computer Vision tasks

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions