[2022 CVPR Oral] Grounded Language-Image Pre-training

CLIP으로 zero-shot image classification이 가능해짐

하지만 CLIP은 object-level visual representations를 학습하지 않아서 image에 대한 fine-grained understanding이 부족함

저자들의 목표 : zero-shot object detection model

어떻게 하면 object-level, language-aware, semantic-rich visual representation을 학습할 수 있을까?

CLIP과 같이 대규모 image-text data를 학습해야 다양한 concept을 학습할 수 있을 것임
→ phrase grounding task로 pre-train하자


<details><summary>about phrase grounding task</summary>

> phrase grounding task (= word-to-region matching task)
> input : image, text
> text 내의 phrase와 image 내의 object간의 fine-grained correspondence를 맞추는 task
> 
> object detection task와 매우 유사하나, 주어진 text 내에서의 phrase에 해당하는 object를 detect해야한다는 점에서 다름
> (input이 image가 아니라 image+text인 점, fixed-class classification이 아닌 점)

</details>


해당 논문의 4가지 main contribution은 다음과 같음

1. object detection task를 phrase grounding task로 reformulate
  모든 candidate category를 text prompt로 변경하여 phrase grounding task로 reformulate
2. phrase grounding task로 pre-train하는 방법인 Grounded Language-Image Pre-training (GLIP)을 제안함
  CLIP과 다르게 deep cross-modality fusion 사용
3. image-text data를 augment하여 GLIP pre-training data로 사용
  NLP parser로 noun phrase detect한 후, teacher grounding model을 이용하여 pseudo label 생성
4. GLIP 모델은 다양한 object-level recognition에서 좋은 성능을 보임
  phrase grounding data로 학습 → 다양한 visual concepts 학습
  object detection data로 학습 → 더 많은 bounding box 학습

중요하다고 생각되는 부분만 간단히 요약


## 1. Grounding Language Image Pre-training
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/9682c4a7-69fa-4497-8c85-325bda31761c">


### 1.1. Unified Formulation

<details><summary>Background: object detection</summary>

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/0ee30386-870e-4047-b765-c91eff290c98">

> `Equation 1` - object detector loss
> localization loss : image에 대해 bounding box를 예측
> classification loss : bounding box region의 object class를 예측

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/33f2a4bb-5b50-46f7-850f-cdc956e7154b">

> `Equation 2` - object detector classification loss
> image encoder로 N개의 object feature를 뽑고
> linear classifier를 이용하여 N개의 object가 무엇인지 예측하고
> target matching information T를 이용하여 loss를 계산

</details>


<details><summary>Object detection as phrase grounding</summary>

> object detection task를 phrase grounding task로 reformulate하고 싶음
> bounding box가 주어졌을 때 c classes 중 하나를 예측하는 것 → X 
> bounding box가 주어졌을 때 text prompt에서의 c phrases 중 하나를 예측하는 것 → O

> detection task에서의 object classes를 쭉 나열해서 text prompt로 변환해줌
> ex.
> detection object classes : [person, bicycle, car, toothbrush]
> text prompt : "person. bicycle. car. toothbrush"
> 
> 모든 category name을 1개의 prompt로 넣기 어려울 수 있음
> training : category random downsample해서 1개의 prompt로 학습
> (단 positive class는 무조건 포함되도록)
> inference : split the category names into multiple prompts



<img width="50%" alt="image" src="https://github.com/user-attachments/assets/8ae5a749-2d1b-4f11-8bd1-82b91c4e6607">

> `Equation 3`
> image encoder로 N개의 object feature를 뽑고
> text encoder로 M개의 sub-word feature를 뽑고
> object feature와 sub-word feature간의 matrix 연산을 이용하여 classification logit을 구함

> classification target은 phrase이지 sub-word가 아님
> 그러면 어떻게 학습하는가?
> phrase를 구성하는 sub-words를 모두 positive라고 생각하고 classification target 생성
> ex.
> phrase : "traffic light"
> sub-words : "traffic", "light"
> "traffic", "light"은 1, 나머지는 0이 되도록 정답지 생성
> binary sigmoid loss로 모델 학습
>
> 그러면 inference는 어떻게 하는가?
> phrase의 sub-words probability를 average하여 phrase probability로 사용
> phrase probability를 이용하여 object가 무엇인지를 예측

</details>


### 1.2. Language-Aware Deep Fusion

> deep fusion의 2가지 장점
> 1. improves the phrase grounding performance
> 2. makes the learned visual features language-aware
> → model's prediction is conditioned on the text prompt

<details><summary>deep fusion details</summary>

> image encoder : DyHead
> text encoder : BERT (base-uncased, max 256 input length)
> deep fusion layer : DyHead module, BERT layer, X-MHA

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/2c68adc7-2108-46b7-ab37-a8fe706cfe70">

> `Equation 4, 5, 6`
> cross-modality multi-head attention module (X-MHA)를 이용하여 cross-modality communication한 다음
> single modality fusion에 update

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/3c0a7047-89e6-4d5d-9821-62f299d4d6c6">

> cross-modality multi-head attention은 cross attention과 매우 유사하다고 생각하면 됨
> image query, text query로 attention map 구하고
> text value, image value에 각각 attention map을 계산하여 image feature, text feature를 구함

</details>


### 1.3. Pre-training with Scalable Semantic-Rich Data

> 웹에서 수집한 image-text data를 pseudo labeling하여 grounding data 수를 늘려 성능을 높임

<details><summary>self-training detail</summary>

> 1. teacher GLIP을 gold (human-annotated) detection, grounding data로 pre-train
> 2. teacher GLIP을 이용하여 web-collected image-text data를 pseudo labeling
> NLP parser를 이용하여 noun phrase 추출하고, 이에 대해 bounding box predict
> 3. student GLIP을 gold data, generated pseudo grounding data로 학습

> pseudo data로 pre-train할 때, positive caption에 few negative captions를 mix하는 augmentation 사용
> mix in 19 negative captions with 0.3 probability

</details>


<details><summary>why self-training works well?</summary>

> student model 성능이 teacher model 성능보다 높음
> why?
> teacher model이 학습때 못봐서 모르는 concept이 존재함
> teacher model이 rich language context를 이용해서 "educated guess"를 할 수 있음
> `Figure 3`를 예로 들면, small vail을 localize할 수 있기에 small vail of vaccine을 localize할 수 있음
> student는 teacher의 "educated guess"를 학습해서 더 다양한 concept을 배우게 되어 성능이 좋아짐

</details>


## 2. Transfer to Established Benchmarks

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/07210aaf-73e5-4e31-904c-d45d6efb099a">


<details><summary>2.1. Zero-Shot and Supervised Transfer on COCO</summary>
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/6ea78cde-8bf6-465c-bcbb-c9473fe9979c">

> GLIP models achieve strong zero-shot and supervised performance

> GLIP의 zero-shot performance에 영향을 미치는 3가지 factor
> 1. close domain overlap between Object365 and COCO
> 2. deep fusion
> 3. grounding data

</details>


<details><summary>2.2. Zero-Shot Transfer on LVIS</summary>
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/b29ffd74-972d-49d2-b73a-d2ca466f3572">

> GLIP exhibits strong zero-shot performance on all the categories

</details>


<details><summary>2.3. Phrase Grounding on Flickr30K Entities</summary>
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/03e6ec43-309a-43d4-b1e8-8e32284af5b3">

> addition of detection data helps grounding
> → synergy between the two tasks and the effectiveness of our unified loss

</details>


<details><summary>2.4. Analysis</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/6adecc89-a8e2-4b5e-b8b0-0c1a5f3b5f8c">

> adding grounding data brings consistent improvement with different detection data
> → grounding data are more semantic-rich and a promising alternative to scaling up detection data

</details>


## 3. Object Detection in the Wild

<details><summary>3.1. Data Efficiency</summary>

> freeze the bottom 2 layers of the backbone and fine-tune

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/0f8cb38b-79ac-49c4-b74a-e2e4a2d118a1">

> `Figure 4`
> GLIP exhibits transformative data efficiency
> unified grounding reformulation, deep fusion, grounding data, and model scale-up all contribute to the improved data efficiency

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/26e35fec-4a1e-4578-a2d2-c61f80c280fc">

> `Figure 5`
> introduction of grounding data brings significant improvement on certain tasks that test novel concepts

</details>


<details><summary>3.2. One Model for All Tasks</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/4ae0460b-8abf-4e28-8555-4cd38cb6de22">

> `Figure 6` - manual prompt tuning
> for any novel categories, the user can use expressive descriptions in the text prompt, adding attributes or language context, to inject domain knowledge and help GLIP transfer
> simple prompt change improves performance


<img width="50%" alt="image" src="https://github.com/user-attachments/assets/596e8ed8-313d-4930-9716-17d1dc8b7175">

> `Figure 7` - prompt tuning
> get prompt embeddings from the language backbone and only fine-tune prompt embeddings as the task-specific output
> prompt tuning almost matches the full-tuning results, without changing any of the grounding model parameters
> as the model and data size grow larger, the gap between full-model tuning and prompt tuning becomes smaller

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[2022 CVPR Oral] Grounded Language-Image Pre-training #224

1. Grounding Language Image Pre-training

1.1. Unified Formulation

1.2. Language-Aware Deep Fusion

1.3. Pre-training with Scalable Semantic-Rich Data

2. Transfer to Established Benchmarks

3. Object Detection in the Wild

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[2022 CVPR Oral] Grounded Language-Image Pre-training #224

Description

1. Grounding Language Image Pre-training

1.1. Unified Formulation

1.2. Language-Aware Deep Fusion

1.3. Pre-training with Scalable Semantic-Rich Data

2. Transfer to Established Benchmarks

3. Object Detection in the Wild

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions