[2024 ICLR Spotlight] Ferret: Refer and Ground Anything Anywhere at Any Granularity

<img width="90%" alt="image" src="https://github.com/user-attachments/assets/1deb7e1d-9e5d-4abe-9720-2c4254f18fb8">

<details><summary>motivation</summary>

모델이 spatial understanding을 잘하는지 어떻게 측정할 수 있을까?

가장 대표적인 2가지 task : referring, grounding

- referring
  image + region → text
  image의 specific region을 잘 이해했는지?
- grounding
  image + text → region
  text description이 주어졌을 때, image의 어떤 region에 부합하는 지 localize

대부분의 기존 연구들은 task별로 모델을 따로 학습했음
(referring model 따로, grounding model 따로)

잘 생각해보면 spatial information과 semantic간의 alignment를 잘 이해했다면 referring & grounding을 모두 잘할 것임

사람이라면 한가지 task를 학습하면 다른 task도 잘할 것임

이를 바탕으로 저자들은 다음과 같은 3가지 목표를 설정함

1. referring & grounding을 one framework로 unify하자
2. 다양한 형태의 region을 input으로 받을 수 있도록 하자
3. open-vocabulary, instruction-following, robust한 모델을 만들자


3번 목표 → Multimodal Large Language Model (MLLM) + Instruction-Tuning dataset

1번 목표 → region coordinates를 natural language numerical form으로 represent
(special token 사용하지 않고 자연어 숫자를 사용)

2번 목표 → spatial-aware visual sampler 제안

</details>


해당 논문의 3가지 main contribution은 다음과 같음

1. refer-and-ground MLLM인 Ferret을 제안
  Ferret은 free-formed region input을 처리할 수 있는 첫 모델임
2. Ground-and-Refer Instruction-Tuning (GRIT) dataset을 공개
  existing vision(-language) dataset
  refer-and-grounding instruction-tuning conversation dataset (ChatGPT/GPT-4)
  spatial negative data mining
3. Ferret-Bench를 제안
  3 new types of tasks : Referring Description, Referring Reasoning, Grounding in Conversation

Ferret은 기존 연구들보다 좋은 성능을 보이며, hallucination이 적음
(hallucination이 적은 이유는 spatial negative samples를 학습했기 때문)

중요하다고 생각되는 부분만 간단히 요약


## 1. Method
<img width="90%" alt="image" src="https://github.com/user-attachments/assets/733c66b2-736f-415e-9a3a-dbb0373afed7">


<details><summary>1.1. Hybrid Region Representation</summary>

box, point는 coordinates로 표현해도 괜찮음

그에 반해 free-form shape는 coordinates로 표현하기에 computationally expensive함

coordinates로 표현한다고 해도 complexity로 인해 학습이 어려움

즉, user의 region input을 text 형식으로 표현하는 방법에는 한계가 있음

따라서, image단에서 해결해야겠다고 생각하는 것이 자연스러움

→ image feature map에서 region에 해당하는 부분을 sampling하여 fixed size feature vector를 구함
(point 같은 경우, point를 center로 하는 radius 5인 원을 region으로 취급)

다양한 image resolution에 invariant하도록, coordinate을 1000 discrete bins로 quantize
(ex. 500x200 이미지에서의 (100, 50)은 1000x1000 기준으로 (200, 250)으로 coordinate가 바뀜)

</details>


<details><summary>1.2. Model Architecture</summary>

> **Input** (`Figure 3 (Right)`)
> pre-trained visual encoder (CLIP-ViT-L/14)를 이용하여 image embedding 추출
> pre-trained LLM tokenizer를 이용하여 text sequence를 tokenize한 다음, projection하여 text embedding 추출
> region은 coordinates에 placeholder로 special token (`<SPE>`)을 추가해줌


<img width="90%" alt="image" src="https://github.com/user-attachments/assets/a6664e85-6c4c-4ea0-b63d-e212a19ff1d1">

> **Spatial-aware visual sampler** (`Figure 3 (Left)`)
> 1개의 block은 다음과 같은 과정을 수행
> 1. binary region mask M에서 $N$ positive points를 random sample
> 2. bilinear interpolation을 통해 각 point의 feature를 구함
> 3. farthest point sampling (FPS)을 통해 $N/r$ points를 sample
> (FPS를 통해 sufficient coverage를 보장하면서 point의 개수를 줄일 수 있음)
> 4. sampled point $x\_i$에 대해 $N$ points에서의 $k$ nearest neighbors를 구함
> ($N/r$ points에서의 1개의 point에 대해 $N$ points에서의 $k$ nearest neighbors를 구한다는 의미)
> 5. sampled point $x\_i$와 neighbor point간의 feature을 fuse하여 neighbor point feature를 구함 (`Equation 1`)
> 6. $k$ neighbor features를 max pooling하여 one feature로 fuse
> 
> 2개의 block을 cascade하여 사용
> ($N = 512, r = 4, k = 24$) block + ($N = 128, r = 4, k = 24$) block
> → 32 points feature를 output
>
> 32 point features를 single vector로 flatten한 다음, LLM embedding dimension으로 projection
> `<SPE>` token을 projected feature로 replace

> **Output**
> grounding 같은 경우, noun - noun에 해당하는 box coordinates 형식으로 output
> (ex. There is a dog [100, 150, 300, 200] in the figure)

> **LLM**
> Vicuna 사용

</details>


## 2. GRIT: Ground-and-Refer Instruction-Tuning dataset
<img width="90%" alt="image" src="https://github.com/user-attachments/assets/7e7548c2-d44b-4f57-a8af-8918febff346">

> GRIT은 3가지 유형의 데이터로 구성됨 (`Figure 4`)
> 1. public datasets that are converted into an instruction-following format
> 2. instruction-tuning data generated via ChatGPT and GPT-4
> 3. additional data from spatial negative mining for enhancing model robustness


<details><summary>2.1. Hierarchy</summary>
<img width="90%" alt="image" src="https://github.com/user-attachments/assets/56fd8cf6-15b3-442d-9ae8-bf98371c1388">

> 저자들은 2가지 축을 기준으로 데이터를 분류
> 
> 1. granularity
> individual objects → object detection, visual grounding datasets
> relationships among objects → select from Visual Genome (data with object relationships)
> descriptions of specific regions → select from Visual Genome (data with region captions)
> region-based complex reasoning → ChatGPT/GPT-4로 생성
> 
> 2. task format
> region-in text-out format → object detection dataset, Visual Genome
> (to understand free-form shapes, apply SAM to obtain a segmentation mask for each object)
> text-in region-out format → visual grounding dataset
> text-region combined format

</details>


<details><summary>2.2. Data generation using GPT</summary>

> MLLM 분야에서 dialogue instruction tuning data가 중요하다는 것이 밝혀졌음
> (dialogue instruction tuning data is critical for MLLM to understand human intention and generate fluent, natural, long-form responses)
> 기존의 instruction tuning data는 spatial-related information을 명시하지 않고, entire image에 대해서 묘사하도록 구성되어있음
> (entire image - global caption)

<img width="90%" alt="image" src="https://github.com/user-attachments/assets/24c460e9-4fa3-4e7e-b4f1-814521b0b0cf">

> region-based spatial knowledge를 더 잘 학습할 수 있도록, 저자들은 다음 3가지 사항을 고려하여 34k dialogues를 생성
> `Table 2`를 보면 쉽게 이해할 수 있음
> 
> 1. coordinates를 이용하여 object와 region caption간의 relationship 명시하기
> 2. groundable regions or objects 뒤에 coordinates 추가하기
> 3. generated dialogues가 system prompts, few-shot examples에서의 rules, patterns를 따르지 않을 수 있음
> → ChatGPT/GPT-4를 이용해서 initially generated dialogues를 refine
> 비용을 줄이기 위해서 ChatGPT로 데이터를 먼저 생성한 뒤, GPT-4로 refine했다고 함

> 기존의 instruction-tuning data를 활용하기 위해, open-vocabulary object detector를 사용해서 데이터를 추가해줌
> 구체적으로 말하면, GLIPv2를 이용해서 LLaVA-158k를 돌려서 pseudo-grounded LLaVA instruction data를 생성
> (groundable noun 뒤에 그에 해당하는 bounding box를 추가해주는 방식)

</details>


<details><summary>2.3. Negative</summary>

> MLLM은 yes/no question에 대해 hallucinate하는 경향이 있음
> (MLLM에게 어떤 object를 localize하라고 질문하면, image에 해당 object가 없음에도 불구하고 있다고 답함)
> image에 해당 object가 없다면, 없다고 말할 수 있는 능력을 어떻게 학습할 수 있을까?
> → negative data를 만들어서 학습하자
> 
> 저자들은 2가지 방식의 negative sample mining을 진행
> 1. Image-conditioned Category Localization
> Object365 data 사용
> image에 없는 object classes에서 random sample해서 negative data 생성
> 2. Semantics-conditioned Category Localization
> Flickr30k data 사용
> ChatGPT/GPT-4를 사용해서 original class와 유사한 entity를 사용해서 negative data 생성
> (ex. man - woman, blue - yellow, two - three, ...)

</details>


## 3. Experiments

<details><summary>3.1. Training details</summary>

> image encoder : CLIP-ViT-L/14@336p
> projection layer : LLaVA's first-stage weights
> visual sampler : random init
> LLM : Vicuna

> GRIT data로 3 epochs 학습
> input이 region일 경우, center points, bounding boxes, segmentation masks 중에서 random choice해서 학습

</details>


<details><summary>3.2. Input Referring</summary>
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/b45ca82e-3bce-451c-b98a-c1081b26d509">

> **Referring Object Classification**
> image 내의 specific region에 대한 object classification
> MLLM은 free-form text response를 generate하기에, predicted class와 ground-truth class간 matching하는게 쉽지 않음
> 따라서, binary-choice question으로 format을 바꿔주고, rule-base로 response matching
> (ex. `Is the object <location> a <class A> or a <class B>?`)

<img width="40%" alt="image" src="https://github.com/user-attachments/assets/fbe5cc1b-adb3-45ef-a9f8-123843a358bb">

> `Table 3`
> Ferret significantly outperform previous models and handle all types of referring

</details>


<details><summary>3.3. Output Grounding</summary>

<img width="90%" alt="image" src="https://github.com/user-attachments/assets/0a5a3109-3679-4425-94a4-50a2a2a2c1f1">

> **Visual Grounding**
> 2가지 유형의 task가 존재함
> 1. Referring Expression Comprehension (REC) : image 내에서 specific area에 대한 query (question, description)가 주어졌을 때, 그에 해당하는 single bounding box 찾기
> 2. Phrase Grounding : input sentence에서의 모든 noun phrases에 해당하는 bounding boxes를 찾고, word-box connection 찾기
> 
> task에 상관없이, 동일한 prompt 사용
> `What are the locations of <query>/<phrases>?`
> comma로 noun phrases를 구분해주며, `<query>[box]` format으로 output

> **Grounded Captioning**
> image에 대한 caption을 생성한 다음, generated noun phrases에 대한 phrase grounding
> 3가지 output이 나오게 됨
> → text caption, visual regions as boxes, grounding alignments between words and boxes


<img width="90%" alt="image" src="https://github.com/user-attachments/assets/6e505ebe-8119-45b2-844a-b53fe44305d6">

> `Table 5`
>  Ferret achieves an outstanding performance on visual grounding
> Ferret achieves state-of-the-art on grounded captioning

</details>


<details><summary>3.4. Ferret-Bench: Multimodal chatting with referring and grounding</summary>

> referring, grounding action이 포함된 multimodal chatting을 evaluate할 수 있는 dataset이 없음
> 이를 극복하기 위해, 3가지 유형의 region-based questions를 평가할 수 있는 Ferret-Bench를 공개함
> 1. Referring Description : describe a referred region based on its interaction with surrounding objects
> 2. Referring Reasoning : reason on top of one or more referred regions correctly
> 3. Grounding in Conversation : reason correctly and accurately ground/localize the objects/regions necessary for the reasoning

> 그럼 어떻게 평가하는가?
> 
> predicted answer : MLLM으로 구한 prediction
> pseudo answer : ground-truth textual description에 기반한 GPT-4 output
> 
> GPT-4를 이용하여 predicted answer과 pseudo answer에 대해 3가지 측면에서 rate를 매김
> → referring understanding, object grounding, correctness of semantics
> 
> predicted answer's score과 GPT-4 answer's score 간의 ratio를 계산하여 MLLM의 performance 측정

<img width="90%" alt="Table 7" src="https://github.com/user-attachments/assets/fe2bc505-a931-41b9-a52f-00cc606652c0">

> `Table 7`
> Ferret achieves superior performance in all types of tasks

<img width="90%" alt="Table 6" src="https://github.com/user-attachments/assets/5268736a-249f-4b8a-9a6c-1eb9f31f58a4">

> `Table 6`
> Ferret demonstrates strong spatial understanding and commonsense reasoning capability

</details>


<details><summary>3.5. Ablation</summary>
<img width="40%" alt="Table 8" src="https://github.com/user-attachments/assets/bc82efe7-9d0b-4caa-af80-54e889065cbf">

> **Mutual benefits of grounding and referring** (`Table 8`)
> grounding and referring can actually benefit each other


<img width="45%" alt="Table 9" src="https://github.com/user-attachments/assets/4fbcd51c-071e-42e1-b715-0ae427af025f">

> **Spatial-aware Visual Sampler** (`Table 9`)
> spatial-aware visual sampler의 effectiveness를 측정하기 위해, SEEM에서의 visual sampler로 교체
> ours can outperform the previous visual sampler in all three referring tasks


> **LLM model size**
> Ferret 13B가 Ferret 7B보다 성능이 좋음
> → LM backbone이 클수록 좋음

</details>


<details><summary>3.6. Object Hallucination</summary>
<img width="90%" alt="image" src="https://github.com/user-attachments/assets/f2e3cbf1-ff2a-4f5f-96c1-c7447f3663a3">

> `Table 10`
> Ferret exhibits strong power against the hallucination problem

</details>


<details><summary>3.7. Ferret v.s. GPT-4V</summary>

> GPT-4V의 referring and grounding capability를 Ferret과 비교

> referring with GPT-4V in 2 ways
> 1. image에 red circle/outline을 marking하고, red circle/outline에 해당하는게 무엇인지 question
> 2. image, image size, refer할 object coordinates를 주고, 이에 해당하는게 무엇인지 question

> grounding with GPT-4V
> `Localize <class> in the image using bounding boxes. The image size is (width, height).`

<img width="90%" alt="image" src="https://github.com/user-attachments/assets/13a8f7c6-15ef-4380-a7a3-430f6a19457c">

> `Figure 6`
> referring
> GPT-4V는 image 내의 colored region, text coordinates를 이해할 수 있음
> 하지만, small region에 대해 잘 refer하지 못함
> 물론, GPT-4V가 commonsense 측면에서는 더 좋음
> 
> grounding
> CAPTCHA에서 실험
> Ferret excels at accurately identifying most traffic lights even in cluttered scenes

> 결론
> Ferret shines especially when precise bounding boxes for grounding are needed, and catering to those applications that require pinpoint accuracy in smaller regions

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[2024 ICLR Spotlight] Ferret: Refer and Ground Anything Anywhere at Any Granularity #230

1. Method

2. GRIT: Ground-and-Refer Instruction-Tuning dataset

3. Experiments

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[2024 ICLR Spotlight] Ferret: Refer and Ground Anything Anywhere at Any Granularity #230

Description

1. Method

2. GRIT: Ground-and-Refer Instruction-Tuning dataset

3. Experiments

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions