From 2939ddb8e0834b492155e195eb3272ae5c73ea02 Mon Sep 17 00:00:00 2001 From: Dana Aubakirova <118912928+danaaubakirova@users.noreply.github.com> Date: Wed, 24 Jul 2024 18:33:25 +0200 Subject: [PATCH 01/22] Create zero-shot-vqa This is the blogpost about trying VLM for zero-shot VQA on Docmatix --- zero-shot-vqa | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 107 insertions(+) create mode 100644 zero-shot-vqa diff --git a/zero-shot-vqa b/zero-shot-vqa new file mode 100644 index 0000000000..5a8db1101f --- /dev/null +++ b/zero-shot-vqa @@ -0,0 +1,107 @@ +# LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning? + +
+ Figure 1: t-SNE visualization of Zero-Shot Generated and Reference Answers from Docmatix dataset +
+ +## Method + +[Docmatix](https://huggingface.co/blog/docmatix) is the largest synthetic DocVQA dataset, generated from the curated document dataset, [PDFA] (https://huggingface.co/datasets/pixparse/pdfa-eng-wds). It is 100x larger than previously available datasets. The human-curated counterpart is DocVQA, which serves as an evaluation benchmark for VQA models for Document Understanding. In this post, we are going to use **the subset of Docmatix** which consists areound of 1700 train and 200 test samples, which can be downloaded here [FIXME: add the link to the dataset]. + +Although the content of the question and answer pairs in Docmatix and DocVQA is similar, their styles differ significantly. Traditional metrics like CIDER, ANLS, and BLEU can be overly restrictive for zero-shot evaluation in this context. Motivated by the similarity of the embeddings observed in t-SNE (Figure 1), we decided to use a different evaluation metric. In this post, we consider the LAVE metric to better assess generalization on this unseen but semantically similar dataset. + +For our evaluation, we chose [MPLUGDocOwl1.5](https://arxiv.org/pdf/2403.12895) as a baseline model. This model achieves an 84% ANLS score on the test subset of the original DocVQA dataset. We then ran a zero-shot generation on a subset of Docmatix, consisting of 200 images. We used [Llama-2-Chat-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) for rating the answers. + +## About LAVE + +We followed the procedure outlined in the [paper](https://arxiv.org/html/2310.02567v2). The VQA evaluation is framed as an answer-rating task suitable for in-context learning with LLMs. We used a rating scale from 1 to 3 to account for ambiguous questions or incomplete answers. The prompt included a task description, several demonstrations of input/output, and the input for a test example. + +We structured our task description and included the instruction **"Give the rationale before rating"** to showcase a justification for the assigned rating. Each demonstration comprised a question, a set of reference answers, the candidate answer, the answer rating, and an explanation for the rating. We also include the **"Provide only one rating"** to avoid sentence-by-sentence analysis, which sometimes resulted in several ratings. + +```py +task_description = """You are given a question, a set of gold-standard reference answers written by +experts, and a candidate answer. Please rate the accuracy of the candidate answer for the question +considering the reference answers. Use a scale of 1-3, with 1 indicating an incorrect or irrelevant +answer, 2 indicating an ambiguous or incomplete answer, and 3 indicating a correct answer. +Give the rationale before rating. Provide only one rating. +THIS IS VERY IMPORTANT: +A binary question should only be answered with 'yes' or 'no', +otherwise the candidate answer is incorrect.""" + +demonstrations = [ + { + "question": "What's the weather like?", + "reference_answer": ["sunny", "clear", "bright", "sunny", "sunny"], + "generated_answer": "cloudy" + } +] +``` + +#### Scoring Function + +Given the LLM’s generated text for the test example, we extracted the rating from the last character (either 1, 2, or 3) and mapped it to a score in the range [0, 1]: $$ s = \frac{r - 1}{2} $$ + +#### Table of Results + +The results of our evaluation are summarized in the table below: + +Metric | +CIDER | +BLEU | +ANLS | +LAVE | +
---|---|---|---|---|
Score | +0.1411 | +0.0032 | +0.002 | +0.58 | +
+ Figure 2: Llama rating and rationale. +
+ ++ Figure 3: Llama rating and rationale. +
+ + +## Are we too strict in evaluating VQA systems and do we need finetuning? + +We have approximately 50% accuracy when using LLMs to evaluate responses, indicating that answers can be correct despite not adhering to a strict format. This suggests that our current evaluation metrics may be too rigid. It’s important to note that this is not a comprehensive research paper, and more ablation studies are needed to fully understand the effectiveness of different metrics on the evaluation of zero-shot performance on synthetic dataset. We hope this work serves as a starting point to broaden the current research focus on improving the evaluation of zero-shot vision-language models within the context of synthetic datasets and to explore more efficient approaches beyond prompt learning. + +## References + +[FIXME: add bibtex refs] From 16c384572a6730f6523859ee500da430f8e57f36 Mon Sep 17 00:00:00 2001 From: Dana Aubakirova <118912928+danaaubakirova@users.noreply.github.com> Date: Wed, 24 Jul 2024 18:34:59 +0200 Subject: [PATCH 02/22] Delete zero-shot-vqa --- zero-shot-vqa | 107 -------------------------------------------------- 1 file changed, 107 deletions(-) delete mode 100644 zero-shot-vqa diff --git a/zero-shot-vqa b/zero-shot-vqa deleted file mode 100644 index 5a8db1101f..0000000000 --- a/zero-shot-vqa +++ /dev/null @@ -1,107 +0,0 @@ -# LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning? - -- Figure 1: t-SNE visualization of Zero-Shot Generated and Reference Answers from Docmatix dataset -
- -## Method - -[Docmatix](https://huggingface.co/blog/docmatix) is the largest synthetic DocVQA dataset, generated from the curated document dataset, [PDFA] (https://huggingface.co/datasets/pixparse/pdfa-eng-wds). It is 100x larger than previously available datasets. The human-curated counterpart is DocVQA, which serves as an evaluation benchmark for VQA models for Document Understanding. In this post, we are going to use **the subset of Docmatix** which consists areound of 1700 train and 200 test samples, which can be downloaded here [FIXME: add the link to the dataset]. - -Although the content of the question and answer pairs in Docmatix and DocVQA is similar, their styles differ significantly. Traditional metrics like CIDER, ANLS, and BLEU can be overly restrictive for zero-shot evaluation in this context. Motivated by the similarity of the embeddings observed in t-SNE (Figure 1), we decided to use a different evaluation metric. In this post, we consider the LAVE metric to better assess generalization on this unseen but semantically similar dataset. - -For our evaluation, we chose [MPLUGDocOwl1.5](https://arxiv.org/pdf/2403.12895) as a baseline model. This model achieves an 84% ANLS score on the test subset of the original DocVQA dataset. We then ran a zero-shot generation on a subset of Docmatix, consisting of 200 images. We used [Llama-2-Chat-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) for rating the answers. - -## About LAVE - -We followed the procedure outlined in the [paper](https://arxiv.org/html/2310.02567v2). The VQA evaluation is framed as an answer-rating task suitable for in-context learning with LLMs. We used a rating scale from 1 to 3 to account for ambiguous questions or incomplete answers. The prompt included a task description, several demonstrations of input/output, and the input for a test example. - -We structured our task description and included the instruction **"Give the rationale before rating"** to showcase a justification for the assigned rating. Each demonstration comprised a question, a set of reference answers, the candidate answer, the answer rating, and an explanation for the rating. We also include the **"Provide only one rating"** to avoid sentence-by-sentence analysis, which sometimes resulted in several ratings. - -```py -task_description = """You are given a question, a set of gold-standard reference answers written by -experts, and a candidate answer. Please rate the accuracy of the candidate answer for the question -considering the reference answers. Use a scale of 1-3, with 1 indicating an incorrect or irrelevant -answer, 2 indicating an ambiguous or incomplete answer, and 3 indicating a correct answer. -Give the rationale before rating. Provide only one rating. -THIS IS VERY IMPORTANT: -A binary question should only be answered with 'yes' or 'no', -otherwise the candidate answer is incorrect.""" - -demonstrations = [ - { - "question": "What's the weather like?", - "reference_answer": ["sunny", "clear", "bright", "sunny", "sunny"], - "generated_answer": "cloudy" - } -] -``` - -#### Scoring Function - -Given the LLM’s generated text for the test example, we extracted the rating from the last character (either 1, 2, or 3) and mapped it to a score in the range [0, 1]: $$ s = \frac{r - 1}{2} $$ - -#### Table of Results - -The results of our evaluation are summarized in the table below: - -Metric | -CIDER | -BLEU | -ANLS | -LAVE | -
---|---|---|---|---|
Score | -0.1411 | -0.0032 | -0.002 | -0.58 | -
- Figure 2: Llama rating and rationale. -
- -- Figure 3: Llama rating and rationale. -
- - -## Are we too strict in evaluating VQA systems and do we need finetuning? - -We have approximately 50% accuracy when using LLMs to evaluate responses, indicating that answers can be correct despite not adhering to a strict format. This suggests that our current evaluation metrics may be too rigid. It’s important to note that this is not a comprehensive research paper, and more ablation studies are needed to fully understand the effectiveness of different metrics on the evaluation of zero-shot performance on synthetic dataset. We hope this work serves as a starting point to broaden the current research focus on improving the evaluation of zero-shot vision-language models within the context of synthetic datasets and to explore more efficient approaches beyond prompt learning. - -## References - -[FIXME: add bibtex refs] From 964c55447c2271ae65043b8fd81a8f3ac504fca5 Mon Sep 17 00:00:00 2001 From: Dana Aubakirova <118912928+danaaubakirova@users.noreply.github.com> Date: Wed, 24 Jul 2024 18:35:40 +0200 Subject: [PATCH 03/22] Create zero-shot-vqa.md --- zero-shot-vqa.md | 107 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 107 insertions(+) create mode 100644 zero-shot-vqa.md diff --git a/zero-shot-vqa.md b/zero-shot-vqa.md new file mode 100644 index 0000000000..5a8db1101f --- /dev/null +++ b/zero-shot-vqa.md @@ -0,0 +1,107 @@ +# LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning? + ++ Figure 1: t-SNE visualization of Zero-Shot Generated and Reference Answers from Docmatix dataset +
+ +## Method + +[Docmatix](https://huggingface.co/blog/docmatix) is the largest synthetic DocVQA dataset, generated from the curated document dataset, [PDFA] (https://huggingface.co/datasets/pixparse/pdfa-eng-wds). It is 100x larger than previously available datasets. The human-curated counterpart is DocVQA, which serves as an evaluation benchmark for VQA models for Document Understanding. In this post, we are going to use **the subset of Docmatix** which consists areound of 1700 train and 200 test samples, which can be downloaded here [FIXME: add the link to the dataset]. + +Although the content of the question and answer pairs in Docmatix and DocVQA is similar, their styles differ significantly. Traditional metrics like CIDER, ANLS, and BLEU can be overly restrictive for zero-shot evaluation in this context. Motivated by the similarity of the embeddings observed in t-SNE (Figure 1), we decided to use a different evaluation metric. In this post, we consider the LAVE metric to better assess generalization on this unseen but semantically similar dataset. + +For our evaluation, we chose [MPLUGDocOwl1.5](https://arxiv.org/pdf/2403.12895) as a baseline model. This model achieves an 84% ANLS score on the test subset of the original DocVQA dataset. We then ran a zero-shot generation on a subset of Docmatix, consisting of 200 images. We used [Llama-2-Chat-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) for rating the answers. + +## About LAVE + +We followed the procedure outlined in the [paper](https://arxiv.org/html/2310.02567v2). The VQA evaluation is framed as an answer-rating task suitable for in-context learning with LLMs. We used a rating scale from 1 to 3 to account for ambiguous questions or incomplete answers. The prompt included a task description, several demonstrations of input/output, and the input for a test example. + +We structured our task description and included the instruction **"Give the rationale before rating"** to showcase a justification for the assigned rating. Each demonstration comprised a question, a set of reference answers, the candidate answer, the answer rating, and an explanation for the rating. We also include the **"Provide only one rating"** to avoid sentence-by-sentence analysis, which sometimes resulted in several ratings. + +```py +task_description = """You are given a question, a set of gold-standard reference answers written by +experts, and a candidate answer. Please rate the accuracy of the candidate answer for the question +considering the reference answers. Use a scale of 1-3, with 1 indicating an incorrect or irrelevant +answer, 2 indicating an ambiguous or incomplete answer, and 3 indicating a correct answer. +Give the rationale before rating. Provide only one rating. +THIS IS VERY IMPORTANT: +A binary question should only be answered with 'yes' or 'no', +otherwise the candidate answer is incorrect.""" + +demonstrations = [ + { + "question": "What's the weather like?", + "reference_answer": ["sunny", "clear", "bright", "sunny", "sunny"], + "generated_answer": "cloudy" + } +] +``` + +#### Scoring Function + +Given the LLM’s generated text for the test example, we extracted the rating from the last character (either 1, 2, or 3) and mapped it to a score in the range [0, 1]: $$ s = \frac{r - 1}{2} $$ + +#### Table of Results + +The results of our evaluation are summarized in the table below: + +Metric | +CIDER | +BLEU | +ANLS | +LAVE | +
---|---|---|---|---|
Score | +0.1411 | +0.0032 | +0.002 | +0.58 | +
+ Figure 2: Llama rating and rationale. +
+ ++ Figure 3: Llama rating and rationale. +
+ + +## Are we too strict in evaluating VQA systems and do we need finetuning? + +We have approximately 50% accuracy when using LLMs to evaluate responses, indicating that answers can be correct despite not adhering to a strict format. This suggests that our current evaluation metrics may be too rigid. It’s important to note that this is not a comprehensive research paper, and more ablation studies are needed to fully understand the effectiveness of different metrics on the evaluation of zero-shot performance on synthetic dataset. We hope this work serves as a starting point to broaden the current research focus on improving the evaluation of zero-shot vision-language models within the context of synthetic datasets and to explore more efficient approaches beyond prompt learning. + +## References + +[FIXME: add bibtex refs] From 0a8c409f70190978c4ddc85893fb7efed253aa20 Mon Sep 17 00:00:00 2001 From: Dana Aubakirova <118912928+danaaubakirova@users.noreply.github.com> Date: Wed, 24 Jul 2024 18:36:25 +0200 Subject: [PATCH 04/22] Delete zero-shot-vqa.md --- zero-shot-vqa.md | 107 ----------------------------------------------- 1 file changed, 107 deletions(-) delete mode 100644 zero-shot-vqa.md diff --git a/zero-shot-vqa.md b/zero-shot-vqa.md deleted file mode 100644 index 5a8db1101f..0000000000 --- a/zero-shot-vqa.md +++ /dev/null @@ -1,107 +0,0 @@ -# LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning? - -- Figure 1: t-SNE visualization of Zero-Shot Generated and Reference Answers from Docmatix dataset -
- -## Method - -[Docmatix](https://huggingface.co/blog/docmatix) is the largest synthetic DocVQA dataset, generated from the curated document dataset, [PDFA] (https://huggingface.co/datasets/pixparse/pdfa-eng-wds). It is 100x larger than previously available datasets. The human-curated counterpart is DocVQA, which serves as an evaluation benchmark for VQA models for Document Understanding. In this post, we are going to use **the subset of Docmatix** which consists areound of 1700 train and 200 test samples, which can be downloaded here [FIXME: add the link to the dataset]. - -Although the content of the question and answer pairs in Docmatix and DocVQA is similar, their styles differ significantly. Traditional metrics like CIDER, ANLS, and BLEU can be overly restrictive for zero-shot evaluation in this context. Motivated by the similarity of the embeddings observed in t-SNE (Figure 1), we decided to use a different evaluation metric. In this post, we consider the LAVE metric to better assess generalization on this unseen but semantically similar dataset. - -For our evaluation, we chose [MPLUGDocOwl1.5](https://arxiv.org/pdf/2403.12895) as a baseline model. This model achieves an 84% ANLS score on the test subset of the original DocVQA dataset. We then ran a zero-shot generation on a subset of Docmatix, consisting of 200 images. We used [Llama-2-Chat-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) for rating the answers. - -## About LAVE - -We followed the procedure outlined in the [paper](https://arxiv.org/html/2310.02567v2). The VQA evaluation is framed as an answer-rating task suitable for in-context learning with LLMs. We used a rating scale from 1 to 3 to account for ambiguous questions or incomplete answers. The prompt included a task description, several demonstrations of input/output, and the input for a test example. - -We structured our task description and included the instruction **"Give the rationale before rating"** to showcase a justification for the assigned rating. Each demonstration comprised a question, a set of reference answers, the candidate answer, the answer rating, and an explanation for the rating. We also include the **"Provide only one rating"** to avoid sentence-by-sentence analysis, which sometimes resulted in several ratings. - -```py -task_description = """You are given a question, a set of gold-standard reference answers written by -experts, and a candidate answer. Please rate the accuracy of the candidate answer for the question -considering the reference answers. Use a scale of 1-3, with 1 indicating an incorrect or irrelevant -answer, 2 indicating an ambiguous or incomplete answer, and 3 indicating a correct answer. -Give the rationale before rating. Provide only one rating. -THIS IS VERY IMPORTANT: -A binary question should only be answered with 'yes' or 'no', -otherwise the candidate answer is incorrect.""" - -demonstrations = [ - { - "question": "What's the weather like?", - "reference_answer": ["sunny", "clear", "bright", "sunny", "sunny"], - "generated_answer": "cloudy" - } -] -``` - -#### Scoring Function - -Given the LLM’s generated text for the test example, we extracted the rating from the last character (either 1, 2, or 3) and mapped it to a score in the range [0, 1]: $$ s = \frac{r - 1}{2} $$ - -#### Table of Results - -The results of our evaluation are summarized in the table below: - -Metric | -CIDER | -BLEU | -ANLS | -LAVE | -
---|---|---|---|---|
Score | -0.1411 | -0.0032 | -0.002 | -0.58 | -
- Figure 2: Llama rating and rationale. -
- -- Figure 3: Llama rating and rationale. -
- - -## Are we too strict in evaluating VQA systems and do we need finetuning? - -We have approximately 50% accuracy when using LLMs to evaluate responses, indicating that answers can be correct despite not adhering to a strict format. This suggests that our current evaluation metrics may be too rigid. It’s important to note that this is not a comprehensive research paper, and more ablation studies are needed to fully understand the effectiveness of different metrics on the evaluation of zero-shot performance on synthetic dataset. We hope this work serves as a starting point to broaden the current research focus on improving the evaluation of zero-shot vision-language models within the context of synthetic datasets and to explore more efficient approaches beyond prompt learning. - -## References - -[FIXME: add bibtex refs] From 7de22236489a5dca78286a327356dd10e79bebfd Mon Sep 17 00:00:00 2001 From: Dana Aubakirova <118912928+danaaubakirova@users.noreply.github.com> Date: Wed, 24 Jul 2024 18:39:40 +0200 Subject: [PATCH 05/22] Create zero-shot-vqa-docmatix.md --- zero-shot-vqa-docmatix.md | 107 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 107 insertions(+) create mode 100644 zero-shot-vqa-docmatix.md diff --git a/zero-shot-vqa-docmatix.md b/zero-shot-vqa-docmatix.md new file mode 100644 index 0000000000..5a8db1101f --- /dev/null +++ b/zero-shot-vqa-docmatix.md @@ -0,0 +1,107 @@ +# LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning? + ++ Figure 1: t-SNE visualization of Zero-Shot Generated and Reference Answers from Docmatix dataset +
+ +## Method + +[Docmatix](https://huggingface.co/blog/docmatix) is the largest synthetic DocVQA dataset, generated from the curated document dataset, [PDFA] (https://huggingface.co/datasets/pixparse/pdfa-eng-wds). It is 100x larger than previously available datasets. The human-curated counterpart is DocVQA, which serves as an evaluation benchmark for VQA models for Document Understanding. In this post, we are going to use **the subset of Docmatix** which consists areound of 1700 train and 200 test samples, which can be downloaded here [FIXME: add the link to the dataset]. + +Although the content of the question and answer pairs in Docmatix and DocVQA is similar, their styles differ significantly. Traditional metrics like CIDER, ANLS, and BLEU can be overly restrictive for zero-shot evaluation in this context. Motivated by the similarity of the embeddings observed in t-SNE (Figure 1), we decided to use a different evaluation metric. In this post, we consider the LAVE metric to better assess generalization on this unseen but semantically similar dataset. + +For our evaluation, we chose [MPLUGDocOwl1.5](https://arxiv.org/pdf/2403.12895) as a baseline model. This model achieves an 84% ANLS score on the test subset of the original DocVQA dataset. We then ran a zero-shot generation on a subset of Docmatix, consisting of 200 images. We used [Llama-2-Chat-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) for rating the answers. + +## About LAVE + +We followed the procedure outlined in the [paper](https://arxiv.org/html/2310.02567v2). The VQA evaluation is framed as an answer-rating task suitable for in-context learning with LLMs. We used a rating scale from 1 to 3 to account for ambiguous questions or incomplete answers. The prompt included a task description, several demonstrations of input/output, and the input for a test example. + +We structured our task description and included the instruction **"Give the rationale before rating"** to showcase a justification for the assigned rating. Each demonstration comprised a question, a set of reference answers, the candidate answer, the answer rating, and an explanation for the rating. We also include the **"Provide only one rating"** to avoid sentence-by-sentence analysis, which sometimes resulted in several ratings. + +```py +task_description = """You are given a question, a set of gold-standard reference answers written by +experts, and a candidate answer. Please rate the accuracy of the candidate answer for the question +considering the reference answers. Use a scale of 1-3, with 1 indicating an incorrect or irrelevant +answer, 2 indicating an ambiguous or incomplete answer, and 3 indicating a correct answer. +Give the rationale before rating. Provide only one rating. +THIS IS VERY IMPORTANT: +A binary question should only be answered with 'yes' or 'no', +otherwise the candidate answer is incorrect.""" + +demonstrations = [ + { + "question": "What's the weather like?", + "reference_answer": ["sunny", "clear", "bright", "sunny", "sunny"], + "generated_answer": "cloudy" + } +] +``` + +#### Scoring Function + +Given the LLM’s generated text for the test example, we extracted the rating from the last character (either 1, 2, or 3) and mapped it to a score in the range [0, 1]: $$ s = \frac{r - 1}{2} $$ + +#### Table of Results + +The results of our evaluation are summarized in the table below: + +Metric | +CIDER | +BLEU | +ANLS | +LAVE | +
---|---|---|---|---|
Score | +0.1411 | +0.0032 | +0.002 | +0.58 | +
+ Figure 2: Llama rating and rationale. +
+ ++ Figure 3: Llama rating and rationale. +
+ + +## Are we too strict in evaluating VQA systems and do we need finetuning? + +We have approximately 50% accuracy when using LLMs to evaluate responses, indicating that answers can be correct despite not adhering to a strict format. This suggests that our current evaluation metrics may be too rigid. It’s important to note that this is not a comprehensive research paper, and more ablation studies are needed to fully understand the effectiveness of different metrics on the evaluation of zero-shot performance on synthetic dataset. We hope this work serves as a starting point to broaden the current research focus on improving the evaluation of zero-shot vision-language models within the context of synthetic datasets and to explore more efficient approaches beyond prompt learning. + +## References + +[FIXME: add bibtex refs] From 74bde5c17b1e87b540921befaaf4ff43e4361919 Mon Sep 17 00:00:00 2001 From: Dana Aubakirova <118912928+danaaubakirova@users.noreply.github.com> Date: Thu, 25 Jul 2024 10:17:36 +0200 Subject: [PATCH 06/22] Update zero-shot-vqa-docmatix.md --- zero-shot-vqa-docmatix.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/zero-shot-vqa-docmatix.md b/zero-shot-vqa-docmatix.md index 5a8db1101f..a4cabe75af 100644 --- a/zero-shot-vqa-docmatix.md +++ b/zero-shot-vqa-docmatix.md @@ -1,12 +1,12 @@ # LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?+ Figure 2: The examples of Q&A pairs from Docmatix and DocVQA test set. Note: the corresponding images are not shown here. +
+ Although the content of the question and answer pairs in Docmatix and DocVQA is similar, their styles differ significantly. Traditional metrics like CIDER, ANLS, and BLEU can be overly restrictive for zero-shot evaluation in this context. Motivated by the similarity of the embeddings observed in t-SNE (Figure 1), we decided to use a different evaluation metric. In this post, we consider the LAVE metric to better assess generalization on this unseen but semantically similar dataset. For our evaluation, we chose [MPLUGDocOwl1.5](https://arxiv.org/pdf/2403.12895) as a baseline model. This model achieves an 84% ANLS score on the test subset of the original DocVQA dataset. We then ran a zero-shot generation on a subset of Docmatix, consisting of 200 images. We used [Llama-2-Chat-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) for rating the answers. @@ -78,6 +87,15 @@ The results of our evaluation are summarized in the table below: ++ Figure 5: t-SNE visualization of Question, Answer and Image features from Docmatix and DocVQA datasets +
## Qualitative Examples From ba639f4e4aacd467678eff9d208d5101068705d6 Mon Sep 17 00:00:00 2001 From: Dana Aubakirova <118912928+danaaubakirova@users.noreply.github.com> Date: Thu, 25 Jul 2024 10:11:26 +0000 Subject: [PATCH 09/22] added refs --- zero-shot-vqa-docmatix.md | 58 +++++++++++++++++++++++++++++++++++---- 1 file changed, 53 insertions(+), 5 deletions(-) diff --git a/zero-shot-vqa-docmatix.md b/zero-shot-vqa-docmatix.md index c5733c03f8..d334429794 100644 --- a/zero-shot-vqa-docmatix.md +++ b/zero-shot-vqa-docmatix.md @@ -88,9 +88,9 @@ The results of our evaluation are summarized in the table below:
@@ -118,8 +118,56 @@ The results of our evaluation are summarized in the table below:
## Are we too strict in evaluating VQA systems and do we need finetuning?
-We have approximately 50% accuracy when using LLMs to evaluate responses, indicating that answers can be correct despite not adhering to a strict format. This suggests that our current evaluation metrics may be too rigid. It’s important to note that this is not a comprehensive research paper, and more ablation studies are needed to fully understand the effectiveness of different metrics on the evaluation of zero-shot performance on synthetic dataset. We hope this work serves as a starting point to broaden the current research focus on improving the evaluation of zero-shot vision-language models within the context of synthetic datasets and to explore more efficient approaches beyond prompt learning.
+We have approximately 50% accuracy gain when using LLMs to evaluate responses, indicating that the answers can be correct despite not adhering to a strict format. This suggests that our current evaluation metrics may be too rigid. It’s important to note that this is not a comprehensive research paper, and more ablation studies are needed to fully understand the effectiveness of different metrics on the evaluation of zero-shot performance on synthetic dataset. We hope this work serves as a starting point to broaden the current research focus on improving the evaluation of zero-shot vision-language models within the context of synthetic datasets and to explore more efficient approaches beyond prompt learning.
## References
-[FIXME: add bibtex refs]
+```
+@inproceedings{cascante2022simvqa,
+ title={Simvqa: Exploring simulated environments for visual question answering},
+ author={Cascante-Bonilla, Paola and Wu, Hui and Wang, Letao and Feris, Rogerio S and Ordonez, Vicente},
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+ pages={5056--5066},
+ year={2022}
+}
+
+@article{hu2024mplug,
+ title={mplug-docowl 1.5: Unified structure learning for ocr-free document understanding},
+ author={Hu, Anwen and Xu, Haiyang and Ye, Jiabo and Yan, Ming and Zhang, Liang and Zhang, Bo and Li, Chen and Zhang, Ji and Jin, Qin and Huang, Fei and others},
+ journal={arXiv preprint arXiv:2403.12895},
+ year={2024}
+}
+
+@article{agrawal2022reassessing,
+ title={Reassessing evaluation practices in visual question answering: A case study on out-of-distribution generalization},
+ author={Agrawal, Aishwarya and Kaji{\'c}, Ivana and Bugliarello, Emanuele and Davoodi, Elnaz and Gergely, Anita and Blunsom, Phil and Nematzadeh, Aida},
+ journal={arXiv preprint arXiv:2205.12191},
+ year={2022}
+}
+
+@inproceedings{li2023blip,
+ title={Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models},
+ author={Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven},
+ booktitle={International conference on machine learning},
+ pages={19730--19742},
+ year={2023},
+ organization={PMLR}
+}
+@inproceedings{manas2024improving,
+ title={Improving automatic vqa evaluation using large language models},
+ author={Ma{\~n}as, Oscar and Krojer, Benno and Agrawal, Aishwarya},
+ booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
+ volume={38},
+ number={5},
+ pages={4171--4179},
+ year={2024}
+}
+
+@article{li2023scigraphqa,
+ title={Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs},
+ author={Li, Shengzhi and Tajbakhsh, Nima},
+ journal={arXiv preprint arXiv:2308.03349},
+ year={2023}
+}
+
+```
From f70530e90236f7cd55d6256ff3746fd5351f1509 Mon Sep 17 00:00:00 2001
From: Dana Aubakirova <118912928+danaaubakirova@users.noreply.github.com>
Date: Thu, 25 Jul 2024 10:23:10 +0000
Subject: [PATCH 10/22] resolved major content related comment
---
zero-shot-vqa-docmatix.md | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)
diff --git a/zero-shot-vqa-docmatix.md b/zero-shot-vqa-docmatix.md
index d334429794..40d24a9262 100644
--- a/zero-shot-vqa-docmatix.md
+++ b/zero-shot-vqa-docmatix.md
@@ -4,13 +4,9 @@
-## Introduction
-Our community has recently focused on out-of-distribution (OOD) evaluation, utilizing methods like zero-shot transfer to unseen VQA tasks or fine-tuning on one VQA dataset and evaluating on another. This shift is increasingly relevant with the rise of synthetic datasets such as Docmatix, SciGraphQA, SimVQA used to fine-tune Vision Language Models (VLMs).
+What happens when we apply zero-shot VQA to a synthetic dataset? As illustrated in the Figure 1, the generated answers often semantically align with the reference answers. However, most traditional VQA metrics fall into the 'very low' range. This raises the question: Should we fine-tune the models, or should we develop new metrics that account for distribution shifts and focus on capturing the core meaning of answers?
-Traditionally, VQA Accuracy has been the main metric for evaluating model performance. It relies on exact string matching between a model's predicted answer and a set of reference answers annotated by humans. This metric worked well because VQA evaluation followed an independent and identically distributed (IID) paradigm, where training and testing data distributions were similar, allowing models to adapt effectively [See details here](https://arxiv.org/pdf/2205.12191).
-
-In OOD settings, generated answers might not match reference answers despite being correct due to differences in format, specificity, or interpretation. This paradigm is perfectly illustrated in the Figure 1, where we compare the zero-shot generated captions vs the reference captions from the synthetic dataset. This is particularly true for instruction-generated datasets and their human-curated counterparts. Some [methods](https://proceedings.mlr.press/v202/li23q.html) have attempted to align answer formats with references, but this only addresses the symptom, not the root cause of flawed evaluation metrics. While human evaluation is reliable, it is costly and not scalable, highlighting the need for metrics that better align with human judgment.
From 6c8685312dc908e286fd1754e985a9ed81d29f1d Mon Sep 17 00:00:00 2001 From: Dana Aubakirova <118912928+danaaubakirova@users.noreply.github.com> Date: Thu, 25 Jul 2024 10:29:09 +0000 Subject: [PATCH 11/22] formatting --- zero-shot-vqa-docmatix.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/zero-shot-vqa-docmatix.md b/zero-shot-vqa-docmatix.md index 40d24a9262..920c88b7f0 100644 --- a/zero-shot-vqa-docmatix.md +++ b/zero-shot-vqa-docmatix.md @@ -1,3 +1,12 @@ +--- +title: "LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?" +authors: +- user: danaaubakirova +- user: andito + guest: true + +--- + # LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?
+ Figure 5: t-SNE visualization of Question, Answer and Image features from Docmatix and DocVQA datasets +
+ For our evaluation, we chose [MPLUGDocOwl1.5](https://arxiv.org/pdf/2403.12895) as a baseline model. This model achieves an 84% ANLS score on the test subset of the original DocVQA dataset. We then ran a zero-shot generation on a subset of Docmatix, consisting of 200 images. We used [Llama-2-Chat-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) for rating the answers. ## About LAVE @@ -101,16 +112,6 @@ The results of our evaluation are summarized in the table below: -- Figure 5: t-SNE visualization of Question, Answer and Image features from Docmatix and DocVQA datasets -
- ## Qualitative Examples