Multimodal Visual Question Answering (VQA) using ABO Dataset

Overview

This project explores Visual Question Answering (VQA) using the Amazon-Berkeley Objects (ABO) dataset. The task involves curating a single-word answer VQA dataset with multimodal tools, evaluating pretrained models, fine-tuning them using Low-Rank Adaptation (LoRA), and benchmarking performance using accuracy metrics.

GitHub repo: https://github.com/DyuthiVivek/VR_Project2_IMT2022018_523_572/

Team members:

Swetha Murali - IMT2022018
Dyuthi Vivek - IMT2022523
PVS Sukeerthi - IMT2022572

Files

dataset_preparation.ipynb: Full data curation process.
dataset.csv: CSV file with image path, question and answer.
baseline_evaluation.ipynb: baseline and fine-tuned model evaluation.
fine_tuning.ipynb: Fine-tuning using LoRA.
inference.py: Inference script to run on new images.
requirements.txt: Dependencies to run the code.
README.md: Project report.

Data Curation

Source: Amazon Berkeley Objects Dataset (small variant ~3GB)
Images: We used 15k catalog images (256x256), and created 2-3 questions per image
Metadata: Product-level metadata from images.csv.gz

We used the Google Gemini 2.0 API (google.generativeai) to create questions.

The given metadata was cleaned, and product name, features, and tags were extracted for each product. The cleaned metadata and the image were passed to Gemini with the below prompt template. The model's response was parsed and questions were extracted as json.

We used 15K images and created 2-3 questions per image. Final dataset size: 37468 questions

Prompt Template:

Given a product image, generate a set of diverse questions that can be answered solely by looking at the image.
Each question must have a single-word answer (e.g., "red", "shoes", "five", "yes").
Cover a range of question types, including:

- Object recognition (e.g., What product is shown?)
- Attribute detection (e.g., What color is the product?)
- Material/texture recognition (e.g., What material is the product made of?)
- Size/shape recognition (e.g., What is the shape of the product?)
- Brand recognition (e.g., What brand is the product?)
- Counting (e.g., How many items are in the image?)
- Yes/No questions (e.g., Is the product a smartphone?)

Mix easy and challenging questions. Avoid subjective or ambiguous questions.

Use this product information for reference but only generate questions that can be answered from the image alone.

Product Information:
[METADATA]

Output Format:
For each product image, return a list of 3-4 questions and their single-word answers. Respond in this format (do not include any extra information):

[
    {
        "question": "What product is shown?",
        "answer": "Laptop"
    },
    {
        "question": "What color is the product?",
        "answer": "Black"
    },
    {
        "question": "Is the logo visible?",
        "answer": "Yes"
    },
    {
        "question": "What material is the product made of?",
        "answer": "Plastic"
    },
    {
        "question": "Is the product in a box?",
        "answer": "No"
    }
]

Baseline Evaluation

We used a subset of the created questions for baseline evaluation.

Train data size - 30523 questions

Test data size - 6945 questions

We evaluated both the baseline and fine-tuned models on only the test data to ensure that the model had not seen the questions during training.

Models Used

BLIP (Bootstrapping Language-Image Pretraining) BLIP is a transformer-based vision-language model that integrates a vision encoder (ViT) and a text decoder, trained on large-scale image-text pairs for tasks like image captioning, visual question answering, and retrieval. Nearly 385M parameters.
ViLT (Vision-and-Language Transformer) ViLT is a lightweight model that removes the convolutional visual backbone and directly processes image patches using transformers. Nearly 86M parameters.

Evaluation metrics

Compared predicted answers with ground truth using:
- Exact Match Accuracy: Proportion of predictions that exactly match the ground truth word.
- BERTScore: Semantic similarity metric using contextual embeddings from BERT.
- BLEU Score: Measures the overlap of n-grams between predicted and reference answers.
- METEOR Score: Considers synonymy, stemming, and word order for better alignment with human judgment.

Model	Exact Match Accuracy	BERTScore	BLEU Score	METEOR Score
BLIP baseline	0.49	0.9792	0.4953	0.2636
Vilt baseline	0.471	0.977	0.47	0.2528

Observation

BLIP baseline outperformed ViLT across all metrics. This is because BLIP is larger and more powerful than ViLT.

Fine-Tuning using LoRA

Model Selection

Based on the baseline evaluation results, we chose to fine-tune BLIP since it performed better than ViLT. We chose to not fine-tune BLIP-2 since it is a large model with 2.7 B parameters.
We experimented fine-tuning with various LoRA configurations

Training Details

Trained and tested 2 configurations for LoRA fine-tuning.
Hyperparameters such as batch size were experimented with.
The baseline models and the fine-tuned models were evaluated only on test data.

Best model hyperparameters:

Parameter	Value
Model	BLIP
LoRA Rank	16
Epochs	8
Batch Size	12
Learning Rate	5e-5 with learning rate scheduler

LoRA configurations used

Config	Rank (`r`)	Alpha	Dropout	Target Modules	Bias
Config 1	8	16	0.1	`["query", "value"]`	none
Config 2	16	32	0.2	`["query", "value"]`	none

Config 2 gave a better accuracy.

Results after fine-tuning

Model	LoRA config	Batch Size	Exact Match Accuracy	BERTScore	BLEU Score	METEOR Score
BLIP fine-tuned	config 1	12	0.777	0.9753	0.7776	0.3925
BLIP fine-tuned	config 2	12	0.779	0.9753	0.779	0.3933
BLIP fine-tuned	config 2	8	0.7758	0.9753	0.775	0.3921

Observations

Exact Match Accuracy improved by nearly 29%, showing that the model was better at generating the exact expected word after fine-tuning.
BERTScore was high before fine-tuning and did not change much after fine-tuning as well. It evaluates the semantic similarity between the predicted and reference answers using contextual embeddings from BERT. This shows that both baseline and fine-tuned models produced semantically relevant answers, even if the exact token or phrasing changed. BERTScore is less sensitive to small token-level changes, so it’s not a strong indicator for slight improvements after fine-tuning.
BLEU Score and METEOR Score, which measure fluency and partial match/synonymy, increased significantly, indicating both better wording and alignment with the expected output.
LoRA Config 2 slightly outperformed Config 1 across all metrics. The increase in r, alpha, and dropout in Config 2 could have helped to avoid overfitting by regularizing better. However, the improvements are marginal.
Changing the batch size did not change the accuracy by much.
On evaluating the model after training for 1 epoch, we observed 75% exact match accuracy. After 8 epochs, we observed 79%. This is justified as the loss did not decrease by much after a few epochs. After 8 epochs, the the validation loss started increasing.

The plot below shows the accuracy of the model with different types of questions. This shows that VQA performance is highly dependent on clarity of features relevant to the question type:
- Color is not ambiguous, and directly learnable.
- Brand recognition is harder. The model might benefit from including OCR modules for better brand and material recognition.

We tried evaluating initially on smaller test data set. The accuracy increased when we evaluated it on the entire test set.

Inference time speedup

We tried KV caching and FP precision reduction.

Basic finetuned BLIP (FP32): The original fine-tuned BLIP model took 0.127 seconds/image, using standard 32-bit floating point precision with no inference optimizations.
With KV Cache: Inference took 0.090 seconds/image. KV caching enables key-value caching which allows the model to reuse previously computed attention keys/values during generation, reducing redundant computations and speeding up inference.
With FP16 Precision (half-precision inference): Using this, the inference took 0.095 seconds/image. Switching to 16-bit floating point (FP16) reduces memory usage and takes advantage of faster tensor operations on GPUs, leading to improved inference time.

Challenges

We initially used only the metadata without the image for dataset curation. We observed that the model hallucinated and created questions that could not be answered by simply looking at the image. Baseline evaluation gave a low accuracy. We resolved this by giving the image as an input during dataset curation.
We faced API rate limits while using Gemini, especially since we gave the image as an input. We resolved this by using several API keys in rotation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Visual Question Answering (VQA) using ABO Dataset

Overview

Files

Data Curation

Baseline Evaluation

Models Used

Evaluation metrics

Observation

Fine-Tuning using LoRA

Model Selection

Training Details

LoRA configurations used

Results after fine-tuning

Observations

Inference time speedup

Challenges

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
README.md		README.md
baseline_evaluation.ipynb		baseline_evaluation.ipynb
dataset.csv		dataset.csv
dataset_preparation.ipynb		dataset_preparation.ipynb
fine_tuning.ipynb		fine_tuning.ipynb
inference.py		inference.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Multimodal Visual Question Answering (VQA) using ABO Dataset

Overview

Files

Data Curation

Baseline Evaluation

Models Used

Evaluation metrics

Observation

Fine-Tuning using LoRA

Model Selection

Training Details

LoRA configurations used

Results after fine-tuning

Observations

Inference time speedup

Challenges

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages