GitHub - zhu-xlab/SCST_4_RS: An improved version of InstructBLIP that uses SCST to reduce visual reasoning errors (oversights, hallucinations, ...)

LAVIS's InstructBLIP model finetuned to remote sensing image-text data via Reinforcement Learning. The aim is to teach Visual Reasoning to a VLM on Remote Sensing imagery: Visual Reasoning data being quite scarce in the domain of remote sensing, this improves performances on standard metrics, and on novel metrics.

Table of Contents

About

Reward functions used to optimize these metrics directly (A.K.A. learning signals)

Getting Started
- Prerequisites
Training configuration
Start training
Best Model
Learning signals registry
License
Acknowledgements

❓ About

Forked from SalesForce's LAVIS repository, this improved version implements Reinforcement Learning to bolster image captioning abilities for the specific domain of remote sensing. On top of optimization through Cross-Entropy loss minimization, a few supplementary Reinforcement Learning epochs are ran to guide the model towards more desirable outputs, using well-crafted learning signals. More precisely, Self-Critical Sequence Training (https://arxiv.org/abs/1612.00563), a variant of the REINFORCE algorithm, which is similar to PPO or GRPO, is used to enforce these learning signals.

Additional info

Note that SCST can be made compatible with PPO/GRPO, with the issue that there are no intermediate rewards during the generation of a caption (the full generated caption is required to compute the learning signals).

🛠 Built With

🏗 SalesForce's LAVIS - Core vision-language model, easily adaptable to RL;
📊 FACTUAL Scene Graph Extractor - One of the most impactful reward function is obtained by measuring the closeness of generated captions and ground-truth (human-annotated) captions. FACTUAL extracts "scene graphs" better than SPICE does, to compute this reward function by comparing the graphs. The difference between graphs also highlights the missing objects and the hallucinations made by the models.

⚙️ Works with

Alibaba's Qwen2.5-VL - Core vision-language model, performance gains similar to LAVIS;
Applicable to any differentiable model producing logits as output.

📈 Diagram of the model

📊 Qualitative results

Examples extracted from RSICD's test split

Caption generated by our best model

three tennis courts are next to a road and some green trees

Human-created captions

Several orange and green tennis courts sit side by side beside the road.
Three tennis courts are surrounded by several buildings and green trees.
Three tennis courts are surrounded by several buildings and trees.
Three tennis courts semi-surrounded by some trees and buildings is next to a road.
Three tennis courts are surrounded by several buildings and trees.

Our model fails to mention the buildings.

Caption generated by our best model

six white storage tanks are near a road and some green meadows

Human-created captions

There are seven white columnar tanks near a road with two tracks.
Seven same cylinder storage tanks are placed on the grass between a forest and a road.
Seven same cylinder storage tanks are placed on the grass between a forest and a road.
Seven storage tanks stand alongside the straight road where two trucks are running.
Seven white storage tanks in two lines are near some green trees.

Our model doesn't count storage tanks correctly, which seems to be happening because two storage tanks are not fully within the picture. It also fails to mention the two trucks, and mistakes a forest for meadows.

📈 Quantitative results

📖 Standard captioning metrics

Our model is first evaluated on standard captioning metrics, including:

In this list, SPICE is the most correlated with human judgement.

Experiments were conducted on RSICD, UCM-Captions, and on VRSBench.

When evaluated on RSICD using these metrics, our method demonstrates SOTA performances.

CE = Cross-Entropy loss training.

➕ Augmented datasets

Four image captioning datasets (RemoteCLIP, VRSBench, UCM-Captions, NWPU-Captions) are augmented with the detections of an object detectors, and the count of each class of detected object Large Selective Kernel Network.

A new entry, the "count dictionary", in the {"object": count, ...} is added to the samples of all four datasets:

The 📊 FACTUAL Scene Graph Extractor is employed to detect objects from ground-truth captions and, if possible, their count;
The 🛰️ Large Selective Kernel Network is employed to detect objects from the aerial images and count each class of detected object;
The dictionaries from both models are nonredundantly fused. Ground-truth captions' counts have priority over the object detector model's counts.

This improves the SDE learning signal that is truly rooted in Remote Sensing, and introduces a Count learning signal.

📈 Custom metrics (oversights, hallucinations)

WIP

Learning signals that combine into the reward function

NKL: Negative Kullback-Leibler Divergence. Using a small language model (a pretrained BERT model from spaCy), we compute embeddings for every tokens of the ground-truth captions and of the generated captions. This yields two distributions of embeddings, that we try to bring closer by minimizing the KL-Divergence. To avoid the distribution being learned collapsing on the other, which would cause overfitting on the training dataset, we measure this KL Divergence only for the last 10,000 token embeddings, and we update this learning signal once every 10,000 tokens (Exponential Moving Average WIP). This can be considered as a soft trust region method.

CIDEr: A classic captioning metric that relies on the TF-IDF vectors ressemblance between two sentences to compare.

Length: Number of tokens in the generated caption (since the policy loss is being minimized, we must minimize the opposite of the length to maximize it).

SDE: Scene Description Exhaustiveness, proportion of entities in the generated caption present in the ground-truth count dictionary(ies). Serves the purpose of getting ground-truth objects and counts into generated captions, to mostly align with the expert human annotators and the object detector results, but not entirely, to avoid exposure bias that would result in fitting moderately noisy ground-truth captions and omit crucial elements. When the scene graph of a generated caption and the count dictionary of a ground-truth caption are compared, extracted objects are lemmatized, and fuzzy matching is applied, to account for synonyms.

SDE computation example:

• Generated caption: There is a forest. (object: forest)
• Ground-truth caption 1: There is a forest and a river. (objects: forest, river)
• Ground-truth caption 2: There is a forest, a river and a road. (objects: forest, river, road)

Objects detected in the human-annotated (ground-truth) captions: forest, river, road (3 objects)
Object detected in the model's output caption: forest (1 object)

Therefore, the SDE score is 1/3 in this example.

Counting Accuracy: Once objects from generated and ground-truth captions are "aligned", and if they have an associated count, the average (over objects) absolute and signed (how an object's count compares to the ground-truth count, >, = or <) errors are computed and used as learning signals.

Examples: • Generated caption: There are four planes and three buildings. • Ground-truth caption 1: There are five planes and one buildings.

The absolute counting accuracy is (|5 - 4| + |3 - 1|)/2 = 1.5 The signed counting accuracy is (4 - 5 + 3 - 1)/2 = 1.5

RSICD Dataset

The up and down arrows next to the scores indicate the direction towards which the score should go to improve the score. For instance, -2,62% in the "oversights" column on the first line is in accordance with the arrow's direction, meaning that it is going in the right direction. However, the code fails at addressing hallucinations: this is probably caused by the relatively short length of the captions. VRSBench has longer, more expressive captions than other datasets, and contains an elaborate vocabulary, making it the perfect candidate for hallucinations reduction testing, which does not work on datasets with smaller sentences, as the amount of hallucinations tends to increase with the length of the caption.

UCM Dataset

Our method seems even more efficient on UCM dataset. This might be caused by the fact that this dataset is quite small, and contains a lot of duplicate captions.

VRSBench

VRSBench captions are particularly long, which allows us to demonstrate the effectiveness of our approach in decreasing hallucinations without affecting its overall ability of decreasing oversights.

➕ Addendum to the policy loss of SCST

Another loss term, termed V/E for Varentropy/Entropy, is jointly minimized with the policy loss. Inspired by Entropix, the point is to balance between diverse vocabulary usage (high entropy) and consistent token distributions (low varentropy). This significantly limits degenerate generated tokens distributions, and encourages vocabulary exploration at the same time, which increases the model's vocabulary by taking inspiration from the human-annotated captions. The $\lambda \ = \ 10^{-4}$ constant is the weight we multiply the V/E term with to control its magnitude.

A simple ablation study where two models are trained in the same conditions, but one with CIDEr only and the other one with CIDEr and V/E, demonstrates a slight improvement in terms of oversights decreasing, BLEU, METEOR, and CIDEr.

🚀 Getting Started

Prerequisites

Clone the present repository (installing the original LAVIS repository will require multiple precise modifications in the code that have already been done in this very repository).

RS-LAVIS with RL

After installing this repository, you need to create an environment, activate it, and install the libraries from requirements.txt. PYTHON 3.9+ REQUIRED

conda

conda create --name lavis_rl python=3.9
conda activate lavis_rl

pip

pip install -r requirements.txt

FACTUAL Scene Graph Extraction

Crucial for the "Object Proportion" (SDE) learning signal to work.

pip install FactualSceneGraph

OR choose a pretrained model from huggingface: https://github.com/zhuang-li/FactualSceneGraph

🎛️ Training configuration

The training configuration for captioning can be found here: lavis/projects/blip2/train/caption_rs_ft.yaml

BLIP2 models

Alternative frozen vision encoders can be used with BLIP2. They can be found in lavis/configs/models/blip2.

Datasets configurations

The .yaml file for dataset configuration may be found here: lavis/configs/datasets/rs/defaults_cap.yaml. The image folder must contain every image from the dataset, regardless of the split they belong to. The JSON files containing the captions for the train, val and test splits must be in COCO format.

Object detector based "pseudo-captioning" can be activated by editing lines 48 and 72 from lavis/datasets/datasets/rs.py. This can slightly improve performances.

In case you need to modify the dataset config, edit this code: lavis/datasets/builders/rs_caption.py.

Finally, set the paths to your val and test json files in lavis/tasks/captioning.py, lines 138-139

⌛ Start training

Once everything is correctly installed and configured, run the following command:

python train.py --cfg-path your_main_folder/LAVIS/lavis/projects/blip2/train/caption_rs_ft.yaml --model_name eva_clip_g_plus

🏆 Best model

Weights for the best InstructBLIP model we have obtained. https://huggingface.co/tdujardin/InstructBLIP_RS_RL/tree/main

⚙️ Learning signals registry

The "rewards.py" registry of learning signals may be found in InstructBLIP_SCST/lavis/tasks/rewards.py

🧾 License

This repository contains code derived from SalesForce's LAVIS which is licensed under the BSD 3-Clause License, and code from FACTUAL which is licensed under the MIT License.

New contributions to this repository are licensed under the MIT License.
Portions derived from LAVIS remain under the BSD 3-Clause License.
The sentence object extractor, FACTUAL, is licensed under the MIT License

🙏 Acknowledgements

We extend our gratitude to SalesForce for developing the LAVIS repository, which provides a simple to use Vision-Language models library. Implementing Reinforcement Learning was made significantly easier by their work.

Additionally, one of our main learning signals for RL was based on FACTUAL, a finetuned FLAN-T5 model that extracts scene graphs.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
lavis		lavis
.gitattributes		.gitattributes
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reward functions used to optimize these metrics directly (A.K.A. learning signals)

❓ About

🛠 Built With

⚙️ Works with

📈 Diagram of the model

📊 Qualitative results

📈 Quantitative results

📖 Standard captioning metrics

➕ Augmented datasets

📈 Custom metrics (oversights, hallucinations)

Learning signals that combine into the reward function

➕ Addendum to the policy loss of SCST

🚀 Getting Started

Prerequisites

RS-LAVIS with RL

conda

pip

FACTUAL Scene Graph Extraction

🎛️ Training configuration

BLIP2 models

Datasets configurations

⌛ Start training

🏆 Best model

⚙️ Learning signals registry

🧾 License

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

zhu-xlab/SCST_4_RS

Folders and files

Latest commit

History

Repository files navigation

Reward functions used to optimize these metrics directly (A.K.A. learning signals)

❓ About

🛠 Built With

⚙️ Works with

📈 Diagram of the model

📊 Qualitative results

📈 Quantitative results

📖 Standard captioning metrics

➕ Augmented datasets

📈 Custom metrics (oversights, hallucinations)

Learning signals that combine into the reward function

➕ Addendum to the policy loss of SCST

🚀 Getting Started

Prerequisites

RS-LAVIS with RL

conda

pip

FACTUAL Scene Graph Extraction

🎛️ Training configuration

BLIP2 models

Datasets configurations

⌛ Start training

🏆 Best model

⚙️ Learning signals registry

🧾 License

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages