Table of Contents
|
Forked from SalesForce's LAVIS repository, this improved version implements Reinforcement Learning to bolster image captioning abilities for the specific domain of remote sensing. On top of optimization through Cross-Entropy loss minimization, a few supplementary Reinforcement Learning epochs are ran to guide the model towards more desirable outputs, using well-crafted learning signals. More precisely, Self-Critical Sequence Training (https://arxiv.org/abs/1612.00563), a variant of the REINFORCE algorithm, which is similar to PPO or GRPO, is used to enforce these learning signals. Additional infoNote that SCST can be made compatible with PPO/GRPO, with the issue that there are no intermediate rewards during the generation of a caption (the full generated caption is required to compute the learning signals). |
-
π SalesForce's LAVIS - Core vision-language model, easily adaptable to RL;
-
π FACTUAL Scene Graph Extractor - One of the most impactful reward function is obtained by measuring the closeness of generated captions and ground-truth (human-annotated) captions. FACTUAL extracts "scene graphs" better than SPICE does, to compute this reward function by comparing the graphs. The difference between graphs also highlights the missing objects and the hallucinations made by the models.
-
Alibaba's Qwen2.5-VL - Core vision-language model, performance gains similar to LAVIS;
-
Applicable to any differentiable model producing logits as output.
Examples extracted from RSICD's test split
Caption generated by our best model
- three tennis courts are next to a road and some green trees
Human-created captions
- Several orange and green tennis courts sit side by side beside the road.
- Three tennis courts are surrounded by several buildings and green trees.
- Three tennis courts are surrounded by several buildings and trees.
- Three tennis courts semi-surrounded by some trees and buildings is next to a road.
- Three tennis courts are surrounded by several buildings and trees.
Our model fails to mention the buildings.
Caption generated by our best model
- six white storage tanks are near a road and some green meadows
Human-created captions
- There are seven white columnar tanks near a road with two tracks.
- Seven same cylinder storage tanks are placed on the grass between a forest and a road.
- Seven same cylinder storage tanks are placed on the grass between a forest and a road.
- Seven storage tanks stand alongside the straight road where two trucks are running.
- Seven white storage tanks in two lines are near some green trees.
Our model doesn't count storage tanks correctly, which seems to be happening because two storage tanks are not fully within the picture. It also fails to mention the two trucks, and mistakes a forest for meadows.
Our model is first evaluated on standard captioning metrics, including:
In this list, SPICE is the most correlated with human judgement.
Experiments were conducted on RSICD, UCM-Captions, and on VRSBench.
When evaluated on RSICD using these metrics, our method demonstrates SOTA performances.
CE = Cross-Entropy loss training.
Four image captioning datasets (RemoteCLIP, VRSBench, UCM-Captions, NWPU-Captions) are augmented with the detections of an object detectors, and the count of each class of detected object Large Selective Kernel Network.
A new entry, the "count dictionary", in the {"object": count, ...} is added to the samples of all four datasets:
- The π FACTUAL Scene Graph Extractor is employed to detect objects from ground-truth captions and, if possible, their count;
- The π°οΈ Large Selective Kernel Network is employed to detect objects from the aerial images and count each class of detected object;
- The dictionaries from both models are nonredundantly fused. Ground-truth captions' counts have priority over the object detector model's counts.
This improves the SDE learning signal that is truly rooted in Remote Sensing, and introduces a Count learning signal.
WIP
NKL: Negative Kullback-Leibler Divergence. Using a small language model (a pretrained BERT model from spaCy), we compute embeddings for every tokens of the ground-truth captions and of the generated captions. This yields two distributions of embeddings, that we try to bring closer by minimizing the KL-Divergence. To avoid the distribution being learned collapsing on the other, which would cause overfitting on the training dataset, we measure this KL Divergence only for the last 10,000 token embeddings, and we update this learning signal once every 10,000 tokens (Exponential Moving Average WIP). This can be considered as a soft trust region method.
CIDEr: A classic captioning metric that relies on the TF-IDF vectors ressemblance between two sentences to compare.
Length: Number of tokens in the generated caption (since the policy loss is being minimized, we must minimize the opposite of the length to maximize it).
SDE: Scene Description Exhaustiveness, proportion of entities in the generated caption present in the ground-truth count dictionary(ies). Serves the purpose of getting ground-truth objects and counts into generated captions, to mostly align with the expert human annotators and the object detector results, but not entirely, to avoid exposure bias that would result in fitting moderately noisy ground-truth captions and omit crucial elements. When the scene graph of a generated caption and the count dictionary of a ground-truth caption are compared, extracted objects are lemmatized, and fuzzy matching is applied, to account for synonyms.
SDE computation example:
β’ Generated caption: There is a forest. (object: forest)
β’ Ground-truth caption 1: There is a forest and a river. (objects: forest, river)
β’ Ground-truth caption 2: There is a forest, a river and a road. (objects: forest, river, road)
Objects detected in the human-annotated (ground-truth) captions: forest, river, road (3 objects)
Object detected in the model's output caption: forest (1 object)
Therefore, the SDE score is 1/3 in this example.
Counting Accuracy: Once objects from generated and ground-truth captions are "aligned", and if they have an associated count, the average (over objects) absolute and signed (how an object's count compares to the ground-truth count, >, = or <) errors are computed and used as learning signals.
Examples: β’ Generated caption: There are four planes and three buildings. β’ Ground-truth caption 1: There are five planes and one buildings.
The absolute counting accuracy is (|5 - 4| + |3 - 1|)/2 = 1.5 The signed counting accuracy is (4 - 5 + 3 - 1)/2 = 1.5
RSICD Dataset
The up and down arrows next to the scores indicate the direction towards which the score should go to improve the score. For instance, -2,62% in the "oversights" column on the first line is in accordance with the arrow's direction, meaning that it is going in the right direction. However, the code fails at addressing hallucinations: this is probably caused by the relatively short length of the captions. VRSBench has longer, more expressive captions than other datasets, and contains an elaborate vocabulary, making it the perfect candidate for hallucinations reduction testing, which does not work on datasets with smaller sentences, as the amount of hallucinations tends to increase with the length of the caption.
UCM Dataset
Our method seems even more efficient on UCM dataset. This might be caused by the fact that this dataset is quite small, and contains a lot of duplicate captions.
VRSBench
VRSBench captions are particularly long, which allows us to demonstrate the effectiveness of our approach in decreasing hallucinations without affecting its overall ability of decreasing oversights.
Another loss term, termed V/E for Varentropy/Entropy, is jointly minimized with the policy loss. Inspired by Entropix, the point is to balance between diverse vocabulary usage (high entropy) and consistent token distributions (low varentropy). This significantly limits degenerate generated tokens distributions, and encourages vocabulary exploration at the same time, which increases the model's vocabulary by taking inspiration from the human-annotated captions.
The
A simple ablation study where two models are trained in the same conditions, but one with CIDEr only and the other one with CIDEr and V/E, demonstrates a slight improvement in terms of oversights decreasing, BLEU, METEOR, and CIDEr.
- Clone the present repository (installing the original LAVIS repository will require multiple precise modifications in the code that have already been done in this very repository).
- After installing this repository, you need to create an environment, activate it, and install the libraries from requirements.txt. PYTHON 3.9+ REQUIRED
conda create --name lavis_rl python=3.9
conda activate lavis_rlpip install -r requirements.txtCrucial for the "Object Proportion" (SDE) learning signal to work.
pip install FactualSceneGraphOR choose a pretrained model from huggingface: https://github.com/zhuang-li/FactualSceneGraph
The training configuration for captioning can be found here: lavis/projects/blip2/train/caption_rs_ft.yaml
Alternative frozen vision encoders can be used with BLIP2. They can be found in lavis/configs/models/blip2.
The .yaml file for dataset configuration may be found here: lavis/configs/datasets/rs/defaults_cap.yaml. The image folder must contain every image from the dataset, regardless of the split they belong to. The JSON files containing the captions for the train, val and test splits must be in COCO format.
Object detector based "pseudo-captioning" can be activated by editing lines 48 and 72 from lavis/datasets/datasets/rs.py. This can slightly improve performances.
In case you need to modify the dataset config, edit this code: lavis/datasets/builders/rs_caption.py.
Finally, set the paths to your val and test json files in lavis/tasks/captioning.py, lines 138-139
Once everything is correctly installed and configured, run the following command:
python train.py --cfg-path your_main_folder/LAVIS/lavis/projects/blip2/train/caption_rs_ft.yaml --model_name eva_clip_g_plusWeights for the best InstructBLIP model we have obtained. https://huggingface.co/tdujardin/InstructBLIP_RS_RL/tree/main
The "rewards.py" registry of learning signals may be found in InstructBLIP_SCST/lavis/tasks/rewards.py
This repository contains code derived from SalesForce's LAVIS which is licensed under the BSD 3-Clause License, and code from FACTUAL which is licensed under the MIT License.
- New contributions to this repository are licensed under the MIT License.
- Portions derived from LAVIS remain under the BSD 3-Clause License.
- The sentence object extractor, FACTUAL, is licensed under the MIT License
We extend our gratitude to SalesForce for developing the LAVIS repository, which provides a simple to use Vision-Language models library. Implementing Reinforcement Learning was made significantly easier by their work.
Additionally, one of our main learning signals for RL was based on FACTUAL, a finetuned FLAN-T5 model that extracts scene graphs.







