We distribute models pretrained on Conceptual Captions. We share ViLBERT, LXMERT and VL-BERT pretrained as originally presented in their papers, as well as the weights for ViLBERT, LXMERT, VL-BERT, VisualBERT and UNITER pretrained in our controlled setup. For the latter, we distribute the weights that lead to higher average downstream performance when fine-tuned once.
| Model | VQAv2 | RefCOCO+ | NLVR2 | Flickr30k IR | Flickr30k TR |
|---|---|---|---|---|---|
| ViLBERT | 66.68 | 70.49 | 74.26 | 58.90 | 75.50 |
| LXMERT | 67.98 | 71.58 | |||
| VL-BERT | 67.44 | 71.00 | |||
| ViLBERT (CTRL) | 68.97 | 70.53 | 72.24 | 60.34 | 78.80 |
| LXMERT (CTRL) | 67.52 | 70.49 | 71.09 | 58.62 | 74.90 |
| VL-BERT (CTRL) | 68.23 | 71.23 | 73.22 | 57.62 | 70.90 |
| VisualBERT (CTRL) | 69.03 | 70.02 | 72.70 | 61.48 | 75.20 |
| UNITER (CTRL) | 68.67 | 71.45 | 73.73 | 60.54 | 76.40 |
All the models pretrained with 10 random seeds in our controlled setup can be downloaded from here.
| Model | Source |
|---|---|
| LXMERT (Original) | airsplay/lxmert |
| Model | XVNLI | xGQA | MaRVL | xFlickr&CO IR | xFlickr&CO TR | WIT IR | WIT TR |
|---|---|---|---|---|---|---|---|
| mUNITER | 53.69 | 9.97 | 53.72 | 8.06 | 8.86 | 9.16 | 10.48 |
| xUNITER | 58.48 | 21.72 | 54.59 | 14.04 | 13.51 | 8.72 | 9.81 |
| UC2 | 62.05 | 29.35 | 57.28 | 20.31 | 17.89 | 7.83 | 9.09 |
| M3P | 58.25 | 28.17 | 56.00 | 12.91 | 11.90 | 8.12 | 9.98 |
Models are defined in configuration files (see config/ for some examples). Rather than using Transformer layers, we specify attention and feed-forward sub-layers for each modality, which allows to quickly extend proposed architectures. In particular, the following sub-layers are defined:
tt_attn_sublayers: text-text attention sub-layerstv_attn_sublayers: text-vision attention sub-layers (text used as query, vision as context)vt_attn_sublayers: vision-text attention sub-layers (vision used as query, text as context)vv_attn_sublayers: vision-vision attention sub-layerst_ff_sublayers: feed-forward sub-layers for the text modalityv_ff_sublayers: feed-forward sub-layers for the vision modality
In addition, the following parameters allow to tune parameter sharing across modalities:
shared_sublayers: sub-layers that share parameters between modalitiessingle_ln_sublayers: sub-layers in which text and vision tensors are concatenated and fed into a single LN layer
Finally, bert_layer2attn_sublayer and bert_layer2ff_sublayer are used to load text-only BERT layers into VOLTA ones.
The following figure shows how these sub-layers are used to construct ViLBERT:
