From 00c4fb4d796e543b731126687b0142e8c9f28f59 Mon Sep 17 00:00:00 2001 From: ChengchaoShen Date: Fri, 27 Jun 2025 14:46:15 +0800 Subject: [PATCH] add two mllm papers --- README.md | 96 ++++++++++++++++++++++++++++--------------------------- 1 file changed, 49 insertions(+), 47 deletions(-) diff --git a/README.md b/README.md index 5751a78..f988f2f 100644 --- a/README.md +++ b/README.md @@ -116,6 +116,7 @@ This is the first work to correct hallucination in multimodal large language mod ## Multimodal Instruction Tuning | Title | Venue | Date | Code | Demo | |:--------|:--------:|:--------:|:--------:|:--------:| +| ![Star](https://img.shields.io/github/stars/visresearch/LLaVA-STF.svg?style=social&label=Star)
[**Learning Compact Vision Tokens for Efficient Large Multimodal Models**](https://arxiv.org/abs/2506.07138)
| arXiv | 2025-06-27 | [Github](https://github.com/visresearch/LLaVA-STF) | - | | ![Star](https://img.shields.io/github/stars/EvolvingLMMs-Lab/multimodal-search-r1.svg?style=social&label=Star)
[**MMSearch-R1: Incentivizing LMMs to Search**](https://arxiv.org/pdf/2506.20670)
| arXiv | 2025-06-25 | [Github](https://github.com/EvolvingLMMs-Lab/multimodal-search-r1) | - | | ![Star](https://img.shields.io/github/stars/showlab/Show-o.svg?style=social&label=Star)
[**Show-o2: Improved Native Unified Multimodal Models**](https://arxiv.org/pdf/2506.15564)
| arXiv | 2025-06-18 | [Github](https://github.com/showlab/Show-o) | - | | [**Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities**](https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf) | Google | 2025-06-17 | - | - | @@ -140,7 +141,7 @@ This is the first work to correct hallucination in multimodal large language mod | [**Addendum to GPT-4o System Card: Native image generation**](https://cdn.openai.com/11998be9-5319-4302-bfbf-1167e093f1fb/Native_Image_Generation_System_Card.pdf) | OpenAI | 2025-03-25 | - | - | | ![Star](https://img.shields.io/github/stars/VITA-MLLM/Sparrow.svg?style=social&label=Star)
[**Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation**](https://arxiv.org/pdf/2411.19951)
| arXiv | 2025-03-17 | [Github](https://github.com/VITA-MLLM/Sparrow) | - | | [**Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision**](https://arxiv.org/pdf/2503.01879) | arXiv | 2025-03-07 | - | - | -| [**Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs**](https://arxiv.org/pdf/2503.01743) | arXiv | 2025-03-03 | [Hugging Face](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) | [Demo](https://huggingface.co/spaces/microsoft/phi-4-multimodal) | +| [**Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs**](https://arxiv.org/pdf/2503.01743) | arXiv | 2025-03-03 | [Hugging Face](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) | [Demo](https://huggingface.co/spaces/microsoft/phi-4-multimodal) | | ![Star](https://img.shields.io/github/stars/VITA-MLLM/Long-VITA.svg?style=social&label=Star)
[**Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray**](https://arxiv.org/pdf/2502.05177)
| arXiv | 2025-02-19 | [Github](https://github.com/VITA-MLLM/Long-VITA) | - | | ![Star](https://img.shields.io/github/stars/QwenLM/Qwen2.5-VL.svg?style=social&label=Star)
[**Qwen2.5-VL Technical Report**](https://arxiv.org/pdf/2502.13923)
| arXiv | 2025-02-19 | [Github](https://github.com/QwenLM/Qwen2.5-VL) | [Demo](https://huggingface.co/spaces/Qwen/Qwen2.5-VL) | | ![Star](https://img.shields.io/github/stars/baichuan-inc/Baichuan-Omni-1.5.svg?style=social&label=Star)
[**Baichuan-Omni-1.5 Technical Report**](https://github.com/baichuan-inc/Baichuan-Omni-1.5/blob/main/baichuan_omni_1_5.pdf)
| Tech Report | 2025-01-26 | [Github](https://github.com/baichuan-inc/Baichuan-Omni-1.5) | Local Demo | @@ -157,20 +158,20 @@ This is the first work to correct hallucination in multimodal large language mod | ![Star](https://img.shields.io/github/stars/NVlabs/VILA.svg?style=social&label=Star)
[**NVILA: Efficient Frontier Visual Language Models**](https://arxiv.org/pdf/2412.04468)
| arXiv | 2024-12-05 | [Github](https://github.com/NVlabs/VILA) | [Demo](https://vila.mit.edu) | | ![Star](https://img.shields.io/github/stars/inst-it/inst-it.svg?style=social&label=Star)
[**Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning**](https://arxiv.org/pdf/2412.03565)
| arXiv | 2024-12-04 | [Github](https://github.com/inst-it/inst-it) | - | | ![Star](https://img.shields.io/github/stars/TimeMarker-LLM/TimeMarker.svg?style=social&label=Star)
[**TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability**](https://arxiv.org/pdf/2411.18211)
| arXiv | 2024-11-27 | [Github](https://github.com/TimeMarker-LLM/TimeMarker/) | - | -| ![Star](https://img.shields.io/github/stars/IDEA-Research/ChatRex.svg?style=social&label=Star)
[**ChatRex: Taming Multimodal LLM for Joint Perception and Understanding**](https://arxiv.org/pdf/2411.18363)
| arXiv | 2024-11-27 | [Github](https://github.com/IDEA-Research/ChatRex) | Local Demo | +| ![Star](https://img.shields.io/github/stars/IDEA-Research/ChatRex.svg?style=social&label=Star)
[**ChatRex: Taming Multimodal LLM for Joint Perception and Understanding**](https://arxiv.org/pdf/2411.18363)
| arXiv | 2024-11-27 | [Github](https://github.com/IDEA-Research/ChatRex) | Local Demo | | ![Star](https://img.shields.io/github/stars/Vision-CAIR/LongVU.svg?style=social&label=Star)
[**LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding**](https://arxiv.org/pdf/2410.17434)
| arXiv | 2024-10-22 | [Github](https://github.com/Vision-CAIR/LongVU) | [Demo](https://huggingface.co/spaces/Vision-CAIR/LongVU) | | ![Star](https://img.shields.io/github/stars/shikiw/Modality-Integration-Rate.svg?style=social&label=Star)
[**Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate**](https://arxiv.org/pdf/2410.07167)
| arXiv | 2024-10-09 | [Github](https://github.com/shikiw/Modality-Integration-Rate) | - | | ![Star](https://img.shields.io/github/stars/rese1f/aurora.svg?style=social&label=Star)
[**AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark**](https://arxiv.org/pdf/2410.03051)
| arXiv | 2024-10-04 | [Github](https://github.com/rese1f/aurora) | Local Demo | -| ![Star](https://img.shields.io/github/stars/emova-ollm/EMOVA.svg?style=social&label=Star)
[**EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions**](https://arxiv.org/pdf/2409.18042)
| CVPR | 2024-09-26 | [Github](https://github.com/emova-ollm/EMOVA) | [Demo](https://huggingface.co/spaces/Emova-ollm/EMOVA-demo) | +| ![Star](https://img.shields.io/github/stars/emova-ollm/EMOVA.svg?style=social&label=Star)
[**EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions**](https://arxiv.org/pdf/2409.18042)
| CVPR | 2024-09-26 | [Github](https://github.com/emova-ollm/EMOVA) | [Demo](https://huggingface.co/spaces/Emova-ollm/EMOVA-demo) | | [**Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models**](https://arxiv.org/pdf/2409.17146) | arXiv | 2024-09-25 | [Huggingface](https://huggingface.co/allenai/MolmoE-1B-0924) | [Demo](https://molmo.allenai.org) | | ![Star](https://img.shields.io/github/stars/QwenLM/Qwen2-VL.svg?style=social&label=Star)
[**Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution**](https://arxiv.org/pdf/2409.12191)
| arXiv | 2024-09-18 | [Github](https://github.com/QwenLM/Qwen2-VL) | [Demo](https://huggingface.co/spaces/Qwen/Qwen2-VL) | | ![Star](https://img.shields.io/github/stars/IDEA-FinAI/ChartMoE.svg?style=social&label=Star)
[**ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding**](https://arxiv.org/pdf/2409.03277)
| ICLR | 2024-09-05 | [Github](https://github.com/IDEA-FinAI/ChartMoE) | Local Demo | -| ![Star](https://img.shields.io/github/stars/FreedomIntelligence/LongLLaVA.svg?style=social&label=Star)
[**LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture**](https://arxiv.org/pdf/2409.02889)
| arXiv | 2024-09-04 | [Github](https://github.com/FreedomIntelligence/LongLLaVA) | - | +| ![Star](https://img.shields.io/github/stars/FreedomIntelligence/LongLLaVA.svg?style=social&label=Star)
[**LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture**](https://arxiv.org/pdf/2409.02889)
| arXiv | 2024-09-04 | [Github](https://github.com/FreedomIntelligence/LongLLaVA) | - | | ![Star](https://img.shields.io/github/stars/NVlabs/Eagle.svg?style=social&label=Star)
[**EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders**](https://arxiv.org/pdf/2408.15998)
| arXiv | 2024-08-28 | [Github](https://github.com/NVlabs/Eagle) | [Demo](https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat) | | ![Star](https://img.shields.io/github/stars/shufangxun/LLaVA-MoD.svg?style=social&label=Star)
[**LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation**](https://arxiv.org/pdf/2408.15881)
| arXiv | 2024-08-28 | [Github](https://github.com/shufangxun/LLaVA-MoD) | - | | ![Star](https://img.shields.io/github/stars/X-PLUG/mPLUG-Owl.svg?style=social&label=Star)
[**mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models**](https://www.arxiv.org/pdf/2408.04840)
| arXiv | 2024-08-09 | [Github](https://github.com/X-PLUG/mPLUG-Owl) | - | -| ![Star](https://img.shields.io/github/stars/VITA-MLLM/VITA.svg?style=social&label=Star)
[**VITA: Towards Open-Source Interactive Omni Multimodal LLM**](https://arxiv.org/pdf/2408.05211)
| arXiv | 2024-08-09 | [Github](https://github.com/VITA-MLLM/VITA) | - | -| ![Star](https://img.shields.io/github/stars/LLaVA-VL/LLaVA-NeXT.svg?style=social&label=Star)
[**LLaVA-OneVision: Easy Visual Task Transfer**](https://arxiv.org/pdf/2408.03326)
| arXiv | 2024-08-06 | [Github](https://github.com/LLaVA-VL/LLaVA-NeXT) | [Demo](https://llava-onevision.lmms-lab.com) | +| ![Star](https://img.shields.io/github/stars/VITA-MLLM/VITA.svg?style=social&label=Star)
[**VITA: Towards Open-Source Interactive Omni Multimodal LLM**](https://arxiv.org/pdf/2408.05211)
| arXiv | 2024-08-09 | [Github](https://github.com/VITA-MLLM/VITA) | - | +| ![Star](https://img.shields.io/github/stars/LLaVA-VL/LLaVA-NeXT.svg?style=social&label=Star)
[**LLaVA-OneVision: Easy Visual Task Transfer**](https://arxiv.org/pdf/2408.03326)
| arXiv | 2024-08-06 | [Github](https://github.com/LLaVA-VL/LLaVA-NeXT) | [Demo](https://llava-onevision.lmms-lab.com) | | ![Star](https://img.shields.io/github/stars/OpenBMB/MiniCPM-V.svg?style=social&label=Star)
[**MiniCPM-V: A GPT-4V Level MLLM on Your Phone**](https://arxiv.org/pdf/2408.01800)
| arXiv | 2024-08-03 | [Github](https://github.com/OpenBMB/MiniCPM-V) | [Demo](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5) | | [**VILA^2: VILA Augmented VILA**](https://arxiv.org/pdf/2407.17453) | arXiv | 2024-07-24 | - | - | | [**SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models**](https://arxiv.org/pdf/2407.15841) | arXiv | 2024-07-22 | - | - | @@ -185,7 +186,7 @@ This is the first work to correct hallucination in multimodal large language mod | ![Star](https://img.shields.io/github/stars/ByungKwanLee/TroL.svg?style=social&label=Star)
[**TroL: Traversal of Layers for Large Language and Vision Models**](https://arxiv.org/pdf/2406.12246)
| EMNLP | 2024-06-18 | [Github](https://github.com/ByungKwanLee/TroL) | Local Demo | | ![Star](https://img.shields.io/github/stars/baaivision/EVE.svg?style=social&label=Star)
[**Unveiling Encoder-Free Vision-Language Models**](https://arxiv.org/pdf/2406.11832)
| arXiv | 2024-06-17 | [Github](https://github.com/baaivision/EVE) | Local Demo | | ![Star](https://img.shields.io/github/stars/showlab/VideoLLM-online.svg?style=social&label=Star)
[**VideoLLM-online: Online Video Large Language Model for Streaming Video**](https://arxiv.org/pdf/2406.11816)
| CVPR | 2024-06-17 | [Github](https://github.com/showlab/VideoLLM-online) | Local Demo | -| ![Star](https://img.shields.io/github/stars/wentaoyuan/RoboPoint.svg?style=social&label=Star)
[**RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics**](https://arxiv.org/pdf/2406.10721)
| CoRL | 2024-06-15 | [Github](https://github.com/wentaoyuan/RoboPoint) | [Demo](https://007e03d34429a2517b.gradio.live/) | +| ![Star](https://img.shields.io/github/stars/wentaoyuan/RoboPoint.svg?style=social&label=Star)
[**RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics**](https://arxiv.org/pdf/2406.10721)
| CoRL | 2024-06-15 | [Github](https://github.com/wentaoyuan/RoboPoint) | [Demo](https://007e03d34429a2517b.gradio.live/) | | ![Star](https://img.shields.io/github/stars/wlin-at/CaD-VI)
[**Comparison Visual Instruction Tuning**](https://arxiv.org/abs/2406.09240)
| arXiv | 2024-06-13 | [Github](https://wlin-at.github.io/cad_vi) | Local Demo | | ![Star](https://img.shields.io/github/stars/yfzhang114/SliME.svg?style=social&label=Star)
[**Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models**](https://arxiv.org/pdf/2406.08487)
| arXiv | 2024-06-12 | [Github](https://github.com/yfzhang114/SliME) | - | | ![Star](https://img.shields.io/github/stars/DAMO-NLP-SG/VideoLLaMA2.svg?style=social&label=Star)
[**VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs**](https://arxiv.org/pdf/2406.07476)
| arXiv | 2024-06-11 | [Github](https://github.com/DAMO-NLP-SG/VideoLLaMA2) | Local Demo | @@ -193,7 +194,7 @@ This is the first work to correct hallucination in multimodal large language mod | ![Star](https://img.shields.io/github/stars/AIDC-AI/Ovis.svg?style=social&label=Star)
[**Ovis: Structural Embedding Alignment for Multimodal Large Language Model**](https://arxiv.org/pdf/2405.20797)
| arXiv | 2024-05-31 | [Github](https://github.com/AIDC-AI/Ovis/) | - | | ![Star](https://img.shields.io/github/stars/gordonhu608/MQT-LLaVA.svg?style=social&label=Star)
[**Matryoshka Query Transformer for Large Vision-Language Models**](https://arxiv.org/pdf/2405.19315)
| arXiv | 2024-05-29 | [Github](https://github.com/gordonhu608/MQT-LLaVA) | [Demo](https://huggingface.co/spaces/gordonhu/MQT-LLaVA) | | ![Star](https://img.shields.io/github/stars/alibaba/conv-llava.svg?style=social&label=Star)
[**ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models**](https://arxiv.org/pdf/2405.15738)
| arXiv | 2024-05-24 | [Github](https://github.com/alibaba/conv-llava) | - | -| ![Star](https://img.shields.io/github/stars/ByungKwanLee/Meteor.svg?style=social&label=Star)
[**Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models**](https://arxiv.org/pdf/2405.15574)
| arXiv | 2024-05-24 | [Github](https://github.com/ByungKwanLee/Meteor) | [Demo](https://huggingface.co/spaces/BK-Lee/Meteor) | +| ![Star](https://img.shields.io/github/stars/ByungKwanLee/Meteor.svg?style=social&label=Star)
[**Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models**](https://arxiv.org/pdf/2405.15574)
| arXiv | 2024-05-24 | [Github](https://github.com/ByungKwanLee/Meteor) | [Demo](https://huggingface.co/spaces/BK-Lee/Meteor) | | ![Star](https://img.shields.io/github/stars/YifanXu74/Libra.svg?style=social&label=Star)
[**Libra: Building Decoupled Vision System on Large Language Models**](https://arxiv.org/pdf/2405.10140)
| ICML | 2024-05-16 | [Github](https://github.com/YifanXu74/Libra) | Local Demo | | ![Star](https://img.shields.io/github/stars/SHI-Labs/CuMo.svg?style=social&label=Star)
[**CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts**](https://arxiv.org/pdf/2405.05949)
| arXiv | 2024-05-09 | [Github](https://github.com/SHI-Labs/CuMo) | Local Demo | | ![Star](https://img.shields.io/github/stars/OpenGVLab/InternVL.svg?style=social&label=Star)
[**How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites**](https://arxiv.org/pdf/2404.16821)
| arXiv | 2024-04-25 | [Github](https://github.com/OpenGVLab/InternVL) | [Demo](https://internvl.opengvlab.com) | @@ -216,7 +217,7 @@ This is the first work to correct hallucination in multimodal large language mod | ![Star](https://img.shields.io/github/stars/DCDmllm/Momentor.svg?style=social&label=Star)
[**Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning**](https://arxiv.org/pdf/2402.11435.pdf)
| arXiv | 2024-02-18 | [Github](https://github.com/DCDmllm/Momentor) | - | | ![Star](https://img.shields.io/github/stars/FreedomIntelligence/ALLaVA.svg?style=social&label=Star)
[**ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model**](https://arxiv.org/pdf/2402.11684.pdf)
| arXiv | 2024-02-18 | [Github](https://github.com/FreedomIntelligence/ALLaVA) | [Demo](https://huggingface.co/FreedomIntelligence/ALLaVA-3B) | | ![Star](https://img.shields.io/github/stars/ByungKwanLee/CoLLaVO-Crayon-Large-Language-and-Vision-mOdel.svg?style=social&label=Star)
[**CoLLaVO: Crayon Large Language and Vision mOdel**](https://arxiv.org/pdf/2402.11248.pdf)
| arXiv | 2024-02-17 | [Github](https://github.com/ByungKwanLee/CoLLaVO-Crayon-Large-Language-and-Vision-mOdel) | - | -| ![Star](https://img.shields.io/github/stars/TRI-ML/prismatic-vlms.svg?style=social&label=Star)
[**Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models**](https://arxiv.org/pdf/2402.07865)
| ICML | 2024-02-12 | [Github](https://github.com/TRI-ML/prismatic-vlms) | - | +| ![Star](https://img.shields.io/github/stars/TRI-ML/prismatic-vlms.svg?style=social&label=Star)
[**Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models**](https://arxiv.org/pdf/2402.07865)
| ICML | 2024-02-12 | [Github](https://github.com/TRI-ML/prismatic-vlms) | - | | ![Star](https://img.shields.io/github/stars/THUDM/CogCoM.svg?style=social&label=Star)
[**CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations**](https://arxiv.org/pdf/2402.04236.pdf)
| arXiv | 2024-02-06 | [Github](https://github.com/THUDM/CogCoM) | - | | ![Star](https://img.shields.io/github/stars/Meituan-AutoML/MobileVLM.svg?style=social&label=Star)
[**MobileVLM V2: Faster and Stronger Baseline for Vision Language Model**](https://arxiv.org/pdf/2402.03766.pdf)
| arXiv | 2024-02-06 | [Github](https://github.com/Meituan-AutoML/MobileVLM) | - | | ![Star](https://img.shields.io/github/stars/WEIYanbin1999/GITA.svg?style=social&label=Star)
[**GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning**](https://arxiv.org/pdf/2402.02130)
| NeurIPS | 2024-02-03 | [Github](https://github.com/WEIYanbin1999/GITA/) | - | @@ -226,22 +227,22 @@ This is the first work to correct hallucination in multimodal large language mod | ![Star](https://img.shields.io/github/stars/InternLM/InternLM-XComposer.svg?style=social&label=Star)
[**InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model**](https://arxiv.org/pdf/2401.16420.pdf)
| arXiv | 2024-01-29 | [Github](https://github.com/InternLM/InternLM-XComposer) | [Demo](https://openxlab.org.cn/apps/detail/WillowBreeze/InternLM-XComposer) | | ![Star](https://img.shields.io/github/stars/01-ai/Yi.svg?style=social&label=Star)
[**Yi-VL**](https://github.com/01-ai/Yi/tree/main/VL)
| - | 2024-01-23 | [Github](https://github.com/01-ai/Yi/tree/main/VL) | Local Demo | | [**SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities**](https://arxiv.org/pdf/2401.12168.pdf) | arXiv | 2024-01-22 | - | - | -| ![Star](https://img.shields.io/github/stars/OpenGVLab/ChartAst.svg?style=social&label=Star)
[**ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning**](https://arxiv.org/pdf/2401.02384)
| ACL | 2024-01-04 | [Github](https://github.com/OpenGVLab/ChartAst) | Local Demo | -| ![Star](https://img.shields.io/github/stars/Meituan-AutoML/MobileVLM.svg?style=social&label=Star)
[**MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices**](https://arxiv.org/pdf/2312.16886.pdf)
| arXiv | 2023-12-28 | [Github](https://github.com/Meituan-AutoML/MobileVLM) | - | +| ![Star](https://img.shields.io/github/stars/OpenGVLab/ChartAst.svg?style=social&label=Star)
[**ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning**](https://arxiv.org/pdf/2401.02384)
| ACL | 2024-01-04 | [Github](https://github.com/OpenGVLab/ChartAst) | Local Demo | +| ![Star](https://img.shields.io/github/stars/Meituan-AutoML/MobileVLM.svg?style=social&label=Star)
[**MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices**](https://arxiv.org/pdf/2312.16886.pdf)
| arXiv | 2023-12-28 | [Github](https://github.com/Meituan-AutoML/MobileVLM) | - | | ![Star](https://img.shields.io/github/stars/OpenGVLab/InternVL.svg?style=social&label=Star)
[**InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks**](https://arxiv.org/pdf/2312.14238.pdf)
| CVPR | 2023-12-21 | [Github](https://github.com/OpenGVLab/InternVL) | [Demo](https://internvl.opengvlab.com) | | ![Star](https://img.shields.io/github/stars/CircleRadon/Osprey.svg?style=social&label=Star)
[**Osprey: Pixel Understanding with Visual Instruction Tuning**](https://arxiv.org/pdf/2312.10032.pdf)
| CVPR | 2023-12-15 | [Github](https://github.com/CircleRadon/Osprey) | [Demo](http://111.0.123.204:8000/) | | ![Star](https://img.shields.io/github/stars/THUDM/CogVLM.svg?style=social&label=Star)
[**CogAgent: A Visual Language Model for GUI Agents**](https://arxiv.org/pdf/2312.08914.pdf)
| arXiv | 2023-12-14 | [Github](https://github.com/THUDM/CogVLM) | [Coming soon]() | | [**Pixel Aligned Language Models**](https://arxiv.org/pdf/2312.09237.pdf) | arXiv | 2023-12-14 | [Coming soon]() | - | | ![Star](https://img.shields.io/github/stars/NVlabs/VILA.svg?style=social&label=Star)
[**VILA: On Pre-training for Visual Language Models**](https://arxiv.org/pdf/2312.07533)
| CVPR | 2023-12-13 | [Github](https://github.com/NVlabs/VILA) | Local Demo | -| [**See, Say, and Segment: Teaching LMMs to Overcome False Premises**](https://arxiv.org/pdf/2312.08366.pdf) | arXiv | 2023-12-13 | [Coming soon]() | - | +| [**See, Say, and Segment: Teaching LMMs to Overcome False Premises**](https://arxiv.org/pdf/2312.08366.pdf) | arXiv | 2023-12-13 | [Coming soon]() | - | | ![Star](https://img.shields.io/github/stars/Ucas-HaoranWei/Vary.svg?style=social&label=Star)
[**Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models**](https://arxiv.org/pdf/2312.06109.pdf)
| ECCV | 2023-12-11 | [Github](https://github.com/Ucas-HaoranWei/Vary) | [Demo](http://region-31.seetacloud.com:22701/) | | ![Star](https://img.shields.io/github/stars/kakaobrain/honeybee.svg?style=social&label=Star)
[**Honeybee: Locality-enhanced Projector for Multimodal LLM**](https://arxiv.org/pdf/2312.06742.pdf)
| CVPR | 2023-12-11 | [Github](https://github.com/kakaobrain/honeybee) | - | | [**Gemini: A Family of Highly Capable Multimodal Models**](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) | Google | 2023-12-06 | - | - | | ![Star](https://img.shields.io/github/stars/csuhan/OneLLM.svg?style=social&label=Star)
[**OneLLM: One Framework to Align All Modalities with Language**](https://arxiv.org/pdf/2312.03700.pdf)
| arXiv | 2023-12-06 | [Github](https://github.com/csuhan/OneLLM) | [Demo](https://huggingface.co/spaces/csuhan/OneLLM) | -| ![Star](https://img.shields.io/github/stars/Meituan-AutoML/Lenna.svg?style=social&label=Star)
[**Lenna: Language Enhanced Reasoning Detection Assistant**](https://arxiv.org/pdf/2312.02433.pdf)
| arXiv | 2023-12-05 | [Github](https://github.com/Meituan-AutoML/Lenna) | - | +| ![Star](https://img.shields.io/github/stars/Meituan-AutoML/Lenna.svg?style=social&label=Star)
[**Lenna: Language Enhanced Reasoning Detection Assistant**](https://arxiv.org/pdf/2312.02433.pdf)
| arXiv | 2023-12-05 | [Github](https://github.com/Meituan-AutoML/Lenna) | - | | [**VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding**](https://arxiv.org/pdf/2312.02310.pdf) | arXiv | 2023-12-04 | - | - | -| ![Star](https://img.shields.io/github/stars/RenShuhuai-Andy/TimeChat.svg?style=social&label=Star)
[**TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding**](https://arxiv.org/pdf/2312.02051.pdf)
| arXiv | 2023-12-04 | [Github](https://github.com/RenShuhuai-Andy/TimeChat) | Local Demo | -| ![Star](https://img.shields.io/github/stars/mu-cai/vip-llava.svg?style=social&label=Star)
[**Making Large Multimodal Models Understand Arbitrary Visual Prompts**](https://arxiv.org/pdf/2312.00784.pdf)
| CVPR | 2023-12-01 | [Github](https://github.com/mu-cai/vip-llava) | [Demo](https://pages.cs.wisc.edu/~mucai/vip-llava.html) | +| ![Star](https://img.shields.io/github/stars/RenShuhuai-Andy/TimeChat.svg?style=social&label=Star)
[**TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding**](https://arxiv.org/pdf/2312.02051.pdf)
| arXiv | 2023-12-04 | [Github](https://github.com/RenShuhuai-Andy/TimeChat) | Local Demo | +| ![Star](https://img.shields.io/github/stars/mu-cai/vip-llava.svg?style=social&label=Star)
[**Making Large Multimodal Models Understand Arbitrary Visual Prompts**](https://arxiv.org/pdf/2312.00784.pdf)
| CVPR | 2023-12-01 | [Github](https://github.com/mu-cai/vip-llava) | [Demo](https://pages.cs.wisc.edu/~mucai/vip-llava.html) | | ![Star](https://img.shields.io/github/stars/vlm-driver/Dolphins.svg?style=social&label=Star)
[**Dolphins: Multimodal Language Model for Driving**](https://arxiv.org/pdf/2312.00438.pdf)
| arXiv | 2023-12-01 | [Github](https://github.com/vlm-driver/Dolphins) | - | | ![Star](https://img.shields.io/github/stars/Open3DA/LL3DA.svg?style=social&label=Star)
[**LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning**](https://arxiv.org/pdf/2311.18651.pdf)
| arXiv | 2023-11-30 | [Github](https://github.com/Open3DA/LL3DA) | [Coming soon]() | | ![Star](https://img.shields.io/github/stars/huangb23/VTimeLLM.svg?style=social&label=Star)
[**VTimeLLM: Empower LLM to Grasp Video Moments**](https://arxiv.org/pdf/2311.18445.pdf)
| arXiv | 2023-11-30 | [Github](https://github.com/huangb23/VTimeLLM/) | Local Demo | @@ -258,20 +259,20 @@ This is the first work to correct hallucination in multimodal large language mod | ![Star](https://img.shields.io/github/stars/Alpha-VLLM/LLaMA2-Accessory.svg?style=social&label=Star)
[**SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models**](https://arxiv.org/pdf/2311.07575.pdf)
| arXiv | 2023-11-13 | [Github](https://github.com/Alpha-VLLM/LLaMA2-Accessory) | [Demo](http://imagebind-llm.opengvlab.com/) | | ![Star](https://img.shields.io/github/stars/Yuliang-Liu/Monkey.svg?style=social&label=Star)
[**Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models**](https://arxiv.org/pdf/2311.06607.pdf)
| CVPR | 2023-11-11 | [Github](https://github.com/Yuliang-Liu/Monkey) | [Demo](http://27.17.184.224:7681/) | | ![Star](https://img.shields.io/github/stars/LLaVA-VL/LLaVA-Plus-Codebase.svg?style=social&label=Star)
[**LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents**](https://arxiv.org/pdf/2311.05437.pdf)
| arXiv | 2023-11-09 | [Github](https://github.com/LLaVA-VL/LLaVA-Plus-Codebase) | [Demo](https://llavaplus.ngrok.io/) | -| ![Star](https://img.shields.io/github/stars/NExT-ChatV/NExT-Chat.svg?style=social&label=Star)
[**NExT-Chat: An LMM for Chat, Detection and Segmentation**](https://arxiv.org/pdf/2311.04498.pdf)
| arXiv | 2023-11-08 | [Github](https://github.com/NExT-ChatV/NExT-Chat) | Local Demo | +| ![Star](https://img.shields.io/github/stars/NExT-ChatV/NExT-Chat.svg?style=social&label=Star)
[**NExT-Chat: An LMM for Chat, Detection and Segmentation**](https://arxiv.org/pdf/2311.04498.pdf)
| arXiv | 2023-11-08 | [Github](https://github.com/NExT-ChatV/NExT-Chat) | Local Demo | | ![Star](https://img.shields.io/github/stars/X-PLUG/mPLUG-Owl.svg?style=social&label=Star)
[**mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration**](https://arxiv.org/pdf/2311.04257.pdf)
| arXiv | 2023-11-07 | [Github](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2) | [Demo](https://modelscope.cn/studios/damo/mPLUG-Owl2/summary) | | ![Star](https://img.shields.io/github/stars/Luodian/Otter.svg?style=social&label=Star)
[**OtterHD: A High-Resolution Multi-modality Model**](https://arxiv.org/pdf/2311.04219.pdf)
| arXiv | 2023-11-07 | [Github](https://github.com/Luodian/Otter) | - | | [**CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding**](https://arxiv.org/pdf/2311.03354.pdf) | arXiv | 2023-11-06 | [Coming soon]() | - | | ![Star](https://img.shields.io/github/stars/mbzuai-oryx/groundingLMM.svg?style=social&label=Star)
[**GLaMM: Pixel Grounding Large Multimodal Model**](https://arxiv.org/pdf/2311.03356.pdf)
| CVPR | 2023-11-06 | [Github](https://github.com/mbzuai-oryx/groundingLMM) | [Demo](https://glamm.mbzuai-oryx.ngrok.app/) | | ![Star](https://img.shields.io/github/stars/RUCAIBox/ComVint.svg?style=social&label=Star)
[**What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning**](https://arxiv.org/pdf/2311.01487.pdf)
| arXiv | 2023-11-02| [Github](https://github.com/RUCAIBox/ComVint) | - | -| ![Star](https://img.shields.io/github/stars/Vision-CAIR/MiniGPT-4.svg?style=social&label=Star)
[**MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning**](https://arxiv.org/pdf/2310.09478.pdf)
| arXiv | 2023-10-14 | [Github](https://github.com/Vision-CAIR/MiniGPT-4) | Local Demo | +| ![Star](https://img.shields.io/github/stars/Vision-CAIR/MiniGPT-4.svg?style=social&label=Star)
[**MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning**](https://arxiv.org/pdf/2310.09478.pdf)
| arXiv | 2023-10-14 | [Github](https://github.com/Vision-CAIR/MiniGPT-4) | Local Demo | | ![Star](https://img.shields.io/github/stars/bytedance/SALMONN.svg?style=social&label=Star)
[**SALMONN: Towards Generic Hearing Abilities for Large Language Models**](https://arxiv.org/pdf/2310.13289)
| ICLR | 2023-10-20 | [Github](https://github.com/bytedance/SALMONN) | - | | ![Star](https://img.shields.io/github/stars/apple/ml-ferret.svg?style=social&label=Star)
[**Ferret: Refer and Ground Anything Anywhere at Any Granularity**](https://arxiv.org/pdf/2310.07704.pdf)
| arXiv | 2023-10-11 | [Github](https://github.com/apple/ml-ferret) | - | -| ![Star](https://img.shields.io/github/stars/THUDM/CogVLM.svg?style=social&label=Star)
[**CogVLM: Visual Expert For Large Language Models**](https://arxiv.org/pdf/2311.03079.pdf)
| arXiv | 2023-10-09 | [Github](https://github.com/THUDM/CogVLM) | [Demo](http://36.103.203.44:7861/) | +| ![Star](https://img.shields.io/github/stars/THUDM/CogVLM.svg?style=social&label=Star)
[**CogVLM: Visual Expert For Large Language Models**](https://arxiv.org/pdf/2311.03079.pdf)
| arXiv | 2023-10-09 | [Github](https://github.com/THUDM/CogVLM) | [Demo](http://36.103.203.44:7861/) | | ![Star](https://img.shields.io/github/stars/haotian-liu/LLaVA.svg?style=social&label=Star)
[**Improved Baselines with Visual Instruction Tuning**](https://arxiv.org/pdf/2310.03744.pdf)
| arXiv | 2023-10-05 | [Github](https://github.com/haotian-liu/LLaVA) | [Demo](https://llava.hliu.cc/) | -| ![Star](https://img.shields.io/github/stars/PKU-YuanGroup/LanguageBind.svg?style=social&label=Star)
[**LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment**](https://arxiv.org/pdf/2310.01852.pdf)
| ICLR | 2023-10-03 | [Github](https://github.com/PKU-YuanGroup/LanguageBind) | [Demo](https://huggingface.co/spaces/LanguageBind/LanguageBind) | -![Star](https://img.shields.io/github/stars/SY-Xuan/Pink.svg?style=social&label=Star)
[**Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs**](https://arxiv.org/pdf/2310.00582.pdf) | arXiv | 2023-10-01 | [Github](https://github.com/SY-Xuan/Pink) | - | -| ![Star](https://img.shields.io/github/stars/thunlp/Muffin.svg?style=social&label=Star)
[**Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants**](https://arxiv.org/pdf/2310.00653.pdf)
| arXiv | 2023-10-01 | [Github](https://github.com/thunlp/Muffin) | Local Demo | +| ![Star](https://img.shields.io/github/stars/PKU-YuanGroup/LanguageBind.svg?style=social&label=Star)
[**LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment**](https://arxiv.org/pdf/2310.01852.pdf)
| ICLR | 2023-10-03 | [Github](https://github.com/PKU-YuanGroup/LanguageBind) | [Demo](https://huggingface.co/spaces/LanguageBind/LanguageBind) | +|![Star](https://img.shields.io/github/stars/SY-Xuan/Pink.svg?style=social&label=Star)
[**Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs**](https://arxiv.org/pdf/2310.00582.pdf) | arXiv | 2023-10-01 | [Github](https://github.com/SY-Xuan/Pink) | - | +| ![Star](https://img.shields.io/github/stars/thunlp/Muffin.svg?style=social&label=Star)
[**Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants**](https://arxiv.org/pdf/2310.00653.pdf)
| arXiv | 2023-10-01 | [Github](https://github.com/thunlp/Muffin) | Local Demo | | [**AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model**](https://arxiv.org/pdf/2309.16058.pdf) | arXiv | 2023-09-27 | - | - | | ![Star](https://img.shields.io/github/stars/InternLM/InternLM-XComposer.svg?style=social&label=Star)
[**InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition**](https://arxiv.org/pdf/2309.15112.pdf)
| arXiv | 2023-09-26 | [Github](https://github.com/InternLM/InternLM-XComposer) | Local Demo | | ![Star](https://img.shields.io/github/stars/RunpeiDong/DreamLLM.svg?style=social&label=Star)
[**DreamLLM: Synergistic Multimodal Comprehension and Creation**](https://arxiv.org/pdf/2309.11499.pdf)
| ICLR | 2023-09-20 | [Github](https://github.com/RunpeiDong/DreamLLM) | [Coming soon]() | @@ -280,58 +281,58 @@ This is the first work to correct hallucination in multimodal large language mod | ![Star](https://img.shields.io/github/stars/NExT-GPT/NExT-GPT.svg?style=social&label=Star)
[**NExT-GPT: Any-to-Any Multimodal LLM**](https://arxiv.org/pdf/2309.05519.pdf)
| arXiv | 2023-09-11 | [Github](https://github.com/NExT-GPT/NExT-GPT) | [Demo](https://fc7a82a1c76b336b6f.gradio.live/) | | ![Star](https://img.shields.io/github/stars/UCSC-VLAA/Sight-Beyond-Text.svg?style=social&label=Star)
[**Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics**](https://arxiv.org/pdf/2309.07120.pdf)
| arXiv | 2023-09-13 | [Github](https://github.com/UCSC-VLAA/Sight-Beyond-Text) | - | | ![Star](https://img.shields.io/github/stars/OpenGVLab/LLaMA-Adapter.svg?style=social&label=Star)
[**ImageBind-LLM: Multi-modality Instruction Tuning**](https://arxiv.org/pdf/2309.03905.pdf)
| arXiv | 2023-09-07 | [Github](https://github.com/OpenGVLab/LLaMA-Adapter) | [Demo](http://imagebind-llm.opengvlab.com/) | -| [**Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning**](https://arxiv.org/pdf/2309.02591.pdf) | arXiv | 2023-09-05 | - | - | +| [**Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning**](https://arxiv.org/pdf/2309.02591.pdf) | arXiv | 2023-09-05 | - | - | | ![Star](https://img.shields.io/github/stars/OpenRobotLab/PointLLM.svg?style=social&label=Star)
[**PointLLM: Empowering Large Language Models to Understand Point Clouds**](https://arxiv.org/pdf/2308.16911.pdf)
| arXiv | 2023-08-31 | [Github](https://github.com/OpenRobotLab/PointLLM) | [Demo](http://101.230.144.196/) | | ![Star](https://img.shields.io/github/stars/HYPJUDY/Sparkles.svg?style=social&label=Star)
[**✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models**](https://arxiv.org/pdf/2308.16463.pdf)
| arXiv | 2023-08-31 | [Github](https://github.com/HYPJUDY/Sparkles) | Local Demo | | ![Star](https://img.shields.io/github/stars/opendatalab/MLLM-DataEngine.svg?style=social&label=Star)
[**MLLM-DataEngine: An Iterative Refinement Approach for MLLM**](https://arxiv.org/pdf/2308.13566.pdf)
| arXiv | 2023-08-25 | [Github](https://github.com/opendatalab/MLLM-DataEngine) | - | -| ![Star](https://img.shields.io/github/stars/PVIT-official/PVIT.svg?style=social&label=Star)
[**Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models**](https://arxiv.org/pdf/2308.13437.pdf)
| arXiv | 2023-08-25 | [Github](https://github.com/PVIT-official/PVIT) | [Demo](https://huggingface.co/spaces/PVIT/pvit) | -| ![Star](https://img.shields.io/github/stars/QwenLM/Qwen-VL.svg?style=social&label=Star)
[**Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities**](https://arxiv.org/pdf/2308.12966.pdf)
| arXiv | 2023-08-24 | [Github](https://github.com/QwenLM/Qwen-VL) | [Demo](https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo/summary) | -| ![Star](https://img.shields.io/github/stars/OpenBMB/VisCPM.svg?style=social&label=Star)
[**Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages**](https://arxiv.org/pdf/2308.12038.pdf)
| ICLR | 2023-08-23 | [Github](https://github.com/OpenBMB/VisCPM) | [Demo](https://huggingface.co/spaces/openbmb/viscpm-chat) | +| ![Star](https://img.shields.io/github/stars/PVIT-official/PVIT.svg?style=social&label=Star)
[**Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models**](https://arxiv.org/pdf/2308.13437.pdf)
| arXiv | 2023-08-25 | [Github](https://github.com/PVIT-official/PVIT) | [Demo](https://huggingface.co/spaces/PVIT/pvit) | +| ![Star](https://img.shields.io/github/stars/QwenLM/Qwen-VL.svg?style=social&label=Star)
[**Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities**](https://arxiv.org/pdf/2308.12966.pdf)
| arXiv | 2023-08-24 | [Github](https://github.com/QwenLM/Qwen-VL) | [Demo](https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo/summary) | +| ![Star](https://img.shields.io/github/stars/OpenBMB/VisCPM.svg?style=social&label=Star)
[**Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages**](https://arxiv.org/pdf/2308.12038.pdf)
| ICLR | 2023-08-23 | [Github](https://github.com/OpenBMB/VisCPM) | [Demo](https://huggingface.co/spaces/openbmb/viscpm-chat) | | ![Star](https://img.shields.io/github/stars/icoz69/StableLLAVA.svg?style=social&label=Star)
[**StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data**](https://arxiv.org/pdf/2308.10253.pdf)
| arXiv | 2023-08-20 | [Github](https://github.com/icoz69/StableLLAVA) | - | | ![Star](https://img.shields.io/github/stars/mlpc-ucsd/BLIVA.svg?style=social&label=Star)
[**BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions**](https://arxiv.org/pdf/2308.09936.pdf)
| arXiv | 2023-08-19 | [Github](https://github.com/mlpc-ucsd/BLIVA) | [Demo](https://huggingface.co/spaces/mlpc-lab/BLIVA) | | ![Star](https://img.shields.io/github/stars/DCDmllm/Cheetah.svg?style=social&label=Star)
[**Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions**](https://arxiv.org/pdf/2308.04152.pdf)
| arXiv | 2023-08-08 | [Github](https://github.com/DCDmllm/Cheetah) | - | -| ![Star](https://img.shields.io/github/stars/OpenGVLab/All-Seeing.svg?style=social&label=Star)
[**The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World**](https://arxiv.org/pdf/2308.01907.pdf)
| ICLR | 2023-08-03 | [Github](https://github.com/OpenGVLab/All-Seeing) | [Demo](https://huggingface.co/spaces/OpenGVLab/all-seeing) | +| ![Star](https://img.shields.io/github/stars/OpenGVLab/All-Seeing.svg?style=social&label=Star)
[**The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World**](https://arxiv.org/pdf/2308.01907.pdf)
| ICLR | 2023-08-03 | [Github](https://github.com/OpenGVLab/All-Seeing) | [Demo](https://huggingface.co/spaces/OpenGVLab/all-seeing) | | ![Star](https://img.shields.io/github/stars/dvlab-research/LISA.svg?style=social&label=Star)
[**LISA: Reasoning Segmentation via Large Language Model**](https://arxiv.org/pdf/2308.00692.pdf)
| arXiv | 2023-08-01 | [Github](https://github.com/dvlab-research/LISA) | [Demo](http://103.170.5.190:7860) | | ![Star](https://img.shields.io/github/stars/rese1f/MovieChat.svg?style=social&label=Star)
[**MovieChat: From Dense Token to Sparse Memory for Long Video Understanding**](https://arxiv.org/pdf/2307.16449.pdf)
| arXiv | 2023-07-31 | [Github](https://github.com/rese1f/MovieChat) | Local Demo | -| ![Star](https://img.shields.io/github/stars/UMass-Foundation-Model/3D-LLM.svg?style=social&label=Star)
[**3D-LLM: Injecting the 3D World into Large Language Models**](https://arxiv.org/pdf/2307.12981.pdf)
| arXiv | 2023-07-24 | [Github](https://github.com/UMass-Foundation-Model/3D-LLM) | - | +| ![Star](https://img.shields.io/github/stars/UMass-Foundation-Model/3D-LLM.svg?style=social&label=Star)
[**3D-LLM: Injecting the 3D World into Large Language Models**](https://arxiv.org/pdf/2307.12981.pdf)
| arXiv | 2023-07-24 | [Github](https://github.com/UMass-Foundation-Model/3D-LLM) | - | | [**ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning**](https://arxiv.org/pdf/2307.09474.pdf)
| arXiv | 2023-07-18 | - | [Demo](https://chatspot.streamlit.app/) | | ![Star](https://img.shields.io/github/stars/magic-research/bubogpt.svg?style=social&label=Star)
[**BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs**](https://arxiv.org/pdf/2307.08581.pdf)
| arXiv | 2023-07-17 | [Github](https://github.com/magic-research/bubogpt) | [Demo](https://huggingface.co/spaces/magicr/BuboGPT) | | ![Star](https://img.shields.io/github/stars/BAAI-DCAI/Visual-Instruction-Tuning.svg?style=social&label=Star)
[**SVIT: Scaling up Visual Instruction Tuning**](https://arxiv.org/pdf/2307.04087.pdf)
| arXiv | 2023-07-09 | [Github](https://github.com/BAAI-DCAI/Visual-Instruction-Tuning) | - | | ![Star](https://img.shields.io/github/stars/jshilong/GPT4RoI.svg?style=social&label=Star)
[**GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest**](https://arxiv.org/pdf/2307.03601.pdf)
| arXiv | 2023-07-07 | [Github](https://github.com/jshilong/GPT4RoI) | [Demo](http://139.196.83.164:7000/) | -| ![Star](https://img.shields.io/github/stars/bytedance/lynx-llm.svg?style=social&label=Star)
[**What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?**](https://arxiv.org/pdf/2307.02469.pdf)
| arXiv | 2023-07-05 | [Github](https://github.com/bytedance/lynx-llm) | - | -| ![Star](https://img.shields.io/github/stars/X-PLUG/mPLUG-DocOwl.svg?style=social&label=Star)
[**mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding**](https://arxiv.org/pdf/2307.02499.pdf)
| arXiv | 2023-07-04 | [Github](https://github.com/X-PLUG/mPLUG-DocOwl) | [Demo](https://modelscope.cn/studios/damo/mPLUG-DocOwl/summary) | +| ![Star](https://img.shields.io/github/stars/bytedance/lynx-llm.svg?style=social&label=Star)
[**What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?**](https://arxiv.org/pdf/2307.02469.pdf)
| arXiv | 2023-07-05 | [Github](https://github.com/bytedance/lynx-llm) | - | +| ![Star](https://img.shields.io/github/stars/X-PLUG/mPLUG-DocOwl.svg?style=social&label=Star)
[**mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding**](https://arxiv.org/pdf/2307.02499.pdf)
| arXiv | 2023-07-04 | [Github](https://github.com/X-PLUG/mPLUG-DocOwl) | [Demo](https://modelscope.cn/studios/damo/mPLUG-DocOwl/summary) | | ![Star](https://img.shields.io/github/stars/ChenDelong1999/polite_flamingo.svg?style=social&label=Star)
[**Visual Instruction Tuning with Polite Flamingo**](https://arxiv.org/pdf/2307.01003.pdf)
| arXiv | 2023-07-03 | [Github](https://github.com/ChenDelong1999/polite_flamingo) | [Demo](http://clever_flamingo.xiaoice.com/) | | ![Star](https://img.shields.io/github/stars/SALT-NLP/LLaVAR.svg?style=social&label=Star)
[**LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding**](https://arxiv.org/pdf/2306.17107.pdf)
| arXiv | 2023-06-29 | [Github](https://github.com/SALT-NLP/LLaVAR) | [Demo](https://eba470c07c805702b8.gradio.live/) | | ![Star](https://img.shields.io/github/stars/shikras/shikra.svg?style=social&label=Star)
[**Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic**](https://arxiv.org/pdf/2306.15195.pdf)
| arXiv | 2023-06-27 | [Github](https://github.com/shikras/shikra) | [Demo](http://demo.zhaozhang.net:7860/) | -| ![Star](https://img.shields.io/github/stars/OpenMotionLab/MotionGPT.svg?style=social&label=Star)
[**MotionGPT: Human Motion as a Foreign Language**](https://arxiv.org/pdf/2306.14795.pdf)
| arXiv | 2023-06-26 | [Github](https://github.com/OpenMotionLab/MotionGPT) | - | +| ![Star](https://img.shields.io/github/stars/OpenMotionLab/MotionGPT.svg?style=social&label=Star)
[**MotionGPT: Human Motion as a Foreign Language**](https://arxiv.org/pdf/2306.14795.pdf)
| arXiv | 2023-06-26 | [Github](https://github.com/OpenMotionLab/MotionGPT) | - | | ![Star](https://img.shields.io/github/stars/lyuchenyang/Macaw-LLM.svg?style=social&label=Star)
[**Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration**](https://arxiv.org/pdf/2306.09093.pdf)
| arXiv | 2023-06-15 | [Github](https://github.com/lyuchenyang/Macaw-LLM) | [Coming soon]() | -| ![Star](https://img.shields.io/github/stars/OpenLAMM/LAMM.svg?style=social&label=Star)
[**LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark**](https://arxiv.org/pdf/2306.06687.pdf)
| arXiv | 2023-06-11 | [Github](https://github.com/OpenLAMM/LAMM) | [Demo](https://huggingface.co/spaces/openlamm/LAMM) | +| ![Star](https://img.shields.io/github/stars/OpenLAMM/LAMM.svg?style=social&label=Star)
[**LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark**](https://arxiv.org/pdf/2306.06687.pdf)
| arXiv | 2023-06-11 | [Github](https://github.com/OpenLAMM/LAMM) | [Demo](https://huggingface.co/spaces/openlamm/LAMM) | | ![Star](https://img.shields.io/github/stars/mbzuai-oryx/Video-ChatGPT.svg?style=social&label=Star)
[**Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models**](https://arxiv.org/pdf/2306.05424.pdf)
| arXiv | 2023-06-08 | [Github](https://github.com/mbzuai-oryx/Video-ChatGPT) | [Demo](https://www.ival-mbzuai.com/video-chatgpt) | | ![Star](https://img.shields.io/github/stars/Luodian/Otter.svg?style=social&label=Star)
[**MIMIC-IT: Multi-Modal In-Context Instruction Tuning**](https://arxiv.org/pdf/2306.05425.pdf)
| arXiv | 2023-06-08 | [Github](https://github.com/Luodian/Otter) | [Demo](https://otter.cliangyu.com/) | -| [**M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning**](https://arxiv.org/pdf/2306.04387.pdf) | arXiv | 2023-06-07 | - | - | +| [**M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning**](https://arxiv.org/pdf/2306.04387.pdf) | arXiv | 2023-06-07 | - | - | | ![Star](https://img.shields.io/github/stars/DAMO-NLP-SG/Video-LLaMA.svg?style=social&label=Star)
[**Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding**](https://arxiv.org/pdf/2306.02858.pdf)
| arXiv | 2023-06-05 | [Github](https://github.com/DAMO-NLP-SG/Video-LLaMA) | [Demo](https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA) | | ![Star](https://img.shields.io/github/stars/microsoft/LLaVA-Med.svg?style=social&label=Star)
[**LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day**](https://arxiv.org/pdf/2306.00890.pdf)
| arXiv | 2023-06-01 | [Github](https://github.com/microsoft/LLaVA-Med) | - | -| ![Star](https://img.shields.io/github/stars/StevenGrove/GPT4Tools.svg?style=social&label=Star)
[**GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction**](https://arxiv.org/pdf/2305.18752.pdf)
| arXiv | 2023-05-30 | [Github](https://github.com/StevenGrove/GPT4Tools) | [Demo](https://huggingface.co/spaces/stevengrove/GPT4Tools) | -| ![Star](https://img.shields.io/github/stars/yxuansu/PandaGPT.svg?style=social&label=Star)
[**PandaGPT: One Model To Instruction-Follow Them All**](https://arxiv.org/pdf/2305.16355.pdf)
| arXiv | 2023-05-25 | [Github](https://github.com/yxuansu/PandaGPT) | [Demo](https://huggingface.co/spaces/GMFTBY/PandaGPT) | -| ![Star](https://img.shields.io/github/stars/joez17/ChatBridge.svg?style=social&label=Star)
[**ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst**](https://arxiv.org/pdf/2305.16103.pdf)
| arXiv | 2023-05-25 | [Github](https://github.com/joez17/ChatBridge) | - | +| ![Star](https://img.shields.io/github/stars/StevenGrove/GPT4Tools.svg?style=social&label=Star)
[**GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction**](https://arxiv.org/pdf/2305.18752.pdf)
| arXiv | 2023-05-30 | [Github](https://github.com/StevenGrove/GPT4Tools) | [Demo](https://huggingface.co/spaces/stevengrove/GPT4Tools) | +| ![Star](https://img.shields.io/github/stars/yxuansu/PandaGPT.svg?style=social&label=Star)
[**PandaGPT: One Model To Instruction-Follow Them All**](https://arxiv.org/pdf/2305.16355.pdf)
| arXiv | 2023-05-25 | [Github](https://github.com/yxuansu/PandaGPT) | [Demo](https://huggingface.co/spaces/GMFTBY/PandaGPT) | +| ![Star](https://img.shields.io/github/stars/joez17/ChatBridge.svg?style=social&label=Star)
[**ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst**](https://arxiv.org/pdf/2305.16103.pdf)
| arXiv | 2023-05-25 | [Github](https://github.com/joez17/ChatBridge) | - | | ![Star](https://img.shields.io/github/stars/luogen1996/LaVIN.svg?style=social&label=Star)
[**Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models**](https://arxiv.org/pdf/2305.15023.pdf)
| arXiv | 2023-05-24 | [Github](https://github.com/luogen1996/LaVIN) | Local Demo | -| ![Star](https://img.shields.io/github/stars/OptimalScale/DetGPT.svg?style=social&label=Star)
[**DetGPT: Detect What You Need via Reasoning**](https://arxiv.org/pdf/2305.14167.pdf)
| arXiv | 2023-05-23 | [Github](https://github.com/OptimalScale/DetGPT) | [Demo](https://d3c431c0c77b1d9010.gradio.live/) | +| ![Star](https://img.shields.io/github/stars/OptimalScale/DetGPT.svg?style=social&label=Star)
[**DetGPT: Detect What You Need via Reasoning**](https://arxiv.org/pdf/2305.14167.pdf)
| arXiv | 2023-05-23 | [Github](https://github.com/OptimalScale/DetGPT) | [Demo](https://d3c431c0c77b1d9010.gradio.live/) | | ![Star](https://img.shields.io/github/stars/microsoft/Pengi.svg?style=social&label=Star)
[**Pengi: An Audio Language Model for Audio Tasks**](https://arxiv.org/pdf/2305.11834.pdf)
| NeurIPS | 2023-05-19 | [Github](https://github.com/microsoft/Pengi) | - | | ![Star](https://img.shields.io/github/stars/OpenGVLab/VisionLLM.svg?style=social&label=Star)
[**VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks**](https://arxiv.org/pdf/2305.11175.pdf)
| arXiv | 2023-05-18 | [Github](https://github.com/OpenGVLab/VisionLLM) | - | | ![Star](https://img.shields.io/github/stars/YuanGongND/ltu.svg?style=social&label=Star)
[**Listen, Think, and Understand**](https://arxiv.org/pdf/2305.10790.pdf)
| arXiv | 2023-05-18 | [Github](https://github.com/YuanGongND/ltu) | [Demo](https://github.com/YuanGongND/ltu) | | ![Star](https://img.shields.io/github/stars/THUDM/VisualGLM-6B.svg?style=social&label=Star)
**VisualGLM-6B**
| - | 2023-05-17 | [Github](https://github.com/THUDM/VisualGLM-6B) | Local Demo | -| ![Star](https://img.shields.io/github/stars/xiaoman-zhang/PMC-VQA.svg?style=social&label=Star)
[**PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering**](https://arxiv.org/pdf/2305.10415.pdf)
| arXiv | 2023-05-17 | [Github](https://github.com/xiaoman-zhang/PMC-VQA) | - | +| ![Star](https://img.shields.io/github/stars/xiaoman-zhang/PMC-VQA.svg?style=social&label=Star)
[**PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering**](https://arxiv.org/pdf/2305.10415.pdf)
| arXiv | 2023-05-17 | [Github](https://github.com/xiaoman-zhang/PMC-VQA) | - | | ![Star](https://img.shields.io/github/stars/salesforce/LAVIS.svg?style=social&label=Star)
[**InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning**](https://arxiv.org/pdf/2305.06500.pdf)
| arXiv | 2023-05-11 | [Github](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip) | Local Demo | | ![Star](https://img.shields.io/github/stars/OpenGVLab/Ask-Anything.svg?style=social&label=Star)
[**VideoChat: Chat-Centric Video Understanding**](https://arxiv.org/pdf/2305.06355.pdf)
| arXiv | 2023-05-10 | [Github](https://github.com/OpenGVLab/Ask-Anything) | [Demo](https://ask.opengvlab.com/) | | ![Star](https://img.shields.io/github/stars/open-mmlab/Multimodal-GPT.svg?style=social&label=Star)
[**MultiModal-GPT: A Vision and Language Model for Dialogue with Humans**](https://arxiv.org/pdf/2305.04790.pdf)
| arXiv | 2023-05-08 | [Github](https://github.com/open-mmlab/Multimodal-GPT) | [Demo](https://mmgpt.openmmlab.org.cn/) | -| ![Star](https://img.shields.io/github/stars/phellonchen/X-LLM.svg?style=social&label=Star)
[**X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages**](https://arxiv.org/pdf/2305.04160.pdf)
| arXiv | 2023-05-07 | [Github](https://github.com/phellonchen/X-LLM) | - | +| ![Star](https://img.shields.io/github/stars/phellonchen/X-LLM.svg?style=social&label=Star)
[**X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages**](https://arxiv.org/pdf/2305.04160.pdf)
| arXiv | 2023-05-07 | [Github](https://github.com/phellonchen/X-LLM) | - | | ![Star](https://img.shields.io/github/stars/YunxinLi/LingCloud.svg?style=social&label=Star)
[**LMEye: An Interactive Perception Network for Large Language Models**](https://arxiv.org/pdf/2305.03701.pdf)
| arXiv | 2023-05-05 | [Github](https://github.com/YunxinLi/LingCloud) | Local Demo | -| ![Star](https://img.shields.io/github/stars/OpenGVLab/LLaMA-Adapter.svg?style=social&label=Star)
[**LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model**](https://arxiv.org/pdf/2304.15010.pdf)
| arXiv | 2023-04-28 | [Github](https://github.com/OpenGVLab/LLaMA-Adapter) | [Demo](http://llama-adapter.opengvlab.com/) | +| ![Star](https://img.shields.io/github/stars/OpenGVLab/LLaMA-Adapter.svg?style=social&label=Star)
[**LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model**](https://arxiv.org/pdf/2304.15010.pdf)
| arXiv | 2023-04-28 | [Github](https://github.com/OpenGVLab/LLaMA-Adapter) | [Demo](http://llama-adapter.opengvlab.com/) | | ![Star](https://img.shields.io/github/stars/X-PLUG/mPLUG-Owl.svg?style=social&label=Star)
[**mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality**](https://arxiv.org/pdf/2304.14178.pdf)
| arXiv | 2023-04-27 | [Github](https://github.com/X-PLUG/mPLUG-Owl) | [Demo](https://huggingface.co/spaces/MAGAer13/mPLUG-Owl) | | ![Star](https://img.shields.io/github/stars/Vision-CAIR/MiniGPT-4.svg?style=social&label=Star)
[**MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models**](https://arxiv.org/pdf/2304.10592.pdf)
| arXiv | 2023-04-20 | [Github](https://github.com/Vision-CAIR/MiniGPT-4) | - | | ![Star](https://img.shields.io/github/stars/haotian-liu/LLaVA.svg?style=social&label=Star)
[**Visual Instruction Tuning**](https://arxiv.org/pdf/2304.08485.pdf)
| NeurIPS | 2023-04-17 | [GitHub](https://github.com/haotian-liu/LLaVA) | [Demo](https://llava.hliu.cc/) | | ![Star](https://img.shields.io/github/stars/OpenGVLab/LLaMA-Adapter.svg?style=social&label=Star)
[**LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention**](https://arxiv.org/pdf/2303.16199.pdf)
| ICLR | 2023-03-28 | [Github](https://github.com/OpenGVLab/LLaMA-Adapter) | [Demo](https://huggingface.co/spaces/csuhan/LLaMA-Adapter) | -| ![Star](https://img.shields.io/github/stars/VT-NLP/MultiInstruct.svg?style=social&label=Star)
[**MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning**](https://arxiv.org/pdf/2212.10773.pdf)
| ACL | 2022-12-21 | [Github](https://github.com/VT-NLP/MultiInstruct) | - | +| ![Star](https://img.shields.io/github/stars/VT-NLP/MultiInstruct.svg?style=social&label=Star)
[**MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning**](https://arxiv.org/pdf/2212.10773.pdf)
| ACL | 2022-12-21 | [Github](https://github.com/VT-NLP/MultiInstruct) | - | ## Multimodal Hallucination | Title | Venue | Date | Code | Demo | @@ -474,18 +475,19 @@ This is the first work to correct hallucination in multimodal large language mod ## Foundation Models | Title | Venue | Date | Code | Demo | |:--------|:--------:|:--------:|:--------:|:--------:| +| ![Star](https://img.shields.io/github/stars/visresearch/DGMR.svg?style=social&label=Star)
[**Diversity-Guided MLP Reduction for Efficient Large Vision Transformers**](https://arxiv.org/abs/2506.08591)
| arXiv | 2025-06-27 | [Github](https://github.com/visresearch/DGMR) | - | | ![Star](https://img.shields.io/github/stars/DAMO-NLP-SG/VideoLLaMA3.svg?style=social&label=Star)
[**VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding**](https://arxiv.org/pdf/2501.13106)
| arXiv | 2025-01-22 | [Github](https://github.com/DAMO-NLP-SG/VideoLLaMA3) | [Demo](https://huggingface.co/spaces/lixin4ever/VideoLLaMA3) | | ![Star](https://img.shields.io/github/stars/baaivision/Emu3.svg?style=social&label=Star)
[**Emu3: Next-Token Prediction is All You Need**](https://arxiv.org/pdf/2409.18869)
| arXiv | 2024-09-27 | [Github](https://github.com/baaivision/Emu3) | Local Demo | -| [**Llama 3.2: Revolutionizing edge AI and vision with open, customizable models**](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) | Meta | 2024-09-25 | - | [Demo](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) | +| [**Llama 3.2: Revolutionizing edge AI and vision with open, customizable models**](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) | Meta | 2024-09-25 | - | [Demo](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) | | [**Pixtral-12B**](https://mistral.ai/news/pixtral-12b/) | Mistral | 2024-09-17 | - | - | | ![Star](https://img.shields.io/github/stars/salesforce/LAVIS.svg?style=social&label=Star)
[**xGen-MM (BLIP-3): A Family of Open Large Multimodal Models**](https://arxiv.org/pdf/2408.08872)
| arXiv | 2024-08-16 | [Github](https://github.com/salesforce/LAVIS/tree/xgen-mm) | - | | [**The Llama 3 Herd of Models**](https://arxiv.org/pdf/2407.21783) | arXiv | 2024-07-31 | - | - | | [**Chameleon: Mixed-Modal Early-Fusion Foundation Models**](https://arxiv.org/pdf/2405.09818) | arXiv | 2024-05-16 | - | - | -| [**Hello GPT-4o**](https://openai.com/index/hello-gpt-4o/) | OpenAI | 2024-05-13 | - | - | +| [**Hello GPT-4o**](https://openai.com/index/hello-gpt-4o/) | OpenAI | 2024-05-13 | - | - | | [**The Claude 3 Model Family: Opus, Sonnet, Haiku**](https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf) | Anthropic | 2024-03-04 | - | - | | [**Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context**](https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf) | Google | 2024-02-15 | - | - | | [**Gemini: A Family of Highly Capable Multimodal Models**](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) | Google | 2023-12-06 | - | - | -| [**Fuyu-8B: A Multimodal Architecture for AI Agents**](https://www.adept.ai/blog/fuyu-8b) | blog | 2023-10-17 | [Huggingface](https://huggingface.co/adept/fuyu-8b) | [Demo](https://huggingface.co/adept/fuyu-8b) +| [**Fuyu-8B: A Multimodal Architecture for AI Agents**](https://www.adept.ai/blog/fuyu-8b) | blog | 2023-10-17 | [Huggingface](https://huggingface.co/adept/fuyu-8b) | [Demo](https://huggingface.co/adept/fuyu-8b) | | ![Star](https://img.shields.io/github/stars/mshukor/UnIVAL.svg?style=social&label=Star)
[**Unified Model for Image, Video, Audio and Language Tasks**](https://arxiv.org/pdf/2307.16184.pdf)
| arXiv | 2023-07-30 | [Github](https://github.com/mshukor/UnIVAL) | [Demo](https://huggingface.co/spaces/mshukor/UnIVAL) | | [**PaLI-3 Vision Language Models: Smaller, Faster, Stronger**](https://arxiv.org/pdf/2310.09199.pdf) | arXiv | 2023-10-13 | - | - | | [**GPT-4V(ision) System Card**](https://cdn.openai.com/papers/GPTV_System_Card.pdf) | OpenAI | 2023-09-25 | - | - | @@ -494,13 +496,13 @@ This is the first work to correct hallucination in multimodal large language mod | ![Star](https://img.shields.io/github/stars/yiren-jian/BLIText.svg?style=social&label=Star)
[**Bootstrapping Vision-Language Learning with Decoupled Language Pre-training**](https://arxiv.org/pdf/2307.07063.pdf)
| NeurIPS | 2023-07-13 | [Github](https://github.com/yiren-jian/BLIText) | - | | ![Star](https://img.shields.io/github/stars/baaivision/Emu.svg?style=social&label=Star)
[**Generative Pretraining in Multimodality**](https://arxiv.org/pdf/2307.05222.pdf)
| arXiv | 2023-07-11 | [Github](https://github.com/baaivision/Emu) | [Demo](http://218.91.113.230:9002/) | | ![Star](https://img.shields.io/github/stars/microsoft/unilm.svg?style=social&label=Star)
[**Kosmos-2: Grounding Multimodal Large Language Models to the World**](https://arxiv.org/pdf/2306.14824.pdf)
| arXiv | 2023-06-26 | [Github](https://github.com/microsoft/unilm/tree/master/kosmos-2) | [Demo](https://aka.ms/kosmos-2-demo) | -| ![Star](https://img.shields.io/github/stars/VPGTrans/VPGTrans.svg?style=social&label=Star)
[**Transfer Visual Prompt Generator across LLMs**](https://arxiv.org/pdf/2305.01278.pdf)
| arXiv | 2023-05-02 | [Github](https://github.com/VPGTrans/VPGTrans) | [Demo](https://3fc7715dbc44234a7f.gradio.live/) | +| ![Star](https://img.shields.io/github/stars/VPGTrans/VPGTrans.svg?style=social&label=Star)
[**Transfer Visual Prompt Generator across LLMs**](https://arxiv.org/pdf/2305.01278.pdf)
| arXiv | 2023-05-02 | [Github](https://github.com/VPGTrans/VPGTrans) | [Demo](https://3fc7715dbc44234a7f.gradio.live/) | | [**GPT-4 Technical Report**](https://arxiv.org/pdf/2303.08774.pdf) | arXiv | 2023-03-15 | - | - | -| [**PaLM-E: An Embodied Multimodal Language Model**](https://arxiv.org/pdf/2303.03378.pdf) | arXiv | 2023-03-06 | - | [Demo](https://palm-e.github.io/#demo) | +| [**PaLM-E: An Embodied Multimodal Language Model**](https://arxiv.org/pdf/2303.03378.pdf) | arXiv | 2023-03-06 | - | [Demo](https://palm-e.github.io/#demo) | | ![Star](https://img.shields.io/github/stars/NVlabs/prismer.svg?style=social&label=Star)
[**Prismer: A Vision-Language Model with An Ensemble of Experts**](https://arxiv.org/pdf/2303.02506.pdf)
| arXiv | 2023-03-04 | [Github](https://github.com/NVlabs/prismer) | [Demo](https://huggingface.co/spaces/lorenmt/prismer) | | ![Star](https://img.shields.io/github/stars/microsoft/unilm.svg?style=social&label=Star)
[**Language Is Not All You Need: Aligning Perception with Language Models**](https://arxiv.org/pdf/2302.14045.pdf)
| arXiv | 2023-02-27 | [Github](https://github.com/microsoft/unilm) | - | -| ![Star](https://img.shields.io/github/stars/salesforce/LAVIS.svg?style=social&label=Star)
[**BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models**](https://arxiv.org/pdf/2301.12597.pdf)
| arXiv | 2023-01-30 | [Github](https://github.com/salesforce/LAVIS/tree/main/projects/blip2) | [Demo](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/examples/blip2_instructed_generation.ipynb) | -| ![Star](https://img.shields.io/github/stars/vimalabs/VIMA.svg?style=social&label=Star)
[**VIMA: General Robot Manipulation with Multimodal Prompts**](https://arxiv.org/pdf/2210.03094.pdf)
| ICML | 2022-10-06 | [Github](https://github.com/vimalabs/VIMA) | Local Demo | +| ![Star](https://img.shields.io/github/stars/salesforce/LAVIS.svg?style=social&label=Star)
[**BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models**](https://arxiv.org/pdf/2301.12597.pdf)
| arXiv | 2023-01-30 | [Github](https://github.com/salesforce/LAVIS/tree/main/projects/blip2) | [Demo](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/examples/blip2_instructed_generation.ipynb) | +| ![Star](https://img.shields.io/github/stars/vimalabs/VIMA.svg?style=social&label=Star)
[**VIMA: General Robot Manipulation with Multimodal Prompts**](https://arxiv.org/pdf/2210.03094.pdf)
| ICML | 2022-10-06 | [Github](https://github.com/vimalabs/VIMA) | Local Demo | | ![Star](https://img.shields.io/github/stars/MineDojo/MineDojo.svg?style=social&label=Star)
[**MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge**](https://arxiv.org/pdf/2206.08853.pdf)
| NeurIPS | 2022-06-17 | [Github](https://github.com/MineDojo/MineDojo) | - | | ![Star](https://img.shields.io/github/stars/shizhediao/DaVinci.svg?style=social&label=Star)
[**Write and Paint: Generative Vision-Language Models are Unified Modal Learners**](https://arxiv.org/pdf/2206.07699.pdf)
| ICLR | 2022-06-15 | [Github](https://github.com/shizhediao/DaVinci) | - | | ![Star](https://img.shields.io/github/stars/microsoft/unilm.svg?style=social&label=Star)
[**Language Models are General-Purpose Interfaces**](https://arxiv.org/pdf/2206.06336.pdf)
| arXiv | 2022-06-13 | [Github](https://github.com/microsoft/unilm) | - |