diff --git a/README.md b/README.md
index 5751a78..f988f2f 100644
--- a/README.md
+++ b/README.md
@@ -116,6 +116,7 @@ This is the first work to correct hallucination in multimodal large language mod
## Multimodal Instruction Tuning
| Title | Venue | Date | Code | Demo |
|:--------|:--------:|:--------:|:--------:|:--------:|
+| 
[**Learning Compact Vision Tokens for Efficient Large Multimodal Models**](https://arxiv.org/abs/2506.07138)
| arXiv | 2025-06-27 | [Github](https://github.com/visresearch/LLaVA-STF) | - |
| 
[**MMSearch-R1: Incentivizing LMMs to Search**](https://arxiv.org/pdf/2506.20670)
| arXiv | 2025-06-25 | [Github](https://github.com/EvolvingLMMs-Lab/multimodal-search-r1) | - |
| 
[**Show-o2: Improved Native Unified Multimodal Models**](https://arxiv.org/pdf/2506.15564)
| arXiv | 2025-06-18 | [Github](https://github.com/showlab/Show-o) | - |
| [**Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities**](https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf) | Google | 2025-06-17 | - | - |
@@ -140,7 +141,7 @@ This is the first work to correct hallucination in multimodal large language mod
| [**Addendum to GPT-4o System Card: Native image generation**](https://cdn.openai.com/11998be9-5319-4302-bfbf-1167e093f1fb/Native_Image_Generation_System_Card.pdf) | OpenAI | 2025-03-25 | - | - |
| 
[**Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation**](https://arxiv.org/pdf/2411.19951)
| arXiv | 2025-03-17 | [Github](https://github.com/VITA-MLLM/Sparrow) | - |
| [**Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision**](https://arxiv.org/pdf/2503.01879) | arXiv | 2025-03-07 | - | - |
-| [**Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs**](https://arxiv.org/pdf/2503.01743) | arXiv | 2025-03-03 | [Hugging Face](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) | [Demo](https://huggingface.co/spaces/microsoft/phi-4-multimodal) |
+| [**Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs**](https://arxiv.org/pdf/2503.01743) | arXiv | 2025-03-03 | [Hugging Face](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) | [Demo](https://huggingface.co/spaces/microsoft/phi-4-multimodal) |
| 
[**Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray**](https://arxiv.org/pdf/2502.05177)
| arXiv | 2025-02-19 | [Github](https://github.com/VITA-MLLM/Long-VITA) | - |
| 
[**Qwen2.5-VL Technical Report**](https://arxiv.org/pdf/2502.13923)
| arXiv | 2025-02-19 | [Github](https://github.com/QwenLM/Qwen2.5-VL) | [Demo](https://huggingface.co/spaces/Qwen/Qwen2.5-VL) |
| 
[**Baichuan-Omni-1.5 Technical Report**](https://github.com/baichuan-inc/Baichuan-Omni-1.5/blob/main/baichuan_omni_1_5.pdf)
| Tech Report | 2025-01-26 | [Github](https://github.com/baichuan-inc/Baichuan-Omni-1.5) | Local Demo |
@@ -157,20 +158,20 @@ This is the first work to correct hallucination in multimodal large language mod
| 
[**NVILA: Efficient Frontier Visual Language Models**](https://arxiv.org/pdf/2412.04468)
| arXiv | 2024-12-05 | [Github](https://github.com/NVlabs/VILA) | [Demo](https://vila.mit.edu) |
| 
[**Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning**](https://arxiv.org/pdf/2412.03565)
| arXiv | 2024-12-04 | [Github](https://github.com/inst-it/inst-it) | - |
| 
[**TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability**](https://arxiv.org/pdf/2411.18211)
| arXiv | 2024-11-27 | [Github](https://github.com/TimeMarker-LLM/TimeMarker/) | - |
-| 
[**ChatRex: Taming Multimodal LLM for Joint Perception and Understanding**](https://arxiv.org/pdf/2411.18363)
| arXiv | 2024-11-27 | [Github](https://github.com/IDEA-Research/ChatRex) | Local Demo |
+| 
[**ChatRex: Taming Multimodal LLM for Joint Perception and Understanding**](https://arxiv.org/pdf/2411.18363)
| arXiv | 2024-11-27 | [Github](https://github.com/IDEA-Research/ChatRex) | Local Demo |
| 
[**LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding**](https://arxiv.org/pdf/2410.17434)
| arXiv | 2024-10-22 | [Github](https://github.com/Vision-CAIR/LongVU) | [Demo](https://huggingface.co/spaces/Vision-CAIR/LongVU) |
| 
[**Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate**](https://arxiv.org/pdf/2410.07167)
| arXiv | 2024-10-09 | [Github](https://github.com/shikiw/Modality-Integration-Rate) | - |
| 
[**AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark**](https://arxiv.org/pdf/2410.03051)
| arXiv | 2024-10-04 | [Github](https://github.com/rese1f/aurora) | Local Demo |
-| 
[**EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions**](https://arxiv.org/pdf/2409.18042)
| CVPR | 2024-09-26 | [Github](https://github.com/emova-ollm/EMOVA) | [Demo](https://huggingface.co/spaces/Emova-ollm/EMOVA-demo) |
+| 
[**EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions**](https://arxiv.org/pdf/2409.18042)
| CVPR | 2024-09-26 | [Github](https://github.com/emova-ollm/EMOVA) | [Demo](https://huggingface.co/spaces/Emova-ollm/EMOVA-demo) |
| [**Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models**](https://arxiv.org/pdf/2409.17146) | arXiv | 2024-09-25 | [Huggingface](https://huggingface.co/allenai/MolmoE-1B-0924) | [Demo](https://molmo.allenai.org) |
| 
[**Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution**](https://arxiv.org/pdf/2409.12191)
| arXiv | 2024-09-18 | [Github](https://github.com/QwenLM/Qwen2-VL) | [Demo](https://huggingface.co/spaces/Qwen/Qwen2-VL) |
| 
[**ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding**](https://arxiv.org/pdf/2409.03277)
| ICLR | 2024-09-05 | [Github](https://github.com/IDEA-FinAI/ChartMoE) | Local Demo |
-| 
[**LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture**](https://arxiv.org/pdf/2409.02889)
| arXiv | 2024-09-04 | [Github](https://github.com/FreedomIntelligence/LongLLaVA) | - |
+| 
[**LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture**](https://arxiv.org/pdf/2409.02889)
| arXiv | 2024-09-04 | [Github](https://github.com/FreedomIntelligence/LongLLaVA) | - |
| 
[**EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders**](https://arxiv.org/pdf/2408.15998)
| arXiv | 2024-08-28 | [Github](https://github.com/NVlabs/Eagle) | [Demo](https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat) |
| 
[**LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation**](https://arxiv.org/pdf/2408.15881)
| arXiv | 2024-08-28 | [Github](https://github.com/shufangxun/LLaVA-MoD) | - |
| 
[**mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models**](https://www.arxiv.org/pdf/2408.04840)
| arXiv | 2024-08-09 | [Github](https://github.com/X-PLUG/mPLUG-Owl) | - |
-| 
[**VITA: Towards Open-Source Interactive Omni Multimodal LLM**](https://arxiv.org/pdf/2408.05211)
| arXiv | 2024-08-09 | [Github](https://github.com/VITA-MLLM/VITA) | - |
-| 
[**LLaVA-OneVision: Easy Visual Task Transfer**](https://arxiv.org/pdf/2408.03326)
| arXiv | 2024-08-06 | [Github](https://github.com/LLaVA-VL/LLaVA-NeXT) | [Demo](https://llava-onevision.lmms-lab.com) |
+| 
[**VITA: Towards Open-Source Interactive Omni Multimodal LLM**](https://arxiv.org/pdf/2408.05211)
| arXiv | 2024-08-09 | [Github](https://github.com/VITA-MLLM/VITA) | - |
+| 
[**LLaVA-OneVision: Easy Visual Task Transfer**](https://arxiv.org/pdf/2408.03326)
| arXiv | 2024-08-06 | [Github](https://github.com/LLaVA-VL/LLaVA-NeXT) | [Demo](https://llava-onevision.lmms-lab.com) |
| 
[**MiniCPM-V: A GPT-4V Level MLLM on Your Phone**](https://arxiv.org/pdf/2408.01800)
| arXiv | 2024-08-03 | [Github](https://github.com/OpenBMB/MiniCPM-V) | [Demo](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5) |
| [**VILA^2: VILA Augmented VILA**](https://arxiv.org/pdf/2407.17453) | arXiv | 2024-07-24 | - | - |
| [**SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models**](https://arxiv.org/pdf/2407.15841) | arXiv | 2024-07-22 | - | - |
@@ -185,7 +186,7 @@ This is the first work to correct hallucination in multimodal large language mod
| 
[**TroL: Traversal of Layers for Large Language and Vision Models**](https://arxiv.org/pdf/2406.12246)
| EMNLP | 2024-06-18 | [Github](https://github.com/ByungKwanLee/TroL) | Local Demo |
| 
[**Unveiling Encoder-Free Vision-Language Models**](https://arxiv.org/pdf/2406.11832)
| arXiv | 2024-06-17 | [Github](https://github.com/baaivision/EVE) | Local Demo |
| 
[**VideoLLM-online: Online Video Large Language Model for Streaming Video**](https://arxiv.org/pdf/2406.11816)
| CVPR | 2024-06-17 | [Github](https://github.com/showlab/VideoLLM-online) | Local Demo |
-| 
[**RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics**](https://arxiv.org/pdf/2406.10721)
| CoRL | 2024-06-15 | [Github](https://github.com/wentaoyuan/RoboPoint) | [Demo](https://007e03d34429a2517b.gradio.live/) |
+| 
[**RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics**](https://arxiv.org/pdf/2406.10721)
| CoRL | 2024-06-15 | [Github](https://github.com/wentaoyuan/RoboPoint) | [Demo](https://007e03d34429a2517b.gradio.live/) |
| 
[**Comparison Visual Instruction Tuning**](https://arxiv.org/abs/2406.09240)
| arXiv | 2024-06-13 | [Github](https://wlin-at.github.io/cad_vi) | Local Demo |
| 
[**Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models**](https://arxiv.org/pdf/2406.08487)
| arXiv | 2024-06-12 | [Github](https://github.com/yfzhang114/SliME) | - |
| 
[**VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs**](https://arxiv.org/pdf/2406.07476)
| arXiv | 2024-06-11 | [Github](https://github.com/DAMO-NLP-SG/VideoLLaMA2) | Local Demo |
@@ -193,7 +194,7 @@ This is the first work to correct hallucination in multimodal large language mod
| 
[**Ovis: Structural Embedding Alignment for Multimodal Large Language Model**](https://arxiv.org/pdf/2405.20797)
| arXiv | 2024-05-31 | [Github](https://github.com/AIDC-AI/Ovis/) | - |
| 
[**Matryoshka Query Transformer for Large Vision-Language Models**](https://arxiv.org/pdf/2405.19315)
| arXiv | 2024-05-29 | [Github](https://github.com/gordonhu608/MQT-LLaVA) | [Demo](https://huggingface.co/spaces/gordonhu/MQT-LLaVA) |
| 
[**ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models**](https://arxiv.org/pdf/2405.15738)
| arXiv | 2024-05-24 | [Github](https://github.com/alibaba/conv-llava) | - |
-| 
[**Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models**](https://arxiv.org/pdf/2405.15574)
| arXiv | 2024-05-24 | [Github](https://github.com/ByungKwanLee/Meteor) | [Demo](https://huggingface.co/spaces/BK-Lee/Meteor) |
+| 
[**Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models**](https://arxiv.org/pdf/2405.15574)
| arXiv | 2024-05-24 | [Github](https://github.com/ByungKwanLee/Meteor) | [Demo](https://huggingface.co/spaces/BK-Lee/Meteor) |
| 
[**Libra: Building Decoupled Vision System on Large Language Models**](https://arxiv.org/pdf/2405.10140)
| ICML | 2024-05-16 | [Github](https://github.com/YifanXu74/Libra) | Local Demo |
| 
[**CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts**](https://arxiv.org/pdf/2405.05949)
| arXiv | 2024-05-09 | [Github](https://github.com/SHI-Labs/CuMo) | Local Demo |
| 
[**How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites**](https://arxiv.org/pdf/2404.16821)
| arXiv | 2024-04-25 | [Github](https://github.com/OpenGVLab/InternVL) | [Demo](https://internvl.opengvlab.com) |
@@ -216,7 +217,7 @@ This is the first work to correct hallucination in multimodal large language mod
| 
[**Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning**](https://arxiv.org/pdf/2402.11435.pdf)
| arXiv | 2024-02-18 | [Github](https://github.com/DCDmllm/Momentor) | - |
| 
[**ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model**](https://arxiv.org/pdf/2402.11684.pdf)
| arXiv | 2024-02-18 | [Github](https://github.com/FreedomIntelligence/ALLaVA) | [Demo](https://huggingface.co/FreedomIntelligence/ALLaVA-3B) |
| 
[**CoLLaVO: Crayon Large Language and Vision mOdel**](https://arxiv.org/pdf/2402.11248.pdf)
| arXiv | 2024-02-17 | [Github](https://github.com/ByungKwanLee/CoLLaVO-Crayon-Large-Language-and-Vision-mOdel) | - |
-| 
[**Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models**](https://arxiv.org/pdf/2402.07865)
| ICML | 2024-02-12 | [Github](https://github.com/TRI-ML/prismatic-vlms) | - |
+| 
[**Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models**](https://arxiv.org/pdf/2402.07865)
| ICML | 2024-02-12 | [Github](https://github.com/TRI-ML/prismatic-vlms) | - |
| 
[**CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations**](https://arxiv.org/pdf/2402.04236.pdf)
| arXiv | 2024-02-06 | [Github](https://github.com/THUDM/CogCoM) | - |
| 
[**MobileVLM V2: Faster and Stronger Baseline for Vision Language Model**](https://arxiv.org/pdf/2402.03766.pdf)
| arXiv | 2024-02-06 | [Github](https://github.com/Meituan-AutoML/MobileVLM) | - |
| 
[**GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning**](https://arxiv.org/pdf/2402.02130)
| NeurIPS | 2024-02-03 | [Github](https://github.com/WEIYanbin1999/GITA/) | - |
@@ -226,22 +227,22 @@ This is the first work to correct hallucination in multimodal large language mod
| 
[**InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model**](https://arxiv.org/pdf/2401.16420.pdf)
| arXiv | 2024-01-29 | [Github](https://github.com/InternLM/InternLM-XComposer) | [Demo](https://openxlab.org.cn/apps/detail/WillowBreeze/InternLM-XComposer) |
| 
[**Yi-VL**](https://github.com/01-ai/Yi/tree/main/VL)
| - | 2024-01-23 | [Github](https://github.com/01-ai/Yi/tree/main/VL) | Local Demo |
| [**SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities**](https://arxiv.org/pdf/2401.12168.pdf) | arXiv | 2024-01-22 | - | - |
-| 
[**ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning**](https://arxiv.org/pdf/2401.02384)
| ACL | 2024-01-04 | [Github](https://github.com/OpenGVLab/ChartAst) | Local Demo |
-| 
[**MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices**](https://arxiv.org/pdf/2312.16886.pdf)
| arXiv | 2023-12-28 | [Github](https://github.com/Meituan-AutoML/MobileVLM) | - |
+| 
[**ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning**](https://arxiv.org/pdf/2401.02384)
| ACL | 2024-01-04 | [Github](https://github.com/OpenGVLab/ChartAst) | Local Demo |
+| 
[**MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices**](https://arxiv.org/pdf/2312.16886.pdf)
| arXiv | 2023-12-28 | [Github](https://github.com/Meituan-AutoML/MobileVLM) | - |
| 
[**InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks**](https://arxiv.org/pdf/2312.14238.pdf)
| CVPR | 2023-12-21 | [Github](https://github.com/OpenGVLab/InternVL) | [Demo](https://internvl.opengvlab.com) |
| 
[**Osprey: Pixel Understanding with Visual Instruction Tuning**](https://arxiv.org/pdf/2312.10032.pdf)
| CVPR | 2023-12-15 | [Github](https://github.com/CircleRadon/Osprey) | [Demo](http://111.0.123.204:8000/) |
| 
[**CogAgent: A Visual Language Model for GUI Agents**](https://arxiv.org/pdf/2312.08914.pdf)
| arXiv | 2023-12-14 | [Github](https://github.com/THUDM/CogVLM) | [Coming soon]() |
| [**Pixel Aligned Language Models**](https://arxiv.org/pdf/2312.09237.pdf) | arXiv | 2023-12-14 | [Coming soon]() | - |
| 
[**VILA: On Pre-training for Visual Language Models**](https://arxiv.org/pdf/2312.07533)
| CVPR | 2023-12-13 | [Github](https://github.com/NVlabs/VILA) | Local Demo |
-| [**See, Say, and Segment: Teaching LMMs to Overcome False Premises**](https://arxiv.org/pdf/2312.08366.pdf) | arXiv | 2023-12-13 | [Coming soon]() | - |
+| [**See, Say, and Segment: Teaching LMMs to Overcome False Premises**](https://arxiv.org/pdf/2312.08366.pdf) | arXiv | 2023-12-13 | [Coming soon]() | - |
| 
[**Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models**](https://arxiv.org/pdf/2312.06109.pdf)
| ECCV | 2023-12-11 | [Github](https://github.com/Ucas-HaoranWei/Vary) | [Demo](http://region-31.seetacloud.com:22701/) |
| 
[**Honeybee: Locality-enhanced Projector for Multimodal LLM**](https://arxiv.org/pdf/2312.06742.pdf)
| CVPR | 2023-12-11 | [Github](https://github.com/kakaobrain/honeybee) | - |
| [**Gemini: A Family of Highly Capable Multimodal Models**](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) | Google | 2023-12-06 | - | - |
| 
[**OneLLM: One Framework to Align All Modalities with Language**](https://arxiv.org/pdf/2312.03700.pdf)
| arXiv | 2023-12-06 | [Github](https://github.com/csuhan/OneLLM) | [Demo](https://huggingface.co/spaces/csuhan/OneLLM) |
-| 
[**Lenna: Language Enhanced Reasoning Detection Assistant**](https://arxiv.org/pdf/2312.02433.pdf)
| arXiv | 2023-12-05 | [Github](https://github.com/Meituan-AutoML/Lenna) | - |
+| 
[**Lenna: Language Enhanced Reasoning Detection Assistant**](https://arxiv.org/pdf/2312.02433.pdf)
| arXiv | 2023-12-05 | [Github](https://github.com/Meituan-AutoML/Lenna) | - |
| [**VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding**](https://arxiv.org/pdf/2312.02310.pdf) | arXiv | 2023-12-04 | - | - |
-| 
[**TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding**](https://arxiv.org/pdf/2312.02051.pdf)
| arXiv | 2023-12-04 | [Github](https://github.com/RenShuhuai-Andy/TimeChat) | Local Demo |
-| 
[**Making Large Multimodal Models Understand Arbitrary Visual Prompts**](https://arxiv.org/pdf/2312.00784.pdf)
| CVPR | 2023-12-01 | [Github](https://github.com/mu-cai/vip-llava) | [Demo](https://pages.cs.wisc.edu/~mucai/vip-llava.html) |
+| 
[**TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding**](https://arxiv.org/pdf/2312.02051.pdf)
| arXiv | 2023-12-04 | [Github](https://github.com/RenShuhuai-Andy/TimeChat) | Local Demo |
+| 
[**Making Large Multimodal Models Understand Arbitrary Visual Prompts**](https://arxiv.org/pdf/2312.00784.pdf)
| CVPR | 2023-12-01 | [Github](https://github.com/mu-cai/vip-llava) | [Demo](https://pages.cs.wisc.edu/~mucai/vip-llava.html) |
| 
[**Dolphins: Multimodal Language Model for Driving**](https://arxiv.org/pdf/2312.00438.pdf)
| arXiv | 2023-12-01 | [Github](https://github.com/vlm-driver/Dolphins) | - |
| 
[**LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning**](https://arxiv.org/pdf/2311.18651.pdf)
| arXiv | 2023-11-30 | [Github](https://github.com/Open3DA/LL3DA) | [Coming soon]() |
| 
[**VTimeLLM: Empower LLM to Grasp Video Moments**](https://arxiv.org/pdf/2311.18445.pdf)
| arXiv | 2023-11-30 | [Github](https://github.com/huangb23/VTimeLLM/) | Local Demo |
@@ -258,20 +259,20 @@ This is the first work to correct hallucination in multimodal large language mod
| 
[**SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models**](https://arxiv.org/pdf/2311.07575.pdf)
| arXiv | 2023-11-13 | [Github](https://github.com/Alpha-VLLM/LLaMA2-Accessory) | [Demo](http://imagebind-llm.opengvlab.com/) |
| 
[**Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models**](https://arxiv.org/pdf/2311.06607.pdf)
| CVPR | 2023-11-11 | [Github](https://github.com/Yuliang-Liu/Monkey) | [Demo](http://27.17.184.224:7681/) |
| 
[**LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents**](https://arxiv.org/pdf/2311.05437.pdf)
| arXiv | 2023-11-09 | [Github](https://github.com/LLaVA-VL/LLaVA-Plus-Codebase) | [Demo](https://llavaplus.ngrok.io/) |
-| 
[**NExT-Chat: An LMM for Chat, Detection and Segmentation**](https://arxiv.org/pdf/2311.04498.pdf)
| arXiv | 2023-11-08 | [Github](https://github.com/NExT-ChatV/NExT-Chat) | Local Demo |
+| 
[**NExT-Chat: An LMM for Chat, Detection and Segmentation**](https://arxiv.org/pdf/2311.04498.pdf)
| arXiv | 2023-11-08 | [Github](https://github.com/NExT-ChatV/NExT-Chat) | Local Demo |
| 
[**mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration**](https://arxiv.org/pdf/2311.04257.pdf)
| arXiv | 2023-11-07 | [Github](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2) | [Demo](https://modelscope.cn/studios/damo/mPLUG-Owl2/summary) |
| 
[**OtterHD: A High-Resolution Multi-modality Model**](https://arxiv.org/pdf/2311.04219.pdf)
| arXiv | 2023-11-07 | [Github](https://github.com/Luodian/Otter) | - |
| [**CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding**](https://arxiv.org/pdf/2311.03354.pdf) | arXiv | 2023-11-06 | [Coming soon]() | - |
| 
[**GLaMM: Pixel Grounding Large Multimodal Model**](https://arxiv.org/pdf/2311.03356.pdf)
| CVPR | 2023-11-06 | [Github](https://github.com/mbzuai-oryx/groundingLMM) | [Demo](https://glamm.mbzuai-oryx.ngrok.app/) |
| 
[**What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning**](https://arxiv.org/pdf/2311.01487.pdf)
| arXiv | 2023-11-02| [Github](https://github.com/RUCAIBox/ComVint) | - |
-| 
[**MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning**](https://arxiv.org/pdf/2310.09478.pdf)
| arXiv | 2023-10-14 | [Github](https://github.com/Vision-CAIR/MiniGPT-4) | Local Demo |
+| 
[**MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning**](https://arxiv.org/pdf/2310.09478.pdf)
| arXiv | 2023-10-14 | [Github](https://github.com/Vision-CAIR/MiniGPT-4) | Local Demo |
| 
[**SALMONN: Towards Generic Hearing Abilities for Large Language Models**](https://arxiv.org/pdf/2310.13289)
| ICLR | 2023-10-20 | [Github](https://github.com/bytedance/SALMONN) | - |
| 
[**Ferret: Refer and Ground Anything Anywhere at Any Granularity**](https://arxiv.org/pdf/2310.07704.pdf)
| arXiv | 2023-10-11 | [Github](https://github.com/apple/ml-ferret) | - |
-| 
[**CogVLM: Visual Expert For Large Language Models**](https://arxiv.org/pdf/2311.03079.pdf)
| arXiv | 2023-10-09 | [Github](https://github.com/THUDM/CogVLM) | [Demo](http://36.103.203.44:7861/) |
+| 
[**CogVLM: Visual Expert For Large Language Models**](https://arxiv.org/pdf/2311.03079.pdf)
| arXiv | 2023-10-09 | [Github](https://github.com/THUDM/CogVLM) | [Demo](http://36.103.203.44:7861/) |
| 
[**Improved Baselines with Visual Instruction Tuning**](https://arxiv.org/pdf/2310.03744.pdf)
| arXiv | 2023-10-05 | [Github](https://github.com/haotian-liu/LLaVA) | [Demo](https://llava.hliu.cc/) |
-| 
[**LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment**](https://arxiv.org/pdf/2310.01852.pdf)
| ICLR | 2023-10-03 | [Github](https://github.com/PKU-YuanGroup/LanguageBind) | [Demo](https://huggingface.co/spaces/LanguageBind/LanguageBind) |
-
[**Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs**](https://arxiv.org/pdf/2310.00582.pdf) | arXiv | 2023-10-01 | [Github](https://github.com/SY-Xuan/Pink) | - |
-| 
[**Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants**](https://arxiv.org/pdf/2310.00653.pdf)
| arXiv | 2023-10-01 | [Github](https://github.com/thunlp/Muffin) | Local Demo |
+| 
[**LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment**](https://arxiv.org/pdf/2310.01852.pdf)
| ICLR | 2023-10-03 | [Github](https://github.com/PKU-YuanGroup/LanguageBind) | [Demo](https://huggingface.co/spaces/LanguageBind/LanguageBind) |
+|
[**Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs**](https://arxiv.org/pdf/2310.00582.pdf) | arXiv | 2023-10-01 | [Github](https://github.com/SY-Xuan/Pink) | - |
+| 
[**Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants**](https://arxiv.org/pdf/2310.00653.pdf)
| arXiv | 2023-10-01 | [Github](https://github.com/thunlp/Muffin) | Local Demo |
| [**AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model**](https://arxiv.org/pdf/2309.16058.pdf) | arXiv | 2023-09-27 | - | - |
| 
[**InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition**](https://arxiv.org/pdf/2309.15112.pdf)
| arXiv | 2023-09-26 | [Github](https://github.com/InternLM/InternLM-XComposer) | Local Demo |
| 
[**DreamLLM: Synergistic Multimodal Comprehension and Creation**](https://arxiv.org/pdf/2309.11499.pdf)
| ICLR | 2023-09-20 | [Github](https://github.com/RunpeiDong/DreamLLM) | [Coming soon]() |
@@ -280,58 +281,58 @@ This is the first work to correct hallucination in multimodal large language mod
| 
[**NExT-GPT: Any-to-Any Multimodal LLM**](https://arxiv.org/pdf/2309.05519.pdf)
| arXiv | 2023-09-11 | [Github](https://github.com/NExT-GPT/NExT-GPT) | [Demo](https://fc7a82a1c76b336b6f.gradio.live/) |
| 
[**Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics**](https://arxiv.org/pdf/2309.07120.pdf)
| arXiv | 2023-09-13 | [Github](https://github.com/UCSC-VLAA/Sight-Beyond-Text) | - |
| 
[**ImageBind-LLM: Multi-modality Instruction Tuning**](https://arxiv.org/pdf/2309.03905.pdf)
| arXiv | 2023-09-07 | [Github](https://github.com/OpenGVLab/LLaMA-Adapter) | [Demo](http://imagebind-llm.opengvlab.com/) |
-| [**Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning**](https://arxiv.org/pdf/2309.02591.pdf) | arXiv | 2023-09-05 | - | - |
+| [**Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning**](https://arxiv.org/pdf/2309.02591.pdf) | arXiv | 2023-09-05 | - | - |
| 
[**PointLLM: Empowering Large Language Models to Understand Point Clouds**](https://arxiv.org/pdf/2308.16911.pdf)
| arXiv | 2023-08-31 | [Github](https://github.com/OpenRobotLab/PointLLM) | [Demo](http://101.230.144.196/) |
| 
[**✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models**](https://arxiv.org/pdf/2308.16463.pdf)
| arXiv | 2023-08-31 | [Github](https://github.com/HYPJUDY/Sparkles) | Local Demo |
| 
[**MLLM-DataEngine: An Iterative Refinement Approach for MLLM**](https://arxiv.org/pdf/2308.13566.pdf)
| arXiv | 2023-08-25 | [Github](https://github.com/opendatalab/MLLM-DataEngine) | - |
-| 
[**Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models**](https://arxiv.org/pdf/2308.13437.pdf)
| arXiv | 2023-08-25 | [Github](https://github.com/PVIT-official/PVIT) | [Demo](https://huggingface.co/spaces/PVIT/pvit) |
-| 
[**Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities**](https://arxiv.org/pdf/2308.12966.pdf)
| arXiv | 2023-08-24 | [Github](https://github.com/QwenLM/Qwen-VL) | [Demo](https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo/summary) |
-| 
[**Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages**](https://arxiv.org/pdf/2308.12038.pdf)
| ICLR | 2023-08-23 | [Github](https://github.com/OpenBMB/VisCPM) | [Demo](https://huggingface.co/spaces/openbmb/viscpm-chat) |
+| 
[**Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models**](https://arxiv.org/pdf/2308.13437.pdf)
| arXiv | 2023-08-25 | [Github](https://github.com/PVIT-official/PVIT) | [Demo](https://huggingface.co/spaces/PVIT/pvit) |
+| 
[**Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities**](https://arxiv.org/pdf/2308.12966.pdf)
| arXiv | 2023-08-24 | [Github](https://github.com/QwenLM/Qwen-VL) | [Demo](https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo/summary) |
+| 
[**Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages**](https://arxiv.org/pdf/2308.12038.pdf)
| ICLR | 2023-08-23 | [Github](https://github.com/OpenBMB/VisCPM) | [Demo](https://huggingface.co/spaces/openbmb/viscpm-chat) |
| 
[**StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data**](https://arxiv.org/pdf/2308.10253.pdf)
| arXiv | 2023-08-20 | [Github](https://github.com/icoz69/StableLLAVA) | - |
| 
[**BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions**](https://arxiv.org/pdf/2308.09936.pdf)
| arXiv | 2023-08-19 | [Github](https://github.com/mlpc-ucsd/BLIVA) | [Demo](https://huggingface.co/spaces/mlpc-lab/BLIVA) |
| 
[**Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions**](https://arxiv.org/pdf/2308.04152.pdf)
| arXiv | 2023-08-08 | [Github](https://github.com/DCDmllm/Cheetah) | - |
-| 
[**The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World**](https://arxiv.org/pdf/2308.01907.pdf)
| ICLR | 2023-08-03 | [Github](https://github.com/OpenGVLab/All-Seeing) | [Demo](https://huggingface.co/spaces/OpenGVLab/all-seeing) |
+| 
[**The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World**](https://arxiv.org/pdf/2308.01907.pdf)
| ICLR | 2023-08-03 | [Github](https://github.com/OpenGVLab/All-Seeing) | [Demo](https://huggingface.co/spaces/OpenGVLab/all-seeing) |
| 
[**LISA: Reasoning Segmentation via Large Language Model**](https://arxiv.org/pdf/2308.00692.pdf)
| arXiv | 2023-08-01 | [Github](https://github.com/dvlab-research/LISA) | [Demo](http://103.170.5.190:7860) |
| 
[**MovieChat: From Dense Token to Sparse Memory for Long Video Understanding**](https://arxiv.org/pdf/2307.16449.pdf)
| arXiv | 2023-07-31 | [Github](https://github.com/rese1f/MovieChat) | Local Demo |
-| 
[**3D-LLM: Injecting the 3D World into Large Language Models**](https://arxiv.org/pdf/2307.12981.pdf)
| arXiv | 2023-07-24 | [Github](https://github.com/UMass-Foundation-Model/3D-LLM) | - |
+| 
[**3D-LLM: Injecting the 3D World into Large Language Models**](https://arxiv.org/pdf/2307.12981.pdf)
| arXiv | 2023-07-24 | [Github](https://github.com/UMass-Foundation-Model/3D-LLM) | - |
| [**ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning**](https://arxiv.org/pdf/2307.09474.pdf)
| arXiv | 2023-07-18 | - | [Demo](https://chatspot.streamlit.app/) |
| 
[**BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs**](https://arxiv.org/pdf/2307.08581.pdf)
| arXiv | 2023-07-17 | [Github](https://github.com/magic-research/bubogpt) | [Demo](https://huggingface.co/spaces/magicr/BuboGPT) |
| 
[**SVIT: Scaling up Visual Instruction Tuning**](https://arxiv.org/pdf/2307.04087.pdf)
| arXiv | 2023-07-09 | [Github](https://github.com/BAAI-DCAI/Visual-Instruction-Tuning) | - |
| 
[**GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest**](https://arxiv.org/pdf/2307.03601.pdf)
| arXiv | 2023-07-07 | [Github](https://github.com/jshilong/GPT4RoI) | [Demo](http://139.196.83.164:7000/) |
-| 
[**What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?**](https://arxiv.org/pdf/2307.02469.pdf)
| arXiv | 2023-07-05 | [Github](https://github.com/bytedance/lynx-llm) | - |
-| 
[**mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding**](https://arxiv.org/pdf/2307.02499.pdf)
| arXiv | 2023-07-04 | [Github](https://github.com/X-PLUG/mPLUG-DocOwl) | [Demo](https://modelscope.cn/studios/damo/mPLUG-DocOwl/summary) |
+| 
[**What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?**](https://arxiv.org/pdf/2307.02469.pdf)
| arXiv | 2023-07-05 | [Github](https://github.com/bytedance/lynx-llm) | - |
+| 
[**mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding**](https://arxiv.org/pdf/2307.02499.pdf)
| arXiv | 2023-07-04 | [Github](https://github.com/X-PLUG/mPLUG-DocOwl) | [Demo](https://modelscope.cn/studios/damo/mPLUG-DocOwl/summary) |
| 
[**Visual Instruction Tuning with Polite Flamingo**](https://arxiv.org/pdf/2307.01003.pdf)
| arXiv | 2023-07-03 | [Github](https://github.com/ChenDelong1999/polite_flamingo) | [Demo](http://clever_flamingo.xiaoice.com/) |
| 
[**LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding**](https://arxiv.org/pdf/2306.17107.pdf)
| arXiv | 2023-06-29 | [Github](https://github.com/SALT-NLP/LLaVAR) | [Demo](https://eba470c07c805702b8.gradio.live/) |
| 
[**Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic**](https://arxiv.org/pdf/2306.15195.pdf)
| arXiv | 2023-06-27 | [Github](https://github.com/shikras/shikra) | [Demo](http://demo.zhaozhang.net:7860/) |
-| 
[**MotionGPT: Human Motion as a Foreign Language**](https://arxiv.org/pdf/2306.14795.pdf)
| arXiv | 2023-06-26 | [Github](https://github.com/OpenMotionLab/MotionGPT) | - |
+| 
[**MotionGPT: Human Motion as a Foreign Language**](https://arxiv.org/pdf/2306.14795.pdf)
| arXiv | 2023-06-26 | [Github](https://github.com/OpenMotionLab/MotionGPT) | - |
| 
[**Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration**](https://arxiv.org/pdf/2306.09093.pdf)
| arXiv | 2023-06-15 | [Github](https://github.com/lyuchenyang/Macaw-LLM) | [Coming soon]() |
-| 
[**LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark**](https://arxiv.org/pdf/2306.06687.pdf)
| arXiv | 2023-06-11 | [Github](https://github.com/OpenLAMM/LAMM) | [Demo](https://huggingface.co/spaces/openlamm/LAMM) |
+| 
[**LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark**](https://arxiv.org/pdf/2306.06687.pdf)
| arXiv | 2023-06-11 | [Github](https://github.com/OpenLAMM/LAMM) | [Demo](https://huggingface.co/spaces/openlamm/LAMM) |
| 
[**Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models**](https://arxiv.org/pdf/2306.05424.pdf)
| arXiv | 2023-06-08 | [Github](https://github.com/mbzuai-oryx/Video-ChatGPT) | [Demo](https://www.ival-mbzuai.com/video-chatgpt) |
| 
[**MIMIC-IT: Multi-Modal In-Context Instruction Tuning**](https://arxiv.org/pdf/2306.05425.pdf)
| arXiv | 2023-06-08 | [Github](https://github.com/Luodian/Otter) | [Demo](https://otter.cliangyu.com/) |
-| [**M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning**](https://arxiv.org/pdf/2306.04387.pdf) | arXiv | 2023-06-07 | - | - |
+| [**M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning**](https://arxiv.org/pdf/2306.04387.pdf) | arXiv | 2023-06-07 | - | - |
| 
[**Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding**](https://arxiv.org/pdf/2306.02858.pdf)
| arXiv | 2023-06-05 | [Github](https://github.com/DAMO-NLP-SG/Video-LLaMA) | [Demo](https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA) |
| 
[**LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day**](https://arxiv.org/pdf/2306.00890.pdf)
| arXiv | 2023-06-01 | [Github](https://github.com/microsoft/LLaVA-Med) | - |
-| 
[**GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction**](https://arxiv.org/pdf/2305.18752.pdf)
| arXiv | 2023-05-30 | [Github](https://github.com/StevenGrove/GPT4Tools) | [Demo](https://huggingface.co/spaces/stevengrove/GPT4Tools) |
-| 
[**PandaGPT: One Model To Instruction-Follow Them All**](https://arxiv.org/pdf/2305.16355.pdf)
| arXiv | 2023-05-25 | [Github](https://github.com/yxuansu/PandaGPT) | [Demo](https://huggingface.co/spaces/GMFTBY/PandaGPT) |
-| 
[**ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst**](https://arxiv.org/pdf/2305.16103.pdf)
| arXiv | 2023-05-25 | [Github](https://github.com/joez17/ChatBridge) | - |
+| 
[**GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction**](https://arxiv.org/pdf/2305.18752.pdf)
| arXiv | 2023-05-30 | [Github](https://github.com/StevenGrove/GPT4Tools) | [Demo](https://huggingface.co/spaces/stevengrove/GPT4Tools) |
+| 
[**PandaGPT: One Model To Instruction-Follow Them All**](https://arxiv.org/pdf/2305.16355.pdf)
| arXiv | 2023-05-25 | [Github](https://github.com/yxuansu/PandaGPT) | [Demo](https://huggingface.co/spaces/GMFTBY/PandaGPT) |
+| 
[**ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst**](https://arxiv.org/pdf/2305.16103.pdf)
| arXiv | 2023-05-25 | [Github](https://github.com/joez17/ChatBridge) | - |
| 
[**Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models**](https://arxiv.org/pdf/2305.15023.pdf)
| arXiv | 2023-05-24 | [Github](https://github.com/luogen1996/LaVIN) | Local Demo |
-| 
[**DetGPT: Detect What You Need via Reasoning**](https://arxiv.org/pdf/2305.14167.pdf)
| arXiv | 2023-05-23 | [Github](https://github.com/OptimalScale/DetGPT) | [Demo](https://d3c431c0c77b1d9010.gradio.live/) |
+| 
[**DetGPT: Detect What You Need via Reasoning**](https://arxiv.org/pdf/2305.14167.pdf)
| arXiv | 2023-05-23 | [Github](https://github.com/OptimalScale/DetGPT) | [Demo](https://d3c431c0c77b1d9010.gradio.live/) |
| 
[**Pengi: An Audio Language Model for Audio Tasks**](https://arxiv.org/pdf/2305.11834.pdf)
| NeurIPS | 2023-05-19 | [Github](https://github.com/microsoft/Pengi) | - |
| 
[**VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks**](https://arxiv.org/pdf/2305.11175.pdf)
| arXiv | 2023-05-18 | [Github](https://github.com/OpenGVLab/VisionLLM) | - |
| 
[**Listen, Think, and Understand**](https://arxiv.org/pdf/2305.10790.pdf)
| arXiv | 2023-05-18 | [Github](https://github.com/YuanGongND/ltu) | [Demo](https://github.com/YuanGongND/ltu) |
| 
**VisualGLM-6B**
| - | 2023-05-17 | [Github](https://github.com/THUDM/VisualGLM-6B) | Local Demo |
-| 
[**PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering**](https://arxiv.org/pdf/2305.10415.pdf)
| arXiv | 2023-05-17 | [Github](https://github.com/xiaoman-zhang/PMC-VQA) | - |
+| 
[**PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering**](https://arxiv.org/pdf/2305.10415.pdf)
| arXiv | 2023-05-17 | [Github](https://github.com/xiaoman-zhang/PMC-VQA) | - |
| 
[**InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning**](https://arxiv.org/pdf/2305.06500.pdf)
| arXiv | 2023-05-11 | [Github](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip) | Local Demo |
| 
[**VideoChat: Chat-Centric Video Understanding**](https://arxiv.org/pdf/2305.06355.pdf)
| arXiv | 2023-05-10 | [Github](https://github.com/OpenGVLab/Ask-Anything) | [Demo](https://ask.opengvlab.com/) |
| 
[**MultiModal-GPT: A Vision and Language Model for Dialogue with Humans**](https://arxiv.org/pdf/2305.04790.pdf)
| arXiv | 2023-05-08 | [Github](https://github.com/open-mmlab/Multimodal-GPT) | [Demo](https://mmgpt.openmmlab.org.cn/) |
-| 
[**X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages**](https://arxiv.org/pdf/2305.04160.pdf)
| arXiv | 2023-05-07 | [Github](https://github.com/phellonchen/X-LLM) | - |
+| 
[**X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages**](https://arxiv.org/pdf/2305.04160.pdf)
| arXiv | 2023-05-07 | [Github](https://github.com/phellonchen/X-LLM) | - |
| 
[**LMEye: An Interactive Perception Network for Large Language Models**](https://arxiv.org/pdf/2305.03701.pdf)
| arXiv | 2023-05-05 | [Github](https://github.com/YunxinLi/LingCloud) | Local Demo |
-| 
[**LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model**](https://arxiv.org/pdf/2304.15010.pdf)
| arXiv | 2023-04-28 | [Github](https://github.com/OpenGVLab/LLaMA-Adapter) | [Demo](http://llama-adapter.opengvlab.com/) |
+| 
[**LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model**](https://arxiv.org/pdf/2304.15010.pdf)
| arXiv | 2023-04-28 | [Github](https://github.com/OpenGVLab/LLaMA-Adapter) | [Demo](http://llama-adapter.opengvlab.com/) |
| 
[**mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality**](https://arxiv.org/pdf/2304.14178.pdf)
| arXiv | 2023-04-27 | [Github](https://github.com/X-PLUG/mPLUG-Owl) | [Demo](https://huggingface.co/spaces/MAGAer13/mPLUG-Owl) |
| 
[**MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models**](https://arxiv.org/pdf/2304.10592.pdf)
| arXiv | 2023-04-20 | [Github](https://github.com/Vision-CAIR/MiniGPT-4) | - |
| 
[**Visual Instruction Tuning**](https://arxiv.org/pdf/2304.08485.pdf)
| NeurIPS | 2023-04-17 | [GitHub](https://github.com/haotian-liu/LLaVA) | [Demo](https://llava.hliu.cc/) |
| 
[**LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention**](https://arxiv.org/pdf/2303.16199.pdf)
| ICLR | 2023-03-28 | [Github](https://github.com/OpenGVLab/LLaMA-Adapter) | [Demo](https://huggingface.co/spaces/csuhan/LLaMA-Adapter) |
-| 
[**MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning**](https://arxiv.org/pdf/2212.10773.pdf)
| ACL | 2022-12-21 | [Github](https://github.com/VT-NLP/MultiInstruct) | - |
+| 
[**MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning**](https://arxiv.org/pdf/2212.10773.pdf)
| ACL | 2022-12-21 | [Github](https://github.com/VT-NLP/MultiInstruct) | - |
## Multimodal Hallucination
| Title | Venue | Date | Code | Demo |
@@ -474,18 +475,19 @@ This is the first work to correct hallucination in multimodal large language mod
## Foundation Models
| Title | Venue | Date | Code | Demo |
|:--------|:--------:|:--------:|:--------:|:--------:|
+| 
[**Diversity-Guided MLP Reduction for Efficient Large Vision Transformers**](https://arxiv.org/abs/2506.08591)
| arXiv | 2025-06-27 | [Github](https://github.com/visresearch/DGMR) | - |
| 
[**VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding**](https://arxiv.org/pdf/2501.13106)
| arXiv | 2025-01-22 | [Github](https://github.com/DAMO-NLP-SG/VideoLLaMA3) | [Demo](https://huggingface.co/spaces/lixin4ever/VideoLLaMA3) |
| 
[**Emu3: Next-Token Prediction is All You Need**](https://arxiv.org/pdf/2409.18869)
| arXiv | 2024-09-27 | [Github](https://github.com/baaivision/Emu3) | Local Demo |
-| [**Llama 3.2: Revolutionizing edge AI and vision with open, customizable models**](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) | Meta | 2024-09-25 | - | [Demo](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) |
+| [**Llama 3.2: Revolutionizing edge AI and vision with open, customizable models**](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) | Meta | 2024-09-25 | - | [Demo](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) |
| [**Pixtral-12B**](https://mistral.ai/news/pixtral-12b/) | Mistral | 2024-09-17 | - | - |
| 
[**xGen-MM (BLIP-3): A Family of Open Large Multimodal Models**](https://arxiv.org/pdf/2408.08872)
| arXiv | 2024-08-16 | [Github](https://github.com/salesforce/LAVIS/tree/xgen-mm) | - |
| [**The Llama 3 Herd of Models**](https://arxiv.org/pdf/2407.21783) | arXiv | 2024-07-31 | - | - |
| [**Chameleon: Mixed-Modal Early-Fusion Foundation Models**](https://arxiv.org/pdf/2405.09818) | arXiv | 2024-05-16 | - | - |
-| [**Hello GPT-4o**](https://openai.com/index/hello-gpt-4o/) | OpenAI | 2024-05-13 | - | - |
+| [**Hello GPT-4o**](https://openai.com/index/hello-gpt-4o/) | OpenAI | 2024-05-13 | - | - |
| [**The Claude 3 Model Family: Opus, Sonnet, Haiku**](https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf) | Anthropic | 2024-03-04 | - | - |
| [**Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context**](https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf) | Google | 2024-02-15 | - | - |
| [**Gemini: A Family of Highly Capable Multimodal Models**](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf) | Google | 2023-12-06 | - | - |
-| [**Fuyu-8B: A Multimodal Architecture for AI Agents**](https://www.adept.ai/blog/fuyu-8b) | blog | 2023-10-17 | [Huggingface](https://huggingface.co/adept/fuyu-8b) | [Demo](https://huggingface.co/adept/fuyu-8b)
+| [**Fuyu-8B: A Multimodal Architecture for AI Agents**](https://www.adept.ai/blog/fuyu-8b) | blog | 2023-10-17 | [Huggingface](https://huggingface.co/adept/fuyu-8b) | [Demo](https://huggingface.co/adept/fuyu-8b) |
| 
[**Unified Model for Image, Video, Audio and Language Tasks**](https://arxiv.org/pdf/2307.16184.pdf)
| arXiv | 2023-07-30 | [Github](https://github.com/mshukor/UnIVAL) | [Demo](https://huggingface.co/spaces/mshukor/UnIVAL) |
| [**PaLI-3 Vision Language Models: Smaller, Faster, Stronger**](https://arxiv.org/pdf/2310.09199.pdf) | arXiv | 2023-10-13 | - | - |
| [**GPT-4V(ision) System Card**](https://cdn.openai.com/papers/GPTV_System_Card.pdf) | OpenAI | 2023-09-25 | - | - |
@@ -494,13 +496,13 @@ This is the first work to correct hallucination in multimodal large language mod
| 
[**Bootstrapping Vision-Language Learning with Decoupled Language Pre-training**](https://arxiv.org/pdf/2307.07063.pdf)
| NeurIPS | 2023-07-13 | [Github](https://github.com/yiren-jian/BLIText) | - |
| 
[**Generative Pretraining in Multimodality**](https://arxiv.org/pdf/2307.05222.pdf)
| arXiv | 2023-07-11 | [Github](https://github.com/baaivision/Emu) | [Demo](http://218.91.113.230:9002/) |
| 
[**Kosmos-2: Grounding Multimodal Large Language Models to the World**](https://arxiv.org/pdf/2306.14824.pdf)
| arXiv | 2023-06-26 | [Github](https://github.com/microsoft/unilm/tree/master/kosmos-2) | [Demo](https://aka.ms/kosmos-2-demo) |
-| 
[**Transfer Visual Prompt Generator across LLMs**](https://arxiv.org/pdf/2305.01278.pdf)
| arXiv | 2023-05-02 | [Github](https://github.com/VPGTrans/VPGTrans) | [Demo](https://3fc7715dbc44234a7f.gradio.live/) |
+| 
[**Transfer Visual Prompt Generator across LLMs**](https://arxiv.org/pdf/2305.01278.pdf)
| arXiv | 2023-05-02 | [Github](https://github.com/VPGTrans/VPGTrans) | [Demo](https://3fc7715dbc44234a7f.gradio.live/) |
| [**GPT-4 Technical Report**](https://arxiv.org/pdf/2303.08774.pdf) | arXiv | 2023-03-15 | - | - |
-| [**PaLM-E: An Embodied Multimodal Language Model**](https://arxiv.org/pdf/2303.03378.pdf) | arXiv | 2023-03-06 | - | [Demo](https://palm-e.github.io/#demo) |
+| [**PaLM-E: An Embodied Multimodal Language Model**](https://arxiv.org/pdf/2303.03378.pdf) | arXiv | 2023-03-06 | - | [Demo](https://palm-e.github.io/#demo) |
| 
[**Prismer: A Vision-Language Model with An Ensemble of Experts**](https://arxiv.org/pdf/2303.02506.pdf)
| arXiv | 2023-03-04 | [Github](https://github.com/NVlabs/prismer) | [Demo](https://huggingface.co/spaces/lorenmt/prismer) |
| 
[**Language Is Not All You Need: Aligning Perception with Language Models**](https://arxiv.org/pdf/2302.14045.pdf)
| arXiv | 2023-02-27 | [Github](https://github.com/microsoft/unilm) | - |
-| 
[**BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models**](https://arxiv.org/pdf/2301.12597.pdf)
| arXiv | 2023-01-30 | [Github](https://github.com/salesforce/LAVIS/tree/main/projects/blip2) | [Demo](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/examples/blip2_instructed_generation.ipynb) |
-| 
[**VIMA: General Robot Manipulation with Multimodal Prompts**](https://arxiv.org/pdf/2210.03094.pdf)
| ICML | 2022-10-06 | [Github](https://github.com/vimalabs/VIMA) | Local Demo |
+| 
[**BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models**](https://arxiv.org/pdf/2301.12597.pdf)
| arXiv | 2023-01-30 | [Github](https://github.com/salesforce/LAVIS/tree/main/projects/blip2) | [Demo](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/examples/blip2_instructed_generation.ipynb) |
+| 
[**VIMA: General Robot Manipulation with Multimodal Prompts**](https://arxiv.org/pdf/2210.03094.pdf)
| ICML | 2022-10-06 | [Github](https://github.com/vimalabs/VIMA) | Local Demo |
| 
[**MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge**](https://arxiv.org/pdf/2206.08853.pdf)
| NeurIPS | 2022-06-17 | [Github](https://github.com/MineDojo/MineDojo) | - |
| 
[**Write and Paint: Generative Vision-Language Models are Unified Modal Learners**](https://arxiv.org/pdf/2206.07699.pdf)
| ICLR | 2022-06-15 | [Github](https://github.com/shizhediao/DaVinci) | - |
| 
[**Language Models are General-Purpose Interfaces**](https://arxiv.org/pdf/2206.06336.pdf)
| arXiv | 2022-06-13 | [Github](https://github.com/microsoft/unilm) | - |