Skip to content

hanxunyu/VisionTrim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚡️ VisionTrim: Unified Vision Token Compression for
Training-Free MLLM Acceleration

[ICLR 2026]

This is an official repository for the paper "VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration".

VisionTrim

Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC)module, which facilitates context-aware token merging guided by textual cues.

📰 News

  • Feb. 2nd, 2026: Paper is now available at arXiv.
  • Jan. 26th, 2026: VisionTrim is accepted by ICLR 2026!

⚙️ Setup

🏝️ Environment

  1. Clone this repository.
git clone https://github.com/hanxunyu/VisionTrim.git
cd VisionTrim
  1. Install necessary packages.
conda create -n visiontrim python=3.10 -y
conda activate visiontrim
pip install -e .
pip install protobuf
  1. (Optional) Install FlashAttention for further inference acceleration.
pip install flash-attn --no-build-isolation

📦️ Model

Download corresponding LLaVA checkpoints from Hugging Face 🤗:

Version LLM Checkpoint
LLaVA-1.5 Vicuna-7B liuhaotian/llava-v1.5-7b
LLaVA-1.5 Vicuna-13B liuhaotian/llava-v1.5-13b
LLaVA-1.6 (LLaVA-NeXT) Vicuna-7B liuhaotian/llava-v1.6-vicuna-7b
LLaVA-1.6 (LLaVA-NeXT) Vicuna-13B liuhaotian/llava-v1.6-vicuna-13b

📊 Data

Download each dataset according to EVAL.md.

🔬 Analysis

To analyze the inaccurate text-visual attention in VLMs, you need to download the visual instruction tuning data for LLaVA first, which we use for attention computation. And we provide the 1K subset for attention analysis in ./playground/data/analysis/llava_v1_5_mix1k.jsonl.

🛹 Attention Shift

To analyze the attention shift in VLMs, run the script ./scripts/analyze_attn_shift.sh.

bash scripts/v1_5/analyze_attn_shift.sh

🪩 Attention Dispersion

To analyze the attention dispersion in VLMs, run the script ./scripts/analyze_attn_dispersion.sh.

bash scripts/v1_5/analyze_attn_dispersion.sh

📋️ Evaluation

The main implementation of our VisionTrim is mainly in llava_llama.py, clip_encoder.py, llava_arch.py, model_vqa.py, and model_vqa_loader.py

We provide the evaluation scripts for each benchmark under ./scripts/v1_5/eval, you need to set the DVTS_token_num and TGVC_token_num as the bash argument. The detailed guidance for evaluation commands and online submission of each benchmark can be found in EVAL.md.

For evaluation with the 13B LLM, you just need to replace the CKPT argument from llava-v1.5-7b to llava-v1.5-13b in each script. And for evaluation with LLaVA-NeXT, you can use the scripts in ./scripts/v1_6/eval.

GQA

  1. Download the data and evaluation scripts following the official instructions and put under ../data/gqa/data. You may need to modify eval.py due to the missing assets in the GQA v1.2 release.
  2. Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/gqa.sh $token_num $token_complement

ScienceQA

  1. Under ../data/scienceqa, download images, pid_splits.json, problems.json from the data/scienceqa folder of the ScienceQA.
  2. Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/sqa.sh $token_num $token_complement

TextVQA

  1. Download TextVQA_0.5.1_val.json and images and extract to ../data/textvqa.
  2. Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/textvqa.sh $token_num $token_complement

POPE

  1. Download coco from POPE and put under ../data.
  2. Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/pope.sh $token_num $token_complement

MMBench

  1. Download mmbench_dev_20230712.tsv and put under ../data/mmbench.
  2. Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/mmbench.sh $token_num $token_complement
  1. Submit the results to the evaluation server: ../data/eval/mmbench/answers_upload/mmbench_dev_20230712.

MMBench-CN

  1. Download mmbench_dev_cn_20231003.tsv and put under ../data/mmbench.
  2. Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/mmbench_cn.sh $token_num $token_complement
  1. Submit the results to the evaluation server: ../data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003.

SEED-Bench

  1. Following the official instructions to download the images and the videos. Put images under ../data/seed_bench/SEED-Bench-image. Note that we only use the image subset to evaluate.
  2. Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/seed.sh $token_num $token_complement

MM-Vet

  1. Extract mm-vet.zip to ../data/mmvet.
  2. Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/mmvet.sh $token_num $token_complement
  1. Evaluate the predictions in ../data/eval/mmvet/results using the official Jupyter notebook.

😊 Acknowledgement

We are grateful for the open-source contributions of other projects:

🖊️ Citation

If you find our VisionTrim useful for your research, please consider giving this repository a star and citing our paper as follows:

@article{yu2026visiontrim,
  title={VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration},
  author={Yu, Hanxun and Li, Wentong and Qu, Xuan and Wang, Song and Chen, Junbo and Zhu, Jianke},
  journal={arXiv preprint arXiv:2601.22674},
  year={2026}
}

About

[ICLR 2026] Official code repository for "⚡️VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors