This is an official repository for the paper "VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration".
Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC)module, which facilitates context-aware token merging guided by textual cues.
Feb. 2nd, 2026: Paper is now available at arXiv.Jan. 26th, 2026: VisionTrim is accepted by ICLR 2026!
- Clone this repository.
git clone https://github.com/hanxunyu/VisionTrim.git
cd VisionTrim- Install necessary packages.
conda create -n visiontrim python=3.10 -y
conda activate visiontrim
pip install -e .
pip install protobuf- (Optional) Install FlashAttention for further inference acceleration.
pip install flash-attn --no-build-isolationDownload corresponding LLaVA checkpoints from Hugging Face 🤗:
| Version | LLM | Checkpoint |
|---|---|---|
| LLaVA-1.5 | Vicuna-7B | liuhaotian/llava-v1.5-7b |
| LLaVA-1.5 | Vicuna-13B | liuhaotian/llava-v1.5-13b |
| LLaVA-1.6 (LLaVA-NeXT) | Vicuna-7B | liuhaotian/llava-v1.6-vicuna-7b |
| LLaVA-1.6 (LLaVA-NeXT) | Vicuna-13B | liuhaotian/llava-v1.6-vicuna-13b |
Download each dataset according to EVAL.md.
To analyze the inaccurate text-visual attention in VLMs, you need to download the visual instruction tuning data for LLaVA first, which we use for attention computation. And we provide the 1K subset for attention analysis in ./playground/data/analysis/llava_v1_5_mix1k.jsonl.
To analyze the attention shift in VLMs, run the script ./scripts/analyze_attn_shift.sh.
bash scripts/v1_5/analyze_attn_shift.shTo analyze the attention dispersion in VLMs, run the script ./scripts/analyze_attn_dispersion.sh.
bash scripts/v1_5/analyze_attn_dispersion.shThe main implementation of our VisionTrim is mainly in llava_llama.py, clip_encoder.py, llava_arch.py, model_vqa.py, and model_vqa_loader.py
We provide the evaluation scripts for each benchmark under ./scripts/v1_5/eval, you need to set the DVTS_token_num and TGVC_token_num as the bash argument. The detailed guidance for evaluation commands and online submission of each benchmark can be found in EVAL.md.
For evaluation with the 13B LLM, you just need to replace the CKPT argument from llava-v1.5-7b to llava-v1.5-13b in each script. And for evaluation with LLaVA-NeXT, you can use the scripts in ./scripts/v1_6/eval.
- Download the data and evaluation scripts following the official instructions and put under
../data/gqa/data. You may need to modifyeval.pydue to the missing assets in the GQA v1.2 release. - Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/gqa.sh $token_num $token_complement- Under
../data/scienceqa, downloadimages,pid_splits.json,problems.jsonfrom thedata/scienceqafolder of the ScienceQA. - Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/sqa.sh $token_num $token_complement- Download
TextVQA_0.5.1_val.jsonand images and extract to../data/textvqa. - Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/textvqa.sh $token_num $token_complement- Download
cocofrom POPE and put under../data. - Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/pope.sh $token_num $token_complement- Download
mmbench_dev_20230712.tsvand put under../data/mmbench. - Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/mmbench.sh $token_num $token_complement- Submit the results to the evaluation server:
../data/eval/mmbench/answers_upload/mmbench_dev_20230712.
- Download
mmbench_dev_cn_20231003.tsvand put under../data/mmbench. - Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/mmbench_cn.sh $token_num $token_complement- Submit the results to the evaluation server:
../data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003.
- Following the official instructions to download the images and the videos. Put images under
../data/seed_bench/SEED-Bench-image. Note that we only use the image subset to evaluate. - Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/seed.sh $token_num $token_complement- Extract
mm-vet.zipto../data/mmvet. - Single-GPU or Multi-GPU inference and evaluation.
method=VisionTrim
bash scripts/v1_5/eval/mmvet.sh $token_num $token_complement- Evaluate the predictions in
../data/eval/mmvet/resultsusing the official Jupyter notebook.
We are grateful for the open-source contributions of other projects:
If you find our VisionTrim useful for your research, please consider giving this repository a star and citing our paper as follows:
@article{yu2026visiontrim,
title={VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration},
author={Yu, Hanxun and Li, Wentong and Qu, Xuan and Wang, Song and Chen, Junbo and Zhu, Jianke},
journal={arXiv preprint arXiv:2601.22674},
year={2026}
}