OmniBal: Towards Fast Instruction-Tuning for Vision-Language Models via Omniverse Computation Balance

Yongqiang Yao*, Jingru Tan*, Feizhao Zhang*, Jiahao Hu, Yazhe Niu, Xin Jin, Bo Li, Pengfei Liu, Ruihao Gong📧, Dahua Lin , Ningyi Xu📧 (* denotes equal contribution, 📧 denotes corresponding author.)

This is the official implementation of our paper OmniBal, an omniverse balanced training framework for large-scale 3D parallel training of vision-language models.
End-to-end experiments on open-source VLMs show a 1.8x training speed-up, and the method is model-, dataset-, and hardware-agnostic ready to plug into existing training pipelines with minimal changes.

News

May 1, 2025: 🌟 Our paper has been accepted by ICML 2025! 🎉 Cheers!

Overview

Large-scale vision-language instruction tuning often suffers from severe load imbalance across GPUs because the vision and language branches differ drastically in data distribution and network structure. OmniBal rebalances computation from three tightly coupled angles:

Data: regrouping samples into mini-batches that equalize per-GPU FLOPs.
Model: a search-based partitioner that assigns vision and language layers to devices for near-uniform workload.
Memory: adaptive, per-partition re-compute policies that squeeze the most out of available memory without stalling kernels.

Together, these modules form an “omniverse” training framework that delivers ~1.8 × end-to-end speed-up on InternVL-Chat and consistently accelerates other VL models and datasets – all while maintaining accuracy.

Framework

Imbalance Problem In VLM

Inter-Stage: computation imbalance of different pipeline parallel stages.
Intra-Stage: indicates the computation imbalance of the same stage across time and devices.

Balanced Dynamic Mini-Batch

ISF Algorithm
Example

Prepare dataset length

We need to calculate offline statistics for all data, including the number of images and the token number of text.

We have already prepared the internvl-1.2M length information and placed it in the dataset. test_balanced_dynamic_batch.py

Data Input

"internvl_sft_1.2M.json" is our simulated input, containing actual real statistical lengths.

The "Token_length" information consists of a list in this data format. "vit_num" represents the vision image batch size number in the current sample, "token_num" indicates the final text token length, and "image_flag" refers to the actual number of images in a sample. (Some plain text might generate fake images as dummy inputs to ensure training stability.)

[
    {"vit_num": 5,
      "token_num": 811,
      "image_flag": 3
    },
    {"vit_num": 3,
      "token_num": 831,
      "image_flag": 3
    },
    {"vit_num": 1,
      "token_num": 310,
      "image_flag": 1
    },
    {"vit_num": 1,
      "token_num": 920,
      "image_flag": 0
    },
]

Get ISF arguments (vit bs num and llm token length)

python test_balanced_dynamic_batch.py

if you want to use fast version

cd fast_isf
sh build.sh && cd ..
python test_balanced_dynamic_batch.py

Replace your dataset

The example implementation we provided is based on a fake dataset. For actual use, you need to replace it with your own dataset.

Code

Data Example

InternVL-Chat-V1.5

InternVL-Chat-V2.0

Xtuner-example

Full Code

Example

Citation

If you find this repository helpful, please cite the paper below.

@article{yao2024omnibal,
  title={OmniBal: Towards Fast Instruction-tuning for Vision-Language Models via Omniverse Computation Balance},
  author={Yao, Yongqiang and Tan, Jingru and Hu, Jiahao and Zhang, Feizhao and Jin, Xin and Li, Bo and Gong, Ruihao and Liu, Pengfei},
  journal={arXiv e-prints},
  pages={arXiv--2407},
  year={2024}
}

License

This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the Apache license 2.0.

Acknowledgement

We build our project based on:

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
fast_isf		fast_isf
images		images
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
internvl_sft_1.2M.json		internvl_sft_1.2M.json
pure_llm_1.0M.json		pure_llm_1.0M.json
sampler.py		sampler.py
test_balanced_dynamic_batch.py		test_balanced_dynamic_batch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OmniBal: Towards Fast Instruction-Tuning for Vision-Language Models via Omniverse Computation Balance

News

Overview

Framework

Imbalance Problem In VLM

Balanced Dynamic Mini-Batch

Prepare dataset length

Data Input

Get ISF arguments (vit bs num and llm token length)

Replace your dataset

Code

Data Example

Full Code

Citation

License

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ModelTC/OmniBal

Folders and files

Latest commit

History

Repository files navigation

OmniBal: Towards Fast Instruction-Tuning for Vision-Language Models via Omniverse Computation Balance

News

Overview

Framework

Imbalance Problem In VLM

Balanced Dynamic Mini-Batch

Prepare dataset length

Data Input

Get ISF arguments (vit bs num and llm token length)

Replace your dataset

Code

Data Example

Full Code

Citation

License

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages