Vidi2: Large Multimodal Models for Video Understanding and Creation

Homepage: https://bytedance.github.io/vidi-website/

We introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understanding and editing (VUE) scenarios. The first release focuses on temporal retrieval (TR), i.e., identifying the time ranges in input videos corresponding to a given text query. The second release evolves toward a foundation model with state-of-the-art spatio-temporal grounding (STG) and temporal retrieval capability while maintaining basic open-ended video QA performance.

Release

[11/25/2025] 🔥 Vidi2 released at Report, Github, Homepage, Demo Coming Very Soon.
[08/29/2025] 🔥 Vidi1.5-9B demo released at https://vidi.byteintl.com/ with new UI design.
[06/06/2025] 🔥 Vidi-7B demo released at https://vidi.byteintl.com/. Follow the instructions in the demo section to run the demo.
[04/21/2025] 🔥 The first release of Vidi consists of tech report and the VUE-TR evaluation benchmark. The 7B model demo and weights are coming soon.

Content

Demo Coming Very Soon
Installation
Evaluation (VUE-STG)
Evaluation (VUE-TR-V2)
Model

Installation

Run the install.sh.

Evaluation (VUE-STG)

We release the video ids, ground-truth annotation and evaluation results in csv files. Follow the instruction in VUE_STG/README.md to conduct evaluation.

cd VUE_STG
python3 evaluate.py

To evaluate your own model:

First download the videos based on the ids in "VUE_STG/vue-stg-benchmark/video.csv" from Youtube (e.g., yt-dlp ).
Generate the result following the format in VUE_STG/results/vidi2/tubes.csv. Run evaluation script.

Evaluation (VUE-TR-V2)

We release the ground-truth annotation and evaluation results in 5 json files. Run the script for a standalone evaluation:

cd VUE_TR_V2
python3 -u qa_eval.py --pred_path results_Vidi.json

The result figures will be saved in the output folder ('./results' by default) .

For evaluation of new models, first download the videos based on the ids in VUE_TR_V2/video_id.txt from Youtube (e.g., yt-dlp ). Then run inference and save the results in the following format:

[
    {
        "query_id": 0,
        "video_id": "6Qv-LrXJjSM",
        "duration": 3884.049,
        "query": "The slide showcases Taco Bell's purple ang pow for Chinese New Year, while a woman explains that purple symbolizes royalty in the Chinese tradition.",
        "answer": [
            [
                913.1399199,
                953.5340295
            ]
        ],
        "task": "temporal_retrieval"
    },
    ...
]

You may find the instruction and data for the previous version (VUE-TR) here.

Model and Inference

We release the 7B model weight for reproduction of Vidi results in 2025/04/15 tech report.

First download the checkpoint from Coming Very Soon.

Then run install.sh in "./Vidi_7B":

cd Vidi_7B
bash install.sh

For a given video and text query, run the following command to get the results:

python3 -u inference.py --video-path [video path] --query [query] --model-path [model path]

Citation

If you find Vidi useful for your research and applications, please cite using this BibTeX:

@article{Vidi2025vidi2,
    title={Vidi2: Large Multimodal Models for Video 
            Understanding and Creation},
    author={Vidi Team, Celong Liu, Chia-Wen Kuo, Chuang Huang, 
            Dawei Du, Fan Chen, Guang Chen, Haoji Zhang, 
            Haojun Zhao, Lingxi Zhang, Lu Guo, Lusha Li, 
            Longyin Wen, Qihang Fan, Qingyu Chen, Rachel Deng,
            Sijie Zhu, Stuart Siew, Tong Jin, Weiyan Tao,
            Wen Zhong, Xiaohui Shen, Xin Gu, Zhenfang Chen, Zuhua Lin},
    journal={arXiv preprint arXiv:2511.19529},
    year={2025}
}
@article{Vidi2025vidi,
    title={Vidi: Large Multimodal Models for Video 
            Understanding and Editing},
    author={Vidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du, 
            Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang,
            Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen, 
            Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, 
            Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei, 
            Xueqiong Qu, Zhenfang Chen},
    journal={arXiv preprint arXiv:2504.15681},
    year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vidi2: Large Multimodal Models for Video Understanding and Creation

Release

Content

Installation

Evaluation (VUE-STG)

Evaluation (VUE-TR-V2)

Model and Inference

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
VUE_STG		VUE_STG
VUE_TR		VUE_TR
VUE_TR_V2		VUE_TR_V2
Vidi_7B		Vidi_7B
.DS_Store		.DS_Store
LICENSE.txt		LICENSE.txt
README.md		README.md
install.sh		install.sh
requirements.txt		requirements.txt

License

bytedance/vidi

Folders and files

Latest commit

History

Repository files navigation

Vidi2: Large Multimodal Models for Video Understanding and Creation

Release

Content

Installation

Evaluation (VUE-STG)

Evaluation (VUE-TR-V2)

Model and Inference

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages