*Equal contribution. †Corresponding author.
• 2025.12: 🔥 Our paper, training code, and project page are released.
TL;DR: We propose WorldWander, an in-context learning framework for translating between egocentric and exocentric worlds in video generation. We also release EgoExo-8K, a large-scale dataset containing synchronized egocentric–exocentric triplets. The teaser is shown below:

Video diffusion models have recently achieved remarkable progress in realism and controllability. However, achieving seamless video translation across different perspectives, such as first-person (egocentric) and third-person (exocentric), remains underexplored. Bridging these perspectives is crucial for filmmaking, embodied AI, and world models.
Motivated by this, we present WorldWander, an in-context learning framework tailored for translating between egocentric and exocentric worlds in video generation. Building upon advanced video diffusion transformers, WorldWander integrates (i) In-Context Perspective Alignment and (ii) Collaborative Position Encoding to efficiently model cross-view synchronization.
Overall framework is shown below:

To further support our task, we curate EgoExo-8K, a large-scale dataset containing synchronized egocentric–exocentric triplets from both synthetic and real-world scenarios.
We show some examples below:

git clone https://github.com/showlab/WorldWander.git
# Installation with the requirement.txt
conda create -n WorldWander python=3.10
conda activate WorldWander
pip install -r requirements.txt
# Installation with environment.yml
conda env create -f environment.yml
conda activate WorldWander
WorldWander is trained on the wan2.2-TI2V-5B model using 4 H200 GPUs, with a batch size of 4 per GPU. To make it easier for you to use directly, we provide the following checkpoints for different tasks:
| Models | Links | configs |
|---|---|---|
| wan2.2-TI2V-5B_three2one_synthetic | 🤗 Huggingface | configs/wan2-2_lora_three2one_synthetic.yaml |
| wan2.2-TI2V-5B_one2three_synthetic | 🤗 Huggingface | configs/wan2-2_lora_one2three_synthetic.yaml |
| wan2.2-TI2V-5B_three2one_realworld | 🤗 Huggingface | configs/wan2-2_lora_three2one_realworld.yaml |
| wan2.2-TI2V-5B_one2three_realworld | 🤗 Huggingface | configs/wan2-2_lora_one2three_realworld.yaml |
You can download the specific checkpoint above and specify the corresponding config file for inference. For convenience, we have provided the following example script:
bash scripts/inference_wan2.sh
Note that the parameter ckpt_path needs to be updated to the path of the checkpoint you downloaded.
It is recommended to run this code on a GPU with 80GB of VRAM to avoid out of memory.
You can also train on your custom dataset. To achieve this, you first need to adjust the first_video_root, third_video_root, ref_image_root, and other parameters in corresponding config file. If necessary, you may need to modify the CustomTrainDataset class in dataset/custom_dataset.py according to the attributes of your own dataset.
For convenience, we have also provided the following training script:
bash scripts/train_wan2.sh
🙏 This codebase borrows parts from DiffSynth-Studio and the Wan2.2. Many thanks to them for their open-source contributions. I also want to thank my co–first author for his trust and support; and to anonymously thank the senior who taught me PyTorch Lightning, enabling me to build training code from scratch on my own.
👋 If you find this code useful for your research, we would appreciate it if you could cite:
@article{song2025worldwander,
title={WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation},
author={Song, Quanjian and Song, Yiren and Peng, Kelly and Gao, Yuan and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2511.22098},
year={2025}
}