Mitty: Diffusion-based Human-to-Robot Video Generation
Yiren Song, Cheng Liu, Weijia Mao, and Mike Zheng Shou
Show Lab, National University of Singapore
conda create -n mitty python=3.10 -y
conda activate mittypip install -r requirements.txtThe fine-tuned Mitty models will be available at:
- Model:
https://huggingface.co/showlab/Mitty_Model
The paired human–robot dataset will be released as a HuggingFace dataset:
- Dataset:
https://huggingface.co/datasets/showlab/Mitty_Dataset
A recommended format is:
dataset/
├── human/
│ ├── xxx_00001.mp4
│ ├── xxx_00001.txt # prompt
│ └── ...
├── robot/
│ ├── xxx_00001.mp4
│ └── ...
We provide simple shell scripts to launch training and inference.
Edit _scripts/train.sh to set your dataset paths, output directory, and training hyperparameters.
Then run:
_scripts/train.shThis will start training Mitty on the paired human–robot dataset.
Edit _scripts/inference.sh to point to your trained checkpoint (or the released pretrained model) and specify the input human video / prompt and output directory.
Then run:
_scripts/inference.shThis will generate corresponding robot videos from the human inputs using the Mitty model.
If you use this codebase or the released models / dataset in your research, please cite the Mitty paper.
@article{mitty2025,
title = {Mitty: Diffusion-based Human-to-Robot Video Generation},
author = {Yiren Song and Cheng Liu and Weijia Mao and Mike Zheng Shou},
year = {2025},
}