H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos
Hai Ci, Xiaokang Liu, Pei Yang, Yiren Song, Mike Zheng Shou*
Show Lab, National University of Singapore
*Corresponding author
📄 Paper (arXiv): coming soon
🌐 Project Page: https://showlab.github.io/H2R-Grounder/
H2R-Grounder converts third-person human interaction videos into frame-aligned robot manipulation videos — using no paired human–robot data for training.
Figure: H2R-Grounder pipeline. We extract pose and background to form H2Rep, then use a diffusion-based in-context model to generate physically grounded robot videos aligned with human actions.
Visit our project page for full videos, comparisons, ablations, and failure case analysis:
👉 https://showlab.github.io/H2R-Grounder/
Code and models will be released soon.
@article{ci2025h2rgrounder,
title={H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos},
author={Ci, Hai and Liu, Xiaokang and Yang, Pei and Song, Yiren and Shou, Mike Zheng},
journal={arXiv preprint arXiv:XXXXX},
year={2025}
}