Learning Household Manipulation from Human Videos with Simulated Twin Environments
- Haoyu Xiong*1,2
- Haoyuan Fu*3
- Jieyi Zhang3
- Chen Bao3
- Qiang Zhang4
- Yongxi Huang3
- Wenqiang Xu1,3
- Animesh Garg5
- Huazhe Xu1,4
- Cewu Lu1,3
*Equal contribution
TL;DR: Human videos + digital twins (for the objects in the videos).
Abstract: We aim to build a useful, reproducible, platform for learning from human videos. To realize this goal, a diverse, high-quality human video dataset curated specifically for robot manipulation is desired. To evaluate the learning progress, a simulated twin environment that resembles the appearance and the dynamics of the physical world would help roboticists and AI researchers validate their algorithms convincingly and efficiently before testing on a real robot. Hence, we present RoboTube, a human video dataset, and its digital twins for robot learning from human videos.
Human Video Dataset
We build a diverse and high-quality human video dataset with multiple functionalities. The dataset spans a wide range of complementary settings:
- Successful and failure videos
- First-person and third-person views
- RGB and depth
- Structured table-top and in-the-wild scenes
Simulated Digital Twins
We construct a suite of simulated digital twins for the objects in the human videos. These digital twins enable fair comparisons with baseline methods, and can validate algorithms convincingly and efficiently before conducting more complex experiments on real robots.
Citation
To cite this work, please use the following BibTeX entry:
@inproceedings{xiong2022robotube,
title={RoboTube: Learning Household Manipulation from Human Videos
with Simulated Twin Environments},
author={Haoyu Xiong and Haoyuan Fu and Jieyi Zhang and Chen Bao
and Qiang Zhang and Yongxi Huang and Wenqiang Xu
and Animesh Garg and Cewu Lu},
booktitle={6th Annual Conference on Robot Learning},
year={2022},
url={https://openreview.net/forum?id=VD0nXUG5Qk}
}