RoboTubeLearning Household Manipulation from Human Videos with Simulated Twin Environments
1Shanghai Qizhi Institute   2Carnegie Mellon University   3SJTU   4Tsinghua University   5University of Toronto
*Equal contribution
CoRL 2022 Oral Presentation

TL;DR: Human videos + digital twins (for the objects in the videos).

Abstract: We aim to build a useful, reproducible, platform for learning from human videos. To realize this goal, a diverse, high-quality human video dataset curated specifically for robot manipulation is desired. To evaluate the learning progress, a simulated twin environment that resembles the appearance and the dynamics of the physical world would help roboticists and AI researchers validate their algorithms convincingly and efficiently before testing on a real robot. Hence, we present RoboTube, a human video dataset, and its digital twins for robot learning from human videos.

Human Video Dataset

We build a diverse and high-quality human video dataset with multiple functionalities. The dataset spans a wide range of complementary settings:

  • Successful and failure videos
  • First-person and third-person views
  • RGB and depth
  • Structured table-top and in-the-wild scenes

Simulated Digital Twins

We construct a suite of simulated digital twins for the objects in the human videos. These digital twins enable fair comparisons with baseline methods, and can validate algorithms convincingly and efficiently before conducting more complex experiments on real robots.

Citation

To cite this work, please use the following BibTeX entry:

@inproceedings{xiong2022robotube,
  title={RoboTube: Learning Household Manipulation from Human Videos
         with Simulated Twin Environments},
  author={Haoyu Xiong and Haoyuan Fu and Jieyi Zhang and Chen Bao
          and Qiang Zhang and Yongxi Huang and Wenqiang Xu
          and Animesh Garg and Cewu Lu},
  booktitle={6th Annual Conference on Robot Learning},
  year={2022},
  url={https://openreview.net/forum?id=VD0nXUG5Qk}
}