RoboTube

TL;DR: Human videos + digital twins (for the objects in the videos).

Abstract: We aim to build a useful, reproducible, platform for learning from human videos. To realize this goal, a diverse, high-quality human video dataset curated specifically for robot manipulation is desired. To evaluate the learning progress, a simulated twin environment that resembles the appearance and the dynamics of the physical world would help roboticists and AI researchers validate their algorithms convincingly and efficiently before testing on a real robot. Hence, we present RoboTube, a human video dataset, and its digital twins for robot learning from human videos.

Human Video Dataset

We build a diverse and high-quality human video dataset with multiple functionalities. The dataset spans a wide range of complementary settings:

Successful and failure videos
First-person and third-person views
RGB and depth
Structured table-top and in-the-wild scenes

You can download from Hugging Face: RoboTube_human_videos

Simulated Digital Twins

We construct a suite of simulated digital twins for the objects in the human videos. These digital twins enable fair comparisons with baseline methods, and can validate algorithms convincingly and efficiently before conducting more complex experiments on real robots.

You can find the digital twins at GitHub: robotube

Citation

To cite this work, please use the following BibTeX entry:

@inproceedings{xiong2022robotube,
  title={RoboTube: Learning Household Manipulation from Human Videos
         with Simulated Twin Environments},
  author={Haoyu Xiong and Haoyuan Fu and Jieyi Zhang and Chen Bao
          and Qiang Zhang and Yongxi Huang and Wenqiang Xu
          and Animesh Garg and Cewu Lu},
  booktitle={6th Annual Conference on Robot Learning},
  year={2022},
  url={https://openreview.net/forum?id=VD0nXUG5Qk}
}