General Flow as Foundation Affordance for
Scalable Robot Learning

Institute for Interdisciplinary Information Sciences, Tsinghua University
Shanghai Artificial Intelligence Laboratory,
Shanghai Qi Zhi Institute



General Flow is a Foundation Affordace which provides
scalability, universality, rich geometric guidance and small inference domain-gap.
With general flow as affordance, we achieve zero-shot human-to-robot skill transfer with 81% success rate on totally 18 tasks.

Abstract

We address the challenge of acquiring real-world manipulation skills with a scalable framework. Inspired by the success of large-scale auto-regressive prediction in Large Language Models (LLMs), we hold the belief that identifying an appropriate prediction target capable of leveraging large-scale datasets is crucial for achieving efficient and universal learning. Therefore, we propose to utilize flow, which represents the future trajectories of 3D points on objects of interest, as an ideal prediction target in robot learning. To exploit scalable data resources, we turn our attention to cross-embodiment datasets. We develop, for the first time, a language-conditioned prediction model directly from large-scale RGBD human video datasets. Our predicted flow offers actionable geometric and physics guidance, thus facilitating stable zero-shot skill transfer in real-world scenarios. We deploy our method with a policy based on closed-loop flow prediction. Remarkably, without any additional training, our method achieves an impressive 81\% success rate in human-to-robot skill transfer, covering 18 tasks in 6 scenes. Our framework features the following benefits: (1) scalability: leveraging cross-embodiment data resources; (2) universality: multiple object categories, including rigid, articulated, and soft bodies; (3) stable skill transfer: providing actionable guidance with a small inference domain-gap. These lead to a new pathway towards scalable general robot learning. Data, code, and model weights will be made publicly available.

General Flow



We propose General Flow as a Foundation Affordance. Its properties and applications are analyzed to reveal its great power. We design a scale-aware algorithm for general flow prediction and achieve stable zero-shot cross-embodiment skill transfer in the real world. These findings highlight the transformative potential of general flow in spearheading scalable general robot learning.



Zero-Shot Real World Execution


For all 18 tasks, we provide 4 human videos example and 3 demos of robot trials.
We keep gripper position, grasp manner, initial state, scene setting and policy behaviour as diverse as possible.
Please check out our paper for detailed method and deployment setting.

Video Demo for
with "
" action.

Video Examples from Human Datasets

Real World Trial-1

Real World Trial-2

Real World Trial-3




Zero-Shot General Flow Prediction

We provide visualiaztion for general flow prediction during zero-shot execution.
25 trajectories are selected for clearity.
General Flow prediction is robustness to embodiment-transfer and segmentation error to some extend.

Visualization for
with "
" action.
Description of First Image

Scene Reference

Description of Second Image

General Flow Prediction




More Visualization from Human Videos

We provide more visualiaztion of prediction on human videos. 25 or 50 trajectories are selected for clearity.

select
of visualization.
Visualization



Emergent Properties of General Flow

When trained on the general flow prediction task at scale, our model acquires multiple emergent properties that are unfeasible in a small-scale imitation learning setting.

  • Semantic Richness & Controllability [(a)(b)]: The semantics of the flow can be easily altered by switching language instructions, showcasing the model's flexibility.

  • Robustness to Label Noise [(c)(d)]: Despite severe noise in labels, such as significant deviation in 'open Safe' or almost static labels in 'pickup Toy Car', our model consistently predicts the correct trend. In the figure, red indicates the label and green represents the prediction.

  • Spatial Commonsense Acquisition [(e)(f)]: For instance, in task 'putdown Mug', the model adjusts its prediction scale to accurately reflect the spatial relationships of objects, ensuring both ends are on the table and adapting to longer distances with a larger scale.

Visualization

Failure Case

Trajectory Deviate

Gripper Drop

Robot Stuck

Please check out our paper for more details. Have fun :)

BibTeX

If you find this repository useful, please kindly acknowledge our work :
@article{yuan2024general,
        title={General Flow as Foundation Affordance for Scalable Robot Learning},
        author={Yuan, Chengbo and Wen, Chuan and Zhang, Tong and Gao, Yang},
        journal={arXiv preprint arXiv:2401.11439},
        year={2024}
      }