What Matters in Learning from Large-Scale Datasets for Robot Manipulation

Abstract

Imitation learning from large multi-task demonstration datasets has emerged as a promising path for building generally-capable robots. As a result, 1000s of hours have been spent on building such large-scale datasets around the globe. Despite the continuous growth of such efforts, we still lack a systematic understanding of what data should be collected to improve the utility of a robotics dataset and facilitate downstream policy learning. In this work, we conduct a large-scale dataset composition study to answer this question. We develop a data generation framework to procedurally emulate common sources of diversity in existing datasets (such as sensor placements and object types and arrangements), and use it to generate large-scale robot datasets with controlled compositions, enabling a suite of dataset composition studies that would be prohibitively expensive in the real world. We focus on two practical settings: (1) what types of diversity should be emphasized when future researchers collect large-scale datasets for robotics, and (2) how should current practitioners retrieve relevant demonstrations from existing datasets to maximize downstream policy performance on tasks of interest. Our study yields several critical insights -- for example, we find that camera poses and spatial arrangements are crucial dimensions for both diversity in collection and alignment in retrieval. In real-world robot learning settings, we find that not only do our insights from simulation carry over, but our retrieval strategies on existing datasets such as DROID allow us to consistently outperform existing training strategies by up to 70%.

Data Composition Study - Collector and Retriever Perspectives

We study dataset composition for imitation learning from two lenses - a dataset collector, and a dataset retriever.

MimicLabs Dataset

To understand the effects of dataset composition along various dimensions of variation (DVs), we create a large robotic manipulation dataset with ~1M demonstration trajectories, that we can retrieve for camera poses, backgrounds and textures, spatial arrangements of objects and receptacles, and motions that carry out the tasks.

The Collector's Perspective

We design an experiment to understand the effects of variation in different DVs as they enable or hamper skill transfer when co-training with a large dataset. We create multiple target distributions with differing variations for each DV, and experimented co-training with misaligned variations in each DV.

On the clear table task, this experiment showed how (1) misaligned camera poses and spatial arrangements hamper skill transfer, and (2) diverse camera poses enable transfer in the presence of differing textures.

The Retriever's Perspective - in the MimicLabs Dataset

Below are some demonstrations from the MimicLabs dataset for target tasks we consider in this study.

Bin carrot

Bin bowl

Clear table

Microwave teapot

Make coffee

We summarize our findings in the table below, that shows success rates on all five target tasks shown above when co-training on different dataset splits in the MimicLabs dataset. Our structured demonstration generation pipeline allows for counterfactual retrival on the absence of the required skill for grasping the required object or accessign the receptacle.

Real-robot Experiments - retrieval from DROID

Below are some videos showing rollouts for different tasks, with different co-training datasets retrieved from the DROID dataset.

Wipe Board

Target only

DROID co-training

Retrieve object

Retrieve campose

Retrieve spatial

Pour Bowl

Target only

DROID co-training

Retrieve object

Retrieve campose

Retrieve spatial

Stack Block

Target only

DROID co-training

Retrieve object

Retrieve campose

Retrieve spatial

Snack

Target only

DROID co-training

Retrieve object

Retrieve campose

Retrieve spatial

Overall, we see significant boost in task success when retrieving on the proposed dimensions of variation, as summarized in the table below. Success rates are calculated over 20 rollouts on the real robot.

BibTeX

@inproceedings{
        title={What Matters in Learning from Large-Scale Datasets for Robot Manipulation},
        author={Vaibhav Saxena, Matthew Bronars, Nadun Ranawaka Arachchige, Kuancheng Wang, Woo Chul Shin, Soroush Nasiriany, Ajay Mandlekar, Danfei Xu},
        booktitle={The Thirteenth International Conference on Learning Representations},
        year={2025},
        url={https://openreview.net/forum?id=LqhorpRLIm}
      }