Skip to content

Data Soup

One robot body. A thousand different teachers. Fifteen data source types mixed into a single training stream.


The Idea

A robot that only learns from its own body learns slowly. A robot that learns from every body learns the universal language of movement.

Following the XVLA/Agibot-World approach, Neon trains on a data soup — a weighted mixture of diverse datasets, each contributing different skills, different environments, different embodiments:

graph LR
    subgraph "Source Types"
        LR["lerobot"]
        AG["agibot"]
        DG["dreamgen"]
        COS["cosmos_dreamgen"]
        OXE["oxe"]
        VC["🗣️ voice_commands"]
        S4D["stereo4d"]
        G1T["🦿 g1_teleop"]
        GRT["groot_teleop"]
        NV3["neon_v3"]
        BON["🦴 bones_seed"]
        KIM["🎮 kimodo"]
        MOL["🧪 molmobot"]
    end

    subgraph "Pipeline"
        MAP["ActionMapper<br/>Cross-embodiment"]
        REL["Relative Actions<br/>Cosmos EE deltas"]
        NORM["Normalize to G1"]
        MIX["Weighted Mixing"]
    end

    LR --> MAP
    AG --> MAP
    DG --> MAP
    OXE --> MAP
    G1T --> MAP
    GRT --> MAP
    NV3 --> MAP
    BON --> MAP
    KIM --> MAP
    MOL --> MAP
    COS --> REL
    MAP --> NORM
    REL --> NORM
    VC --> MIX
    S4D --> MIX
    NORM --> MIX

    MIX --> DS["NeonEpisode[]"]

    style MIX fill:#e65100,color:#fff

Thirteen Sources

Type What It Is What It Teaches Has Actions
lerobot HuggingFace LeRobot (Bridge, DROID) Tabletop manipulation fundamentals
agibot Agibot-World bimanual data Two-armed coordination, 1M+ episodes
dreamgen GR00T-Dreams synthetic videos Synthetic demonstrations via IDM extraction
cosmos_dreamgen Cosmos-Predict2.5 style data Relative EE actions — cross-embodiment gold
oxe Open X-Embodiment (multi-robot) Breadth — many robots, many tasks
voice_commands 50K natural language instructions Language diversity across 10 categories ❌ (text only)
stereo4d Stereo kitchen video pairs Scene understanding, depth from stereo ❌ (visual only)
g1_teleop Unitree G1 teleoperation data Native G1 whole-body movement
groot_teleop GR00T teleoperation recordings GR1-style humanoid manipulation
neon_v3 / neon_native Neon's own recorded episodes Self-collected demonstrations
bones_seed BONES skeletal seed demonstrations Skeletal-level motion primitives
kimodo Kimodo generator output Kinematic motion from procedural generation
molmobot MolmoBot-Data simulation trajectories Sim pick-and-place + flow matching insights

Cross-Embodiment Mapping

Different robots have different bodies. The ActionMapper translates them all into G1 space:

graph TD
    subgraph "Source Bodies"
        FR["Franka (7 DoF)"]
        SO["SO-100 (7 DoF)"]
        AB["Agibot (14 DoF)"]
        GR["GR-1 (14 DoF)"]
        G1S["G1 (29 DoF)"]
    end

    MAP["ActionMapper<br/>Index remapping + zero-pad"]

    FR --> MAP
    SO --> MAP
    AB --> MAP
    GR --> MAP
    G1S --> MAP

    MAP --> G1["G1 Action Space<br/>(normalized, unified)"]

    style MAP fill:#e65100,color:#fff
    style G1 fill:#1b5e20,color:#fff

Franka's 7 DoF maps to the G1's right arm (indices [7-13]). Agibot's 14 DoF maps directly to both arms. Everything else gets zero-padded. The mapper handles it.


Configuration

from neon.data.data_soup import DataSoupConfig, DataSourceConfig

config = DataSoupConfig(
    sources=[
        DataSourceConfig(
            name="bridge",
            type="lerobot",
            path="lerobot/bridge_v2",
            weight=1.0,
        ),
        DataSourceConfig(
            name="agibot",
            type="agibot",
            path="lerobot/xvla-agibot-world",
            weight=2.0,          # 2× more agibot — it's our best bimanual data
            max_episodes=10000,
        ),
        DataSourceConfig(
            name="cosmos-synth",
            type="cosmos_dreamgen",
            path="nvidia/GR1-100",
            weight=1.5,
            use_relative_actions=True,
            action_scaler=20.0,
        ),
        DataSourceConfig(
            name="stereo-kitchen",
            type="stereo4d",
            path="cagataydev/strands-kitchen-stereo4d",
            weight=0.5,
            use_stereo_pair=True,
        ),
        DataSourceConfig(
            name="voice-cmds",
            type="voice_commands",
            path="cagataydev/vlm-voice-commands",
            weight=0.3,
        ),
    ],
    chunk_size=16,
    fps=15,
    shuffle=True,
)

The weight parameter controls the mixing ratio. Higher weight = more samples from that source during training. Tune this to bias the model toward the skills you care about most.


The Unified Episode

All source types normalize to a common structure:

@dataclass
class NeonEpisode:
    images: List[np.ndarray]              # (T, H, W, 3) RGB frames
    actions: np.ndarray                   # (T, action_dim) normalized [-1, 1]
    language: str                         # Natural language instruction
    proprioception: Optional[np.ndarray]  # (T, joint_dim)
    audio: Optional[np.ndarray]           # (T, samples) 16kHz waveform per step
    lidar: Optional[np.ndarray]           # (T, N, 4) point clouds per step
    eef_state: Optional[np.ndarray]       # (T, 14) bimanual EE pos+quat
    metadata: Dict[str, Any]              # Source info, episode flags

All modalities are optional — a dataset with only cameras and joints will have audio=None, lidar=None, eef_state=None. The trainer's collation function handles mixed batches gracefully, zero-padding missing modalities.

Episode Flags

Two flags tell the training loop what kind of loss to compute:

  • language_only=True — Voice command episodes. No images or actions. Contributes to language conditioning only.
  • visual_only=True — Stereo4D / DreamGen. Video without action labels. Contributes to visual understanding only.

Adding a New Dataset

Five steps:

  1. Add a DataSourceConfig in neon/data/data_soup.py
  2. Implement _load_<type> method if it's a new source format
  3. Add embodiment mapping in ActionMapper.EMBODIMENT_MAPS
  4. Add to a training config preset in neon/training/config.py
  5. Test: pytest tests/test_data.py -v

New Sources (v2)

MolmoBot — Sim-to-Real Flow Matching

From Allen AI's MolmoBot paper (arXiv:2603.16861). Large-scale simulated manipulation data with key insights for flow matching training:

  • Joint position actions (not deltas) work better with flow matching heads
  • Exo + wrist camera pairs give robust sim-to-real transfer
  • Prompt randomization (case, punctuation, synonyms) prevents overfitting
  • Auxiliary dataset mixing with weighted sampling improves generalization
DataSourceConfig(
    name="molmobot",
    type="molmobot",
    path="allenai/MolmoBot-Data",     # HuggingFace or local H5 files
    weight=1.0,
    max_episodes=5000,
)

H5 format: obs/agent/qpos/arm (T,7), actions/joint_pos/arm (T,7), obs/cameras/<cam>/video (T,H,W,3).

G1 Teleoperation

Native Unitree G1 whole-body teleoperation recordings. No cross-embodiment mapping needed — actions map directly to G1 joints.

DataSourceConfig(
    name="g1-teleop",
    type="g1_teleop",
    path="cagataydev/g1-teleop-v1",
    weight=3.0,        # High weight — native embodiment data is gold
)

GR00T Teleoperation

GR1 humanoid teleoperation from NVIDIA's data collection pipeline. 44-DoF actions mapped to G1's 29-DoF via joint group alignment.

DataSourceConfig(
    name="groot-teleop",
    type="groot_teleop",
    path="nvidia/GR1-teleop-100k",
    weight=1.5,
)

BONES Seed

Skeletal-level motion primitives — minimal demonstrations that capture the essence of a movement. Used for bootstrapping training before scaling to full datasets.

DataSourceConfig(
    name="bones",
    type="bones_seed",
    path="cagataydev/bones-seed-v1",
    weight=0.5,
)

Kimodo

Procedurally generated kinematic motions from the Kimodo motion generator. Synthetic but physically plausible whole-body trajectories.

DataSourceConfig(
    name="kimodo",
    type="kimodo",
    path="cagataydev/kimodo-g1-v1",
    weight=0.8,
)

Next: Audio In/Out — give the robot ears and a voice