Data Soup¶

One robot body. A thousand different teachers. Fifteen data source types mixed into a single training stream.

The Idea¶

A robot that only learns from its own body learns slowly. A robot that learns from every body learns the universal language of movement.

Following the XVLA/Agibot-World approach, Neon trains on a data soup — a weighted mixture of diverse datasets, each contributing different skills, different environments, different embodiments:

graph LR
    subgraph "Source Types"
        LR["lerobot"]
        AG["agibot"]
        DG["dreamgen"]
        COS["cosmos_dreamgen"]
        OXE["oxe"]
        VC["🗣️ voice_commands"]
        S4D["stereo4d"]
        G1T["🦿 g1_teleop"]
        GRT["groot_teleop"]
        NV3["neon_v3"]
        BON["🦴 bones_seed"]
        KIM["🎮 kimodo"]
        MOL["🧪 molmobot"]
    end

    subgraph "Pipeline"
        MAP["ActionMapper<br/>Cross-embodiment"]
        REL["Relative Actions<br/>Cosmos EE deltas"]
        NORM["Normalize to G1"]
        MIX["Weighted Mixing"]
    end

    LR --> MAP
    AG --> MAP
    DG --> MAP
    OXE --> MAP
    G1T --> MAP
    GRT --> MAP
    NV3 --> MAP
    BON --> MAP
    KIM --> MAP
    MOL --> MAP
    COS --> REL
    MAP --> NORM
    REL --> NORM
    VC --> MIX
    S4D --> MIX
    NORM --> MIX

    MIX --> DS["NeonEpisode[]"]

    style MIX fill:#e65100,color:#fff

Thirteen Sources¶

Type	What It Is	What It Teaches	Has Actions
`lerobot`	HuggingFace LeRobot (Bridge, DROID)	Tabletop manipulation fundamentals	✅
`agibot`	Agibot-World bimanual data	Two-armed coordination, 1M+ episodes	✅
`dreamgen`	GR00T-Dreams synthetic videos	Synthetic demonstrations via IDM extraction	✅
`cosmos_dreamgen`	Cosmos-Predict2.5 style data	Relative EE actions — cross-embodiment gold	✅
`oxe`	Open X-Embodiment (multi-robot)	Breadth — many robots, many tasks	✅
`voice_commands`	50K natural language instructions	Language diversity across 10 categories	❌ (text only)
`stereo4d`	Stereo kitchen video pairs	Scene understanding, depth from stereo	❌ (visual only)
`g1_teleop`	Unitree G1 teleoperation data	Native G1 whole-body movement	✅
`groot_teleop`	GR00T teleoperation recordings	GR1-style humanoid manipulation	✅
`neon_v3` / `neon_native`	Neon's own recorded episodes	Self-collected demonstrations	✅
`bones_seed`	BONES skeletal seed demonstrations	Skeletal-level motion primitives	✅
`kimodo`	Kimodo generator output	Kinematic motion from procedural generation	✅
`molmobot`	MolmoBot-Data simulation trajectories	Sim pick-and-place + flow matching insights	✅

Cross-Embodiment Mapping¶

Different robots have different bodies. The ActionMapper translates them all into G1 space:

graph TD
    subgraph "Source Bodies"
        FR["Franka (7 DoF)"]
        SO["SO-100 (7 DoF)"]
        AB["Agibot (14 DoF)"]
        GR["GR-1 (14 DoF)"]
        G1S["G1 (29 DoF)"]
    end

    MAP["ActionMapper<br/>Index remapping + zero-pad"]

    FR --> MAP
    SO --> MAP
    AB --> MAP
    GR --> MAP
    G1S --> MAP

    MAP --> G1["G1 Action Space<br/>(normalized, unified)"]

    style MAP fill:#e65100,color:#fff
    style G1 fill:#1b5e20,color:#fff

Franka's 7 DoF maps to the G1's right arm (indices [7-13]). Agibot's 14 DoF maps directly to both arms. Everything else gets zero-padded. The mapper handles it.

Configuration¶

from neon.data.data_soup import DataSoupConfig, DataSourceConfig

config = DataSoupConfig(
    sources=[
        DataSourceConfig(
            name="bridge",
            type="lerobot",
            path="lerobot/bridge_v2",
            weight=1.0,
        ),
        DataSourceConfig(
            name="agibot",
            type="agibot",
            path="lerobot/xvla-agibot-world",
            weight=2.0,          # 2× more agibot — it's our best bimanual data
            max_episodes=10000,
        ),
        DataSourceConfig(
            name="cosmos-synth",
            type="cosmos_dreamgen",
            path="nvidia/GR1-100",
            weight=1.5,
            use_relative_actions=True,
            action_scaler=20.0,
        ),
        DataSourceConfig(
            name="stereo-kitchen",
            type="stereo4d",
            path="cagataydev/strands-kitchen-stereo4d",
            weight=0.5,
            use_stereo_pair=True,
        ),
        DataSourceConfig(
            name="voice-cmds",
            type="voice_commands",
            path="cagataydev/vlm-voice-commands",
            weight=0.3,
        ),
    ],
    chunk_size=16,
    fps=15,
    shuffle=True,
)

The weight parameter controls the mixing ratio. Higher weight = more samples from that source during training. Tune this to bias the model toward the skills you care about most.

The Unified Episode¶

All source types normalize to a common structure:

@dataclass
class NeonEpisode:
    images: List[np.ndarray]              # (T, H, W, 3) RGB frames
    actions: np.ndarray                   # (T, action_dim) normalized [-1, 1]
    language: str                         # Natural language instruction
    proprioception: Optional[np.ndarray]  # (T, joint_dim)
    audio: Optional[np.ndarray]           # (T, samples) 16kHz waveform per step
    lidar: Optional[np.ndarray]           # (T, N, 4) point clouds per step
    eef_state: Optional[np.ndarray]       # (T, 14) bimanual EE pos+quat
    metadata: Dict[str, Any]              # Source info, episode flags

All modalities are optional — a dataset with only cameras and joints will have audio=None, lidar=None, eef_state=None. The trainer's collation function handles mixed batches gracefully, zero-padding missing modalities.

Episode Flags¶

Two flags tell the training loop what kind of loss to compute:

language_only=True — Voice command episodes. No images or actions. Contributes to language conditioning only.
visual_only=True — Stereo4D / DreamGen. Video without action labels. Contributes to visual understanding only.

Adding a New Dataset¶

Five steps:

Add a DataSourceConfig in neon/data/data_soup.py
Implement _load_<type> method if it's a new source format
Add embodiment mapping in ActionMapper.EMBODIMENT_MAPS
Add to a training config preset in neon/training/config.py
Test: pytest tests/test_data.py -v

New Sources (v2)¶

MolmoBot — Sim-to-Real Flow Matching¶

From Allen AI's MolmoBot paper (arXiv:2603.16861). Large-scale simulated manipulation data with key insights for flow matching training:

Joint position actions (not deltas) work better with flow matching heads
Exo + wrist camera pairs give robust sim-to-real transfer
Prompt randomization (case, punctuation, synonyms) prevents overfitting
Auxiliary dataset mixing with weighted sampling improves generalization

DataSourceConfig(
    name="molmobot",
    type="molmobot",
    path="allenai/MolmoBot-Data",     # HuggingFace or local H5 files
    weight=1.0,
    max_episodes=5000,
)

H5 format: obs/agent/qpos/arm (T,7), actions/joint_pos/arm (T,7), obs/cameras/<cam>/video (T,H,W,3).

G1 Teleoperation¶

Native Unitree G1 whole-body teleoperation recordings. No cross-embodiment mapping needed — actions map directly to G1 joints.

DataSourceConfig(
    name="g1-teleop",
    type="g1_teleop",
    path="cagataydev/g1-teleop-v1",
    weight=3.0,        # High weight — native embodiment data is gold
)

GR00T Teleoperation¶

GR1 humanoid teleoperation from NVIDIA's data collection pipeline. 44-DoF actions mapped to G1's 29-DoF via joint group alignment.

DataSourceConfig(
    name="groot-teleop",
    type="groot_teleop",
    path="nvidia/GR1-teleop-100k",
    weight=1.5,
)

BONES Seed¶

Skeletal-level motion primitives — minimal demonstrations that capture the essence of a movement. Used for bootstrapping training before scaling to full datasets.

DataSourceConfig(
    name="bones",
    type="bones_seed",
    path="cagataydev/bones-seed-v1",
    weight=0.5,
)

Kimodo¶

Procedurally generated kinematic motions from the Kimodo motion generator. Synthetic but physically plausible whole-body trajectories.

DataSourceConfig(
    name="kimodo",
    type="kimodo",
    path="cagataydev/kimodo-g1-v1",
    weight=0.8,
)

→ Next: Audio In/Out — give the robot ears and a voice