Data Soup¶
One robot body. A thousand different teachers. Fifteen data source types mixed into a single training stream.
The Idea¶
A robot that only learns from its own body learns slowly. A robot that learns from every body learns the universal language of movement.
Following the XVLA/Agibot-World approach, Neon trains on a data soup — a weighted mixture of diverse datasets, each contributing different skills, different environments, different embodiments:
graph LR
subgraph "Source Types"
LR["lerobot"]
AG["agibot"]
DG["dreamgen"]
COS["cosmos_dreamgen"]
OXE["oxe"]
VC["🗣️ voice_commands"]
S4D["stereo4d"]
G1T["🦿 g1_teleop"]
GRT["groot_teleop"]
NV3["neon_v3"]
BON["🦴 bones_seed"]
KIM["🎮 kimodo"]
MOL["🧪 molmobot"]
end
subgraph "Pipeline"
MAP["ActionMapper<br/>Cross-embodiment"]
REL["Relative Actions<br/>Cosmos EE deltas"]
NORM["Normalize to G1"]
MIX["Weighted Mixing"]
end
LR --> MAP
AG --> MAP
DG --> MAP
OXE --> MAP
G1T --> MAP
GRT --> MAP
NV3 --> MAP
BON --> MAP
KIM --> MAP
MOL --> MAP
COS --> REL
MAP --> NORM
REL --> NORM
VC --> MIX
S4D --> MIX
NORM --> MIX
MIX --> DS["NeonEpisode[]"]
style MIX fill:#e65100,color:#fff
Thirteen Sources¶
| Type | What It Is | What It Teaches | Has Actions |
|---|---|---|---|
lerobot |
HuggingFace LeRobot (Bridge, DROID) | Tabletop manipulation fundamentals | ✅ |
agibot |
Agibot-World bimanual data | Two-armed coordination, 1M+ episodes | ✅ |
dreamgen |
GR00T-Dreams synthetic videos | Synthetic demonstrations via IDM extraction | ✅ |
cosmos_dreamgen |
Cosmos-Predict2.5 style data | Relative EE actions — cross-embodiment gold | ✅ |
oxe |
Open X-Embodiment (multi-robot) | Breadth — many robots, many tasks | ✅ |
voice_commands |
50K natural language instructions | Language diversity across 10 categories | ❌ (text only) |
stereo4d |
Stereo kitchen video pairs | Scene understanding, depth from stereo | ❌ (visual only) |
g1_teleop |
Unitree G1 teleoperation data | Native G1 whole-body movement | ✅ |
groot_teleop |
GR00T teleoperation recordings | GR1-style humanoid manipulation | ✅ |
neon_v3 / neon_native |
Neon's own recorded episodes | Self-collected demonstrations | ✅ |
bones_seed |
BONES skeletal seed demonstrations | Skeletal-level motion primitives | ✅ |
kimodo |
Kimodo generator output | Kinematic motion from procedural generation | ✅ |
molmobot |
MolmoBot-Data simulation trajectories | Sim pick-and-place + flow matching insights | ✅ |
Cross-Embodiment Mapping¶
Different robots have different bodies. The ActionMapper translates them all into G1 space:
graph TD
subgraph "Source Bodies"
FR["Franka (7 DoF)"]
SO["SO-100 (7 DoF)"]
AB["Agibot (14 DoF)"]
GR["GR-1 (14 DoF)"]
G1S["G1 (29 DoF)"]
end
MAP["ActionMapper<br/>Index remapping + zero-pad"]
FR --> MAP
SO --> MAP
AB --> MAP
GR --> MAP
G1S --> MAP
MAP --> G1["G1 Action Space<br/>(normalized, unified)"]
style MAP fill:#e65100,color:#fff
style G1 fill:#1b5e20,color:#fff
Franka's 7 DoF maps to the G1's right arm (indices [7-13]). Agibot's 14 DoF maps directly to both arms. Everything else gets zero-padded. The mapper handles it.
Configuration¶
from neon.data.data_soup import DataSoupConfig, DataSourceConfig
config = DataSoupConfig(
sources=[
DataSourceConfig(
name="bridge",
type="lerobot",
path="lerobot/bridge_v2",
weight=1.0,
),
DataSourceConfig(
name="agibot",
type="agibot",
path="lerobot/xvla-agibot-world",
weight=2.0, # 2× more agibot — it's our best bimanual data
max_episodes=10000,
),
DataSourceConfig(
name="cosmos-synth",
type="cosmos_dreamgen",
path="nvidia/GR1-100",
weight=1.5,
use_relative_actions=True,
action_scaler=20.0,
),
DataSourceConfig(
name="stereo-kitchen",
type="stereo4d",
path="cagataydev/strands-kitchen-stereo4d",
weight=0.5,
use_stereo_pair=True,
),
DataSourceConfig(
name="voice-cmds",
type="voice_commands",
path="cagataydev/vlm-voice-commands",
weight=0.3,
),
],
chunk_size=16,
fps=15,
shuffle=True,
)
The weight parameter controls the mixing ratio. Higher weight = more samples from that source during training. Tune this to bias the model toward the skills you care about most.
The Unified Episode¶
All source types normalize to a common structure:
@dataclass
class NeonEpisode:
images: List[np.ndarray] # (T, H, W, 3) RGB frames
actions: np.ndarray # (T, action_dim) normalized [-1, 1]
language: str # Natural language instruction
proprioception: Optional[np.ndarray] # (T, joint_dim)
audio: Optional[np.ndarray] # (T, samples) 16kHz waveform per step
lidar: Optional[np.ndarray] # (T, N, 4) point clouds per step
eef_state: Optional[np.ndarray] # (T, 14) bimanual EE pos+quat
metadata: Dict[str, Any] # Source info, episode flags
All modalities are optional — a dataset with only cameras and joints will have audio=None, lidar=None, eef_state=None. The trainer's collation function handles mixed batches gracefully, zero-padding missing modalities.
Episode Flags¶
Two flags tell the training loop what kind of loss to compute:
language_only=True— Voice command episodes. No images or actions. Contributes to language conditioning only.visual_only=True— Stereo4D / DreamGen. Video without action labels. Contributes to visual understanding only.
Adding a New Dataset¶
Five steps:
- Add a
DataSourceConfiginneon/data/data_soup.py - Implement
_load_<type>method if it's a new source format - Add embodiment mapping in
ActionMapper.EMBODIMENT_MAPS - Add to a training config preset in
neon/training/config.py - Test:
pytest tests/test_data.py -v
New Sources (v2)¶
MolmoBot — Sim-to-Real Flow Matching¶
From Allen AI's MolmoBot paper (arXiv:2603.16861). Large-scale simulated manipulation data with key insights for flow matching training:
- Joint position actions (not deltas) work better with flow matching heads
- Exo + wrist camera pairs give robust sim-to-real transfer
- Prompt randomization (case, punctuation, synonyms) prevents overfitting
- Auxiliary dataset mixing with weighted sampling improves generalization
DataSourceConfig(
name="molmobot",
type="molmobot",
path="allenai/MolmoBot-Data", # HuggingFace or local H5 files
weight=1.0,
max_episodes=5000,
)
H5 format: obs/agent/qpos/arm (T,7), actions/joint_pos/arm (T,7), obs/cameras/<cam>/video (T,H,W,3).
G1 Teleoperation¶
Native Unitree G1 whole-body teleoperation recordings. No cross-embodiment mapping needed — actions map directly to G1 joints.
DataSourceConfig(
name="g1-teleop",
type="g1_teleop",
path="cagataydev/g1-teleop-v1",
weight=3.0, # High weight — native embodiment data is gold
)
GR00T Teleoperation¶
GR1 humanoid teleoperation from NVIDIA's data collection pipeline. 44-DoF actions mapped to G1's 29-DoF via joint group alignment.
DataSourceConfig(
name="groot-teleop",
type="groot_teleop",
path="nvidia/GR1-teleop-100k",
weight=1.5,
)
BONES Seed¶
Skeletal-level motion primitives — minimal demonstrations that capture the essence of a movement. Used for bootstrapping training before scaling to full datasets.
Kimodo¶
Procedurally generated kinematic motions from the Kimodo motion generator. Synthetic but physically plausible whole-body trajectories.
→ Next: Audio In/Out — give the robot ears and a voice