Synthetic Data Factory — Design Document¶
Goal: Generate sim datasets indistinguishable from real G1 teleop data, at scale.
The Core Insight¶
Real teleop data (the S3 pick-basket dataset) has these properties: - 43-DOF joint states + 14-DOF EEF at 20Hz - Ego-view 640×480 video at 20fps - Natural task instructions ("pick up the basket") - Locomotion commands (vx, vy, omega) - Physics-consistent grasping, contact, collisions
Your idea: if the sim environment looks real AND the physics are real, the data IS real.
Marble (3D room) + Objaverse/NVIDIA 3D objects + Newton physics + Cosmos Transfer + LiDAR sim + Audio synth
↓
LeRobot v3 parquet + video — SAME SCHEMA as S3 teleop dataset
↓
Neon VLA training — model can't tell sim from real
Pipeline Architecture¶
┌──────────────────────────────────────────────────────────────┐
│ STAGE 1: World Generation │
│ │
│ Marble API ──→ 3D room (PLY/GLB/SPZ) │
│ ↓ │
│ 3DGrut: PLY → USDZ ──→ room mesh │
│ + │
│ HuggingFace 3D objects (Objaverse, NVIDIA kitchen assets) │
│ ↓ │
│ Scene Composer: room + table + objects + G1 → scene.xml │
└──────────────────────┬───────────────────────────────────────┘
│
┌──────────────────────▼───────────────────────────────────────┐
│ STAGE 2: Physics Simulation (Newton GPU) │
│ │
│ Newton ModelBuilder → replicate × 4096 parallel envs │
│ ↓ │
│ G1 with SONIC RL policy (locomotion) + scripted arm tasks │
│ ↓ │
│ Per-step collection: │
│ ✅ joint_state (43-DOF) — from newton state │
│ ✅ action (43-DOF + 3 loco) — from policy + scripted │
│ ✅ eef_state (14-DOF) — from FK on wrist sites │
│ ✅ ego_view RGB (640×480) — SensorTiledCamera │
│ ✅ lidar (4096×4) — raycast from torso body │
│ ✅ audio (32000 samples) — synthesized from task text │
│ ✅ language_instruction — from task template │
└──────────────────────┬───────────────────────────────────────┘
│
┌──────────────────────▼───────────────────────────────────────┐
│ STAGE 3: Visual Augmentation │
│ │
│ Cosmos Transfer 2.5 (depth + edge control): │
│ sim RGB video → photorealistic video │
│ Robot pose preserved (ControlNet conditioning) │
│ Background looks like real room from Marble prompt │
│ │
│ Domain randomization per-world: │
│ - Lighting (direction, color, intensity) │
│ - Object textures (random materials) │
│ - Camera noise (gaussian, motion blur) │
│ - Physics params (friction ±20%, mass ±30%) │
└──────────────────────┬───────────────────────────────────────┘
│
┌──────────────────────▼───────────────────────────────────────┐
│ STAGE 4: Export to LeRobot v3 │
│ │
│ NeonLeRobotWriter outputs: │
│ data/chunk-000/episode_XXXXXX.parquet │
│ videos/chunk-000/observation.image.camera_head/ep.mp4 │
│ meta/info.json + episodes.jsonl + tasks.jsonl │
│ │
│ SAME SCHEMA as real teleop dataset ← this is the key │
└──────────────────────┬───────────────────────────────────────┘
│
┌──────────────────────▼───────────────────────────────────────┐
│ STAGE 5: Neon VLA Training │
│ │
│ DataSoup mixes: │
│ - synthetic_factory (weight=2.0) — this pipeline │
│ - g1_teleop (weight=3.0) — real S3 data │
│ - groot_teleop (weight=1.5) — NVIDIA real G1 data │
│ - voice_commands (weight=0.3) — language conditioning │
│ │
│ → NeonVLA with all 6 modalities active │
│ → Backbone frozen, ~6-44M trainable heads │
└─────────────────────────────────────────────────────────────┘
Why This Works (the physics argument)¶
The gap between sim and real teleop is:
| Property | Real Teleop | Naive Sim | This Pipeline |
|---|---|---|---|
| Visual realism | ✅ Real camera | ❌ Flat shading | ✅ Cosmos Transfer |
| Room geometry | ✅ Real room | ❌ Flat plane | ✅ Marble 3D scan |
| Object geometry | ✅ Real objects | ⚠️ Primitives | ✅ 3D mesh datasets |
| Physics | ✅ Real physics | ✅ MuJoCo/Newton | ✅ Same solver |
| Contact dynamics | ✅ Real friction | ✅ Tuned params | ✅ SONIC PD gains |
| Joint states | ✅ Encoder readback | ✅ Perfect state | ✅ Same 43-DOF space |
| EEF tracking | ✅ FK from encoders | ✅ FK from sim | ✅ Same representation |
| LiDAR | ✅ Unitree L1 | ❌ Missing | ✅ Raycast sim |
| Audio | ✅ Real microphone | ❌ Missing | ⚠️ Synthesized |
| Task diversity | ❌ Limited by human | ❌ Scripted only | ✅ Procedural + Marble |
The only remaining gap is audio — synthetic speech patterns ≠ real speech. But the audio encoder is Whisper (frozen), and the language instruction text IS identical. The audio channel is a nice-to-have, not critical.
Scale Math¶
| Parameter | Value | Notes |
|---|---|---|
| Newton parallel envs | 4096 | On single A100/H100 |
| Steps per episode | 400 (8s at 50Hz) | Matches real teleop episode length |
| Render FPS | 20 | Matches real dataset |
| Worlds per batch | 4096 | All parallel |
| Batches for 1M episodes | 245 | 1M / 4096 ≈ 245 batches |
| Time per batch (physics) | ~2 min | Newton GPU is fast |
| Time per batch (render) | ~5 min | TiledCamera is bottleneck |
| Total 1M episodes | ~28 hours physics | On single GPU |
| Cosmos Transfer per video | ~30s on A100 | 25 frames per episode |
| Cosmos 1M episodes | ~347 GPU-days | This is the bottleneck |
| Cosmos 10K episodes | ~3.5 GPU-days | Realistic first target |
Practical first target: 10K synthetic + 526 real teleop + 1244 NVIDIA GR00T teleop = ~12K episodes.
Implementation Plan (files to create/modify in neon/)¶
New Files¶
neon/
├── synth/ # NEW: Synthetic data factory
│ ├── __init__.py
│ ├── pipeline.py # NeonSynthPipeline — orchestrator
│ ├── world_generator.py # Marble integration (port from strands-gtc-nvidia)
│ ├── object_placer.py # 3D object placement on generated scenes
│ ├── task_generator.py # Procedural task + instruction generation
│ ├── newton_collector.py # Newton GPU parallel data collection
│ ├── cosmos_augmentor.py # Cosmos Transfer 2.5 visual augmentation
│ └── config.py # SynthConfig with all pipeline params
Modified Files¶
neon/data/data_soup.py # Add type="synthetic_factory" source
neon/data/lerobot_v3.py # Already handles all modalities ✅
neon/sim/newton/newton_backend.py # Already has parallel envs ✅
neon/sim/env.py # Already has raycast LiDAR ✅
neon/training/config.py # Add synth_omnimodal_config() preset
Key Code to Port from strands-gtc-nvidia¶
| Source (strands-gtc-nvidia) | Destination (neon) | What |
|---|---|---|
strands_robots/marble/__init__.py |
neon/synth/world_generator.py |
MarblePipeline, MarbleConfig, scene presets |
strands_robots/cosmos_transfer/__init__.py |
neon/synth/cosmos_augmentor.py |
CosmosTransferPipeline, depth/edge/seg generation |
scripts/newton_groot/pipeline.py |
neon/synth/newton_collector.py |
Newton parallel sim → LeRobot data collection loop |
scripts/newton_groot/thor_e2e_pipeline.py |
neon/synth/pipeline.py |
Full E2E orchestration pattern |
Task Generation Strategy¶
The GTC pipeline only does "walk_around_room". For Neon manipulation, we need:
TASK_TEMPLATES = {
"pick_place": [
"pick up the {object} and place it on the {target}",
"move the {object} to the {location}",
"grab the {object} from the table",
],
"push": [
"push the {object} to the {direction}",
"slide the {object} towards {target}",
],
"pour": [
"pour from the {container} into the {target}",
],
"navigate": [
"walk to the {location}",
"move to the {object}",
"go to the {room_area}",
],
"combined": [
"walk to the table, pick up the {object}, and bring it to {target}",
],
}
Each task template generates: 1. Language instruction (for the VLA) 2. Scripted arm trajectory (IK-based reach → grasp → transport) 3. Locomotion commands (navigate to workspace) 4. Success condition (object at target ± tolerance)
Scripted Manipulation Policies¶
For data collection, we DON'T need a trained manipulation policy. We use:
- SONIC RL policy for locomotion (already trained, from Newton assets)
- IK-based scripted policy for arms:
- Compute target EEF position from object position
- Newton
solve_ik()→ joint trajectory - Smooth interpolation with startup ramp
- Add Gaussian noise for diversity (±2° per joint)
This is the same pattern NVIDIA uses in GR00T-Dreams: scripted expert → sim recording → train VLA on recordings.
Output Schema Compatibility¶
The synthetic data MUST match the real teleop schema exactly:
Real teleop (S3): Synthetic (this pipeline):
───────────── ─────────────────────────
observation.state: (43,) ←→ observation.state: (43,) ✅ same
observation.eef_state: (14,) ←→ observation.eef_state: (14,) ✅ same
action: (43,) ←→ action: (43,) ✅ same
teleop.navigate_command: (3,)←→ action[-3:] locomotion ✅ mapped
observation.images.ego_view ←→ observation.image.camera_head ✅ Cosmos makes it real
←→ observation.lidar: (4096, 4) 🆕 synthetic bonus
←→ observation.audio: (32000,) 🆕 synthetic bonus
language_instruction ←→ language_instruction ✅ same
The synthetic data has MORE modalities than real data. This is fine — DataSoup handles missing modalities gracefully (zero-fills). But when a real G1 has L1 LiDAR + microphone, the model already knows what to do with them.
First Milestone: 10K Kitchen Episodes¶
# neon/synth/config.py
@dataclass
class SynthConfig:
# World generation
marble_presets: List[str] = ["kitchen", "office_desk", "workshop", "living_room"]
worlds_per_preset: int = 25 # 100 unique rooms
objects_per_scene: int = 5
# Newton simulation
num_envs: int = 256 # Parallel on A100
episodes_per_world: int = 100 # 100 tasks per room
episode_length: int = 400 # 8 seconds at 50Hz
sim_hz: int = 200
control_hz: int = 50
render_hz: int = 20
# Cosmos Transfer
cosmos_enabled: bool = True
cosmos_control_types: List[str] = ["depth", "edge"]
cosmos_guidance: float = 3.0
# Output
output_repo_id: str = "cagataydev/neon-g1-synth-kitchen-10k"
fps: int = 20
push_to_hub: bool = True
# Total: 100 rooms × 100 episodes = 10,000 episodes
# At 8s × 20fps = 160 frames each → 1.6M total frames
# Joint data: 10K × 400 steps × 46 floats × 4 bytes = ~74 MB parquet
# Video: 10K × 8s ≈ 22 hours of 640×480 video
What Makes This "One of a Kind"¶
No existing VLA has ALL of these simultaneously:
- 6 modalities (vision, language, joints, EEF, LiDAR, audio)
- Photorealistic synthetic environments (Marble + Cosmos Transfer)
- Physics-accurate training data (Newton GPU with SONIC PD gains)
- Cross-embodiment grounding (relative actions + ActionMapper)
- Real + synthetic data mixing (DataSoup weighted sampling)
- Scalable generation (4096 parallel Newton envs)
GR00T-Dreams does steps 2-3 but without LiDAR, audio, or Marble world diversity. Cosmos-Predict does step 3 but without physics-grounded joint states. Neon's synth pipeline does all of them, outputting the exact same LeRobot v3 format as real teleop.