Synthetic Data Factory — Design Document¶

Goal: Generate sim datasets indistinguishable from real G1 teleop data, at scale.

The Core Insight¶

Real teleop data (the S3 pick-basket dataset) has these properties: - 43-DOF joint states + 14-DOF EEF at 20Hz - Ego-view 640×480 video at 20fps - Natural task instructions ("pick up the basket") - Locomotion commands (vx, vy, omega) - Physics-consistent grasping, contact, collisions

Your idea: if the sim environment looks real AND the physics are real, the data IS real.

Marble (3D room) + Objaverse/NVIDIA 3D objects + Newton physics + Cosmos Transfer + LiDAR sim + Audio synth
    ↓
LeRobot v3 parquet + video — SAME SCHEMA as S3 teleop dataset
    ↓
Neon VLA training — model can't tell sim from real

Pipeline Architecture¶

                    ┌──────────────────────────────────────────────────────────────┐
                    │              STAGE 1: World Generation                       │
                    │                                                              │
                    │  Marble API ──→ 3D room (PLY/GLB/SPZ)                       │
                    │       ↓                                                      │
                    │  3DGrut: PLY → USDZ  ──→  room mesh                         │
                    │       +                                                      │
                    │  HuggingFace 3D objects (Objaverse, NVIDIA kitchen assets)   │
                    │       ↓                                                      │
                    │  Scene Composer: room + table + objects + G1 → scene.xml     │
                    └──────────────────────┬───────────────────────────────────────┘
                                           │
                    ┌──────────────────────▼───────────────────────────────────────┐
                    │              STAGE 2: Physics Simulation (Newton GPU)        │
                    │                                                              │
                    │  Newton ModelBuilder → replicate × 4096 parallel envs        │
                    │       ↓                                                      │
                    │  G1 with SONIC RL policy (locomotion) + scripted arm tasks   │
                    │       ↓                                                      │
                    │  Per-step collection:                                         │
                    │    ✅ joint_state (43-DOF)     — from newton state            │
                    │    ✅ action (43-DOF + 3 loco) — from policy + scripted      │
                    │    ✅ eef_state (14-DOF)       — from FK on wrist sites      │
                    │    ✅ ego_view RGB (640×480)    — SensorTiledCamera           │
                    │    ✅ lidar (4096×4)            — raycast from torso body     │
                    │    ✅ audio (32000 samples)     — synthesized from task text  │
                    │    ✅ language_instruction       — from task template          │
                    └──────────────────────┬───────────────────────────────────────┘
                                           │
                    ┌──────────────────────▼───────────────────────────────────────┐
                    │              STAGE 3: Visual Augmentation                    │
                    │                                                              │
                    │  Cosmos Transfer 2.5 (depth + edge control):                 │
                    │    sim RGB video → photorealistic video                      │
                    │    Robot pose preserved (ControlNet conditioning)             │
                    │    Background looks like real room from Marble prompt         │
                    │                                                              │
                    │  Domain randomization per-world:                             │
                    │    - Lighting (direction, color, intensity)                   │
                    │    - Object textures (random materials)                       │
                    │    - Camera noise (gaussian, motion blur)                    │
                    │    - Physics params (friction ±20%, mass ±30%)               │
                    └──────────────────────┬───────────────────────────────────────┘
                                           │
                    ┌──────────────────────▼───────────────────────────────────────┐
                    │              STAGE 4: Export to LeRobot v3                   │
                    │                                                              │
                    │  NeonLeRobotWriter outputs:                                  │
                    │    data/chunk-000/episode_XXXXXX.parquet                     │
                    │    videos/chunk-000/observation.image.camera_head/ep.mp4     │
                    │    meta/info.json + episodes.jsonl + tasks.jsonl             │
                    │                                                              │
                    │  SAME SCHEMA as real teleop dataset ← this is the key       │
                    └──────────────────────┬───────────────────────────────────────┘
                                           │
                    ┌──────────────────────▼───────────────────────────────────────┐
                    │              STAGE 5: Neon VLA Training                      │
                    │                                                              │
                    │  DataSoup mixes:                                              │
                    │    - synthetic_factory (weight=2.0) — this pipeline          │
                    │    - g1_teleop (weight=3.0) — real S3 data                   │
                    │    - groot_teleop (weight=1.5) — NVIDIA real G1 data         │
                    │    - voice_commands (weight=0.3) — language conditioning     │
                    │                                                              │
                    │  → NeonVLA with all 6 modalities active                      │
                    │  → Backbone frozen, ~6-44M trainable heads                   │
                    └─────────────────────────────────────────────────────────────┘

Why This Works (the physics argument)¶

The gap between sim and real teleop is:

Property	Real Teleop	Naive Sim	This Pipeline
Visual realism	✅ Real camera	❌ Flat shading	✅ Cosmos Transfer
Room geometry	✅ Real room	❌ Flat plane	✅ Marble 3D scan
Object geometry	✅ Real objects	⚠️ Primitives	✅ 3D mesh datasets
Physics	✅ Real physics	✅ MuJoCo/Newton	✅ Same solver
Contact dynamics	✅ Real friction	✅ Tuned params	✅ SONIC PD gains
Joint states	✅ Encoder readback	✅ Perfect state	✅ Same 43-DOF space
EEF tracking	✅ FK from encoders	✅ FK from sim	✅ Same representation
LiDAR	✅ Unitree L1	❌ Missing	✅ Raycast sim
Audio	✅ Real microphone	❌ Missing	⚠️ Synthesized
Task diversity	❌ Limited by human	❌ Scripted only	✅ Procedural + Marble

The only remaining gap is audio — synthetic speech patterns ≠ real speech. But the audio encoder is Whisper (frozen), and the language instruction text IS identical. The audio channel is a nice-to-have, not critical.

Scale Math¶

Parameter	Value	Notes
Newton parallel envs	4096	On single A100/H100
Steps per episode	400 (8s at 50Hz)	Matches real teleop episode length
Render FPS	20	Matches real dataset
Worlds per batch	4096	All parallel
Batches for 1M episodes	245	1M / 4096 ≈ 245 batches
Time per batch (physics)	~2 min	Newton GPU is fast
Time per batch (render)	~5 min	TiledCamera is bottleneck
Total 1M episodes	~28 hours physics	On single GPU
Cosmos Transfer per video	~30s on A100	25 frames per episode
Cosmos 1M episodes	~347 GPU-days	This is the bottleneck
Cosmos 10K episodes	~3.5 GPU-days	Realistic first target

Practical first target: 10K synthetic + 526 real teleop + 1244 NVIDIA GR00T teleop = ~12K episodes.

Implementation Plan (files to create/modify in neon/)¶

New Files¶

neon/
├── synth/                              # NEW: Synthetic data factory
│   ├── __init__.py
│   ├── pipeline.py                     # NeonSynthPipeline — orchestrator
│   ├── world_generator.py              # Marble integration (port from strands-gtc-nvidia)
│   ├── object_placer.py                # 3D object placement on generated scenes
│   ├── task_generator.py               # Procedural task + instruction generation
│   ├── newton_collector.py             # Newton GPU parallel data collection
│   ├── cosmos_augmentor.py             # Cosmos Transfer 2.5 visual augmentation
│   └── config.py                       # SynthConfig with all pipeline params

Modified Files¶

neon/data/data_soup.py                  # Add type="synthetic_factory" source
neon/data/lerobot_v3.py                 # Already handles all modalities ✅
neon/sim/newton/newton_backend.py       # Already has parallel envs ✅
neon/sim/env.py                         # Already has raycast LiDAR ✅
neon/training/config.py                 # Add synth_omnimodal_config() preset

Key Code to Port from strands-gtc-nvidia¶

Source (strands-gtc-nvidia)	Destination (neon)	What
`strands_robots/marble/__init__.py`	`neon/synth/world_generator.py`	MarblePipeline, MarbleConfig, scene presets
`strands_robots/cosmos_transfer/__init__.py`	`neon/synth/cosmos_augmentor.py`	CosmosTransferPipeline, depth/edge/seg generation
`scripts/newton_groot/pipeline.py`	`neon/synth/newton_collector.py`	Newton parallel sim → LeRobot data collection loop
`scripts/newton_groot/thor_e2e_pipeline.py`	`neon/synth/pipeline.py`	Full E2E orchestration pattern

Task Generation Strategy¶

The GTC pipeline only does "walk_around_room". For Neon manipulation, we need:

TASK_TEMPLATES = {
    "pick_place": [
        "pick up the {object} and place it on the {target}",
        "move the {object} to the {location}",
        "grab the {object} from the table",
    ],
    "push": [
        "push the {object} to the {direction}",
        "slide the {object} towards {target}",
    ],
    "pour": [
        "pour from the {container} into the {target}",
    ],
    "navigate": [
        "walk to the {location}",
        "move to the {object}",
        "go to the {room_area}",
    ],
    "combined": [
        "walk to the table, pick up the {object}, and bring it to {target}",
    ],
}

Each task template generates: 1. Language instruction (for the VLA) 2. Scripted arm trajectory (IK-based reach → grasp → transport) 3. Locomotion commands (navigate to workspace) 4. Success condition (object at target ± tolerance)

Scripted Manipulation Policies¶

For data collection, we DON'T need a trained manipulation policy. We use:

SONIC RL policy for locomotion (already trained, from Newton assets)
IK-based scripted policy for arms:
Compute target EEF position from object position
Newton solve_ik() → joint trajectory
Smooth interpolation with startup ramp
Add Gaussian noise for diversity (±2° per joint)

This is the same pattern NVIDIA uses in GR00T-Dreams: scripted expert → sim recording → train VLA on recordings.

Output Schema Compatibility¶

The synthetic data MUST match the real teleop schema exactly:

Real teleop (S3):                    Synthetic (this pipeline):
─────────────                        ─────────────────────────
observation.state: (43,)     ←→      observation.state: (43,)        ✅ same
observation.eef_state: (14,) ←→      observation.eef_state: (14,)    ✅ same
action: (43,)                ←→      action: (43,)                   ✅ same
teleop.navigate_command: (3,)←→      action[-3:] locomotion          ✅ mapped
observation.images.ego_view  ←→      observation.image.camera_head   ✅ Cosmos makes it real
                             ←→      observation.lidar: (4096, 4)    🆕 synthetic bonus
                             ←→      observation.audio: (32000,)     🆕 synthetic bonus
language_instruction         ←→      language_instruction            ✅ same

The synthetic data has MORE modalities than real data. This is fine — DataSoup handles missing modalities gracefully (zero-fills). But when a real G1 has L1 LiDAR + microphone, the model already knows what to do with them.

First Milestone: 10K Kitchen Episodes¶

# neon/synth/config.py
@dataclass
class SynthConfig:
    # World generation
    marble_presets: List[str] = ["kitchen", "office_desk", "workshop", "living_room"]
    worlds_per_preset: int = 25  # 100 unique rooms
    objects_per_scene: int = 5

    # Newton simulation
    num_envs: int = 256  # Parallel on A100
    episodes_per_world: int = 100  # 100 tasks per room
    episode_length: int = 400  # 8 seconds at 50Hz
    sim_hz: int = 200
    control_hz: int = 50
    render_hz: int = 20

    # Cosmos Transfer
    cosmos_enabled: bool = True
    cosmos_control_types: List[str] = ["depth", "edge"]
    cosmos_guidance: float = 3.0

    # Output
    output_repo_id: str = "cagataydev/neon-g1-synth-kitchen-10k"
    fps: int = 20
    push_to_hub: bool = True

    # Total: 100 rooms × 100 episodes = 10,000 episodes
    # At 8s × 20fps = 160 frames each → 1.6M total frames
    # Joint data: 10K × 400 steps × 46 floats × 4 bytes = ~74 MB parquet
    # Video: 10K × 8s ≈ 22 hours of 640×480 video

What Makes This "One of a Kind"¶

No existing VLA has ALL of these simultaneously:

6 modalities (vision, language, joints, EEF, LiDAR, audio)
Photorealistic synthetic environments (Marble + Cosmos Transfer)
Physics-accurate training data (Newton GPU with SONIC PD gains)
Cross-embodiment grounding (relative actions + ActionMapper)
Real + synthetic data mixing (DataSoup weighted sampling)
Scalable generation (4096 parallel Newton envs)

GR00T-Dreams does steps 2-3 but without LiDAR, audio, or Marble world diversity. Cosmos-Predict does step 3 but without physics-grounded joint states. Neon's synth pipeline does all of them, outputting the exact same LeRobot v3 format as real teleop.