Skip to content

Architecture

How Neon is built. Every module, every data flow, every design decision — laid bare.


Package Structure

neon/
├── model/
│   ├── neon_vla.py             # Complete VLA + PointCloudEncoder + EEFEncoder + GPS/Depth/Seg/Tactile/IMU/Force encoders
│   ├── video_backbone.py       # Qwen2.5-Omni / Cosmos-Reason2 adapter
│   ├── action_heads.py         # MLP, ActionChunking, FlowMatching, DiT, StateRelative, Ensemble, G1ActionHead
│   ├── audio.py                # AudioEncoder (Whisper/Mel/Omni) + PersonaPlex TTS
│   ├── brain_encoder.py        # TRIBEv2 techniques: TemporalSmoothing, ModalityDropout, EmbodimentLayers, BrainFusion
│   ├── tribe_backbone.py       # TRIBEv2 native multimodal: V-JEPA2 + Wav2Vec-BERT + LLaMA fusion
│   ├── photon.py               # Photon-inspired: SpeculativeActionDecoder, AdaptiveComputeRouter, CUDAGraphWrapper
│   └── guidance.py             # AirVLA physics-aware guidance: Payload, Torque, Collision, Smooth, Composite
├── data/
│   ├── action_space.py         # G1 29-DoF joint definitions, normalization, modes
│   ├── data_soup.py            # 15-source mixing, NeonEpisode w/ all modalities
│   ├── lerobot_v3.py           # LeRobot v3 omni-modal writer/reader (14 modalities)
│   └── relative_actions.py     # Cosmos-Predict2.5 relative EE actions
├── training/
│   ├── config.py               # TrainConfig + 24 presets
│   ├── train.py                # NeonTrainer — omni-modal collation + loss
│   ├── self_learner.py         # Online self-supervised adaptation engine
│   └── distill.py              # Knowledge distillation utilities
├── eval/
│   ├── neonbench.py            # NeonBench — unseen env × unseen task benchmark
│   ├── libero_eval.py          # LIBERO 4-suite benchmark (vs GR00T N1.6, π₀.₅)
│   ├── robocasa_eval.py        # RoboCasa zero-shot kitchen evaluation
│   ├── simplerenv_eval.py      # SimplerEnv (Google RT-2 benchmark environments)
│   ├── mujoco_eval.py          # MuJoCo simulation evaluation
│   └── dreamgen_eval.py        # DreamGen synthetic evaluation
├── inference/
│   ├── server.py               # HTTP server — all modalities + /health
│   └── g1_controller.py        # Unitree G1 SDK, safety limits, closed-loop
├── sim/
│   ├── env.py                  # NeonSimEnv — MuJoCo G1 sim with LiDAR, EEF, teleop
│   ├── simulation.py           # Self-contained MuJoCo Simulation class (URDF, MJCF, multi-robot)
│   ├── scene.py                # Procedural scene generation (tables, objects)
│   ├── morphology.py           # GulaMannen modular humanoid morphology system (6 configs)
│   ├── policies.py             # Policy provider interface (NeonPolicy, Mock)
│   ├── dataset_recorder.py     # LeRobotDataset recording bridge (sim + real)
│   ├── newton/
│   │   ├── newton_backend.py   # Newton GPU backend — 4096+ parallel envs, warp solver, diffsim
│   │   └── newton_gym_env.py   # Gym-compatible Newton environment
│   └── isaac/
│       ├── isaac_sim_backend.py    # Isaac Sim backend
│       ├── isaac_lab_env.py        # Isaac Lab environment
│       ├── isaac_lab_trainer.py    # Isaac Lab RL trainer
│       ├── isaac_gym_env.py        # Isaac Gym environment
│       ├── isaac_sim_bridge.py     # Isaac Sim bridge
│       └── asset_converter.py      # MJCF → USD converter
├── synth/
│   ├── pipeline.py             # SynthPipeline — master orchestrator
│   ├── config.py               # SynthConfig with presets: kitchen_10k, diverse_50k, fast_debug
│   ├── world_generator.py      # Marble 3D room generation + procedural scene composition
│   ├── task_generator.py       # Procedural task + language + scripted arm trajectory
│   ├── newton_collector.py     # Newton GPU parallel data collection (4096+ envs)
│   ├── cosmos_augmentor.py     # Cosmos Transfer 2.5 sim→real visual augmentation
│   ├── kimodo_generator.py     # Kimodo text→G1 motion generation (NVIDIA, 700h mocap)
│   └── idm_extractor.py        # Inverse Dynamics Model action extraction from video
├── streams/
│   ├── channels.py             # Typed data channels (Camera, Joint, LiDAR, Audio, GPS, Depth, Seg, Tactile, IMU, Force)
│   ├── recorder.py             # StreamRecorder → LeRobot dataset on HF
│   └── session.py              # StreamSession — full robot control loop
├── collect/
│   └── g1_data_collector.py    # G1 teleoperation data collection (Quest 3)
├── dashboard/
│   └── bridge.py               # WebSocket dashboard (camera, joints, LiDAR viz)
├── tools/
│   └── neon_tool.py            # Strands agent tool wrapper
├── policy.py                   # NeonPolicy — strands-robots integration (HTTP/ZMQ)
└── tests/                      # 639 tests across 21 files (all CPU)

Component Hierarchy

graph TD
    NV["NeonVLA<br/><em>Complete model</em>"]
    NV --> VB["VideoBackbone<br/><em>Qwen2.5-Omni / Cosmos</em>"]
    NV --> TB["TribeMultimodalBackbone<br/><em>V-JEPA2 + Wav2Vec + LLaMA</em>"]
    NV --> AE["AudioEncoder<br/><em>Whisper / Mel / Omni</em>"]
    NV --> PE["ProprioceptionEncoder<br/><em>MLP: joints → features</em>"]
    NV --> LE["PointCloudEncoder<br/><em>PointNet-style: LiDAR → features</em>"]
    NV --> EE["EEFEncoder<br/><em>MLP: bimanual EE → features</em>"]
    NV --> SE["Sensor Encoders<br/><em>GPS, Depth, Seg, Tactile, IMU, Force</em>"]
    NV --> FUS["Fusion Layer<br/><em>Linear + ReLU²</em>"]
    NV --> AH["G1ActionHead"]
    NV --> SH["SpeechResponseHead"]

    AH --> ACH["ActionChunkingHead<br/><em>Arms (14 DoF)</em>"]
    AH --> FMH["FlowMatchingHead<br/><em>Multi-modal actions</em>"]
    AH --> DIT["DiTActionHead<br/><em>Diffusion Transformer</em>"]
    AH --> ENS["EnsembleHead<br/><em>Gated MLP+Flow+DiT</em>"]
    AH --> SRH["StateRelativeHead<br/><em>Δ-from-state wrapper</em>"]

    style NV fill:#e65100,color:#fff
    style VB fill:#1565c0,color:#fff
    style TB fill:#1565c0,color:#fff
    style AE fill:#1565c0,color:#fff
    style AH fill:#1b5e20,color:#fff

Data Flow — Training

sequenceDiagram
    participant DS as DataSoupDataset
    participant VB as Video Backbone (frozen)
    participant PE as Proprio Encoder
    participant AE as Audio Encoder
    participant LE as LiDAR Encoder
    participant EE as EEF Encoder
    participant SE as Sensor Encoders
    participant FUS as Fusion (ReLU²)
    participant AH as Action Heads (v2)
    participant OPT as Adam (β₁=0.85)

    DS->>VB: images + text
    VB->>FUS: visual-language features (2048)
    DS->>PE: joint states
    PE->>FUS: proprio features (256)
    DS->>AE: audio waveform (optional)
    AE->>FUS: audio features (2048)
    DS->>LE: LiDAR point cloud (optional)
    LE->>FUS: spatial features (256)
    DS->>EE: EEF state (optional)
    EE->>FUS: EE features (128)
    DS->>SE: tactile, IMU, force, GPS, depth, seg (optional)
    SE->>FUS: sensor features
    FUS->>AH: fused features (2048)
    AH->>AH: RMSNorm → ReLU² → Skip → SoftCap
    Note over AH: MSE / Flow / Diffusion loss
    AH->>OPT: gradients (clip=0.3)
    OPT->>AH: parameter update

Gradients never reach the backbone. The frozen 7B parameters serve as a fixed feature extractor — all learning happens in the ~6M trainable parameters downstream.


Data Flow — Inference

sequenceDiagram
    participant CAM as Camera
    participant MIC as Microphone
    participant SDK as Joint States
    participant NV as NeonVLA
    participant PHO as Photon Engine
    participant PP as PersonaPlex
    participant G1 as G1 Robot

    CAM->>NV: camera frame
    MIC->>NV: spoken command (16kHz)
    SDK->>NV: current joint positions
    NV->>PHO: adaptive compute routing
    PHO->>PHO: speculative decode (fast MLP → verify with Flow/DiT)
    PHO->>G1: action chunk (16 steps × 17 DoF)
    NV->>PP: speech text
    PP->>G1: audio playback

Data Pipeline

graph LR
    subgraph "15 Source Types"
        LR["LeRobot"]
        AG["Agibot"]
        DG["DreamGen"]
        COS["Cosmos DreamGen"]
        VC["Voice Commands"]
        S4D["Stereo4D"]
        G1T["G1 Teleop"]
        GRT["GR00T Teleop"]
        NV3["Neon v3"]
        BS["BONES-SEED"]
        KIM["Kimodo"]
        OXE["OXE"]
        CUS["Custom"]
    end

    subgraph "Processing"
        MAP["ActionMapper<br/>Cross-embodiment"]
        REL["Relative Actions<br/>Cosmos EE deltas"]
        IDM["IDM Extractor<br/>Video → actions"]
        NORM["Normalize<br/>to G1 space"]
    end

    LR --> MAP
    AG --> MAP
    DG --> MAP
    COS --> REL
    OXE --> MAP
    G1T --> NORM
    GRT --> MAP
    NV3 --> NORM
    BS --> MAP
    KIM --> MAP
    MAP --> NORM
    REL --> NORM
    IDM --> NORM
    VC --> SOUP
    S4D --> SOUP
    NORM --> SOUP["Data Soup<br/>Weighted mixing"]

    style SOUP fill:#e65100,color:#fff

Configuration Cascade

graph LR
    TC["TrainConfig"] --> NC["NeonConfig"]
    TC --> DSC["DataSoupConfig"]
    TC --> HYP["adam_β₁=0.85<br/>grad_clip=0.3"]
    NC --> BC["BackboneConfig"]
    NC --> TBC["TribeBackboneConfig"]
    NC --> AHC["ActionHeadConfig<br/>MLP / Flow / DiT / Ensemble"]
    NC --> AC["AudioConfig"]
    DSC --> SRC["DataSourceConfig[]<br/>15 source types"]

    style TC fill:#333,color:#fff
    style AHC fill:#e65100,color:#fff

Five Design Principles

1. Frozen Backbone, Trainable Heads

The video backbone has 3-7B parameters pre-trained on millions of hours of video. We freeze it and only train ~6M parameters. Train with 100× less data. No catastrophic forgetting. Every byte of the action heads matters — hence Parameter Golf.

2. Progressive Scaling

Start with arms_only (17 DoF). Scale to upper_body (20 DoF). Then whole_body (32 DoF). Each mode adds joints and heads without changing existing ones. Your first model doesn't become obsolete — it becomes the foundation.

3. Cross-Embodiment Actions

Cosmos relative actions (rel_xyz = R₁ᵀ·(xyz₂-xyz₁)) mean data from any robot can train a G1. Same physical movement = same numbers. A Franka picking up a cup teaches a G1 to pick up a cup.

4. Modular Multimodality

Audio is optional. Speech output is optional. Proprioception is optional. Every modality can be independently enabled, disabled, or swapped. The fusion layer handles variable input dimensionality gracefully. TRIBEv2 backbone gives dedicated SOTA encoder per modality when you need maximum quality.

5. Safety by Default

The G1 controller enforces joint limits, velocity limits, and locomotion limits before every command. Safety checks cannot be bypassed in the default configuration. Physics-aware guidance functions (from AirVLA) steer actions toward feasible trajectories without retraining. A robot that hurts someone teaches nothing.