Architecture¶
How Neon is built. Every module, every data flow, every design decision — laid bare.
Package Structure¶
neon/
├── model/
│ ├── neon_vla.py # Complete VLA + PointCloudEncoder + EEFEncoder + GPS/Depth/Seg/Tactile/IMU/Force encoders
│ ├── video_backbone.py # Qwen2.5-Omni / Cosmos-Reason2 adapter
│ ├── action_heads.py # MLP, ActionChunking, FlowMatching, DiT, StateRelative, Ensemble, G1ActionHead
│ ├── audio.py # AudioEncoder (Whisper/Mel/Omni) + PersonaPlex TTS
│ ├── brain_encoder.py # TRIBEv2 techniques: TemporalSmoothing, ModalityDropout, EmbodimentLayers, BrainFusion
│ ├── tribe_backbone.py # TRIBEv2 native multimodal: V-JEPA2 + Wav2Vec-BERT + LLaMA fusion
│ ├── photon.py # Photon-inspired: SpeculativeActionDecoder, AdaptiveComputeRouter, CUDAGraphWrapper
│ └── guidance.py # AirVLA physics-aware guidance: Payload, Torque, Collision, Smooth, Composite
├── data/
│ ├── action_space.py # G1 29-DoF joint definitions, normalization, modes
│ ├── data_soup.py # 15-source mixing, NeonEpisode w/ all modalities
│ ├── lerobot_v3.py # LeRobot v3 omni-modal writer/reader (14 modalities)
│ └── relative_actions.py # Cosmos-Predict2.5 relative EE actions
├── training/
│ ├── config.py # TrainConfig + 24 presets
│ ├── train.py # NeonTrainer — omni-modal collation + loss
│ ├── self_learner.py # Online self-supervised adaptation engine
│ └── distill.py # Knowledge distillation utilities
├── eval/
│ ├── neonbench.py # NeonBench — unseen env × unseen task benchmark
│ ├── libero_eval.py # LIBERO 4-suite benchmark (vs GR00T N1.6, π₀.₅)
│ ├── robocasa_eval.py # RoboCasa zero-shot kitchen evaluation
│ ├── simplerenv_eval.py # SimplerEnv (Google RT-2 benchmark environments)
│ ├── mujoco_eval.py # MuJoCo simulation evaluation
│ └── dreamgen_eval.py # DreamGen synthetic evaluation
├── inference/
│ ├── server.py # HTTP server — all modalities + /health
│ └── g1_controller.py # Unitree G1 SDK, safety limits, closed-loop
├── sim/
│ ├── env.py # NeonSimEnv — MuJoCo G1 sim with LiDAR, EEF, teleop
│ ├── simulation.py # Self-contained MuJoCo Simulation class (URDF, MJCF, multi-robot)
│ ├── scene.py # Procedural scene generation (tables, objects)
│ ├── morphology.py # GulaMannen modular humanoid morphology system (6 configs)
│ ├── policies.py # Policy provider interface (NeonPolicy, Mock)
│ ├── dataset_recorder.py # LeRobotDataset recording bridge (sim + real)
│ ├── newton/
│ │ ├── newton_backend.py # Newton GPU backend — 4096+ parallel envs, warp solver, diffsim
│ │ └── newton_gym_env.py # Gym-compatible Newton environment
│ └── isaac/
│ ├── isaac_sim_backend.py # Isaac Sim backend
│ ├── isaac_lab_env.py # Isaac Lab environment
│ ├── isaac_lab_trainer.py # Isaac Lab RL trainer
│ ├── isaac_gym_env.py # Isaac Gym environment
│ ├── isaac_sim_bridge.py # Isaac Sim bridge
│ └── asset_converter.py # MJCF → USD converter
├── synth/
│ ├── pipeline.py # SynthPipeline — master orchestrator
│ ├── config.py # SynthConfig with presets: kitchen_10k, diverse_50k, fast_debug
│ ├── world_generator.py # Marble 3D room generation + procedural scene composition
│ ├── task_generator.py # Procedural task + language + scripted arm trajectory
│ ├── newton_collector.py # Newton GPU parallel data collection (4096+ envs)
│ ├── cosmos_augmentor.py # Cosmos Transfer 2.5 sim→real visual augmentation
│ ├── kimodo_generator.py # Kimodo text→G1 motion generation (NVIDIA, 700h mocap)
│ └── idm_extractor.py # Inverse Dynamics Model action extraction from video
├── streams/
│ ├── channels.py # Typed data channels (Camera, Joint, LiDAR, Audio, GPS, Depth, Seg, Tactile, IMU, Force)
│ ├── recorder.py # StreamRecorder → LeRobot dataset on HF
│ └── session.py # StreamSession — full robot control loop
├── collect/
│ └── g1_data_collector.py # G1 teleoperation data collection (Quest 3)
├── dashboard/
│ └── bridge.py # WebSocket dashboard (camera, joints, LiDAR viz)
├── tools/
│ └── neon_tool.py # Strands agent tool wrapper
├── policy.py # NeonPolicy — strands-robots integration (HTTP/ZMQ)
└── tests/ # 639 tests across 21 files (all CPU)
Component Hierarchy¶
graph TD
NV["NeonVLA<br/><em>Complete model</em>"]
NV --> VB["VideoBackbone<br/><em>Qwen2.5-Omni / Cosmos</em>"]
NV --> TB["TribeMultimodalBackbone<br/><em>V-JEPA2 + Wav2Vec + LLaMA</em>"]
NV --> AE["AudioEncoder<br/><em>Whisper / Mel / Omni</em>"]
NV --> PE["ProprioceptionEncoder<br/><em>MLP: joints → features</em>"]
NV --> LE["PointCloudEncoder<br/><em>PointNet-style: LiDAR → features</em>"]
NV --> EE["EEFEncoder<br/><em>MLP: bimanual EE → features</em>"]
NV --> SE["Sensor Encoders<br/><em>GPS, Depth, Seg, Tactile, IMU, Force</em>"]
NV --> FUS["Fusion Layer<br/><em>Linear + ReLU²</em>"]
NV --> AH["G1ActionHead"]
NV --> SH["SpeechResponseHead"]
AH --> ACH["ActionChunkingHead<br/><em>Arms (14 DoF)</em>"]
AH --> FMH["FlowMatchingHead<br/><em>Multi-modal actions</em>"]
AH --> DIT["DiTActionHead<br/><em>Diffusion Transformer</em>"]
AH --> ENS["EnsembleHead<br/><em>Gated MLP+Flow+DiT</em>"]
AH --> SRH["StateRelativeHead<br/><em>Δ-from-state wrapper</em>"]
style NV fill:#e65100,color:#fff
style VB fill:#1565c0,color:#fff
style TB fill:#1565c0,color:#fff
style AE fill:#1565c0,color:#fff
style AH fill:#1b5e20,color:#fff
Data Flow — Training¶
sequenceDiagram
participant DS as DataSoupDataset
participant VB as Video Backbone (frozen)
participant PE as Proprio Encoder
participant AE as Audio Encoder
participant LE as LiDAR Encoder
participant EE as EEF Encoder
participant SE as Sensor Encoders
participant FUS as Fusion (ReLU²)
participant AH as Action Heads (v2)
participant OPT as Adam (β₁=0.85)
DS->>VB: images + text
VB->>FUS: visual-language features (2048)
DS->>PE: joint states
PE->>FUS: proprio features (256)
DS->>AE: audio waveform (optional)
AE->>FUS: audio features (2048)
DS->>LE: LiDAR point cloud (optional)
LE->>FUS: spatial features (256)
DS->>EE: EEF state (optional)
EE->>FUS: EE features (128)
DS->>SE: tactile, IMU, force, GPS, depth, seg (optional)
SE->>FUS: sensor features
FUS->>AH: fused features (2048)
AH->>AH: RMSNorm → ReLU² → Skip → SoftCap
Note over AH: MSE / Flow / Diffusion loss
AH->>OPT: gradients (clip=0.3)
OPT->>AH: parameter update
Gradients never reach the backbone. The frozen 7B parameters serve as a fixed feature extractor — all learning happens in the ~6M trainable parameters downstream.
Data Flow — Inference¶
sequenceDiagram
participant CAM as Camera
participant MIC as Microphone
participant SDK as Joint States
participant NV as NeonVLA
participant PHO as Photon Engine
participant PP as PersonaPlex
participant G1 as G1 Robot
CAM->>NV: camera frame
MIC->>NV: spoken command (16kHz)
SDK->>NV: current joint positions
NV->>PHO: adaptive compute routing
PHO->>PHO: speculative decode (fast MLP → verify with Flow/DiT)
PHO->>G1: action chunk (16 steps × 17 DoF)
NV->>PP: speech text
PP->>G1: audio playback
Data Pipeline¶
graph LR
subgraph "15 Source Types"
LR["LeRobot"]
AG["Agibot"]
DG["DreamGen"]
COS["Cosmos DreamGen"]
VC["Voice Commands"]
S4D["Stereo4D"]
G1T["G1 Teleop"]
GRT["GR00T Teleop"]
NV3["Neon v3"]
BS["BONES-SEED"]
KIM["Kimodo"]
OXE["OXE"]
CUS["Custom"]
end
subgraph "Processing"
MAP["ActionMapper<br/>Cross-embodiment"]
REL["Relative Actions<br/>Cosmos EE deltas"]
IDM["IDM Extractor<br/>Video → actions"]
NORM["Normalize<br/>to G1 space"]
end
LR --> MAP
AG --> MAP
DG --> MAP
COS --> REL
OXE --> MAP
G1T --> NORM
GRT --> MAP
NV3 --> NORM
BS --> MAP
KIM --> MAP
MAP --> NORM
REL --> NORM
IDM --> NORM
VC --> SOUP
S4D --> SOUP
NORM --> SOUP["Data Soup<br/>Weighted mixing"]
style SOUP fill:#e65100,color:#fff
Configuration Cascade¶
graph LR
TC["TrainConfig"] --> NC["NeonConfig"]
TC --> DSC["DataSoupConfig"]
TC --> HYP["adam_β₁=0.85<br/>grad_clip=0.3"]
NC --> BC["BackboneConfig"]
NC --> TBC["TribeBackboneConfig"]
NC --> AHC["ActionHeadConfig<br/>MLP / Flow / DiT / Ensemble"]
NC --> AC["AudioConfig"]
DSC --> SRC["DataSourceConfig[]<br/>15 source types"]
style TC fill:#333,color:#fff
style AHC fill:#e65100,color:#fff
Five Design Principles¶
1. Frozen Backbone, Trainable Heads¶
The video backbone has 3-7B parameters pre-trained on millions of hours of video. We freeze it and only train ~6M parameters. Train with 100× less data. No catastrophic forgetting. Every byte of the action heads matters — hence Parameter Golf.
2. Progressive Scaling¶
Start with arms_only (17 DoF). Scale to upper_body (20 DoF). Then whole_body (32 DoF). Each mode adds joints and heads without changing existing ones. Your first model doesn't become obsolete — it becomes the foundation.
3. Cross-Embodiment Actions¶
Cosmos relative actions (rel_xyz = R₁ᵀ·(xyz₂-xyz₁)) mean data from any robot can train a G1. Same physical movement = same numbers. A Franka picking up a cup teaches a G1 to pick up a cup.
4. Modular Multimodality¶
Audio is optional. Speech output is optional. Proprioception is optional. Every modality can be independently enabled, disabled, or swapped. The fusion layer handles variable input dimensionality gracefully. TRIBEv2 backbone gives dedicated SOTA encoder per modality when you need maximum quality.
5. Safety by Default¶
The G1 controller enforces joint limits, velocity limits, and locomotion limits before every command. Safety checks cannot be bypassed in the default configuration. Physics-aware guidance functions (from AirVLA) steer actions toward feasible trajectories without retraining. A robot that hurts someone teaches nothing.