Home
Neon Is Not a VLA. It's the Layer That Makes Any Foundation Model Into One.¶
Every VLA today is a monolith. Backbone and decoder, welded together.
Neon inverts this. The backbone is a plug. The action head is the product.
We take any frozen foundation model — Qwen2.5-Omni (video+audio), Cosmos-Reason2 (physics), TRIBEv2 (V-JEPA2+Wav2Vec+LLaMA) — and train a tiny action decoder on top. ~6 million parameters. 0.08% of the total. That decoder translates the backbone's understanding into 29 joint commands for a humanoid body, 16 timesteps into the future.
Swap the eyes, keep the hands.
The Architecture¶
graph TD
subgraph "Backbone (frozen, swappable)"
Q["Qwen2.5-Omni<br/>Video + Audio + Language"]
C["Cosmos-Reason2<br/>Physics + World Model"]
T["TRIBEv2<br/>V-JEPA2 + Wav2Vec + LLaMA"]
end
subgraph "Sensor Encoders (trainable)"
PE["Proprio<br/>MLP"]
LE["LiDAR<br/>PointNet"]
EE["EEF<br/>MLP"]
end
subgraph "Neon Action Layer (trainable, ~6M)"
FUS["Feature Fusion"]
AH["Action Heads<br/>MLP / Flow / DiT / StateRelative / Ensemble"]
end
Q --> FUS
C --> FUS
T --> FUS
PE --> FUS
LE --> FUS
EE --> FUS
FUS --> AH
AH --> OUT["29 DoF × 16 steps + Speech"]
style Q fill:#0097a7,color:#fff
style C fill:#6a1b9a,color:#fff
style T fill:#1b5e20,color:#fff
style AH fill:#e65100,color:#fff
style FUS fill:#333,color:#fff
Why Action Head Framework?¶
| Traditional VLA | Neon | |
|---|---|---|
| New backbone drops | Retrain everything | Swap config, retrain heads (~hours) |
| Physics understanding | Hope the model learned it | Plug in Cosmos, physics is pre-trained |
| Audio | Separate pipeline | Native — Qwen hears, TRIBEv2 hears, already fused |
| Edge vs cloud | Ship the whole thing | Same heads, smaller backbone |
| Self-improvement | Offline retraining | Online RL on action heads, backbone frozen |
| Trainable params | Often billions | 6M — 0.08% |
Three Ways In¶
5 Action Head Types¶
The action head is the product. Choose the right one for the job:
| Head | Params | Speed | Accuracy | Best For |
|---|---|---|---|---|
| MLP | ~2M | ⚡⚡⚡ | ★★★ | Edge deployment |
| FlowMatching | ~4M | ⚡⚡ | ★★★★ | Smooth trajectories |
| DiT | ~6M | ⚡ | ★★★★★ | Maximum precision |
| StateRelative | ~3M | ⚡⚡⚡ | ★★★★ | Cross-embodiment |
| Ensemble | ~8M | ⚡ | ★★★★★ | Gated routing |
Built with Parameter Golf v2 — ReLU², RMSNorm, soft-capping, U-Net skips, speculative decoding.
Self-Learning Loop¶
Because the backbone is frozen, action heads can learn online — no human labels needed:
graph LR
A["Observe"] --> B["Predict"]
B --> C["Act (16 steps)"]
C --> D["Observe Again"]
D --> E["Compute Loss"]
E --> F["Update Heads Only"]
F --> A
style F fill:#e65100,color:#fff
Three self-supervised losses: outcome prediction, temporal coherence, action consistency. EWC prevents forgetting. Deep dive →
What's Inside¶
Swappable Backbones¶
Qwen2.5-Omni, Cosmos-Reason2, TRIBEv2. Frozen. 3-10B parameters. When a new SOTA drops, swap the config, retrain 6M params.
5 Action Head Types¶
MLP, FlowMatching, DiT, StateRelative, Ensemble — from fast edge inference to maximum accuracy with gated routing.
Self-Learning Engine¶
Online RL adaptation. The backbone watches, the heads learn. Robot gets better with every action. Zero human labels.
Parameter Golf v2¶
ReLU², RMSNorm, learnable residual scales, U-Net skip connections, logit soft-capping, speculative decoding.
29 DoF Humanoid¶
Unitree G1. Three progressive control modes. Normalization, safety limits, per-embodiment adaptation.
Data Soup (15 sources)¶
LeRobot, Agibot, Cosmos DreamGen, GR00T, BONES, Kimodo — cross-embodiment mapping via relative actions.
Physics Guidance¶
AirVLA-style gradient corrections at inference. Inject physics constraints without retraining. Works with any flow head.
Latent World Model¶
LeWM for imagination rollouts. 48× faster planning than foundation-model world models. MPC in latent space.
By The Numbers¶
3 Backbones (swappable) 5 Action head types ~6M Trainable params (0.08%) 29 DoF 16 Step chunking 6 Input modalities 50ms on Jetson 639 Tests 24 Training presets 15 Data sources
Navigate¶
Installation — pip install in 10 seconds
Quickstart — from zero to robot actions
Why Video Models — the temporal insight
Backbones — Qwen, Cosmos, TRIBEv2
Action Heads — 5 types, all tricks
Training — 24 presets, any GPU
Self-Learning — online RL adaptation
Action Space — 29 joints, explained