Why Action Head, Not VLA?¶
The Key Insight¶
Neon is not a Vision-Language-Action model. It's a universal action head — a thin, trainable decoder that sits on top of any frozen foundation model and translates its understanding into robot joint commands.
Why does this matter? Because foundation models are improving faster than any robotics lab can retrain. When Qwen 3.0 drops, when Meta releases V-JEPA3, when NVIDIA ships Cosmos-Predict3 — a VLA team starts over. A Neon user swaps a config line and retrains 6M parameters in hours.
The backbone can be:
- Qwen2.5-Omni — video + audio + language, native temporal understanding
- Cosmos-Reason2 — physics, world model, causal reasoning
- TRIBEv2 — V-JEPA2 (video) + Wav2Vec-BERT (audio) + LLaMA (text), per-modality SOTA
All frozen. All swappable. The action heads don't change.
Why Video Models Specifically?¶
The Blind Spot in Robotics¶
Every major Vision-Language-Action model today — RT-2, Octo, GR00T N1, OpenVLA — uses an image encoder. CLIP. SigLIP. DINOv2. Models trained on photographs. Models that see a frozen moment.
Show an image encoder a robot arm reaching toward a cup. It sees a robot arm near a cup. What it does not see:
- The direction the arm is moving
- The speed of approach
- Whether the cup is sliding or stationary
- What happened 100 milliseconds ago
- What is about to happen next
These models try to compensate — stacking frames, adding recurrence, bolting on positional embeddings. But they're teaching temporal reasoning to a model that has never seen time. It's like teaching a photographer to direct a film by showing them more photographs, faster.
What Video Models Already Know¶
Video foundation models — Qwen2.5-Omni, Cosmos-Reason2 — were trained on millions of hours of the world in motion. They didn't learn to classify objects in isolation. They learned how the world moves:
| What They Learned | Why a Robot Needs It |
|---|---|
| Object permanence | The cup is still behind the arm, even when occluded |
| Physics intuition | Cups fall when pushed. Water pours down. Stacks topple |
| Motion prediction | "What happens next" is literally the training objective |
| Temporal coherence | Smooth trajectories, not jerky frame-by-frame guessing |
| Depth from motion | 3D understanding from parallax — no depth sensor needed |
This is exactly the knowledge a robot needs to act in the world. It was sitting there, pre-trained, waiting for someone to connect it to a body.
graph TD
subgraph "Image Encoder (the status quo)"
I1["Frame t"] --> IE["CLIP / SigLIP<br/><i>sees a photograph</i>"]
IE --> F1["Static features<br/><i>where things are</i>"]
end
subgraph "Video Model (what Neon uses)"
V1["Frame t-3"] --> VE["Qwen2.5-Omni<br/><i>sees time</i>"]
V2["Frame t-2"] --> VE
V3["Frame t-1"] --> VE
V4["Frame t"] --> VE
VE --> F2["Temporal features<br/><i>where things are going</i>"]
end
style IE fill:#555,color:#fff
style VE fill:#e65100,color:#fff
Three Backbones, One Action Head¶
Qwen2.5-Omni (Default)¶
Alibaba's multimodal model. Processes interleaved images and video with native temporal attention. Also understands audio — the robot can hear your voice and respond.
| Variant | Params | VRAM (4-bit) | Best For |
|---|---|---|---|
| 3B | 3B | ~4 GB | Jetson Orin, rapid iteration, edge deployment |
| 7B | 7B | ~8 GB | Production quality, rich temporal features |
Cosmos-Reason2¶
NVIDIA's Physical AI reasoning model. Fine-tuned specifically for understanding the physical world — gravity, collisions, materials, forces. When the task involves physics, this is the backbone.
| Variant | Params | VRAM (4-bit) | Best For |
|---|---|---|---|
| 2B | 2B | ~3 GB | Edge, lightweight physical reasoning |
| 8B | 8B | ~10 GB | Complex physics scenarios, sim-to-real |
TRIBEv2¶
Meta's native multimodal foundation model. Instead of one omni-model, TRIBEv2 uses dedicated SOTA encoders per modality: V-JEPA2 (video, self-supervised physics), Wav2Vec-BERT (raw audio), LLaMA 3.2 (language). Each encoder is the best at its modality. Features are fused via cross-modal Transformer.
| Component | Model | Strength |
|---|---|---|
| Video | V-JEPA2 (ViT-G) | Self-supervised physics, no text supervision needed |
| Audio | Wav2Vec-BERT 2.0 | Raw waveforms — hears urgency, collisions, motor strain |
| Language | LLaMA 3.2 (3B) | Instruction following, large context |
Key advantage: Modality dropout during training → robot works when camera is occluded or audio is noisy.
Freeze the Brain, Train the Spine¶
This is the second key insight. The backbone already understands the world. We don't need to teach it vision or physics or temporal reasoning. We need to teach it what actions to take.
Camera + Voice + Text
│
▼
┌──────────────────────┐
│ Frozen Backbone │ ← Pre-trained. Already understands motion.
│ 3–7B parameters │ We touch none of these.
└──────────┬───────────┘
│
Pooled features
(batch, 2048)
│
▼
┌──────────────────────┐
│ Trainable Decoder │ ← These are trained. ~6M parameters.
│ Action Heads (v2) │ Arms. Locomotion. Grippers.
└──────────────────────┘
Why freeze?
- 6M vs 7B — Train 0.08% of parameters instead of 100%. Overnight, not overmonth
- Less data — Fewer parameters means faster convergence with orders of magnitude less data
- Faster training — No gradients backpropagating through 7 billion parameters
- No forgetting — The backbone's understanding of physics survives training intact
Where Neon Sits¶
| Approach | Visual Encoder | Temporal? | Audio? | Backbone Swap? | Trainable |
|---|---|---|---|---|---|
| RT-2 | ViT | Frame stacking | No | ❌ | Billions |
| Octo | ResNet | Transformer history | No | ❌ | Millions |
| GR00T N1 | Eagle-2 | Dual-system | No | ❌ | Millions |
| OpenVLA | SigLIP | Single frame | No | ❌ | Billions |
| Neon | Any (Qwen/Cosmos/TRIBEv2) | Native temporal | Native | ✅ Swap in hours | 6M |
→ Next: Action Space — the 29 joints that receive these predictions