Why Action Head, Not VLA?¶

The Key Insight¶

Neon is not a Vision-Language-Action model. It's a universal action head — a thin, trainable decoder that sits on top of any frozen foundation model and translates its understanding into robot joint commands.

Why does this matter? Because foundation models are improving faster than any robotics lab can retrain. When Qwen 3.0 drops, when Meta releases V-JEPA3, when NVIDIA ships Cosmos-Predict3 — a VLA team starts over. A Neon user swaps a config line and retrains 6M parameters in hours.

The backbone can be:

Qwen2.5-Omni — video + audio + language, native temporal understanding
Cosmos-Reason2 — physics, world model, causal reasoning
TRIBEv2 — V-JEPA2 (video) + Wav2Vec-BERT (audio) + LLaMA (text), per-modality SOTA

All frozen. All swappable. The action heads don't change.

Why Video Models Specifically?¶

Every major Vision-Language-Action model today — RT-2, Octo, GR00T N1, OpenVLA — uses an image encoder. CLIP. SigLIP. DINOv2. Models trained on photographs. Models that see a frozen moment.

Show an image encoder a robot arm reaching toward a cup. It sees a robot arm near a cup. What it does not see:

The direction the arm is moving
The speed of approach
Whether the cup is sliding or stationary
What happened 100 milliseconds ago
What is about to happen next

These models try to compensate — stacking frames, adding recurrence, bolting on positional embeddings. But they're teaching temporal reasoning to a model that has never seen time. It's like teaching a photographer to direct a film by showing them more photographs, faster.

What Video Models Already Know¶

Video foundation models — Qwen2.5-Omni, Cosmos-Reason2 — were trained on millions of hours of the world in motion. They didn't learn to classify objects in isolation. They learned how the world moves:

What They Learned	Why a Robot Needs It
Object permanence	The cup is still behind the arm, even when occluded
Physics intuition	Cups fall when pushed. Water pours down. Stacks topple
Motion prediction	"What happens next" is literally the training objective
Temporal coherence	Smooth trajectories, not jerky frame-by-frame guessing
Depth from motion	3D understanding from parallax — no depth sensor needed

This is exactly the knowledge a robot needs to act in the world. It was sitting there, pre-trained, waiting for someone to connect it to a body.

graph TD
    subgraph "Image Encoder (the status quo)"
        I1["Frame t"] --> IE["CLIP / SigLIP<br/><i>sees a photograph</i>"]
        IE --> F1["Static features<br/><i>where things are</i>"]
    end

    subgraph "Video Model (what Neon uses)"
        V1["Frame t-3"] --> VE["Qwen2.5-Omni<br/><i>sees time</i>"]
        V2["Frame t-2"] --> VE
        V3["Frame t-1"] --> VE
        V4["Frame t"] --> VE
        VE --> F2["Temporal features<br/><i>where things are going</i>"]
    end

    style IE fill:#555,color:#fff
    style VE fill:#e65100,color:#fff

Three Backbones, One Action Head¶

Qwen2.5-Omni (Default)¶

Alibaba's multimodal model. Processes interleaved images and video with native temporal attention. Also understands audio — the robot can hear your voice and respond.

Variant	Params	VRAM (4-bit)	Best For
3B	3B	~4 GB	Jetson Orin, rapid iteration, edge deployment
7B	7B	~8 GB	Production quality, rich temporal features

Cosmos-Reason2¶

NVIDIA's Physical AI reasoning model. Fine-tuned specifically for understanding the physical world — gravity, collisions, materials, forces. When the task involves physics, this is the backbone.

Variant	Params	VRAM (4-bit)	Best For
2B	2B	~3 GB	Edge, lightweight physical reasoning
8B	8B	~10 GB	Complex physics scenarios, sim-to-real

TRIBEv2¶

Meta's native multimodal foundation model. Instead of one omni-model, TRIBEv2 uses dedicated SOTA encoders per modality: V-JEPA2 (video, self-supervised physics), Wav2Vec-BERT (raw audio), LLaMA 3.2 (language). Each encoder is the best at its modality. Features are fused via cross-modal Transformer.

Component	Model	Strength
Video	V-JEPA2 (ViT-G)	Self-supervised physics, no text supervision needed
Audio	Wav2Vec-BERT 2.0	Raw waveforms — hears urgency, collisions, motor strain
Language	LLaMA 3.2 (3B)	Instruction following, large context

Key advantage: Modality dropout during training → robot works when camera is occluded or audio is noisy.

Freeze the Brain, Train the Spine¶

This is the second key insight. The backbone already understands the world. We don't need to teach it vision or physics or temporal reasoning. We need to teach it what actions to take.

Camera + Voice + Text
        │
        ▼
┌──────────────────────┐
│   Frozen Backbone     │  ← Pre-trained. Already understands motion.
│   3–7B parameters     │     We touch none of these.
└──────────┬───────────┘
           │
     Pooled features
     (batch, 2048)
           │
           ▼
┌──────────────────────┐
│   Trainable Decoder   │  ← These are trained. ~6M parameters.
│   Action Heads (v2)   │     Arms. Locomotion. Grippers.
└──────────────────────┘

Why freeze?

6M vs 7B — Train 0.08% of parameters instead of 100%. Overnight, not overmonth
Less data — Fewer parameters means faster convergence with orders of magnitude less data
Faster training — No gradients backpropagating through 7 billion parameters
No forgetting — The backbone's understanding of physics survives training intact

Where Neon Sits¶

Approach	Visual Encoder	Temporal?	Audio?	Backbone Swap?	Trainable
RT-2	ViT	Frame stacking	No	❌	Billions
Octo	ResNet	Transformer history	No	❌	Millions
GR00T N1	Eagle-2	Dual-system	No	❌	Millions
OpenVLA	SigLIP	Single frame	No	❌	Billions
Neon	Any (Qwen/Cosmos/TRIBEv2)	Native temporal	Native	✅ Swap in hours	6M

→ Next: Action Space — the 29 joints that receive these predictions