Skip to content

Why Action Head, Not VLA?


The Key Insight

Neon is not a Vision-Language-Action model. It's a universal action head — a thin, trainable decoder that sits on top of any frozen foundation model and translates its understanding into robot joint commands.

Why does this matter? Because foundation models are improving faster than any robotics lab can retrain. When Qwen 3.0 drops, when Meta releases V-JEPA3, when NVIDIA ships Cosmos-Predict3 — a VLA team starts over. A Neon user swaps a config line and retrains 6M parameters in hours.

The backbone can be:

  • Qwen2.5-Omni — video + audio + language, native temporal understanding
  • Cosmos-Reason2 — physics, world model, causal reasoning
  • TRIBEv2 — V-JEPA2 (video) + Wav2Vec-BERT (audio) + LLaMA (text), per-modality SOTA

All frozen. All swappable. The action heads don't change.


Why Video Models Specifically?


The Blind Spot in Robotics

Every major Vision-Language-Action model today — RT-2, Octo, GR00T N1, OpenVLA — uses an image encoder. CLIP. SigLIP. DINOv2. Models trained on photographs. Models that see a frozen moment.

Show an image encoder a robot arm reaching toward a cup. It sees a robot arm near a cup. What it does not see:

  • The direction the arm is moving
  • The speed of approach
  • Whether the cup is sliding or stationary
  • What happened 100 milliseconds ago
  • What is about to happen next

These models try to compensate — stacking frames, adding recurrence, bolting on positional embeddings. But they're teaching temporal reasoning to a model that has never seen time. It's like teaching a photographer to direct a film by showing them more photographs, faster.


What Video Models Already Know

Video foundation models — Qwen2.5-Omni, Cosmos-Reason2 — were trained on millions of hours of the world in motion. They didn't learn to classify objects in isolation. They learned how the world moves:

What They Learned Why a Robot Needs It
Object permanence The cup is still behind the arm, even when occluded
Physics intuition Cups fall when pushed. Water pours down. Stacks topple
Motion prediction "What happens next" is literally the training objective
Temporal coherence Smooth trajectories, not jerky frame-by-frame guessing
Depth from motion 3D understanding from parallax — no depth sensor needed

This is exactly the knowledge a robot needs to act in the world. It was sitting there, pre-trained, waiting for someone to connect it to a body.

graph TD
    subgraph "Image Encoder (the status quo)"
        I1["Frame t"] --> IE["CLIP / SigLIP<br/><i>sees a photograph</i>"]
        IE --> F1["Static features<br/><i>where things are</i>"]
    end

    subgraph "Video Model (what Neon uses)"
        V1["Frame t-3"] --> VE["Qwen2.5-Omni<br/><i>sees time</i>"]
        V2["Frame t-2"] --> VE
        V3["Frame t-1"] --> VE
        V4["Frame t"] --> VE
        VE --> F2["Temporal features<br/><i>where things are going</i>"]
    end

    style IE fill:#555,color:#fff
    style VE fill:#e65100,color:#fff

Three Backbones, One Action Head

Qwen2.5-Omni (Default)

Alibaba's multimodal model. Processes interleaved images and video with native temporal attention. Also understands audio — the robot can hear your voice and respond.

Variant Params VRAM (4-bit) Best For
3B 3B ~4 GB Jetson Orin, rapid iteration, edge deployment
7B 7B ~8 GB Production quality, rich temporal features

Cosmos-Reason2

NVIDIA's Physical AI reasoning model. Fine-tuned specifically for understanding the physical world — gravity, collisions, materials, forces. When the task involves physics, this is the backbone.

Variant Params VRAM (4-bit) Best For
2B 2B ~3 GB Edge, lightweight physical reasoning
8B 8B ~10 GB Complex physics scenarios, sim-to-real

TRIBEv2

Meta's native multimodal foundation model. Instead of one omni-model, TRIBEv2 uses dedicated SOTA encoders per modality: V-JEPA2 (video, self-supervised physics), Wav2Vec-BERT (raw audio), LLaMA 3.2 (language). Each encoder is the best at its modality. Features are fused via cross-modal Transformer.

Component Model Strength
Video V-JEPA2 (ViT-G) Self-supervised physics, no text supervision needed
Audio Wav2Vec-BERT 2.0 Raw waveforms — hears urgency, collisions, motor strain
Language LLaMA 3.2 (3B) Instruction following, large context

Key advantage: Modality dropout during training → robot works when camera is occluded or audio is noisy.


Freeze the Brain, Train the Spine

This is the second key insight. The backbone already understands the world. We don't need to teach it vision or physics or temporal reasoning. We need to teach it what actions to take.

Camera + Voice + Text
┌──────────────────────┐
│   Frozen Backbone     │  ← Pre-trained. Already understands motion.
│   3–7B parameters     │     We touch none of these.
└──────────┬───────────┘
     Pooled features
     (batch, 2048)
┌──────────────────────┐
│   Trainable Decoder   │  ← These are trained. ~6M parameters.
│   Action Heads (v2)   │     Arms. Locomotion. Grippers.
└──────────────────────┘

Why freeze?

  1. 6M vs 7B — Train 0.08% of parameters instead of 100%. Overnight, not overmonth
  2. Less data — Fewer parameters means faster convergence with orders of magnitude less data
  3. Faster training — No gradients backpropagating through 7 billion parameters
  4. No forgetting — The backbone's understanding of physics survives training intact

Where Neon Sits

Approach Visual Encoder Temporal? Audio? Backbone Swap? Trainable
RT-2 ViT Frame stacking No Billions
Octo ResNet Transformer history No Millions
GR00T N1 Eagle-2 Dual-system No Millions
OpenVLA SigLIP Single frame No Billions
Neon Any (Qwen/Cosmos/TRIBEv2) Native temporal Native ✅ Swap in hours 6M

Next: Action Space — the 29 joints that receive these predictions