Home

Neon Is Not a VLA. It's the Layer That Makes Any Foundation Model Into One.¶

Every VLA today is a monolith. Backbone and decoder, welded together.

Neon inverts this. The backbone is a plug. The action head is the product.

We take any frozen foundation model — Qwen2.5-Omni (video+audio), Cosmos-Reason2 (physics), TRIBEv2 (V-JEPA2+Wav2Vec+LLaMA) — and train a tiny action decoder on top. ~6 million parameters. 0.08% of the total. That decoder translates the backbone's understanding into 29 joint commands for a humanoid body, 16 timesteps into the future.

Swap the eyes, keep the hands.

pip install neon-vla

The Architecture¶

graph TD
    subgraph "Backbone (frozen, swappable)"
        Q["Qwen2.5-Omni<br/>Video + Audio + Language"]
        C["Cosmos-Reason2<br/>Physics + World Model"]
        T["TRIBEv2<br/>V-JEPA2 + Wav2Vec + LLaMA"]
    end

    subgraph "Sensor Encoders (trainable)"
        PE["Proprio<br/>MLP"]
        LE["LiDAR<br/>PointNet"]
        EE["EEF<br/>MLP"]
    end

    subgraph "Neon Action Layer (trainable, ~6M)"
        FUS["Feature Fusion"]
        AH["Action Heads<br/>MLP / Flow / DiT / StateRelative / Ensemble"]
    end

    Q --> FUS
    C --> FUS
    T --> FUS
    PE --> FUS
    LE --> FUS
    EE --> FUS
    FUS --> AH
    AH --> OUT["29 DoF × 16 steps + Speech"]

    style Q fill:#0097a7,color:#fff
    style C fill:#6a1b9a,color:#fff
    style T fill:#1b5e20,color:#fff
    style AH fill:#e65100,color:#fff
    style FUS fill:#333,color:#fff

Why Action Head Framework?¶

	Traditional VLA	Neon
New backbone drops	Retrain everything	Swap config, retrain heads (~hours)
Physics understanding	Hope the model learned it	Plug in Cosmos, physics is pre-trained
Audio	Separate pipeline	Native — Qwen hears, TRIBEv2 hears, already fused
Edge vs cloud	Ship the whole thing	Same heads, smaller backbone
Self-improvement	Offline retraining	Online RL on action heads, backbone frozen
Trainable params	Often billions	6M — 0.08%

Three Ways In¶

As a ModelAs an Inference ServerAs a Strands Agent Tool

from neon.model.neon_vla import NeonVLA, NeonConfig

config = NeonConfig(
    backbone="Qwen/Qwen2.5-Omni-7B",  # or Cosmos, TRIBEv2
    action_head_type="flow",            # or mlp, dit, ensemble
)
model = NeonVLA(config)
model.load_backbone()
output = model.predict(image=frame, instruction="Pick up the cup")

neon-serve --model cagataydev/neon-g1-v1 --port 8300

from strands import Agent
from strands_robots import Robot

robot = Robot("g1")
agent = Agent(tools=[robot])
agent("Pick up the red cube")

5 Action Head Types¶

The action head is the product. Choose the right one for the job:

Head	Params	Speed	Accuracy	Best For
MLP	~2M	⚡⚡⚡	★★★	Edge deployment
FlowMatching	~4M	⚡⚡	★★★★	Smooth trajectories
DiT	~6M	⚡	★★★★★	Maximum precision
StateRelative	~3M	⚡⚡⚡	★★★★	Cross-embodiment
Ensemble	~8M	⚡	★★★★★	Gated routing

Built with Parameter Golf v2 — ReLU², RMSNorm, soft-capping, U-Net skips, speculative decoding.

Self-Learning Loop¶

Because the backbone is frozen, action heads can learn online — no human labels needed:

graph LR
    A["Observe"] --> B["Predict"]
    B --> C["Act (16 steps)"]
    C --> D["Observe Again"]
    D --> E["Compute Loss"]
    E --> F["Update Heads Only"]
    F --> A

    style F fill:#e65100,color:#fff

Three self-supervised losses: outcome prediction, temporal coherence, action consistency. EWC prevents forgetting. Deep dive →

What's Inside¶

Swappable Backbones¶

Qwen2.5-Omni, Cosmos-Reason2, TRIBEv2. Frozen. 3-10B parameters. When a new SOTA drops, swap the config, retrain 6M params.

Backbones · TRIBEv2

5 Action Head Types¶

MLP, FlowMatching, DiT, StateRelative, Ensemble — from fast edge inference to maximum accuracy with gated routing.

Action Heads

Self-Learning Engine¶

Online RL adaptation. The backbone watches, the heads learn. Robot gets better with every action. Zero human labels.

Self-Learning

Parameter Golf v2¶

ReLU², RMSNorm, learnable residual scales, U-Net skip connections, logit soft-capping, speculative decoding.

Parameter Golf

29 DoF Humanoid¶

Unitree G1. Three progressive control modes. Normalization, safety limits, per-embodiment adaptation.

Action Space

Data Soup (15 sources)¶

LeRobot, Agibot, Cosmos DreamGen, GR00T, BONES, Kimodo — cross-embodiment mapping via relative actions.

Data Soup

Physics Guidance¶

AirVLA-style gradient corrections at inference. Inject physics constraints without retraining. Works with any flow head.

Brain Fusion

Latent World Model¶

LeWM for imagination rollouts. 48× faster planning than foundation-model world models. MPC in latent space.

Architecture

By The Numbers¶

3 Backbones (swappable) 5 Action head types ~6M Trainable params (0.08%) 29 DoF 16 Step chunking 6 Input modalities 50ms on Jetson 639 Tests 24 Training presets 15 Data sources

Navigate¶

Installation — pip install in 10 seconds

Quickstart — from zero to robot actions

Why Video Models — the temporal insight

Backbones — Qwen, Cosmos, TRIBEv2

Action Heads — 5 types, all tricks

Training — 24 presets, any GPU

Self-Learning — online RL adaptation

Action Space — 29 joints, explained

Inference — from prediction to movement

Architecture — how it's built

API Reference — every class and method