Skip to content

Home

PyPI MIT tests python

Neon Is Not a VLA. It's the Layer That Makes Any Foundation Model Into One.

Every VLA today is a monolith. Backbone and decoder, welded together.

Neon inverts this. The backbone is a plug. The action head is the product.

We take any frozen foundation model — Qwen2.5-Omni (video+audio), Cosmos-Reason2 (physics), TRIBEv2 (V-JEPA2+Wav2Vec+LLaMA) — and train a tiny action decoder on top. ~6 million parameters. 0.08% of the total. That decoder translates the backbone's understanding into 29 joint commands for a humanoid body, 16 timesteps into the future.

Swap the eyes, keep the hands.

pip install neon-vla

The Architecture

graph TD
    subgraph "Backbone (frozen, swappable)"
        Q["Qwen2.5-Omni<br/>Video + Audio + Language"]
        C["Cosmos-Reason2<br/>Physics + World Model"]
        T["TRIBEv2<br/>V-JEPA2 + Wav2Vec + LLaMA"]
    end

    subgraph "Sensor Encoders (trainable)"
        PE["Proprio<br/>MLP"]
        LE["LiDAR<br/>PointNet"]
        EE["EEF<br/>MLP"]
    end

    subgraph "Neon Action Layer (trainable, ~6M)"
        FUS["Feature Fusion"]
        AH["Action Heads<br/>MLP / Flow / DiT / StateRelative / Ensemble"]
    end

    Q --> FUS
    C --> FUS
    T --> FUS
    PE --> FUS
    LE --> FUS
    EE --> FUS
    FUS --> AH
    AH --> OUT["29 DoF × 16 steps + Speech"]

    style Q fill:#0097a7,color:#fff
    style C fill:#6a1b9a,color:#fff
    style T fill:#1b5e20,color:#fff
    style AH fill:#e65100,color:#fff
    style FUS fill:#333,color:#fff

Why Action Head Framework?

Traditional VLA Neon
New backbone drops Retrain everything Swap config, retrain heads (~hours)
Physics understanding Hope the model learned it Plug in Cosmos, physics is pre-trained
Audio Separate pipeline Native — Qwen hears, TRIBEv2 hears, already fused
Edge vs cloud Ship the whole thing Same heads, smaller backbone
Self-improvement Offline retraining Online RL on action heads, backbone frozen
Trainable params Often billions 6M — 0.08%

Three Ways In

from neon.model.neon_vla import NeonVLA, NeonConfig

config = NeonConfig(
    backbone="Qwen/Qwen2.5-Omni-7B",  # or Cosmos, TRIBEv2
    action_head_type="flow",            # or mlp, dit, ensemble
)
model = NeonVLA(config)
model.load_backbone()
output = model.predict(image=frame, instruction="Pick up the cup")
neon-serve --model cagataydev/neon-g1-v1 --port 8300
from strands import Agent
from strands_robots import Robot

robot = Robot("g1")
agent = Agent(tools=[robot])
agent("Pick up the red cube")

5 Action Head Types

The action head is the product. Choose the right one for the job:

Head Params Speed Accuracy Best For
MLP ~2M ⚡⚡⚡ ★★★ Edge deployment
FlowMatching ~4M ⚡⚡ ★★★★ Smooth trajectories
DiT ~6M ★★★★★ Maximum precision
StateRelative ~3M ⚡⚡⚡ ★★★★ Cross-embodiment
Ensemble ~8M ★★★★★ Gated routing

Built with Parameter Golf v2 — ReLU², RMSNorm, soft-capping, U-Net skips, speculative decoding.


Self-Learning Loop

Because the backbone is frozen, action heads can learn online — no human labels needed:

graph LR
    A["Observe"] --> B["Predict"]
    B --> C["Act (16 steps)"]
    C --> D["Observe Again"]
    D --> E["Compute Loss"]
    E --> F["Update Heads Only"]
    F --> A

    style F fill:#e65100,color:#fff

Three self-supervised losses: outcome prediction, temporal coherence, action consistency. EWC prevents forgetting. Deep dive →


What's Inside

Swappable Backbones

Qwen2.5-Omni, Cosmos-Reason2, TRIBEv2. Frozen. 3-10B parameters. When a new SOTA drops, swap the config, retrain 6M params.

Backbones · TRIBEv2

5 Action Head Types

MLP, FlowMatching, DiT, StateRelative, Ensemble — from fast edge inference to maximum accuracy with gated routing.

Action Heads

Self-Learning Engine

Online RL adaptation. The backbone watches, the heads learn. Robot gets better with every action. Zero human labels.

Self-Learning

Parameter Golf v2

ReLU², RMSNorm, learnable residual scales, U-Net skip connections, logit soft-capping, speculative decoding.

Parameter Golf

29 DoF Humanoid

Unitree G1. Three progressive control modes. Normalization, safety limits, per-embodiment adaptation.

Action Space

Data Soup (15 sources)

LeRobot, Agibot, Cosmos DreamGen, GR00T, BONES, Kimodo — cross-embodiment mapping via relative actions.

Data Soup

Physics Guidance

AirVLA-style gradient corrections at inference. Inject physics constraints without retraining. Works with any flow head.

Brain Fusion

Latent World Model

LeWM for imagination rollouts. 48× faster planning than foundation-model world models. MPC in latent space.

Architecture


By The Numbers

3 Backbones (swappable) 5 Action head types ~6M Trainable params (0.08%) 29 DoF 16 Step chunking 6 Input modalities 50ms on Jetson 639 Tests 24 Training presets 15 Data sources