Skip to content

Training

Everything you need to train your own Neon model — from a single command on HuggingFace to a fully customized multi-source data soup on your own GPU.


What Trains, What Doesn't

The video backbone is frozen. Only the small stuff trains — and that's the entire point:

graph TD
    subgraph "Frozen (3-7B params)"
        BB["Video Backbone<br/>Qwen2.5-Omni / Cosmos<br/>4-bit quantized"]
    end

    subgraph "Trainable"
        FUS["Fusion Layer"]
        PE["Proprioception Encoder"]
        AH["Action Heads<br/>Parameter Golf v2"]
        AE["Audio Encoder Projection"]
        LE["LiDAR PointCloud Encoder"]
        EE["EEF State Encoder"]
        SH["Speech Response Head"]
    end

    BB --> FUS
    PE --> FUS
    AE --> FUS
    LE --> FUS
    EE --> FUS
    FUS --> AH
    FUS --> SH

    style BB fill:#1565c0,color:#fff
    style AH fill:#e65100,color:#fff

We don't teach the model to see — the video backbone already sees. We teach it to act.


Choose Your Size

Neon ships with 24 training presets at two scales:

Standard (~7M trainable)

Fast to train, good for iteration and edge deployment.

Preset Backbone Mode Trainable GPU Time
default_arms_only Qwen2.5-Omni-7B arms_only 7.3M L4 / RTX 4090 ~2h
default_wholebody Qwen2.5-Omni-7B whole_body ~10M RTX 4090 ~3h
cosmos_physics Cosmos-Reason2-8B arms_only ~8M A100 ~4h
edge_3b Qwen2.5-Omni-3B arms_only ~2M L4 / Jetson ~1h

Large (~44M trainable) — GR00T-Dreams Scale

Publication quality. Matches NVIDIA GR00T-Dreams' 42M-parameter action head.

Preset Backbone Mode Trainable GPU Time
large_arms Qwen2.5-Omni-7B arms_only 44M A100 40GB ~6h
large_cosmos Cosmos-Reason2-8B arms_only 44M A100 40GB ~6h
large_wholebody Qwen2.5-Omni-7B whole_body 55M A100 80GB ~10h

The large configs use mlp_hidden=2048, action_head_layers=8, and proprioception_hidden=512 — 4× wider and 2.7× deeper than the standard heads.

Omni-Modal — All Sensors Enabled

Preset Backbone Mode Sensors GPU Notes
g1_omnimodal Qwen2.5-Omni-7B whole_body Camera + Audio + LiDAR + EEF + Proprio A100 40GB+ Full G1 teleop data

Enables use_lidar=True, use_eef=True, audio, and all sensor encoders. Use with G1 teleoperation datasets that include LiDAR point clouds and end-effector states.


The fastest path from zero to a trained model. No local GPU required.

Prerequisites

  1. A HuggingFace account
  2. A HuggingFace access token with write permissions
  3. The huggingface_hub CLI: pip install huggingface_hub
  4. Login: huggingface-cli login

One Command

# Standard size — Omni-7B backbone, arms only, A100
hf jobs uv run \
    --flavor a100-large \
    --secrets HF_TOKEN \
    --timeout 8h \
    scripts/train_neon.py \
    --backbone Qwen/Qwen2.5-Omni-7B \
    --mode arms_only \
    --audio \
    --output YOUR_HF_USERNAME/neon-g1-v1

That's it. HuggingFace provisions an A100, installs dependencies (declared inline in the script via PEP 723), trains the model, and pushes the checkpoint to your Hub. You can close your laptop.

All Launch Commands

hf jobs uv run --flavor a100-large --secrets HF_TOKEN --timeout 8h \
    scripts/train_neon.py \
    --backbone Qwen/Qwen2.5-Omni-7B \
    --audio \
    --output YOUR_USERNAME/neon-g1-v1
hf jobs uv run --flavor a100-large --secrets HF_TOKEN --timeout 8h \
    scripts/train_neon.py \
    --backbone nvidia/Cosmos-Reason2-8B \
    --output YOUR_USERNAME/neon-g1-cosmos-v1
hf jobs uv run --flavor l4x1 --secrets HF_TOKEN --timeout 4h \
    scripts/train_neon.py \
    --backbone Qwen/Qwen2.5-Omni-3B \
    --audio \
    --output YOUR_USERNAME/neon-g1-edge-v1
# Use the Python API for large configs (see below)
hf jobs uv run --flavor a100-large --secrets HF_TOKEN --timeout 12h \
    scripts/train_neon.py \
    --backbone Qwen/Qwen2.5-Omni-7B \
    --audio \
    --epochs 5 \
    --batch-size 2 \
    --lr 1e-4 \
    --output YOUR_USERNAME/neon-g1-large-v1

GPU Flavors and Cost

Flavor GPU VRAM Cost/hr Best For
l4x1 L4 24 GB ~$1 3B backbone, standard heads
a10g-large A10G 24 GB ~$1.50 3B-7B backbone, standard heads
a100-large A100 40 GB ~$4 7B backbone, large heads
a100-80gb A100 80 GB ~$6 Whole-body large, LoRA

Always push_to_hub=True

HuggingFace Jobs have ephemeral storage. When the job finishes (or times out), everything on disk is gone. The model must be pushed to the Hub during training. This is the default — don't disable it.


Train Locally

Quick Start

git clone https://github.com/cagataycali/neon.git
cd neon
pip install -e ".[train]"

python scripts/train_neon.py \
    --backbone Qwen/Qwen2.5-VL-3B-Instruct \
    --mode arms_only \
    --epochs 3 \
    --batch-size 4 \
    --dataset lerobot/xvla-agibot-world \
    --output YOUR_USERNAME/neon-g1-v1

Python API (Full Control)

For large configs or custom data soup, use the Python API directly:

from neon.training.config import large_arms_config
from neon.training.train import NeonTrainer

# Load preset (44M trainable, Omni-7B, arms_only)
config = large_arms_config()

# Customize
config.hub_model_id = "YOUR_USERNAME/neon-g1-large-v1"
config.epochs = 5
config.batch_size = 2

# Train
trainer = NeonTrainer(config)
stats = trainer.train()
print(f"Best loss: {stats['best_loss']:.4f}")

Custom Data Soup

Mix multiple datasets with different weights:

from neon.training.config import TrainConfig
from neon.model.neon_vla import NeonConfig
from neon.model.video_backbone import BackboneConfig
from neon.data.data_soup import DataSoupConfig, DataSourceConfig

config = TrainConfig(
    model=NeonConfig(
        backbone=BackboneConfig(
            model_id="Qwen/Qwen2.5-Omni-7B",
            load_in_4bit=True,
            freeze_backbone=True,
        ),
        control_mode="arms_only",
        num_action_steps=16,
        mlp_hidden=2048,       # Large heads
        action_head_layers=8,
    ),
    data=DataSoupConfig(
        sources=[
            DataSourceConfig(
                name="agibot",
                type="lerobot",
                path="lerobot/xvla-agibot-world",
                weight=2.0,         # Bimanual manipulation — high weight
            ),
            DataSourceConfig(
                name="bridge",
                type="lerobot",
                path="lerobot/bridge_v2",
                weight=1.0,         # Tabletop fundamentals
            ),
            DataSourceConfig(
                name="cosmos-synth",
                type="cosmos_dreamgen",
                path="nvidia/GR1-100",
                weight=1.5,         # Synthetic + relative actions
                use_relative_actions=True,
                action_scaler=20.0,
            ),
            DataSourceConfig(
                name="voice-cmds",
                type="voice_commands",
                path="cagataydev/vlm-voice-commands",
                weight=0.3,         # Language diversity
            ),
        ],
    ),
    epochs=5,
    batch_size=2,
    learning_rate=1e-4,
    push_to_hub=True,
    hub_model_id="YOUR_USERNAME/neon-g1-custom-v1",
)

Backbone Selection Guide

                    ┌──────────────────────────────────────────────────────────┐
                    │          Which backbone should I use?                     │
                    └────────────────────────┬─────────────────────────────────┘
                              ┌──────────────▼──────────────┐
                              │ Do you need audio input?     │
                              └──────┬───────────────┬──────┘
                                     │ Yes           │ No
                              ┌──────▼──────┐ ┌──────▼──────────────┐
                              │ Use Omni    │ │ Physics-heavy task?  │
                              │ (3B or 7B)  │ └──────┬────────┬─────┘
                              └─────────────┘        │ Yes    │ No
                                              ┌──────▼─────┐ ┌▼──────────┐
                                              │ Cosmos     │ │ Qwen2.5-VL│
                                              │ Reason2    │ │ (3B or 7B)│
                                              └────────────┘ └───────────┘
Backbone Params VRAM (4-bit) Audio Physics Best For
Qwen2.5-Omni-7B 7B ~8 GB Native Learned Spoken commands, production
Qwen2.5-Omni-3B 3B ~4 GB Native Learned Edge deployment, Jetson
Cosmos-Reason2-8B 8B ~10 GB Whisper Pre-trained Sim2real, physical reasoning
Cosmos-Reason2-2B 2B ~3 GB Whisper Pre-trained Edge + physics
Qwen2.5-VL-7B 7B ~8 GB Whisper Learned Text instructions, most stable
Qwen2.5-VL-3B 3B ~4 GB Whisper Learned Fast iteration

The Training Loop — What Happens Inside

sequenceDiagram
    participant DS as Data Soup
    participant VB as Backbone (frozen)
    participant FUS as Fusion
    participant AH as Action Heads
    participant OPT as Optimizer

    DS->>VB: images + text
    VB->>FUS: visual-language features (2048-3584)
    DS->>FUS: proprioception features
    DS->>FUS: audio features (optional)
    FUS->>AH: fused features
    AH->>AH: RMSNorm → ReLU² → Skip → SoftCap
    Note over AH: MSE loss vs target action chunks
    AH->>OPT: gradients (clip=0.3)
    OPT->>AH: AdamW-8bit (β₁=0.85)
  1. Build model — Create NeonVLA, freeze backbone, initialize action heads
  2. Load backbone — Download from HuggingFace, quantize to 4-bit, pin in memory
  3. Build data loader — Load data soup, weighted sampling across sources
  4. For each batch:
    • Forward: backbone encode → fusion → action heads → predicted 16-step chunk
    • Loss: MSE between predicted and target action chunks
    • Backward: gradients flow through action heads and fusion (not backbone)
    • Optimize: AdamW-8bit, gradient clipping (0.3), cosine LR schedule
  5. Checkpoint — Save best model weights (~25-100 MB), upload to HuggingFace Hub

Key Hyperparameters

Parameter Standard Large Why
learning_rate 2e-4 1e-4 Lower for larger heads — more stable convergence
batch_size 4 2 Each sample needs a backbone forward pass
gradient_accumulation 8 16 Effective batch = batch_size × accum
num_action_steps 16 16 Action chunk size. Match model and data config
mlp_hidden 512 2048 Action head width (4× for large)
action_head_layers 3 8 Action head depth
optim paged_adamw_8bit same Saves ~30% VRAM vs standard AdamW
adam_β₁ 0.85 0.85 Lower than 0.9 — faster adaptation (Parameter Golf)
max_grad_norm 0.3 0.3 Tight clipping for small heads (Parameter Golf)
bf16 True True BFloat16 mixed precision

Saving and Loading

What Gets Saved

model.save_pretrained("/path/to/model")
# Saves:
#   config.json          — NeonConfig (full architecture definition)
#   action_space.json    — G1ActionSpace (joint limits, control mode)
#   neon_weights.pt      — Action heads + fusion + proprio encoder

Checkpoint size: ~25 MB (standard) or ~100 MB (large). The backbone is NOT saved — it loads from HuggingFace at inference time. This keeps checkpoints tiny and portable.

Load a Trained Model

from neon.model.neon_vla import NeonVLA

model = NeonVLA.from_pretrained("YOUR_USERNAME/neon-g1-v1")
model.load_backbone()  # Downloads backbone separately

output = model.predict(image=frame, instruction="Pick up the cup")

Push to Hub Manually

from huggingface_hub import HfApi

api = HfApi()
api.upload_folder(
    folder_path="/tmp/neon_output/best",
    repo_id="YOUR_USERNAME/neon-g1-v1",
    repo_type="model",
)

Evaluate Your Model

After training, measure performance with per-group MSE, baseline comparisons, and trajectory plots:

# Quick eval (10 trajectories)
python scripts/eval_neon.py \
    --model YOUR_USERNAME/neon-g1-v1 \
    --dataset lerobot/xvla-agibot-world \
    --trajs 10

# Full eval with baselines and plots
python scripts/eval_neon.py \
    --model YOUR_USERNAME/neon-g1-v1 \
    --dataset lerobot/xvla-agibot-world \
    --trajs 500 \
    --baselines \
    --plot

# Eval on HuggingFace GPU
hf jobs uv run --flavor l4x1 --secrets HF_TOKEN --timeout 2h \
    scripts/eval_neon.py \
    --model YOUR_USERNAME/neon-g1-large-v1 \
    --trajs 500 \
    --baselines

The eval script reports:

  • Total MSE — overall prediction accuracy
  • Per-group MSE — arms, locomotion, torso, head, legs separately
  • Baselines — zero-action and random-action reference points
  • Improvement — percentage better than zero-action baseline

Evaluation Guide for full details on metrics, visualization, and benchmarking.


Lessons Burned Into Silicon

First step is slow

With MoE backbones, the first training step takes 10–20 minutes. Gradient checkpointing recomputes all activations. This is normal — subsequent steps are fast.

flash_attention_2 isn't always there

HF GPU Jobs may not have flash-attn installed. The code auto-detects and falls back to eager attention. If you want FA2 speed: pip install flash-attn --no-build-isolation.

Always push_to_hub=True

HF Jobs have ephemeral storage. If the job times out, your model is gone unless it was pushed to the Hub. This lesson cost real GPU hours. Don't learn it twice.

Start standard, scale to large

Train a standard-size model first (7M params, ~2 hours). Verify the data pipeline works, the loss decreases, the predictions look reasonable. Then scale to large (44M, ~6 hours) with confidence.


Next: Evaluation — measure your model's performance