Training¶

Everything you need to train your own Neon model — from a single command on HuggingFace to a fully customized multi-source data soup on your own GPU.

What Trains, What Doesn't¶

The video backbone is frozen. Only the small stuff trains — and that's the entire point:

graph TD
    subgraph "Frozen (3-7B params)"
        BB["Video Backbone<br/>Qwen2.5-Omni / Cosmos<br/>4-bit quantized"]
    end

    subgraph "Trainable"
        FUS["Fusion Layer"]
        PE["Proprioception Encoder"]
        AH["Action Heads<br/>Parameter Golf v2"]
        AE["Audio Encoder Projection"]
        LE["LiDAR PointCloud Encoder"]
        EE["EEF State Encoder"]
        SH["Speech Response Head"]
    end

    BB --> FUS
    PE --> FUS
    AE --> FUS
    LE --> FUS
    EE --> FUS
    FUS --> AH
    FUS --> SH

    style BB fill:#1565c0,color:#fff
    style AH fill:#e65100,color:#fff

We don't teach the model to see — the video backbone already sees. We teach it to act.

Choose Your Size¶

Neon ships with 24 training presets at two scales:

Standard (~7M trainable)¶

Fast to train, good for iteration and edge deployment.

Preset	Backbone	Mode	Trainable	GPU	Time
`default_arms_only`	Qwen2.5-Omni-7B	arms_only	7.3M	L4 / RTX 4090	~2h
`default_wholebody`	Qwen2.5-Omni-7B	whole_body	~10M	RTX 4090	~3h
`cosmos_physics`	Cosmos-Reason2-8B	arms_only	~8M	A100	~4h
`edge_3b`	Qwen2.5-Omni-3B	arms_only	~2M	L4 / Jetson	~1h

Large (~44M trainable) — GR00T-Dreams Scale¶

Publication quality. Matches NVIDIA GR00T-Dreams' 42M-parameter action head.

Preset	Backbone	Mode	Trainable	GPU	Time
`large_arms`	Qwen2.5-Omni-7B	arms_only	44M	A100 40GB	~6h
`large_cosmos`	Cosmos-Reason2-8B	arms_only	44M	A100 40GB	~6h
`large_wholebody`	Qwen2.5-Omni-7B	whole_body	55M	A100 80GB	~10h

The large configs use mlp_hidden=2048, action_head_layers=8, and proprioception_hidden=512 — 4× wider and 2.7× deeper than the standard heads.

Preset	Backbone	Mode	Sensors	GPU	Notes
`g1_omnimodal`	Qwen2.5-Omni-7B	whole_body	Camera + Audio + LiDAR + EEF + Proprio	A100 40GB+	Full G1 teleop data

Enables use_lidar=True, use_eef=True, audio, and all sensor encoders. Use with G1 teleoperation datasets that include LiDAR point clouds and end-effector states.

Train on HuggingFace (Recommended)¶

The fastest path from zero to a trained model. No local GPU required.

Prerequisites¶

A HuggingFace account
A HuggingFace access token with write permissions
The huggingface_hub CLI: pip install huggingface_hub
Login: huggingface-cli login

One Command¶

# Standard size — Omni-7B backbone, arms only, A100
hf jobs uv run \
    --flavor a100-large \
    --secrets HF_TOKEN \
    --timeout 8h \
    scripts/train_neon.py \
    --backbone Qwen/Qwen2.5-Omni-7B \
    --mode arms_only \
    --audio \
    --output YOUR_HF_USERNAME/neon-g1-v1

That's it. HuggingFace provisions an A100, installs dependencies (declared inline in the script via PEP 723), trains the model, and pushes the checkpoint to your Hub. You can close your laptop.

All Launch Commands¶

Standard — Omni (Audio-native)Standard — Cosmos (Physics)Standard — 3B Edge (Budget)Large — 44M (GR00T-Dreams Scale)

hf jobs uv run --flavor a100-large --secrets HF_TOKEN --timeout 8h \
    scripts/train_neon.py \
    --backbone Qwen/Qwen2.5-Omni-7B \
    --audio \
    --output YOUR_USERNAME/neon-g1-v1

hf jobs uv run --flavor a100-large --secrets HF_TOKEN --timeout 8h \
    scripts/train_neon.py \
    --backbone nvidia/Cosmos-Reason2-8B \
    --output YOUR_USERNAME/neon-g1-cosmos-v1

hf jobs uv run --flavor l4x1 --secrets HF_TOKEN --timeout 4h \
    scripts/train_neon.py \
    --backbone Qwen/Qwen2.5-Omni-3B \
    --audio \
    --output YOUR_USERNAME/neon-g1-edge-v1

# Use the Python API for large configs (see below)
hf jobs uv run --flavor a100-large --secrets HF_TOKEN --timeout 12h \
    scripts/train_neon.py \
    --backbone Qwen/Qwen2.5-Omni-7B \
    --audio \
    --epochs 5 \
    --batch-size 2 \
    --lr 1e-4 \
    --output YOUR_USERNAME/neon-g1-large-v1

GPU Flavors and Cost¶

Flavor	GPU	VRAM	Cost/hr	Best For
`l4x1`	L4	24 GB	~$1	3B backbone, standard heads
`a10g-large`	A10G	24 GB	~$1.50	3B-7B backbone, standard heads
`a100-large`	A100	40 GB	~$4	7B backbone, large heads
`a100-80gb`	A100	80 GB	~$6	Whole-body large, LoRA

Always push_to_hub=True

HuggingFace Jobs have ephemeral storage. When the job finishes (or times out), everything on disk is gone. The model must be pushed to the Hub during training. This is the default — don't disable it.

Train Locally¶

Quick Start¶

git clone https://github.com/cagataycali/neon.git
cd neon
pip install -e ".[train]"

python scripts/train_neon.py \
    --backbone Qwen/Qwen2.5-VL-3B-Instruct \
    --mode arms_only \
    --epochs 3 \
    --batch-size 4 \
    --dataset lerobot/xvla-agibot-world \
    --output YOUR_USERNAME/neon-g1-v1

Python API (Full Control)¶

For large configs or custom data soup, use the Python API directly:

from neon.training.config import large_arms_config
from neon.training.train import NeonTrainer

# Load preset (44M trainable, Omni-7B, arms_only)
config = large_arms_config()

# Customize
config.hub_model_id = "YOUR_USERNAME/neon-g1-large-v1"
config.epochs = 5
config.batch_size = 2

# Train
trainer = NeonTrainer(config)
stats = trainer.train()
print(f"Best loss: {stats['best_loss']:.4f}")

Custom Data Soup¶

Mix multiple datasets with different weights:

from neon.training.config import TrainConfig
from neon.model.neon_vla import NeonConfig
from neon.model.video_backbone import BackboneConfig
from neon.data.data_soup import DataSoupConfig, DataSourceConfig

config = TrainConfig(
    model=NeonConfig(
        backbone=BackboneConfig(
            model_id="Qwen/Qwen2.5-Omni-7B",
            load_in_4bit=True,
            freeze_backbone=True,
        ),
        control_mode="arms_only",
        num_action_steps=16,
        mlp_hidden=2048,       # Large heads
        action_head_layers=8,
    ),
    data=DataSoupConfig(
        sources=[
            DataSourceConfig(
                name="agibot",
                type="lerobot",
                path="lerobot/xvla-agibot-world",
                weight=2.0,         # Bimanual manipulation — high weight
            ),
            DataSourceConfig(
                name="bridge",
                type="lerobot",
                path="lerobot/bridge_v2",
                weight=1.0,         # Tabletop fundamentals
            ),
            DataSourceConfig(
                name="cosmos-synth",
                type="cosmos_dreamgen",
                path="nvidia/GR1-100",
                weight=1.5,         # Synthetic + relative actions
                use_relative_actions=True,
                action_scaler=20.0,
            ),
            DataSourceConfig(
                name="voice-cmds",
                type="voice_commands",
                path="cagataydev/vlm-voice-commands",
                weight=0.3,         # Language diversity
            ),
        ],
    ),
    epochs=5,
    batch_size=2,
    learning_rate=1e-4,
    push_to_hub=True,
    hub_model_id="YOUR_USERNAME/neon-g1-custom-v1",
)

Backbone Selection Guide¶

                    ┌──────────────────────────────────────────────────────────┐
                    │          Which backbone should I use?                     │
                    └────────────────────────┬─────────────────────────────────┘
                                             │
                              ┌──────────────▼──────────────┐
                              │ Do you need audio input?     │
                              └──────┬───────────────┬──────┘
                                     │ Yes           │ No
                              ┌──────▼──────┐ ┌──────▼──────────────┐
                              │ Use Omni    │ │ Physics-heavy task?  │
                              │ (3B or 7B)  │ └──────┬────────┬─────┘
                              └─────────────┘        │ Yes    │ No
                                              ┌──────▼─────┐ ┌▼──────────┐
                                              │ Cosmos     │ │ Qwen2.5-VL│
                                              │ Reason2    │ │ (3B or 7B)│
                                              └────────────┘ └───────────┘

Backbone	Params	VRAM (4-bit)	Audio	Physics	Best For
Qwen2.5-Omni-7B	7B	~8 GB	Native	Learned	Spoken commands, production
Qwen2.5-Omni-3B	3B	~4 GB	Native	Learned	Edge deployment, Jetson
Cosmos-Reason2-8B	8B	~10 GB	Whisper	Pre-trained	Sim2real, physical reasoning
Cosmos-Reason2-2B	2B	~3 GB	Whisper	Pre-trained	Edge + physics
Qwen2.5-VL-7B	7B	~8 GB	Whisper	Learned	Text instructions, most stable
Qwen2.5-VL-3B	3B	~4 GB	Whisper	Learned	Fast iteration

The Training Loop — What Happens Inside¶

sequenceDiagram
    participant DS as Data Soup
    participant VB as Backbone (frozen)
    participant FUS as Fusion
    participant AH as Action Heads
    participant OPT as Optimizer

    DS->>VB: images + text
    VB->>FUS: visual-language features (2048-3584)
    DS->>FUS: proprioception features
    DS->>FUS: audio features (optional)
    FUS->>AH: fused features
    AH->>AH: RMSNorm → ReLU² → Skip → SoftCap
    Note over AH: MSE loss vs target action chunks
    AH->>OPT: gradients (clip=0.3)
    OPT->>AH: AdamW-8bit (β₁=0.85)

Build model — Create NeonVLA, freeze backbone, initialize action heads
Load backbone — Download from HuggingFace, quantize to 4-bit, pin in memory
Build data loader — Load data soup, weighted sampling across sources
For each batch:
- Forward: backbone encode → fusion → action heads → predicted 16-step chunk
- Loss: MSE between predicted and target action chunks
- Backward: gradients flow through action heads and fusion (not backbone)
- Optimize: AdamW-8bit, gradient clipping (0.3), cosine LR schedule
Checkpoint — Save best model weights (~25-100 MB), upload to HuggingFace Hub

Key Hyperparameters¶

Parameter	Standard	Large	Why
`learning_rate`	2e-4	1e-4	Lower for larger heads — more stable convergence
`batch_size`	4	2	Each sample needs a backbone forward pass
`gradient_accumulation`	8	16	Effective batch = batch_size × accum
`num_action_steps`	16	16	Action chunk size. Match model and data config
`mlp_hidden`	512	2048	Action head width (4× for large)
`action_head_layers`	3	8	Action head depth
`optim`	paged_adamw_8bit	same	Saves ~30% VRAM vs standard AdamW
`adam_β₁`	0.85	0.85	Lower than 0.9 — faster adaptation (Parameter Golf)
`max_grad_norm`	0.3	0.3	Tight clipping for small heads (Parameter Golf)
`bf16`	True	True	BFloat16 mixed precision

Saving and Loading¶

What Gets Saved¶

model.save_pretrained("/path/to/model")
# Saves:
#   config.json          — NeonConfig (full architecture definition)
#   action_space.json    — G1ActionSpace (joint limits, control mode)
#   neon_weights.pt      — Action heads + fusion + proprio encoder

Checkpoint size: ~25 MB (standard) or ~100 MB (large). The backbone is NOT saved — it loads from HuggingFace at inference time. This keeps checkpoints tiny and portable.

Load a Trained Model¶

from neon.model.neon_vla import NeonVLA

model = NeonVLA.from_pretrained("YOUR_USERNAME/neon-g1-v1")
model.load_backbone()  # Downloads backbone separately

output = model.predict(image=frame, instruction="Pick up the cup")

Push to Hub Manually¶

from huggingface_hub import HfApi

api = HfApi()
api.upload_folder(
    folder_path="/tmp/neon_output/best",
    repo_id="YOUR_USERNAME/neon-g1-v1",
    repo_type="model",
)

Evaluate Your Model¶

After training, measure performance with per-group MSE, baseline comparisons, and trajectory plots:

# Quick eval (10 trajectories)
python scripts/eval_neon.py \
    --model YOUR_USERNAME/neon-g1-v1 \
    --dataset lerobot/xvla-agibot-world \
    --trajs 10

# Full eval with baselines and plots
python scripts/eval_neon.py \
    --model YOUR_USERNAME/neon-g1-v1 \
    --dataset lerobot/xvla-agibot-world \
    --trajs 500 \
    --baselines \
    --plot

# Eval on HuggingFace GPU
hf jobs uv run --flavor l4x1 --secrets HF_TOKEN --timeout 2h \
    scripts/eval_neon.py \
    --model YOUR_USERNAME/neon-g1-large-v1 \
    --trajs 500 \
    --baselines

The eval script reports:

Total MSE — overall prediction accuracy
Per-group MSE — arms, locomotion, torso, head, legs separately
Baselines — zero-action and random-action reference points
Improvement — percentage better than zero-action baseline

→ Evaluation Guide for full details on metrics, visualization, and benchmarking.

Lessons Burned Into Silicon¶

First step is slow

With MoE backbones, the first training step takes 10–20 minutes. Gradient checkpointing recomputes all activations. This is normal — subsequent steps are fast.

flash_attention_2 isn't always there

HF GPU Jobs may not have flash-attn installed. The code auto-detects and falls back to eager attention. If you want FA2 speed: pip install flash-attn --no-build-isolation.