Training¶
Everything you need to train your own Neon model — from a single command on HuggingFace to a fully customized multi-source data soup on your own GPU.
What Trains, What Doesn't¶
The video backbone is frozen. Only the small stuff trains — and that's the entire point:
graph TD
subgraph "Frozen (3-7B params)"
BB["Video Backbone<br/>Qwen2.5-Omni / Cosmos<br/>4-bit quantized"]
end
subgraph "Trainable"
FUS["Fusion Layer"]
PE["Proprioception Encoder"]
AH["Action Heads<br/>Parameter Golf v2"]
AE["Audio Encoder Projection"]
LE["LiDAR PointCloud Encoder"]
EE["EEF State Encoder"]
SH["Speech Response Head"]
end
BB --> FUS
PE --> FUS
AE --> FUS
LE --> FUS
EE --> FUS
FUS --> AH
FUS --> SH
style BB fill:#1565c0,color:#fff
style AH fill:#e65100,color:#fff
We don't teach the model to see — the video backbone already sees. We teach it to act.
Choose Your Size¶
Neon ships with 24 training presets at two scales:
Standard (~7M trainable)¶
Fast to train, good for iteration and edge deployment.
| Preset | Backbone | Mode | Trainable | GPU | Time |
|---|---|---|---|---|---|
default_arms_only |
Qwen2.5-Omni-7B | arms_only | 7.3M | L4 / RTX 4090 | ~2h |
default_wholebody |
Qwen2.5-Omni-7B | whole_body | ~10M | RTX 4090 | ~3h |
cosmos_physics |
Cosmos-Reason2-8B | arms_only | ~8M | A100 | ~4h |
edge_3b |
Qwen2.5-Omni-3B | arms_only | ~2M | L4 / Jetson | ~1h |
Large (~44M trainable) — GR00T-Dreams Scale¶
Publication quality. Matches NVIDIA GR00T-Dreams' 42M-parameter action head.
| Preset | Backbone | Mode | Trainable | GPU | Time |
|---|---|---|---|---|---|
large_arms |
Qwen2.5-Omni-7B | arms_only | 44M | A100 40GB | ~6h |
large_cosmos |
Cosmos-Reason2-8B | arms_only | 44M | A100 40GB | ~6h |
large_wholebody |
Qwen2.5-Omni-7B | whole_body | 55M | A100 80GB | ~10h |
The large configs use mlp_hidden=2048, action_head_layers=8, and proprioception_hidden=512 — 4× wider and 2.7× deeper than the standard heads.
Omni-Modal — All Sensors Enabled¶
| Preset | Backbone | Mode | Sensors | GPU | Notes |
|---|---|---|---|---|---|
g1_omnimodal |
Qwen2.5-Omni-7B | whole_body | Camera + Audio + LiDAR + EEF + Proprio | A100 40GB+ | Full G1 teleop data |
Enables use_lidar=True, use_eef=True, audio, and all sensor encoders. Use with G1 teleoperation datasets that include LiDAR point clouds and end-effector states.
Train on HuggingFace (Recommended)¶
The fastest path from zero to a trained model. No local GPU required.
Prerequisites¶
- A HuggingFace account
- A HuggingFace access token with write permissions
- The
huggingface_hubCLI:pip install huggingface_hub - Login:
huggingface-cli login
One Command¶
# Standard size — Omni-7B backbone, arms only, A100
hf jobs uv run \
--flavor a100-large \
--secrets HF_TOKEN \
--timeout 8h \
scripts/train_neon.py \
--backbone Qwen/Qwen2.5-Omni-7B \
--mode arms_only \
--audio \
--output YOUR_HF_USERNAME/neon-g1-v1
That's it. HuggingFace provisions an A100, installs dependencies (declared inline in the script via PEP 723), trains the model, and pushes the checkpoint to your Hub. You can close your laptop.
All Launch Commands¶
GPU Flavors and Cost¶
| Flavor | GPU | VRAM | Cost/hr | Best For |
|---|---|---|---|---|
l4x1 |
L4 | 24 GB | ~$1 | 3B backbone, standard heads |
a10g-large |
A10G | 24 GB | ~$1.50 | 3B-7B backbone, standard heads |
a100-large |
A100 | 40 GB | ~$4 | 7B backbone, large heads |
a100-80gb |
A100 | 80 GB | ~$6 | Whole-body large, LoRA |
Always push_to_hub=True
HuggingFace Jobs have ephemeral storage. When the job finishes (or times out), everything on disk is gone. The model must be pushed to the Hub during training. This is the default — don't disable it.
Train Locally¶
Quick Start¶
git clone https://github.com/cagataycali/neon.git
cd neon
pip install -e ".[train]"
python scripts/train_neon.py \
--backbone Qwen/Qwen2.5-VL-3B-Instruct \
--mode arms_only \
--epochs 3 \
--batch-size 4 \
--dataset lerobot/xvla-agibot-world \
--output YOUR_USERNAME/neon-g1-v1
Python API (Full Control)¶
For large configs or custom data soup, use the Python API directly:
from neon.training.config import large_arms_config
from neon.training.train import NeonTrainer
# Load preset (44M trainable, Omni-7B, arms_only)
config = large_arms_config()
# Customize
config.hub_model_id = "YOUR_USERNAME/neon-g1-large-v1"
config.epochs = 5
config.batch_size = 2
# Train
trainer = NeonTrainer(config)
stats = trainer.train()
print(f"Best loss: {stats['best_loss']:.4f}")
Custom Data Soup¶
Mix multiple datasets with different weights:
from neon.training.config import TrainConfig
from neon.model.neon_vla import NeonConfig
from neon.model.video_backbone import BackboneConfig
from neon.data.data_soup import DataSoupConfig, DataSourceConfig
config = TrainConfig(
model=NeonConfig(
backbone=BackboneConfig(
model_id="Qwen/Qwen2.5-Omni-7B",
load_in_4bit=True,
freeze_backbone=True,
),
control_mode="arms_only",
num_action_steps=16,
mlp_hidden=2048, # Large heads
action_head_layers=8,
),
data=DataSoupConfig(
sources=[
DataSourceConfig(
name="agibot",
type="lerobot",
path="lerobot/xvla-agibot-world",
weight=2.0, # Bimanual manipulation — high weight
),
DataSourceConfig(
name="bridge",
type="lerobot",
path="lerobot/bridge_v2",
weight=1.0, # Tabletop fundamentals
),
DataSourceConfig(
name="cosmos-synth",
type="cosmos_dreamgen",
path="nvidia/GR1-100",
weight=1.5, # Synthetic + relative actions
use_relative_actions=True,
action_scaler=20.0,
),
DataSourceConfig(
name="voice-cmds",
type="voice_commands",
path="cagataydev/vlm-voice-commands",
weight=0.3, # Language diversity
),
],
),
epochs=5,
batch_size=2,
learning_rate=1e-4,
push_to_hub=True,
hub_model_id="YOUR_USERNAME/neon-g1-custom-v1",
)
Backbone Selection Guide¶
┌──────────────────────────────────────────────────────────┐
│ Which backbone should I use? │
└────────────────────────┬─────────────────────────────────┘
│
┌──────────────▼──────────────┐
│ Do you need audio input? │
└──────┬───────────────┬──────┘
│ Yes │ No
┌──────▼──────┐ ┌──────▼──────────────┐
│ Use Omni │ │ Physics-heavy task? │
│ (3B or 7B) │ └──────┬────────┬─────┘
└─────────────┘ │ Yes │ No
┌──────▼─────┐ ┌▼──────────┐
│ Cosmos │ │ Qwen2.5-VL│
│ Reason2 │ │ (3B or 7B)│
└────────────┘ └───────────┘
| Backbone | Params | VRAM (4-bit) | Audio | Physics | Best For |
|---|---|---|---|---|---|
| Qwen2.5-Omni-7B | 7B | ~8 GB | Native | Learned | Spoken commands, production |
| Qwen2.5-Omni-3B | 3B | ~4 GB | Native | Learned | Edge deployment, Jetson |
| Cosmos-Reason2-8B | 8B | ~10 GB | Whisper | Pre-trained | Sim2real, physical reasoning |
| Cosmos-Reason2-2B | 2B | ~3 GB | Whisper | Pre-trained | Edge + physics |
| Qwen2.5-VL-7B | 7B | ~8 GB | Whisper | Learned | Text instructions, most stable |
| Qwen2.5-VL-3B | 3B | ~4 GB | Whisper | Learned | Fast iteration |
The Training Loop — What Happens Inside¶
sequenceDiagram
participant DS as Data Soup
participant VB as Backbone (frozen)
participant FUS as Fusion
participant AH as Action Heads
participant OPT as Optimizer
DS->>VB: images + text
VB->>FUS: visual-language features (2048-3584)
DS->>FUS: proprioception features
DS->>FUS: audio features (optional)
FUS->>AH: fused features
AH->>AH: RMSNorm → ReLU² → Skip → SoftCap
Note over AH: MSE loss vs target action chunks
AH->>OPT: gradients (clip=0.3)
OPT->>AH: AdamW-8bit (β₁=0.85)
- Build model — Create NeonVLA, freeze backbone, initialize action heads
- Load backbone — Download from HuggingFace, quantize to 4-bit, pin in memory
- Build data loader — Load data soup, weighted sampling across sources
- For each batch:
- Forward: backbone encode → fusion → action heads → predicted 16-step chunk
- Loss: MSE between predicted and target action chunks
- Backward: gradients flow through action heads and fusion (not backbone)
- Optimize: AdamW-8bit, gradient clipping (0.3), cosine LR schedule
- Checkpoint — Save best model weights (~25-100 MB), upload to HuggingFace Hub
Key Hyperparameters¶
| Parameter | Standard | Large | Why |
|---|---|---|---|
learning_rate |
2e-4 | 1e-4 | Lower for larger heads — more stable convergence |
batch_size |
4 | 2 | Each sample needs a backbone forward pass |
gradient_accumulation |
8 | 16 | Effective batch = batch_size × accum |
num_action_steps |
16 | 16 | Action chunk size. Match model and data config |
mlp_hidden |
512 | 2048 | Action head width (4× for large) |
action_head_layers |
3 | 8 | Action head depth |
optim |
paged_adamw_8bit | same | Saves ~30% VRAM vs standard AdamW |
adam_β₁ |
0.85 | 0.85 | Lower than 0.9 — faster adaptation (Parameter Golf) |
max_grad_norm |
0.3 | 0.3 | Tight clipping for small heads (Parameter Golf) |
bf16 |
True | True | BFloat16 mixed precision |
Saving and Loading¶
What Gets Saved¶
model.save_pretrained("/path/to/model")
# Saves:
# config.json — NeonConfig (full architecture definition)
# action_space.json — G1ActionSpace (joint limits, control mode)
# neon_weights.pt — Action heads + fusion + proprio encoder
Checkpoint size: ~25 MB (standard) or ~100 MB (large). The backbone is NOT saved — it loads from HuggingFace at inference time. This keeps checkpoints tiny and portable.
Load a Trained Model¶
from neon.model.neon_vla import NeonVLA
model = NeonVLA.from_pretrained("YOUR_USERNAME/neon-g1-v1")
model.load_backbone() # Downloads backbone separately
output = model.predict(image=frame, instruction="Pick up the cup")
Push to Hub Manually¶
from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
folder_path="/tmp/neon_output/best",
repo_id="YOUR_USERNAME/neon-g1-v1",
repo_type="model",
)
Evaluate Your Model¶
After training, measure performance with per-group MSE, baseline comparisons, and trajectory plots:
# Quick eval (10 trajectories)
python scripts/eval_neon.py \
--model YOUR_USERNAME/neon-g1-v1 \
--dataset lerobot/xvla-agibot-world \
--trajs 10
# Full eval with baselines and plots
python scripts/eval_neon.py \
--model YOUR_USERNAME/neon-g1-v1 \
--dataset lerobot/xvla-agibot-world \
--trajs 500 \
--baselines \
--plot
# Eval on HuggingFace GPU
hf jobs uv run --flavor l4x1 --secrets HF_TOKEN --timeout 2h \
scripts/eval_neon.py \
--model YOUR_USERNAME/neon-g1-large-v1 \
--trajs 500 \
--baselines
The eval script reports:
- Total MSE — overall prediction accuracy
- Per-group MSE — arms, locomotion, torso, head, legs separately
- Baselines — zero-action and random-action reference points
- Improvement — percentage better than zero-action baseline
→ Evaluation Guide for full details on metrics, visualization, and benchmarking.
Lessons Burned Into Silicon¶
First step is slow
With MoE backbones, the first training step takes 10–20 minutes. Gradient checkpointing recomputes all activations. This is normal — subsequent steps are fast.
flash_attention_2 isn't always there
HF GPU Jobs may not have flash-attn installed. The code auto-detects and falls back to eager attention. If you want FA2 speed: pip install flash-attn --no-build-isolation.
Always push_to_hub=True
HF Jobs have ephemeral storage. If the job times out, your model is gone unless it was pushed to the Hub. This lesson cost real GPU hours. Don't learn it twice.
Start standard, scale to large
Train a standard-size model first (7M params, ~2 hours). Verify the data pipeline works, the loss decreases, the predictions look reasonable. Then scale to large (44M, ~6 hours) with confidence.
→ Next: Evaluation — measure your model's performance