Evaluation¶

How to measure whether your Neon model actually learned to control a robot — and how to prove it.

Quick Eval¶

python scripts/eval_neon.py \
    --model YOUR_USERNAME/neon-g1-v1 \
    --dataset lerobot/xvla-agibot-world \
    --trajs 100 \
    --baselines \
    --plot

Output:

══════════════════════════════════════════════════════════════
NEON VLA EVALUATION RESULTS
══════════════════════════════════════════════════════════════
Model:   YOUR_USERNAME/neon-g1-v1
Dataset: lerobot/xvla-agibot-world
Mode:    arms_only
Samples: 1500
Time:    245.3s (6.1 samples/sec)
──────────────────────────────────────────────────────────────
  Total MSE:     0.023451 ± 0.015832
  left_arm        0.028123 ± 0.018541
  right_arm       0.031245 ± 0.020103
  locomotion      0.004521 ± 0.003210
──────────────────────────────────────────────────────────────
  [baseline] Zero:   0.142301
  [baseline] Random: 0.334521
  Improvement over zero: 83.5%
══════════════════════════════════════════════════════════════

What Gets Measured¶

Per-Group MSE¶

Every joint group is evaluated independently:

Group	Joints	What It Tells You
`left_arm`	7 (shoulder, elbow, wrist, gripper)	Left arm manipulation accuracy
`right_arm`	7	Right arm manipulation accuracy
`locomotion`	3 (vx, vy, ω)	Walking / base movement quality
`torso`	1 (waist yaw)	Upper body orientation (upper_body+ modes)
`head`	2 (pitch, yaw)	Gaze direction (upper_body+ modes)
`left_leg` / `right_leg`	6 each	Walking gait quality (whole_body mode)

Baselines¶

Two baselines establish reference points:

Zero-action: Always predict zero (standing still). If your model doesn't beat this, it hasn't learned anything useful.
Random-action: Predict random values in [-1, 1]. Absolute floor — any trained model should beat this trivially.

The improvement over zero percentage is the most meaningful single number. A model at 80%+ improvement is producing meaningful, directional predictions.

Trajectory Visualization¶

With --plot, the eval script generates per-group time series comparing predicted (dashed) vs ground truth (solid) actions:

python scripts/eval_neon.py \
    --model YOUR_USERNAME/neon-g1-v1 \
    --trajs 10 \
    --plot \
    --plot-dir /tmp/neon_eval_plots

This creates /tmp/neon_eval_plots/eval_trajectory.png — visually inspect whether the predicted trajectories track the ground truth in shape, timing, and amplitude.

Evaluation on HuggingFace¶

Run eval on cloud GPUs (useful when the model needs a GPU backbone):

hf jobs uv run --flavor l4x1 --secrets HF_TOKEN --timeout 2h \
    scripts/eval_neon.py \
    --model YOUR_USERNAME/neon-g1-large-v1 \
    --dataset lerobot/xvla-agibot-world \
    --trajs 500 \
    --baselines

If the model repo is under your HuggingFace account, results are automatically pushed to the model card as eval_results.json.

CLI Reference¶

python scripts/eval_neon.py [OPTIONS]

Options:
  --model MODEL          Model path (HuggingFace ID or local dir)
  --dataset DATASET      Evaluation dataset (HuggingFace ID)
  --mode MODE            Control mode: arms_only, upper_body, whole_body
  --trajs N              Number of trajectories to evaluate
  --steps N              Max steps per trajectory (default: 150)
  --action-horizon N     Action chunk size (default: 16)
  --baselines            Compute zero-action and random baselines
  --plot                 Generate trajectory comparison plots
  --plot-dir DIR         Directory for plots (default: /tmp/neon_eval_plots)
  --output FILE          Path for JSON results (default: /tmp/neon_eval_results.json)
  --backbone MODEL_ID    Override backbone model ID

Python API¶

For custom evaluation pipelines:

from neon.model.neon_vla import NeonVLA
from neon.data.action_space import G1ActionSpace
import numpy as np

# Load model
model = NeonVLA.from_pretrained("YOUR_USERNAME/neon-g1-v1", load_backbone=True)
model.eval()

# Predict
output = model.predict(
    image=camera_frame,
    instruction="Pick up the red cup",
    proprioception=joint_states,
)

# Compare against ground truth
mse = np.mean((output.actions - target_actions) ** 2)

# Per-group
split_pred = model.action_space.split_action(output.actions)
split_gt = model.action_space.split_action(target_actions)
for group in split_pred:
    group_mse = np.mean((split_pred[group] - split_gt[group]) ** 2)
    print(f"{group}: MSE={group_mse:.6f}")

Comparing Models¶

Evaluate multiple models on the same dataset, then compare:

# Standard size
python scripts/eval_neon.py --model YOUR_USERNAME/neon-g1-v1 \
    --output /tmp/eval_standard.json --trajs 200 --baselines

# Large size
python scripts/eval_neon.py --model YOUR_USERNAME/neon-g1-large-v1 \
    --output /tmp/eval_large.json --trajs 200 --baselines

# Cosmos backbone
python scripts/eval_neon.py --model YOUR_USERNAME/neon-g1-cosmos-v1 \
    --output /tmp/eval_cosmos.json --trajs 200 --baselines

Then compare the JSON results:

import json

for name in ["standard", "large", "cosmos"]:
    with open(f"/tmp/eval_{name}.json") as f:
        r = json.load(f)
    total = r["mse"]["total"]["mean"]
    zero = r["baselines"]["zero_action"]
    improvement = (1 - total / zero) * 100
    print(f"{name:12s}  MSE={total:.6f}  vs_zero={improvement:.1f}%")

What Good Looks Like¶

Metric	Untrained	Standard (7M)	Large (44M)	Target
Total MSE	~0.14 (≈ zero baseline)	0.02–0.04	0.01–0.02	< 0.01
Improvement over zero	0%	70–85%	85–95%	> 95%
Arms MSE	~0.16	0.03–0.05	0.01–0.03	< 0.01
Locomotion MSE	~0.08	0.005–0.01	0.002–0.005	< 0.002

MSE is a starting point, not the finish line

Low MSE on held-out trajectories means the model predicts actions that match demonstrations. It doesn't guarantee the robot will succeed in the real world — sim2real transfer, temporal smoothing, and closed-loop robustness matter too. But if the MSE isn't good, nothing else will be either.

→ Next: Video Backbone — customize which model sees the world