Skip to content

Evaluation

How to measure whether your Neon model actually learned to control a robot — and how to prove it.


Quick Eval

python scripts/eval_neon.py \
    --model YOUR_USERNAME/neon-g1-v1 \
    --dataset lerobot/xvla-agibot-world \
    --trajs 100 \
    --baselines \
    --plot

Output:

══════════════════════════════════════════════════════════════
NEON VLA EVALUATION RESULTS
══════════════════════════════════════════════════════════════
Model:   YOUR_USERNAME/neon-g1-v1
Dataset: lerobot/xvla-agibot-world
Mode:    arms_only
Samples: 1500
Time:    245.3s (6.1 samples/sec)
──────────────────────────────────────────────────────────────
  Total MSE:     0.023451 ± 0.015832
  left_arm        0.028123 ± 0.018541
  right_arm       0.031245 ± 0.020103
  locomotion      0.004521 ± 0.003210
──────────────────────────────────────────────────────────────
  [baseline] Zero:   0.142301
  [baseline] Random: 0.334521
  Improvement over zero: 83.5%
══════════════════════════════════════════════════════════════

What Gets Measured

Per-Group MSE

Every joint group is evaluated independently:

Group Joints What It Tells You
left_arm 7 (shoulder, elbow, wrist, gripper) Left arm manipulation accuracy
right_arm 7 Right arm manipulation accuracy
locomotion 3 (vx, vy, ω) Walking / base movement quality
torso 1 (waist yaw) Upper body orientation (upper_body+ modes)
head 2 (pitch, yaw) Gaze direction (upper_body+ modes)
left_leg / right_leg 6 each Walking gait quality (whole_body mode)

Baselines

Two baselines establish reference points:

  • Zero-action: Always predict zero (standing still). If your model doesn't beat this, it hasn't learned anything useful.
  • Random-action: Predict random values in [-1, 1]. Absolute floor — any trained model should beat this trivially.

The improvement over zero percentage is the most meaningful single number. A model at 80%+ improvement is producing meaningful, directional predictions.

Trajectory Visualization

With --plot, the eval script generates per-group time series comparing predicted (dashed) vs ground truth (solid) actions:

python scripts/eval_neon.py \
    --model YOUR_USERNAME/neon-g1-v1 \
    --trajs 10 \
    --plot \
    --plot-dir /tmp/neon_eval_plots

This creates /tmp/neon_eval_plots/eval_trajectory.png — visually inspect whether the predicted trajectories track the ground truth in shape, timing, and amplitude.


Evaluation on HuggingFace

Run eval on cloud GPUs (useful when the model needs a GPU backbone):

hf jobs uv run --flavor l4x1 --secrets HF_TOKEN --timeout 2h \
    scripts/eval_neon.py \
    --model YOUR_USERNAME/neon-g1-large-v1 \
    --dataset lerobot/xvla-agibot-world \
    --trajs 500 \
    --baselines

If the model repo is under your HuggingFace account, results are automatically pushed to the model card as eval_results.json.


CLI Reference

python scripts/eval_neon.py [OPTIONS]

Options:
  --model MODEL          Model path (HuggingFace ID or local dir)
  --dataset DATASET      Evaluation dataset (HuggingFace ID)
  --mode MODE            Control mode: arms_only, upper_body, whole_body
  --trajs N              Number of trajectories to evaluate
  --steps N              Max steps per trajectory (default: 150)
  --action-horizon N     Action chunk size (default: 16)
  --baselines            Compute zero-action and random baselines
  --plot                 Generate trajectory comparison plots
  --plot-dir DIR         Directory for plots (default: /tmp/neon_eval_plots)
  --output FILE          Path for JSON results (default: /tmp/neon_eval_results.json)
  --backbone MODEL_ID    Override backbone model ID

Python API

For custom evaluation pipelines:

from neon.model.neon_vla import NeonVLA
from neon.data.action_space import G1ActionSpace
import numpy as np

# Load model
model = NeonVLA.from_pretrained("YOUR_USERNAME/neon-g1-v1", load_backbone=True)
model.eval()

# Predict
output = model.predict(
    image=camera_frame,
    instruction="Pick up the red cup",
    proprioception=joint_states,
)

# Compare against ground truth
mse = np.mean((output.actions - target_actions) ** 2)

# Per-group
split_pred = model.action_space.split_action(output.actions)
split_gt = model.action_space.split_action(target_actions)
for group in split_pred:
    group_mse = np.mean((split_pred[group] - split_gt[group]) ** 2)
    print(f"{group}: MSE={group_mse:.6f}")

Comparing Models

Evaluate multiple models on the same dataset, then compare:

# Standard size
python scripts/eval_neon.py --model YOUR_USERNAME/neon-g1-v1 \
    --output /tmp/eval_standard.json --trajs 200 --baselines

# Large size
python scripts/eval_neon.py --model YOUR_USERNAME/neon-g1-large-v1 \
    --output /tmp/eval_large.json --trajs 200 --baselines

# Cosmos backbone
python scripts/eval_neon.py --model YOUR_USERNAME/neon-g1-cosmos-v1 \
    --output /tmp/eval_cosmos.json --trajs 200 --baselines

Then compare the JSON results:

import json

for name in ["standard", "large", "cosmos"]:
    with open(f"/tmp/eval_{name}.json") as f:
        r = json.load(f)
    total = r["mse"]["total"]["mean"]
    zero = r["baselines"]["zero_action"]
    improvement = (1 - total / zero) * 100
    print(f"{name:12s}  MSE={total:.6f}  vs_zero={improvement:.1f}%")

What Good Looks Like

Metric Untrained Standard (7M) Large (44M) Target
Total MSE ~0.14 (≈ zero baseline) 0.02–0.04 0.01–0.02 < 0.01
Improvement over zero 0% 70–85% 85–95% > 95%
Arms MSE ~0.16 0.03–0.05 0.01–0.03 < 0.01
Locomotion MSE ~0.08 0.005–0.01 0.002–0.005 < 0.002

MSE is a starting point, not the finish line

Low MSE on held-out trajectories means the model predicts actions that match demonstrations. It doesn't guarantee the robot will succeed in the real world — sim2real transfer, temporal smoothing, and closed-loop robustness matter too. But if the MSE isn't good, nothing else will be either.


Next: Video Backbone — customize which model sees the world