Evaluation¶
How to measure whether your Neon model actually learned to control a robot — and how to prove it.
Quick Eval¶
python scripts/eval_neon.py \
--model YOUR_USERNAME/neon-g1-v1 \
--dataset lerobot/xvla-agibot-world \
--trajs 100 \
--baselines \
--plot
Output:
══════════════════════════════════════════════════════════════
NEON VLA EVALUATION RESULTS
══════════════════════════════════════════════════════════════
Model: YOUR_USERNAME/neon-g1-v1
Dataset: lerobot/xvla-agibot-world
Mode: arms_only
Samples: 1500
Time: 245.3s (6.1 samples/sec)
──────────────────────────────────────────────────────────────
Total MSE: 0.023451 ± 0.015832
left_arm 0.028123 ± 0.018541
right_arm 0.031245 ± 0.020103
locomotion 0.004521 ± 0.003210
──────────────────────────────────────────────────────────────
[baseline] Zero: 0.142301
[baseline] Random: 0.334521
Improvement over zero: 83.5%
══════════════════════════════════════════════════════════════
What Gets Measured¶
Per-Group MSE¶
Every joint group is evaluated independently:
| Group | Joints | What It Tells You |
|---|---|---|
left_arm |
7 (shoulder, elbow, wrist, gripper) | Left arm manipulation accuracy |
right_arm |
7 | Right arm manipulation accuracy |
locomotion |
3 (vx, vy, ω) | Walking / base movement quality |
torso |
1 (waist yaw) | Upper body orientation (upper_body+ modes) |
head |
2 (pitch, yaw) | Gaze direction (upper_body+ modes) |
left_leg / right_leg |
6 each | Walking gait quality (whole_body mode) |
Baselines¶
Two baselines establish reference points:
- Zero-action: Always predict zero (standing still). If your model doesn't beat this, it hasn't learned anything useful.
- Random-action: Predict random values in [-1, 1]. Absolute floor — any trained model should beat this trivially.
The improvement over zero percentage is the most meaningful single number. A model at 80%+ improvement is producing meaningful, directional predictions.
Trajectory Visualization¶
With --plot, the eval script generates per-group time series comparing predicted (dashed) vs ground truth (solid) actions:
python scripts/eval_neon.py \
--model YOUR_USERNAME/neon-g1-v1 \
--trajs 10 \
--plot \
--plot-dir /tmp/neon_eval_plots
This creates /tmp/neon_eval_plots/eval_trajectory.png — visually inspect whether the predicted trajectories track the ground truth in shape, timing, and amplitude.
Evaluation on HuggingFace¶
Run eval on cloud GPUs (useful when the model needs a GPU backbone):
hf jobs uv run --flavor l4x1 --secrets HF_TOKEN --timeout 2h \
scripts/eval_neon.py \
--model YOUR_USERNAME/neon-g1-large-v1 \
--dataset lerobot/xvla-agibot-world \
--trajs 500 \
--baselines
If the model repo is under your HuggingFace account, results are automatically pushed to the model card as eval_results.json.
CLI Reference¶
python scripts/eval_neon.py [OPTIONS]
Options:
--model MODEL Model path (HuggingFace ID or local dir)
--dataset DATASET Evaluation dataset (HuggingFace ID)
--mode MODE Control mode: arms_only, upper_body, whole_body
--trajs N Number of trajectories to evaluate
--steps N Max steps per trajectory (default: 150)
--action-horizon N Action chunk size (default: 16)
--baselines Compute zero-action and random baselines
--plot Generate trajectory comparison plots
--plot-dir DIR Directory for plots (default: /tmp/neon_eval_plots)
--output FILE Path for JSON results (default: /tmp/neon_eval_results.json)
--backbone MODEL_ID Override backbone model ID
Python API¶
For custom evaluation pipelines:
from neon.model.neon_vla import NeonVLA
from neon.data.action_space import G1ActionSpace
import numpy as np
# Load model
model = NeonVLA.from_pretrained("YOUR_USERNAME/neon-g1-v1", load_backbone=True)
model.eval()
# Predict
output = model.predict(
image=camera_frame,
instruction="Pick up the red cup",
proprioception=joint_states,
)
# Compare against ground truth
mse = np.mean((output.actions - target_actions) ** 2)
# Per-group
split_pred = model.action_space.split_action(output.actions)
split_gt = model.action_space.split_action(target_actions)
for group in split_pred:
group_mse = np.mean((split_pred[group] - split_gt[group]) ** 2)
print(f"{group}: MSE={group_mse:.6f}")
Comparing Models¶
Evaluate multiple models on the same dataset, then compare:
# Standard size
python scripts/eval_neon.py --model YOUR_USERNAME/neon-g1-v1 \
--output /tmp/eval_standard.json --trajs 200 --baselines
# Large size
python scripts/eval_neon.py --model YOUR_USERNAME/neon-g1-large-v1 \
--output /tmp/eval_large.json --trajs 200 --baselines
# Cosmos backbone
python scripts/eval_neon.py --model YOUR_USERNAME/neon-g1-cosmos-v1 \
--output /tmp/eval_cosmos.json --trajs 200 --baselines
Then compare the JSON results:
import json
for name in ["standard", "large", "cosmos"]:
with open(f"/tmp/eval_{name}.json") as f:
r = json.load(f)
total = r["mse"]["total"]["mean"]
zero = r["baselines"]["zero_action"]
improvement = (1 - total / zero) * 100
print(f"{name:12s} MSE={total:.6f} vs_zero={improvement:.1f}%")
What Good Looks Like¶
| Metric | Untrained | Standard (7M) | Large (44M) | Target |
|---|---|---|---|---|
| Total MSE | ~0.14 (≈ zero baseline) | 0.02–0.04 | 0.01–0.02 | < 0.01 |
| Improvement over zero | 0% | 70–85% | 85–95% | > 95% |
| Arms MSE | ~0.16 | 0.03–0.05 | 0.01–0.03 | < 0.01 |
| Locomotion MSE | ~0.08 | 0.005–0.01 | 0.002–0.005 | < 0.002 |
MSE is a starting point, not the finish line
Low MSE on held-out trajectories means the model predicts actions that match demonstrations. It doesn't guarantee the robot will succeed in the real world — sim2real transfer, temporal smoothing, and closed-loop robustness matter too. But if the MSE isn't good, nothing else will be either.
→ Next: Video Backbone — customize which model sees the world