Skip to content

Benchmarks

How Neon compares against GR00T N1, OmniVLA, and Cosmos-Predict2.5 — and how to reproduce every number.


The Comparison Table

This is the table we're filling. TBD cells are measured by running benchmarks/run_benchmarks.py.

GR00T N1 OmniVLA Cosmos 2.5 Neon (7M) Neon (44M)
Checkpoint size ~2 GB ~14 GB N/A ~25 MB ~100 MB
Trainable params 42M ~7B N/A 7M 44M
Training time Days Days N/A 18h 45h
Training cost $$$ $$$ N/A $45 $111
Inference (Jetson) ~80ms ❌ OOM TBD TBD
Inference (4090) ~30ms ~200ms ~500ms TBD TBD
SIMPLER avg TBD TBD N/A TBD TBD
Agibot MSE TBD TBD N/A TBD TBD
Unseen zero-shot ❌ needs FT Limited ✅ video only TBD TBD
Audio commands
RTC blending N/A
Open source Partial Partial

Five Benchmark Suites

1. SIMPLER — Simulated Manipulation

Cross-embodiment evaluation from Open X-Embodiment. Tests whether data soup training generalizes to different robot bodies.

python benchmarks/run_benchmarks.py --model cagataydev/neon-g1-v1 --suite simpler

2. LIBERO — 130 Multi-Task Evaluation

Compositional reasoning, spatial understanding, long-horizon planning.

python benchmarks/run_benchmarks.py --model cagataydev/neon-g1-v1 --suite libero

3. Agibot-World — Bimanual Manipulation

In-distribution sanity check on held-out data. Per-group MSE for left arm, right arm, locomotion.

python benchmarks/run_benchmarks.py --model cagataydev/neon-g1-v1 --suite agibot_eval

4. Unseen Zero-Shot — The Killer Benchmark

12 tasks deliberately excluded from training data. Tests whether the video foundation model backbone transfers general world knowledge to action prediction.

Category Example Task Why It's Hard
Novel objects "Pick up the rubber duck" Not in bridge/agibot data
Novel instructions "Stack blocks by color" Compositional reasoning
Novel physics "Pour water into the cup" Fluid dynamics
Novel environments "Open the microwave" Unseen affordances
python benchmarks/run_benchmarks.py --model cagataydev/neon-g1-v1 --suite unseen_zeroshot

5. Efficiency — Latency, VRAM, Size

Practical deployment metrics that differentiate Neon from larger VLAs.

python benchmarks/run_benchmarks.py --model cagataydev/neon-g1-v1 --suite efficiency

Run Everything

# All benchmarks, all metrics
python benchmarks/run_benchmarks.py \
    --model cagataydev/neon-g1-v1 \
    --suite all \
    --output benchmarks/results/

# On HuggingFace GPU
hf jobs uv run --flavor a100-large --secrets HF_TOKEN --timeout 6h \
    benchmarks/run_benchmarks.py \
    --model cagataydev/neon-g1-v1 \
    --suite all

Results are saved as JSON in benchmarks/results/ and automatically pushed to the HuggingFace model card.


Training Tiers

Three training configs from fast iteration to maximum scale:

Tier Config Data Heads Time Cost
v1-dev benchmarks/configs/v1_dev.yaml 171K samples 7M 18h $45
v1-release benchmarks/configs/v1_release.yaml 171K samples 44M 45h $111
v1-full benchmarks/configs/v1_full.yaml 2.15M samples 44M 336h $1,680

Train v1-dev first. Evaluate. If benchmarks pass, train v1-release. Details in benchmarks/PLAN.md.


Neon's Thesis

The benchmarks are designed to test five claims:

  1. Smallest usable VLA — 25 MB checkpoint, runs on Jetson → Efficiency suite
  2. Video backbone = zero-shot generalization — trained on bridge, works on unseen tasks → Unseen Zero-Shot suite
  3. Audio-native — speak to your robot → Qualitative demo
  4. Train in a day, not a week — frozen backbone, only heads train → Training time measurement
  5. RTC-enabled — smooth real-time control → Latency + trajectory smoothness

The benchmarks prove or disprove these claims. We write the paper second.


Back to Evaluation Guide for detailed eval metrics and visualization