Benchmarks¶

How Neon compares against GR00T N1, OmniVLA, and Cosmos-Predict2.5 — and how to reproduce every number.

The Comparison Table¶

This is the table we're filling. TBD cells are measured by running benchmarks/run_benchmarks.py.

	GR00T N1	OmniVLA	Cosmos 2.5	Neon (7M)	Neon (44M)
Checkpoint size	~2 GB	~14 GB	N/A	~25 MB	~100 MB
Trainable params	42M	~7B	N/A	7M	44M
Training time	Days	Days	N/A	18h	45h
Training cost	$$$	$$$	N/A	$45	$111
Inference (Jetson)	~80ms	❌ OOM	❌	TBD	TBD
Inference (4090)	~30ms	~200ms	~500ms	TBD	TBD
SIMPLER avg	TBD	TBD	N/A	TBD	TBD
Agibot MSE	TBD	TBD	N/A	TBD	TBD
Unseen zero-shot	❌ needs FT	Limited	✅ video only	TBD	TBD
Audio commands	❌	❌	❌	✅	✅
RTC blending	❌	❌	N/A	✅	✅
Open source	Partial	✅	Partial	✅	✅

Five Benchmark Suites¶

1. SIMPLER — Simulated Manipulation¶

Cross-embodiment evaluation from Open X-Embodiment. Tests whether data soup training generalizes to different robot bodies.

python benchmarks/run_benchmarks.py --model cagataydev/neon-g1-v1 --suite simpler

2. LIBERO — 130 Multi-Task Evaluation¶

Compositional reasoning, spatial understanding, long-horizon planning.

python benchmarks/run_benchmarks.py --model cagataydev/neon-g1-v1 --suite libero

3. Agibot-World — Bimanual Manipulation¶

In-distribution sanity check on held-out data. Per-group MSE for left arm, right arm, locomotion.

python benchmarks/run_benchmarks.py --model cagataydev/neon-g1-v1 --suite agibot_eval

4. Unseen Zero-Shot — The Killer Benchmark¶

12 tasks deliberately excluded from training data. Tests whether the video foundation model backbone transfers general world knowledge to action prediction.

Category	Example Task	Why It's Hard
Novel objects	"Pick up the rubber duck"	Not in bridge/agibot data
Novel instructions	"Stack blocks by color"	Compositional reasoning
Novel physics	"Pour water into the cup"	Fluid dynamics
Novel environments	"Open the microwave"	Unseen affordances

python benchmarks/run_benchmarks.py --model cagataydev/neon-g1-v1 --suite unseen_zeroshot

5. Efficiency — Latency, VRAM, Size¶

Practical deployment metrics that differentiate Neon from larger VLAs.

python benchmarks/run_benchmarks.py --model cagataydev/neon-g1-v1 --suite efficiency

Run Everything¶

# All benchmarks, all metrics
python benchmarks/run_benchmarks.py \
    --model cagataydev/neon-g1-v1 \
    --suite all \
    --output benchmarks/results/

# On HuggingFace GPU
hf jobs uv run --flavor a100-large --secrets HF_TOKEN --timeout 6h \
    benchmarks/run_benchmarks.py \
    --model cagataydev/neon-g1-v1 \
    --suite all

Results are saved as JSON in benchmarks/results/ and automatically pushed to the HuggingFace model card.

Training Tiers¶

Three training configs from fast iteration to maximum scale:

Tier	Config	Data	Heads	Time	Cost
v1-dev	`benchmarks/configs/v1_dev.yaml`	171K samples	7M	18h	$45
v1-release	`benchmarks/configs/v1_release.yaml`	171K samples	44M	45h	$111
v1-full	`benchmarks/configs/v1_full.yaml`	2.15M samples	44M	336h	$1,680

Train v1-dev first. Evaluate. If benchmarks pass, train v1-release. Details in benchmarks/PLAN.md.

Neon's Thesis¶

The benchmarks are designed to test five claims:

Smallest usable VLA — 25 MB checkpoint, runs on Jetson → Efficiency suite
Video backbone = zero-shot generalization — trained on bridge, works on unseen tasks → Unseen Zero-Shot suite
Audio-native — speak to your robot → Qualitative demo
Train in a day, not a week — frozen backbone, only heads train → Training time measurement
RTC-enabled — smooth real-time control → Latency + trajectory smoothness

The benchmarks prove or disprove these claims. We write the paper second.

→ Back to Evaluation Guide for detailed eval metrics and visualization